PreprintPDF Available

Cloud-native Observability: The Many-faceted Benefits of Structured and Unified Logging - A Case Study

Authors:
  • Lübeck University of Applied Sciences

Abstract and Figures

Background: Cloud-native software systems often have a much more decentralized structure and many independently deployable and (horizontally) scalable components, making it more complicated to create a shared and consolidated picture of the overall decentralized system state. Today, observability is often understood as a triad of collecting and processing metrics, distributed tracing data, and logging. The result is often a complex observability system composed of three stovepipes whose data is difficult to correlate. Objective: This study analyzes whether these three historically emerged observability stovepipes of logs, metrics and distributed traces could be handled more integrated and with a more straightforward instrumentation approach. Method: This study applied an action research methodology used mainly in industry-academia collaboration and common in software engineering. The research design utilized iterative action research cycles, including one long-term use case. Results: This study presents a unified logging library for Python and a unified logging architecture that uses the structured logging approach. The evaluation shows that several thousand events per minute are easily processable. Conclusion: The results indicate that a unification of the current observability triad is possible without the necessity to develop utterly new toolchains.
Content may be subject to copyright.
Article
Cloud-native Observability: The Many-faceted Benefits of
Structured and Unified Logging - A Case Study
Nane Kratzke
Lübeck University of Applied Sciences; nane.kratzke@th-luebeck.de
Correspondence: nane.kratzke@th-luebeck.de
Abstract: Background: Cloud-native software systems often have a much more decentralized
1
structure and many independently deployable and (horizontally) scalable components, making it
2
more complicated to create a shared and consolidated picture of the overall decentralized system state.
3
Today, observability is often understood as a triad of collecting and processing metrics, distributed
4
tracing data, and logging. The result is often a complex observability system composed of three
5
stovepipes whose data is difficult to correlate. Objective: This study analyzes whether these three
6
historically emerged observability stovepipes of logs, metrics and distributed traces could be handled
7
more integrated and with a more straightforward instrumentation approach. Method: This study
8
applied an action research methodology used mainly in industry-academia collaboration and common
9
in software engineering. The research design utilized iterative action research cycles, including one
10
long-term use case. Results: This study presents a unified logging library for Python and a unified
11
logging architecture that uses the structured logging approach. The evaluation shows that several
12
thousand events per minute are easily processable. Conclusion: The results indicate that a unification
13
of the current observability triad is possible without the necessity to develop utterly new toolchains.
14
Keywords: cloud-native; observability; cloud computing; logging; structured logging; logs; metrics;
15
traces; distributed tracing; log aggregation; log forwarding; log consolidation 16
1. Introduction 17
A "crypto winter" basically means that the prices for so-called cryptocurrencies such as
18
Bitcon, Ethereeum, Solana, etc. fell sharply on the crypto exchanges and then stay low. The
19
signs were all around in 2022: the failure of the TerraUSD crypto project in May 2022 sent
20
an icy blast through the market, then the cryptocurrency lending platform Celsius Network
21
halted withdrawals, prompting a sell-off that pushed Bitcoin to a 17-month low. 22
This study logged such a "crypto winter" on Twitter more by accident than by intention.
23
Twitter was simply selected as an appropriate use case to evaluate a unified logging solution
24
for cloud-native systems and decided to log Tweets containing stock symbols like $USD or
25
$EUR. It turned out that most symbols used on Twitter are not related to currencies like
26
$USD (US-Dollar) or stocks like $AAPL (Apple) but to Cryptocurrencies like $BTC (Bitcoin)
27
or $ETH (Ethereum). The Twitter community therefore seems to be quite cryptocurrency-
28
savvy. So, although some data of this 2022 crypto winter will be presented in this paper,
29
this paper will take more the methodical part into focus and will address how such and
30
further data could be collected more systematically in distributed cloud-native applications.
31
The paper will at least show that even complex observability of distributed systems can be
32
reached, simply by logging events to stdout. 33
Observability measures how well a system’s internal state can be inferred from knowl-
34
edge of its external outputs. The concept of observability was initially introduced by
35
the Hungarian-American engineer Rudolf E. Kálmán for linear dynamical systems [
1
,
2
].
36
However, observability also applies to information systems and is of particular interest
37
to fine-grained and distributed cloud-native systems that come with a very own set of
38
observability challenges. 39
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
© 2022 by the author(s). Distributed under a Creative Commons CC BY license.
2 of 18
Traditionally, the responsibility for observability is (was?) with operations (Ops). With
40
the emergence of DevOps, we can observe a shift of Ops responsibilities to developers. So,
41
observability is evolving more and more into a Dev responsibility. Observability should
42
ideally already be considered during the application design phase and not be regarded
43
as some "add-on" feature for later expansion stages of an application. The current discus-
44
sion about observability began well before the advent of cloud-native technologies like
45
Kubernetes. A widely cited blog post by Cory Watson from 2013 shows how engineers at
46
Twitter looked for ways to monitor their systems as the company moved from a monolithic
47
to a distributed architecture [
3
5
]. One of the ways Twitter did this was by developing a
48
command-line tool that engineers could use to create their dashboards to keep track of the
49
charts they were creating. While CI/CD tools and container technologies often bridge Dev
50
and Ops in one direction, observability solutions close the loop in the opposite direction,
51
from Ops to Dev [
4
]. Observability is thus the basis for data-driven software development
52
(see Fig. 1and [
6
]). As developments around cloud(-native) computing progressed, more
53
and more engineers began to "live in their dashboards." They learned that it is not enough
54
to collect and monitor data points but that it is necessary to address this problem more
55
systematically. 56
Figure 1. Observability can be seen as a feedback channel from Ops to Dev (adopted from [
4
] + [
6
]).
2. Problem description 57
Today, observability is often understood as a triad. Observability of distributed information
58
systems is typically achieved through the collection and processing of metrics (quantitative
59
data primarily as time-series), distributed tracing data (execution durations of complex
60
system transactions that flow through services of a distributed system), and logging (qual-
61
itative data of discrete system events often associated with timestamps but encoded as
62
unstructured strings). Consequently, three stacks of observability solutions have emerged,
63
and the following somehow summarizes the current state of the art. 64
Metrics: Here, quantitative data is often collected in time series, e.g., how many
65
requests a system is currently processing. The metrics technology stack is often
66
characterized by tools such as Prometheus and Grafana. 67
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
3 of 18
Distributed tracing involves following the path of transactions along the components
68
of a distributed system. The tracing technology stack is characterized by tools such as
69
Zipkin or Jaeger, and the technologies are used to identify and optimize particularly
70
slow or error-prone substeps of distributed transaction processing. 71
Logging is probably as old as software development itself, and many developers,
72
because of the log ubiquity, are unaware that logging should be seen as part of holistic
73
observability. Logs are usually stored in so-called log files. Primarily qualitative events
74
are logged (e.g. user XYZ logs in/out). An event is usually attached to a log file in
75
a text line. Often the implicit and historically justifiable assumption prevails with
76
developers that these log files are read and evaluated primarily by administrators
77
(thus humans). However, that is hardly the case anymore. It is becoming increasingly
78
common for the contents of these log files to be forwarded to a central database
79
through "log forwarders" so that they can be evaluated and analyzed centrally. The
80
technology stack is often characterized by tools such as Fluentd, FileBeat, LogStash
81
for log forwarding, databases such as ElasticSearch, Cassandra or simply S3 and user
82
interfaces such as Kibana. 83
Figure 2. An application is quickly surrounded by a complex observability system when metrics,
tracing and logs are captured with different observability stacks.
Incidentally, all three observability pillars have in common that software to be developed
84
must be somehow instrumented. This instrumentation is normally done using program-
85
ming language-specific libraries. Developers often regard distributed tracing instrumenta-
86
tion in particular as time-consuming. Also, which metric types (counter, gauge, histogram,
87
history, and more) are to be used in metric observability solutions such as Prometheus
88
often depends on Ops experience and is not always immediately apparent to developers.
89
Certain observability hopes fail simply because of wrongly chosen metric types. Only
90
system metrics such as CPU, memory, and storage utilization can be easily captured in a
91
black-box manner (i.e., without instrumentation in the code). However, these data are often
92
only of limited use for the functional assessment of systems. For example, CPU utilization
93
provides little information about whether conversion rates in an online store are developing
94
in the desired direction. 95
So, current observability solutions are often based on these three stovepipes for logs,
96
metrics, and traces. The result is an application surrounded by a complex observability
97
system whose isolated datasets can be difficult to correlate. Fig. 2focuses on the application
98
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
4 of 18
(i.e., the object to be monitored) and triggers the question, whether it is justified to use three
99
complex subsystems and three types of instrumentation, which always means three times
100
the instrumentation and data analysis effort of isolated data silos. 101
The often-used tool combination of ElasticSearch, LogStash, and Kibana is often used
102
for logging and has even been given a catchy acronym: ELK-Stack [3]. The ELK stack
103
can be used to collect metrics and using the plugin APM also for distributed tracing. So,
104
at least for the ELK stack, the three stovepipes are not clearly separable or disjoint. The
105
separateness is somewhat historically "suggested" than technologically given. Nevertheless,
106
this tripartite division into metrics, tracing and logging is very formative for the industry,
107
as shown, for example, by the OpenTelemetry project [
7
]. OpenTelemetry is currently in the
108
incubation stage at the Cloud Native Computing Foundation and provides a collection of
109
standardized tools, APIs, and SDKs to instrument, generate, collect, and export telemetry
110
data (metrics, logs, and traces) to analyze the performance and behaviour of software
111
systems. OpenTelemetry thus standardizes observability but hardly aims to overcome the
112
columnar separation into metrics, tracing, and logging. 113
In past and current industrial action research [
4
,
6
,
8
14
], I came across various cloud-
114
native applications and corresponding engineering methodologies like the 12-factor app
115
(see 4.1) and learned that the discussion around observability is increasingly moving
116
beyond these three stovepipes and taking a more nuanced and integrated view. There is a
117
growing awareness of integrating and unifying these three pillars, and more emphasis is
118
being placed on analytics. 119
The research question arises whether these three historically emerged observability
120
stovepipes of logs, metrics and distributed traces could be handled more integrated and
121
with a more straightforward instrumentation approach. The results of this action research
122
study shows that this unification potential could be surprisingly easy to realize. This paper
123
presents the methodology in Sec. 3and its results in Sec. 4(including a logging prototype
124
in Sec 4.4 and its evaluation results in 4.5 as the main contribution of this paper to the
125
field). The results are discussed in Sec. 5. Furthermore, the study presents related work in
126
Sec. 6and concludes its findings as well as future promising research directions in Sec. 7.127
3. Methodology 128
This study followed the action research methodology as a proven and well-established re-
129
search methodology model for industry-academia collaboration in the software engineering
130
context to analyze the research-question mentioned above. Following the recommendations
131
of Petersen et al. [
15
], a research design was defined that applied iterative action research
132
cycles (see Fig. 3): 133
1. Diagnosis (Diagnosing according to [15]) 134
2. Prototyping (Action planning, design and taking according to [15]) 135
3. Evaluation including a may be required redesign (Evaluation according to [15]) 136
4.
Transfer learning outcomes to further use cases (Specifying learning according to
137
[15]) 138
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
5 of 18
Figure 3. Action research methodology of this study
With each of the following use cases insights were transferred from the previous use case
139
into a structured logging prototype (see Fig. 3). The following use cases have been studied
140
and evaluated. 141
Use Case 1: Observation of qualitative events occurring in an existing solution (on-
142
line code editor; https://codepad.th-luebeck.dev, this use case was inspired by our
143
research [11]) 144
Use Case 2: Observation of distributed events along distributed services (distributed
145
tracing in an existing solution of an online code editor, see UC1) 146
Use Case 3: Observation of quantitative data generated by a technical infrastructure
147
(Kubernetes platform, this use case was inspired by our research [14]) 148
Use Case 4: Observation of a massive online event stream to gain experiences with
149
high-volume event streams (we used Twitter as a data source and tracked worldwide
150
occurrences of stock symbols, this use case was inspired by our research [16,17]) 151
4. Results 152
The analysis of cloud-native methodologies like the 12-factor app [
18
] has shown that to
153
build observability, one should take a more nuanced and integrated view to integrate and
154
unify these three pillars of metrics, traces, and logs to enable more agile and convenient
155
analytics in feedback information flow in DevOps cycles (see Fig. 1). Two aspects that
156
gained momentum in cloud-native computing are of interest: 157
Recommendations on how to handle log forwarding and log consolidaion in cloud-
158
native applications 159
Recommendations to apply structured logging 160
Because both aspects guided the implementation of the logging prototype deeply, they will
161
be explained in more details providing the reader the necessary context. 162
4.1. Twelve-factor apps 163
The 12-factor app is a method [
18
] for building software-as-a-service applications that
164
pay special attention to the dynamics of organic growth of an application over time,
165
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
6 of 18
the dynamics of collaboration between developers working together on a codebase, and
166
avoiding the cost of software erosion. At its core, 12 rules (factors) should be followed to
167
develop well-operational and evolutionarily developable distributed applications. This
168
methodology harmonizes very well with microservice architecture approaches [
3
] and
169
cloud-native operating environments like Kubernetes [
19
], which is why the 12-factor
170
methodology is becoming increasingly popular. Incidentally, the 12-factor methodology
171
does not contain any factor explicitly referring to observability, certainly not in the triad of
172
metrics, tracing and logging. However, factor XI recommends how to handle logging: 173
Logs are the stream of aggregated events sorted by time and summarized from the output
174
streams of all running processes and supporting services. Logs are typically a text format
175
with one event per line. 176
[...] 177
A twelve-factor app never cares about routing or storing its output stream. It should
178
not attempt to write to or manage log files. Instead, each running process writes its
179
stream of events to stdout. [...] On staging or production deploys, the streams of all
180
processes are captured by the runtime environment, combined with all other streams of
181
the app, and routed to one or more destinations for viewing or long-term archiving. These
182
archiving destinations are neither visible nor configurable to the app - they are managed
183
entirely from the runtime environment. 184
4.2. From logging to structured logging 185
The logging instrumentation is quite simple for developers and works mainly programming
186
language specific but basically according to the following principle illustrated in Python. 187
A logging library must often be imported, defining so-called log levels such as DEBUG,
188
INFO, WARNING, ERROR, FATAL, and others. While the application is running, a log
189
level is usually set via an environment variable, e.g. INFO. All log calls above this level are
190
then written to a log file. 191
1im po rt l ogging 192
lo g gi n g . b as i cC o nf i g ( fi l en a me = " e xa m pl e . lo g " , le v el = l o gg in g . D EB U G ) 193
3lo g gi n g . de b ug ( " P e rf o rm i ng u s er c he c k ") 194
us e r = "N a ne K r at z ke " 195
5lo g g in g . i nf o ( f " U se r { u se r } t r ie s t o lo g i n . ") 196
lo g g in g . w ar n in g ( f " Us er { u se r }␣ n o t␣ f ou n d ) 197
7lo g g in g . e rr o r (f " U se r { us e r } ha s be e n b an n ed . " ) 198
For example, line 5 would create the following entry in a log file: 199
1IN FO 20 22 - 01 - 27 16 : 17 :5 8 - Us er Na ne K ra tz ke tr i es t o lo g in 200
In a 12-factor app, this logging would be configured so that events are written directly to
201
Stdout (console). The runtime environment (e.g., Kubernetes with FileBeat service installed)
202
then routes the log data to the appropriate database taking work away from the developer
203
that they would otherwise have to invest in log processing. This type of logging is well
204
supported across many programming languages and can be consolidated excellently with
205
the ELK stack (or other observability stacks). 206
Logging (unlike distributed tracing and metrics collection) is often not even perceived
207
as (complex) instrumentation by developers. Often it is done on their own initiative.
208
However, one can systematize this instrumentation somewhat and extend it to so-called
209
"structured logging". Again, the principle is straightforward. One simply does not log lines
210
of text like 211
1IN FO 20 22 - 01 - 27 16 : 17 :5 8 - Us er Na ne K ra tz ke tr i es t o lo g in 212
but instead, the same information in a structured form, e.g. using JSON: 213
1{ " lo g l e ve l " : " i nf o " , " t im e s ta m p " : " 20 2 2 - 01 - 2 7 1 6: 1 7 :5 8 " , " e ve n t " : " Lo g i n " , 214
" us e r ": " N an e K ra t zk e " , " re s ul t " : " su c ce s s "} 215
In both cases, the text is written to the console. In the second case, however, a structured text-
216
based data format is used that is easier to evaluate. In the case of a typical logging statement
217
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
7 of 18
like "User Max Mustermann tries to log in" the text must first be analyzed to determine the
218
user. This text parsing is costly on a large scale and can also be very computationally
219
intensive and complex if there is plenty of log data in a variety of formats (which is the
220
common case in the real world). 221
However, in the case of structured logging, this information can be easily extracted
222
from the JSON data field "user". In particular, more complex evaluations become much
223
easier with structured logging as a result. However, the instrumentation does not become
224
significantly more complex, especially since there are logging libraries for structured
225
logging. The logging looks in the logging prototype log12 of this study like this: 226
1im po rt l og 12 227
[. .. ] 228
3lo g 1 2 . e rr o r ( " Lo g i n " , us e r = us er , re s u lt = " N ot f ou n d " , r ea s o n = " Ba n n ed " ) 229
The resulting log files are still readable for administrators and developers (even if a bit more
230
unwieldy) but much better processable and analyzable by databases such as ElasticSearch.
231
Quantitative metrics can also be recorded in this way. Structured logging can thus also be
232
used for the recording of quantitative metrics. 233
1im po rt l og 12 234
[. .. ] 235
3lo g 12 . i n fo ( " O pe n r eq u es t s " , re q ue s ts = l e n ( re q ue s ts ) ) 236
1{ " ev en t " : " Op en r e qu e st s " , " re q ue s ts " : 42 } 237
What is more, this structured logging approach can also be used to create tracings. In
238
distributed tracing systems, a trace ID is created for each transaction that passes through a
239
distributed system. The individual steps are so-called spans. These are also assigned an
240
ID (span ID). The span ID is then linked to the trace ID, and the runtime is measured and
241
logged. In this way, the time course of distributed transactions can be tracked along the
242
components involved, and, for example, the duration of individual processing steps can be
243
determined. 244
4.3. Resulting and simplified logging architecture 245
So, if the two principles to print logs simply to stdout and to log in a structured and text-
246
based data format are applied consequently. The resulting observability system complexity
247
thus reduces from Fig. 2to Fig. 4because all system components can collect log, metric, and
248
trace information in the same style that can be routed seamlessly from an operation platform
249
provided log forwarder (already existing technology) to a central analytical database. 250
Figure 4. An observability system consistently based on structured logging with significantly reduced
complexity.
4.4. Study outcome: Unified instrumentation via an structured logging library (prototype) 251
This paper will briefly explain below the way to capture events, metrics, and traces using
252
the logging prototype that emerged. The prototype library log12 was developed in Python
253
3 but could implemented in other programming languages analogously. 254
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
8 of 18
log12 will create automatically for each event additional key-value attributes like an
255
unique identifier (that is used to relate child events to parent events and even remote events
256
in distributed tracing scenarios) and start and completion timestamps that can be used to
257
measure the runtime of events (although known from distributed tracing libraries but not
258
common for logging libraries). It is explained 259
how to create a log stream, 260
how an event in a log stream is created and logged, 261
how a child event can be created and assigned to a parent event (to trace and record
262
runtimes of more complex and dependent chains of events within the same process), 263
and how to make use of the distributed tracing features to trace events that pass
264
through a chain of services in a distributed service of services system). 265
The following lines of code create a log stream with the name "logstream" that is logged to
266
stdout. 267
Listing 1: Creating an event log stream in log12
1im po rt l og 12 268
lo g = l og 1 2 . l o gg i n g ( " l og s t re a m " , 269
3ge n e ra l = " v al u e " , ta g =" f o o " , se r vi c e _m a r k =" t es t " 270
)271
Each event and child events of this stream are assigned a set of key-value pairs: 272
general="value" 273
tag="foo" 274
service_mark="test" 275
These log-stream-specific key-value pairs can be used to define selection criteria in analytical
276
databases like ElasticSearch to filter events of a specific service only. The following lines of
277
code demonstrate how to create a parent event and child events. 278
Listing 2: Event logging in log12 using blocks as structure
# Lo g events us in g th e wi th cl au se 279
2wi t h lo g . ev e nt ( " T es t " , he ll o = " Wo r ld " ) as e v en t : 280
ev e nt . u p da t e ( te s t =" s o me t hi n g " ) 281
4# ad ds e ve nt s pe ci fi c ke y va lue pairs to th e event 282
283
6wi t h e ve n t . c h il d ( " S u be v e n t 1␣ o f T e st " ) a s ev : 284
ev . u p da t e ( fo o = " ba r " ) 285
8ev . e r r or ( " C a ta s t r op h e " ) 286
# E xp l ic i t ca l l of l o g (h e re o n e rr o r le v el ) 287
10 288
wi t h e ve n t . c h il d ( " S u be v e n t 2␣ o f T e st " ) a s ev : 289
12 ev . u p da t e ( ba r = " fo o " ) 290
# I mp l i ci t c a ll of e v . i nf o ( " S u cc e s s ") ( a t b lo c k e nd ) 291
14 292
wi t h e ve n t . c h il d ( " S u be v e n t 3␣ o f T e st " ) a s ev : 293
16 ev . u p da t e ( ba r = " fo o " ) 294
# I mp l i ci t c a ll of e v . i nf o ( " S u cc e s s ") ( a t b lo c k e nd ) 295
Furthermore, it is possible to log events in the event stream without the block style. That
296
might be necessary for programming languages that do not support to close resources (here
297
a log stream) at the end of a block. In this case programmers are responsible to close events
298
using the .info(),.warn(),.error() log levels. 299
Listing 3: Event logging in log12 without blocks
1# To l og ev e nt s w i th o ut w it h - b l oc k s is p o ss i b le a s w el l . 300
ev = l og . e v en t ( " A no t he r t e st " , f oo = " b ar " ) 301
3ev . u p da t e ( ba r = " fo o " ) 302
ch i l d = ev . c h il d ( " S u be v e nt o f A no t h er t e st " , f oo = " b ar " ) 303
5ev . i n fo ( " F i ni s h e d " ) 304
# <= Howeve r , th an y ou a re are r es po ns ib le t o lo g ev en ts e xp licit y 305
7# If p ar en t ev en ts a re l og ge d al l su bs eq ue nt child events 306
# ar e assumed to ha ve c lo se d s uc ce ss fu lly as wel l 307
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
9 of 18
Using this type of logging to forward events along HTTP-based requests is also possible.
308
This usage of HTTP-Headers is the usual method in distributed tracing. Two main ca-
309
pabilities are required for this [
20
]. First, extracting header information received by an
310
HTTP service process must be possible. Secondly, it must be possible to inject the tracing
311
information in follow-up upstream HTTP requests (in particular, the trace ID and span ID
312
of the process initiating the request). 313
Listing 4shows how log12 supports this with an extract attribute at event creation
314
and an inject method of the event that extracts relevant key-value pairs from the event so
315
that they can be passed as header information along an HTTP request. 316
Listing 4: Extraction and injection of tracing headers in log12
im po rt l og 12 317
2im po rt r equests # To g enerate H TT P r eq ue st s 318
fr om fl as k import re qu es t # To d emons tr at e H ea de r ext ra ct io n 319
4320
wi t h lo g . ev e nt ( " D is t ri b ut e d tr ac i ng " , extract=request.headers ) as e v : 321
6322
# He re i s ho w to p as s tracing i nf or mat io n al on g r em ot e ca ll s 323
8wi t h ev . c hi l d (" T a sk 1 ") a s ev e nt : 324
re s p on s e = r eq u es t s . g et ( 325
10 " h tt p s :/ / q r . my l ab . t h - l ue b ec k . d ev / r o ut e ? u rl = h t tp s : / / g oo g le . c o m ", 326
headers=event.inject() 327
12 )328
ev e nt . u p da te ( l e ng t h = le n ( re s po n se . t ex t ) , st a tu s = re s po n se . s t a tu s _c o de ) 329
4.5. Evaluation of logging prototype in the definded use cases 330
Use Cases 1 and 2: Codepad is an online coding tool to share quickly short code snippets in
331
online and offline teaching scenarios. It has been introduced during the Corona Pandemic
332
shutdowns to share short code snippets mainly in online educational settings for 1st or
333
2nd semester computer science students. Meanwhile the tool is used in presence lectures
334
and labs as well. The reader is welcome to try out the tool at https://codepad.th-luebeck.
335
dev. This study used the Codepad tool in its steps 1, 2, 3, and 4 of its action research
336
methodology as an instrumentation use case (see Fig. 3) to evaluate the instrumentation of
337
qualitative system events according to Sec. 4.4. Fig. 5shows the Web-UI on the left and the
338
resulting dashboard on the right. In a transfer step (steps 12, 13, 14, and 15 of the action
339
research methodolgy, see Fig. 3) the same product was used to evaluate distributed tracing
340
instrumentation (not covered in detail by this report). 341
Figure 5. Use Cases 1 and 2: Codepad is an online coding tool to share quickly short code snippets in
online and offline teaching scenarios. On the left the Web-UI. On the right the Kibana Dashboard
used for observability in this study. Codepad was used as an instrumentation object of investigation.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
10 of 18
The Use Case 3 (steps 5, 6, 7, 8 of research methodology; Fig. 3)observed an institutes
342
infrastructure, the so-called myLab infrastructure. myLab (https://mylab.th-luebeck.dev)
343
is a virtual laboratory that can be used by students and faculty staff to develop and host
344
web applications. This use case was chosen to demonstrate that it is possible to collect
345
primarily metrics based data over a long term using the same approach as in Use Case 1. A
346
pod tracked mainly the resource consumption of various differing workloads deployed by
347
more than 70 student web projects of different university courses. To observe this resource
348
consumption the pod simply run periodically 349
kubectl top nodes 350
kubectl top pods –all-namespaces 351
against the cluster. This observation pod parsed the output of both shell commands and
352
printed the parsed results in the structured logging approach presented in Sec. 4.4. Fig. 6
353
shows the resulting Kibana dashboard for demonstration purposes. 354
Figure 6. Use Case 3: The dashboard of the Kubernetes infrastructure under observation (myLab)
The Use Case 4 (steps 9, 10, 11 of research methodology; Fig. 3)left our own ecosystem and
355
observed the public Twitter Event stream as a type representative for a high-volume and
356
long-term observation of an external system. So, a system that was intentionally not under
357
the direct administrative control of the study investigators. The Use Case 4 was designed as
358
two phase study: The first screening phase was designed to gain experiences in logging high
359
volume event streams and to provide necessary features and performance optimizations
360
to the structured logging library prototype. The screening phase was designed to screen
361
the complete and representative Twitter traffic as a kind of "ground truth". We were
362
interested in the distribution of languages and stock symbols in relation to the general
363
Twitter "background noise". This screening phase lasted from 20/01/2022 to 02/02/2022
364
and identified most used stock symbols. A long-term recording was then done as a second
365
long-term evaluation phase and was used to track and record the most frequent used stock
366
symbols identified in the screening phase. This evaluation phase lasted from Feb. 2022 until
367
mid of August 2022. In this evaluation phase just one infrastructure downtime occurred
368
due to a shutdown of electricity of the author’s institute. However, this downtime was not
369
due to or related to the presented unified logging stack (see Fig. 9). 370
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
11 of 18
hashtag
mention
symbol
tweet
user
Screening
phase
0.0
0.5
1.0
1.5
Events per day
1e7
all events
symbols
hashtag
mention
symbol
tweet
user
Evaluation
phase
2022-02
2022-03
2022-04
2022-05
2022-06
2022-07
2022-08
0
2
4
Events per day
1e6
Infrastructure
downtime
LUNA crash all events
symbols
Figure 7. Recorded events (screening and evaluation phase of Use Case 4).
The recording was done using the following source code, compiled into a Docker container,
371
that has been executed on a Kubernetes cluster that has been logged in Use Case 1, 2, and 3.
372
FileBeat was used as a log forwarding component to a background ElasticSearch database.
373
The resulting event log has been analyzed and visualized using Kibana. Kibana was used
374
as well to collect the data in form of CSV-Files for the screening and the evaluation phase.
375
The Fig. 7,8, and 9have been compiled from that data. This setting followed exactly the
376
unified and simplified logging architecture presented in Fig. 4.377
Listing 5: The used logging program to record Twitter stock symbols from the public
Twitter Stream API
1im po rt l og1 2 , t weepy , o s 378
379
3KE Y = o s . en v ir on . g et ( " C ON SU M ER _ KE Y " ) 380
SE C R ET = o s . en v ir o n . g et ( " C O NS U M ER _ S EC R E T ") 381
5TO K E N = o s . e n vi r o n . ge t ( " A C C ES S _ T OK E N " ) 382
T OK E N _ SE C R E T = o s . e n vi r o n . g et ( " A C C E SS _ T O K EN _ S E C RE T " ) 383
7384
L AN G U AG E S = [l . s t r ip ( ) fo r l in o s . e n vi r o n . ge t ( " L A N GU A G ES " , " ") . s p l it ( " , " ) ] 385
9TR A C K = [ t . s tr i p ( ) fo r t in o s . e n vi r o n . ge t ( " T R AC K S " ) . s pl i t ( " ," ) ] 386
387
11 lo g = lo g 12 . l o gg i ng ( " t wi t te r st r ea m " ) 388
389
13 cl a ss Tw i st a ( t we ep y . S tr e am ) : 390
391
15 de f o n _ st a t us ( s el f , s t a tu s ) : 392
wi t h lo g . ev e nt ( " t we e t ", t w ee t _i d = s ta t us . i d_ st r , 393
17 us e r _i d = s ta t us . u s er . i d_ st r , l an g = s ta t us . l a ng 394
) a s e ve n t : 395
19 ki n d = " s ta t us " 396
ki n d = " r ep l y " i f s ta tu s . _ js o n [’ i n _ re p ly _ t o_ s ta t u s_ i d ] e ls e k in d 397
21 ki n d = " r et w ee t " if r e tw e et e d_ s ta t us i n s t at u s . _j so n e ls e k in d 398
ki n d = " q uo t e " i f qu o te d _ st a tu s in st a tu s . _ js on el s e ki n d 399
23 ev e nt . u p da t e ( la n g = st a tu s . la ng , ki n d = ki nd , m e ss a ge = s t at u s . te x t ) 400
401
25 wi t h e v en t . c h i ld ( u s e r ) a s u sr : 402
na m e = s t at u s . us e r . na me if s t at u s . us e r . na me el s e " u n kn o w n " 403
27 us r . u pd at e ( l an g = st a tu s . la ng , i d = st at u s . us er . i d_ s tr , 404
na m e = na m e , 405
29 s cr e e n_ n a m e = f "@ { s t a tu s . u s er . s c r ee n _ n am e } " , 406
me s sa g e = st a tu s . te xt , 407
31 kind=kind 408
)409
33 410
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
12 of 18
fo r t ag i n s t at u s . e n ti t i e s [ h as h t ag s ] : 411
35 wi t h e ve n t . c h il d ( h a sh t a g ) a s h a sh t a g : 412
ha s ht a g . up d at e ( la n g = st at u s . la ng , 413
37 ta g = f " #{ t ag [ t e x t ]. l o w er ( ) } " , 414
me s sa g e = st a tu s . te xt , 415
39 kind=kind 416
)417
41 418
fo r s ym i n s t at u s . e n ti t i e s [ s ym b o ls ] : 419
43 wi t h e ve n t . c h il d ( s y mb o l ) as s y m bo l : 420
sy m bo l . u pd a te ( l an g = st a tu s . la ng , 421
45 sy m b ol = f " $ { sy m [ t ex t ]. u p p er ( ) } " , 422
me s sa g e = st a tu s . te xt , 423
47 kind=kind 424
)425
49 sy m bo l . u pd a te ( s c re e n_ n am e = f "@ { s ta t us . u se r . s cr e en _ na m e }" ) 426
427
51 fo r u s e r_ m e n ti o n i n s ta t u s . e nt i t ie s [ u s er _ m e nt i o n s ]: 428
wi t h e ve n t . c h il d ( m e nt i o n ) a s m e nt i o n : 429
53 me n ti o n . up d at e ( la n g = st at u s . la ng , 430
s cr ee n _n a me = f "@ { u se r _m e nt i on [ sc r ee n _n a me ’] } ", 431
55 me s sa g e = st a tu s . te xt , 432
kind=kind 433
57 )434
435
59 re c o r d = Tw i s ta ( K EY , S EC R E T , T OK E N , T O K EN _ S E C RE T ) 436
if L A NG U AG E S : 437
61 re c or d . fi l te r ( tr a ck = T RA CK , l a ng u ag e s = LA N GU AG E S ) 438
else: 439
63 re c or d . f il t er ( t ra c k = TR AC K ) 440
According to Fig. 7, just every 100th observed event in the screening phase was a stock
441
symbol. That is simply the "ground-truth" on Twitter. If one is observing the public Twitter
442
stream without any filter, that is what you get. So, the second evaluation phase recorded
443
a very specific "filter bubble" of the Twitter stream. The reader should be aware, that the
444
data presented in the following is a clear bias and not a representative Twitter event stream,
445
it is clearly a stock market focused subset or to be even more precise: a cryptcocurrency
446
focused subset, because almost all stock symbols on Twitter are related to cryptocurrencies.
447
It is possible to visualize the resulting effects using the recorded data. Fig. 8shows the
448
difference in language distributions of the screening phase (unfiltered ground-truth) and
449
the evaluation phase (activated symbol filter). While in the screening phase English (en),
450
Spanish (es), Portugese (pt), and Turkish (tr) are responsible for more than 3/4 of all traffic,
451
in the evaluation phase almost all recorded Tweets are in English. So, on Twitter, the most
452
stock symbol related language is clearly English. 453
en
es
pt
tr
Other
fr
und
ja
it
pl
in
Languages
(ISO code)
Screening phase
en
und
Other
tr
qme
fr
es
ja
in
cy
de
Languages
(ISO code)
Evaluation phase
Figure 8. Observed languages (screening and evaluation phase of Use Case 4).
Although the cryptocurrency logging was used mainly as a use case for technical evaluation
454
purposes of the logging library prototype, some interesting insights could be gained. E.g., 455
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
13 of 18
although Bitcoin (BTC) is likely the most prominent cryptocurrency, it is by far not the most
456
frequent used stock symbol on Twitter. The most prominent stock symbols on Twitter are:
457
ETH: Ethereum cryptocurrency 458
SOL: Solana cryptocurrency 459
BTC: Bitcoin cryptocurrency 460
LUNA: Terra Luna cryptocurrency (replaced by a new version after the crash in May
461
2022) 462
BNB: Binance Coin cryptocurrency 463
What is more, we can see interesting details in trends (see Fig. 9). 464
The ETH usage on Twitter seems to reducing throughout our observed period. 465
The SOL usage is on the opposite increasing, although we observed a sharp decline in
466
July. 467
The LUNA usage has a clear peak that correlates with the LUNA cryptocurrency crash
468
in the mid of May 2022 (this crash was heavily reflected in the investor media). 469
The Twitter usage was not correlated with the curreny rates on crpytocurrency stock
470
markets. However, changes in usage patterns of stock market symbols might be of interest
471
for cryptocurrency investors as interesting indicators to observe. As this study shows, these
472
changes can be easily tracked using structured logging approaches. Of course, this can be 473
transferred to other social media streaming or general event streaming use cases like IoT
474
(Internet of Things) as well. 475
5. Discussion 476
This style of a unified and structured observability was successfully evaluated on several
477
use cases that made usage of a FileBeat/ElasticSearch-based observability stack. However,
478
other observability stacks that can forward and parse structured text in a JSON-format will
479
likely show the same results. The evaluation included a long-term test over more than six 480
months for a high-volume evaluation use-case. 481
On the one hand, it could be proven that such a type of logging can easily be used to 482
perform classic metrics collections. For this purpose, BlackBox metrics such as CPU,
483
memory, and storage for the infrastructure (nodes) but also the "payload" (pods) were
484
successfully collected and evaluated in several Kubernetes clusters (see Fig. 6). 485
Second, a high-volume use case was investigated and analyzed in-depth. Here, all
486
English-language tweets on the public Twitter stream were logged. About 1 million
487
events per hour were logged over a week and forwarded to an ElasticSearch database
488
using the log forwarder FileBeat. Most systems will generate far fewer events (see
489
Figure 7). 490
2022-02 2022-03 2022-04 2022-05 2022-06 2022-07 2022-08
0
100000
200000
300000
400000
500000
600000
700000
800000
Infrastructure
downtime
LUNA crash
decline unclear
Recorded symbols per day (Screening phase)
$ETH $SOL $BTC $LUNA $BNB
Figure 9. Recorded symbols per day (evaluation phase of Use Case 4).
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
14 of 18
In addition, the prototype logging library log12 is meanwhile used in several internal
491
systems, including web-based development environments, QR code services, and
492
e-learning systems, to record access frequencies to learning content, and to study
493
learning behaviours of students. 494
5.1. Lessons learned 495
All use cases have shown that structured logging is easy to instrument and harmonizes
496
well with existing observability stacks (esp. Kubernetes, Filebeat, ElasticSearch, Kibana).
497
However, some aspects should be considered: 498
1.
It is essential to apply structured logging, cause this can be used to log events, metrics,
499
and traces in the same style. 500
2.
Very often, only error-prone situations are logged. However, if you want to act in the
501
sense of DevOps-compliant observability, you should also log normal - completely
502
regular - behaviour. DevOps engineers can gain many insights from how normal
503
users use systems in standard situations. So, the log level should be set to INFO, and
504
not WARNING, ERROR, or above. 505
3.
Cloud-native system components should rely on the log forwarding and log aggrega-
506
tion of the runtime environment. Never implement this on your own. You will double
507
logic and end up with complex and may be incompatible log aggregation systems. 508
4.
To simplify analysis for engineers, one should push key-value pairs of parent events
509
down to child events. This logging approach simplifies analysis in centralized log
510
analysis solutions - it simply reduces the need to derive event contexts that might be
511
difficult to deduce in JSON document stores. However, this comes with the cost of
512
more extensive log storage. 513
5.
Do not collect aggregated metrics data. The aggregation (mean, median, percentile,
514
standard deviations, sum, count, and more) can be done much more convenient in
515
the analytical database. The instrumentation should focus on recording metrics data
516
in a point-on-time style. According to our developer experience, developers are glad
517
to be authorized to log only such simple metrics, especially when there is not much
518
background knowledge in statistics. 519
5.2. Threats of validity and to be considered limitations of the study design 520
Action research is prone to drawing incorrect or non-generalizable conclusions. Logically,
521
the significance is consistently highest within the considered use cases. In order to draw
522
generalizable conclusions, this study defined use cases in such a way that intentionally
523
different classes of telemetry data (logs, metrics, traces) were considered. It should be noted
524
that the study design primarily considered logs and metrics but traces only marginally.
525
Traces were not wholly neglected, however, but were analyzed less intensively. 526
The long-term acquisition was performed with a high-volume use case to cover certain
527
stress test aspects. However, the reader must be aware, that the screening phase generated
528
significantly higher data volumes in Use Case 4 than the evaluation phase. Therefore, to use
529
stress test data from this study, one should look at the event volume of the screening phase
530
of Use Case 4. Here, about ten thousand events per minute were logged for more than a
531
week giving an impression of the performance of the proposed approach. The study data
532
shows that the saturation limit should be far beyond these ten thousand events per minute.
533
However, the study design did not pushed the system to its event recording saturation
534
limits. 535
What is more, this study should not be used to derive any cryptocurrency related
536
conclusions. Although some interesting aspects from Use Case 4 could be of interest for
537
cryptocurency trading indicator generation. However, no detailed analysis on correlations
538
between stock prices and usage frequencies of stock symbols on Twitter have been done. 539
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
15 of 18
6. Related work 540
There are relatively few studies dealing with observability as a main object of investigation
541
in an academic understanding. The field is currently treated somewhat stepmotherly.
542
However, an interesting and recent overview is provided by the survey of Usman et al.
543
[
21
]. This survey provides a list of microservice-focused managed and unified observability
544
services (Dynatrace, Datadog, New Relic, Sumo Logic, Solar Winds, Honeycomb). The
545
presented research prototype of this study heads into the same direction, but tries to pursue
546
the problem primarily on the instrumenting side using a more lightweight and unified
547
approach. So, to address the client-side of the problem is obviously harder economical ex-
548
ploitable which is why the industry might address the problem preferable on the managed
549
service side. 550
Of logs, metrics and distributed traces, distributed tracing is still considered in the
551
most detail. In particular, the papers around Dapper [
20
] should be mentioned here, which
552
had a significant impact on this field. A black box approach without instrumenting needs 553
for distributed tracing is presented by [
22
]. This study, however, has seen tracing as only
554
one of three aspects of observability and therefore follows a broader approach. A more
555
recent review on current challenges and approaches of distributed tracing is presented by 556
Bento et. al. [23]. 557
6.1. Existing instrumenting libraries and observability solutions 558
Although the academic coverage of the observability field is expandable, in practice, there is
559
an extensive set of existing solutions, especially for time series analysis and instrumentation.
560
A complete listing is beyond the scope of this paper. However, from the disproportion
561
of the number of academic papers to the number of real existing solutions, one quickly
562
recognizes the practical relevance of the topic. Table 1contains a list of existing database
563
products often used for telemetry data consolidation to give the reader an overview without
564
claiming completeness. This study used ElasticSearch as an analytical database. 565
Table 1. Often seen databases for telemetry data consolidation. Products used in this study are
marked bold . Without claiming completeness.
Product Organization License often seen scope
APM Elastic Apache 2.0 Tracing (add-on to ElasticSearch database)
ElasticSearch Elastic Apache/Elastic License 2.0 Logs, Tracing, (rarely Metrics)
InfluxDB Influxdata MIT Metrics
Jaeger Linux Foundation Apache 2.0 Tracing
OpenSearch Amazon Web Services Apache 2.0 Logs, Tracing, (rarely Metrics); fork from ElasticSearch
Prometheus Linux Foundation Apache 2.0 Metrics
Zipkin OpenZipkin Apache 2.0 Tracing
Table 2lists several frequently used forwarding solutions that developers can use to forward
566
data from the point of capture to the databases listed in Table 1. In the context of this study,
567
FileBeat was used as a log forwarding solution. It could be prooved that this solution is
568
also capable to forward traces and metrics if applied in a structured logging setting. 569
Table 2. Often seen forwarding solutions for log consolidation. Products used in this study are
marked bold . Without claiming completeness.
Product Organization License
Fluentd FluentD Project Apache 2.0
Flume Apache Apache 2.0
LogStash Elastic Apache 2.0
FileBeat Elastic Apache/Elastic License 2.0
Rsyslog Adiscon GPL
syslog-ng One Identity GPL
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
16 of 18
An undoubtedly incomplete overview of instrumentation libraries for different prod-
570
ucts and languages is given in Table 3, presumably because each programming language
571
comes with its own form of logging in the shape of specific libraries. To avoid this language-
572
binding is hardly possible in the instrumentation context unless one pursues "esoteric
573
approaches" like [
22
]. The logging library prototype is strongly influenced by the Python
574
standard logging library but also by structlog for structured logging but without actually
575
using these libraries. 576
Table 3. Often seen instrumenting libraries. Products that inspired the research prototype are marked
bold . Without claiming completeness.
Product Use Case Organization License Remark
APM Agents Tracing Elastic BSD 3
Jaeger Clients Tracing Linux Foundation Apache 2.0
log Logging Go Standard Library BSD 3 Logging for Go
log4j Logging Apache Apache 2.0 Logging for Java
logging Logging Python Standard Library GPL compatible Logging for Python
Micrometer Metrics Pivotal Apache 2.0
OpenTracing Tracing OpenTracing Apache 2.0
prometheus Metrics Linux Foundation Apache 2.0
Splunk APM Tracing Splunk Apache 2.0
structlog Logging Hynek Schlawack Apache 2.0, MIT structured logging for Python
winston Logging Charlie Robbins MIT Logging for node.js
6.2. Standards 577
There are hardly any observability standards. However, a noteworthy standardization ap-
578
proach is the OpenTelemetry Specification [
7
] of the Cloud Native Computing Foundation
579
[
24
], that tries to standardize the way of instrumentation. This approach corresponds to the
580
core idea, which this study also follows. Nevertheless, the standard is still divided into Logs
581
[
25
], Metrics [
26
] and Traces [
27
], which means that the conceptual triad of observability
582
is not questioned. On the other hand, approaches like the OpenTelemetry Operator [
28
]
583
for Kubernetes enable to inject auto-instrumentation libraries for Java, Node.js and Python
584
into Kubernetes operated applications which is a feature that is currently not addressed
585
by the present study. However, so-called service meshes also use auto-instrumentation. A
586
developing standard here is the so-called Service Mesh Interface (SMI) [29]. 587
7. Conclusions and Future Research Directions 588
Cloud-native software systems often have a much more decentralized structure and many
589
independently deployable and (horizontally) scalable components, making it more compli-
590
cated to create a shared and consolidated picture of the overall decentralized system state.
591
Today, observability is often understood as a triad of collecting and processing metrics,
592
distributed tracing data, and logging. But why except for historical reasons? 593
This study presents a unified logging library for Python [
30
] and a unified logging
594
architecture (see Fig. 4) that uses a structured logging approach. The evaluation of four
595
use cases shows that several thousand events per minute are easily processable and can
596
be used to handle logs, traces, and metrics the same. At least, this study was able with
597
a straight-forward approach to log the world-wide Twitter event stream of stock market
598
symbols over a period of six months without any noteworthy problems. As a side effect,
599
some interesting aspects how crypto-currencies are reflected on Twitter could be derived.
600
This might be of minor relevance for this study but shows the overall potential of an unified
601
and structured logging based observability approach. 602
The presented approach relies on an easy-to-use programming language-specific
603
logging library that follows the structured logging approach. The long-term observation
604
results of more than six months indicate that a unification of the current observability
605
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
17 of 18
triad of logs, metrics, and traces is possible without the necessity to develop utterly new
606
toolchains. The trick is to 607
use structured logging and 608
apply log forwarding to a central analytical database 609
in a systematic infrastructure- or platform-provided way. 610
Further research should therefore be concentrated on the instrumenting and less on the
611
log forwarding and consolidation layer. If we instrument logs, traces, and metrics in the
612
same style using the same log forwarding, we automatically generate correlatable data in a
613
single data source of truth and we simplify analysis. 614
So, the observability road ahead may have several paths. On the one hand, we
615
should standardize the logging libraries in a structured style like log12 in this study
616
or the OpenTelemetry project in the "wild". Logging libraries should be comparably
617
implemented in different programming languages and shall generate the same structured
618
logging data. So, we have to standardize the logging SDKs and the data format. Both
619
should be designed to cover logs, metrics, and distributed traces in a structured format. To
620
simplify instrumentation further, we should additionally think about auto-instrumentation
621
approaches, for instance, proposed by the OpenTelemetry Kubernetes Operator [
28
] and
622
several Service Meshes like Istio [31] and corresponding standards like SMI [29]. 623
Funding: This research received no external funding. 624
Data Availability Statement: The resulting research prototype of the developed structured logging
625
library log12 can be accessed here [
30
]. However, the reader should be aware, that this is prototyping
626
software in progress. 627
Conflicts of Interest: The author declares no conflict of interest. 628
References 629
1.
Kalman, R. On the general theory of control systems. IFAC Proceedings Volumes 1960,1, 491–502. 1st International IFAC Congress
630
on Automatic and Remote Control, Moscow, USSR, 1960, https://doi.org/https://doi.org/10.1016/S1474-6670(17)70094-8.631
2.
Kalman, R.E. Mathematical Description of Linear Dynamical Systems. Journal of the Society for Industrial and Applied Mathematics
632
Series A Control 1963,1, 152–192. https://doi.org/10.1137/0301010.633
3. Newman, S. Building Microservices, 1st ed.; O’Reilly Media, Inc., 2015. 634
4.
Kim, G.; Humble, J.; Debois, P.; Willis, J.; Forsgren, N. The DevOps handbook: How to create world-class agility, reliability, & security in
635
technology organizations; IT Revolution, 2016. 636
5. Davis, C. Cloud Native Patterns: Designing change-tolerant software; Simon and Schuster, 2019. 637
6.
Kratzke, N. Cloud-native Computing: Software Engineering von Diensten und Applikationen für die Cloud; Carl Hanser Verlag GmbH
638
Co. KG, 2021. 639
7. The OpenTelemetry Authors. The OpenTelemetry Specification, 2021. 640
8.
Kratzke, N.; Peinl, R. ClouNS - a Cloud-Native Application Reference Model for Enterprise Architects. In Proceedings
641
of the 2016 IEEE 20th International Enterprise Distributed Object Computing Workshop (EDOCW), 2016, pp. 1–10. https:
642
//doi.org/10.1109/EDOCW.2016.7584353.643
9. Kratzke, N.; Quint, P.C. Understanding Cloud-native Applications after 10 Years of Cloud Computing - A Systematic Mapping 644
Study. Journal of Systems and Software 2017,126, 1–16. https://doi.org/10.1016/j.jss.2017.01.001.645
10. Kratzke, N. A Brief History of Cloud Application Architectures. Applied Sciences 2018,8.https://doi.org/10.3390/app8081368.646
11.
Kratzke, N. How programming students trick and what JEdUnit can do against it. In Computer Supported Education ; Lane, H.C.;
647
Zvacek, S.; Uhomoibhi, J., Eds.; Springer International Publishing , 2020; pp. 1–25. CSEDU 2019 - Revised Selected Best Papers 648
(CCIS), https://doi.org/10.1007/978-3-030-58459-7_1.649
12. Kratzke, N. Einfachere Observability durch strukturiertes Logging. Informatik Aktuell 2022.650
13.
Kratzke, N.; Siegfried, R. Towards Cloud-native Simulations - Lessons learned from the front-line of cloud computing. Journal of
651
Defense Modeling and Simulation 2020.https://doi.org/10.1177/1548512919895327.652
14.
Truyen, Eddy.; Kratzke, Nane.; Van Landyut, Dimitri.; Lagaisse, Bert.; Joosen, Wouter. Managing Feature Compatibility in
653
Kubernetes: Vendor Comparison and Analysis. IEEE Access 2020,8, "228420–228439". https://doi.org/10.1109/ACCESS.2020.3
654
045768.655
15.
Petersen, K.; Gencel, C.; Asghari, N.; Baca, D.; Betz, S. Action Research as a Model for Industry-Academia Collaboration in the
656
Software Engineering Context. In Proceedings of the Proceedings of the 2014 International Workshop on Long-Term Industrial
657
Collaboration on Software Engineering; Association for Computing Machinery: New York, NY, USA, 2014; WISE ’14, p. 55–62.
658
https://doi.org/10.1145/2647648.2647656.659
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
18 of 18
16.
Kratzke, N. The #BTW17 Twitter Dataset - Recorded Tweets of the Federal Election Campaigns of 2017 for the 19th German
660
Bundestag. Data 2017,2.https://doi.org/10.3390/data2040034.661
17. Kratzke, N. Monthly Samples of German Tweets, 2022. https://doi.org/10.5281/zenodo.2783954.662
18. Wiggins, A. The Twelve-Factor App, 2017. https://12factor.net.663
19. The Kubernetes Authors. Kubernetes, 2014. https://kubernetes.io.664
20.
Sigelman, B.H.; Barroso, L.A.; Burrows, M.; Stephenson, P.; Plakal, M.; Beaver, D.; Jaspan, S.; Shanbhag, C. Dapper, a Large-Scale
665
Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010. 666
21.
Usman, M.; Ferlin, S.; Brunstrom, A.; Taheri, J. A Survey on Observability of Distributed Edge & Container-based Microservices.
667
IEEE Access 2022, pp. 1–1. https://doi.org/10.1109/ACCESS.2022.3193102.668
22.
Chow, M.; Meisner, D.; Flinn, J.; Peek, D.; Wenisch, T.F. The Mystery Machine: End-to-end Performance Analysis of Large-scale
669
Internet Services. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14);
670
USENIX Association: Broomfield, CO, 2014; pp. 217–231. 671
23.
Bento, A.; Correia, J.; Filipe, R.; Araujo, F.; Cardoso, J. Automated Analysis of Distributed Tracing: Challenges and Research
672
Directions. Journal of Grid Computing 2021,19, 9. https://doi.org/10.1007/s10723-021- 09551-5.673
24. Linux Foundation. Cloud-native Computing Foundation, 2015. https://cncf.io.674
25.
The OpenTelemetry Authors. The OpenTelemetry Specification - Logs Data Model, 2021. https://opentelemetry.io/docs/
675
reference/specification/logs/data-model/.676
26.
The OpenTelemetry Authors. The OpenTelemetry Specification - Metrics SDK, 2021. https://opentelemetry.io/docs/reference/
677
specification/metrics/sdk/.678
27.
The OpenTelemetry Authors. The OpenTelemetry Specification - Tracing SDK, 2021. https://opentelemetry.io/docs/reference/
679
specification/trace/sdk/.680
28.
The OpenTelemetry Authors. The OpenTelemetry Operator, 2021. https://github.com/open-telemetry/opentelemetry- operator.
681
29. Service Mesh Interface Authors. SMI: A standard interface for service meshes on Kubernetes, 2022. https://smi-spec.io.682
30. Kratzke, N. log12 - a single and self-contained structured logging library, 2022. https://github.com/nkratzke/log12.683
31. Istio Authors. The Istio service mesh, 2017. https://istio.io/.684
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2022 doi:10.20944/preprints202208.0427.v1
Article
Full-text available
Software logging is the practice of recording different events and activities that occur within a software system, which are useful for different activities such as failure prediction and anomaly detection. While previous research focused on improving different aspects of logging practices, the goal of this paper is to conduct a systematic literature review and the existing challenges of practitioners in software logging practices. In this paper, we focus on the logging practices that cover the steps from the instrumentation of a software system, the storage of logs, up to the preprocessing steps that prepare log data for further follow-up analysis. Our systematic literature review (SLR) covers 204 research papers and a quantitative and qualitative analysis of 20,766 and 149 questions on StackOverflow (SO). We observe that 53% of the studies focus on improving the techniques that preprocess logs for analysis (e.g., the practices of log parsing, log clustering and log mining), 37% focus on how to create new log statements, and 10% focus on how to improve log file storage. Our analysis of SO topics reveals that five out of seven identified high-level topics are not covered by the literature and are related to dependency configuration of logging libraries, infrastructure related configuration, scattered logging, context-dependant usage of logs and handling log files.
Article
Full-text available
Edge computing is proposed as a technical enabler for meeting emerging network technologies (such as 5G and Industrial Internet of Things), stringent application requirements and key performance indicators (KPIs). It aims to alleviate the problems associated with centralized cloud computing systems by placing computational resources to the network's edge, closer to the users. However, the complexity of distributed edge infrastructures grows when hosting containerized workloads as microservices, resulting in hard to detect and troubleshoot outages on critical use cases such as industrial automation processes. Observability aims to support operators in managing and operating complex distributed infrastructures and microservices architectures by instrumenting end-to-end runtime performance. To the best of our knowledge, no survey article has been recently proposed for distributed edge and containerized microservices observability. Thus, this article surveys and classifies state-of-the-art solutions from various communities. Besides surveying state-of-the-art, this article also discusses the observability concept, requirements, and design considerations. Finally, we discuss open research issues as well as future research directions that will inspire additional research in this area.
Article
Full-text available
Microservice-based architectures are gaining popularity for their benefits in software development. Distributed tracing can be used to help operators maintain observability in this highly distributed context, and find problems such as latency, and analyse their context and root cause. However, exploring and working with distributed tracing data is sometimes difficult due to its complexity and application specificity, volume of information and lack of tools. The most common and general tools available for this kind of data, focus on trace-level human-readable data visualisation. Unfortunately, these tools do not provide good ways to abstract, navigate, filter and analyse tracing data. Additionally, they do not automate or aid with trace analysis, relying on administrators to do it themselves. In this paper we propose using tracing data to extract service metrics, dependency graphs and work-flows with the objective of detecting anomalous services and operation patterns. We implemented and published open source prototype tools to process tracing data, conforming to the OpenTracing standard, and developed anomaly detection methods. We validated our tools and methods against real data provided by a major cloud provider. Results show that there is an underused wealth of actionable information that can be extracted from both metric and morphological aspects derived from tracing. In particular, our tools were able to detect anomalous behaviour and situate it both in terms of involved services, work-flows and time-frame. Furthermore, we identified some limitations of the OpenTracing format—as well as the industry accepted tracing abstractions—, and provide suggestions to test trace quality and enhance the standard.
Article
Full-text available
Kubernetes (k8s) is a kind of cluster operating system for cloud-native workloads that has become a de-facto standard for container orchestration. Provided by more than one hundred vendors, it has the potential to protect the customer from vendor lock-in. However, the open-source k8s distribution consists of many optional and alternative features that must be explicitly activated and may depend on pre-configured system components. As a result, incompatibilities still may ensue among Kubernetes vendors. Mostly managed k8s services typically restrict the customizability of Kubernetes. This paper firstly compares the most relevant k8s vendors and, secondly, analyses the potential of Kubernetes to detect and configure compatible support for required features across vendors in a uniform manner. Our comparison is performed based on documented features, by testing, and by inspection of the configuration state of running clusters. Our analysis focuses on the potential of the end-to-end testing suite of Kubernetes to detect support for a desired feature in any Kubernetes vendor and the possibility of reconfiguring the studied vendors with missing features in a uniform manner. Our findings are threefold: First, incompatibilities arise between default cluster configurations of the studied vendors for approximately 18% of documented features. Second, matching end-to-end tests exist only for around 64% of features and for 17% of features these matching tests are not well developed for all vendors. Third, almost all feature incompatibilities can be resolved using a vendor-agnostic API. These insights are beneficial to avoid feature incompatibilities already in cloud-native application engineering processes. Moreover, the end-to-end testing suite can be extended in currently unlighted areas to provide better feature coverage.
Chapter
Full-text available
According to our data, about 15% of programming students trick if they are aware that only a "dumb" robot evaluates their programming assignments unattended by programming experts. Especially in large-scale formats like MOOCs, this might become a question because to trick current automated assignment assessment systems (APAAS) is astonishingly easy and the question arises whether unattended grading components grade the capability to program or to trick. This study analyzed what kind of tricks students apply beyond the well-known "copy-paste" code plagiarism to derive possible mitigation options. Therefore, this study analyzed student cheat patterns that occurred in two programming courses and developed a unit testing framework JEdUnit as a solution proposal that intentionally targets such tricky educational aspects of programming. The validation phase validated JEdUnit in another programming course. This study identified and analyzed four recurring cheat patterns (overfitting, evasion, redirection, and injection) that hardly occur in "normal" software development and are not aware to normal unit testing frameworks that are frequently used to test the correct-ness of student submissions. Therefore, the concept of well-known unit testing frameworks was extended by adding three "countermeasures": randomization, code inspection, separation. The validation showed that JEdUnit detected these patterns and in consequence, reduced cheating entirely to zero. From a students perspective, JEdUnit makes the grading component more intelligent, and cheating does not pay-off anymore. This Chapter explains the cheat patterns and what features of JEdUnit mitigate them by a continuous example.
Article
Full-text available
Cloud computing can be a game-changer for computationally intensive tasks like simulations. The computational power of Amazon, Google, or Microsoft is even available to a single researcher. However, the pay-as-you-go cost model of cloud computing influences how cloud-native systems are being built. We transfer these insights to the simulation domain. The major contributions of this paper are twofold: (A) we propose a cloud-native simulation stack and (B) derive expectable software engineering trends for cloud-native simulation services. Our insights are based on systematic mapping studies on cloud-native applications, a review of cloud standards, action research activities with cloud engineering practitioners, and corresponding software prototyping activities. Two major trends have dominated cloud computing over the last 10 years. The size of deployment units has been minimized and corresponding architectural styles prefer more fine-grained service decompositions of independently deployable and horizontally scalable services. We forecast similar trends for cloud-native simulation architectures. These similar trends should make cloud-native simulation services more microservice-like, which are composable but just ''simulate one thing well.'' However, merely transferring existing simulation models to the cloud can result in significantly higher costs. One critical insight of our (and other) research is that cloud-native systems should follow cloud-native architecture principles to leverage the most out of the pay-as-you-go cost model.
Article
Full-text available
This paper presents a review of cloud application architectures and its evolution. It reports observations being made during a research project that tackled the problem to transfer cloud applications between different cloud infrastructures. As a side effect, we learned a lot about commonalities and differences from plenty of different cloud applications which might be of value for cloud software engineers and architects. Throughout the research project, we analyzed industrial cloud standards, performed systematic mapping studies of cloud-native application-related research papers, did action research activities in cloud engineering projects, modeled a cloud application reference model, and performed software and domain-specific language engineering activities. Two primary (and sometimes overlooked) trends can be identified. First, cloud computing and its related application architecture evolution can be seen as a steady process to optimize resource utilization in cloud computing. Second, these resource utilization improvements resulted over time in an architectural evolution of how cloud applications are being built and deployed. A shift from monolithic service-oriented architectures (SOA), via independently deployable microservices towards so-called serverless architectures, is observable. In particular, serverless architectures are more decentralized and distributed and make more intentional use of separately provided services. In other words, a decentralizing trend in cloud application architectures is observable that emphasizes decentralized architectures known from former peer-to-peer based approaches. This is astonishing because, with the rise of cloud computing (and its centralized service provisioning concept), the research interest in peer-to-peer based approaches (and its decentralizing philosophy) decreased. However, this seems to change. Cloud computing could head into the future of more decentralized and more meshed services.
Article
Full-text available
The German Bundestag elections are the most important elections in Germany. This dataset comprises Twitter interactions related to German politicians of the most important political parties over several months in the (pre-)phase of the German federal election campaigns in 2017. The Twitter accounts of more than 360 politicians were followed for four months. The collected data comprise a sample of approximately 10 GB of Twitter raw data, and they cover more than 120,000 active Twitter users and more than 1,200,000 recorded tweets. Even without sophisticated data analysis techniques, it was possible to deduce a likely political party proximity for more than half of these accounts simply by looking at the re-tweet behavior. This might be of interest for innovative data-driven party campaign strategists in the future. Furthermore, it is observable, that, in Germany, supporters and politicians of populist parties make use of Twitter much more intensively and aggressively than supporters of other parties. Furthermore, established left-wing parties seem to be more active on Twitter than established conservative parties. The dataset can be used to study how political parties, their followers and supporters make use of social media channels in political election campaigns and what kind of content is shared.
Article
Full-text available
It is common sense that cloud-native applications (CNA) are intentionally designed for the cloud. Although this understanding can be broadly used it does not guide and explain what a cloud-native application exactly is. The term "cloud-native" was used quite frequently in birthday times of cloud computing (2006) which seems somehow obvious nowadays. But the term disappeared almost completely. Suddenly and in the last years the term is used again more and more frequently and shows increasing momentum. This paper summarizes the outcomes of a systematic mapping study analyzing research papers covering "cloud-native" topics, research questions and engineering methodologies. We summarize research focuses and trends dealing with cloud-native application engineering approaches. Furthermore, we provide a definition for the term "cloud-native application" which takes all findings, insights of analyzed publications and already existing and well-defined terminology into account.
Book
Märkte verändern sich immer schneller, Kundenwünsche stehen im Mittelpunkt – viele Unternehmen sehen sich 𝗛𝗲𝗿𝗮𝘂𝘀𝗳𝗼𝗿𝗱𝗲𝗿𝘂𝗻𝗴𝗲𝗻 gegenüber, die nur 𝗱𝗶𝗴𝗶𝘁𝗮𝗹 𝗯𝗲𝗵𝗲𝗿𝗿𝘀𝗰𝗵𝗯𝗮𝗿 sind. Um diese Anforderungen zu bewältigen, bietet sich der Einsatz von 𝗖𝗹𝗼𝘂𝗱-𝗻𝗮𝘁𝗶𝘃𝗲-𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝗶𝗲𝗻 an. Dabei reicht es jedoch nicht aus, einen Account bei einem Cloud-Anbieter anzulegen. Es geht auch darum, die unterschiedlichen Faktoren zu verstehen, die den Erfolg von Cloud-native-Projekten beeinflussen. Anhand von realen 𝗣𝗿𝗮𝘅𝗶𝘀𝗯𝗲𝗶𝘀𝗽𝗶𝗲𝗹𝗲𝗻 wird gezeigt, was bei der Umsetzung in 𝘂𝗻𝘁𝗲𝗿𝘀𝗰𝗵𝗶𝗲𝗱𝗹𝗶𝗰𝗵𝗲𝗻 𝗕𝗿𝗮𝗻𝗰𝗵𝗲𝗻 gut und was schlecht gelaufen ist und welche Best Practices sich daraus ableiten lassen. Dabei wird auch die Migration von Legacy-Code berücksichtigt. 𝗜𝗧-𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗸𝘁𝗲𝗻 𝘃𝗲𝗿𝗺𝗶𝘁𝘁𝗲𝗹𝘁 𝗱𝗶𝗲𝘀𝗲𝘀 𝗕𝘂𝗰𝗵 𝗱𝗮𝘀 𝗴𝗿𝘂𝗻𝗱𝗹𝗲𝗴𝗲𝗻𝗱𝗲 𝗪𝗶𝘀𝘀𝗲𝗻, 𝘂𝗺 𝗖𝗹𝗼𝘂𝗱-𝗻𝗮𝘁𝗶𝘃𝗲-𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝗶𝗲𝗻 𝘂𝗻𝗱 𝗱𝗶𝗲 𝗗𝗲𝘃𝗢𝗽𝘀-𝗞𝘂𝗹𝘁𝘂𝗿 𝗶𝗻 𝗶𝗵𝗿𝗲𝗺 𝗣𝗿𝗼𝗷𝗲𝗸𝘁 𝗼𝗱𝗲𝗿 𝗶𝗺 𝗴𝗲𝘀𝗮𝗺𝘁𝗲𝗻 𝗨𝗻𝘁𝗲𝗿𝗻𝗲𝗵𝗺𝗲𝗻 einzuführen. Das Buch beleuchtet den 𝗖𝗹𝗼𝘂𝗱-𝗻𝗮𝘁𝗶𝘃𝗲-𝗪𝗮𝗻𝗱𝗲𝗹 aus 𝘂𝗻𝘁𝗲𝗿𝘀𝗰𝗵𝗶𝗲𝗱𝗹𝗶𝗰𝗵𝗲𝗻 𝗣𝗲𝗿𝘀𝗽𝗲𝗸𝘁𝗶𝘃𝗲𝗻: Von der Unternehmenskultur, der Cloud-Ökonomie und der Einbeziehung der Kunden (Co-Creation) über das Projektmanagement (Agilität) und die Softwarearchitektur bis hin zu Qualitätssicherung (Continuous Delivery) und Betrieb (DevOps). ▪️ Grundlagen des Cloud Computings (Service-Modelle und Cloud-Ökonomie) ▪️ Das Everything-as-Code-Paradigma (DevOps, Deployment Pipelines, IaC) ▪️ Den Systembetrieb mit Container-Orchestrierung automatisieren ▪️ Microservice- und Serverless-Architekturen verstehen und Cloud-native-Architekturen mit Domain Driven Design entwerfen EXTRA: 𝗖𝗖𝟬-𝗹𝗶𝘇𝗲𝗻𝘀𝗶𝗲𝗿𝘁𝗲 𝘂𝗻𝗱 𝗲𝗱𝗶𝘁𝗶𝗲𝗿𝗯𝗮𝗿𝗲 𝗛𝗮𝗻𝗱𝗼𝘂𝘁𝘀 und Labs für Trainer:innen und Dozent:innen finden sich übrigens hier: https://cloud-native-computing.de