System Log Clustering Approaches for Cyber Security Applications: A Survey

Article (PDF Available)inComputers & Security 92:101739 · May 2020with 91 Reads 
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
DOI: 10.1016/j.cose.2020.101739
Cite this publication
Abstract
Log files give insight into the state of a computer system and enable the detection of anomalous events relevant to cyber security. However, automatically analyzing log data is difficult since it contains massive amounts of unstructured and diverse messages collected from heterogeneous sources. Therefore, several approaches that condense or summarize log data by means of clustering techniques have been proposed. Picking the right approach for a particular application domain is, however, non-trivial, since algorithms are designed towards specific objectives and requirements. This paper therefore surveys existing approaches. It thereby groups approaches by their clustering techniques, reviews their applicability and limitations, discusses trends and identifies gaps. The survey reveals that approaches usually pursue one or more of four major objectives: overview and filtering, parsing and signature extraction, static outlier detection, and sequences and dynamic anomaly detection. Finally, this paper also outlines a concept and tool that support the selection of appropriate approaches based on user-defined requirements.
Computers & Security 92 (2020) 101739
Contents lists available at ScienceDirect
Computers & Security
journal homepage: www.elsevier.com/locate/cose
System log clustering approaches for cyber security applications: A
survey
Max Landauer
a ,
, Florian Skopik
a
, Markus Wurzenberger
a
, Andreas Rauber
b
a
Austrian Institute of Technology, Austria
b
Vienna University of Technology, Austria
a r t i c l e i n f o
Article history:
Received 23 December 2019
Accepted 29 January 2020
Available online 31 January 2020
Keywo rds:
Log clustering
Cyber security
Log mining
Signature extraction
Anomaly detection
a b s t r a c t
Log files give insight into the state of a computer system and enable the detection of anomalous events
relevant to cyber security. However, automatically analyzing log data is difficult since it contains mas-
sive amounts of unstructured and diverse messages collected from heterogeneous sources. Therefore,
several approaches that condense or summarize log data by means of clustering techniques have been
proposed. Picking the right approach for a particular application domain is, however, non-trivial, since al-
gorithms are designed towards specific objectives and requirements. This paper therefore surveys existing
approaches. It thereby groups approaches by their clustering techniques, reviews their applicability and
limitations, discusses trends and identifies gaps. The survey reveals that approaches usually pursue one
or more of four major objectives: overview and filtering, parsing and signature extraction, static outlier
detection, and sequences and dynamic anomaly detection. Finally, this paper also outlines a concept and
tool that support the selection of appropriate approaches based on user-defined requirements.
© 2020 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license.
( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
1. Introduction
Log files contain information about almost all events that take
place in a system, depending on the log level. For this, the de-
ployed logging infrastructure automatically collects, aggregates and
stores the logs that are continuously produced by most compo-
nents and devices, e.g., web servers, data bases, or firewalls. The
textual log messages are usually human-readable and attached to
a time stamp that specifies the point in time the log entry was
generated. Especially for large organizations and enterprises, the
benefits of having access to long-term log data are manifold: His-
toric logs enable forensic analysis of past events. Most prominently
applied after faults occurred in the system, forensic analysis gives
system administrators the possibility to trace back the roots of ob-
served problems. Moreover, the logs may help to recover the sys-
tem to a non-faulty state, reset incorrect transactions, restore data,
prevent losses of information, and replicate scenarios that lead to
erroneous states during testing. Finally, logs also allow administra-
tors to validate the performance of processes and discover bottle-
Corresponding author.
E-mail addresses: max.landauer@ait.ac.at (M. Landauer), florian.skopik@ait.ac.at
(F. Skopik), markus.wurzenberger@ait.ac.at (M. Wurzenberger),
rauber@ifs.tuwien.ac.at (A. Rauber).
necks. In addition to these functional advantages, storing logs is
typically inexpensive since log files can effectively be compressed
due to a high number of repeating lines.
A major issue with forensic log analysis is that problems are
only detected in hindsight. Furthermore, it is a time- and resource-
consuming task that requires domain knowledge about the sys-
tem at hand. For these reasons, modern approaches in cyber se-
curity shift from a purely forensic to a proactive analysis ( He et al.,
2017b ). Thereby, real-time fault detection is enabled by constantly
monitoring system logs in an online manner, i.e., as soon as they
are generated. This allows timely responses and in turn reduces the
costs caused by incidents and cyber attacks. On top of that, indi-
cators for upcoming erroneous system behavior can frequently be
observed in advance. Detecting such indicators early enough and
initiating appropriate countermeasures can help to prevent certain
faults altogether.
Unfortunately, this task is hardly possible for humans since log
data is generated in immense volumes and fast rates. When con-
sidering large enterprise systems, it is not uncommon that the
number of daily produced log lines is up in the millions, for exam-
ple, publicly available Hadoop Distributed File System (HDFS) logs
comprise more than 4 million log lines per day ( Xu et al., 2009 )
and small organizations are expected to deal with peaks of 22,0 0 0
events per second ( Allen and Richardson, 2019 ). Clearly, this makes
https://doi.org/10.1016/j.cose.2020.101739
0167-4048/© 2020 The Authors. Published by Elsevier Ltd. Thisisanopen access article under the CC BY-NC-ND license. ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
2 M. Landauer, F. Skopik and M. Wurzenberger et al. / Computers & Security 92 (2020) 101739
manual analysis impossible and it thus stands to reason to employ
machine learning algorithms that automatically process the lines
and recognize interesting patterns that are then presented to sys-
tem operators in a condensed form.
One method for analyzing large amounts of log data is clus-
tering. Thus, several clustering algorithms that were particularly
designed for textual log data have been proposed in the past.
Since most of the algorithms were mainly developed for certain
application-specific scenarios at hand, their approaches frequently
differ in their overall goals and assumptions on the input data. We
were specifically interested to discover the different strategies the
authors used to pursue the objectives induced by their use-cases.
However, to the best of our knowledge there is no exhaustive sur-
vey on state-of-the-art log data clustering approaches that focuses
on applications in cyber security. Despite also concerned with cer-
tain types of log files, existing works are either outdated or focus
on network traffic classification ( Ewards et al., 2013 ), web cluster-
ing ( Carpineto et al., 2009 ), and user profiling ( Facca and Lanzi,
20 05; Vakali et al., 20 04 ). Other surveys address only log parsers
rather than clustering ( Zhu et al., 2019 ).
In this paper we therefore create a survey of current and estab-
lished strategies for log clustering found in scientific literature. This
survey is oriented towards the identification of overall trends and
highlights the contrasts between existing approaches. This sup-
ports analysts in selecting methods that fit the requirements im-
posed by their systems. In addition, with this paper we aim at the
generation of a work of reference that is helpful for all authors
planning to publish in this field. Overall, the research questions we
address with this paper are as follows:
What are essential properties of existing log clustering algo-
rithms?
How are these algorithms applied in cyber security?
On what kind of data do these algorithms operate?
How were these algorithms evaluated?
The remainder of the paper is structured as follows.
Section 2 outlines the problem of clustering log data and dis-
cusses how log analysis is used in the cyber security domain.
In Section 3 , we explain our method of carrying out the liter-
ature study. The results of the survey are stated and discussed
in Section 4 . We then propose a decision model for selecting
appropriate clustering approaches and demonstrate it based on
the survey results in Section 5 . Finally, Section 6 concludes the
paper.
2. Survey Background
Log data exhibits certain characteristics that have to be taken
into account when designing a clustering algorithm. In the follow-
ing, we therefore discuss important properties of log data, outline
the reasons why log data is suitable to be clustered and look into
application scenarios relevant to cyber security.
2.1. The nature of log data
Despite the fact that log data exists in various forms, some gen-
eral assumptions on their compositions can be made. First, a log
file typically consists of a set of single- or multi-line strings listed
in inherent chronological order. This chronological order is usu-
ally underpinned by a time stamp attached to the log messages.
1
1 The order and time stamps of messages do not necessarily have to correctly
represent the actual generation of log lines due to technological restrictions ap-
pearing during log collection, e.g., delays caused by buffering or issues with time
synchronization. A thorough investigation of any adverse consequences evoked by
such effects
is considered out of scope for this paper.
The messages may be highly structured (e.g., a list of comma-
separated values), partially structured (e.g., attribute-value pairs),
unstructured (e.g., free text of arbitrary length) or a combination
thereof. In addition, log messages sometimes include process IDs
(PIDs) that relate to the task (also referred to as thread or case)
that generated them. If this is the case, it is simple to extract log
traces, i.e., sequences of related log lines, and perform workflow
and process mining ( Nandi et al., 2016 ). Other artifacts sometimes
included in log messages are line numbers, an indicator for the
level or severity of the message (TRACE, DEBUG, INFO, WARN, ER-
ROR, FATAL, ALL, or OFF), and a static identifier referencing the
statement printing the message ( Bao et al., 2018 ).
Arguably, log files are fairly different from documents written
in natural language. This is not necessarily the case because the
log messages themselves are different from natural language (since
they are supposed to be human-readable), but rather because of
two reasons: (i) Similar messages repeat over and over. This is
caused by the fact that events are recurring since procedures are
usually executed in loops and the majority of the log lines are gen-
erated by a limited set of print statements, i.e., predefined func-
tions in the code that write formatted strings to some output. (ii)
The appearances of some messages are highly correlated. This is
due to the fact that programs usually follow certain control flows
and components that generate log lines are linked with each other.
For example, two consecutive print statements will always produce
perfectly correlated log messages during normal system behavior
since the execution of the first statement will always be followed
by the execution of the second statement. In practice, it is diffi-
cult to derive such correlations since they often depend on exter-
nal events and are the result of states and conditions.
These properties allow system logs to be clustered in two differ-
ent ways. First, clustering individual log lines by the similarity of
their messages yields an overview of all events that occur in the
system. Second, clustering sequences of log messages gives insight
into the underlying program logic and uncovers otherwise hidden
dependencies of events and components.
2.2. Static clustering
We consider clustering individual log lines as a static proce-
dure, because the order and dependencies between lines is usu-
ally neglected. After such static line-based clustering, the resulting
set of clusters should ideally resemble the set of all log-generating
print statements, where each log line should be allocated to the
cluster representing the statement it was generated by. Examining
these statements in more detail shows that they usually comprise
static strings that are identical in all messages produced by that
statement and variable parts that are dynamically replaced at run
time. Thereby, variable parts are frequently numeric values, iden-
tifiers (e.g., IDs, names, or IP addresses), or categorical attributes.
Note that the generation of logs using mostly fixed statements is
responsible for a skewed word distribution in log files, where few
words from the static parts appear very frequently while the ma-
jority of words appears very infrequently or even just once ( Ning
et al., 2014; Vaarandi, 2003 ).
In the following, we demonstrate issues in clustering with the
sample log lines shown in Fig. 1 . In the example, log messages de-
scribe users logging in and out. Given this short log file, a human
would most probably assume that the two statements print(“User
+ name + “ logs in with status ”+ status) and print(“User ”+ name
+ “ logs out with status ”+ status) generated the lines, and thus
allocate lines {1, 2, 4} to the former and lines {3, 5} to the lat-
ter cluster. From this clustering, the templates (also referred to as
signatures, patterns, or events) “User
logs in with status
”and
“User
logs out with status
”can be derived, where the Kleene
star
denotes a wildcard accepting any word at that position.
M. Landauer, F. Skopik and M. Wurzenberger et al. / Computers & Security 92 (2020) 101739 3
Fig. 1. Sample log messages for static analysis.
Beside the high resemblance of the original statements, the wild-
cards appear to be reasonably placed since all other users logging
in or out with any status will be correctly allocated, e.g., “User
Dave logs in with status 0”.
Other than humans, algorithms lack semantic understanding of
the log messages and might just as well group the lines according
to the user name, i.e., create clusters {1, 3}, {2, 5}, and {4}, or ac-
cording to a state variable, i.e., create clusters {1, 2, 3, 5} and {4}.
In the latter case, the most specific templates corresponding to the
clusters are “User
logs
with status 1” and “User Charlie logs in
with status -1”. In most scenarios, the quality of these templates is
considered to be poor, since the second wildcard of the first tem-
plate is an over-generalization of a categorical attribute and the
second template is overly specific. Accordingly, newly arriving log
lines would be likely to form outliers, i.e., not match any cluster
template.
With this example in mind we want to point out that there al-
ways exist a multitude of different possible valid clusterings and
judging the quality of the clusters is eventually a subjective deci-
sion that is largely application-specific. For example, investigations
regarding user-behavior may require that all log lines generated by
a specific user end up in the same cluster. In any way, appropriate
cluster quality is highly important since clusters are often the ba-
sis for further analyses that operate on top of the grouped data and
extracted templates. The next section explores dynamic clustering
as such an application that utilizes static cluster allocations.
2.3. Dynamic clustering
As pointed out earlier, log files are suited for dynamic cluster-
ing, i.e., allocation of sequences of log line appearances to patterns.
However, raw log lines are usually not suited for such sequential
pattern recognition, due to the fact that each log line is a uniquely
occurring instance describing a part of the system state at a par-
ticular point in time. Since pattern recognition relies on repeating
behavior, the log lines first have to be allocated to classes that refer
to their originating event. This task is enabled by static clustering
as outlined in the previous section.
In the following, we consider the sample log file shown in Fig. 2
that contains three users logging into the system, performing some
action, and logging out. We assume that these steps are always
carried out in this sequence, i.e., it is not possible to perform an
action or log out without first being logged in.
Fig. 2. Sample log messages and their event allocations for dynamic analysis.
Fig. 3. Sample log events visualized on a timeline.
We assume that the sample log file has been analyzed by a
static clustering algorithm to generate the three templates A = ”User
logs in with status
“, B = ”User
performs action
“, and C = ”User
logs out with status
“. It is then possible to assign each line one
of the events as indicated on the right side of the figure. In such a
setting, the result of a dynamic clustering algorithm could be the
extracted sequence A, B, C since this pattern describes normal user
behavior. However, the events in lines 6 and 7 are switched, thus
interrupting the pattern. Fig. 3 shows that the reason for this is-
sue is caused by interleaved user behavior, i.e., user Charlie logs in
before user Bob logs out.
Since many applications are running in parallel in real sys-
tems, interleaved processes are commonly occurring in log files
and thus complicate the pattern extraction process. As mentioned
in Section 2.1 , some log files include process IDs that allow to an-
alyze the corresponding logs isolated from interrupting processes
and thus resolve this issue. In the simple example from Fig. 2 , the
username could have been used for this purpose. In addition to
interleaved event sequences, real systems obviously involve much
more complex patterns, including arbitrarily repeating, optional, al-
ternative, or nested subpatterns.
While sequence mining is common, it is not the only dy-
namic clustering technique. In particular, similar groups of log lines
can be formed by aggregating them in time-windows and analyz-
ing their frequencies, co-occurrences, or correlations. For exam-
ple, clustering could aim at generating groups of log lines that
frequently occur together. Note that in this setting, the ordering
of events is not relevant, but only their occurrence within a cer-
tain time interval. The next section outlines several applications of
static and dynamic clustering for system security.
2.4. Applications in the security domain
Due to the fact that log files contain permanent documentation
of almost all events that take place in a system, they are frequently
used by analysts to investigate unexpected or faulty system behav-
ior in order to find its origin. In some cases, the strange behav-
ior is caused by system intrusions, cyber attacks, malware, or any
other adversarial processes. Since such attacks often lead to high
costs for affected organizations, timely detection and clarification
of consequences is of particular importance.
Independent from whether anomalous log manifestations are
caused by randomly occurring failures or targeted adversarial ac-
tivity, their detection is of great help for administrators and may
prevent or reduce costs. Clustering is able to largely reduce the ef-
fort required to manually analyze log files, for example, by pro-
viding summaries of log file contents, and even provides function-
alities to automatize detection of anomalous behavior. In the fol-
lowing, we outline some of the most relevant types of anomalies
detectable or supported by clustering.
Outliers are single log lines that do not match any of the exist-
ing templates or are dissimilar to all identified clusters that are
known to represent normal system behavior. Outliers are often
new events that have not occurred during clustering or contain
highly dissimilar parameters in the log messages. An example
could be an error log message in a log file that usually only
contains informational and debugging messages.
4 M. Landauer, F. Skopik and M. Wurzenberger et al. / Computers & Security 92 (2020) 101739
Frequency anomalies are log events that appear unexpectedly
frequent or rare during a given time interval. This may include
cases where components stop logging, or detection of attacks
that involve the execution of many events, e.g., vulnerability
scans.
Correlation anomalies are log events that are expected to oc-
cur in pairs or groups but fail to do so. This may include simple
co-occurrence anomalies, i.e., two or more events that are ex-
pected to occur together, and implication anomalies, where one
or more events imply that some other event or events have
to occur, but not the other way round. For example, a web
server that logs an incoming connection should imply that cor-
responding log lines on the firewall have occurred earlier.
Inter-arrival time anomalies are caused by deviating time in-
tervals between occurrences of log events. They are related to
correlation anomalies and may provide additional detection ca-
pabilities, e.g., an implied event is expected to occur within a
certain time window.
Sequence anomalies are caused by missing or additional log
events as well as deviating orders in sequences of log events
that are expected to occur in certain patterns.
Outliers are based on single log line occurrences and are thus
the only type of anomalies detectable by static cluster algorithms.
All other types of anomalies require dynamic clustering techniques.
In addition, anomalies do not necessarily have to be detected using
strict rules that report every single violation. For example, event
correlations that are expected to occur only in 90% of all cases may
be analyzed with appropriate statistical tests.
3. Survey method
In this section we describe our approach to gather and analyze
the existing literature.
3.1. Set of criteria
In order to carry out the literature survey on log clustering ap-
proaches in a structured way, we initially created a set of evalua-
tion criteria that addresses relevant aspects of the research ques-
tions in more detail. The first block of questions in the set of cri-
teria covers purpose, applicability, and usability of the proposed
solutions:
P-1 What is the purpose of the introduced approach?
P-2 Does the method have a broad applicability or are there
constraints, such as requirements for specific logging stan-
dards?
P-3 Is the algorithm a commercial product or has been de-
ployed in industry?
P-4 Is the code of the algorithm publicly accessible?
The next group of questions focuses on the properties of the
introduced clustering algorithms:
C-1a What type of technique is applied for static clustering?
C-1b What type of technique is applied for dynamic clustering?
C-2 Is the algorithm fully unsupervised as opposed to algo-
rithms requiring detailed knowledge about the log structures
or labeled log data for training?
C-3 Is the clustering character-based?
C-4 Is the clustering word- or token-based?
C-5 Are log signatures or templates generated?
C-6 Does the clustering algorithm take dynamic features of log
lines (e.g., sequences) into account?
C-7 Does the algorithm generate new clusters online, i.e., in a
streaming manner, as opposed to approaches that allocate
log lines to a fixed set of clusters generated in a training
phase?
C-8 Is the clustering adaptive to system changes, i.e., are exist-
ing clusters adjusted over time rather than static constructs?
C-9 Is the algorithm designed for fast data processing?
C-10 Is the algorithm designed for parallel execution?
C-11 Is the algorithm deterministic?
Since we were aware that a large number of approaches aim at
anomaly detection, we dedicated the following set of questions to
this topic:
AD-1 Is the approach designed for the detection of outliers, i.e.,
static anomalies?
AD-2 Is the approach designed for the detection of dynamic
anomalies?
AD-3 Is the approach designed for the detection of cyber at-
tacks?
Finally, we defined questions that assess whether and how the
approaches were evaluated in the respective articles:
E-1 Did the evaluation include quantitative measures, e.g., ac-
curacy or true positive rates?
E-2 Did the evaluation involve qualitative reviews, e.g., expert
reviews or discussions of cluster quality?
E-3 Was the algorithm evaluated regarding its time complexity,
i.e., running time and scalability?
E-4 Was at least one existing algorithm used as a benchmark
for validating the introduced approach?
E-5 Was real log data used as opposed to synthetically gener-
ated log data?
E-6 Is the log data used for evaluation publicly available?
The set of evaluation criteria was then completed for every rel-
evant approach. The process of retrieving these articles is outlined
in the following section.
3.2. Literature search
The search for relevant literature was carried out in November
2019. For this, three research databases were used: (i) ACM Digital
Library,
2 a digital library containing more than 50 0,0 0 0 full-text
articles on computing and information technology, (ii) IEEE Xplore
Digital Library,
3 a platform that enables the discovery of scientific
articles within more than 4.5 million documents published in the
fields computer science, electrical engineering and electronics, and
(iii) Google Scholar,
4
a web search engine for all kinds of academic
literature.
The keywords used for searching on these platforms were “log
clustering” (29,383 results on ACM, 2,210 on IEEE, 3,050,0 0 0 on
Google), “log event mining” (54,833 results on ACM, 621 on IEEE,
1,240,0 0 0 on Google), “log data anomaly detection” (207,821 re-
sults on ACM, 377 on IEEE, 359,0 0 0 on Google). We did not make
any restrictions regarding the date of publication. The titles and
abstracts of the first 300 articles retrieved for each query were ex-
amined and potentially relevant documents were stored for thor-
ough inspection. It should be noted that a rather large amount of
false positives were retrieved and immediately dismissed. The rea-
son why such unrelated articles appeared is that the keywords in
the queries were sometimes misinterpreted by the engines, e.g., re-
sults related to “logarithm” showed up when searching for “log”.
After removing duplicates, this search yielded 207 potentially rele-
vant articles.
2 https://dl.acm.org/results.cfm .
3 https://ieeexplore.ieee.org/Xplore/home.jsp .
4 https://scholar.google.at/ .
M. Landauer, F. Skopik and M. Wurzenberger et al. / Computers & Security 92 (2020) 101739 5
During closer inspection, several of these articles were dis-
carded. The majority of these dismissed approaches focused on
clustering numeric features extracted from highly structured net-
work traffic logs rather than clustering the raw string messages
themselves. This is a broad field of research and there exist numer-
ous papers that apply well-known machine learning techniques for
analyzing, grouping, and classifying the parsed data ( Portnoy et al.,
2001 ). Many other approaches are directed towards process min-
ing from event logs ( Van der Aalst et al., 2004 ), which is an exten-
sive topic considered out of scope for our survey since it relies on
log traces rather than simple log data. Furthermore, we discarded
papers that introduce approaches for analysis and information ex-
traction from log data, but are not fitted for clustering log lines,
such as terminology extraction ( Saneifar et al., 2009 ) and com-
pression ( Balakrishnan and Sahoo, 2006 ). We also dismissed ap-
proaches for clustering search engine query logs ( Beeferman and
Berger, 20 0 0 ) since they are designed to process keywords writ-
ten by users rather than log lines generated by programs as out-
lined in Section 2.1 . Articles on protocol reverse engineering are
discarded, because they are not primarily designed for process-
ing system log lines and surveys on this topic already exist, e.g.,
Narayan et al. (2016) . Finally, we excluded articles that do not
propose a new clustering approach, but apply existing algorithms
without modifications on different data or perform comparisons
(e.g., Makanju et al., 2009a ) as well as surveys. This also includes
articles that propose algorithms for subsequent analyses such as
anomaly detection, alert clustering, or process model mining, that
operate on already clustered log data, but do not apply any log
clustering techniques themselves.
After this stage, 50 articles remained. A snowball search was
conducted with these articles, i.e., articles referenced in the rele-
vant papers as well as articles referencing these papers were in-
dividually retrieved. These articles were examined analogously and
added if they were considered relevant. Eventually, we obtained 59
articles and 2 tools that were analyzed with respect to the afore-
mentioned characteristics stated in the set of evaluation criteria.
We used these criteria to group articles with respect to different
features and discover interesting patterns. The following section
discusses the findings.
4. Survey results
We arranged the articles into groups according to the properties
ascertained in the set of evaluation criteria. We thereby derived
common features that could be found in several articles as well
as interesting concepts and ideas that stood out from the over-
all strategies. In the following, we discuss these insights for every
group of questions.
4.1. Purpose and applicability (P)
Four main categories of overall design goals (P-1) were identi-
fied during the review process:
Overview & filtering . Log data is usually high-volume data that
is tedious to search and analyze manually. Therefore, it is
reasonable to reduce the total number of log messages pre-
sented to system administrators by removing log events that
are frequently repeating without contributing new or any other
valuable information. Clustering is able to provide such com-
pact representations of complex log files by filtering out most
logs that belong to certain (large) clusters, thus only leaving
logs that occur rarely or do not fit into any clusters to be
shown to administrators ( Jain et al., 2009; Reidemeister et al.,
2011 ).
Parsing & signature extraction . These approaches aim at the au-
tomatic generation of log event templates (cf. Section 2.1 ) for
parsing log lines. Parsers enable the allocation of log lines to
particular system events, i.e., log line classification, and the
structured extraction of parameters. These are important fea-
tures for subsequent analyses, such as clustering of event se-
quences or anomaly detection ( He et al., 2017b; Wurzenberger
et al., 2019 ).
Outlier detection . System failures, cyber attacks, or other adverse
system behavior generates log lines that differ from log lines
representing normal behavior regarding their syntax or param-
eter values. It is therefore reasonable to disclose single log lines
that do not fit into the overall picture of the log file. During
clustering, these log lines are identified as lines that have a
high dissimilarity to all existing clusters or do not match any
signatures ( Juvonen et al., 2015; Wurzenberger et al., 2017b ).
Sequences & dynamic anomaly detection . Not all adverse system
behavior manifests itself as individual anomalous log lines, but
rather as dynamic or sequence anomalies (cf. Section 2.4 ). Thus,
approaches that group sequences of log lines or disclose tempo-
ral patterns such as frequent co-occurrence or correlations are
required. Dynamic clustering usually relies on line-based event
classification as an initial step and often has to deal with in-
terleaving processes that cause interrupted sequences ( Aharon
et al., 2009; Jia et al., 2017 ).
Table 1 shows the determined classes for each reviewed ap-
proach. Note that this classification is not mutually exclusive, i.e.,
an approach may pursue multiple goals at the same time. For ex-
ample, He et al. (2017b) introduce an approach for the extraction
of log signatures and then perform anomaly detection on the re-
trieved events.
Tabl e 1
Overview of main goals of reviewed approaches. Categorizations are not mutually exclusive.
Purpose of approach Approaches
Overview & filtering Aharon et al. (2009) , Aussel et al. (2018) , Christensen and Li (2013) , Jiang et al. (2008) , Joshi et al. (2014) , Li et al. (2017,
2005) , Reidemeister et al. (2011) , Gainaru et al. (2011) , Gurumdimma et al. (2015) , Hamooni et al. (2016) , Jain et al. (2009) ,
Jayathilake et al. (2017) , Leichtnam et al. (2017) , Makanju et al. (2009b) , Nandi et al. (2016) , Ning et al. (2014) ,
Qiu et al. (2010) , Ren et al. (2018) , Salfner and Tschirpk e (2008) , Schipper et
al. (2019) , Carasso (2012) , Taera t et al. (2011) ,
Xu et al. (2009) , Zou et al. (2016)
Parsing & signature
extraction
Agrawal et al. (2019) , Chuah et al. (2010) , Du and Li (2016) , Fu et al. (2009) , Gainaru et al. (2011) , Hamooni et al. (2016) , He
et al. (2017a,b) , Jayathilake et al. (2017) , Kimura et al. (2014) , Kobayashi et al. (2014) , Li et al. (2017) , Li et al. (2018) ,
Makanju et al. (2009b) , Messaoudi et al. (2018) , Menkovski and Petkovic (2017) , Mizutani (2013)
, Nagappan and Vouk (2010) ,
Nandi et al. (2016) , Ning et al. (2014) , Qiu et al. (2010) , Zhen (2014) , Shima (2016) , Ta erat et al. (2011) , Tang and Li (2010) ,
Tang et al. (2011) , Thaler et al. (2017) , Tovar
ˇ
nák and
Pitner (2019) , Vaarandi (2003, 2004) , Vaarandi and Pihelgas (2015) ,
Wurzenberger et al. (2019) , Zhang et al. (2017) , Zhao and Xiao (2016) , Zulkernine et al. (2013)
Outlier detection Juvonen et al. (2015) , Leichtnam et al. (2017) , Splunk ( Carasso, 2012 ), Wurzenberger et al. (2017a,b)
Sequences & dynamic
anomaly detection
Aharon et al. (2009) , Chuah et al. (2010) , Du et al. (2017) , Du and Cao (2015) , Fu et al. (2009) , Gurumdimma et al. (2015) , He
et al. (2017a,b) , Jia et al. (2017) , Kimura et al. (2014)
, Li et al. (2018) , Lin et al. (2016) , Nandi et al. (2016) , Salfner and
Tschirp ke (2008) , Splunk ( Carasso, 2012 ), Stearley (2004) , Vaarandi (2004) , Wang et al. (2018) , Xu et al. (2009) ,
Zhang et al. (2017) , Zhang et al. (2019) , Zou et al. (2016)
6 M. Landauer, F. Skopik and M. Wurzenberger et al. / Computers & Security 92 (2020) 101739
As expected, most approaches aim at broad applicability and
do not make any specific assumptions on the input data (P-2).
Although some authors particularly design and evaluate their ap-
proaches in the context of a specific type of log protocol (e.g.,
router syslogs Qiu et al. (2010) ), their proposed algorithms are also
suitable for any other logging standard. Only few approaches re-
quire artifacts specific to some protocol (e.g., Modbus ( Wang et al.,
2018 )) for similarity computation or prevent general applicability
by relying on labeled data ( Reidemeister et al., 2011 ) or category
labels (e.g., start, stop, dependency, create, connection, report, re-
quest, configuration, and other ( Li et al., 2005 )) for model train-
ing, log level information ( Du and Cao, 2015 ) for an improved log
similarity computation during clustering, or process IDs for link-
ing events to sequences ( Lin et al., 2016 ). Other approaches impose
constraints such as the requirement of manually defined parsers
Tang and Li (2010) or access to binary/source code of the log gener-
ating system in order to parse logs using the respective print state-
ments ( Schipper et al., 2019; Xu et al., 2009; Zhang et al., 2017 ).
We mentioned in Section 3 that we included two approaches
from non-academic literature: Splunk ( Carasso, 2012 ) and Se-
quence ( Zhen, 2014 ). Splunk is a commercial product (P-3) that of-
fers features that exceed log clustering and is deployed in numer-
ous organizations. However, also the authors of scientific papers
share success stories about real-world application in their works,
e.g., Lin et al. (2016) describe feedback and results following the
implementation of their approach in a large-scale environment and
Li et al. (2017) evaluate their approach within a case-study carried
out in cooperation with an international company. We appreciate
information about such deployments in real-world scenarios, be-
cause they validate that the algorithms are meeting the require-
ments for practical application. Finally, we could only find the orig-
inal source code of ( He et al., 2017a; 2017b; Makanju et al., 2009b;
Messaoudi et al., 2018; Shima, 2016; Thaler et al., 2017; Vaarandi,
20 03; 20 04; Vaarandi and Pihelgas, 2015; Xu et al., 2009; Zhao and
Xiao, 2016; Zhen, 2014 ) online (P-4). In addition, several reimple-
mentations of algorithms provided by other authors exist. We en-
courage authors to make their code available open-source in order
to enable reproducibility.
4.2. Clustering techniques (C)
In the following, we explore different types of applied cluster-
ing techniques with respect to their purpose, their applicability in
live systems, and non-functional requirements.
4.2.1. Types of static clustering techniques
One of the most interesting findings of this research study
turned out to be the large diversity of proposed clustering tech-
niques (C-1a, C-1b). Considering static clustering approaches, a ma-
jority of the approaches employ a distance metric that determines
the similarity or dissimilarity of two or more strings. Based on
the resulting scores, similar log lines are placed in the same clus-
ters, while dissimilar lines end up in different clusters. The cal-
culation of the distance metric may thereby be character-based,
token-based or a combination of both strategies (C-3, C-4). While
token-based approaches assume that the log lines can reasonably
be split by a set of predefined delimiters (most frequently, only
white space is used as a delimiter), character-based approaches
are typically more flexible, but also computationally more ex-
pensive. For example, Juvonen et al. (2015) and Christensen and
Li (2013) compute the amount of common n-grams between two
lines in order to determine their similarity. Du and Cao (2015) ,
Ren et al. (2018) , Salfner and Tschirpke (2008) , and Wurzenberger
et al. (2017a,b) use the Levenshtein metric to compute the
similarity between two lines by counting the character inser-
tions, deletions and replacements needed to transform one string
into the other. Taerat et al. (2011) , Gurumdimma et al. (2015) ,
Jain et al. (2009) , Zou et al. (2016) , and Fu et al. (2009) employ a
similar metric based on the words of a line rather than its char-
acters. Another simple token-based approach for computing the
similarity between two log lines is by summing up the amount
of matching words at each position. In mathematical terms, this
similarity between log lines a and b with their respective tokens
a
1
, a
2
, ..., a
n and b
1
, b
2
, ..., b
m is computed by
min (n,m )
i =1
I (a
i
, b
i
) ,
where I (a
i
, b
i
) is 1 if a
i
is equal to b
i
and 0 otherwise. This met-
ric is frequently normalized ( Aharon et al., 2009; He et al., 2017b;
Li et al., 2018; Mizutani, 2013; Ning et al., 2014 ) and weighted
( Hamooni et al., 2016; Tang and Li, 2010 ). Joshi et al. (2014) use bit
patterns of tokens to achieve a similar result. Li et al. (2017) com-
pute the similarity between log lines after transforming them into
a tree-like structure. Du and Cao (2015) also consider the log
level (e.g., INFO, WARN, ERROR) relevant for clustering and point
out that log lines generated on a different level should not be
grouped together. Finally, token vectors that emphasize the occur-
rence counts of words rather than their positions (i.e., the well-
known bag of words model) may be used to compute the cosine
similarity ( Carasso, 2012; Lin et al., 2016; Shima, 2016 ) or apply
k-means clustering ( Aussel et al., 2018 ).
Not all approaches employ distance or similarity metrics. SLCT
( Vaarandi, 2003 ) is one of the earliest published approaches for log
clustering. The idea behind the concept of SLCT is that frequent to-
kens (i.e., tokens that occur more often than a user-defined thresh-
old) represent fixed elements of log templates, while infrequent to-
kens represent variables. Despite being highly efficient, one of the
downsides of SLCT is that clustering requires three passes over the
data: The first pass over all log lines retrieves the frequent tokens,
the second pass generates cluster templates by identifying these
frequent tokens in each line and filling the gaps with wildcards,
and the third pass reports cluster templates that represent suffi-
ciently many log lines. Allocating the log lines to clusters is accom-
plished during the second pass, where each log line is assigned to
an already existing or newly generated template.
Density-based clustering appears to be a natural strategy for
generating trees ( Qiu et al., 2010; Tova r
ˇ
nák and Pitner, 2019;
Wurzenberger et al., 2019; Zhao and Xiao, 2016 ), i.e., data struc-
tures that represent the syntax of log data as sequences of nodes
that branch into subsequences to describe different log events.
Thereby, nodes represent fixed or variable tokens and may even
differentiate between data types, e.g., numeric values or IP ad-
dresses. The reason why all of the reviewed approaches leveraging
trees use density-based techniques is likely attributable to the way
trees are built: Log messages are processed token-wise from their
beginning to their end; identical tokens in all lines are frequent to-
kens that result in fixed nodes, tokens with highly diverse values
are infrequent and result in variable nodes, and cases in between
result in branches of the tree.
4.2.2. Types of dynamic clustering techniques
Several approaches pursue the clustering of log sequences
rather than only grouping single log lines (C-6). Thereby, pro-
cess IDs that uniquely identify related log lines may be ex-
ploited to retrieve the sequences ( Lin et al., 2016 ). For exam-
ple, Fu et al. (2009) use these IDs to build a finite state au-
tomaton describing the execution behavior of the monitored sys-
tem. However, logs that do not contain such process IDs require
mechanisms for detecting relations between identified log events.
Du and Cao (2015) and Gurumdimma et al. (2015) first cluster
similar log lines, then generate sequences by grouping events oc-
curring in time windows and finally cluster the identified se-
quences in order to derive behavior patterns. Similarly, Salfner and
Tschirpke (2008) group generated events that occur within a
predefined inter-arrival time and cluster the sequences with a
M. Landauer, F. Skopik and M. Wurzenberger et al. / Computers & Security 92 (2020) 101739 7
hidden semi-Markov Model. Also Qiu et al. (2010) measure the
inter-arrival time of log lines for clustering periodically occurring
events and additionally group the events by derived correlation
rules. Kimura et al. (2014) derive event co-occurrences by factor-
izing a 3-dimensional tensor consisting of the previously iden-
tified templates, hosts and time windows. DeepLog ( Du et al.,
2017 ) extends Spell ( Du and Li, 2016 ) by computing probabil-
ities for transitions between the identified log events in order
to construct a workflow model. Jain et al. (2009) group time-
series derived from cluster appearances in a hierarchical fashion.
LogSed ( Jia et al., 2017 ) and OASIS ( Nandi et al., 2016 ) analyze fre-
quent successors and predecessors of lines for mining a control
flow graph. After first categorizing log messages using probabilis-
tic models ( Li et al., 2005 ) and distance-based strategies ( Li et al.,
2017 ), the authors determine the temporal relationships between
log events by learning the distributions of their lag intervals, i.e.,
time periods between events. Other than the previous approaches,
Aharon et al. (2009) assume that the order of log lines is mean-
ingless and their algorithm PARIS thus identifies log events that
frequently occur together within certain time windows regardless
of their order.
We summarize the results in Table 2 . For columns C-1a and C-
1b, we coded distance-based strategies as (1) and density-based
strategies as (2). Note that for static clustering, distances are usu-
ally measured between log lines and densities refer to token fre-
quencies, while for dynamic clustering techniques, distances are
computed between time-series of event occurrences and densities
refer to event frequency counts. Other identified strategies used for
static and dynamic clustering are (3) Neural Networks, which are
useful for signature extraction ( Kobayashi et al., 2014; Menkovski
and Petkovic, 2017; Thaler et al., 2017 ) and event classification
( Ren et al., 2018 ) by Natural Language Processing (NLP) as well as
for detecting sequences in the form of Long Short-Term Memory
(LSTM) recurrent neural networks ( Du et al., 2017; Li et al., 2018;
Zhang et al., 2019 ), (4) iterative partitioning, where groups of log
lines are recursively split into subgroups according to particular to-
ken positions ( Gainaru et al., 2011; He et al., 2017a; Makanju et al.,
2009b ), (5) Longest Common Substring (LCS), which is a measure
for the similarity of log lines ( Agrawal et al., 2019; Du and Li, 2016;
He et al., 2017a; Jayathilake et al., 2017; Reidemeister et al., 2011;
Tang et al., 2011 ) or sequences of log events ( Wang et al., 2018 ), (6)
binary/source code analysis ( Schipper et al., 2019; Xu et al., 2009;
Zhang et al.,