ArticlePDF Available

On the Pragmatic Design of Literature Studies in Software Engineering: An Experience-based Guideline

Authors:

Abstract and Figures

Systematic literature studies have received much attention in empirical software engineering in recent years. They have become a powerful tool to collect and structure reported knowledge in a systematic and reproducible way. We distinguish systematic literature reviews to systematically analyze reported evidence in depth, and systematic mapping studies to structure a field of interest in a broader, usually quantified manner. Due to the rapidly increasing body of knowledge in software engineering, researchers who want to capture the published work in a domain often face an extensive amount of publications, which need to be screened, rated for relevance, classified, and eventually analyzed. Although there are several guidelines to conduct literature studies, they do not yet help researchers coping with the specific difficulties encountered in the practical application of these guidelines. In this article, we present an experience-based guideline to aid researchers in designing systematic literature studies with special emphasis on the data collection and selection procedures. Our guideline aims at providing a blueprint for a practical and pragmatic path through the plethora of currently available practices and deliverables capturing the dependencies among the single steps. The guideline emerges from various mapping studies and literature reviews conducted by the authors and provides recommendations for the general study design, data collection, and study selection procedures. Finally, we share our experiences and lessons learned in applying the different practices of the proposed guideline.
Content may be subject to copyright.
On#the#Pragmatic#Design#of#Literature#
Studies#in#Software#Engineering:#An#
Experience-based#Guideline
!
Marco!Kuhrmann1,!Daniel!Méndez!Fernández2,!Maya!Daneva3
1University!of!Southern!Denmark,!Mærsk!Mc-Kinney!Møller!Institute!&!
SDU!Software!Engineering,!Campusvej!55,!DK-5230!Odense!M,!Denmark!
2Technische!Universität!München,!Faculty!of!Informatics!–!Software!&!
Systems!Engineering,!Boltzmannstr.!3!85748!Garching,!Germany
3University!of!Twente,!Drinerlolaan!5,!7522!AE,!Enschede,!!
The!Netherlands
!
Corresponding!Contact:!
!! E-Mail:!Kuhrmann@acm.org!!!
!! Phone:!+45!24!60!14!22!
!
©!Springer!2016.!Preprint.!This!is!the!author's!version!of!the!work.!The!definite!
version!was!accepted!(December!12,!2016)!in!the!Empirical)Software)Engineering!
journal.!Issue!assignment!pending.
The!final!version!will!be!available!via:!
http://www.springer.com/computer/swe/journal/10664!
!
Empirical Software Engineering manuscript No.
(will be inserted by the editor)
On the Pragmatic Design of Literature Studies in
Software Engineering: An Experience-based Guideline
Marco Kuhrmann ·Daniel M´endez
Fern´andez ·Maya Daneva
Received: date / Accepted: date
Abstract Systematic literature studies have received much attention in empir-
ical software engineering in recent years. They have become a powerful tool to
collect and structure reported knowledge in a systematic and reproducible way.
We distinguish systematic literature reviews to systematically analyze reported
evidence in depth, and systematic mapping studies to structure a field of interest
in a broader, usually quantified manner. Due to the rapidly increasing body of
knowledge in software engineering, researchers who want to capture the published
work in a domain often face an extensive amount of publications, which need
to be screened, rated for relevance, classified, and eventually analyzed. Although
there are several guidelines to conduct literature studies, they do not yet help
researchers coping with the specific diculties encountered in the practical appli-
cation of these guidelines. In this article, we present an experience-based guideline
to aid researchers in designing systematic literature studies with special emphasis
on the data collection and selection procedures. Our guideline aims at providing
a blueprint for a practical and pragmatic path through the plethora of currently
available practices and deliverables capturing the dependencies among the single
steps. The guideline emerges from various mapping studies and literature reviews
M. Kuhrmann
University of Southern Denmark,
Mærsk Mc-Kinney Møller Institute, Section Software Engineering,
Campusvej 55, 5230 Odense M, Denmark
Tel.: +45 2460 1422
E-mail: kuhrmann@acm.org
D. M´endez Fern´andez
Technical University of Munich,
Institute for Informatics, Software & Systems Engineering
Boltzmannstr. 4, 85748 Garching, Germany
Tel.: +49 89 289 17056
E-mail: daniel.mendez@tum.de
M. Daneva
University of Twente,
Drinerlolaan 5, 7522 AE, Enschede The Netherlands
Tel.: +31 53 4892889
E-mail: m.daneva@utwente.nl
2 Marco Kuhrmann et al.
conducted by the authors and provides recommendations for the general study
design, data collection, and study selection procedures. Finally, we share our ex-
periences and lessons learned in applying the dierent practices of the proposed
guideline.
Keywords Systematic Literature Review ·Systematic Mapping Study ·
Empirical Software Engineering ·Guideline Proposal ·Lessons Learned
1Introduction
Systematic literature studies have received much attention in recent years as a
powerful instrument to gather and structure reported knowledge in a systematic
and reproducible way. We distinguish two types of secondary studies:
ASystematic Mapping Study (SMS; Petersen et al. [40]) is a method to build
a classification schema for topics studied in a field of interest. By counting
the number of publications for categories within a schema, the coverage and
maturity of the research field can be determined. Graphical maps showing the
number of publications in the dierent categories of the schema represent the
study results. Mapping studies usually cover a broader range of publications
as the analysis focuses on the key terms and abstracts of publications.
ASystematic Literature Review (SLR; also: Systematic Review, SR; Kitchenham
et al. [22]) is a means to identify, analyze and interpret reported evidence
related to a set of specific research questions in a way that is unbiased and
(to a degree) repeatable. In contrast to mapping studies, systematic reviews
usually cover a smaller, more specific range of publications while the analysis
focuses on the details of the published contributions.
A mapping study is therefore often used to provide (and visualize) a big picture
of a publication space while the systematic review is additionally concerned with
analyzing and integrating the knowledge contained in the reviewed publications,
as well as identifying inconsistencies among results, and areas that need more
investigation. Both types of secondary studies (also applicable in combination)
allow to share a structured overview of the publications in a specific research area
and a common understanding of the state of reported evidence in topics along a
given (or emerging) classification scheme. Since the initially proposed guidelines
to conduct literature studies in software engineering [19], we, as a community,
could collect and systematize the procedures required, and we could see a boost of
secondary studies in the various international evidence-based software engineering
venues. This indicates the value of such studies to the research communities.
Problem Statement Since researchers face a variety of challenges for which avail-
able guidelines do not yet give sucient practical advice; they either comprise
generic workflows or provide methods and techniques in a compendium-like style
[22,41], or elaborate selected methods only, e.g., the eectiveness of certain se-
lection procedures [1,58]. Hence, conducting a literature study still depends to a
large extent on the expertise of the involved researchers. Furthermore, conducting
literature studies, to a large extent, still lacks tool support [13,5,52] thus making
the research process as such dicult to implement; notably for novices. While
On the Pragmatic Design of Literature Studies in Software Engineering 3
working on a number of literature studies ourselves (Sect. 3), we experienced the
following challenges to be the most critical ones worth deeper examination:
How do we begin a secondary study, how do we build search strings adequate
for given databases, and how can we control accurate results given the de-
pendency to the expertise, experiences, and potential subconscious bias of the
researchers?
How do we deal with a large amount of data including hundreds or even thou-
sands of potentially relevant papers to classify and structure, and how do we
eciently filter relevant results from irrelevant ones?
How do we eciently work in a distributed team? Which tools can we use to
organize our (potentially distributed) way of working?
We experienced those challenges to concern mainly the design of a study [22]and
the data collection and study selection itself [58], notably independent of whether
it is conducted as a systematic review or a mapping study. The choice of one
particular study approach or a combination thereof (as for instance found in [41]
oftentimes) aects subsequent data analysis where the data is structured, classified,
coded, and analyzed to draw conclusions in tune with the research questions.
Despite the criticality of the initial design and data collection steps, little prac-
tical advice is given on how to eectively cope with the mentioned challenges. Ex-
isting guidelines are either too generic [51], or they focus on what a design should
accomplish rather than on how and why particular practices should be executed
in a cost-eective way, and how these practices are interconnected with each other
(see also our discussion in Sect. 4). In turn, for each literature study, researchers
need to carefully design and outline the process from the beginning again and
again, and they need to work out or even re-invent their own set of best practices.
Contribution In this article, we report on our own experiences in conducting sys-
tematic literature studies and contribute
A detailed blueprint for the design, data collection, and study selection proce-
dures steered by the aforementioned challenges.
A set of practical lessons learned and supporting material readily available for
use by other researchers approaching their own systematic literature studies.
We aim at supporting researchers, who already have a basic knowledge about the
general guidelines, in their literature studies by providing a practical and prag-
matic, experienced-based path through the available practices and deliverables
capturing the dependencies among the single steps (Sect. 4). Researchers can di-
rectly reuse our blueprint to design and conduct their own domain-specific litera-
ture study and build on top the data analysis to answer their individual research
questions.
Outline The remainder of this article is organized as follows: In Sect. 2, we present
our experienced-based approach to design and set up a literature study. We de-
scribe our procedures as they emerged from our previously conducted studies. We
also outline the handover to the data analysis, which depends on the type of the
respective study (mapping study and/or systematic review) and the research ques-
tions previously defined. Our previously conducted studies from which we distill
4 Marco Kuhrmann et al.
the blueprint are discussed in Sect. 3along practical lessons we learned while con-
ducting these studies. In Sect. 4, we finally discuss related work and position our
guideline, before concluding our article in Sect. 5. In the articles’s appendix, we
provide exemplary integrated workflows describing reusable standard workflows,
and further complementing material.
2 Study Design and Data Collection: An Experience-based Approach
We provide an experienced-based guideline to support the study design, and to per-
form the data collection, cleaning, and study selection procedures. For each step,
we provide a guideline complemented with small inline examples. The guideline is
organized in the three phases Preparation,Data collection and Dataset cleaning,
and Study selection. Figure 1provides a big picture of the whole process includ-
ing the most important inputs and outputs for the respective phases. The figure
also outlines the variations in the data analysis procedures that depend (in more
detail) on whether it is a mapping study or a systematic review.
Preparation Data collection &
Dataset cleaning Study selection Data analysis
& Visualisation
Protocol
- Res. questions
- Search strings
- ...
Publications
- All papers free
from duplicates
Relevant
Publications
Reporting
Scope of contribution: Design and data collection Individual facets of data analysis in
mapping study / literature review
Classification
- Qual. vis.
- Quant. vis.
(Maps)
Report
Fig. 1 Overview of the presented approach and scoping.
Our guideline presented in this article emphasizes the early stages of a literature
study and constitutes a new building block in the methodical instrumentation of
evidence-based software engineering [22]. A detailed discussion on the relation to
existing guidelines and publications is provided in the related work in Sect. 4.
2.1 Preparation
The study preparation phase serves the purpose of setting up the study design
including, inter alia, the definition of appropriate research questions, the choice
of relevant literature databases, or the development of search queries. This phase
relates to the planning step mentioned in [22] where, for instance, the protocol
development is described. To set the scope of the search, inclusion and exclusion
criteria need to be carefully outlined, and, if necessary, preliminary studies can be
carried out to, among other things, support search string development or testing
and improving the study design (see also test-retest procedures as mentioned in
[22], or the quasi-gold standard search approach from [58]). In the following, we
describe the individual and minimum steps to be carried out during the preparation
of a literature study and give examples.
On the Pragmatic Design of Literature Studies in Software Engineering 5
2.1.1 Research Goals and Research Questions
There is no silver bullet to define the goals of a literature study, as this strongly de-
pends on the purpose of the study. In general, the primary goal of literature studies
is to systematically collect reported knowledge in an area of interest. This can be
done in-breadth, usually in scope of mapping studies [40] that quantify selected
aspects reported in literature, or in-depth, usually in scope of systematic reviews
[22] to analyze publications in detail. The purpose of a study eventually dictates
the goals of the study, such as providing an overview of all relevant contributions
dealing with a particular topic.
Independent of the respective goals, we have found some general research ques-
tions particularly worth considering in a literature study, as they help elaborating
a big picture and providing relevant background information about the publica-
tion space. Table 1summarizes such generic research questions, which could be
answered in every literature study—regardless of the particular study’s scope and
selected topic.
Tabl e 1 Exemplary standard research questions for literature studies.
No. Research Question
1 Which/how many publications on [topic] are published?
2 Which/how many publications on [topic] are published over the years?
3 What is the scientific maturity of the publication set?
4 What is the contribution of the publication set?
5 What are observable mainstreams in the publication set?
6 What new approaches for [topic] are available?
The research questions in Table 1address the general descriptive aspects
present in every result set. Questions 1 and 2 aim at drawing a demographic
picture to outline the current state of a field under investigation, i.e., providing
information about publication quantity and frequency. This information can be
instrumented to show the development over time of the studied domain and to
analyze trends, for example, an emerging or a maturing domain (as exemplarily
depicted in Figure 2). The level of detail and data type (quantitative or qualitative)
further depends on the respective study type1.
To direct the study towards its goal, i.e., a mapping study or a literature re-
view, further standard questions can be asked that support the next steps in the
study selection process. For instance, the scientific maturity addresses the classi-
fication according to the research ty pe fa cet [55] to work out the level of evidence
1Note that finding the “right” research question is a challenge and highly depends on the
actual study type. For instance, Kitchenham et al. [22] mention (standard) research questions
for systematic reviews usually addressing the evaluation of impact and/eectiveness of certain
paradigms, while mapping studies usually address more high-level questions with the purpose of
providing some sort of categorization. The questions presented in Table 1are addressing more
the latter aspect, as this covers information available from all sorts of studies. Nonetheless,
to plan and implement a literature study eciently, Staples and Niazi [51] make clear that
narrowly defined research questions are key. We therefore recommend to use a combination of
generic research questions (e.g., Table 1to “get a feeling” about the result set) and specific
narrow research questions—even for mapping studies.
6 Marco Kuhrmann et al.
2 1 3 4 6
9
6
27 26 24
33
28
34
46
24
29
36
33
44
38 39
47 46
42
8
0
10
20
30
40
50
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
0 0 0 0 0 0 0 0 0 1 0 0 0 2 1 0 1 0 0 0 0 0 0 0 0 0
1
1
1
3
3
3
9
14
6 13 10 11
18
7 9
17 15 23
9
17
25
16 13
0
1
0
2
2
2
1
2
9
8
9
10 9 12
11 10
9
12
5
13
12
12
15
17
25
6
0
0 0
0
1
1
1
2
1
5 6
3
7 9 2
5
2
10
4
14
7
5
11
4
2
1
0 0
1
0
4
0
7
3 3 4
6
4 6 4 6
4 3 4 3 3 2 2 0 0
0%
20%
40%
60%
80%
100%
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Opinion Solution Philosophical Evaluation Experience
Fig. 2 Exemplary demographic distribution of publications over specific facets (addressing
research questions 1, 2, and 4) as taken from [28]. This figure illustrates on top the number of
publications over time and per year depicting publication trends. The bottom part indicates
to the maturity of the result set by providing information about the research type facets.
in the publications. A mature field should for example not only contain solution
proposals, but also validation and evaluation research papers, and consequently
experience reports (Figure 2). The question for the result set’s contribution aims
at working out the dierent kinds of contribu tion type facets [40] and their respec-
tive distribution in the publication population. For instance, does the result set
contain models, theories, lessons learned, or frameworks? The remaining questions
address further general aspects, such as observable streams in the result set. Such
streams can become obvious by certain trends or accumulations of publications,
e.g., outstanding number of solution proposals and, at the same time, no theories.
Such a discussion can also be supported by applying further specific models, such
as the rigor-relevance model proposed by Ivarsson and Gorschek [16]. Mainstreams
can also be brought to light by studying the contents of the paper in more de-
tail, e.g., by introducing focus type facets [37], which can also direct the in-depth
investigation of a systematic review.
In summary, the standard research questions from Table 1aim at providing a
demographic overview of the study. Answering these questions shows how many
publications have been published over time, about which topics they are, and which
results they provide. These questions already provide a big picture of a research
field, and they allow for getting a better understanding of the studies available
in that field. Finally, these questions also help scoping the study and preparing
the collection and selection procedures according to the overall study objectives.
For example, an initial analysis of the demographic information helps checking the
suitability of research questions and adjusting them if necessary.
On the Pragmatic Design of Literature Studies in Software Engineering 7
2.1.2 Search Strings
Once the scope of the study has been set, researchers need to reflect on proper
search strings, which also depend on the domain under investigation. Depending on
the precision of the search strings, the queries may produce inappropriate results,
too much overhead, or just an incomplete result set. Therefore, search strings must
be defined with care [22], and search queries should always be tested prior to the
actual search2. There exist some strategies to develop proper search strings, e.g.:
Snowballing One way to narrow down the search space in advance is to conduct
a preliminary investigation of the field by relying on snowballing [22]. That is, the
investigation starts by studying publications known in advance and by iteratively
extending the known literature set by following the references provided therein.
This procedure helps providing an initial overview of the publication space and
key contributors, but very much depends on the expertise necessary to select an
appropriate starting point (see also Sect. 3). However, as reported by Badampudi
et al. [2], manual search strategies compared to automatic ones are capable of
producing “competitive” results regarding result set precision while, at the same
time, avoiding vast overhead usually produced by automatic database searches.
Trail-and-Error Search One approach suitable to find and test search queries is
the “Trail-and-Error Search”. This approach relies on meta-search engines, e.g.,
Scopus or Google Scholar, and requires initial keywords or (partial) key phrases
that are considered search query candidates for the “real” search. The purpose
aims at iteratively narrowing down the list of potential candidates by checking
whether:
A search query returns a (potentially) meaningful result set.
A keyword or a combination thereof returns hits (at all).
A search query is of sucient precision; for instance, if searching a particular
domain, how many hits are not in the domain of interest?
Hence, a trail-and-error search serves two major purposes: First, it can be used
to initially test and develop search queries, e.g., by determining which keywords
might (not) generate useful results. Second, results from such test runs can be
used to harvest reference publications to support manual search strategies (as for
instance exercised in [53]). Although this approach can be seen as everything but
a good scientific practice, it still helps taking the initial steps into the overall
research design development—especially in domains in which few or no secondary
studies are present to provide structure to the field of interest (as it for instance
was the case in [15]).
2Note that the construction of search strings also depends on the planned search strategy
(see Sect. 2.2), since search stings for automated database searches have a dierent “layout”
than those used for a curiosity-driven or trail-and-error search, e.g., using Google Scholar.
Regardless of the search strategy, finding the proper key words is crucial. The most straight-
forward approach to develop appropriate search strings is either to do a trail-and-error search
or to call in domain experts. Alternatively, a preliminary study can be conducted to “test” the
field of interest.
8 Marco Kuhrmann et al.
2.1.3 Inclusion and Exclusion Criteria
Depending on the study’s scope, result sets can contain a vast amount of poten-
tially relevant publications. In the worst case, we experienced searches to yield in
several thousands of hits. We doubt it should be questionable that several 10,000
hits cannot be treated seriously within an acceptable timeframe3. Therefore, re-
searchers need to clean the dataset and to select the relevant studies. In order to
make these procedures rigorous and reproducible, inclusion and exclusion criteria
need to be defined.
Tabl e 2 Exemplary (standard) inclusion (I) and exclusion (E) criteria for literature studies.
No. Criterion
1 I Title, keyword list, and abstract make explicit that the paper is related to [topic].
2 I The paper presents [topic]-related contributions, e.g., [topic list].
3 E The paper is not in English [or any other language of interest].
4 E The paper is not in the domain [domain name(s)].
5 E The paper is a tutorial-, workshop-, or poster summary only.
6 E The paper relates to [topic] in its related work only.
7 E The paper occurs multiple times in the result set.
8 E The paper’s full text is not available for download.
Similar as with standard research questions (Table 1), we experienced some
inclusion and exclusion criteria to be useful in a broad spectrum of studies. These
standard criteria listed in Table 2allow researchers to obtain an appropriate result
set and to define their requirements on the objective-dependent relevance of publi-
cations retrieved. For instance, experience shows workshop- or tutorial summaries
can contain a lot of relevant keywords, but might not necessarily advance the ac-
tual body of knowledge. Also, since contributions might occur multiple times or
might be out of scope, those have to be eliminated as soon as possible (criterion 7).
Another important criterion is the eighth, i.e., if the full text is not available, the
respective publication is usually of little value (regarding possibilities to analyze
them and eventually draw proper conclusions). In context of a mapping study, this
issue can be compensated to a certain extent as those studies focus on an early,
abstract-based analysis. However, when it comes to in-depth analyses, e.g., in a
systematic review, the full text needs to be available.
Finally, Kitchenham et al. [22] recommend aligning search strings with the
research questions. We add to this the suggestion to also align the in-/exclusion
criteria with the research questions. This might result in a number of “duplicated”
criteria, i.e., a paper could be relevant to topic Aor to topic Bif the literature study
aims at synthesizing knowledge thus requiring multiple topics to be addressed and
analyzed together. This furthermore allows for later replication why a specific
paper was in- or excluded to/from the study.
3As it is also criticized by Staples and Niazi [51]. In [28], however, we accepted this challenge.
It took us about a year just to clean the data and perform the selection procedures. We do
not recommend this for replication.
On the Pragmatic Design of Literature Studies in Software Engineering 9
2.2 Data Collection and Dataset Cleaning
Once the study is designed, data can be collected. In that stage, the resulting
data needs to be analyzed, cleaned/harmonized, and prepared for the upcoming
investigations.
2.2.1 Data Collection
The data collection is usually conducted as an automated search using dierent
sources. Automated data search, however, needs careful preparation and poten-
tially extra test runs, as every data source has a slightly dierent format of the
query strings, or constraints regarding the queries’ length and complexity (see also
the discussion in [1,2,4,22]). In practice, we experienced the design of multiple and
overlapping query strings beneficial. Although the search procedure must be exe-
cuted several times and produces some overhead, simple queries are usually better
accepted by the search engines (see Sect. 3for a detailed discussion).
Appropriate Data Sources Depending on the particular disciplines, several stan-
dard databases or collections (so-called baskets4) are available. In the following,
we give an exemplary discussion for software engineering. Apart from specific
conference- and workshop series (so-called restricted approach [22]), a literature
search should address the most common sources. That is, instead of searching
specific proceedings of a conference, search queries should be designed to work
with entire digital libraries. For the more general field of software engineering, the
following libraries can be considered as standard libraries (or subsets thereof if
opting for the restricted approach):
IEEE Digital Library (Xplore)
ACM Digital Library
SpringerLink
ScienceDirect (Elsevier)
Wiley Interscience
IET (also accessible via IEEE)
However, these libraries have their “specialties”, notably, regarding the search
query construction. Another point to take care of when using such digital libraries
is the continuous indexing, i.e., indexes will “evolve” over time, which makes it
hard to reproduce searches (see Sect. 3.1.2).
Checking the Result Set Before conducting the data collection, we recommend to
have a set of reference publications available. One criterion we found useful for
checking the appropriateness of a search is if the result set contains the expected
reference publications (see also, e.g., [58]). If one expects a particular publication
in the result set, e.g., arising from a preliminary search, but it is not contained in
that set, the revision of the search might be recommendable. Options to identify
reference publications can be found in Sect. 2.1.2.
4Such as the Senior Scholars Basket, cf. http://home.aisnet.org/displaycommon.cfm?an=
1&subarticlenbr=346
10 Marco Kuhrmann et al.
Primary Search and Backup Search Primary searches should always be conducted
using aforementioned (or comparable) standard libraries. However, for several rea-
sons, those libraries do not always contain all relevant publications. For example,
contributions relevant to the field might result from Ph.D. theses that are not
published in/not indexed by the standard libraries.
Therefore, we experienced it beneficial to complement the primary search with
a backup search utilizing meta-search engines to complete the result set. However,
using a meta-search engine must be done carefully. Besides the standard meta-
search engines5, such as DBLP or Scopus, Google Scholar is often used to get
results quickly. However, the quality of search results obtained from such engines
also depends on search preferences and even trends and, thus, searches might be
much less repeatable than compared to standard libraries. Also, the results might
also provoke duplicates and introduce extra threats to the validity of a literature
study. A Ph.D. thesis, for example, can be written in a cumulative manner where
parts of it exist separately as peer-reviewed publications already present in the
result set of the primary search. Hence, it is important that the results obtained via
meta-search engines are not included into the main result set without crosscheck.
To this end, hits produced by such engines should be included in an own category,
and such searches should be discussed as part of the threats to validity to increase
the transparency.
(Data) Export Practices Data obtained from a data source must be stored in a
way in which it can be used for further analyses. This part can become time
consuming since dierent databases provide dierent export formats, which later
on need to be joined and integrated. Therefore, data should be exported in at least
two formats:
A literature management tool of choice, such as BibT
E
X
As plain or (better) comma-separated (CSV) text files
These formats have the advantage that they are easy to process and convert into
spreadsheets to allow for further selection (Sect. 2.2.2), and later on, analysis steps.
2.2.2 Dataset Cleaning
Cleaning a result set is a demanding, time-consuming task. Usually, we find two
types of papers to be removed from the result set (cf. Table 2):
1. Contributions that are out of scope, and
2. Duplicates.
Duplicates are easy to find and eliminate, yet it is hard to decide which of the
duplicates should be eliminated. It often happens that one publication is listed
in multiple literature databases (e.g., for cross-indexing reasons). In such cases,
it needs to be decided which paper to consider for inclusion into the result set.
A pragmatic approach is to include the results from the database that provides
the paper for download and to remove the other occurrences; this needs, however,
5Note: Apart from serving the backup search, meta-search engines can also be a useful
instrument in studies that also include (continuous) updates, e.g., to monitor the development
of a field over time [23].
On the Pragmatic Design of Literature Studies in Software Engineering 11
Fig. 3 Example of a word cloud from [28] for visually inspecting the result set. “Outliers” to
be used for excluding further papers from the result set are highlighted.
to be defined in the exclusion criteria for the sake of transparency. Another case
for a duplicate is a conference paper, followed by a journal article, e.g., a special
issue paper. In such cases, it must be decided whether the original or the extended
publication should be selected for inclusion. A criterion could be to always select
the higher-valued publication, i.e., journal over conference, as journal articles are
expected to have a higher maturity [37] and level of detail.
Publications that are out of scope are, on the other hand, easy to remove, yet
they are often hard to identify if part of a large result set. Since the result set
might have been created from an automatic search, even out-of-scope publications
that met at least one selection criterion could be present. Those publications need
to be found manually and removed in the cleaning procedures.
Scoping via Word Clouds To support the identification of out-of-scope-papers,
we experienced word clouds (tag clouds) to be a useful tool. Word clouds can
be automatically created using keyword lists or abstracts. A word cloud is an
instrument to visualize the (quantified) occurrence of a word/term in relation to
other terms. They can be easily created using several publicly available tools6,
e.g., Wordl or TagCrowd7.
Word clouds can serve two purposes: First, word clouds can be used to analyze
the appropriateness of a result set. A word cloud, which is based on the keywords,
can be analyzed to work out whether the contained publications’ keywords match
the expectations (Figure 3). Unexpected and/or “wrong” keywords can be easily
detected and used to clean the result set. Depending on the quality, a considerable
share of non-fitting papers can be removed; remaining papers (in a reduced set)
are then removed during the selection phase (Sect. 2.3).
However, word clouds have to be used with care: even though there is research
that shows word clouds providing improvement concerning the clustering and sum-
marizing of descriptive information, such as [36,44,29,48,46], there is still the risk
6Note: Some of the tools have limitations regarding the amount of text they can process.
Furthermore, the tools oer dierent features, such as thresholds, visualization and export
mechanisms. Those points need to be evaluated prior to usage.
7Both tools are available at: http://www.wordle.net and http://tagcrowd.com/
12 Marco Kuhrmann et al.
of eliminating relevant papers; for instance, because those papers might rely on
a rarely used terminology. Therefore, eliminating papers based on word clouds
only might threaten the validity of a study why we recommend that the use of
word clouds must be planned with care and in detail in advance, and resulting
candidates for removal require careful inspection.
As a second purpose to be served, a word cloud can support the later analysis
of a result set during, for example, the concept classification conducted as part
of a mapping study. For instance, in our study on method engineering [27], we
analyzed the final word cloud to get a better understanding about which research
type facets to expect from the publication set (e.g., how to interpret terms like
“case study” as used in the respective community). The result of the word cloud
and the result of the classification conducted in the study can then be compared
to analyze the subjective authors’ self-classification and the more objective one
from the reviewers’ classification. In another example [24], we used a word cloud
to support the development of a focus type facet [37] and, furthermore, to conduct
a cluster analysis.
Merging and Reducing the Integrated Dataset Depending on the particular search
strategy—especially the search query construction approach—researchers have to
deal with multiple (isolated) datasets. This is especially true if the work during the
data collection is distributed among multiple researchers. To prepare the selection,
the individual result sets need to be integrated into a holistic one. This integration
constitutes a challenging task:
If a literature database was queried multiple times (e.g., for the search string
construction), the individual results need to be joined.
Every literature database provides a slightly dierent export format and/or
structure, e.g., CSV files obtained from Springer and from ACM have a dierent
structure. These dierences need to be reconciled.
If duplicates were removed on a per-database basis, the integrated result set
may still contain cross-database duplicates. The integrated dataset must then
be cleaned again by identifying and removing duplicates.
If the individual datasets were yet not investigated for duplicates, the respective
cleaning procedures must be performed now.
The aforementioned steps can be (partially) automated [23]. Nevertheless, the in-
clusion and exclusion criteria selected for the study should be consulted to support
the compilation of the integrated dataset as well. We experienced the following
procedure (Figure 4) to be best suited for the stepwise integration:
1. Integrate and clean the data on a per-database level, i.e., if a database was
queried multiple times, integrate the obtained sub-result-sets first.
2. Integrate all sub-result-sets into the integrated dataset and repeat the cleaning.
Eventually, we create an integrated dataset. Appendix Bprovides an example
illustrating and explaining the minimal required data. Please note that the step of
integrating and reducing the data is crucial and, therefore, needs to be documented
carefully. The particular steps of the applied procedures are valuable information
for other researchers to reproduce the overall study. Furthermore, the outcome
of these steps forms the input for the rest of the study. Hence, researchers must
ensure that no relevant publication is lost during this step.
On the Pragmatic Design of Literature Studies in Software Engineering 13
Result Set 1.1
Literature
Database 1
Result Set 1.2
...
Result Set 2.1
Literature
Database 2
Result Set 2.2
...
Further Literature
Databases...
Result Set 1
(integrated)
Result Set 2
(integrated)
Integrated
Result Set
remove multiple
occurences
remove multiple
occurences
integrated literature
database; cleaned
from duplicates
Fig. 4 Exemplary proce-
dure of stepwise integrat-
ing and cleaning litera-
ture databases. In each in-
tegration step, reporting-
relevant information needs
to be recorded.
Step-wise Dataset Completion Once an integrated data set is obtained, it should
be analyzed for (sucient) completeness. Depending on the database and the
individual publications, some information might be missing, e.g., abstracts or key-
words. This information needs to be collected and integrated, however, the step
bears some pitfalls:
There are abstract-free publications, e.g., magazine articles, of which the re-
spective literature databases provide parts of the introduction section as ab-
stract substitute. Such cases require a manual inspection and researchers need
to discuss how to treat them.
There are publications without (electronically available) keywords. These are
publications that have no keywords at all, or publications that may well have
defined keywords, but those were not listed in the exported data structure. For
those publications, it must be defined how to treat them.
For technical reasons, some literature databases do not provide options to
export the abstracts. In such cases, manual work is required to get the abstract
and integrate it into the dataset.
Pieces of required metadata might be missing, e.g., the publication year, pub-
lication vehicle (conference, journal, etc.). This information needs to be com-
pleted.
Apart from this essential information, another aspect needs to be taken into ac-
count: the representation of the authors. Literature databases do not have a uni-
form representation of the author lists; for instance, authors might have varying
aliations or their first and second names are ordered dierently (e.g., “J. J.
Abrams” versus “Abrams, J. J.”). If researchers plan for a study, for example,
to conduct some analyses on the author lists, such as by creating collaboration
networks, cliques, and mainstreams, the author information must be available in
a uniform way. As dataset completion can be extremely fidgety work, it should be
performed iteratively and under continuous quality assurance:
1. Complete the abstracts
2. Complete the keywords
3. Complete all other required metadata
4. Ensure consistency in the author lists
Dataset Structure: A Template To support all aforementioned steps, a defined
data structure needs to be in place. The particular data structure depends on the
specific study. However, we recommend minimal data structure shown in Table 8
14 Marco Kuhrmann et al.
(Appendix B) as it emerges from our previously conducted studies. The table
illustrates the recommended minimal data structure to organize the result set.
This table serves the basic purposes and can be extended respecting the actual
study’s needs, such as extra columns for classifications for mapping studies.
2.3 Study Selection
In the study selection phase, the prepared dataset is analyzed for publications
relevant for the actual study, i.e., researchers systematically select the relevant
papers from the search results (this phase relates to the (primary) study selection
in [22]). Since result sets can comprise several hundreds or even thousands of
papers, this phase requires special attention and, thus, careful planning.
2.3.1 Plan: Defining the Study Selection Procedure
Many factors influence the actual study selection (e.g., number of researchers, de-
gree of distribution, familiarity with the topic, etc.). In case of multiple researchers
conducting the study, we consider the following aspects of the study selection nec-
essary to be planned and agreed on in advance:
Schedule for the study selection including workshops, regular meetings/calls
for discussing intermediate selection/classification results, etc.
Technical infrastructure (tools, data storage, file formats, etc.)
The criteria upon which researchers decide the relevance of a publication
The procedure to infer an agreement and the voting procedure (if applicable)
The last step, the voting, assumes that various researchers vote for in-/exclusion
of a publication independently. The final decision for including the publication
into the final result set then depends on the result of the voting. There are many
practices that can be included into the voting procedure (e.g., veto rights) while
we believe that this also much depends on the research context, e.g., researchers’
experience, expertise, but also their personal preferences to conduct the study (see
also Sect. 2.3.3).
2.3.2 Kick-O: Setting Up the Selection Approach
Assuming a study within a group of multiple researchers, the study selection starts
with a kick-omeeting in which the inclusion and exclusion criteria are recalled
(Table 2), the selection/voting procedure is discussed, and a schedule for subse-
quent meetings is defined. In the following, every participating reviewer gets a copy
of the cleaned result set, which is rated individually. That is, the study selection
procedures are initiated.
2.3.3 Voting Procedure
Voting is essentially a headcount procedure in which a team of researchers works
out a decision whether a particular paper is considered relevant for the study or
not, i.e., to eliminate those papers from the result set that are considered irrelevant.
On the Pragmatic Design of Literature Studies in Software Engineering 15
The relevance can be determined by dierent measures, which need to be defined
in advance (e.g., title, abstract, and full text). Potential routes towards a decision
are majority votes or relative ra tings. The actual classification can be carried out
in a group of researchers or individually, iteratively, round-based or in workshops.
In the following, we focus on an individual, traditional round-based classification.
Majority Voting The voting is a headcount that aims to bring in objectivity into
the study selection. Although there are in-/exclusion criteria, the final application
of the criteria to the publications to be selected is in the hands of individuals
thus including individual interpretations of a publication. The reason why we
recommend including multiple researchers in this procedure is to overcome the
inherent threat arising from this subjectivity. Hence, we also consider a majority
vote to be the standard procedure as it is the most straightforward approach:
every reviewer is provided with the integrated result set and reviews the items
individually according to defined criteria, e.g., title and/or abstract. If a reviewer
considers a publication relevant, 1 point is given, 0 otherwise. For nreviewers and
mpublications, the procedure results in an nmvoting matrix, which helps to
select the relevant papers. The (final) selection is then based upon the agreements,
such as a threshold or agreement statistics (e.g., Cohen’s or Fleiss’ ). For example,
if three reviewers participate, the voting procedure could be organized as shown
in Figure 5: two reviewers start individually. To get a paper included in the set
of relevant papers, 2 points are required (threshold approach). Two reviewers can
come up with the following results: 2 points = paper is relevant, 0 points = paper
is irrelevant, and 1 point = paper is not yet decided. In the next step, the third
reviewer8is called in and is presented a reduced list that only contains the papers
yet not decided. The third reviewer then conducts the voting to finally decide
about the papers’ relevance.
Fig. 5 Overview of the
standard majority voting
procedure for a 3-reviewer
team.
Alternative Approaches Instead of calling in a third reviewer to conduct a fully
independent review, a voting workshop can be organized. In such a workshop,
all reviewers involved in the selection process discuss and decide the non-decided
8Please note that a reviewer can be an internal reviewer (e.g., a co-author) as well as an
external researcher or expert not involved at all in the design in case of unknown domains.
16 Marco Kuhrmann et al.
A
B
C
D
E
F
G
0
1
1
0
Paper Reviewer 1
integrated vote
(threshold = 2)
A
C
E
G
B
D
E
F
G
A
B
C
E
F
A
C
D
G
B
D
F
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
Reviewer 2 Reviewer 3 Reviewer 4 Reviewer 5
0
0
0
1
1
out
in
Fig. 6 Paper selection
based on overlapping pa-
per subsets (a reviewer
evaluates only subset of
papers, usually just one
run required to find the
selection).
papers. We applied this approach for instance in [34,28]. Yet another approach
is to provide reviewers with overlapping subsets of the whole result set, e.g., to
incrementally collect three votes in just one run (Figure 6).
Scaling So far, we performed the majority voting procedure with 2 reviewers in the
workshop model, 3 and 4 reviewers, and two 2-person review teams (see Sect. 3).
However, as we talk about simply summing up points, the approach can be scaled
to an even larger number of reviewers. A paper’s relevance is then simply defined
by a function
relevance :R+Z!{0,1,?}(1)
that is used to determine the relevance of a paper pjin relation to a threshold th,
and to (de-)select papers or marking them for later decision:
relevance(rating (pj),th) = 8
>
<
>
:
1ifrating(pj)>th
0ifrating(pj)<th
toDecide if rating(pj)=th
(2)
The actual threshold th needs to be defined during the initialization of the selection
procedure (Sect. 2.3.1). The rating (simple, unweighted case; Figure 5)ofapaper
is then defined by the number of points that a paper received from nreviewers
involved in the process:
rating(pj)=
n
X
i=1
ri(pj)(3)
Regardless of the number of stages and reviewers involved, rating statistics need to
be carefully documented in order to be able to reproduce which paper came in in
which stage and to make explicit the inter-rater agreement. Furthermore, we also
suggest to document according to which criteria a paper was included or excluded
after all, which can require extending the data structure of the result set to keep
this information.
Relative Rating The rela tive ratings approach9as illustrated in Figure 7is similar
to the majority voting where all reviewers are asked to vote a result set, but with a
9So far, we did not yet apply this method to a complete study, but partially applied it
during sample-based result set testing and evaluation (cf. Sect. 3). As this approach is quite
complex compared to the majority vote, it requires sucient tool support.
On the Pragmatic Design of Literature Studies in Software Engineering 17
A
B
C
D
E
F
G
2
3
Paper Reviewer 1
integrated vote
(threshold:
mode = 4)
Reviewer 2 Reviewer 3
5
5
1
3
4
out
in
1
1
1
1
1
1
1
2
2
2
2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4
4
4
5
5
5
5
5
5
5
1
1
1
1
1
1
1
2
2
2
2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4
4
4
5
5
5
5
5
5
5
1
1
1
1
1
1
1
2
2
2
2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4
4
4
5
5
5
5
5
5
5
out
in
in
discuss...
discuss...
Fig. 7 Paper selection
based on relative votes
(final selection is, in this
example, made using the
mode value, while the
“neutral” element 3 serves
as marker for papers to be
discussed).
dierence in the applied metric: Instead of a simple “Yes/No” (1/0) metric, in this
approach, we use Likert scales and thresholds. The basic underlying procedure
remains the same: each reviewer is provided with the integrated result set and
rates a paper, but on a scale, such as a 5-point Likert scale:
5 points: Paper is highly relevant (must be included)
4 points: Paper is (somewhat) relevant
3 points: neutral/no opinion
2 points: Paper is not relevant
1 point: Paper is absolutely irrelevant
Based on the individual ratings, relevance can be determined, e.g., using the mean
value or the mode, and precision can be determined, e.g., using standard deviations
or distance metrics. The inclusion criterion is then a selected value on the used
scale, e.g., 4 (or better). Critical is the handling of papers that end up with the
neutral value. These papers require extra handling.
Balancing Votes How reliable is this way of selecting papers? In the simple case,
which is the majority vote, a democratic headcount is used to in-/exclude a paper.
However, this procedure has some flaws. For instance, given a situation in which
two reviewers ended up in a stalemate. A third reviewer is then called in to make
the decision; and now scale this up to 7 reviewers: the 7th reviewer makes the
final decision by outvoting 3 others. To overcome such situations, workshops can
be performed to discuss critical papers (which can be unrealistic if, for instance,
250 papers need to be discussed), thresholds can be defined, or weights can be
introduced, e.g., senior reviewer votes count twice. However, the basic problem
remains: what is the level of agreement, i.e., the reliability of the selection? As a
first step to determine the reliability, the inter-rater reliability can be calculated,
e.g., using Cohen’s [6] for two reviewers or, more general, Fleiss’ [11] for more
than two reviewers10. Furthermore, the basic agreement can be visualized (and
partially automated) as shown in Figure 10 (Appendix B).
Yet, the headcount is a fairly simple, but absolute metric. In some cases, we
experienced the need for a more dierentiated vote, which can be implemented,
e.g., using relative votes with Likert scales. However, the more dierentiating scale
introduces a new challenge: How to find a final and consolidated rating? Approach-
ing a consolidated rating via the mean or the mode might fail, because they are
10 Please note that inter-rater reliability calculations also depend on the scales applied, e.g.,
weighted val ue s wh en u si ng o rd in al d at a (c f. [ 22,56]).
18 Marco Kuhrmann et al.
easy to trick or because they might be even not applicable; consider, for example,
the mode of {0,0,1,1}, and what a resulting mean of 0.5 even implies in relation
to a th 2Z(cf. Eq. 1). Again, a simple solution could be to introduce rater-specific
weights. Furthermore, simple weighting methods, such as, the 3-point-method can
be applied, with Vj={vr1
pj,...,v
rn
pj}being the set of nreviewer votes for a paper
pj:
rating(pj)=
min(Vj)+4·¯
Vj+max(Vj)
6
(4)
The extended weighted rating from Eq. 4can be used in the determination of the
relevance in Eq. 2.
2.3.4 The Gathering: Integrate and Finalize the Paper Selection
Having all individual ratings conducted, the study’s moderator (Kitchenham et al.
[22] speak of a team leader) collects all individual ratings and starts the integration
of the results. The basic task is to, initially, integrate the individual ratings to work
out the current state of selected and/or undecided papers (see also color-coding in
Figure 10 that is based on Eq. 2and Eq. 3). Depending on the approach defined in
the initialization of the selection procedure (Sect. 2.3.1), the moderator prepares
the dataset for extra review iterations and/or organizes required workshops. In the
following, the selection procedure is iterated until all papers are finally decided.
Once all papers are decided, the moderator draws a baseline and prepares
the final selection of papers, i.e., a cleaned list that only contains those papers
considered relevant for the study, and he finally prepares the clearing work.
2.3.5 Class Dismissed: Analyze the Result Set and Report
When the selection is done, the moderator concludes the selection process and
prepares the handover to the actual analysis. This includes some standard tasks
as well as some optional tasks depending on the eventually targeted study. In
particular, the moderator has to prepare the study selection report and the result-
ing literature database. The literature database must at least contain all papers
that were selected as relevant to the study. The report comprises some statistics,
such as, databases, results per database from search, and elimination statistics (an
example is shown in Table 3).
Depending on the intended study type, just in this step, the moderator can
also provide some extra data to support the later analyses. For example, if ap-
plicable, the inter-rater agreement helps identify those publications that form the
heart of the result set. Furthermore, several outputs can be generated from the
result set that help finding a starting point, e.g., exports of the keyword lists and
abstracts and word clouds generated thereof, and, associated with more eort,
social networks (Sect. 3.1.4).
2.4 Concluding and Handover to Data Analysis
The last step consists in initiating the actual data analysis, which is dictated by
the research questions and eventually the type of secondary study. From the afore-
mentioned described steps, the outcomes listed in Table 4have to be assembled
On the Pragmatic Design of Literature Studies in Software Engineering 19
Tabl e 3 Exemplary search and selection report (excerpt from [28]).
Step IEEE ACM . . . Total
Step 1: Search
S1and (C1or C2)71543...3,185
... ... ... ... ...
S8and C2114 105 . . . 8,374
Step 2: Removing Duplicates
Duplicates p er database 1,486 566 . . . 16,643
Duplicates across all databases 916 551 . . . 5,315
Step 3: In-depth Filtering
Applying filters F1and F2578 . . . 1,562
Unfiltered 551 . . . 1,610
Result set (search process) 578 551 . . . 3,172
Step 4: Voting
Final result set 283 65 . . . 635
and shipped to the in-depth analyses. These deliverables can be properly inte-
grated with the research protocols as, for instance, recommended by Kitchenham
et al. [22].
Tabl e 4 Artifacts to be created in the early stages of literature studies to be shipped to the
in-depth data analysis.
Reference Outcomes and content to be delivered
Sect. 2.1.2 Search terms and resulting search queries (generic terms and queries, as well as
database-specific queries)
Sect. 2.1.3 In-/exclusion criteria used in the study
Sect. 2.2.1 List of selected and queried databases, and raw result sets (e.g., CSV files)
Sect. 2.2.2 Cleaned and integrated data sets (including all support instruments used)
Sect. 2.3.1 Adocumentedstudyselectionapproach,includingteamsetup,selectionproce-
dures, and so forth
Sect. 2.3.4 Decided data set (final result), statistics of the selection, further complementing
report data
3 Example Studies and Lessons Learned
The guideline presented in this article emerges from various conducted systematic
reviews and mapping studies. In this section, we provide an overview of the previ-
ously contributed studies and discuss how we applied the discussed practices and
procedures so far. Table 5provides an overview of the referred studies and relates
the studies to the respective methods and techniques.
20 Marco Kuhrmann et al.
Tabl e 5 Overview of the dierent studies utilizing the presented practices.
Ref. Title
Type (r/m)
Preliminary Study
Trail-and-Error Search
Snowballing
Search String (1/n)
Majority Voting
Relative Rating (s/f)
Workshops
Inter-rater Agreement (s/f)
Multiple Researcher Teams
Word Clouds
Social Network Analysis
Rigor-Relevance Model [16]
[27] A Mapping Study on the Feasibility of
Method Engineering
m333
(n)3(3) 333
[34] Where Do We Stand in Requirements
Engineering Improvement Today? First
Results from a Mapping Study
m33
(n)3(3) 3(f)3
[18] Criteria for Software Process Tailoring:
ASystematicReview
r33
(n)3(3)
[26] Systematic Software Process Develop-
ment: Where Do We Stand Today?
r33
(1) 3(3)
[28] Software Process Improvement: Where
Is the Evidence?
m33
(n)3(2) 3(s)33
(s)3
[23] Software process improvement: A sys-
tematic mapping study on the state of
the art I
m33
(n)3(2) 33
(s)
[24] How does software process improve-
ment address global software engineer-
ing?
m/r 3H3(2) 333
[17] On the Role of Software Quality Man-
agement in Software Process Improve-
ment
m/r 3H3(2) 3 3
[25] Towards Artifact Models as Pro-
cess Interfaces in Distributed Software
Projects
m/r 3(n)3(2) 3
[53] Is Water-Scrum-Fall Reality? On the
Use of Agile and Traditional Develop-
ment Practices
r333
(1) 3(2) 33
(2)
[15] On the Use of Safety Certification Prac-
tices in Autonomous Field Robot Soft-
ware Development: A Systematic Map-
ping Study
m333
(n)3(3) 3
[43] Value Creation by Agile Projects:
Methodology or Mystery?
m33
(n)3(3)
[7] A Systematic Mapping Study on Em-
pirical Evaluation of Software Require-
ments Specifications Techniques
m33
(n)3(4)
[14] A Systematic Literature Review on Ag-
ile Requirements Engineering Practices
and Challenges
r33
(n)3(3)
Search String (1/n): The study uses 1 large or nsmaller search strings
Relative Rating (s/f): Relative rating of the full result set or on samples thereof
3(): * number of search strings, or number of reviewers involved
I: Study update for [28]; H: Detailed study using the dataset from [23]
On the Pragmatic Design of Literature Studies in Software Engineering 21
3.1 Selected Examples and Lessons Learned
Over the last years of working on literature studies, we collected a number of
lessons learned, which we briefly summarize below. Furthermore, in order to il-
lustrate the lessons learned with examples, in this section, we relate the lessons
learned to the studies from Table 5and provide some examples. Moreover, the
practices listed in Table 5, in general, can be considered self-contained building
blocks, i.e., they can be combined in dierent ways. However, in our experience,
some combinations of practices showed especially beneficial. Those are presented
in Appendix Aas a blueprint. We also have to note that there might exist de-
pendencies and/or constraints providing arguments in favor or against applying
certain practices in respect of a particular context (see also [58]). For example, if
a preliminary study was already conducted to find the study’s scope and a set of
reference publications, the “Trail-and-Error” search approach will not add to the
study. Another example is the combination of selection strategies, i.e., the com-
bination of majority votes, relative ratings, and voting/rating workshops. Here,
setting up workshops (“expensive” due to required human resources) should be
preferably scheduled for the late selection phases when the amount of publications
to be decided was reduced to a manageable number (see Sect. 2.3.3). The rest of
this section is organized according to the stages of this guideline (cf. Figure 1).
3.1.1 Basic Planning
Regarding the general planning activities associated with a literature study, we
consider the following lessons learned the most important.
Make a Cunning Plan that Cannot Fail Given the eort, duration, and the in-
volvement of various researchers, a literature study should be built upon a concrete
plan of which the research protocol [22] is key. We experienced that involving all
researchers at the beginning is crucial to establish a shared understanding of:
The basic terms, concepts, and their synergies, in the field of interest, and
The way the classification criteria should be interpreted and applied.
If classifying the relevance or other concepts based on a pre-defined scheme, those
concepts need to be clarified at the beginning.
Watch out! The Technical Infrastructure Matters One of the most important ad-
ministrative tasks is to define the technical infrastructure to be used for the study.
The two most important aspects are:
Use a version control system (VCS).
Don’t mix up Microsoft Excel and OpenOce.org/LibreOce.
The VCS is crucial to create baselines of the study, e.g., raw data or tentative
result sets. Furthermore, a VCS allows for distributed collaborative and concur-
rent work, and it ensures that results are not accidentally overwritten. The second
aspect is caused by practical experience: In several studies, some researchers just
took the pre-configured Microsoft Excel file (see Appendix B) and worked on it
with OpenOce.org/LibreOce, so that many scripts and auto-formatting con-
figurations did not further work, or that other researchers could simply not open it
anymore with the respectively other tool (e.g., as happened in [28]). Fixing those
situations is time-intensive and avoidable.
22 Marco Kuhrmann et al.
3.1.2 Search Strings and Search Engines
Regarding the construction of proper search strings, we consider the following
lessons learned the most important ones.
One Search String or Multiple Ones? Applying the introduced search strategies
may result in more than one search string, which then can be customized for the
dierent search engines. A practical problem remains: the length and complexity of
the search strings, and the ability or limitations of literature databases to process
search queries of and above a certain complexity (as observed when trying to
replicate [47]). That is, the major question is which alternative is better: One
integrated long search string or multiple shorter ones, as exemplarily shown in
Tabl e 6.
Tabl e 6 Exemplary search strings for an automated database search (excerpt from [28]).
Search string Addresses. . .
S1(life-cycle or lifecycle or life cycle) and (management
or administration or development or description or
authoring or deployment)
process management: gen-
eral life cycle
... ...
S8(feasibility or experience) and (study or report) reported knowledge and em-
pirical research
An integrated search string has the advantage of (relative) high precision. Fur-
thermore, it allows for capturing the entire domain in only one query. However,
many literature databases, such as IEEE Xplore, have some limitations regarding
length and complexity. Furthermore, the syntax of the search queries diers from
database to database, thus, requiring database-specific instances of the query any-
way. In contrast, multiple shorter search queries bypass database limitations by
providing simpler structures (also recommended by [22]) and, furthermore, those
strings are easier to adapt to specific database requirements. On the other hand, in
order to ensure search precision, multiple search strings require more eort in their
design. For instance, to get a maximum of publications, multiple search strings re-
quire some overlap to avoid “losses at the borders”. This, however, may cause
some overhead in the result set and multiply occurring publications that have to
be identified and removed later on. Furthermore, due to the simpler structure,
such search strings are prone to attract unwanted publications [58] thus requiring
extra context selectors and filter constructs [28].
Don’t trust Old Result Sets When it comes to updating or replicating a literature
study, one problem is the literature database as such. For example, in a student
study activity, we aimed at replicating and updating a previously conducted SLR
[47] of which we had the full research protocol available. The replication package
also included text files containing the database-specific search strings. In an initial
test run, we encountered the following to happen: IEEE Xplore rejected the search
On the Pragmatic Design of Literature Studies in Software Engineering 23
query stating it was too complex having more than 50 terms. Transferring the
(general master) search string to Scopus (to test if it will trigger any papers at all)
and configuring the search properly (limiting the venues and publishers etc.), we
found 215 instead of 125 papers matching the search criteria. So far, we couldn’t
suciently elaborate what happened exactly, but argue this being one of the eects
coming along with continuously updating indexes (see also Brereton et al. [4], who
mention indexing of current digital libraries inappropriate). In short, over the
time, search queries age and literature databases evolve. There is no guarantee
that a result set obtained at one point in time will be re-constructible some time
later. There is no mitigation strategy for this problem, except to increase the
transparency of the data collection by reporting a timestamp for the searches to
support the reproducibility and thereby the validity. Therefore, search queries as
well as raw result sets (Sect. 2.2) should be stored—at least to reproduce the
findings from the raw data.
3.1.3 Data Collection and Cleaning
Regarding the data collection and cleaning, we consider the following lessons
learned the most important ones.
Find the Right Scope In some studies, we saw an explicit and intentional limi-
tation of the search; for instance, instead of searching a whole library, authors
of a study limited themselves to particular conferences or journals [47]. Such an
approach promises the advantage of having a more focused result set by avoid-
ing overhead [22]. However, this may come possibly at the price of information
loss, because many relevant publications might not be found. Such procedure is
of course possible, but not recommended; yet, if conducted that way, it should be
explicitly mentioned in the threats to validity to increase the transparency and re-
producibility. Finally, if the ultimate goal is a systematic mapping study, however,
this approach cannot be applied, as the limitation of the search scope hampers the
overall result set quality and also the quality and reliability of the conclusions.
What Publication Type to Include? Besides the used search engines, researchers
need to clarify what types of publications can/cannot be included into the result
set. We consider, for example, including textbooks and edited chapters as a viable
option in case the study is about the analysis of definitions, e.g., to understand the
meaning of a particular concept as used by authors in a field. The choice of certain
books can and should be justified based on their popularity in a community; for
example by including well-established textbooks as used for teaching, or books
that have a high number of citations in empirical papers in the area. Master
theses in turn should be avoided given their missing peer-review process. Involving
Ph.D. theses, however, depends on various contextual characteristics; for instance,
whether they passed a peer-review process or whether they are cumulative ones
(which might, of course, lead to duplicates in the result set given that the content
is previously published material, see also Sect. 2.2.1).
How Valid is the Paper Selection Process? In the previous sections, we described
dierent voting procedures that can be applied. With every voting procedure comes
24 Marco Kuhrmann et al.
dierent ways of increasing the validity of the methods applied and the results
obtained. The least common denominator of all procedures, however, is the inter-
rater agreement [22]. We postulate the use of inter-rater agreements especially
if used in a multi-staged voting procedure as they serve as a constructive quality
assurance measure between the stages; for example, to clarify misconceptions, mis-
interpretations of research questions, misinterpretations of classification schemes,
and dierent understanding of the relevance of publications. Besides the value
of inter-rater agreements for constructive quality assurance, it also increases the
transparency to the reader and, therefore, the conclusion validity. However, such
an agreement makes only sense if the voting procedure is not conducted itera-
tively over incomplete result sets whereby it is impossible to use the agreement as
a means to improve the classification between stages (if not used in a training/test
phase). Hence, there is a trade-oregarding the purpose and the eort of using
the inter-rater agreement, which needs to be clarified in advance.
How Much is Enough? As a matter of fact, there is no meaningful metric that
could be used to indicate whether the result set is suciently large or not, let alone
because the size of a dataset provides no indicator to the quality of its content [57,
58,2]. For example, in [18]and[15], we performed the data search, but then capped
the result sets to include only the first 50 hits per query result. Is this enough?
What is the risk of loosing relevant papers? As there is no common ground, such
a decision must be taken on a per-study basis. Yet, it needs to be ensured that
the result set of papers obtained is of high quality, i.e., representative for the field
of investigation and the research questions formulated. This means to ensure an
accurate result set and a detailed and validated review protocol including a search
string potentially adapted to the particularities of the search engines, and detailed
inclusion and exclusion criteria.
3.1.4 Preparing the Handover
Although a study selection might be completed, more activities can be carried out
before entering the in-depth analysis. The final dataset provides already data that
can be used early in the overall literature study process to help researches finding
appropriate points to start with the analysis. From our so far conducted studies,
we consider the following lessons learned helpful.
Exporting Keyword Lists, Abstracts, and Word Clouds From the result set, key-
word lists and abstracts can be easily harvested and prepared to support the
beginning of the analysis. We can create, for example, word clouds from these
lists to get a quick visual inspection where a striking keyword could indicate to a
set of publications to start with. However, what seems easy to generate and use
can eventually turn out to be dicult or even misleading: several tools for word
cloud generation have limitations regarding the amount of text they can process.
A solution is to perform a keyword coding, which serves three purposes (as used
in [28]): first, the list of keywords is shortened; second, the used terminology is
harmonized (e.g., “small-to-medium-sized companies” and “small and medium en-
terprises” are coded to “SME”); third, the keyword coding can be considered a
first step towards full coding, which is normally performed in the context of a
On the Pragmatic Design of Literature Studies in Software Engineering 25
mapping study to work out the classification schemas. If a keyword (and/or ab-
stract) coding would be performed, the outcomes of the activities would comprise
the respective keyword lists, abstract lists, the mapping files containing the codes
and all synonyms, and optionally generated visuals.
Utilizing Social Network Analysis as a Means for Pre-Selection A social network
is a graph that provides an overview of subjects and their relationships (see for in-
stance [12,49,54]). Right in the early stages, even before the actual study begins,
a social network graph can be generated from the result set. Such a graph can
serve multiple purposes. For instance, a social network graph highlights coopera-
tion cliques, i.e., authors that collaborate and contribute a considerable share of
the result set, thus, forming the “community leaders”. When it eventually comes
to begin with the result set analysis, researchers can face the problem to find a
proper starting point. Potentially identified clusters can provide some guidance
through the result set. Another option is to look for domain-shaping key contri-
butions, which are potentially highlighted by a citation network11 . Beyond the
analysis preparation, a social network is also a supportive means within a study.
For ex a m ple, in [27], we used a collaboration network to study if a found trend in
the publication space is just because of the result set’s background noise. There-
fore, we generated the social network to identify the key contributors and created
a sub-map, which was based on the respective publications only, and compared
whether the general trends diered.
4 Related Work and Discussion
This article complements a number of existing guidelines and initiatives for con-
ducting literature studies. In this section, we provide an overview of related work
including approaches, methods, experiences, and tools to support literature stud-
ies and position our contribution in context of the current publication landscape.
Tabl e 7summarizes the body of knowledge in existing guidelines we found particu-
larly relevant and adds how our contribution at hand deviates (i.e., adapts/extends)
from existing contributions.
Approaches We deliberately use the term “approaches” to subsume all the dier-
ent processes and methods utilized in literature studies. One prominent approach
in context of literature studies is the systematic review process as initiated for soft-
ware engineering by Kitchenham [19] and continuously improved, e.g., Kitchenham
and Charters [21], eventually leading to a consolidated guideline [22], as well as
the systematic mapping study made popular for software engineering by Petersen
et al. [40] (updated in [41]).
These general guidelines, which—despite of their value to provide a common
structure and consistent terminology—have been experienced as too generic for
direct practical application [41,51]. They still serve as an umbrella and a multi-
tude on fine-grained methods and models, and advice and best practices can be
11 This approach needs to be considered with care, as for instance newer publications may
have a high-quality contribution, but don’t have a high citation count (e.g., compared to a
10-year old publication). Therefore, citation networks only deliver initial indication and trends
shouldn’t be taken for granted.
26 Marco Kuhrmann et al.
Tabl e 7 Relation of the present guideline with further established guidelines.
Ref. Key Contributions Adaptation/Extension
[22] Kitchenham et al. provide a well elaborated
overview of the systematic literature study pro-
cesses. To this end, they introduce a conceptual
description of what to do in a systematic review
or in a systematic mapping study, and an ex-
planation of why these steps should be carried
out. The aim is to provide a generalized view
on what to do while concrete advice of how to
operationalize the respective steps in a specific
context is out of scope.
Our guideline emphasizes the operationalization of the
particular steps in the data collection and study selec-
tion phase, and the guide provides examples and crit-
ical discussion of lessons we learned. Furthermore, our
guideline describes activities as building blocks and of-
fers exemplary workflow templates for literature studies
of dierent complexity and size.
[41] Peterson et al. propose a guideline, which ex-
tends their original one [40] grounded in ev-
idence obtained from analyzing 52 mapping
studies and comparing the guidelines used
therein. The guideline provides a checklist of ac-
tivities and refers to articles that used those to
select data for the study. It further proposes a
more detailed classification schema (compared
to [40,55]) and comprises small examples for
illustration.
Our guideline has a dierent scope compared to [41]as
we focus on the relatively unexplored early stages only.
That is, our guideline focuses on the data collection and
study selection process, whereas we pay little attention
to the data extraction and analysis which we believe to
be already well elaborated. Yet, our guideline provides
a more detailed perspective, e.g., on the dierent prac-
tices and how to combine them, how to utilize techniques
such as word clouds or social networks to aid the selec-
tion process (both not mentioned in [41]). Therefore, our
guideline is a pragmatic complementation of the study
identification phase from [41].
[58] Zhang et al. describe a “quasi-gold standard”
to find an eective study selection strategy.
Among other things, Zhang et al. define a
search process to achieve high sensitivity and
precision of the searches.
Similar to Zhang et al., our guideline recommends utiliz-
ing dierent search engines. Yet, our guideline provides
more details regarding actual practices to analyze and
clean the result (sub-)sets obtained from dierent search
runs, and we also provide recommendations to develop an
integrated result set to be evaluated in the actual study
selection process. Therefore, our guideline complements
[58] and provides recommendations to fill gaps, such as
missing information concerning the steps required to get
from step 4 (conduct automated search) to step 5 (eval-
uate search performance).
[55] The work by Wieringa et al. has become repre-
sentative for developing classification schemas
based on a well elaborated reference (see also
[40,37]or[41]).
In the present guideline, we explicitly do not aim to sup-
port schema development. However, when providing a
data structure template, we leave room for classifica-
tion schemas. Furthermore, grounded in our experience,
we also propose considering free metadata to be col-
lected, since we found strict classification schemas not
well-applicable in all setups.
[16] The rigor-relevance model by Ivarsson et al.
provides a scale-based approach to determine
the relevance to industry and the rigorousness
of the research conducted. Hence, this model
can support the paper selection process.
In our guideline, we utilize the rigor-relevance model ex-
actly as proposed as an explicit extra dimension to sup-
port the classification, because we experienced it to be
of particularly high value. We therefore recommend to
use a combination of “standard schemas” (e.g., [55,16,
40,37]) complemented with study-specific schemas, e.g.,
those developed from free metadata.
embodied by the guidelines. For example, a challenge in literature studies is the
development of proper classification schemas. In literature, we find, for instance,
the research type fa cet classification schema developed by Wieringa et al. [55]and
the contributio n type facet schema as illustrated by Petersen et al. [40] (adopted
On the Pragmatic Design of Literature Studies in Software Engineering 27
from Shaw’s work [50]) serving as generic classification patterns for studies [41].
Another perspective is provided by Paternoster et al. [37], who utilize a focus type
facet and a pertinence facet. Furthermore, Paternoster et al. [37] include a model
for determining rigor and relevance of the involved studies (based on a model pro-
posed by Ivarsson and Gorschek [16]) to support the determination of the result
set’s reliability. However, Petersen et al. [41] mention those classification schemas
critical. The reason is that such schemas, as the one by Wieringa et al. [55], leave
room for interpretation. As a matter of fact, we can find “tailored” variants of
this schema in a number of studies (see also Wohlin et at. [57]). It also remains a
challenge to construct a schema in a proper and ecient manner, and a number
of strategies are available for this purpose [39]. For instance, in our study [28],
we used the focus type facet concept finding the described construction procedure
from [37] inappropriate for the following reasons: if one has to deal with a very
large number of papers, a manual coding-based schema construction is too costly.
Moreover, it is challenging to clearly define the elements of such a schema, as
indicated by Portillo-Rodr´ıguez et al. [42]. This is because not all papers have
sucient information in title, keywords, and abstracts to conduct a proper and
fine-grained classification [4], and if the purpose of the study is to capture an entire
domain, developing a precise classification is close to impossible, as many publi-
cations address multiple topics, which makes a unique classification hard or even
impossible. Therefore, in previous work [28], we started collecting “free” metadata
instead of providing a big picture of the domain, but leaving the full classification
to the fine-grained analyses of selected topics. As outcome, in [23,24]weusedthe
metadata to generate heat maps (as also done in [38]) to work out trends worth
further investigation.
Constructing a classification schema requires data to apply the schema. In this
respect, Petersen et al. [41] found 15 ways to collect and identify relevant studies.
Data search is mainly done using manual and database searches, and snowballing.
Yet, it is currently subject to discussion which of the practices (or combinations
thereof) result in datasets of sucient quality and what is considered a sucient
dataset after all [57]. Ali and Petersen [1] review strategies to select studies in
systematic reviews and formulate a selection process. They conclude that a good-
enough sample could be obtained by following a less inclusive but more ecient
strategy. Zhang et al. [58] present a “quasi-gold standard” to identify relevant
studies and Badampudi et al. [2] show that snowballing also leads to an appropriate
result set. That is, all the dierent search strategies used so far produce sucient
datasets. Up to now, however, little has been reported on the complementary
use of the dierent search strategies, costs and benefits associated with such a
combination. In the present article, similar to Dyb˚a et al. [8], we stress this aspect
by presenting the combined use, and we also demonstrate how a search can be
complemented by further techniques, such as social network analysis [49,54]or
word clouds [28,24], to support pre-selection, analysis scoping, and dataset/result
visualization.
The search and selection procedures also include the definition and use of in-
clusion and exclusion criteria. However, Petersen et al. [41] found only five out
of 10 guidelines explicitly addressing this topic, but there was so far no attempt
to craft a set of standard in-/exclusion criteria. Similarly to standard research
questions, standard data collection workflows, and standard study selection pro-
cedures, we have proposed a set of standard inclusion and exclusion criteria to
28 Marco Kuhrmann et al.
support a quick start of the study and to lay the foundation for the development
of further study-specific criteria.
Experiences Regarding the (generic) guidelines used by empirical software engi-
neering researchers, Petersen et al. [41] found and compared in total 10 guidelines
used, whereas the (more general) ones by Kitchenham and Charters [21]andPe-
terson et al. [40] were identified as the most frequently used. Furthermore, their
findings include identified gaps in the individual guidelines, such as missing practi-
cal advice on how to do self-evaluation, justification and motivation of the research
question chosen regarding the demographic overview of a planned study, or miss-
ing shared practices from personal accounts of designing systematic reviews and
mapping studies by following specific guidelines. Petersen et al. [41]addtoase-
ries of meta-studies that aim to monitor the guidelines’ application and to collect
lessons learned and best practices is a required step to consolidate experience. For
instance, Kitchenham and Brereton [20] analyzed 68 studies and found that the
time required to conduct a systematic review and diculties regarding quality as-
sessment are problematic. This finding provides extra arguments for sophisticated
tool support. In their study, authors also found current digital libraries not appro-
priate for broad literature searches. This is also supported by Brereton et al. [4],
who specifically found the indexing of those digital libraries inadequate and also
mention that the quality of paper abstracts is too poor, e.g., to judge upon the
relevance of a paper based on its abstract only. This provides a rationale for dif-
ferent search and selection strategies [1,2,58]. A more general discussion is raised
by Staples and Niazi [51], who generally recommend using guidelines, but also
mention a need to optimize the process as such (e.g., narrowly defined research
questions, improved selection procedures, and improved data extraction) to reduce
the eort needed to conduct such a study. However, exemplary research questions
to start a literature study are only provided by Petersen et al. [41] as part of the
analysis of other studies, thus, being focused to the respective study subjects—
the presented list of quoted research questions does not serve the generalization.
Dyb˚a et al. [8] consider “normal” meta-analytic approaches to be of limited use
for software engineering only and, hence, report their experience from applying
diverse study types in a systematic review; a mixed-method approach similar to
the practices reported in the present article. Riaz et al. [45] provide a dierent
perspective in their report and mention experts and novices having a dierent
perception of the systematic review process and its challenges. The present article
also addresses this point by providing examples, reusable assets like research ques-
tions or in-/exclusion criteria, and a detailed elaboration on selected practices and
a demonstration of their use. Such challenges are also addressed by Fabbri et al.
[10], who provide an experienced-based guideline that comes as integrated process
with the purpose of externalizing tacit knowledge about the process and its im-
plementation. In contrast to Fabbri et al. [10], the present article is not supposed
to be a self-contained comprehensive guideline covering the process of conducting
a literature study as a whole. Instead, we focus on the early stages and provide
a limited, but interlinked set of practices illustrated by examples and reusable
building blocks, which we also compile into reference workflows to follow.
Tools The body of knowledge in software engineering is growing and, thus, lit-
erature studies are likely to grow in size and complexity as well. Tool support
On the Pragmatic Design of Literature Studies in Software Engineering 29
has therefore become crucial to collect, manage, and evaluate data. However, the
question of what can be considered as proper tool support has puzzled researchers
for years [13,52]. A group around Marshall conducted research on tool support for
literature studies [30,31,32,33]. Among others, they provided a feature analysis to
define basic requirements [32], and in [5], authors found a strong need to provide
support for planning and teamwork when conducting a literature study. In [33],
the same author group concluded a recommended list of requirements, which was
generated based on 13 semi-structured interviews. Yet, the requirements list only
provides a high-level overview of features that opens a fairly large design space that
should be carefully considered when designing tools. The challenges coming along
with this large design space were explicitly addressed in [52] in which we, based
on a shared set of requirements, independently developed two tools—both real-
izations with dierent features emphasized and implementing dierent work and
collaboration patterns. Over the years, few tools dedicated to support researchers
performing systematic reviews have been proposed; notable examples are SLuRp
[3], SESRA [35], and StArt [9]. These tools were analyzed in [32], yet those are not
ranked with flying colors. Still, the classic spreadsheet application (quite often) in
combination with so-called reference managers (e.g., EndNote, Mendeley, Papers,
and Zotero) seem to build the standard tooling for literature studies.
Summary of Related Works The present article contributes to the body of knowl-
edge by stressing the need for more concrete advice to complement the generic
guidelines, and by oering an experience-based guideline especially to perform the
steps in the early stages. Although for instance Petersen et al. [41] provide a com-
prehensive selection of practices used for these stages, a streamlined approach to
presenting, explaining, and linking these steps to each other is not in scope of their
contribution. In a nutshell, most of the available guidelines are focused on what
a design should accomplish rather than on how and why a particular step should
be executed in a cost-eective way. For example, we found no guideline explaining
what pieces of information are worthwhile including and what justified particular
configurations of descriptive data pieces to be taken care of by the researchers.
Our recommended minimal data structure (Table 8) can be directly used by re-
searchers facing this question. Furthermore, no guideline so far discussed in detail
the ways to run a voting procedure. We provided an operationalized description
on how to do this in a systematic way, along with a discussion on a research team
model, a scaling/vote calculation schema and a demonstration of a potential tech-
nical realization based on the suggested minimal data structure (Figure 10). Based
on our reported experience, we also provide a description of the work deliverables
that are produced during a literature study process and the dependencies among
the deliverables (Sect. 2.4), and we shared our lessons learned regarding the issues
coming along with handling search engines, which are barely discussed in available
guidelines.
5 Conclusion
Systematic literature studies have become a powerful means to elaborate and struc-
ture the state of reported knowledge. Especially in the software engineering com-
munity, they have received much attention in recent years. Despite their relevance
30 Marco Kuhrmann et al.
to the community and first valuable proposals of guidelines, they are dicult to
conduct, require a lot of eort and depend on experiences and expertise of the re-
searchers involved. Especially the latter decides often over the success of a study,
depending on aspects such as
Appropriateness of the research questions and value to the community,
Accuracy of the design, or
Reproducibility of the data collection.
When conducting literature studies, there are various challenges all concerning
the initial stages of the data collection rather than the particularities of the later
analytical phase, and there are challenges that concern the organization of such a
study.
In this article, we reported on our experiences made in the course of various
literature studies and contributed an experience-based guideline that puts strong
emphasis on tackling some practical challenges. Our aim was to specifically sup-
port young scholars facing their first literature study and to provide them with a
pragmatic and easy-to-enact guideline. To this end, we collected and structured
our experiences, and we also shared our experiences in utilizing dierent tools to
support the data collection, the dataset cleaning, or the study selection procedures.
Furthermore, we provided some generalized blueprint-style workflows to follow in a
particular study, also increase the eciency in the way study designs are reported
in papers within the space limitations given for conference submissions so that the
used approaches don’t have to be justified from scratch all the time drowning the
presentation of the results out.
While compiling this guideline, we also realized again the need for fine-grained
guidelines and, moreover, the need for a sophisticated tool support. As a matter
of fact, all our studies were conducted utilizing fairly simple tools, such as spread-
sheets or plain text files to feed further external tools, e.g., word cloud generators.
However, having conducted the data collection, research teams have rich data avail-
able, which could be used for extensive tool-support. Yet, comprehensive tools are
not yet publicly available or, if at all, in their early stages of their development as
for instance [52]. This indicates to a strong need to (1) increasing the eort spent
on developing applicable procedures and fine-grained reference workflows from the
available knowledge and experience, and (2) to put eort into the development
of tools to support literature studies. These tools need to support the collection
of data, their storage and organization, the management of in-/exclusion criteria,
support to implement workflows for paper selection and classification, which also
includes the management of classification schemas, and, eventually, supporting
the connection to further tools, e.g., word cloud generators, statistics software,
and social network analysis tools.
Acknowledgements
We want to thank Roel Wieringa for fruitful discussions on previous versions of
this article and all our students, especially those who contributed to our previously
conducted literature studies over the past years. Finally, we are grateful for the
constructive feedback provided by the anonymous reviewers of this article who
helped improving it substantially.
On the Pragmatic Design of Literature Studies in Software Engineering 31
AStudyWorkowTemplates
In this appendix, we provide selected workflow templates, which we inferred from experiences
(Table 5) for simple reuse in research method descriptions of scientific papers. The provided
templates can be used to inspire or shorten the description of research methods, which es-
pecially in conference papers consumes much precious space. For each model in subsequent
sections, we provide a brief context description, an exemplary workflow, and textual descrip-
tion.
A.1 Template 1: 2 Researcher Workshop Model with Snowballing
Context: This model addresses smaller literature studies in which just two researchers collab-
orate, thus, having no option to implement more comprehensive study selection procedures,
such as majority votes. Our experience shows this model to be well-applicable in settings with
up to approximately 50 papers, two senior or one senior and one junior researcher, and in
distributed settings. Apart from an initial research objective and/or a set of research questions
and a (small) set of reference publications, no extra entry conditions need to be fulfilled.
Workflow: Figure 8illustrates the basic workflow for this model including some notes em-
phasizing the most relevant points to be considered.
Select Reference
Studies (Papers)
Carry out
Snowballing
Define Queries Define Data Sources Define Inclusion and
Exclusion Criteria
Carry out Data
Collection
Clean Result Set
Rating Workshop
Perform Kick-Off
Meeting
Perform Individual
Rating
Integrate Individual
Ratings
(call in 2
reviewers)
[Consensus=no]
Keywording on
Studies obtained from
Snowballing
(decide on those studies
that are not yet voted in
or out; in the workshop,
1st: discuss paper based
on metadata, 2: inspect
the whole paper)
(discuss in-/exclusion criteria
and agree upon schedule)
(integrate the individual data
sets and analyze the results;
identify the undecided studies
and prepare a closing or rating
workshop)
(conduct automated data search
based on the queries developed
from the snowballing outcomes)
(analyze the snowballing
results to generate "formal"
keyword lists for the search
query construction)
(define AND test the search
queries; run tests and inspect
the results: are the reference
papers in the result?)
(define the in-/exclusion
criteria; use standard criteria
and revise them with study-
specific criteria)
[Consensus=yes]
(handover to the main study)
Fig. 8 Exemplary workflow for the 2 researcher workshop model with a snowballing-based
preliminary study.
Workflow Description: The 2 Researcher Workshop Model with Snowballing is implemented
32 Marco Kuhrmann et al.
as follows: Right in the beginning of the study, a snowballing-based preliminary study is con-
ducted. For this pre-study, a set of reference papers is selected to lay the foundation for an
(incremental) snowballing search. When the snowballing is done, the obtained papers are ana-
lyzed for keywords, which are used to construct the search queries for an automated database
search. As the last preparation steps, the data sources of interest are selected and the inclusion
and exclusion criteria are defined.
The data collection is performed (according to the search strategy, Sect. 2.2.1). After
the search, the dataset is cleaned (Sect. 2.2.2), e.g., by a stepwise integration of individual
datasets. The kick-omeeting is—on the one hand—closing the data collection and cleaning
phase and—on the other hand—starts the study selection phase. In the kick-omeeting, both
researchers reflect on all the criteria, inspect and prepare the dataset for the rating, and agree
on a schedule. According to the procedure illustrated in Figure 5, each researcher gets a copy
of the dataset and carries out the individual rating. When the rating is done, both datasets
are integrated and checked for consensus. In a rating workshop (or multiple workshops), both
researchers iterate through the dataset discussing all items that are not yet decided to find an
agreement. When the concluding integration is done, the study selection phase is closed and
the result set is transferred to the main study (Sect. 2.4). For handing over the result set, a
copy of the fully rated result set is created for archiving, and the actual result set is reduced,
i.e., those dataset items that were rated as irrelevant for the main study are removed from the
dataset so that only relevant data finds its way into the analysis.
A.2 Template 2: 3 Researcher Voting-only Model
Context: This model addresses literature studies in which three researchers collaborate and
implement a voting-based study selection procedure. Our experience shows this model to be
well-applicable in the majority of all literature study settings. This model supports mixed and
distributed teams, whereas at least one senior researcher has to be involved to guide the study
project. Our standard implementation of the 3 Researcher Voting-only Model follows the 2+1
approach (Figure 5,p.15), i.e., the voting procedure to select relevant papers is organized by
two researchers carrying out the full voting independently and calling in a third researcher to
make the final decisions. In order to set up a study following this model, research objectives
and questions, keyword lists and accordingly derived search queries have to be in place; op-
tionally, a (small) set of reference publications is available.
Workflow: Figure 9illustrates the basic workflow for this model including some notes em-
phasizing the most relevant points to be considered.
Workflow Description: The 3 Researcher Voting-only Model is implemented as follows:
After defining the search queries, data sources of interest, and the required inclusion and ex-
clusion criteria, actual data collection is performed (Sect. 2.2.1). After the data collection, the
data sets are cleaned (Sect. 2.2.2), e.g., via a stepwise integration of individual datasets.
In the kick-omeeting, the team of researchers nominates two researchers who will con-
duct the initial rating. According to the procedure illustrated in Figure 5,eachofthetwo
selected researchers gets a copy of the integrated dataset for carrying out the individual rat-
ing. When both researchers have rated the dataset, one of them integrates both and analyzes
the integrated result set for the agreement. Those dataset items that are not yet decided are
selected and exported in a reduced dataset, which is given to the third reviewer. The third
reviewer then performs a rating on the reduced dataset and, eventually, integrates the outcome
with the full dataset. After performing this third rating, the dataset is now fully decided and
can be prepared to be transferred to the main analysis (Sect. 2.4). If using a tool-supported
approach as, for instance, shown in Figure 10, the dierent stages can be supported by simple
calculation, scripts, and conditional formatting (color coding).
BRecommendedDataStructure
In this section, we present a recommendation of a data structure to store data obtained by a
manual/automatic literature search. Table 8presents this recommended data structure, which
emerges from several literature studies (Table 5), and the table explains the meaning of the
On the Pragmatic Design of Literature Studies in Software Engineering 33
Perform Kick-Off
Meeting
Perform Individual
Rating
Integrate Individual
Ratings
[Consensus=yes]
[Consensus=no]
(call in reviewer 3)
Define Queries Define Data Sources Define Inclusion and
Exclusion Criteria
Carry out Data
Collection
(conduct automated data search
based on the queries in the defined
data sources)
(define AND test the search
queries; run tests and inspect
the results: are the expected
papers in the result?)
(define the in-/exclusion
criteria; use standard criteria
and revise them with study-
specific criteria)
Clean Result Set
(call in 2 reviewers)
(discuss in-/exclusion criteria, decide
which 2 reviewers do the initial
selection, and agree upon schedule)
(handover to the main study)
Fig. 9 Exemplary workflow for a data collection and study selection approach for 3 reviewers
using a voting-only approach.
dierent fields. Note: We consider the presented data structure to be minimal, i.e., specific
studies will require further data fields. However, due to the absence of comprehensive and
mature tools to support mapping studies, the normal would be to set up a simple spreadsheet.
Examples of such spreadsheets (Figure 10) can be obtained from http://goo.gl/PBylsn.
Fig. 10 Example of a color-coded voting spreadsheet. The sheet shows dierent combinations
of a 3-person majority vote (2 reviewers + 1 extra reviewer for final decisions).
The data structure as presented in Table 8only contains a minimal set of data, which
needs to be extended according to the study’s scope. For systematic mapping studies, the
following extra data should be contained:
Generic/reused classification schemas, such as research/contribution type facet (Wieringa
et al. [55], Petersen et al. [40])
Study-specific classification schemas, such as focus type facets (Paternoster et al. [37]) or
rigor/relevance models (Ivarsson and Gorschek [16])
In-/exclusion criteria to document, why a paper was in-/excluded (cf. Table 2)
Furthermore, grounded in our experience from [28], we also recommend adding “dynamic
metadata” to the data structure (as already mentioned in Table 8). Such metadata can be
34 Marco Kuhrmann et al.
Tabl e 8 Recommended minimal data structure.
Field Cardinality Description
No. 1 The overall publication number in the integrated dataset.
DB-No. 1 The database-specific number if a paper from the individual lit-
erature database to allow for linking an entry to the originating
dataset.
Title 1 Title of the publication.
Authors 1, 1..n Authors of the publication; either integrated in one cell and
separated by special characters (e.g., “;”), or converted into a
one-author-per-cell pattern, i.e. there are n columns to represent
the author list.
Keywords 1 List of keywords separated by special characters (e.g., “,” or
“;”).
Abstract 1 Abstract of the paper.
Yea r 1 Ye a r of pu b lica tion ( n ote : e .g. , f or jo u rna l s, th e re mi g ht b e m ul-
tiple dates, such as accepted, online available, preprint, pub-
lished, etc.—it is required to define which of these is the one
that makes it into the dataset).
Publisher/
Database
1 Which database created this item? In case of cross-indexing,
publisher and originating database can dier, e.g., IEEE Xplore
also lists IET papers.
Source/
Ven ue
1 Which source or venue published this paper? In case of
a conference, this field should contain the conference name
and/acronym, in case of a journal, the name/acronym of the
journal should be contained, and so forth.
Publication
Veh i cle
1..n For every publication vehicle, an individual column should be
present, e.g., journal, magazine, conference, workshop, book,
chapter, misc, and so forth. Experience shows individual
columns beneficial for later analyses.
General
Comments
1 Provide some space for general comments.
Metadata
Classes
(optional)
0..n It was shown beneficial to provide some space for metadata, for
example, this is a survey, a literature review, this deals with
Agile, and so forth. The number of metadata is not limited and
can be extended during analysis. Furthermore, metadata should
allow for categorization, that is, one column per metadata class
should be provided.
added on-the-fly and can support the enhancement of the dataset. From our experience [23],
we recommend to collect metadata at least from the dimensions Study and Context.
The dimension Study covers the overall research approach followed in a particular paper,
e.g., is a particular paper a primary study, a replication, or even a secondary study, and it
can even contain the research methods used, such as interview research or grounded theory
analyses. Metadata from this category supports a more detailed classification and analysis
of papers regarding the research and contribution type facets. The dimension Context aims
at collecting as much context information from the selected papers as possible, such as the
software engineering lifecycle phase addressed by a paper (e.g., design, coding, test), the orga-
nizational context in which the research was conducted (e.g., SMEs, global players etc.), and
the application domain of a paper (e.g., automotive software or software for the healthcare
domain).
References
1. Ali, N.B., Petersen, K.: Evaluating strategies for study selection in systematic literature
studies. In: Proceedings of the International Symposium on Empirical Software Engi-
On the Pragmatic Design of Literature Studies in Software Engineering 35
neering and Measurement, ESEM, pp. 45:1–45:4. ACM, New York, NY, USA (2014).
DOI 10.1145/2652524.2652557
2. Badampudi, D., Wohlin, C., Petersen, K.: Experiences from using snowballing and
database searches in systematic literature studies. In: Proceedings of the International
Conference on Evaluation and Assessment in Software Engineering, EASE, pp. 17:1–17:10.
ACM, New York, NY, USA (2015). DOI 10.1145/2745802.2745818
3. Bowes, D., Hall, T., Beecham, S.: SLuRp: a tool to help large complex systematic literature
reviews deliver valid and rigorous results. In: Proceedings of the International Workshop
on Evidential Assessment of Software Technologies, pp. 33–36. ACM, New York, NY, USA
(2012)
4. Brereton, P., Kitchenham, B.A., Budgen, D., Turner, M., Khalil, M.: Lessons from applying
the systematic literature review process within the software engineering domain. Journal
of Systems and Software 80(4), 571–583 (2007). DOI 10.1016/j.jss.2006.07.009
5. Carver, J.C., Hassler, E., Hernandes, E., Kraft, N.A.: Identifying barriers to the systematic
literature review process. In: Proceedings of the International Symposium on Empirical
Software Engineering and Measurement, ESEM, pp. 203–212. IEEE, Washington, DC,
USA (2013). DOI 10.1109/ESEM.2013.28
6. Cohen, J.: Weighted kappa: Nominal scale agreement provision for scaled disagreement or
partial credit. Psychological Bulletin 70(4), 213–220 (1968)
7. Condori-Fernandez, N., Daneva, M., Sikkel, K., Wieringa, R., Dieste, O., Pastor, O.: A
systematic mapping study on empirical evaluation of software requirements specifications
techniques. In: Proceedings of the International Symposium on Empirical Software En-
gineering and Measurement, ESEM, pp. 502–505. IEEE, Washington, DC, USA (2009).
DOI 10.1109/ESEM.2009.5314232
8. Dyb˚a, T., Dingsøyr, T., Hanssen, G.K.: Applying systematic reviews to diverse study
types: An experience report. In: Proceedings of the International Symposium on Empirical
Software Engineering and Measurement, ESEM, pp. 225–234. IEEE, Washington, DC,
USA (2007). DOI 10.1109/ESEM.2007.21
9. Fabbri, S., Silva, C., Hernandes, E., Octaviano, F., Di Thommazo, A., Belgamo, A.:
Improvements in the start tool to better support the systematic review process. In:
Proceedings of the International Conference on Evaluation and Assessment in Soft-
ware Engineering, EASE, pp. 21:1–21:5. ACM, New York, NY, USA (2016). DOI
10.1145/2915970.2916013
10. Fabbri, S.C.P.F., Felizardo, K.R., Ferrari, F.C., Hernandes, E.C.M., Octaviano, F.R., Nak-
agawa, E.Y., Maldonado, J.C.: Externalising tacit knowledge of the systematic review
process. IET Software 7(6), 298–307 (2013). DOI 10.1049/iet-sen.2013.0029
11. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin
76(5), 378–382 (1971)
12. Hanneman, A., Riddle, M.: Introduction to social network methods. Online http:
//faculty.ucr.edu/~hanneman/ (2005)
13. Hassler, E., Carver, J.C., Hale, D., Al-Zubidy, A.: Identification of slr tool needs – results
of a community workshop. Information and Software Technology 70, 122–129 (2016).
DOI http://dx.doi.org/10.1016/j.infsof.2015.10.011
14. Inayat, I., Salim, S.S., Marczak, S., Daneva, M., Shamshirband, S.: A systematic literature
review on agile requirements engineering practices and challenges. Computers in Human
Behavior 51, Part B, 915–929 (2015). DOI http://dx.doi.org/10.1016/j.chb.2014.10.046
15. Ingibergsson, J., Schultz, U., Kuhrmann, M.: On the use of safety certification practices in
autonomous field robot software development: A systematic mapping study. In: Proceed-
ings of the International Conference on Product Focused Software Development and Pro-
cess Improvement, Lecture Notes in Computer Science, vol. 9459, pp. 335–352. Springer
Berlin Heidelberg (2015)
16. Ivarsson, M., Gorschek, T.: A method for evaluating rigor and industrial relevance of
technology evaluations. Empirical Software Engineering 16(3), 365–395 (2011). DOI
10.1007/s10664-010-9146-4
17. Jacobson, J.W., Kuhrmann, M., M¨unch, J., Diebold, P., Felderer, M.: On the role of
software quality management in software process improvement. In: Proceedings of the In-
ternational Conference on Product-Focused Software Process Improvement, Lecture Notes
in Computer Science, vol. 10027, pp. 327–343. Springer, Berlin, Heidelberg (2016)
18. Kalus, G., Kuhrmann, M.: Criteria for software process tailoring: A systematic review. In:
Proceedings of the International Conference on Software and System Process, ICSSP, pp.
171–180. ACM Press, New York, NY, USA (2013)
36 Marco Kuhrmann et al.
19. Kitchenham, B.: Procedures for Performing Systematic Reviews. Tech. Rep. TR/SE-0401,
Keele University (2004)
20. Kitchenham, B., Brereton, P.: A systematic review of systematic review process research
in software engineering. Information and Software Technology 55(12), 2049–2075 (2013).
DOI 10.1016/j.infsof.2013.07.010
21. Kitchenham, B., Charters, S.: Guidelines for performing systematic literature reviews in
software engineering. Tech. Rep. EBSE-2007-01, Keele University (2007)
22. Kitchenham, B.A., Budgen, D., Brereton, P.: Evidence-Based Software Engineering and
Systematic Reviews. CRC Press (2015)
23. Kuhrmann, M., Diebold, P., M¨unch, J.: Software process improvement: A systematic map-
ping study on the state of the art. PeerJ Computer Science 2(e62) (2016)
24. Kuhrmann, M., Diebold, P., M¨unch, J., Tell, P.: How does software process improvement
address global software engineering? In: International Conference on Global Software
Engineering, ICGSE, pp. 89–98. IEEE, Washington, DC, USA (2016)
25. Kuhrmann, M., Fern´andez, D.M., Gr¨ober, M.: Towards artifact models as process inter-
faces in distributed software projects. In: Proceedings of the International Conference on
Global Software Engineering, ICGSE, pp. 11–20. IEEE, Washington, DC, USA (2013)
26. Kuhrmann, M., Fern´andez, D.M., Steenweg, R.: Systematic software process development:
Where do we stand today? In: Proceedings of the International Conference on Software
and System Process, ICSSP, pp. 166–170. ACM Press, New York, NY, USA (2013)
27. Kuhrmann, M., Fern´andez, D.M., Tiessler, M.: A mapping study on the feasibility of
method engineering. Journal of Software: Evolution and Process 26(12), 1053–1073 (2014)
28. Kuhrmann, M., Konopka, C., Nellemann, P., Diebold, P., M¨unch, J.: Software process
improvement: Where is the evidence? In: Proceedings of the International Conference on
Software and Systems Process, ICSSP, pp. 107–116. ACM, New York, NY, USA (2015)
29. Kuo, B.Y.L., Hentrich, T., Good, B.M.., Wilkinson, M.D.: Tag clouds for summarizing
web search results. In: Proceedings of the International Conference on World Wide Web,
WWW, pp. 1203–1204. ACM, New York, NY, USA (2007). DOI 10.1145/1242572.1242766
30. Marshall, C., Brereton, P.: Tools to support systematic literature reviews in software engi-
neering: A mapping study. In: Proccedings of the International Symposium on Empirical
Software Engineering and Measurement, ESEM, pp. 296–299. IEEE, Washington, DC,
USA (2013). DOI 10.1109/ESEM.2013.32
31. Marshall, C., Brereton, P.: Systematic review toolbox: A catalogue of tools to support
systematic reviews. In: Proceedings of the International Conference on Evaluation and
Assessment in Software Engineering, EASE, pp. 23:1–23:6. ACM, New York, NY, USA
(2015)
32. Marshall, C., Brereton, P., Kitchenham, B.: Tools to support systematic reviews in soft-
ware engineering: A feature analysis. In: Proceedings of the International Conference on
Evaluation and Assessment in Software Engineering, EASE, pp. 13:1–13:10. ACM, New
York, NY, USA (2014)
33. Marshall, C., Brereton, P., Kitchenham, B.: Tools to support systematic reviews in software
engineering: A cross-domain survey using semi-structured interviews. In: Proceedings
of the International Conference on Evaluation and Assessment in Software Engineering,
EASE, pp. 26:1–26:6. ACM, New York, NY, USA (2015)
34. endez Fern´andez, D., Ognawala, S., Wagner, S., Daneva, M.: Where do we stand in
requirements engineerign improvement today? first results from a mapping study. In:
Proceedings of the International Symposium on Empirical Software Engineering and Mea-
surement, ESEM, pp. 58:1–58:4. ACM, New York, NY, USA (2014)
35. Moll´eri, J.S., Benitti, F.B.V.: SESRA: A web-based automated tool to support the sys-
tematic literature review process. In: Proceedings of the International Conference on
Evaluation and Assessment in Software Engineering, EASE, pp. 24:1–24:6. ACM, New
York, NY, USA (2015)
36. Oosterman, J., Cockburn, A.: An empirical comparison of tag clouds and tables. In:
Proceedings of the Conference of the Computer-Human Interaction Special Interest Group
of Australia on Computer-Human Interaction, OZCHI, pp. 288–295. ACM, New York, NY,
USA (2010). DOI 10.1145/1952222.1952284
37. Paternoster, N., Giardino, C., Unterkalmsteiner, M., Gorschek, T., Abrahamsson, P.: Soft-
ware development in startup companies: A systematic mapping study. Information and
Software Technology 56(10), 1200–1218 (2014). DOI http://dx.doi.org/10.1016/j.infsof.
2014.04.014
On the Pragmatic Design of Literature Studies in Software Engineering 37
38. Penzenstadler, B., Raturi, A., Richardson, D., Calero, C., Femmer, H., Franch, X.: Sys-
tematic mapping study on software engineering for sustainability (SE4S). In: Proceedings
of the International Conference on Evaluation and Assessment in Software Engineering,
EASE, pp. 14:1–14:14. ACM, New York, NY, USA (2014). DOI 10.1145/2601248.2601256
39. Petersen, K., Ali, N.B.: Identifying strategies for study selection in systematic reviews and
maps. In: Proceedings of the International Symposium on Empirical Software Engineering
and Measurement, ESEM, pp. 351–354. IEEE, Washington, DC, USA (2011). DOI 10.
1109/ESEM.2011.46
40. Petersen, K., Feldt, R., Mujtaba, S., Mattson, M.: Systematic mapping studies in software
engineering. In: Proceedings of the International Conference on Evaluation and Assessment
in Software Engineering, EASE, pp. 68–77. ACM, New York, NY, USA (2008)
41. Petersen, K., Vakkalanka, S., Kuzniarz, L.: Guidelines for conducting systematic mapping
studies in software engineering: An update. Information and Software Technology 64,
1–18 (2015)
42. Portillo-Rodr´ıguez, J., Vizca´ıno, A., Piattini, M., Beecham, S.: Tools used in global soft-
ware engineering: A systematic mapping review. Information and Software Technology
54(7), 663–685 (2012). DOI http://dx.doi.org/10.1016/j.infsof.2012.02.006
43. Racheva, Z., Daneva, M., Sikkel, K.: Value creation by agile pro jects: Methodology or
mystery? In: Product-Focused Software Process Improvement, Lecture Notes in Business
Information Processing, vol. 32, pp. 141–155. Springer Berlin Heidelberg (2009). DOI
10.1007/978-3- 642-02152- 7 12
44. Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models. In:
Proceedings of the International AAAI Conference on Weblogs and Social Media, pp.
130–137. Association for the Advancement of Artificial Intelligence (2010)
45. Riaz, M., Sulayman, M., Salleh, N., Mendes, E.: Experiences conducting systematic reviews
from novices’ perspective. In: Proceedings of the International Conference on Evaluation
and Assessment in Software Engineering, EASE, pp. 44–53. British Computer Society,
Swinton, UK, UK (2010)
46. Rivadeneira, A.W., Gruen, D.M., Muller, M.J., Millen, D.R.: Getting our head in the
clouds: Toward evaluation studies of tagclouds. In: Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems, CHI, pp. 995–998. ACM, New York, NY, USA
(2007). DOI 10.1145/1240624.1240775
47. Schramm, J., Dohrmann, P., Rausch, A., Ternit´e, T.: Process model engineering lifecycle:
Holistic concept proposal and systematic literature review. In: Proceedings of the Euromi-
cro Conference on Software Engineering and Advanced Applications, SEAA, pp. 127–130.
IEEE, Washington, DC, USA (2014)
48. Schrammel, J., Leitner, M., Tscheligi, M.: Semantically structured tag clouds: An empirical
evaluation of clustered presentation approaches. In: Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems, CHI, pp. 2037–2040. ACM, New York, NY, USA
(2009). DOI 10.1145/1518701.1519010
49. Scott, J.: Social Network Analysis: A Handbook, 2 edn. ISBN-13: 978-0761963394. SAGE
Publications (2000)
50. Shaw, M.: Writing good software engineering research papers: Minitutorial. In: Interna-
tional Conference on Software Engineering, ICSE, pp. 726–736. IEEE, Washington, DC,
USA (2003)
51. Staples, M., Niazi, M.: Experiences using systematic review guidelines. Journal of Systems
and Software 80(9), 1425–1437 (2007). DOI 10.1016/j.jss.2006.09.046
52. Tell, P., Cholewa, J., Nellemann, P., Kuhrmann, M.: Beyond the spreadsheet: Reflections
on tool support for literature studies. In: Proceedings of the International Conference on
Evaluation and Assessment in Software Engineering, EASE, pp. 22:1–22:5. ACM, New
York, NY, USA (2016)
53. Theocharis, G., Kuhrmann, M., M¨unch, J., Diebold, P.: Is Water-Scrum-Fall reality? on
the use of agile and traditional development practices. In: Proccedings of International
Conference on Product Focused Software Development and Process Improvement, Lecture
Notes in Computer Science, vol. 9459, pp. 149–166. Springer, Berlin, Heidelberg (2015)
54. Was s erm a n, S. , Fa ust , K .: So c i al ne two rk an a lys i s: Me t hods a n d app l ica t ion s , 2nd e d n.
Cambridge University Press (1994)
55. Wieringa, R., Maiden, N., Mead, N., Rolland, C.: Requirements engineering paper classi-
fication and evaluation criteria: A proposal and a discussion. Requirements Engineering
11(1), 102–107 (2005). DOI 10.1007/s00766-005- 0021-6
38 Marco Kuhrmann et al.
56. Woh l in, C . , Run e son , P. , H¨os t, M. , O hls s on, M . C., R e gne l l, B. , We ssl ´en, A . : Exp e r ime n -
tation in Software Engineering. Springer (2012)
57. Woh l in, C . , Run e son , P. , da Mo t a Sil vei r a Net o , P.A. , m , E.E . , do Ca r mo Ma cha d o, I. ,
de Almeida, E.S.: On the reliability of mapping studies in software engineering. Journal
of Systems and Software 86(10), 2594 – 2610 (2013). DOI http://dx.doi.org/10.1016/j.
jss.2013.04.076
58. Zhang, H., Babar, M.A., Tell, P.: Identifying relevant studies in software engineering.
Information and Software Technology 53(6), 625–637 (2011). DOI 10.1016/j.infsof.2010.
12.010

Supplementary resource (1)

... 1) Search Strategy: We performed an automated search in the following literature databases: IEEE Digital Library, ACM Digital Library, SpringerLink, ScienceDirect, and Wiley Interscience. To test the search queries and complete the result sets [16], if necessary, we used the meta-search engines Web of Science, DBLP, Semantic Scholar, and Google Scholar. ...
... 3) Study Selection: We defined the inclusion (IC) and exclusion criteria (EC) listed in Table II to be applied for selecting those studies that qualify for an in-depth investigation, i.e., full-text read, data extraction, and synthesis. The selection procedure followed the majority voting (Table III, Step 3) procedures as defined in [16]. That is, two researchers individually voted the full list of found papers. ...
... The internal validity might be affected by the search and selection procedure implemented. To mitigate this threat, we applied the procedures listed in [16], e.g., researcher triangulation. Another threat inherent to literature studies is publication bias. ...
Preprint
With the emergence of agile software development methods, new approaches for determining agile maturity have become necessary. Other than for traditional maturity and capability models like CMMI and ISO/IEC 15504, the field of agile maturity models is not yet settled. Even worse, a common understanding regarding agility in general and the levels of agility in particular is missing. The paper at hand aims to shed light on the field of agile maturity models with a particular focus on maturity levels, their definition, and their evaluation and computation. We conducted a systematic literature review to extract maturity levels and provide an initial harmonization of the levels found. Our findings from analyzing 19 agile maturity models show that there is yet no agreement with regard to the maturity levels. In total, 69 maturity levels have been analyzed for harmonization opportunities. Two major dimensions of maturity levels of agile maturity models could be identified: (1) team-related and (2) general maturity, which is comparable to standard approaches. However, the procedures to assess organizations and processes, if at all present, are to a large extent focused on persons and their personal opinion, which paves the way for future research, e.g., in terms of developing measurement systems for assessing agile maturity.
... To develop the intended software reference architecture, we followed the guidelines of design-oriented research in information systems (Hevner et al., 2004). We initialized our study by identifying the architectural building blocks essential to DQ tools and conducted an SLR following the guidelines of Kuhrmann et al. (2017) and Webster and Watson (2002). Our goal was to identify scientific studies reporting on the architecture and the functional and non-functional design of DQ artifacts. ...
Conference Paper
Full-text available
Organizations crave to succeed in the ongoing digital transformation, and central to this is the quality of data as a major source for business innovation. Data quality tools promise to increase the quality of data by managing and automating the different tasks of data quality management. However, established tools often lack support for the fundamental changes accompanying an ongoing digital transformation, such as data mesh architectures. In this paper, we propose a software reference architecture for data quality tools that guides organizations in creating state-of-the-art solutions. Our reference architecture is based on the knowledge captured from ten data quality tools described in the scientific literature. For evaluation, we conducted two qualitative focus group discussions using the adapted architecture tradeoff analysis method as a basis. Our findings reveal that the proposed reference architecture is well-suited for creating successful data quality tools and can help organizations assess offerings in the market.
... Following the procedures described by Kuhrmann et al. (2017), inclusion criteria (IC) and exclusion criteria (EC) were defined for the publications returned by the search string. These criteria are needed to select only relevant publications for the search and filter publications that require further analysis. ...
Article
Full-text available
Teaching programming is a complex process requiring learning to develop different skills. To minimize the challenges faced in the classroom, instructors have been adopting active methodologies in teaching computer programming. This article presents a Systematic Mapping Study (SMS) to identify and categorize the types of methodologies that instructors have adopted for teaching programming. We evaluated 3,850 papers published from 2000 to 2022. The results provide an overview and comprehensive view of active learning methodologies employed in teaching programming, technologies, programming languages, and the metrics used to observe student learning in this context. In the results, we identified thirty-seven different ALMs adopted by instructors. We realized that seventeen publications describe teaching approaches that combine more than one ALM, and the most reported methodologies in the studies are Flipped Classroom and Gamification-Based Learning. In addition, we are proposing an educational and collaborative tool called CollabProg, which summarizes the primary active learning methodologies identified in this SMS. CollabProg will assist instructors in selecting appropriate ALMs that align with their pedagogical requirements and teaching programming context.
... The goal of DERM was to provide a data engineering reference model that can serve as the common ground for developing data-intensive applications. DERM is the result of a systematic literature review (SLR) that followed the principles of [19] and [39]. For coding and grouping the papers we used the Grounded Theory Methodology (GTM) [33] as an example. ...
Chapter
The proliferation of data-intensive applications is continuously growing. Yet, many of these applications remain experimental or insular as they face data challenges that are rooted in a lack of practical engineering practices. To address this shortcoming and fully leverage the data resource, a professionalization of engineering data-intensive applications is necessary. In a previous study, we developed a data engineering reference model (DERM) that outlines the important building-blocks for handling data along the data life cycle. To create the model, we conducted a systematic literature review on data life cycles to find commonalities between these models and derive an abstract meta-model. We validated DERM theoretically by classifying scientific data engineering topics on the model and placed them in their corresponding life cycle phase. This led to the realization that the phases plan, create and destroy, as well as the layers enterprise and metadata are underrepresented in research literature. To strengthen these findings, this work conducts an empirical survey among data engineering professionals to assess the maturity of the model’s pillars from a practical perspective. It turns out that the gaps found in theory also prevail in practice. Based on our results, we derived a set of research gaps that need further attention for establishing a practically grounded engineering process.KeywordsData engineeringReference modelDERMSurvey
... To do so, we conduct an SLR, a secondary form of study used to identify and evaluate available research relevant to a certain research question or topic of interest [48]. SLRs are useful to obtain a detailed picture of a research area, and to integrate existing evidence [49]. They are increasingly common in SE [47], and specifically in the modelling community in the last decade, e.g. ...
Article
Full-text available
Despite potential benefits in Software Engineering, adoption of software modelling in industry is low. Technical issues such as tool support have gained significant research before, but individual guidance and training have received little attention. As a first step towards providing the necessary guidance in modelling, we conduct a systematic literature review to explore the current state of the art. We searched academic literature for guidance on model creation and selected 35 papers for full-text screening through three rounds of selection. We find research on model creation guidance to be fragmented, with inconsistent usage of terminology, and a lack of empirical validation or supporting evidence. We outline the different dimensions commonly used to provide guidance on software and system model creation. Additionally, we provide definitions of the three terms modelling method, style, and guideline as current literature lacks a well-defined distinction between them. These definitions can help distinguishing between important concepts and provide precise modelling guidance.
Chapter
Rework is a significant contributor to cost and schedule delays in construction projects. A considerable amount of study has been conducted to address rework, yet there has been limited advancement in minimising its occurrence and negative consequences. In response, Lean construction (LC) has evolved as an effective management concept to reduce waste while also improving the safety and quality of building projects. The study aims to review and analyse literature to identify the factors affecting rework in construction industry (CI) and various lean tools mentioned in the literature that can effectively help to reduce/remove rework from various construction processes. The review is conducted by using 60 peer-reviewed articles in the field of rework retrieved from three prominent databases: EBSCO host, Scopus, and Web of science. Using the Systematic Literature Review (SLR) strategy, the articles are categorised according to the technologies used in order to assist in the findings shown by the existing literature. Thus, this study contributes to the body of knowledge by identifying the solutions to rework through LC that can help industry practitioners, policymakers, and researchers in the CI. Future research directions are offered to bridge gaps in the existing literature and improve the effectiveness of studies that aim to boost LC.
Chapter
Infrastructure development provides a framework that allows society to function effectively. The Government of India (GOI) has initiated various infrastructure initiatives, including Smart Cities Mission, Jal Jeevan Mission, Power for All, and others. India’s infrastructure sector will need a projected sum of INR 304 trillion to continue development through 2040. However, majority of infrastructure projects experience time and cost overruns endangering continued viability of the project. According to the Ministry of Statistics and Programme Implementation, out of 1568 projects (above 150 crores), 423 reported cost overruns, and 721 were delayed. Hence, technological solutions like Building Information Modelling (BIM) that enhance project lifecycle should be implemented effectively for on-time and within-budget project completion. Adopting BIM requires changes in technology and work processes to ensure on-time project delivery. In this regard, lean principles can help to improve a process by eliminating wastes from existing process. This research reviews BIM and Lean literature to determine how BIM-Lean can help the Indian construction industry. The study employed Scopus, Web of Science, and EBSCO, peer-reviewed publications to summarize BIM & Lean independently as well when integrated. Finally, the paper’s outcomes will support construction stakeholders in successfully implementing BIM by embracing lean thinking by understanding and overcoming the barriers affecting lean implementation.