PreprintPDF Available

Evaluation and Measurement of Software Process Improvement -- A Systematic Literature Review

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

BACKGROUND: Software Process Improvement (SPI) is a systematic approach to increase the efficiency and effectiveness of a software development organization and to enhance software products. OBJECTIVE: This paper aims to identify and characterize evaluation strategies and measurements used to assess the impact of different SPI initiatives. METHOD: The systematic literature review includes 148 papers published between 1991 and 2008. The selected papers were classified according to SPI initiative, applied evaluation strategies, and measurement perspectives. Potential confounding factors interfering with the evaluation of the improvement effort were assessed. RESULTS: Seven distinct evaluation strategies were identified, wherein the most common one, "Pre-Post Comparison" was applied in 49 percent of the inspected papers. Quality was the most measured attribute (62 percent), followed by Cost (41 percent), and Schedule (18 percent). Looking at measurement perspectives, "Project" represents the majority with 66 percent. CONCLUSION: The evaluation validity of SPI initiatives is challenged by the scarce consideration of potential confounding factors, particularly given that "Pre-Post Comparison" was identified as the most common evaluation strategy, and the inaccurate descriptions of the evaluation context. Measurements to assess the short and mid-term impact of SPI initiatives prevail, whereas long-term measurements in terms of customer satisfaction and return on investment tend to be less used.
Content may be subject to copyright.
arXiv:2307.13143v1 [cs.SE] 24 Jul 2023
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z 1
Evaluation and Measurement of Software
Process Improvement - A Systematic Literature
Review
Michael Unterkalmsteiner, Student Member, IEEE, Tony Gorschek, Member, IEEE, A. K. M. Moinul Islam,
Chow Kian Cheng, Rahadian Bayu Permadi, Robert Feldt, Member, IEEE
Abstract—BACKGROUND—Software Process Improvement (SPI) is a systematic approach to increase the efficiency and
effectiveness of a software development organization and to enhance software products. OBJECTIVE—This paper aims to identify and
characterize evaluation strategies and measurements used to assess the impact of different SPI initiatives. METHOD—The systematic
literature review includes 148 papers published between 1991 and 2008. The selected papers were classified according to SPI
initiative, applied evaluation strategies and measurement perspectives. Potential confounding factors interfering with the evaluation of
the improvement effort were assessed. RESULTS—Seven distinct evaluation strategies were identified, whereas the most common
one, “Pre-Post Comparison”, was applied in 49% of the inspected papers. Quality was the most measured attribute (62%), followed by
Cost (41%) and Schedule (18%). Looking at measurement perspectives, “Project” represents the majority with 66%.
CONCLUSION—The evaluation validity of SPI initiatives is challenged by the scarce consideration of potential confounding factors,
particularly given that “Pre-Post Comparison” was identified as the most common evaluation strategy, and the inaccurate descriptions
of the evaluation context. Measurements to assess the short and mid-term impact of SPI initiatives prevail, whereas long-term
measurements in terms of customer satisfaction and return on investment tend to be less used.
1 INTROD UCT ION
WITH the increasing importance of software products
in industry as well as in our every day’s life [1], the
process of developing software has gained major attention
by software engineering researchers and practitioners in the
last three decades [2]–[5]. Software processes are human-
centered activities and as such prone to unexpected or
undesired performance and behaviors [6]. It is generally
accepted that software processes need to be continuously
assessed and improved in order to fulfill the requirements
of the customers and stakeholders of the organization [6].
Software Process Improvement (SPI) encompasses the as-
sessment and improvement of the processes and practices
involved in software development [7]. SPI initiatives are
henceforth referred to activities aimed at improving the soft-
ware development process (see Section 3.4.3 for a definition
of the different types of initiatives).
The measurement of the software process is a substan-
tial component in the endeavor to reach predictable per-
M. Unterkalmsteiner, T. Gorschek and R. Feldt are with the Software En-
gineering Research Lab, School of Computing, Blekinge Institute of Tech-
nology, SE-371 79 Karlskrona, Sweden. E-mail: {mun, tgo, rfd}@bth.se.
A. K. M. Moi nul Islam is with the Software Engineering: Process and
Measurement Research Group, Department of Computer Science, Uni-
versity of Kaiserslautern, PO Box 3049, 67653 Kaiserslautern, Germany.
E-mail: moinul.islam@cs.uni-kl.de.
C. K. Cheng is with General Electrics Healthcare, Healthcare
IT, Munzinger Straße 5, 79111 Freiburg, Germany. E-mail:
ChowKian.Cheng@ge.com.
R. B. Permadi is with Amadeus S. A. S., Product Marketing and Devel-
opment, 485 Route du Pin Montard, Boite Postale 69, 06902 Sophia
Antipolis Cedex, France. E-mail: rahadian-bayu.permadi@amadeus.com.
Manuscript received May 11, 2010; revised February 8, 2011; accepted
February 15, 2011
formance and high capability, and to ensure that process
artifacts meet their specified quality requirements [8], [9]. As
such, software measurement is acknowledged as essential in
the improvement of software processes and products since,
if the process (or the result) is not measured and evaluated,
the SPI effort could address the wrong issue [10].
Software measurement is a necessary component of ev-
ery SPI program or change effort, and empirical results
indicate that measurement is an important factor for the
initiatives’ success [11], [12]. The feedback gathered by
software measurement and the evaluation of the effects of
the improvement provide at least two benefits. By making
the outcome visible, it motivates and justifies the effort put
into the initiative. Furthermore, it enables assessment of
SPI strategies and tactics [13]. However, at the same time,
it is difficult to establish and implement a measurement
program which provides relevant and valid information
on which decisions can be based [13], [14]. There is little
agreement on what should be measured, and the absence
of a systematic and reliable measurement approach is re-
garded as a factor that contributes to the high failure rate of
improvement initiatives [15]. Regardless of these problems
in evaluating SPI initiatives, a plethora of evidence exists
to show that improvement efforts provide the expected
benefits [16]–[25].
An interesting question that arises from that is how
these benefits are actually assessed. A similar question was
raised by Gorschek and Davis [26], where it was criticized
how changes / improvements in requirements engineering
processes are evaluated for their success. Inspired by the
search for dependent variables [26], we conducted a Sys-
tematic Literature Review (SLR) to explore how the success
of SPI initiatives is determined, and if the approach is dif-
2 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
ferent depending on the particular initiative. Furthermore,
we investigated which types of measures are used and,
based on the categorization by Gorschek and Davis [26],
which perspectives (project, product or organization) are
used to assess improvement initiatives. Following the idea
of Evidence-Based Software Engineering (EBSE) [11] we
collect and analyze knowledge from both research and
practical experience. To this end we adopted the approach
for conducting SLR’s proposed by Kitchenham [27].
This paper is organized as follows. Background and
related work is presented in Section 2 and our research
methodology is presented in Section 3. In Section 4 we
describe the results and answer our four major research
questions. We present our conclusions in Section 5.
2 BACKGRO UND A ND REL ATED WOR K
2.1 Software process improvement
Software process research is motivated by the common
assumption that process quality is directly related with
the quality of the developed software [1], [6], [28]. The
aim of software process improvement is therefore to in-
crease product quality, but also to reduce time-to-market
and production costs [28]. The mantra of many soft-
ware process improvement frameworks and models origi-
nates in the Shewhart-Deming cycle [29]: establish an im-
provement plan, implement the new process, measure the
changed process, and analyze the effect of the implemented
changes [30]–[33].
The Capability Maturity Model (CMM) [34] is an early
attempt to guide organizations to increase their software
development capability and process maturity [35]. Although
software and process measurement is an integral part of the
lower maturity levels (repeatable and defined) and central
for the managed level [36], the model only suggests concrete
measurements since the diversity of project environments
may evoke varying measurement needs [37]. Similarly, the
Capability Maturity Model Integration (CMMI) [38]–[40]
and ISO / IEC 15504 [41], [42] (also known a s SPICE), pro-
pose various measurements. The CMMI reference documen-
tation, both for the staged and the continuous representa-
tion [39], [40], provides measurement suggestions for each
process area as an informative supplement to the required
components of the model. The ISO / IEC 15504 standard
documentation [43], on the other hand, prescribes that the
process improvement has to be confirmed and defines a
process measurement framework. The informative part of
the ISO standard provides some rather limited examples of
process measures without showing how the measurement
framework is applied in practice.
A common characteristic of the above-mentioned im-
provement initiatives is their approach to identify the to-
be-improved processes: the actual processes are compared
against a set of “best practice” processes. In case of sig-
nificant divergences, improvement opportunities are iden-
tified and the elimination of the differences constitutes the
actual process improvement [44]. This approach is com-
monly referred to as top-down [44] or prescriptive [33]
improvement. In conceptual opposition to this idea are
the bottom-up [44] or inductive [33] approaches to process
improvement. The main principle of bottom-up improve-
ment is a process change driven by the knowledge of the
development organization and not by a set of generalized
“best practices” [44]. The Quality Improvement Paradigm
(QIP) / Experience Factory [45], [46] is one instance in this
category of improvement initiatives. As in the prescriptive
approaches, measurement to control process change and to
confirm goal achievement is a central part of QIP.
2.2 Related work
Gorschek and Davis present a conceptual framework for
assessing the impact of requirements process changes [26].
Their central idea is that the effect of a change in the
requirements process can be observed and measured at dif-
ferent levels: (1) Effort and quality of requirements related
activities and artifacts in the requirements phase, (2) project
success in terms of meeting time, budget and scope con-
straints, (3) product success in terms of meeting both the
customers’ and the company’s expectations, (4) company
success in terms of product portfolio and market strategies,
and (5) the influence on society.
Although these concepts are described from the perspec-
tive of requirements engineering, the essence to evaluate
a process change on different levels to understand its im-
pact more thoroughly, is conveyable to software process
improvement in general.
By looking at the recent literature one can find several
endeavors to systematically collect and analyze the current
knowledge in software measurement.
Gomez et al. [47] conducted a SLR on measurement
in software engineering. The study considered in total 78
publications and tried to answer three questions: “What
to measure?”, “How to Measure?’ and “When to Mea-
sure?” The criteria for inclusion in the review were that
the publication presents current and useful measurements.
To answer the rst question, the study accumulated the
metrics based on entities where the measures are collected
and the measured attributes. The most measured entity
was “Product” (79%), followed by “Project” (12%) and
“Process” (9%), and the most measured attributes were
“Complexity” (19%), “Size” (16%) and “Inheritance” (8%).
The second question is answered by identifying metrics that
have been validated empirically (46%), theoretically (26%)
and both empirically / theoretically (28%). Furthermore the
measurement focus, e. g. object-orientation, process, quality,
was analyzed. The answer for the third question, when
to measure, is presented by mapping the metrics onto the
waterfall lifecycle phases. The identified product metrics are
found in the design (42%), development (27%), maintenance
(14%), testing (12%) and analysis (5%) phase.
Bellini et al. [48] systematically reviewed the literature
in twenty Software Engineering and Information Systems
journals with the aim of describing current and future trends
in software measurement. The study identifies and discusses
five key software measurement topics in the the reviewed
literature: measurement theory, software metrics, develop-
ment and identification of metrics, measure collection, and
evaluation and analysis of measures. The authors conclude
that, besides traditional software measures like code com-
plexity and developer productivity, developments from or-
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 3
ganizational studies, marketing and human resources man-
agement are gaining interest in the area of Software Engi-
neering / Information Systems due to the human-intensive
nature of software development. Measures used in practice
should be developed based upon a common agreement on
the relationship between the empirical object of interest
and its mathematical representation. Furthermore, for the
practical analysis of measures, a more flexible interpretation
of the admissible transformations of measurement scales is
advocated.
Kitchenham [49] conducted a systematic mapping study
to describe the state-of-the-art in software metrics research.
The study assesses 103 papers published between 2000 and
2005 and includes an analysis on their influence (in terms
of citation counts) on research. Kitchenham concludes that
papers presenting empirical validations of metrics have
the highest impact on metrics research although she has
also identified several issues with this type of studies. For
example, 5 out of 7 papers, which empirically validated
the object oriented metrics proposed by Chidamber and
Kemerer [50], included Lack of Cohesion (LCOM) in the val-
idation. Kitchenham [49] pointed out that LCOM has been
demonstrated theoretically invalid [51] and that continuous
attempts to validate LCOM empirically seem therefore fu-
tile.
The aim of this SLR differs from the above reviews
in two aspects. First, the focus of this review is on mea-
surement of software process improvement initiatives, i. e.
what to measure, and is therefore more specific than the
reviews of Bellini et al. and Gomez et al. Second, this review
investigates also how the measures are used to evaluate
and analyze the process improvement. Given our different
focus, only 1 ([52]) of our 148 reviewed papers was also
covered by Bellini et al. [48]. Gomez et al. [47] did not report
the reviewed papers which impedes a coverage assessment
with our SLR.
3 RESE ARC H METH ODOLOGY
In this section we describe the design and the execution of
the SLR. Furthermore, we discuss threats to the validity of
this review. Figure 1 outlines the research process we have
used and its steps are described in detail in the following
sub-sections.
The need for this systematic review (Step 1, Figure 1) was
motivated in the introduction of this paper. In order to
determine if similar work had already been performed, we
searched the Compendex, Inspec and Google Scholar digital
libraries1. We used the following search string to search
within keywords, title and abstracts, using synonyms for
“systematic review” defined by Biolchini et al. [53]:
((SPI OR "Software process improvement") AND
("systematic review" OR "research review" OR
"research synthesis" OR "research integration"
OR "systematic overview" OR "systematic re-
search synthesis" OR "integrative research re-
view" OR "integrative review"))
None of the retrieved publications (see [54]) were related to
our objectives which are expressed in the research questions
1. performed on 2008/11/20
Figure 1. Systematic review steps (adapted from [27])
(Step 2). The research questions (Table 1) define what should
be extracted from the selected publications (see Section 3.4).
For example, RQ1 pertains to how the success (or failure)
of SPI initiatives is evaluated, that is, to methods which
show the impact of the initiative. Note that with “evaluation
strategy” we do not refer to SPI appraisals, such as CBA-
IPI [55], SCE [56] or SCAMPI [57], where the organizations
maturity is assessed by its conformity to a certain industrial
standard [12]. We rather aim to identify the evaluation
strategies which are used to effectively show the impact of
a process change.
RQ3 investigates the measurement perspectives from
which SPI initiatives are evaluated. The perspectives
(project, product and organization) are an abstraction based
on the identified metrics from RQ2. Finally, RQ4 aims to
elicit factors which may impede an accurate evaluation of
the initiative.
The aim of the review protocol (Step 3) is to reduce
potential researcher bias and to permit a replication of
the review in the future [27]. The protocol was evaluated
(Step 4) by an independent researcher with experience in
conducting systematic reviews. According to his feedback
and our own gathered experiences during the process, we
iteratively improved the design of the review. A summary
of the final protocol is given in Sections 3.1 to 3.5.
4 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
Table 1
Research questions for the systematic review
ID Question Aim
RQ1 What types of evaluation strategies are used to evaluate SPI
initiatives? To identify which concrete evaluation strategies are used and how
they are applied in practice to assess SPI initiatives.
RQ2 What are the reported metrics for evaluating the SPI
initiatives? To identify the metrics which are commonly collected and used to
evaluate SPI initiatives.
RQ3 What measurement perspectives are used in the evaluation
and to what extent are they associated with the identified
SPI initiatives?
To determine from which measurement perspective SPI initiatives are
evaluated. Furthermore, to analyze any relationship between SPI
initiatives and measurement perspectives.
RQ4 What are the confounding factors in relation to the
identified evaluation strategies? To identify the reported factors that can distort and hence limit the
validity of the results of the SPI evaluation. To determine if these
issues are addressed and to identify possible remedies.
Figure 2. Search strategy
Table 2
Search keywords
Population Intervention
“process improvement” OR
“process enhancement” OR
“process innovation”OR SPI
measur* OR metric* OR success*
OR evaluat* OR assess* OR roi OR
investment* OR value* OR cost* OR
effect* OR goal* OR result*
3.1 Search strategy
We followed the process depicted in Figure 2 for the iden-
tification of papers. Figure 3 shows the selected databases
and the respective number of publications that we retrieved
from each.
From our research questions we derived the keywords
for the search. The search string is composed by the terms
representing the population AND intervention (Table 2).
In order to verify the quality of the search string, we
conducted a trial search on Inspec and Compendex. We
manually identified relevant publications from the journal
“Software Process: Improvement and Practice” (SPIP) and
compared them with the result-set of the trial search. The
search string captured 24 out of 31 reference publications.
Three papers were not in the result-set because Inspec and
Compendex did not index at the time of the search2issues
of SPIP prior to 1998. In order to capture the remaining
four publications we added the term “result*” to the search
string.
Due to the high number of publications we had to
handle (10817, see Figure 3) we decided to use a reference
management system. We opted for Zotero3, mainly due to
its integrated capability to share and synchronize references.
3.2 Study selection criteria
The main criterion for inclusion as primary study is the pre-
sentation of empirical data showing how the discussed SPI
initiative is assessed and therefore answering the research
questions (Table 1). Both studies conducted in industry and
in an academic environment are included. Since the focus of
this review is the measurement and evaluation of SPI (see
our research questions in Table 1), general discussions on
improvement approaches and comparisons of frameworks
or models were excluded. For the same reason, descrip-
tions of practices or tools without empirical evaluation of
their application were also not considered. Furthermore
we excluded reports of “lessons learned” and anecdotal
evidence of SPI benefits, books, presentations, posters and
non-English texts.
3.3 Study selection procedure
The systematic review procedure was first piloted (Step 5)
in order to establish a homogeneous interpretation of the
selection criteria among the four researchers which con-
ducted the review. The selection criteria were applied on
the title and abstract, and if necessary, on the introduction
and conclusion of the publication. For the pilot, we as-
sessed individually 50 randomly selected publications from
a search conducted in Inspec and Compendex. The Fleiss
2. performed on 2008/11/28
3. http://www.zotero.org
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 5
Figure 3. Primary studies selection
Kappa [58] value showed a very low agreement (0.2) among
the review team. We conducted a post-mortem analysis to
unveil the causes for the poor result. As a main reason we
identified the imprecise definition of the selection criteria
and research questions, on which the decision for inclusion
was mainly based on. After a refinement of these defini-
tions, we conducted a second pilot on 30 randomly selected
publications from a search in SCOPUS. Furthermore, we
introduced an “Unsure” category to classify publications
that should be assessed by all researchers until a consen-
sus was reached. Fleiss’ Kappa increased to a moderate
agreement (0.5), and, after that the “Unsure” publications
were discussed, the inter-rater agreement improved to 0.7
(substantial agreement according to Landis and Koch [59]),
which we considered as an acceptable level to start the
selection procedure. Figure 3 illustrates in detail how the
publications retrieved from the databases were reduced to
the final primary studies (Step 6) on which we applied the
data extraction.
As can be seen in Figure 3, from the 10817 retrieved
papers, we first discarded duplicates (by ordering them
alphabetically by their title and authors) and studies not
published in the English language. After applying the inclu-
sion / exclusion criteria, a total of 6321 papers were found
not to be relevant and for 234 publications we were not able
to obtain a copy of the text. This diminished the pool of
papers for full-text reading to 362 papers. In the final pool
of primary studies, 148 papers remained after filtering out
studies that we found to be irrelevant after assessing the
full-text and those that reported on the same industry case
studies.
3.4 Data extraction
Similarly to the study selection, we distributed the work-
load among four researchers. The 148 publications accepted
for data extraction (Step 7) were randomly assigned to the
extraction team (37 publications for each member).
We performed the data extraction in an iterative manner.
Based on the experiences reported by Staples and Niazi [60],
we expected that it would be difficult to establish a-priori an
Table 3
Extracted properties
ID Property Research question(s)
P1 Research method Overview of the studies
P2 Context Overview of the studies
P3 SPI initiative RQ1, RQ2, RQ3
P4 Success indicator and metric RQ2
P5 Measurement perspective RQ3
P6 Evaluation strategy RQ1, RQ4
P7 Confounding factors RQ4
exhaustive set of values for all the properties. We therefore
prepared an initial extraction form with the properties listed
in Table 3, which shows also the mapping to the respective
research questions answered by the property. For properties
P1, P2, P3 and P5 a list of expected values was established,
whereas properties P4, P6 and P7 should be extracted
from the studies. Before starting the second iteration, we
reviewed the compiled extraction forms in joint meetings
and consolidated the extracted data into the categorization
given in Sections 3.4.1 to 3.4.7. In a second data extraction
iteration we confirmed the established categorization and
used it for data synthesis (Step 9) and drawing conclusions
(Step 10).
3.4.1 Research method (P1)
We categorized the studies according to the applied re-
search method. Our initial strategy for the categorization
was simple and straightforward: extract the mentioned re-
search method without interpreting the content of the study.
However, we discovered two issues with this approach.
First, the mentioned research methods were inconsistent,
i. e. one study fulfilled our understanding of a “Case study”
while another did not. Second, the research method was not
mentioned at all in the paper.
Therefore, we defined the following categories and crite-
ria to classify the studies consistently:
Case study if one of the following criteria applies:
1) The study declares one or more research ques-
tions which are answered (completely or par-
tially) by applying a case study [61], [62].
2) The study empirically evaluates a theoreti-
cal concept by applying it in a case study
(without necessarily explicitly stating re-
search questions, but having a clearly defined
goal [62]).
Industry report if the focus of the study is directed
towards reporting industrial experiences without
stating research questions or a theoretical concept
which is then evaluated empirically. Usually these
studies do not mention any research method explic-
itly. Therefore, instead of creating a category “N/A”
(research method is not applicable), we added this
category as it complies with the “Project monitoring”
method described by Zelkowitz and Wallace [62].
6 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
Experiment if the study conducts an experiment [61],
[62] and clearly defines its design.
Survey if the study collects quantitative and/or qual-
itative data by means of a questionnaire or inter-
views [61], [63], [64].
Action research if the study states this research
method explicitly [61], [65].
Not stated if the study does not define the applied
research method and it can not be derived or inter-
preted from reading the paper.
3.4.2 Context (P2)
We categorized the studies into industry and non-industry
cases. The industry category contains studies in which
the research was performed in collaboration or embedded
within industry. The non-industry category is comprised
of studies which were performed in an academic setting
or for which the research environment was not properly
described.
For industrial studies we extracted the company size fol-
lowing the European Recommendation 2003/361/EC [66],
the customer type (internal or external to the company)
of the developed product, the product type (pure software
or embedded), and the application domain. Furthermore,
the number of projects in which the SPI initiative was
implemented and the staff-size was recorded.
Based on this information, we assessed the study quality
from the perspective of the presented research context (see
QA4 in Table 5 in Section 3.5).
3.4.3 SPI initiative (P3)
We categorized the studies according to the presented SPI
initiative as follows:
Framework: this group contains frame-
works/models like CMM, international standards
like ISO/IEC 15504 (SPICE) and business
management strategies like Six Sigma. For the
analysis, we further broke down this category into:
Established frameworks - CMM, CMMI,
ISO/IEC 15504 (SPICE), Six-Sigma, PSP, TSP,
QIP, TQM, IDEAL, PDCA.
Combined frameworks - two or more estab-
lished frameworks are used in combination to
implement the SPI initiative.
Derived frameworks - an established frame-
work is extended or refined to fulfill the spe-
cific needs.
Own framework - the study proposes a new
framework without reference to one of the
established frameworks.
Limited framework - the framework targets
only a specific process area.
Practices: software engineering practices which can
be applied in one or more phases of the software
development life-cycle (e.g. inspections, test-driven
development, etc.).
Tools: software applications that support software
engineering practices.
3.4.4 Success indicator and metric (P4)
From the inspected studies we extracted the metrics which
were used to measure the described SPI initiative. In order
to get an overview of what is actually measured, the metrics
were categorized according to “success indicators”. We did
not define the classification scheme a-priori but it emerged
and evolved during the data extraction (it was stabilized
after the first iteration of the data extraction).
We use the term “success indicator” in order to describe
the improvement context in which the measurement takes
place. Therefore, a “success indicator” is an attribute of
an entity (e. g. process, product, organization) which can
be used to evaluate the improvement of that entity. The
categories of success indicators is shown in Section 4.3
(Table 8). The identified metrics were categorized as in
the following example: (1) The metric “Number of defects
found in peer reviews” is mapped to the “Process quality”
category as it describes the effectiveness of the peer review
process (e. g. [67]–[69]). (2) The metric “Number of defects
identified after shipment / KLOC” (e. g. [67], [70], [71]) is
mapped to the “Product quality” category as the object of
study is the product itself and not the processes from which
the product originates.
The categorization of the metric is dependent on the
context of the study. The use of the metric is interpreted
by understanding which attribute is actually measured and
with which intention. In some cases this was not possible
due to missing information in the description of the metric.
For example the “Defects” category contains those defect
metrics for which the given information could not be used
to justify a classification into one of the predefined quality
categories (neither product nor process).
3.4.5 Measurement perspective (P5)
We use the concept of “measurement perspective” to define
and categorize how the improvement is being assessed.
Concretely, a measurement perspective describes the view
on the improvement, i.e. which entities are measured in
order to make the change visible in either a quantitative or
qualitative manner. We derived from which measurement
perspective an initiative is evaluated by interpreting the
metrics which were described in the study and from the
attributes they are supposed to measure. We defined the
following measurement perspectives, based on the five soft-
ware entity types proposed by Buglione and Abran [72] (the
entity types process, project and resources were bundled
under the project perspective due to the difficulty to con-
sistently interpret the measures identified in the reviewed
studies and to avoid mis-categorization):
Project perspective
The measurement is conducted during the project
where the SPI initiative takes place. Examples of
metrics that are used to measure from this per-
spective are productivity during the development
phase, defect rates per development phase, etc. These
measures assess the entity types process, project and
resources.
Product perspective
The evaluation of the SPI initiatives’ impact is con-
ducted by measuring the effect on the delivered
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 7
Figure 4. The influence of confounding factors
products. An example of a metric that is used to mea-
sure from this perspective is the number of customer
complaints after product release.
Organization perspective
The measurement and evaluation of the SPI initia-
tives’ impact is conducted organization-wide. An
example of a metric that is used to measure from this
perspective is return on investment. Other qualita-
tive measurements such as employee satisfaction and
improved business opportunities are also measured
from this perspective.
3.4.6 Evaluation strategy (P6)
During the first iteration of the data extraction we discov-
ered that many publications do not describe or define the
adopted evaluation strategy explicitly. To solve this prob-
lem, we established a categorization of evaluation strategies
based on their common characteristics (see Section 4.2,
Table 7). The categorization grew while extracting the data
from the studies and was consolidated after the first itera-
tion of the process. In some cases we could not identify an
evaluation strategy and the publication was categorized as
"Not Stated".
3.4.7 Confounding factors (P7)
In the context of experimental design, Wohlin et al. [73]
defined confounding factors as “variables that may affect
the dependent variables without the knowledge of the re-
searcher”. They represent a threat to the internal validity of
the experiment and to the causal inferences that could be
drawn since the effect of the treatment cannot be attributed
solely to the independent variable. As shown in Figure 4,
both independent variables (treatments) and confounding
factors represent the input to the experiment and the as-
sessment validity of the dependent variables (effects) is
threatened [74].
Assuming that in the evaluation of software process im-
provements the change is assessed by comparing indicators
which represent an attribute before and after the initiative
has taken place, it is apparent that the problem of confound-
ing factors, as it is encountered in an experimental setting,
is also an issue in the evaluation of SPI initiatives. We argue
therefore that it is of paramount importance to identify
potential confounding factors in the field of software process
improvement.
Kitchenham et al. [75] identified several confounding
factors in the context of the the evaluation of software
engineering methods and tools through case studies (see
Table 4). Similarly, we extracted from the reviewed publica-
tions any discussion that addresses the concept of confound-
ing factors in the context of SPI initiatives and, if given, the
chosen remedies to control the issues.
3.5 Study quality assessment
The study quality assessment (Step 8) can be used to guide the
interpretation of the synthesis findings and to determine the
strength of the elaborated inferences [27]. However, as also
experienced by Staples and Niazi [60], we found it difficult
to assess to which extent the authors of the studies were
actually able to address validity threats. Indeed, the quality
assessment we have performed is a judgment of reporting
rather than study quality. We answered the questions given
in Table 5 for each publication during the data extraction
process.
With QA1 we assessed if the authors of the study clearly
state the aims and objectives of the carried out research.
This question could be answered positively for all of the
reviewed publications. With QA2 we asked if the study pro-
vides enough information (either directly or by referencing
to the relevant literature) to give the presented research
the appropriate context and background. For almost all
publications (98%) this could be answered positively. QA3
was checked with “Yes” if validity threats were explicitly
discussed, adopting the categorization proposed by Wohlin
et al. [76]. The discussion on validity threats of an empirical
study increases its credibility [77]. A conscious reflection on
potential threats and an explicit reporting of validity threats
from the researcher increases the trustworthiness of and the
confidence in the reported results. Therefore, if the study
just mentioned validity threats without properly explaining
how they are identified or addressed, the question was
answered with “Partially”. The result of QA3 confirms the
observation in [5] that in empirical studies the scope of
validity is scarcely discussed. QA4 was answered with “Yes”
if we could compile the data in the context property of the
data extraction form to a major degree (see Section 3.4.2).
As it was pointed out by Petersen and Wohlin [78], context
has a large impact on the conclusions that are drawn from
the evidence in industrial studies. However, 51.7% of the
reviewed studies did not, or only partially, describe the con-
text of the research. With QA5 we assessed if the outcome of
the research was properly documented. As with QA1, this
questions could be answered positively for all (except one)
study.
3.6 Validity Threats
We identified three potential threats to the validity (Step 11) of
the systematic review and its results.
3.6.1 Publication bias
Publication bias refers to the general problem that positive
research outcomes are more likely to be published than
negative ones [27]. We regard this threat as moderate, since
the research questions in this review are not geared towards
the performance of a specific software process improvement
initiative for the purpose of a comparison. The same rea-
soning applies to the threat of sponsoring in which certain
methods are promoted by influential organizations [27], and
8 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
Table 4
Confounding factors and remedies (adapted from [75])
Confounding factor Description Remedy
Learning bias The initial effort to acquire the knowledge for using the method or
tool interferes with the evaluation of its benefits. Separate activities aimed at learning the new
technology from those aimed at evaluating it.
Participant bias Attitude of the case study participants towards the new
technology (enthusiasm versus skepticism). Select participants according to a standard
staff-allocation method.
Project bias The projects on which the technology is evaluated differ on
application domains, e.g. embedded-systems and information
systems.
Select projects within the same application
domain.
Table 5
Quality assessment
ID Quality assessment question Yes Partially No
QA1 Is the aim of the research sufficiently explained? 138 (93.2%) 10 (6.8%) 0 (0.0%)
QA2 Is the presented idea/approach clearly explained? 115 (77.7%) 30 (20.3%) 3 (2.0%)
QA3 Are threats to validity taken into consideration? 16 (10.8%) 19 (12.8%) 113 (76.4%)
QA4 Is it clear in which context the research was carried out? 73 (49.3%) 54 (36.5%) 21 (14.2%)
QA5 Are the findings of the research clearly stated? 117 (79.0%) 30 (20.3%) 1 (0.7%)
negative research outcomes regarding this method are not
published. We did not restrict the sources of information to
a certain publisher, journal or conference such that it can be
assumed that the breadth of the field is covered sufficiently.
However, we had to consider the trade-off of considering
as much literature as possible and, at the same time, accu-
mulating reliable information. Therefore we decided not to
include grey literature (technical reports, work in progress,
unpublished or not peer-reviewed publications) [27].
3.6.2 Threats to the identification of primary studies
The strategy to construct the search string aimed to retrieve
as many documents as possible related to measurement and
evaluation of software process improvements. Therefore, the
main metric to decide about the quality of the search string
should be the recall of the search result. Recall is expressed
as the ratio of the retrieved relevant items and all existing
relevant items [79]. Since it is impossible to know all existing
relevant items, the recall of the search string was estimated
by conducting a pilot search as described in Section 3.1.
This showed showed an initial recall of 88%, and after a
refinement of the search string, a recall of 100%. Although
the search string was exercised on a journal (SPIP) of high
relevance for this systematic review, the threat of missing
relevant articles still exists. Inconsistent terminology, in
particular in software measurement research [80], or use of
different terminology with respect to the exercised search
string (see Table 2) may have biased the identification of
primary studies.
Precision, on the other hand, expresses how good the
search identifies only relevant items. Precision is defined
as the ratio of retrieved relevant items and all retrieved
items [79]. We did not attempt to optimize the search string
for precision. This is clearly reflected by the final, very low,
precision of 2.2% (considering 6683 documents after the
removal of duplicates and 148 selected primary studies).
This is however an expected result since recall and precision
are adversary goals, i. e. the optimization to retrieve more
relevant items (increase recall) implies usually a retrieval of
more irrelevant items too (decrease precision) [81]. The low
precision itself represents a moderate threat to the validity of
the systematic review since it induced a considerably higher
effort in selecting the final primary studies. We addressed
this threat as explained in Section 3.6.3.
We followed two additional strategies in order to further
decrease the probability of missing relevant papers. First,
during the testing of the search string (see Section 3.1),
we discovered that the bibliographic databases (Inspec and
Compendex) did not index studies published in "Software
Process: Improvement and Practice" prior to 1998. There-
fore we decided to include a third bibliographic database
(SCOPUS) and also individual publishers in the data sources
(IEEE Explore and ACM Digital Library). This led to a
high number of duplicates (3893) which we could however
reliably identify by sorting the documents alphabetically
by their title and authors. Secondly, the systematic review
design was assessed for completeness and soundness by
an independent researcher with experience in conducting
systematic literature reviews.
We could not retrieve the full-text for 234 studies within
the scheduled time-frame for the systematic review. This
however represents a minor threat since this set, recalling
the retrieval precision of 2.2%, would have contained ap-
proximately only five relevant studies.
3.6.3 Threats to selection and data extraction consistency
Due to the scope of the systematic review, we had to
develop efficient (in terms of execution time) and effec-
tive (in terms of selection and data extraction consistency)
strategies. One of the main aims of defining a review pro-
tocol is to reduce researcher bias [27] by defining explicit
inclusion/exclusion criteria and a data extraction strategy. A
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 9
Table 6
Inter- and Intra-rater agreement
Property Inter-rater Intra-rater
Research method (P1) 0.56 0.78
Context (P2) 0.90 0.90
SPI initiative (P3) 0.83 0.91
Success indicator (P4) 0.54 0.58
Measurement perspective (P5) 0.06 0.25
Evaluation strategy (P6) 0.77 0.47
Confounding factors (P7) -0.05 -0.1
well defined protocol increases the consistency in selection
of primary studies and in the following data extraction
if the review is conducted by multiple researchers. One
approach to further increase the validity of the review
results is to conduct selection and data extraction in parallel
by several researchers and cross-check the outcome after
each phase. In the case of disagreements they should be
discussed until a final decision is achieved. Due to the large
amount of initially identified studies (10817) we found this
strategy impossible to implement within the given time-
frame. Therefore, as proposed by Brereton et al. [82] and
illustrated in Section 3.3 and 3.4, we piloted the paper
selection and data extraction and improved the consensus
iteratively. By piloting we addressed two issues: first, the
selection criteria and the data extraction form were tested
for appropriateness, e. g. are the inclusion / exclusion cri-
teria too restrictive or liberal, should elds be added or
removed, are the provided options in the fields exhaustive?
Second, the agreement between the researchers could be
assessed and discrepancies streamlined, e. g. by increasing
the precision of the definitions of terms. Although it can be
argued that this strategy is weaker in terms of consistency
than the previously mentioned cross-checking approach, it
was a necessary trade-off in order to fulfill the schedule and
the targeted breadth of the systematic review.
In order to assess data extraction consistency, we per-
formed a second extraction on a randomly selected sample
of the included primary studies. Each researcher extracted
data from 15 papers, which is slightly more than 10% of
the total number of included studies and approximately
50% of the studies each researcher was assigned in the first
extraction.
Table 6 shows the Fleiss’ Kappa [58] value of each
property that was extracted from the primary studies. The
inter-rater agreement denotes thereby the data extraction
consistency between the researchers. The intra-rater agree-
ment gives an indication of the repeatability of the process
(the second extraction was performed eighteen months after
the original one).
Landis and Koch [59] propose the following interpreta-
tion for Fleiss’ Kappa: Almost excellent (1.0 - 0.81), Substan-
tial (0.80 - 0.61), Moderate (0.60 - 0.41), Fair (0.40 - 0.21),
Slight (0.20 - 0), and Poor (< 0).
The analysis shown in Table 6 indicates that in properties
P5 and P7 we achieved only slight respectively poor agree-
ment in the data extraction validation. A potential reason for
this result on property P7 may be that confounding factors
are not explicitly mentioned in the selected primary studies
and therefore difficult to identify. In rare cases, confounding
factors are mentioned in the validity threats of the study
(e. g. [71]) or, more frequently, in the results discussion (e. g.
[8], [83]). A consistent extraction of property P7 is therefore
rather challenging and may be biased.
We agreed however on the identified confounding fac-
tors (P7) and the measurement perspective (P5) catego-
rization, as after the original data extraction, all involved
researchers jointly discussed the results until a consensus
was reached. Hence we are confident that the reported
results in Section 4.4 and 4.5 are internally consistent.
4 RESU LTS AND AN ALYSIS
A total of 148 studies discuss the measurement and eval-
uation of SPI initiatives. Prior to presenting the results and
analysis for each research question we give a short overview
of the general characteristics of the studies.
4.1 Overview of the studies
4.1.1 Publication year
The reviewed papers were published between 1991 and
2008. A first increased interest in evaluating SPI initiatives
appears in the period between 1998 and 2000 (35, 24%). A
second spike can be observed between 2005 and 2008 (55,
37%). This seems to indicate an increased interest in SPI
and success measurement, pointing to the relevance of the
area. In addition, as a substantial part of the publications
fall within a period of four years before this review was
conducted (2008), it increases the likelihood for the results
of the studies being relevant, elevating the potential value
obtained in this systematic review.
4.1.2 Research method
The inspected publications were classified according to the
applied research methods as defined in Section 3.4.1. Case
studies (66, 45%) and industry reports (53, 36%) constitute
a clear majority of the studies, followed by experiments
(8, 5%), surveys (7, 4%), action research (1, 1%) and a
combination of action research and experiment (1, 1%).
Also interesting to observe is that the lack of an adequate
description of the applied research methodology prevented
a categorization (12, 8%).
4.1.3 Study context
The study settings were categorized in industry and non-
industry cases (see Section 3.4.2). The majority of the papers
(126, 85%) are situated in the industry category, indicating
that the results obtained from this review are based on
realistic settings.
Remarkably about 50% of the industry studies do not
provide any information on the size of the organization
where the research was carried out. The fact that consid-
erable research effort exists to explore how to introduce
software process improvement into small and medium sized
companies [84], [85], suggests that company size and the
available resources should be taken into account when
choosing and embarking on an SPI initiative. Omitting that
10 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
information therefore debilitates the judgment if such an
initiative is feasible in a different setting [78]. In those
studies which reported the organizations size, large (> 250
employees) organizations dominate (34, 27%) over medium
(13, 10%) or small (< 50 employees) organizations (13, 10%).
Many publications only provide the name of the company
but they seldom provide its size in terms of the number
of employees. For well-known organizations, this could
be due to that the authors consider this information as
obvious. Another reason could be that the information was
not considered as important to report. Furthermore, confi-
dentiality concerns are not a valid argument for omitting
context information since it is possible to anonymize the
published data [86]. Indeed there are several reasons why
context information such as size, not only of the organiza-
tion, but also of the unit under study can be considered
as crucial. Consider for example “A practical view of soft-
ware measurement and implementation experiences within
Motorola” [67]. The paper does not mention the size of the
company. Since Motorola is a well-known company, it is
possible to get the information about Motorola’s size (at
the end of 2008 it had 64000 employees [87]). Even if the
organizations’ size at the publication date of the study (1992)
would be known, it is still difficult to judge the scope of
SPI implementation since the paper does not specify the
size of, nor in which business units the SPI initiative was
implemented.
In order to improve context documentation, future SPI
research should consider to adopt the guidelines developed
by Petersen and Wohlin [78].
4.1.4 Identified SPI initiatives
Figure 5 shows the distribution of the SPI initiatives accord-
ing to the definition given in Section 3.4.3. A detailed list
of all identified initiatives can be found in the extended
material of the systematic review (see [54]). Combinations
of SPI initiatives (e.g. a certain practice was applied in
the context of a framework) are recorded explicitly. The
“Framework” category is predominant (91, 61%), followed
by “Practices” (29, 20%) and “Tools” (9, 6%).
The scope of this systematic review is to capture any
kind of process improvement initiative and their respective
approaches to evaluate it. The holistic approach is cap-
tured by the “Framework” category while the initiatives
targeted at a limited or specific area of software develop-
ment are represented by the “Practices and “Tools” cate-
gories. Adding up the latter categories (i. e. the categories
“Practices”, “Tools” and “Practices + Tool” sum up to 42)
shows that compared to frameworks (91 studies), they are
underrepresented. This suggests that it is less common to
measure and evaluate the impact of practices and tools in
the context of software process improvement research.
Figure 6 shows the distribution of the established frame-
works. It is of no surprise that CMM is the most reported
framework (42, 44%) since it was introduced almost 20 years
ago. The influence of the Software Engineering Institute
(SEI) can be seen here, which is also the sponsor of the
CMMI, Team and Personal Software Process (TSP/PSP) and
IDEAL. SPICE (ISO/IEC 15504) and BOOTSTRAP, software
process improvement and assessment proposals originating
in Europe, are rather underrepresented. We extracted the
Figure 5. SPI initiative distribution of the publications
Figure 6. Established framework distribution of the publications
geographic location from the papers where the authors
explicitly stated were the study was conducted. Looking
at the studies on CMM, North America is represented 15,
and Europe 9 times. On the other hand, none of the studies
on SPICE were conducted in North America. Considering
that of all identified SPI initiatives 27 studies where located
in North America and 38 in Europe, this may indicate the
existence of a locality principle, i.e. that companies adopt
SPI initiatives developed in their geographic vicinity.
However, since the focus of the research questions is to
elicit evaluation strategies and measurements in SPI initia-
tives, the conclusion that SPICE is generally less commonly
used in industry cannot be drawn from the picture; it rather
means that the evaluation strategies and measurements
used in SPICE are less frequently reported by the scientific
literature.
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 11
In the following sections we answer the research ques-
tions stated in Table 1, Section 3.
4.2 Types of evaluation strategies used to evaluate SPI
initiatives (RQ1)
4.2.1 Results
The purpose of this research question was to identify the
evaluation strategies that are applied to assess the impact of
an SPI initiative. As stated in Section 3.4.6, we categorized
the strategies according to their common characteristics and
established seven categories (see Table 7). The strategies
are discussed in more detail in Section 4.2.2. The predom-
inant evaluation strategy that we identified was “Pre-Post
Comparison” (72, 49%), followed by “Statistical Analysis”
(23, 15%). We encountered also papers where we could
not identify an evaluation strategy (21, 14%). They were
however included in the review as they provided data
points relevant to the other research questions.
4.2.2 Analysis and Discussion
“Pre-Post Comparison” is the most common evaluation
strategy. However, the validity of this strategy, in terms of
whether the assessed results are in causal relationship with
the SPI initiative, is rarely discussed (see Section 3.4.7 for a
more detailed discussion).
Most of the identified evaluation strategies are not
specifically designed for evaluating the outcome of SPI ini-
tiatives. However, an exception is given by the Philip Crosby
Associates’ Approach, which suggests explicitly what to
evaluate [216]. The majority of the found evaluation strate-
gies are very generic in nature and different organizations
applied those methods for measuring different success indi-
cators based on the organizational needs and contexts. This
indicates that there is a shortcoming in the used methods
to evaluate the outcome of SPI initiative in a consistent
and appropriate way, and supports the demand [15] for a
comprehensive measurement framework for SPI.
Pre-Post Comparison: The outcome of SPI initiatives
is evaluated by comparing the success indicators’ values
before and after the SPI initiatives took place. Hence, for the
“Pre-Post Comparison” of success indicators it is necessary
to setup a baseline from which the improvements can be
measured [217]. The major difficulty here is to identify
reasonable baseline values. One strategy could be to use
the values from a very successful project or product (either
internal or external to the organization) and benchmark
the improvement against those. Accordingly, the baseline
would represent the target that is aimed for in the im-
provement. Benchmarking in this way is useful if no his-
torical data of successful projects or products is available.
However, the performance of the improvement initiative
cannot be deduced by comparing against a target baseline
since the previous status is unknown and therefore the
target may merely serve as an indication. Therefore, for
evaluating the effect of improvement initiatives, historical
data against which the actual performance can be compared
is essential. An example that illustrates how a baseline for
organizational performance can be constructed is given by
Paulish and Carleton [8]. Organizations with an established
measurement program will have less difficulty to establish
a baseline than organizations with a newly instantiated or
even not yet started program [8].
Baselines are also essential in statistical process control
(SPC) where the variation of a specific process attribute rel-
ative to a baseline is interpreted as instability and therefore
a possible cause of quality issues of the resulting product.
Hollenbach and Smith [181] exemplify the establishment of
baselines for SPC. Furthermore, the statistical techniques
presented by Henry et al. [166] can be used to create base-
lines of quality and productivity measurements.
Statistical Analysis and Statistical Process Control
(SPC): Statistical analysis includes descriptive statistics
where data are summarized numerically (e. g. mean, me-
dian, mode) or graphically (e. g. charts and graphs). Sta-
tistical analysis can also be done by inferential statistics
by drawing inferences about the larger population through
hypothesis testing, estimates of numerical characteristics
(estimation), descriptions of association (correlation), or
modeling of relationships (regression). One application of
statistical techniques is to strengthen the validity of the
collected measurements [218]. Another common application
is found in SPC which aim is to measure and analyze the
variation in processes. Time series analysis, as promoted by
SPC, can provide information when an improvement should
be carried out and determine the efficacy of the process
changes [219].
As proposed by Henry et al. [166], several statistical
techniques can be applied to evaluate the effectiveness of
software process improvement in terms of increased esti-
mation accuracy, product quality and customer satisfaction.
The described methods are multiple regression, rank cor-
relation and chi-square tests of independence in two-way
contingency tables, which, when applied repeatedly over
time can show the effectiveness of process improvements
statistically [166]. However, care must be taken when ap-
plying these techniques since a single method alone may
not show the true impact of the initiative and wrong conclu-
sions could be drawn [166]. Furthermore Henry et al. [166]
objected that in some cases the process improvement must
be very effective in order to show significant alterations
in the statistical evaluation results. Statistical methods are
also used to assess process stability which is regarded as an
important aspect of organizational capability [161]. In order
to evaluate stability, the authors propose trend, change and
shape metrics which can be used in the short- and long-
term and are analyzed by visual inspection of the data
summarized by descriptive statistics (e. g. histograms and
trend diagrams).
Ramil and Lehman [156] discuss the assessment of pro-
cess improvement from the viewpoint of software evolution.
The authors propose a statistical technique to determine
whether productivity (or any other process or product at-
tribute) changes significantly over a long period of time. The
aim of the presented CUSUM (cumulative sum) test is to
systematically explore data points which highlight changes
in the evolutionary behavior. Although this can also be
done by visual inspection of trends (as it was proposed by
Schneidewind [161]), a change detection algorithm is con-
sidered as less error-prone and is particularly useful when
assessing the impact of process improvement initiatives and
when analyzing whether the performance of processes has
12 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
Table 7
Evaluation strategies
Name Studies Frequency
Pre-Post Comparison [8], [20], [52], [68]–[70], [88]–[152] 72
Statistical Analysis [18], [94], [153]–[173] 23
Pre-Post Comparison &
Survey [15], [16], [71], [174]–[180] 10
Statistical Process
Control [24], [67], [181]–[186] 8
Cost-Benefit Analysis [25], [187]–[190] 5
Statistical Analysis &
Survey [191], [192] 2
Philip Crosby
Associates’ Approach [193], [194] 2
Pre-Post Comparison &
Cost-Benefit Analysis [83], [195] 2
Survey [196] 1
Software Productivity
Analysis Method [197] 1
Cost-Benefit Analysis &
Survey [198] 1
Not stated [19], [21], [23], [26], [199]–[215] 21
changed [156].
An interesting approach to address the issue of certain
confounding factors using statistical techniques is presented
by Schalken et al. [172]. The authors illustrate how Cost-
Model Comparison, based on a linear regression equation,
can account for the factor of project size when evaluating
the effect of a process improvement on productivity (the
same method is also proposed by Alagarsamy et al. [140]).
A second issue, namely the comparison of projects from
different departments to assess productivity improvement
is addressed by the Hierarchical Model Approach. Projects
originating from different departments in an organization
are not directly comparable since they are either specialized
on a group of products, a specific technology or have
employees with different skills [172]. Both the Cost-Model
Comparison and the Hierarchical Model Approach can be
used to prevent erroneous conclusions about the impact of
the process improvement initiative by considering context.
Unfortunately, as we have shown in Section 4.1.3, the con-
text in which the improvement initiatives are evaluated, is
seldom presented completely. It is therefore difficult to judge
in such cases if the reported improvement can be attributed
to the initiative.
Survey: In the context of this work, a survey is defined as
any method to collect, compare and evaluate quantitative or
qualitative data from human subjects. A survey can be con-
ducted by interviews or questionnaires, targeting employees
affected by the process improvement initiative or customers
of the organization. Surveys can be an effective mean to as-
sess the changes introduced in an improvement effort since
after all, the development of software is a human-intensive
task. The feedback provided by employees can therefore be
used to improve the understanding of the effects caused by
the introduced changes and to steer future improvements.
Gathering information from customers, on the other hand,
can provide insight how the improvement affects the quality
of products or services as perceived by their respective
users. This can be valuable to assess external quality char-
acteristics, such as integrity, reliability, usability, correctness,
efficiency and interoperability [220], which otherwise would
be difficult to evaluate. The analysis of the improvement
participants’ feedback can be valuable if historical data for
comparison is not available or if its quality / completeness
limits the evaluability of the improvement. A systematic
method to assess the effects caused by an improvement
initiative is described by Pettersson [33]. The approach can
be useful if no or only limited historical data is available to
construct a baseline which can serve as a reference point for
the improvement evaluation. The post-evaluation is based
on the expert opinion of the directly involved personnel
which compares the improved process with the previous
one. This lightweight process improves the visibility on the
effects of the undertaken improvement initiative and pro-
vides also information on how the change was experienced
by the involved roles. The method could be enhanced by
integrating the concept of "contribution percentages" as it
was proposed by van Solingen [198]. The idea is to let the ex-
perts assess how much the initiative actually contributed to
the improvement, i. e. provide the possibility to express that
only a fraction of the change is attributable to the initiative
and other factors have also contributed to the enhancement.
Such an approach could also support the identification of
potential confounding factors (see Section 4.5).
Besides by the expert opinion of employees, it is also pos-
sible to evaluate the effects of the improvement by querying
customers. Quality of service surveys could be sent period-
ically to customers, illustrating the effects of the adapted or
new process from the customer perspective [175].
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 13
Cost-Benefit Analysis: Evaluating an improvement ini-
tiative with a cost-benefit measure is important since the
allocated budget for the program must be justifiable in
order not to risk its continuation [1], [198]. Furthermore,
it is necessary to avoid loss of money and to identify
the most efficient investment opportunities [198]. When
assessing cost, organizations should also consider other
resources than pure effort (which can be relatively easily
measured), e. g. office space, travel, computer infrastruc-
ture [198], training, coaching, additional metrics, additional
management activities, process maintenance [190]. Activity
Based Costing helps to relate certain activities with the
actual spent effort [190]. Since cost and effort data can be
collected in projects, they must not be estimated [190]. On
the other hand, the thereby obtained values are still an
approximation and estimations of both costs and benefits
are inevitable [198]. Since it is usually enough to know the
ROI’s relative value (positive, balanced or negative), perfect
accuracy is not required as long as the involved stakeholders
agree on the procedure how to assess it [198]. Direct bene-
fits and especially indirect and intangible benefits are best
assessed by multiple stakeholders [198]; some of the difficult
to quantify benefits are: customer satisfaction, improved
market share due to improved quality, reduced time-to-
deliver and accuracy, feature-cost reduction, opportunity
costs, reduced maintenance in follow-up projects, better
reusability, employee satisfaction, increased resource avail-
ability [190]. A useful technique to support the estimation is
the so-called "what-if-not" analysis [198]. Project managers
could be asked to estimate how much effort was saved due
to the implemented improvement in follow-up projects. The
saved effort would then be accounted as a benefit. Another
strategy would be to estimate the "worth" of a certain
improvement, e. g. asking managers how many training
days would they invest to increase employee motivation
and quantify the cost of such a training program [198].
Philip Crosby Associates’ Approach: This method is
derived from Philip Crosby’s Cost of Quality idea [216]. It
is based on distinguishing the cost of doing it right the
first time (performance costs) from the cost of rework (non-
conformance costs). The cost of quality is determined by the
sum of appraisal, prevention and rework costs [193]. The
improvement is evaluated by a reduction of rework costs
over a longer period of time (several years, as shown in [193]
and [194]). This method is similar to Cost-Benefit Analysis
but particularly tailored to software process improvement
evaluation.
Software Productivity Analysis Method (SPAM):
SPAM provides a way of defining productivity models and
evaluation algorithms to calculate the productivity of all
possible combinations of an observed phenomenon (pro-
cess, project size, technology etc.) [197], [221].
4.3 Reported metrics for evaluating the SPI initiatives
(RQ2)
4.3.1 Results
The purpose of this research was to identify the used metrics
and success indicators (see Section 3.4.4) in SPI evaluations.
Table 8 and Table 9 show the frequency of the identified
success indicators in the inspected studies. “Process Qual-
ity” (57, 39%) was the most observed success indicator, fol-
lowed by “Estimation Accuracy” (56, 38%), “Productivity”
(52, 35%) and “Product Quality” (in total 47 papers, 32%,
considering also those from Table 10).
We differentiated the “Product Quality” success indica-
tors based on the ISO 9126-1 standard. The identified studies
are shown in Table 10. Two points have to be noted. First,
we added Reusability”, which is not defined as a product
quality attribute by ISO 9126-1, to the quality attributes. Fur-
thermore, if the study did not explicitly state, or sufficiently
describe, which quality attribute is measured, we mapped
the study to the general “Product Quality” category (see
Table 8).
“Reliability” was the most observed success indicator for
the product quality characteristics, followed by “Maintain-
ability” and “Reusability”.
Table 11 shows the categorization of estimation accuracy
indicators. The “Others” category contains again estimation
accuracy metrics which could not be mapped to the specific
categories. “Schedule” (37, 25%) is by far the most observed
success indicator for estimation accuracy. On the other hand,
assuming that “Cost” can be expressed in terms of “Effort”
and vice versa, combining them shows that their number
of observations (35, 24%) is comparable to that one of
“Schedule”. “Size” (10, 7%), “Productivity” and “Quality”
(2 papers each, 1%) fall behind.
We also distinguished how customer satisfaction is as-
sessed (Table 9). Qualitative customer satisfaction is largely
assessed by questionnaires, while quantitative customer sat-
isfaction is recorded by objective measures (e.g. New open
problems = total new post-release problems opened during
the month).
The “Other Qualitative/Quantitative Success Indicator”
categories contain indicators such as “Team morale”, “Em-
ployee motivation” or “Innovation” which were explicitly
mentioned in the studies as indicators for improvement but
could not be mapped into the classification.
4.3.2 Analysis and Discussion
The main incentive behind the embarkment of an SPI
initiative is to increase quality and to decrease cost and
schedule [224]–[226]. In order to evaluate the success of such
an initiative it is crucial to assess the improvement’s effects.
Table 8 and Table 9 list the success indicators we identified in
this systematic review. We mapped the improvement goals
of quality, cost and schedule with these success indicators:
Quality (“Process Quality” & “Product Quality” &
“Other Quality Attributes”) was found in 92 papers,
62%
Cost (“Effort” & “Cost”) was found in 61 papers, 41%
Schedule (“Time-to-market”) was found in 27 papers,
18%
This shows that quality is the most measured attribute,
followed by cost and schedule. Drawing an analogy with the
time-cost-performance triangle [227], [228], which reflects
that the three properties are interrelated and it is not pos-
sible to optimize all three at the same time, the unbalanced
number in the identified success indicators suggests that this
is also true for what is actually measured in SPI initiatives.
14 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
Table 8
Success indicators
Success indicator Description Studies Frequency
Process Quality The quality that indicates process
performance and is not related to the
product. The metrics for this success
indicator are dependent on the type of
process they measure.
[18]–[20], [24]–[26], [67]–[69], [83], [96]–[100], [105],
[109], [112], [114], [119], [122]–[124], [127], [129],
[133], [136], [141], [142], [146], [147], [151], [154],
[155], [158], [159], [161], [162], [165], [166], [168],
[171], [174], [176]–[179], [182]–[185], [199], [203],
[213], [214], [222]
57
Estimation
Accuracy The deviation between the actual and
planned values of other attributes
measurements. Examples of attributes
commonly estimated are schedule,
effort, size and productivity.
[8], [15], [16], [20], [23], [26], [52], [67], [69]–[71],
[88], [92], [93], [98], [100], [101], [106], [112], [113],
[115], [118], [123], [128], [131], [132], [136], [138],
[139], [141], [144], [146], [147], [149], [151], [153],
[162], [166]–[168], [170], [178], [180]–[182], [186],
[188], [194], [195], [199], [201], [202], [209], [211],
[212], [215]
56
Productivity The performance of the development
team in terms of its efficiency in
delivering the required output.
[8], [15], [16], [52], [67], [70], [71], [83], [93], [98],
[101], [103], [108], [113], [115], [120], [121], [124],
[128], [130]–[132], [135], [137], [140], [143], [147],
[149]–[151], [153], [154], [156], [167], [172], [178],
[182], [186], [188], [192], [194], [197], [200]–[202],
[205], [208]–[210], [215], [222], [223]
52
Product Quality This list shows studies in which
we identified measures for product
quality in general. Studies, in which a
quality attribute according to the ISO
9126-1 standard was mentioned, are
shown in Table 10.
[16], [70], [71], [88], [100], [103], [104], [108]–[110],
[118], [121], [126], [128], [131], [135], [136], [139],
[141], [146], [154], [157], [174], [177], [192], [201],
[206], [209]
47 (28 +
Table 10
Effort The effort of the development team in
developing the product. [8], [20], [23], [25], [26], [70], [90], [91], [93], [96],
[100], [103], [104], [109]–[111], [113], [115], [117],
[123], [129], [136], [137], [142], [145], [147], [151],
[152], [154], [155], [158], [160], [162], [164], [171],
[185], [188], [199], [200], [211], [214]
41
Defects This success indicator is to group
metrics that are solely intended to
measure the defects without relating
them to quality.
[8], [15], [18], [23], [67], [90], [94], [101], [107], [115],
[116], [124], [130], [134], [136], [138], [143]–[145],
[148], [150], [151], [160], [162], [169], [171], [173],
[176], [182], [184], [185], [188], [199], [200], [202]
35
Furthermore, in order to accurately calculate the finan-
cial benefits of an SPI initiative, it is necessary to take
all three attributes into account [224]. The low occurrence
of “Return-on-investment” (22, 15%) as success indicator
suggests that it is seldom used to increase the visibility of
the improvement efforts. It has been shown, however, that
“Return-on-investment” can be used to communicate the
results of an SPI initiative to the various stakeholders [229]
(see Section 4.2.2, Cost-Benefit Analysis for are more in-
depth discussion about “Return-on-Investment”).
Product Quality: As shown in Table 10, we categorized
success indicators according to ISO 9126-1 product quality
attributes. The main incentive to analyze the success indi-
cators from this perspective is that those attributes may
have a different weight, depending on the stakeholder. A
developer may rate “Maintainability”, “Reusability” and
“Portability” (internal quality attributes) higher than the
products customer. “Reliability”, “Usability”, “Functional-
ity” and “Efficiency” on the other hand are the external
quality attributes of the product which are potentially more
important to the customer [230], [231]. The measurement of
internal quality attributes can be applied efficiently, with a
low error frequency and cost [231]. It is therefore of no sur-
prise that internal attributes are measured more frequently
than external ones (see Table 10). Interestingly, Reliability”
is measured far more often as compared to the other three
external attributes. This is explained by looking at the used
measures in these studies to express “Reliability”, which in
the majority are based on product failures reported by the
customer and therefore relatively easy to collect and evalu-
ate. On the other hand, “Usability” which is considered as
difficult to measure [232]–[234], is also seldom assessed in
the context of process improvement (see Table 10).
Customer Satisfaction: Customer satisfaction can be
used to determine software quality [235] since it is com-
monly considered as an accomplishment of quality man-
agement [236]. An increased product quality could therefore
also be assessed by examining customer satisfaction. Never-
theless, we identified only few papers (20, 14%) which use
qualitative means, and even fewer papers (7, 5%) in which
quantitative means are described to determine a change
in customer satisfaction (see Table 9). Although measur-
ing customer satisfaction by a questionnaire can provide a
more complete view on software quality, it is an intrusive
measurement that needs the involvement and cooperation
of the customer [237]. On the other hand, quantitative
measurements as the number of customer reported failures
need to be put into relation with other, possibly unknown,
variables in order to be a valid measure for software quality.
A decrease in product sales, an increased knowledge of the
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 15
Table 9
Success Indicators (Continued)
Success indicator Description Studies Frequency
Cost The cost in terms of the resources that is
required in developing the product
(monetary expenses).
[23], [26], [67], [95], [100], [104], [107], [108], [114],
[127], [129]–[131], [133], [134], [142], [159], [175],
[177], [184], [193]–[195], [197], [200], [205], [206]
27
Time-to-Market The time that it takes to deliver a
product to the market from its
conception time.
[8], [20], [23], [83], [94], [97], [107], [108],
[112]–[114], [117], [125], [128], [130], [136], [139],
[145], [149], [152], [177], [191], [200], [202], [206],
[209], [214]
27
Other Qualitative
Success Indicators Examples are: staff morale, employee
satisfaction, quality awareness [15], [19], [21], [23], [25], [71], [92], [102], [125],
[126], [149], [150], [162], [177], [188], [194], [196],
[201], [202], [204], [207], [211], [215]
24
Return-On-
Investment The value quantified by considering the
benefit and cost of software process
improvement.
[8], [16], [26], [83], [92], [102], [109], [123], [125],
[127], [137], [150], [152], [184], [187], [190], [195],
[198], [200], [203], [206], [207]
22
Customer
Satisfaction
(Qualitative)
The level of customer expectation
fulfillment by the organization’s
product and service. Customer
satisfaction measurement is divided into
two types, qualitative and quantitative.
[15], [16], [23], [52], [124], [125], [149], [157], [162],
[168], [175], [177], [179], [180], [192], [200], [201],
[206], [212], [215]
20
Customer
Satisfaction
(Quantitative)
[67], [93], [111], [128], [133], [168], [175] 7
Other Quantitative
Success Indicators This success indicator is to group
metrics that measure context-specific
attributes which are not part of any of
the above success indicators (e.g.
employee satisfaction, innovation)
[160], [177], [179], [189], [191], [199] 6
Other Quality
Attributes [52], [128], [163], [195] 4
Table 10
ISO-9126-1 Product Quality Attributes
Quality attribute Studies Frequency
(abs/rel)
Reliability [18], [67], [92], [129], [153],
[166], [181], [212], [214] 9/0.47
Maintainability [117], [159], [160], [164], [182],
[212] 6/0.32
Reusability [129], [148], [149], [159], [208],
[211] 6/0.32
Usability [149], [212] 2/0.10
Portability [150], [212] 2/0.10
Efficiency [212] 1/0.05
Functionality [212] 1/0.05
customer on how to circumvent problems or a shift in the
user base can all cause a reduction in reported failures,
making the measurement of software quality from this angle
more complex [238].
Estimation accuracy: In Table 11 the success indica-
tors for estimation accuracy are shown. It is interesting
that estimating quality seems very uncommon although the
improvement of quality is one of the main interests of
SPI initiatives [1], [239], where quality is found to be the
most measured success indicator (Table 8). The identified
quality estimation metric instances cover process quality,
e. g. actual/estimated number of Quality Assurance re-
views ([202]) and actual/estimated number of defects re-
moved per development phase ([146]). Quality estimation
metrics should be given equal importance as the other
estimation metrics as they can be used to assess the stability
of the software process. On the other hand, “Schedule” (37,
25%) and “Cost and Effort” (34, 24%) represent the bulk of
the estimation accuracy measures. These two factors may
be presumed as important constraints during the project
planning [228] and are therefore preferably selected for
estimation.
Validity of measurements: Overall we extracted an
overwhelming list of metric instances from the publica-
tions. However, many of the metric instances are actually
measuring the same attribute but in different measurement
units, e. g. defect density which is measured by taking the
number of defects over size, where size can be expressed
in either LOC, FP, etc. Even more interesting is that the
definition of basic measures deviates considerably. For the
success indicator “Productivity” there are examples where
the metric was defined as the ratio of effort over size ([208],
[209]), and reversely, as the ratio of size over effort ([128],
[131]). Another example can be found for the metric "Defect
Density", that is interpreted as "Process Quality" ([142]) but
classified as "Defect" in [185], [188].
A potential reason for these inconsistencies can be the
lack of a by researchers and practitioners acknowledged ref-
erence terminology for software measurement [80]. Impre-
cise terminology can lead to inadequate assessment, com-
16 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
Table 11
Estimation Accuracy Success Indicators
Success
Indicator Studies Frequency
(abs/rel)
Schedule [8], [15], [16], [20], [23], [26],
[52], [67], [88], [92], [93], [100],
[112], [113], [115], [123], [128],
[132], [138], [139], [144], [146],
[149], [151], [162], [167], [168],
[170], [171], [180]–[182], [194],
[199], [201], [212], [215]
37/0.66
Cost [15], [16], [23], [26], [52], [69],
[92], [93], [131], [147], [149],
[168], [180], [181], [194], [201],
[212], [215]
18/0.32
Effort [67], [70], [71], [93], [101], [112],
[118], [136], [141], [144], [151],
[153], [166], [178], [199], [209],
[211]
17/0.30
Size [98], [101], [106], [115], [118],
[136], [144], [151], [153], [195] 10/0.18
Others [93], [123], [146] 3/0.05
Productivity [146], [153] 2/0.04
parison and reporting of measurement results and impede
learning [240] and therefore improvement. Besides the lack
of agreement on measurement terminology and concepts,
there exist doubts on the validity of certain measures. The
poor definition of measures leads to broad margins of inter-
pretation as, for example, shown by Kaner and Bond [241]
for the reliability metric mean time to failure (MTTF). As
pointed out by Carbone et al. [242] it is necessary to un-
derstand better the abstract concepts behind the measured
quantities and to construct precise operational definitions in
order to improve the validity of measurements.
4.4 Identified measurement perspectives in the evalua-
tion of SPI initiatives (RQ3)
4.4.1 Results
The purpose of this research question was to assess from
which measurement perspective (project, product or or-
ganization) SPI initiatives are evaluated (see Section 3.4.5
for the definition of the perspectives). Figure 7 shows the
frequencies of the identified measurement perspectives. The
“Project” perspective (98, 66%) represents the majority, fol-
lowed by the “Project and Product” perspective (30, 20%)
and the “Project, Product and Organization” perspective (8,
5%). These numbers show that measurement and evaluation
at the project level is the most common approach to assess
SPI initiatives
The SPI initiatives and the corresponding measurement
perspectives are mapped in Table 12 and Table 13 respec-
tively.
We identified the organizational measurement perspec-
tive mostly in studies with a CMM-based initiative (row A
in Table 12). We did not identify any study with the product
perspective alone within the established SPI framework cat-
egory; however rows A, B, E, F and G in Table 12 show that it
is common to combine the project and product perspectives.
Figure 7. Measurement perspective
4.4.2 Analysis and Discussion
A considerable amount (98, 66%) of the total 148 papers in
this review reported only measurements for the project per-
spective. This indicates that the measurement perspective
to evaluate the SPI initiatives’ outcome is strongly biased
towards the project perspective. The dominance of project
perspective and the very low number of organization per-
spective may indicate a potential problem to communicate
the evaluation results of the SPI initiatives to all the orga-
nization’s stakeholders, assuming that they have different
information needs. On the other hand, it can be argued that
measuring the project is easier as probably less confounding
factors are involved [26].
At the corporate level, business benefits realized by the
improvement initiative need to be visible, whereas the ini-
tiatives’ impact on a certain project is of more relevance for
the involved developers, project or product managers [243].
Hence, it may be beneficial to consider and assess infor-
mation quality of software measurements in terms of their
fitness of purpose [244].
It can also be observed that, whenever the product
perspective is considered it is often accompanied by the
project perspective. The combination of these measurement
perspectives seems reasonable, especially when considering
the project success definition by Baccarini [245]: overall
project success is the combination of project management
success and project product success.
Relying exclusively on the project perspective can raise
the difficulty to span the evaluation over several projects,
thus not only focusing on attaining goals of a single
project [26]. For example, Babar and Gorton [246] have
observed in a survey among practitioners that software ar-
chitecture reviews are often performed in an ad-hoc manner,
without a dedicated role or team responsible for the review.
As such, this initiative may be beneficial for the current
project, but fail to provide the expected financial benefits in
the long-term [246]. That would however stay unobserved
if the improvement initiative is only evaluated from the
project perspective. It is therefore important to assess the
effect of SPI initiatives from perspectives beyond the project,
that is, consider also the impact on the product and the
organization [26].
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 17
Table 12
Measurement perspectives identified in established frameworks
ID SPI initiatives
Project (Prj)
Product (Prd)
Organization (Org)
Prj & Prd
Prj & Org
Prd & Org
Prj & Prd & Org
A CMM [70], [83], [115], [130], [131], [136], [147],
[161], [172], [185], [188], [190], [193], [209],
[223]
- [21] [23], [111], [128],
[133], [175], [180],
[214]
[15] - [8], [201],
[207]
B CMMI [151], [181] - - [121] - - -
C SPICE [93], [215] - - - - - -
D PSP [118], [141]–[143], [153] - - - [195] - -
E TSP [144] - - [146] - - -
F Six-Sigma [169], [183] - - [18], [126] - - -
G QIP [129] - - - - - -
H TQM - - - [149] - - -
I IDEAL [135] - - - - - -
J PDCA [122] - - - - - -
Frequency 30 0 1 12 2 0 3
Looking at Table 12 and rows K to N in Table 13, it can be
seen that 77 out of 91 (85%) initiatives that are supported by
a framework are evaluated from the project and/or product
perspective. This indicates a discrepancy of the initiatives
aim, i. e. to establish an organization-wide improvement
(e. g. at CMM level 3 the improvement is extended to
organizational issues [247]), and how the achievement of
this aim is assessed. From the indications gathered in this
review, the organizational measurement perspective is the
least reported one.
SPI initiatives that involve Six Sigma are mostly focused
on the “Project” and Project & Product” perspective. In
9 out of 10 studies ([18], [126], [169], [183] from Table 12
and [24], [67], [120], [167], [184] from Table 13) these perspec-
tives are considered while only [206] from Table 13 covers
the organizational measurement perspective. This could be
ascribed to the emphasis given by Six Sigma on product
quality [248] and the implied focus on evaluating the impact
on the project and on the produced goods.
Finally, if we look at the measurement perspectives iden-
tified in the tools and practices category (Table 13, rows Q, R
and S), we can identify some interesting patterns. Only [92],
[102], [150] consider the organizational measurement per-
spective. In particular, SPI initiatives in the “Tools” and
“Practices + Tools” categories do not consider the organi-
zation perspective in the measurement. A potential expla-
nation can be that tools and practices are mostly applied on
project or product levels and not on the organization level.
For the “Practice” category, the most prominent measure-
ment perspective is the project perspective. The reason is
that these initiatives are mostly addressing the project level.
The introduction of a tool as an SPI initiative can however
have far-reaching consequences, that is for the project [197],
[249], but also both for the product quality [75], [250]–[252]
and the organization [253]–[256].
4.5 Confounding factors in evaluating SPI initiatives
(RQ4)
4.5.1 Result
The purpose of this research question was to determine
which confounding factors (see Section 3.4.7) need to be
taken into consideration when evaluating SPI initiatives. As
Table 14 shows, we could identify only a few hints regarding
these factors. This might indicate that confounding factors
are seldom explicitly taken into consideration when evalu-
ating process improvement.
4.5.2 Analysis and Discussion
From the results presented above we can identify several
issues regarding confounding factors and their role in eval-
uating SPI initiatives. The first is that we could only identify
19 studies (out of 148) which discuss potential validity
problems when evaluating SPI initiatives. It is therefore
difficult to generalize assumptions or to relate a finding
to a certain evaluation strategy. Second, the authors of
18 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
Table 13
Measurement perspectives identified in framework variations, practices and tools initiatives
ID SPI initiatives
Project (Prj)
Product (Prd)
Organization (Org)
Prj & Prd
Prj & Org
Prd & Org
Prj & Prd & Org
K Two or more SPI
frameworks [20], [68], [69], [98], [99], [117], [120], [123],
[138], [186], [196] - [187],
[198] [16], [67], [168],
[184], [212] [194],
[202] - [179],
[200]
L Derived SPI frame-
work [24], [101], [119], [140], [162], [174] - - - - - [206]
M Own SPI
framework [19], [25], [104], [191] - - [52], [100] - - -
N Limited framework [90], [112], [155], [160], [164], [203] - - [88] - [177] -
O SPI framework &
Practice [134], [148], [167], [208], [222] - - [94], [108] - - -
P SPI framework &
Tool - - - [199] - - -
Q Practices [91], [105], [107], [113], [116], [127], [132],
[137], [139], [156], [158], [159], [165], [170],
[171], [176], [178], [182], [205], [213]
- [102] [71], [95], [124],
[145], [154], [192] [150] - [92]
R Tool [89], [96], [97], [106], [152], [197], [204] [125] - [103] - - -
S Practices & Tool [109], [110], [114], [211] - - - - - -
T Not stated [157], [163], [166], [173], [210] - [189] - - - [26]
Frequency 68 1 4 18 3 1 5
the publications seldom use the term “confounding fac-
tor” or “confounding variable”; often we had to interpret
the descriptions of study designs, executions and results
to discover if the authors considered confounding factors.
We identified several synonyms instead: influencing fac-
tors” [16], “influences” [115], “state variables” [96], “un-
controlled independent variables” [107] and environmental
influences” [8], [109].
What can be learned from the identified studies is that
the identification, characterization and control of confound-
ing factors is a challenging endeavor. In [15], [25], [109],
[195] they are described in an abstract and general way
without discussing remedies to overcome them. The authors
in [25] pointed out that it is possible to measure product
quality improvement effected by specific process actions.
They also cautioned that it is necessary to study the condi-
tions under which the relationship between process action
and improvement are observed in order to increase the
knowledge on these relationships. Unfortunately in many
cases the context, in which the improvement is evaluated, is
described unsatisfactorily (see Section 4.1.3), and an identi-
fication of confounding factors is therefore aggravated.
Generally, the effect of confounding factors on the de-
pendent variable can be controlled by designing the study
appropriately, e. g. by a random allocation of the treatment
and control groups [259]. The fundamental assumption by
such a design is that the confounding variables are equally
distributed in each group, i. e. that the probability is high
that the groups have similar properties. Therefore, if the
distribution of the dependent variable is similar in both the
control and treatment group, it can be concluded that the
treatment has no effect.
The concept of randomization is also discussed in [73],
[260], [261] in the context of software engineering experi-
ments. Pfleeger [260], points out that the major difference
between experiments and case studies is the degree of con-
trol. In order to control a potential confounding variable, the
experiment can be designed in such a way that the exper-
imental units within the distinct groups are homogeneous
(blocking). Additionally, if the number of experimental units
is the same in each group, the design is balanced.
Unfortunately, random sampling of projects or subjects
is seldom an option in the evaluation of improvement ini-
tiatives and therefore knowing of the existence of potential
confounding factors is however needed in order to be able
to apply certain techniques to compensate confounding ef-
fects [259]. The matching technique, for example, leads to an
evaluation design that satisfies the ceteris paribus condition
by selecting groups with similar properties with respect
to confounding factors [259]. By looking at the proposed
solutions, several studies apply some sort of matching, e. g.
by selecting similar projects in terms of size and application
domain, technology or staff size (see [16], [83], [90], [93],
[107], [115], [145], [181] in Table 14).
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 19
Table 14
Identified confounding factors
Study Confounding factors Proposed solutions proposed in the study
[90] When choosing projects for evaluation, “it is impossible to find
identical projects as they are always of differing size and nature”. Selection of similar projects in size and nature for
evaluation.
[181] Development phase, Measurement unit, Data collection process Group projects according to project categories and
evaluate categories individually.
[93] In consecutive assessments, projects may have different application
domains and scope. When comparing two evaluation results, select projects
with similar application domain and scope.
[8] Environmental influences like staff size and turnover, capability
maturity level, staff morale. Collect those environmental data (influences) which
help to identify and understand influences on
performance.
[195] The result of measuring the impact of personal software process
training depends on the length of project time being measured and
number of sample data used to measure improvement.
No solution provided.
[96] The authors identify seven state variables which can influence the
result of their study: programming and testing experience of the
developers, application domain of the tested component, functional
area of the classes involved in the tested component, familiarity of the
developers with other tools, scale of the project, size of the project
team, and number of iterations previously completed.
Besides the statement that these variables have to be
taken into consideration when interpreting the result of
their study, no solution is provided.
[83] Project domain, Project size, Technology changes, code reuse Project domain - Select similar projects for cycle-time
baselining.
Project size - Normalize size to "assembly-equivalent
lines of code".
[107] The authors mention uncontrolled independent variables and the
Hawthorne effect [257], [258]. Evaluated projects are grouped according to potential
confounding factors (cultural differences, skill
background of employees) in non-overlapping sets.
[172] Projects of different size and from different departments. Linear regression models and hierarchical linear models.
[71] Organizational changes (management), product maturity, process
changes unrelated to the evaluated improvement initiative, and the
Hawthorne effect.
No solution provided except a reasoning why these
factors are minor threats to the internal validity of the
study.
[108] Staff size, staff training / learning curve, fixed ("overhead") costs as
program management, and configuration management and regression
testing for multi-platform development.
Staff size - Production rates normalized to staff size.
Fixed ("overhead") costs these cost need to be
considered in cost reduction improvement.
[109] Changes in the environment that might influence the experiment
results: company restructuring, change of development platform,
changes in product release frequency
No solution provided.
[115] Project nature (sustainment or new development), manual data
collection, different programming languages, and employee education
an experience level
Project nature Group projects according to project
categories and evaluate the categories individually.
[15] “Conflicts about measurement goals can often influence perceptions of
success or failure on SPI initiatives”. No solution provided.
[25] The authors state that measuring the effect of a specific process action
on product quality is possible. However, the missing knowledge on
relationships between process actions and product quality makes the
measurement unreliable and therefore it cannot be generalized to all
situations.
No solution provided.
[124] The authors state that the availability and quality of historical data can
affect the result of applying their method of defect-related
measurement (BiDefect).
Data from a stable process is required if no high quality
historical data is available.
[16] Several factors that can influence the productivity values such as
language, project size, tools and technical issues. Measuring projects that use same language, tools,
development environment and normalizing the
productivity by size (function points) can help to reduce
the influence of those factors.
[171] “The preparation rate is known to be a main independent variable for
inspection quality.” Measure preparation rates in software inspections and
take them into account when evaluating the efficiency
and effectiveness of software inspections.
[145] Development team / Test (QA) team, technology, customer Staffing, technology and used platform in all projects is
similar. Customer is the same organizational division.
20 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
There exists no systematic way to identify confound-
ing variables [74] and as shown by the examples above,
their identification depends on the context in which the
study is conducted and on the background knowledge of
the researcher. It is therefore difficult to assure that all
confounding variables are eliminated or controlled, since
their determination relies on assumptions and sound logical
reasoning. An interesting discussion about the identification
of a confounding factor can be found in the comments by
Evanco [262], which refers to the validity of the assumption
by El Emam et al. that size is a confounding variable for
object oriented metrics [263]. El Emam et al. demonstrate
empirically that class size confounds the validity of object
oriented metrics as indicators of the fault-proneness of a
class. The comments [262], however, show that the identifi-
cation and recognition of certain confounding factors is still
disputed [264], [265].
5 CONC LUS ION
This paper presents a systematic literature review that in-
vestigates how the impact of software process improve-
ment initiatives (as defined in Section 3.4.3) is measured
and evaluated. The aim is to identify and characterize the
different approaches used in realistic settings, i. e. to pro-
vide a comprehensive outline and discussion of evaluation
strategies and measurements used in the field to assess
improvement initiatives. The major findings of this review
and their implications for research are:
Incomplete context descriptions: Seventy-five out
of 148 studies did not or only partially describe
the context in which the study was carried out (see
Section 3.5). In the area of process improvement it is
however critical to describe the process change and
its environment in order to provide results which
have the potential to be reused or to be transferred
into different settings. Since a considerable body of
knowledge on the impact of improvement initiatives
is provided by industry reports (53, 36%), a precise
and informative context description would be bene-
ficial for both practitioners and researchers.
Evaluation validity: In more than 50% of the stud-
ies in which improvement initiatives are evaluated,
“Pre-Post Comparison” is used individually or in
combination with another method (see Section 4.2).
Considering that confounding factors are rarely dis-
cussed (19 out of 148 studies, see Section 4.5), the
accuracy of the evaluation results can be questioned.
The severity of confounding is even increased by un-
satisfactory context descriptions. A grounded judg-
ment by the reader on the validity of the evaluation
is prohibited by the absence of essential information.
Measurement validity: Kaner and Bond [241] il-
lustrated how important it is to define exactly the
semantics of a metric and the pitfalls that arise if
it is not commonly agreed what the metric actu-
ally means, i. e. which attribute it actually measures.
This issue is related with farther reaching questions
than process improvement measurement and evalua-
tion, and concerns fundamental problems of software
measurement validity. Nevertheless, measurement
definition inconsistencies, as shown in Section 4.3.2,
inhibit the process of improvement itself since the
comparison and communication of results is aggra-
vated. The implication for research is that it is dif-
ficult to identify and use the appropriate measures
for improvement evaluation. A better support for
defining, selecting and validating measures could
enable a comparable and meaningful evaluation of
SPI initiatives.
Measurement scope: The analysis on what is actually
measured during or after an improvement initiative
shows a focus on process and product quality (see
Section 4.3). From the software process improvement
perspective this measurement goal might be ade-
quate and sufficient. It is however crucial to push the
event horizon of improvement measurement beyond
the level of projects (see Section 4.4) in order to
confirm the relatively short-dated measurements at
the project or product level. Since the information
needs for the different stakeholders vary, appropriate
improvement indicators need to be implemented. At
the corporate level for example, business benefits
realized by projects which encompass a wider scope
than pilot improvement implementations are of in-
terest.
Indicators for these long-term effects can be customer
satisfaction, to assess quality improvement, and re-
turn on investment to evaluate the economic benefits
of improvement. The data presented in this review
(see Section 4.3.2) suggests that these indicators tend
to be less used in the evaluation of process im-
provement as other, easier to collect, indicators. The
implication for research is to integrate the success in-
dicators into a faceted view on process improvement
which captures its short- and long-term impact.
Confounding factors: In a majority (129, 87%) of the
reviewed studies we could not identify a discussion
on confounding factors that might affect the perfor-
mance of SPI initiatives and thus their evaluation.
Since process improvement affects many aspects of a
development project, its results and effect on the or-
ganization, there are many potential such confound-
ing factors that threaten validity. Even though study
design can often be used to limit the effects it is often
not practical to fully control the studied context.
Thus future research on SPI should always consider
and discuss confounding factors. However, we note
that no good conceptual model or framework for
such a discussion is currently available.
The results of this review encourage further research on the
evaluation of process improvement, particularly on the con-
ception of structured guidelines which support practitioners
in the endeavor of measuring, evaluating and communicat-
ing the impact of improvement initiatives.
ACKNOW LED GME NT
The authors thank the anonymous reviewers whose detailed
and judicious comments improved the paper considerably.
This work was partially funded by the Industrial Excellence
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 21
Center EASE - Embedded Applications Software Engineer-
ing (http://ease.cs.lth.se).
REFER ENC ES
[1] B. Kitchenham and S. Pfleeger, “Software quality: the elusive
target,” IEEE Softw., vol. 13, no. 1, pp. 12–21, Jan. 1996.
[2] N. Wirth, “A brief history of software engineering,” IEEE Ann.
Hist. Comput., vol. 30, no. 3, pp. 32–39, Jul. 2008.
[3] M. Shaw, “Prospects for an engineering discipline of software,”
IEEE Softw., vol. 7, no. 6, pp. 15–24, Nov. 1990.
[4] W. Scacchi, “Process models in software engineering,” Encyclope-
dia of Software Engineering, pp. 993–1005, 2001.
[5] D. I. K. Sjoberg, T. Dyba, and M. Jorgensen, “The future of
empirical methods in software engineering research,” in Future
of Software Engineering (FOSE), Minneapolis, 2007, pp. 358–378.
[6] A. Fuggetta, “Software process: a roadmap,” in Proceedings Con-
ference on The Future of Software Engineering, Limerick, Ireland,
2000, pp. 25–34.
[7] D. N. Card, “Research directions in software process improve-
ment,” in Proceedings 28th Annual International Computer Soft-
ware and Applications Conference (COMPSAC), Hong Kong, China,
2004, p. 238.
[8] D. J. Paulish and A. D. Carleton, “Case studies of software-
process-improvement measurement,” Computer, vol. 27, no.9, pp.
50–57, Sep. 1994.
[9] W. A. Florac and A. D. Carleton, Measuring the software process.
Boston: Addison-Wesley, 1999.
[10] T. Hall, N. Baddoo, and D. Wilson, “Measurement in software
process improvement programmes: An empirical study,” in New
Approaches in Software Measurement, ser. Lecture Notes in Com-
puter Science. Berlin, Germany: Springer, 2001, vol. 2006, pp.
73–82.
[11] T. Dyba, “An empirical investigation of the key factors for suc-
cess in software process improvement,” IEEE Trans. Softw. Eng.,
vol. 31, no. 5, pp. 410–424, May 2005.
[12] D. Goldenson, K. E. Emam, J. Herbsleb, and C. Deephouse,
“Empirical studies of software process assessment methods,”
Kaiserslautern: Fraunhofer - Institute for Experimental Software
Engineering, Tech. Rep. ISERN-97-09, 1996. [Online]. Available:
http://www.ehealthinformation.ca/documents/isern-97-09.pdf
[13] L. Mathiassen, O. Ngwenyama, and I. Aaen, “Managing change
in software process improvement,” IEEE Softw., vol. 22, no. 6, pp.
84–91, Nov. 2005.
[14] M. Brown and D. Goldenson, “Measurement and analysis: What
can and does go wrong?” in Proceedings 10th International Sympo-
sium on Software Metrics (METRICS), Chicago, 2004, pp. 131–138.
[15] J. Iversen and O. Ngwenyama, “Problems in measuring effec-
tiveness in software process improvement: A longitudinal study
of organizational change at Danske Data,” International Journal of
Information Management, vol. 26, no. 1, pp. 30–43, Feb. 2006.
[16] A. I. F. Ferreira, G. Santos, R. Cerqueira, M. Montoni, A. Barreto,
A. R. Rocha, A. O. S. Barreto, and R. C. Silva, “ROI of software
process improvement at BL Informatica: SPIdex is really worth
it,” Software Process Improvement and Practice, vol. 13, no. 4, pp.
311–318, Jul. 2008.
[17] P. Mohagheghi and R. Conradi, “An empirical investigation
of software reuse benefits in a large telecom product,” ACM
Transactions on Software Engineering and Methodology, vol. 17, no. 3,
pp. 1–31, Jun. 2008.
[18] C. Redzic and J. Baik, “Six sigma approach in software quality
improvement,” in Proceedings 4th International Conference on Soft-
ware Engineering Research, Management and Applications (SERA),
Seattle, 2006, pp. 396–406.
[19] G. Canfora, F. Garcia, M. Piattini, F. Ruiz, and C. Visaggio,
“Applying a framework for the improvement of software process
maturity,” Software - Practice and Experience, vol. 36, no. 3, pp. 283–
304, Mar. 2006.
[20] I. Sommerville and J. Ransom, “An empirical study of indus-
trial requirements engineering process assessment and improve-
ment,” ACM Transactions on Software Engineering and Methodology,
vol. 14, no. 1, pp. 85–117, Jan. 2005.
[21] K. Hyde and D. Wilson, “Intangible benefits of CMM-based
software process improvement,” Software Process Improvement and
Practice, vol. 9, no. 4, pp. 217–28, Oct. 2004.
[22] D. Goldenson and D. Gibson, “Demonstrating the
impact and benefits of CMMI: an update and
preliminary results,” Software Engineering Institute,
Tech. Rep. CMU/SEI-2003-SR-009, 2003. [Online]. Available:
http://www.sei.cmu.edu/library/abstracts/reports/03sr009.cfm
[23] R. Achatz and F. Paulisch, “Industrial strength software and
quality: software and engineering at Siemens,” in Proceedings 3rd
International Conference on Quality Software (QSIC), Dallas, 2003,
pp. 321–6.
[24] M. Murugappan and G. Keeni, “Blending CMM and six sigma to
meet business goals,” IEEE Softw., vol. 20, no. 2, pp. 42–8, Mar.
2003.
[25] J. Trienekens, R. Kusters, and R. van Solingen, “Product focused
software process improvement: concepts and experiences from
industry,” Software Quality Journal, vol. 9, no. 4, pp. 269–81, Dec.
2001.
[26] T. Gorschek and A. Davis, “Requirements engineering: In search
of the dependent variables,” Information and Software Technology,
vol. 50, no. 1-2, pp. 67–75, Jan. 2008.
[27] B. Kitchenham and S. Charters, “Guidelines for performing
systematic literature reviews in software engineering,” Software
Engineering Group, Keele University and Department of Com-
puter Science, University of Durham, United Kingdom, Technical
Report EBSE-2007-01, 2007.
[28] G. Cugola and C. Ghezzi, “Software processes: a retrospective
and a path to the future,” Software Process: Improvement and
Practice, vol. 4, no. 3, pp. 101–123, Sep. 1998.
[29] W. E. Deming, Out of the crisis. Cambridge: MIT Press, 1986.
[30] W. S. Humphrey, “Introduction to software process
improvement,” Software Engineering Institute, Tech.
Rep. CMU/SEI-92-TR7, 1993. [Online]. Available:
ftp://ftp.sei.cmu.edu/public/documents/92.reports/pdf/tr07.92.pdf
[31] C. Fox and W. Frakes, “The quality approach: is it delivering?”
Comm. ACM, vol. 40, no. 6, pp. 24–29, Jun. 1997.
[32] T. Gorschek and C. Wohlin, “Packaging software process im-
provement issues: a method and a case study,” Software: Practice
and Experience, vol. 34, no. 14, pp. 1311–1344, Nov. 2004.
[33] F. Pettersson, M. Ivarsson, T. Gorschek, and P. Öhman, “A
practitioner’s guide to light weight software process assessment
and improvement planning,” The Journal of Systems and Software,
vol. 81, no. 6, pp. 972–995, Jun. 2008.
[34] M. C. Paulk, C. V. Weber, B. Curtis, and M. B. Chrissis, The
Capability Maturity Model: Guidelines for Improving the Software
Process. Boston: Addison-Wesley, 1995.
[35] B. Boehm, “A view of 20th and 21st century software engi-
neering,” in Proceedings 28th International Conference on Software
Engineering (ICSE), Shanghai, China, 2006, pp. 12–29.
[36] M. C. Paulk, B. Curtis, M. B. Chrissis, and
C. V. Weber, “Capability maturity model for
software version 1.1,” Software Engineering Institute,
Carnegie Mellon University, Pittsburgh, USA, Technical
Report CMU/SEI-93-TR-024, 1993. [Online]. Available:
ftp://ftp.sei.cmu.edu/pub/documents/93.reports/pdf/tr24.93.pdf
[37] M. C. Paulk, C. V. Weber, S. M. Garcia, M. B. Chrissis,
and M. Bush, “Key practices of the capability maturity
model SM, version 1.1,” Software Engineering Institute,
Tech. Rep. CMU/SEI-93-TR-025, 1993. [Online]. Available:
ftp://ftp.sei.cmu.edu/pub/documents/93.reports/pdf/tr25.93.pdf
[38] D. M. Ahern, R. Turner, and A. Clouse, CMMI(SM) Distilled: A
Practical Introduction to Integrated Process Improvement. Boston:
Addison-Wesley, 2001.
[39] “Capability maturity model integration (CMMI), version
1.1 (Staged representation),” Carnegie Mellon Software
Engineering Institute, Pittsburgh, USA, Technical
Report CMU/SEI-2002-TR-012, 2002. [Online]. Available:
http://www.sei.cmu.edu/library/abstracts/reports/02tr012.cfm
[40] “Capability maturity model integration (CMMI),
version 1.1 (Continuous representation),” Soft-
ware Engineering Institute, Technical Report
CMU/SEI-2002-TR-011, 2002. [Online]. Available:
http://www.sei.cmu.edu/library/abstracts/reports/02tr011.cfm
[41] K. El Emam, J. Drouin, and W. Melo, SPICE: The Theory and Prac-
tice of Software Process Improvement and Capability Determination.
Los Alamitos: IEEE Comput. Soc., 1998.
[42] “ISO/IEC TR2 15504 - software process assessment - part 7:
Guide for use in process improvement,” ISO, Geneva, Switzer-
land, Technical Report ISO/IEC TR2 15504, 1998.
22 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
[43] “ISO/IEC TR2 15504 - software process assessment: Part 1 - part
9,” ISO, Geneva, Switzerland, Technical Report ISO/IEC TR2
15504, 1998.
[44] M. Thomas and F. McGarry, “Top-down vs. bottom-up process
improvement,” IEEE Softw., vol. 11, no. 4, pp. 12–13, Jul. 1994.
[45] V. Basili, “The experience factory and its relationship to other
improvement paradigms,” in Software Engineering - ESEC ’93, ser.
Lecure Notes in Computer Science. London, UK: Springer, 1993,
vol. 717, pp. 68–83.
[46] V. Basili and G. Caldiera, “Improve software quality by reusing
knowledge and experience,” Sloan Management Review, vol. 37,
no. 1, pp. 55–64, Oct. 1995.
[47] O. Gómez, H. Oktaba, M. Piattini, and F. Garcia, “A systematic
review measurement in software engineering: State-of-the-Art in
measures,” in Software and Data Technologies, ser. Communica-
tions in Computer and Information Science. Berlin, Germany:
Springer, 2008, vol. 10, pp. 165–176.
[48] C. G. P. Bellini, R. C. F. Pereira, and J. L. Becker, “Measurement
in software engineering: From the roadmap to the crossroads,”
International Journal of Software Engineering and Knowledge Engi-
neering, vol. 18, no. 1, pp. 37–64, Feb. 2008.
[49] B. Kitchenham, “What’s up with software metrics? - a prelim-
inary mapping study,” Journal of Systems and Software, vol. 83,
no. 1, pp. 37–51, Jan. 2010.
[50] S. Chidamber and C. Kemerer, “A metrics suite for object oriented
design,” IEEE Trans. Softw. Eng., vol. 20, no. 6, pp. 476–493, Jun.
1994.
[51] M. Hitz and B. Montazeri, “Chidamber and Kemerer’s metrics
suite: a measurement theory perspective,” IEEE Trans. Softw. Eng.,
vol. 22, no. 4, pp. 267–271, Apr. 1996.
[52] J. Iversen and L. Mathiassen, “Cultivation and engineering of a
software metrics program,” Information Systems Journal, vol. 13,
no. 1, pp. 3–19, Jan. 2003.
[53] J. C. de Almeida Biolchini, P. G. Mian, A. C. C. Natali, T. U. Conte,
and G. H. Travassos, “Scientific research ontology to support
systematic review in software engineering,” Advanced Engineering
Informatics, vol. 21, no. 2, pp. 133–151, Apr. 2007.
[54] M. Unterkalmsteiner, T. Gorschek, A. K. M. M. Islam, C. K.
Cheng, R. B. Permadi, and R. Feldt, “Extended material to
"Evaluation and measurement of software process improvement
- a systematic literature review",” 2010. [Online]. Available:
http://www.bth.se/com/mun.nsf/pages/spi-sysrev-material
[55] D. K. Dunaway and S. Masters, “CMM®-Based appraisal
for internal process improvement (CBA IPI) version 1.2
method description,” Software Engineering Institute, Carnegie
Mellon, Technical Report CMU/SEI-2001-TR-033, 2001. [Online].
Available: http://www.sei.cmu.edu/reports/01tr033.pdf
[56] P. Byrnes and M. Phillips, “Software capability
evaluation version 3.0 method description,” Software
Engineering Institute, Carnegie Mellon, Technical
Report CMU/SEI-96-TR-002, 1996. [Online]. Available:
ftp://ftp.sei.cmu.edu/public/documents/96.reports/pdf/tr002.96.pdf
[57] “Appraisal requirements for CMMI, version 1.2
(ARC, v1.2),” Software Engineering Institute,
Carnegie Mellon, Pittsburgh, USA, Technical Report
CMU/SEI-2006-TR-011, 2006. [Online]. Available:
http://www.sei.cmu.edu/library/abstracts/reports/06tr011.cfm
[58] J. Fleiss, “Measuring nominal scale agreement among many
raters,” Psychological Bulletin, vol. 76, no. 5, pp. 378–382, Nov.
1971.
[59] J. R. Landis and G. G. Koch, “The measurement of observer
agreement for categorical data,” Biometrics, vol. 33, no. 1, pp. 159–
174, Mar. 1977.
[60] M. Staples and M. Niazi, “Experiences using systematic review
guidelines,” Journal of Systems and Software, vol. 80, no. 9, pp.
1425–1437, Sep. 2007.
[61] S. Easterbrook, J. Singer, M. Storey, and D. Damian, “Selecting
empirical methods for software engineering research,” in Guide to
Advanced Empirical Software Engineering. London, UK: Springer,
2008, pp. 285–311.
[62] M. V. Zelkowitz and D. Wallace, “Experimental validation in soft-
ware engineering,” Information and Software Technology, vol. 39,
no. 11, pp. 735–743, 1997.
[63] S. L. Pfleeger and B. Kitchenham, “Principles of survey research:
part 1: turning lemons into lemonade,” ACM SIGSOFT Software
Engineering Notes, vol. 26, no. 6, pp. 16–18, Nov. 2001.
[64] C. Seaman, “Qualitative methods in empirical studies of software
engineering,” IEEE Trans. Softw. Eng., vol. 25, no. 4, pp. 557–72,
Jul. 1999.
[65] R. Davison, M. G. Martinsons, and N. Kock, “Principles of
canonical action research,” Information Systems Journal, vol. 14,
no. 1, pp. 65–86, Jan. 2004.
[66] “Enterprise - SME definition,” Aug. 2009. [Online]. Available:
http://ec.europa.eu/enterprise/enterprise_policy/sme_definition/index_en.htm
[67] M. K. Daskalantonakis, “A practical view of software measure-
ment and implementation experiences within Motorola,” IEEE
Trans. Softw. Eng., vol. 18, no. 11, pp. 998–1010, Nov. 1992.
[68] M. Russ and J. McGregor, “A software development process for
small projects,” IEEE Softw., vol. 17, no. 5, pp. 96–101, Sep. 2000.
[69] A. Ferreira, G. Santos, R. Cerqueira, M. Montoni, A. Barreto, A. S.
Barreto, and A. Rocha, “Applying ISO 9001:2000, MPS.BR and
CMMI to achieve software process maturity: BL Informatica’s
pathway,” in Proceedings 29th International Conference on Software
Engineering (ICSE), Minneapolis, 2007, pp. 642–651.
[70] K. Sakamoto, N. Niihara, T. Tanaka, K. Nakakoji, and K. Kishida,
“Analysis of software process improvement experience using the
project visibility index,” in Proceedings 3rd Asia-Pacific Software
Engineering Conference (APSEC), Seoul, South Korea, 1996, pp.
139–48.
[71] D. Damian and J. Chisan, “An empirical study of the complex
relationships between requirements engineering processes and
other processes that lead to payoffs in productivity, quality, and
risk management,” IEEE Trans. Softw. Eng., vol. 32, no. 7, pp. 433–
453, Jul. 2006.
[72] L. Buglione and A. Abran, “ICEBERG: a different look at software
project management,” in Proceedings 12th International Workshop
on Software Measurement (IWSM), Magdeburg, Germany, 2002, pp.
153–167.
[73] C. Wohlin, M. Höst, and K. Henningsson, “Empirical research
methods in software engineering,” in Empirical Methods and
Studies in Software Engineering, ser. Lecture Notes in Computer
Science. Berlin, Germany: Springer, 2003, vol. 2765, pp. 7–23.
[74] J. Pearl, “Why there is no statistical test for confounding, why
many think there is, and why they are almost right,” Department
of Statistics, UCLA, Jul. 1998.
[75] B. Kitchenham, L. Pickard, and S. Pfleeger, “Case studies for
method and tool evaluation,” IEEE Softw., vol. 12, no. 4, pp. 52–
62, Jul. 1995.
[76] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and
A. Wesslén, Experimentation in software engineering: an introduction.
Norwell: Kluwer Academic Publishers, 2000.
[77] D. E. Perry, A. A. Porter, and L. G. Votta, “Empirical studies of
software engineering: a roadmap,” in Proceedings Conference on
The Future of Software Engineering, Limerick, Ireland, 2000, pp.
345–355.
[78] K. Petersen and C. Wohlin, “Context in industrial software en-
gineering research,” in Proceedings 3rd International Symposium on
Empirical Software Engineering and Measurement, Orlando, 2009,
pp. 401–404.
[79] T. Saracevic, “Evaluation of evaluation in information retrieval,”
in Proceedings 18th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, Seattle, 1995, pp.
138–146.
[80] F. Garcia, M. F. Bertoa, C. Calero, A. Vallecillo, F. Ruiz, M. Piattini,
and M. Genero, “Towards a consistent terminology for software
measurement,” Information and Software Technology, vol. 48, no. 8,
pp. 631–644, Aug. 2006.
[81] V. Raghavan, P. Bollmann, and G. S. Jung, “A critical investi-
gation of recall and precision as measures of retrieval system
performance,” ACM Transactions on Information Systems, vol. 7,
no. 3, pp. 205–229, Jul. 1989.
[82] P. Brereton, B. A. Kitchenham, D. Budgen, M. Turner, and
M. Khalil, “Lessons from applying the systematic literature re-
view process within the software engineering domain,” Journal of
Systems and Software, vol. 80, no. 4, pp. 571–583, Apr. 2007.
[83] M. Diaz and J. Sligo, “How software process improvement
helped Motorola,” IEEE Softw., vol. 14, no. 5, pp. 75–80, Sep. 1997.
[84] G. Santos, M. Montoni, J. Vasconcellos, S. Figueiredo, R. Cabral,
C. Cerdeiral, A. Katsurayama, P. Lupo, D. Zanetti, and A. Rocha,
“Implementing software process improvement initiatives in
small and medium-size enterprises in Brazil,” in 6th International
Conference on the Quality of Information and Communications Tech-
nology (QUATIC), Lisbon, Portugal, 2007, pp. 187–198.
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 23
[85] “Improving processes in small settings (IPSS) - a white
paper,” The International Process Research Consortium
(IPRC), Pittsburgh, USA, Tech. Rep., 2006. [Online]. Available:
http://www.sei.cmu.edu/iprc/ipss-white-paper-v1-1.pdf
[86] P. Runeson and M. Höst, “Guidelines for conducting and re-
porting case study research in software engineering,” Empirical
Software Engineering, vol. 14, no. 2, pp. 131–164, Apr. 2009.
[87] “2008 annual report,” Motorola, Inc., Annual Report
Motorola, Inc. 2008 Form 10-K, 2009. [Online]. Available:
http://investor.motorola.com/annuals.cfm
[88] B. Regnell, P. Beremark, and O. Eklundh, “A market-driven
requirements engineering process: results from an industrial pro-
cess improvement programme,” Requirements Engineering, vol. 3,
no. 2, pp. 121–9, Jun. 1998.
[89] M. Visconti and L. Guzman, “A measurement-based approach
for implanting SQA and SCM practices,” in Proceedings 20th
International Conference of the Chilean Computer Science Society
(SCCC), Santiago, Chile, 2000, pp. 126–34.
[90] D. Karlström, P. Runeson, and S. Norden, “A minimal test prac-
tice framework for emerging software organizations,” Software
Testing, Verification and Reliability, vol. 15, no. 3, pp. 145–66, Sep.
2005.
[91] O. Salo and P. Abrahamsson, “An iterative improvement process
for agile software development,” Software Process Improvement and
Practice, vol. 12, no. 1, pp. 81–100, Jan. 2007.
[92] A. Roan and P. Hebrard, “A PIE one year after: APPLY,” in Inter-
national Conference on Product Focused Software Process Improvement
(VTT Symposium), Oulu, Finland, 1999, pp. 606–19.
[93] S. Hwang and H. Kim, “A study on metrics for supporting the
software process improvement based on SPICE,” in Software Engi-
neering Research and Applications, ser. Lecture Notes in Computer
Science. Berlin, Germany: Springer, 2005, vol. 3647, pp. 71–80.
[94] C. Hollenbach, R. Young, A. Pflugrad, and D. Smith, “Combining
quality and software improvement,” Comm. ACM, vol. 40, no. 6,
pp. 41–5, Jun. 1997.
[95] S. Morad and T. Kuflik, “Conventional and open source software
reuse at Orbotech - an industrial experience,” in Proceedings Inter-
national Conference on Software - Science, Technology and Engineering
(SWSTE), Herzlia, Israel, 2005, pp. 110–17.
[96] G. Giraudo and P. Tonella, “Designing and conducting an empir-
ical study on test management automation,” Empirical Software
Engineering, vol. 8, no. 1, pp. 59–81, Mar. 2003.
[97] C. Ebert and J. D. Man, “e-R&D - effectively managing process
diversity,” Annals of Software Engineering, vol. 14, no. 1-4, pp. 73–
91, Dec. 2002.
[98] J. Jarvinen and R. van Solingen, “Establishing continuous assess-
ment using measurements,” in Proceedings 1st Internation Confer-
ence on Product Focused Software Process Improvement (PROFES),
Oulu, Finland, 1999, pp. 49–67.
[99] G. Spork and U. Pichler, “Establishment of a performance
driven improvement programme,” Software Process Improvement
and Practice, vol. 13, no. 4, pp. 371–382, Jul. 2008.
[100] C. von Wangenheim, S. Weber, J. Hauck, and G. Trentin, “Expe-
riences on establishing software processes in small companies,”
Information and Software Technology, vol. 48, no. 9, pp. 890–900,
Sep. 2006.
[101] G. Cuevas, J. C. Manzano, T. S. Feliu, J. Mejia, M. Munoz, and
S. Bayona, “Impact of TSPi on software projects,” in 4th Congress
of Electronics, Robotics and Automotive Mechanics (CERMA), Cuer-
navaca, Mexico, 2007, pp. 706–711.
[102] A. Borjesson, “Improve by improving software process im-
provers,” International Journal of Business Information Systems,
vol. 1, no. 3, pp. 310–38, Jan. 2006.
[103] F. Titze, “Improvement of a configuration management system,”
in Proceedings 22nd International Conference on Software Engineering
(ICSE), Limerick, Ireland, 2000, pp. 618–25.
[104] T. Tanaka, K. Sakamoto, S. Kusumoto, K. Matsumoto, and
T. Kikuno, “Improvement of software process by process de-
scription and benefit estimation,” in Proceedings 17th International
Conference on Software Engineering (ICSE), Seattle, 1995, pp. 123–
32.
[105] H. Leung, “Improving defect removal effectiveness for software
development,” in Proceedings 2nd Euromicro Conference on Software
Maintenance and Reengineering (CSMR), Florence, Italy, 1998, pp.
157–64.
[106] B. Anda, E. Angelvik, and K. Ribu, “Improving estimation prac-
tices by applying use case models,” in Product Focused Software
Process Improvement, ser. Lecture Notes in Computer Science.
Berlin, Germany: Springer, 2002, vol. 2559, pp. 383–97.
[107] C. Ebert, C. H. Parro, R. Suttels, and H. Kolarczyk, “Improv-
ing validation activities in a global software development,” in
Proceedings 23rd International Conference on Software Engineering
(ICSE), Toronto, Canada, 2001, pp. 545–554.
[108] J. A. Lane and D. Zubrow, “Integrating measurement with im-
provement: An action-oriented approach,” in Proceedings 19th
International Conference on Software Engineering (ICSE), Boston,
1997, pp. 380–389.
[109] J. Larsen and H. Roald, “Introducing ClearCase as a process
improvement experiment,” in System Configuration Management,
ser. Lecture Notes in Computer Science. Berlin, Germany:
Springer, 1998, vol. 1439, pp. 1–12.
[110] J. Dick and E. Woods, “Lessons learned from rigorous sys-
tem software development,” Information and Software Technology,
vol. 39, no. 8, pp. 551–560, Aug. 1997.
[111] C. Debou and A. Kuntzmann-Combelles, “Linking software
process improvement to business strategies: experiences from
industry,” Software Process Improvement and Practice, vol. 5, no. 1,
pp. 55–64, Mar. 2000.
[112] J. Zettell, F. Maurer, J. Münch, and L. Wong, “LIPE: a lightweight
process for e-business startup companies based on extreme pro-
gramming,” in Product Focused Software Process Improvement, ser.
Lecture Notes in Computer Science. Berlin, Germany: Springer,
2001, vol. 2188, pp. 255–70.
[113] K. Kautz, “Making sense of measurement for small organiza-
tions,” IEEE Softw., vol. 16, no. 2, pp. 14–20, Mar. 1999.
[114] M. Winokur, A. Grinman, I. Yosha, and R. Gallant, “Measuring
the effectiveness of introducing new methods in the software de-
velopment process,” in Proceedings 24th EUROMICRO Conference
(EUROMICRO), Västerås, Sweden, 1998, pp. 800–7.
[115] R. Grable, J. Jernigan, C. Pogue, and D. Divis, “Metrics for small
projects: Experiences at the SED,” IEEE Softw., vol. 16, no. 2, pp.
21–9, Mar. 1999.
[116] J. Haugh, “Never make the same mistake twice-using configura-
tion control and error analysis to improve software quality, in
Proceedings 10th Digital Avionics Systems Conference, Los Angeles,
1991, pp. 220–5.
[117] J. Jarvinen, D. Hamann, and R. van Solingen, “On integrating
assessment and measurement: towards continuous assessment of
software engineering processes,” in Proceedings 6th International
Software Metrics Symposium (METRICS), Boca Raton, 1999, pp. 22–
30.
[118] P. Abrahamsson and K. Kautz, “The personal software process:
Experiences from Denmark,” in Proceedings 28th Euromicro Con-
ference (EUROMICRO), Dortmund, Germany, 2002, pp. 367–74.
[119] A. Cater-Steel, M. Toleman, and T. Rout, “Process improvement
for small firms: An evaluation of the RAPID assessment-based
method,” Information and Software Technology, vol. 48, no. 5, pp.
323–334, May 2006.
[120] Z. Xiaosong, H. Zhen, ZhangMin, W. Jing, and Y. Dainuan,
“Process integration of six sigma and CMMI,” in Proceedings 6th
International Conference on Industrial Informatics (INDIN), Singa-
pore, China, 2008, pp. 1650–1653.
[121] K. Taneike, H. Okada, H. Ishigami, and H. Mukaiyama, “Quality
assurance activities for enterprise application software pack-
ages,” Fujitsu Scientific and Technical Journal, vol. 44, no. 2, pp.
106–113, Apr. 2008.
[122] H. Kihara, “Quality assurance activities in the software devel-
opment center, Hitachi Ltd,” in Proceedings 16th Annual Pacific
Northwest Software Quality Conference Joint ASQ Software Division’s
8th International Conference on Software Quality, Portland, 1998, pp.
372–84.
[123] A. Kuntzmann-Combelles, “Quantitative approach to software
process improvement,” in Objective Software Quality, ser. Lecture
Notes in Computer Science. Berlin, Germany: Springer, 1995,
vol. 926, pp. 16–30.
[124] L. Gou, Q. Wang, J. Yuan, Y. Yang, M. Li, and N. Jiang, “Quan-
titatively managing defects for iterative projects: An industrial
experience report in China,” in Making Globally Distributed Soft-
ware Development a Success Story, ser. Lecture Notes in Computer
Science. Berlin, Germany: Springer, 2008, vol. 5007, pp. 369–380.
[125] J. Momoh and G. Ruhe, “Release planning process improvement
- an industrial case study,” Software Process Improvement and
Practice, vol. 11, no. 3, pp. 295–307, May 2006.
24 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
[126] Z. Xiaosong, H. Zhen, G. Fangfang, and Z. Shenqing, “Research
on the application of six sigma in software process improve-
ment,” in Proceedings 4th International Conference on Intelligent
Information Hiding and Multimedia Signal Processing (IIH-MSP),
Harbin, China, 2008, pp. 937–940.
[127] L. Damm and L. Lundberg, “Results from introducing
component-level test automation and Test-Driven development,”
Journal of Systems and Software, vol. 79, no. 7, pp. 1001–14, Jul.
2006.
[128] H. Wohlwend and S. Rosenbaum, “Schlumberger’s software im-
provement program,” IEEE Trans. Softw. Eng., vol. 20, no. 11, pp.
833–9, Nov. 1994.
[129] V. Basili, M. Zelkowitz, F. McGarry, J. Page, S. Waligora, and
R. Pajerski, “SEL’s software process improvement program,”
IEEE Softw., vol. 12, no. 6, pp. 83–7, Nov. 1995.
[130] C. Buchman, “Software process improvement at AlliedSignal
Aerospace,” in Proceedings 29th Hawaii International Conference on
System Sciences (HICSS), Wailea, 1996, pp. 673–80.
[131] T. J. Haley, “Software process improvement at Raytheon,” IEEE
Softw., vol. 13, no. 6, pp. 33–41, Nov. 1996.
[132] A. Ahmed, M. Fraz, and F. Zahid, “Some results of experimenta-
tion with extreme programming paradigm,” in Proceedings 7th
International Multi-Topic Conference (INMIC), Lahore, Pakistan,
2004, pp. 387–90.
[133] J. Batista and A. D. de Figueiredo, “SPI in a very small team: a
case with CMM,” Software Process Improvement and Practice, vol. 5,
no. 4, pp. 243–50, Dec. 2000.
[134] K. Nelson, M. Buche, and H. Nelson, “Structural change and
change advocacy: a study in becoming a software engineering
organization,” in Proceedings 34th Annual Hawaii International
Conference on System Sciences (HICSS), Maui, 2001, p. 9 pp.
[135] J. W. Lee, S. H. Jung, S. C. Park, Y. J. Lee, and Y. C. Jang, “System
based SQA and implementation of SPI for successful projects,”
in Proceedings International Conference on Information Reuse and
Integration (IRI), Las Vegas, 2005, pp. 494–499.
[136] C. Ebert, “Technical controlling and software process improve-
ment,” Journal of Systems and Software, vol. 46, no. 1, pp. 25–39,
Apr. 1999.
[137] T. Nishiyama, K. Ikeda, and T. Niwa, “Technology transfer
macro-process. a practical guide for the effective introduction
of technology,” in Proceedings 22nd International Conference on
Software Engineering (ICSE), Limerick, Ireland, 2000, pp. 577–86.
[138] L. Pracchia, “TheAV-8B team learns synergy of EVM and TSP
accelerates software process improvement,” CrossTalk, no. 1, pp.
20–22, 2004.
[139] C. Ebert, “The impacts of software product management,” Journal
of Systems and Software, vol. 80, no. 6, pp. 850–861, Jun. 2007.
[140] K. Alagarsamy, S. Justus, and K. Iyakutti, “The knowledge based
software process improvement program: A rational analysis,” in
International Conference on Software Engineering Advances (ICSEA),
Cap Esterel, France, 2007, p. 61.
[141] P. Abrahamsson and K. Kautz, “Personal software process: class-
room experiences from Finland,” in Software Quality - ESCQ
2002, ser. Lecture Notes in Computer Science. Berlin, Germany:
Springer, 2002, vol. 2349, pp. 175–85.
[142] C. Wohlin and A. Wesslen, “Understanding software defect
detection in the personal software process,” in Proceedings 9th
International Symposium on Software Reliability Engineering (ISSRE),
Paderborn, Germany, 1998, pp. 49–58.
[143] W. Humphrey, “Using a defined and measured personal software
process,” IEEE Softw., vol. 13, no. 3, pp. 77–88, May 1996.
[144] N. Davis, J. Mullaney, and D. Carrington, “Using measurement
data in a TSPSM project,” in Software Process Improvement, ser.
Lecture Notes in Computer Science. Berlin, Germany: Springer,
2004, vol. 3281, pp. 91–101.
[145] F. Downey and G. Coleman, “Using SPI to achieve delivery ob-
jectives in e-commerce software development,” Software Process
Improvement and Practice, vol. 13, no. 4, pp. 327–333, Jul. 2008.
[146] G. Seshagiri and S. Priya, “Walking the talk: building quality
into the software quality management tool,” in Proceedings 3rd
International Conference on Quality Software (QSIC), Dallas, 2003,
pp. 67–74.
[147] F. McGarry, “What is a level 5?” in Proceedings 26th Annual NASA
Goddard Software Engineering Workshop (SEW), Greenbelt, 2002,
pp. 83–90.
[148] C. Tischer, A. Müller, M. Ketterer, and L. Geyer, “Why does it take
that long? Establishing product lines in the automotive domain,”
in Proceedings 11th International Software Product Line Conference
(SPLC), Kyoto, Japan, 2007, pp. 269–274.
[149] S. Shah and J. Sutton, “Crafting a TQM-oriented software de-
velopment lifecycle: program experience,” in Proceedings National
Aerospace and Electronics Conference (NEACON), Dayton, 1992, pp.
643–9.
[150] B. Shen and D. Ju, “On the measurement of agility in software
process,” in Software Process Dynamics and Agility, ser. Lecture
Notes in Computer Science. Berlin, Germany: Springer, 2007,
vol. 4470, pp. 25–36.
[151] R. Xu, Y. Xue, P. Nie, Y. Zhang, and D. Li, “Research on CMMI-
based software process metrics,” in Proceedings 1st International
on Computer and Computational Sciences (IMSCCS), Hangzhou,
China, 2006, pp. 391–397.
[152] J. Hössler, O. Kath, M. Soden, M. Born, and S. Saito, “Significant
productivity enhancement through model driven techniques: a
success story,” in Proceedings 10th International Enterprise Dis-
tributed Object Computing Conference (EDOC), Hong Kong, China,
2006, pp. 367–373.
[153] L. Prechelt and B. Unger, “An experiment measuring the effects
of personal software process (PSP) training,” IEEE Trans. Softw.
Eng., vol. 27, no. 5, pp. 465–472, May 2001.
[154] M. T. Baldassarre, A. Bianchi, D. Caivano, and G. Visaggio,
“An industrial case study on reuse oriented development,” in
Proceedings 21st International Conference on Software Maintenance
(ICSM), Budapest, Hungary, 2005, pp. 283–92.
[155] G. Visaggio, P. Ardimento, M. Baldassarre, and D. Caivano,
“Assessing multiview framework (MF) comprehensibility and
efficiency: A replicated experiment,” Information and Software
Technology, vol. 48, no. 5, pp. 313–22, May 2006.
[156] J. Ramil and M. Lehman, “Defining and applying metrics in
the context of continuing software evolution,” in Proceedings
7th International Software Metrics Symposium (METRICS), London,
UK, 2000, pp. 199–209.
[157] K. El Emam and N. Madhavji, “Does organizational maturity
improve quality?” IEEE Softw., vol. 13, no. 5, pp. 109–10, Sep.
1996.
[158] D. Winkler, B. Thurnher, and S. Biffl, “Early software product
improvement with sequential inspection sessions: an empirical
investigation of inspector capability and learning effects,” in
Proceedings 33rd Euromicro Conference on Software Engineering and
Advanced Applications (EUROMICRO), Lübeck, Germany, 2007,
pp. 245–54.
[159] K. Nelson and M. Ghods, “Evaluating the contributions of a
structured software development and maintenance methodol-
ogy,” Information Technology & Management, vol. 3, no. 1-2, pp.
11–23, Jan. 2002.
[160] L. Suardi, “How to manage your software product life cycle with
MAUI,” Comm. ACM, vol. 47, no. 3, pp. 89–94, Mar. 2004.
[161] N. Schneidewind, “Measuring and evaluating maintenance pro-
cess using reliability, risk, and test metrics,” IEEE Trans. Softw.
Eng., vol. 25, no. 6, pp. 769–81, Nov. 1999.
[162] Q. Wang and M. Li, “Measuring and improving software process
in China,” in Proceedings International Symposium on Empirical
Software Engineering, Noosa Heads, Australia, 2005, pp. 183–192.
[163] R. Seacord, J. Elm, W. Goethert, G. Lewis, D. Plakosh, J. Robert,
L. Wrage, and M. Lindvall, “Measuring software sustainability,”
in Proceedings International Conference on Software Maintenance
(ICSM), Amsterdam, The Netherlands, 2003, pp. 450–9.
[164] J. H. Hayes, N. Mohamed, and T. H. Gao, “Observe-mine-adopt
(OMA): an agile way to enhance software maintainability, Jour-
nal of Software Maintenance and Evolution, vol. 15, no. 5, pp. 297–
323, Sep. 2003.
[165] L. He and J. Carver, “PBR vs. checklist: A replication in the n-fold
inspection context,” in Proceedings 5th International Symposium on
Empirical Software Engineering, Rio de Janeiro, Brazil, 2006, pp.
95–104.
[166] J. Henry, A. Rossman, and J. Snyder, “Quantitative evaluation of
software process improvement,” Journal of Systems and Software,
vol. 28, no. 2, pp. 169–177, Feb. 1995.
[167] S. I. Hashmi and J. Baik, “Quantitative process improvement
in XP using six sigma tools,” in Proceedings 7th International
Conference on Computer and Information Science (ICIS), Portland,
2008, pp. 519–524.
[168] B. Moreau, C. Lassudrie, B. Nicolas, O. I’Homme,
C. d’Anterroches, and G. L. Gall, “Software quality improvement
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 25
in France Telecom research center,” Software Process Improvement
and Practice, vol. 8, no. 3, pp. 135–44, Jul. 2003.
[169] T. Galinac and Z. Car, “Software verification process improve-
ment proposal using six sigma,” in Product Focused Software
Process Improvement, ser. Lecture Notes in Computer Science.
Berlin, Germany: Springer, 2007, vol. 4589, pp. 51–64.
[170] C. Ebert, “Understanding the product life cycle: four key require-
ments engineering techniques,” IEEE Softw., vol. 23, no. 3, pp.
19–25, May 2006.
[171] M. van Genuchten, C. van Dijk, H. Scholten, and D. Vogel, “Using
group support systems for software inspections,” IEEE Softw.,
vol. 18, no. 3, pp. 60–5, May 2001.
[172] J. Schalken, S. Brinkkemper, and H. van Vliet, “Using linear
regression models to analyse the effect of software process im-
provement,” in Product-Focused Software Process Improvement, ser.
Lecture Notes in Computer Science. Berlin, Germany: Springer,
2006, vol. 4034, pp. 234–248.
[173] B. Freimut, C. Denger, and M. Ketterer, “An industrial case
study of implementing and validating defect classification for
process improvement and quality management,” in Proceedings
11th International Software Metrics Symposium (METRICS), Como,
Italy, 2005, pp. 165–174.
[174] S. Otoya and N. Cerpa, “An experience: a small software com-
pany attempting to improve its process,” in Proceedings 9th In-
ternational Workshop Software Technology and Engineering Practice
(STEP), Pittsburgh, 1999, pp. 153–60.
[175] K. A. McKeown and E. G. McGuire, “Evaluation of a metrics
framework for product and process integrity,” in Proceedings 33rd
Hawaii International Conference on System Sciences (HICSS), Maui,
2000, p. pp. 4046.
[176] M. Höst and C. Johansson, “Evaluation of code review methods
through interviews and experimentation,” Journal of Systems &
Software, vol. 52, no. 2-3, pp. 113–20, Jun. 2000.
[177] B. List, R. Bruckner, and J. Kapaun, “Holistic software process
performance measurement from the stakeholders’ perspective,”
in Proceedings 16th International Workshop on Database and Expert
Systems Applications (DEXA), Copenhagen, Denmark, 2005, pp.
941–7.
[178] D. Damian, J. Chisan, L. Vaidyanathasamy, and Y. Pal, “Re-
quirements engineering and downstream software development:
Findings from a case study,” Empirical Software Engineering,
vol. 10, no. 3, pp. 255–283, Jul. 2005.
[179] E. Savioja and M. Tukiainen, “Measurement practices in financial
software industry,” Software Process Improvement and Practice,
vol. 12, no. 6, pp. 585–595, Nov. 2007.
[180] A. Johnson, “Software process improvement experience in the
DP/MIS function,” in Proceedings 16th International Conference on
Software Engineering (ICSE), Sorrento, Italy, 1994, pp. 323–9.
[181] C. Hollenbach and D. Smith, “A portrait of a CMMISM level 4
effort,” Systems Engineering, vol. 5, no. 1, pp. 52–61, 2002.
[182] V. French, “Applying software engineering and process im-
provement to legacy defence system maintenance: an experience
report,” in Proceedings 11th International Conference on Software
Maintenance (ICSM), Opio (Nice), France, 1995, pp. 337–43.
[183] D. Macke and T. Galinac, “Optimized software process for fault
handling in global software development,” in Making Globally
Distributed Software Development a Success Story, ser. Lecture Notes
in Computer Science. Berlin, Germany: Springer, 2008, vol. 5007,
pp. 395–406.
[184] M. Murugappan and G. Keeni, “Quality improvement-the six
sigma way,” in Proceedings 1st Asia-Pacific Conference on Quality
Software (APAQS), Hong Kong, China, 2000, pp. 248–57.
[185] K. Sargut and O. Demirors, “Utilization of statistical process
control (SPC) in emergent software organizations: pitfalls and
suggestions,” Software Quality Journal, vol. 14, no. 2, pp. 135–57,
Jun. 2006.
[186] B. R. V. Konsky and M. Robey, “A case study: GQM and TSP in a
software engineering capstone project,” in Proceedings 18th Soft-
ware Engineering Education Conference (CSEET), Ottawa, Canada,
2005, pp. 215–222.
[187] A. Birk, P. Derks, D. Hamann, J. Hirvensalo, M. Oivo, E. Roden-
bach, R. van Solingen, and J. Taramaa, “Applications of measure-
ment in product-focused process improvement: a comparative
industrial case study,” in Proceedings 5th International Software
Metrics Symposium (METRICS), Bethesda, 1998, pp. 105–8.
[188] J. Rooijmans, H. Aerts, and M. van Genuchten, “Software quality
in consumer electronics products,” IEEE Softw., vol. 13, no. 1, pp.
55–64, Jan. 1996.
[189] W. Harrison, D. Raffo, J. Settle, and N. Eicklemann, “Technology
review: adapting financial measures: making a business case for
software process improvement,” Software Quality Journal, vol. 8,
no. 3, pp. 211–30, Nov. 1999.
[190] C. Ebert, “The quest for technical controlling,” Software Process
Improvement and Practice, vol. 4, no. 1, pp. 21–31, Mar. 1998.
[191] D. Weiss, D. Bennett, J. Payseur, P. Tendick, and P. Zhang, “Goal-
oriented software assessment,” in Proceedings 24th International
Conference on Software Engineering (ICSE), Orlando, 2002, pp. 221–
31.
[192] A. Nolan, “Learning from success,” IEEE Softw., vol. 16, no. 1, pp.
97–105, Jan. 1999.
[193] R. Dion, “Elements of a process-improvement program,” IEEE
Softw., vol. 9, no. 4, pp. 83–5, Jul. 1992.
[194] ——, “Process improvement and the corporate balance sheet,”
IEEE Softw., vol. 10, no. 4, pp. 28–35, Jul. 1993.
[195] T. Lee, D. Baik, and H. In, “Cost benefit analysis of personal
software process training program,” in Proceedings 8th Interna-
tional Conference on Computer and Information Technology Workshops
(CITWORKSHOPS), Sydney, Australia, 2008, pp. 631–6.
[196] L. Scott, R. Jeffery, L. Carvalho, J. D’Ambra, and P. Rutherford,
“Practical software process improvement - the IMPACT project,”
in Proceedings 13th Australian Software Engineering Conference
(ASWEC), Canberra, Australia, 2001, pp. 182–9.
[197] T. Bruckhaus, N. H. Madhavii, I. Janssen, and J. Henshaw, “The
impact of tools on software productivity,” IEEE Softw., vol. 13,
no. 5, pp. 29–38, Sep. 1996.
[198] R. van Solingen, “Measuring the ROI of software process im-
provement,” IEEE Softw., vol. 21, no. 3, pp. 32–38, May 2004.
[199] D. Escala and M. Morisio, “A metric suite for a team PSP,” in Pro-
ceedings 5th International Software Metrics Symposium (METRICS),
Bethesda, 1998, pp. 89–92.
[200] P. Miller, “An SEI process improvement path to software quality,”
in Proceedings 6th International Conference on the Quality of Informa-
tion and Communication Technology (QUATIC), Lisbon, Portugal,
2007, pp. 12–18.
[201] J. D. Herbsleb and D. R. Goldenson, “A systematic survey of
CMM experience and results,” in Proceedings 18th International
Conference on Software Engineering (ICSE), Berlin, Germany, 1996,
pp. 323–330.
[202] F. McGarry and B. Decker, “Attaining level 5 in CMM process
maturity,” IEEE Softw., vol. 19, no. 6, pp. 87–96, Nov. 2002.
[203] L. Lazic and N. Mastorakis, “Cost effective software test metrics,”
WSEAS Transactions on Computers, vol. 7, no. 6, pp. 599–619, Jun.
2008.
[204] J. Andrade, J. Ares, O. Dieste, R. Garcia, M. Lopez, S. Rodriguez,
and L. Verde, “Creation of an automated management software
requirements environment: A practical experience,” in Proceed-
ings 10th International Workshop on Database and Expert Systems
Applications (DEXA), Florence, Italy, 1999, pp. 328–35.
[205] S. A. Ajila and D. Wu, “Empirical study of the effects of open
source adoption on software development economics,” Journal of
Systems and Software, vol. 80, no. 9, pp. 1517–1529, Sep. 2007.
[206] S. Golubic, “Influence of software development process capabil-
ity on product quality, in Proceedings 8th International Conference
on Telecommunications (ConTEL), Zagreb, Croatia, 2005, pp. 457–
63.
[207] H. Krasner and G. Scott, “Lessons learned from an initiative for
improving software process, quality and reliability in a semicon-
ductor equipment company,” in Proceedings 29th Annual Hawaii
Intemational Conference on System Sciences (HICSS), Maui, 1996,
pp. 693–702.
[208] S. Pfleeger, “Maturity, models, and goals: how to build a metrics
plan,” Journal of Systems and Software, vol. 31, no. 2, pp. 143–55,
Nov. 1995.
[209] F. McGarry, S. Burke, and B. Decker, “Measuring the impacts
individual process maturity attributes have on software prod-
ucts,” in Proceedings 5th International Software Metrics Symposium
(METRICS), Bethesda, 1998, pp. 52–60.
[210] J. D. Valett, “Practical use of empirical studies for maintenance
process improvement,” Empirical Software Engineering, vol. 2,
no. 2, pp. 133–142, Jun. 1997.
[211] A. Calio, M. Autiero, and G. Bux, “Software process improve-
ment by object technology (ESSI PIE 27785 - SPOT),” in Proceed-
26 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, DATE Z
ings 22nd International Conference on Software Engineering (ICSE),
Limerick, Ireland, 2000, pp. 641–647.
[212] J. Kuilboer and N. Ashrafi, “Software process improvement
deployment: an empirical perspective,” Journal of Information
Technology Management, vol. 10, no. 3-4, pp. 35–47, 1999.
[213] S. Biffl and M. Halling, “Software product improvement with
inspection. a large-scale experiment on the influence of inspection
processes on defect detection in software requirements docu-
ments,” in Proceedings 26th Euromicro Conference, Maastricht, The
Netherlands, 2000, pp. 262–9.
[214] J. Trienekens, R. Kusters, M. van Genuchten, and H. Aerts,
“Targets, drivers and metrics in software process improvement:
results of a survey in a multinational organization,” Software
Quality Journal, vol. 15, no. 2, pp. 135–53, Jun. 2007.
[215] K. El Emam and A. Birk, “Validating the ISO/IEC 15504 measure
of software requirements analysis process capability,” IEEE Trans.
Softw. Eng., vol. 26, no. 6, pp. 541–66, Jun. 2000.
[216] P. B. Crosby, Quality Without Tears. New York: McGraw-Hill,
1984.
[217] J. A. Rozum, “Concepts on measuring the benefits of
software process improvements,” Software Engineering Institute,
Tech. Rep. CMU/SEI-93-TR-009, 1993. [Online]. Available:
http://www.sei.cmu.edu/reports/93tr009.pdf
[218] D. J. Rocha, “Strengthening the validity of software process
improvement measurements through statistical analysis: A
case study at Ericsson AB,” http://hdl.handle.net/2077/10529.
[Online]. Available: http://hdl.handle.net/2077/10529
[219] D. Caivano, “Continuous software process improvement through
statistical process control,” in Proceedings 9th European Conference
on Software Maintenance and Reengineering (CSMR), Manchester,
UK, 2005, pp. 288–293.
[220] R. Fitzpatrick and C. Higgins, “Usable software and its attributes:
A synthesis of software quality,” in Proceedings of HCI on People
and Computers XIII, Sheffield, UK, 1998, pp. 3–21.
[221] T. Bruckhaus, “A quantitative approach for analyzing the impact
of tools on software productivity,” Ph.D. dissertation, McGill
University, 1997.
[222] E. Lee and M. Lee, “Development system security process of
ISO/IEC TR 15504 and security considerations for software pro-
cess improvement,” in Computational Science and its Applications,
ser. Lecture Notes in Computer Science. Berlin, Germany:
Springer, 2005, vol. 3481, pp. 363–372.
[223] E. Bellini and C. lo Storto, “CMM implementation and orga-
nizational learning: findings from a case study analysis,” in
Proceedings PICMET 2006-Technology Management for the Global
Future (PICMET), Istanbul, Turkey, 2006, pp. 1256–71.
[224] D. Raffo, “The role of process improvement in delivering cus-
tomer and financial value,” in Portland International Conference on
Management and Technology (PICMET), Portland, 1997, pp. 589–
592.
[225] R. van Solingen, D. F. Rico, and M. V. Zelkowitz, “Calculating
software process improvement’s return on investment,” in Ad-
vances in Computers. Elsevier, 2006, vol. 66, pp. 1–41.
[226] M. Staples and M. Niazi, “Systematic review of organizational
motivations for adopting CMM-based SPI,” Information and Soft-
ware Technology, vol. 50, no. 7-8, pp. 605–620, Jun. 2008.
[227] R. Atkinson, “Project management: cost, time and quality, two
best guesses and a phenomenon, it’s time to accept other success
criteria,” International Journal of Project Management, vol. 17, no. 6,
pp. 337–342, Dec. 1999.
[228] H. Kerzner, Project Management: A Systems Approach to Planning,
Scheduling, and Controlling, 10th ed. Hoboken: John Wiley, 2009.
[229] D. F. Rico, ROI of software process improvement. Fort Lauderdale:
J. Ross Publishing, 2004.
[230] N. Bevan, “Quality in use: Meeting user needs for quality,”
Journal of Systems and Software, vol. 49, no. 1, pp. 89–96, Dec. 1999.
[231] D. Stavrinoudis and M. Xenos, “Comparing internal and external
software quality measurements,” in Proceedings 2008 Conference
on Knowledge-Based Software Engineering, Pireaus, Greece, 2008,
pp. 115–124.
[232] A. Seffah, M. Donyaee, R. Kline, and H. Padda, “Usability mea-
surement and metrics: A consolidated model,” Software Quality
Journal, vol. 14, no. 2, pp. 159–178, Jun. 2006.
[233] N. Bevan and M. MacLeod, “Usability measurement in context,”
Behaviour & Information Technology, vol. 13, no. 1, pp. 132–45, 1994.
[234] N. McNamara and J. Kirakowski, “Functionality, usability, and
user experience: three areas of concern, Interactions, vol. 13,
no. 6, pp. 26–28, Nov. 2006.
[235] R. S. Pressman, Software engineering: a practitioner’s approach,
5th ed. New York: McGraw-Hill, 2001.
[236] J. Ho-Won, K. Seung-Gweon, and C. Chang-Shin, “Measuring
software product quality: a survey of ISO/IEC 9126,” IEEE Softw.,
vol. 21, no. 5, pp. 88–92, Sep. 2004.
[237] J. McColl-Kennedy and U. Schneider, “Measuring customer sat-
isfaction: why, what and how,” Total Quality Management, vol. 11,
no. 7, pp. 883–896, Sep. 2000.
[238] A. Mockus, P. Zhang, and P. L. Li, “Predictors of customer
perceived software quality,” in Proceedings 27th International Con-
ference on Software Engineering (ICSE), St. Louis, 2005, pp. 225–233.
[239] E. Gray and W. Smith, “On the limitations of software process
assessment and the recognition of a required re-orientation for
global process improvement,” Software Quality Journal, vol. 7,
no. 1, pp. 21–34, Mar. 1998.
[240] S. Grimstad, M. Jørgensen, and K. Moløkken-Østvold, “Software
effort estimation terminology: The tower of babel,” Information
and Software Technology, vol. 48, no. 4, pp. 302–310, Apr. 2006.
[241] C. Kaner and W. P. Bond, “Software engineering metrics: What
do they measure and how do we know,” in Proceedings 10th
International Software Metrics Symposium (METRICS), Chicago,
2004.
[242] P. Carbone, L. Buglione, L. Mari, and D. Petri, “A comparison
between foundations of metrology and software measurement,”
IEEE Trans. Instrum. Meas., vol. 57, no. 2, pp. 235–241, Feb. 2008.
[243] P. Abrahamsson, “Measuring the success of software
process improvement: The dimensions,” in Proceedings
European Software Process Improvement (EuroSPI2000)
Conference, Copenhagen, Denmark, 2000. [Online]. Available:
http://www.iscn.at/select_newspaper/measurement/oulu.html
[244] M. Berry, R. Jeffery, and A. Aurum, “Assessment of software
measurement: an information quality study,” in Proceedings 10th
International Symposium on Software Metrics (METRICS), Chicago,
2004, pp. 314–325.
[245] D. Baccarini, “The logical framework method for defining project
success,” Project Management Journal, vol. 30, no. 4, pp. 25–32, Dec.
1999.
[246] M. A. Babar and I. Gorton, “Software architecture review: The
state of practice,” Computer, vol. 42, no. 7, pp. 26–32, Jul. 2009.
[247] B. Fitzgerald and T. O’Kane, “A longitudinal study of software
process improvement,” IEEE Softw., vol. 16, no. 3, pp. 37–45, May
1999.
[248] R. Biehl, “Six sigma for software,” IEEE Softw., vol. 21, no. 2, pp.
68–70, Mar. 2004.
[249] D. Flynn, J. Vagner, and O. D. Vecchio, “Is CASE technology
improving quality and productivity in software development?”
Logistics Information Management, vol. 8, no. 2, p. 8–21, 1995.
[250] S. Jarzabek and R. Huang, “The case for user-centered CASE
tools,” Comm. ACM, vol. 41, no. 8, pp. 93–99, Aug. 1998.
[251] G. Low and V. Leenanuraksa, “Software quality and CASE tools,”
in Proceedings on Software Technology and Engineering Practice
(STEP), Pittsburgh, 1999, pp. 142–150.
[252] R. Patnayakuni and A. Rai, “Development infrastructure charac-
teristics and process capability,” Comm. ACM, vol. 45, no. 4, pp.
201–210, Apr. 2002.
[253] W. Humphrey, “CASE planning and the software process,” Jour-
nal of Systems Integration, vol. 1, no. 3, pp. 321–337, Nov. 1991.
[254] W. J. Orlikowski, “CASE tools as organizational change: Inves-
tigating incremental and radical changes in systems develop-
ment,” MIS Quarterly, vol. 17, no. 3, pp. 309–340, Sep. 1993.
[255] G. Premkumar and M. Potter, “Adoption of computer aided
software engineering (CASE) technology: an innovation adoption
perspective,” ACM SIGMIS Database, vol. 26, no. 2-3, pp. 105–124,
May 1995.
[256] S. Sharma and A. Rai, “CASE deployment in IS organizations,”
Comm. ACM, vol. 43, no. 1, pp. 80–88, Jan. 2000.
[257] V. Basili and D. Weiss, “A methodology for collecting valid
software engineering data,” IEEE Trans. Softw. Eng., vol. 10, no. 6,
pp. 728–738, Nov. 1984.
[258] J. P. Campbell, V. A. Maxey, and W. A. Watson, “Hawthorne
effect: Implications for prehospital research,” Annals of Emergency
Medicine, vol. 26, no. 5, pp. 590–594, Nov. 1995.
UNTERKALMSTEINER et al.: EVALUATION AND MEASUREMENT OF SOFTWARE PROCESS IMPROVEMENT - A SYSTEMATIC LITERATURE REVIEW 27
[259] S. Anderson, A. Auquier, W. W. Hauck, D. Oakes, W. Vandaele,
and H. I. Weisberg, Statistical Methods for Comparative Studies:
Techniques for Bias Reduction. New York: John Wiley, 1980.
[260] S. Pfleeger, “Experimentation in software engineering,” in Ad-
vances in Computers. San Diego: Academic Press, 1997, vol. 44,
pp. 127–167.
[261] B. Kitchenham, S. L. Pfleeger, L. M. Pickard, P. W. Jones, D. C.
Hoaglin, K. E. Emam, and J. Rosenberg, “Preliminary guidelines
for empirical research in software engineering,” IEEE Trans. Softw.
Eng., vol. 28, no. 8, pp. 721–734, Aug. 2002.
[262] W. Evanco, “Comments on "The confounding effect of class size
on the validity of object-oriented metrics",” IEEE Trans. Softw.
Eng., vol. 29, no. 7, pp. 670–672, Jul. 2003.
[263] K. El Emam, S. Benlarbi, N. Goel, and S. N. Rai, “The confound-
ing effect of class size on the validity of Object-Oriented metrics,”
IEEE Trans. Softw. Eng., vol. 27, no. 7, pp. 630–650, Jul. 2001.
[264] M. Bruntink and A. van Deursen, “An empirical study into class
testability,” Journal of Systems and Software, vol. 79, no. 9, pp. 1219–
1232, Sep. 2006.
[265] Y. Zhou, H. Leung, and B. Xu, “Examining the potentially con-
founding effect of class size on the associations between Object-
Oriented metrics and Change-Proneness,” IEEE Trans. Softw. Eng.,
vol. 35, no. 5, pp. 607–623, Sep. 2009.
Michael Unterkalmsteiner is a PhD student
at the Blekinge Institute of Technology (BTH)
where he is with the Software Engineering Re-
search Lab. His research interests include soft-
ware repository mining, software measurement
and testing, process improvement, and require-
ments engineering. His current research fo-
cuses on the co-optimization of requirements
engineering and verification & validation pro-
cesses. He received the B. Sc. degree in applied
computer science from the Free University of
Bolzano / Bozen (FUB) in 2007 and is currently completing the M. Sc.
degree in software engineering at BTH.
Tony Gorschek is a professor of software
engineering at Blekinge Institute of Technol-
ogy (BTH) with over ten years industrial ex-
perience. He also manages his own industry
consultancy company, works as a CTO, and
serves on several boards in companies devel-
oping cutting edge technology and products. His
research interests include requirements engi-
neering, technology and product management,
process assessment and improvement, qual-
ity assurance, and innovation. Contact him at
tony.gorschek@bth.se or visit www.gorschek.com.
A. K. M. Moinul Islam is a researcher at the
Technical University of Kaiserslautern, Germany.
He is with the Software Engineering: Process
and Measurement Research Group. His re-
search interests include global software en-
gineering, software process improvement and
evaluation, and empirical software engineer-
ing. He received his double master’s degree,
M. Sc. in Software Engineering, in 2009 jointly
from University of Kaiserslautern, Germany and
Blekinge Institute of Technology, Sweden within
the framework of European Union’s Erasmus Mundus Programme. Prior
to his master’s degree, he worked for 3 years in the IT and Telecommu-
nication industry.
Chow Kian Cheng is a software engineer
at General Electric International Inc. based in
Freiburg, Germany. He is responsible for the de-
velopment of clinical software in the healthcare
industry. He holds a joint master degree, M. Sc.
in Software Engineering, from the Blekinge Insti-
tute of Technology, Sweden and the Free Univer-
sity of Bolzano / Bozen, Italy. Prior to his master
degree, he worked for 4 years with Motorola Inc.
and Standard Chartered Bank.
Rahadian Bayu Permadi received his Bachelor
degree in Informatics from Bandung Institute of
Technology, Indonesia. He obtained in 2009 the
double-degree master in software engineering
from the Free University of Bolzano/ Bozen, Italy
and the Blekinge Institute of Technology, Swe-
den. Currently, he is working as a software en-
gineer at Amadeus S. A. S, France. His interests
are software measurements & process improve-
ment, software architecture and software project
management. He was a Java technology re-
searcher in Indonesia before he got awarded with the Erasmus Mundus
scholarship for European Master in Software Engineering programme.
Robert Feldt (M’98) is an associate professor
of software engineering at Chalmers University
of Technology (CTH) as well as at Blekinge In-
stitute of Technology. He has also worked as
an IT and Software consultant for more than
15 years. His research interests include soft-
ware testing and verification and validation, auto-
mated software engineering, requirements engi-
neering, user experience, and human-centered
software engineering. Most of the research is
conducted in close collaboration with industry
partners such as Ericsson, RUAG Space and SAAB Systems. Feldt has
a PhD (Tekn. Dr.) in software engineering from CTH.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The objective of this report is to propose comprehensive guidelines for systematic literature reviews appropriate for software engineering researchers, including PhD students. A systematic literature review is a means of evaluating and interpreting all available research relevant to a particular research question, topic area, or phenomenon of interest. Systematic reviews aim to present a fair evaluation of a research topic by using a trustworthy, rigorous, and auditable methodology. The guidelines presented in this report were derived from three existing guidelines used by medical researchers, two books produced by researchers with social science backgrounds and discussions with researchers from other disciplines who are involved in evidence-based practice. The guidelines have been adapted to reflect the specific problems of software engineering research. The guidelines cover three phases of a systematic literature review: planning the review, conducting the review and reporting the review. They provide a relatively high level description. They do not consider the impact of the research questions on the review procedures, nor do they specify in detail the mechanisms needed to perform meta-analysis.
Book
Software Engineering: A Practitioner's Approach (SEPA), Ninth Edition, represents a major restructuring and update of previous editions, solidifying the book's position as the most comprehensive guide to this important subject. This text is also available in Connect. Connect enables the professor to assign readings, homework, quizzes, and tests easily and automatically grades and records the scores of the student's work.
Article
While software metrics are a generally desirable feature in the software management functions of project planning and project evaluation, they are of especial importance with a new technology such as the object-oriented approach. This is due to the significant need to train software engineers in generally accepted object-oriented principles. This paper presents theoretical work that builds a suite of metrics for object-oriented design. In particular, these metrics are based upon measurement theory and are informed by the insights of experienced object-oriented software developers. The proposed metrics are formally evaluated against a widelyaccepted list of software metric evaluation criteria.
Conference Paper
Software Process Improvement (SPI) is the set of activities with which an organization attempts to reach better performances on product cost, time-to-market and product quality, by improving the software development process. Changes are made to the process based on 'best practices': experiences of other, not necessarily similar organizations. Within SPI methodologies there is a focus on the software development process, because it is based on the assumption that an improved development process positively impacts product quality, productivity, product cost and time-to-market. This paper defines stan-dard metrics for quantitative measurement of quality indicators of processes through Software Process Assessment (SPA) based on SPICE. Through accom-plishment of this, we are able to control and to measure SPI activity, and pro-vide for a basis of quantitative S/W process management. The results of our re-search will represent a circulatory architecture for SPI and support of the risk management through the improvement activities and the Process Asset Library with collected and measured data.
Article
A strong process infrastructure based on two synergistic process improvement initiatives, Team Software ProcessSM and Earned Value Management (EVM), is developed by the Naval Air Systems Command's (NAVAIR) AV-8B Joint System Support Activity (JSSA). Through these initiatives they surpassed their goals by reducing schedule variance by 80 percent and still achieved the measurable benefits of a Capability Maturity Model® for Software Level 4 organization. With EVM and TSP in place, an an open culture that encouraged taking qualified risks, the AV-8B rapidly enhanced its software process maturity. EVM at the AV-8B was primarily beneficial at lower maturity levels while TSP offered both high- and low-maturity benefits.
Article
Every organization experiences a range of success. In an attempt to improve business processes, many organizations focus on those projects that fail in order to `learn from their mistakes'. Whereas the author does not disagree with learning from mistakes, many organizations overlook learning from their successes. The result of any project is a direct consequence of the actions and activities used on the project. Therefore, if you believe that failure is no accident, you must believe also that success is no accident. `Learning from Success' is a study on accelerated process improvement and provides a methodology for uncovering why successful projects are indeed successful.
Article
While software now pervades most facets of modern life, its historical problems have not been solved. This report explains why some of these problems have been so difficult for organizations to address and the actions required to address them. It describes the Software Engineering Institute's (SEI) software process maturity model, how this model can be used to guide software organizations in process improvement, and the various assessment and evaluation methods that use this model. The report concludes with a discussion of improvement experience and some comments on future directions for this work. 1 Introduction The Software Process Capability Maturity Model (CMM) deals with the capability of software organizations to consistently and predictably produce high quality products. It is closely related to such topics as software process, quality management, and process improvement. The drive for improved software quality is motivated by technology, customer need, regulation, and compet...
Article
Measuring Software Process Improvements is a challenge for software organizations. It is difficult to find a non-questionable way to show that improvement actually has taken place. Through process employee surveys it is possible to design questions and key ratios that help us understand current situations and trends. The telecom company Ericsson AB has been working with such measurements since early 2001. In this thesis we investigate the result of such a survey through basic statistical analysis. We conclude that the statistical analysis of the process employee survey investigated in this thesis shows strong consistency and validity of both survey questions and key ratios. We also conclude that we can strengthen the statistical validity in SPI measurements through the use of basic statistical analysis. In this study we suggest a set of practical steps that can be performed on SPI measurements in order to strengthen their validity. We also present an iterative process model that could be used at Ericsson to assure quality in their SPI measurements and improve future SPI measurements.