Conference PaperPDF Available

A cost effectiveness indicator for software development


Abstract and Figures

Product quality, development productivity, and staffing needs are main cost drivers in software development. The paper proposes a cost-effectiveness indicator that combines these drivers using an economic criterion.
Content may be subject to copyright.
NRC Publications Archive (NPArC)
Archives des publications du CNRC (NPArC)
Cost-Effectiveness Indicator for Software Development
Erdogmus, Hakan
Contact us / Contactez nous:
L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site
Web page / page Web
Access and use of this website and the material on it are subject to the Terms and Conditions set forth at
Cost-Effectiveness Indicator for Software
Development *
Erdogmus, H.
* Proceedings of the International Symposium on Empirical Software
Engineering and Measurement (ESEM 2007). September 20, 2007. NRC
Copyright 2007 by
National Research Council of Canada
Permission is granted to quote short excerpts and to reproduce figures and tables
from this report, provided that the source of such material is fully acknowledged.
A cost effectiveness indicator for software development
Hakan Erdogmus
NRC Institute for Information Technology
Ottawa, Canada
Product quality, development productivity, and
staffing needs are main cost drivers in software
development. The paper proposes a one-stop cost-
effectiveness indicator that combines these three cost
drivers through an economic criterion.
1. Introduction
Tradeoffs between development productivity and
product quality make it hard to assess the cost-
effectiveness of software development, both across
software development projects and across development
techniques and practices. Previous research confirms
the variability in quality and productivity [1], and the
tension between them [2]. This paper proposes an
indicator that reconciles this tension by aggregating
software development’s main cost drivers [3] -- team
productivity, staffing needs, and product quality -- into
a single coherent quantity. The indicator, called
breakeven multiple, allows comparison among projects
and development techniques based on their relative
cost-effectiveness. The indicator incorporates
productivity through its impact on direct development
costs and product quality through its impact on indirect
or downstream costs associated with rework [4].
Economic metrics for software development have
existed since the late nineties. Erdogmus [5]
developed a cost-benefit model based on net present
value for comparing software initiatives. Muller and
Padberg [6] adapted this model to evaluate extreme
programming projects. Erdogmus and Williams [7]
later combined net present value with breakeven
analysis to derive an economic feasibility metric for
pair programming. Padberg and Muller [8] used a
similar approach in their own analysis of the same
practice. Wagner [4] recently proposed an economic
efficiency model for quality that aggregates costs and
benefits of quality activities into a return-on-
investment metric.
The work presented here builds on the metric
defined by Erdogmus and Williams [7] for comparing
two practices. It both generalizes and simplifies this
metric, allowing more robust, multi-way comparison.
2. Basic Concepts
A project is work undertaken by a team. A project’s
output is a partial or complete software product with
working features and no known issues that require
resolution. A project comprises production and rework
activities. Production refers to all work that leads to
the initial external release of parts or whole of a
usable, but not necessarily perfect, product. Production
results in a product that may contain issues requiring
resolution. The output of production is the project’s
nominal output. Rework refers to all work that resolves
any identified issues in the nominal output. Rework
transforms a released imperfect product into a finished
product free of such issues. After rework, nominal
output becomes the project’s real output.
Product quality, or simply quality, refers to absence
of issues in a project’s output. Think of an issue as a
defect or an undesirable property or behavior that
incurs some latent cost, or prevents the benefits of a
product from being realized as intended. Issues are
discovered post-production and require resolution.
They may relate to functionality, reliability, usability,
maintainability or other external attributes. Rework
captures cost of poor quality.
Schedule is the duration of an activity, measured in
calendar time. Effort is the labor cost of an activity,
measured in person-time.
3. Derived Measures
The following derived measures can be obtained
from the base measures of nominal output, production
effort, rework effort, issue count and staffing profile
(salary loading of project as a function of schedule):
Load factor (L) quantifies a project’s average
staffing load based on the staffing profile, in terms of a
base salary’s a multiple.
Production speed (p) captures the production
component of team productivity. It is the average
delivery speed of nominal output by the project:
Lp ×=
effortt Developmen
output Nominal
Issue density (d) captures the level of rework that
the released nominal product requires. It is the average
issue count of a unit of nominal output.
Resolution speed (r) captures the rework
component of team productivity. It is the average rate
at which the project resolves issues in a nominal
4. Derivation of the indicator
Production efficiency is the ratio of production
effort to the total effort. A project that is 100%
efficient does not perform any rework, and its nominal
productivity effectively equals its real productivity. A
project having a production speed of p output units per
unit schedule, an issue density of d issues per unit
output, a resolution speed of r issues per unit schedule,
has a production efficiency,
, of r/(r + pd). If V
denotes the hypothetical value earned by a single unit
of real output, then for each unit schedule the project
on average earns a value of Vp
Now suppose S is the base salary of a developer. If
the project has a load factor of L persons, it incurs for
each unit of schedule a cost of SL. Then the average
net value, NV, earned by project per unit schedule is
LS. Of interest is the minimum level of the
quaintly V that allows the project to break even.
Solving the equation NV = 0 for V yields this
breakeven unit value. Thus BUV = min{ V | Vp
= 0 } = LS/p
BUV combines productivity and quality as desired,
but it still depends on S. Normalizing the base salary S
with respect to BUV results in a more compact
indicator called the breakeven multiple, or BM, where:
BM = S/BUV = p
BM expresses the base salary S in terms of a
multiple of BUV, but it does not depend on S. Since S
is invariant within and across projects in the same
context, if a project’s BM increases, the project
requires a lower unit value to break even, and the
project’s cost-effectiveness and profitability increase
as a result. A more intuitive interpretation of BM relies
on its unit. BM is measured in output per person-time,
the same unit as resource productivity. BM is indeed
nominal calendar productivity adjusted by efficiency
and de-normalized with respect to resource load.
Therefore, it can be thought of as the real resource
productivity of a production process.
5. Advantages, Limitations, and Uses
BM is an indicator that aggregates productivity,
quality, and staffing needs into a single, simple
quantity. It makes possible to compare projects with
opposite productivity and quality characteristics, thus
reconciling the underlying trade-offs. BM is
empirically determined through combining
interdependent measures, but does not express a
natural relationship among these measures.
Through alternative derivations, BM captures both
cost-effectiveness and real (as opposed to nominal)
productivity, both of which admit intuitive
interpretations. It is also sound with respect to standard
financial theory under the assumption of continuous
incremental delivery [7].
BM requires simple base measures to be collected
about a project. It can be customized for a given
context by appropriately choosing the underlying base
measures. A serious limitation of BM is its dependence
on the unit of the particular output measure used. Thus
projects having different output measures are not
comparable by this indicator. The base measures of
output and issue count should be interpretable on a
ratio scale for realistically large ranges. Particularly
problematic is the situation when base measures are
highly variable. Software unfortunately does not admit
a universal and uniform output measure. Although the
ideal output measure is delivered business value, either
size measures such as lines of code (low-level) and
function points (high-level) or requirements-oriented
measures like use-cases and stories are adopted as
proxies. However, each proxy has advantages and
disadvantages [9]. Finding portable, meaningful, sound
measures of size, functionality, productivity and
quality has been an elusive endeavor.
The breakeven multiple has two intended uses: (1)
as a high-level, one-stop performance indicator inside
a portfolio of projects; and (2) as a one-stop dependent
variable in empirical studies of software development
practices. In experimental contexts, BM’s limitations
can be alleviated through study design.
6. Application Example
As an example, consider test-driven development
(TDD), a coding technique in which development tasks
are driven by unit tests written before production code.
The example demonstrates BM’s use in conjunction
with sensitivity analysis.
An empirical study by Erdogmus, Morisio, and
Torchiano [10] evaluated the effects of writing unit
tests before production code (Test-First) relative to
writing units tests after production code (Test-Last).
The study measured the average nominal productivity
and product quality of two groups performing a
programming task with a set of incremental
requirements. The study measured external program
quality (through failing acceptance tests) and
production effort, but not rework productivity.
0% | 0%
16% | 14%
27% | 25%
36% | 33%
42% | 39%
48% | 45%
53% | 49%
56% | 53%
60% | 57%
62% | 59%
65% | 62%
Production Efficiency: Test-Last | Test-First
Breakeven Multiple
Figure 2. Breakeven multiples for the TDD study as a
function of the resolution speed.
To calculate the two groups’ BM values, we treat
them as two projects, setting the output measure to
number of completed stories. The measure of
production speed is stories per hour, which is readily
adoptable. For the quality measure, we equate a failing
acceptance test to an issue, and calculate issue density
in failures per story. The load is constant since the two
techniques were executed by single programmers.
Since the study did not measure rework
productivity, we fix the production speed of Test-Last,
and estimate the resolution speed of Test-First by
applying its observed 28% nominal productivity speed-
up. Subsequently, we vary Test-Last’s resolution
speed, determine the corresponding Test-First
resolution speed, compute the corresponding
production efficiencies, and finally plot the BM values
against the resulting production efficiency pairs. The
chart in Figure 2 shows this analysis. The analysis
suggests an increasing cost-effectiveness for the Test-
First group relative to the Test-Last group as efficiency
7. Summary
The breakeven multiple is an aggregate economic
indicator for software development. It reduces what
would ordinarily be multi-criteria comparisons based
on separate quality, productivity, and staffing measures
into single-criterion comparisons based on cost-
effectiveness. It is indented for use as a high-level
performance indicator for software projects and as a
dependent variable in empirical studies of software
BM does not express a functional-empirical
relationship among the base measures. Sensitivity
analyses should be conducted with the recognition of
the base measures’ mutual dependence in mind.
Measurement issues constitute BM’s main limitation.
Availability of proper and meaningful base measures,
ability to accurately capture them, and dependence on
the output measure limit BM’s applicability and
8. References
[1] K. Maxwell and P. Forselius, "Benchmarking
software development productivity," IEEE
Software, pp. 80-88, 2000.
[2] A. MacCormack, C. F. Kemerer, M.
Cusumano, and B. Crandall, "Trade-offs
between productivity and quality in selecting
software development practices," IEEE
Software, vol. Sep/Oct, pp. 78-85, 2003.
[3] B. W. Boehm and P. N. Papaccio,
"Understanding and controlling software
costs," IEEE Transactions on Software
Engineering, vol. 14, pp. 1462-1477, 1988.
[4] S. Wagner, "A literature survey of the quality
economics of defect-detection techniques,"
presented at International Symposium on
Empirical Software Engineering, 2006.
[5] H. Erdogmus, "Comparative evaluation of
software development strategies based on Net
Present Value," presented at First ICSE
Workshop on Economics-Driven Software
Engineering Research, Los Angeles,
California, 1999.
[6] M. Müller and F. Padberg, "On the economic
evaluation of XP projects," presented at
Jouint 9th European Software Engineering
Conference and 11th ACM SIGSOFT Int'l
Symposium on Foundations of Software
Engineering, Helsinki, Finland, 2003.
[7] H. Erdogmus and L. Williams, "The
Economics of Software Development by Pair
Programmers," The Engineering Economist,
vol. 48, 2003.
[8] F. Padberg and M. Müller, "Analyzing cost
and benefits of pair programming," presented
at 9th International Software Metrics
Symposium, 2003.
[9] M. Asmild, J. C. Paradi, and A. Kulkarni,
"Using data envelopment analysis in sofware
development productivity measurement,"
Software Process Improvement and Practice,
vol. 11, pp. 561-572, 2006.
[10] H. Erdogmus, M. Morisio, and M. Torchiano,
"On the Effectiveness of the Test-First
Approach to Programming," IEEE
Transactions on Software Engineering, vol.
31, pp. 226-237, 2005.
... Also related is the work to define measures for economic evaluations in SE e.g. the "cost-effectiveness indicator" proposed by Erdogmus [14]. Economic evaluations in evidence-based medicine: In health care, the guidelines for systematic literature review [15] provide support for performing economic evaluations in a secondary study. ...
... There have been some proposals for unified indicators to allow cost-effectiveness analysis to compare two or more interventions (c.f. [14]). Such proposals need to be further evaluated. ...
Conference Paper
Context: Software Engineering (SE) research with a scientific foundation aims to influence SE practice to enable and sustain efficient delivery of high quality software. Goal: To improve the impact of SE research, one objective is to facilitate practitioners in choosing empirically vetted interventions. Method: Literature from evidence-based medicine, economic evaluations in SE and software economics is reviewed. Results: In empirical SE research, the emphasis has been on substantiating the claims about the benefits of proposed interventions. However, to support informed decision making by practitioners regarding technology adoption , we must present a business case for these interventions , which should comprise not just effectiveness, but also the evidence of cost-effectiveness. Conclusions: This paper highlights the need to investigate and report the resources required to adopt an intervention. It also provides some guidelines and examples to improve support for practitioners in decisions regarding technology adoption.
... Rework in this context is all work that "transforms a released imperfect product into a finished product free of such issues (Erdogmus 2007)"; this definition excludes "evolutionary rework" that adds value to an evolving product to provide new capabilities in the next version (Fairley and Willshire 2005). ...
... The central assumption of "pull" is that it aligns the production on what provides value for the customer and in this way avoids rework (Ohno 1988, Womack and Jones 1996). Rework in this context is all work that "transforms a released imperfect product into a finished product free of such issues (Erdogmus 2007)"; this definition excludes "evolutionary rework" that adds value to an evolving product to provide new capabilities in the next version (Fairley and Willshire 2005). To get an estimation of the gravity of the problem of rework within a development team, we want to evaluate the costs that are generated because of rework. ...
Conference Paper
Full-text available
The Toyota Production System promotes "pull" scheduling to reduce the production of parts that do not comply to what the customer needs. The use of "pull" within software represents a radical change in the way activities are planned. This article gives two examples of the possible application of "pull" within software engineering and de-scribes a measurement tool to assess the current costs and amount of rework within a software development project. The described approach aims to help practitioners to under-stand whether to use "pull" or "push" in their organizations.
... Erdogmus presents a cost-effectiveness indicator for software development. He uses base measures such as nominal output, production effort, rework effort, issue count, staffing profile to derive a breakeven multiple as an indicator aggregating productivity, quality, and staffing needs [28]. This is a good example for this approach in a different context. ...
Context: Context Software effort estimation (SEE) is most crucial activity in the field of software engineering. Vast research has been conducted in SEE resulting into a tremendous increase in literature. Thus it is of utmost importance to identify the core research areas and trends in SEE which may lead the researchers to understand and discern the research patterns in large literature dataset. Objective: To identify unobserved research patterns through natural language processing from a large set of research articles on SEE published during the period 1996 to 2016. Method: A generative statistical method, called Latent Dirichlet Allocation (LDA), applied on a literature dataset of 1178 articles published on SEE. Results: As many as twelve core research areas and sixty research trends have been revealed; and the identified research trends have been semantically mapped to associate core research areas. Conclusions: This study summarises the research trends in SEE based upon a corpus of 1178 articles. The patterns and trends identified through this research can help in finding the potential research areas.
Conference Paper
Full-text available
Context: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data points may be relevant in this regard. Objective: We investigate the reporting of three potentially influential elements of data quality in ESE studies: data collection, data pre-processing, and the identification of data quality issues. This enables us to establish how researchers view the topic of data quality and the mechanisms that are being used to address it. Greater awareness of data quality should inform both the sound conduct of ESE research and the robust practice of ESE data collection and processing. Method: We performed a targeted literature review of empirical software engineering studies covering the period January 2007 to September 2012. A total of 221 relevant studies met our inclusion criteria and were characterized in terms of their consideration and treatment of data quality. Results: We obtained useful insights as to how the ESE community considers these three elements of data quality. Only 23 of these 221 studies reported on all three elements of data quality considered in this paper. Conclusion: The reporting of data collection procedures is not documented consistently in ESE studies. It will be useful if data collection challenges are reported in order to improve our understanding of why there are problems with software engineering data sets and the models developed from them. More generally, data quality should be given far greater attention by the community. The improvement of data sets through enhanced data collection, pre-processing and quality assessment should lead to more reliable prediction models, thus improving the practice of software engineering.
Technical Report
Full-text available
Effective software project management is a key element in achieving software project success. To improve the quality of software project management and focus efforts on the right issues, it is first essential to measure software project management effectiveness. In this report, the authors introduce four alternative approaches for guiding the development of project management metrics.
Full-text available
Evaluating, monitoring and improving the effectiveness of project management can contribute to successful acquisition of software systems. In this dissertation, we introduce a quantitative metric for gauging the effectiveness of managing a software- development project. The metric may be used to evaluate and monitor project management effectiveness in software projects by project managers, technical managers, executive managers, project team leaders and various experts in the project organization. It also has the potential to be used to quantify the effectiveness improvement efforts on project management areas. The metric is validated by conducting survey studies on software projects from public and private sectors. A statistical analysis of sixteen surveys on software projects, spanning small to large development projects, indicated that there is a strong positive correlation with software project success ratings provided by study participants and project management effectiveness measurements. Other contributions of this research include identification of approaches for measuring project management effectiveness of software projects, establishment of theories on project management and on project management effectiveness measurement, and the introduction and validation of a framework for software project management.
A mechanical testing system was acquired by this DURIP grant for mechanical characterization of micron and millimeter-scale specimens at low (> -150 deg C), and room to moderate temperatures (0-315 deg C). The acquired instrumentation utilizes magneto-mechanical actuation that permits mechanical testing of micron and millimeter size specimens requiring ultra high force resolution (0.5 mN) and fine displacement control (~50 nm), which are not possible with conventional servohydraulic or DC motor based mechanical testing machines. The equipment has been integrated in the PI's laboratory and has already facilitated AFOSR supported research on fracture of silica epoxy nanocomposites. it is expected that it will further be of service in future AFOSR research on small-scale measurements in MEMS and thin films.
Conference Paper
Full-text available
This paper proposes a discriminant analysis method that uses a neural network model to predict the fault-prone program modules that will cause failure after the release. In our method, neural networks of a layered type are used to represent nonlinear relation among predictor variables and objective variables. Since the relation among predictor variables and objective variables is complicated in real software, linear representation used in conventional discriminant analysis is not suitable for the prediction model. To evaluate the method, we have measured 20 metrics, as predictor variables, from a large scale software that have been maintained more than 20 years, and also measured the number of faults found after the release as objective variables. Result of the evaluation showed that prediction accuracy of our model is better than that of conventional linear model.
Conference Paper
Full-text available
Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus.
Assessment of alternative development strategies in a software project can be difficult due to the interactions among the multiple factors involved. A rigorous, value-based approach allows a more objective comparison of the available alternatives. For example, such an approach can steer the analysis of the economic incentive to choose a strategy that promises rapid development over a strategy that promises high earnings. The corporate finance concept of Net Present Value is used in this context to define a hierarchy of comparison metrics based on six high-level variables. The top-level metric measures relative economic incentive. The analysis focuses on the impact of product risk and rapid development on the decision to select the most valuable strategy. Keywords -- software engineering economics, software project valuation, software investment analysis, software cost estimation, software development strategies, COTS software, Net Present Value, investment analysis techniques, systema...
As a software system evolves, programmers make changes that sometimes cause problems. We analyze CVS archives for fix-inducing changes—changes that lead to problems, indicated by fixes. We show how to automatically locate fix-inducing changes by linking a version archive (such as CVS) to a bug database (such as BUGZILLA). In a first investigation of the MOZILLA and ECLIPSE history, it turns out that fix-inducing changes show distinct patterns with respect to their size and the day of week they were applied.
Conference Paper
Testing and code editing are interleaved activities during program development. When tests fail unexpectedly, the changes that caused the failure(s) are not always easy to find. We explore how change classification can focus programmer attention on failure-inducing changes by automatically labeling changes Red, Yellow, or Green, indicating the likelihood that they have contributed to a test failure. We implemented our change classification tool JUnit/CIA as an ex- tension to the JUnit component within Eclipse, and evaluated its effectiveness in two case studies. Our results indicate that change classification is an effective technique for finding failure-inducing changes.
Conference Paper
From a project economics point of view, the most important practices of Extreme Programming (XP) are Pair Programming and Test-Driven Development. Pair Programming leads to a large increase in the personnel cost, and Test-Driven Development adds to the development effort. On the other hand, pp can speed the project up, both pp and Tdd can reduce the defect density of the code. Can the increased cost of XP be balanced by its shorter time to market and higher code quality? To answer this question, we construct a new model for the business value of software projects. We then analyze the cost and benefit of XP by applying our model to a realistic sample project. We systematically vary important model parameters to provide a sensitivity analysis. Our analysis shows that the economic value of, XP strongly depends on how large the XP speed and defect advantage really are. We also find that the market pressure is an important factor when assessing the business value of XP., Our study provides clear guidelines for managers when to consider using XP -- or better not.
Conference Paper
Over the last decades, a considerable amount of empiri- cal knowledge about the efficiency of defect-detection tech- niques has been accumulated. Also a few surveys have sum- marised those studies with different focuses, usually for a specific type of technique. This work reviews the results of empirical studies and associates them with a model of soft- ware quality economics. This allows a better comparison of the different techniques and supports the application of the model in practice as several parameters can be approximated with typical average values. The main contributions are the provision of average values of several interesting quan- tities w.r.t. defect detection and the identification of areas that need further research because of the limited knowledge available.