ArticlePDF Available

Standardized code quality benchmarking for improving software maintainability


Abstract and Figures

We provide an overview of the approach developed by the Software Improvement Group for code analysis and quality consulting focused on software maintainability. The approach uses a standardized measurement model based on the ISO/IEC 9126 definition of maintainability and source code metrics. Procedural standardization in evaluation projects further enhances the comparability of results. Individual assessments are stored in a repository that allows any system at hand to be compared to the industry-wide state of the art in code quality and maintainability. When a minimum level of software maintainability is reached, the certification body of TÜV Informationstechnik GmbH issues a Trusted Product Maintainability certificate for the software product. KeywordsSoftware product quality–Benchmarking–Certification–Standardization
Content may be subject to copyright.
Noname manuscript No.
(will be inserted by the editor)
Standardized Code Quality Benchmarking for Improving
Software Maintainability
Robert Baggen ·Jos´e Pedro Correia ·
Katrin Schill ·Joost Visser
Received: date / Accepted: date
Abstract We provide an overview of the approach developed by the Software
Improvement Group (SIG) for code analysis and quality consulting focused
on software maintainability. The approach uses a standardized measurement
model based on the ISO/IEC 9126 definition of maintainability and source
code metrics. Procedural standardization in evaluation projects further en-
hances the comparability of results. Individual assessments are stored in a
repository that allows any system at hand to be compared to the industry-
wide state of the art in code quality and maintainability. When a minimum
level of software maintainability is reached, the certification body of T ¨
UV In-
formationstechnik GmbH (T¨
UViT) issues a Trusted Product Maintainability
certificate for the software product.
Keywords Software product quality ·benchmarking ·certification ·
Robert Baggen
UV Informationstechnik GmbH
Essen, Germany
Jos´e Pedro Correia
Software Improvement Group
Amsterdam, The Netherlands
Katrin Schill
UV Informationstechnik GmbH
Essen, Germany
Joost Visser
Software Improvement Group
Amsterdam, The Netherlands
criteria for publish
Software system
under evaluation
quality model
Evaluation procedure Certification procedure
Fig. 1 Evaluation, benchmarking and certification of software product quality.
1 Introduction
The technical quality of source code (how well written it is) is an important
determinant for software maintainability. When a change is needed in the
software, the quality of its source code has an impact on how easy it is: 1)
to determine where and how that change can be performed; 2) to implement
that change; 3) to avoid unexpected eects of that change and 4) to validate
the changes performed.
However, many projects fail to assess code quality and to control it the
same way as the other classical project management KPIs for timeline or
budget. This is often due to the fact that projects lack a standardized frame
of reference when working with source code measurements. As a result, the
quality of the product remains unknown until the (final) testing and problem
fixing phase begins.
In this paper, we describe an approach developed by the Software Im-
provement Group (SIG) for code analysis and quality consulting focused on
software maintainability. The approach uses a standardized measurement pro-
cedure based on the ISO/IEC 9126 definition of maintainability and source
code metrics. Measurement standardization greatly facilitates the collection of
individual assessments in a structured repository. Based on the repository, any
system at hand can be compared to the industry-wide state of the art in code
quality and maintainability. Procedural standardization in evaluation projects
further enhances the comparability of results. When a minimum level of soft-
ware maintainability is reached, T ¨
UV Informationstechnik GmbH (T¨
issues a Trusted Product Maintainability certificate for the software product.
An illustration of the approach is provided in Figure 1.
This paper is organized as follows. In Section 2 we start with an explanation
of the measurement model and its calibration against a benchmark database
of measurement results. In Section 3 we describe the standardized evaluation
procedure in which the model is used to arrive at quality judgments in an
evaluator-independent manner. This evaluation procedure is used as part of the
software product certification scheme and software quality consulting services,
as explained in Section 4. Examples of the usefulness of the standardized
approach in practice are presented in section 5. In Section 6 we approach
some potential points of criticism to our approach. Section 7 discusses related
work. Finally, in Section 8 we present some concluding remarks and directions
for future work.
2 Measuring software via code metrics
The application of objective metrics for the measurement and improvement
of code quality has a tradition of more than 40 years. Today, code measure-
ment is seen as pragmatic work – the goal is to find the right indicator for
a given quality aspect and a given development environment. However, being
too creative about the measurements may preclude helpful comparisons with
other projects. Therefore, metrics have to be chosen with clear reference to an
agreed standard – e.g. the ISO/IEC 9126 international standard for software
product quality [17].
In the following subsections we summarize the measurement model and
its calibration against a benchmark repository. Full details can be found in
previous publications on the design of the model [13], its calibration [1], its
evaluation [8, 29, 3], its application to open-source software [7] and the under-
lying benchmark [6]. For self-containment, some details are also available in
Appendix A.
2.1 Software code metrics for maintainability
Conceiving maintainability as a function of code quality leads to a number of
code metrics as candidates in maintainability assessments. SIG chose 6 source
code properties as key metrics for the quality assessments, namely:
Volume the larger a system, the more eort it takes to maintain since there
is more information to be taken into account;
Redundancy duplicated code has to be maintained in all places where it
Unit size units as the lowest level piece of functionality should be kept small
to be focused and easier to understand;
Complexity simple systems are easier to comprehend and test than complex
Unit interface size units with many parameters can be a symptom of bad
Coupling tightly coupled components are more resistant to change.
The first four properties have been introduced in [13], whereas the last two
have been recently added.
These indicators assess clearly defined aspects of maintainability. They can
be calculated at least down to the unit level. This allows detailed analysis of
the system when drilling down into the results later on.
2.2 Measurements aggregation
In order to support a reproducible evaluation, software quality measurement
needs a clear mapping to an agreed standard. For that reason, the measure-
ments are interpreted in the framework of a hierarchical quality model with
dimensions according to the ISO/IEC 9126. In the ISO/IEC 9126 standard,
maintainability is seen as a general quality characteristic of a software product
and is decomposed into the sub-characteristics of analyzability, changeability,
stability and testability [17].
In the model, the sub-characteristics are made quantifiable with the above
source code metrics [13]. For this, raw metrics have to be aggregated to the
level of the whole system. This is done either by using a grand total (such as
for Volume and Duplication) or by using so-called quality profiles. The latter
summarize the distribution of a metric (e.g. cyclomatic complexity) at a certain
level (e.g. per unit) by performing a classification into risk categories based on
a set of thresholds. The outcome is percentage of code in low, moderate, high
and very high risk.
The aggregated measurements are used to determine a ratin g for each
source code property, based on the application of another set of thresholds.
These are further combined to calculate ratings for the sub-characteristics and
the general maintainability score for a given system [8]. The ratings correspond
to 5 quality levels, represented by a number of stars, from ?to ?????. Details
can be found in Appendix A.
A validation study on open source systems has shown that ratings as
awarded by the quality model correlate positively with the speed with which
defects are resolved by the system’s maintainers [29]. Also, the impact of tech-
nical quality on various external issue handling indicators was quantified and
the results corroborate the validity of the model (see [3]). For example, for
systems rating 4 stars, issues were found to be resolved 3 times faster than for
systems rating 2 stars.
As depicted in Figure 1, the layered quality model provides the criteria for
the evaluation procedure.
2.3 The role of a software benchmark repository
Even with quality dimensions derived from an international standard, quality
indices calculated from source code measurements remain arbitrary as long as
no comparison with other systems is available. To provide such information,
SIG maintains a benchmark repository [6] holding the results from several
hundreds of standard system evaluations carried out so far. As shown in Fig-
ure 1, the benchmark database is updated with the results of each evaluation
that is carried out.
Currently (May 2010), the benchmark repository holds results for over
500 evaluations (measurements computed for a language within a system)
encompassing around 200 systems. These comprise proprietary systems (about
(a) Percentage of systems per
(b) Percentage of systems per functionality group.
Fig. 2 Overview of the contents of the software benchmark repository per programming
paradigm and functionality group. For the latter, we employ a taxonomy used by ISBSG [28].
85%) as well as open source systems. A total of 45 dierent computer languages
is used by the systems in the repository, with Java, C, COBOL, C#, C++,
and ABAP as largest contributors in terms of lines of code.
The repository entries are described with meta-data to characterize the
systems along several dimensions. Information like functionality, development
model or architecture (among others), is added by the analysts involved in
each project. In Table 1, average values for some key metrics are listed per
represented programming paradigm. In Figure 2 we show some examples of
the heterogeneity of the contents of this repository in terms of programming
paradigm and functionality.
With this extensive statistical basis, SIG can compare any system to the
whole repository or to similar products in terms of size, programming lan-
guages or industry branch. Furthermore, interpreting the ranking, an evalu-
ator can guide his scrutiny to parts of the code really needing improvement
rather than curing minor symptoms. Such comparisons and interpretations
are performed in the context of the quality consultancy services described in
Section 4.2.
2.4 Calibration of the quality model
The evaluation data accumulated in the benchmark repository is also used for
calibrating the quality model.
Calibration is performed on two dierent levels, namely to determine thresh-
olds for i) the raw metrics and for ii) the aggregated quality profiles. The first
Tabl e 1 Average values for redundancy and complexity
Paradigm redundant decision density
or group lines (McCabe / LOC)
OOP (e.g. Java, C#) 12.3 % 19.6 %
Web (e.g. JSP,ASP,PHP) 32.5 % 15.5 %
Procedural (e.g. C, COBOL) 20.2 % 10.0 %
DB (e.g. PL/SQL, T-SQL) 28.2 % 7.9 %
level of calibration is performed by analyzing the statistical distributions of
the raw metrics among the dierent systems. Thresholds are then determined
based on the variability between the systems, allowing to pinpoint the more
uncommon (thus considered riskier) range of values for the metric. More de-
tails on the methodology used for the first level calibration can be found in [1].
For the second level, the aggregated quality profiles per system are used
and thresholds for those are tuned in such a way that for each lowest-level
source code property a desired symmetrical distribution of systems over quality
ratings is achieved. Concretely, the model is calibrated such that systems have
ah5,30,30,30,5ipercentage-wise distribution over 5 levels of quality. This
means that if a system is awarded 5 stars, it is comparable to the 5% best
systems in the benchmark, in terms of maintainability. Note that it would
be possible to select any other distribution, since it is a parameter of the
calibration algorithm. We chose this one in particular so that only very good
systems attain 5 stars, hence promoting excellence.
It would be possible to rely on expert opinions to define the thresholds, but
calibration against a large set of real-world systems brings some advantages,
a) the process is more objective since it is based solely on data;
b) it can be done almost automatically, thus allowing to easily update the
c) it creates a realistic scale, since it is constructed to represent the full range
of quality achieved by real-world systems.
For the calibration, a subset of the repository is selected according to cer-
tain criteria. Namely, only evaluations for modern programming languages
that pertain to recently developed systems (in the past decade) are taken into
account. This ensures that the quality model remains a reflection of the state
of the art in software engineering. That subset is then manually inspected and
purged of outliers to ensure the reliability of the obtained results. The high
number of included systems and the heterogeneity in terms of domains, lan-
guages, architectures, owners and/or producers (SIG’s clients and their suppli-
ers, as well as open source communities), help to guard the representativeness
of the calibration set.
Such calibration is performed with an updated set of systems at least once
per year. The changes in the data set are, nevertheless, kept small enough not
to cause abrupt changes after re-calibration.
3 Standardized evaluation procedure
A standard evaluation procedure has been defined in which the SIG quality
model is applied to software products [7]. The procedure consists of several
steps, starting with the take-in of the source code by secure transmission to
the evaluation laboratory and ending with the delivery of an evaluation report
(see Figure 3).
2008 © Software Improvement Group • Software Product Certification • Standard Operating Procedure
Internal and external perspective
Scope Measure
Create scope
Run analysis
create report
Receive and
source code
Fig. 3 Evaluation procedure (image adapted from [7]).
2008 © Software Improvement Group • Software Product Certification • Standard Operating Procedure
Internal and external perspective
Establish purpose of evaluation
Identify types of products
Specify quality model
Specify the
Select metrics
Establish rating levels for metrics
Establish criteria for assessment
Design the
evaluation Produce evaluation plan
Execute the
Take measures
Compare with criteria
Assess results
ISO/IEC 9126
SIG Quality Model
Any software product
Evaluate code maintainability
ISO/IEC 14598
Fig. 4 Conformance to the ISO/IEC 14598 standard (image adapted from [7]).
Intake The source code is received via a secure upload and copied to a stan-
dard location. A checksum is calculated to allow for future identification
of the original source.
Scope In this step, the scope of the evaluation is determined. As a result, an
unambiguous description of which software artifacts are to be covered by
the evaluation becomes available.
This scope description includes information such as: the identification of the
software system (name, version, etc.), a characterization of the technology
footprint of the system (which programming languages and the number of
files analyzed for each), as well as a description of specific files excluded
from the scope of the evaluation and why.
Measure In this step, a range of measurement values is determined for the
software artifacts that fall within the evaluation scope. Each measurement
is determined automatically by processing the software artifacts with an
appropriate algorithm.
This results in a large collection of measurement values at the level of
source code units, which are then aggregated to the level of properties of
the system as a whole, as described in Appendix A.
Rate Finally, the values obtained in the measurement step are combined and
subsequently compared against target values in order to determine quality
sub-ratings and the final rating for the system under evaluation.
The procedure conforms to the guidelines of the ISO/IEC 14598 standard
for software product evaluation [16], which is a companion standard to the
ISO/IEC 9126. This is illustrated in Figure 4.
To further ensure the objectivity and traceability of the evaluation, the
evaluation laboratory of the SIG that carries out the procedure conforms to
Fig. 5 The quality mark Trusted Product - Maintainability.
the guidelines of the ISO/IEC 17025 international standard for evaluation
laboratories [20]. Among other things, this standard requires to have a quality
management system in place that strictly separates the role of evaluator (who
operates source code analysis tools and applies the quality model to produce
an evaluation report) and the role of quality ocer (who performs independent
quality control on the activities and results of the evaluator).
4 Applications
The standardized procedure for measuring the maintainability of source code
is used both in evaluation projects leading to certification, and in consultancy
projects leading to (management-level) recommendations.
4.1 Certification of maintainability
The availability of a benchmark repository provides the means for an objective
comparison of software systems in terms of their maintainability. It is thus
possible to assign an objective measure of maintainability to every system
that undergoes the standardized evaluation procedure. This measure reflects
the relative status of the system within the population of all systems evaluated
so far.
Based on this system rating, T ¨
UViT has set up a certification scheme.
In this scheme, systems with maintainability scores above a certain threshold
are eligible for the certificate called “T ¨
UViT Trusted Product Maintainabil-
ity” [33] (see Figure 5). To achieve a certificate, a system must score at least
with 2 stars (??) on all sub-characteristics and at least 3 stars (???) on
the overall maintainability score. Besides reaching these minimal rankings, a
system description is required to document at least the top-level components.
The SIG software laboratory was accredited by the T¨
UViT certification body
to function within this scheme as an evaluation laboratory for software code
quality according to ISO/IEC Guide 65 [15].
As indicated in Figure 1, the issued certificates are published in an online
registry1. For full traceability, the certificates and the underlying evaluation
reports are always annotated with the version of the quality model and source
code analysis tools that were used for evaluation.
4.2 Standardized consulting approach
Based on experience from its assessment and monitoring projects in code qual-
ity and maintainability [10, 25], SIG has enhanced the evaluation procedure
described in Section 3 with activities to collect complementary information
about a system under evaluation [5].
Usually, the projects start, after an initial meeting, with the source code
submission. Next, a technical session is held together with the development
staof the customer to find out how the code is organized, what decisions were
taken during development and so on. During a second, strategic session, SIG
collects information from the management, i.e. the reason for the evaluation,
history of the system, future plans etc. With a method of structured business
scenarios for the future use of the system, together with the customer, SIG
attaches risk figures to the various development options the management has
for the system under test. This information can be used later on to prioritize
investments in code quality linking the quality assessment and the business
plan for the system.
In parallel to these sessions, the standardized evaluation procedure is car-
ried out. The results of the evaluation are communicated to the customers in
several ways. To begin with, a validation session is scheduled to resolve results
that may contradict the initial briefings. When the results are consolidated,
SIG presents its findings in a management session. Finally, an assessment
report is provided with the detailed results and all recommendations for im-
provement established during the assessment.
4.3 Other applications
Standardized software evaluations have several other possible applications.
The evaluation of software maintainability provides managers in ongoing
development projects with valuable information about the quality of the code
they are producing, thus allowing direct monitoring of that quality [26, 27, 4].
Deciders purchasing software, in a scenario where the future maintenance
burden will be on them, will reach better decisions when the code quality is
evaluated for the products on their shortlist.
In tendering or outsourcing procedures, software providers may prove the
quality of their product with a certificate.
Reaching a certifiable level of maintainability may become part of develop-
ment contracts and thus raise the quality level of individual software projects.
5 Case examples
In this section, we recount some cases where the standardized approach de-
scribed in this paper played a central role in the improvement of software
5.1 Ministry of Justice, NRW, Germany
For its prison regime the Ministry of Justice in the German state of Nordrhein-
Westfalen uses BASIS-Web, a complete prison management system imple-
mented in Java client-server technology. Since 2005, BASIS-Web computers
are used for the management of prisoners data, the correct calculation of pe-
riods of detention as well as the treatment of cash balances of prisoners.
SIG has been asked, together with T ¨
UViT, to assess the future-proofness
of this system. The assessment had two goals: first to give insight into the
maintainability of the system, secondly to determine whether any future per-
formance risks can be identified from the source code. The system analysis
was performed according the standardized consulting approach described in
Section 4.2.
The main result was insight of the overall maintainability of the system,
based on the ISO/IEC 9126 dimensions of analyzability, changeability, testabil-
ity and stability. Additionally, the system was compared against a benchmark
of other systems with similar size and architecture. This provided the Ministry
of Justice as well as the development contractor insights into the strengths and
The overall technical quality of the system was found to be somewhat
lower than expected by the Ministry and the contractor and also somewhat
lower than the average of comparable systems. Although this result may be ex-
plainable from the project’s basic parameters (e.g. design phase started before
standardized Java components for session control became available and thus
had to be developed by the project) it clearly opens the stage for improvement
because it replaced a number of subjective expert opinions about code quality
with an objective assessment.
More in-depth analysis of the results on component and partly on unit level
revealed particular places in the code to concentrate further eort on. SIG and
UViT consultants gave a number of concrete measures for improvements in
BASIS-Web. Every suggested improvement was validated in the repository for
relevance in current state of the art Java software systems. This allowed to
prioritize the improvements in a roadmap and to avoid costly cosmetic work
in the source code without eect on later maintainability.
To help the customer relate possible improvement work and overall business
strategy for the system, three dierent scenarios were considered:
1. Continue using the system as is, i.e. with the current user base and func-
tionality. However, even without enhancements in users or functions, a
Developed rebuild value Hours spent Number of
with monitoring in man-months on defects defects
System A Yes 34 <20 2
System B No 89 500 25
Tabl e 2 Quantification of savings due to monitoring (reproduced from [14]). The two sys-
tems are of comparable functional size and were developed and maintained within the same
KAS BANK development team.
baseline of maintenance eort resulting from bug fixing or adaptation to
changing legislation has to be expected.
2. Extend the user base for the system to more German states or to foreign
countries. This scenario would mean more maintenance eort beyond the
baseline from scenario 1 since more users demand bug fixes, updates and
enhancements. Additionally, the system performance has to be improved.
3. Cover more areas of prison management than today, thus enhance func-
tionality. For this scenario, SIG and T ¨
UViT foresaw the same challenge in
maintenance terms as in scenario 2, this time being more functionality the
driver for increase in code size and complexity.
For the first scenario (maintaining the status quo), only quick-wins and
easy-to-implement improvements from the roadmap were suggested (e.g. fix-
ing empty Java exceptions). For scenario 2 (extension of the user base) and 3
(extension of the functionality) however, major and costly reworkings of the
code base have to be considered in order to yield a future-proof prison manage-
ment system (e.g. partly change function allocation to modules, use a dierent
paradigm for session control).
KAS BANK is a European specialist in wholesale security services that had to
meet the challenge of updating its IT systems. These systems had gradually
become legacy, with two undesirable consequences, namely i) they posed a
risk when new services were introduced and ii) they were costly with regard to
maintenance [14]. KAS BANK decided to gradually modernize this application
portfolio by migration to the .NET platform, making optimal use of its existing
employees and their know-how.
KAS BANK decided to make use of SIG’s portfolio monitoring service to
safeguard the maintainability of the new software in the long run. In addi-
tion, the maintainability of the modernized software was certified by T¨
For example, the Tri-Party Collateral Management system of KAS BANK
achieved a certificate with 4 stars in April 2009.
KAS BANK analyzed the savings that were obtained through monitoring
the systems during their development in terms of maintenance eort and num-
bers of defects occurring in the post-production stage. A summary is given in
Table 2. Although these results are not based on a controlled experiment, they
give an indication of the positive eect of monitoring in both the reduction
of the number of hours that needed to be spent on solving defects and in the
reduction of the number of defects that occurred.
6 Discussion of the approach
In the following, we will discuss a number of potential limitations to our ap-
6.1 Repository bias
The benchmark repository that supports calibration consists of systems ana-
lyzed by SIG. This could lead to a bias in the repository, which would have
an impact on the generalizability of the quality model. An important type of
bias could be that only systems with quality problems require SIG’s services,
thus resulting in low standards for quality.
We think that a systematical bias of this kind is unlikely, since the reposi-
tory is populated with systems analyzed in dierent contexts, for which varying
levels of quality are to be expected. It contains systems analyzed in the con-
text of i) one-oassessments, ii) monitoring projects, iii) certification requests
and iv) assessments of open-source. Systems analyzed for i) may actually have
quality problems, those analyzed for ii) typically are steered to improve their
quality. In the context of iii) good quality systems are to be expected and iv)
are performed by SIG’s own initiative. Thus, systems analyzed in the context
of ii), iii) and iv) should display state of the art code quality whereas only sys-
tems analyzed in the context of i) might suer from quality problems leading
to the involvement of SIG for improvement.
Every time calibration is performed, we conduct a systematic investigation
of the bias in the set of systems used. This is done using statistical analysis to
compare the ratings obtained by dierent groups of systems. Unfortunately, it
is virtually impossible to investigate all potential dimensions of bias. Currently
we inspect bias with respect to volume (is there a significant dierence be-
tween big and small systems?), programming languages (e.g. do Java systems
score dierently from C# systems?), SIG project type (one-oassessments,
monitoring projects, certification requests and assessments of open-source), de-
velopment environment (industrial development versus open-source), among
6.2 Quality model soundness
An important part of the approach presented in this paper is the quality model
used to aggregate source code metrics to ratings. This model was created to be
pragmatic, easy to explain, technology independent and to enable root-cause
analysis (see [13]). As any model, it is not complete and provides only an
estimation of the modeled variable, in this case maintainability.
Even though the model was created and developed through years of expe-
rience in assessing maintainability, by experts in the field, it is important to
scientifically assess its soundness. We have been conducting empirical studies
in order to build up more confidence in the model and its relationship with
actual maintainability of a system.
In [8] a study was conducted on the connections between source code prop-
erties and ISO/IEC 9126 sub-characteristics, as defined in the model. The
relationship was found to be mostly consistent with expert opinions. In [29]
another study was performed, this time to investigate the relationship be-
tween the ratings calculated by our quality model, and the time taken to solve
defects. All ratings, except for one, were found to correlate with defect resolu-
tion time, which is a proxy for actual maintenance performed. This study was
further extended in a Master’s project [3] to include enhancements, 3 more in-
dicators of issue handling performance, and quantification of the relationship.
The results were consistent with [29] in supporting the relationship between
issue handling performance and ratings.
All three studies further developed our confidence in the model, but also
helped reveal some limitations. We continue to extend these studies and per-
form new ones, as well as progressively improving the model.
6.3 Quality model stability
As described in Section 2.4, re-calibration is performed periodically, at least
once a year. The objective is to ensure that the quality model reflects the state
of the art in software engineering.
An important issue stemming from re-calibration could be that updating
the set of systems would cause the model to change dramatically, thus reducing
its reliability as a standard. As mentioned in Section 2.4, this is taken into
consideration when determining the modifications (addition of new systems,
removal of old ones) to the set.
A recent (January 2010) re-calibration was evaluated for its impact on 25
monitored systems. This was performed by calculating, for each system, the
dierences in ratings obtained by applying the existing model, versus the ones
obtained by applying the newly calibrated one. The result was an average
dierence of 0.17 in terms of the overall maintainability rating per system,
ranging from 0.56 to +0.09. Ratings are calculated in a continuous scale in
the range [0.5; 5.5[ thus these values correspond to 3.4%, 11.2% and +1.8%
of the possible range (respectively). We think that such a small change per
re-calibration is what to expect from the repository in the view of real im-
provements in software quality.
It could be argued that re-calibration weakens the role of the quality model
as a standard for software quality because there is no clear reference to the
particular parameter set used for an evaluation. To counter this, the quality
model is explicitly versioned and any document related to evaluation results
identifies the version of the model used. If there is doubt that a particular cur-
rent quality model is correct, evaluation can also be done with the parameter
set of an earlier model and results can be compared.
6.4 Focus
The standardized evaluation procedure described in this paper has its focus on
assessing a software product’s maintainability. As described in the ISO/IEC
9126 [17], this is just one aspect of a software product’s quality, thus it is possi-
ble and even desirable to use it in combination with other quality instruments.
Various kinds of software testing, such as unit testing, functional testing,
integration testing, and acceptance testing are essential for software product
quality, both functional and technical. However, evaluation of technical quality
as in the described approach does not replace testing, but operates on a higher
level, is less labor-intensive, and can be performed independently.
Methodologies for software process improvement (SPI), such as the Ca-
pability Maturity Model Integration (CMMI) concern the production process,
rather than the product. SPI works under the assumption that better processes
lead to better products. Since this relationship is not a perfect one, improving
software via an objective assessment of source code quality is an independent
approach usable in a complementary way.
7 Related work
7.1 Software product quality improvement
The idea of improving software quality with the use of source code metrics has
a long history. An interesting recent work is the Code Quality Management
(CQM) framework proposed by Simon, Seng and Mohaupt [32]. Besides apply-
ing code metrics in large software projects, the authors introduce the idea of a
benchmark repository for comparing between projects and identifying the best
practices across the software industry. However, current emphasis in CQM is
given to the definition of new creative quality indicators for object oriented
programming rather than to setting up a universal benchmark standard for
comparison across the dierent software development paradigms.
7.2 Quality assessment methodologies
Some methodologies have been proposed for the assessment of software prod-
ucts, namely targeted at open source project. These include OSMM [11],
QSOS [2] and OpenBRR [31]. A comparison of the latter two can be found
in [9]. These methods mainly focus on the community contribution, activity
and other “environmental” characteristics of the product. Although these ap-
proaches usually include assessment of technical quality, no concrete definition
of measurements or norms are provided.
Currently there are several research projects related to quality of open
source software, e.g. FLOSSMetrics, QualOSS, or SQO-OSS 2. Each of these
three projects aims to provide a platform for gathering information regard-
ing open source projects and possibly to provide some automated analysis of
7.3 Certification of functional quality
Here we briefly discuss work related to the application of our standardized
evaluation procedure for software product certification.
Heck et al [12] have developed a method for software certification where
five levels of verification are distinguished. At the highest level, the software
product is verified using formal methods where not only properties are proven
about an abstract model, but about the software itself.
ISO/IEC 9126 lists security as one of the sub-characteristics of functional-
ity. The most well-known software standard regarding security is the ISO/IEC
15408 standard on evaluation of IT security criteria [19] It is also published
under the title Common Criteria for Information Technology Security Eval-
uation (CC) and is the basis for an elaborate software product certification
scheme. Besides specific product security requirements, this standard defines
a framework for specification, implementation, and evaluation of security as-
pects. In many industrial countries, there are national certification schemes
for software product security based on the CC.
The ISO/IEC 25051 [21] specifies requirements for functionality, documen-
tation and usability of Commercial O-The-Shelf (COTS) products. COTS
products are standard software packages sold to consumers “as is”, i.e. with-
out a consulting service or other support. ISO/IEC 25051 requires that any
claim made in the product documentation is tested, thus assuring the func-
tional correctness of the product. The fulfillment of this standard can also be
certified by many certification bodies.
Finally, the international standard ISO 9241 [22] describes requirements
for software product usability. ISO 9241 pursues the concept of usability-in-
context, i.e. usability is not a generic property but must always be evaluated in
the context of use of the product. In Germany, there is a certification scheme
for software product usability based on ISO 9241-110, -11.
7.4 Software process quality improvement
The issue of software quality can be addressed not only from the point of
view of the product, but also by evaluating and improving the development
process. Originating from the so-called software crisis and the high demands on
software quality in the defense sector, process approaches to software quality
have been around for some twenty years.
Process approaches like Capability Maturity Model Integration3(CMMI)
or Software Process Improvement and Capability Determination (SPICE) [18]
can best be seen as collections of best practices for organizations aiming at
the development of excellent software. Both approaches arrange best practices
in reference process models.
A basic concept is the maturity of an organization in implementing the
quality practices and the reference models. The maturity concept motivates
to improve once an initial level is reached. In several industrial areas, a minimal
maturity level is required for contractors to succeed in software development
business. Thus, together with the models, audit and certification schemes are
available to assess the process maturity of an organization. Training institu-
tions provide insight into the models and support candidate organizations in
improving their maturity level.
Although providing some probability, process approaches to software qual-
ity cannot guarantee that a software product reaches a certain technical quality
level. This is because the impact of the process on the actual programming
work is only indirect. Additionally, implementing process approaches usually
requires some investments since a number of roles have to be staed with
trained personnel which gets removed from the creative development process.
7.5 Benchmarking
Software benchmarking is usually associated to productivity rather than code
quality. Jones [24] provides a treatment of benchmarking software projects.
The focus is not on the software product, though the functional size of systems
in terms of function points and the technical volume in terms of lines of code,
are taken into account.
The International Software Benchmarking Standards Group (ISBSG)4col-
lects data about software productivity and disseminates the collected data
for benchmarking purposes. Apart from function points and lines of code, no
software product measures are taken into account.
In [23], Izquierdo-Cortazar et al use a set of 1400 open-source projects to
determine thresholds for a number of metrics regarding the level of activity of
communities. This is comparable to how we calibrate our quality model, but
their work diers in terms of focus (community quality), repository composi-
tion (restricted to open-source) and methodology.
8 Concluding remarks and future work
We have provided an overview of the standardized models and procedures
used by SIG and T¨
UViT for evaluation, benchmarking, and certification of
software products. Standardization is achieved by following terminology and
requirements of several relevant international standards.
We have explained the role of a central benchmark repository in which
evaluation results are accumulated to be used in annual calibration of the
measurement model. Such calibration enables comparison of software prod-
ucts against industry-wide levels of quality. In combination with standardized
procedures for evaluation, the calibrated model is a stable basis for the pre-
sented software product certification scheme.
We have shown, with real-world examples, the value of the approach as a
tool for improving and managing technical quality.
We are continuously working on evaluating and validating the quality
model in various ways (see [8, 29]). We also plan to extend the model to
encompass more dimensions of technical quality, such as implemented archi-
tecture [5], or even other dimensions of software quality besides maintainabil-
Furthermore, given the importance of the software benchmark repository
and its size and heterogeneity, we aim to continuously extend it with new and
more systems.
1. Alves TL, Ypma C, Visser J (2010) Deriving metric thresholds from bench-
mark data. In: 26th IEEE International Conference on Software Mainte-
nance (ICSM 2010), September 12-18, 2010, Timisoara, Romania
2. Atos Origin (2006) Method for qualification and selection of open source
software (QSOS), version 1.6
3. Bijlsma D (2010) Indicators of issue handling eciency and their relation
to software maintainability. Master’s thesis, University of Amsterdam
4. Bouwers E, Vis R (2008) Multidimensional software monitoring applied
to erp. In: Makris C, Visser J (eds) Proc. 2nd Int. Workshop on Software
Quality and Maintainability, Elsevier, ENTCS, to appear
5. Bouwers E, Visser J, van Deursen A (2009) Criteria for the evaluation
of implemented architectures. In: 25th IEEE International Conference on
Software Maintenance (ICSM 2009), September 20-26, 2009, Edmonton,
Alberta, Canada, IEEE, pp 73–82
6. Correia J, Visser J (2008) Benchmarking technical quality of software
products. In: WCRE ’08: Proceedings of the 2008 15th Working Confer-
ence on Reverse Engineering, IEEE Computer Society, Washington, DC,
USA, pp 297–300, DOI
7. Correia JP, Visser J (2008) Certification of technical quality of software
products. In: Barbosa L, Breuer P, Cerone A, Pickin S (eds) International
Workshop on Foundations and Techniques bringing together Free/Libre
Open Source Software and Formal Methods (FLOSS-FM 2008) & 2nd
International Workshop on Foundations and Techniques for Open Source
Certification (OpenCert 2008), United Nations University - International
Institute for Software Technology (UNU-IIST), Research Report 398, pp
8. Correia JP, Kanellopoulos Y, Visser J (2009) A survey-based study of the
mapping of system properties to iso/iec 9126 maintainability character-
istics. In: 25th IEEE International Conference on Software Maintenance
(ICSM 2009), September 20-26, 2009, Edmonton, Alberta, Canada, IEEE,
pp 61–70
9. Deprez JC, Alexandre S (2008) Comparing assessment methodologies for
free/open source software: OpenBRR and QSOS. In: PROFES
10. van Deursen A, Kuipers T (2003) Source-based software risk assessment.
In: ICSM ’03: Proc. Int. Conference on Software Maintenance, IEEE Com-
puter Society, p 385
11. Golden B (2005) Making open source ready for the enterprise: The open
source maturity model, whitepaper available from
12. Heck P, Eekelen Mv (2008) The LaQuSo software product certification
model: (LSPCM). Tech. Rep. 08-03, Tech. Univ. Eindhoven
13. Heitlager I, Kuipers T, Visser J (2007) A practical model for measur-
ing maintainability. In: 6th Int. Conf. on the Quality of Information and
Communications Technology (QUATIC 2007), IEEE Computer Society,
pp 30–39
14. van Hooren M (2009) KAS BANK and SIG - from legacy to software
certified by T¨
UViT. Banking and Finance
15. International Organization for Standardization (1996) ISO/IEC Guide 65:
General requirements for bodies operating product certification systems
16. International Organization for Standardization (1999) ISO/IEC 14598-1:
Information technology - software product evaluation - part 1: General
17. International Organization for Standardization (2001) ISO/IEC 9126-1:
Software engineering - product quality - part 1: Quality model
18. International Organization for Standardization (2004) ISO/IEC 15504: In-
formation technology - process assessment
19. International Organization for Standardization (2005) ISO/IEC 15408: In-
formation technology - security techniques - evaluation criteria for IT se-
20. International Organization for Standardization (2005) ISO/IEC 17025:
General requirements for the competence of testing and calibration labo-
21. International Organization for Standardization (2006) ISO/IEC 25051:
Software engineering - software product quality requirements and evalua-
tion (square) - requirements for quality of commercial o-the-shelf (cots)
software product and instructions for testing
22. International Organization for Standardization (2008) ISO/IEC 9241: Er-
gonomics of human-system interaction
23. Izquierdo-Cortazar D, Gonzalez-Barahona JM, Robles G, Deprez JC, Au-
vray V (2010) FLOSS communities: Analyzing evolvability and robustness
from an industrial perspective. In: Proceedings of the 6th International
Conference on Open Source Systems (OSS 2010)
24. Jones C (2000) Software Assessments, Benchmarks, and Best Practices.
25. Kuipers T, Visser J (2004) A tool-based methodology for software portfolio
monitoring. In: Piattini M, Serrano M (eds) Proc. 1st Int. Workshop on
Software Audit and Metrics, (SAM 2004), INSTICC Press, pp 118–128
26. Kuipers T, Visser J (2004) A tool-based methodology for software portfolio
monitoring. In: Piattini M, et al (eds) Proc. 1st Int. Workshop on Software
Audit and Metrics, (SAM 2004), INSTICC Press, pp 118–128
27. Kuipers T, Visser J, de Vries G (2007) Monitoring the quality of out-
sourced software. In: van Hillegersberg J, et al (eds) Proc. Int. Work-
shop on Tools for Managing Globally Distributed Software Develop-
ment (TOMAG 2007), Center for Telematics and Information Technology,
28. Lokan C (2008) The Benchmark Release 10 - project planning edition.
Tech. rep., International Software Benchmarcking Standards Groups Ltd.
29. Luijten B, Visser J (2010) Faster defect resolution with higher technical
quality of software. In: 4th International Workshop on Software Quality
and Maintainability (SQM 2010), March 15, 2010, Madrid, Spain
30. McCabe TJ (1976) A complexity measure. In: ICSE ’76: Proceedings of
the 2nd international conference on Software engineering, IEEE Computer
Society Press, Los Alamitos, CA, USA, p 407
31. OpenBRRorg (2005) Business readiness rating for open source, request for
comment 1
32. Simon F, Seng O, Mohaupt T (2006) Code Quality Management: Tech-
nische Qualit¨at industrieller Softwaresysteme transparent und vergleichbar
gemacht. dpunkt-Verlag, Heidelberg, Germany
33. Software Improvement Group (SIG) and T¨
UV Informationstechnik GmbH
UViT) (2009) SIG/T ¨
UViT evaluation criteria – Trusted Product Main-
tainability, version 1.0
34. Software Productivity Research (2007) Programming Languages Table
(version 2007d)
A The quality model
The SIG has developed a layered model for measuring and rating the technical quality of
a software system in terms of the quality characteristics of ISO/IEC 9126 [13]. The layered
structure of the model is illustrated in Figure 6. This appendix section describes the current
state of the quality model, which has been improved and further operationalized since [13].
Unit complexity
Unit size
Unit interfacing
Module coupling
ISO/IEC 9126
product propertiessource code measurements
Fig. 6 Relation between source code metrics and system sub-characteristics of maintain-
ability (image taken from [29]).
Source code metrics are used to collect facts about a software system. The measured
value s are combi ne d and a ggreg ated to p rovi de in fo rma ti on on p rop ert ie s at th e le ve l of the
entire system, which are then mapped into higher level ratings that directly relate to the
ISO/IEC 9126 standard. These ratings are presented using a five star system (from ?to ?
????), where more stars mean better quality.
A.1 Source code measurements
In order to make the product properties measurable, the following metrics are calculated:
Estimated rebuild value The software product’s rebuild value is estimated from the
number of lines of code. This value is calculated in man-years using the Programming
Languages Table of the Software Productivity Research [34]. This metric is used to
evaluate the volume property;
Percentage of redundant code Alineofcodeisconsideredredundantifitispartof
a code fragment (larger than 6 lines of code) that is repeated literally (modulo white-
space) in at least one other location in the source code. The percentage of redundant
lines of code is used to evaluate the duplication property;
Lines of code per unit The number of lines of code in each unit. The notion of unit is
defined as the smallest piece of invokable code, excluding labels (for example a function
or procedure). This metric is used to evaluate the unit size property;
Cyclomatic complexity per unit The McCabe’s cyclomatic complexity [30] for each
unit. This metric is used to evaluate the unit complexity property;
Number of parameters per unit The number of parameters declared in the interface
of each unit. This metric is used to evaluate the unit interfacing property;
Number of incoming calls per module The number of incoming invocations for each
module. The notion of module is defined as a delimited group of units (for example a
class or file). This metric is used to evaluate the module coupling property.
A.2 From source code measurements to source code property ratings
To evaluate measurements at the source code level as property ratings at the system level,
we make use of just a few simple techniques. In case the metric is more relevant as a single
value f or the wh ole sys tem, we use t hr esh ol din g to calc ulate t he rati ng. For e xam pl e, fo r
duplication we use the amount of duplicated code in the system, as a percentage, and perform
thresholding according to the following values:
rating duplication
????? 3%
???? 5%
??? 10%
?? 20%
The interpretation of this table is that the values on the right are the maximum values the
metric can have that still warrant the rating on the left. Thus, to be rated as ?????a
system can have no more than 3% duplication, and so forth.
In case the metric is more relevant at the unit level, we make use of so-called quality
profiles. As an example, let us take a look at how the rating for unit complexity is calculated.
First the cyclomatic complexity index [30] is calculated for each code unit (where a unit
is the smallest piece of code that can be executed and tested individually, for example a
Java method or a C function). The values for individual units are then aggregated into four
risk categories (following a similar categorization of the Software Engineering Institute), as
indicated in the following table:
cyclomatic complexity risk category
1-10 low risk
11-20 moderate risk
21-50 high risk
>50 very high risk
For each category, the relative volumes are computed by summing the lines of code of the
units that fit in that category, and dividing by the total lines of code in all units. These
percentages are finally rated using a set of thresholds, defined as in the following example:
maximum relative volume
rating moderate high very high
????? 25% 0% 0%
???? 30% 5% 0%
??? 40% 10% 0%
?? 50% 15% 5%
Note that this rating scheme is designed to progressively give more importance to cat-
egories with more risk. The first category (“low risk”) is not shown in the table since it is
the complement of the sum of the other three, adding up to 100%. Other properties have
similar evaluation schemes relying on dierent categorization and thresholds. The particular
thresholds are calibrated per property, against a benchmark of systems.
Such quality profiles have as an advantage over other kinds of aggregation (such as
summary statistics like mean or median value) that sucient information is retained to
make significant quality dierences between systems detectable (see [1] for a more detailed
The evaluation of source code properties is first done separately for each dierent pro-
gramming language, and subsequently aggregated into a single property rating by weighted
average, according to the relative volume of each programming language in the system.
The specific thresholds used are calculated and calibrated on a periodic basis based on
a large set of software systems, as described in Section 2.4.
A.3 Continuous scale
The calculation of ratings from source code metrics is described in terms of discrete quality
levels. These values will need to be further combined and aggregated and for that, a discrete
scale is not adequate. We thus use the discrete scale for describing the evaluation schemes,
but make use of interpolation to adapt them in order to obtain ratings in a continuous scale
in the interval [0.5,5.5[. An equivalence between the two scales is established so that the
behavior as described in terms of the discrete scale is preserved.
Let us consider a correspondence of the discrete scale to a continuous one where ?
corresponds to 1, ??to 2 and so forth. Thresholding as it was described can then be seen
ISO 9126
unit size
unit complexity
unit interfacing
inward coupling
analyzability ⇥⇥⇥
changeability ⇥ ⇥ ⇥
stability ⇥ ⇥
testability ⇥ ⇥
Fig. 7 Mapping of source code properties to ISO/IEC 9126 sub-characteristics.
as a step function, defined, for the example of duplication (d), as:
This step function can be converted into a continuous piecewise linear function as follows:
1. In order for the function to be continuous, the value for the point on the limit between
two steps (say, for example, point 3% which is between the steps with values 4 and
5) should be between the two steps’ values (in the case of point 3% it would then be
(4 + 5)/2=4.5). Thus, for example, rating(5%) = 3.5andrating(10%) = 2.5;
2. Values between limits are computed by linear interpolation using the limit values. For
example, rating(5.1%) = 3.4andrating(7.5%) = 3.
The equivalence to the discrete scale can be established by arithmetic, round half up round-
This approach has the advantage of providing more precise ratings. Namely, with the
first approach we have, for example, rating(5.1%) = rating(10%) = 3, whereas in the
second approach we have rating(5.1%) = 3.43andrating(10%) = 2.53. Thus,
one can distinguish a system with 5.1% duplication from another one with 10%, while still
preserving the originally described behavior.
The technique is also applied to the evaluation schemes for quality profiles of a certain
property. Namely, interpolation is performed per risk category, resulting in three provisional
ratings of which the minimum is taken as the final rating for that property.
A.4 From source code property ratings to ISO/IEC 9126 ratings
Property ratings are mapped to ratings for ISO/IEC 9126 sub-characteristics of maintain-
ability following dependencies summarized in a matrix (see Figure 7). In this matrix, a is
placed whenever a property is deemed to have an important impact on a certain sub-char-
acteristic. These impacts were decided upon by a group of experts and have further been
studied in [8].
The sub-characteristic rating is obtained by averaging the ratings of the properties where
ais present in the sub-characteristic’s line in the matrix. For example, changeability is
represented in the model as aected by duplication,unit complexity and inward coupling,
thus its rating will be computed by averaging the ratings obtained for those properties.
Finally, all sub-characteristic ratings are averaged to provide the overall maintainability
... With our new method-level benchmark of code metrics and change evolution, we reproduce three major prior observations: 1) Similar to some previous studies (e.g., Johnson et al. 2019;Tiwari and Kumar 2014;Subandri and Sarno 2017;Romano and Pinzger 2011), we first ignore size as a confounding factor, and show that code metrics are good maintenance predictors. 2) By dividing a metric value by size-a common (Suh and Neamtiu 2010;Shepperd 1988;Robert et al. 2012), but inaccurate approach (Gil and Lalouche 2017) for size normalization-we reproduce the claim that code metrics are good maintenance predictors. 3) We then show that the widely adopted size normalization approach fails to neutralize the size influence, and the maintenance impact of code metrics can still be explained by their correlation with size. ...
... Although this approach should reduce the confounding impact of size to some extent, analyzing all methods with SLOC > 60 (for example) in one group can not eliminate the problem completely. A more common approach is to calculate metric density per lines of code (Suh and Neamtiu 2010;Shepperd 1988;Robert et al. 2012; Lalouche 2017)i.e., 100×McCabe/Size. Unfortunately, Gil and Lalouche (2017) argued that this approach is inaccurate and questions some of the previous claims of validity for different code metrics. ...
... From this we conclude that without size normalization we do not know the true effectiveness of code metrics. The most common approach for size normalization takes the density of a metric and divides its measure in a component by the size of the component (Suh and Neamtiu 2010;Shepperd 1988;Robert et al. 2012;Gil and Lalouche 2017). For example, 100 × McCabe/Size gives the McCabe value per 100 lines of code, so we should have a normalized McCabe measure completely independent of size. ...
Full-text available
Evaluating and predicting software maintenance effort using source code metrics is one of the holy grails of software engineering. Unfortunately, previous research has provided contradictory evidence in this regard. The debate is still open: as a community we are not certain about the relationship between code metrics and maintenance impact. In this study we investigate whether source code metrics can indeed establish maintenance effort at the previously unexplored method level granularity. We consider \(\sim \)730K Java methods originating from 47 popular open source projects. After considering seven popular method level code metrics and using change proneness as a maintenance effort indicator, we demonstrate why past studies contradict one another while examining the same data. We also show that evaluation context is king. Therefore, future research should step away from trying to devise generic maintenance models and should develop models that account for the maintenance indicator being used and the size of the methods being analyzed. Ultimately, we show that future source code metrics can be applied reliably and that these metrics can provide insight into maintenance effort when they are applied in a judiciously context-sensitive manner.
... Similarly to software systems [9], [10], poor code quality and code heterogeneity in the technologies and coding practice used have the potential to impact the overall reliability and quality of an IoT system. The problem of code quality has been addressed in various software studies and an established set of metrics has been proposed [11]- [13] and consolidated [9], [10], [14], [15]. However, such a consolidated view has not been provided for IoT systems, despite code quality's relevance in this field. ...
... In more recent studies, Heitlager et al. [20] is discussing metrics that impact the maintainability of software systems. The list provided by Heitlager et al. was further extended by Baggen et al. [15] in a study that also focuses mainly on system maintainability aspects. The established metrics such as code volume, code redundancy, unit size, code complexity, unit interface size, and component coupling are discussed [15]. ...
... The list provided by Heitlager et al. was further extended by Baggen et al. [15] in a study that also focuses mainly on system maintainability aspects. The established metrics such as code volume, code redundancy, unit size, code complexity, unit interface size, and component coupling are discussed [15]. This list partially overlaps with a selection by Pantiuchina et al., who adds cohesion and code readability to the discussion [9]. ...
Full-text available
Software code is present on multiple levels within the current Internet of Things (IoT) systems. The quality of this code impacts system reliability, safety, maintainability, and other quality aspects. In this paper, we provide a comprehensive overview of code quality-related metrics, specifically revised for the context of IoT systems. These metrics are divided into main code quality categories: Size, redundancy, complexity, coupling, unit test coverage and effectiveness, cohesion, code readability, security, and code heterogeneity. The metrics are then linked to selected general quality characteristics from the ISO/IEC 25010:2011 standard by their possible impact on the quality and reliability of an IoT system, the principal layer of the system, the code levels, and the main phases of the project to which they are relevant. This analysis is followed by a discussion of code smells and their relation to the presented metrics. The overview presented in the paper is the result of a thorough analysis and discussion of the author’s team with the involvement of external subject-matter experts in which a defined decision algorithm was followed. The primary result of the paper is an overview of the metrics accompanied by applicability notes related to the quality characteristics, the system layer, the level of the code, and the phase of the IoT project.
... This was painfully visible in a study of 15 large software organizations where the benefits of paying down TD is not always clear to managers, and consequently some managers would not grant the necessary budget, nor priorities, for refactoring [38]. The lack of clear and quantifiable benefits makes it hard to build a business case for code quality and, hence, easy to trade short-term wins for long-term sustainability and software maintenance; there are no standard Key Performance Indicators (KPIs) for code quality that are relevant to companies in the way that financial KPIs are [5]. Consequently, in a study involving 1,831 participants, only 10% reported that their business manager was actively managing TD [21]. ...
... There are numerous studies on code quality metrics [5,52]. Although primarily used to communicate how comprehensive a software solution is or how much work a development task would require, Lines of Code (LoC) metrics are often involved also in code quality metrics [2]. ...
... Prior research by Lenarduzzi et al. have calculated the variation in lead time for resolving Jira issues as an initial data-driven approach for TD estimation and to improve the prioritization when deciding to remove TD [36]. In their study, Lenarduzzi et al. measured code TD using SonarQuebe's coding rules 5 . They conclude that "we have not found evidence that either the presence of a single violation or a number of violations affects the lead time for resolving issues. ...
Full-text available
Code quality remains an abstract concept that fails to get traction at the business level. Consequently, software companies keep trading code quality for time-to-market and new features. The resulting technical debt is estimated to waste up to 42% of developers' time. At the same time, there is a global shortage of software developers, meaning that developer productivity is key to software businesses. Our overall mission is to make code quality a business concern, not just a technical aspect. Our first goal is to understand how code quality impacts 1) the number of reported defects, 2) the time to resolve issues, and 3) the predictability of resolving issues on time. We analyze 39 proprietary production codebases from a variety of domains using the CodeScene tool based on a combination of source code analysis, version-control mining, and issue information from Jira. By analyzing activity in 30,737 files, we find that low quality code contains 15 times more defects than high quality code. Furthermore, resolving issues in low quality code takes on average 124% more time in development. Finally, we report that issue resolutions in low quality code involve higher uncertainty manifested as 9 times longer maximum cycle times. This study provides evidence that code quality cannot be dismissed as a technical concern. With 15 times fewer defects, twice the development speed, and substantially more predictable issue resolution times, the business advantage of high quality code should be unmistakably clear.
... SIG offers another commercial version on BCH called Sigrid [1,8,22,45], which considers only eight guidelines, omitting i) and j) from above. For each guideline a five star rating is computed by comparing the results within each area with results of analyzing multiple hundreds of undisclosed systems. ...
... In particular as they seem to be a moving target. During the history of the underlying sig model (SIG/TÜViT Evaluation Criteria Trusted Product Maintainability) they increased from five source code properties that are measured by the sig model, see Heitlager et al. [22], over six source code properties, see Baggen et al. [8], to respectively nine in current version of the sig model [1] and ten such properties in Visser [45] and BCH. ...
... 1. argument count (too many arguments per unit), 2. complex logic (too long Boolean expressions), 3. file length (too many lines in a file), 4. identical blocks of code (syntactic code clones), 5. unit complexity (units with too high cognitive complexity [11]), 6. unit count (too many units per modules), 7. method length (too many lines per unit), 8. nested control flow (too deeply nested control structures), 9. return statements (too many return statements per unit), and 10. similar blocks of code (structural code clones). ...
Full-text available
The technical state of software, i.e., its technical debt (TD) and maintainability are of increasing interest as ever more software is developed and deployed. Since td and maintainability are neither uniformly defined, not easy to understand, nor directly measurable, practitioners are likely to apply readily available tools to assess TD or maintainability and they may rely on the reported results without properly understanding what they embody. In this paper, we: a) methodically identify 11 readily available tools that measure TD or maintainability, b) present an in-depth investigation on how each of these tools measures and computes TD or maintainability, and c) compare these tools and their characteristics. We find that contemporary tools focus mainly on internal qualities of software, i.e., quality of source code, that they define and measure TD or maintainability in widely different ways, that most of the tools measure TD or maintainability opaquely, and that it is not obvious why the measure of one tool is more trustworthy or representative than the one of another.
... There are many prominent compliance concerns that have been considered by numerous regulations, standards, laws, and acts. The most important concerns are as follows: security [15], safety [16][17][18], privacy [19,20], data protection [21], accountability [22], responsibility [23], transparency [24], competency [25], anti-piracy [26], anti-corruption [27], antitrust [27], accessibility [28], HCI, quality management and assurance [29][30][31], environmental management [32], sustainability [33], usability [34], human comfort [35], ethics [36], conformance with the disabilities [37], adherence to the children [38], the elderly [39], simplicity [40], and ease of use. ...
... A benchmark is the common or standard infrastructure employed to analyze, evaluate, and compare the reality of solutions, tools, or systems by their executions (for a few definitions, see [63][64][65][66][67][68][69][70][71]. For some instances in other fields, see [30,[72][73][74][75][76]). Sometimes, a measure can simply be used as a benchmark [77]. ...
Full-text available
The problem of compliance checking and assessment is to ensure that the design or implementation of a system meets some desired properties and complies with some rules or regularities. This problem is a key issue in several human and engineering application domains such as organizational management and e-governance, software and IT industries, and software and systems quality engineering. To deal with this problem, some different approaches and methods have been proposed. In addition to the approaches such as formal methods, mathematical proofs, and logical evaluations, benchmarking can be used for compliance assessment. Naturally, a set of benchmarks can shape an applied solution to compliance assessment. In this paper we propose KARB solution system, i.e. keeping away compliance Anomalies through Rule-based Benchmarking. In fact, in our proposed method the rule-based benchmarking means evaluating the conformity of an under-compliance system to a set of rules. In this solution approach, the under-compliance system is specified symbolically (using formal and logical descriptions). Also, the desired rules are specified formally as the semantic logic in the evaluation process. After reviewing the proposed method, a case study was conducted to demonstrate and analyze the KARB solution. The IR-QUMA study (Iranian Survey on Quality in Messenger Apps) was then conducted to evaluate the quality of some messenger applications. According to the evaluation results, the hybrid DD-KARB method (with a combination of semantics-awareness and data-drivenness) is more effective than solo methods and can compute a good estimation for the messenger application user quality scores. Therefore, DD-KARB can be considered as a method for quality benchmarking in this technical context.
... McCabe [52], or cyclomatic complexity, is a measure of the number of independent paths in a source code component. The assumption is that a source code component with high McCabe score would be hard to maintain; it would be more change-and bugprone [5,6,21,47,61,62,68,84,93]. Another popular code metric is the Halstead metric, which is mainly based on the number of operators and operands in a code component [7,21,35,43,63]. ...
Full-text available
Code metrics have been widely used to estimate software maintenance effort. Metrics have generally been used to guide developer effort to reduce or avoid future maintenance burdens. Size is the simplest and most widely deployed metric. The size metric is pervasive because size correlates with many other common metrics (e.g., McCabe complexity, readability, etc.). Given the ease of computing a method's size, and the ubiquity of these metrics in industrial settings, it is surprising that no systematic study has been performed to provide developers with meaningful method size guidelines with respect to future maintenance effort. In this paper we examine the evolution of around 785K Java methods and show that developers should strive to keep their Java methods under 24 lines in length. Additionally, we show that decomposing larger methods to smaller methods also decreases overall maintenance efforts. Taken together, these findings provide empirical guidelines to help developers design their systems in a way that can reduce future maintenance.
Full-text available
verification and validation of software systems and their intermediate artifacts. If an organization wants certainty about or confidence in a software artifact a LaQuSo certificate can be requested. Being part of universities, LaQuSo is able to perform the independent evaluator role in many certification projects. Certification is a check that the artifact fulfills a well-defined set of requirements. These requirements are defined by the customer or a third party; LaQuSo will do the check. The certificate will always refer to the requirements that were used to check the artifact against. The certificate covers only the product quality, not the management and development processes. The quality certificate consists of a diagnosis report plus verdict document and will grow from a diagnosis report with a LaQuSo certificate to diagnosis with an official widely recognized certificate. To achieve a recognized certificate LaQuSo has defined a Software Product Certification Model, which is presented in this document. We have also defined 5 standard certificate types which have been included in the appendix.
Full-text available
Outsourcing application development or mainte-nance, especially offshore, creates a greater need for hard facts to manage by. We have developed a tool-based method for software monitoring which has been deployed over the past few years in a diverse set of outsourcing situations. In this paper we outline the method and underlying tools, and through several case reports we recount our experience with their application.
Full-text available
We performed an empirical study of the relation between technical quality of software products and the defect resolution performance of their maintainers. In particular, we tested the hypothesis that ratings for source code maintain-ability, as employed by the SIG quality model, are correlated with ratings for defect resolution speed. This study revealed that all but one of the metrics of the SIG quality model show a significant positive correlation.
Conference Paper
Full-text available
Many organizations using Free/Open Source Software (FlOSS) are dealing with the major problem of selecting the most appropriate software product corresponding to their needs. Most of theses companies are currently selecting FlOSS projects using ad-hoc techniques. However, in the last couple of years, two methodologies for assessing FlOSS project have emerge, namely QSOS and OpenBRR. The objective of this work is, through a detailed and rigorous assessment methodology comparison, to allow companies to have a better understanding of these two assessment methodologies content and limitation. This work compares both methodologies on several aspects, among others, their overall approaches, their scoring procedures and their evaluation criteria.
Conference Paper
Full-text available
The ISO/IEC 9126 international standard for software product quality is a widely accepted reference for terminology regarding the multi-faceted concept of software product quality. Based on this standard, the Software Improvement Group has developed a pragmatic approach for measuring technical quality of software products. This quality model introduces another level below the hierarchy defined by ISO/IEC 9126, which consists of system properties such as volume, duplication, unit complexity and others. A mapping between system properties and ISO/IEC 9126 characteristics is defined in a binary fashion: a property either influences a characteristic or not. This mapping embodies consensus among three experts based, in an informal way, on their experience in software quality assessment. We have conducted a survey-based experiment to study the mapping between system properties and quality characteristics. We used the Analytic Hierarchy Process as a formally structured method to elicit the relative importance of system properties and quality characteristics from a group of 22 software quality experts. We analyzed the results of the experiment with two objectives: (i) to validate the original binary mapping and (ii) to refine the mapping using the elicited relative weights.
An abstract is not available.