ArticlePDF Available

Effects of Developer Experience on Learning and Applying Unit Test-Driven Development

Authors:

Abstract and Figures

Unit Test-Driven Development (UTDD) is a software development practice where unit test cases are specified iteratively and incrementally before production code. In the last years, researchers have conducted several studies within academia and industry on the effectiveness of this software development practice. They have investigated its utility as compared to other development techniques, focusing mainly on code quality and productivity. This quasi-experiment analyzes the influence of the developers’ experience level on the ability to learn and apply UTDD. The ability to apply UTDD is measured in terms of process conformance and development time. From the research point of view, our goal is to evaluate how difficult is learning UTDD by professionals without any prior experience in this technique. From the industrial point of view, the goal is to evaluate the possibility of using this software development practice as an effective solution to take into account in real projects. Our results suggest that skilled developers are able to quickly learn the UTDD concepts and, after practicing them for a short while, become as effective in performing small programming tasks as compared to more traditional test-last development techniques. Junior programmers differ only in their ability to discover the best design, and this translates into a performance penalty since they need to revise their design choices more frequently than senior programmers.
Content may be subject to copyright.
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 1
Effects of developer experience on learning and
applying Unit Test-Driven Development
Roberto Latorre
Abstract—Unit Test-Driven Development (UTDD) is a software development practice where unit test cases are specified iteratively and
incrementally before production code. In the last years, researchers have conducted several studies within academia and industry on
the effectiveness of this software development practice. They have investigated its utility as compared to other development techniques,
focusing mainly on code quality and productivity. This quasi-experiment analyzes the influence of the developers’ experience level on
the ability to learn and apply UTDD. The ability to apply UTDD is measured in terms of process conformance and development time.
From the research point of view, our goal is to evaluate how difficult is learning UTDD by professionals without any prior experience
in this technique. From the industrial point of view, the goal is to evaluate the possibility of using this software development practice
as an effective solution to take into account in real projects. Our results suggest that skilled developers are able to quickly learn the
UTDD concepts and, after practicing them for a short while, become as effective in performing small programming tasks as compared
to more traditional test-last development techniques. Junior programmers differ only in their ability to discover the best design, and this
translates into a performance penalty since they need to revise their design choices more frequently than senior programmers.
Index Terms—Test-Driven Development, Test-First Design, Software Engineering Process, Software Quality/SQA, Software Construc-
tion, Process Conformance, Programmer Productivity
1INTRODUCTION
TEST-DRIVEN Development (TDD) is a technique to
incrementally develop software where test cases are
specified before functional code [1], [2], [3]. Originally,
TDD referred to creating unit tests before production
code. However, recently, another technique applying a
test-driven development strategy at the acceptance test
level is gaining attention [4], [5]. In this sense, in the last
years, it is usual to distinguish between Unit Test-Driven
Development (UTDD), which targets unit tests to ensure
the system is performing correctly; and Acceptance Test-
Driven Development (ATDD), a technique focused on
the business level. Here we aim our attention at UTDD.
Despite its name, UTDD is not a testing technique, but
a programming/design practice [6], [7], [8]. In the last
years, several studies and experiments have analyzed
the influence of this practice on software in terms of
software quality and productivity within academia and
industry (e.g., see [9], [10], [11], [12]). In literature, UTDD
sometimes appears as one of the most efficient software
development techniques to write clean and flexible code
on time [13], [14], [15], [16]. Nevertheless, these studies
report conflicting results (positive, negative and neutral
about the use of UTDD) depending on several factors.
Thus, no definite conclusions can be drawn, which limits
the industrial adoption of this development practice [17].
One of the main problems with UTDD is the difficulty
in isolating its effects from other context variables. So,
the influence of these context variables must be analyzed
R. Latorre is with Dpto. Ingenier´ıa Inform´atica, Escuela Polit´ecnica Superior,
Universidad Aut´onoma de Madrid, c/ Francisco Tom´as y Valiente, 11, 28049
Madrid, Spain; and with NeuroLogic Soluciones Inform´aticas. c/ Arcos de
Jal´on. 28037, Madrid, Spain
E-mail: roberto.latorre@uam.es
in detail to determine the real benefits and cost of
adopting this technique. For example, Causevic et al. [17]
identified seven possible factors limiting the industrial
adoption of this development technique. One of these
factors is the programmers’ experience. Nevertheless, as
far as we know, just a few empirical studies investigate
directly or indirectly the effect of developer experience
on applying UTDD. In the context of these studies,
experience (or knowledge) usually refers to the degree of
practice or theoretical insights in UTDD, and the qualifi-
cation of the participants ranges between UTDD novices
and UTDD experts, where novices are often students.
These studies and reports, for example, analyze the
correlation between programming experience and code
quality [18], highlight some aspects contributing to the
adoption of a TDD strategy in an industrial project [19],
or compare the characteristics of experts’ and novices’
UTDD development process [20], [21]. Results here are
conflicting again.
In this paper, we are interested in the impact of
developer experience on learning (with demonstrated
ability to apply) UTDD. Therefore, all the participants
in our study are UTDD novices. By experience, here we
do not refer to the particular level of experience of a
programmer in UTDD, but to his/her general experience
in the professional development context. We argue that
success of adopting UTDD is significantly related to the
developer’s experience level. It seems obvious that the
success of learning and properly using UTDD in an
industrial project depends on the skills and previous
experience of the team members in areas such as pro-
gramming, testing, designing and/or refactoring.
The general perception about UTDD is that it is dif-
ficult to learn and apply, requiring much mastery and
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 2
a higher level of discipline from the developers than
other traditional software development practices [13],
[14], [22], [15], [23]. In this paper, we address a con-
trolled quasi-experiment involving 25 industrial pro-
grammers with different skills, assessing how difficult
it is to actually learn and properly apply UTDD in a
first exposure to this software development practice.
After an initial training, the research subjects have to
individually implement a set of requirements by using
UTDD. The efficiency of the learning process is mea-
sured as a trade-off between the correct application of
the UTDD concepts and the additional time for the
development due to the learning curves with UTDD.
Then, we study the evolution during the experiment of
the development process conformance to UTDD, and
the effective development time and the self-training
needed to completely implement a set of requirements.
In our analysis, we first verify that the functional code
produced by the subjects during the experiment satisfies
the requirements and has the expected behavior. Then,
we estimate the extent to which the development process
remains faithful to the principles of UTDD, and calculate
and analyze the learning curves for the different subjects
and developer experience levels. Finally, we evaluate
the effect of these learning curves on performance. Our
results point out that the knowledge in UTDD becomes
better as participants progress with the experiment and
suggest that experienced professional developers are
able to quickly learn the UTDD concepts and, after
practicing them for a short while, become as effective in
performing small programming tasks. Although at the
end of the experiment all the subjects but one are able
to properly apply UTDD, junior programmers show a
performance penalty mainly due to limitations in their
ability to design.
The remainder of the paper is organized as follows. In
Section 2 we formalize the UTDD process and explain
how we measure UTDD process conformance. Section 3
gives details of the experimental design. Results are
reported in Section 4. Section 5 addresses threats to
validity. And, finally, Section 6 summarizes our findings
and draws the conclusions.
2FORMALIZATION OF THE UTDD PROCESS
Any software development process consists of multiple
programming tasks. Evaluating the degree to which
a process conforms to UTDD is not easy because it
requires the analysis of multiple process measures. A
formal definition of the UTDD practice is then needed
to facilitate measurement of process conformance.
UTDD is based on a minute-by-minute iteration be-
tween writing unit tests and writing functional code.
Before writing functional code, the developers write unit
test cases (preferably automated unit tests) for the new
functionality they are about to implement. Just after that,
they implement the production code needed to pass
these test cases. The following iterative and incremental
cycle defines a typical UTDD implementation:
1) Write a unit test.
2) Run all tests and see the new one fails.
3) Implement just enough code to pass the new test.
4) Run the new test and all other unit test cases
written for the code. Repeat the process from step
3 until all tests are passed.
5) Refactor the code. Refactoring includes modifica-
tions in the code to reduce its complexity and/or to
improve its understandability and maintainability.
6) If refactoring, re-run all tests and correct the possi-
ble errors.
7) Go to the next functionality and repeat the process
from step 1. New functionalities are not considered
as proper implementations unless the new unit test
cases and all the prior tests are passed.
According to Beck [1], such as development process
contains three different types of programming tasks:
Red tasks, related to write a test that fails (steps 1
and 2).
Green tasks, to make the test work quickly (steps 3
and 4).
Refactor tasks that eliminate all duplications from
both test code and application code (steps 5 and 6).
Obviously, the order of these tasks is crucial to the
conformance of the development process to UTDD,
which has given rise to the so called UTDD mantra:
red/green/refactor. Note that refactoring may be omitted if
unnecessary. A UTDD implementation process consists
of a sequence of UTDD cycles maintaining a contin-
uous red/green/refactor rhythm. Therefore, to determine
whether the research subjects actually learned and ap-
plied this practice during the experiment, we evaluate
the existence of red/green/refactor sequences during the
development.
Taking into account the above definitions, we estimate
the extent to which the research subjects remained faith-
ful to the principles of UTDD by using a metric proposed
by M¨
uller and H¨
ofer [20]:
conformance =|UTDD changes|+|refactorings|
|all changes|(1)
The larger the value of the metric, the higher the
conformance to the UTDD rules.
To classify the changes made by a programmer during
the development we consider as:
UTDD changes, the changes satisfying that unit tests
were written before related pieces of the applica-
tion code. We include in UTDD changes those cases
where the subject does not validate that the test fails
after a test code change (weak UTDD change). To
identify UTDD changes we apply the same rules
proposed by M¨
uller and H¨
ofer [20].
Refactorings, the changes in the structure of the code
that do not alter its observable behavior.
3STUDY DEFINITION
The research question this study aims to address is: How
does the developer’s experience affect the effort required to
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 3
learn and properly use UTDD?
This section describes our study design, planning and
execution following well-known guidelines on experi-
mental software engineering [24], [25]. We define the
study to be able to compare from different perspectives
the learning curves with UTDD of professional develop-
ers with different levels of experience, but all of them
inexperienced in test-first development practices. The
collected data include, between others, all the changes
made to the functional and test code during the whole
learning process and the time-stamp and the result of
each unit test invocation. These data allow us to ana-
lyze the effort required for adapting the programmers’
mindset to UTDD.
3.1 Study subjects
All the participants in the study were industrial de-
velopers with computing degrees and several years of
experience using traditional methodologies based on
test-last development strategies. However, since our
main goal was to compare their learning curves with
UTDD, not having a previous experience with test-
driven development strategies was mandatory. The sub-
jects belonged to three different companies and volun-
tarily participated in the study because of a personal
invitation.
In an experiment with students, Kollanus and
Isom¨
ott¨
onen [26] classify the difficulties the subjects find
with the adoption of UTDD in three main categories:
UTDD approach difficulties, designing of test problems
and technical difficulties. This and other studies point
out that the adoption of UTDD requires skilled program-
mers. In particular, a prior experience in unit testing and
automated testing tools contributes to the adoption [14],
[27], [28]; while a lack of domain and/or language exper-
tise hinders its use [9], [22], [11]. In our case, we were in-
terested in evaluating the subjects’ learning curves with
UTDD, i.e., following the Kollanus and Isom¨
ott¨
onen’s
classification, only in UTDD approach difficulties. Dur-
ing the experiment, the research subjects did not need
to spend any time in solving technical issues related
neither to the development language (Java), nor to the
automated unit testing framework (JUnit1). Therefore, all
of the subjects had testing skills and a wide previous
knowledge of Java and JUnit (see threats to validity in
Section 5).
Initially, we contacted 30 programmers to perform the
experiment. We divided them into three groups of ten
programmers each according to their general software
development experience within a professional context.
In order to increase the variability between groups and
avoid overlapping, we considered a one-year gap be-
tween neighbors groups:
Juniors. Developers with a one-to-two-year
experience in the professional development context.
1. http://www.junit.org
Intermediates. Developers with a three-to-five-year
experience.
Seniors. Developers with a more-than-six-year
experience.
For the sake of ensuring that the subjects’ skills were
the ones required and that no significant differences
existed between the groups of subjects, we performed
pre-study surveys evaluating their programming, de-
signing and testing expertise; and their knowledge and
experience in the particular development/testing envi-
ronment and programming language. After that, we
excluded five participants (one junior, two intermediates
and two seniors) because of the following reasons: the
junior programmer had not enough knowledge of JUnit;
while the intermediates and the seniors did not currently
work in a technical job. Therefore, finally, the study
involved 25 participants: 9 juniors (J1-J9), 8 intermediates
(I1-I8) and 8 seniors (S1-S8).
3.2 Study material
The experiment consisted of individually designing and
programming a set of back-end requirements for a Java
application, together with their corresponding JUnit au-
tomated unit tests. Note that UTDD is not easily appli-
cable to front-end development [6]. The application we
selected for the study was a simple project planning tool
used during the course of “Project of Software Analysis
and Design” imparted by the author in the degree in
Computer Science at Universidad Aut´
onoma de Madrid.
This allowed us to use the acquired knowledge during
the course with the students to select and classify the
requirements used as study material into three categories
according to their complexity level: easy requirements,
moderate requirements and difficult requirements. In the
Appendix we include three examples of requirements.
During the experiment, requirements were provided
to the subjects in Concordion2by using a textual descrip-
tion and a set of acceptance tests as described in [19].
The requirements were tested during the experiment by
running the corresponding acceptance tests included in
the Concordion specification (see below).
3.3 Previous training
As none of the research subjects had a previous exposure
to test-first development practices, before starting the
development they received a brief course about UTDD.
In all cases, the instructor was the same person to avoid
differences in the training. The training consisted of two
short sessions of three hours each. The time required
for these sessions was not included in our analysis. The
first session was dedicated to introduce the UTDD main
topics. The second was a practical session dedicated
to present examples of UTDD incremental cycles and
typical UTDD patterns. There are several patterns that
can be used during a UTDD process (see [1], Section III).
2. http://www.concordion.org
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 4
Some of them are tricks, some are design patterns and
some are refactorings. During training we focused on
implementation strategies, also called green bar patterns
(see [29], Section 3.3.1.1), and refactorings.
Empirical studies suggest an initial resistance to use
UTDD due to inexperience [30], [9], [31]. They high-
light that when no support for UTDD is available, in-
experienced UTDD developers tend to return to more
traditional techniques. In our case, we were interested
in quantifying this initial resistance within the learning
curves with UTDD. Therefore, during the training, we
insisted in the importance of properly applying UTDD
along the experiment, but the participants did not re-
ceive any additional training afterwards. They could
search by themselves information about UTDD both in
written and online literature, but more support did not
take place. The self-training time would be tracked and
analyzed during the subsequent analysis.
3.4 Study design
Although there was not actually an explicit time limit
to finish the implementation, we adopted a very simple
study design (see Fig. 1) for the participants to be able to
finish the experiment in around one month during their
compressed working day of summer 2012. After the ini-
tial training, each subject had to individually implement
the same 24 requirements. For privacy purposes, results
of the experiment were made anonymously.
Three of the seniors and two juniors worked on the
development during their official workday in their own
office. The rest of the participants worked at home
during their free time. All of them were free to work
whenever they wanted. The only requisite was to record
their working sessions (see Section 3.5 for details).
In order to minimize the effect of task difficulty,
requirements were organized in eight groups of three
requirements (G1-G8). Each group comprised an easy
requirement (R1), a moderate one (R2) and a difficult
one (R3), that the subjects had to implement in that order
(R1-R2-R3). After the UTDD training, the same sequence
of groups of three requirements was presented to every
subject. They had to complete the development starting
by G1 and ending by G8 (Fig. 1). The participants
had to work on each group of requirements until all
written unit test cases were properly passed and they
thought they had finished the job. At this moment, the
Concordion acceptance tests were executed to validate the
new implementations. A new set of requirements was
not considered unless the acceptance tests associated to
the three requirements of the prior set and all previous
acceptance test were properly tested. Hereafter, we call
iteration to the process of completely developing a set
of three requirements. Each iteration includes several
UTDD cycles.
To compare UTDD and non-UTDD metrics and es-
tablish baselines for comparisons between the different
research subjects, we instructed them to develop the
groups of requirements G1 and G2 (a total of six require-
ments) with a test-last strategy without any constraint.
That is, each developer could use the method he/she
preferred. This provided an estimation of the design,
development and testing abilities of each subject. The
other six groups of requirements (G3-G8) should be
developed by using UTDD. Note that although UTDD
was not used with the groups of requirements G1 and
G2, developers had to implement the JUnit tests for these
requirements too. The only difference was when these
tests were written (“after” vs. “before” the production
code under test).
3.5 Study variables and data collection
The dependent variables of our study were:
The task correctness and completeness, i.e., the appli-
cation code generated by the subjects during the
experiment should work and do what it was sup-
posed to do.
The process conformance to UTDD, i.e., the extent to
which the subjects followed the UTDD rules during
the experiment (see Section 2 for details).
And the programming performance, i.e., the time re-
quired to correctly develop a given set of three
requirements. During the experiment, the variability
in the resting times in between consecutive pro-
gramming sessions varied from two hours to five
days, and, consequently, there were significant dif-
ferences between the total amount of time re-
quired for the different subjects to develop the
24 requirements (from two weeks to a little more
than a month). At this regard, it is important to
take into account that we were not interested in
this overall development time, but in the effective
development time spent in each set of requirements.
In this time we included both active (described
as typing and producing the code) and passive
development time (spent on reading the source
code, self-training, looking for a bug, etc) [32]. This
effective development time was not correlated with
the time needed to finish the experiment.
To collect the data needed to measure these variables,
we combined the use of questionnaires the subjects had
to fill in during the experiment (Section 3.5.1), and
automatic and manual tracking tools (Section 3.5.2). Both
during the development and the analysis phase, we
reviewed and validated coherence between qualitative
and quantitative data.
3.5.1 Questionnaires
After the implementation of each set of three require-
ments (see Section 3.4 for details), the subjects had to
fill in with answers a form with six questions about
the requirements just implemented. From here on, we
refer to these forms as intermediate questionnaires. They
consisted of the following two questions for each re-
quirement (R1, R2 and R3):
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 5
Fig. 1. Schematic representation of the study design. After a UTDD training, the research subjects implemented
eight groups of requirements (G1-G8), with three requirements per group (R1-R2-R3). The groups of requirements
were completed always in the same order. G1 and G2 were developed with a test-last development practice (control
strategy); and G3-G8 by using UTDD.
1) How difficult has it been for you to understand the
requirement? (1-5)
2) How many changes/corrections in the structure of
the existing code have you needed to successfully
implement the requirement?
The first question was expressed on a five-point Lik-
ert scale from 1 (“Extremely easy”) to 5 (“Extremely
difficult”). It aimed at better explaining the experiment
results, addressing the clarity of the requirements and
the ability of subjects to understand them. The second
question aimed at gaining insights about the subjects’
behavior during the experiment. The answers to this
question helped us to classify the changes in the func-
tional code during the analysis phase.
At the end of the experiment, the subjects had to
answer two more questions that tried to measure the
acceptance of UTDD:
Q1. How difficult has it been for you to learn
UTDD? (1-5)
Being 1 = Extremely easy; 3 = Neither easy nor
difficult; and 5 = Extremely difficult
Q2. How useful do you feel UTDD is? (1-5)
Being 1 = Not at all useful; 2 = Slightly useful;
3 = Moderately useful; 4 = Very useful; and 5 =
Extremely useful.
3.5.2 Tracking tools
The tracking tools were a set of Eclipse plugins that
collected data about the software development process
for each subject during the programming sessions. This
data collection mechanism requires a running Eclipse
programming environment3where to install the plugins,
but it does not require any other prerequisites. Using
a similar strategy to the one used in earlier empirical
studies about UTDD [20], [33], some of the tracking tools
ran in the background of the Eclipse IDE and collected
data transparently for the participants. The main goal of
these data was to detect deviations from UTDD.
To integrate new code during the experiment, we
used a Subversion repository (SVN)4. We assumed that
participants were not going to commit changes in this
repository as frequently as we needed for our analysis.
Then, we used a first Eclipse plugin to transparently
3. http://www.eclipse.org
4. http://subversion.apache.org
commit in a different SVN repository a new revision of a
file (both functional and test code) each time the subject
saved it locally. Additionally, a second plugin automat-
ically saved in a database information about the JUnit
test invocations. This information included the start and
end time-stamp of all the invocations along with their
result. All these data allowed us (i) to determine whether
the new unit test cases were written after or before the
application code and (ii) to calculate deltas of all the files
in the project.
In addition to the transparently collected data, the sub-
jects had to manually track their development activities
and the required time to develop each requirement. This
time included self-training (in the case of using UTDD),
requirement comprehension, designing, unit tests and
functional code implementation, refactoring and testing.
To keep track of the time the subjects spent in each of
these tasks, we used BaseCamp5.
Automated collected data and BaseCamp data were
daily checked during the development for consistency.
4ANALYZING THE STUDY RESULTS
4.1 Task correctness and completeness
We first evaluate how correct/complete are the programs
produced by the research subjects relative tothe require-
ments. If the application code generated by a subject
does not do what it is supposed to do, it makes no sense
to do further analysis of that subject’s learning curve.
For this purpose, we use the Concordion acceptance
tests. As we indicate above, during the development,
after the implementation of each set of three require-
ments, the corresponding acceptance tests and all the
previous ones were run to check that the new code had
the expected behavior. If some of these tests failed, the
subject had to solve the problem in a new UTDD cycle. In
this manner, we ensure that at the end of the experiment
all the research subjects’ application code passes all the
acceptance tests.
4.2 Learning process
Table 1 shows the distribution of the subjects’ answers to
the first question in the final questionnaire (“Q1. How
5. http://www.basecamphq.com
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 6
TABLE 1
Distribution of the subjects’ responses to questions
belonging to the final questionnaire (Q1 and Q2). See
Section 3.5.1 for details. The questions are expressed on
a Likert scale as follows. Q1: 1 = Extremely easy; 2 =
Easy; 3 = Neither easy nor difficult; 4 = Difficult; and 5 =
Extremely difficult. Q2: 1 = Not at all useful; 2 = Slightly
useful; 3 = Moderately useful; 4 = Very useful; and 5 =
Extremely useful.
1 2 3 4 5
Q1
Juniors 0 8 0 1 0
Intermediates 0 8 0 0 0
Seniors 2 6 0 0 0
Q2
Juniors 1 0 3 4 1
Intermediates 0 0 3 5 0
Seniors 1 1 1 3 2
Fig. 2. Comparison of the average percentage of self-
training time during the development of the sets of re-
quirements G3-G8 for juniors, intermediates and seniors.
difficult has it been for you to learn UTDD?”). These
answers allow us to analyze the subjects’ personal per-
ception of their own learning processes. Qualitative evi-
dence points out that, independently of their experience
level, the subjects found UTDD easy to learn, since 24
out of 25 subjects answered that the learning process was
easy (22 subjects, 88%, gave response 2) or extremely
easy (2 subjects, 8%, gave response 1). Only one subject
(J9) considered it difficult (response 4). This result is a
promising starting point, but, nevertheless, it does not
reflect whether the subjects actually learned and applied
UTDD during the experiment.
During the development, we tried to ensure that
UTDD was properly applied by continuously examining
deltas and code samples to confirm that JUnit tests were
written in step with the functional code. These observa-
tions supported that, in general, a test-first development
approach was being employed, but they did not allow
neither the process conformance with UTDD to be mea-
sured, nor the learning curves to be quantified.
To start addressing a more quantitative analysis of
the learning process, Fig. 2 compares the average self-
training time during the development of the sets of re-
quirements G3-G8 for juniors, intermediates and seniors.
Note that the UTDD development started with G3 and
finished with G8. Sets G1 and G2 are not included in
the figure because they were not developed by using
UTDD and, therefore, there was no self-training during
their implementation. In terms of effort used, during
the first set of requirements developed with UTDD, ju-
niors used 11% of effort for self-training, intermediates
17% and seniors 9%. Independently of the program-
mer experience level, this effort dropped quickly in the
subsequent iterations, until being 0 in just three or four
iterations. Note that intermediate programmers spent
more time for self-training than junior programmers.
These data are in line with the results presented in
Table 1 and they suggest that the knowledge in UTDD
quickly improved as the development progressed. Al-
though they do not reflect yet whether the technique
was or not actually used during the experiment.
To quantitatively assure that the research subjects
remained faithful to UTDD during the development,
we compare the process conformance (Eq. 1) in our
experiment with the values reported by M¨
uller and
H¨
ofer for UTDD experts and UTDD novices [20]. The
evolution of this metric along the development of the dif-
ferent groups of three requirements (G3-G8) also allows
to quantify the learning curves with UTDD of a given
subject or group of subjects (see below). As a first step
for this analysis, we had to classify all the changes made
by the subjects in the functional code into three types
according to their nature: UTDD changes,refactorings and
other changes. This classification was carried out semi-
automatically. On one hand, UTDD changes could be
easily identified automatically by using collected data
and deltas from the SVN repository. On the other hand,
the rest of changes might belong to a refactoring and had
to be inspected and classified manually. Manual classifi-
cation of changes was a hard task because, unlike [20],
we performed lots of manual inspections (cf. 34 vs. 332
changes to be classified manually for a subject in the
worst case).
Left panels in Fig. 3 show the classification of all
the changes made by each individual research subject
during the implementation of the distinct sets of re-
quirements developed by using UTDD. Initially, there
existed a great variability in the ratios of changes for the
three groups of programmers. For example, during the
development of G3, there were subjects with a ratio of
UTDD changes greater than 80% (or even equal to 100%)
in the three groups. But, also in the three groups, there
were subjects with a ratio of other changes greater than
50%. As the development progressed, this variability
decreased mainly due to the increasing number of UTDD
changes. Nevertheless, note that those subjects with the
larger ratios of UTDD changes during the initial require-
ments tended to increase the ratio of other changes as
the experiment progressed (e.g., see J6 or S1). In the
case of intermediate and senior programmers, each ratio
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 7
Fig. 3. Left panels: Classification of the changes made individually by the subjects during the development of each set
of requirements. Right panels: Classification of changes grouped by developer’s experience level (junior, intermediate
and senior). Sets G1 and G2 are not included because they were not developed by using UTDD.
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 8
of changes reached a steady similar value for all the
subjects belonging to the same group (cf. G7 and G8).
In the case of juniors, this did not happen, although a
few more iterations would likely lead to an equivalent
result. This general behavior is also observed when we
globally analyze the juniors’, intermediates’ and seniors’
development process. Right panels in Fig. 3 show the
classification of changes by group of programmers and
set of requirements. These panels lead to the following
observations:
All the ratios of changes tended to stabilize in a
nearly constant value as the study progressed.
In general, the ratio of UTDD changes grew; while
the ratio of other changes dropped. This result con-
firms that self-training during the development of
the initial sets of requirements (Fig. 2) translated into
a better knowledge in UTDD.
In average, the number of UTDD changes was larger
(and similar) for intermediate and senior program-
mers than for junior developers.
The ratio of refactorings for seniors was nearly con-
stant during the whole experiment; but, at the end,
juniors made more refactorings.
An interesting result observed in our experiment is
that UTDD conformance (Eq. 1) is 0 for all the subjects
during the development of G1 and G2 (the control
groups of requirements developed with a non-TDD strat-
egy). This implies that no refactorings were made during
the development of these requirements. In the same way,
during the early stages of the UTDD implementation,
refactorings were not very usual (Fig. 3), and even some
subjects did almost no refactoring of their code during
the whole experiment (e.g. I6 and I7). These results sug-
gest that refactorings may not be a commonly performed
type of change.
Table 2 shows the conformance to the UTDD rules
for each subject and set of requirements developed with
UTDD (columns G3-G8). To investigate whether the
subjects actually learned the development technique, we
focus on the last group of requirements (column G8). In
the experiments by M¨
uller and H¨
ofer, conformance to
UTDD process for a group of UTDD experts varies from
64% to 96%; while for a group of UTDD novices it
varies from 0% to 94%. Note that both ranges overlap,
so, when an individual developer has a conformance
greater than 64%, it is difficult to identify whether
he/she behaves as an UTDD expert or as an UTDD
novice. In our case, during the development of the last
set of requirements, all the research subjects but one
(J9) achieved a process conformance in the range 78%-
95%. These values are high enough to consider that
the corresponding subjects remained faithful to UTDD.
Therefore, 24 out of 25 subjects (96%) were able to
quickly learn the UTDD concepts during the study and
effectively use them when this concluded. These results
corroborate the qualitative evidence pointing out that,
independently of the experience level, UTDD was easy
TABLE 2
G3-G8: UTDD process conformance for each subject
and set of requirements developed by using UTDD.
C-Cluster: Subjects classification according to the
learning curves with UTDD.
G3 G4 G5 G6 G7 G8 C-Cluster
J1 58.2 60.4 65.0 75.8 80.8 80.3 slow
J2 44.2 58.4 71.4 80.5 65.9 81.3 slow
J3 56.6 59.4 78.7 63.4 84.4 81.9 slow
J4 59.1 60.8 64.0 74.7 77.9 78.7 slow
J5 92.3 84.8 86.6 87.1 85.1 82.9 none
J6 100 93.1 90.2 90.9 95.4 93.7 none
J7 46.7 48.0 70.1 46.0 84.7 80.0 slow
J8 37.7 39.2 82.5 79.7 53.4 82.3 slow
J9 27.6 32.9 41.7 37.6 44.7 46.1 outlier
I1 100 98.6 94.5 76.0 74.1 83.1 none
I2 74.6 85.1 88.0 93.3 83.1 86.3 fast
I3 62.7 63.8 78.3 77.1 84.7 82.8 slow
I4 30.6 37.3 62.3 71.0 78.0 84.5 slow
I5 79.0 78.1 81.2 86.0 88.3 85.2 fast
I6 61.4 62.0 70.8 78.1 82.0 84.1 slow
I7 52.9 52.5 71.7 74.0 79.4 78.7 slow
I8 45.3 58.7 68.2 76.4 86.8 85.1 slow
S1 100 100 88.7 91.2 87.2 89.1 none
S2 66.6 77.0 86.4 85.7 83.5 85.1 fast
S3 43.5 54.2 69.2 78.6 94.5 82.2 slow
S4 83.3 89.4 85.8 87.2 87.3 86.8 fast
S5 80.3 80.6 87.8 93.1 86.0 86.6 fast
S6 74.3 83.3 89.1 95.6 94.9 94.5 fast
S7 94.1 94.0 92.8 87.1 88.3 86.2 none
S8 68.4 85.9 93.9 86.9 95.7 83.7 fast
to learn. The question now is “how easy” was the
learning process. To answer this question, we analyze
and compare the evolution during the experiment of
the UTDD process conformance between different sub-
jects and groups of programmers. We can consider that
this evolution characterizes the learning curves, since it
shows how the knowledge in UTDD evolved during
the study. For example, the subject I4 started out the
development with 31% process conformance for the
set of requirements G3, and ended with 85% for G8.
The increase conformance points out that this subject
adapted his mindset to UTDD along the experiment.
Figure 4 shows the juniors’, intermediates’ and se-
niors’ learning curves with UTDD. In the case of
groups of programmers, these learning curves also allow
to characterize the behavior of the group during the
experiment by comparing our results with the average
UTDD conformance reported by M¨
uller and H¨
ofer [20]
for a group of UTDD experts (82%) and for a group
of students novices on UTDD (67%). As expected after
analyzing the ratios of changes, in general, UTDD con-
formance grew as the development advanced, i.e. knowl-
edge in UTDD became better. So that, at the end of the
experiment, the three groups of programmers achieved
a high conformance to UTDD: 79% for juniors, 84% for
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 9
Fig. 4. Learning curves with UTDD of juniors, interme-
diates and seniors. Note that G1 and G2 are the control
sets of requirements.
intermediates and 87% for seniors. If we compare these
values with the ones reported by M¨
uller and H¨
ofer,
we can consider that the three groups of programmers
reached the goal of learning and applying UTDD in
a similar manner to an expert group. The difference
between the three groups was on how they achieved
this final development process conformance to UTDD
(learning process). If we consider G3 as the starting point
of the learning process, junior and intermediate pro-
grammers started out with a similar UTDD conformance
to the reported for a UTDD novice group (cf. 58% for
juniors and 63% for intermediates, and 67% reported
by M¨
uller and H¨
ofer for a group of students novices on
UTDD). Furthermore, in these cases, there was a high
variability between the different subjects, a property
also observed in an UTDD novice group [20]. Therefore,
junior and intermediate programmers needed a few iter-
ations to practice and consolidate the knowledge before
effectively applying the TDD concepts. They started the
development with a low TDD conformance and became
as effective as the experiment progressed. In contrast,
senior programmers showed a high and nearly constant
average UTDD conformance from the beginning (cf.
77% for G3 and 87% for G8). Even for the first set of
requirements, the seniors’ UTDD conformance was near
the value reported by M¨
uller and H¨
ofer for an UTDD
expert group (cf. 77% and 82%). Thus, although they did
not have a prior UTDD experience, senior programmers
remained faithful to the principles of UTDD even during
the development of the initial requirements.
The variability in the ratios of changes observed dur-
ing the development of the initial sets of requirements,
mainly in the case of juniors and intermediates (left
panels in Fig. 3), translates into a great variability in
the UTDD process conformance for the distinct subjects
of a same group (cf. columns G3 and G4 in Table 2).
This produces important differences in the individual
learning curves regarding the starting conformance and
the number of iterations needed to conform to UTDD.
It seems that these differences are independent of the
programmer’s experience level, i.e. there was not a
clear correlation between how the subjects learned and
applied the UTDD rules and our a priori classification of
the 25 participants. To study the real effect of developer
experience on learning UTDD, we have used the k-means
clustering algorithm [34], [35] to find an a posteriori clas-
sification of subjects according to their learning curves.
The optimal number of clusters6for a dataset with the 25
learning curves is four. Column “C-Cluster” in Table 2
indicates the result of the k-means classification. Figure 5
shows the characteristic learning curves of each of these
clusters:
Subjects in the cluster labeled as “none” showed a
high UTDD conformance from the beginning of the
development. These subjects were able to properly
use UTDD just after the initial training. They are se-
niors, intermediates and juniors. Interestingly, some
of them achieved 100% UTDD conformance during
the implementation of the first set of three require-
ments (e.g. J6, I1 and S1). Therefore, unlike previous
studies suggesting that this technique requires sev-
eral months of intense use to be learned [22], in our
case, there were subjects (5 out of 25, 20%) that just
needed a theoretical knowledge and a brief practice
(initial training) to effectively apply UTDD (effi-
ciency is analyzed in the next section). Note that, in
these cases, UTDD conformance slightly decreased
during the development, although we can consider
that the development process always conformed to
UTDD. Below this behavior is analyzed.
Subjects in the cluster labeled as “fast” were able to
apply UTDD with a good enough process confor-
mance in just one or two iterations.
Those subjects belonging to the cluster labeled as
“slow” started the development with a low or
medium UTDD conformance (in the range 30%-
63%) and required four or five iterations to conform
to UTDD.
And, finally, the only subject in the cluster labeled
as “outlier” is the subject with the worst results
and the only one that said that UTDD is difficult
to learn (first question of the final questionnaire). In
this sense, we can consider him as an outlier.
Note that the final UTDD conformance is similar
and greater than 80% for all the clusters except for
the “outlier” cluster (Fig. 5). Then, again, quantitative
evidence points out that 24 out of 25 subjects easily
learned UTDD during the experiment. The only differ-
ence between them is that those subjects belonging to
the clusters labeled as “none” or “fast”, mostly seniors,
followed the UTDD rules almost from the beginning of
the experiment (12 out of 25 subjects). The rest of subjects
spent a few iterations practicing UTDD until they were
6. The optimal number of clusters is calculated by minimizing the
sum-of-squares errors over multiple executions of the algorithm with
different coherent values of the parameter k(number of clusters).
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 10
Fig. 5. Different characteristic learning curves with UTDD
found in our experiment. They are obtained by classifying
with k-means the distinct subjects’ learning curves.
able to use it properly (12 out of 25 subjects). The a poste-
riori classification, together with our personal perception
about the subjects, suggests that the learning curves with
UTDD depended on a trade-off between experience and
discipline (i.e., how businesslike the programmer is).
Discipline is a tacit knowledge that may be acquired
through experience, but also innate [36]. Those subjects
with a high level of discipline, not only the experienced,
were the subjects able to follow the UTDD rules from
the initial sets of requirements.
Finally, another interesting result is observed during
the development of the last sets of requirements, where
none of the participants achieved a 100% UTDD con-
formance, unlike what happened in the initial ones.
This produces the conformance decreased in the learning
curve of the “none” cluster (Fig. 5). A similar decrease
effect was usually observed when a subject achieved
a close to 100% UTDD conformance (e.g. see S5), as
illustrated in the learning curve of seniors (Fig. 4) and of
the “fast” cluster (Fig. 5). After the analysis, we asked the
subjects about a possible explanation for this behavior.
In the initial stages, they followed all the time the
UTDD rules because of our insistence on the importance
of properly using UTDD. However, as the experiment
advanced and the knowledge improved (and though
they were instructed to not do so), they felt that some
developments could be more efficiently implemented
with a test-last approach. Note that, except in the case
of the subject J9, the ratio of other changes in the last
iterations ranged between 5% and 45%. These re-
sults are coherent with M¨
uller and H¨
ofer’s results [20]
and suggest a trade-off between test-first practices and
other development techniques during a typical UTDD
development process, i.e., not all the changes are made
following the UTDD rules.
4.3 Programming efficiency
Finally, we evaluate the subjects’ efficiency during the
learning process in order to investigate if the learning
curves generated additional time for the development.
Performance is analyzed based on the effective time
taken to develop a group of three requirements. The
possible performance penalties may be a consequence
of self-training and/or of the developer’s learning curve
with UTDD. To quantitatively assess the real effort re-
quired by the subjects to properly use this development
technique, we compare their performance with UTDD
and a non-UTDD development strategy. This com-
parison is uneven in terms of absolute development
times, since, obviously, the required time by each sub-
ject to implement a set of requirements depended on
his/her design, development and/or testing ability.
Therefore, we need to define measurements that allow
the development efficiency to be compared taking into
account this issue.
Here, we do not assume any relationship in terms
of efficiency or productivity between UTDD and other
software development practices. We use the sets of re-
quirements G1 and G2, implemented using a non-UTDD
control strategy, to calculate baselines for normalized
comparisons between subjects. The ability of each subject
to implement a set of three requirements (an easy one, a
moderate one and a difficult one) may be estimated as
the average development time for G1 and G2. Formally,
for a subject s, this estimation is given by the following
equation:
bls=(
2
i=1
Ds
Gi+Js
Gi+JUs
Gi+Ts
Gi)/2(2)
where Giidentifies the sets of requirements G1 and G2;
and DGi,JGi,JUGiand TGidenote, respectively, the
time registered for designing, programming/refactoring
the Java functional code, programming the JUnit tests
and testing the three requirements included in the cor-
responding set of requirements. We consider blsas a
baseline quantifying the development ability of subject
s. Thus, for our performance analysis, we normalize the
development times of the sets of requirements imple-
mented with UTDD (G3-G8) using the corresponding
value of bls(see threats to validity in Section 5). In this
vein, for example, when we say that the normalized
development time of a given subject is equal to 1.25, it
means that the subject’s efficiency during the implemen-
tation of the corresponding set of requirements is 25%
worse than his/her efficiency when using the control
strategy. Table 3 shows the normalized development
time for each subject and set of requirements (G3-G8)
with and without self-training.
Note that the requirement comprehension time is not
included in Eq. 2. Table 4 shows the distribution of the
subjects’ answers to the questions in the intermediate
questionnaires regarding the difficulty of understanding
the requirements. Only in five cases the response differed
from 2 (“Easy”) or 1 (“Extremely easy”) and in all these
cases the subject gave response 3 (“Neither easy nor
difficult”). These answers indicate that the requirement
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 11
TABLE 3
Normalized development time to implement the sets of requirements G3-G8. Left value in each column does not
include self-training time. Right value includes self-training.
G3 G4 G5 G6 G7 G8 P-Cluster
J1 1.38 - 1.50 1.34 - 1.34 1.12 - 1.20 1.22 - 1.22 1.16 - 1.16 1.18 - 1.18 no
J2 1.23 - 1.29 1.28 - 1.28 1.04 - 1.04 1.09 - 1.09 1.02 - 1.02 1.09 - 1.09 yes
J3 1.18 - 1.23 1.20 - 1.29 1.30 - 1.30 1.25 - 1.25 1.14 - 1.14 1.05 - 1.05 yes
J4 1.35 - 1.46 1.37 - 1.41 1.31 - 1.31 1.28 - 1.28 1.31 - 1.31 1.22 - 1.22 no
J5 1.27 - 1.46 1.29 - 1.41 1.22 - 1.22 1.34 - 1.34 1.19 - 1.19 1.04 - 1.04 no
J6 1.24 - 1.41 1.17 - 1.25 1.03 - 1.03 1.00 - 1.07 1.04 - 1.04 1.18 - 1.18 no
J7 1.28 - 1.36 1.12 - 1.12 1.04 - 1.04 1.21 - 1.21 1.27 - 1.27 1.08 - 1.08 no
J8 1.19 - 1.28 1.20 - 1.20 1.14 - 1.14 1.16 - 1.21 1.12 - 1.12 1.21 - 1.21 no
J9 1.32 - 1.45 1.24 - 1.30 1.22 - 1.22 1.20 - 1.20 1.21 - 1.21 1.27 - 1.27 no
I1 1.26 - 1.47 1.22 - 1.22 1.14 - 1.22 0.95 - 0.95 1.06 - 1.06 1.02 - 1.02 yes
I2 1.25 - 1.37 1.11 - 1.11 0.97 - 0.97 1.03 - 1.03 0.91 - 0.91 0.98 - 0.98 yes
I3 1.29 - 1.37 1.19 - 1.26 1.04 - 1.04 0.98 - 0.98 0.96 - 0.96 1.11 - 1.11 yes
I4 1.17 - 1.39 1.23 - 1.25 1.08 - 1.08 1.04 - 1.04 1.09 - 1.09 1.01 - 1.01 yes
I5 1.30 - 1.47 1.33 - 1.33 1.22 - 1.27 1.24 - 1.24 1.13 - 1.13 1.20 - 1.20 no
I6 1.19 - 1.37 1.18 - 1.21 1.23 - 1.28 1.15 - 1.15 1.12 - 1.12 1.03 - 1.03 yes
I7 1.29 - 1.51 1.13 - 1.24 1.14 - 1.24 1.06 - 1.06 1.11 - 1.11 0.89 - 0.89 yes
I8 1.20 - 1.36 1.24 - 1.38 1.21 - 1.33 1.17 - 1.17 1.06 - 1.06 0.95 - 0.95 yes
S1 1.22 - 1.30 1.18 - 1.28 0.91 - 0.91 1.05 - 1.05 0.88 - 0.88 0.92 - 0.92 yes
S2 1.17 - 1.23 1.21 - 1.21 1.01 - 1.01 1.01 - 1.01 1.04 - 1.04 0.94 - 0.94 yes
S3 1.26 - 1.39 1.23 - 1.23 1.15 - 1.15 1.12 - 1.12 1.12 - 1.12 1.06 - 1.06 yes
S4 1.12 - 1.27 1.10 - 1.14 0.96 - 0.96 0.95 - 0.95 1.03 - 1.03 0.95 - 0.95 yes
S5 1.04 - 1.04 1.09 - 1.09 1.03 - 1.03 1.08 - 1.08 1.05 - 1.05 0.92 - 0.92 yes
S6 1.17 - 1.22 1.22 - 1.28 1.16 - 1.16 1.14 - 1.14 1.16 - 1.16 1.11 - 1.11 yes
S7 1.26 - 1.44 1.28 - 1.28 1.12 - 1.12 0.98 - 0.98 1.09 - 1.09 1.01 - 1.01 yes
S8 1.18 - 1.31 1.14 - 1.17 1.10 - 1.18 0.98 - 0.98 1.03 - 1.03 0.94 - 0.94 yes
TABLE 4
Distribution of the responses to the questions regarding
the difficulty of understanding the requirements in the
intermediate questionnaires.
1 2 3 4 5
Juniors
Easy requirements 68 3 1 0 0
Moderate requirements 19 53 0 0 0
Difficult requirements 12 57 3 0 0
Interm.
Easy requirements 58 6 0 0 0
Moderate requirements 39 25 0 0 0
Difficult requirements 26 38 0 0 0
Seniors
Easy requirements 58 6 0 0 0
Moderate requirements 52 12 0 0 0
Difficult requirements 41 22 1 0 0
understanding was easy for the research subjects, so we
can consider the learning curves with UTDD indepen-
dent of the requirement comprehension (see threats to
validity in Section 5 for details).
Figure 6 compares the average normalized
development time for each group of programmers
and set of requirements. Dark bars on top correspond to
the effort used in self-training. These measures estimate
the efficiency with UTDD of junior, intermediate and
senior programmers as compared to the control strategy.
For the initial groups of requirements, efficiency with
UTDD is very low. For example, during the development
of G3, performance decreased 41% for juniors, 42%
for intermediates, and 29% for seniors. Initially, most
of the subjects needed to look for information about
UTDD (self-training), which had a negative impact on
performance. As we show above, self-training quickly
dropped as the development progressed (see Fig. 2),
and its penalty in performance was close to 0 in just a
few iterations (no subject spent any time in self-training
at the end of the experiment). If now we do not consider
self-training, during the development of the first set of
requirements developed with UTDD (G3), performance
decreased 30% for juniors, 25% for intermediates,
and 20% for seniors. In some sense, these percentages
can be considered as the effort for applying UTDD.
Note that now performance is better for intermediates
than for juniors. Independently of the experience
level, the UTDD effort dropped as the development
progressed. The better the knowledge in UTDD, the
better performance (cf. Figures 4 and 6). Therefore, the
learning curves impacted on efficiency as this latter
improved through practice. Initially, senior programmers
had a better performance than intermediates. However,
as the intermediates’ development process better
conformed to UTDD, both groups displayed a similar
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 12
Fig. 6. Comparison of the average development time
during the study for each programmer category. Dark bars
on top correspond to self-training time.
efficiency. Juniors’ performance was always the worst.
These results indicate that, unlike process confor-
mance, efficiency mainly correlates with expertise. If
now we classify the subjects with the k-means algo-
rithm according to the performance evolution during the
experiment, the optimal number of clusters we find is
two. Column “P-Cluster” in Table 3 shows this classifi-
cation:
Cluster labeled as “yes” contains the subjects that
were able to continuously apply UTDD in an effi-
cient manner when the experiment concluded. All
the seniors and all the intermediates but one are
in this cluster. During the initial iterations, these
subjects used up some effort for UTDD, but this
dropped quickly in the subsequent iterations until
achieving a similar performance to the one observed
with the control strategy (values close to 1). Seniors
needed to implement three sets of requirements to
achieve a good enough performance, while interme-
diates needed four iterations (see Fig. 6).
The subjects belonging to the “no” cluster are
mainly junior programmers. Here we have to dis-
tinguish between two subgroups of subjects. On one
hand, if we individually analyze the development
times of these subjects, we observe that some of
them alternated good and bad performances (e.g.
see J6 or J7). In these cases the penalty in per-
formance at the end of the experiment was not
related to training on UTDD. Those cases with a
higher development time corresponded with the
cases in which the number of structural changes in
the existing design was also higher. The correction
of these bad design decisions was the main factor
to increase the development time for juniors. On the
other hand, a different situation occurs for subjects
J1 and I5. Although they were able to properly apply
UTDD, they could not do it efficiently (12% of
subjects if we include the outlier).
Another interesting result pointing out the difference
between juniors and seniors is the response to the last
question of the final questionnaire (“Q2. How useful do
you feel UTDD is?”). Only three subjects (see Table 1)
gave response 1 (“Not at all useful”, J9 and S5) or 2
(“Slightly useful”, S1). That means that they were the
only subjects with a poor acceptance of UTDD. However,
despite this situation, the subjects S1 and S5 achieved a
good performance (even better than 1 in some cases),
while the subject J9 always had a poor performance.
The rest of the subjects considered UTDD moderately
or very useful (the mean value in the Likert scale is
always greater than 3.4) and no significant differences
in performance are found between them according to
UTDD acceptance.
Finally, if we try to find a correlation between UTDD
conformance and efficiency, we observe that, as the
development advanced and the knowledge in UTDD
became better, some subjects reached a trade-off between
the use of UTDD and efficiency. As we describe above,
Fig. 5 points out that when the subjects acquired a good
knowledge in UTDD, the corresponding process confor-
mance usually decreased to a nearly stable value that de-
pended on the specific programmer. These small UTDD
deviations usually imply a performance improvement
(e.g., compare the conformance and the performance
evolution of subjects I2 or S3).
In brief, our results show that during the study, not
only effectiveness on UTDD was improved as the subject
progressed with the experiment (Fig. 5), but also effi-
ciency was improved through practice on UTDD (Fig. 6).
However, for junior programmers, although the UTDD
concepts were clear, the efficiency was lower due to what
looks like a poor ability to design.
4.4 Retention ability
When we talk about the adoption of a new software
development technique, an important feature to take into
account is the retention ability, not only the learning
process. As far as we know, some of the intermediate
and senior programmers that participated in our study
have included UTDD (and also ATDD) in the arsenal
of development techniques of their teams and, nowa-
days, they use them in projects within the industrial
environment. Nevertheless, we cannot have conducted
a controlled experiment to study in detail the retention
and later applying of UTDD within the professional
environment. In this sense, we can only provide a simple
estimation of retention in the case of three of the original
research subjects (I2, I5 and S5). Six months after finish-
ing the experiment explained so far, we conducted a very
simple empirical study with these three subjects trying to
evaluate their ability to continue using UTDD. None of
them had used UTDD again after our initial experiment.
In the new quasi-experiment, they had to develop by
using UTDD two new groups of requirements (G1’
and G2’), each one comprising an easy requirement,
a moderate one and a difficult one. Figure 7 shows
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 13
Fig. 7. Estimation of the retention ability of subjects I2, I5
and S5. G3-G8 are the groups of requirements included
in initial experiment. G1’ and G2’ are two new groups
of requirements with three requirements per group to be
developed six months after finishing the experiment.
their UTDD conformance (top panel) and performance
(bottom panel) during the initial development (G3-G8)
and the later one (G1’-G2’). The three subjects reached
a similar UTDD conformance to the one reached during
the development of the groups of requirements G7 and
G8. From the performance perspective, they showed a
little decline, although in all cases performance was near
the non-UTDD control value. These results point out that
after six months the three subjects were able to continue
using UTDD effectively and efficiently.
5THREATS TO VALIDITY
Like most empirical studies, the validity of our results is
subject to several threats. In order to limit the scope of
our claims as well as to explain their utility, the first
important fact to take into account is that this is an
empirical study that presents results based on a small
number of developers. In some sense, they knew that we
were measuring their development capabilities, leading
to a possible Hawthorne effect. Furthermore, the lack of
time pressure during the development could help the
participants to focus on using UTDD in the right way
instead of on performance, which facilitates learning and
increases UTDD conformance.
A crucial internal validity threat is related to how
we calculate the baseline for the time required by each
research subject to implement a set of three requirements
(Eq. 2). Note that we use these baselines to quantify
the designing, programming and testing abilities of the
subjects, not to compare UTDD and non-UTDD perfor-
mances. UTDD forces developers to write unit tests, be-
cause the tests are an essential part of the development.
With more traditional approaches unit testing is some-
times less detailed or even disregarded [37], [13], [14],
[38]. To allow the performance comparison in an equiv-
alent scenario, we instructed the subjects to implement
JUnit tests with the control strategy. Intermediates and
seniors created approximately the same mean number of
unit tests with the UTDD and the non-UTDD approach.
In the case of juniors, the number of unit tests slightly
increased with UTDD. In order to analyze whether the
method depended on the sets of requirements used in
the estimation, we tested that the level of complexity
and the development time of all the sets of three require-
ments were similar. Before the study, three professional
programmers (one for each of the developer categories
considered in the study) developed the eight sets of three
requirements validating that the performance in all cases
was equivalent between sets and programmers. After
that, we chose the order of the requirements and the
two groups to calculate the baselines.
In our analysis we have omitted the requirement
comprehension time because of the subjects’ responses to
the intermediate questionnaires. If we include this time
within the effective development time, small differences
appear in the performance results. However, this is an
expected effect since different requirements of the same
category require different time to be read and under-
stood.
Finally, concerning the generalization of the findings,
as with all empirical studies, our analysis should be
repeated in different environments and in different con-
texts before generalizing the results. In particular, all
the participants in our study are skilled programmers.
They have experience in unit testing and know the
development language and the automated testing tool.
Our results are only applicable to similar scenarios. The
lack of experience/knowledge will produce different
learning curves with UTDD. Threats to external validity
are also related to the application we used as study
material. Although it is a real world application, require-
ments were adapted to the time available for the study
and, therefore, all the development tasks were simple.
Shallower learning curves (mainly from the efficiency
point of view) are possible when considering more
complex tasks. In this vein, for example, our results are
the same considering the developers’ general software
development experience, the Java and JUnit knowledge
or the design experience. However, inmore complex real
world developments, the testing or design experience,
or a specific programming language knowledge, may be
key factors with significant impact on UTDD adoption.
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 14
6CONCLUSIONS
The principal conclusion drawn from our study is that
skilled developers with the appropriate knowledge and
without any prior experience in test-first development
can quickly learn the UTDD rules and, after practicing
them for a short while, properly apply them in small
programming tasks. Interestingly, there even existed
some subjects whose development process conformed to
UTDD from the beginning of the experiment. In other
words, they did not need a training or learning period
to apply UTDD in the right way. Furthermore, it also
seems that the research subjects were able to retain the
UTDD knowledge and even use it in their companies
within the industrial environment.
All the participants in our study except one (96% of
the research subjects) believed that UTDD was easy to
learn. The quantitative results corroborate this belief. The
research subjects wrote a correct and complete functional
code and, at the end of the experiment, only one of them
(outlier) was not able to properly apply this software
development practice. If we do not take into account the
outlier, the participants followed three possible learning
processes (Fig. 5), although all of them adapted their
mindset to UTDD in just a few iterations.
Analyzing the learning curves by experience level
(Fig. 4), the development process of seniors looks like
an UTDD expert process from the beginning of the
experiment; while the processes of juniors and inter-
mediates initially look like an UTDD novice process.
Knowledge in UTDD became better as the participants
advanced with the development. Not only self-training
dropped (Fig. 2), but also UTDD conformance grew
until reaching a steady level (Fig. 4). When the study
concluded, the three groups of programmers behaved
similarly to an UTDD expert team. If we consider 0%
conformance the starting point of the learning process,
the learning curves are steeper for intermediate pro-
grammers than for juniors (although it is important to
keep in mind that in any case they are always steep),
but juniors needed more self-training along the learning
process. Nevertheless, the learning curves with UTDD
do not seem to only depend on the programmer’s
experience level. Some juniors and intermediates were
also able to be faithful with UTDD with just a little
practice. This fact points out the dependence of the
learning process on how businesslike the programmer
is. Therefore, we can conclude that the learning curves
with UTDD were related to a trade-off between the
developer’s experience and discipline.
During the experiment, performance improved
through practice too. Unlike development process
conformance, programming efficiency mainly depended
on the programmer’s experience level. Both intermediate
and senior programmers were able to efficiently use
UTDD with a similar performance to more traditional
test-last techniques with just a little practice. In these
cases, the use of UTDD had a minimal impact on
productivity. Although this is not the main goal of our
study, this is an interesting result that suggests that
UTDD performance in the industrial environment could
be similar to the performance with other traditional
techniques. Junior programmers, in the same way as
intermediates and seniors, were able to quickly learn
and apply the UTDD concepts, but, in general, they
had limitations regarding the design decisions they
made in each UTDD cycle. Therefore, although they
could properly use UTDD, in general they had a worse
performance due to the time needed to modify and/or
correct the existing code.
Another interesting result of our study is that only two
subjects achieved a UTDD conformance greater than 90%
during the development of the final requirements, unlike
what happened during the initial requirements, where
even some of the participants showed 100% process con-
formance (Table 2). This also happens in UTDD expert
groups [18]. In the same manner, a similar effect can be
observed when a subject reached a high knowledge level
in UTDD (i.e., he/she achieved a close to 100% UTDD
conformance). These subjects usually started combining
the use of UTDD with other development practices
because they believed that some developments could
be more efficiently completed with other approaches.
Therefore, they showed small UTDD deviations. When
this change in the behavior happened, performance im-
proved (cf. Table 2 and Table 3). So, the developers’
observation seems to be truth. A possible explanation is
that UTDD may not be a suitable or effective solution for
all kind of requirements [6]. These results suggest that
experienced programmers are able to dynamically adapt
their development process in order to reach an optimal
balance between conformance against performance and,
therefore, a strict conformance to the UTDD rules may
not be observable in practice within expert groups.
The general perception about UTDD is that it is
difficult to learn and apply [13], [14], [15], [26]. It is
commonly thought that although it seems easy at first
glance, it requires a high level of discipline to be carried
out effectively and efficiently [22], [23]. In this sense,
previous studies show that, at a first exposure to this
software development practice, there often exists an
initial resistance to use it due to inexperience and a
potential growth in the amount of work [9], [30]. Then,
it is usually highlighted that support is required for
UTDD inexperienced developers at least in the early
stages of the learning process [31]. In a context like
the industrial environment, where software professionals
often work with time pressure and financial constraints,
all these factors may make UTDD novices revert to
well-known traditional techniques. In contrast to the
general perception, recent empirical results indicate that,
in an UTDD novice development team, problems related
to lack of experience can be limited with an UTDD
mentor or following a pair programming approach [14],
[39], [21], [19]. Our results are aligned with these last
findings. The difference with previous studies likely lies
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 15
in the subjects’ experience. While in most of the previous
UTDD studies the subjects are students, in our case they
are industrial developers with testing skill and several
years of experience. This provides them the ability to
write efficient and effective unit test cases [40], [41],
which for us is one of the key factors for the UTDD
success.
From the industrial point of view, our results point
out that UTDD can be taken into account as an effective
software development practice even if the development
team has no experience in such technique. Intermediate
and senior programmers can quickly take advantage of
the benefits of using UTDD in those cases where the
adoption of this practice benefits the project (see [42] for
a detailed analysis).
Finally, our results suggest that junior programmers
do not have good design abilities in many occasions.
At this regard, an interesting question already raised by
Janzen and Saiedian [7] is the possibility of incorporating
UTDD into the undergraduate curriculum to improve
the students ability to design. In this line, promising
results have been obtained recently in introductory pro-
gramming courses with WebIDE [43], [44].
APPENDIX
The application to be implemented during the study
was a simple project planning tool. The requirements
used as study material were classified in three different
categories as a function of their complexity level. In
this Appendix we present an example of requirement
belonging to each category: easy requirements, moderate
requirements and difficult requirements.
Before providing these examples, we need to introduce
some concepts. A project has different properties (not
important here) and contains zero or more tasks.A
task belongs to one and only one project. It consumes
time and requires resources (persons or materials) to be
carried out. A task has the following attributes: Name,
Description, Minimum Start Date (the start date if the
task has not dependencies) and Time (the time required
to accomplish the task). Each association between a
resource and a task has an availability attribute indicating
the percentage of the resource availability allocated for the
associated task. Each resource can have different costs
and percentages of availability and, in the case of resources
of type “person”, different percentages of over-allocation
and costs of over-allocation in different time periods. Now
we are ready to provide the examples of requirements.
Details of the different user roles are not specified:
Easy requirement: Create a new task.
The authorized users must be able to create a new
task within the current project and to specify its
properties. If the user does not indicate a minimum
start date for the task, the minimum start date is the
start date of the current project. The task name must
be unique within the current project.
Moderate requirement: Calculate the project cost.
The authorized users must be able to calculate the
cost of a given project. The cost of a project is
defined as the sum of the cost of all its tasks. The
cost of each task is calculated as the sum of the cost
of the resources associated with the task and all its
subtasks.
Difficult requirement: Evaluate whether a resource
is over-allocated.
A resource is over-allocated in a given time period
if the sum of the availability attributes of all the
corresponding resource-task associations aregreater
that the resource’s availability in that period. In
the case of resources of type “person”, this sum
cannot be higher than the corresponding percentage
of over-allocation.
ACKNOWLEDGMENT
The author would like to thank M´
onica Cabanillas for
her comments and suggestions that helped to improve
the final version of the manuscript.
REFERENCES
[1] K. Beck, Test Driven Development: By Example. Addison-Wesley
Professional, 2003.
[2] D. Astels, Test Driven development: A Practical Guide. Prentice Hall
Professional Technical Reference, 2003.
[3] H. Erdogmus, G. Melnik, and R. Jeffries, “Test-driven
development,” in Encyclopedia of Software Engineering, 2010,
pp. 1211–1229.
[4] L. Koskela, Test driven: practical tdd and acceptance tdd for java
developers. Greenwich, CT, USA: Manning Publications Co., 2007.
[5] E. Hendricksons, “Acceptance test driven development (atdd):
an overview,” in Seventh Software Testing Australia/New Zealand
(STANZ), Wellington, 2008.
[6] K. Beck, “Aim, fire,” IEEE Softw., vol. 18, no. 5, pp. 87–89, Sep.
2001.
[7] D. Janzen and H. Saiedian, “Test-driven development: Concepts,
taxonomy, and future direction,” Computer, vol. 38, no. 9, pp. 43–
50, Sep. 2005.
[8] ——, “Does test-driven development really improve software
design quality?” IEEE Software, vol. 25, no. 2, pp. 77–84, Mar.
2008.
[9] E. M. Maximilien and L. A. Williams, “Assessing Test-Driven
Development at IBM.” in ICSE, L. A. Clarke, L. Dillon, and W. F.
Tichy, Eds. IEEE Computer Society, 2003, pp. 564–569.
[10] H. Erdogmus, M. Morisio, and M. Torchiano, “On the effective-
ness of the test-first approach to programming,” IEEE Trans. Softw.
Eng., vol. 31, no. 3, pp. 226–237, Mar. 2005.
[11] T. Bhat and N. Nagappan, “Evaluating the efficacy of test-driven
development: industrial case studies,” in Proceedings of the 2006
ACM/IEEE international symposium on Empirical software engineer-
ing, ser. ISESE ’06. New York, NY, USA: ACM, 2006, pp. 356–363.
[12] N. Nagappan, E. M. Maximilien, T. Bhat, and L. Williams, “Re-
alizing quality improvement through test driven development:
results and experiences of four industrial teams,” Empirical Softw.
Engg., vol. 13, no. 3, pp. 289–302, Jun. 2008.
[13] B. George and L. Williams, “An initial investigation of test driven
development in industry,” in Proceedings of the 2003 ACM sympo-
sium on Applied computing, ser. SAC ’03. New York, NY, USA:
ACM, 2003, pp. 1135–1139.
[14] ——, “A structured experiment of test-driven development,”
Information and Software Technology, vol. 46, no. 5, pp. 337 – 342,
2004.
[15] L. Crispin, “Driving software quality: How test-driven
development impacts software quality.” IEEE Software, vol. 23,
no. 6, pp. 70–71, 2006.
0098-5589 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TSE.2013.2295827, IEEE Transactions on Software Engineering
MANUSCRIPT 16
[16] R. Jeffries and G. Melnik, “Guest editors’ introduction: Tdd–the
art of fearless programming,” IEEE Softw., vol. 24, no. 3, pp. 24–30,
May 2007.
[17] A. Causevic, D. Sundmark, and S. Punnekkat, “Factors limiting
industrial adoption of test driven development: A systematic
review,” in Software Testing, Verification and Validation (ICST), 2011
IEEE Fourth International Conference on. IEEE, 2011, pp. 337–346.
[18] M. M ¨
uller and F. Padberg, “An empirical study about the feelgood
factor in pair programming,” in Software Metrics, 2004. Proceedings.
10th International Symposium on, 2004, pp. 151–158.
[19] R. Latorre, “A successful application of a test-driven
development strategy in the industrial environment,” Empirical
Software Engineering, pp. 1–21, 2013. [Online]. Available:
http://dx.doi.org/10.1007/s10664-013-9281-9
[20] M. M. M¨
uller and A. H¨
ofer, “The effect of experience on the test-
driven development process,” Empirical Softw. Engg., vol. 12, no. 6,
pp. 593–615, Dec. 2007.
[21] A. H¨
ofer and M. Philipp, “An empirical study on the tdd confor-
mance of novice and expert pair programmers,” in XP, 2009, pp.
33–42.
[22] P. Abrahamsson, A. Hanhineva, and J. Jlinoja, “Improving busi-
ness agility through technical solutions: A case study on test-
driven development in mobile software development,” in Business
Agility and Information Technology Diffusion, ser. IFIP International
Federation for Information Processing, R. Baskerville, L. Mathi-
assen, J. Pries-Heje, and J. DeGross, Eds. Springer US, 2005, vol.
180, pp. 227–243.
[23] L. Cao and B. Ramesh, “Agile requirements engineering practices:
An empirical study,” Software, IEEE, vol. 25, no. 1, pp. 60–67, 2008.
[24] C. Wohlin, P. Runeson, M. H¨
ost, M. C. Ohlsson, B. Regnell, and
A. Wessl´
en, Experimentation in software engineering: an introduction.
Norwell, MA, USA: Kluwer Academic Publishers, 2000.
[25] P. Runeson and M. H¨
ost, “Guidelines for conducting and report-
ing case study research in software engineering,” Empirical Softw.
Engg., vol. 14, no. 2, pp. 131–164, Apr. 2009.
[26] S. Kollanus and V. Isom¨
ott¨
onen, “Understanding tdd in academic
environment: experiences from two experiments,” in Proceedings
of the 8th International Conference on Computing Education Research,
ser. Koli ’08. New York, NY, USA: ACM, 2008, pp. 25–31.
[27] A. Geras, M. Smith, and J. Miller, “A prototype empirical eval-
uation of test driven development,” in Software Metrics, 2004.
Proceedings. 10th International Symposium on, 2004, pp. 405–416.
[28] P. Sfetsos, L. Angelis, and I. Stamelos, “Investigating the extreme
programming systeman empirical study,” Empirical Software En-
gineering, vol. 11, no. 2, pp. 269–301, 2006.
[29] L. Madeyski, Test-Driven Development: An Empirical Evaluation of
Agile Practice. Springer, 2010.
[30] M. M. M¨
uller and O. Hagner, “Experiment about test-first pro-
gramming,” IEE Proceedings - Software, vol. 149, no. 5, pp. 131–136,
2002.
[31] J. Rasmusson, “Introducing xp into greenfield projects: lessons
learned,” Software, IEEE, vol. 20, no. 3, pp. 21–28, 2003.
[32] L. Madeyski and L. Szala, “The impact of test-driven development
on software development productivity - an empirical study,” in
Software Process Improvement, ser. Lecture Notes in Computer Sci-
ence, P. Abrahamsson, N. Baddoo, T. Margaria, and R. Messnarz,
Eds. Springer Berlin Heidelberg, 2007, vol. 4764, pp. 200–211.
[33] H. Kou, P. Johnson, and H. Erdogmus, “Operational definition
and automated inference oftest-driven development with zorro,”
Automated Software Engineering, vol. 17, no. 1, pp. 57–85, 2010.
[Online]. Available: http://dx.doi.org/10.1007/s10515-009-0058-8
[34] J. B. MacQueen, “Some methods for classification and analysis of
multivariate observations,” in Proc. of the fifth Berkeley Symposium
on Mathematical Statistics and Probability. University of California
Press, 1967, pp. 281–297.
[35] R. Xu and I. Wunsch, D., “Survey of clustering algorithms,” Neural
Networks, IEEE Transactions on, vol. 16, no. 3, pp. 645–678, 2005.
[36] S. Ryan and R. V. OConnor, “Development of a team measure
for tacit knowledge in software development teams,” Journal of
Systems and Software, vol. 82, no. 2, pp. 229 – 240, 2009.
[37] R. A. Ynchausti, “Integrating unit testing into a software
development teams process,” in Intl. Conf. eXtreme Programming
and Flexible Processes in Software Engineering, 2001, pp. 79–83.
[38] L. Williams, E. M. Maximilien, and M. Vouk, “Test-driven
development as a defect-reduction practice,” in Proceedings of the
14th International Symposium on Software Reliability Engineering, ser.
ISSRE ’03. Washington, DC, USA: IEEE Computer Society, 2003,
pp. 34–45.
[39] M. Domino, R. Collins, and A. Hevner, “Controlled experimenta-
tion on adaptations of pair programming,” Information Technology
and Management, vol. 8, no. 4, pp. 297–312, 2007.
[40] A. van Deursen, “Program comprehension risks and opportu-
nities in extreme programming,” in Reverse Engineering, 2001.
Proceedings. Eighth Working Conference on, 2001, pp. 176–185.
[41] A. Van Deursen, L. Moonen, A. van den Bergh, and G. Kok,
Refactoring test code. CWI, 2001.
[42] Y. Rafique and V. B. Misic, “The effects of test-driven development
on external quality and productivity: A meta-analysis,” IEEE
Transactions on Software Engineering, vol. 99, no. PrePrints, 2012.
[43] T. Dvornik, D. S. Janzen, J. Clements, and O. Dekhtyar, “Sup-
porting introductory test-driven labs with webide,” in Software
Engineering Education and Training (CSEE&T), 2011 24th IEEE-CS
Conference on. IEEE, 2011, pp. 51–60.
[44] M. Hilton and D. S. Janzen, “On teaching arrays with test-
driven learning in webide,” in Proceedings of the 17th ACM annual
conference on Innovation and technology in computer science education.
ACM, 2012, pp. 93–98.
Roberto Latorre is Profesor Ayudante Doctor
of Computer Science at Dpto. Ingenier´
ıa In-
form ´
atica, Escuela Polit´
ecnica Superior, Univer-
sidad Aut´
onoma de Madrid, Madrid, Spain. He is
also a software engineer and researcher in Neu-
roLogic Soluciones Inform´
aticas. His research
interests include different topics in software engi-
neering (e-business software engineering, qual-
ity software development) and neurocomputing
(from the generation of motor patterns and infor-
mation coding to pattern recognition and artificial
neural networks). He received his Ph.D. in Computer Science and
Telecommunications from Universidad Aut´
onoma de Madrid in 2008.
... Therefore, some researchers have recommended taking a longitudinal perspective when investigating such a development approach (e.g., [16,18,20,21])-i.e., studying TDD over a time span. Nevertheless, only a few studies have taken such a perspective (e.g., [22]). ...
... Latorre [22] studied the capability of 30 professional software developers of different seniority levels (junior, intermediate, and expert) to develop a complex software system by using TDD. The study targeted the learnability of TDD since the participants did not know that technique before participating in the study. ...
... The results show that all participants delivered functionally correct software regardless of their seniority. Latorre [22] also provides initial evidence on the retainment of TDD. Six months after the study investigating the learnability of TDD, three developers, among those who had previously participated in that study, were asked to implement a new functionality. ...
Preprint
In this paper, we investigate the effect of TDD, as compared to a non-TDD approach, as well as its retainment (or retention) over a time span of (about) six months. To pursue these objectives, we conducted a (quantitative) longitudinal cohort study with 30 novice developers (i.e., third-year undergraduate students in Computer Science). We observed that TDD affects neither the external quality of software products nor developers' productivity. However, we observed that the participants applying TDD produced significantly more tests, with a higher fault-detection capability than those using a non-TDD approach. As for the retainment of TDD, we found that TDD is retained by novice developers for at least six months.
... For instance, Santos et al. [36] conducted four industrial experiments in two different companies, and reported that the larger the experience with unit testing and testing tools, the better developers perform in terms of external quality in ITL than in TDD. Latorre [26] found that in unit test-driven development, junior developers are not able to discover the best design, and this translates into a performance penalty since they need to revise their design choices more frequently than skilled developers. Romano et al. [34] investigated the affective reactions of novice developers to the development approach and reported that novices seem to like a non-TDD development approach more than TDD, and that the testing phase makes developers using TDD less happy. ...
... For instance, [23] investigated the effect of task description granularity on the quality (functional correctness and completeness) of software developed in TDD by novice developers (precisely graduate students), and reported that more granular task descriptions significantly improve quality. Latorre [26] showed that experienced developers who practice TDD for a short while become as effective in performing "small programming tasks" as compared to more traditional test-last development techniques. However, many consider TDD as a design technique [3], but how much design is involved in a small task is debatable. ...
Preprint
Full-text available
[Background] Recent investigations into the effects of Test-Driven Development (TDD) have been contradictory and inconclusive. This hinders development teams to use research results as the basis for deciding whether and how to apply TDD. [Aim] To support researchers when designing a new study and to increase the applicability of TDD research in the decision-making process in the industrial context, we aim at identifying the reasons behind the inconclusive research results in TDD. [Method] We studied the state of the art in TDD research published in top venues in the past decade, and analyzed the way these studies were set up. [Results] We identified five categories of factors that directly impact the outcome of studies on TDD. [Conclusions] This work can help researchers to conduct more reliable studies, and inform practitioners of risks they need to consider when consulting research on TDD.
... Latorre [21] analyzes the influence of the developers' experience level on the ability to learn and apply unit testdriven development (UTDD). Their results suggest that skilled developers are able to quickly learn the UTDD concepts, and after practicing them for a short while, become as effective in performing small programming tasks as compared to more traditional TLD techniques. ...
Article
The best practices of agile software development have had a significant positive impact on the quality of software and time-to-delivery. As a result, many leading software companies employ some form of agile software development practices. Some of the most important best practices of agile software development, which have received significant attention in recent years, are automated unit testing (AUT) and test-driven development (TDD). Both of these practices work in conjunction to provide numerous benefits. AUT leads to reduced time to test, discover bugs, fix bugs, and implement new features; wider and measurable test coverage; reproducibility, reusability, consistency, and reliability of tests; improved accuracy, regression testing, parallel testing, faster feedback cycle, reduced cost, and higher team morale. The benefits of TDD include flexible and adaptive program design, cleaner interfaces, higher code quality, maintainability, extensibility, reliability, detailed evolving specification, reduced time on bug fixes and feature implementation, reliable refactoring, and code changes, reduced cost, reduced development time, and increased programmer productivity. Unfortunately, students in introductory programming courses are generally not introduced to AUT and TDD. This leads to the development of bad programming habits and practices which become harder to change later on. By introducing the students earlier to these industry-standard best practices, not only the motivation and interest of students in this area can be increased but also their academic success and job marketability can be enhanced. This paper presents the detailed design and efficacy study of an introductory C++ programming course designed using the principles of AUT and TDD. The paper presents the pedagogical techniques employed to build industry-proven agile software development practices in the students. As part of the paper, all the course material including the source code for the labs and the automated unit tests are being made available to encourage people to incorporate these best practices into their curricula.
... [48][49][50] Unit test cases provide the code coverage and they are even automated and autogenerated unit based on models. [51][52][53][54][55][56][57] Behavior-driven test for graphical user interface tests using Cucumber and Gherkin is widely adopted for mobile GUI automation. 58 ML/AI test framework, the recent researches present ways to create machine language-based automation of tests. ...
Article
Full-text available
Software industry has adopted automated testing widely. The most common method adopted is graphical user interface test automation for the functional scenarios to reduce manual testing and increase the repeatability. Reducing execution time and redundancy to achieve quality “go/no go decisions” provides rational for the executive management to allocate funds to adopt automation and invest in the setup including people, process, and tools to achieve faster time to market. There are a variety of practices engaged by testers, like frameworks, tools, methods, procedures, models, and technologies to achieve automation. Nonetheless, the actual effectiveness in terms of return on investment (ROI) is not known, though there are various formulas to calculate ROI of test automation. The factors that determine the ROI are maturity of test automation, purpose, or intent, picking right tests for automation, knowledge of the tester to derive test coverage, domain expertise, defining right metrics like defects found by automation runs, cost versus reduction of time, labor versus quality of scripts, repeatability, and results. These factors form the base of the questions designed for the survey. The paper presents a survey and analysis to understand the ROI of test automation from industry test professionals from both product and services organizations.
... In contrast with the past, current developers are considered key participants or stakeholders in various business environments. Furthermore, because they are decision-makers, much attention has been paid to the developer's experience [3]. For example, the success of the Apple App Store and Google Play-which transformed the mobile market ecosystem-was due to the vast development ecosystem or DX in which developers can actively develop and register apps with interest. ...
Article
Full-text available
Background: Developer experience should be considered a key factor from the beginning of the use of development platform, but it has not been received much attention in literature. Research Goals: The present study aimed to identify and validate the sub-constructs and item measures in the evaluation of developer experience toward the use of a deep learning platform. Research Methods: A Delphi study as well as a series of statistical methodologies including the assessment of data normality, common method bias, and exploratory and confirmatory factor analysis were utilized to determine the reliability and validity of a measurement model proposed in the present work. Results: The results indicate that the measurement model proposed in this work successfully ensures the nomological validity of the three second-order constructs of cognitive, affective, and behavioral components to explain the second-order construct of developer experience at p < 0.5 Conclusions: The measurement instrument developed from the current work should be used to measure the developer experience during the use of a deep learning platform. Implication: The results of the current work provide important insights into the academia and practitioners for the understanding of developer experience.
Article
Full-text available
It has been argued that reporting software engineering experiments in a standardized way helps researchers find relevant information, understand how experiments were conducted and assess the validity of their results. Various guidelines have been proposed specifically for software engineering experiments. The benefits of such guidelines have often been emphasized, but the actual uptake and practice of reporting have not yet been investigated since the introduction of many of the more recent guidelines. In this research, we utilize a mixed-method study design including sequence analysis techniques for evaluating to which extent papers follow such guidelines. Our study focuses on the four most prominent software engineering journals and the time period from 2000 to 2020. Our results show that many experimental papers miss information suggested by guidelines, that no de facto standard sequence for reporting exists, and that many papers do not cite any guidelines. We discuss these findings and implications for the discipline of experimental software engineering focusing on the review process and the potential to refine and extend guidelines, among others, to account for theory explicitly.
Article
In this paper, we investigate the effect of TDD, as compared to a non-TDD approach, as well as its retainment (or retention) over a time span of (about) six months. To pursue these objectives, we conducted a (quantitative) longitudinal cohort study with 30 novice developers (i.e., third-year undergraduate students in Computer Science). We observed that TDD affects neither the external quality of software products nor developers’ productivity. However, we observed that the participants applying TDD produced significantly more tests, with a higher fault-detection capability, than those using a non-TDD approach. As for the retainment of TDD, we found that TDD is retained by novice developers for at least six months.
Article
Full-text available
This paper provides a systematic meta-analysis of 27 studies that investigate the impact of Test-Driven Development (TDD) on external code quality and productivity. The results indicate that, in general, TDD has a small positive effect on quality but little to no discernible effect on productivity. However, subgroup analysis has found both the quality improvement and the productivity drop to be much larger in industrial studies in comparison with academic studies. A larger drop of productivity was found in studies where the difference in test effort between the TDD and the control group's process was significant. A larger improvement in quality was also found in the academic studies when the difference in test effort is substantial; however, no conclusion could be derived regarding the industrial studies due to the lack of data. Finally, the influence of developer experience and task size as moderator variables was investigated, and a statistically significant positive correlation was found between task size and the magnitude of the improvement in quality.
Article
Full-text available
Unit Test-Driven Development (UTDD) and Acceptance Test-Driven Development (ATDD) are software development techniques to incrementally develop software where the test cases, unit or acceptance tests respectively, are specified before the functional code. There are little empirical evidences supporting or refuting the utility of these techniques in an industrial context. Just a few case studies can be found in literature within the industrial environment and they show conflicting results (positive, negative and neutral). In this report, we present a successful application of UTDD in combination with ATDD in a commercial project. By successful we mean that the project goals are reached without an extra economic cost. All the UTDD and ATDD implementations are based on the same basic concepts, but they may differ in specific adaptations to each project or team. In the implementation presented here, the business requirements are specified by means of executable acceptance tests, which then are the input of a development process where the functional code is written in response to specific unit tests. Our goal is to share our successful experience in a specific project from an empirical point of view. We highlight the advantages and disadvantages of adopting UTDD and ATDD and identify some conditions that facilitate success. The main conclusions we draw from this project are that ATDD contributes to clearly capture and validate the business requirements, but it requires an extensive cooperation from the customer; and that UTDD has not a significant impact neither on productivity nor on software quality. These results cannot be generalized, but they point out that under some circumstances a test-driven development strategy can be a possible option to take into account by software professionals.
Article
Full-text available
Test-driven development (TDD) has been shown to reduce defects and to lead to better code, but can it help beginning students learn basic programming topics, specifically arrays? We performed a controlled experiment where we taught arrays to two CS0 classes, one using WebIDE, an intelligent tutoring system that enforced the use of Test-Driven Learning (TDL) methods, and one using more traditional static methods and a development environment that instructed, but did not enforce the use of TDD. Students who used the TDL approach with WebIDE performed significantly better in assessments and had significantly higher opinions of their experiences than students who used traditional methods and tools.
Conference Paper
Full-text available
WebIDE is a new web-based development environment for entry-level programmers with two primary goals: minimize tool barriers to writing computer programs and introduce software engineering best practices early in a student's educational career. Currently, WebIDE focuses on Test-Driven Learning (TDL) by using small iterative examples and introducing lock-step labs, which prevent the student from moving forward until they finish the current step. However, WebIDE does not require that labs follow TDL. Instructors can write their own labs for WebIDE using any software engineering or pedagogical approach. Likewise, instructors can build custom evaluators - written in any language - to support their approach and provide detailed error messages to students. We report on a pilot study in a CS0 course where students were split into two groups, one that used WebIDE and one that didn't. The WebIDE group showed a significant improvement in performance when writing a simple Android application. Additionally, among students with some programming experience, the WebIDE group was more proficient in writing unit tests.
Conference Paper
Several studies have reported positive experiences with Test-Driven Development (TDD) but the results still diverge. In this study we aim to improve understanding on TDD in educational context. We conducted two experiments on TDD in a master's level university course. The research setting was slightly changed in the second experiment and this paper focuses on comparing the differences between the two rounds. We analyzed the students' perceptions and the difficulties they faced with TDD. The given assignment clearly affected the students' reflections so that the more difficult assignment evoked a richer discussion among the students. Additionally, some insights into teaching TDD are discussed.
Article
Test-Driven Development: A Practical Guide presents TDD from the perspective of the working programmer: real projects, real challenges, real solutions, ...real code. Dave Astels explains TDD through a start-to-finish project written in Java and using JUnit. He introduces powerful TDD tools and techniques; shows how to utilize refactoring, mock objects, and "programming by intention"; even introduces TDD frameworks for C++, C#/.NET, Python, VB6, Ruby, and Smalltalk. Invaluable for anyone who wants to write better code... and have more fun doing it!
Article
The use of agile methods is growing in industrial practice due to the documented benefits of increased software quality, shared programmer expertise, and user satisfaction. These methods include pair programming (two programmers working side-by-side producing the code) and test-driven approaches (test cases written first to prepare for coding). In practice, software development organizations adapt agile methods to their environment. The purpose of this research is to understand better the impacts of adapting these methods. We perform a set of controlled experiments to investigate how adaptations, or variations, to the pair programming method impact programming performance and user satisfaction. We find that method variations do influence programming results. In particular, better performance and satisfaction outcomes are achieved when the pair programming is performed in face-to-face versus virtual settings, in combination with the test-driven approach, and with more experienced programmers. We also find that limiting the extent of collaboration can be effective, especially when programmers are more experienced. These experimental results provide a rigorous foundation for deciding how to adapt pair programming methods into specific project contexts.
Article
Test Driven Development (TDD) is a software development practice in which unit test cases are incrementally written prior to code implementation. We ran a set of structured experiments with 24 professional pair programmers. One group developed a small Java program using TDD while the other (control group), used a waterfall-like approach. Experimental results, subject to external validity concerns, tend to indicate that TDD programmers produce higher quality code because they passed 18% more functional black-box test cases. However, the TDD programmers took 16% more time. Statistical analysis of the results showed that a moderate statistical correlation existed between time spent and the resulting quality. Lastly, the programmers in the control group often did not write the required automated test cases after completing their code. Hence it could be perceived that waterfall-like approaches do not encourage adequate testing. This intuitive observation supports the perception that TDD has the potential for increasing the level of unit testing in the software industry.
Conference Paper
We conducted a quasi-experiment comparing the confor- mance to the test-driven development (TDD) process of one expert and two novice groups of programmers working in pairs. Besides an insignificant tendency of the expert group toward a higher TDD conformance and instruction coverage, we found that the expert group had refactored their code to a larger extent than the two novice groups. More surprisingly though, the pairs in the expert group were significantly slower than the pairs in one of the novice groups.