ChapterPDF Available

How Effective is Test Driven Development



Content may be subject to copyright.
How Effective is Test-Driven
Burak Turhan
Lucas Layman
Madeline Diep
Hakan Erdogmus
Forrest Shull
Test-Driven Development (TDD) [Beck 2002] is one of the most referenced, yet least used agile
practices in industry. Its neglect is due mostly to our lack of understanding of its effects on
people, processes, and products. Although most people agree that writing a test case before
code promotes more robust implementation and a better design, the unknown costs associated
with TDD’s effects and the inversion of the ubiquitous programmer “code-then-test” paradigm
has impeded TDD’s adoption.
To provide an overview of current evidence on the effects of TDD, we conducted a systematic
review of TDD research in online databases and scientific publications. Systematic review is a
research method popularized in the medical community for aggregating and analyzing the
results of clinical trials. A systematic review seeks to answer the general question, “What does
the published evidence say about the effects of using technique X?” In medicine, systematic reviews
play a critical role in evaluating the effectiveness of pharmaceutical drugs and alternative
treatments for illnesses. Empirical software engineering researchers have adopted this
approach for summarizing and analyzing evidence about the effects of software development
practices. It is covered in Chapter 3, What We Can Learn From Systematic Reviews, by Barbara
Kitchenham, and in [Kitchenham 2004] and [Dybå et al. 2005].
In this chapter, we treat TDD as an imaginary medical pill and describe its effects with a
narrative from a pharmacological point of view, instead of providing a formal systematic review
report. We invite the reader to imagine that the rest of this chapter is a medical fact sheet for
the TDD “pill” and to continue reading with the following question in mind:
“If TDD were a pill, would you take it to improve your health?”
The TDD Pill—What is It?
The ingredients of the TDD pill are as follows and should be prepared following the given order
1. Choose a small task.
2. Write a test for that task.
3. Run all the tests to verify that the new test fails.
4. Write minimal production code to complete the task.
5. Run all tests (including the new one) to verify that they pass.
6. Refactor the code as necessary.
7. Repeat from step 1.
The active ingredient in the TDD pill is the authoring of test cases before production code.
Authoring test cases before code requires the patient to consider the design of the solution,
how information will flow, the possible outputs of the code, and exceptional scenarios that
might occur. Running the newly written test case before writing production code helps to verify
that the test is written correctly (a passing test case at this point is not testing the intended
effects) and that the system compiles. The TDD pill also involves writing just enough production
code to pass the test case, which encourages an uncluttered, modular design. Furthermore,
TDD users create a growing library of automated test cases that can be executed at any time to
verify the correctness of the existing system whenever changes are made.
Like many drugs, the TDD pill has some official variants, including ATDD (Acceptance-Test-
Driven Development), BDD (Behavior-Driven Development) and STDD (Story-Test-Driven
Development). ATDD replaces the “small task” step with “functional level business logic tasks,”
whereas BDD uses “behavioral specifications” instead. The ordering of tasks in TDD
differentiates it from other treatments, but official varieties of TDD pills may also contain sets
of other ingredients, such as breaking work down into simple tasks, refactoring, keeping the
test-code cycles short, and relentless regression testing.
Except for refactoring, some key ingredients may be missing in many “generic”
TDD pills due to different interpretations in practice and the dominance of the
active ingredient.
Summary of Clinical TDD Trials
The focus of our review was to gather quantitative evidence on the effects of the TDD pill on
internal code quality (see “Measuring Code Quality” on page 209), external quality,
productivity, and test quality. The evaluation of the TDD pill is based on data gathered from
32 clinical trials. In the first quarter of 2009, the authors gathered 325 TDD research reports
from online comprehensive indices, major scientific publishers (ACM, IEEE, Elsevier), and
“gray literature” (technical reports, theses). The initial set of 325 reports was narrowed down
to 22 reports through a two-level screening process. Four researchers filtered out studies
conducted prior to 2000, qualitative studies, surveys, and wholly subjective analyses of the
TDD pill. Some of these reports contained multiple or overlapping trials (i.e., the same trial
was reported in multiple papers); in such cases, the trial was counted only once. A team of five
researchers then extracted key information from the reports regarding study design, study
context, participants, treatments and controls, and study results. In total, the research team
analyzed 22 reports containing 32 unique trials.
The internal quality of a system is related to its design quality, usually with the interpretation that
good designs are simple, modular, and easy to maintain and understand. Though TDD is primarily
interpreted as a development practice, it is considered as a design practice as well. An incremental
and simple design is expected to emerge when using the TDD pill. The simple design is driven by
the modularity needed to make code testable, by writing minimal production code to complete
simple tasks, and by constant refactoring. To assess the internal quality of a system, the TDD trials
use one or more of the following measures:
Object-oriented metrics. These involve weighted methods on a per-class basis (WMC, depth
of inheritance tree), DIT, etc. [Chidamber et al. 1994]
Cyclomatic complexity
Code density (e.g., lines of code per method)
Code size per feature
The external quality of a system is usually measured by the number of pre-release or post-release
defects. TDD is associated with the important claim that it increases external quality because it
encourages writing lots of test cases, developers work on simple tasks that are easy to comprehend,
the system is under frequent regression testing, and errors due to changes can be easily detected
by the fine-grained tests. In TDD trials, external quality is reported by one or more of the following:
Test cases passed
Number of defects
Defect density
Defects per test
Effort required to fix defects
Change density
Percentage of preventative changes
The 32 trials were conducted in academic or industrial settings in the form of controlled
experiments, pilot studies, or commercial projects. Controlled experiments were conducted in
academic laboratories or controlled industry environments with a defined research protocol,
pilot studies were carried out using less-structured experimental tasks, and commercial projects
described industry teams using TDD as part of their everyday work. Participants in these trials
had different experience levels, ranging from undergraduate students to graduate students to
professionals. The number of participants per trial ranged from a single individual to 132
persons. The effort spent on the trials spans a wide interval, ranging from a few person-hours
to 21,600 person-hours. Each trial compares the effects of the TDD pill with respect to another
treatment—usually traditional test-last development. The subjects in the treatment groups
were comprised of various units, such as individuals, pairs, teams, and projects.
We have classified the 32 trials into four levels based on the experience of participants, the
detail of the experimental construct, and the scale of the trial. The experience of participants
is determined by whether they were undergraduate students, graduate students, or
The descriptions of the dynamics of TDD and the control treatment were used to evaluate the
construct of the trial as either good, adequate, poor, or unknown. A good construct enforced
all prescribed TDD ingredients from the prior section, an adequate construct prescribed writing
tests first but not all of the TDD ingredients, a poor construct did not enforce the TDD steps,
and an unknown construct did not specify whether the TDD steps were enforced or not.
Finally, the scale of a trial is recorded as small, medium or large, depending on the reported
effort or estimated effort based on duration and number of participants. Small projects involved
less than 170 person-hours of total effort across all subjects, whereas large projects ranged from
3,000 to 21,600 person-hours. A simple clustering algorithm was used to categorize the scale
of the projects, while the experience of participants and the details of the construct were based
on descriptive data found in the trial reports.
Our confidence that the results of using the TDD pill will generalize to “real-life” cases increases
as the level of the trial increases. The lowest level, L0, contains all small-scale trials. These trials
report less than 170 person-hours of effort or less than 11 participants. The next level, L1,
consists of medium- or large-scale trials with unknown or poor constructs. The L2 level consists
of medium- or large-scale trials with adequate or good constructs and undergraduate student
participants. Finally, the highest level, L3, contains medium- or large-scale trials with adequate
or good constructs and graduate student or professional participants.
Table 12-1 summarizes the attributes we used to classify trials into levels, and Table 12-2 shows
how many trials we examined at each level.
TABLE 12-1. Levels of clinical TDD trials
L0 L1 L2 L3
Experience Any Any Undergraduate student Graduate student or professional
Construct Any Poor or unknown Adequate or good Adequate or good
Scale Small Medium or large Medium or large Medium or large
TABLE 12-2. Types of clinical TDD trials
Type L0 L1 L2 L3 Total
Controlled experiment 2 0 2 4 8
Pilot study 2 0 5 7 14
Industrial use 1 7 0 2 10
Total 5 7 7 13 32
The Effectiveness of TDD
We analyzed the TDD trials that reported quantitative results of the TDD pill’s effects on
productivity, internal and external quality, and test quality. Direct comparison of the
quantitative results across trials was impossible, since the trials measured TDD’s effectiveness
in different ways. Instead, we assign each trial a summary value of “better,” “worse,” “mixed,”
or “inconclusive/no-difference.” The summary value is determined by the quantitative results
reported for the TDD pill compared with a control. The summary value also incorporates the
report author’s interpretation of the trial results. In trials with a summary value of “better,” a
majority of quantitative measures favor the TDD pill in comparison to the control treatment.
In trials with a summary value of “worse,” a majority of measures favor the control treatment.
Trials with a summary value of “inconclusive/no-difference” were inconclusive or report no
observed differences. Finally, in trials with a summary value of “mixed,” some measures favor
TDD while others don’t. In all cases, the summary assignation was guided by the report author’s
interpretation of the study findings because, in many cases, the reports omitted details of the
trial that would have enabled an objective external evaluation.
In the following sections we do our best to draw some conclusions about the value of TDD
from the trials.
Internal Quality
Available evidence from the trials suggests that TDD does not have a consistent effect on
internal quality. Although TDD appears to yield better results over the control group for certain
types of metrics (complexity and reuse), other metrics (coupling and cohesion) are often worse
in the TDD treatment. Another observation from the trial data is that TDD yields production
code that is less complex at the method/class level, but more complex at the package/project
level. This inconsistent effect is more visible in more rigorous trials (i.e., L2 and L3 trials). The
differences in internal quality may be due to other factors, such as motivation, skill, experience,
and learning effects. Table 12-3 classifies the trials according to internal quality metrics.
In the following tables, the first number in each cell reports all trials, whereas the
number in parentheses reports only L2 and L3 trials.
TABLE 12-3. Effects on internal quality
Controlled experiment 1 (0) 0 (0) 0 (0) 2 (2) 3 (2)
Pilot study 1 (1) 1 (1) 3 (1) 2 (2) 7 (5)
Industrial use 3 (1) 1 (1) 0 (0) 0 (0) 4 (2)
Total 5 (2) 2 (2) 3 (1) 4 (4) 14 (9)
External Quality
There is some evidence to suggest that TDD improves external quality. Although the outcomes
of controlled experiments are mostly inconclusive, industrial use and pilot studies strongly
favor TDD. However, the supporting evidence from industrial use and controlled experiments
disappears after filtering out the less rigorous studies (i.e., L0 and L1 trials). Furthermore, the
evidence from pilot studies and controlled experiments is contradictory once L0 and L1 trials
are filtered out. If all studies are counted equally, however, the evidence suggests that the TDD
pill can improve external quality. Table 12-4 classifies the trials according to external quality
TABLE 12-4. Effects on external quality
Controlled experiment 1 (0) 2 (2) 0 (0) 3 (3) 6 (5)
Pilot study 6 (5) 1 (1) 0 (0) 2 (2) 9 (8)
Industrial use 6 (0) 0 (0) 0 (0) 1 (1) 7 (1)
Total 13 (5) 3 (3) 0 (0) 6 (6) 22 (14)
The productivity dimension engenders the most controversial discussion of TDD. Although
many admit that adopting TDD may require a steep learning curve that may decrease the
productivity initially, there is no consensus on the long-term effects. One line of argument
expects productivity to increase with TDD; reasons include easy context switching from one
simple task to another, improved external quality (i.e., there are few errors and errors can be
detected quickly), improved internal quality (i.e., fixing errors is easier due to simpler design),
and improved test quality (i.e., chances of introducing new errors is low due to automated
tests). The opposite line argues that TDD incurs too much overhead and will negatively impact
productivity because too much time and focus may be spent on authoring tests as opposed to
adding new functionality. The different measures used in TDD trials for evaluating productivity
included development and maintenance effort, the amount of code or features produced over
time, and the amount of code or features produced per unit of development effort.
The available evidence from the trials suggests that TDD does not have a consistent effect on
productivity. The evidence from controlled experiments suggests an improvement in
productivity when TDD is used. However, the pilot studies provide mixed evidence, some in
favor of and others against TDD. In the industrial studies, the evidence suggests that TDD yields
worse productivity. Even when considering only the more rigorous studies (L2 and L3), the
evidence is equally split for and against a positive effect on productivity. Table 12-5 classifies
the trials according to effects on productivity.
TABLE 12-5. Effects on productivity
Controlled experiment 3 (1) 0 (0) 0 (0) 1 (1) 4 (2)
Pilot study 6 (5) 4 (4) 0 (0) 4 (3) 14 (12)
Industrial use 1 (0) 5 (1) 0 (0) 1 (0) 7 (1)
Total 10 (6) 9 (5) 0 (0) 6 (4) 25 (15)
Test Quality
Because test cases precede all development activities with TDD, testing the correctness of an
evolving system is expected to be made easier by a growing library of automated tests. Further,
the testing process is expected to be of high quality due to the fine granularity of the tests
produced. In the trials, test quality is captured by test density, test coverage, test productivity,
or test effort.
There is some evidence to suggest that TDD improves test quality. Most of the evidence comes
from pilot studies and is in favor of TDD, even after filtering out less rigorous studies. Controlled
experiments suggest that TDD fares at least as well as the control treatments. There is
insufficient evidence from industrial use to reach a conclusion.
Therefore, the test quality associated with TDD seems at least not worse and often better than
alternative approaches. Here we would have expected stronger results: since encouraging test
case development is one of the primary active ingredients of TDD, the overall evidence should
have favored TDD in promoting the test quality measures reported in these studies.
Table 12-6 classifies the trials according to test quality.
TABLE 12-6. Effects on test quality
Controlled experiment 2 (1) 0 (0) 0 (0) 3 (3) 5 (4)
Pilot study 7 (5) 1 (1) 0 (0) 1 (1) 9 (7)
Industrial use 1 (0) 1 (1) 0 (0) 1 (0) 3 (1)
Total 10 (6) 2 (2) 0 (0) 5 (4) 17 (12)
Enforcing Correct TDD Dosage in Trials
Although most of the trials did not measure or control the amount of the TDD pill taken (which
in software parlance translates into a lack of attention to process conformance), we believe
that the dosage ended up being variable across trials and subjects. Trials with poor or unknown
constructs may not have strictly enforced TDD usage, and we believe it is highly likely that the
trial participants customized the pill with a selection of ingredients rather than following the
strict textbook definition of TDD. This issue poses a serious threat to drawing generalized
conclusions. In the medical context, not enforcing or measuring TDD usage is analogous to
failing to ensure that the patients took a pill for some treatment or not knowing which dosage
the patient took. Thus, the observed effects of the TDD pill may be due to process conformance
or other factors that are not adequately described or controlled. In future trials, conformance
to the treatment and the control should be carefully monitored.
Regardless of the reporting quality of the TDD trials, a related question is raised: “Should the
textbook definition of TDD be followed in all real-life cases?” Sometimes patients get better
even with a half-sized or quarter-sized pill modified for their specific work context and personal
style. Micro-level logging tools for development activity are available and can be used to
investigate these issues. Such logging tools can be helpful both for controlling the conformance
to TDD processes and for understanding real-life, practical implementations of TDD.
Cautions and Side Effects
In this section we pose several questions about the TDD pill that may temper TDD’s
effectiveness in different contexts.
Is it reactive to the environment?
There is no recommended best context for the use of TDD. We do not know whether it is
applicable to all domains, to all kinds of tasks within a domain, or to projects of all sizes
and complexities. For example, the trials do not make it clear whether TDD is an applicable
practice for developing embedded systems or for developing highly decentralized systems
where incremental testing may not be feasible. Furthermore, it is often considered a
challenge to use TDD for legacy systems that may require considerable refactoring of
existing code to become testable.
Is it for everyone?
One basic fact on which almost everyone agrees is that TDD is difficult to learn. It involves
a steep learning curve that requires skill, maturity, and time, particularly when developers
are entrenched in the code-then-test paradigm. Better tool support for test-case generation
and early exposure in the classroom to a test-then-code mentality may encourage TDD
Could it be addictive?
Personal communications with TDD developers suggest that it is an addictive practice. It
changes the way people think and their approach to coding in a way that is difficult to roll
back. Therefore, leaving TDD practices may be as difficult as adopting them.
Does it interact with other medications?
No studies focus specifically on whether TDD performs better or worse when used with
other medications. In one trial, it is suggested that, when coupled with up-front design,
TDD results in a 40% improvement in external quality [Williams et al. 2003]. Another
trial compares solo and pair developers who practice TDD and incremental test-last
development [Madeyski 2005]. That trial reports no difference in the external quality of
software produced by solo or pair programmers using TDD. It is not known which practices
go well or poorly with TDD. Although there may be practices that stimulate its desired
effects, there also may be some that inhibit them. The examples just mentioned are
probably case-specific, but they point out the need to investigate further TDD’s interaction
with other medications.
The effects of TDD still involve many unknowns. Indeed, the evidence is not undisputedly
consistent regarding TDD’s effects on any of the measures we applied: internal and external
quality, productivity, or test quality. Much of the inconsistency likely can be attributed to
internal factors not fully described in the TDD trials. Thus, TDD is bound to remain a
controversial topic of debate and research.
For practitioners looking for some actionable advice, our expert panel recommends taking the
TDD pill, carefully monitoring its interactions and side effects, and increasing or decreasing the
dosage accordingly. So we end with some specific prescriptions from individual members of
our team, after reviewing the data:
We’ve been able to compile the evidence, but each reader has to make up his or her own mind.
First, decide which qualities matter most to you. For example, do you care more about
productivity or external quality? Can you justify spending more effort to create higher-quality
tests? The evidence in this chapter is useful only for making decisions based on each reader’s
specific goals.
I have taken the TDD pill and become hooked. My personal experience has been that TDD
improves productivity, although evidence from our study is lacking in this regard. Perhaps mine
was simply a perception. Based on these results, especially based on the evidence regarding its
conservatively positive impact on external quality, if I weren’t already using TDD, I’d start having
my team take it in small doses and see whether they find a long-term productivity improvement
of their own. If there are no adverse reactions, I’d increase the dosage gradually and keep
Although TDD is promising, its adoption can be impeded by uncertainties about its effectiveness
and by high up-front adoption cost. Still, its ingredients seem to encourage good programming
and development habits to flourish, yielding better-quality programmers and tests in the long
TDD seems promising, but let’s face it, it tastes bad when you first start. A lot of people like the
old stuff better. After all, it’s hard to feel productive when you spend a large amount of your
time writing test cases that fail. On the other hand, I’ve never written cleaner code in my life,
and it feels great to make a change to that old code, hit the “Run Tests” button, and be confident
that I didn’t break anything.
The evidence packed into this chapter shows that TDD might be a cure for you, yet you should
not try to use it as a panacea. Your TDD adventure is likely to vary with certain factors, including
your experience and the context you are working in. As a practitioner, developing an insight
about when to expect improvements from TDD would be a valuable asset.
Dr. Janice Singer was one of the researchers who participated in the screening of the studies
in the initial stages of the systematic review. We gratefully acknowledge her contributions to
this work.
General References
[Beck 2002] Beck, Kent. 2002. Test-Driven Development: By Example. Boston: Addison-Wesley.
[Chidamber et al. 1994] Chidamber, S.R., and C.F. Kemerer. 1994. A Metrics Suite for Object
Oriented Design. IEEE Transactions on Software Engineering20(6): 476-493.
[Dybå et al. 2005] Dybå, Tore, Barbara Kitchenham, and Magne Jørgensen. 2005. Evidence-
Based Software Engineering for Practitioners. IEEE Software 22(1): 5865.
[Kitchenham 2004] Kitchenham, Barbara. 2004. Procedures for Performing Systematic
Reviews. Keele University Technical Report TR/SE0401.
Clinical TDD Trial References
[Canfora et al. 2006] Canfora, Gerardo, Aniello Cimitile, Felix Garcia, Mario Piattini, and
Corrado Aaron Visaggio. 2006. Evaluating advantages of test-driven development: a
controlled experiment with professionals. Proceedings of the ACM/IEEE international symposium
on Empirical software engineering: 364-371.
[Erdogmus et al. 2005] Erdogmus, Hakan, Maurizio Morisio, and Marco Torchiano. 2005. On
the Effectiveness of the Test-First Approach to Programming. IEEE Transactions on Software
[Flohr et al. 2006] Flohr, Thomas, and Thorsten Schneider. 2006. Lessons Learned from an XP
Experiment with Students: Test-First Needs More Teachings. In Product-Focused Software
Process Improvement: 7th International Conference, PROFES 2006, Proceedings, ed. J. Münch and
M. Vierimaa, 305–318. Berlin: Springer-Verlag.
[George 2002] George, Boby. 2002. Analysis and Quantification of Test-Driven Development
Approach. MS thesis, North Carolina State University.
[Geras 2004] Geras, Adam. 2004. The effectiveness of test-driven development. MSc thesis, University
of Calgary.
[Geras et al. 2004] Geras, A., M. Smith, and J. Miller. 2004. A Prototype Empirical Evaluation
of Test-Driven Development. Proceedings of the 10th International Symposium on Software
Metrics: 405-416.
[Gupta et al. 2007] Gupta, Atul, and Pankaj Jaloye. 2007. An Experimental Evaluation of the
Effectiveness and Efficiency of the Test-Driven Development. Proceedings of the First
International Symposium on Empirical Software Engineering and Measurement: 285-294.
[Huang et al. 2009] Huang, Liang, and Mike Holcombe. 2009. Empirical investigation towards
the effectiveness of Test First programming. Information & Software Technology 51(1):
[Janzen 2006] Janzen, David Scott. 2006. An Empirical Evaluation of the Impact of Test-Driven
Development on Software Quality. PhD thesis, University of Kansas.
[Kaufmann et al. 2003] Kaufmann, Reid, and David Janzen. 2003. Implications of test-driven
development: a pilot study. Companion of the 18th annual ACM SIGPLAN conference on Object-
oriented programming, systems, languages, and applications: 298-299.
[Madeyski 2005] Madeyski, Lech. 2005. Preliminary Analysis of the Effects of Pair
Programming and Test-Driven Development on the External Code Quality. Proceedings of the
2005 conference on Software Engineering: Evolution and Emerging Technologies: 113123.
[Madeyski 2006] Madeyski, Lech. 2006. The Impact of Pair Programming and Test-Driven
Development on Package Dependencies in Object-Oriented Design—An Experiment. In
Product-Focused Software Process Improvement: 7th International Conference, PROFES 2006,
Proceedings, ed. J. Münch and M. Vierimaa, 278289. Berlin: Springer-Verlag.
[Madeyski et al. 2007] Madeyski, Lech, and Lukasz Szala. 2007. The Impact of Test-Driven
Development on Software Development Productivity — An Empirical Study. Software Process
Improvement, 4th European Conference, EuroSPI 2007, Proceedings, ed. P. Abrahamsson, N.
Baddoo, T. Margaria, and R. Massnarz, 200211. Berlin: Springer-Verlag.
[Muller et al. 2002] Muller, M. M., and O. Hagner. 2002. Experiment about test-first
programming. Software, IEEE Proceedings149(5): 131-136.
[Nagappan et al. 2008] Nagappan, Nachiappan, E. Michael Maximilien, Thirumalesh Bhat, and
Laurie Williams. 2008. Realizing quality improvement through test-driven development:
results and experiences of four industrial teams. Empirical Software Engineering 13(3):
[Pancur et al. 2003] Pancur, M., M. Ciglaric, M. Trampus, and T. Vidmar. 2003. Towards
empirical evaluation of test-driven development in a university environment. The IEEE
Region 8 EUROCON Computer as a Tool(2): 83-86.
[Siniaalto et al. 2008] Siniaalto, Maria, and Pekka Abrahamsson. 2008. Does Test-Driven
Development Improve the Program Code? Alarming Results from a Comparative Case
Study. In Balancing Agility and Formalism in Software Engineering, ed. B Meyer, J. Nawrocki,
and B. Walter, 143–156. Berlin: Springer-Verlag.
[Slyngstad et al. 2008] Slyngstad, Odd Petter N., Jingyue Li, Reidar Conradi, Harald Ronneberg,
Einar Landre, and Harald Wesenberg. 2008. The Impact of Test Driven Development on the
Evolution of a Reusable Framework of Components—An Industrial Case Study. Proceedings
of the Third International Conference on Software Engineering Advances: 214-223.
[Vu et al. 2009] Vu, John, Niklas Frojd, Clay Shenkel-Therolf, and David Janzen. 2009.
Evaluating Test-Driven Development in an Industry-sponsored Capstone Project.
Proceedings of the 2009 Sixth International Conference on Information Technology: New
Generations: 229-234.
[Williams et al. 2003] Williams, Laurie, E. Michael Maximilien, and Mladen Vouk. 2003. Test-
Driven Development as a Defect-Reduction Practice. Proceedings of the 14th International
Symposium on Software Reliability Engineering: 34.
[Yenduri et al. 2006] Yenduri, Sumanth, and Louise A. Perkins. 2006. Impact of Using Test-
Driven Development: A Case Study. Proceedings of the International Conference on Software
Engineering Research and Practice & Conference on Programming Languages and Compilers, SERP
2006: 126-129.
[Zhang et al. 2006] Zhang, Lei, Shunsuke Akifuji, Katsumi Kawai, and Tsuyoshi Morioka.
2006. Comparison Between Test-Driven Development and Waterfall Development in a
Small-Scale Project. Extreme Programming and Agile Processes in Software Engineering, 7th
International Conference, XP 2006, Proceedings, ed. P. Abrahamsson, M. Marchesi, and G. Succi,
211–212. Berlin: Springer-Verlag.
... Scholar -It is a great idea. The scientific consensus says that TDD increases external quality [2,6,8,12,16]. ...
... S -TDD does not look like a practice on fashion anymore. The problem could be that TDD drops productivity [2,12,16]. ...
... S -Not sure. Probably the professionals' motivation has impacted experiment results [16]. Learning ability and skills [16], or test case design knowledge [3], could influence productivity too. ...
Conference Paper
Full-text available
Software Engineering (SE) experiments typically have small, hard to acquire sample sizes. Recruiting subjects is an issue for SE progress. However, we argue that characterization, i.e., understanding the population traits, is the key problem. We propose a roadmap to address the problem. CCS CONCEPTS • General and reference → Empirical studies.
... A number of primary studies, like experiments or case studies, have been conducted on TDD (e.g., [8,9,12,13,14]). Their results, gathered and combined in a number of secondary studies (e.g., [6,15,16,17,18,19]), do not fully support the claimed benefits of TDD. Therefore, some researchers have recommended taking a longitudinal perspective when investigating such a development approach (e.g., [16,18,20,21])-i.e., studying TDD over a time span. ...
... The effects of TDD on several outcomes, including the ones of interest for this study-i.e., functional quality and productivity-is the topic of several empirical studies, summarized in Systematic Literature Reviews (SLRs) and meta-analyses (e.g., [15,17,18,19]). The SLR by Turhan et al. [17] includes 32 primary studies (e.g., controlled experiments and case studies) published from 2000 to 2009. The gathered evidence shows a moderate effect in favor of TDD on functional quality while the evidence about productivity is inconclusive. ...
... The effects of TDD on several outcomes, including the ones of interest for this study-i.e., functional quality and productivity-is the topic of several empirical studies, summarized in Systematic Literature Reviews (SLRs) and meta-analyses (e.g., [15,17,18,19]). The SLR by Turhan et al. [17] includes 32 primary studies (e.g., controlled experiments and case studies) published from 2000 to 2009. The gathered evidence shows a moderate effect in favor of TDD on functional quality while the evidence about productivity is inconclusive. ...
In this paper, we investigate the effect of TDD, as compared to a non-TDD approach, as well as its retainment (or retention) over a time span of (about) six months. To pursue these objectives, we conducted a (quantitative) longitudinal cohort study with 30 novice developers (i.e., third-year undergraduate students in Computer Science). We observed that TDD affects neither the external quality of software products nor developers' productivity. However, we observed that the participants applying TDD produced significantly more tests, with a higher fault-detection capability than those using a non-TDD approach. As for the retainment of TDD, we found that TDD is retained by novice developers for at least six months.
... Kent Beck, one of the founders of XP, is credited with popularizing TDD in his book "Test-Driven Development by Example" [1]. TDD gained wider acceptance in the software development community in the early 2000s and has since become an essential practice for many development teams [3] with reduced defects, increased productivity, and improvement in the external quality of the software. According to the State of Agile report (2018), 33% of teams follow this approach in their work [4]. ...
... TDD can be integrated into agile development processes by educating the team, starting with small and low-risk features or modules, creating test cases before coding, automating testing, refactoring code, and fostering collaboration and communication between team members. By following these steps, teams can gradually incorporate TDD practices and benefit from improved code quality, reduced defects, and increased productivity [3] and TDD can also be integrated into existing development processes and methodologies by aligning it with agile methodologies like XP, applying it as a micro process within process models such as Scrum or DevOps, incrementally adopting TDD practices, promoting collaboration and communication among team members, leveraging TDD to support refactoring and code improvement, and utilizing TDD-friendly tools and frameworks. Integration requires a mindset shift, education, and gradual implementation [20]. ...
Full-text available
Test-Driven Development (TDD) is a methodology in software development that necessitates tests to be written before to the production code. This approach can be used in any software development paradigm that involves writing code, including Agile, Scrum, XP, and Lean. This research paper surveys the impact of TDD on software development with a specific focus on its effects on code coverage, productivity, internal and external software quality, and the affec-tive reactions associated with TDD. The paper also identifies potential challenges and drawbacks of implementing TDD, such as increased overhead and time consumption , a learning curve for developers, and difficulty in testing certain types of code. The studies' results suggest that TDD can improve code coverage, and code quality, reduce defects, increase productivity and developer satisfaction, improve internal and external software quality, and ultimately lead to higher customer satisfaction .
... The code is refactored for a higher quality before ending an iteration. TDD has been applied at all levels from acceptance testing to unit testing [23,24]. TDD has demonstrated successes among early adopters in IoT development (e.g., [4]). ...
... For example, the behaviors of transferring money will be different, respectively, depending on the source account having a sufficient or an insufficient balance before a transfer. TDD has been applied at all levels from acceptance testing to unit testing [22,23]. TDD has 90 demonstrated successes among early adopters in IoT development (e.g., [4]). ...
Full-text available
The Internet of Things (IoT) systems are inherently distributed with many concurrent behaviors. In order to apply behavior-driven development (BDD), a proven agile practice of software development that brings many benefits, we must ensure that the specification of sequential and concurrent behaviors is supported at the specification level and that tool support is in place to execute the specification. This study proposes a minimal semantic enhancement to the Gherkin language, the most popular specification language in BDD, to distinguish sequential and concurrent behaviors. At the same time, a tool called concurrentSpec is developed to support the correct execution of specifications written in the enhanced Gherkin language. With two IoT examples involving both sequential and concurrent behaviors, it is shown that the enhanced Gherkin with concurrentSpec can correctly specify and execute the specifications, while the original Gherkin with existing tools is unable to do so. Hence, the contribution of this study is to eliminate a technical impediment for the IoT development community to adopt BDD and receive its benefits.
... Broader reviews of current research on the topic of TDD effectiveness may also be found in a book by Madeyski [6], which in general focuses on the analysis of impact of programming practices like test-first programming, test-last programming, pair programming, solo programming and other closely related approaches. Also, further discussion can be found in systematic reviews such as [31] or [32]. Munir et al. [3] present a systematic review which additionally classifies presented studies according to rigor and relevance. ...
... 2.1). Also, with about 3040 person-hours, this experiment is classified as large using the scale proposed in [31]. ...
Full-text available
Context: There is still little evidence on differences between Test-Driven Development and Test-Last Development, especially for real-world projects, so their impact on code/test quality is an ongoing research trend. An empirical comparison is presented, with 19 participants working on an industrial project developed for an energy market software company, implementing real-world requirements for one of the company's customers. Objective: Examine the impact of TDD and TLD on quality of the code and the tests. The aim is to evaluate if there is a significant difference in external code quality and test quality between these techniques. Method: The experiment is based on a randomized within-subjects block design, with participants working for three months on the same requirements using different techniques, changed from week to week, within three different competence blocks: Intermediate, Novice and Mixed. The resulting code was verified for process conformance. The participants developed only business logic and were separated from infrastructural concerns. A separate group of code repositories was used to work without unit tests, to verify that the requirements were not too easy for the participants. Also, it was analysed if there is any difference between the code created by shared efforts of developers with different competences and the code created by participants isolated in the competence blocks. The resulting implementations had LOC order of magnitude of 10k. Results: Statistically significant advantage of TDD in terms of external code quality (1.8 fewer bugs) and test quality (5 percentage points higher) than TLD. Additionally, TDD narrows the gap in code coverage between developers from different competence blocks. At the same time, TDD proved to have a considerable entry barrier and was hard to follow strictly, especially by Novices. Still, no significant difference w.r.t. code coverage has been observed between the Intermediate and the Novice developers - as opposed to TLD, which was easier to follow. Lastly, isolating the Intermediate developers from the Novices had significant impact on the code quality. Conclusion:TDD is a recommended technique for software projects with a long horizon or when it is critical to minimize the number of bugs and achieve high code coverage.
The research on the claimed effects of Test-Driven Development (TDD) on software quality and developers’ productivity has shown inconclusive results. Some researchers have ascribed such results to the negative affective reactions that TDD would provoke when developers apply it. In this paper, we studied whether and in which phases TDD influences the affective states of developers, who are new to this development approach. To that end, we conducted a baseline experiment and two replications, and analyzed the data from these experiments both individually and jointly. Also, we performed methodological triangulation by means of an explanatory survey, whose respondents were experienced with TDD. The results of the baseline experiment suggested that developers like TDD significantly less, compared to a non-TDD approach. Also, developers who apply TDD like implementing production code significantly less than those who apply a non-TDD approach, while testing production code makes TDD developers significantly less happy. These results were not confirmed in the replicated experiments. We found that the moderator that better explains these differences across experiments is experience (in months) with unit testing, practiced in a test-last manner. The higher the experience with unit testing, the more negative the affective reactions caused by TDD. The results from the survey seem to confirm the role of this moderator.
Full-text available
Abstract HydroShare is an online collaborative system under development to support the open sharing of hydrologic data, analytical tools, and computer models. With HydroShare, scientists can easily discover, access, and analyze hydrologic data and thereby enhance the production and reproducibility of hydrologic scientific results. HydroShare also takes advantage of emerging social media functionality to enable users to enhance information about and collaboration around hydrologic data and models. HydroShare is being developed by an interdisciplinary collaborative team of domain scientists, university software developers, and professional software engineers from ten institutions located across the United States. While the combination of non-co-located, diverse stakeholders presents communication and management challenges, the interdisciplinary nature of the team is integral to the project’s goal of improving scientific software development and capabilities in academia. This chapter describes the challenges faced and lessons learned with the development of HydroShare, as well as the approach to software development that the HydroShare team adopted on the basis of the lessons learned. The chapter closes with recommendations for the application of modern software engineering techniques to large, collaborative, scientific software development projects, similar to the National Science Foundation (NSF)–funded HydroShare, in order to promote the successful application of the approach described herein by other teams for other projects.
In this paper, we investigate the effect of TDD, as compared to a non-TDD approach, as well as its retainment (or retention) over a time span of (about) six months. To pursue these objectives, we conducted a (quantitative) longitudinal cohort study with 30 novice developers (i.e., third-year undergraduate students in Computer Science). We observed that TDD affects neither the external quality of software products nor developers’ productivity. However, we observed that the participants applying TDD produced significantly more tests, with a higher fault-detection capability, than those using a non-TDD approach. As for the retainment of TDD, we found that TDD is retained by novice developers for at least six months.
Conference Paper
Full-text available
In this paper, we conduct an experimental study over two groups of students comprising of undergraduate students (seniors) who develop software using the conventional way of performing unit testing after development and also by extracting test cases before implementation as in Agile Programming. Both groups developed the same software using an incremental and iterative approach. The results showed that the software had less number of faults when developed using Agile Programming. Also, the quality of software was better a nd the productivity increased.
While software metrics are a generally desirable feature in the software management functions of project planning and project evaluation, they are of especial importance with a new technology such as the object-oriented approach. This is due to the significant need to train software engineers in generally accepted object-oriented principles. This paper presents theoretical work that builds a suite of metrics for object-oriented design. In particular, these metrics are based upon measurement theory and are informed by the insights of experienced object-oriented software developers. The proposed metrics are formally evaluated against a widelyaccepted list of software metric evaluation criteria.
Test-driven development (TDD) has gained recent attention with the popularity of the Extreme Programming agile software development methodology. Advocates of TDD rely primarily on anecdotal evidence with relatively little empirical evidence of the benefits of the practice. This research is the first comprehensive evaluation of how TDD affects software architecture and internal design quality. Formal controlled experiments were conducted in undergraduate and graduate academic courses, in a professional training course, and with in-house professional development projects in a Fortune 500 company. The experiments involved over 230 student and professional programmers working on almost five hundred software projects ranging in size from one hundred to over 30,000 lines of code. The research also included a case study of fifteen software projects developed over five years in a Fortune 500 corporation. This research demonstrates that software developers applying a test-first (TDD) approach are likely to improve some software quality aspects at minimal cost over a comparable test-last approach. In particular this research has shown statistically significant differences in the areas of code complexity, size, and testing. These internal quality differences can substantially improve external software quality (defects), software maintainability, software understandability, and software reusability. Further this research has shown that mature programmers who have used both the test-first and test-last development approaches prefer the test-first approach. In addition, this research contributes a pedagogical approach called test-driven learning (TDL) that integrates TDD instruction at all levels. TDL was partially applied at all academic levels from early programming instruction through professional continuing education. Results indicate some differences between beginning and mature developers including reluctance by early programmers to adopt the TDD approach. By providing the first substantial empirical evidence on TDD and internal software quality, this research establishes a benchmark and framework for future empirical studies. By focusing on both software design and software testing, this research is already raising awareness of TDD as both a design and testing approach through publications and international awards.
Conference Paper
Test-Driven Development (TDD) is an agile development process wherein automated tests are created before production code is designed or constructed in short, rapid iterations. This paper discusses an experiment conducted with undergraduate students in a year-long software engineering capstone course. In this course the students designed, implemented, deployed, and maintained a software system to meet the requirements of an industry sponsor who served as the customer. The course followed an incremental process in which features were added incrementally under the direction of the industry sponsor and the professor. The fourteen students observed in the study were divided into three teams. Among the three teams were two experimental groups. One group consisted of two teams that applied a Test-First (TDD) methodology, while a control group applied a traditional Test-Last methodology. Unlike Test-First, the tests in Test-Last are written after the design and construction of the production code being tested. Results from this experiment differ from many previous studies. In particular, the Test-Last team was actually more productive and wrote more tests than their Test-First counterparts. Anecdotal evidence suggests that factors other than development approach such as individual ambition and team motivation may have more affect than the development approach applied. Although more students indicated a preference for the Test-First approach, concerns regarding learning and applying TDD with unfamiliar technologies are noted.
Conference Paper
In order to popularize the Test Driven Development (TDD) practice in Chinese offshore companies, an experimental research was firstly conducted to compare TDD with the traditional waterfall development in a small-scale project. Although the project scale was small and all the subjects were students, this experiment was designed very strictly to guarantee the reliable evaluation of the efficacy of TDD. Furthermore, it is also the first time to evaluate the maintainability and the flexibility of TDD by experiment.
Conference Paper
A Spring 2003 experiment examines the claims that test-driven development or test-first programming improves software quality and programmer confidence. The results indicate support for these claims and inform larger future experiments.