Test-driven development (TDD) involves more than just testing before coding. This article examines how (and whether) TDD has lived up to its promises.
What Do We (Really)
Know about Test-Driven
Itir Karac and Burak Turhan
(TD D) is one of the most controver-
sial agile practices in terms of its
impact on software quality and pro-
grammer productivity. After more
than a decade’s research, the jury is
still out on its effectiveness. TDD
promised all: increased quality and
productivity, along with an emerg-
ing, clean design supported by the
safety net of a growing library
of tests. What’s more, the recipe
sounded surprisingly simple: Don’t
write code without a failing test.
Here, we revisit the evidence of
the promises of TDD.1 But, before
we go on, just pause and think of an
answer to the following core ques-
tion: What is TDD?
Let us guess: your response is
most likely along the lines of, “TDD
is a practice in which you write
tests before code.” This emphasis
on its test-rst dynamic, strongly
implied by the name, is perhaps the
root of most, if not all, of the con-
troversy about TDD. Unfortunately,
it’s a common misconception to use
“TDD” and “test-rst” interchange-
ably. Test-rst is only one part of
TDD. There are many other cogs
in the system that potentially make
TDD tick.
How about working on small
tasks, keeping the red–green–refactor
cycles short and steady, writing only
the code necessary to pass a fail-
ing test, and refactoring? What if
we told you that some of these cogs
contribute more toward fullling
the promises of TDD than the order
of test implementation? (Hint: you
should ask for evidence.)
15 Years of (Contradictory)
Back in 2003, when the software
development paradigm started to
change irrevocably (for the bet-
ter?), Kent Beck posed a claim based
on anecdotal evidence and paved
the way for software engineering
No studies have categorically
demonstrated the difference be-
tween TDD and any of the many
alternatives in quality, productiv-
ity, or fun. However, the anecdotal
evidence is overwhelming, and the
secondary effects are unmistakable.2
Since then, numerous studies—
for example, experiments and case
studies—have investigated TDD’s
effectiveness. These studies are pe-
riodically synthesized in secondary
studies (see Table 1), only to reveal
contradictory results across the pri-
mary studies. This research has also
demonstrated no consistent overall
benet from TDD, particularly for
overall productivity and within sub-
groups for quality.
Why the inconsistent results? Be-
sides the reasons listed in Table 1,
other likely reasons are that
• TDD has too many cogs,
• its effectiveness is highly inu-
enced by the context (for ex-
ample, the tasks at hand or skills
of individuals),
• the cogs highly interact with
each other, and
• most studies have focused on
only the test-rst aspect.
Identifying the inconsistencies’
sources is important for designing
further studies that control for those
Matjaž Pancˇur and Mojca
Ciglaricˇ speculated that the results of
studies showing TDD’s superiority
over a test-last approach were due to
the fact that most of the experiments
employed a coarse-grained test-last
process closer to the waterfall
Table 1. Systematic literature reviews on test-driven development (TDD).
Overall conclusion for quality
with TDD
Overall conclusion for
productivity with TDD
Inconsistent result s in the study
Bissi et al.3Improvement Inconclusive Productivity:
Academic vs. industrial setting
Munir et al.4Improvement or no difference Degradation or no difference Quality:
Low vs. high rigor
Low vs. high relevance
Low vs. high rigor
Low vs. high relevance
Raque and Mišic´5Improvement Inconclusive Quality:
Waterfall vs. iterative test-last
Waterfall vs. iterative test-last
Academic vs. industrial
Turhan et al.6 and Shull et al.1Improvement Inconclusive Quality:
Among controlled experiments
Among studies with high rigor
Among pilot studies
Controlled experiments vs.
industrial case studies
Among studies with high rigor
Kollanus7Improvement Degradation Quality:
Among academic studies
Among semi-industrial studies
Siniaalto8Improvement Inconclusive Productivity:
Among academic studies
Among semi-industrial studies
approach as a control group.9 This
created a large differential in granu-
larity between the treatments, and
sometimes even a complete lack of
tests in the control, resulting in un-
fair, misleading comparisons. In the
end, TDD might perform better only
when compared to a coarse-grained
development process.
Industry Adoption
(or Lack Thereof)
Discussions on TDD are common
and usually heated. But how com-
mon is the use of TDD in practice?
Not very—at least, that’s what the
evidence suggests.
For example, after monitoring
the development activity of 416 de-
velopers over more than 24,000
hours, researchers reported that the
developers followed TDD in only
12 percent of the projects that
claimed to use it.10 We’ve observed
similar patterns in our work with
professional developers. Indeed, if it
were possible to reanalyze all exist-
ing evidence considering this facet
only, the shape of things might
change signicantly (for better or
worse). We’ll be the devil’s advocate
and ask, what if the anecdotal evi-
dence from TDD enthusiasts is based
on misconceived personal experience
from non-TDD activities?
Similarly, a recent study analyzed
a September 2015 snapshot of all the
(Java) projects in GitHub.11 Using
heuristics for identifying TDD-like
repositories, the researchers found
that only 0.8 percent of the projects
adhered to TDD protocol. Further-
more, comparing those projects to
a control set, the study reported no
difference between the two groups in
terms of
• the commit velocity as a measure
of productivity,
• the number of bug-xing com-
mits as an indicator of the num-
ber of defects, and
• the number of issues reported
for the project as a predictor of
Additionally, a comparison of the
number of pull requests and the dis-
tribution of commits per author
didn’t indicate any effect on devel-
oper collaboration.
Adnan Causevic and his col-
leagues identied seven factors limit-
ing TDD’s use in the industry:12
• increased development time
(productivity hits),
• insufcient TDD experience or
insufcient design,
• insufcient developer testing
• insufcient adherence to TDD
• domain- and tool-specic limita-
tions, and
• legacy code.
It’s not surprising that three of these
factors are related to the developers’
capacity to follow TDD and their
rigor in following it.
What Really Makes TDD Tick?
A more rened look into TDD is
concerned with not only the order
in which production code and test
code are written but also the average
duration of development cycles, that
duration’s uniformity, and the refac-
toring effort. A recent study of 39
professionals reported that a steady
rhythm of short development cycles
was the primary reason for improved
quality and productivity.13 Indeed,
the effect of test-rst completely di-
minished when the effects of short
and steady cycles were considered.
These ndings are consistent with
earlier research demonstrating that
TDD experts had much shorter and
less variable cycle lengths than nov-
ices did.14 The signicance of short
development cycles extends beyond
TDD; Alistair Cockburn, in explain-
ing the Elephant Carpaccio concept,
states that “agile developers apply
micro-, even nano-incremental de-
velopment in their work.”15
Another claim of Elephant Car-
paccio, related to the TDD concept
of working on small tasks, is that
agile developers can deliver fast
“not because we’re so fast we can
[develop] 100 times as fast as other
people, but rather, we have trained
ourselves to ask for end-user-visible
functionality 100 times smaller than
most other people.15 To test this,
we conducted experiments in which
we controlled for the framing of task
descriptions (ner-grained user sto-
ries versus coarser-grained generic
descriptions). We observed that the
type of task description and the task
itself are signicant factors affect-
ing software quality in the context
of TDD.
In short, working on small,
well-dened tasks in short, steady
development cycles has a more
positive impact on quality and
productivity than the order of test
Deviations from the
Test-First Mantra
Even if we consider the studies that
focus on only the test-rst nature
of TDD, there’s still the problem of
conformance to the TDD process.
TDD isn’t a dichotomy in which
you either religiously write tests
rst every time or always test after
the fact. TDD is a continuous spec-
trum between these extremes, and
developers tend to dynamically span
this spectrum, adjusting the TDD
process as needed. In industrial set-
tings, time pressure, lack of disci-
pline, and insufcient realization of
TDD’s benets have been reported
to cause developers to deviate from
the process.12
To gain more insight, in an ethno-
graphically informed study, research-
ers monitored and documented the
TDD development process more
closely by means of artifacts includ-
ing audio recordings and notes.16
They concluded that developers per-
ceived implementation as the most
important phase and didn’t strictly
follow the TDD process. In par-
ticular, developers wrote more pro-
duction code than necessary, often
omitted refactoring, and didn’t keep
test cases up to date in accordance
with the progression of the produc-
tion code. Even when the develop-
ers followed the test-rst principle,
they thought about how the produc-
tion code (not necessarily the design)
should be before they wrote the test
for the next feature. In other words,
perhaps we should simply name this
phenomenon “code-driven testing”?
TDD’s internal and external
dynamics are more complex
than the order in which tests
are written. There’s no convincing
evidence that TDD consistently fares
better than any other development
method, at least those methods that
are iterative. And enough evidence ex-
ists to question whether TDD fulls
its promises.
How do you decide whether and
when to use TDD, then? And what
about TDD’s secondary effects?
As always, context is the key, and
any potential benet of TDD is likely
not due to whatever order of writing
tests and code developers follow. It
makes sense to have realistic expecta-
tions rather than worship or discard
TDD. Focus on the rhythm of devel-
opment; for example, tackle small
tasks in short, steady development
cycles, rather than bother with the
test order. Also, keep in mind that
some tasks are better (suited) than
others with respect to “TDD-bility.
This doesn’t mean you should
avoid trying TDD or stop using it.
For example, if you think that TDD
offers you the self-discipline to write
tests for each small functionality,
following the test-rst principle will
certainly prevent you from taking
shortcuts that skip tests. In this case,
there’s value in Beck’s suggestion,
“Never write a line of functional code
without a broken test case.”2 How-
ever, you should primarily consider
those tests’ quality (without obsessing
over coverage),17 instead of xating
on whether you wrote them before
the code. Although TDD does result
in more tests,1,6 the lack of attention
to testing quality,12 including main-
tainability and coevolution with pro-
duction code,16 could be alarming.
As long as you’re aware of and
comfortable with the potential trad-
eoff between productivity and test-
ability and quality (perhaps paying
off in the long term?), using TDD
is ne. If you’re simply having fun
and feeling good while performing
TDD without any signicant draw-
backs, that’s also ne. After all, the
evidence shows that happy develop-
ers are more productive and produce
better code!18
