Context: Test-driven development (TDD) is an agile software development approach that has been widely claimed to improve software quality. However, the extent to which TDD improves quality appears to be largely dependent upon the characteristics of the study in which it is evaluated (e.g., the research method, participant type, programming environment, etc.). The particularities of each study make the aggregation of results untenable. Objectives: The goal of this paper is to: increase the accuracy and generalizability of the results achieved in isolated experiments on TDD, provide joint conclusions on the performance of TDD across different industrial and academic settings, and assess the extent to which the characteristics of the experiments affect the quality-related performance of TDD. Method: We conduct a family of 12 experiments on TDD in academia and industry. We aggregate their results by means of meta-analysis. We perform exploratory analyses to identify variables impacting the quality-related performance of TDD. Results: TDD novices achieve a slightly higher code quality with iterative test-last development (i.e., ITL, the reverse approach of TDD) than with TDD. The task being developed largely determines quality. The programming environment, the order in which TDD and ITL are applied, or the learning effects from one development approach to another do not appear to affect quality. The quality-related performance of professionals using TDD drops more than for students. We hypothesize that this may be due to their being more resistant to change and potentially less motivated than students. Conclusion: Previous studies seem to provide conflicting results on TDD performance (i.e., positive vs. negative, respectively). We hypothesize that these conflicting results may be due to different study durations, experiment participants being unfamiliar with the TDD process, or case studies comparing the performance achieved by TDD vs. the control approach (e.g., the waterfall model), each applied to develop a different system. Further experiments with TDD experts are needed to validate these hypotheses.
Test-Driven Development (TDD), an agile development approach that enforces the construction of software systems by means of successive micro-iterative testing coding cycles, has been widely claimed to increase external software quality. In view of this, some managers at Paf-a Nordic gaming entertainment company-were interested in knowing how would TDD perform at their premises. Eventually, if TDD outperformed their traditional way of coding (i.e., YW, short for Your Way), it would be possible to switch to TDD considering the empirical evidence achieved at the company level. We conduct an experiment at Paf to evaluate the performance of TDD, YW and the reverse approach of TDD (i.e., ITL, short for Iterative-Test Last) on external quality. TDD outperforms YW and ITL at Paf. Despite the encouraging results, we cannot recommend Paf to immediately adopt TDD as the difference in performance between YW and TDD is small. However, as TDD looks promising at Paf, we suggest to move some developers to TDD and to run a future experiment to compare the performance of TDD and YW. TDD slightly outperforms ITL in controlled experiments for TDD novices. However, more industrial experiments are still needed to evaluate the performance of TDD in real-life contexts.
Test-Driven Development (TDD) has been claimed to increase external software quality. However, the extent to which TDD increases external quality has been seldom studied in industrial experiments. We conduct four industrial experiments in two different companies to evaluate the performance of TDD on external quality. We study whether the performance of TDD holds across premises within the same company and across companies. We identify participant-level characteristics impacting results. Iterative-Test Last (ITL), the reverse approach of TDD, outperforms TDD in three out of four premises. ITL outperforms TDD in both companies. The larger the experience with unit testing and testing tools, the larger the difference in performance between ITL and TDD (in favour of ITL). Technological environment (i.e., programming language and testing tool) seems not to impact results. Evaluating participant-level characteristics impacting results in industrial experiments may ease the understanding of the performance of TDD in realistic settings.
Context: Researchers from different groups and institutions are collaborating towards the construction of groups of interrelated replications. Applying unsuitable techniques to aggregate interrelated replications' results may impact the reliability of joint conclusions. Objectives: Comparing the advantages and disadvantages of the techniques applied to aggregate interrelated replications' results in Software Engineering (SE). Method: We conducted a literature review to identify the techniques applied to aggregate interrelated replications' results in SE. We analyze a prototypical group of interrelated replications in SE with the techniques that we identified. We check whether the advantages and disadvantages of each technique -according to mature experimental disciplines such as medicine- materialize in the SE context. Results: Narrative synthesis and Aggregation of p-values do not take advantage of all the information contained within the raw-data for providing joint conclusions. Aggregated Data (AD) meta-analysis provides visual summaries of results and allows assessing experiment-level moderators. Individual Participant Data (IPD) meta-analysis allows interpreting results in natural units and assessing experiment-level and participant-level moderators. Conclusion: All the information contained within the raw-data should be used to provide joint conclusions. AD and IPD, when used in tandem, seem suitable to analyze groups of interrelated replications in SE.
Context: Families of experiments (i.e., groups of experiments with the same goal) are on the rise in Software Engineering (SE). Selecting unsuitable aggregation techniques to analyze families may undermine their potential to provide in-depth insights from experiments' results. Objectives: Identifying the techniques used to aggregate experiments' results within families in SE. Raising awareness of the importance of applying suitable aggregation techniques to reach reliable conclusions within families. Method: We conduct a systematic mapping study (SMS) to identify the aggregation techniques used to analyze families of experiments in SE. We outline the advantages and disadvantages of each aggregation technique according to mature experimental disciplines such as medicine and pharmacology. We provide preliminary recommendations to analyze and report families of experiments in view of families' common limitations with regard to joint data analysis. Results: Several aggregation techniques have been used to analyze SE families of experiments, including Narrative synthesis, Aggregated Data (AD), Individual Participant Data (IPD) mega-trial or stratified, and Aggregation of p-values. The rationale used to select aggregation techniques is rarely discussed within families. Families of experiments are commonly analyzed with unsuitable aggregation techniques according to the literature of mature experimental disciplines. Conclusion: Data analysis' reporting practices should be improved to increase the reliability and transparency of joint results. AD and IPD stratified appear to be suitable to analyze SE families of experiments.
This paper shares our experience with initial negotiation and topic elicitation process for conducting industry experiments in six software development organizations in Finland. The process involved interaction with company representatives in the form of both multiple group discussions and separate face-to-face meetings. Fitness criteria developed by researchers were applied to the list of generated topics to decide on a common topic. The challenges we faced include diversity of proposed topics, communication gaps, skepticism about research methods, initial disconnect between research and industry needs, and lack of prior work relationship. Lessons learned include having enough time to establish trust with partners, importance of leveraging the benefits of training and skill development that are inherent in the experimental approach, uniquely positioning the experimental approach within the landscape of other validation approaches more familiar to industrial partners, and introducing the fitness criteria early in the process.
Existing empirical studies on test-driven development (TDD) report different conclusions about its effects on quality and productivity. Very few of those studies are experiments conducted with software professionals in industry. We aim to analyse the effects of TDD on the external quality of the work done and the productivity of developers in an industrial setting. We conducted an experiment with 24 professionals from three different sites of a software organization. We chose a repeated-measures design, and asked subjects to implement TDD and incremental test last development (ITLD) in two simple tasks and a realistic application close to real-life complexity. To analyse our findings, we applied a repeated-measures general linear model procedure and a linear mixed effects procedure. We did not observe a statistical difference between the quality of the work done by subjects in both treatments. We observed that the subjects are more productive when they implement TDD on a simple task compared to ITLD, but the productivity drops significantly when applying TDD to a complex brownfield task. So, the task complexity significantly obscured the effect of TDD. Further evidence is necessary to conclude whether TDD is better or worse than ITLD in terms of external quality and productivity in an industrial setting. We found that experimental factors such as selection of tasks could dominate the findings in TDD studies.