The effectiveness of pair programming: A meta-analysis
Jo E. Hannay
, Tore Dybå
, Erik Arisholm
, Dag I.K. Sjøberg
Simula Research Laboratory, Department of Software Engineering, Pb. 134, NO-1325 Lysaker, Norway
University of Oslo, Department of Informatics, Pb. 1080 Blindern, 0316 Oslo, Norway
SINTEF Information and Communication Technology, NO-7465 Trondheim, Norway
Received 20 October 2008
Received in revised form 9 February 2009
Accepted 9 February 2009
Available online 20 February 2009
Evidence-based software engineering
Several experiments on the effects of pair versus solo programming have been reported in the literature.
We present a meta-analysis of these studies. The analysis shows a small signiﬁcant positive overall effect
of pair programming on quality, a medium signiﬁcant positive overall effect on duration, and a medium
signiﬁcant negative overall effect on effort. However, between-study variance is signiﬁcant, and there are
signs of publication bias among published studies on pair programming. A more detailed examination of
the evidence suggests that pair programming is faster than solo programming when programming task
complexity is low and yields code solutions of higher quality when task complexity is high. The higher
quality for complex tasks comes at a price of considerably greater effort, while the reduced completion
time for the simpler tasks comes at a price of noticeably lower quality. We conclude that greater attention
should be given to moderating factors on the effects of pair programming.
Ó2009 Elsevier B.V. All rights reserved.
1. Introduction ........................................................................................................ 1111
2. Method . . . . ........................................................................................................ 1111
2.1. Inclusion and exclusion criteria . . . . . . . . ............................. .............................................. 1111
2.2. Data sources and search strategy . . . . . . . ............................................. .............................. 1111
2.3. Study identification and selection . . . . . . ....................................... .................................... 1111
2.4. Data extraction and checking . . . . . . . . . . ............................................. .............................. 1111
2.5. Statistical analysis ............................................. ................................................. 1112
2.6. Synthesis . . . . . . . ....... ...................................................................... ................. 1112
2.7. Assumption of parametric normality . . . . .... ..................................................................... .. 1113
3. Results. . . . . ........................................................................................................ 1113
3.1. Description of studies . . . . . . . . . . . . . . . . ................................................ ........................... 1113
3.2. Effects of pair programming . . . . . . . . . . . .......... ................................................................. 1114
3.2.1. Effects of pair programming on Quality . . . . . ................................................................ 1114
3.2.2. Effects of pair programming on Duration . . . . ................................................................ 1114
3.2.3. Effects of pair programming on Effort. . . . . . . ................................................................ 1115
3.3. Subgroup analyses................ ...................................................................... ........ 1115
3.4. Publication bias . . ................................................ .............................................. 1116
3.5. Moderating effects of task complexity and expertise . . . . . . . .......................................... ................. 1118
4. Discussion. . ........................................................................................................ 1120
5. Conclusion . ........................................................................................................ 1120
Acknowledgements . . . . . . . . . . ........................................................................................ 1121
Appendix . . ........................................................................................................ 1121
References . ........................................................................................................ 1121
0950-5849/$ - see front matter Ó2009 Elsevier B.V. All rights reserved.
*Corresponding author. Address: Simula Research Laboratory, Department of Software Engineering, Pb. 134, NO-1325 Lysaker, Norway.
E-mail address: email@example.com (J.E. Hannay).
Information and Software Technology 51 (2009) 1110–1122
Contents lists available at ScienceDirect
Information and Software Technology
journal homepage: www.elsevier.com/locate/infsof
Much of the current interest in pair programming is likely due
to the popularity of extreme programming (XP), of which pair
programming is one of the core practices . Common-sense,
but scientiﬁcally unwarranted claims as to both the beneﬁts
and the adversities of pair programming abound. Advocates of
pair programming claim that it has many beneﬁts over individual
programming when applied to new code development or when
used to maintain and enhance existing code. Stated beneﬁts in-
clude higher-quality code in about half the time as individuals,
happier programmers, improved teamwork, improved knowledge
transfer, and enhanced learning . There are also expectations
with respect to the beneﬁts and drawbacks of various kinds of
pairing, e.g., that ‘‘expert–expert pairings seem to be especially
accelerated” (ibid., p. 102) and that ‘‘novice–novice is the most
problematic” (ibid., p. 120). Stephens and Rosenberg  claim
that pair programming is clearly the least beneﬁcial of XP prac-
tices and also that novice–novice pairing is obviously an undesir-
able combination, because it is akin to ‘‘the blind leading the
With the considerable interest in pair programming by the
software industry and academia, it is important to determine sci-
entiﬁcally, whether, and if so when, pairing two programmers is
beneﬁcial in terms of important cost drivers such as duration, ef-
fort, and quality. The ﬁnding across decades of small group re-
search is that groups usually fall short of reasonable
expectations to improved performance [48,35,33]. An important
question, therefore, is whether the claims regarding pair pro-
gramming can be substantiated by empirical evidence and how
pair programming relates to such group research. Several empir-
ical studies have now been conducted that set out to examine the
effects of pair programming in a systematic manner. This article
provides a systematic review in terms of a meta-analysis of all
published experiments on the effectiveness of pair programming
and subsequently offers recommendations for evidence-based
practice . This review extends the intermediate analysis re-
ported in , which brieﬂy summarized pair programming
experiments published up and until 2006. In the present full-scale
analysis, we take into account between-study variance, subgroup
differences and publication bias. We also take into account stud-
ies published in 2007.
Section 2presents the methods for systematic review and sta-
tistical meta-analysis. Section 3presents the results from the anal-
ysis, and Section 4discusses implications for theory and practice.
Informed by the general procedures for performing systematic
reviews  and the established methods of meta-analysis
[39,45], we undertook the meta-analysis in distinct stages: identi-
ﬁcation of inclusion and exclusion criteria, search for relevant
studies, identiﬁcation and selection of studies, data extraction,
and statistical analysis and synthesis, see details below.
The meta-analysis focused on combining quantitative effects
on three central outcome constructs that were investigated in
the included studies. We did not assess the quality of the in-
cluded studies in terms of, e.g., the appropriateness of the chosen
effect size measures , the appropriateness of randomization
procedures , subject/task selection and validity issues ,
statistical power , the use of theory , the approach to real-
ism , etc. Future meta-analyses might incorporate study qual-
ity, but at present, it is not clear how to aggregate this
multifaceted concept into a single measure to be used in a
2.1. Inclusion and exclusion criteria
Studies were eligible for inclusion in the meta-analysis if they
presented quantitative data on the effectiveness of pair program-
ming in which a comparison was made between pairs and individ-
uals, possibly in a team context. The subjects could be either
students or professional software developers. Included studies
had to report one of the primary outcomes Quality,Duration,orEf-
fort. We did not want to put any restrictions on the operationaliza-
tion of these outcome constructs. Furthermore, the studies had to
be reported in English.
2.2. Data sources and search strategy
The search strategy included electronic databases and hand
searches of conference proceedings. We searched in the following
electronic databases: ACM Digital Library, Compendex, IEEE
Xplore, and ISI Web of Science. We did not perform separate
searches in the SE-speciﬁc databases Kluwer Online, ScienceDirect,
SpringerLink, and Wiley Inter Science Journal Finder, because pre-
vious experience with systematic search strategies has shown that
articles retrieved from these databases are also returned by either
ISI Web of Science or Compendex . In addition to the electronic
databases, we hand-searched all volumes of the following thematic
conference proceedings: XP, XP/Agile Universe, Agile, and Agile
Development Conference. We used the basic search string ‘‘‘pair
programming’ OR ‘collaborative programming’” to conduct the
2.3. Study identiﬁcation and selection
The identiﬁcation and selection process consisted of three
stages. At Stage 1, the second author applied the search string to
the titles, abstracts, and keywords of the articles in the included
electronic databases and conference proceedings. All retrieved arti-
cles were published, or accepted for publication, before or in 2007.
Excluded from the search were editorials, prefaces, article summa-
ries, interviews, news items, correspondence, discussions, com-
ments, reader’s letters, and summaries of tutorials, workshops,
panels, and poster sessions. This search strategy resulted in a total
of 236 unique citations.
At Stage 2, the ﬁrst and second authors went through the titles
and abstracts of all the studies resulting from stage 1 for relevance
to the review. If it was unclear from the title, abstract, and key-
words whether a study conformed to our inclusion criteria, it
was included for a detailed review. This screening process resulted
in 57 citations that were passed on to the next stage.
At Stage 3, the full text of all 57 citations from Stage 2 were re-
trieved and reviewed in detail by the ﬁrst and second authors. This
resulted in 23 included articles according to the inclusion criteria.
Five of these did not report enough information to compute stan-
dardized effect sizes and were excluded. Thus, 18 studies (all
experiments) met the inclusion criteria and were included in the
review (see Appendix).
2.4. Data extraction and checking
We collected data from the 18 articles, including type of treat-
ment, type of system, type of tasks, duration of the study, number
of groups, group assignment, type of subjects and their experience
with pair programming, number of pairs, number of individuals,
outcome variable, means, standard deviations, counts, percentages,
and p-values. Every article included in the review was read in de-
tail and the data was extracted and cross-checked by the ﬁrst, sec-
ond and third authors. Discrepancies were resolved by discussion
among all four authors.
J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122 1111
2.5. Statistical analysis
We used Comprehensive Meta-Analysis v2 to calculate
effect size estimates for all the tests in the 18 articles.
order to be comparable across studies, effect sizes must be standard-
ized. In this meta-analysis, we used Hedges’ gas the standardized
measure of effect size. Like Cohen’s dand Glass’ D, Hedges’ gis sim-
ply the difference between the outcome means of the treatment
groups, but standardized with respect to the pooled standard devia-
tion, sp, and corrected for small sample bias :
The pooled standard deviation is based on the standard deviations
in both groups, s
Hedges’ g, Cohen’s d, and Glass’
have the same properties in
large samples (i.e., they are equivalent in the limit
Þ!1), but Hedges’ ghas the best properties for small
samples when multiplied by a correction factor that adjusts for
small sample bias :
Correction factor for Hedges’ g¼13
where N= total sample size.
An effect size of .5 thus indicates that the mean of the treatment
group is half a standard deviation larger than the mean of the con-
trol group. Effect sizes larger than 1.00 can be assumed to be large,
effect sizes of .38–1.00 medium, and effect sizes of 0–.37 small .
We conducted separate meta-analyses for the three outcome
constructs Quality,Duration, and Effort. Some studies applied sev-
eral tests on the same outcome construct. In theses cases, we used
the mean of the effect sizes over these tests to give only one effect
size per outcome per study. Comprehensive Meta-Analysis calcu-
lates these mean effect sizes as an option.
Because we expected considerable heterogeneity, we decided to
calculate the resulting meta-analytic effect sizes both under the
assumption of the random-effects model and under the assump-
tion of the ﬁxed-effects model. These models lead to different sig-
niﬁcance test and conﬁdence intervals for meta-analytic results
The ﬁxed-effects model assumes an unknown and ﬁxed popula-
tion effect size that is estimated by the studies in the meta-analy-
sis. All the studies in the meta-analysis are seen as drawn from the
same population, and variances in effect sizes between individual
studies are viewed as due to subject variability .
The random-effects model, on the other hand, assumes an un-
known and stochastic population effect-size distribution. That is,
the true effect size of pair programming varies around a mean
This caters for the view that the effect of pair programming de-
pends on situational variables and other factors (both known and
unknown) that are not taken into consideration in the analysis.
Variance in effect sizes are then seen as due to subject variability,
and also to inter-study variability, since each study is seen as
approximating a different part of the true effect size distribution
Both models relate to an unknown population parameter. The
approaches are hence referred to as unconditional . The choice
of which model to use is made prior to the analysis based on the-
ory, past empirical ﬁndings, and on insights as to what the in-
cluded studies describe.
However, one may also use the analysis techniques associated
with the two models merely to characterize the studies relatively
to each other without any reference to a population effect size.
Which model to use in this case, is determined from the observed
data based on heterogeneity measures, which are calculated under
the assumption of a ﬁxed-effects model. If heterogeneity is non-
signiﬁcant, a ﬁxed-effects model is an appropriate characterization
of data. Otherwise, a random-effects model best characterizes the
data. The results from such conditional approaches should, how-
ever, not be confounded with statements regarding population
We conducted our analysis from both an unconditional and a
conditional perspective. For the unconditional perspective, we
chose the random-effects model. Hedges and Vevea state that
‘‘In the case of random-effects models, for example, some individ-
ual effect-size parameters may be negative even though
sitive. That corresponds to the substantive idea that some
realizations of the treatment may actually be harmful even if
the average effect is beneﬁcial” . Results in  suggest that
the effects of pair programming may be positive or negative
dependent on other factors (e.g., expertise and task complexity).
Also, the outcome constructs Quality,Duration, and Effort are
not yet well-deﬁned in software engineering. These constructs
are operationalized in very diverse ways (Section 3.1), and for
the time being, it is reasonable to view these diverse operational-
izations as different aspects of a construct (a so-called formative
measurement model [3,6,12,40]). Under these circumstances,
and until the constructs are better understood, it is reasonable
to relate to a random-effects model.
In the conditional approach, we tested whether there were gen-
uine differences underlying the results of the studies (heterogene-
ity), or whether the variation in the ﬁndings was compatible with
In the following, we give an overview of the technical details
of the meta-analysis. For further elaborations, see e.g., [27,4,39].
Let kbe the number of studies in the meta-analysis. Let T
the standardized effect size estimate (in our case, Hedges’ g)of
study i. In the ﬁxed-effects model, the estimate T
of the as-
sumed ﬁxed population effect size, and the estimate’s variance
is the weight assigned to study i, i.e., the reciprocal
of the variance
for study i. Thus, T
is a weighted mean over the
effect sizes of the individual studies, where studies with less vari-
ance are given greater weight. In the random-effects model, the
weights are based on between-study variance in addition to with-
. Speciﬁcally, the estimate T
of the mean
the assumed population effect size distribution, and the estimate’s
is the additional be-
where the degrees of freedom df ¼k1, and Qrepresents the total
Comprehensive Meta-Analysis is a trademark of Biostat Inc.
1112 J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122
In Eq. (6),Cis simply a scaling factor that ensures that
same denomination as within-study variance, i.e.,
In fact, Qis a statistic that indicates heterogeneity, and one that
we used for this purpose. A signiﬁcant Qrejects the null hypothesis
of homogeneity and indicates that the variability among the effect
sizes is greater than what is likely to have resulted from subject-le-
vel variability alone . We also calculated the I
indicates heterogeneity in percentages:
A value of 0 indicates no observed heterogeneity, 25% low, 50%
moderate, and 75% high heterogeneity .
2.7. Assumption of parametric normality
Hedges’ gis a parametric effect size measure that is based,
ﬁrstly, on the assumption that one wishes to relate to a population
distribution, and, secondly, that each sub-population (solo, pair) is
normally distributed on the response variable. This assumption
also underlies the meta-analytic procedures and estimates that
we used. It is well-known that Duration and Effort measures have
Gamma distributions, and the various Quality measures in this
study may or may not have normal distributions. Thus, one may
question whether other standardized effect-size measures should
have been used for the meta-analysis. Non-parametric effect-size
measures would merely allow one to characterize data and not al-
low one to relate to a population. On the other hand, specifying
other distributions (e.g., Gamma) would demand that we had ac-
cess to the individual studies’ raw data. Since means are meaning-
ful measures for all three outcome constructs, we therefore
decided to use Hedges’ gas a best compromise for conducting
the meta-analysis. Note also that even if the assumed population
distribution is not evident in a (small) sample, this reason alone
should not lead one to abandon the model (unless the sole aim is
to characterize data). Small samples will of course, lead to less con-
ﬁdent parameter estimates but these conﬁdence estimates will be
calculated correctly according the assumed population
We ﬁrst present characteristics of the 18 studies that were in-
cluded in the review and meta-analysis. We then give the meta-
analytic results in terms of overall effects, subgroup effects and
3.1. Description of studies
Characteristics of the 18 studies included in the meta-analysis
are summarized in Table 1 in alphabetical order, while full cita-
tions are provided in the Appendix.
Of the 18 studies, 11 were from Europe and seven from North
America. The number of subjects in the studies varied from 12 to
295. Thirteen of the studies used students as subjects, while four
used professionals. One used both professionals and students. Five
studies made the comparison within a team context, that is, teams
of pairs versus teams of individuals (marked with an asterisk in the
ﬁrst column). The studies often administered several tests and the
number of data points may have varied across tests (numbers in
parentheses in Table 1).
All studies used programming tasks as the basis for comparison.
In addition, Madeyski (2006) and Madeyski (2007) included test-
Characteristics of the included studies.
Study Subjects N
Arisholm et al.
Professionals 295 98 99 10 sessions with individuals over 3 months and 17 sessions with pairs over 5 months (each of
1 day duration, with different subjects). Modiﬁed 2 systems of about 200–300 Java LOC each
Baheti et al. (2002) Students 134 16 9 Teams had 5 weeks to complete a curricular OO programming project. Distinct projects per
Canfora et al. (2005) Students 24 12 24 2 applications each with 2 tasks (run1 and run2)
Canfora et al. (2007) Professionals 18 5(4) 8(10) Study session and 2 runs (totalling 390 min) involving 4 maintenance tasks (grouped in 2
assignments) to modify design documents (use case and class diagrams)
Domino et al. (2007) Professionals
88 28 32 Run as several sessions during a period of two months. Pseudocode on ‘‘Create-Design” tasks
Heiberg et al. (2003) Students 84(66) 23(16) 19(17) 4 sessions over 4 weeks involving 2 programming tasks to implement a component for a larger
Madeyski (2006) Students 188 28 31(35) 8 laboratory sessions involving 1 initial programming task in a ﬁnance accounting system (27
Madeyski (2007) Students 98 35 28 Java course project of developing a 27 user story accounting system over 8 laboratory sessions
of 90 min each. Test-driven development
Müller (2005) Students 38 19 23 2 runs of 1 programming session each on 2 initial programming tasks (Polynomial and Shufﬂe-
Puzzle) producing about 150 LOC
Müller (2006) Students 18(16) 4(5) 6 1 session involving initial design + programming tasks on an elevator system
Students 15 5 5 4 lab sessions over a winter semester, as part of a University course. Wrote four C/C++
programs ranging from 150–400 LOC
Nosek (1998) Professionals 15 5 5 45 min to solve 1 programming task (database consistency check script)
Students 95 7 7 12 weeks to complete 4 phases of development + inspection
Students 36 5 4 Part of a team project to extend a system. 13 weeks to complete 4 phases of
development + inspection
Rostaher & Hericko
Professionals 16 6 4 6 small user stories ﬁlling 1 day
Students 20 4 8 9-week student project in which each subject spent a total of 100 h (400 h per team). A total of
1500–4000 LOC was written
Williams et al.
Students 41 14 13 6-week course where the students had to deliver 4 programming assignments
Xu & Rajlich (2006) Students 12 4 4 2 sessions with pairs and 1 session with individuals. 1 initial programming task producing
around 200–300 LOC
J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122 1113
driven development; Müller (2005), Phongpaibul and Boehm
(2006), and Phongpaibul and Boehm (2007) included inspections;
and Müller (2006) included design tasks.
Quality was typically reported as the number of test cases
passed or number of correct solutions of programming tasks, but
student grades, delivered functionality, and metrics for code com-
plexity were also used as measures of Quality.Duration was re-
ported mainly in two modes: as the total time taken to complete
all tasks considered (all solutions), or as the total time taken to
complete the tasks that had been assessed as having passed a cer-
tain quality standard (checked solutions). For comparisons be-
tween pairs and solo programmers, pair Effort was reported as
twice the duration of each individual in the pair. For team-based
comparisons, e.g., teams of individuals versus teams of pairs, Effort
was reported as the total effort spent by the respective groups.
Thus, the studies included in the meta-analysis do not all apply
the same measures or have similar context variables. Rather, they
investigate the effectiveness of pair programming with respect to
different aspects of the constructs Quality,Duration, and Effort.As
such, the studies may be seen as differentiated replications ,
and any measure from a particular study in the meta-analysis is
but one indicator of, perhaps, many indicators regarding one of these
3.2. Effects of pair programming
Table 2 summarizes the meta-analysis. The gcolumn shows the
meta-analytic estimates T
, in terms of Hedges’ g, of the
population effect size parameter in the ﬁxed-effects model and
random-effects model, respectively, along with 95% conﬁdence
intervals and p-values. Also given are heterogeneity measures cal-
culated under the assumption of a ﬁxed-effects model. The Overall
effects analysis will be described in detail in this section, and the
Subgroup analysis will be given in detail in Section 3.3.
Overall, 14 studies used Quality as an outcome construct, 11
used Duration, and 11 used Effort.Fig. 1 shows Forest plots
the standardized effects for each of the three outcome constructs.
The studies are sorted according to the relative weight that a study’s
effect size receives in the meta-analysis. Relative weights are nor-
malized versions of the weights wior w
iused in calculating the
meta-analytic estimates Tor T
. The rightmost columns in Fig. 1
show these weights according to the ﬁxed-effects and random-ef-
fects models. Estimates from larger studies will usually be more pre-
cise than the estimates from smaller studies; hence, larger studies
will generally be given greater weight.
The squares indicate the effect size estimate for each study. The
size of each square is proportional to the relative weight of the
study according to the random-effects model. The relative weights
of a random-effects model are generally more uniform than those
of ﬁxed-effects models, due to the incorporation of between-study
variance into all the studies’ weights. The horizontal lines indicate
the 95% conﬁdence intervals for each study’s effect size estimate
according to the random-effects model.
The diamonds at the bottom give the meta-analytic effect size
estimate according to the ﬁxed-effects and the random-effects
model, i.e., T
, respectively. The diamonds’ centers and
widths indicate the estimates and their 95% conﬁdence interval,
Fig. 2 shows one-study-removed analyses for each of the three
outcome constructs. The plots show the meta-analytic effect size
estimate when each study is removed from the meta-analysis.
The resulting deviation from the full analysis indicates the sensitiv-
ity of the full analysis with respect to each study, that is, how much
difference a given study makes to the meta-analysis.
3.2.1. Effects of pair programming on Quality
Fourteen studies compared the effects of pair programming on
Quality in a total of 38 tests. These studies used a total of 1160 sub-
jects, although in some studies, not all subjects were used in these
particular tests. The subjects were distributed to study units (pairs,
solo, teams of pairs, teams of solo) in various ways (Section 3.1).
The meta-analytic effect size estimate is .23 in the ﬁxed-effects
model and .33 in the random-effects model.
Both the ﬁxed-effects model and the random-effects model sug-
gest that there is a small
positive overall effect of pair program-
ming on Quality compared with solo programming. Only one study
showed a negative effect of pair programming on Quality (Domino
et al., 2007). All the other studies showed mostly small to medium
positive effects. The three studies by Domino et al. (2007), Arisholm
et al. (2007), and Madeyski (2006) contribute more than 50% of the
total weight in the meta-analysis for Quality. The one-study-re-
moved analysis shows that the meta-analysis is most sensitive to
the inclusion/exclusion of Williams et al. (2000). Heterogeneity is
signiﬁcant at a medium level (Q¼35:97; p<:01; I2¼63:86%).
3.2.2. Effects of pair programming on Duration
Eleven studies reported effects on Duration in a total of 21 tests.
These studies used a total of 669 subjects. Both the ﬁxed-effects
model and the random-effects model suggest that there is a med-
ium positive overall effect of pair programming on Duration.
Summary of meta-analysis.
Analysis kModel Effect Size Heterogeneity
g95% CI pQ I
Quality 14 ﬁxed .23 .09 .37 .00 35.97 63.86 .14 .00
random .33 .07 .60 .01
Duration 11 ﬁxed .40 .21 .59 .00 33.57 70.21 .28 .00
random .54 .13 .94 .01
Effort 11 ﬁxed .73 .94 .51 .00 66.36 84.93 .87 .00
random .52 1.18 .13 .12
Quality 11 ﬁxed .22 .06 .38 .01 32.97 69.66 .18 .00
random .32 .01 .65 .06
Duration 7 ﬁxed .58 .33 .84 .00 12.28 51.13 .13 .06
random .63 .24 1.02 .00
Effort 8 ﬁxed .59 .88 .30 .00 48.85 85.67 1.17 .00
random .04 .82 .91 .92
Quality 3 ﬁxed .26 .05 .57 .10 2.97 32.56 .07 .23
random .37 .10 .85 .12
Duration 4 ﬁxed .16 .13 .46 .28 16.87 82.22 .83 .00
random .50 .55 1.54 .35
Effort 3 ﬁxed .90 1.22 .58 .00 15.48 87.08 1.54 .00
random 1.99 3.56 .41 .01
Quality 3 ﬁxed .19 .22 .60 .36 .23 .00 .00 .89
random .19 .22 .60 .36
Duration 2 ﬁxed .34 .13 .81 .16 1.46 31.55 .06 .23
random .31 .27 .89 .30
Effort 2 ﬁxed .74 .13 1.61 .09 11.25 91.11 4.14 .00
random .99 1.96 3.94 .51
Quality 11 ﬁxed .24 .09 .39 .00 35.70 71.99 .18 .00
random .38 .05 .70 .02
Duration 9 ﬁxed .41 .20 .63 .00 32.02 75.02 .37 .00
random .63 .13 1.13 .01
Effort 9 ﬁxed .82 1.05 .60 .00 43.36 81.55 .61 .00
random .85 1.48 .23 .01
The plots were generated with PSTricks postscript macros in LaTeX within an
Excel spreadsheet using data produced by Comprehensive Meta-Analysis.
The effect size is small compared with the effect sizes reported in other software
engineering experiments .
1114 J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122
Compared with Quality, the studies on Duration show a more
mixed picture; two of the 11 studies show a negative effect, while
the remaining nine show a positive effect. In addition, the smaller
studies show medium to large contradictory effects. The three
studies by Arisholm et al. (2007), Canfora et al. (2005), and Naw-
rocki and Wojciechowski (2001) contribute more than 60% of the
total weight in the meta-analysis of Duration. The one-study-re-
moved analysis shows that the meta-analysis is most sensitive to
the inclusion/exclusion of Nosek (1998). Heterogeneity is signiﬁ-
cant at a medium level (Q¼33:57; p<:01; I
3.2.3. Effects of pair programming on Effort
Eleven studies reported effects on Effort in a total of 18 tests.
These studies used a total of 586 subjects. Both the ﬁxed-effects
model and the random-effects model suggest that there is a med-
ium negative overall effect of pair programming on Effort com-
pared with solo programming. All the included studies show a
negative effect on Effort, apart from the two studies by Phongpaibul
and Boehm (2006,2007). However, the results of those studies are
not directly comparable, because the researchers compared pair
programming teams with teams of individuals who also performed
inspections. The three studies by Arisholm et al. (2007), Canfora
et al. (2005), and Müller (2005) contribute almost 70% of the total
weight in the meta-analysis of Effort. The one-study-removed anal-
ysis shows that the meta-analysis is most sensitive to the inclu-
sion/exclusion of either of Phongpaibul and Boehm (2006,2007).
Heterogeneity is signiﬁcant and high (Q¼66:36; p<:01;
3.3. Subgroup analyses
Because of medium to high heterogeneity, we decided to con-
duct subgroup analyses as a step to identify possible immediate
sources of heterogeneity. Two subgroup types stand out due to
surface dissimilarities: the type of developers (students or profes-
sionals) and the type of comparison (isolated pairs vs. individuals,
or teams of pairs vs. teams of individuals). The results of these
analyses are summarized in Table 2.
The most dramatic result of the subgroup analysis is the rever-
sal of effect on Effort for the Teams subgroup (from .52 in the
Fig. 1. Forest plots for meta-analysis of Quality,Duration and Effort.
J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122 1115
overall analysis to .99 (non-signiﬁcant) in the subgroup analysis,
for the random-effects model). This is due to the two studies
Phongpaibul and Boehm (2006, 2007), which we remarked upon
above. Without these two studies (No teams subgroup), the effect
size increases to from .52 to .85 in the random-effects model.
Apart from this reversal, effect sizes in the subgroups remain in
the same order of magnitude as in the overall analysis, except for
the following cases: For Quality, the No teams subgroup reaches a
medium level in the random model’s estimate (.38). For Duration,
the Professionals subgroup declines to a small level in the ﬁxed
model’s estimate (.16). For Effort, the Students subgroup declines
to small (.04) (but non-signiﬁcant, p¼:92), and the Professionals
subgroup increases to a large effect size (1.99) in their respective
Heterogeneity levels remain medium to high. Professionals and
No teams increase to high heterogeneity for Duration (82.22) and
Teams decreases to small (.00) (but non-signiﬁcant, p¼:89) for
Signiﬁcance levels for both effect size estimates and heteroge-
neity decrease dramatically in several instances in the subgroups,
probably due to small k. This is particularly the case in the Profes-
sionals and Teams subgroups. Note however, that the effect on Ef-
fort actually turns signiﬁcant in the Professionals subgroup. Note
that the Professionals subgroup excludes Phongpaibul and Boehm
(2006, 2007) since they have student subjects.
3.4. Publication bias
Publication bias captures the idea that studies that report rela-
tively large treatment effects are the ones that were most likely to
be initiated in the ﬁrst place, and/or submitted and accepted for
publication. The effect size estimated from a biased collection of
Fig. 2. Forest plots for one-study-removed meta-analysis of Quality,Duration and Effort.
1116 J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122
studies will tend to overestimate the true effect. We used Compre-
hensive Meta-Analysis to assess the likely extent of publication
bias, and its potential impact on the conclusions.
Figs. 3–5 show funnel plots for Quality,Duration and Effort,
respectively. These graphs plot each study according to a measure
of study size called precision (which is simply 1=StdError) on the
vertical axis, and according to effect size (Hedges’ g) on the hori-
Hence, large studies tend to appear toward the top of the graph,
and smaller studies tend toward the bottom of the graph. Since
there is more inter-study variance in effect size estimates among
the smaller studies, these studies will be dispersed wider, while
larger studies will tend to cluster near the meta-analytic estimate
of effect size.
In the absence of publication bias, studies should be distributed
symmetrically on either side of the meta-analytic estimate of effect
size. In the presence of bias, the bottom part of the plot should
show a higher concentration of studies on one side of the meta-
analytic estimate than the other. This would reﬂect the fact that
smaller studies are more likely to be published if they have larger
than average effects, which makes them more likely to meet the
criterion for statistical signiﬁcance. Various statistical measures
can be used to complement the picture given by the funnel plots.
Duval and Tweedie’s trim and ﬁll method [13,14], takes the basic
idea behind the funnel plot one step further and imputes (com-
putes and inserts) assumed missing studies to obtain symmetry
if the funnel plot is asymmetric and then recomputes the meta-
analytic estimate of effect size. The method initially trims the
asymmetric studies from the biased side to locate the unbiased ef-
fect (in an iterative procedure), and then ﬁlls the plot by re-insert-
ing the trimmed studies on the biased side as well as their imputed
counterparts on the opposite side of the meta-analytic estimate of
effect size. Figs. 6–8 show the plots of Figs. 3–5 with imputed stud-
ies (black dots) and adjusted meta-analytic estimates of effect size
These analyses suggest that there is indeed some publication
bias in our sample. For example, six studies are imputed to obtain
symmetry for Quality. This might be taken as an incentive to pub-
lish more high-quality studies that, perhaps, exhibit low effects or
opposite effects to those that are expected. This might include pro-
moting gray literature to be readily accessible in journals and
However, it is important to note that publication bias may not
be the only cause of funnel plot asymmetry. For instance, asymme-
try may be caused by between-study heterogeneity, study quality,
or other factors .
In the presence of heterogeneity (which we are assuming),
there are three meaningful ways to impute the assumed missing
studies. (1) A random-effects model is used to trim and ﬁll, and
then the adjusted meta-analytic estimate of effect size is calculated
from the ﬁlled data using a random-effects model (a so-called ran-
dom–random approach) [13,14]. However, in a random-effects
model meta-analysis, smaller studies are given added weight in
the synthesis. Thus, if publication bias exists, the meta-analytic
estimate is likely to be more biased than that obtained from a
ﬁxed-effects model meta-analysis . Thus, an alternative is:
(2) a ﬁxed-effects model is used to trim and ﬁll, and then the ad-
justed meta-analytic estimate of effect size is calculated from the
ﬁlled data using a ﬁxed-effects model (ﬁxed–ﬁxed). It is recom-
mended to report both ﬁxed–ﬁxed and random–random analyses
. The ﬁxed–ﬁxed approach may be unsatisfying if one wishes
to view adjustments relative to the original random-effects model
meta-analysis. (3) A third approach is to use a ﬁxed-effects model
to trim and ﬁll the meta-analysis, but a random-effects model to
calculate the meta-analytic effect size estimate from the ﬁlled data
(ﬁxed–random). The ﬁxed-effects trim and ﬁll process is less likely
to be inﬂuenced by any publication bias than the random-effects
model approach, but the resulting estimate from the random-ef-
fects model is likely to be more conservative than if a ﬁxed-effects
model is used . The plots in Figs. 6–8 were generated using the
Fig. 3. Funnel plot quality.
Fig. 4. Funnel plot duration.
Fig. 5. Funnel plot effort.
J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122 1117
Table 3 presents the results from all perspectives. For example,
for Quality a ﬁxed-effects model approach to trim and ﬁll yields six
imputed studies and adjusts the estimate in the ﬁxed-effects mod-
el from .23 to .03 (ﬁxed–ﬁxed), while it adjusts the estimate in the
random-effects model from .33 to .07 (ﬁxed–random).
The trim and ﬁll procedure searches for missing studies on one
side at a time. For Qualtiy and Duration missing studies were found
on the left (Table 3), i.e., in the opposite direction of the estimated
effect, which is consistent with the assumption of publication bias.
For Effort, the ﬁxed approach found a missing study on the left,
while the random approach found missing studies on the right.
The missing study on the left is not consistent with the assumption
of publication bias and is due to the overly large effort estimate of
one of the studies in the opposite direction of the estimate.
Sterne and Egger  note that trim and ﬁll plots merely detect
a relationship between sample size and effect size, not a causal
mechanism between the two. The effect size may be larger in small
studies due to publication bias. However, it is also possible that
small studies estimate a different part of the population effect size
distribution than do large studies, e.g., because smaller studies use
different operationalizations of the outcome constructs, or have
different situational variables than large studies. Because one does
not know for certain whether funnel plot asymmetry is really
caused by publication bias, it is recommended to use the trim
and ﬁll method mainly as a form of sensitivity analysis .
3.5. Moderating effects of task complexity and expertise
Due to the interdisciplinary and complex nature of industrial
software engineering, it is usually not reasonable to test an
hypothesis that considers only one independent (predictor) vari-
able and one dependent (criterion) variable. Hypotheses related
to software engineering should typically include additional vari-
ables and test more complex relationships in order to provide a
more accurate description of reality. Indeed, the relatively small
overall effects and large between-study variance (heterogeneity)
indicate that one or more moderator variables might play a signif-
Only two of the studies in the review, Vanhanen and Lassenius
(2005) and Arisholm et al. (2007), tested explicitly for moderator
effects, while only one of the other studies discussed the potential
inﬂuence of such effects; Williams et al. (2000) suggested that the
relative improvement by pairs after the ﬁrst programming task in
their experiment was due to ‘‘pair jelling”.
The literature on group dynamics (e.g., [5,21]) suggests that the
extent to which group performance exceeds that of individuals,
and the mechanisms by which such gains in performance may be
achieved, depend upon the composition of the group and the char-
acteristics of the tasks. Vanhanen and Lassenius (2005) found that
task complexity did not affect the differences in effort between
Fig. 7. Funnel plot duration trim and ﬁll ﬁxed–random.
Fig. 8. Funnel plot effort trim and ﬁll ﬁxed–random.
Fig. 6. Funnel plot quality trim and ﬁll ﬁxed–random.
Trim and ﬁll analysis.
Imputed Fixed Random Q
g95% C.I. g95% C.I.
Observed .23 .09 .37 .33 .07 .60 35.97
ﬁxed 6 .03 .09 .16 .07 .23 .37 81.42 Left
random 0 .23 .09 .37 .33 .07 .60 35.97 Left
Observed .40 .21 .59 .53 .13 .94 33.57
ﬁxed 2 .34 .15 .54 .35 .10 .80 49.77 Left
random 1 .37 .18 .56 .44 .01 .88 45.01 Left
Observed .73 .94 .51 .52 1.18 .13 66.36
ﬁxed 1 .75 .97 .54 .74 1.46 .01 86.01 Left
random 2 .58 .78 .37 .07 .79 .65 104.46 Right
1118 J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122
solo and pair programming; the Pearson correlation between task
complexity and effort difference was as low as r¼:02.
On the other hand, Arisholm et al. (2007) found moderating ef-
fects of both task complexity and expertise. The results are shown
in Fig. 9. Overall, the results showed that the pairs had an 8% de-
crease in duration (g¼:21) with a corresponding 84% increase in
effort (g¼:68) and a 7% increase in correctness (g¼:11)
(Fig. 9a). However, the main effects of pair programming were
masked by the moderating effect of system complexity, in that
simpler designs had shorter duration, while more complex designs
had increased correctness (Fig. 9e).
Furthermore, when considering the moderating effect of pro-
grammer expertise, junior pairs had a small (5%) increase in dura-
tion and thus a large increase in effort (111%), and a 73% increase in
correctness (Fig. 9b). Intermediate pairs had a 28% decrease in
duration (43% increase in effort) and a negligible (4%) increase in
correctness (Fig. 9c). Senior pairs had a 9% decrease in duration
(83% increase in effort) and an 8% decrease in correctness
(Fig. 9d). Thus, the juniors beneﬁted from pair programming in
terms of increased correctness, the intermediates in terms of de-
creased duration, while there were no overall beneﬁts of pair pro-
gramming for seniors. When considering the combined
moderating effect of system complexity and programmer expertise
on pair programming, there appears to be an interaction effect:
Among the different treatment combinations, junior pairs assigned
to the complex design had a remarkable 149% increase on correct-
Total Effect of PP
Difference from individuals
Moderating Effect of System Complexity on PP
Difference from individuals
Moderating Effect of System Complexity for Juniors
Difference from individuals
Moderating Effect of System Complexity for Seniors
Difference from individuals
Moderating Effect of System Complexity for Intermediates
-39 % -29 %
Difference from individuals
Effect of PP for Juniors
Difference from individuals
Effect of PP for Intermediates
Difference from individuals
Effect of PP for Seniors
Duration Effort Correctness
Duration Effort Correctness Duration Effort Correctness
Duration Effort Correctness
Duration Effort Correctness
Duration Effort Correctness Duration Effort Correctness
Duration Effort Correctness
Difference from individuals
Fig. 9. The moderating effects of programmer expertise (a)–(d) and system complexity (e)–(h) on the relation of pair programming on duration, effort, and correctness
(Arisholm et al., 2007).
J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122 1119
ness compared with individuals (Fig. 9f). Furthermore, intermedi-
ates and seniors experienced an effect of pair programming on
duration on the simpler design, with a 39% (Fig. 9g) and 23%
(Fig. 9h) decrease, respectively. However, the cost of this shorter
duration was a corresponding decrease in correct solutions by
29% and 13%, respectively.
The unconditional part of the meta-analysis suggests that the
population effect size mean
for each of the outcome constructs
are small to medium.
More interestingly, the conditional part of the meta-analysis
showed large and partly contradictory differences in the reported
overall effects of pair programming, speciﬁcally with respect to
Duration and Effort. Our subgroup analyses do not suggest that
these differences are due to differences in samples (e.g., students
or professionals). Differences in organization (e.g., teams or no
teams) were apparent from the subgroup analysis, but these differ-
ences were due to two particular studies that, when removed, did
not decrease overall heterogeneity.
At the outset, we anticipated between-study variance (hetero-
geneity) due to moderating factors such as expertise and task com-
plexity and to the fact that the outcome constructs of the meta-
analysis are operationalized by indicators that are different aspects
of the constructs. Such heterogeneity was taken into account by
using a random-effects model for the unconditional interpretation.
Nevertheless, contradictory differences still manifested them-
selves. Thus, it seems clear that moderating factors play an impor-
tant role and should be investigated further. The study in 
corroborates this conclusion for expertise and task complexity,
but other moderating factors, such as amount of training in pair
programming, motivation, team climate, etc., are likely to be rele-
vant as well.
Consequently, the question of whether pair programming is
better than solo programming is not precise enough to be mean-
ingful, since the answer to that question in the present context is
both ‘‘yes” and ‘‘no”. On the basis of the evidence from this review,
the answer is that ‘‘it depends”: It depends on other factors, for
example, the expertise of the programmers and on the complexity
of the system and tasks to be solved.
In the literature, expertise and task complexity are perhaps the
most central situation-independent predictors of performance.
(Situation-dependent factors, on the other hand, include more dy-
namic factors such as motivation, team climate, organizational is-
sues, etc.). Theory predicts that experts perform better on complex
tasks than do novices because experts’ level of understanding cor-
responds to the deep structure [10,8,9,19] of a complex task. That
is, experts perceive (objectively) complex tasks as subjectively less
complex, i.e., less variable and more analyzable [43,38]. Conversely,
experts perceive (objectively) less complex tasks as more variable
and less analyzable since such tasks do not match the expert’s
deep-level understanding of the problem. Novices, on the other
hand, do have an understanding that matches the surface structure
of a non-complex task, and are expected to do better on non-com-
plex tasks than on complex tasks. Moreover, they may be expected
to even outperform experts on non-complex tasks .
These effects are evident in , in which the levels of correct-
ness for individual juniors, intermediates, and seniors on the
non-complex system were 63%, 87%, and 86%, respectively,
whereas the levels on the complex system were 34%, 41%, and
81%, respectively. Thus, the performance drop was much higher
for juniors than for intermediates and seniors, when moving from
the non-complex to the complex system, although juniors did not
outperform higher expertise groups on the non-complex system,
see also .
In our context, one of the most interesting observations is that
the pairing up of individuals seems to elevate the junior pairs up to
near senior pair performance. Thus, pair collaboration might com-
pensate for juniors’ lack of deep understanding, for example, by
inducing an expert-like strategy.
Change tasks rely heavily on program comprehension. To com-
prehend code, experts use a top–down model  that short-cuts
the available information by only investigating details as dictated
by domain knowledge. This approach to comprehension is more
efﬁcient than bottom–up comprehension, which builds under-
standing from details, and which is the approach found to be used
by programmers encountering totally unfamiliar code [41,42].
By forcing junior peers to rationalize their ideas to each other,
junior pairs might adopt a top–down strategy to comprehension,
rather than getting lost in code on their own. The mere act of think-
ing aloud whilst solving problems has been shown to increase
performance, when the verbalization is intended to reason or
explain action, e.g.,  (Type 3 verbalization, in Ericsson and Si-
mon’s terminology) . Several studies have concluded that
apparent successes of pair programming are not due to the partic-
ularities of pair programming (such as the speciﬁc roles of driver
and navigator), but rather to the shear amount of verbalization that
the pair programming situation necessitates [11,22].
Group performance not only relies on task complexity but also
on the collaborative nature of a task. In fact, the appropriateness of
each of the familiar adages ‘‘two heads are better than one”, ‘‘many
hands make light work”, and ‘‘a chain is only as strong as its weak-
est link” depends on whether a task is additive, compensatory, dis-
junctive or conjunctive . For example, the chain analogy is
appropriate for conjunctive tasks, where all group members must
contribute to the solution, but is inappropriate for disjunctive tasks
for which it sufﬁces that one group member has the ability to com-
plete the task. It is not obvious what sort of task pair programming
is in this respect.
The precise collaborative nature of pair programming also
inﬂuences what social mechanisms (social loaﬁng, social labour-
ing, social facilitation, social inhibition, social compensation, etc.)
are applicable. However, these social mechanisms also depend on
a host of other factors. In a meta-analysis of social loaﬁng (the
phenomenon that individuals tend to expend less effort when
working collectively than when working individually), Karau
and Williams  identiﬁed several conditions in which such
loaﬁng is eliminated (e.g., by high group cohesion) and some in
which the opposite phenomenon, social laboring , could be
observed (i.e., greater effort on group tasks). Social laboring seems
to occur when complex or highly involving tasks are performed, or
when the group is considered important for its members, or if
the prevailing values favor collectivism rather than individualism
Our meta-analysis suggests that pair programming is not uni-
formly beneﬁcial or effective, that inter-study variance is high,
and that perhaps publication bias is an issue. Hence, if further
investigations are to be undertaken on pair programming, then
the focus should be on untangling the moderating factors of the ef-
fect of pair programming.
However, with respect to the central factors expertise and task
complexity, the current state of knowledge suggest that pair pro-
gramming is beneﬁcial for achieving correctness on highly com-
plex programming tasks. Pair programming may also have a time
gain on simpler tasks. By cooperating, programmers may complete
tasks and attain goals that would be difﬁcult or impossible if they
worked individually. Junior pair programmers, for example, seem
able to achieve approximately the same level of correctness in
1120 J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122
about the same amount of time (duration) as senior individuals.
However, the higher quality for complex tasks comes at a price
of a considerably higher effort (cost), while the reduced completion
time for the simpler tasks comes at a price of a noticeably lower
quality. This fact conﬁrms Voas’  contention that you cannot
expect faster and better and cheaper. These relationships give rise
to a few evidence-based guidelines for the use of pair program-
ming for professional software developers. If you do not know
the seniority or skill levels of your programmers, but do have a
feeling for task complexity, then employ pair programming either
when task complexity is low and time is of the essence, or when
task complexity is high and correctness is important.
In the future, we intend to investigate deeper into the theoret-
ical and empirical underpinnings of collaboration in pairs, e.g., by
studying group dynamics and analyzing pair dialogues to obtain
insights into subjects’ learning and reasoning processes. Only by
understanding what makes pairs work, and what makes pairs less
efﬁcient, can steps be taken to provide beneﬁcial conditions for
work and to avoid detrimental conditions; or to avoid pairing alto-
gether when beneﬁcial conditions cannot be provided.
The authors are grateful to the anonymous referees for insight-
The following articles were included in the meta-analysis:
Arisholm et al. (2007). E. Arisholm, H. Gallis, T. Dybå, D.I.K. Sjøberg,
Evaluating pair programming with respect to system complex-
ity and programmer expertise, IEEE Trans. Software Eng. 33
(2007) 65–86 (Feb.)
Baheti et al. (2002). P. Baheti, E. Gehringer, D. Stotts, Exploring the
efﬁcacy of distributed pair programming, in: Proc. XP/Agile Uni-
verse 2002, ser. Lecture Notes in Computer Science, vol. 2418,
Springer Verlag, 2002, pp. 208–220.
Canfora et al. (2007). G. Canfora, A. Cimitlie, F. Garcia, M. Piattini,
C.A. Visaggio, Evaluating performances of pair designing in
industry, J. Systems and Software 80(8) (2007) 1317–1327 (Aug.)
Canfora et al. (2005). G. Canfora, A. Cimitlie, C.A. Visaggio, Empiri-
cal study on the productivity of the pair programming, in: Proc.
XP 2005, ser. Lecture Notes in Computer Science,vol. 3556,
Springer Verlag, 2005, pp. 92–99.
Domino et al. (2007). M.A. Domino, R. Webb Collins, A.R. Hevner,
Controlled experimentation on adaptations of pair program-
ming, Information Technology and Management 8(4) (2007)
Heiberg et al. (2003). S. Heiberg, U. Puus, P. Salumaa, A. Seeba, Pair-
programming effect on developers productivity, in: Proc. XP
2003, ser. Lecture Notes in Computer Science, vol. 2675.
Springer Verlag, 2003, pp. 215–224.
Madeyski (2006). L. Madeyski, The impact of pair programming
and test-driven development on package dependencies in
object-oriented design – an experiment, in: Proc. PROFES
2006, ser. Lecture Notes in Computer Science, vol. 4034,
Springer Verlag, 2006, pp. 278–289.
Madeyski (2007). L. Madeyski, On the effects of pair programming
on thoroughness and fault-ﬁnding effectiveness of unit tests, in:
Proc. PROFES 2007, ser. Lecture Notes in Computer Science, vol.
4589. Springer Verlag, 2007, pp. 207–221.
Müller (2005). M. Müller, Two controlled experiments concerning
the comparison of pair programming to peer review, J. Systems
and Software 78(2) (2005) 169–179
Müller (2006). M. Müller, A preliminary study on the impact of a
pair design phase on pair programming and solo programming,
Information and Software Technology 48(5) (2006) 335–344
Nawrocki and Wojciechowski (2001). J. Nawrocki, A. Wojciechow-
ski, Experimental evaluation of pair programming, in: Proc.
European Software Control and Metrics Conference
(ESCOM’01), 2001, pp. 269–276.
Nosek (1998). J. Nosek, The case for collaborative programming,
Comm. ACM 41(3) (1998) 105–108 (Mar.)
Phongpaibul and Boehm (2006). M. Phongpaibul, B. Boehm, An
empirical comparison between pair development and software
inspection in Thailand, in: Proc. Int’l Symposium on Empirical
Software Engineering (ISESE’06), 2006, pp. 85–94.
Phongpaibul and Boehm (2007). M. Phongpaibul, B. Boehm, A rep-
licate empirical comparison between pair development and
software development with inspection, in: Proc. 1st Int’l Symp.
Empirical Software Engineering and Measurement (ESEM’07),
IEEE Computer Society, 2007, pp. 265–274.
Rostaher and Herick (2002). M. Rostaher, M. Hericko, Tracking test
ﬁrst pair programming – an experiment, in: Proc. XP/Agile Uni-
verse 2002, ser. Lecture Notes in Computer Science, vol. 2418.
Springer Verlag, 2002, pp. 174–184.
Vanhanen and Lassenius (2005). J. Vanhanen, C. Lassenius, Effects
of pair programming at the development team level: an exper-
iment, in: Proc. Int’l Symposium on Empirical Software Engi-
neering (ISESE’05), 2005, pp. 336–345.
Williams et al. (2000). L. Williams, R.R. Kessler, W. Cunningham, R.
Jeffries, Strengthening the case for pair programming, IEEE Soft-
ware 17(4) (2000) 19–25
Xu and Rajlich (2006). S. Xu, V. Rajlich, Empirical validation of test-
driven pair programming in game development, in:Proc. Int’l
Conference on Computer and Information Science and Int’l
Workshop on Component-Based Software Engineering, Soft-
ware Architecture and Reuse (ICIS-COMSAR’06), 2006, pp.
 E. Arisholm, H. Gallis, T. Dybå, D.I.K. Sjøberg, Evaluating pair programming
with respect to system complexity and programmer expertise, IEEE Trans.
Software Eng. 33 (Feb.) (2007) 65–86.
 K. Beck, C. Andres, Extreme Programming Explained: Embrace Change, second
ed., Addison-Wesley, 2003.
 K. Bollen, R. Lennox, Conventional wisdom on measurement: a structural
equation perspective, Psychol. Bull. 110 (2) (1991) 305–314.
 M. Borenstein, L.V. Hedges, H. Rothstein, Meta-analysis – ﬁxed effects versus
random effects, Biostat Inc., Tech. Rep., 2007, <meta-analysis.com/downloads/
 R. Brown, Group Processes: Dynamics within and between Groups, second ed.,
 C. Burke jarvis, S.B. Mackenzie, P.M. Podsakoff, A critical review of construct
indicators and measurement model misspeciﬁcation in marketing and
consumer research, J. Consum. Res. 30 (Sept.) (2003) 199–218.
 M.T.H. Chi, N. de Leeuw, M.H. Chiu, C. LaVancher, Eliciting self-explanations
improves understanding, Cognitive Sci. 18 (1994) 439–477.
 M.T.H. Chi, P.J. Feltovich, R. Glaser, Categorization and representation of
physics problems by experts and novices, Cognitive Sci. (1981) 121–152.
 M.T.H. Chi, R. Glaser, E. Rees, Expertise in problem solving, in: R. Sternberg
(Ed.), Advances in the Psychology of Human Intelligence, Lawrence Erlbaum
Associates Inc., 1982, pp. 17–76.
 N. Chomsky, Syntactic Structures, second ed., Mouton, 2002.
 J. Chong, T. Hurlbutt, The social dynamics of pair programming, in: Proc. 29th
Int’l Conf. Software Engineering, 2007.
 A. Diamantopoulos, J.A. Siguaw, Formative versus reﬂective indicators in
organizational measure development: a comparison and empirical illustration,
Brit. J. Manag. 17 (2006) 263–282.
 S. Duval, R. Tweedie, A nonparametric ‘‘trim and ﬁll” method of accounting for
publication bias in meta-analysis, J. Am. Stat. Assoc. 95 (445) (2000) 89–98.
 S. Duval, R. Tweedie, Trim and ﬁll: a simple funnel-plot-based method of
testing and adjusting for publication bias in meta-analysis, Biometrics 56 (2)
J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122 1121
 T. Dybå, E. Arisholm, D.I.K. Sjøberg, J.E. Hannay, F. Shull, Are two heads better
than one? on the effectiveness of pair programming, IEEE Software, November/
December 2007, pp. 12–15.
 T. Dybå, T. Dingsøyr, G.K.S. Hanssen, Applying systematic reviews to diverse
study types: an experience report, in: Proc. 1st Int’l Symp. Empirical Software
Engineering and Measurement (ESEM’07), IEEE Computer Society, 2007, pp.
 T. Dybå, V.B. Kampenes, D.I.K. Sjøberg, A systematic review of statistical power
in software engineering experiments, J. Inform. Software Technol. 48 (2006).
 T. Dybå, B.A. Kitchenham, M. Jørgensen, Evidence-based software engineering
for practitioners, IEEE Software 22 (2005) 58–65.
 K.A. Ericsson, N. Charness, Expert performance – its structure and acquisition,
Am. Psychol. 49 (1994) 725–747.
 K.A. Ericsson, H.A. Simon, Protocol Analysis, revised ed., The MIT Press, 1993.
 D.R. Forsyth, Group Dynamics, fourth ed., Thomson Wadsworth, 2006.
 S. Freudenberg (née Bryant), P. Romero, B. du Boulay, ‘Talking the talk’: is
intermediate-level conversation the key to the pair programming success
story? in: Proc. AGILE 2007, 2007.
 T. H
rem, D. Rau, The inﬂuence of degree of expertise and objective task
complexity on perceived task complexity and performance, J. Appl. Psychol. 92
(5) (2007) 1320–1331.
 J.E. Hannay, M. Jørgensen, The role of deliberate artiﬁcial design elements in
software engineering experiments, IEEE Trans. Software Eng. 34 (Mar/Apr)
 J.E. Hannay, D.I.K. Sjøberg, T. Dybå, A systematic review of theory use in
software engineering experiments, IEEE Trans. Software Eng. 33 (Feb.) (2007)
 L.V. Hedges, I. Olkin, Statistical Methods for Meta-Analysis, Academic Press
 L.V. Hedges, J.L. Vevea, Fixed- and random-effects models in meta-analysis,
Psychol. Meth. 3 (4) (1998) 486–504.
 J.P. Higgins, S.G. Thompson, J.J. Deeks, D.G. Altman, Measuring inconsistency in
meta-analyses, Brit. Med. J. 327 (4) (2003) 557–560.
 J.E. Hunter, F.L. Schmidt, Fixed effects vs. random effects meta-analysis
models: implications for cumulative research knowledge, Int. J. Select. Assess.
327 (8) (2000) 272–292.
 V.B. Kampenes, T. Dybå, J.E. Hannay, D.I.K. Sjøberg, A systematic review of
effect size in software engineering experiments, Inform. Software Technol. 49
(11–12) (2007) 1073–1086.
 V.B. Kampenes, T. Dybå, J.E. Hannay, D.I.K. Sjøberg, A systematic review of
quasi-experiments in software engineering, Inform. Software Technol. 49 (11–
12) (2007) 1073–1086.
 S.J. Karau, K.D. Williams, Social loaﬁng: a meta-analytic review and theoretical
integration, J. Pers. Soc. Psychol. 65 (4) (1993) 681–706.
 N.L. Kerr, R.S. Tindale, Group performance and decision making, Ann. Rev.
Psychol. 55 (2004) 623–655.
 B.A. Kitchenham, S. Charters, Guidelines for Performing Systematic Literature
Reviews in Software Engineering, Keele University, EBSE Technical Report,
EBSE-2007-01, Tech. Rep., 2007.
 J.M. Levine, R.L. Moreland, Progress in small group research, Ann. Rev. Psychol.
41 (1990) 585–634.
 R.M. Lindsay, A.S.C. Ehrenberg, The design of replicated studies, Am.
Statistician 47 (3) (1993) 217–228.
 M.W. Lipsey, D.B. Wilson, Practical Meta-Analysis, Sage, 2001.
 J.G. March, H.A. Simon, Organizations, second ed., Wiley-Blackwell, 1993.
 J. Miller, Applying meta-analytical procedures to software engineering
experiments, J. Syst. Software 54 (1–2) (2000) 29–39.
 J.C. Nunnally, I. Bernstein, Psychometric Theory, third ed., McGraw-Hill Inc.,
 N. Pennington, Comprehension strategies in programming, in: Proc. Second
Workshop Empirical Studies of Programmers, 1987, pp. 100–113.
 N. Pennington, Stimulus structures and mental representations in expert
comprehension of computer programs, Cognitive Psychol. 19 (1987) 295–341.
 C. Perrow, A framework for the comparative analysis of organizations, Am. Soc.
Rev. 32 (1967) 194–208.
 J.L. Peters, A.J. Sutton, D.R. Jones, K.R. Abrams, L. Rushton, Performance of the
trim and ﬁll method in the presence of publication bias and between-study
heterogeneity, Stat. Med. 26 (2007) 4544–4562.
 R. Rosenthal, M.R. DiMatteo, Meta-analysis: recent developments in
quantitative methods for literature reviews, Ann. Rev. Psychol. 52 (2001)
 D.I.K. Sjøberg, J.E. Hannay, O. Hansen, V.B. Kampenes, A. Karahasanovic
Liborg, A.C. Rekdal, A survey of controlled experiments in software
engineering, IEEE Trans. Software Eng. 31 (9) (2005) 733–753.
 E. Soloway, B. Adelson, K. Ehrlich, Knowledge and processes in the
comprehension of computer programs, in: M. Chi, R. Glaser, M. Farr (Eds.),
The Nature of Expertise, Lawrence Erlbaum Assoc., 1988, pp. 129–152.
 I.D. Steiner, Group Process and Productivity, Academic Press, New York,
 M. Stephens, D. Rosenberg, Extreme Programming Refactored: The Case
Against XP, APress, 2003.
 J.A.C. Sterne, M. Egger, Regression methods to detect publication and other
bias in meta-analysis, in: H.R. Rothstein, A.J. Sutton, M. Borenstein (Eds.),
Publication Bias in Meta-analysis: Prevention, Assessment and Adjustments,
Wiley, 2005, pp. 99–110 (Chapter 6).
 A.J. Sutton, Evidence concerning the consequences of publication and related
biases, in: H.R. Rothstein, A.J. Sutton, M. Borenstein (Eds.), Publication Bias in
Meta-analysis: Prevention, Assessment and Adjustments, Wiley, 2005, pp.
175–192 (Chapter 10).
 J. Voas, Faster, better, and cheaper, IEEE Software 18 (3) (2001) 96–99.
 L. Williams, R.R. Kessler, Pair Programming Illuminated, Addison-Wesley,
1122 J.E. Hannay et al. / Information and Software Technology 51 (2009) 1110–1122