ArticlePDF Available

A controlled experiment on the effects of PSP training: Detailed description and evaluation

Authors:

Abstract and Figures

The Personal Software Process (PSP) is a methodology for systematic and continuous improvement of an individual software engineer's software production capabilities. The proponents of the PSP claim that the PSP methods improve in particular the program quality and the capability for accurate estimation of the development time, but do not impair productivity. We have performed a controlled experiment for assessing these and related claims. The experiment compares the performance of a group of students that have just previously participated in a PSP course to a comparable set of students from a "normal" programming course. This report presents in detail the experiment design and setup, the results of the experiment, and our interpretation of the results. The results indicate that the claims are basically correct, but the improvements may be a lot smaller than expected. However, we found an important additional benefit from PSP that is not usually mentioned by the PSP proponent...
Content may be subject to copyright.
A controlled experiment
on the effects of PSP training:
Detailed description and evaluation
Lutz Prechelt (prechelt@ira.uka.de)
Barbara Unger (unger@ira.uka.de)
Fakult
¨
at f¨ur Informatik
Universit
¨
at Karlsruhe
D-76128 Karlsruhe, Germany
+49/721/608-4068, Fax: +49/721/608-7343
http://wwwipd.ira.uka.de/EIR/
Technical Report 1/1999
April 8, 1999
Abstract
The Personal Software Process (PSP) is a methodology for systematic and continuous improve-
ment of an individual software engineer’s software production capabilities. The proponents of the
PSP claim that the PSP methods improve in particular the program quality and the capability for
accurate estimation of the development time, but do not impair productivity.
We have performed a controlled experiment for assessing these and related claims. The experiment
compares the performance of a group of students that have just previously participated in a PSP
course to a comparable set of students from a “normal” programming course. This report presents
in detail the experiment design and setup, the results of the experiment, and our interpretation of
the results.
The results indicate that the claims are basically correct, but the improvements may be a lot smaller
than expected. However, we found an important additional benefit from PSP that is not usually
mentioned by the PSP proponents: The performance in the PSP group was consistently less vari-
able for most of the many variables we investigated. Less variable performance in a software team
greatly reduces the risk in software projects.
Contents
1 Introduction 4
1.1 What is the PSP? ......................................... 4
1.2 Experimentoverview....................................... 5
1.3 Relatedwork ........................................... 6
1.4 Whysuchanexperiment?..................................... 6
1.5 Howtousethisreport....................................... 6
2 Description of the experiment 8
2.1 Experimentdesign ........................................ 8
2.2 Hypotheses . ........................................... 9
2.3 Experiment format and conduct . . ................................ 10
2.4 Experimentalsubjects....................................... 11
2.4.1 Overview......................................... 11
2.4.2 Educationandexperience ................................ 12
2.4.3 The PSP course (experiment group) . .......................... 14
2.4.4 The alternative courses (control group) ......................... 14
2.5 Task................................................ 15
2.5.1 Goals for choosing the task ................................ 15
2.5.2 Taskdescriptionandconsequences ........................... 15
2.5.3 Taskinfrastructureprovidedtothesubjects ....................... 16
2.5.4 The acceptance test .................................... 16
2.5.5 Thegoldprogram ................................... 17
2.6 Internalvalidity.......................................... 17
2.6.1 Control .......................................... 17
2.6.2 Accuracyofdatagatheringandprocessing ....................... 18
2.7 Externalvalidity.......................................... 18
2.7.1 Experienceasasoftwareengineer............................ 18
2.7.2 Experience with
psp
use ................................. 18
2.7.3 Kindsofworkconditionsortasks ............................ 19
3 Experiment results and discussion 20
3.1 Statistical methods . . . ..................................... 20
3.1.1 One-dimensionalstatistics ................................ 20
3.1.2 Two-dimensionalstatistics................................ 23
3.1.3 Presentationofresults .................................. 25
3.2 Groupformation ......................................... 25
3.3 Estimation............................................. 27
3.4 Reliability and robustness ..................................... 31
3.4.1 Blackboxanalysisandwhiteboxanalysis........................ 31
3.4.2 The test inputs: a, m, and z ................................ 32
2
CONTENTS
3
3.4.3 Reliability measures . . . ................................ 32
3.4.4 Inputs with nonempty encodings . . . .......................... 33
3.4.5 Arbitrary inputs . ..................................... 35
3.4.6 Influence of the programming language ......................... 36
3.4.7 Summary......................................... 39
3.5 Release maturity ......................................... 39
3.6 Documentation .......................................... 42
3.7 Trivialmistakes.......................................... 42
3.8 Productivity . ........................................... 47
3.9 Quality judgement . . . ..................................... 48
3.10Efciency............................................. 49
3.11Simplicity............................................. 51
3.12Analysisofcorrelations...................................... 51
3.12.1 Howtimeisspent..................................... 52
3.12.2 Betterdocumentationsavestrivialmistakes....................... 53
3.12.3 Theurgetonish..................................... 53
3.13Subjectsexperiences....................................... 54
3.14Mean/median/iqroverviewtable ................................. 55
4Conclusion 59
4.1 Summaryofresults........................................ 59
4.2 Possiblereasons.......................................... 59
4.3 Consequences........................................... 61
Appendix 62
A Experiment materials 62
A.1 Experimentprocedure....................................... 63
A.2 Questionnaire – personal information .............................. 64
A.3 Taskdescription.......................................... 66
A.4 Questionnaire – Estimation .................................... 70
A.5 Questionnaire – Self-evaluation . . ................................ 73
A.6 Versuchsablauf .......................................... 76
A.7 Fragebogen – pers¨onlicheAngaben................................ 77
A.8 Aufgabenstellung......................................... 79
A.9 Fragebogen – Selbsteinsch¨atzung................................. 84
A.10 Fragebogen – Eigenbeurteilung . . ................................ 87
B Glossary 90
Bibliography 92
One item could not be deleted because it was missing.
Apple Macintosh System 7 OS error message
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Chapter 1
Introduction
Everybody has opinions. I have data.
Watts S. Humphrey
The present report is the definitive and detailed description and evaluation of a controlled experiment comparing
students that received PSP (Personal Software Process) training to other students that received other software
engineering education.
In the rst chapter we will first discuss the general topic of the experiment, then give a broad overview of the
purpose and setup of the experiment, and nally describe related work.
Chapter 2 describes the subjects, setup, and execution of the experiment, relying partially on the original exper-
iment materials as printed in the appendix. It also discusses possible threats to the internal and external validity
of the experiment.
Chapter 3 presents and interprets in detail the results obtained in the experiment and Chapter 4 presents con-
clusions. The appendix contains the handouts used in the experiment: personal data questionnaire, estimation
questionnaire, task description, work time logging sheet, postmortem questionnaire.
1.1 What is the PSP?
The Personal Software Process (PSP) is a methodology for structuring the work of an individual software
engineer introduced by Watts Humphrey in 1995 [3]. At its core is the notion of an individual’s software
process, that is, the set of procedures used by a single software engineer to do his or her work. The PSP has
several goals:
Reliable planning capability, i.e., the ability to accurately predict the delivery time of a piece of work
performed by a single software engineer.
Effective quality management, i.e., the ability to avoid introducing defects or other quality-reducing
properties into the work products, to detect and remove those that have been introduced anyway, and
to improve both capabilities over time. These other quality attributes can be, for instance, the ease of
maintenance, reuse, or testing (internal view of the product), or the suitability, exibility, and ease of use
(external view), etc.
Defining and documenting the software process, i.e., laying down in writing the abstract principles
and concrete procedures by which one generally creates software. The purpose of process definition is
improving the ability of the process to be traced, understood, communicated, measured, or improved.
4
1.2 Experiment overview
5
Continuous process improvement, i.e., the capability to continuously identify the relatively weakest
points in one’s own software process, develop alternative solutions, evaluating these solutions, and incor-
porating the best one into the process subsequently.
The PSP may also lead to improved productivity but this is not a primary goal; part of the productivity gains may
be offset by the overhead introduced by the PSP, because the means to the above ends are process definition,
process measurement, and data analysis, which lead to a number of additional tasks.
The PSP methodology is taught by means of a PSP course. In its standard form, this is a 15-week training
program requiring roughly one full day per week. According to the experience of both Watts Humphrey and
ourselves, the PSP can hardly be learned without that course, because under the pressure of real-world working
conditions, programmers will only be able to accept and execute the overhead tasks once they have experienced
their benefits but the benefits will only be experienced after the overhead tasks have been performed for
quite a while. Hence, the course is needed for providing a pressure-free “playground” for learning about the
usefulness of PSP techniques.
The practical consequence of the PSP course for an individual software engineer is to obtain a personal software
process (
psp
, in non-capital letters). The course provides a set of example methods that serve as a starting
point for the development of an individual
psp
. The methods are reasonable default instantiations of the PSP
principles and can be tailored to one’s individual preferences and work conditions during later
psp
usage.
1.2 Experiment overview
The question asked by this experiment is the following:
What (if any) differences in behavior or capabilities can be found when comparing software engi-
neers that have received PSP training to software engineers that have received an equivalent amount
of “conventional” technical training?
The approach used to answer this question is the following:
Find participants with similar capabilities and backgrounds, except that one group has had previous PSP
training and the other has not.
Let each participant solve the same non-trivial programming task.
Observe as many features of behavior (process) and result (product) as possible.
Examples: total working time, number of compilations, program reliability when program was first
considered functional (acceptance test), final program reliability, program efficiency, etc.
Formulate hypotheses describing which differences might be expected between the groups with respect
to the features that were observed.
Example: PSP-trained participants produce more reliable programs.
Analyze the data in order to test the hypotheses. Describe additional interesting structure found in the
data, if any. Interpret the results.
In our case, the participants were graduate students and the programming task involved designing and imple-
menting a rather uncommon search and encoding algorithm. On average, the task size was effectively more
than 1 person day (between 3 and 50 work hours). A total of 55 persons participated in the experiment between
August 1996 and October 1998.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
6 Chapter 1: Introduction
1.3 Related work
It is one of the most important principles of the PSP methodology to base decisions on objective measurement
data (as opposed to intuitive judgement). Consequently, the PSP course (and also later practitioning of the PSP)
builds a collection of data for each participant from which the development of several attributes of process qual-
ity and effective developer capability can be seen. Such data has been described and discussed by Humphrey in
several articles and reports, e.g. [4]. Most of this data show the development of a certain metric over time, such
as the decreasing density of defects inserted in a program. The main and unavoidable drawback of such data is
a lack of control: It is impossible to say how much of the effect comes from each of the possible sources, such
as: the particular programming problem solved at each time, maturation that would also occur without PSP
training (at least given some other training), details of the measurement that cannot be made objective, and,
finally, real and unique PSP/
psp
benefits.
The purpose of the present experiment is providing data with a much higher degree of comparability: measuring
psp
effects in a controlled fashion.
We know of no other evaluation work specifically targeting the PSP methodology or the PSP course.
1.4 Why such an experiment?
There are basically two reasons why we need this experiment. First, any methodology, even one as convincing
as the PSP, should undergo a sound scientific validation. Not only to see whether it works, but rather to
understand the structure of its effects: Which consequences are visible at all? How strong is their influence?
How do they interact?
The second reason is more pessimistic: Based on our observations with about a hundred German informatics
students we estimate that only about one third of them will regularly use PSP techniques in their normal work
after the course and will really form a
psp
. Roughly another third appears to be unable to regularly exercise the
self-control required for building and using a
psp
. For the rest, we consider the prospects to be unclear; their
PSP future may depend on the kind of environment in which they will work.
Given this estimation it is not clear whether PSP-trained students will be superior to others, even if one is
willing to believe that a PSP education in principle has this effect. The purpose of the experiment is to assess
the average effect as well as look for the results of the above-mentioned dichotomy, if any. If it exists, we may
for instance see a larger variance of performance in the PSP group or maybe even a bimodal distribution having
two peaks instead of just one.
1.5 How to use this report
This report is meant to provide a most detailed documentation of the experiment and its results. This has several
consequences:
You need not read the whole report from front to back. The contents are logically structured and it should
be easy to find specific information of interest using the table of contents.
When you encounter a term whose definition you have skipped, refer to the table of contents or the
glossary for nding it.
A controlledexperimenton the effects of PSP training
1.5 How to use this report
7
The main text does not try to describe the tasks or questionnaires in any detail but instead relies on
the original experiment materials that are printed in the appendix. Please consult the appendix where
necessary.
The results section is rather detailed. We recommend to stick to the text and to refer to the diagrams and
their captions only at points of particular interest.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Chapter 2
Description of the experiment
Good judgement comes from experience,
and experience comes from bad judgement.
Anonymous
This chapter will describe the experiment design, the hypotheses to be investigated by the experiment, the
experiment procedure, the motivation and background of the participants, and the task to be solved. In the final
sections we will discuss possible threats to the internal and external validity of the experiment.
2.1 Experiment design
As mentioned before, the general goal of the experiment is to investigate differences in performance or behavior
between
persons that have received PSP training shortly before and
“other” people.
The resulting basic experimental design is very simple: it is a two-group, posttest-only, inter-subject design
with a single binary independent variable, namely “subject
1
has received PSP training”. We will call the two
groups P (for “PSP-trained”) and N (for “not PSP-trained”). The set of dependent variables to be used is not at
all simple, though. To avoid unnecessary repetition, these variables will be introduced step by step during the
presentation of the results in Chapter 3.
In order to maximize the power of the experiment in view of the large variations in individual performance
that are to be expected, one would ideally want to pair participants with similar expected performance (based
on available knowledge about the participants’ background) and put one person of each pair into each group.
Unfortunately, this is not possible in the present setting because group membership is determined by the previ-
ous university career of each participant that can not be controlled by the experimenter. See also the following
section.
1
We will use the terms “subject” and “participant” interchangeably to refer to the persons who participated in our experiment as
experimental subjects.
8
2.2 Hypotheses
9
2.2 Hypotheses
As mentioned in the overview in Section 1.2, the general purpose of the experiment is investigating behav-
ior differences (and their consequences) between PSP-trained and non-PSP-trained subjects. To satisfy such
a broad ambition, we must collect and analyze data as comprehensively as is feasible (from both a techni-
cal/organizational and a psychological point of view).
However, to guide the evaluation of this data it is useful to formulate some explicit expectations about the
differences that might occur. These expectations are formulated in this section in the form of hypotheses. The
experiment shall investigate the following hypotheses:
Hypothesis H1: Estimation. Since effort estimation is a major component of the PSP course, we expect that
the deviations of actual from expected development time are smaller in group P compared to group N.
(Note that this hypothesis assumes that the task can be considered to be from a familiar domain, because
otherwise the PSP planning may break down and the results become unpredictable.)
Hypothesis H2: Reliability. Since defect prevention and early defect removal are major goals throughout the
PSP course, we expect the reliability of the program for “normal” inputs to be higher in group P compared to
group N.
Hypothesis H3: Robustness. Since producing defect-free programs is an important goal during the PSP
course, we also expect the reliability of the program for “surprising” sorts of inputs to be higher in group P
compared to group N.
(Note that the term “robustness” is somewhat misleading here because robustness against illegal inputs was
explicitly not a requirement in the experiment.)
Hypothesis H4: Release maturity. We expect that group P will typically deliver programs in a relatively more
mature state compared to group N. When the requirements are invariant, release maturity can be represented by
the fraction of overall development time that comes after the rst program release.
(In the experiment, releasing a program will be represented by the request of the participant that an acceptance
test be performed with the program.)
Hypothesis H5: Documentation. Since the PSP course puts quite some weight onto making a proper design
and design review and onto product quality in general, we expect that there will be more documentation in the
final product in group P compared to group N.
Hypothesis H6: Trivial mistakes. Quality management as taught in the PSP course is based on the principle
to take even trivial defects seriously, hence we expect a lower number of simple-to-correct defects in group P
compared to group N.
Hypothesis H7: Productivity. For programs that are difficult to get right, the PSP focus on early defect
detection saves a lot of testing and debugging effort. The overhead implied by planning and defect logging
usually does not outweigh these savings. What might outweigh the savings, though, is if participants produce
a thorough design documentation that accompanies the program, e.g. in the form of long comments in the
source code. Such behavior is also expected to be more likely for PSP-trained participants. Note that since
all programs were built to conform to the same requirements, productivity can be defined simply as 1 divided
by the total work time. Now for the actual hypothesis: We expect group P to complete the program faster
compared to group N, at least if one subtracts the effort spent for documentation.
Hypothesis H8: Quality judgement. The PSP quality management tracks the density of defects in a program
as seen during development and even as predicted before development starts. We speculate that this might also
lead to more realistic estimates of the defect content and reliability of a final program. Hence, we expect to see
more accurate estimates of final program reliability in group P compared to group N.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
10 Chapter 2: Description of the experiment
Speculative hypothesis H9: Efficiency. PSP quality management suggests to prefer simpler solutions over
“clever” ones. This might lead to programs that run more efficient or use less memory, but might also lead
to programs that run more slowly or require more memory. We hypothesize that we may find differences in
speed and memory consumption between groups P and N, although we cannot say in advance which form these
differences will have.
Speculative hypothesis H10: Simplicity. The more careful design phase presumably performed by PSP-
trained subjects might lead to simpler and shorter code in group P compared to group N.
Note that this set of hypotheses is quite bold and the odds are not the same for each hypothesis: For estimation
and reliability, for example, we clearly expect to see some advantage of the P group, while for efficiency
expecting any differences is pure speculation. The results for documentation and productivity must be treated
with care, because the experiment instructions rst and foremost called for reliability, not for productivity or
maintainability.
All of these hypotheses are expected to hold more strongly if we look at only the better half of the participants in
each group, because then, according to the remark in Section 1.4, the PSP group presumably consists mostly of
subjects that indeed use a
psp
. In Section 3.2 on page 25, we will form various subgroups for such comparisons.
As mentioned above, the purpose of the experiment is not limited to formally testing these hypotheses, but in-
cludes all other appropriate analyses that might provide insights towards understanding the behavior differences
and their effects. Consequently, we will use the hypotheses somewhat loosely in order not to be distracted from
possibly more important other observations.
2.3 Experiment format and conduct
After some trial runs in August 1996, the experiment started in February 1997 and finished in October 1998.
With a few exceptions, the participants worked during the semester breaks from mid-February to mid-April
or mid-July to mid-October. Subjects announced their participation by email and agreed on an appointment,
usually starting at 9:30 in the morning.
Each subject could choose the programming language to be C, C++, Java, Modula-2 or Sather-K; two subjects
would have preferred Pascal. The compilers used were gcc, g++, JDK, mocka, and sak, respectively. An
account was set up for each subject on a Sun Unix workstation. At most three such accounts were in use
at any given time. The account was setup so as to protocol activity, in particular each version of the source
code that was submitted to the compiler. A subject was told about this monitoring on request. Due to a
mistake made by one experimenter (Prechelt) when adapting the setup of the participant accounts to a change
in the software environment of our workstations (switch to a new version of the Java development kit), the
monitoring mechanism was corrupted and nonfunctional during a significant part of the experiment and many
of these compilation protocols were lost. In order to provide as natural a working environment as possible and
avoid irritating the participants, no direct observation, video recording, or other kind of visible monitoring was
performed.
A subject was allowed to fetch and install auxiliary tools or data by FTP and install it for the experiment. For
instance, a few subjects brought their own editor or re-used small parts of previously written programs (e.g. file
handling routines). Some, but not all, PSP subjects brought their estimation data and/or PSP tools.
The subject was then given the first three parts of the experiment materials: first the personal information
questionnaire, then the task description, and then the estimation questionnaire (see Appendix A). After lling
in the latter, the subject was left alone and worked according to a schedule he could choose freely. Only three
restrictions were made: First, to always work in this special account; second, to use but one source code file;
A controlledexperimenton the effects of PSP training
2.4 Experimental subjects
11
and third, to log the working time on a special protocoling sheet we had provided along with the other materials.
Each of these restrictions was violated by a few (but not many) of the participants.
The subject was told to ask if he encountered technical problems with the setup or if he felt something was
ambiguous in the requirements. Technical problems occurred frequently and were then resolved on the spot.
In order not to bias the results, the time required for resolving the problems (between 5 and 30 minutes per
participant) is included in the work time. Inclusion is required because some subjects may have chosen to
resolve the problem alone. Questions about the requirements were asked by about a dozen participants and in
all cases they were referred to read the requirements more closely, because the apparent ambiguity was indeed
properly resolved in the description they had received.
The participant was asked not to cooperate with other participants working at the same time or with earlier
participants; we have not found any evidence of such cooperation and believe that essentially none has occurred.
The subject was further told to ask for an acceptance test at any time if he felt his program would now work
correctly.
If the acceptance test failed, the subject was encouraged to analyze the problems, correct the program, and try
again. Once the acceptance test was passed (or the subject gave up), the participant was given the postmortem
self-evaluation questionnaire. After finishing and returning the questionnaire, the subject was paid (see Sec-
tion 2.4.1) and all data was copied from the subject’s experiment account to a safe place. In a few cases, the
subjects asked for (and were granted) some additional time for improving the program after the acceptance
test was passed. Typically this was used for cleaning up unused code sections and inserting comments into the
program.
2.4 Experimental subjects
This section will describe the background of the experimental subjects. Often in this report we refer to a few of
the subjects individually by name: these names are letter/number-combinations in the range from s12 to s102.
2.4.1 Overview
The experiment had 50 participants, 30 of them in the PSP group and 20 in the control group. Our 50 partici-
pants break up into the following sorts with respect to their motivation: 40 of them were obliged to participate
in the experiment as a part of a lab course they took (29 from two PSP courses and 11 from a Java course, see
the description in Sections 2.4.3 and 2.4.4). Note that the obligation included only participation, not success:
Even those participants that gave up during the experiment passed their course. One PSP participant, s045,
retracted all of his materials from the experiment and will not be included in any of the analyses.
The other 10 participants were volunteers. 8 participants were volunteers from other lab courses in our depart-
ment. Four of these were highly capable students who expressedly came to “prove that PSP-trained people are
not better than others”. (s020, s023, s025, and s034. Interestingly, two of these four, s020 and s034, later par-
ticipated in the PSP course.) 2 participants (s081, s102) are actually “repeaters”: they had already participated
in the experiment in February/March 1997 as non-PSP subjects and voluntarily participated again in June 1998
and March 1998, respectively, after they had taken the PSP course. Quite obviously, these differences need to
be taken into account during data analysis; see Section 3.2 for a description of the actual groups compared in
the analysis.
All participants were motivated towards high performance by the following reward system: They would be
paid DM 50 (approximately 30 US dollars) for successfully participating in the experiment (i.e., passing the
acceptance test). However, each failed acceptance test reduced the payment by DM 10.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
12 Chapter 2: Description of the experiment
2.4.2 Education and experience
The following information excludes the two repeater participants, the one participant from the PSP group who
retracted all of his materials from the experiment, and those participants for which the particular piece of
information was not available (“missing”).
The participants were in their 4th to 17th semester at the university (median 8, see Figure 2.1; please refer
to Section 3.1.1 on page 20 for an explanation of the plot). With the exception of two highly capable fourth
semesterers (s023 and s050) all of the students were graduate students (after the “Vordiplom”).
M
1 missing
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
oo
o
o
oo
M
1 missing
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
M
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
all
P
N
4 6 8 10 12 14 16
current semester number
Figure 2.1: Distribution of
semester number of subjects in
the PSP group (P), the non-PSP
group (N), and both together
(all).
The participants had a programming experience of 3 to 14 years (median 8 years, see Figure 2.2) with an esti-
mated 10(sic!) to 15000(sic!) actual programming working hours beyond the programming exercises performed
in the university education (median 600 hours, see Figure 2.3).
M
1 missing
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
M
1 missing
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
M
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
all
P
N
468101214
total programming experience [years]
Figure 2.2: Distribution of years
of programming experience in
the PSP group (P), the non-PSP
group(N),andboth together(all)
M
13 missing
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
M
7 missing
o
oo
oo
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
M
6 missing
o
o
o
o
o
o
o
o
o
o
o
o
all
P
N
0 1000 2000 3000 4000 5000
total programming experience [work hours]
Figure 2.3: Distribution of hours
of non-university programming
experiencein the PSPgroup(P),
thenon-PSPgroup(N),and both
together (all). There is one point
at 15000 in N.
During that time, each of them had written an estimated total 4 to 2000(sic!) KLOC
2
with a median of 20, see
Figure 2.4. The estimated total number of lines the subjects had written in the language they had chosen for
2
One KLOC is thousand lines of code. For roughly half of the participants this includes comments, for the others it includes only
statements.
A controlledexperimenton the effects of PSP training
2.4 Experimental subjects
13
the experiment was from 0.5 to 100 KLOC (median 5 KLOC, see Figure 2.6). The few extremely high values
that occur in most of these variables show that there are a few quite extraordinary subjects in the sample, in
particular in the non-PSP group.
M
4 missing
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
M
1 missing
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
M
3 missing
o
o
o
oo
o
o
oo
o
o
o
o
o
o
all
P
N
0 100 200 300 400 500
total programming experience [KLOC]
Figure 2.4: Distribution of total
KLOC written in the PSP group
(P), the non-PSP group (N), and
both together (all). There is one
point at 2000 in N.
M
1 missing
o
o
o
o
o
o
o
ooo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
M
1 missing
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
ooo
o
M
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
all
P
N
0204060
size of largest program written [KLOC]
Figure 2.5: Distribution of size
of largest program written in the
PSP group (P), the non-PSP
group (N), and both together
(all).
M
2 missing
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
oo
o
o
M
1 missing
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
M
1 missing
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
all
P
N
0 20406080100
programming language experience [KLOC]
Figure 2.6: Distribution of pro-
gramming experience (in KLOC)
in the programming language
used during the experiment by
each individual subject in the
PSP group (P), the non-PSP
group (N), and both together
(all).
Looking at these data, our two main groups appear to be reasonably balanced. The apparently somewhat larger
values in the N group for total experience in KLOC and size of largest program may be spurious, because the
non-PSP participants are more likely to over-estimate their past productivity as we will see in Section 3.3.
There are two subjects (s034 and s043) that appear among the top three 5 times and 3 times, respectively, for
the 5 programming experience measures; both are in the N group.
Note that the distribution of programming languages differs between the N and P group, see Figure 2.7. There
is a relatively larger fraction of Java users in the N group.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
14 Chapter 2: Description of the experiment
C
C++
Java
Modula-2
Sather-K
02468101214
programming lang., N group
C
C++
Java
Modula-2
Sather-K
02468101214
programming lang., P group
Figure 2.7: Number of participants in the N group (left) and the P group (right) that have used each
programming language.
2.4.3 The PSP course (experiment group)
The PSP methodology is taught by means of a PSP course. In its standard form and as taught in our department,
this is a 15-week training program consisting of 15 lectures of ninety minutes each, 10 programming exercises,
and 5 process exercises. This course requires roughly one full day per week.
Each exercise is submitted to a teaching assistant who carefully checks the correctness of the materials (with
respect to process) and marks any problems s/he nds. The materials are handed back to the students who
have to resubmit them in corrected form until everything is OK. The focus of the course is on the following
topics: working towards (and then according to) a well-defined, written-out software process, learning and
using systematic planning and estimation based on personal historical data, and defect prevention and removal
by means of managed personal reviews, defect logging, defect data analysis, and performance-data-controlled
process changes.
The participants of the P group are from the second and third time we taught the PSP course. For the most part,
we used the agenda defined by Humphrey in his book [3] on page 746. The participants needed to submit all
exercises in good shape (with resubmissions if corrections were necessary) in order to pass the course. Those
few participants who did not pass the course dropped out voluntarily during the semester, nobody was explicitly
expelled.
2.4.4 The alternative courses (control group)
The volunteers of the N group (s014, s017, s020, s023, s025, s028, s031, s034) came from various other lab
courses.
The non-volunteers of the N group all came from an advanced Java course (“component software in Java”);
many of them had previously also participated in a basic Java course we had taught the year before. The course
was following a compressed schedule such that the course ran over only 6 weeks of the 13-week semester,
but required a very high time investment during that time. In terms of the total amount of code produced, this
course is quite similar to the PSP course, although it had only 5 larger exercises instead of 10 smaller ones. The
course content was highly technical, covering the then-new Swing GUI classes, localization and international-
ization, serialization and persistence, reflection and JavaBeans, and distributed programming (Remote Method
A controlledexperimenton the effects of PSP training
2.5 Task
15
Invocation). The programs were submitted to the course teachers and tested in a black-box fashion. Participants
needed to score 70% of all possible points (based on correctly implemented functionality) in order to pass the
course.
2.5 Task
This section will shortly describe the task to be solved in the experiment and will explain why we chose it. For
details, please refer to the original task description on page 66 in the appendix.
2.5.1 Goals for choosing the task
The tasks to be used in our experiment should have the following properties:
1. Suitable size. Obviously, the task should not be too small in order to provide enough and interesting data.
A trivial task would have too little generalizability. The task was planned to take about 4 or 5 hours for
a good programmer, so that most participants would be able to finish within one day, when they started
in the morning. (It later turned out that only 28% of the participants were able to finish on the day they
started and 46% took more than two days.)
2. Suitable difficulty. Most if not all of the participants should be able to complete the task successfully.
In particular, it must be possible to solve the task without inventing an algorithm or data structure that
requires high creativity.
Furthermore, the application domain of the task had to be well understandable by all subjects.
On the other hand, it must be possible to make subtle mistakes or ruin the efficiency so that there can be
sufficient differences in the work products among even the successful solutions.
3. Automatic testability. In order to test the quality of the solutions thoroughly and objectively it must be
possible to run a rather large number of tests without human intervention. In particular, the acceptance test
should be automatic and entirely objective.
2.5.2 Task description and consequences
From these requirements, we chose the following task:
Given a list of long “telephone numbers” and a “dictionary” (list of words), encode each of the
telephone numbers by one word or a sequence of multiple words in every possible way according
to a xed, prescribed letter-to-digit mapping. A single digit may stand for itself in the encoding
between two words under certain circumstances.
Read the phone numbers and the dictionary from two les and print each resulting encoding to
standard output in an exactly prescribed format.
Please see the exact task description on page 66 for the details and for input/output examples.
The above functionality can be coded in about 150 statements with any programming language that has a rea-
sonable string handling capability. Understanding the requirements exactly and producing an appropriate search
algorithm is not trivial, but certainly within the capabilities of the participants. Various details give enough room
for gross or subtle mistakes, e.g. handling special characters allowed in the phone numbers (slash, dash) or the
words (quote, dash), always producing the correct output format, or handling all cases of digit-insertion cor-
rectly. The algorithmic nature of the problem is simple to understand for all subjects, regardless of specific
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
16 Chapter 2: Description of the experiment
backgrounds, and the search algorithm gives room for enormous differences in the resource consumption (both
space and time) of the resulting program. The batch job character of the program makes automatic testing
possible and the simple structure of the input data even allows for fully automatic generation of test cases once
a correct “gold” program has been implemented. This allowed for generating new data for each acceptance test
on the fly.
During the evaluation of the experiment it turned out that the differences among the solutions were even larger
than expected.
2.5.3 Task infrastructure provided to the subjects
Along with the task description, the following task-specific infrastructure was provided to the participants:
the miniature dictionary test.w and the small phone number input file test.t used in the example
presented in the task description,
afiletest.out containing the correct output for these inputs,
a large dictionary called woerter2 containing 73220 words.
The same large dictionary was also used during the evaluation of all programs presented in this report; a fact
which we told the participants upon request.
2.5.4 The acceptance test
The acceptance test worked as follows: For each test, a new set of 500 phone numbers was created and the
corresponding correct output computed using the gold program. This took only a few seconds. The dictionary
used was a random but fixed subset of 20946 words from the woerter2 dictionary.
Then the candidate program was run with these inputs and the outputs were collected by an evaluation Perl
script. This script matches the outputs of the candidate program to the correct outputs and computes the
reliability of the candidate program. The evaluation script stops the candidate program if it is too slow: an
accumulative timeout of 30 seconds per output was applied plus a 5 minute bonus for loading the dictionary
at the start. This means that, for instance, the 40th output must be produced before 25 minutes of wall clock
time are over or else the program will be stopped and its reliability judged based on the outputs produced so
far. Many programs did indeed run for half an hour or more during the acceptance test; the number of expected
outputs varied from 25 to 248 with a typical range of 40 to 80.
At the end of the acceptance test the following data was printed by the evaluation script: The sorted actual
output of the candidate program, the sorted expected output (i.e., the output of the gold program), a list of
differences in the form of missing correct outputs and additional incorrect outputs, and the resulting reliability
in percent.
The exact notion of reliability will be defined in Section 3.4.3 under the name of output reliability. A minimum
output reliability of 95 percent was required for passing the acceptance test. However, in their final acceptance
test with only two exceptions all programs either achieved 100 percent or failed entirely.
A controlledexperimenton the effects of PSP training
2.6 Internal validity
17
2.5.5 The “gold” program
Given this style of acceptance test, the before-mentioned “gold” program obviously plays a rather important
role in this experiment. The gold program was developed by Lutz Prechelt together with the development of the
requirements. The initial requirements turned out to be too simple, so the rules for allowing or forbidding digits
in the encoding were made up and added to the requirements during the implementation of the gold program.
The gold program, called phonewrd,was developed using a
psp
in July 1996 during three sessions of about six
hours total. The total time splits into 126 minutes of design (including global design, pseudocode development,
and test development), 93 minutes of design review, 72 minutes of coding, 38 minutes of code review, and
19 minutes compilation. The program ran correctly upon the first attempt and no defect was ever found after
completion this despite numerous claims of participants that “the acceptance test program is wrong. My
program works correctly!”. 19 defects were originally introduced in the design and pseudocode, 11 of which
were found during design review, 6 defects were introduced during coding and found during code review or
compilation. These values include trivial mistakes such as syntactical errors. The program was written in C
with refinements
3
, which turned out to be a superbly suitable basis for this problem.
The initial program was only modestly efficient (trying 10% of the dictionary for each digit). A few days
later it was improved into a more efficient version (called phonewrd2 trying only 0.01% of the dictionary
for each digit) which also worked correctly right from the start. This second version was used throughout the
experiment.
2.6 Internal validity
There are two sources of threats to the interval validity of an experiment
4
: Insufficient control of relevant
variables or inaccurate data gathering or processing.
2.6.1 Control
Controlling the independent variable means holding all other influential variables constant and varying only
the one under scrutiny. Controlling the dozens of possibly relevant variables in human-related experiments is
usually done by random sampling of participants into the experiment groups and subsequent averaging over
these groups: variation in any other than the controlled variable is then expected to cancel out.
However, random sampling is difficult for software engineering experiments, because they require such a high
level of knowledge. The problem becomes particularly pronounced if the independent variable is a specific
difference in education: neither can we randomly sample from a large group of possible participants, nor can
we freely assign each of them into a group chosen at random. Instead, we are confined to a small number
of available subjects with the proper background and, worse yet, typically each of them fits into only one of
the groups, because we cannot impose a certain education on the subjects and withhold the other; they choose
themselves what they want to learn.
In the given experiment this means that the preferences that let the subjects choose one course and not the other
could in principle be related to the results observed. We cannot prove that there is no such effect, but based on
our personal knowledge of the individuals in both courses, we submit that we cannot see any severe difference
in their average capabilities. In fact, because they liked us as teachers, several participants from the PSP course
later also took the other course and vice versa.
3
http://wwwipd.ira.uka.de/˜prechelt/sw/#crefine
4
Definition from [1]: Internal validity refers to the extent to which we can accurately state that the independent variable produced
the observed effect.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
18 Chapter 2: Description of the experiment
2.6.2 Accuracy of data gathering and processing
Inaccurate data gathering or processing is unlikely as there was very little manual work involved in this respect.
Instead, most data gathering and almost all data processing was automated. The scripts and programs were
carefully developed and tested and their results again scrutinized. Many consistency checks were applied for
detecting possible mistakes. One remaining problem is missing data, which is almost inevitable in any large
scale experiment. The consequences of missing data, if any, will be discussed for each dependent variable in
the results sections below.
2.7 External validity
Three major factors limit the generalizability (external validity) of this experiment: Longer experience as a
software engineer, longer experience with
psp
use, and other kinds of work conditions or tasks. (You may
perhaps want to skip the rest of this section until you have seen the results and their discussion.)
2.7.1 Experience as a software engineer
One of the most frequent critiques applied to controlled experiments performed with student subjects is that the
results cannot be transferred to software professionals.
The validity of this critique depends on the actual task performed in the experiment: if the task requires very
specific knowledge such as the capability to properly use complex tools or esoteric notations or uncommon pro-
cesses, then the critique is probably valid. If, on the other hand, only very general software production abilities
are required in the task, a graduate student group performs not much different from a group of professionals: It
is known that experience is a rather weak predictor of performance within a group of professionals and in fact
soon these same students will be professionals themselves.
The task in this experiment is very general, requiring only knowledge that is taught during undergraduate
university education. Hence, we can expect to nd relatively little difference in the behavior of our student
subjects compared to professionals. We can imagine only two differences that might be relevant.
First, professionals taking a PSP course after some professional experience may often be much more moti-
vated towards actually using PSP techniques, because due to previous negative experiences they have a much
clearer conception of the possible benefits than students. Student background often involves only relatively
small projects with comparatively little schedule pressure and little need for relying on the work of colleagues.
This motivation difference, if present, should pronounce the differences between PSP and non-PSP groups for
professionals.
Second, some of the less gifted students may later pick a non-programming job, resulting in a sort of clean-up
(smaller variance of performance in the lower part) in a group of professionals compared to a group of students.
The possible consequences of this effect, if it exists, on the difference between PSP and non-PSP groups are
unclear.
2.7.2 Experience with
psp
use
The present experiment investigates performance and behavior differences shortly after a PSP course. One
should be rather careful when generalizing these results to persons that have been using a
psp
for some longer
time, say, two years.
A controlledexperimenton the effects of PSP training
2.7 External validity
19
It is plausible that in those cases where differences between the PSP group and the non-PSP group were found,
these differences will become more pronounced over time. However, for a PSP-adverse work environment it
is also conceivable that differences wear off over time (because PSP techniques are no longer used) and it is
unclear whether differences may emerge over time where none have been observed in the experiment.
It would definitely be important to run a similar experiment much longer after a PSP (or other) training.
2.7.3 Kinds of work conditions or tasks
As mentioned above, it is plausible that differences due to PSP training may be reduced by a working environ-
ment that discourages the sort of data gathering implied by PSP techniques; the level of actual PSP use may just
drop. Inversely, the effects might also become more pronounced for instance if the tasks are very difficult to
get right, if the work conditions demand accurate communication of technical decisions, or if accurate planning
can reduce the stress due to schedule pressure.
Furthermore, some of the PSP participants may have taken the experiment task too lightly and may have under-
used their
psp
in comparison to their standard professional working behavior. For instance, some of them did
not bring their PSP tools or PSP estimation data.
All of this is unknown, however, so adequate care must be exercised when applying the results of this experi-
ment to such different situations.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Chapter 3
Experiment results and discussion
I don’t know the key to success,
but the key to failure is to please everybody.
Bill Cosby
This chapter presents and interprets the results of the experiment. The first section explains the means of
statistical analysis and result presentation that we use and explains why they were chosen. Section 3.2 describes
how, exactly, the groups to be compared in the analysis were derived from the raw groups of PSP and non-PSP
subjects. The following sections present the results (objective performance); the analysis is organized along the
hypotheses listed in Section 2.2. (Warning: The amount of detail in the diagrams and captions may overwhelm
you. The main text, however, is short and easy to read.) Two final sections describe findings from an analysis of
correlations between variables and findings from analyzing the answers of the subjects given in the postmortem
questionnaire.
3.1 Statistical methods
In this section we will describe the individual statistical techniques (including the graphical presentations) used
in this work for assessing the results. For each technique, we will describe its purpose, the meaning of its
results, and its caveats and limitations. The analyses and plots were made with S-Plus 3.4 on Solaris and we
will shortly indicate the names of the relevant S-Plus functions in the description as well.
3.1.1 One-dimensional statistics
For most of this report, we will simply compare the values of a single measurement for all of the participants
in the PSP group against the participants in the non-PSP group. The simplest form of such a comparison is
comparing the arithmetic mean of the values in one group against the mean of the values in the other. However,
the mean can be very misleading if the data contains a few values that are very different from the rest. A more
robust basis for a comparison is hence the median, that is, the value chosen such that half of the values in the
group are smaller or equal and the other half are greater or equal. In contrast to the mean, the median is not
influenced by how far away from the rest the most extreme values are located.
Possibly we are not only interested in the average behavior of the groups, but also in the variation (variability,
variance) within each group. In our context, smaller variation is usually preferable, because it means more
predictable software development. One way of assessing variation is the standard deviation. If the underlying
20
3.1 Statistical methods
21
data follows a normal distribution (the Gaussian bell curve), about two thirds of the data (68%) will lie within
plus or minus one standard deviation from the mean. However, if the data does not follow a normal distribution,
the standard deviation is plagued by the same problem as the mean: a few far-away values will influence the
result heavily — the resulting standard deviations can be very misleading. Software engineering data often has
such values and hence the standard deviation is not a reliable measure of variation. Instead, we will often use
the interquartile range and similar measures we will explain now in a graphical context.
A flexible and robust way of comparing two groups of values for both average (statisticians speak of “location”)
and variation (called “spread”) is the boxplot, more fully called box-and-whisker plot (S-Plus: bwplot()).
You can find an example in Figure 3.4 on page 28. The data for the PSP group is shown in the upper part,
the data for the non-PSP group in the lower part. The small circles indicate the individual values, one per
participant. Only their horizontal location is important, the vertical “jittering” was added artificially to allow
for discriminating values that happen to be at the same horizontal position. The width and location of the
rectangle (the “box”) and the T-shaped lines on its left and right (the “whiskers”) are determined from the data
values as follows. The left edge of the box is located such that 25% of the data values are less than or equal
to its position, the right edge is chosen such that 75% of the data values are less than or equal to its position
(which means that 25% are greater or equal that value). The position of the left edge is called the 25-percentile
or 25% quantile or first quartile, the right edge is correspondingly called the 75% quantile or third quartile.
Similarly, the left and right whiskers indicate values such that exactly 10% of the values are smaller or equal
(left whisker, 10-percentile) or 10% are larger or equal (right whisker, 90-percentile), respectively. The fat dot
within the box marks the median, which could also be called 50-percentile, 50% quantile, or second quartile.
Note that different percentiles can be the same if there are several identical data values (called ties), so that, for
instance, whiskers can be missing because they are identical with the edge of the box or the median dot can lie
on an edge of the box etc.
Boxplots allow for easy comparison of both spread and location of several groups of data. One can concentrate
either on the width of the boxes (called the inter-quartile range or iqr) or the width of the whole boxplots
for comparing spread or can concentrate on particular box edges or the median dots for comparing different
aspects of location (namely the location of the lower half, upper half,ormiddle half of the data points).
Note that for distributions that have only few distinct values (typically all small integers) and therefore contain
many ties, differences in the width of the box or the location of any of the quartiles can be misleading because
it may change a lot if only a single data value changes. Figure 3.26 on page 41 shows a simple example. The
two distributions are similar, but the boxplots look quite different. A similar caveat applies when the number
of data values plotted is small, e.g. less than ten.
Our boxplots have one additional feature: the letter M in the plot indicates the location of the mean and the
dashed line around it indicates a range of plus or minus one standard error of the mean. The latter quantifies
the uncertainty with which the mean is estimated from the data and decreases with decreasing standard deviation
of the data and with an increasing number of data points. For about 68% of all data samples of the given size
taken from the same population, the sample mean will lie within this standard error band. For symmetric
distributions, the mean is equal to the median. However, in our data, many distributions are skewed, i.e., the
data is less dense on one end of the distribution than on the other. In this case, the mean will lie closer to the
less dense end than the median.
A string such as 2 missing (e.g. on the left edge of Figure 3.5) indicates that two of the data points in the
sample had missing values and hence are not shown in the plot.
When comparing two distributions, say in a boxplot, it is often unclear whether observed differences in location
should be considered accidental or real. This question can be assessed by a statistical hypothesis test.A
test computes the probability that the observed differences of, say, the mean will occur when the underlying
distributions in fact have the same mean.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
22 Chapter 3: Experiment results and discussion
The classical statistical test for comparing the means of two samples is the t-test (S-Plus: t.test()). Un-
fortunately, this test assumes that the two samples each come from a normal distribution and that these dis-
tributions have the same standard deviation. Most of our data, however, has a distribution that is non-normal
in many ways. In particular, our distributions are often unsymmetric. For such data, the t-test may produce
misleading results and should thus not be used. Sometimes asymmetric data can be transformed into normally
distributed data by taking e.g. the logarithm and the t-test will then produce valid results, but this still requires
postulation of a certain distribution underlying the data, which is often not warranted for software engineering
data, because too little is known about their composition.
We will use two replacements for the t-test that do not require assumptions about the actual distribution of the
data: Bootstrap resampling of means differences and the Wilcoxon test.
The Wilcoxon rank sum test also known as the Mann-Whitney U test (S-Plus: wilcox.test())compares
the medians of two samples and computes the probability that the medians of the underlying distributions are in
fact equal. This probability is called the p-value and is usually presented as the test result as a number between
0 and 1. When the p-value is sufficiently small, i.e., the difference is probably not due to chance alone, we call
the difference significant. We will call a difference significant if
. The Wilcoxon test does not make any
assumptions about the distribution of the data, except that the data must come from a continuous distribution
so that there are never any ties. Fortunately, there is a modified test (the Wilcoxon test with Lehmann normal
approximation) that can cope with ties as well and we will use this extension where necessary.
Note that a Wilcoxon test may find the estimated median of a sample
to be larger than that of a sample even
if the actual sample median of
is smaller! Here is an example:
a: 1 2 3 4 5 6 7 8 910 21 22232425262728293031
b: 10 11 12 13 14 15 16 17 18 19 20 30 31 32 33 34 35 36 37 38 39
Obviously
should be considered larger than and in fact the Wilcoxon test will find exactly this with
. (We will write a test result such as this in the following form: Wilcoxon test .) However,
the actual sample median happens to be 21 for
but only 20 for ; for other samples this difference could even
be arbitrarily large.
If we have no preconception about which sample must have the larger median, if any, we use a so-called two-
sided test. If, as in the example above, we want to assume that, say, either
is larger or both are equal, but
want to neglect (in advance) the possibility that
could be larger than , we will use a one-sided test,whichis
in fact the exactly same thing, except that the p-value is halved. In this experiment, we will most usually use
one-sided tests, because our testing is hypothesis-driven.
Our second robust replacement for the t-test is Bootstrap resampling of differences of means. The idea of
bootstrapping is quite simple: simulation. The only assumption required is that the samples seen are repre-
sentative for the underlying distribution with respect to the statistic that is being tested this assumption is
of course implicit in all statistical tests. With a computer we can now generate lots of further samples that
correspond to the two given ones, by sampling with replacement (S-Plus: sample(x,replace=T)). This
process is called resampling. For instance resampling of the above sample
might yield
a: 1 3 3 4 4 6 7 8 9102122232425252527273030
Such a resample can (and usually will) have a different mean than the original one and by drawing hundreds or
thousands of such resamples
from and from we can compute the so-called bootstrap distribution of all
the differences “mean of
minus mean of ”. Now we can compute what fraction of these differences is, say,
greater than zero. Let’s assume we have computed 1000 resamples of both
and and found that only 4 of the
differences were greater than zero (which is a realistic value; see Figure 3.1 for an actual example). Then 4/1000
A controlledexperimenton the effects of PSP training
3.1 Statistical methods
23
or 0.004 is the p-value for the hypothesis that the mean of the distribution underlying is actually larger than
the mean of the distribution underlying
. From this bootstrap-based test, we can clearly reject the hypothesis.
1
Instead of p-values, we can also read arbitrary confidence intervals from the bootstrap distribution. In the
example, 90% of all bootstrap differences are left of the value
, hence a left 90% confidence interval for
the size of the difference would be
; in other words: the difference is 4 or larger with a probability of
0.9.
0
50
100
150
200
-20 -15 -10 -5 0 5
mean(a)-mean(b)
Count
Figure 3.1: Boostrap distribution
of the difference of means of the
two samples
and as given in
the text.
When we use Bootstrap, we will usually compute 1000 pairs of resamples and indicate the resulting p-value
with two digits. For the example above we would write: mean bootstrap
. Occasionally we will use
10000 resamples and then indicate three digits (mean bootstrap
)
Sometimes we would like to compare not only means and medians, but also the variability (spread) of two
samples. The conventional method of doing this is the F-test, which compares the standard deviations. It is
related to the t-test and, like the latter, assumes the two samples to come from a normal distribution. Unlike the
t-test, which is quite robust against deviations from normality and whose results deteriorate only slowly with
increasingly pathological inputs, the F-test is very sensitive to data with deviations from normality. Therefore,
the F-test is completely unsuitable for our purposes. Instead we resort to resampling again and compare a
robust measure of spread, namely the inter-quartile range mentioned above. We generate pairs of resamples
and compute the differences of their interquartile ranges. This way, we compute a Bootstrap resampling of
differences of inter-quartile ranges in order to arrive at a test for inequality of variability. Note that the inter-
quartile range nicely corresponds to the box width in our boxplots. Much like for the test on means, we write
the result of a bootstrap-based test on inter-quartile ranges (iqr) like this: iqr bootstrap
.
3.1.2 Two-dimensional statistics
Sometimes we do not only want to compare groups of individual measurements, but rather want to understand
whether or how one variable depends on another. Such a question is usually best be assessed by the familiar
x-y-coordinate plot (S-Plus: xyplot()); see Figure 3.38 on page 46. Here, each participant is represented
by one point and we possibly make two (or more) such plots side by side for comparing different groups of
participants.
We will often plot some sort of trend line together with the raw data in an x-y plot in order to show the
relationship between the two variables plotted. If the assumed relationship is linear, the trend line is a straight
line. The most common form of this is the least squares linear regression line (S-Plus: lsfit()), a line
chosen such that the sum of the squares of the so-called residual errors (residuals) is minimized. The residual
error of a point is the vertical distance from the point to the regression line. While the least squares regression
1
By the way, the t-test would produce the more pessimistic p-value 0.010 when comparing and an error that is not relevant
in this case of a very large difference, but that is important in other cases. If the deviations of the samples from normality are large, the
t-test becomes insensitive and can often not detect the differences (statisticians say it has small power).
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
24 Chapter 3: Experiment results and discussion
line is very simple to compute, it is sensitive to individual far-away points, much like the mean in the one-
dimensional case. Therefore, we will sometimes add two other sorts of trend lines in order to avoid spurious or
misleading results. These other trend lines are less common and more difficult to compute, but are often more
appropriate because they are more robust against a small number of far-away points. The first of these robust
regression lines is the least distance regression line or L1 regression line (S-Plus: l1fit()). It minimizes
the absolute values of the residual errors instead of their squares, which reduces the influence of the largest
residuals. The L1 regression is computed by linear programming. The second kind of robust regression is the
trimmed least squares regression (S-Plus: ltsfit()), which computes a standard least squares regression
after leaving out those, say, 10% of all points that would result in the largest residual errors. The operation of
leaving out a certain fraction of extreme points is called trimming. Since the identity of these worst points
depends on the regression line chosen and the regression line depends on which points are left out, the trimmed
regression is extremely difficult to compute it is implemented by a genetic algorithm. The fraction that is
trimmed