A scale for rating the quality of psychological trials for pain
Shona L. Yatesa, Stephen Morleya,*, Christopher Ecclestonb, Amanda C. de C. Williamsc
aAcademic Unit of Psychiatry and Behavioural Sciences, University of Leeds, 15 Hyde Terrace, Leeds LS2 9JT, UK
bPain Management Unit, University of Bath, Bath, UK
cDepartment of Psychology, University College London, London, UK
Received 1 February 2005; received in revised form 19 April 2005; accepted 20 June 2005
This paper reports the development of a scale for assessing the quality of reports of randomised controlled trials for psychological
treatments. The Delphi method was used in which a panel of 15–12 experts generated statements relating to treatment and design components
of trials. After three rounds, statements with high consensus agreement were reviewed by a second expert panel and rewritten as a scale.
Evidence to support the reliability and validity of the scale is reported. Three expert and five novice raters assessed sets of 31 and 25
published trials to establish scale reliability (ICC ranges from 0.91 to 0.41 for experts and novices, respectively) and item reliability (Kappa
and inter-rater agreement). The total scale score discriminated between trials globally judged as good and poor by experts, and trial quality
was shown to be a function of year of publication. Uses for the scale are suggested.
q 2005 Published by Elsevier B.V. on behalf of International Association for the Study of Pain.
It is widely agreed that interpretation of the results of a
randomised controlled trials (RCT) should be informed by
the quality of the trial: the better the quality, the greater the
confidence one may have in the validity and utility of the
results. There are several guidelines to aid the critical
appraisal of reports of RCTs, e.g. (Davidson et al., 2003)
and other authors have developed scales by which the
quality of a study may be quantified (Chalmers et al., 1981;
Downs and Black, 1998; Harbour and Miller, 2001; Jadad
et al., 1996; Sindhu et al., 1997). Quantification can be used
to inform the conduct and analysis of systematic reviews
and meta-analysis either by setting a cut-off score to
determine the exclusion of trials that do not meet a pre-
defined criterion or by examining the influence of quality
parameters on standardised trial outcomes.
For many purposes judgment of quality is indexed by
methods used to control bias. Quality is a multi-dimensional
construct and most current scales have been constructed
around features of the internal validity of trials to identify
potential sources of bias. To this end most scales assessing
quality have focused on the design features of trials. Table 1
summarises published scales identified in a literature
search.1These scales were mostly designed for medical
trials in which pharmacological treatments can be delivered
in a double blind manner. Furthermore, the major aspect of
the quality of treatment delivered in medical trials (the drug)
is controlled via manufacturing quality control processes,
although the context in which therapy may be delivered
within trials may vary considerably. In contrast, delivering
psychological treatments in controlled trials poses a number
of other problems (Schwartz et al., 1997; Waltz et al., 1993).
For example, it is improbable that delivery can be double
blind as skilled therapists will know what they are
delivering and participants will also be aware of treatment
content. Other steps must, therefore, be taken to ensure
equivalence between the treatment arms on potentially
confounding variables, e.g. expectation of improvement.
Furthermore, treatment integrity needs to be maintained
throughout the study as the treatment is essentially
Pain 117 (2005) 314–325
0304-3959/$20.00 q 2005 Published by Elsevier B.V. on behalf of International Association for the Study of Pain.
*Corresponding author. Tel.: C44 113 343 2733; fax: C44 113 243
E-mail address: email@example.com (S. Morley).
1The scales in Table 1 were retrieved by a systematic search of the
literature in January 2003.
Content analysis of published quality scales
General items Methodological quality Treatment quality
AuthorDate RCTsNo. itemsResponse
1. Chalmers1981General27 Y, N, P,
Y, N, P
2. Cho 1994 General31 Revised
3.de Vet 1997 General15Y, N, P,
Y, N, P
4. Detsky1992General5 Revised
5.Downs1998 General27 Y, N
6. Evans1985 General/
33 Y, N
199434 5 point
8. Gotzsche1989 Drug
Y, N, P
11.1996 Pain3 Y, NNominal
7 Y, N, P
Y, N, P,
Y, N, U
16. Turlik200014 Arbitrary
17. Van der
Total number of scales with items for each aspect of quality6 1618 12 1713 10304
C, included in scale; B, not included in scale.
aY, yes; N, no; P, partial definition; U, unknown.
bScale measured quality of report NOT methodological quality.
cAlso had scale (25 items) assessing quality of report.
S.L. Yates et al. / Pain 117 (2005) 314–325
‘manufactured’ by the therapist at each session. Trialists
must, therefore, ensure that therapists deliver only the
prescribed treatment components at an acceptable level of
competence. A content analysis of current quality scales
(Table 1) clearly identifies a lacuna in the scales around the
issue of treatment implementation. Psychosocial trials may
be unduly penalised because of the problem of double
blinding (Guzman et al., 2001) whereas other potentially
compensatory methodological refinements in the studies
may be overlooked.
The purpose of the current study was to develop a
scale to assess the quality of trials of psychological
treatments that could be used both to assess individual
trials (a critical appraisal tool) and to provide
quantification of quality for use in meta-analytic studies.
The present study reports the development of a scale
using Delphi methodology (Jones and Hunter, 1999;
Linstone and Turoff, 2002) to develop a consensus from
a panel of experts to ensure that the items generated
were not merely a function of the small team of
individuals represented by the authors.
The study comprised several phases: (1) generation of a pool
of statements by a Delphi panel; (2) a panel of experts to write
the scale items; (3) use of the scale by expert and novice raters
to establish reliability; (4) assessment of the scale’s validity;
(5) a preliminary analysis of the influence of trial quality on the
magnitude of effects size. Two sets of published RCTs were
used in the reliability and validity phases, one from which data
(effect sizes) was already available, and an additional sample of
six trials established through a newly written search strategy.
To aid the reader an overview of the phases is represented
diagrammatically Fig. 1.
2.1. Delphi panel
2.1.1. Recruitment of panel
We identified possible participants for the Delphi panel if they
met two of the following criteria: (1) previous involvement in a
published randomised controlled trial of a psychological treatment
for chronic pain; (2) two or more published articles on
psychological treatment for chronic pain; (3) two or more
conference presentations on the same subject; or (4) possession
of a professional qualification in a relevant discipline, e.g. clinical
A list of 62 eligible candidates from Australia, Europe and
North America was compiled from several sources. The second
author identified approximately 25 eligible participants through his
own knowledge of the field. Further candidates for the surveypanel
were identified as the authors of the randomised controlled trials
included in meta-analysis by Morley et al. (1999) and from a
search of the following electronic databases; Medline, Embase,
PsycINFO using a search strategy developed to identify
randomised controlled trials of psychological therapy for chronic
pain published subsequent to the trials included in the Morley et al.
Electronic mail addresses for 44 of those eligible were obtained
from published articles, the World Wide Web and personal
knowledge: 21 of the experts were located within Europe; 18 in
North America and 5 in Australia and New Zealand. The experts
were individually contacted via e-mail and invited to take part in
the study. The anonymity between participants was maintained
throughout the study.
2.1.2. Development of consensus agreed statement
The Delphi survey was conducted over three rounds. In the first
round, participants were invited to contribute as many ideas as they
wished in response to two open ended questions regarding quality
in research trials: “What factors do you consider are important for
assessing the quality of treatments used in randomised controlled
trials of cognitive behaviour therapy and behaviour therapy for
chronic pain?” and, “What factors do you consider are important
for assessing the methodological quality of randomised controlled
trials of cognitive behaviour therapy and behaviour therapy for
In the second round, the responses obtained in round one
were collated and grouped together under a number of
semantically related headings by the first author in discussion
with the second author. The statements were organised into two
categories: those relating to treatment quality and those relating
to the quality of the design and methods of a trial. Participants
were asked to consider which of the statements would be
essential to include in a rating scale designed to measure the
quality of randomised controlled trials of cognitive behaviour
therapy for chronic pain. For each statement, the degree of
necessity for its inclusion in a final quality rating scale was
rated on a seven point Likert scale (one pointZcompletely
unnecessary, to seven pointsZcompletely necessary).
The median score (representing group level of agreement) and
the inter-quartile range (indicating the degree of consensus) for
each statement were computed. This information was then
incorporated into the round three questionnaire with the addition
of the participant’s own ratings for each statement as a reminder.
Thus, separate round three questionnaires were developed for each
participant. The participants reviewed and re-rated the statements
in the light of the new information about the opinion of the group
as a whole. A list of statements, which achieved consensus
agreement, was prepared by the first author. Consensus for
inclusion was pre-defined as a median rating of six or above and
an inter-quartile range (IQR) of 1.5 or less.
2.2. Expert panel
The expert panel comprised three of the authors (SM, CE and
AW). Their credentials as experts were that they had previously
conducted systematic reviews and meta-analyses of psychological
treatments for chronic pain for both adult (Morley et al., 1999) and
child and adolescent (Eccleston et al., 2002) populations and for
irritable bowel symptoms (Lackner et al., 2004). One (AW) had
also conducted an RCT (Williams et al., 1996) and all were
thoroughly familiar with the field. The panel was presented with
the output of the third round from the Delphi panel to consider
prior to meeting face-to-face for one day. The panel meeting was
chaired by the first author (SY). The main task of the panel was to
S.L. Yates et al. / Pain 117 (2005) 314–325316
draft a quality scale from the output generated by the Delphi panel.
The expert panel aims to produce a new scale of a reasonable
length that would be a practical tool.2Definitions for each item
were drafted and a response scale for each item prepared. The draft
scale was then circulated between members of the expert panel for
further comment and editing.
Three experienced raters (authors: SY, SM and CE) rated the 25
trials included in the metaKanalysis reported by Morley et al.
(1999, Table 3) and an additional six trials randomly selected from
those published subsequent to the 1999 study (Basler et al., 1997;
Ersek et al., 2003; Johansson et al., 1998; Marhold et al., 2001;
Sharpe et al., 2001; Thieme et al., 2003). Two of the raters (SM and
CE) were familiar with the first 25 trials, whereas one rater (SY)
was not familiar with any of the trials. Inter-rater reliability for the
total scale score and the two sub-scale scores (intra class
correlation, ICC), and by item (Kappa and agreement ratio) was
computed. A further test of reliability for the total scale and
Fig. 1. Diagrammatic representation of the sequence of tasks in the study.
2The expert panel aimed to produce a new scale of a reasonable length
that would be a practical tool. The coding sheets developed in a previous
study Morley et al. (1999) were found to be too exhaustive for practical use.
S.L. Yates et al. / Pain 117 (2005) 314–325 317
subscale scores was obtained by recruiting five novice raters who
were unfamiliar with the scale and did not have detailed and
extensive knowledge of the trials. These five raters were all
psychologists; four of these with some experience in providing
cognitive behaviour therapy for chronic pain. They were given one
brief training session (approximately 1.5 h) by the first author.
Each rater rated 10 of the 25 trials from the Morley et al. (1999)
article in a balanced order so that each trial was rated by two raters.
A set of intra-class correlations was computed for each offive pairs
We sought to establish validity using two methods. First, we
followed the method reported by Jadad et al. (1996), where
published articles were initially allocated to three grades by raters
with knowledge of the field. In this study, the second and fourth
authors were presented with the abstracts of the 25 trials analysed
by Morley et al. (1999). They categorised each study as high,
medium or low quality. The two judges were familiar with the
contents and methodology of the trials. (These broad category
judgments were made before the second author reread the articles
in depth to code them for the reliability study.) The validity of the
quality scale was assessed by testing whether the quality scale
discriminated between these categories. The first author’s ratings
were used for this test and these had been completed prior to and
independently from the expert rankings of the studies produced by
the two judges. As a second test of validity we assumed that the
quality of the trial might be expected to improve over time. We,
therefore, regressed the total quality score onto the year of
2.5. Quality and outcome
In the final set of analyses, we conducted exploratory analyses
of the influence of trial quality on outcome. We examined the
relationship of the total score and the two sub-scale scores
(treatment and design) by regressing the scores onto a measure of
outcome for the 31 trials in the data set. The major issue to consider
here was the selection of the outcome measure because the trials
have multiple treatments (trial arms) and multiple outcome
measures and there was no single measure that could be regarded
as the ‘primary endpoint’ that was also common across trials. It is
probable that most trialists with a cognitive-behavioural allegiance
expect that treatment should have a broad impact across a range of
measures. We, therefore, aggregated the effect sizes (ES) across
outcomes within trials. To take into account the fact that the
outcomes are not independent we used the algorithm devised by
Wampold et al. (1997) to compute the mean ES across measures
within trial arms, assuming that the average inter-correlation
between measures was 0.5. The selection of 0.5 as the average
intercorrelations was a ‘guesstimate’ and followed the precedent
set by Wampold et al. (1997). The aggregated ES usingthis method
is monotonically related to the estimated intercorrelations thus
preserving the order of trials across the range of possible
correlations. As the focus of the analysis was the relationship
between quality and the relative differences of ESs across trials the
exact estimate of the aggregated ES is of secondary interest.
3.1. Delphi panel
Emails were sent to 44 of the 62 individuals eligible for
inclusion in the Delphi survey. Fifteen experts responded
the invitation and completed round 1, and 12 also completed
rounds 2 and 3. Reasons for the attrition of experts are given
in Fig. 1. No further participants were sought as it has been
suggested that between 12 and 20 participants is an optimal
size for a Delphi study (Henry et al., 1987).
Table 2 shows a summary of the statements generated
retained over the three rounds. For ease of summary, the
statements have been aggregated into recurrent themes. (A
full list of all items at each stage of development is available
from the authors on request.) In round 1, the panel generated
a total of 234 statements, each person generating on average
15.13 statements (SDZ3.33). Removal of duplicate
statements resulted in 150 statements that were equally
distributed between the two main categories: treatment, and
design and methods. Consensus was defined as a median
Summary of statements generated for each theme at each round
Relevance of methods
Design and Method quality
The numbers in parentheses for rounds 2 and 3 is the percentage of items
retained from the first round.
S.L. Yates et al. / Pain 117 (2005) 314–325318
rating of six or more (indicating necessity of inclusion), and
an inter-quartile range (IQR) of 1.5 or less (indicating
agreement amongst the Delphi panel).
After round two 22 out of 75 of the statements in the
Treatment Quality section (in supplementary Appendix 1 –
online only) of the questionnaire achieved consensus; 30
statements had median ratings in the middle of the range
(medianZ3–5) of which only one also obtained consensus.
Only one statement was judged by the panel to be
unnecessary; ‘duration of therapy at least three months’.
In the Design and method section, 32 statements achieved
consensus; 11 statements had median ratings between 3 and
5 and there was a consensus level of agreement for only one
of these items. None of the statements in this section of the
questionnaire were considered unnecessary by the Delphi
panel (i.e. median%2).
Table 2 shows the number of statements in each category
gaining consensus for rounds 2 and 3. Eight treatment
quality statements obtained absolute agreement (medianZ
7, IQRZ0) and three had stable absolute agreement across
rounds 2 and 3. Twenty-six statements relating to
methodological quality obtained absolute agreement of
which 14 had absolute agreement across rounds 2 and 3.
After round three 45 of the 75 statements relating to
Treatment Quality were judged as necessary (median rating
of O6): 32 statements achieved consensus and were
included in the pool for the quality scale. Design and
Method Quality: 52 of these statements achieved consensus
for inclusion in the statement pool for the quality scale.
3.2. Expert panel
The expert panel considered the 84 statements generated
by the Delphi panel and distilled the statements into 13 main
topics; each referring to a major theme identified by the
Delphi panel. A number of topics contained two or more
parts resulting in a total of 26 items for the quality scale. For
example, the topic of treatment manuals contained items
referring both to the presence of a treatment manual and
whether there was evidence that therapists had adhered to
the manual. The final scale comprised two sections with six
items in the section on treatment quality (supplementary
Appendix 1) and 20 items in the section on design and
method quality (supplementary Appendix 1). A brief coding
The quality rating scale
S.L. Yates et al. / Pain 117 (2005) 314–325319
guide (manual) detailing the criteria for each item and the
associated scale points was also produced as a result of the
panel’s deliberations. The final version of the scale is shown
in Table 3 and the coding guide is reproduced in
supplementary Appendix 1.
The intra-class correlation (two-way mixed effects
absolute agreement model for average measures, McGraw
and Wong (1996)) for three raters was 0.91 (95%CIZ
Table 3 (continued)
S.L. Yates et al. / Pain 117 (2005) 314–325320
0.76–0.96) for the full scale, 0.91 (95%CIZ0.76–0.96) for
design and methods subscale. The multiple rater Kappa
coefficients for each item are shown in Table 4; they ranged
from 0.74 to K0.07. The median Kappa value for all items
was 0.405 (IQRZ0.21–0.66). As Kappa is sensitive to the
when raters agree at a high level and where most of their
agreement lies within one cell. We, therefore, computed
agreement coefficientsforeach item—showninTable 4.We
used two criteria of agreement: the strict criterion was
defined as complete agreement between all three raters and
any rater pair. The median value for the strict agreement
criterionwas 72%(IQRZ50–80%), indicatinggoodlevelof
agreement across most items. As expected, the relaxed
criterion gave higher values of agreement; 90% (IQRZ80–
Intra-class correlations (model as above) were computed
for all pairings of ‘novice’ raters (five pairs) for the total
score and the two subscale scores. The median of the
average rater ICCs for the total scale score was 0.81 (ICC
coefficients for pairs 1–5 were 0.89, 0 .69, 0.91, 0.47, 0.81).
The corresponding values for the treatment subscale were;
medianZ0.57 (ICC coefficients for pairs 1–5 were 0.94,
0.57, 0.99, 0.50, 0.53); and for the methods subscale;
medianZ0.76 (ICC coefficients for pairs 1–5 were 0.72,
0.76, 0.50, 0.42 0.91).
The final two columns of Table 4 also display the
percentage of the trials entered into the analysis which met
the quality criterion represented in each of the items as
given by both the strict and relaxed agreement criteria.
There is marked variation between items which suggests
that for this sample of trials there is significant variation in
the degree to which the reports of the trials meet the quality
criteria. The trials are, generally, strong in reporting
treatment content, sample criteria (inclusion/exclusion)
characteristics and equivalence between groups, details of
outcomes and analysis reporting. In contrast, items relating
to the controlled delivery of treatments (e.g. manuals) show
only a modest attainment of the criteria and there are clear
limitations with respect the reporting of aspects of
experimental design (power calculations, intention-to-treat
analyses, sample sizes, and randomisation procedures).
The two raters achieved consensus agreement that five of
the 25 sample trials were ‘excellent’, seven were ‘average’,
and five were ‘poor’ quality. The remaining eight trials,
where there was no consensus agreement, were removed
from the analysis to ensure a clear unambiguous criterion.
Scale item and range of response
KappaAgreement coefficientFrequency criteria met in trials
(%) Strict criterion(%) Relaxed criterion (%) Strict criterion(%) Relaxed criterion
Treatment content 0,2
Treatment duration 0,1
Manual adherence 0,1
Therapist training 0,2
Patient engagement 0,1
Quality of design and methods
Sample criteria 0,1
Evidence criteria met 0,1
Rates of attrition 0,1
Sample characteristics 0,1
Group equivalence 0,1
Allocation bias 0,1
Measurement bias 0,1
Treatment expectations 0,1
Justification of outcomes 0,2
Validity of outcomes 0,2
Power calculation 0,1
Sample size 0,1
Data analysis 0,1
Statistics reporting 0,1
Intention to treat analysis 0,1
Control group 0,2
S.L. Yates et al. / Pain 117 (2005) 314–325 321
The mean overall score for the 17 trials with expert
consensus judgements of quality was 17.94 with a range of
8.5–25. The mean quality scores (SD in parentheses) for the
excellent, average and poor trials were 22.7 (1.95), 18.71
(2.25) and 12.10 (3.17), respectively. Comparisons between
all pairs of means were made with one-sided t-tests with
predicted directional differences and alpha set at 0.01. Using
this criterion all means were significantly different from
each other. Despite the small number of trials, the post hoc
power for the comparisons always exceeded 83%.
A regression analysis (quality score against year of
publication) included the 25 trials from (Morley et al., 1999)
plus the sample of six additional trials published since 1996.
1996 was 18.24 (SD 4.88) and for those published after 1996
(F1,29Z9.52, P!0.01, AdjR2Z0.221, bZ0.497) such that a
higher year of publication predicted a higher quality score as
shown in Fig. 2. The AdjR2indicated that the year of
publication accounted for just over 20% of the total quality
score suggesting that the quality of trials (or their reporting)
has increased over time. The analysis was repeated with just
resulting in very similar regression statistics (F1,29Z7.41,
P!0.05, AdjR2Z0.211, bZ0.494).
3.5. Quality and outcome
We regressed the quality score onto the averaged effect
size of each trial, i.e. averaged across trial arms. Fig. 3
shows the scatter plot of the averaged effect sizes and the
total quality scores. Inspection of the plot suggested a trend
of a negative correlation and also indicated the presence of
an outlier. The presence of the outlier was confirmed by
diagnostic statistics in a regression analysis. We, therefore,
excluded the outlier and regressed the quality score onto the
averaged effect size of each trial. The regression line is
shown in Fig. 3. The resulting regression did not meet
conventional significance criteria (F1,28Z3.93, P!0.057,
AdjR2Z0.092, bZK0.351). When the separate com-
ponents of the quality scale were considered there was a
significant impact of the quality of design on the magnitude
of the effect size (F1,28Z5.39, P!0.05, AdjR2Z0.131,
bZK0.402) but no effect of the quality of treatment
implementation (bZK0.147, t!1.0, PZns).
The preceding analyses depend on averaging the effect
sizes for treatment arms within each trial and the weights of
the resultant average were regarded as equivalent. This may
introduce bias. We, therefore, repeated the analyses using
grouped regression in which effect sizes within each trial are
regarded as replicates (Buchan, 2000). This method
incorporates all the effect sizes and weights the trial by
the number of ‘replicates’, but makes the assumption that
the replicates are independent. In these analyses, the same
trial was identified as an outlier and excluded from the
analysis. There was a significant regression of effect size on
the total quality score (F1,40Z8.19, P!0.01), a marginally
significant effect for the quality of design and method
(F1,40Z4.06, P!0.057) and no significant relationship
between treatment quality and effect size (F1,40Z1.02, ns).
The purpose of this study was to develop a scale to
measure the quality of randomised controlled trials for
psychological interventions. A Delphi panel generated
statements with consensus validity that was then used to
construct a scale. In two reliability studies, with experts and
nonKexperts, the total scale score and the subscale scores
achieved good levels of reliability. There was variation in
the inter-rater reliability across items. This may be
attributed to the degree of inference required of the raters
Fig. 2. Scatter plot of total quality score again year of publication. The
dotted line represents the fitted regression line.
Fig. 3. Scatter plot of averaged effect size within trial against total quality
score. The dotted line represents the fitted regression line. The circle data
point is the identified outlier.
S.L. Yates et al. / Pain 117 (2005) 314–325322
when making judgments. For example, the presence of a
power calculation is either clearly stated or not whereas
evidence that patients have actively engaged in the
treatment is more a question of interpretation on the part
of the rater. In the absence of a gold standard, we tested
the scale against two criteria: its ability to discriminate
between expert nominated trials of different quality and the
assumption that trial quality has improved in time (the two
decades between 1982 and 2003). Both of these tests
indicated preliminary support for the validity of the scale.
There are potential biases in the methodology used in the
study. First, the statements may not be an exhaustive
inventory of every aspect of methodology that could impact
on trial quality. We obtained consensus opinion from those
with direct experience of conducting RCTs of psychological
treatments for chronic pain as knowledge of the subject
matter is considered the most significant assurance of a valid
outcome using the Delphi method (Stone Fish and Busby,
1996). The experts in the Delphi panel and the reliability
study were predominantly behavioural scientists and
research clinicians, in contrast to the statisticians and
medical clinical trialists involved in other psychosocial
and community-based trials. Differences between these two
groups might be reflected in the content of the various
scales. Nevertheless, the validity of the current scale items is
supported because of the considerable overlap between the
items in the design and methods section (supplementary
Appendix 1) and similar items reported in other quality
scales (Tables 1 and 3). Second, the involvement of the
authors in both the expert panel and as raters in the
reliability study may have inflated the reliability and
validity coefficients. We attempted to minimise these biases
by sequencing the order of the tasks and by temporal
separation of the tasks, e.g. the expert panel meeting
occurred between 3 and 8 months before rating the trials.
The results from the novice raters provided additional
support for the potential usability and reliability of the scale.
Three areas are not covered by the scale: therapist
allegiance, credibility of therapy and the reporting of
adverse events. As far as we can ascertain there is no
evidence to link reporting of adverse events to bias in the
estimation of the effectiveness of therapy but documentation
of such effects in pharmacological trials is required.
Adverse events are rarely reported in psychological trials
although the fact that some patients deteriorate in
psychotherapy has long been documented (Lambert and
Bergin, 1994). The absence of an item directly assessing
therapist allegiance might be rectified in any revision of this
scale. There is substantial evidence that the allegiance of
therapists to a particular model of psychological treatment is
associated with larger effect sizes (Berman et al., 1985;
Wampold, 2001). However, these findings come from a
literature in which therapy is delivered by a single therapist
to a single patient and caution should be exercised
generalising this finding to chronic pain treatment which
team. Nevertheless, we suggest that attempt should be made
to estimate the influence of therapist allegiance on response
to treatment and its importance as a vehicle for therapeutic
change recognised and given due weight. The omission of
an item assessing equivalence of treatment credibility across
arms of a trial might also be rectified. Non-equivalence of
treatment credibility has long been recognised as a potential
source of generating differential expectations of treatment
gain (Kazdin and Wilcoxon, 1976)—a potential placebo
mechanism (Kirsch, 1985; Price et al., 1999). There is some
evidence that initial expectations of treatment gain influence
outcomes in treatments for chronic pain (Goossens et al.,
2005). In mitigation, the scale does include an item to assess
treatment expectations and a credibility assessment might,
therefore, be redundant.
Assessment of trial quality is necessarily intertwined
with the quality of the trial report (Juni et al., 2001). This
can potentially lead to the situation where a well reported
but biased trial could be judged to be of high quality while a
trial that is well designed but poorly reported is judged to be
of low quality (Jadad, 1998). It has been argued that poor
reporting is indeed reflective of poor methods generally
(Schulz et al., 1995). The CONSORT statement (Moher
et al., 2001) was developed with the aim of improving the
standard of reporting of randomised controlled trials for
medical interventions. More recently, additions have been
made to the statement that reflect more accurately the design
features of randomised controlled trials of psychological
interventions that are pertinent, e.g. treatment adherence
(Davidson et al., 2003).
The final two columns of Table 4 provide an overview of
the relative strengths and weakness of cognitive-beha-
vioural treatments published between 1982 and 2003 and
indicates where improvements in design or reporting of
trials are necessary. Despite the sophisticated data analysis
of many trials there are lacunae in either design or reporting,
e.g. participant allocation to treatment (allocation bias),
randomisation, power calculations, adequate sample sizes
and intention-to-treat analysis. Whether all these criteria
should be applied to psychological trials merits further
debate. More attention could perhaps be given to the
selection and design of control groups as only a minority of
trials appear to include control groups that are matched to
the general structure of treatment groups. This is a complex
issue (Schwartz et al., 1997), but an important source of
potential bias if the magnitude of treatment effects are to be
estimated (Baskin et al., 2003). Structural equivalence of
control groups of primary importance for explanatory trials
but may not be relevant in pragmatic trials. This distinction
is not often made by authors of the psychological trials;
nevertheless users of the scale should consider the use of
this scale item in the light of their aims. In contrast to the
apparent shortfalls in design many trialists have developed
manualised protocols, assessed the integrity of implemen-
tation and justified the selection of outcomes. A recent
meta-analytic review of psychological interventions for
S.L. Yates et al. / Pain 117 (2005) 314–325323
irritable bowel syndrome revealed a similar pattern of
strengths and weaknesses (Lackner et al., 2004).
The quality scale developed in this study offers some
advantages for assessing the quality of psychological trials:
its content was developed through the consensus of experts,
it captures features of trial design that are widespread in this
field, and there is preliminary evidence of its validity. The
scale can be used to assess trials in systematic reviews and
to explore the influence of trial quality or particular design
features on the estimated effect size. Although more
comprehensive than existing tools, e.g. Jadad et al. (1996),
it remains concise and easy to use. Its use should provide
greater validity, and a correction to the emphasis in existing
scales on specific methods of bias control, e.g. blinding that
may not pertain to psychological interventions.3Clearly,
caution should be exercised in interpreting the results if
single items are used as they are likely to be measured less
reliably. Users are encouraged to consider the addition of
further items, e.g. to assess credibility, but as with all rating
scales it is necessary to establish coding reliability each time
of use. The scale can be used to assess trial of psychological
interventions in the general field of behavioural medicine as
none of the items are specific to chronic pain. We also note
that many of the items concerning treatment may apply
equally to pharmacological and other interventions where
the competence of the therapist and adherence to the
treatment protocols are also important but perhaps some-
what neglected by current quality scales. Finally, we note
that the current scale might be adapted to appraise trials
where difference modalities of treatment are being
compared, e.g. pharmacotherapy vs. psychological
Shona Yates was supported by the West Yorkshire
Workforce Development Confederation. We would like
to thank: the members of the Delphi panel who gave
their time freely and generously—without them this
E. Wampold of the University of Wisconsin–Madison
who kindly provide the necessary SPSS syntax file for
computing the aggregated effect sizes; the 5 ‘novice’
raters who gave their time in the presence of competing
demands; to Sylvia Bickley for advice on the search
strategy; and finally to Chris Yates.
Appendix 1. Supplementary material
Supplementary data associated with this article can be
found, in the online version, at doi:10.1016/j.pain.2005.06.
Baskin TW, Tierney SC, Minami T, Wampold BE. Establishing specificity
in psychotherapy: a meta-analysis of structural equivalence of placebo
controls. J Consult Clin Psychol 2003;71:973–9.
Basler HD, Jakle C, Kroner-Herwig B. Incorporation of cognitive-
behavioral treatment into the medical care of chronic low back
patients: a controlled randomized study in German pain treatment
centers. Patient Educ Couns 1997;31:113–24.
Berman JS, Miller RC, Massman PJ. Cognitive therapy versus systematic
desensitization: Is one treatment superior? Psychol Bull 1985;97:
Buchan IE. StatsDirect—software program. Sale, Cheshire: StatsDirect
Chalmers TC, Smith H, Blackburn B, Silverman B, Schroeder B,
Reitman D, Ambroz A. A method for assessing the quality of a
randomized control trial. Control Clin Trials 1981;2:31–49.
Davidson KW, Goldstein M, Kaplan RM, Kaufmann PG, Knatterud GL,
Orleans CT, Springs B, Trudeau KJ, Whitlock EP. Evidence-based
behavioral medicine: what is it and how do we achieve it? Ann Behav
Downs SH, Black N. The feasibility of creating a checklist for the
assessment of the methodological quality both of randomised and non-
randomised studies of health care interventions. J Epidemiol Commu-
nity Health 1998;52:377–84.
Eccleston C, Morley S, Williams A, Yorke L, Mastroyannopoulou K.
Systematic review of randomised controlled trials of psychological
therapyforchronicpainin childrenand adolescents, witha subsetmeta-
analysis of pain relief. Pain 2002;99:157–65.
Ersek M, Turner JA, McCurry SM, Gibbons L, Kraybill BM. Efficacy of a
self-management group intervention for elderly persons with chronic
pain. Clin J Pain 2003;19:156–67.
Goossens MEJB, Vlaeyen JWS, Hidding A, Kole-Snijders A, Evers S.
Treatment expectancy affects the outcome of cognitive-behavioral
interventions in chronic pain. Clin J Pain 2005;21:18–26.
Guzman J, Esmail R, Karjalainen K, Malmivaara A, Irvin E, Bombardier C.
Multidisciplinary rehabilitation for chronic low back pain: systematic
review. Br Med J 2001;322:1511–6.
Harbour R, Miller J. A new system for grading recommendations in
evidence based guidelines. Br Med J 2001;323:334–6.
Henry B, Moody LE, Pendergast JF, O’Donnell J, Hutchinson SA,
Scullyl G. Delineation of nursing administration research priorities.
Nurs Res 1987;36:309–14.
Jadad AR, Moore A, Carroll D, Jenkinson C, Reynolds DJM, Gavaghan DJ,
McQuay HJ. Assessing the quality of reports of randomized clinical
trials: Is blinding necessary? Control Clin Trials 1996;17:1–12.
Jadad AR. Randomised controlled Trials. A user’s guide. London: BMJ
Johansson C, Dahl J, Jannert M, Melin L, Andersson G. Effects of a
cognitive-behavioral pain-management program. Behav Res Ther
Jones J, Hunter D. Using the Delphi and nominal group technique in health
services research. In: Pope C, Mays N, editors. Qualitative research in
health care. BMJ Books; 1999.
Juni P, Altman DG, Egger M. Systematic reviews in health care: assessing
the quality of randomised controlled trials. Br Med J 2001;323:42–6.
Kazdin AE, Wilcoxon LA. Systematic desensitization and nonspecific
treatment effects: a methodological evaluation. Psychol Bull 1976;83:
3Although the majority of CBT trials cannot be blinded there are some
psychological treatments, e.g. those delivering biofeedback, where the
treatment can be delivered blind to both participant and therapist. Meta-
analyses of these trials should consider incorporating a ‘blinding’ item from
S.L. Yates et al. / Pain 117 (2005) 314–325324
Kirsch I. Response expectancy as a determinant of experience and Download full-text
behavior. Am Psychol 1985;40:1189–202.
Lackner JM, Morley S, Dowzer C, Mesmer C, Hamilton S. Psychological
treatments for irritable bowel syndrome: a systematic review and meta-
analysis. J Consult Clin Psychol 2004;72:1100–13.
Lambert MJ, Bergin AE. The effectiveness of psychotherapy. In: Bergin A
E, Garfield SL, editors. Handbook of psychotherapy and behavior
change. New York: Wiley; 1994. p. 143–89.
Linstone HA, Turoff M. The Delphi method. Techniques and applications
Marhold C, Linton SJ, Melin L. A cognitive-behavioral return-to-work
program: effects on pain patients with a history of long-term versus
short-term sick leave. Pain 2001;91:155–63.
McGraw KO, Wong SP. Forming inferences about some intraclass
correlation coefficients. Psychol Methods 1996;1:30–46.
Moher D, Schulz KF, Altman DG, for the CONSORT Group. The
CONSORT statement: revised recommendations for improving the
quality of reports of parallel group randomized trials. J Am Med Assoc
Morley S, Eccleston C, Williams A. Systematic review and meta-analysis
of randomized controlled trials of cognitive behaviour therapy and
behaviour therapy for chronic pain in adults, excluding headache. Pain
Price DD, Milling LS, Kirsch I, Duff A, Montgomery GH, Nicholls SS. An
analysis offactors that contribute to the magnitude of placebo analgesia
in an experimental paradigm. Pain 1999;83:147–56.
Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias:
dimensions of methodological quality associated with estimates of
treatment effects in controlled trials. J Am Med Assoc 1995;273:408.
Schwartz CE, Chesney MA, Irvine MJ, Keefe FJ. The control group
dilemma in clinical research: applications for psychosocial and
behavioral medicine trials. Psychosom Med 1997;59:362–71.
Sharpe L, Sensky T, Timberlake N, Ryan B, Brewin C, Allard S. A blind,
randomized, controlled trial of cognitive-behavioural intervention for
patients with recent onset rheumatoid arthritis: Preventing psychologi-
cal and physical morbidity. Pain 2001;89:275–83.
Sindhu F, Carpenter L, Seers K. Development of a tool to rate the quality
assessment of randomized controlled trials using a Delphi technique.
J Adv Nurs 1997;25:1262–8.
Stone Fish L, Busby DM. The Delphi method. In: Sprenkle DH, Moon SM,
editors.Researchmethodsin familytherapy.New York:GuilfordPress;
Thieme K, Gromnica-Ihle E, Flor H. Operant behavioral treatment of
fibromyalgia: a controlled study. Arthritis Rheum 2003;49:314–20.
Waltz J, Addis ME, Koerner K, Jacobson NS. Testing the integrity of a
psychotherapy protocol: assessment of adherence and competence.
J Consult Clin Psychol 1993;61:620–30.
Wampold BE. The great psychotherapy debate: models, methods, and
findings. vol. xiii. Mahwah, NJ: Lawrence Erlbaum Associates; 2001.
Wampold BE, Mondin GW, Moody M, Stich F, Benson K, Ahn H. A meta-
analysis of outcome studies comparing bona fide psychotherapies:
empirically, ‘all must have prizes’. Psychol Bull 1997;122:203–15.
Williams A, Richardson P, Nicholas M, Pither C, Harding V, Ridout K,
Ralphs J, Richardson I, Justins D, Chamberlain J. Inpatient vs.
outpatient pain management: results of a randomised controlled trial.
S.L. Yates et al. / Pain 117 (2005) 314–325325