ArticlePDF Available

Figures

Content may be subject to copyright.
ELSEVIER
Assessing the Quality of Reports of Randomized
Clinical Trials: Is Blinding Necessary?
Alejandro R. Jadad, MD, DPhil; R. Andrew Moore, DPhil;
Dawn Carroll, RGN; Crispin Jenkinson, DPhil;
D. John M. Reynolds, DPhil; David J. Gavaghan, DPhil;
and Henry J. McQuay DM
Oxford Regional Pain Relief Unit (A.R.J., R.A.M., D.C., H.J.M.); Nuffield Department
of Anaesthetics (A. R.J., R.A.M., D. C., 0.1. G., H.I.M.); Department of Public Health
and Primary Care (Cl.); and University Department of Clinical Pharmacology
(D.I.M.R.), University of Oxford, Oxford, UK
ABSTRACT: It has been suggested that the quality of clinical trials should be assessed by blinded
raters to limit the risk of introducing bias into meta-analyses and systematic reviews, and
into the peer-review process. There is very little evidence in the literature to substantiate
this. This study describes the development of an instrument to assess the quality of reports
of randomized clinical trials (RCTs) in pain research and its use to determine the effect of
rater blinding on the assessments of quality. A multidisciplinary panel of six judges produced
an initial version of the instrument. Fourteen raters from three different backgrounds as-
sessed the quality of 36 research reports in pain research, selected from three different
samples. Seven were allocated randomly to perform the assessments under blind conditions.
The final version of the instrument included three items. These items were scored consis-
tently by all the raters regardless of background and could discriminate between reports
from the different samples. Blind assessments produced significantly lower and more consis-
tent scores than open assessments. The implications of this finding for systematic reviews,
meta-analytic research and the peer-review process are discussed. Controlled Clin Trials
1996: 17:1-12
KEY WORDS: Pain, meta-analysis, randomized controlled trials, quality, health technology assess-
ment
INTRODUCTION
The use of reliable data to support medical and public health decisions is essen-
tial if the growing demand for health care is to be met from limited resources.
Determining the effectiveness of medical interventions from clinical research data
is not an easy task, especially if studies addressing the same therapeutic problem
produce conflicting results. The assessment of the validity of the primary studies
Address reprint requests to: Alejandro R. ladad, Department of Clinical Epidemiology and Biosta-
tistics, McMaster University, 1200 Main Street West, Hamilton, Ontario, Canada L8N 325.
As of January 1995, Dr. ladad is at the Department of Clinical Epidemiology and Biostatistics,
McMaster University, Hamilton, Ontario, Canada.
Controlled Clinical Trials 17:1-12 (1996)
0 Elsevier Science Inc. 1996
655 Avenue of the Americas, New York, NY 10010 0197-2456/96/$15.00
SSDI 0197-2456(95)00134-3
A.R. Jadad et al
has been identified as one of the most important steps of the peer-review process
[ 1) and as one of the key components of systematic reviews [2, 31. For more than
10 years it has been suggested that the validity or quality of primary trials should
be assessed under blind conditions in order to reduce or avoid the introduction
of selection bias into meta-analyses and systematic reviews [ 41. Similar suggestions
have been made in relation to the peer review process [S], but there is no empirical
evidence to substantiate any of these claims [S, 61.
There are three methods to assess the quality of clinical trials: individual mark-
ers, checklists, and scales [7]. Scales have the theoretical advantage over the other
methods in that they provide quantitative estimates of quality that could be repli-
cated easily and incorporated formally into the peer review process and into sys-
tematic reviews. The main disadvantage of quality scales is that there is a dearth
of evidence to support the inclusion or exclusion of items and to support the
numerical scores attached to each of those items. In a recent search of the literature,
25 scales designed to assess the quality of primary trials were identified, but only
one had been developed following established methodological procedures (8). In
this paper we describe the development of such a scale and its use to evaluate the
effect of blinding on the assessments of quality.
METHODS
Established methodological procedures suggested for the development and vali-
dation of any other health measurement tool were followed. They included prelimi-
nary conceptual decisions; item generation and assessment of face validity (sensi-
bility); field trials to assess consistency, frequency of endorsement, and construct
validity; and the generation of a refined instrument [9, lo].
Preliminary Conceptual Decisions
During the development of an instrument, it is important to define the entity
to be measured and the framework within which the instrument will be used. In
this particular case, the purpose of the instrument was to assess quality, defined
as the likelihood of the trial design to generate unbiased results and approach the
“therapeutic truth.” This has also been described as “scientific quality” [ll]. Other
trial characteristics such as clinical relevance of the question addressed, data analy-
sis and presentation, literary quality of the report, or ethical implications of the
study were not encompassed by our definition. The aims of the instrument were
(1) to assess the scientific quality of any clinical trial in which pain is an outcome
measure or in which analgesic interventions are compared for outcomes other
than pain (e.g., a study looking at the adverse effect profile of different opioids),
and (2) to allow con:*;tent and reliable assessment of quality by raters with differ-
ent backgrounds, including researchers, clinicians, and professionals from other
disciplines and members of the general public.
Item Generation and Assessment of Item Face Validity
A multidisciplinary panel of judges with an interest in pain research and/or
experience in instrument development was assembled. The definition of quality
Blind Assessment of the Quality of Trial Reports 3
and the purposes of the instrument were discussed with each of the judges. They
were given 2 weeks to produce a list with preliminary items to be considered for
inclusion in the instrument. To generate the items, the judges referred to the criteria
published in previous instruments and used their own judgment. Once they had
generated the items, they sent them to one of the investigators (ARJ) who produced
a single list with all the items nominated by each of the judges.
Using a modified nominal group approach to reach consensus [l2], the judges
assessed the face validity of each of the items according to established criteria [9].
Those items associated with low face validity were deleted. An initial instrument
was created from the remaining items.
The initial instrument was pretested by three raters on 13 study reports. The
raters identified problems in clarity and/or application of each of the items. The
panel of judges then modified the wording of the items accordingly and produced
detailed instructions describing how each of the items should be assessed and
scored. The items were classified by their ability to reduce bias (direct or indirectly)
and individual scores were allocated to them by consensus.
Assessment of Frequency of Endorsement, Consistency, and Validity
Frequency of Endorsement
The frequency of endorsement was calculated by dividing the number of times
each item was scored by the maximum possible number of times each of the items
could have been scored, multiplied by 100. Items with very high or very low
endorsement rates were eliminated because they provided little discriminative
power. Items which scored similarly on excellent quality reports and poor quality
reports would not help to separate excellent from poor and would just make the
test more time consuming. Items with frequency of endorsement below 15% or
above 85% were excluded. These values were selected a priori from the recom-
mended range [lo].
Consistency and Binding of the Assessments
Consistency (also known as reliability), the prime requirement of scientific
information [9], refers to the level of agreement between different observations
of the same entity by the same rater (intrarater consistency) or different raters
(interrater consistency), or under different conditions. In this study, interrater
reliability was evaluated by assessing the degree to which different individuals
agreed on the scientific quality of a set of reports.
Raters were included in three categories defined a priori: researchers, clinicians,
and others. An individual was considered a researcher if she/he had participated
as an investigator in five or more randomized controlled trials (RCTs) in pain relief.
Clinicians were defined as individuals involved in the management of patients with
acute and chronic pain conditions for more than a year but who had’participated
in fewer than five RCTs in pain relief. Those raters who were neither defined as
researchers nor clinicians were described as “other.” All raters were recruited from
the staff of the Oxford Regional Pain Relief Unit, visiting staff, and related profes-
A.R. Jadad et al
sionals. The selection was made on the basis of interest in the study and time
availability. Each rater was given the same set of reports as follows.
The raters were allocated randomly (by using a random numbers table) to open
or blind assessment of the quality of the reports. Those raters allocated to blind
assessments were given reports in which the authors’ names and affiliation, the
names of the journals, the date of publication, the sources of financial support
for the study, and the acknowledgments were deleted. The raters were asked to
assess the quality of the reports independently. No special training was given in
how to score the items. Raters were told that there were no right or wrong answers
and that it should take them less than 10 minutes to score each report.
Intraclass correlation coefficients (ICCs) and their 95% confidence intervals
(95% CI) were used to measure the agreement between raters. ICCs and 95% CI
were calculated according to the method described by Shrout and Fleiss [131.
Values for ICCs range from 0 to 1. The closer the values to 1 the better the agreement.
Although any cutoff value is arbitrary, it was decided a priori that the value of ICCs
should be greater than 0.5 for the criterion to be considered sufficiently reliable and
greater than 0.65 to represent a high level of agreement [ll].
Validity
Validity was defined as the ability of the instrument to measure what it is
believed it is measuring. Assessing the accuracy with which an instrument measures
a construct such as quality involves making predictions and testing them [lo].
Study reports were selected from three different samples. Efforts were made
to locate studies previously judged as excellent or poor, through personal contact
with members of the panel of judges, external researchers, and clinicians. Given
the lack of a single standard, the decision on the judged quality was made a priori
using the definition of quality just described. The rest of the studies were selected
randomly from a set of controlled studies published between 1966 and 1991, which
had been identified by a high yield MEDLINE strategy [X5]. The articles were
presented to the raters in an order determined using a random number table.
Three different overall scores were calculated for each study report: the first
score was obtained by adding. the individual scores of all the items of the initial
instrument. The second value was obtained by adding the scores of items with
frequency of endorsement between 15 and 85%. The third score was calculated
by adding the scores only of those items directly related to bias reduction. The
primary outcome was the score obtained with items directly related to bias reduc-
tion and whose frequency of endorsement was between 15 and 85 % .
Two predictions were made before the reports were given to the raters in order
to test construct validity: (1) the mean overall scores for reports judged as excellent
would be higher than for those selected randomly, and (2) the mean overall scores
of reports regarded as excellent and those selected randomly would be higher than
the overall score of studies regarded as poor.
The mean overall scores were compared using an unpaired t test. Probability
values of less than 0.05 were considered significant. Data were expressed as mean
and standard error of the mean.
Blind Assessment of the Quality of Trial Reports
Table 1 Details of Judges and Raters
Judge No. Sex Background Assessment Comments
1 F
2 F
3 F
4 F
5 M
6 F
7 M
8 F
9 M
10 M
11 M
12 M
13 F
14 M
15 M
Clinician
Clinician
Researcher
Clinician
Researcher
Other
Other
Clinician
Researcher
Researcher
Clinician
Other
Other
Clinician
Other
Open
Blind
Blind
Blind
Open
Open
Blind
Open
Open
Open
Blind
Open
Blind
Open
Blind
Rater
Rater
Judge and rater
Rater
Judge and rater
Rater
Judge and rater
Rater
Judge and rater
Judge and rater
Rater
Judge and rater
Rater
Rater (excluded)
Rater
Generation of a Refined Instrument
A refined instrument would be produced only if (1) overall agreement was
good (ICC > 0.5), and (2) it was possible to differentiate between the three types
of study reports.
If appropriate, the final version of the instrument would include a list of instruc-
tions to score the items.
RESULTS
Judges, Raters, and Reports
The six judges were a psychologist, a clinical pharmacologist, a biochemist,
two anaesthetists, and a research nurse with full-time involvement in pain relief
activities. Thirty-six reports were selected for scoring. Seven had been judged
previously as excellent, 6 as poor, and the remaining 23 were chosen randomly.
Fifteen raters, eight men and seven women, were recruited to score the 36
reports. Four of the raters were regarded as researchers, six as clinicians, and five
as “others.” As was the case during other instrument development exercises [14],
all the judges participated in the scoring process (four as researchers and two as
others). Seven raters performed open assessments and eight scored the reports
under blind conditions (Table 1). One rater, a clinician allocated to open assess-
ment, was excluded from analysis because he recorded the scores incorrectly and
it was impossible to determine to which report each score referred.
Initial Instrument
The judges selected separately 49 nonredundant items (Table 2). Thirty-eight
items were excluded during the consensus meeting because of poor face validity.
The remaining 11 items were transformed to questions and included in the initial
instrument (Table 3). Each affirmative answer was given one point. If the trial
was described as randomized and/or double-blind additional points could be
awarded (one extra point in each case) if the method of randomization and/or
6 A.R. Jadad et al
Table 2 Items Considered by Individual Judges
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
1.5.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
Random allocation
Blinding
Clear/validated outcomes
Description of withdrawals and dropouts
Clear hypothesis and objectives
Clear inclusion/exclusion criteria
Power calculation
Appropriate size
Intention to treat
Single observer
Adequate follow-up
Negative/positive controls
Controlled cointerventions
Appropriate analysis
Randomization method explained
Description of investigators and assessors
Description of interventions
Raw data available
Compliance check
Adverse effects documented clearly
Comparable groups
Clinical relevance
Protocol is followed
Informed consent
Adequate analysis
Appropriate outcome measures
Data supporting conclusions
Paper clear and simple to understand
Ethical approval
Appropriate study
Independent study
Overall impression
Prospective study
More than 1 assessment time
Attempt to demonstrate dose response with new agents
Appropriate duration of study
Description of selection method
Definition of method to record adverse effects
Definition of methods for adverse effect management
Objective outcome measurements
Avoidance of data unrelated to the question addressed
Representative sample
Statistics, central tendency, and dispersion measures reported
Blinding testing
Results of randomization reported
Analysis of impact of withdrawals
Clear tables
Clear figures
Clear retrospective analysis
(5)
(5)
(5)
(5)
(4)
(4)
(4)
(3)
(3)
(3)
(3)
(3)
(3)
(3)
(2)
(2)
(2)
(2)
(2)
(2)
(2)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
(1)
0)
(1)
(1)
(1)
0)
(1)
(1)
0)
0)
(1)
The number in parentheses indicates the number of judges who suggested each of the items.
double blinding was appropriate. Conversely, points could be deducted (one point
in each case) if the study was described as randomized or double blind, but the
methods were inappropriate. An instruction sheet was appended to the initial
instrument.
Blind Assessment of the Quality of Trial Reports
Table 3 Initial Instrument and Frequency of Endorsement
7
Related Directly to the Control of Bias
Items Endorsement
Frequency ( % )
1. Was the study described as randomized?”
2. Was the study described as double-blind?”
3. Was there a description of withdrawals and drop outs?
Other Markers Not Related Directly to the Control of Bias
63
35
54
Endorsement
Items Frequency ( %)
4. Were the objectives of the study defined? 91
5. Were the outcome measures defined clearly? 88
6. Was there a clear description of the inclusion and exclusion criteria? 71
7. Was the sample size justified (e.g., power calculation)? 10
8. Was there a clear description of the interventions? 87
9. Was there at least one control (comparison) group? 92
10. Was the method used to assess adverse effects described? 41
11. Were the methods of statistical analysis described? 73
“The endorsement frequency for the appropriateness of the method to generate the sequence of random-
ization was 15 % , and for double-blinding it was 34 % . The frequency of endorsement for concealment
of randomization was evaluated separately, and it was 6%.
Field Trial
Frequency of Endorsement
Each item was scored 504 times. The frequency of endorsement of individual
items ranged from 10 to 92% (Table 3). Four items (definition of the objectives
of the study, definition of the outcome measures, description of the interventions,
and presence of a control group) were excluded because of the high frequency
of endorsement. Only one item was excluded because of the low frequency of
endorsement (justification of sample size). The remaining six items had frequencies
of endorsement ranging from 15 to 73%. Three of those items, randomization,
double blinding, and description of withdrawals and dropouts, were considered
as directly related to bias reduction (Table 3).
The maximum possible score produced was 13 points by the initial instrument
(11 items); 8 points by the 6 items with adequate frequency of endorsement; and
5 points by the 3 items directly related to bias reduction (Table 3).
Scores were calculated with all the items in the initial instrument @l-item score),
with the 6 items selected after assessment of frequency of endorsement (bitem
score), and with the 3 items directly related to bias reduction (bitem score).
Inter-rater Consistency
The overall agreement among the 14 raters was high for scores calculated with
either 11, 6, or 3 items (Table 4). All groups of raters produced reliable scores.
However, researchers produced more consistent scores than clinicians. Both of
these groups were more consistent than the “others” (Table 4). The J-item scale
showed the highest levels of agreement, overall and within groups.
A.R. Jadad et al
Table 4 Interrater Agreement and Construct Validity
Interrater Agreement [ICC (95% CI)]
Raters 11 Items 6 Items 3 Items
Researchers (n = 4) 0.69 (0.48, 0.83) 0.75 (0.58, 0.84) 0.77 (0.60, 0.86)
Clinicians (n = 5) 0.63 (0.44, 0.78) 0.66 (0.47, 0.80) 0.67 (0.48, 0.80)
Others (n = 5) 0.50 (0.26, 0.71) 0.56 (0.32, 0.76) 0.56 (0.36, 0.75)
All (n = 141 0.59 (0.46. 0.74) 0.65 (0.51. 0.77) 0.66 (0.53. 0.79)
Construct Validity [Mean (Standard Error of the Mean)]
Overall Score
Report Sample 11 Items 6 Items 3 Items
Previously judged as excellent
(n = 7) 9.9 (0.2)” 5.7 (0.2)” 3.4 (0.1)”
Seleted at random
(n = 23) 8.3 (O.l)b 4.5 (O.l)b 2.7 (O.l)b
Previously judged as poor
(n = 6) 5.0 (0.2) 2.0 (0.1) 0.7 (0.1)
All (n = 36) 8.0 (0.1) 4.3 (0.1) 2.5 (0.1)
L1 Significantly higher than randomly selected and poor study reports (p < 0.001).
b Significantly higher than poor study reports (p < 0.001).
Construct Validity
The mean overall score for the 36 reports using the 3-item instrument was 2.5;
the measurements ranged from 0 to 5. The scores for reports regarded as excellent
were significantly higher than for reports selected at random and both groups of
studies received significantly higher scores than those reports judged as poor, with
the II-, 6-, and J-item instruments (Table 4). The individual scores given to the
reports covered the whole spectrum, from 0 to the maximum possible, regardless
of the total number of items used.
All the reports judged as poor scored four points or less with the 6-item instru-
ment and 99% scored two points or less on the &item scale. For reports judged
excellent, 77% scored more than four points with the 6-item instrument and 71%
more than two points with the S-item tool.
Final Version of the Instrument
The final version of the instrument contained the three items related directly
to the reduction of bias and whose frequency of endorsement was between 15
and 85% (Appendix). The items were presented as questions to elicit yes or no
answers. Points awarded for items 1 and 2 depended on the quality of the descrip-
tion of the methods to generate the sequence of randomization and/or on the
quality of the description of the method of double blinding. If the trial had been
described as randomized and/or double blind, but there was no description of
the methods used to generate the sequence of randomization or the double-blind
conditions, one point was awarded in each case. If the method of generating the
sequence of randomization and/or blinding had been described, one additional
Blind Assessment of the Quality of Trial Reports
Table 5 Open vs. Blind Assessments (6- and 3-Item Instruments)
9
Assessment
Open
Blind
Overall Score ICC (95% CI)
6 Item 3 Item 6 Item 3 Item
4.6 (0.1) 2.7 (O.l)b 0.58 (0.44, 0.73) 0.56 (0.39, 0.58)
4.1 (0.1) 2.3 (0.1) 0.72 (0.61, 0.84) 0.76 (0.65, 0.86)
Overall score is expressed as mean (standard error of the mean) and the numbers have been rounded.
n p < 0.001.
b p < 0.01.
point was given to each item if the method was appropriate. A method to generate
randomization sequences was regarded as adequate if it allowed each study partici-
pant to have the same chance of receiving each intervention, and if the investigators
could not predict which intervention was next. Double blinding was considered
appropriate if it was stated or implied that neither the person doing the assessment
nor the study participant could identify the intervention being assessed. Con-
versely, if the method of generating the sequence of randomization and/or blinding
was described but not appropriate, the relevant item was given zero points (Appen-
dix). The third item, withdrawals and dropouts, was awarded zero points for a
negative answer and one point for a positive. For a positive answer, the number
of withdrawals and dropouts and the reasons had to be stated in each of the
comparison groups. If there were no withdrawals, it should have been stated in
the report (Appendix).
Open vs. Blind Assessments
Blind assessment of the reports produced significantly lower and more consistent
scores than open assessments using either the 6- or the 3-item scale (Table 5).
DISCUSSION
This study describes the development of an instrument to assess the quality of
clinical reports in pain relief and its use to evaluate the impact that blinding the
raters could have on the assessments. The instrument is simple, short, reliable,
and apparently valid. Given that none of the three items included in the final
version of the instrument is specific to pain reports, it may have applications in
other areas of medicine. All these items are very similar to the components of a
scale used extensively to assess the effectiveness of interventions during pregnancy
and childbirth [16] and are also part of most of the other available scales. There
is empirical evidence to support the role, of randomization and double blinding
in bias reduction. It has been shown, for instance, that nonrandomized trials or
RCTs that do not use a double-blind design are more likely to show advantage
of an innovation over a standard treatment [17]. In a more recent study, the
analysis of 250 trials from 33 meta-analyses showed that RCTs in which treatment
allocation was inadequately concealed, or in which concealment of allocation was
unclear, yielded significantly larger estimates of treatment effects than those trials
in which concealment was adequate (p < 0.001) [Ml. In the same study, trials
10 A.R. Jadad et al
not using double blinding also yielded significantly larger estimates of treatment
effects (p < 0.01).
The instrument could be used by researchers and referees to assess study proto-
cols; by editors and readers of journals to identify scientifically sound reports;
by researchers to monitor the likelihood of bias in research reports, including
their own; by individuals (not necessarily with research or clinical experience in
pain relief) involved in systematic reviews or meta-analyses to perform differen-
tial analysis based on the quality of the individual primary studies; and by pa-
tients to evaluate the validity of the evidence presented to them by health profes-
sionals.
It was suggested more than 10 years ago that the quality of clinical reports
should be assessed under blind conditions to reduce the likelihood of selection
bias systematic reviews and meta-analyses 141. Blind assessment of the validity
of clinical research reports has also been suggested as part of the peer review
process [6]. The lack of evidence to support this practice has been highlighted
recently [5, 61. We found that the blinded assessment produced significantly lower
and more consistent scores than open assessment. This may be very important
for editors of journals to reduce bias in manuscript selection which could be intro-
duced by open peer review, and for researchers if cutoff scores are recommended
for inclusion and exclusion of trials from systematic reviews or if quality scores
are used to weight the results of primary studies for use in subsequent meta-analysis
[19-211.
A major disadvantage of the instrument described in this paper and of most
others is that assessments of quality depend on the information available in the
reports. Space constraints in most printed journals, the referral of readers to previ-
ous publications as the sources for detailed description of the methods of the trial,
and the publication of trials in abstract form could lead to the assumption that
the trial was methodologically deficient, even when the trial had been designed,
conducted, and analyzed appropriately. Such a disadvantage could be avoided
in the future if journals adopted more uniform requirements for trial reporting
17, 221.
We wish to thank Iain Chalmers and Andy Oxman for their advice during the development of the
instrument and Mansukh Popat, Josephine Fagan, Sarah Booth, Theresa Beynon, Susan McGarrity,
Juan Carlos Tellez, Karen Rose, Neal Thurley, and Shibee Jamal for taking part in the study as raters.
APPENDIX
Instrument to Measure the Likelihood of Bias in Pain Research Reports
This is not the same as being asked to review a paper. It should not take more than 10
minutes to score a report and there are no right or wrong answers.
Please read the article and hy to answer the following questions (see attached instructions):
1. Was the study described as randomized (this includes the use of words such
as randomly, random, and randomization)?
2. Was the study described as double blind?
3. Was there a description of withdrawals and dropouts?
Blind Assessment of the Quality of Trial Reports 11
Scoring the items:
Either give a score of 1 point for each”yes” or 0 points for each “no.” There are no in-between
marks.
Give 1 additional point if:
and/or:
Deduct 1 point if:
and/or:
For question 1, the method to generate the sequence of
randomization was described and it was appropriate
(table of random numbers, computer generated, etc.)
If for question 2 the method of double blinding was
described and it was appropriate (identical placebo,
active placebo, dummy, etc.)
For question 1, the method to generate the sequence
of randomization was described and it was inappro-
priate (patients were allocated alternately, or ac-
cording to date of birth, hospital number, etc.)
For question 2, the study was described as double
blind but the method of blinding was inappropriate
(e.g., comparison of tablet vs. injection with no dou-
ble dummy)
Guidelines for Assessment
1. Randomization
A method to generate the sequence of randomization will be regarded as appropriate if it
allowed each study participant to have the same chance of receiving each intervention and
the investigators could not predict which treatment was next. Methods of allocation using
date of birth, date of admission, hospital numbers, or alternation should be not regarded
as appropriate.
2. Double blinding
A study must be regarded as double blind if the word “double blind” is used. The method
will be regarded as appropriate if it is stated that neither the person doing the assessments
nor the study participant could identify the intervention being assessed, or if in the absence
of such a statement the use of active placebos, identical placebos, or dummies is mentioned.
3. Withdrawals and dropouts
Participants who were included in the study but did not complete the observation period
or who were not included in the analysis must be described. The number and the reasons
for withdrawal in each group must be stated. If there were no withdrawals, it should be
stated in the article. If there is no statement on withdrawals, this item must be given no
points.
REFERENCES
1. Kassirer JP, Campion EW: Peer review: crude and understudied, but indispensable.
JAMA 272~96-97, 1994
2. Chalmers I: Evaluating the effects of care during pregnancy and childbirth. In Effective
Care in Pregnancy and Childbirth, Chalmers I, Enkin M, Keirse MJNC, eds: Oxford,
Oxford University Press, 1989
3. Oxman AD, Guyatt GH: Guidelines for reading literature reviews. Can Med Assoc
J 138:697-703, 1988
4. Chalmers TC, Smith H, Blackburn B, Silverman B, Schroeder B, Reitman D, Ambroz
A: A method for assessing the quality of a randomized control trial. ControlIed Clin
Trials 2:31-49, 1981
5. Irwig I, Tosteson ANA, Gatsonis C, Lau J, Colditz G, Chalmers TC, Mosteller F:
Guidelines for meta-analyses evaluating diagnostic tests. Ann Intern Med 120:667-
676, 1994
12 A.R. Jadad et al
6. Fisher M, Friedman SE, Strauss B: The effects of blinding on acceptance of research
papers by peer review. JAMA 272:143-6, 1994
7. Moher D, Jadad AR, Tugwell I’: Assessing the quality of randomized controlled trials:
current issues and future directions. Int J Tech Asses Health Care (In press), 1995
8. Moher D, Jadad AR, Nichol G, Penman M, Tugwell I’, Walsh S: Assessing the quality
of randomized controlled trials: an annotated bibliography of scales and checklists.
Controlled Clin Trials 16:62-73, 1995
9. Feinstein AR: Clinimetrics. New Haven, Yale University Press, 1987.
10. Streiner DL, Norman CR: Health measurement scales: a practical guide to their develop-
ment and use. New York, Oxford University Press, 1989.
11. Oxman AD, Guyatt GH: Validation of an index of the quality of review articles. J
Clin Epidemiol 44:1271-1278, 1991
12. Fink A, Kosecoff J, Chassin M, Brook RH: Consensus methods: characteristics and
guidelines for use. Am J Public Health 74:979-983, 1984
13. Shrout PE, Fleiss JL: Intraclass correlations: uses in assessing rater reliability. Psycho1
Bull 86:420-428, 1979
14. Oxman AD, Guyatt GH, Singer J, Goldsmith CH, Hutchison BG, Milner RA, Streiner
DL: Agreement among reviewers of review articles. J Clin Epidemiol 44:91-98, 1991
15. Jadad AR, McQuay HJ: A high-yield strategy to identify randomized controlled trials
for systematic reviews. Online J Curr Clin Trials 2:(27 Feb.)Dec 33, 1993
16. Chalmers I, Enkin M, Keirse MJNC: Effective care in pregnancy and childbirth. Oxford,
Oxford University Press, 1989
17. Colditz GA, Miller JN, Mosteller F: How study design affects outcomes in comparisons
of therapy. I. Medical Stat Med 8:411-454, 1989
18. Schulz KF, Chalmers I, Hayes RJ, Altman DG: Empirical evidence of bias: dimensions
of methodological quality associated with estimates of treatment effects in controlled
trials. JAMA 273:408-412, 1995
19. McNutt RA, Evans AT, Fletcher RH, Fletcher SW: The effects of blinding on the quality
of peer review: a randomized trial. JAMA 263:1371-1376, 1990
20. Nurmohamed MT, Rosendaal FR, Buller HR, Dekker E, Hommes DW, Vandenbroucke
JP, Briet E: Low molecular-weight heparin versus standard heparin in general and
orthopaedic surgery: a meta-analysis. Lancet 340:152-156, 1992
21. Fleiss JL, Gross AJ: Meta-analysis in epidemiology, with special reference to studies
of the association between exposure to environmental tobacco smoke and lung cancer:
a critique. J Clin Epidemiol 44:127-139, 1991
22. Detsky AS, Naylor CD, O’Rourke K, McGeer AJ, LAbbe KA: Incorporating variations
in the quality of individual randomized trials into meta-analysis. J Clin Epidemio145:
255-265, 1992
... This is a five-item tool used to report the risk of bias in clinical trials, through the assessment of; randomization, method of randomization, double blind, method of blinding, and reporting of losses and exclusions, respectively. The Jadad score ranges from 0 to 5, with values lower than 3 being considered low quality [20]. ...
Article
Full-text available
Background During the breastfeeding period, important transient changes in calcium homeostasis are verified in the maternal skeleton, to meet the demand for calcium for breastmilk production. The literature is inconclusive regarding the causes and percentages of involvement of bone densitometry resulting from exclusive breastfeeding (4 to 6 months). Methods This article aims to systematically review the literature, to determine the occurrence, intensity, and factors involved in alterations in maternal bone mineral density (BMD), during a period of 4 to 6 months of exclusive breastfeeding. The search descriptors “woman”, “breastfeeding”, “human milk”, and “bone mineral density” were used in the electronic databases of the Virtual Health Library, Scielo (Scientific Electronic Library Online), CAPES Periodicals Portal, LILACS, Embase, PubMed/Medline, Cochrane, Scopus, and Web of Science in June 2023. Inclusion criteria for breastfeedingmothers were; aged to 40 years, primigravida, exclusively breastfeeding, with BMD assessments using dual-energy X-ray absorptiometry (DXA), with values expressed at baseline and from 4 to 6 months postpartum. The Jadad scale, Newcastle–Ottawa Scale, and Oxford Centre for Evidence-based Medicine – levels of evidence were adopted to assess the quality of the studies. For the meta-analytical study, statistical calculations were performed. Results Initially, 381 articles were found using the search strategy and 26 were read in full. After risk of bias analysis, 16 articles remained in the systematic review and four were included in the meta-analysis. The studies showed a reduction in bone mass in the lumbar spine in the first months postpartum (4 – 6 months), when compared with a longer period of breastfeeding (12–18 months). The breastfeeding group presented a greater impact in the meta-analysis than the control group (non-breastfeeding, pregnant, or immediate postpartum), with a reduction in BMD in the lumbar spine of -0.18 g/cm² (-0.36, -0.01 g/cm²); 95% Confidence Interval, on a scale from 0 to 10. Conclusions Our results demonstrated a transitory reduction in bone densitometry of the lumbar spine during exclusive breastfeeding for 4 to 6 months, which was gradually restored later in the postpartum period. More prospective studies are needed to better understand the topic. Trial registration PROSPERO platform (nº CRD42021279199), November 12th, 2021.
... Similarly, this scale is also used for observational studies and was utilized for the quality assessment of our extracted studies. The quality assessment was also done by using the Jadad scale for RCTs [11]. ...
... The methodological quality of RCTs was independently assessed by two reviewers (ABH and NCI) using the Jadad scale (Jadad et al. 1996). The main advantages of this scale are that it is easy to use, it contains many of the important elements that have empirically been shown to correlate with bias, and it has known reliability and external validity. ...
Article
Full-text available
Background Motivating patients to take part in randomized controlled trials (RCTs) is challenging. Patient and public involvement (PPI), recommended for reporting since 2011, may potentially improve the recruitment and retention of participants in research. Aim In this systematic review we aimed to identify the extent of PPI reported in RCTs of lifestyle interventions amongst adults and its impact on enrolment and retention rates. Methods After prospective registration in PROSPERO (CRD42022359833), we searched the MEDLINE, Scopus, Web of Science, and Cochrane Library databases from inception to December 2022. We included RCTs with dietary interventions, with or without physical activity, and with or without behavioural support, among adults with overweight, obesity, or metabolic syndrome. Data extraction and study quality assessment were performed independently by two reviewers. Results Of 1063 records, 26 RCTs (12,100 participants) were included. Among these, 22 were published after 2011. Of the total, 17 (65%) RCTs mentioned PPI directly (two studies) or indirectly. The methodological quality was high in 13 studies (50%), with no significant differences in PPI (p-value = 0.3187). The enrolment rate was no different but the median retention rate was high among RCTs with PPI (0.90; 95% CI 0.86–0.95) compared to those without (0.83; 95% CI 0.70–0.87) (p-value = 0.0426). Conclusion PPI improved the retention of participants in RCTs with lifestyle interventions. However, its impact on enrolment was not clear. Overall, PPI should be encouraged in the RCT research process.
... The methodological integrity of the selected trials was evaluated based on the Jadad scale, focusing on aspects such as randomization, blinding, and participant withdrawals in the studies [30]. The grading scale spans from 0 to 5 points. ...
Article
Full-text available
Background: Numerous patients with inflammatory bowel disease (IBD) do not respond to conventional or biological therapy. Adalimumab (ADA) and vedolizumab (VDZ), according to certain research, may be a useful alternative treatment for these people. The purpose of this study was to assess the effectiveness and safety of using ADA and VDZ to treat moderate to severe IBD: Crohn’s disease (CD) and ulcerative colitis (UC). Methods: We searched PubMed, Medline, Web of Science, Scopus, the Cochrane Library, Embase, Google Scholar, CINAHL, Clinicaltrials.gov, and WHO trials registry (ICTRP). Randomized controlled trials (RCTs) comparing ADA or VDZ with placebo in participants with active CD or UC were included. The primary outcomes were the clinical response and remission at induction and maintenance phases and mucosal healing. The secondary outcome was the incidence of profound negative events. The research used Comprehensive Meta-Analysis version 3 (Biostat Inc., USA). Results: Eighteen RCTs were incorporated, in which 11 studies described the usefulness and safeness of ADA or VDZ in CD patients, and seven studies investigated the efficacy and safety of ADA or VDZ in UC patients. The meta-analysis revealed that both ADA and VDZ treatments were superior to placebo for producing clinical remission and eliciting clinical response at induction and maintenance phases in individuals with moderately to severely active CD or UC. Interestingly, we found that ADA was superior to VDZ as first-line treatment for patients with CD, but not UC. Conclusion: ADA and VDZ are effective and safe in CD and UC patients. However, RCTs of a larger number of patients are still required for better assessing the safety profile of ADA and VDZ.
... On the Jadad scale, scores of 0-4 are considered "poor", scores of 5-8 are "fair", and scores of 9-12 are "good". The scale is based on three criteria: randomisation, double-blinding, and a description of withdrawals and dropouts (13). ...
Article
Full-text available
Aims and Design: Telenutrition offers a potentially useful health improvement approach by providing patients with remote online dietary counselling and disease management services. This review protocol will examine how feasible and effective providing online dietary consultation could be through telenutrition. Data Sources: Adhering to the PRISMA-P, articles from the Cochrane Library, PubMed, Google Scholar, EBSCo, and Scopus databases will be searched using PICOS (population, intervention, comparator, outcome, and study design). Review Methods: The inclusion criteria will be an RCT study design and intervention involving telehealth and telenutrition services, published in English between 1997 and 2022 and in full-text form. The overall risk of bias will be assessed using the Risk of Bias tool developed by the Cochrane Collaboration and the RevMan 5.0 computer program. The latter will be utilised to conduct a meta-analysis. The chosen studies’ heterogeneity will be assessed using a random-effects model and the I2 statistic. Each intervention’s efficacy will be indicated through the statistical significance of the between-group difference (p-value <0.05). The quality of the methodology will be assessed by measuring the RCT design using the Jadad scale, while the evidence quality will be determined using the GRADE system. Results: This review protocol will summarise evidence regarding the feasibility and effectiveness of employing telenutrition for remote dietary consultation. Conference presentations and peer-reviewed journal publications will be how the findings are disseminated. Conclusion and impact: The findings may help to guide the effective implementation of remote dietary consultation services for patients. Trial Registration No: CRD42022340706
Article
Purpose of review: Lung ultrasound is a noninvasive bedside technique that can accurately assess pulmonary congestion by evaluating extravascular lung water. This technique is expanding and is easily available. Our primary outcome was to compare the efficacy of volume status assessment by lung ultrasound with clinical evaluation, echocardiography, bioimpedance, or biomarkers. The secondary outcomes were all-cause mortality and cardiovascular events. Sources of information: We conducted a MEDLINE literature search for observational and randomized studies with lung ultrasound in patients on maintenance dialysis. Methods: From a total of 2363 articles, we included 28 studies (25 observational and 3 randomized). The correlation coefficients were pooled for each variable of interest using the generic inverse variance method with a random effects model. Among the clinical parameters, New York Heart Association Functional Classification of Heart Failure status and lung auscultation showed the highest correlation with the number of B-lines on ultrasound, with a pooled r correlation coefficient of .57 and .36, respectively. Among echocardiographic parameters, left ventricular ejection fraction and inferior vena cava index had the strongest correlation with the number of B-lines, with a pooled r coefficient of .35 and .31, respectively. Three randomized studies compared a lung ultrasound-guided approach with standard of care on hard clinical endpoints. Although patients in the lung ultrasound group achieved better decongestion and blood pressure control, there was no difference between the 2 management strategies with respect to death from any cause or major adverse cardiovascular events. Key findings: Lung ultrasound may be considered for the identification of patients with subclinical volume overload. Trials did not show differences in clinically important outcomes. The number of studies was small and many were of suboptimal quality. Limitations: The included studies were heterogeneous and of relatively limited quality.
Article
Full-text available
Introduction The rate of olfactory loss related to COVID-19 was reported between 4-89 percent. There is no approved treatment for patients who experience anosmia after the mentioned infection. This systematic review aimed to assess the therapeutic effects of corticosteroids on anosmia in COVID-19 patients. Materials and Methods Databases including PubMed, ISI Web of Sciences, Scopus, and Cochrane Library. Databases were searched up to September 2022 to find out randomized controlled trials that assessed the effect of corticosteroids on post-COVID anosmia/hyposmia. Only studies published in the English language were entered in this review. Results Among the six relevant trials with a total population of 712, one study administered the combination therapy of both systemic and nasal corticosteroids, while others used intranasal corticosteroids. No significant difference was observed between the intervention (IG) and control (CG) groups in terms of duration of improvement from anosmia (mean difference:-1.799). The pooled effect of self-rating olfactory scores was assessed at 2 weeks and at the end point of the studies which revealed no significant effect in favor of the IG (pooled effect in 2 weeks: 0.739; in the endpoint: 1.32). The objective evaluation with different tools indicated that IG obtained higher scores at the endpoint of treatment. The pooled results showed that the number of patients who recovered from anosmia is higher in IG compared to CG (Odds Ratio: 1.719). Conclusion It appears that the duration of corticosteroid therapy more than two weeks may be a considerable effect on the recovery of smell dysfunction in COVID-19 patients.
Article
Full-text available
Surgical site infections (SSIs) are a known complication of laparotomies and intra-abdominal surgical operations leading to remarkable consequences on mortality, morbidity, and expenses. The study aims to assess the efficiency of irrigating laparotomy incision sites with povidone-iodine (PVI) or normal saline (NS) in diminishing the rate of SSIs in patients undergoing gastrointestinal operations for varying indications. Randomized controlled trials (RCTs) highlighting the contribution of laparotomy wound irrigation with PVI in opposition to NS in patients planned for laparotomy addressing numerous gastrointestinal issues, and their role in reducing SSI risk were obtained via searching of standard electronic medical databases. The analysis was conducted by utilizing meta-analysis principles procured by statistical software RevMan version 5.3 (Cochrane Collaboration, London, UK). The yield of medical databases exploration and inspection was 13 RCTs on 3816 patients who underwent laparotomy for different gastrointestinal operations. There were 1900 patients in the PVI group whereas 1916 patients received NS wound irrigations preceding closure of the laparotomy skin wound. In the random effects model analysis, the use of PVI for laparotomy wound irrigation was associated with the reduced risk (odds ratio = 0.54, 95% CI (0.30, 0.98), Z = 2.04, P = 0.04) of SSIs. Nevertheless, there was outstanding heterogeneity (Tau² = 70; chi² = 40.19, df = 12; P = 0.0001; I² = 70%) among the included studies. According to the comprehensive analysis outcomes, it has been clinically proven that the use of PVI is highly effective in reducing the occurrence of SSIs, as well as their subsequent implications.
Article
Full-text available
Background The aim of this systematic review is to evaluate the effectiveness of combining acupuncture with speech rehabilitation training, compared to acupuncture alone or speech rehabilitation training alone, in the treatment of post-stroke aphasia. Methods To gather data for this study, we searched 6 databases: PubMed, Cochrane Library, Embase, China National Knowledge Infrastructure, WanFang Data, and Chongqing VIP Database. We included clinical randomized controlled trials on acupuncture combined with rehabilitation training for post-stroke aphasia published between January 1, 2011 and October 8, 2023. Two researchers independently screened the literature, evaluated its quality, and extracted the data using Stata 15.1 SE and RevMan 5.4 software. We conducted a meta-analysis using the random effects model, and expressed dichotomous variables as odds ratios (OR) with 95% confidence intervals (CIs) and continuous variables as weighted mean differences (WMD) with 95% confidence intervals. Specifically, the odds of improvement were significantly higher in the combination group (OR = 3.89, 95% CI = [2.62, 5.78]). Improvements were also seen in several language functions, including expression (WMD = 5.14, 95% CI = [3.87, 6.41]), understanding (WMD = 9.16, 95% CI = [5.20, 13.12]), retelling (WMD = 11.35, 95% CI = [8.70, 14.00]), naming (WMD = 11.36, 95% CI = [8.12, 14.61] ), reading (WMD = 9.20, 95% CI = [4.87, 13.52]), writing (WMD = 5.65, 95% CI = [3.04, 8.26]), and reading aloud (WMD = 7.45, 95% CI = [3.12, 11.78]). Scores on the Chinese Aphasia Complete Test Scale, Western Aphasia Complete Test Scale, and China Rehabilitation Research Center Aphasia Check Scale were also significantly higher in the combination group, with improvements of 7.89, 9.89, and 9.27, respectively. Results A total of 16 clinical randomized controlled trials, including 1258 patients, were included in this meta-analysis. The results showed that compared to simple rehabilitation training or acupuncture treatment alone, the combination of acupuncture and language rehabilitation training was more effective in improving clinical outcomes for patients with post-stroke aphasia. Conclusions The results of this meta-analysis indicate that acupuncture combined with language rehabilitation training can effectively improve the language function of post-stroke aphasia patients and increase clinical effectiveness. However, further research is needed to confirm these findings and provide a more reliable evidence-based basis for clinical practice. In particular, additional studies with large sample sizes, high quality, and more specific and standardized outcome measures are needed to strengthen the evidence. The limited quantity and quality of the current studies may affect the generalizability of the results.
Article
Full-text available
Controlling postoperative pain is essential for the greatest recovery following major abdominal surgery. Thoracic epidural analgesia (TEA) has traditionally been considered the preferred method of providing pain relief after major abdominal surgeries. Thoracic epidural analgesia has a wide range of complications, including residual motor blockade, hypotension, urine retention with the need for urinary catheterisation, tethering to infusion pumps, and occasional failure rates. In recent years, rectus sheath catheter (RSC) analgesia has been gaining popularity. The purpose of this review is to compare the effectiveness of TEA and RSC in reducing pain following major abdominal surgeries. Four randomised controlled trials (RCTs) reporting outcomes of the visual analogue scale (VAS) pain score were included according to the set criteria. A total of 351 patients undergoing major abdominal surgery were included in this meta-analysis. There were 176 patients in the TEA group and 175 patients in the RSC group. In the random effect model analysis, there was no significant difference in VAS pain score in 24 hours at rest (standardised mean difference (SMD) -0.46; 95% CI -1.21 to 0.29; z=1.20, P=0.23) and movement (SMD -0.64; 95% CI -1.69 to -0.14; z=1.19, P=0.23) between TEA and RSC. Similarly, there was no significant difference in pain score after 48 hours at rest (SMD -0.14; 95% CI -0.36 to 0.08; z=1.29, P=0.20) or movement (SMD -0.69; 95% CI -2.03 to 0.64; z=1.02, P=0.31). In conclusion, our findings show that there was no significant difference in pain score between TEA and RSC following major abdominal surgery, and we suggest that both approaches can be used effectively according to the choice and expertise available.
Book
Clinicians and those in health sciences are frequently called upon to measure subjective states such as attitudes, feelings, quality of life, educational achievement and aptitude, and learning style in their patients. This fifth edition of Health Measurement Scales enables these groups to both develop scales to measure non-tangible health outcomes, and better evaluate and differentiate between existing tools. Health Measurement Scales is the ultimate guide to developing and validating measurement scales that are to be used in the health sciences. The book covers how the individual items are developed; various biases that can affect responses (e.g. social desirability, yea-saying, framing); various response options; how to select the best items in the set; how to combine them into a scale; and finally how to determine the reliability and validity of the scale. It concludes with a discussion of ethical issues that may be encountered, and guidelines for reporting the results of the scale development process. Appendices include a comprehensive guide to finding existing scales, and a brief introduction to exploratory and confirmatory factor analysis, making this book a must-read for any practitioner dealing with this kind of data.
Article
We analysed 113 reports published in 1980 in a sample of medical journals to relate features of study design to the magnitude of gains attributed to new therapies over old. Overall we rated 87 per cent of new therapies as improvements over standard therapies. The mean gain (measured by the Mann-Whitney statistic) was relatively constant across study designs, except for non-randomized controlled trials with sequential assignment to therapy, which showed a significantly higher likelihood that a patient would do better on the innovation than on standard therapy (p = 0.004). Randomized controlled trials that did not use a double-blind design had a higher likelihood of showing a gain for the innovation than did double-blind trials (p = 0.02). Any evaluation of an innovation may include both bias and the true efficacy of the new therapy, therefore we may consider making adjustments for the average bias associated with a study design. When interpreting an evaluation of a new therapy, readers should consider the impact of the following average adjustments to the Mann-Whitney statistic: for trials with non-random sequential assignment a decrease of 0.15, for non-double-blind randomized controlled trials a decrease of 0.11.
Article
Objective: To assess the consistency of an index of the scientific quality of research overviews. Design: Agreement was measured among nine judges, each of whom assessed the scientific quality of 36 published review articles. ITEM SELECTION: An iterative process was used to select ten criteria relative to five key tasks entailed in conducting a research overview. Sample: The review articles were drawn from three sampling frames: articles highly rated by criteria external to the study; meta-analyses; and a broad spectrum of medical journals. JUDGES: Three categories of judges were used: research assistants; clinicians with research training; and experts in research methodology; with three judges in each category. Results: The level of agreement within the three groups of judges was similar for their overall assessment of scientific quality and for six of the nine other items. With four exceptions, agreement among judges within each group and across groups, as measured by the intraclass correlation coefficient (ICC), was greater than 0.5, and 60% (24/40) of the ICCs were greater than 0.7. Conclusions: It was possible to achieve reasonable to excellent agreement for all of the items in the index, including the overall assessment of scientific quality. The implications of these results for practising clinicians and the peer review system are discussed.
Article
Low-molecular-weight heparins (LMWHs) have theoretical advantages over standard heparin as postoperative thromboprophylactic agents. We conducted a meta-analysis of studies reported between 1984 and April, 1991, in which LMWHs were compared with standard heparin for postoperative prophylaxis. We included only randomised studies (reported in English, French, or German) in which investigators compared currently recommended doses of the agents and used adequate screening techniques for deep vein thrombosis. For all surgical studies the relative risk (LMWH versus standard heparin) for deep vein thrombosis was 0.74 (95% Cl 0.65-0.86), for pulmonary embolism 0.43 (95% Cl 0.26-0.72), and for major bleeding 0.98 (95% Cl 0.69-1.40). Comparable relative risks were observed for the general and orthopaedic surgery studies separately. When the analysis for the general surgery studies was limited to those of strong methodology, assessed by eight criteria defined in advance, the benefit/risk ratio was less favourable--relative risk for deep vein thrombosis 0.91 (95% Cl 0.68-1.23), for major bleeding 1.32 (95% Cl 0.69-2.56). There is at present no convincing evidence that in general surgery patients LMWHs, compared with standard heparin, generate a clinically important improvement in the benefit to risk ratio. However, LMWHs may be preferable for orthopaedic surgery patients, in view of the larger absolute risk reduction for venous thrombosis.
Article
Meta-analysis is a method of synthesizing evidence from multiple sources. It has been increasingly applied to combine results from randomized trials of therapeutic strategies. Unfortunately there is often variation in the quality of the trials that are included in meta-analyses, limiting the value of combining the results in an overview. This variation in quality can lead to both bias and reduction in precision of the estimate of the therapy's effectiveness. There are a number of methods for quantifying the quality of trials including the detailed Chalmers system and simple scales. The nature of the relationship between these quality scores and the true estimate of effectiveness is unknown at this time. We discuss four methods of incorporating quality into meta-analysis: threshold score as inclusion/exclusion criterion, use of quality score as a weight in statistical pooling, visual plot of effect size against quality score and sequential combination of trial results based on quality score. The last method permits an examination of the relation between quality and both bias and precision on the pooled estimates. We conclude that while it is possible to incorporate the effect of variation of quality of individual trials into overviews, this issue requires more study.
Article
The objective of this study was to assess the validity of an index of the scientific quality of research overviews, the Overview Quality Assessment Questionnaire (OQAQ). Thirty-six published review articles were assessed by 9 judges using the OQAQ. Authors reports of what they had done were compared to OQAQ ratings. The sensibility of the OQAQ was assessed using a 13 item questionnaire. Seven a priori hypotheses were used to assess construct validity. The review articles were drawn from three sampling frames: articles highly rated by criteria external to the study, meta-analyses, and a broad spectrum of medical journals. Three categories of judges were used to assess the articles: research assistants, clinicians with research training and experts in research methodology, with 3 judges in each category. The sensibility of the index was assessed by 15 randomly selected faculty members of the Department of Clinical Epidemiology and Biostatistics at McMaster. Authors' reports of their methods related closely to ratings from corresponding OQAQ items: for each criterion, the mean score was significantly higher for articles for which the authors responses indicated that they had used more rigorous methods. For 10 of the 13 questions used to assess sensibility the mean rating was 5 or greater, indicating general satisfaction with the instrument. The primary shortcoming noted was the need for judgement in applying the index. Six of the 7 hypotheses used to test construct validity held true. The OQAQ is a valid measure of the quality of research overviews.
Article
Meta-analysis, a set of statistical tools for combining and integrating the results of independent studies of a given scientific issue, can be useful when the stringent conditions under which such integration is valid are met. In this report we point out the difficulties in obtaining sound meta-analyses of either controlled clinical trials or epidemiological studies. We demonstrate that hastily or improperly designed meta-analyses can lead to results that may not be scientifically valid. We note that much care is typically taken when meta-analysis is applied to the results of clinical trials. The Food and Drug Administration, for example, requires strict adherence to the principles we discuss in this paper before it allows a drug's sponsor to use a meta-analysis of separate clinical studies in support of a New Drug Application. Such care does not always carry over to epidemiological studies, as demonstrated by the 1986 report of the National Research Council concerning the purported association between exposure to environmental tobacco smoke and the risk of lung cancer. On the basis of a meta-analysis of 13 studies, 10 of which were retrospective and the remaining 3 prospective in nature, the Council concluded that non-smokers who are exposed to environmental tobacco smoke are at greater risk of acquiring lung cancer than non-smokers not so exposed. In our opinion, this conclusion in unwarranted given the poor quality of the studies on which it is based.
Article
Peer reviewers are blinded sometimes to authors' and institutions' names, but the effects of blinding on review quality are not known. We, therefore, conducted a randomized trial of blinded peer review. Each of 127 consecutive manuscripts of original research that were submitted to the Journal of General Internal Medicine were sent to two external reviewers, one of whom was randomly selected to receive a manuscript with the authors' and institutions' names removed. Reviewers were asked, but not required, to sign their reviews. Blinding was successful for 73% of reviewers. Quality of reviews was higher for the blinded manuscripts (3.5 vs 3.1 on a 5-point scale). Forty-three percent of reviewers signed their reviews, and blinding did not affect the proportion who signed. There was no association between signing and quality. Our study shows that, in our setting, blinding improves the quality of reviews and that research on the effects of peer review is possible.
Article
One strategy for dealing with the burgeoning medical literature is to rely on reviews of the literature. Although this strategy is efficient, readers may be misled if the review does not meet scientific standards. Therefore, guidelines that will help readers assess the scientific quality of the review are proposed. The guidelines focus on the definition of the question, the comprehensiveness of the search strategy, the methods of choosing and assessing the primary studies, and the methods of combining the results and reaching appropriate conclusions. Application of the guidelines will allow clinicians to spend their valuable reading time on high-quality material and to judge the validity of an author's conclusions.
Article
A system has been constructed to evaluate the design, implementation, and analysis of randomized control trials (RCT). The degree of quadruple blinding (the randomization process, the physicians and patients as to therapy, and the physicians as to ongoing results) is considered to be the most important aspect of any trial. The analytic techniques are scored with the same emphasis as is placed on the control of bias in the planning and implementation of the studies. Description of the patient and treatment materials and the measurement of various controls of quality have less weight. An index of quality of a RCT is proposed with its pros and cons. If published papers were to approximate these principles, there would be a marked improvement in the quality of randomized control trials. Finally, a reasonable standard design and conduct of trials will facilitate the interpretation of those with conflicting results and help in making valid combinations of undersized trials.