Testing a tool for the classification of study designs in systematic reviews of interventions and exposures showed moderate reliability and low accuracy

Department of Pediatrics, Alberta Research Center for Health Evidence and the University of Alberta Evidence-based Practice Center, University of Alberta, 11402 University Avenue, Edmonton, Alberta, Canada.
Journal of clinical epidemiology (Impact Factor: 3.42). 04/2011; 64(8):861-71. DOI: 10.1016/j.jclinepi.2011.01.010
Source: PubMed


To develop and test a study design classification tool.
We contacted relevant organizations and individuals to identify tools used to classify study designs and ranked these using predefined criteria. The highest ranked tool was a design algorithm developed, but no longer advocated, by the Cochrane Non-Randomized Studies Methods Group; this was modified to include additional study designs and decision points. We developed a reference classification for 30 studies; 6 testers applied the tool to these studies. Interrater reliability (Fleiss' κ) and accuracy against the reference classification were assessed. The tool was further revised and retested.
Initial reliability was fair among the testers (κ=0.26) and the reference standard raters κ=0.33). Testing after revisions showed improved reliability (κ=0.45, moderate agreement) with improved, but still low, accuracy. The most common disagreements were whether the study design was experimental (5 of 15 studies), and whether there was a comparison of any kind (4 of 15 studies). Agreement was higher among testers who had completed graduate level training versus those who had not.
The moderate reliability and low accuracy may be because of lack of clarity and comprehensiveness of the tool, inadequate reporting of the studies, and variability in tester characteristics. The results may not be generalizable to all published studies, as the test studies were selected because they had posed challenges for previous reviewers with respect to their design classification. Application of such a tool should be accompanied by training, pilot testing, and context-specific decision rules.

Download full-text


Available from: Kenneth Bond,
  • Source
    • "There are also other problems with using NRSs in a systematic review. NRSs are more difficult to locate with a search because there is no agreed nomenclature [8]. In case of a systematic review, this means that searches that are sensitive are nonspecific and yield a very large number of references. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Nonrandomized studies (NRSs) are considered to provide less reliable evidence for intervention effects. However, these are included in Cochrane reviews, despite discouragement. There has been no evaluation of when and how these designs are used. Therefore, we conducted an overview of current practice. We included all Cochrane reviews that considered NRS, conducting inclusions and data extraction in duplicate. Of the included 202 reviews, 114 (56%) did not cite a reason for including NRS. The reasons were divided into two major categories: NRS were included because randomized controlled trials (RCTs) are wanted (N = 81, 92%) but not feasible, lacking, or insufficient alone or because RCTs are not needed (N = 7, 8%). A range of designs were included with controlled before-after studies as the most common. Most interventions were nonpharmaceutical and the settings nonmedical. For risk of bias assessment, Cochrane Effective Practice and Organisation of Care Group's checklists were used by most reviewers (38%), whereas others used a variety of checklists and self-constructed tools. Most Cochrane reviews do not justify including NRS. When they do, most are not in line with Cochrane recommendations. Risk of bias assessment varies across reviews and needs improvement.
    Journal of clinical epidemiology 04/2014; 67(6). DOI:10.1016/j.jclinepi.2014.01.001 · 3.42 Impact Factor
  • Source
    • "It is interesting to note that although Gwet proved that the AC1 is better than Cohen’s Kappa in 2001, a finding subsequently confirmed by biostatisticians [18], few researchers have used AC1 as a statistical tool, or are even aware of it, especially in the medical field. Most recently published articles that have assessed inter-rater reliability have used Cohen’s Kappa exclusively [19-26], and a recent review of the current methods used for inter-rater reliability does not even mention AC1 [27]. During our research of PubMed (up to February 2013), we found only 2 published articles that mention using Gwet’s AC1 method as part of a study [28,29]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Rater agreement is important in clinical research, and Cohen’s Kappa is a widely used method for assessing inter-rater reliability; however, there are well documented statistical problems associated with the measure. In order to assess its utility, we evaluated it against Gwet’s AC1 and compared the results. Methods This study was carried out across 67 patients (56% males) aged 18 to 67, with a mean SD of 44.13 ± 12.68 years. Nine raters (7 psychiatrists, a psychiatry resident and a social worker) participated as interviewers, either for the first or the second interviews, which were held 4 to 6 weeks apart. The interviews were held in order to establish a personality disorder (PD) diagnosis using DSM-IV criteria. Cohen’s Kappa and Gwet’s AC1 were used and the level of agreement between raters was assessed in terms of a simple categorical diagnosis (i.e., the presence or absence of a disorder). Data were also compared with a previous analysis in order to evaluate the effects of trait prevalence. Results Gwet’s AC1 was shown to have higher inter-rater reliability coefficients for all the PD criteria, ranging from .752 to 1.000, whereas Cohen’s Kappa ranged from 0 to 1.00. Cohen’s Kappa values were high and close to the percentage of agreement when the prevalence was high, whereas Gwet’s AC1 values appeared not to change much with a change in prevalence, but remained close to the percentage of agreement. For example a Schizoid sample revealed a mean Cohen’s Kappa of .726 and a Gwet’s AC1of .853 , which fell within the different level of agreement according to criteria developed by Landis and Koch, and Altman and Fleiss. Conclusions Based on the different formulae used to calculate the level of chance-corrected agreement, Gwet’s AC1 was shown to provide a more stable inter-rater reliability coefficient than Cohen’s Kappa. It was also found to be less affected by prevalence and marginal probability than that of Cohen’s Kappa, and therefore should be considered for use with inter-rater reliability analysis.
    BMC Medical Research Methodology 04/2013; 13(1). DOI:10.1186/1471-2288-13-61 · 2.27 Impact Factor
    • "Studies in which observations are made in two periods, before and after introducing the intervention of interest in some but not all participants, may be known as cohort controlled before and after studies (education and sociology), controlled before and after studies (health organization) or difference in differences studies (economics). Reliance on simple design labels can lead to potential confusion either among review author teams or among readers of reviews (Hartling et al., 2011). Design labels have been used to classify studies within hierarchies of evidence, such that they can be ranked according to perceived risk of bias. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Non-randomized studies may provide valuable evidence on the effects of interventions. They are the main source of evidence on the intended effects of some types of interventions and often provide the only evidence about the effects of interventions on long-term outcomes, rare events or adverse effects. Therefore, systematic reviews on the effects of interventions may include various types of non-randomized studies. In this second paper in a series, we address how review authors might articulate the particular non-randomized study designs they will include and how they might evaluate, in general terms, the extent to which a particular non-randomized study is at risk of important biases. We offer guidance for describing and classifying different non-randomized designs based on specific features of the studies in place of using non-informative study design labels. We also suggest criteria to consider when deciding whether to include non-randomized studies. We conclude that a taxonomy of study designs based on study design features is needed. Review authors need new tools specifically to assess the risk of bias for some non-randomized designs that involve a different inferential logic compared with parallel group trials. Copyright © 2012 John Wiley & Sons, Ltd. Copyright © 2012 John Wiley & Sons, Ltd.
    Research Synthesis Methods 03/2013; 4(1):12-25. DOI:10.1002/jrsm.1056
Show more