Testing a tool for the classification of study designs in systematic reviews of interventions and exposures showed moderate reliability and low accuracy

Department of Pediatrics, Alberta Research Center for Health Evidence and the University of Alberta Evidence-based Practice Center, University of Alberta, 11402 University Avenue, Edmonton, Alberta, Canada.
Journal of clinical epidemiology (Impact Factor: 3.42). 04/2011; 64(8):861-71. DOI: 10.1016/j.jclinepi.2011.01.010
Source: PubMed

ABSTRACT To develop and test a study design classification tool.
We contacted relevant organizations and individuals to identify tools used to classify study designs and ranked these using predefined criteria. The highest ranked tool was a design algorithm developed, but no longer advocated, by the Cochrane Non-Randomized Studies Methods Group; this was modified to include additional study designs and decision points. We developed a reference classification for 30 studies; 6 testers applied the tool to these studies. Interrater reliability (Fleiss' κ) and accuracy against the reference classification were assessed. The tool was further revised and retested.
Initial reliability was fair among the testers (κ=0.26) and the reference standard raters κ=0.33). Testing after revisions showed improved reliability (κ=0.45, moderate agreement) with improved, but still low, accuracy. The most common disagreements were whether the study design was experimental (5 of 15 studies), and whether there was a comparison of any kind (4 of 15 studies). Agreement was higher among testers who had completed graduate level training versus those who had not.
The moderate reliability and low accuracy may be because of lack of clarity and comprehensiveness of the tool, inadequate reporting of the studies, and variability in tester characteristics. The results may not be generalizable to all published studies, as the test studies were selected because they had posed challenges for previous reviewers with respect to their design classification. Application of such a tool should be accompanied by training, pilot testing, and context-specific decision rules.

Download full-text


Available from: Kenneth Bond, Sep 26, 2015
1 Follower
39 Reads
  • Source
    • "There are also other problems with using NRSs in a systematic review. NRSs are more difficult to locate with a search because there is no agreed nomenclature [8]. In case of a systematic review, this means that searches that are sensitive are nonspecific and yield a very large number of references. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Nonrandomized studies (NRSs) are considered to provide less reliable evidence for intervention effects. However, these are included in Cochrane reviews, despite discouragement. There has been no evaluation of when and how these designs are used. Therefore, we conducted an overview of current practice. We included all Cochrane reviews that considered NRS, conducting inclusions and data extraction in duplicate. Of the included 202 reviews, 114 (56%) did not cite a reason for including NRS. The reasons were divided into two major categories: NRS were included because randomized controlled trials (RCTs) are wanted (N = 81, 92%) but not feasible, lacking, or insufficient alone or because RCTs are not needed (N = 7, 8%). A range of designs were included with controlled before-after studies as the most common. Most interventions were nonpharmaceutical and the settings nonmedical. For risk of bias assessment, Cochrane Effective Practice and Organisation of Care Group's checklists were used by most reviewers (38%), whereas others used a variety of checklists and self-constructed tools. Most Cochrane reviews do not justify including NRS. When they do, most are not in line with Cochrane recommendations. Risk of bias assessment varies across reviews and needs improvement.
    Journal of clinical epidemiology 04/2014; 67(6). DOI:10.1016/j.jclinepi.2014.01.001 · 3.42 Impact Factor
  • Source
    • "It is interesting to note that although Gwet proved that the AC1 is better than Cohen’s Kappa in 2001, a finding subsequently confirmed by biostatisticians [18], few researchers have used AC1 as a statistical tool, or are even aware of it, especially in the medical field. Most recently published articles that have assessed inter-rater reliability have used Cohen’s Kappa exclusively [19-26], and a recent review of the current methods used for inter-rater reliability does not even mention AC1 [27]. During our research of PubMed (up to February 2013), we found only 2 published articles that mention using Gwet’s AC1 method as part of a study [28,29]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Rater agreement is important in clinical research, and Cohen’s Kappa is a widely used method for assessing inter-rater reliability; however, there are well documented statistical problems associated with the measure. In order to assess its utility, we evaluated it against Gwet’s AC1 and compared the results. Methods This study was carried out across 67 patients (56% males) aged 18 to 67, with a mean SD of 44.13 ± 12.68 years. Nine raters (7 psychiatrists, a psychiatry resident and a social worker) participated as interviewers, either for the first or the second interviews, which were held 4 to 6 weeks apart. The interviews were held in order to establish a personality disorder (PD) diagnosis using DSM-IV criteria. Cohen’s Kappa and Gwet’s AC1 were used and the level of agreement between raters was assessed in terms of a simple categorical diagnosis (i.e., the presence or absence of a disorder). Data were also compared with a previous analysis in order to evaluate the effects of trait prevalence. Results Gwet’s AC1 was shown to have higher inter-rater reliability coefficients for all the PD criteria, ranging from .752 to 1.000, whereas Cohen’s Kappa ranged from 0 to 1.00. Cohen’s Kappa values were high and close to the percentage of agreement when the prevalence was high, whereas Gwet’s AC1 values appeared not to change much with a change in prevalence, but remained close to the percentage of agreement. For example a Schizoid sample revealed a mean Cohen’s Kappa of .726 and a Gwet’s AC1of .853 , which fell within the different level of agreement according to criteria developed by Landis and Koch, and Altman and Fleiss. Conclusions Based on the different formulae used to calculate the level of chance-corrected agreement, Gwet’s AC1 was shown to provide a more stable inter-rater reliability coefficient than Cohen’s Kappa. It was also found to be less affected by prevalence and marginal probability than that of Cohen’s Kappa, and therefore should be considered for use with inter-rater reliability analysis.
    BMC Medical Research Methodology 04/2013; 13(1). DOI:10.1186/1471-2288-13-61 · 2.27 Impact Factor
  • Source
    • "The following items were changed during the research process; therefore, the study protocol [33] should be adjusted to reflect these changes: the inclusion criteria was clarified according to the EPOC Data Collection Checklist[37] (Addition File 2); the data extraction process was modified to include a research design algorithm [38] (Additional File 3) to be used in place of the study design component of the EPOC Data Collection Checklist; and the methodological quality assessment tool for qualitative studies that was described in the protocol was replaced with the Quality Assessment Tool for Qualitative Studies[43] (Additional File 6). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: Knowledge translation (KT) aims to close the research-practice gap in order to realize and maximize the benefits of research within the practice setting. Previous studies have investigated KT strategies in nursing and medicine; however, the present study is the first systematic review of the effectiveness of a variety of KT interventions in five allied health disciplines: dietetics, occupational therapy, pharmacy, physiotherapy, and speech-language pathology. Methods: A health research librarian developed and implemented search strategies in eight electronic databases (MEDLINE, CINAHL, ERIC, PASCAL, EMBASE, IPA, Scopus, CENTRAL) using language (English) and date restrictions (1985 to March 2010). Other relevant sources were manually searched. Two reviewers independently screened the titles and abstracts, reviewed full-text articles, performed data extraction, and performed quality assessment. Within each profession, evidence tables were created, grouping and analyzing data by research design, KT strategy, targeted behaviour, and primary outcome. The published descriptions of the KT interventions were compared to the Workgroup for Intervention Development and Evaluation Research (WIDER) Recommendations to Improve the Reporting of the Content of Behaviour Change Interventions. Results: A total of 2,638 articles were located and the titles and abstracts were screened. Of those, 1,172 full-text articles were reviewed and subsequently 32 studies were included in the systematic review. A variety of single (n = 15) and multiple (n = 17) KT interventions were identified, with educational meetings being the predominant KT strategy (n = 11). The majority of primary outcomes were identified as professional/process outcomes (n = 25); however, patient outcomes (n = 4), economic outcomes (n = 2), and multiple primary outcomes (n = 1) were also represented. Generally, the studies were of low methodological quality. Outcome reporting bias was common and precluded clear determination of intervention effectiveness. In the majority of studies, the interventions demonstrated mixed effects on primary outcomes, and only four studies demonstrated statistically significant, positive effects on primary outcomes. None of the studies satisfied the four WIDER Recommendations. Conclusions: Across five allied health professions, equivocal results, low methodological quality, and outcome reporting bias limited our ability to recommend one KT strategy over another. Further research employing the WIDER Recommendations is needed to inform the development and implementation of effective KT interventions in allied health.
    Implementation Science 07/2012; 7(70). DOI:10.1186/1748-5908-7-70 · 4.12 Impact Factor
Show more