High agreement but low kappa: I. The problems of two paradoxes.

Yale University School of Medicine, New Haven, CT 06510.
Journal of Clinical Epidemiology (Impact Factor: 5.48). 02/1990; 43(6):543-9. DOI: 10.1016/0895-4356(90)90158-L
Source: PubMed

ABSTRACT In a fourfold table showing binary agreement of two observers, the observed proportion of agreement, p0, can be paradoxically altered by the chance-corrected ratio that creates kappa as an index of concordance. In one paradox, a high value of p0 can be drastically lowered by a substantial imbalance in the table's marginal totals either vertically or horizontally. In the second pardox, kappa will be higher with an asymmetrical rather than symmetrical imbalanced in marginal totals, and with imperfect rather than perfect symmetry in the imbalance. An adjustment that substitutes kappa max for kappa does not repair either problem, and seems to make the second one worse.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Variation in rib numbering has been noted in adolescent idiopathic scoliosis (AIS), but its effect on the reporting of fusion levels has not been studied. We hypothesized that vertebral numbering variations can lead to differing documentation of fusion levels.
    Journal of children's orthopaedics. 11/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Tracheal hypoplasia is a congenital condition described in mainly brachycephalic breeds and is one component of the brachycephalic obstructive airway syndrome (BOAS). Two radiographic methods have been described to evaluate the dimensions of the tracheal diameter in dogs and to distinguish between hypoplastic and non-hypoplastic tracheas: the tracheal lumen diameter to thoracic inlet distance ratio (TD/TI) and the ratio between the thoracic tracheal luminal diameter and the width of the proximal third of the third rib (TT/3R). The purpose of this study was to compare these two published radiographic methods between observers, different measuring occasions and to investigate the effect on classification of dogs as having hypoplastic or non-hypoplastic tracheas using four previously published mean ratios as cut-offs (<0.11, <0.127 and <0.144 for the TD/TI and <2.0 for the TT/3R method).Three observers evaluated right and left lateral recumbent radiographs from 56 adult English Bulldogs independently on three different occasions. TD/TI and TT/3R ratios were calculated and correlated between measuring occasions. Kappa, observed, positive, and negative agreements were calculated between observers and measuring occasions. Number of hypoplastic and non-hypoplastic dogs for each method and occasion was determined using <0.11, <0.127 and <0.144 as cut-offs for TD/TI and <2.0 for TT/3R.ResultsIntraobserver agreement varied with kappa between 0.45-0.94 for the TD/TI and 0.20-0.86 for the TT/3R method. Interobserver kappa varied between 0.27-0.70 for the TD/TI method and between 0.05-0.57 for the TT/3R method. There was poor agreement in classifying English Bulldogs as tracheal hypoplastic or non-hypoplastic, depending on measuring method, cut-off value and observer.Conclusions The diagnostic value of both the TD/TI and TT/3R methods with such poor agreement is questionable, and significantly impacts their reliability for both clinical evaluation of dogs and use in health screening programs.
    Acta veterinaria Scandinavica. 12/2014; 56(1):79.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Objectives. To measure inter-rater agreement of overall clinical appearance of febrile children aged less than 24 months and to compare methods for doing so. Study Design and Setting. We performed an observational study of inter-rater reliability of the assessment of febrile children in a county hospital emergency department serving a mixed urban and rural population. Two emergency medicine healthcare providers independently evaluated the overall clinical appearance of children less than 24 months of age who had presented for fever. They recorded the initial 'gestalt' assessment of whether or not the child was ill appearing or if they were unsure. They then repeated this assessment after examining the child. Each rater was blinded to the other's assessment. Our primary analysis was graphical. We also calculated Cohen's κ, Gwet's agreement coefficient and other measures of agreement and weighted variants of these. We examined the effect of time between exams and patient and provider characteristics on inter-rater agreement. Results. We analyzed 159 of the 173 patients enrolled. Median age was 9.5 months (lower and upper quartiles 4.9-14.6), 99/159 (62%) were boys and 22/159 (14%) were admitted. Overall 118/159 (74%) and 119/159 (75%) were classified as well appearing on initial 'gestalt' impression by both examiners. Summary statistics varied from 0.223 for weighted κ to 0.635 for Gwet's AC2. Inter rater agreement was affected by the time interval between the evaluations and the age of the child but not by the experience levels of the rater pairs. Classifications of 'not ill appearing' were more reliable than others. Conclusion. The inter-rater reliability of emergency providers' assessment of overall clinical appearance was adequate when described graphically and by Gwet's AC. Different summary statistics yield different results for the same dataset.
    PeerJ. 01/2014; 2:e651.