Article

High agreement but low kappa: I. The problems of two paradoxes.

Yale University School of Medicine, New Haven, CT 06510.
Journal of Clinical Epidemiology (Impact Factor: 5.48). 02/1990; 43(6):543-9. DOI: 10.1016/0895-4356(90)90158-L
Source: PubMed

ABSTRACT In a fourfold table showing binary agreement of two observers, the observed proportion of agreement, p0, can be paradoxically altered by the chance-corrected ratio that creates kappa as an index of concordance. In one paradox, a high value of p0 can be drastically lowered by a substantial imbalance in the table's marginal totals either vertically or horizontally. In the second pardox, kappa will be higher with an asymmetrical rather than symmetrical imbalanced in marginal totals, and with imperfect rather than perfect symmetry in the imbalance. An adjustment that substitutes kappa max for kappa does not repair either problem, and seems to make the second one worse.

0 Bookmarks
 · 
515 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: The goals of the current study were to expand the content domain and further validate the Coercion Assessment Scale (CAS), a measure of perceived coercion for criminally involved substance abusers being recruited into research. Unlike the few existing measures of this construct, the CAS identifies specific external sources of pressure that may influence one's decision to participate. In Phase 1, we conducted focus groups with criminal justice clients and stakeholders to expand the instrument by identifying additional sources of pressure. In Phase 2, we evaluated the expanded measure (i.e., endorsement rates, reliability, validity) in an ongoing research trial. Results identified new sources of pressure and provided evidence supporting the CAS's utility and reliability over time as well as convergent and discriminative validity. © The Author(s) 2014.
    Journal of Empirical Research on Human Research Ethics 10/2014; 9(4):60-70. DOI:10.1177/1556264614544100 · 1.22 Impact Factor
  • Source
    NAACL; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Evaluation of inter-rater agreement (IRA) or inter-rater reliability (IRR), either as a primary or a secondary component of study is common in various disciplines such as medicine, psychology, education, anthropology and marketing where the use of raters or observers as a method of measurement is prevalent. The concept of IRA/IRR is fundamental to the design and evaluation of research instruments. However, many methods for comparing variations and statistical tests exist, and as a result, there is often confusion about their appropriate use. This may lead to incomplete and inconsistent reporting of results. Consequently, a set of guidelines for reporting reliability and agreement studies has recently been developed to improve the scientific rigor in which IRA/IRR studies are conducted and reported (Gisev, Bell & Chen, 2013; Kottner, Audige, & Brorson, 2011). The objective of this technical note is to present the key concepts in relation to IRA/IRR and to describe commonly used approaches for its evaluation. The emphasis will be more on the practical aspects about their use in behavioral and social research rather than the mathematical derivation of the indices. Although practitioners, researchers and policymakers often used the two terms IRA and IRR interchangeably, but there is a technical distinction between the terms agreement and reliability (LeBreton & Senter, 2008; de Vat, Terwee, Tinsely & Weiss, 2000). In general, IRR is defined as a generic term for rater consistency, and it relates to the extent to which raters can consistently distinguish different items on a measurement scale. However, some measurement experts defined it as the measurement of consistency between evaluators regardless of the absolute value of each evaluator's rating. In contrast, IRA measures the extent to which different raters assign the same precise value for each item being observed. In other words, IRA is the degree to which two or more evaluators using the same scale assigns the same rating to an identical observable situation. Thus, unlike IRR, IRA is a measurement of the consistency between the absolute value of evaluator's ratings. The distinction between IRR and IRA is further illustrated in the hypothetical example in