Some Practical Issues of Experimental Design and Data Analysis in Radiological ROC Studies

Department of Radiology, University of Chicago, IL 60637.
Investigative Radiology (Impact Factor: 4.44). 04/1989; 24(3):234-45. DOI: 10.1097/00004424-198903000-00012
Source: PubMed


Receiver operating characteristic (ROC) analysis has been used in a broad variety of medical imaging studies during the past 15 years, and its advantages over more traditional measures of diagnostic performance are now clearly established. But despite the essential simplicity of the approach, workers in the field often find--sometimes only after an ROC study is under way--that a number of subtle issues related to experimental design and data analysis must be confronted in practice. Many of these issues have not been discussed in the literature in detail, and most are not well known. The purposes of this paper are to make users of ROC methodology in medical imaging aware of potential problems that should be confronted before an ROC study is begun and to indicate, at least broadly, how those problems may be dealt with, given the present state of the art. Some of the issues raised here can be addressed adequately by easily prescribed techniques, whereas others remain difficult and will be resolved fully only by new methodologic developments.

45 Reads
  • Source
    • "To determine the diagnostic accuracy of each modality for the two reviewers, the area under the ROC curve (Az) value was evaluated. Factors with Az values greater than 0.80 were regarded as good diagnostic accuracy (25). The Az values of each imaging modality acquired from the ROC curves were statistically compared using the paired Z-test. "
    [Show abstract] [Hide abstract]
    ABSTRACT: To compare the diagnostic performance of high-resolution ultrasound (HRUS) with contrast-enhanced CT and contrast-enhanced magnetic resonance imaging (MRI) with MR cholangiopancreatography (MRCP) to differentiate between adenomyomatosis (ADM) and gallbladder cancer (GBCA). Forty patients with surgically proven ADM (n = 13) or GBCA at stage T2 or lower (n = 27) who previously underwent preoperative HRUS, contrast-enhanced CT, and contrast-enhanced MRI with MRCP were retrospectively included in this study. According to the well-known diagnostic criteria, two reviewers independently analyzed the images from each modality separately with a five-point confidence scale. The interobserver agreement was calculated using weighted κ statistics. A receiver operating characteristic curve analysis was performed and the sensitivity, specificity, and accuracy were calculated for each modality when scores of 1 or 2 indicated ADM. The interobserver agreement between the two reviewers was good to excellent. The mean Az values for HRUS, multidetector CT (MDCT), and MRI were 0.959, 0.898, and 0.935, respectively, without any statistically significant differences between any of the modalities (p > 0.05). The mean sensitivity of MRI with MRCP (80.8%) was significantly higher than that of MDCT (50.0%) (p = 0.0215). However, the mean sensitivity of MRI with MRCP (80.8%) was not significantly different from that of HRUS (73.1%) (p > 0.05). The mean specificities and accuracies among the three modalities were not significantly different (p > 0.05). High-resolution ultrasound and MRI with MRCP have comparable sensitivity and accuracy and MDCT has the lowest sensitivity and accuracy for the differentiation of ADM and GBCA.
    Full-text · Article · Mar 2014 · Korean journal of radiology: official journal of the Korean Radiological Society
  • Source
    • "The key of ROC analysis in radiology is that the participants rate the confidence of judgment or the likelihood of malignancy etc. instead of giving binary answer (i.e. present or absent) (Obuchowski 2003; Metz 1978; Hanley & McNeil 1982; Berbaum et al. 1989; Metz 1989; Gur et al. 1989). The fundamental problem of rating is that the decision needs to be " not obvious " , and " should be of borderline difficulty " (Metz 1978). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Detective performance of radiologists for "obvious" targets should be evaluated by visual search task instead of ROC analysis, but visual task have not been applied to radiology studies. The aim of this study was to set up an environment that allows visual search task in radiology, to evaluate its feasibility, and to preliminarily investigate the effect of career on the performance. In a darkroom, ten radiologists were asked to answer the type of lesion by pressing buttons, when images without lesions, with bulla, ground-glass nodule, and solid nodule were randomly presented on a display. Differences in accuracy and reaction times depending on board certification were investigated. The visual search task was successfully and feasibly performed. Radiologists were found to have high sensitivity, specificity, positive predictive values and negative predictive values in non-board and board groups. Reaction time was under 1 second for all target types in both groups. Board radiologists were significantly faster in answering for bulla, but there were no significant differences for other targets and values. We developed an experimental system that allows visual search experiment in radiology. Reaction time for detection of bulla was shortened with experience.
    Full-text · Article · Nov 2013 · SpringerPlus
  • Source
    • "Two centimeters of PMMA plates was put on the top of the ALVIM phantom to generate images simulating an X-ray beam spectrum as used for images of dense breasts[12,13]. It is known that cancer detection and characterization failures in the detection and characterization of cancer may be attributed to technical factors or operational limitations[1,2,9]. Large breast thickness and density can be responsible for an increase in the false-positive rate[1,19]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Our goal was to evaluate mammography systems based on microcalcifications and fiber detection using a statistical phantom (ALVIM, model TRM 18-209, Nuclear Associates) image readings. ALVIM phantom images were acquired under diverse exposure conditions with various equipments, and 5 radiologists with similar expertise reported their findings. The reading performance in the detection of microcalcifications and fibers of different sizes was measured by simulation of equivalent breast tissue with 4.5 and 6.5 cm thicknesses. We determined kappa values, ROC curves, and kappa probability density and detection rates with dedicated software developed locally. The statistical results generated three kappa (K) ranges that allowed quantification of the detection performance at three quality levels: unacceptable (K ≤ 0.64), achievable (K ≥ 0.70) and acceptable (0.64 < K < 0.70). An extensive database permitted a comparison of the reading performance with 99.5% reliability (p < 0.005). The comparison showed a larger dispersion of the kappa values for the images with low contrast generated with mammography equipment which was not properly calibrated, showing that the method is able to detect the performance changes associated with the loss of image quality.
    Full-text · Article · May 2012 · Radiological Physics and Technology
Show more