Article

Agreement Between Experts and an Untrained Crowd for Identifying Dermoscopic Features Using a Gamified App: Reader Feasibility Study

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Background Dermoscopy is commonly used for the evaluation of pigmented lesions, but agreement between experts for identification of dermoscopic structures is known to be relatively poor. Expert labeling of medical data is a bottleneck in the development of machine learning (ML) tools, and crowdsourcing has been demonstrated as a cost- and time-efficient method for the annotation of medical images. Objective The aim of this study is to demonstrate that crowdsourcing can be used to label basic dermoscopic structures from images of pigmented lesions with similar reliability to a group of experts. Methods First, we obtained labels of 248 images of melanocytic lesions with 31 dermoscopic “subfeatures” labeled by 20 dermoscopy experts. These were then collapsed into 6 dermoscopic “superfeatures” based on structural similarity, due to low interrater reliability (IRR): dots, globules, lines, network structures, regression structures, and vessels. These images were then used as the gold standard for the crowd study. The commercial platform DiagnosUs was used to obtain annotations from a nonexpert crowd for the presence or absence of the 6 superfeatures in each of the 248 images. We replicated this methodology with a group of 7 dermatologists to allow direct comparison with the nonexpert crowd. The Cohen κ value was used to measure agreement across raters. Results In total, we obtained 139,731 ratings of the 6 dermoscopic superfeatures from the crowd. There was relatively lower agreement for the identification of dots and globules (the median κ values were 0.526 and 0.395, respectively), whereas network structures and vessels showed the highest agreement (the median κ values were 0.581 and 0.798, respectively). This pattern was also seen among the expert raters, who had median κ values of 0.483 and 0.517 for dots and globules, respectively, and 0.758 and 0.790 for network structures and vessels. The median κ values between nonexperts and thresholded average–expert readers were 0.709 for dots, 0.719 for globules, 0.714 for lines, 0.838 for network structures, 0.818 for regression structures, and 0.728 for vessels. Conclusions This study confirmed that IRR for different dermoscopic features varied among a group of experts; a similar pattern was observed in a nonexpert crowd. There was good or excellent agreement for each of the 6 superfeatures between the crowd and the experts, highlighting the similar reliability of the crowd for labeling dermoscopic images. This confirms the feasibility and dependability of using crowdsourcing as a scalable solution to annotate large sets of dermoscopic images, with several potential clinical and educational applications, including the development of novel, explainable ML tools.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Biology has become a prime area for the deployment of deep learning and artificial intelligence (AI), enabled largely by the massive data sets that the field can generate. Key to most AI tasks is the availability of a sufficiently large, labeled data set with which to train AI models. In the context of microscopy, it is easy to generate image data sets containing millions of cells and structures. However, it is challenging to obtain large-scale high-quality annotations for AI models. Here, we present HALS (Human-Augmenting Labeling System), a human-in-the-loop data labeling AI, which begins uninitialized and learns annotations from a human, in real-time. Using a multi-part AI composed of three deep learning models, HALS learns from just a few examples and immediately decreases the workload of the annotator, while increasing the quality of their annotations. Using a highly repetitive use-case—annotating cell types—and running experiments with seven pathologists—experts at the microscopic analysis of biological specimens—we demonstrate a manual work reduction of 90.60%, and an average data-quality boost of 4.34%, measured across four use-cases and two tissue stain types.
Article
Full-text available
Rapid advances in image processing capabilities have been seen across many domains, fostered by the application of machine learning algorithms to "big-data". However, within the realm of medical image analysis, advances have been curtailed, in part, due to the limited availability of large-scale, well-annotated datasets. One of the main reasons for this is the high cost often associated with producing large amounts of high-quality meta-data. Recently, there has been growing interest in the application of crowdsourcing for this purpose; a technique that has proven effective for creating large-scale datasets across a range of disciplines, from computer vision to astrophysics. Despite the growing popularity of this approach, there has not yet been a comprehensive literature review to provide guidance to researchers considering using crowdsourcing methodologies in their own medical imaging analysis. In this survey, we review studies applying crowdsourcing to the analysis of medical images, published prior to July 2018. We identify common approaches, challenges and considerations, providing guidance of utility to researchers adopting this approach. Finally, we discuss future opportunities for development within this emerging domain.
Article
Full-text available
Prior skin image datasets have not addressed patient-level information obtained from multiple skin lesions from the same patient. Though artificial intelligence classification algorithms have achieved expert-level performance in controlled studies examining single images, in practice dermatologists base their judgment holistically from multiple lesions on the same patient. The 2020 SIIM-ISIC Melanoma Classification challenge dataset described herein was constructed to address this discrepancy between prior challenges and clinical practice, providing for each image in the dataset an identifier allowing lesions from the same patient to be mapped to one another. This patient-level contextual information is frequently used by clinicians to diagnose melanoma and is especially useful in ruling out false positives in patients with many atypical nevi. The dataset represents 2,056 patients (20.8% with at least one melanoma, 79.2% with zero melanomas) from three continents with an average of 16 lesions per patient, consisting of 33,126 dermoscopic images and 584 (1.8%) histopathologically confirmed melanomas compared with benign melanoma mimickers.
Article
Full-text available
In this paper, we present and discuss a novel reliability metric to quantify the extent a ground truth, generated in multi-rater settings, as a reliable basis for the training and validation of machine learning predictive models. To define this metric, three dimensions are taken into account: agreement (that is, how much a group of raters mutually agree on a single case); confidence (that is, how much a rater is certain of each rating expressed); and competence (that is, how accurate a rater is). Therefore, this metric produces a reliability score weighted for the raters’ confidence and competence, but it only requires the former information to be actually collected, as the latter can be obtained by the ratings themselves, if no further information is available. We found that our proposal was both more conservative and robust to known paradoxes than other existing agreement measures, by virtue of a more articulated notion of the agreement due to chance, which was based on an empirical estimation of the reliability of the single raters involved. We discuss the above metric within a realistic annotation task that involved 13 expert radiologists in labeling the MRNet dataset. We also provide a nomogram by which to assess the actual accuracy of a classification model, given the reliability of its ground truth. In this respect, we also make the point that theoretical estimates of model performance are consistently overestimated if ground truth reliability is not properly taken into account.
Article
Full-text available
Background: Crowdsourcing is used increasingly in health and medical research. Crowdsourcing is the process of aggregating crowd wisdom to solve a problem. The purpose of this systematic review is to summarize quantitative evidence on crowdsourcing to improve health. Methods: We followed Cochrane systematic review guidance and systematically searched seven databases up to September 4th 2019. Studies were included if they reported on crowdsourcing and related to health or medicine. Studies were excluded if recruitment was the only use of crowdsourcing. We determined the level of evidence associated with review findings using the GRADE approach. Results: We screened 3508 citations, accessed 362 articles, and included 188 studies. Ninety-six studies examined effectiveness, 127 examined feasibility, and 37 examined cost. The most common purposes were to evaluate surgical skills (17 studies), to create sexual health messages (seven studies), and to provide layperson cardio-pulmonary resuscitation (CPR) out-of-hospital (six studies). Seventeen observational studies used crowdsourcing to evaluate surgical skills, finding that crowdsourcing evaluation was as effective as expert evaluation (low quality). Four studies used a challenge contest to solicit human immunodeficiency virus (HIV) testing promotion materials and increase HIV testing rates (moderate quality), and two of the four studies found this approach saved money. Three studies suggested that an interactive technology system increased rates of layperson initiated CPR out-of-hospital (moderate quality). However, studies analyzing crowdsourcing to evaluate surgical skills and layperson-initiated CPR were only from high-income countries. Five studies examined crowdsourcing to inform artificial intelligence projects, most often related to annotation of medical data. Crowdsourcing was evaluated using different outcomes, limiting the extent to which studies could be pooled. Conclusions: Crowdsourcing has been used to improve health in many settings. Although crowdsourcing is effective at improving behavioral outcomes, more research is needed to understand effects on clinical outcomes and costs. More research is needed on crowdsourcing as a tool to develop artificial intelligence systems in medicine. Trial registration: PROSPERO: CRD42017052835. December 27, 2016.
Article
Full-text available
The incidence of skin tumors has steadily increased. Although most are benign and do not affect survival, some of the more malignant skin tumors present a lethal threat if a delay in diagnosis permits them to become advanced. Ideally, an inspection by an expert dermatologist would accurately detect malignant skin tumors in the early stage; however, it is not practical for every single patient to receive intensive screening by dermatologists. To overcome this issue, many studies are ongoing to develop dermatologist-level, computer-aided diagnostics. Whereas, many systems that can classify dermoscopic images at this dermatologist-equivalent level have been reported, a much fewer number of systems that can classify conventional clinical images have been reported thus far. Recently, the introduction of deep-learning technology, a method that automatically extracts a set of representative features for further classification has dramatically improved classification efficacy. This new technology has the potential to improve the computer classification accuracy of conventional clinical images to the level of skilled dermatologists. In this review, this new technology and present development of computer-aided skin tumor classifiers will be summarized.
Article
Full-text available
Background Atypical vascular pattern is one of the most important features by differentiating between benign and malignant pigmented skin lesions. Detection and analysis of vascular structures is a necessary initial step for skin mole assessment; it is a prerequisite step to provide an accurate outcome for the widely used 7-point checklist diagnostic algorithm. Methods In this research we present a fully automated machine learning approach for segmenting vascular structures in dermoscopy colour images. The U-Net architecture is based on convolutional networks and designed for fast and precise segmentation of images. After preprocessing the images are randomly divided into 146516 patches of 64 × 64 pixels each. Results On the independent validation dataset including 74 images our implemented method showed high segmentation accuracy. For the U-Net convolutional neural network, an average DSC of 0.84, sensitivity 0.85, and specificity 0.81 has been achieved. Conclusion Vascular structures due to small size and similarity to other local structures create enormous difficulties during the segmentation and assessment process. The use of advanced segmentation methods like deep learning, especially convolutional neural networks, has the potential to improve the accuracy of advanced local structure detection.
Article
Full-text available
Background: Melanoma has one of the fastest rising incidence rates of any cancer. It accounts for a small percentage of skin cancer cases but is responsible for the majority of skin cancer deaths. Although history-taking and visual inspection of a suspicious lesion by a clinician are usually the first in a series of 'tests' to diagnose skin cancer, dermoscopy has become an important tool to assist diagnosis by specialist clinicians and is increasingly used in primary care settings. Dermoscopy is a magnification technique using visible light that allows more detailed examination of the skin compared to examination by the naked eye alone. Establishing the additive value of dermoscopy over and above visual inspection alone across a range of observers and settings is critical to understanding its contribution for the diagnosis of melanoma and to future understanding of the potential role of the growing number of other high-resolution image analysis techniques. Objectives: To determine the diagnostic accuracy of dermoscopy alone, or when added to visual inspection of a skin lesion, for the detection of cutaneous invasive melanoma and atypical intraepidermal melanocytic variants in adults. We separated studies according to whether the diagnosis was recorded face-to-face (in-person), or based on remote (image-based), assessment. Search methods: We undertook a comprehensive search of the following databases from inception up to August 2016: CENTRAL; MEDLINE; Embase; CINAHL; CPCI; Zetoc; Science Citation Index; US National Institutes of Health Ongoing Trials Register; NIHR Clinical Research Network Portfolio Database; and the World Health Organization International Clinical Trials Registry Platform. We studied reference lists and published systematic review articles. Selection criteria: Studies of any design that evaluated dermoscopy in adults with lesions suspicious for melanoma, compared with a reference standard of either histological confirmation or clinical follow-up. Data on the accuracy of visual inspection, to allow comparisons of tests, was included only if reported in the included studies of dermoscopy. Data collection and analysis: Two review authors independently extracted all data using a standardised data extraction and quality assessment form (based on QUADAS-2). We contacted authors of included studies where information related to the target condition or diagnostic threshold were missing. We estimated accuracy using hierarchical summary receiver operating characteristic (SROC),methods. Analysis of studies allowing direct comparison between tests was undertaken. To facilitate interpretation of results, we computed values of sensitivity at the point on the SROC curve with 80% fixed specificity and values of specificity with 80% fixed sensitivity. We investigated the impact of in-person test interpretation; use of a purposely developed algorithm to assist diagnosis; observer expertise; and dermoscopy training. Main results: We included a total of 104 study publications reporting on 103 study cohorts with 42,788 lesions (including 5700 cases), providing 354 datasets for dermoscopy. The risk of bias was mainly low for the index test and reference standard domains and mainly high or unclear for participant selection and participant flow. Concerns regarding the applicability of study findings were largely scored as 'high' concern in three of four domains assessed. Selective participant recruitment, lack of reproducibility of diagnostic thresholds and lack of detail on observer expertise were particularly problematic.The accuracy of dermoscopy for the detection of invasive melanoma or atypical intraepidermal melanocytic variants was reported in 86 datasets; 26 for evaluations conducted in person (dermoscopy added to visual inspection), and 60 for image-based evaluations (diagnosis based on interpretation of dermoscopic images). Analyses of studies by prior testing revealed no obvious effect on accuracy; analyses were hampered by the lack of studies in primary care, lack of relevant information and the restricted inclusion of lesions selected for biopsy or excision. Accuracy was higher for in-person diagnosis compared to image-based evaluations (relative diagnostic odds ratio (RDOR) 4.6, 95% confidence interval (CI) 2.4 to 9.0; P < 0.001).We compared accuracy for (a), in-person evaluations of dermoscopy (26 evaluations; 23,169 lesions and 1664 melanomas),versus visual inspection alone (13 evaluations; 6740 lesions and 459 melanomas), and for (b), image-based evaluations of dermoscopy (60 evaluations; 13,475 lesions and 2851 melanomas),versus image-based visual inspection (11 evaluations; 1740 lesions and 305 melanomas). For both comparisons, meta-analysis found dermoscopy to be more accurate than visual inspection alone, with RDORs of (a), 4.7 (95% CI 3.0 to 7.5; P < 0.001), and (b), 5.6 (95% CI 3.7 to 8.5; P < 0.001). For a), the predicted difference in sensitivity at a fixed specificity of 80% was 16% (95% CI 8% to 23%; 92% for dermoscopy + visual inspection versus 76% for visual inspection), and predicted difference in specificity at a fixed sensitivity of 80% was 20% (95% CI 7% to 33%; 95% for dermoscopy + visual inspection versus 75% for visual inspection). For b) the predicted differences in sensitivity was 34% (95% CI 24% to 46%; 81% for dermoscopy versus 47% for visual inspection), at a fixed specificity of 80%, and predicted difference in specificity was 40% (95% CI 27% to 57%; 82% for dermoscopy versus 42% for visual inspection), at a fixed sensitivity of 80%.Using the median prevalence of disease in each set of studies ((a), 12% for in-person and (b), 24% for image-based), for a hypothetical population of 1000 lesions, an increase in sensitivity of (a), 16% (in-person), and (b), 34% (image-based), from using dermoscopy at a fixed specificity of 80% equates to a reduction in the number of melanomas missed of (a), 19 and (b), 81 with (a), 176 and (b), 152 false positive results. An increase in specificity of (a), 20% (in-person), and (b), 40% (image-based), at a fixed sensitivity of 80% equates to a reduction in the number of unnecessary excisions from using dermoscopy of (a), 176 and (b), 304 with (a), 24 and (b), 48 melanomas missed.The use of a named or published algorithm to assist dermoscopy interpretation (as opposed to no reported algorithm or reported use of pattern analysis), had no significant impact on accuracy either for in-person (RDOR 1.4, 95% CI 0.34 to 5.6; P = 0.17), or image-based (RDOR 1.4, 95% CI 0.60 to 3.3; P = 0.22), evaluations. This result was supported by subgroup analysis according to algorithm used. We observed higher accuracy for observers reported as having high experience and for those classed as 'expert consultants' in comparison to those considered to have less experience in dermoscopy, particularly for image-based evaluations. Evidence for the effect of dermoscopy training on test accuracy was very limited but suggested associated improvements in sensitivity. Authors' conclusions: Despite the observed limitations in the evidence base, dermoscopy is a valuable tool to support the visual inspection of a suspicious skin lesion for the detection of melanoma and atypical intraepidermal melanocytic variants, particularly in referred populations and in the hands of experienced users. Data to support its use in primary care are limited, however, it may assist in triaging suspicious lesions for urgent referral when employed by suitably trained clinicians. Formal algorithms may be of most use for dermoscopy training purposes and for less expert observers, however reliable data comparing approaches using dermoscopy in person are lacking.
Article
Full-text available
Vascular structures of skin are important biomarkers in diagnosis and assessment of cutaneous conditions. Presence and distribution of lesional vessels are associated with specific abnormalities. Therefore, detection and localization of cutaneous vessels provide critical information towards diagnosis and stage status of diseases. However, cutaneous vessels are highly variable in shape, size, color and architecture, which complicate the detection task. Considering the large variability of these structures, conventional vessel detection techniques lack the generalizability to detect different vessel types and require separate algorithms to be designed for each type. Furthermore, such techniques are highly dependent on precise hand-crafted features which are time-consuming and computationally inefficient. As a solution, we propose a data-driven feature learning framework based on stacked sparse auto-encoders (SSAE) for comprehensive detection of cutaneous vessels. Each training image is divided into small patches of either containing or non-containing vasculature. A multilayer SSAE is designed to learn hidden features of the data in hierarchical layers in an unsupervised manner. The high-level learned features are subsequently fed into a classifier which categorizes each patch into absence or presence of vasculature and localizes vessels within the lesion. Over a test set of 3095 patches derived from 200 images, the proposed framework demonstrated superior performance of 95.4% detection accuracy over a variety of vessel patterns; outperforming other techniques by achieving the highest positive predictive value of 94.7%. The proposed Computer-Aided Diagnosis (CAD) framework can serve as a decision support system assisting dermatologists for more accurate diagnosis, especially in teledermatology applications in remote areas.
Article
Full-text available
Two parallel phenomena are gaining attention in human-computer interaction research: gamification and crowdsourcing. Because crowdsourcing's success depends on a mass of motivated crowdsourcees, crowdsourcing platforms have increasingly been imbued with motivational design features borrowed from games; a practice often called gamification. While the body of literature and knowledge of the phenomenon have begun to accumulate, we still lack a comprehensive and systematic understanding of conceptual foundations, knowledge of how gamification is used in crowdsourcing, and whether it is effective. We first provide a conceptual framework for gamified crowdsourcing systems in order to understand and conceptualize the key aspects of the phenomenon. The paper's main contributions are derived through a systematic literature review that investigates how gamification has been examined in different types of crowdsourcing in a variety of domains. This meticulous mapping, which focuses on all aspects in our framework, enables us to infer what kinds of gamification efforts are effective in different crowdsourcing approaches as well as to point to a number of research gaps and lay out future research directions for gamified crowdsourcing systems. Overall, the results indicate that gamification has been an effective approach for increasing crowdsourcing participation and the quality of the crowdsourced work; however, differences exist between different types of crowdsourcing: the research conducted in the context of crowdsourcing of homogenous tasks has most commonly used simple gamification implementations, such as points and leaderboards, whereas crowdsourcing implementations that seek diverse and creative contributions employ gamification with a richer set of mechanics.
Article
Full-text available
Importance: The comparative diagnostic performance of dermoscopic algorithms and their individual criteria are not well studied. Objectives: To analyze the discriminatory power and reliability of dermoscopic criteria used in melanoma detection and compare the diagnostic accuracy of existing algorithms. Design, setting, and participants: This was a retrospective, observational study of 477 lesions (119 melanomas [24.9%] and 358 nevi [75.1%]), which were divided into 12 image sets that consisted of 39 or 40 images per set. A link on the International Dermoscopy Society website from January 1, 2011, through December 31, 2011, directed participants to the study website. Data analysis was performed from June 1, 2013, through May 31, 2015. Participants included physicians, residents, and medical students, and there were no specialty-type or experience-level restrictions. Participants were randomly assigned to evaluate 1 of the 12 image sets. Main outcomes and measures: Associations with melanoma and intraclass correlation coefficients (ICCs) were evaluated for the presence of dermoscopic criteria. Diagnostic accuracy measures were estimated for the following algorithms: the ABCD rule, the Menzies method, the 7-point checklist, the 3-point checklist, chaos and clues, and CASH (color, architecture, symmetry, and homogeneity). Results: A total of 240 participants registered, and 103 (42.9%) evaluated all images. The 110 participants (45.8%) who evaluated fewer than 20 lesions were excluded, resulting in data from 130 participants (54.2%), 121 (93.1%) of whom were regular dermoscopy users. Criteria associated with melanoma included marked architectural disorder (odds ratio [OR], 6.6; 95% CI, 5.6-7.8), pattern asymmetry (OR, 4.9; 95% CI, 4.1-5.8), nonorganized pattern (OR, 3.3; 95% CI, 2.9-3.7), border score of 6 (OR, 3.3; 95% CI, 2.5-4.3), and contour asymmetry (OR, 3.2; 95% CI, 2.7-3.7) (P < .001 for all). Most dermoscopic criteria had poor to fair interobserver agreement. Criteria that reached moderate levels of agreement included comma vessels (ICC, 0.44; 95% CI, 0.40-0.49), absence of vessels (ICC, 0.46; 95% CI, 0.42-0.51), dark brown color (ICC, 0.40; 95% CI, 0.35-0.44), and architectural disorder (ICC, 0.43; 95% CI, 0.39-0.48). The Menzies method had the highest sensitivity for melanoma diagnosis (95.1%) but the lowest specificity (24.8%) compared with any other method (P < .001). The ABCD rule had the highest specificity (59.4%). All methods had similar areas under the receiver operating characteristic curves. Conclusions and relevance: Important dermoscopic criteria for melanoma recognition were revalidated by participants with varied experience. Six algorithms tested had similar but modest levels of diagnostic accuracy, and the interobserver agreement of most individual criteria was poor.
Article
Full-text available
Background: Evolving dermoscopic terminology motivated us to initiate a new consensus. Objective: We sought to establish a dictionary of standardized terms. Methods: We reviewed the medical literature, conducted a survey, and convened a discussion among experts. Results: Two competitive terminologies exist, a more metaphoric terminology that includes numerous terms and a descriptive terminology based on 5 basic terms. In a survey among members of the International Society of Dermoscopy (IDS) 23.5% (n = 201) participants preferentially use descriptive terminology, 20.1% (n = 172) use metaphoric terminology, and 484 (56.5%) use both. More participants who had been initially trained by metaphoric terminology prefer using descriptive terminology than vice versa (9.7% vs 2.6%, P < .001). Most new terms that were published since the last consensus conference in 2003 were unknown to the majority of the participants. There was uniform consensus that both terminologies are suitable, that metaphoric terms need definitions, that synonyms should be avoided, and that the creation of new metaphoric terms should be discouraged. The expert panel proposed a dictionary of standardized terms taking account of metaphoric and descriptive terms. Limitations: A consensus seeks a workable compromise but does not guarantee its implementation. Conclusion: The new consensus provides a revised framework of standardized terms to enhance the consistent use of dermoscopic terminology.
Article
Full-text available
The lack of publicly available ground-truth data has been identified as the major challenge for transferring recent developments in deep learning to the biomedical imaging domain. Though crowdsourcing has enabled annotation of large scale databases for real world images, its application for biomedical purposes requires a deeper understanding and hence, more precise definition of the actual annotation task. The fact that expert tasks are being outsourced to non-expert users may lead to noisy annotations introducing disagreement between users. Despite being a valuable resource for learning annotation models from crowdsourcing, conventional machine-learning methods may have difficulties dealing with noisy annotations during training. In this manuscript, we present a new concept for learning from crowds that handle data aggregation directly as part of the learning process of the convolutional neural network (CNN) via additional crowdsourcing layer (AggNet). Besides, we present an experimental study on learning from crowds designed to answer the following questions: (i) Can deep CNN be trained with data collected from crowdsourcing?, (ii) How to adapt the CNN to train on multiple types of annotation datasets (ground truth and crowd-based)?, (iii) How does the choice of annotation and aggregation affect the accuracy? Our experimental setup involved Annot8, a self-implemented web-platform based on Crowdflower API realizing image annotation tasks for a publicly available biomedical image database. Our results give valuable insights into the functionality of deep CNN learning from crowd annotations and prove the necessity of data aggregation integration.
Article
Full-text available
Citizen science, scientific research conducted by non-specialists, has the potential to facilitate biomedical research using available large-scale data, however validating the results is challenging. The Cell Slider is a citizen science project that intends to share images from tumors with the general public, enabling them to score tumor markers independently through an internet-based interface. From October 2012 to June 2014, 98,293 Citizen Scientists accessed the Cell Slider web page and scored 180,172 sub-images derived from images of 12,326 tissue microarray cores labeled for estrogen receptor (ER). We evaluated the accuracy of Citizen Scientist's ER classification, and the association between ER status and prognosis by comparing their test performance against trained pathologists. The area under ROC curve was 0.95 (95% CI 0.94 to 0.96) for cancer cell identification and 0.97 (95% CI 0.96 to 0.97) for ER status. ER positive tumors scored by Citizen Scientists were associated with survival in a similar way to that scored by trained pathologists. Survival probability at 15 years were 0.78 (95% CI 0.76 to 0.80) for ER-positive and 0.72 (95% CI 0.68 to 0.77) for ER-negative tumors based on Citizen Scientists classification. Based on pathologist classification, survival probability was 0.79 (95% CI 0.77 to 0.81) for ER-positive and 0.71 (95% CI 0.67 to 0.74) for ER-negative tumors. The hazard ratio for death was 0.26 (95% CI 0.18 to 0.37) at diagnosis and became greater than one after 6.5 years of follow-up for ER scored by Citizen Scientists, and 0.24 (95% CI 0.18 to 0.33) at diagnosis increasing thereafter to one after 6.7 (95% CI 4.1 to 10.9) years of follow-up for ER scored by pathologists. Crowdsourcing of the general public to classify cancer pathology data for research is viable, engages the public and provides accurate ER data. Crowdsourced classification of research data may offer a valid solution to problems of throughput requiring human input.
Article
Full-text available
By means of this study, a detection algorithm for the "pigment network" in dermoscopic images is presented, one of the most relevant indicators in the diagnosis of melanoma. The design of the algorithm consists of two blocks. In the first one, a machine learning process is carried out, allowing the generation of a set of rules which, when applied over the image, permit the construction of a mask with the pixels candidates to be part of the pigment network. In the second block, an analysis of the structures over this mask is carried out, searching for those corresponding to the pigment network and making the diagnosis, whether it has pigment network or not, and also generating the mask corresponding to this pattern, if any. The method was tested against a database of 220 images, obtaining 86% sensitivity and 81.67% specificity, which proves the reliability of the algorithm.
Article
Full-text available
The kappa statistic is frequently used to test interrater reliability. The importance of rater reliability lies in the fact that it represents the extent to which the data collected in the study are correct representations of the variables measured. Measurement of the extent to which data collectors (raters) assign the same score to the same variable is called interrater reliability. While there have been a variety of methods to measure interrater reliability, traditionally it was measured as percent agreement, calculated as the number of agreement scores divided by the total number of scores. In 1960, Jacob Cohen critiqued use of percent agreement due to its inability to account for chance agreement. He introduced the Cohen's kappa, developed to account for the possibility that raters actually guess on at least some variables due to uncertainty. Like most correlation statistics, the kappa can range from -1 to +1. While the kappa is one of the most commonly used statistics to test interrater reliability, it has limitations. Judgments about what level of kappa should be acceptable for health research are questioned. Cohen's suggested interpretation may be too lenient for health related studies because it implies that a score as low as 0.41 might be acceptable. Kappa and percent agreement are compared, and levels for both kappa and percent agreement that should be demanded in healthcare studies are suggested.
Article
Full-text available
Computer-aided detection (CAD) systems have been shown to improve the diagnostic performance of CT colonography (CTC) in the detection of premalignant colorectal polyps. Despite the improvement, the overall system is not optimal. CAD annotations on true lesions are incorrectly dismissed, and false positives are misinterpreted as true polyps. Here, we conduct an observer performance study utilizing distributed human intelligence in the form of anonymous knowledge workers (KWs) to investigate human performance in classifying polyp candidates under different presentation strategies. We evaluated 600 polyp candidates from 50 patients, each case having at least one polyp ⩾6mm, from a large database of CTC studies. Each polyp candidate was labeled independently as a true or false polyp by 20KWs and an expert radiologist. We asked each labeler to determine whether the candidate was a true polyp after looking at a single 3D-rendered image of the candidate and after watching a video fly-around of the candidate. We found that distributed human intelligence improved significantly when presented with the additional information in the video fly-around. We noted that performance degraded with increasing interpretation time and increasing difficulty, but distributed human intelligence performed better than our CAD classifier for "easy" and "moderate" polyp candidates. Further, we observed numerous parallels between the expert radiologist and the KWs. Both showed similar improvement in classification moving from single-image to video interpretation. Additionally, difficulty estimates obtained from the KWs using an expectation maximization algorithm correlated well with the difficulty rating assigned by the expert radiologist. Our results suggest that distributed human intelligence is a powerful tool that will aid in the development of CAD for CTC.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
'Crowdsourcing' is a relatively recent concept that encompasses many practices. This diversity leads to the blurring of the limits of crowdsourcing that may be identified virtually with any type of internet-based collaborative activity, such as co-creation or user innovation. Varying definitions of crowdsourcing exist, and therefore some authors present certain specific examples of crowdsourcing as paradigmatic, while others present the same examples as the opposite. In this article, existing definitions of crowdsourcing are analysed to extract common elements and to establish the basic characteristics of any crowdsourcing initiative. Based on these existing definitions, an exhaustive and consistent definition for crowdsourcing is presented and contrasted in 11 cases.
Article
Full-text available
To compare the reliability of a new 7-point checklist based on simplified epiluminescence microscopy (ELM) pattern analysis with the ABCD rule of dermatoscopy and standard pattern analysis for the diagnosis of clinically doubtful melanocytic skin lesions. In a blind study, ELM images of 342 histologically proven melanocytic skin lesions were evaluated for the presence of 7 standard criteria that we called the "ELM 7-point checklist." For each lesion, "overall" and "ABCD scored" diagnoses were recorded. From a training set of 57 melanomas and 139 atypical nonmelanomas, odds ratios were calculated to create a simple diagnostic model based on identification of major and minor criteria for the "7-point scored" diagnosis. A test set of 60 melanomas and 86 atypical nonmelanomas was used for model validation and was then presented to 2 less experienced ELM observers, who recorded the ABCD and 7-point scored diagnoses. University medical centers. A sample of patients with excised melanocytic lesions. Sensitivity, specificity, and accuracy of the models for diagnosing melanoma. From the total combined sets, the 7-point checklist gave a sensitivity of 95% and a sepcificity of 75% compared with 85% sensitivity and 66% specificity using the ABCD rule and 91% sensitivity and 90% specificity using standard pattern analysis (overall ELM diagnosis). Compared with the ABCD rule, the 7-point method allowed less experienced observers to obtain higher diagnostic accuracy values. The ELM 7-point checklist provides a simplification of standard pattern analysis because of the low number of features to identify and the scoring diagnostic system. As with the ABCD rule, it can be easily learned and easily applied and has proven to be reliable in diagnosing melanoma.
Article
Full-text available
To describe the different vascular structures seen by dermoscopy and to evaluate their association with various melanocytic and nonmelanocytic skin tumors in a large series of cases. Digital dermoscopic images of the lesions were evaluated for the presence of various morphologic types of vessels. Specialized university clinic. From a larger database, 531 excised lesions (from 517 patients) dermoscopically showing any type of vascular structures were included. The frequency and positive predictive value of the different vascular structures seen in various tumors were calculated, and the differences were evaluated by the chi2 or Fisher exact test. Arborizing vessels were seen in 82.1% of basal cell carcinomas, with a 94.1% positive predictive value (P<.001). Dotted vessels were generally predictive for a melanocytic lesion (90.0%, P<.001), and were especially seen in Spitz nevi (77.8% of lesions). In melanoma, linear-irregular, dotted, and polymorphous/atypical vessels were the most frequent vascular structures, whereas milky-red globules/areas were the most predictive ones (77.8%, P = .003). The presence of erythema was most predictive for Clark nevus, whereas comma, glomerular, crown, and hairpin vessels were significantly associated with dermal/congenital nevi, Bowen disease, sebaceous hyperplasia, and seborrheic keratosis, respectively (P<.001 for all). Different morphologic types of vessels are associated with different melanocytic or nonmelanocytic skin tumors. Therefore, the recognition of distinctive vascular structures may be helpful for diagnostic purposes, especially when the classic pigmented dermoscopic structures are lacking.
Article
Purpose: Blood vessels called telangiectasia are visible in skin lesions with the aid of dermoscopy. Telangiectasia are a pivotal identifying feature of basal cell carcinoma. These vessels appear thready, serpiginous, and may also appear arborizing, that is, wide vessels branch into successively thinner vessels. Due to these intricacies, their detection is not an easy task, neither with manual annotation nor with computerized techniques. In this study, we automate the segmentation of telangiectasia in dermoscopic images with a deep learning U-Net approach. Methods: We apply a combination of image processing techniques and a deep learning-based U-Net approach to detect telangiectasia in digital basal cell carcinoma skin cancer images. We compare loss functions and optimize the performance by using a combination loss function to manage class imbalance of skin versus vessel pixels. Results: We establish a baseline method for pixel-based telangiectasia detection in skin cancer lesion images. An analysis and comparison for human observer variability in annotation is also presented. Conclusion: Our approach yields Jaccard score within the variation of human observers as it addresses a new aspect of the rapidly evolving field of deep learning: automatic identification of cancer-specific structures. Further application of DL techniques to detect dermoscopic structures and handle noisy labels is warranted.
Article
The benefits of explainable artificial intelligence are not what they appear
Article
Background Software systems using artificial intelligence for medical purposes have been developed in recent years. The success of deep neural networks (DNN) in 2012 in the image recognition challenge ImageNet LSVRC 2010 fuelled expectations of the potential for using such systems in dermatology. Objective To evaluate the ways in which machine learning has been utilised in dermatology to date and provide an overview of the findings in current literature on the subject. Methods We conducted a systematic review of existing literature, identifying the literature through a systematic search of the PubMed database. Two doctors assessed screening and eligibility with respect to pre-determined inclusion and exclusion criteria. Results 2,175 publications were identified, and 64 publications were included. We identified eight major categories where machine learning tools were tested in dermatology. Most systems involved image recognition tools that were primarily aimed at binary classification of malignant melanoma (MM). Short system descriptions and results of all included systems are presented in tables. Conclusion We present a complete overview of artificial intelligence implemented in dermatology. Impressive outcomes were reported in all of the identified eight categories, but head-to-head comparison proved difficult. The many areas of dermatology where we identified machine learning tools indicate the diversity of machine learning.
Article
Artificial intelligence (AI) is quickly making inroads into medical practice, especially in forms that rely on machine learning, with a mix of hope and hype.¹ Multiple AI-based products have now been approved or cleared by the US Food and Drug Administration (FDA), and health systems and hospitals are increasingly deploying AI-based systems.² For example, medical AI can support clinical decisions, such as recommending drugs or dosages or interpreting radiological images.² One key difference from most traditional clinical decision support software is that some medical AI may communicate results or recommendations to the care team without being able to communicate the underlying reasons for those results.
Article
In the past decade, machine learning and artificial intelligence have made significant advancements in pattern analysis, including speech and natural language processing, image recognition, object detection, facial recognition, and action categorization. Indeed, in many of these applications, accuracy has reached or exceeded human levels of performance. Subsequently, a multitude of studies have begun to examine the application of these technologies to health care, and in particular, medical image analysis. Perhaps the most difficult subdomain involves skin imaging because of the lack of standards around imaging hardware, technique, color, and lighting conditions. In addition, unlike radiological images, skin image appearance can be significantly affected by skin tone as well as the broad range of diseases. Furthermore, automated algorithm development relies on large high-quality annotated image data sets that incorporate the breadth of this circumstantial and diagnostic variety. These issues, in combination with unique complexities regarding integrating artificial intelligence systems into a clinical workflow, have led to difficulty in using these systems to improve sensitivity and specificity of skin diagnostics in health care networks around the world. In this article, we summarize recent advancements in machine learning, with a focused perspective on the role of public challenges and data sets on the progression of these technologies in skin imaging. In addition, we highlight the remaining hurdles toward effective implementation of technologies to the clinical workflow and discuss how public challenges and data sets can catalyze the development of solutions.
Article
Background Estimating the extent of affected skin is an important unmet clinical need both for research and practical management in many diseases. In particular, cutaneous burden of chronic graft‐vs‐host disease (cGVHD) is a primary outcome in many trials. Despite advances in artificial intelligence and 3D photography, progress toward reliable automated techniques is hindered by limited expert time to delineate cGVHD patient images. Crowdsourcing may have potential to provide the requisite expert‐level data. Materials and methods Forty‐one three‐dimensional photographs of three cutaneous cGVHD patients were delineated by a board‐certified dermatologist. 410 two‐dimensional projections of the raw photos were each annotated by seven crowd workers, whose consensus performance was compared to the expert. Results The consensus delineation by four of seven crowd workers achieved the highest agreement with the expert, measured by a median Dice index of 0.7551 across all 410 images, outperforming even the best worker from the crowd (Dice index 0.7216). For their internal agreement, crowd workers achieved a median Fleiss's kappa of 0.4140 across the images. The time a worker spent marking an image had only weak correlation with the surface area marked, and very low correlation with accuracy. Percent of pixels selected by the consensus exhibited good correlation (Pearson R = 0.81) with the patient's affected surface area. Conclusion Crowdsourcing may be an efficient method for obtaining demarcations of affected skin, on par with expert performance. Crowdsourced data generally agreed with the current clinical standard of percent body surface area to assess cGVHD severity in the skin.
Article
Dermoscopy is a non-invasive skin imaging technique that permits visualization of features of pigmented melanocytic neoplasms that are not discernable by examination with the naked eye. While studies on the automated analysis of dermoscopy images date back to the late 1990s, because of various factors (lack of publicly available datasets, open-source software, computational power, etc.), the field progressed rather slowly in its first two decades. With the release of a large public dataset by the International Skin Imaging Collaboration (ISIC) in 2016, development of open-source software for convolutional neural networks, and the availability of inexpensive graphics processing units, dermoscopy image analysis has recently become a very active research field. In this paper, we present a brief overview of this exciting subfield of medical image analysis, primarily focusing on three aspects of it, namely segmentation, feature extraction, and classification. We then provide future directions for researchers.
Article
Accurate segmentations in medical images are the foundations for various clinical applications. Advances in machine learning-based techniques show great potential for automatic image segmentation, but these techniques usually require a huge amount of accurately annotated reference segmentations for training. The guiding hypothesis of this paper was that crowd-algorithm collaboration could evolve as a key technique in large-scale medical data annotation. As an initial step toward this goal, we evaluated the performance of untrained individuals to detect and correct errors made by three-dimensional (3-D) medical segmentation algorithms. To this end, we developed a multistage segmentation pipeline incorporating a hybrid crowd-algorithm 3-D segmentation algorithm integrated into a medical imaging platform. In a pilot study of liver segmentation using a publicly available dataset of computed tomography scans, we show that the crowd is able to detect and refine inaccurate organ contours with a quality similar to that of experts (engineers with domain knowledge, medical students, and radiologists). Although the crowds need significantly more time for the annotation of a slice, the annotation rate is extremely high. This could render crowdsourcing a key tool for cost-effective large-scale medical image annotation. © 2018 Society of Photo-Optical Instrumentation Engineers (SPIE).
Article
Artificial intelligence and its machine learning (ML) capabilities are very promising technologies for dermatology and other visually oriented fields due to their power in pattern recognition. Understandably, many physicians distrust replacing clinical finesse with unsupervised computer programs. Here we describe convolutional neural networks and discuss how this method of ML will impact the field of dermatology. ML is a form of artificial intelligence well suited for pattern recognition in visual applications. Many dermatologists are wary of such unsupervised algorithms and their future implications. Herein we discuss these fears.
Article
Background Inadequate dermoscopy training represents a major barrier to proper dermoscopy use. Objective To better understand the status of dermoscopy training in US residency programs. Methods A survey was sent to 417 dermatology residents and 118 program directors of dermatology residency programs. Results Comparing different training times for the same training type, residents with 1–10 hours of dedicated training had similar confidence using dermoscopy in general (p = 1.000) and satisfaction with training (p = .3224) than residents with >10 hours of dedicated training. Comparing similar training times for different training types, residents with 1–10 hours of dedicated training had significantly increased confidence using dermoscopy in general (p = .0105) and satisfaction with training (p = .0066) than residents with 1–10 hours of only bedside training. Lastly, residents with 1–10 hours of dedicated training and >10 hours of dedicated training had significantly increased confidence using dermoscopy in general (p = .0002, p = .2471) and satisfaction with training (p <.0001, p < .0001) than residents with no dermoscopy training at all. Conclusions Dermoscopy training in residency should include formal dermoscopy training that is overseen by the program director and is possibly supplemented by outside dermoscopy training.
Article
Annotating unstructured texts in Electronic Health Records data is usually a necessary step for conducting machine learning research on such datasets. Manual annotation by domain experts provides data of the best quality, but has become increasingly impractical given the rapid increase in the volume of EHR data. In this article, we examine the effectiveness of crowdsourcing with unscreened online workers as an alternative for transforming unstructured texts in EHRs into annotated data that are directly usable in supervised learning models. We find the crowdsourced annotation data to be just as effective as expert data in training a sentence classification model to detect the mentioning of abnormal ear anatomy in radiology reports of audiology. Furthermore, we have discovered that enabling workers to self-report a confidence level associated with each annotation can help researchers pinpoint less-accurate annotations requiring expert scrutiny. Our findings suggest that even crowd workers without specific domain knowledge can contribute effectively to the task of annotating unstructured EHR datasets.
Article
Skin cancer, the most common human malignancy, is primarily diagnosed visually, beginning with an initial clinical screening and followed potentially by dermoscopic analysis, a biopsy and histopathological examination. Automated classification of skin lesions using images is a challenging task owing to the fine-grained variability in the appearance of skin lesions. Deep convolutional neural networks (CNNs) show potential for general and highly variable tasks across many fine-grained object categories. Here we demonstrate classification of skin lesions using a single CNN, trained end-to-end from images directly, using only pixels and disease labels as inputs. We train a CNN using a dataset of 129,450 clinical images-two orders of magnitude larger than previous datasets-consisting of 2,032 different diseases. We test its performance against 21 board-certified dermatologists on biopsy-proven clinical images with two critical binary classification use cases: keratinocyte carcinomas versus benign seborrheic keratoses; and malignant melanomas versus benign nevi. The first case represents the identification of the most common cancers, the second represents the identification of the deadliest skin cancer. The CNN achieves performance on par with all tested experts across both tasks, demonstrating an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists. Outfitted with deep neural networks, mobile devices can potentially extend the reach of dermatologists outside of the clinic. It is projected that 6.3 billion smartphone subscriptions will exist by the year 2021 (ref. 13) and can therefore potentially provide low-cost universal access to vital diagnostic care.
Conference Paper
We developed an easy-to-use and widely accessible crowd-sourcing tool for rapidly training humans to perform biomedical image diagnostic tasks and demonstrated this platform’s ability on middle and high school students in South Korea to diagnose malaria infected red-blood-cells (RBCs) using Giemsa-stained thin blood smears imaged under light microscopes. We previously used the same platform (i.e., BioGames) to crowd-source diagnostics of individual RBC images, marking them as malaria positive (infected), negative (uninfected), or questionable (insufficient information for a reliable diagnosis). Using a custom-developed statistical framework, we combined the diagnoses from both expert diagnosticians and the minimally trained human crowd to generate a gold standard library of malaria-infection labels for RBCs. Using this library of labels, we developed a web-based training and educational toolset that provides a quantified score for diagnosticians/users to compare their performance against their peers and view misdiagnosed cells. We have since demonstrated the ability of this platform to quickly train humans without prior training to reach high diagnostic accuracy as compared to expert diagnosticians. Our initial trial group of 55 middle and high school students has collectively played more than 170 hours, each demonstrating significant improvements after only 3 hours of training games, with diagnostic scores that match expert diagnosticians’. Next, through a national-scale educational outreach program in South Korea we recruited >1660 students who demonstrated a similar performance level after 5 hours of training. We plan to further demonstrate this tool’s effectiveness for other diagnostic tasks involving image labeling and aim to provide an easily-accessible and quickly adaptable framework for online training of new diagnosticians.
Article
Objective To compare the reliability of a new 7-point checklist based on simplified epiluminescence microscopy (ELM) pattern analysis with the ABCD rule of dermatoscopy and standard pattern analysis for the diagnosis of clinically doubtful melanocytic skin lesions.Design In a blind study, ELM images of 342 histologically proven melanocytic skin lesions were evaluated for the presence of 7 standard criteria that we called the "ELM 7-point checklist." For each lesion, "overall" and "ABCD scored" diagnoses were recorded. From a training set of 57 melanomas and 139 atypical non-melanomas, odds ratios were calculated to create a simple diagnostic model based on identification of major and minor criteria for the "7-point scored" diagnosis. A test set of 60 melanomas and 86 atypical non-melanomas was used for model validation and was then presented to 2 less experienced ELM observers, who recorded the ABCD and 7-point scored diagnoses.Settings University medical centers.Patients A sample of patients with excised melanocytic lesions.Main Outcome Measures Sensitivity, specificity, and accuracy of the models for diagnosing melanoma.Results From the total combined sets, the 7-point checklist gave a sensitivity of 95% and a specificity of 75% compared with 85% sensitivity and 66% specificity using the ABCD rule and 91% sensitivity and 90% specificity using standard pattern analysis (overall ELM diagnosis). Compared with the ABCD rule, the 7-point method allowed less experienced observers to obtain higher diagnostic accuracy values.Conclusions The ELM 7-point checklist provides a simplification of standard pattern analysis because of the low number of features to identify and the scoring diagnostic system. As with the ABCD rule, it can be easily learned and easily applied and has proven to be reliable in diagnosing melanoma.
Article
Objectives: To create a simple diagnostic method for invasive melanoma with in vivo cutaneous surface microscopy (epiluminescence microscopy, dermoscopy, dermatoscopy) and to analyze the incidence and characteristics of those invasive melanomas that had no diagnostic features by means of hand-held surface microscopes. Design: Pigmented skin lesions were photographed in vivo with the use of immersion oil. All were excised and reviewed for histological diagnosis. A training set of 62 invasive melanomas and 159 atypical nonmelanomas and a test set of 45 invasive melanomas and 119 atypical nonmelanomas were used. Images from the training set were scored for 72 surface microscopic features. Those features with a low sensitivity (0%) and high specificity (>85%) were used to create a simple diagnostic model for invasive melanoma. Setting: All patients were recruited from the Sydney (Australia) Melanoma Unit (a primary case and referral center). Patients: A random sample of patients whose lesions were excised, selected from a larger database. Main Outcome Measures: Sensitivity and specificity of the model for diagnosis of invasive melanona. Results: The model gave a sensitivity of 92% (98/107) and specificity of 71%. Of the 9 "featureless" melanomas the model failed to detect, 6 were pigmented and thin and had a pigment network. The other 3 were thicker, hypomelanotic lesions lacking a pigment network, some with prominent telangiectases, and all with only small areas of pigment. All featureless melanomas noted by the patients had a history of change in color, shape, or size. Conclusions: Surface microscopy does not allow 100% sensitivity in diagnosing invasive melanoma and therefore cannot be used as the sole indicator for excision. Clinical history is an important consideration when featureless lesions are diagnosed.Arch Dermatol. 1996;132:1178-1182
Article
"Degree of certainty" refers to the subjective belief, prior to feedback, that a decision is correct. A reliable estimate of certainty is essential for prediction, learning from mistakes, and planning subsequent actions when outcomes are not immediate. It is generally thought that certainty is informed by a neural representation of evidence at the time of a decision. Here we show that certainty is also informed by the time taken to form the decision. Human subjects reported simultaneously their choice and confidence about the direction of a noisy display of moving dots. Certainty was inversely correlated with reaction times and directly correlated with motion strength. Moreover, these correlations were preserved even for error responses, a finding that contradicts existing explanations of certainty based on signal detection theory. We also contrived a stimulus manipulation that led to longer decision times without affecting choice accuracy, thus demonstrating that deliberation time itself informs the estimate of certainty. We suggest that elapsed decision time informs certainty because it serves as a proxy for task difficulty. Copyright © 2014 Elsevier Inc. All rights reserved.
Article
Skin self-examination (SSE) is one method for identifying atypical nevi among members of the general public. Unfortunately, past research has shown that SSE has low sensitivity in detecting atypical nevi. The current study investigates whether crowdsourcing (collective effort) can improve SSE identification accuracy. Collective effort is potentially useful for improving people's visual identification of atypical nevi during SSE because, even when a single person has low reliability at a task, the pattern of the group can overcome the limitations of each individual. Adults (N=500) were recruited from a shopping mall in the Midwest. Participants viewed educational pamphlets about SSE and then completed a mole identification task. For the task, participants were asked to circle mole images that appeared atypical. Forty nevi images were provided; nine of the images were of nevi that were later diagnosed as melanoma. Consistent with past research, individual effort exhibited modest sensitivity (.58) for identifying atypical nevi in the mole identification task. As predicted, collective effort overcame the limitations of individual effort. Specifically, a 19% collective effort identification threshold exhibited superior sensitivity (.90). The results of the current study suggest that limitations of SSE can be countered by collective effort, a finding that supports the pursuit of interventions promoting early melanoma detection that contain crowdsourced visual identification components.
Article
Dermoscopy is a non-invasive skin imaging technique, which permits visualization of features of pigmented melanocytic neoplasms that are not discernable by examination with the naked eye. One of the most important features for the diagnosis of melanoma in dermoscopy images is the blue-white veil (irregular, structureless areas of confluent blue pigmentation with an overlying white "ground-glass" film). In this article, we present a machine learning approach to the detection of blue-white veil and related structures in dermoscopy images. The method involves contextual pixel classification using a decision tree classifier. The percentage of blue-white areas detected in a lesion combined with a simple shape descriptor yielded a sensitivity of 69.35% and a specificity of 89.97% on a set of 545 dermoscopy images. The sensitivity rises to 78.20% for detection of blue veil in those cases where it is a primary feature for melanoma recognition.
Article
This paper presents a general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies. The procedure essentially involves the construction of functions of the observed proportions which are directed at the extent to which the observers agree among themselves and the construction of test statistics for hypotheses involving these functions. Tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interobserver agreement are developed as generalized kappa-type statistics. These procedures are illustrated with a clinical diagnosis example from the epidemiological literature.
Article
The difficulties in accurately assessing pigmented skin lesions are ever present in practice. The recently described ABCD rule of dermatoscopy (skin surface microscopy at x10 magnification), based on the criteria asymmetry (A), border (B), color (C), and differential structure (D), improved diagnostic accuracy when applied retrospectively to clinical slides. A study was designed to evaluate the prospective value of the ABCD rule of dermatoscopy in melanocytic lesions. In 172 melanocytic pigmented skin lesions, the criteria of the ABCD rule of dermatoscopy were analyzed with a semiquantitative scoring system before excision. According to the retrospectively determined threshold, tumors with a score higher than 5.45 (64/69 melanomas [92.8%]) were classified as malignant, whereas lesions with a lower score were considered as benign (93/103 melanocytic nevi [90.3%]). Negative predictive value for melanoma (True-Negative divided by [True-Negative+False-Negative]) was 95.8%, whereas positive predictive value (True-Positive divided by [True-Positive+False-Positive]) was 85.3%. Diagnostic accuracy for melanoma (True-Positive divided by [True-Positive+False-Positive+False-Negative]) was 80.0%, compared with 64.4% by the naked eye. Melanoma showed a mean final dermatoscopy score of 6.79 (SD, +/- 0.92), significantly differing from melanocytic nevi (mean score, 4.27 +/- 0.99; p < 0.01, U test). The ABCD rule can be easily learned and rapidly calculated, and has proven to be reliable. It should be routinely applied to all equivocal pigmented skin lesions to reach a more objective and reproducible diagnosis and to obtain this assessment preoperatively.
Article
To create a simple diagnostic method for invasive melanoma with in vivo cutaneous surface microscopy (epiluminescence microscopy, dermoscopy, dermatoscopy) and to analyze the incidence and characteristics of those invasive melanomas that had no diagnostic features by means of hand-held surface microscopes. Pigmented skin lesions were photographed in vivo with the use of immersion oil. All were excised and reviewed for histological diagnosis. A training set of 62 invasive melanomas and 159 atypical nonmelanomas and a test set of 45 invasive melanomas and 119 atypical non-melanomas were used. Images from the training set were scored for 72 surface microscopic features. Those features with a low sensitivity (0%) and high specificity (> 85%) were used to create a simple diagnostic model for invasive melanoma. All patients were recruited from the Sydney (Australia) Melanoma Unit (a primary case and referral center). A random sample of patients whose lesions were excised, selected from a larger database. Sensitivity and specificity of the model for diagnosis of invasive melanona. The model gave a sensitivity of 92% (98/107) and specificity of 71%. Of the 9 "featureless" melanomas the model failed to detect, 6 were pigmented and thin and had a pigment network. The other 3 were thicker, hypomelanotic lesions lacking a pigment network, some with prominent telangiectases, and all with only small areas of pigment. All featureless melanomas noted by the patients had a history of change in color, shape, or size. Surface microscopy does not allow 100% sensitivity in diagnosing invasive melanoma and therefore cannot be used as the sole indicator for excision. Clinical history is an important consideration when featureless lesions are diagnosed.
Article
Dermoscopy (dermatoscopy, epiluminescence microscopy) is an additional measure for making the diagnosis of pigmented skin lesions more accurate. It enables the clinician to visualize features not discernible by the naked eye. By applying enhanced digital dermoscopy and a standardized gross pathology protocol to pigmented skin lesions, a precise clinicopathological correlation of relevant dermoscopic features can be made. Histological specimens of four pigmented skin lesions (melanoma in situ, Clark's nevus, Reed's nevus, seborrheic keratosis) were processed using a standardized gross pathology protocol and viewed along with the clinical photographs and digital dermoscopic images that were magnified and enhanced to better visualize the corresponding dermoscopic structures. Furthermore, measurements of dermoscopic structures using digital equipment were correlated with histometric findings. Our understanding of dermoscopic features, especially the broadened pigment network - a specific dermoscopic criterion for melanoma - was refined by this detailed case-by-case correlation. In addition, some not yet fully characterized dermoscopic features, such as black lamella, radial streaks, and exophytic papillary structures, were described in detail dermoscopically and histopathologically. Moreover, measurements of these dermoscopic features and the underlying histological structures were found to be similar. Linking dermoscopy more closely with cutaneous pathology may help refine the definitions and diagnostic criteria of pigmented skin lesions for dermatologists as well as dermatopathologists.
Article
There is a need for better standardization of the dermoscopic terminology in assessing pigmented skin lesions. The virtual Consensus Net Meeting on Dermoscopy was organized to investigate reproducibility and validity of the various features and diagnostic algorithms. Dermoscopic images of 108 lesions were evaluated via the Internet by 40 experienced dermoscopists using a 2-step diagnostic procedure. The first-step algorithm distinguished melanocytic versus nonmelanocytic lesions. The second step in the diagnostic procedure used 4 algorithms (pattern analysis, ABCD rule, Menzies method, and 7-point checklist) to distinguish melanoma versus benign melanocytic lesions. kappa Values, log odds ratios, sensitivity, specificity, and positive likelihood ratios were estimated for all diagnostic algorithms and dermoscopic features. Interobserver agreement was fair to good for all diagnostic methods, but it was poor for the majority of dermoscopic criteria. Intraobserver agreement was good to excellent for all algorithms and features considered. Pattern analysis allowed the best diagnostic performance (positive likelihood ratio: 5.1), whereas alternative algorithms revealed comparable sensitivity but less specificity. Interobserver agreement on management decisions made by dermoscopy was fairly good (mean kappa value: 0.53). The virtual Consensus Net Meeting on Dermoscopy represents a valid tool for better standardization of the dermoscopic terminology and, moreover, opens up a new territory for diagnosing and managing pigmented skin lesions.
Article
Simplified algorithms for dermoscopy in melanoma diagnosis were developed in order to facilitate the use of this technique by non-experts. However, little is known about their reliability compared with classic pattern analysis when taught to untrained observers. To investigate the diagnostic performance of three different methods, i.e. classic pattern analysis and two of the most used algorithms (the ABCD rule of dermoscopy and the seven-point check-list) when used by newly trained residents in dermatology to diagnose melanocytic lesions. Methods Five residents in dermatology (University of Florence Medical School) were submitted to a teaching programme in dermoscopy based on both formal lessons and training and self-assessment using a newly developed, interactive CD-ROM on dermoscopy. The performance of the three diagnostic methods was analysed in a series of 200 clinically equivocal melanocytic lesions including 44 early melanomas (median thickness 0.30 mm; 25th-75th percentile 0.00-0.58 mm). Pattern analysis yielded the best mean diagnostic accuracy (68.7%), followed by the ABCD rule (56.1%) and the seven-point check-list (53.4%, P = 0.06). The best sensitivity was associated with the use of the seven-point check-list (91.9%), which, however, provided the worst specificity (35.2%) of the methods tested. The interobserver reproducibility, as shown by kappa statistics, was low for all the methods (range 0.27-0.33) and did not show any statistical difference among them. Pattern analysis, i.e. simultaneous assessment of the diagnostic value of all dermoscopy features shown by the lesion, proved to be the most reliable procedure for melanoma diagnosis to be taught to residents in dermatology.
Article
Dermatoscopy, also known as dermoscopy or epiluminescence microscopy (ELM), is a non-invasive, in vivo technique, which permits visualization of features of pigmented melanocytic neoplasms that are not discernable by examination with the naked eye. ELM offers a completely new range of visual features. One such prominent feature is the pigment network. Two texture-based algorithms are developed for the detection of pigment network. These methods are applicable to various texture patterns in dermatoscopy images, including patterns that lack fine lines such as cobblestone, follicular, or thickened network patterns. Two texture algorithms, Laws energy masks and the neighborhood gray-level dependence matrix (NGLDM) large number emphasis, were optimized on a set of 155 dermatoscopy images and compared. Results suggest superiority of Laws energy masks for pigment network detection in dermatoscopy images. For both methods, a texel width of 10 pixels or approximately 0.22 mm is found for dermatoscopy images.