Chapter

A Standardised Approach for Preparing Imaging Data for Machine Learning Tasks in Radiology: Opportunities, Applications and Risks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Medical imaging data is now extremely abundant due to over two decades of digitisation of imaging protocols and data storage formats. However, clean, well-curated data, that is amenable to machine learning, is relatively scarce, and AI developers are paradoxically data starved. Imaging and clinical data is also heterogeneous, often unstructured and unlabelled, whereas current supervised and semi-supervised machine learning techniques rely on homogeneous and carefully annotated data. While imaging biobanks contain small volumes of well-curated data, it is the leveraging of ‘big data’ from the front-line of healthcare that is the focus of many machine learning developers hoping to train and validate computer vision algorithms. The quest for sufficiently large volumes of clean data that can be used for training, validation and testing involves several hurdles, namely ethics and consent, security, the assessment of data quality, ground truth data labelling, bias reduction, reusability and generalisability. In this chapter we propose a new medical imaging data readiness (MIDaR) scale. The MIDaR scale is designed to objectively clarify data quality for both researchers seeking imaging data and clinical providers aiming to share their data. It is hoped that the MIDaR scale will be used globally during collaborative academic and business conversations, so that everyone can more easily understand and quickly appraise the relevant stages of data readiness for machine learning in relation to their AI development projects. We believe that the MIDaR scale could become essential in the design, planning and management of AI medical imaging projects, and significantly increase chances of success.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Development of any algorithm requires robust training data -both in quantity and quality. The performance of ML models improves logarithmically with increased volume of training data available [122][123][124]. As the algorithm 'learns' through feature recognition, the quality of the training cohort fundamentally shapes its performance. ...
... The lack of high-quality labelled training data is a limitation throughout all domains of ML research. Carefully preparing, validating and labelling training data often form the bulk of the development work [124]. ...
... All patient identifiable information needs to be carefully removed from any imaging data set prior to use. Although standards exist for medical imaging data such as DICOM, they are only loosely adhered to, with wide variation in the metadata [124]. Patient information can be difficult to remove, and at times hard coded into the imaging data. ...
Article
Full-text available
Accurate phenotyping of patients with pulmonary hypertension (PH) is an integral part of informing disease classification, treatment, and prognosis. The impact of lung disease on PH outcomes and response to treatment remains a challenging area with limited progress. Imaging with computed tomography (CT) plays an important role in patients with suspected PH when assessing for parenchymal lung disease, however, current assessments are limited by their semi-qualitative nature. Quantitative chest-CT (QCT) allows numerical quantification of lung parenchymal disease beyond subjective visual assessment. This has facilitated advances in radiological assessment and clinical correlation of a range of lung diseases including emphysema, interstitial lung disease, and coronavirus disease 2019 (COVID-19). Artificial Intelligence approaches have the potential to facilitate rapid quantitative assessments. Benefits of cross-sectional imaging include ease and speed of scan acquisition, repeatability and the potential for novel insights beyond visual assessment alone. Potential clinical benefits include improved phenotyping and prediction of treatment response and survival. Artificial intelligence approaches also have the potential to aid more focused study of pulmonary arterial hypertension (PAH) therapies by identifying more homogeneous subgroups of patients with lung disease. This state-of-the-art review summarizes recent QCT developments and potential applications in patients with PH with a focus on lung disease.
... Furthermore, successful attempts to construct mammographic datasets fulfilled requirements for validating a mammographic dataset. The current work met the following requirements, which were adopted from research [35][36][37]. Figure 2 shows a diagram of the process of creating the dataset. The annotation of the images was provided by three different radiologists, which are Dr. Sawsan Ashoor, Dr. Samia Alamoud, and Dr. Gawaher Al Ahadi. ...
... Furthermore, successful attempts to construct mammographic datasets fulfilled requirements for validating a mammographic dataset. The current work met the following requirements, which were adopted from research [35][36][37]. Figure 2 shows a diagram of the process of creating the dataset. The dataset contains five folders divided based on BIRAD categories and includes DICOM and JPG image formats in separate folders. ...
... Finally, the proposed dataset satisfied most of the ideal medical image dataset criteria described in [36,37,41]. It has adequate data volume, curation, annotation, ground truth, reusability, and generalizability. ...
Article
Full-text available
The current era is characterized by the rapidly increasing use of computer-aided diagnosis (CAD) systems in the medical field. These systems need a variety of datasets to help develop, evaluate, and compare their performances fairly. Physicians indicated that breast anatomy, especially dense ones, and the probability of breast cancer and tumor development, vary highly depending on race. Researchers reported that breast cancer risk factors are related to culture and society. Thus, there is a massive need for a local dataset representing breast cancer in our region to help develop and evaluate automatic breast cancer CAD systems. This paper presents a public mammogram dataset called King Abdulaziz University Breast Cancer Mammogram Dataset (KAU-BCMD) version 1. To our knowledge, KAU-BCMD is the first dataset in Saudi Arabia that deals with a large number of mammogram scans. The dataset was collected from the Sheikh Mohammed Hussein Al-Amoudi Center of Excellence in Breast Cancer at King Abdulaziz University. It contains 1416 cases. Each case has two views for both the right and left breasts, resulting in 5662 images based on the breast imaging reporting and data system. It also contains 205 ultrasound cases corresponding to a part of the mammogram cases, with 405 images as a total. The dataset was annotated and reviewed by three different radiologists. Our dataset is a promising dataset that contains different imaging modalities for breast cancer with different cancer grades for Saudi women.
... Data and methods constitute the most visible items within the biomedical analytics ecosystem; metadata, is however, progressively gaining a more relevant role for AI/ML in Precision Medicine, as it contains, in many cases, hints for the automated labeling or classification (even if approximate) tasks that will be further improved by the use of computational intelligence and statistical learning approaches (87,267). We will further discuss this issue in the next subsection. ...
... For this reason, aiming for high quality, well-formatted and standardized metadata has become quite relevant (268). Indeed, a number of biomedical data analysis teams and consortia are encouraging the use of standardized metadata guidelines, exemplified, for instance by a checklist of relevant issues to consider when building and publishing companion metadata (250,269,270); since such metadata could be instrumental to implement data analytics, as well as AI/ML toward a precision medicine approach (267,271). ...
Article
Full-text available
A main goal of Precision Medicine is that of incorporating and integrating the vast corpora on different databases about the molecular and environmental origins of disease, into analytic frameworks, allowing the development of individualized, context-dependent diagnostics, and therapeutic approaches. In this regard, artificial intelligence and machine learning approaches can be used to build analytical models of complex disease aimed at prediction of personalized health conditions and outcomes. Such models must handle the wide heterogeneity of individuals in both their genetic predisposition and their social and environmental determinants. Computational approaches to medicine need to be able to efficiently manage, visualize and integrate, large datasets combining structure, and unstructured formats. This needs to be done while constrained by different levels of confidentiality, ideally doing so within a unified analytical architecture. Efficient data integration and management is key to the successful application of computational intelligence approaches to medicine. A number of challenges arise in the design of successful designs to medical data analytics under currently demanding conditions of performance in personalized medicine, while also subject to time, computational power, and bioethical constraints. Here, we will review some of these constraints and discuss possible avenues to overcome current challenges.
... In particular, many vision applications in medical image analysis [2] require annotations from clinical experts, which incur high costs and commonly suffer from high inter-reader variability [364,365,366,60] -e.g., the average variability in the range 74-85% has been reported for glioblastoma segmentation [367]. While medical imaging data is now extremely abundant due to over two decades of digitisation, the world still remains relatively short of access to clean data with well-curated labels, that is amenable to machine learning [368], necessitating an intelligent method to learn robustly from noisy annotations. ...
... Further aggravated by differences in biases and levels of expertise, segmentation annotations of structures in medical images suffer from high annotation variations [401]. In consequence, despite the present abundance of medical imaging data thanks to over two decades of digitisation, the world still remains relatively short of access to data with curated labels [368], that is amenable to machine learning, necessitating intelligent methods to learn robustly from such noisy annotations. ...
Thesis
Full-text available
Deep learning is now ubiquitous in the research field of medical image computing. As such technologies progress towards clinical translation, the question of safety becomes critical. Once deployed, machine learning systems unavoidably face situations where the correct decision or prediction is ambiguous. However, the current methods disproportionately rely on deterministic algorithms, lacking a mechanism to represent and manipulate uncertainty. In safety-critical applications such as medical imaging, reasoning under uncertainty is crucial for developing a reliable decision making system. Probabilistic machine learning provides a natural framework to quantify the degree of uncertainty over different variables of interest, be it the prediction, the model parameters and structures, or the underlying data (images and labels). Probability distributions are used to represent all the uncertain unobserved quantities in a model and how they relate to the data, and probability theory is used as a language to compute and manipulate these distributions. In this thesis, we explore probabilistic modelling as a framework to integrate uncertainty information into deep learning models, and demonstrate its utility in various high-dimensional medical imaging applications. In the process, we make several fundamental enhancements to current methods. We categorise our contributions into three groups according to the types of uncertainties being modelled: (i) predictive; (ii) structural and (iii) human uncertainty. Firstly, we discuss the importance of quantifying predictive uncertainty and understanding its sources for developing a risk-averse and transparent medical image enhancement application. We demonstrate how a measure of predictive uncertainty can be used as a proxy for the predictive accuracy in the absence of ground-truths. Furthermore, assuming the structure of the model is flexible enough for the task, we introduce a way to decompose the predictive uncertainty into its orthogonal sources i.e. aleatoric and parameter uncertainty. We show the potential utility of such decoupling in providing a quantitative “explanations” into the model performance. Secondly, we introduce our recent attempts at learning model structures directly from data. One work proposes a method based on variational inference to learn a posterior distribution over connectivity structures within a neural network architecture for multi-task learning, and share some preliminary results in the MR-only radiotherapy planning application. Another work explores how the training algorithm of decision trees could be extended to grow the architecture of a neural network to adapt to the given availability of data and the complexity of the task. Lastly, we develop methods to model the “measurement noise” (e.g., biases and skill levels) of human annotators, and integrate this information into the learning process of the neural network classifier. In particular, we show that explicitly modelling the uncertainty involved in the annotation process not only leads to an improvement in robustness to label noise, but also yields useful insights into the patterns of errors that characterise individual experts.
... There are varying degrees of medical imaging data readiness (MIDaR), which were elegantly described by Harvey and Glocker in their MIDaR scale (10). This fourpoint MIDaR scale ranges from level D to level A. Level D (the lowest level of data readiness on the scale), or what can be referred to as "dirty" or "raw" data, represents data that contain patient-identifiable information, unverified in quantity and quality, and inaccessible to researchers. ...
... This fourpoint MIDaR scale ranges from level D to level A. Level D (the lowest level of data readiness on the scale), or what can be referred to as "dirty" or "raw" data, represents data that contain patient-identifiable information, unverified in quantity and quality, and inaccessible to researchers. In contradistinction, a level A dataset is "structured, fully annotated, has minimal noise and, most importantly, is contextually appropriate and ready for a specific machine learning task (10)." Level A data (data veracity) are quite elusive, laborious to curate, and exist in low volumes. ...
... Such data can be analyzed and interpreted using careful image annotation and artificial intelligence approaches, such as neural networks. [27][28][29] This fits the paradigm of variability of data type in the Big Data framework. Cardiac magnetic resonance imaging yields large data sets, both for image analysis as well as incorporation with other clinical data into registries. ...
... Work on data readiness related to other forms of data include that of Nazabal et al. (2020), who address data wrangling issues from a general stand-point using a set of case studies, as well as the work by van Ooijen (2019), and Harvey and Glocker (2019) that both deal with data quality in medical imaging. We have not found any work that focuses specifically on data readiness in the context of NLP. ...
Preprint
This document concerns data readiness in the context of machine learning and Natural Language Processing. It describes how an organization may proceed to identify, make available, validate, and prepare data to facilitate automated analysis methods. The contents of the document is based on the practical challenges and frequently asked questions we have encountered in our work as an applied research institute with helping organizations and companies, both in the public and private sectors, to use data in their business processes.
... An enticing prospect is mining physician expertise by collecting retrospective data from picture archiving and communication systems (PACSs), but the current generation of PACSs do not properly address the curation of large-scale data for machine learning. In PACSs, DICOM tags regarding scan descriptions are typically hand inputted, non-standardized, and often incomplete, which leads to the need for extensive data curation [5]. These limitations frequently produce high mislabeling rates, e.g., the 15% rate reported by Gueld et al., meaning that simply selecting the scans of interest (SOIs) from a large set of studies can be prohibitively laborious. ...
Chapter
Full-text available
As the demand for more descriptive machine learning models grows within medical imaging, bottlenecks due to data paucity will exacerbate. Thus, collecting enough large-scale data will require automated tools to harvest data/label pairs from messy and real-world datasets, such as hospital picture archiving and communication systems (PACSs). This is the focus of our work, where we present a principled data curation tool to extract multi-phase computed tomography (CT) liver studies and identify each scan’s phase from a real-world and heterogenous hospital PACS dataset. Emulating a typical deployment scenario, we first obtain a set of noisy labels from our institutional partners that are text mined using simple rules from DICOM tags. We train a deep learning system, using a customized and streamlined 3D squeeze and excitation (SE) architecture, to identify non-contrast, arterial, venous, and delay phase dynamic CT liver scans, filtering out anything else, including other types of liver contrast studies. To exploit as much training data as possible, we also introduce an aggregated cross entropy loss that can learn from scans only identified as “contrast”. Extensive experiments on a dataset of 43K scans of 7680 patient imaging studies demonstrate that our 3DSE architecture, armed with our aggregated loss, can achieve a mean F1 of 0.977 and can correctly harvest up to 92.7% of studies, which significantly outperforms the text-mined and standard-loss approach, and also outperforms other, and more complex, model architectures.
... 10 Given the speed by which 8 A central reason for why automated image recognition has seen such great progress in recent years is in large part due to the high quality of imaging data. For recent information about the progress being made to improve the quality of imaging data sets even further, see Harvey and Glocker (2019) and van Ooijen (2019). 9 Note that the reported result does not show that deep learning systems generally outperform clinicians and experts. ...
Article
Full-text available
Advanced AI systems are rapidly making their way into medical research and practice, and, arguably, it is only a matter of time before they will surpass human practitioners in terms of accuracy, reliability, and knowledge. If this is true, practitioners will have a prima facie epistemic and professional obligation to align their medical verdicts with those of advanced AI systems. However, in light of their complexity, these AI systems will often function as black boxes: the details of their contents, calculations, and procedures cannot be meaningfully understood by human practitioners. When AI systems reach this level of complexity, we can also speak of black-box medicine. In this paper, we want to argue that black-box medicine conflicts with core ideals of patient-centered medicine. In particular, we claim, black-box medicine is not conducive for supporting informed decision-making based on shared information, shared deliberation, and shared mind between practitioner and patient.
... Yet, most of these databases are collected retrospectively from hospital picture archiving and communication systems (PACSs), which house the medical image and text reports from daily radiological workflows. While harvesting PACSs will likely be essential toward truly obtaining largescale medical imaging data [6], their data are entirely ill-suited for training machine learning systems [7] as they are not curated from a machine learning perspective. As a result, popular large-scale medical imaging datasets suffer from uncertainties, mislabellings [3], [8], [9] and incomplete annotations [5], a trend that promises to increase as more and more PACS data is exploited. ...
Preprint
Full-text available
Acquiring large-scale medical image data, necessary for training machine learning algorithms, is frequently intractable, due to prohibitive expert-driven annotation costs. Recent datasets extracted from hospital archives, e.g., DeepLesion, have begun to address this problem. However, these are often incompletely or noisily labeled, e.g., DeepLesion leaves over 50% of its lesions unlabeled. Thus, effective methods to harvest missing annotations are critical for continued progress in medical image analysis. This is the goal of our work, where we develop a powerful system to harvest missing lesions from the DeepLesion dataset at high precision. Accepting the need for some degree of expert labor to achieve high fidelity, we exploit a small fully-labeled subset of medical image volumes and use it to intelligently mine annotations from the remainder. To do this, we chain together a highly sensitive lesion proposal generator and a very selective lesion proposal classifier. While our framework is generic, we optimize our performance by proposing a 3D contextual lesion proposal generator and by using a multi-view multi-scale lesion proposal classifier. These produce harvested and hard-negative proposals, which we then re-use to finetune our proposal generator by using a novel hard negative suppression loss, continuing this process until no extra lesions are found. Extensive experimental analysis demonstrates that our method can harvest an additional 9,805 lesions while keeping precision above 90%. To demonstrate the benefits of our approach, we show that lesion detectors trained on our harvested lesions can significantly outperform the same variants only trained on the original annotations, with boost of average precision of 7% to 10%. We open source our code and annotations at https://github.com/JimmyCai91/DeepLesionAnnotation.
... The quality and amount of the images vary with the target task and domain. The next step is to structure the data in homogenized and machine-readable formats (24). The last step is to link the images to ground-truth information, which can be one or more labels, segmentations, or electronic phenotype (eg, biopsy or laboratory results). ...
Article
Artificial intelligence (AI) continues to garner substantial interest in medical imaging. The potential applications are vast and include the entirety of the medical imaging life cycle from image creation to diagnosis to outcome prediction. The chief obstacles to development and clinical implementation of AI algorithms include availability of sufficiently large, curated, and representative training data that includes expert labeling (eg, annotations). Current supervised AI methods require a curation process for data to optimally train, validate, and test algorithms. Currently, most research groups and industry have limited data access based on small sample sizes from small geographic areas. In addition, the preparation of data is a costly and time-intensive process, the results of which are algorithms with limited utility and poor generalization. In this article, the authors describe fundamental steps for preparing medical imaging data in AI algorithm development, explain current limitations to data curation, and explore new approaches to address the problem of data availability.
... Item 9. Preprocessing converts raw data from various sources into a well-defined, machine-readable format for analysis (20,21). Describe preprocessing steps fully and in sufficient detail so that other investigators could reproduce them. ...
... It is commonly recommended that image datasets used for training should have been acquired from systems from different vendors. 37 This is particularly relevant for multislice imaging systems (CT/MRI) in which differences in acquisition protocols may have more impact than in x-ray images. Finally, clinical experts and researchers may be unaware of certain biases, for example, differences in local practice. ...
Article
Although artificial intelligence (AI) has been a focus of medical research for decades, in the last decade, the field of radiology has seen tremendous innovation and also public focus due to development and application of machine-learning techniques to develop new algorithms. Interestingly, this innovation is driven simultaneously by academia, existing global medical device vendors, and-fueled by venture capital-recently founded startups. Radiologists find themselves once again in the position to lead this innovation to improve clinical workflows and ultimately patient outcome. However, although the end of today's radiologists' profession has been proclaimed multiple times, routine clinical application of such AI algorithms in 2020 remains rare. The goal of this review article is to describe in detail the relevance of appropriate imaging data as a bottleneck for innovation, provide insights into the many obstacles for technical implementation, and give additional perspectives to radiologists who often view AI solely from their clinical role. As regulatory approval processes for such medical devices are currently under public discussion and the relevance of imaging data is transforming, radiologists need to establish themselves as the leading gatekeepers for evolution of their field and be aware of the many stakeholders and sometimes conflicting interests.
... Further aggravated by differences in biases and levels of expertise, segmentation annotations of structures in medical images suffer from high annotation variations [7]. In consequence, despite the present abundance of medical imaging data thanks to over two decades of digitisation, the world still remains relatively short of access to data with curated labels [8], that is amenable to machine learning, necessitating intelligent methods to learn robustly from such noisy annotations. ...
Preprint
Full-text available
Recent years have seen increasing use of supervised learning methods for segmentation tasks. However, the predictive performance of these algorithms depends on the quality of labels. This problem is particularly pertinent in the medical image domain, where both the annotation cost and inter-observer variability are high. In a typical label acquisition process, different human experts provide their estimates of the 'true' segmentation labels under the influence of their own biases and competence levels. Treating these noisy labels blindly as the ground truth limits the performance that automatic segmentation algorithms can achieve. In this work, we present a method for jointly learning, from purely noisy observations alone, the reliability of individual annotators and the true segmentation label distributions, using two coupled CNNs. The separation of the two is achieved by encouraging the estimated annotators to be maximally unreliable while achieving high fidelity with the noisy training data. We first define a toy segmentation dataset based on MNIST and study the properties of the proposed algorithm. We then demonstrate the utility of the method on three public medical imaging segmentation datasets with simulated (when necessary) and real diverse annotations: 1) MSLSC (multiple-sclerosis lesions); 2) BraTS (brain tumours); 3) LIDC-IDRI (lung abnormalities). In all cases, our method outperforms competing methods and relevant baselines particularly in cases where the number of annotations is small and the amount of disagreement is large. The experiments also show strong ability to capture the complex spatial characteristics of annotators' mistakes.
... Yet, most of these databases are collected retrospectively from hospital picture archiving and communication systems (PACSs), which house the medical image and text reports from daily radiological workflows. While harvesting PACSs will likely be essential toward truly obtaining largescale medical imaging data [6], their data are entirely ill-suited for training machine learning systems [7] as they are not curated from a machine learning perspective. As a result, popular large-scale medical imaging datasets suffer from uncertainties, mislabelings [3], [8], [9] and incomplete annotations [5], a trend that promises to increase as more and more PACS data is exploited. ...
... To see this, note that many CXR datasets are collected using natural language processing (NLP) approaches applied to hospital picture archiving and communication systems (PACSs) (Wang et al., 2017;Irvin et al., 2019). This is a trend that will surely increase given that PACSs remain the most viable source of large-scale medical data (Kohli et al., 2017;Harvey and Glocker, 2019). In such cases, it may not always be possible to extract fine-grained labels with confidence. ...
Article
Full-text available
Chest X-rays (CXRs) are a crucial and extraordinarily common diagnostic tool, leading to heavy research for computer-aided diagnosis (CAD) solutions. However, both high classification accuracy and meaningful model predictions that respect and incorporate clinical taxonomies are crucial for CAD usability. To this end, we present a deep hierarchical multi-label classification (HMLC) approach for CXR CAD. Different than other hierarchical systems, we show that first training the network to model conditional probability directly and then refining it with unconditional probabilities is key in boosting performance. In addition, we also formulate a numerically stable cross-entropy loss function for unconditional probabilities that provides concrete performance improvements. Finally, we demonstrate that HMLC can be an effective means to manage missing or incomplete labels. To the best of our knowledge, we are the first to apply HMLC to medical imaging CAD. We extensively evaluate our approach on detecting abnormality labels from the CXR arm of the Prostate, Lung, Colorectal and Ovarian (PLCO) dataset, which comprises over 198,000 manually annotated CXRs. When using complete labels, we report a mean area under the curve (AUC) of 0.887, the highest yet reported for this dataset. These results are supported by ancillary experiments on the PadChest dataset, where we also report significant improvements, 1.2% and 4.1% in AUC and average precision, respectively over strong "flat" classifiers. Finally, we demonstrate that our HMLC approach can much better handle incompletely labelled data. These performance improvements, combined with the inherent usefulness of taxonomic predictions, indicate that our approach represents a useful step forward for CXR CAD.
... PACSs will likely be essential toward truly obtaining largescale medical imaging data [6], their data are entirely ill-suited for training machine learning systems [7] as they are not curated from a machine learning perspective. As a result, popular large-scale medical imaging datasets suffer from uncertainties, mislabelings [3], [8], [9] and incomplete annotations [5], a trend that promises to increase as more and more PACS data is exploited. ...
Article
Full-text available
The acquisition of large-scale medical image data, necessary for training machine learning algorithms, is hampered by associated expert-driven annotation costs. Mining hospital archives can address this problem, but labels often incomplete or noisy, e.g., 50% of the lesions in DeepLesion are left unlabeled. Thus, effective label harvesting methods are critical. This is the goal of our work, where we introduce Lesion-Harvester-a powerful system to harvest missing annotations from lesion datasets at high precision. Accepting the need for some degree of expert labor, we use a small fully-labeled image subset to intelligently mine annotations from the remainder. To do this, we chain together a highly sensitive lesion proposal generator (LPG) and a very selective lesion proposal classifier (LPC). Using a new hard negative suppression loss, the resulting harvested and hard-negative proposals are then employed to iteratively finetune our LPG. While our framework is generic, we optimize our performance by proposing a new 3D contextual LPG and by using a global-local multi-view LPC. Experiments on DeepLesion demonstrate that Lesion- Harvester can discover an additional 9; 805 lesions at a precision of 90%. We publicly release the harvested lesions, along with a new test set of completely annotated DeepLesion volumes. We also present a pseudo 3D IoU evaluation metric that corresponds much better to the real 3D IoU than current DeepLesion evaluation metrics. To quantify the downstream benefits of Lesion-Harvester we show that augmenting the DeepLesion annotations with our harvested lesions allows state-of-the-art detectors to boost their average precision by 7 to 10%.
... To see this, note that many CXR datasets are collected using natural language processing (NLP) approaches applied to hospital picture archiving and communication systems (PACSs) (Wang et al., 2017;Irvin et al., 2019). This is a trend that will surely increase given that PACSs remain the most viable source of large-scale medical data (Kohli et al., 2017;Harvey and Glocker, 2019). ...
Preprint
CXRs are a crucial and extraordinarily common diagnostic tool, leading to heavy research for CAD solutions. However, both high classification accuracy and meaningful model predictions that respect and incorporate clinical taxonomies are crucial for CAD usability. To this end, we present a deep HMLC approach for CXR CAD. Different than other hierarchical systems, we show that first training the network to model conditional probability directly and then refining it with unconditional probabilities is key in boosting performance. In addition, we also formulate a numerically stable cross-entropy loss function for unconditional probabilities that provides concrete performance improvements. Finally, we demonstrate that HMLC can be an effective means to manage missing or incomplete labels. To the best of our knowledge, we are the first to apply HMLC to medical imaging CAD. We extensively evaluate our approach on detecting abnormality labels from the CXR arm of the PLCO dataset, which comprises over $198,000$ manually annotated CXRs. When using complete labels, we report a mean AUC of 0.887, the highest yet reported for this dataset. These results are supported by ancillary experiments on the PadChest dataset, where we also report significant improvements, 1.2% and 4.1% in AUC and AP, respectively over strong "flat" classifiers. Finally, we demonstrate that our HMLC approach can much better handle incompletely labelled data. These performance improvements, combined with the inherent usefulness of taxonomic predictions, indicate that our approach represents a useful step forward for CXR CAD.
... Therefore, the same way physicians are familiar with planning protocols or delineation guidelines, the clinical teams should start being familiar with guiding principles for data management and curation in the era of AI. The FAIR A. Barragán-Montero et al. (Findability, Accessibility, Interoperability, and Reusability) Data Principles [231] are the most popular and general ones, but the medical community should focus efforts on adapting those principles to the specificities of the medical domain [232][233][234]. Only in this way, we will manage to have a safe and efficient clinical implementation of AI methods. ...
Article
Artificial intelligence (AI) has recently become a very popular buzzword, as a consequence of disruptive technical advances and impressive experimental results, notably in the field of image analysis and processing. In medicine, specialties where images are central, like radiology, pathology or oncology, have seized the opportunity and considerable efforts in research and development have been deployed to transfer the potential of AI to clinical applications. With AI becoming a more mainstream tool for typical medical imaging analysis tasks, such as diagnosis, segmentation, or classification, the key for a safe and efficient use of clinical AI applications relies, in part, on informed practitioners. The aim of this review is to present the basic technological pillars of AI, together with the state-of-the-art machine learning methods and their application to medical imaging. In addition, we discuss the new trends and future research directions. This will help the reader to understand how AI methods are now becoming an ubiquitous tool in any medical image analysis workflow and pave the way for the clinical implementation of AI-based solutions.
... Potential biases or errors in the data have the potential to be propagated further by these techniques. Furthermore, studies that do not use appropriate experts to label data have the potential to introduce errors [16] and reduce data quality [21]. ...
Article
Full-text available
Objective There has been a large amount of research in the field of artificial intelligence (AI) as applied to clinical radiology. However, these studies vary in design and quality and systematic reviews of the entire field are lacking.This systematic review aimed to identify all papers that used deep learning in radiology to survey the literature and to evaluate their methods. We aimed to identify the key questions being addressed in the literature and to identify the most effective methods employed. Methods We followed the PRISMA guidelines and performed a systematic review of studies of AI in radiology published from 2015 to 2019. Our published protocol was prospectively registered. Results Our search yielded 11,083 results. Seven hundred sixty-seven full texts were reviewed, and 535 articles were included. Ninety-eight percent were retrospective cohort studies. The median number of patients included was 460. Most studies involved MRI (37%). Neuroradiology was the most common subspecialty. Eighty-eight percent used supervised learning. The majority of studies undertook a segmentation task (39%). Performance comparison was with a state-of-the-art model in 37%. The most used established architecture was UNet (14%). The median performance for the most utilised evaluation metrics was Dice of 0.89 (range .49–.99), AUC of 0.903 (range 1.00–0.61) and Accuracy of 89.4 (range 70.2–100). Of the 77 studies that externally validated their results and allowed for direct comparison, performance on average decreased by 6% at external validation (range increase of 4% to decrease 44%). Conclusion This systematic review has surveyed the major advances in AI as applied to clinical radiology. Key Points • While there are many papers reporting expert-level results by using deep learning in radiology, most apply only a narrow range of techniques to a narrow selection of use cases. • The literature is dominated by retrospective cohort studies with limited external validation with high potential for bias. • The recent advent of AI extensions to systematic reporting guidelines and prospective trial registration along with a focus on external validation and explanations show potential for translation of the hype surrounding AI from code to clinic.
Chapter
One of the major hurdles for the development and clinical deployment of machine learning (ML) and deep learning (DL) models is the availability of structured, well-curated, large medical imaging data sets with high-quality labels. For most researchers and companies, access to medical imaging data is limited to small datasets from a narrow geographic area. We describe the fundamental steps that need to be taken in the process of preparing cardiovascular imaging data for the development of ML/DL models.
Chapter
Recent years have seen an increasing use of supervised learning methods for segmentation tasks. However, the predictive performance of these algorithms depend on the quality of labels, especially in medical image domain, where both the annotation cost and inter-observer variability are high. In a typical annotation collection process, different clinical experts provide their estimates of the “true” segmentation labels under the influence of their levels of expertise and biases. Treating these noisy labels blindly as the ground truth can adversely affect the performance of supervised segmentation models. In this work, we present a neural network architecture for jointly learning, from noisy observations alone, both the reliability of individual annotators and the true segmentation label distributions. The separation of the annotators’ characteristics and true segmentation label is achieved by encouraging the estimated annotators to be maximally unreliable while achieving high fidelity with the training data. Our method can also be viewed as a translation of STAPLE, an established label aggregation framework proposed in Warfield et al. [1] to the supervised learning paradigm. We demonstrate first on a generic segmentation task using MNIST data and then adapt for usage with MRI scans of multiple sclerosis (MS) patients for lesion labelling. Our method shows considerable improvement over the relevant baselines on both datasets in terms of segmentation accuracy and estimation of annotator reliability, particularly when only a single label is available per image. An open-source implementation of our approach can be found at https://github.com/UCLBrain/MSLS.
Article
Objective Quality gaps in medical imaging datasets lead to profound errors in experiments. Our objective was to characterize such quality gaps in public pancreas imaging datasets (PPIDs), to evaluate their impact on previously published studies, and to provide post-hoc labels and segmentations as a value-add for these PPIDs. Methods We scored the available PPIDs on the medical imaging data readiness (MIDaR) scale, and evaluated for associated metadata, image quality, acquisition phase, etiology of pancreas lesion, sources of confounders, and biases. Studies utilizing these PPIDs were evaluated for awareness of and any impact of quality gaps on their results. Volumetric pancreatic adenocarcinoma (PDA) segmentations were performed for non-annotated CTs by a junior radiologist (R1) and reviewed by a senior radiologist (R3). Results We found three PPIDs with 560 CTs and six MRIs. dataset of normal pancreas CTs (PCT) (n = 80 CTs) had optimal image quality and met MIDaR A criteria but parts of pancreas have been excluded in the provided segmentations. TCIA-PDA (n = 60 CTs; 6 MRIs) and MSD(n = 420 CTs) datasets categorized to MIDaR B due to incomplete annotations, limited metadata, and insufficient documentation. Substantial proportion of CTs from TCIA-PDA and MSD datasets were found unsuitable for AI due to biliary stents [TCIA-PDA:10 (17%); MSD:112 (27%)] or other factors (non-portal venous phase, suboptimal image quality, non-PDA etiology, or post-treatment status) [TCIA-PDA:5 (8.5%); MSD:156 (37.1%)]. These quality gaps were not accounted for in any of the 25 studies that have used these PPIDs (NIH-PCT:20; MSD:1; both: 4). PDA segmentations were done by R1 in 91 eligible CTs (TCIA-PDA:42; MSD:49). Of these, corrections were made by R3 in 16 CTs (18%) (TCIA-PDA:4; MSD:12) [mean (standard deviation) Dice: 0.72(0.21) and 0.63(0.23) respectively]. Conclusion Substantial quality gaps, sources of bias, and high proportion of CTs unsuitable for AI characterize the available limited PPIDs. Published studies on these PPIDs do not account for these quality gaps. We complement these PPIDs through post-hoc labels and segmentations for public release on the TCIA portal. Collaborative efforts leading to large, well-curated PPIDs supported by adequate documentation are critically needed to translate the promise of AI to clinical practice.
Article
Artificial intelligence (AI) has seen an explosion in interest within nuclear medicine. This interest is driven by the rapid progress and eye-catching achievements of machine learning algorithms. The growing foothold of AI in molecular imaging is exposing nuclear medicine personnel to new technology and terminology. Clinicians and researchers can be easily overwhelmed by numerous architectures and algorithms that have been published. This article dissects the backbone of most AI algorithms: the convolutional neural network. The algorithm training workflow and the key ingredients and operations of a convolutional neural network are described in detail. Finally, the ubiquitous U-Net is explained step-by-step.
Preprint
Full-text available
In this paper, we identify the state of data as being an important reason for failure in applied Natural Language Processing (NLP) projects. We argue that there is a gap between academic research in NLP and its application to problems outside academia, and that this gap is rooted in poor mutual understanding between academic researchers and their non-academic peers who seek to apply research results to their operations. To foster transfer of research results from academia to non-academic settings, and the corresponding influx of requirements back to academia, we propose a method for improving the communication between researchers and external stakeholders regarding the accessibility, validity, and utility of data based on Data Readiness Levels \cite{lawrence2017data}. While still in its infancy, the method has been iterated on and applied in multiple innovation and research projects carried out with stakeholders in both the private and public sectors. Finally, we invite researchers and practitioners to share their experiences, and thus contributing to a body of work aimed at raising awareness of the importance of data readiness for NLP.
Article
Full-text available
Purpose: A fully automated system for interpreting abdominal computed tomography (CT) scans with multiple phases of contrast enhancement requires an accurate classification of the phases. Current approaches to classify the CT phases are commonly based on three-dimensional (3D) convolutional neural network (CNN) approaches with high computational complexity and high latency. This work aims at developing and validating a precise, fast multiphase classifier to recognize three main types of contrast phases in abdominal CT scans. Methods: We propose in this study a novel method that uses a random sampling mechanism on top of deep CNNs for the phase recognition of abdominal CT scans of four different phases: noncontrast, arterial, venous, and others. The CNNs work as a slicewise phase prediction, while random sampling selects input slices for the CNN models. Afterward, majority voting synthesizes the slicewise results of the CNNs to provide the final prediction at the scan level. Results: Our classifier was trained on 271 426 slices from 830 phase-annotated CT scans, and when combined with majority voting on 30% of slices randomly chosen from each scan, achieved a mean F1 score of 92.09% on our internal test set of 358 scans. The proposed method was also evaluated on two external test sets: CTPAC-CCRCC (N = 242) and LiTS (N = 131), which were annotated by our experts. Although a drop in performance was observed, the model performance remained at a high level of accuracy with a mean F1 scores of 76.79% and 86.94% on CTPAC-CCRCC and LiTS datasets, respectively. Our experimental results also showed that the proposed method significantly outperformed the state-of-the-art 3D approaches while requiring less computation time for inference. Conclusions: In comparison to state-of-the-art classification methods, the proposed approach shows better accuracy with significantly reduced latency. Our study demonstrates the potential of a precise, fast multiphase classifier based on a two-dimensional deep learning approach combined with a random sampling method for contrast phase recognition, providing a valuable tool for extracting multiphase abdomen studies from low veracity, real-world data.
Article
Full-text available
At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities.
Article
Full-text available
Big Data (BD), with their potential to ascertain valued insights for enhanced decision-making process, have recently attracted substantial interest from both academics and practitioners. Big Data Analytics (BDA) is increasingly becoming a trending practice that many organizations are adopting with the purpose of constructing valuable information from BD. The analytics process, including the deployment and use of BDA tools, is seen by organizations as a tool to improve operational efficiency though it has strategic potential, drive new revenue streams and gain competitive advantages over business rivals. However, there are different types of analytic applications to consider. Therefore, prior to hasty use and buying costly BD tools, there is a need for organizations to first understand the BDA landscape. Given the significant nature of the BD and BDA, this paper presents a state-of-the-art review that presents a holistic view of the BD challenges and BDA methods theorized/proposed/employed by organizations to help others understand this landscape with the objective of making robust investment decisions. In doing so, systematically analysing and synthesizing the extant research published on BD and BDA area. More specifically, the authors seek to answer the following two principal questions: Q1 – What are the different types of BD challenges theorized/proposed/confronted by organizations? and Q2 – What are the different types of BDA methods theorized/proposed/employed to overcome BD challenges?. This systematic literature review (SLR) is carried out through observing and understanding the past trends and extant patterns/themes in the BDA research area, evaluating contributions, summarizing knowledge, thereby identifying limitations, implications and potential further research avenues to support the academic community in exploring research themes/patterns. Thus, to trace the implementation of BD strategies, a profiling method is employed to analyze articles (published in English-speaking peer-reviewed journals between 1996 and 2015) extracted from the Scopus database. The analysis presented in this paper has identified relevant BD research studies that have contributed both conceptually and empirically to the expansion and accrual of intellectual wealth to the BDA in technology and organizational resource management discipline.
Article
Full-text available
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.
Article
Full-text available
At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it's taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It's not annotated with carefully hand-corrected part-of-speech tags. But the fact that it's a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus - along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions - captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks - if only we knew how to extract the model from the data.
Article
Smith and Nichols discuss “big data” human neuroimaging studies, with very large subject numbers and amounts of data. These studies provide great opportunities for making new discoveries about the brain but raise many new analytical challenges and interpretational risks. Smith and Nichols discuss “big data” human neuroimaging studies, with very large subject numbers and amounts of data. These studies provide great opportunities for making new discoveries about the brain but raise many new analytical challenges and interpretational risks.
Article
Application of models to data is fraught. Data-generating collaborators often only have a very basic understanding of the complications of collating, processing and curating data. Challenges include: poor data collection practices, missing values, inconvenient storage mechanisms, intellectual property, security and privacy. All these aspects obstruct the sharing and interconnection of data, and the eventual interpretation of data through machine learning or other approaches. In project reporting, a major challenge is in encapsulating these problems and enabling goals to be built around the processing of data. Project overruns can occur due to failure to account for the amount of time required to curate and collate. But to understand these failures we need to have a common language for assessing the readiness of a particular data set. This position paper proposes the use of data readiness levels: it gives a rough outline of three stages of data preparedness and speculates on how formalisation of these levels into a common language for data readiness could facilitate project management.
Conference Paper
To advance and/or ease computer aided diagnosis (CAD) system, chest X-ray (CXR) image view information is required. In other words, separating CXR image view: frontal and lateral can be considered as a crucial step to effective subsequent processes, since the techniques that work for frontal CXRs may not equally work for lateral ones. With this motivation, in this paper, we present a novel machine learning technique to classify frontal and lateral CXR images, where we introduce a force histogram to extract features and apply three different state-of-the-art classifiers: support vector machine (SVM), random forest (RF) and multi-layer perceptron (MLP). We validated our fully automatic technique on a set of 8100 images hosted by National Library of Medicine (NLM), National Institutes of Health (NIH), and achieved an accuracy close to 100%.
Article
Radiological reporting has generated large quantities of digital content within the electronic health record, which is potentially a valuable source of information for improving clinical care and supporting research. Although radiology reports are stored for communication and documentation of diagnostic imaging, harnessing their potential requires efficient and automated information extraction: they exist mainly as free-text clinical narrative, from which it is a major challenge to obtain structured data. Natural language processing (NLP) provides techniques that aid the conversion of text into a structured representation, and thus enables computers to derive meaning from human (ie, natural language) input. Used on radiology reports, NLP techniques enable automatic identification and extraction of information. By exploring the various purposes for their use, this review examines how radiology benefits from NLP. A systematic literature search identified 67 relevant publications describing NLP methods that support practical applications in radiology. This review takes a close look at the individual studies in terms of tasks (ie, the extracted information), the NLP methodology and tools used, and their application purpose and performance results. Additionally, limitations, future challenges, and requirements for advancing NLP in radiology will be discussed. http://pubs.rsna.org/doi/abs/10.1148/radiol.16142770 © RSNA, 2016
Article
The widely used DICOM 3.0 imaging protocol specifies optional tags to store specific information on modality and body region within the header: Body Part Examined and Anatomic Structure. We investigate whether this information can be used for the automated categorization of medical images, as this is an important first step for medical image retrieval. Our survey examines the headers generated by four digital image modalities (2 CTs, 2 MRIs) in clinical routine at the Aachen University Hospital within a period of four months. The manufacturing dates of the modalities range from 1995 to 1999, with software revisions from 1999 and 2000. Only one modality sets the DICOM tag Body Part Examined. 90 out of 580 images (15.5%) contained false tag entries causing a wrong categorization. This result was verified during a second evaluation period of one month one year later (562 images, 15.3% error rate). The main reason is the dependency of the tag on the examination protocol of the modality, which controls all relevant parameters of the imaging process. In routine, the clinical personnel often applies an examination protocol outside its normal context to improve the imaging quality. This is, however, done without manually adjusting the categorization specific tag values. The values specified by DICOM for the tag Body Part Examined are insufficient to encode the anatomic region precisely. Thus, an automated categorization relying on DICOM tags alone is impossible.