ArticlePDF Available

Avoiding big data pitfalls

Authors:

Abstract

Clinical decisions are based on a combination of inductive inference built on experience (ie, statistical models) and on deductions provided by our understanding of the workings of the cardiovascular system (ie, mechanistic models). In a similar way, computers can be used to discover new hidden patterns in the (big) data and to make predictions based on our knowledge of physiology or physics. Surprisingly, unlike humans through history, computers seldom combine inductive and deductive processes. An explosion of expectations surrounds the computer’s inductive method, fueled by the “big data” and popular trends. This article reviews the risks and potential pitfalls of this computer approach, where the lack of generality, selection or confounding biases, overfitting, or spurious correlations are among the commonplace flaws. Recommendations to reduce these risks include an examination of data through the lens of causality, the careful choice and description of statistical techniques, and an open research culture with transparency. Finally, the synergy between mechanistic and statistical models (ie, the digital twin) is discussed as a promising pathway toward precision cardiology that mimics the human experience.
33
Heart Metab. 2020;82:33-35
Refresher Corner
“Big data” is the computer-enhanced version of
inductive reasoning
There is an exciting future for medicine where
decisions are informed by precise patient-
specific data and risk models. Exploiting the
“big data” in health care is one of the main engines
working toward this future. We have learned from
promising case studies, for example, the impor-
tance of access to the right source of evidence
to select the right therapy for a pediatric lupus
patient,1 but also from epic failures, such as the
unmet expectation to predict seasonal flu from
Internet searches.2
It is useful to conceptualize “big data” as an evo-
lution of traditional statistical methods, now able to
harness the value of much larger and heterogeneous
sources of information. As such, its ability to learn
new biomarkers and predictors will always be sub-
ject to the same fundamental limitations of inductive
reasoning: can we generalize the findings; are they re-
ally true? “Big data” is simply the computer-enhanced
version of human inductive reasoning. The scientific
method teaches us that the interplay between induc-
tive and deductive reasoning is the way forward. We
need to build a hypothesis from the observations and
then advance to make predictions and to run more
experiments to verify them.
In this context, this article reviews the risks and
potential pitfalls of our computer-enhanced ability to
reveal the hidden patterns in the data that predict
cardiovascular outcomes. It then sets out some rec-
ommendations to minimize the risks of spurious pat-
terns: here the main message is the need to examine
the revelations through the lens of causality; do they
make sense? We need to interpret the experimental
findings and see if they fit our framework of a plau-
sible mechanistic explanation.
Avoiding big data pitfalls
Pablo Lamata, PhD
Department of Biomedical Engineering, King’s College London, UK
Correspondence: Pablo Lamata, PhD, Dept of Biomedical Engineering – 5th oor Becket House, 1 Lambeth Palace Road,
London SE1 7EU, UK
E-mail: Pablo.Lamata@kcl.ac.uk
Abstract: Clinical decisions are based on a combination of inductive inference built on experience (ie, statistical
models) and on deductions provided by our understanding of the workings of the cardiovascular system (ie,
mechanistic models). In a similar way, computers can be used to discover new hidden patterns in the (big) data
and to make predictions based on our knowledge of physiology or physics. Surprisingly, unlike humans through
history, computers seldom combine inductive and deductive processes. An explosion of expectations surrounds
the computers inductive method, fueled by the “big data” and popular trends. This article reviews the risks and
potential pitfalls of this computer approach, where the lack of generality, selection or confounding biases, overt-
ting, or spurious correlations are among the commonplace aws. Recommendations to reduce these risks include
an examination of data through the lens of causality, the careful choice and description of statistical techniques,
and an open research culture with transparency. Finally, the synergy between mechanistic and statistical models
(ie, the digital twin) is discussed as a promising pathway toward precision cardiology that mimics the human
experience. L Heart Metab. 2020;82:33-35
Keywords: articial intelligence; big data; digital twin
34
How to learn from data and how to fail in that
endeavor
Our goal is to improve our future clinical decisions
by learning from our past experiences, old-fashioned
clinical observation. Our best tool to implement this
collective knowledge is the use of clinical guidelines,
the compendium of current best evidence mixed with
opinion (class of evidence and level of recommenda-
tion). The future potential is to evolve and accelerate
beyond this model by generating evidence using the
“big data” that is becoming available. And this task is
shared between humans and computers; there is a
continuum between human and machine interactions
to build predictive models.3
The main pitfall is to take the generality of our findings
for granted and assume past experience predicts events
in other cohorts and future patients. This can only be
verified with external validation, a new cohort of patients,
ideally from different clinical centers and/or geographic
regions with a different mix of patients. Unfortunately,
most studies will not include this critical step.4,5
Methodologically, the risk is that of “garbage
in, garbage out” (GIGO): the validity of our findings
strongly depends on the quality of the data learned
from. Confounding biases will lead to surprising asso-
ciations between health scores and risk factors that
were actually driven by a lurking variable. Selection
biases may lead to false conclusions and to ethical
risks of models that create or exacerbate existing ra-
cial or societal biases in health care systems.6
Beyond these traditional statistical risks, the nu-
merous comparisons and searches within the big
data exacerbate other potential issues. We have
many more chances of finding spurious correlations,
such as the high-school basketball searches to pre-
dict seasonal flu burden.2 And we suffer from the “di-
mensionality curse”: the more variables you combine,
the higher the chances of a spurious positive finding
(eg, a false-positive of a dimension in which healthy
and diseased subjects differentiate).
Big data is also characterized by three more fea-
tures7: its heterogeneity (where inferring the mixture
model would be a challenge), noise accumulation
(so, selecting features would be better than trying to
include all), and incidental endogeneity (that would
make variable selection quite challenging). The reader
is referred to previous works8 for their detailed de-
scription.
Recommendations to avoid the pitfalls
The best attitude when reading “big data” studies is
to be cautiously skeptical: a positive finding of the
predictive value when working with thousands of vari-
ables is at best only a first step toward the right direc-
tion. The immediate next step is the preparation of
the external validation tests.4,5
When conducting research in this area, the obvious
recommendation is to apply adequate techniques.
Selection bias can be overcome by weighting. The
high-dimensionality curse and its dire consequences
can be addressed by dimensionality-reduction tech-
niques, where principal component analysis is the
most common approach. And there are specific solu-
tions for each of the challenges of big data: penalized
quasi-likelihood, sparsest solution in high-confidence
set, or independence screening among others (see
ref 8 for further details).
The main challenge of induction is the generality
of the findings, especially hard given the difficulty to
have stable and comparable measurements across
cases and time. The subtle differences in the appear-
ance of an echocardiographic or magnetic resonance
image across manufacturers is a well-known bottle-
neck in the imaging community. The homogenization
of techniques and protocols is indeed one of the main
strategies to alleviate this.5
As a community, the strongest recommendation
in order to accelerate the generality of findings, and to
eventually make an impact in patients, is the promotion
of a culture of transparency2 and open research. The ef-
fort to recruit and follow up a cohort of patients is huge,
as it is the development of the information infrastructure
to allow the access to the electronic health record of a
large population. There are indeed ethical and societal
barriers to release this data for research purposes, but
we must learn to give the adequate value and credit to
these contributions so that clinicians and researchers
do not feel they are losing a competitive advantage.
The most difficult decision is when to include find-
ings in clinical guidelines. The minimum is to have the
positive result subjected to external validation. Even
here one can critically challenge the generality of find-
ings, and the recommendation is to take a practical
sceptic approach: adopt while monitoring real-world
results.
The challenge of generality will be only addressed
by the formulation of the mechanistic hypothesis that
Lamata Heart Metab. 2020;82:33-35
Avoiding big data pitfalls
35
Heart Metab. 2020;82:33-35 Lamata
Avoiding big data pitfalls
offers a plausible explanation of the findings, closing
the induction step of the scientific method. In this
context, computers can also be used to enhance our
deductive reasoning skills: they can make predictions
based on mechanistic simulations of our cardiovas-
cular system.9 The opportunity is thus to exploit the
synergy between mechanistic and statistical compu-
tational models that is the core of the vision of the dig-
ital twin10 and mimics the way clinicians have worked
for millennia. L
Disclosure/Acknowledgments: Pablo Lamata was commissioned to write
this article by, and has received honoraria from, Servier. Pablo Lamata sits
on the advisory board of Ultromics Ltd and Cardesium Inc. Support from the
Wellcome Trust Senior Research Fellowship (209450/Z/17/Z) is acknowl-
edged. No conflict of interest is reported.
REFERENCES
1. Frankovich J, Longhurst CA, Sutherland SM. Evidence-based
medicine in the EMR era. N Engl J Med. 2011;365(19):1758-
1759. doi:10.1056/NEJMp1108726.
2. Lazer D, Kennedy R, King G, Vespignani A. Big data. The
parable of Google Flu: traps in big data analysis. Science.
2014;343(6176):1203-1205. doi:10.1126/science.1248506.
3. Beam AL, Kohane IS. Big data and machine learning in
health care. JAMA. 2018;319(13):1317. doi:10.1001/
jama.2017.18391.
4. Liu X, Faes L, Kale AU, et al. A comparison of deep learn-
ing performance against health-care professionals in detect-
ing diseases from medical imaging: a systematic review and
meta-analysis. Lancet Digit Heal. 2019;1(6):e271-e297.
doi:10.1016/S2589-7500(19)30123-2.
5. Dey D, Slomka PJ, Leeson P, et al. Artificial intelligence in
cardiovascular imaging: JACC state-of-the-art review. J
Am Coll Cardiol. 2019;73(11):1317-1335. doi:10.1016/J.
JACC.2018.12.054.
6. Nordling L. A fairer way forward for AI in health care. Nature.
2019;573(7775):S103-S105. doi:10.1038/d41586-019-
02872-2.
7. Gandomi A, Haider M. Beyond the hype: big data concepts,
methods, and analytics. Int J Inf Manage. 2015;35(2):137-144.
doi:10.1016/J.IJINFOMGT.2014.10.007.
8. Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci
Rev. 2014;1(2):293-314. doi:10.1093/nsr/nwt032.
9. Niederer SA, Lumens J, Trayanova NA. Computational mod-
els in cardiology. Nat Rev Cardiol. 2019;16(2):100-111.
doi:10.1038/s41569-018-0104-y.
10. Corral-Acero J, Margara F, Marciniak M, et al. The “Digital
Twin” to enable the vision of precision cardiology. Eur Heart J.
March 4, 2020. Epub ahead of print. doi:10.1093/eurheartj/
ehaa159.
... Big data hazards. Methodological hazards were a noted challenge to using AI for inductive reasoning; for example, the generalisability of findings necessitates external validation with new patient cohorts, or cohorts from different centres or different geographical locations, and across time 28 . Other risks could be confounding biases, perhaps whereby a variable in the vast data suggests a spurious association. ...
... Selection biases could affect conclusions or result in models that exacerbate racial or other societal biases and could be overcome by weighting, for example. Overall, big data compels transparency, the plausibility of computer-generated predictions, and external validation 28 . ...
Article
Full-text available
Potential benefits of precision medicine in cardiovascular disease (CVD) include more accurate phenotyping of individual patients with the same condition or presentation, using multiple clinical, imaging, molecular and other variables to guide diagnosis and treatment. An approach to realising this potential is the digital twin concept, whereby a virtual representation of a patient is constructed and receives real-time updates of a range of data variables in order to predict disease and optimise treatment selection for the real-life patient. We explored the term digital twin, its defining concepts, the challenges as an emerging field, and potentially important applications in CVD. A mapping review was undertaken using a systematic search of peer-reviewed literature. Industry-based participants and patent applications were identified through web-based sources. Searches of Compendex, EMBASE, Medline, ProQuest and Scopus databases yielded 88 papers related to cardiovascular conditions (28%, n = 25), non-cardiovascular conditions (41%, n = 36), and general aspects of the health digital twin (31%, n = 27). Fifteen companies with a commercial interest in health digital twin or simulation modelling had products focused on CVD. The patent search identified 18 applications from 11 applicants, of which 73% were companies and 27% were universities. Three applicants had cardiac-related inventions. For CVD, digital twin research within industry and academia is recent, interdisciplinary, and established globally. Overall, the applications were numerical simulation models, although precursor models exist for the real-time cyber-physical system characteristic of a true digital twin. Implementation challenges include ethical constraints and clinical barriers to the adoption of decision tools derived from artificial intelligence systems.
... "Big data" (i.e., high numbers of variables and/or observations) may be especially vulnerable to spurious correlations that arise when analyzing massive numbers of measured variables. Several precautions must be taken to avoid spurious positive findings when analyzing a large dataset-cross-validation, dimension reduction, restriction to theoretically-motivated relationships, and other methodological considerations (e.g., penalized models; Calude & Longo, 2017;Lamata, 2020). With proper controls, correlational analyses can identify potentially important variables and even be applied to make predictions about individuals using machine learning techniques. ...
Article
A core feature of behavior analysis is the single-subject design, in which each subject serves as its own control. This approach is powerful for identifying manipulations that are causal to behavioral changes but often fails to account for individual differences, particularly when coupled with a small sample size. It is more common for other subfields of psychology to use larger-N approaches; however, these designs also often fail to account for the individual by focusing on aggregate-level data only. Moving forward, it is important to study individual differences to identify subgroups of the population that may respond differently to interventions and to improve the generalizability and reproducibility of behavioral science. We propose that large-N datasets should be used in behavior analysis to better understand individual subject variability. First, we describe how individual differences have been historically treated and then outline practical reasons to study individual subject variability. Then, we describe various methods for analyzing large-N datasets while accounting for the individual, including correlational analyses, machine learning, mixed-effects models, clustering, and simulation. We provide relevant examples of these techniques from published behavioral literature and from a publicly available dataset compiled from five different rat experiments, which illustrates both group-level effects and heterogeneity across individual subjects. We encourage other behavior analysts to make use of the substantial advancements in online data sharing to compile large-N datasets and use statistical approaches to explore individual differences.
... The more complex the model is, the more important the need of an external cohort to back up any claim made from results. 2 These efforts and considerations are part of the 'one size fits all' challenge that is intrinsic in current risk models. Studies with large sample sizes are required to generate the evidence that inform these models. ...
Article
Full-text available
Providing therapies tailored to each patient is the vision of precision medicine, enabled by the increasing ability to capture extensive data about individual patients. In this position paper, we argue that the second enabling pillar towards this vision is the increasing power of computers and algorithms to learn, reason, and build the 'digital twin' of a patient. Computational models are boosting the capacity to draw diagnosis and prognosis, and future treatments will be tailored not only to current health status and data, but also to an accurate projection of the pathways to restore health by model predictions. The early steps of the digital twin in the area of cardiovascular medicine are reviewed in this article, together with a discussion of the challenges and opportunities ahead. We emphasize the synergies between mechanistic and statistical models in accelerating cardiovascular research and enabling the vision of precision medicine.
Article
Full-text available
Background: Deep learning offers considerable promise for medical diagnostics. We aimed to evaluate the diagnostic accuracy of deep learning algorithms versus health-care professionals in classifying diseases using medical imaging. Methods: In this systematic review and meta-analysis, we searched Ovid-MEDLINE, Embase, Science Citation Index, and Conference Proceedings Citation Index for studies published from Jan 1, 2012, to June 6, 2019. Studies comparing the diagnostic performance of deep learning models and health-care professionals based on medical imaging, for any disease, were included. We excluded studies that used medical waveform data graphics material or investigated the accuracy of image segmentation rather than disease classification. We extracted binary diagnostic accuracy data and constructed contingency tables to derive the outcomes of interest: sensitivity and specificity. Studies undertaking an out-of-sample external validation were included in a meta-analysis, using a unified hierarchical model. This study is registered with PROSPERO, CRD42018091176. Findings: Our search identified 31 587 studies, of which 82 (describing 147 patient cohorts) were included. 69 studies provided enough data to construct contingency tables, enabling calculation of test accuracy, with sensitivity ranging from 9·7% to 100·0% (mean 79·1%, SD 0·2) and specificity ranging from 38·9% to 100·0% (mean 88·3%, SD 0·1). An out-of-sample external validation was done in 25 studies, of which 14 made the comparison between deep learning models and health-care professionals in the same sample. Comparison of the performance between health-care professionals in these 14 studies, when restricting the analysis to the contingency table for each study reporting the highest accuracy, found a pooled sensitivity of 87·0% (95% CI 83·0-90·2) for deep learning models and 86·4% (79·9-91·0) for health-care professionals, and a pooled specificity of 92·5% (95% CI 85·1-96·4) for deep learning models and 90·5% (80·6-95·7) for health-care professionals. Interpretation: Our review found the diagnostic performance of deep learning models to be equivalent to that of health-care professionals. However, a major finding of the review is that few studies presented externally validated results or compared the performance of deep learning models and health-care professionals using the same sample. Additionally, poor reporting is prevalent in deep learning studies, which limits reliable interpretation of the reported diagnostic accuracy. New reporting standards that address specific challenges of deep learning could improve future studies, enabling greater confidence in the results of future evaluations of this promising technology. Funding: None.
Article
Full-text available
Size is the first, and at times, the only dimension that leaps out at the mention of big data. This paper attempts to offer a broader definition of big data that captures its other unique and defining characteristics. The rapid evolution and adoption of big data by industry has leapfrogged the discourse to popular outlets, forcing the academic press to catch up. Academic journals in numerous disciplines, which will benefit from a relevant discussion of big data, have yet to cover the topic. This paper presents a consolidated description of big data by integrating definitions from practitioners and academics. The paper's primary focus is on the analytic methods used for big data. A particular distinguishing feature of this paper is its focus on analytics related to unstructured data, which constitute 95% of big data. This paper highlights the need to develop appropriate and efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats. This paper also reinforces the need to devise new tools for predictive analytics for structured big data. The statistical methods in practice were devised to infer from sample data. The heterogeneity, noise, and the massive size of structured big data calls for developing computationally efficient algorithms that may avoid big data pitfalls, such as spurious correlation.
Article
Full-text available
Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.
Article
Full-text available
Many physicians take great pride in the practice of evidence-based medicine. Modern medical education emphasizes the value of the randomized, controlled trial, and we learn early on not to rely on anecdotal evidence. But the application of such superior evidence, however admirable the ambition, can be constrained by trials' strict inclusion and exclusion criteria - or the complete absence of a relevant trial. For those of us practicing pediatric medicine, this reality is all too familiar. In such situations, we are used to relying on evidence at Levels III through V - expert opinion - or resorting to anecdotal evidence. . . .
Article
Without careful implementation, artificial intelligence could widen health-care inequality. Without careful implementation, artificial intelligence could widen health-care inequality.
Article
Data science is likely to lead to major changes in cardiovascular imaging. Problems with timing, efficiency, and missed diagnoses occur at all stages of the imaging chain. The application of artificial intelligence (AI) is dependent on robust data; the application of appropriate computational approaches and tools; and validation of its clinical application to image segmentation, automated measurements, and eventually, automated diagnosis. AI may reduce cost and improve value at the stages of image acquisition, interpretation, and decision-making. Moreover, the precision now possible with cardiovascular imaging, combined with “big data” from the electronic health record and pathology, is likely to better characterize disease and personalize therapy. This review summarizes recent promising applications of AI in cardiology and cardiac imaging, which potentially add value to patient care.
Article
The treatment of individual patients in cardiology practice increasingly relies on advanced imaging, genetic screening and devices. As the amount of imaging and other diagnostic data increases, paralleled by the greater capacity to personalize treatment, the difficulty of using the full array of measurements of a patient to determine an optimal treatment seems also to be paradoxically increasing. Computational models are progressively addressing this issue by providing a common framework for integrating multiple data sets from individual patients. These models, which are based on physiology and physics rather than on population statistics, enable computational simulations to reveal diagnostic information that would have otherwise remained concealed and to predict treatment outcomes for individual patients. The inherent need for patient-specific models in cardiology is clear and is driving the rapid development of tools and techniques for creating personalized methods to guide pharmaceutical therapy, deployment of devices and surgical interventions.
Article
Nearly all aspects of modern life are in some way being changed by big data and machine learning. Netflix knows what movies people like to watch and Google knows what people want to know based on their search histories. Indeed, Google has recently begun to replace much of its existing non–machine learning technology with machine learning algorithms, and there is great optimism that these techniques can provide similar improvements across many sectors.
Article
Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data.