ArticlePDF Available

Avoiding big data pitfalls

July 2020
Heart and Metabolism 82:33-35

DOI:10.31887/hm.2020.82/plamata

Authors:

King's College London

Clinical decisions are based on a combination of inductive inference built on experience (ie, statistical models) and on deductions provided by our understanding of the workings of the cardiovascular system (ie, mechanistic models). In a similar way, computers can be used to discover new hidden patterns in the (big) data and to make predictions based on our knowledge of physiology or physics. Surprisingly, unlike humans through history, computers seldom combine inductive and deductive processes. An explosion of expectations surrounds the computer’s inductive method, fueled by the “big data” and popular trends. This article reviews the risks and potential pitfalls of this computer approach, where the lack of generality, selection or confounding biases, overfitting, or spurious correlations are among the commonplace flaws. Recommendations to reduce these risks include an examination of data through the lens of causality, the careful choice and description of statistical techniques, and an open research culture with transparency. Finally, the synergy between mechanistic and statistical models (ie, the digital twin) is discussed as a promising pathway toward precision cardiology that mimics the human experience.

Content uploaded by Pablo Lamata

Content may be subject to copyright.

Heart Metab. 2020;82:33-35

Refresher Corner

“Big data” is the computer-enhanced version of

inductive reasoning

There is an exciting future for medicine where

decisions are informed by precise patient-

specific data and risk models. Exploiting the

“big data” in health care is one of the main engines

working toward this future. We have learned from

promising case studies, for example, the impor-

tance of access to the right source of evidence

to select the right therapy for a pediatric lupus

patient,1 but also from epic failures, such as the

unmet expectation to predict seasonal flu from

Internet searches.2

It is useful to conceptualize “big data” as an evo-

lution of traditional statistical methods, now able to

harness the value of much larger and heterogeneous

sources of information. As such, its ability to learn

new biomarkers and predictors will always be sub-

ject to the same fundamental limitations of inductive

reasoning: can we generalize the findings; are they re-

ally true? “Big data” is simply the computer-enhanced

version of human inductive reasoning. The scientific

method teaches us that the interplay between induc-

tive and deductive reasoning is the way forward. We

need to build a hypothesis from the observations and

then advance to make predictions and to run more

experiments to verify them.

In this context, this article reviews the risks and

potential pitfalls of our computer-enhanced ability to

reveal the hidden patterns in the data that predict

cardiovascular outcomes. It then sets out some rec-

ommendations to minimize the risks of spurious pat-

terns: here the main message is the need to examine

the revelations through the lens of causality; do they

make sense? We need to interpret the experimental

findings and see if they fit our framework of a plau-

sible mechanistic explanation.

Avoiding big data pitfalls

Pablo Lamata, PhD

Department of Biomedical Engineering, King’s College London, UK

Correspondence: Pablo Lamata, PhD, Dept of Biomedical Engineering – 5th oor Becket House, 1 Lambeth Palace Road,

London SE1 7EU, UK

E-mail: Pablo.Lamata@kcl.ac.uk

Abstract: Clinical decisions are based on a combination of inductive inference built on experience (ie, statistical

models) and on deductions provided by our understanding of the workings of the cardiovascular system (ie,

mechanistic models). In a similar way, computers can be used to discover new hidden patterns in the (big) data

and to make predictions based on our knowledge of physiology or physics. Surprisingly, unlike humans through

history, computers seldom combine inductive and deductive processes. An explosion of expectations surrounds

the computer’s inductive method, fueled by the “big data” and popular trends. This article reviews the risks and

potential pitfalls of this computer approach, where the lack of generality, selection or confounding biases, overt-

ting, or spurious correlations are among the commonplace aws. Recommendations to reduce these risks include

an examination of data through the lens of causality, the careful choice and description of statistical techniques,

and an open research culture with transparency. Finally, the synergy between mechanistic and statistical models

(ie, the digital twin) is discussed as a promising pathway toward precision cardiology that mimics the human

experience. L Heart Metab. 2020;82:33-35

Keywords: articial intelligence; big data; digital twin

How to learn from data and how to fail in that

endeavor

Our goal is to improve our future clinical decisions

by learning from our past experiences, old-fashioned

clinical observation. Our best tool to implement this

collective knowledge is the use of clinical guidelines,

the compendium of current best evidence mixed with

opinion (class of evidence and level of recommenda-

tion). The future potential is to evolve and accelerate

beyond this model by generating evidence using the

“big data” that is becoming available. And this task is

shared between humans and computers; there is a

continuum between human and machine interactions

to build predictive models.3

The main pitfall is to take the generality of our findings

for granted and assume past experience predicts events

in other cohorts and future patients. This can only be

verified with external validation, a new cohort of patients,

ideally from different clinical centers and/or geographic

regions with a different mix of patients. Unfortunately,

most studies will not include this critical step.4,5

Methodologically, the risk is that of “garbage

in, garbage out” (GIGO): the validity of our findings

strongly depends on the quality of the data learned

from. Confounding biases will lead to surprising asso-

ciations between health scores and risk factors that

were actually driven by a lurking variable. Selection

biases may lead to false conclusions and to ethical

risks of models that create or exacerbate existing ra-

cial or societal biases in health care systems.6

Beyond these traditional statistical risks, the nu-

merous comparisons and searches within the big

data exacerbate other potential issues. We have

many more chances of finding spurious correlations,

such as the high-school basketball searches to pre-

dict seasonal flu burden.2 And we suffer from the “di-

mensionality curse”: the more variables you combine,

the higher the chances of a spurious positive finding

(eg, a false-positive of a dimension in which healthy

and diseased subjects differentiate).

Big data is also characterized by three more fea-

tures7: its heterogeneity (where inferring the mixture

model would be a challenge), noise accumulation

(so, selecting features would be better than trying to

include all), and incidental endogeneity (that would

make variable selection quite challenging). The reader

is referred to previous works8 for their detailed de-

scription.

Recommendations to avoid the pitfalls

The best attitude when reading “big data” studies is

to be cautiously skeptical: a positive finding of the

predictive value when working with thousands of vari-

ables is at best only a first step toward the right direc-

tion. The immediate next step is the preparation of

the external validation tests.4,5

When conducting research in this area, the obvious

recommendation is to apply adequate techniques.

Selection bias can be overcome by weighting. The

high-dimensionality curse and its dire consequences

can be addressed by dimensionality-reduction tech-

niques, where principal component analysis is the

most common approach. And there are specific solu-

tions for each of the challenges of big data: penalized

quasi-likelihood, sparsest solution in high-confidence

set, or independence screening among others (see

ref 8 for further details).

The main challenge of induction is the generality

of the findings, especially hard given the difficulty to

have stable and comparable measurements across

cases and time. The subtle differences in the appear-

ance of an echocardiographic or magnetic resonance

image across manufacturers is a well-known bottle-

neck in the imaging community. The homogenization

of techniques and protocols is indeed one of the main

strategies to alleviate this.5

As a community, the strongest recommendation

in order to accelerate the generality of findings, and to

eventually make an impact in patients, is the promotion

of a culture of transparency2 and open research. The ef-

fort to recruit and follow up a cohort of patients is huge,

as it is the development of the information infrastructure

to allow the access to the electronic health record of a

large population. There are indeed ethical and societal

barriers to release this data for research purposes, but

we must learn to give the adequate value and credit to

these contributions so that clinicians and researchers

do not feel they are losing a competitive advantage.

The most difficult decision is when to include find-

ings in clinical guidelines. The minimum is to have the

positive result subjected to external validation. Even

here one can critically challenge the generality of find-

ings, and the recommendation is to take a practical

sceptic approach: adopt while monitoring real-world

results.

The challenge of generality will be only addressed

by the formulation of the mechanistic hypothesis that

Lamata Heart Metab. 2020;82:33-35

Avoiding big data pitfalls

Heart Metab. 2020;82:33-35 Lamata

Avoiding big data pitfalls

offers a plausible explanation of the findings, closing

the induction step of the scientific method. In this

context, computers can also be used to enhance our

deductive reasoning skills: they can make predictions

based on mechanistic simulations of our cardiovas-

cular system.9 The opportunity is thus to exploit the

synergy between mechanistic and statistical compu-

tational models that is the core of the vision of the dig-

ital twin10 and mimics the way clinicians have worked

for millennia. L

Disclosure/Acknowledgments: Pablo Lamata was commissioned to write

this article by, and has received honoraria from, Servier. Pablo Lamata sits

on the advisory board of Ultromics Ltd and Cardesium Inc. Support from the

Wellcome Trust Senior Research Fellowship (209450/Z/17/Z) is acknowl-

edged. No conflict of interest is reported.

REFERENCES

1. Frankovich J, Longhurst CA, Sutherland SM. Evidence-based

medicine in the EMR era. N Engl J Med. 2011;365(19):1758-

1759. doi:10.1056/NEJMp1108726.

2. Lazer D, Kennedy R, King G, Vespignani A. Big data. The

parable of Google Flu: traps in big data analysis. Science.

2014;343(6176):1203-1205. doi:10.1126/science.1248506.

3. Beam AL, Kohane IS. Big data and machine learning in

health care. JAMA. 2018;319(13):1317. doi:10.1001/

jama.2017.18391.

4. Liu X, Faes L, Kale AU, et al. A comparison of deep learn-

ing performance against health-care professionals in detect-

ing diseases from medical imaging: a systematic review and

meta-analysis. Lancet Digit Heal. 2019;1(6):e271-e297.

doi:10.1016/S2589-7500(19)30123-2.

5. Dey D, Slomka PJ, Leeson P, et al. Artificial intelligence in

cardiovascular imaging: JACC state-of-the-art review. J

Am Coll Cardiol. 2019;73(11):1317-1335. doi:10.1016/J.

JACC.2018.12.054.

6. Nordling L. A fairer way forward for AI in health care. Nature.

2019;573(7775):S103-S105. doi:10.1038/d41586-019-

02872-2.

7. Gandomi A, Haider M. Beyond the hype: big data concepts,

methods, and analytics. Int J Inf Manage. 2015;35(2):137-144.

doi:10.1016/J.IJINFOMGT.2014.10.007.

8. Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci

Rev. 2014;1(2):293-314. doi:10.1093/nsr/nwt032.

9. Niederer SA, Lumens J, Trayanova NA. Computational mod-

els in cardiology. Nat Rev Cardiol. 2019;16(2):100-111.

doi:10.1038/s41569-018-0104-y.

10. Corral-Acero J, Margara F, Marciniak M, et al. The “Digital

Twin” to enable the vision of precision cardiology. Eur Heart J.

March 4, 2020. Epub ahead of print. doi:10.1093/eurheartj/

ehaa159.

The health digital twin to tackle cardiovascular disease—a review of an emerging interdisciplinary field

Article

Full-text available

Aug 2022

Potential benefits of precision medicine in cardiovascular disease (CVD) include more accurate phenotyping of individual patients with the same condition or presentation, using multiple clinical, imaging, molecular and other variables to guide diagnosis and treatment. An approach to realising this potential is the digital twin concept, whereby a virtual representation of a patient is constructed and receives real-time updates of a range of data variables in order to predict disease and optimise treatment selection for the real-life patient. We explored the term digital twin, its defining concepts, the challenges as an emerging field, and potentially important applications in CVD. A mapping review was undertaken using a systematic search of peer-reviewed literature. Industry-based participants and patent applications were identified through web-based sources. Searches of Compendex, EMBASE, Medline, ProQuest and Scopus databases yielded 88 papers related to cardiovascular conditions (28%, n = 25), non-cardiovascular conditions (41%, n = 36), and general aspects of the health digital twin (31%, n = 27). Fifteen companies with a commercial interest in health digital twin or simulation modelling had products focused on CVD. The patent search identified 18 applications from 11 applicants, of which 73% were companies and 27% were universities. Three applicants had cardiac-related inventions. For CVD, digital twin research within industry and academia is recent, interdisciplinary, and established globally. Overall, the applications were numerical simulation models, although precursor models exist for the real-time cyber-physical system characteristic of a true digital twin. Implementation challenges include ethical constraints and clinical barriers to the adoption of decision tools derived from artificial intelligence systems.

Understanding Individual Subject Differences through Large Behavioral Datasets: Analytical and Statistical Considerations

Article

Sep 2023

A core feature of behavior analysis is the single-subject design, in which each subject serves as its own control. This approach is powerful for identifying manipulations that are causal to behavioral changes but often fails to account for individual differences, particularly when coupled with a small sample size. It is more common for other subfields of psychology to use larger-N approaches; however, these designs also often fail to account for the individual by focusing on aggregate-level data only. Moving forward, it is important to study individual differences to identify subgroups of the population that may respond differently to interventions and to improve the generalizability and reproducibility of behavioral science. We propose that large-N datasets should be used in behavior analysis to better understand individual subject variability. First, we describe how individual differences have been historically treated and then outline practical reasons to study individual subject variability. Then, we describe various methods for analyzing large-N datasets while accounting for the individual, including correlational analyses, machine learning, mixed-effects models, clustering, and simulation. We provide relevant examples of these techniques from published behavioral literature and from a publicly available dataset compiled from five different rat experiments, which illustrates both group-level effects and heterogeneity across individual subjects. We encourage other behavior analysts to make use of the substantial advancements in online data sharing to compile large-N datasets and use statistical approaches to explore individual differences.

Refining and simplifying decision models – tackling the “one size fits all” challenge

Article

May 2022

Pablo Lamata

The 'Digital Twin' to enable the vision of precision cardiology

Article

Full-text available

Mar 2020
EUR HEART J

Providing therapies tailored to each patient is the vision of precision medicine, enabled by the increasing ability to capture extensive data about individual patients. In this position paper, we argue that the second enabling pillar towards this vision is the increasing power of computers and algorithms to learn, reason, and build the 'digital twin' of a patient. Computational models are boosting the capacity to draw diagnosis and prognosis, and future treatments will be tailored not only to current health status and data, but also to an accurate projection of the pathways to restore health by model predictions. The early steps of the digital twin in the area of cardiovascular medicine are reviewed in this article, together with a discussion of the challenges and opportunities ahead. We emphasize the synergies between mechanistic and statistical models in accelerating cardiovascular research and enabling the vision of precision medicine.

A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis

Article

Full-text available

Sep 2019

Background: Deep learning offers considerable promise for medical diagnostics. We aimed to evaluate the diagnostic accuracy of deep learning algorithms versus health-care professionals in classifying diseases using medical imaging. Methods: In this systematic review and meta-analysis, we searched Ovid-MEDLINE, Embase, Science Citation Index, and Conference Proceedings Citation Index for studies published from Jan 1, 2012, to June 6, 2019. Studies comparing the diagnostic performance of deep learning models and health-care professionals based on medical imaging, for any disease, were included. We excluded studies that used medical waveform data graphics material or investigated the accuracy of image segmentation rather than disease classification. We extracted binary diagnostic accuracy data and constructed contingency tables to derive the outcomes of interest: sensitivity and specificity. Studies undertaking an out-of-sample external validation were included in a meta-analysis, using a unified hierarchical model. This study is registered with PROSPERO, CRD42018091176. Findings: Our search identified 31 587 studies, of which 82 (describing 147 patient cohorts) were included. 69 studies provided enough data to construct contingency tables, enabling calculation of test accuracy, with sensitivity ranging from 9·7% to 100·0% (mean 79·1%, SD 0·2) and specificity ranging from 38·9% to 100·0% (mean 88·3%, SD 0·1). An out-of-sample external validation was done in 25 studies, of which 14 made the comparison between deep learning models and health-care professionals in the same sample. Comparison of the performance between health-care professionals in these 14 studies, when restricting the analysis to the contingency table for each study reporting the highest accuracy, found a pooled sensitivity of 87·0% (95% CI 83·0-90·2) for deep learning models and 86·4% (79·9-91·0) for health-care professionals, and a pooled specificity of 92·5% (95% CI 85·1-96·4) for deep learning models and 90·5% (80·6-95·7) for health-care professionals. Interpretation: Our review found the diagnostic performance of deep learning models to be equivalent to that of health-care professionals. However, a major finding of the review is that few studies presented externally validated results or compared the performance of deep learning models and health-care professionals using the same sample. Additionally, poor reporting is prevalent in deep learning studies, which limits reliable interpretation of the reported diagnostic accuracy. New reporting standards that address specific challenges of deep learning could improve future studies, enabling greater confidence in the results of future evaluations of this promising technology. Funding: None.

Beyond the hype: Big data concepts, methods, and analytics

Article

Full-text available

Apr 2015
INT J INFORM MANAGE

Size is the first, and at times, the only dimension that leaps out at the mention of big data. This paper attempts to offer a broader definition of big data that captures its other unique and defining characteristics. The rapid evolution and adoption of big data by industry has leapfrogged the discourse to popular outlets, forcing the academic press to catch up. Academic journals in numerous disciplines, which will benefit from a relevant discussion of big data, have yet to cover the topic. This paper presents a consolidated description of big data by integrating definitions from practitioners and academics. The paper's primary focus is on the analytic methods used for big data. A particular distinguishing feature of this paper is its focus on analytics related to unstructured data, which constitute 95% of big data. This paper highlights the need to develop appropriate and efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats. This paper also reinforces the need to devise new tools for predictive analytics for structured big data. The statistical methods in practice were devised to infer from sample data. The heterogeneity, noise, and the massive size of structured big data calls for developing computationally efficient algorithms that may avoid big data pitfalls, such as spurious correlation.

Challenges of Big Data Analysis

Article

Full-text available

Aug 2013

Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

Evidence-Based Medicine in the EMR Era

Article

Full-text available

Nov 2011
NEW ENGL J MED

Many physicians take great pride in the practice of evidence-based medicine. Modern medical education emphasizes the value of the randomized, controlled trial, and we learn early on not to rely on anecdotal evidence. But the application of such superior evidence, however admirable the ambition, can be constrained by trials' strict inclusion and exclusion criteria - or the complete absence of a relevant trial. For those of us practicing pediatric medicine, this reality is all too familiar. In such situations, we are used to relying on evidence at Levels III through V - expert opinion - or resorting to anecdotal evidence. . . .

A fairer way forward for AI in health care

Article

Sep 2019

Linda Nordling

Without careful implementation, artificial intelligence could widen health-care inequality. Without careful implementation, artificial intelligence could widen health-care inequality.

Artificial Intelligence in Cardiovascular Imaging

Article

Mar 2019
J AM COLL CARDIOL

Data science is likely to lead to major changes in cardiovascular imaging. Problems with timing, efficiency, and missed diagnoses occur at all stages of the imaging chain. The application of artificial intelligence (AI) is dependent on robust data; the application of appropriate computational approaches and tools; and validation of its clinical application to image segmentation, automated measurements, and eventually, automated diagnosis. AI may reduce cost and improve value at the stages of image acquisition, interpretation, and decision-making. Moreover, the precision now possible with cardiovascular imaging, combined with “big data” from the electronic health record and pathology, is likely to better characterize disease and personalize therapy. This review summarizes recent promising applications of AI in cardiology and cardiac imaging, which potentially add value to patient care.

Computational models in cardiology

Article

Oct 2018

The treatment of individual patients in cardiology practice increasingly relies on advanced imaging, genetic screening and devices. As the amount of imaging and other diagnostic data increases, paralleled by the greater capacity to personalize treatment, the difficulty of using the full array of measurements of a patient to determine an optimal treatment seems also to be paradoxically increasing. Computational models are progressively addressing this issue by providing a common framework for integrating multiple data sets from individual patients. These models, which are based on physiology and physics rather than on population statistics, enable computational simulations to reveal diagnostic information that would have otherwise remained concealed and to predict treatment outcomes for individual patients. The inherent need for patient-specific models in cardiology is clear and is driving the rapid development of tools and techniques for creating personalized methods to guide pharmaceutical therapy, deployment of devices and surgical interventions.

Big Data and Machine Learning in Health Care

Article

Mar 2018

Nearly all aspects of modern life are in some way being changed by big data and machine learning. Netflix knows what movies people like to watch and Google knows what people want to know based on their search histories. Indeed, Google has recently begun to replace much of its existing non–machine learning technology with machine learning algorithms, and there is great optimism that these techniques can provide similar improvements across many sectors.

The Parable of Google Flu: Traps in Big Data Analysis

Article

Mar 2014
SCIENCE

Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data.

Avoiding big data pitfalls

Abstract

Recommended publications

The 'Digital Twin' to enable the vision of precision cardiology

AI in the Real World

Editorial: Translating artificial intelligence into clinical use within cardiology

The challenge of understanding heart failure with supernormal left ventricular ejection fraction: ti...