Article

The Cardiovascular Phenotype of Chronic Obstructive Pulmonary Disease (COPD): Applying Machine Learning to the Prediction of Cardiovascular Comorbidities

Authors:
  • The Organizational Neuroscience Laboratory | University of Surrey | Warwick University
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Background Chronic Obstructive Pulmonary Disease (COPD) is a heterogeneous group of lung conditions that are challenging to diagnose and treat. As the presence of comorbidities often exacerbates this scenario, the characterization of patients with COPD and cardiovascular comorbidities may allow early intervention and improve disease management and care. Methods We analysed a 4-year observational cohort of 6,883 UK patients who were ultimately diagnosed with COPD and at least one cardiovascular comorbidity. The cohort was extracted from the UK Royal College of General Practitioners and Surveillance Centre database. The COPD phenotypes were identified prior to diagnosis and their reproducibility was assessed following COPD diagnosis. We then developed four classifiers for predicting cardiovascular comorbidities. Results Three subtypes of the COPD cardiovascular phenotype were identified prior to diagnosis. Phenotype A was characterised by a higher prevalence of severe COPD, emphysema, hypertension. Phenotype B was characterised by a larger male majority, a lower prevalence of hypertension, the highest prevalence of the other cardiovascular comorbidities, and diabetes. Finally, phenotype C was characterised by universal hypertension, a higher prevalence of mild COPD and the low prevalence of COPD exacerbations. These phenotypes were reproduced after diagnosis with 92% accuracy. The random forest model was highly accurate for predicting hypertension while ruling out less prevalent comorbidities. Conclusions This study identified three subtypes of the COPD cardiovascular phenotype that may generalize to other populations. Among the four models tested, the random forest classifier was the most accurate at predicting cardiovascular comorbidities in COPD patients with the cardiovascular phenotype.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Phenotype C received less treatment, with only one third of patients treated with LAMA. 41 Therefore, these data confi rm that the greatest burden of cardiometabolic comorbidities is observed mainly in patients with moderate COPD and is a risk factor for poor prognosis. 37,41 ...
... 41 Therefore, these data confi rm that the greatest burden of cardiometabolic comorbidities is observed mainly in patients with moderate COPD and is a risk factor for poor prognosis. 37,41 ...
... 2,3 Thus, in silico models are a great opportunity to identify bio-markers and to predict disease progression and treatment outcome. [4][5][6] While machine learning models can combine multiscale data from large cohorts and might identify pathological phenotypes, [7][8][9] bio-mechanical modelling provides a rich tool to explore hypotheses obtained from statistical analysis and gain understanding about underlying mechanisms. ...
... The least squares optimization approach presented in the original work was applied with the PSC. 32 It was optimized for one ex vivo heart, for both resolution setting at 14 sampling points within the parameter space given by SNR [10,30] and the number of short-axis input slices [3,9]. To reduce the computational cost, a 2D second-order polynomial was fit through the samples, for each spatial direction, providing the diagonal components of H for all imaging parameter settings. ...
Article
Full-text available
Cardiac electrophysiology and cardiac mechanics both depend on the average cardiomyocyte long-axis orientation. In the realm of personalized medicine, knowledge of the patient-specific changes in cardiac microstructure plays a crucial role. Patient-specific computational modelling has emerged as a tool to better understand disease progression. In vivo cardiac diffusion tensor imaging (cDTI) is a vital tool to non-destructively measure the average cardiomyocyte long-axis orientation in the heart. However, cDTI suffers from long scan times, rendering volumetric, high-resolution acquisitions challenging. Consequently, interpolation techniques are needed to populate bio-mechanical models with patient-specific average cardiomyocyte long-axis orientations. In this work, we compare five interpolation techniques applied to in vivo and ex vivo porcine input data. We compare two tensor interpolation approaches, one rule-based approximation, and two data-driven, low-rank models. We demonstrate the advantage of tensor interpolation techniques, resulting in lower interpolation errors than do low-rank models and rule-based methods adapted to cDTI data. In an ex vivo comparison, we study the influence of three imaging parameters that can be traded off against acquisition time: in-plane resolution, signal to noise ratio, and number of acquired short-axis imaging slices.
... The RF model, by integrating multiple decision trees, effectively reduced the risk of overfitting, demonstrating greater robustness [29]. RF models have shown advantages in predicting survival rates in chronic obstructive pulmonary disease (COPD) patients and assessing cardiovascular disease risks, showcasing their strength in processing multidimensional clinical data [35]. Moreover, the model's effectiveness in managing nonlinear relationships and noisy data further supports its applicability and reliability in gait analysis [36]. ...
Article
Full-text available
Background: Three-dimensional gait analysis, supported by advanced sensor systems, is a crucial component in the rehabilitation assessment of post-stroke hemiplegic patients. However, the sensor data generated from such analyses are often complex and challenging to interpret in clinical practice, requiring significant time and complicated procedures. The Gait Deviation Index (GDI) serves as a simplified metric for quantifying the severity of pathological gait. Although isokinetic dynamometry, utilizing sophisticated sensors, is widely employed in muscle function assessment and rehabilitation, its application in gait analysis remains underexplored. Objective: This study aims to investigate the use of sensor-acquired isokinetic muscle strength data, combined with machine learning techniques, to predict the GDI in hemiplegic patients. This study utilizes data captured from sensors embedded in the Biodex dynamometry system and the Vicon 3D motion capture system, highlighting the integration of sensor technology in clinical gait analysis. Methods: This study was a cross-sectional, observational study that included a cohort of 150 post-stroke hemiplegic patients. The sensor data included measurements such as peak torque, peak torque/body weight, maximum work of repeated actions, coefficient of variation, average power, total work, acceleration time, deceleration time, range of motion, and average peak torque for both flexor and extensor muscles on the affected side at three angular velocities (60°/s, 90°/s, and 120°/s) using the Biodex System 4 Pro. The GDI was calculated using data from a Vicon 3D motion capture system. This study employed four machine learning models—Lasso Regression, Random Forest (RF), Support Vector regression (SVR), and BP Neural Network—to model and validate the sensor data. Model performance was evaluated using mean squared error (MSE), the coefficient of determination (R2), and mean absolute error (MAE). SHapley Additive exPlanations (SHAP) analysis was used to enhance model interpretability. Results: The RF model outperformed others in predicting GDI, with an MSE of 16.18, an R2 of 0.89, and an MAE of 2.99. In contrast, the Lasso Regression model yielded an MSE of 22.29, an R2 of 0.85, and an MAE of 3.71. The SVR model had an MSE of 31.58, an R2 of 0.82, and an MAE of 7.68, while the BP Neural Network model exhibited the poorest performance with an MSE of 50.38, an R2 of 0.79, and an MAE of 9.59. SHAP analysis identified the maximum work of repeated actions of the extensor muscles at 60°/s and 120°/s as the most critical sensor-derived features for predicting GDI, underscoring the importance of muscle strength metrics at varying speeds in rehabilitation assessments. Conclusions: This study highlights the potential of integrating advanced sensor technology with machine learning techniques in the analysis of complex clinical data. The developed GDI prediction model, based on sensor-acquired isokinetic dynamometry data, offers a novel, streamlined, and effective tool for assessing rehabilitation progress in post-stroke hemiplegic patients, with promising implications for broader clinical application.
... The ability of these algorithms to capture intricate interrelations and patterns within clinical data ensures a trustworthy risk prognosis. In the context of mutation analysis, it is imperative that the decisions made by machine learning models be easily understandable, [22]. ...
Article
Full-text available
Acute lymphoblastic leukemia, a pervasive form of the carcinogenic disease, is a lethal ailment subjecting numerous pediatric patients globally to terminal conditions. is a rapidly progressive condition, that exposes patients to conditions including Tumor Lysis Syndrome which often occurs early after the induction chemotherapy, contemporary research focuses primarily on the development of techniques for the early diagnosis of Acute Lymphoblastic Leukemia (ALL), leaving a gap within the literature. This study examines the application of machine learning techniques for the prognosis the mutation rate of cancer cells in pediatric patients with Acute Lymphoblastic Leukemia using clinical data from patients with ALL, who have undergone tests using Next Generation Sequencing (NGS) technology. An overview of the clinical data utilized is provided in this study, with a comprehensive workflow encompassing, data analysis, dimensionality reduction, classification and regression tree algorithm (CART), and neural networks. Results here demonstrate the efficiency with which these methods are able to target and decipher cancer cell proliferation in pediatric patients suffering from acute lymphoblastic leukemia. Valuable insights into relationships between key factors and conversion rates were also derived through data mining. However, tree classification and regression algorithms and neural networks used herein indicate the flexibility and the power of machine learning models in predicting the recurrence of cancer cells accurately. This study’s results affirm previous findings thus giving clinical proof for mutational drivers among pediatric patients having Acute Lymphoblastic Leukemia. This adds value to results by providing an applicable utility in medical practice. Principally, this study denotes a substantial advancement in leveraging machine learning workflows for mutation rate analysis of cancer cells. By appraising clinical corroboration, emphasizing the explain ability and interpretability, and building upon these findings, future research can contribute to improving patient care and results in the field of Leukaemia.
... [13] [14] [15], hypertension, e.g. [16] [17], anxiety e.g. [18] [19], and heart disease, e.g. ...
Article
Full-text available
This work leveraged predictive modeling techniques in machine learning (ML) to predict heart disease using a dataset sourced from the Center for Disease Control and Prevention in the US. The dataset was preprocessed and used to train five machine learning models: random forest, support vector machine, logistic regression, extreme gradient boosting and light gradient boosting. The goal was to use the best performing model to develop a web application capable of reliably predicting heart disease based on user-provided data. The extreme gradient boosting classifier provided the most reliable results with precision, recall and F1-score of 97%, 72%, and 83% respectively for Class 0 (no heart disease) and 21% (precision), 81% (recall) and 34% (F1-score) for Class 1 (heart disease). The model was further deployed as a web application.
... Devising a generalized approach that can target all different ethnicities is itself a bigger issue that demands an in depth analysis of data representative's, proper risk factor utilization feature engineering, and technique validation in different groups. Machine learning techniques can inadvertently inherit biases present in the evaluation data, leading to over-fitted, or under-fitted model leading biased results for every [27]. This challenge is extremely critical and can be resolved by highlighting and addressing the major bias variables in data including gender, age, race and socioeconomic factors for ensure bias less and fair prediction in CVD risk prediction and prevention systems. ...
Research
Full-text available
Over the last 10 years, a significant surge in cardiovascular diseases has been observed around theworld. Considering the cruciality of the disease leads to rapid action toward the development of accuratecardiovascular disease risk prediction. However, currently, existing methods often fail to predictcardiovascular disease risk to diagnose patients that could have benefited via preventive treatment, while inother cases patients go through dispensable interceding. Machine learning techniques offer for cardiovasculardisease prediction not only detect disease risk with maximum accuracy and precision but also exploit complexinteractions for better disease prognosis. Effective and timely prediction of cardiovascular disease usingpatients’ health records not only assists in rapid diagnosis but also reduces mortality rate. In this article, wepresent a detailed comparative analysis of existing machine-learning techniques for cardiovascular diseaseprediction and prevention. Our research shows extensive analysis of around 35 papers related to machinelearning-based cardiovascular disease prediction. This study will not only summarize the existing up-to-dateapproaches but also assist doctors in predicting heart disease risks before time, providing ample time forprecautionary actions.
... Data were extracted from the different data sources over time span one month to 23 years. Six circulatory health conditions were identified in 23 studies [29,36,37,40,45,[53][54][55][56][57][58][59][60][61][62][63][64][65][66][67][68][69][70]. These conditions were hypertension (I10-I15) (n = 5), heart failure (I50) (n = 5), atrial fibrillation (I48) (n = 2), stroke (I64) (n = 2), atherosclerosis (I70) (n = 1), myocardial infarction (I21) . ...
Article
Full-text available
With the advances in technology and data science, machine learning (ML) is being rapidly adopted by the health care sector. However, there is a lack of literature addressing the health conditions targeted by the ML prediction models within primary health care (PHC) to date. To fill this gap in knowledge, we conducted a systematic review following the PRISMA guidelines to identify health conditions targeted by ML in PHC. We searched the Cochrane Library, Web of Science, PubMed, Elsevier, BioRxiv, Association of Computing Machinery (ACM), and IEEE Xplore databases for studies published from January 1990 to January 2022. We included primary studies addressing ML diagnostic or prognostic predictive models that were supplied completely or partially by real-world PHC data. Studies selection, data extraction, and risk of bias assessment using the prediction model study risk of bias assessment tool were performed by two investigators. Health conditions were categorized according to international classification of diseases (ICD-10). Extracted data were analyzed quantitatively. We identified 106 studies investigating 42 health conditions. These studies included 207 ML prediction models supplied by the PHC data of 24.2 million participants from 19 countries. We found that 92.4% of the studies were retrospective and 77.3% of the studies reported diagnostic predictive ML models. A majority (76.4%) of all the studies were for models’ development without conducting external validation. Risk of bias assessment revealed that 90.8% of the studies were of high or unclear risk of bias. The most frequently reported health conditions were diabetes mellitus (19.8%) and Alzheimer’s disease (11.3%). Our study provides a summary on the presently available ML prediction models within PHC. We draw the attention of digital health policy makers, ML models developer, and health care professionals for more future interdisciplinary research collaboration in this regard.
Article
Background and objective: Chronic obstructive pulmonary disease (COPD) is a leading cause of death worldwide that frequently presents with concomitant cardiovascular diseases. Despite the pathological distinction between individual COPD phenotypes such as emphysema and chronic bronchitis, there is a lack of knowledge about the impact of COPD phenotype on cardiovascular disease risk. Thus, this study aimed to utilize a nationally representative sample to investigate cardiovascular disease prevalence in patients with COPD with emphysema and chronic bronchitis phenotypes. Methods: Data from 31,560 adults including 2504 individuals with COPD, collected as part of the National Health and Nutrition Examination Survey (1999-2018), were examined. Results: A significantly increased cardiovascular disease risk, including coronary heart disease, heart failure, myocardial infarction and stroke, was identified in patients with COPD among all disease phenotypes. Particularly, compared to those without COPD, individuals with chronic bronchitis presented with 1.76 (95% CI: 1.41-2.20) times greater odds, individuals with emphysema with 2.31 (95% CI: 1.80-2.96) times greater odds, while those with a concurrent phenotype (combined chronic bronchitis and emphysema) exhibited 2.98 (95% CI: 2.11-4.21) times greater odds of reporting cardiovascular diseases. Conclusion: Our data confirms that patients with COPD present an elevated risk of developing cardiovascular disease among all phenotypes, with the most marked increase being in those with concurrent chronic bronchitis and emphysema phenotypes. These findings emphasize the need for awareness and appropriate cardiovascular screening in COPD.
Article
Objective: Disease comorbidity is a major challenge in healthcare affecting the patient's quality of life and costs. AI-based prediction of comorbidities can overcome this issue by improving precision medicine and providing holistic care. The objective of this systematic literature review was to identify and summarise existing machine learning (ML) methods for comorbidity prediction and evaluate the interpretability and explainability of the models. Materials and methods: The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework was used to identify articles in three databases: Ovid Medline, Web of Science and PubMed. The literature search covered a broad range of terms for the prediction of disease comorbidity and ML, including traditional predictive modelling. Results: Of 829 unique articles, 58 full-text papers were assessed for eligibility. A final set of 22 articles with 61 ML models was included in this review. Of the identified ML models, 33 models achieved relatively high accuracy (80-95%) and AUC (0.80-0.89). Overall, 72% of studies had high or unclear concerns regarding the risk of bias. Discussion: This systematic review is the first to examine the use of ML and explainable artificial intelligence (XAI) methods for comorbidity prediction. The chosen studies focused on a limited scope of comorbidities ranging from 1 to 34 (mean = 6), and no novel comorbidities were found due to limited phenotypic and genetic data. The lack of standard evaluation for XAI hinders fair comparisons. Conclusion: A broad range of ML methods has been used to predict the comorbidities of various disorders. With further development of explainable ML capacity in the field of comorbidity prediction, there is a significant possibility of identifying unmet health needs by highlighting comorbidities in patient groups that were not previously recognised to be at risk for particular comorbidities.
Preprint
Full-text available
Aim: With the rapid advances in technology and data science, machine learning (ML) is being adopted by the health care sector; but there is a lack of literature addressing the health conditions targeted by the ML prediction models within primary health care (PHC). To fill this gap in knowledge, we conducted a systematic review following the PRISMA guidelines to identify the health conditions targeted by ML in PHC. Methods: We searched the Cochrane Library, Web of Science, PubMed, Elsevier, BioRxiv, Association of Computing Machinery (ACM), and IEEE Xplore databases for studies published from January 1990 to January 2022. We included any primary study addressing ML diagnostic or prognostic predictive models that were supplied completely or partially by real-world PHC data. We performed literature screening, data extraction, and risk of bias assessment. Health conditions were categorized according to international classification of diseases. Extracted date were analyzed quantitatively and qualitatively. Results: We identified 109 studies investigating 42 health conditions. These studies included 273 ML prediction models supplied by the PHC data of 24.2 million participants from 19 countries. We found that 82% of the studies were retrospective. 76.6% of the studies reported diagnostic predictive ML models. 77% of all reported models aimed for models’ development without external validation. Risk of bias assessment revealed that 90.8% of the studies were of high or unclear risk of bias. The most frequently reported health conditions were Alzheimer’s disease and diabetes mellitus. Conclusions: To the best of our knowledge, this is the first review to investigate the extent of the health conditions targeted by the ML prediction models within PHC settings. Our study provides an important summary on the presently available ML models in PHC, which can be used in further research and implementation efforts.
Article
The field of machine learning (ML) is sufficiently young that it is still expanding at an accelerated pace, lying at the crossroads of computer science and statistics, and at the core of artificial intelligence (AI) and data science. Recent progress in ML has been driven both by the development of new learning algorithms theory, and by the ongoing explosion in the availability of vast amount of data (often referred to as "Big data") and low-cost computation. The adoption of ML-based approaches can be found throughout science, technology and industry, leading to more evidence-based decision-making across many walks of life, including healthcare, biomedicine, manufacturing, education, financial modeling, data governance, policing, and marketing. Although the past decade has seen increased interest with these fields, we are just beginning to tap the potential of these ML algorithms for studying systems that improve with experience. In this manuscript, we present a comprehensive view on geo worldwide trends (taking into account China, USA, Israel, Italy, UK, and Middle East) of ML-based approaches highlighting rapid growth in the last 5 years attributable to the introduction of related national policies. Furthermore, based on the literature review, we also discuss the potential research directions in this field, summarizing some popular application areas of machine learning technology, such as healthcare, cyber-security systems, sustainable agriculture, data governance, and nanotechnology, suggesting that the "dissemination of research" in the ML scientific community have undergone exceptional growth in the time range of 2018–2020, reaching a value of 16,339 publications. Finally we report the challenges and the regulatory standpoints for managing ML technology. Overall, we hope that this work will help to explain the geo trends of ML approaches and their applicability in various real-world domains, as well as serve as a reference point for both academia and industry professionals, particularly from a technical, ethical and regulatory point of view.
Article
Full-text available
Background: COPD is a highly heterogeneous disease composed of different phenotypes with different aetiological and prognostic profiles and current classification systems do not fully capture this heterogeneity. In this study we sought to discover, describe and validate COPD subtypes using cluster analysis on data derived from electronic health records. Methods: We applied two unsupervised learning algorithms (k-means and hierarchical clustering) in 30,961 current and former smokers diagnosed with COPD, using linked national structured electronic health records in England available through the CALIBER resource. We used 15 clinical features, including risk factors and comorbidities and performed dimensionality reduction using multiple correspondence analysis. We compared the association between cluster membership and COPD exacerbations and respiratory and cardiovascular death with 10,736 deaths recorded over 146,466 person-years of follow-up. We also implemented and tested a process to assign unseen patients into clusters using a decision tree classifier. Results: We identified and characterized five COPD patient clusters with distinct patient characteristics with respect to demographics, comorbidities, risk of death and exacerbations. The four subgroups were associated with 1) anxiety/depression; 2) severe airflow obstruction and frailty; 3) cardiovascular disease and diabetes and 4) obesity/atopy. A fifth cluster was associated with low prevalence of most comorbid conditions. Conclusions: COPD patients can be sub-classified into groups with differing risk factors, comorbidities, and prognosis, based on data included in their primary care records. The identified clusters confirm findings of previous clustering studies and draw attention to anxiety and depression as important drivers of the disease in young, female patients.
Article
Full-text available
The ability to extract knowledge from data has been the driving force of Data Mining since its inception, and of statistical modeling long before even that. Actionable knowledge often takes the form of patterns, where a set of antecedents can be used to infer a consequent. In this paper we offer a solution to the problem of comparing different sets of patterns. Our solution allows comparisons between sets of patterns that were derived from different techniques (such as different classification algorithms), or made from different samples of data (such as temporal data or data perturbed for privacy reasons). We propose using the Jaccard index to measure the similarity between sets of patterns by converting each pattern into a single element within the set. Our measure focuses on providing conceptual simplicity, computational simplicity, interpretability, and wide applicability. The results of this measure are compared to prediction accuracy in the context of a real-world data mining scenario.
Article
Full-text available
Purpose The Royal College of General Practitioners Research and Surveillance Centre (RCGP RSC) is one of the longest established primary care sentinel networks. In 2015, it established a new data and analysis hub at the University of Surrey. This paper evaluates the representativeness of the RCGP RSC network against the English population. Participants and method The cohort includes 1 042 063 patients registered in 107 participating general practitioner (GP) practices. We compared the RCGP RSC data with English national data in the following areas: demographics; geographical distribution; chronic disease prevalence, management and completeness of data recording; and prescribing and vaccine uptake. We also assessed practices within the network participating in a national swabbing programme. Findings to date We found a small over-representation of people in the 25–44 age band, under-representation of white ethnicity, and of less deprived people. Geographical focus is in London, with less practices in the southwest and east of England. We found differences in the prevalence of diabetes (national: 6.4%, RCPG RSC: 5.8%), learning disabilities (national: 0.44%, RCPG RSC: 0.40%), obesity (national: 9.2%, RCPG RSC: 8.0%), pulmonary disease (national: 1.8%, RCPG RSC: 1.6%), and cardiovascular diseases (national: 1.1%, RCPG RSC: 1.2%). Data completeness in risk factors for diabetic population is high (77–99%). We found differences in prescribing rates and costs for infections (national: 5.58%, RCPG RSC: 7.12%), and for nutrition and blood conditions (national: 6.26%, RCPG RSC: 4.50%). Differences in vaccine uptake were seen in patients aged 2 years (national: 38.5%, RCPG RSC: 32.8%). Owing to large numbers, most differences were significant (p<0.00015). Future plans The RCGP RSC is a representative network, having only small differences with the national population, which have now been quantified and can be assessed for clinical relevance for specific studies. This network is a rich source for research into routine practice.
Article
Full-text available
In COPD patients, mortality risk is influenced by age, severity of respiratory disease, and comorbidities. With an unbiased statistical approach we sought to identify clusters of COPD patients and to examine their mortality risk. Stable COPD subjects (n = 527) were classified using hierarchical cluster analysis of clinical, functional and imaging data. The relevance of this classification was validated using prospective follow-up of mortality. The most relevant patient classification was that based on three clusters (phenotypes). Phenotype 1 included subjects at very low risk of mortality, who had mild respiratory disease and low rates of comorbidities. Phenotype 2 and 3 were at high risk of mortality. Phenotype 2 included younger subjects with severe airflow limitation, emphysema and hyperinflation, low body mass index, and low rates of cardiovascular comorbidities. Phenotype 3 included older subjects with less severe respiratory disease, but higher rates of obesity and cardiovascular comorbidities. Mortality was associated with the severity of airflow limitation in Phenotype 2 but not in Phenotype 3 subjects, and subjects in Phenotype 2 died at younger age. We identified three COPD phenotypes, including two phenotypes with high risk of mortality. Subjects within these phenotypes may require different therapeutic interventions to improve their outcome.
Article
Full-text available
Gradient boosting constructs additive regression models by sequentially fitting a simple parameterized function (base learner) to current “pseudo”-residuals by least squares at each iteration. The pseudo-residuals are the gradient of the loss functional being minimized, with respect to the model values at each training data point evaluated at the current step. It is shown that both the approximation accuracy and execution speed of gradient boosting can be substantially improved by incorporating randomization into the procedure. Specifically, at each iteration a subsample of the training data is drawn at random (without replacement) from the full training data set. This randomly selected subsample is then used in place of the full sample to fit the base learner and compute the model update for the current iteration. This randomized approach also increases robustness against overcapacity of the base learner.
Article
Full-text available
A new graphical display is proposed for partitioning techniques. Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation. This silhouette shows which objects lie well within their cluster, and which ones are merely somewhere in between clusters. The entire clustering is displayed by combining the silhouettes into a single plot, allowing an appreciation of the relative quality of the clusters and an overview of the data configuration. The average silhouette width provides an evaluation of clustering validity, and might be used to select an ‘appropriate’ number of clusters.
Article
Full-text available
Chronic obstructive pulmonary disease (COPD) is a complex condition with pulmonary and extra-pulmonary manifestations. This study describes the heterogeneity of COPD in a large and well characterised and controlled COPD cohort (ECLIPSE). We studied 2164 clinically stable COPD patients, 337 smokers with normal lung function and 245 never smokers. In these individuals, we measured clinical parameters, nutritional status, spirometry, exercise tolerance, and amount of emphysema by computed tomography. COPD patients were slightly older than controls and had more pack years of smoking than smokers with normal lung function. Co-morbidities were more prevalent in COPD patients than in controls, and occurred to the same extent irrespective of the GOLD stage. The severity of airflow limitation in COPD patients was poorly related to the degree of breathlessness, health status, presence of co-morbidity, exercise capacity and number of exacerbations reported in the year before the study. The distribution of these variables within each GOLD stage was wide. Even in subjects with severe airflow obstruction, a substantial proportion did not report symptoms, exacerbations or exercise limitation. The amount of emphysema increased with GOLD severity. The prevalence of bronchiectasis was low (4%) but also increased with GOLD stage. Some gender differences were also identified. The clinical manifestations of COPD are highly variable and the degree of airflow limitation does not capture the heterogeneity of the disease.
Article
Full-text available
Multivariate Imputation by Chained Equations (MICE) is the name of software for imputing incomplete multivariate data by Fully Conditional Speci cation (FCS). MICE V1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. MICE V1.0 introduced predictor selection, passive imputation and automatic pooling. This article presents MICE V2.0, which extends the functionality of MICE V1.0 in several ways. In MICE V2.0, the analysis of imputed data is made completely general, whereas the range of models under which pooling works is substantially extended. MICE V2.0 adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling and model selection. Imputation of categorical data is improved in order to bypass problems caused by perfect prediction. Special attention to transformations, sum scores, indices and interactions using passive imputation, and to the proper setup of the predictor matrix. MICE V2.0 is freely available from CRAN as an R package mice. This article provides a hands-on, stepwise approach to using mice for solving incomplete data problems in real data.
Article
Full-text available
This article provides an investigation of cluster validation indices that relates 4 of the indices to the L. Hubert and P. Arabie (1985) adjusted Rand index--the cluster validation measure of choice (G. W. Milligan & M. C. Cooper, 1986). It is shown how these other indices can be "roughly" transformed into the same scale as the adjusted Rand index. Furthermore, in-depth explanations are given of why classification rates should not be used in cluster validation research. The article concludes by summarizing several properties of the adjusted Rand index across many conditions and provides a method for testing the significance of observed adjusted Rand indices.
Article
Chronic Obstructive Pulmonary Disease (COPD) is a highly heterogeneous condition projected to become the third leading cause of death worldwide by 2030. To better characterize this condition, clinicians have classified patients sharing certain symptomatic characteristics, such as symptom intensity and history of exacerbations, into distinct phenotypes. In recent years, the growing use of machine learning algorithms, and cluster analysis in particular, has promised to advance this classification through the integration of additional patient characteristics, including comorbidities, biomarkers, and genomic information. This combination would allow researchers to more reliably identify new COPD phenotypes, as well as better characterize existing ones, with the aim of improving diagnosis and developing novel treatments. Here, we systematically review the last decade of research progress, which uses cluster analysis to identify COPD phenotypes. Collectively, we provide a systematized account of the extant evidence, describe the strengths and weaknesses of the main methods used, identify gaps in the literature, and suggest recommendations for future research.
Chapter
Principal components analysis (PCA) is a commonly used descriptive multivariate method for handling quantitative data and can be extended to deal with mixed measurement level data. For the extended PCA with such a mixture of quantitative and qualitative data, we require the quantification of qualitative data in order to obtain optimal scaling data. PCA with optimal scaling is referred to as nonlinear PCA, (Gifi, Nonlinear Multivariate Analysis. Wiley, Chichester, 1990). Nonlinear PCA including optimal scaling alternates between estimating the parameters of PCA and quantifying qualitative data. The alternating least squares (ALS) algorithm is used as the algorithm for nonlinear PCA and can find least squares solutions by minimizing two types of loss functions: a low-rank approximation and homogeneity analysis with restrictions. PRINCIPALS of Young et al. (Principal components of mixed measurement level multivariate data: an alternating least squares method with optimal scaling features 43:279–281, 1978) and PRINCALS of Gifi (Nonlinear Multivariate Analysis. Wiley, Chichester, 1990) are used for the computation.
Article
Objective: We aim to make use of clinical spirometry data in order to identify individual COPD-patients with divergent trajectories of lung function over time. Study design and setting: Hospital-based COPD cohort (N = 607) was followed on average 4.6 years. Each patient had a mean of 8.4 spirometries available. We used a Hierarchical Bayesian Model (HBM) to identify the individuals presenting constant trends in lung function. Results: At a probability level of 95%, one third of the patients (180/607) presented rapidly declining FEV1 (mean -78 ml/year, 95% CI -73 to -83 ml) compared to that in the rest of the patients (mean -26 ml/year, 95% CI -23 to -29 ml, p ≤ 2.2 × 10(-16)). Constant improvement of FEV1 was very rare. The rapid decliners more frequently suffered from exacerbations measured by various outcome markers. Conclusion: Clinical data of unique patients can be utilized to identify diverging trajectories of FEV1 with a high probability. Frequent exacerbations were more prevalent in FEV1-decliners than in the rest of the patients. The result confirmed previously reported association between FEV1 decline and exacerbation rate and further suggested that in clinical practice HBM could improve the identification of high-risk individuals at early stages of the disease.
Article
The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. Two algorithms are found in the literature and software, both announcing that they implement the Ward clustering method. When applied to the same distance matrix, they produce different results. One algorithm preserves Ward’s criterion, the other does not. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward’s hierarchical clustering method.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ∗∗∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Book
An accessible introduction and essential reference for an approach to machine learning that creates highly accurate prediction rules by combining many weak and inaccurate ones. Boosting is an approach to machine learning based on the idea of creating a highly accurate predictor by combining many weak and inaccurate “rules of thumb.” A remarkably rich theory has evolved around boosting, with connections to a range of topics, including statistics, game theory, convex optimization, and information geometry. Boosting algorithms have also enjoyed practical success in such fields as biology, vision, and speech processing. At various times in its history, boosting has been perceived as mysterious, controversial, even paradoxical. This book, written by the inventors of the method, brings together, organizes, simplifies, and substantially extends two decades of research on boosting, presenting both theory and applications in a way that is accessible to readers from diverse backgrounds while also providing an authoritative reference for advanced researchers. With its introductory treatment of all material and its inclusion of exercises in every chapter, the book is appropriate for course use as well. The book begins with a general introduction to machine learning algorithms and their analysis; then explores the core theory of boosting, especially its ability to generalize; examines some of the myriad other theoretical viewpoints that help to explain and understand boosting; provides practical extensions of boosting for more complex learning problems; and finally presents a number of advanced theoretical topics. Numerous applications and practical illustrations are offered throughout.
Article
The pathology of cardiovascular disease (CVD) is complex; multiple biological pathways have been implicated, including, but not limited to, inflammation and oxidative stress. Biomarkers of inflammation and oxidative stress may serve to help identify patients at risk for CVD, to monitor the efficacy of treatments, and to develop new pharmacological tools. However, due to the complexities of CVD pathogenesis there is no single biomarker available to estimate absolute risk of future cardiovascular events. Furthermore, not all biomarkers are equal; the functions of many biomarkers overlap, some offer better prognostic information than others, and some are better suited to identify/predict the pathogenesis of particular cardiovascular events. The identification of the most appropriate set of biomarkers can provide a detailed picture of the specific nature of the cardiovascular event. The following review provides an overview of existing and emerging inflammatory biomarkers, pro-inflammatory cytokines, anti-inflammatory cytokines, chemokines, oxidative stress biomarkers, and antioxidant biomarkers. The functions of each biomarker are discussed, and prognostic data are provided where available.
Cardiovascular comorbidity in COPD: systematic literature review
  • Müllerova
EBK-means: a clustering technique based on elbow method and k-means in WSN
  • Bholowalia
P. Bholowalia, A. Kumar, EBK-means: a clustering technique based on elbow method and k-means in WSN, Int. J. Comput. Appl. 105 (9) (2014 Jan 1).
  • H Müllerova
  • A Agusti
  • S Erqou
  • D W Mapel
H. Müllerova, A. Agusti, S. Erqou, D.W. Mapel, Cardiovascular comorbidity in COPD: systematic literature review, Chest 144 (4) (2013) 1163-1178.