To read the full-text of this research, you can request a copy directly from the authors.
Abstract
Additive feature explanations using Shapley values have become popular for providing transparency into the relative importance of each feature to an individual prediction of a machine learning model. While Shapley values provide a unique additive feature attribution in cooperative game theory, the Shapley values that can be generated for even a single machine learning model are far from unique, with theoretical and implementational decisions affecting the resulting attributions. Here, we consider the application of Shapley values for explaining decision tree ensembles and present a novel approach to Shapley value-based feature attribution that can be applied to random forests and boosted decision trees. This new method provides attributions that accurately reflect details of the model prediction algorithm for individual instances, while being computationally competitive with one of the most widely used current methods. We explain the theoretical differences between the standard and novel approaches and compare their performance using synthetic and real data.
To read the full-text of this research, you can request a copy directly from the authors.
... This results in the so-called missingness property of the corresponding explainers [37]. We shall exploit this property in our treatment of tree ensembles to reduce the complexity of computing marginal feature attributions; compare with [6]. Another well-known game value of form (1.3) that will come up in this paper is the Banzhaf value [4]. ...
... Figure 1). The implementation invariance axiom also fails for the "eject" variant of TreeSHAP introduced in [6]; see Appendix B. ...
... After a brief review of TreeSHAP in §2.6, we compare its outputs with marginal feature attributions in §3 where we expose certain shortcomings of TreeSHAP, and discuss how calculating marginal feature attributions for tree-based models can be done more efficiently. 6 I.e. the union of R 1 , . . . , R covers B except for perhaps a measure zero subset, and the intersection of any two of them is of measure zero. ...
Due to their power and ease of use, tree-based machine learning models have become very popular. To interpret these models, local feature attributions based on marginal expectations e.g. marginal (interventional) Shapley, Owen or Banzhaf values may be employed. Such feature attribution methods are true to the model and implementation invariant, i.e. dependent only on the input-output function of the model. By taking advantage of the internal structure of tree-based models, we prove that their marginal Shapley values, or more generally marginal feature attributions obtained from a linear game value, are simple (piecewise-constant) functions with respect to a certain finite partition of the input space determined by the trained model. The same is true for feature attributions obtained from the famous TreeSHAP algorithm. Nevertheless, we show that the "path-dependent" TreeSHAP is not implementation invariant by presenting two (statistically similar) decision trees computing the exact same function for which the algorithm yields different rankings of features, whereas the marginal Shapley values coincide. Furthermore, we discuss how the fact that marginal feature attributions are simple functions can potentially be utilized to compute them. An important observation, showcased by experiments with XGBoost, LightGBM and CatBoost libraries, is that only a portion of all features appears in a tree from the ensemble; thus the complexity of computing marginal Shapley (or Owen or Banzhaf) feature attributions may be reduced. In particular, in the case of CatBoost models, the trees are oblivious (symmetric) and the number of features in each of them is no larger than the depth. We exploit the symmetry to derive an explicit formula with improved complexity for marginal Shapley (and Banzhaf and Owen) values which is only in terms of the internal parameters of the CatBoost model.
... As an initial step in this stage, we apply AE to generate Reconstruction Loss (RL) and threshold. In this regard, we used the steps proposed by (Campbell et al., 2022;Yang, 2021;Lundberg et al., 2017). ...
... The second is that the contribution of each feature is the difference between the model prediction with and without the feature. The third is that the contribution of each feature is the difference between the model prediction with and without the feature, weighted by the feature's value (Campbell et al., 2022;Yang, 2021;Lundberg et al., 2017). Therefore, we propose the following three steps for identifying the main variables related to the fault and consequent system downtime: ...
... Naively, computing all n Shapley values according to Equation 1 requires O(2 n ) evaluations of v (each of which involves the evaluation of a learned model) and O(2 n ) time. This cost can be reduced in certain special cases, e.g. when computing feature attributions for linear models or decision trees (Lundberg et al., 2018;Campbell et al., 2022;Amoukou et al., 2022;Chen et al., 2018). ...
Originally introduced in game theory, Shapley values have emerged as a central tool in explainable machine learning, where they are used to attribute model predictions to specific input features. However, computing Shapley values exactly is expensive: for a general model with n features, model evaluations are necessary. To address this issue, approximation algorithms are widely used. One of the most popular is the Kernel SHAP algorithm, which is model agnostic and remarkably effective in practice. However, to the best of our knowledge, Kernel SHAP has no strong non-asymptotic complexity guarantees. We address this issue by introducing Leverage SHAP, a light-weight modification of Kernel SHAP that provides provably accurate Shapley value estimates with just model evaluations. Our approach takes advantage of a connection between Shapley value estimation and agnostic active learning by employing leverage score sampling, a powerful regression tool. Beyond theoretical guarantees, we show that Leverage SHAP consistently outperforms even the highly optimized implementation of Kernel SHAP available in the ubiquitous SHAP library [Lundberg & Lee, 2017].
... We further investigated the results obtained by our best decision tree using SHAP (SHapley Additive exPlanations) to reveal how features were related to the prediction outcomes. SHAP is a widely used tool to explain machine learning models by deconstructing their predictions into the contributions of individual features, as exemplified by recent studies using SHAP on decision trees (Rodrigo et al. 2021), including boosted trees (Nohara et al. 2022) or ensembles (Campbell et al. 2022). SHAP supports local interpretability because it helps to understand individual predictions rather than how the model works (global interpretability). ...
Fanfictions are a popular literature genre in which writers reuse a universe, for example to transform heteronormative relationships with queer characters or to bring romance into shows focused on horror and adventure. Fanfictions have been the subject of numerous studies in text mining and network analysis, which used Natural Language Processing (NLP) techniques to compare fanfictions with the original scripts or to make various predictions. In this paper, we use NLP to predict the popularity of a story and examine which features contribute to popularity. This endeavor is important given the rising use of AI assistants and the ongoing interest in generating text with desirable characteristics. We used the main two websites to collect fan stories (Fanfiction.net and Archives Of Our Own) on Supernatural, which has been the subject of numerous scholarly works. We extracted high-level features such as the main character and sentiments from 79,288 of these stories and used the features in a binary classification supported by tree-based methods, ensemble methods (random forest), neural networks, and Support Vector Machines. Our optimized classifiers correctly identified popular stories in four out of five cases. By relating features to classification outcomes using SHAP values, we found that fans prefer longer stories with a wider vocabulary, which can inform the prompts of AI chatbots to continue generating such successful stories. However, we also observed that fans wanted stories unlike the original material (e.g., favoring romance and disliking when characters are hurt), hence AI-powered stories may be less popular if they strictly follow the original material of a show.
... For that last reason, SHAP values have received increasing attention and focus in the data valuation community in recent years, especially to detect the high-value datum in training predictive models [5]. Previous studies have been focused either on detecting which features are the most influential for machine learning models output [12,[16][17][18][20][21][22] or choosing the best datum or subset of data which improve these models performance during the training process [5,13]. ...
This paper investigates the impact of data valuation metrics (variability and coefficient of variation) on the feature importance in classification models. Data valuation is an emerging topic in the fields of data science, accounting, data quality, and information economics concerned with methods to calculate the value of data. Feature importance or ranking is important in explaining how black-box machine learning models make predictions as well as selecting the most predictive features while training these models. Existing feature importance algorithms are either computationally expensive (e.g. SHAP values) or biased (e.g. Gini importance in Tree-based models). No previous investigation of the impact of data valuation metrics on feature importance has been conducted. Five popular machine learning models (eXtreme Gradient Boosting (XGB), Random Forest (RF), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and Naive Bayes (NB)) have been used as well as six widely implemented feature ranking 1 2 M. Ebiele et al. and SHAP values) to investigate the relationship between feature importance and data valuation metrics for a clinical use case. XGB outperforms the other models with a weighted F1-score of 79.72%. The findings suggest that features with variability greater than 0.4 or a coefficient of variation greater than 23.4 have little to no value; therefore, these features can be filtered out during feature selection. This result, if generalisable, could simplify feature selection and data preparation.
... The algorithm's performance was assessed in two scenarios: with and without the target variable. SHAP value was then used to track each feature's impact on the prediction process [50,51]. ...
This work aims to propose a simplified decision tool for the design of side-lit spaces that accounts for the impacts of climate and surroundings. The framework was developed using a smart optimization algorithm NSGA-III in combination with CatBoost ensemble machine learning technique and simulation. WWRs and overhang depths were optimized to maximize daylight penetration, minimize glare risk and reduce energy demand in different climates, orientations, space proportions, and surrounding obstruction angles. Besides, optimal solutions were used to determine the range of attainable targets for daylight, glare, and energy metrics in each climate, which can be used as requirements of national codes and standards. For example, for WDR = 3:2 and an OA of 20, energy demand varied between 260 and 266 kWh/m2, ASE remained at 40–50%, and sDA was reported 100% for south-oriented cases in Tehran. A sensitivity analysis was also performed to provide insights into how various design parameters affected daylight and energy performance in architectural spaces. The results revealed that daylight availability metrics were highly sensitive to the Window-to-Wall Ratio (WWR), while glare metrics were primarily affected by obstruction angle. Energy consumption was mainly influenced by room depth, WWR, window orientation, and obstruction angle. Notably, these parameters ranked similarly across all considered climates, albeit with varying degrees of significance. Results were presented in the form of guide charts that offer a practical tool for designing buildings in highly obstructed contexts and enable non-programmer architects and designers to make informed decisions.
... One possible criticism of these earlier results is that some model agnostic explainers do not aim to select subsets of features relevant to the prediction, but target instead finding an absolute/relative order of feature importance. In contrast, a number of authors have reported pitfalls with the use of SHAP and Shapley values as a measure of feature importance [105,60,95,74,37,104,78,2,101,59,20]. However, these earlier works do not identify fundamental flaws with the use of Shapley values in explainability. ...
This paper develops a rigorous argument for why the use of Shapley values in explainable AI (XAI) will necessarily yield provably misleading information about the relative importance of features for predictions. Concretely, this paper demonstrates that there exist classifiers, and associated predictions, for which the relative importance of features determined by the Shapley values will incorrectly assign more importance to features that are provably irrelevant for the prediction, and less importance to features that are provably relevant for the prediction. The paper also argues that, given recent complexity results, the existence of efficient algorithms for the computation of rigorous feature attribution values in the case of some restricted classes of classifiers should be deemed unlikely at best.
... To provide better interpretations of environmental conditions and knowlesi malaria risk, we applied SHapley Additive exPlanations (SHAP) to disseminate and interpret the output of XGBoost model (Campbell et al., 2022). SHAP values were generated to evaluate the relative importance of covariates in the model. ...
The emergence of potentially life-threatening zoonotic malaria caused by Plasmodium knowlesi nearly two decades ago has continued to challenge Malaysia healthcare. With a total of 376 P. knowlesi infections notified in 2008, the number increased to 2,609 cases in 2020 nationwide. Numerous studies have been conducted in Malaysian Borneo to determine the association between environmental factors and knowlesi malaria transmission. However, there is still a lack of understanding of the environmental influence on knowlesi malaria transmission in Peninsular Malaysia. Therefore, our study aimed to investigate the ecological distribution of human P. knowlesi malaria in relation to environmental factors in Peninsular Malaysia. A total of 2,873 records of human P. knowlesi infections in Peninsular Malaysia from 1st January 2011 to 31st December 2019 were collated from the Ministry of Health Malaysia and geolocated. Three machine learning-based models, maximum entropy (MaxEnt), extreme gradient boosting (XGBoost), and ensemble modeling approach, were applied to predict the spatial variation of P. knowlesi disease risk. Multiple environmental parameters including climate factors, landscape characteristics, and anthropogenic factors were included as predictors in both predictive models. Subsequently, an ensemble model was developed based on the output of both MaxEnt and XGBoost. Comparison between models indicated that the XGBoost has higher performance as compared to MaxEnt and ensemble model, with AUCROC values of 0.933 ± 0.002 and 0.854 ± 0.007 for train and test datasets, respectively. Key environmental covariates affecting human P. knowlesi occurrence were distance to the coastline, elevation, tree cover, annual precipitation, tree loss, and distance to the forest. Our models indicated that the disease risk areas were mainly distributed in low elevation (75–345 m above mean sea level) areas along the Titiwangsa mountain range and inland central-northern region of Peninsular Malaysia. The high-resolution risk map of human knowlesi malaria constructed in this study can be further utilized for multi-pronged interventions targeting community at-risk, macaque populations, and mosquito vectors.
... Sin embargo, para tomar decisiones importantes y significativas es necesario buscar métodos alternativos [2]. Uno de ellos es el árbol de decisión, por tener la atractiva propiedad de ser sencillo de interpretar y fácil de entender [3]. ...
Figuring out the characteristics of urban residents' travel mode choices is the key to the forecasting of residents' travel demand as well as an important basis for transportation system management and planning. The integrated learning model based on the Boosting framework has high prediction accuracy and strong feature selection and combination ability and has become the preferred algorithm for building travel demand prediction models.In this article, the authors use the resident travel survey data of Kunming City, choose four integrated learning classifiers, XGBoost, LightGBM, CatBoost, and GBDT, to predict the travel mode of the residents, select the best parameters of the model by using grid search and five-fold cross-validation, analyze the importance of the features of the prediction model by using TreeSHAP, and finally explore the selection of travel modes under the interaction of important feature variables. The results of the study show that (1) the XGBoost model performs better than the other models, and the accuracy, precision, recall, and F1 value of the XGBoost model reach 90%, respectively, and the prediction accuracy of the four modes of travel, namely walking, two-wheeled electric motorcycle, public transportation, and car, reaches 94%, 90%, 85%, and 90%, respectively, and the corresponding AUC values reach 0.99, 0.97, 0.96, and 0.98, respectively. (2) Compared with household size and annual income, the actual distance of travel paths, ownership of cars and 2-wheeled electric motorcycles, age and gender of travelers, and the built environment are more important factors influencing the prediction of residents' travel choices. (3) The characteristics of travel mode choice under the interaction of several factors are obvious; except for the group over 55 years old, the ownership of travel means of transportation in the family significantly affects the choice of travel mode of residents; men between 20 and 55 years old have more medium-distance and long-distance trips, and they are the main group of people who use cars; when the travel distance is less than 15km, the 2-wheeled electric motorcycle and cars have a certain mutual substitution effect. In order to comprehensively promote the high-quality development of transportation, it is necessary to focus on the travel needs of women and the elderly while controlling the number of motor vehicles in the household, introducing policies to encourage the use of two-wheeled electric motorcycles, and improving the city's public transportation and commercial support facilities.
Recent work demonstrated the existence of Boolean functions for which Shapley values provide misleading information about the relative importance of features in rule-based explanations. Such misleading information was broadly categorized into a number of possible issues. Each of those issues relates with features being relevant or irrelevant for a prediction, and all are significant regarding the inadequacy of Shapley values for rule-based explainability. This earlier work devised a brute-force approach to identify Boolean functions, defined on small numbers of features, and also associated instances, which displayed such inadequacy-revealing issues, and so served as evidence to the inadequacy of Shapley values for rule-based explainability. However, an outstanding question is how frequently such inadequacy-revealing issues can occur for Boolean functions with arbitrary large numbers of features. It is plain that a brute-force approach would be unlikely to provide insights on how to tackle this question. This paper answers the above question by proving that, for any number of features, there exist Boolean functions that exhibit one or more inadequacy-revealing issues, thereby contributing decisive arguments against the use of Shapley values as the theoretical underpinning of feature-attribution methods in explainability.
Rationale:
Prognostic tools for aiding in the treatment of hospitalized COVID-19 patients could help improve outcome by identifying patients at higher or lower risk of severe disease. The study objective was to develop models to stratify patients by risk of severe outcomes during COVID-19 hospitalization using readily available information at hospital admission.
Methods:
Hierarchical ensemble classification models were trained on a set of 229 patients hospitalized with COVID-19 to predict severe outcomes, including ICU admission, development of acute respiratory distress syndrome, or intubation, using easily attainable attributes including basic patient characteristics, vital signs at admission, and basic lab results collected at time of presentation. Each test stratifies patients into groups of increasing risk. An additional cohort of 330 patients was used for blinded, independent validation. Shapley value analysis evaluated which attributes contributed most to the models' predictions of risk.
Main results:
Test performance was assessed using precision (positive predictive value) and recall (sensitivity) of the final risk groups. All test cut-offs were fixed prior to blinded validation. In development and validation, the tests achieved precision in the lowest risk groups near or above 0.9. The proportion of patients with severe outcomes significantly increased across increasing risk groups. While the importance of attributes varied by test and patient, C-reactive protein, lactate dehydrogenase, and D-dimer were often found to be important in the assignment of risk.
Conclusions:
Risk of severe outcomes for patients hospitalized with COVID-19 infection can be assessed using machine learning-based models based on attributes routinely collected at hospital admission.
Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. (1) A polynomial time algorithm to compute optimal explanations based on game theory. (2) A new type of explanation that directly measures local feature interaction effects. (3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Tree-based machine learning models are widely used in domains such as healthcare, finance and public services. The authors present an explanation method for trees that enables the computation of optimal local explanations for individual predictions, and demonstrate their method on three medical datasets.
Infections have become the major cause of morbidity and mortality among patients with chronic lymphocytic leukemia (CLL) due to immune dysfunction and cytotoxic CLL treatment. Yet, predictive models for infection are missing. In this work, we develop the CLL Treatment-Infection Model (CLL-TIM) that identifies patients at risk of infection or CLL treatment within 2 years of diagnosis as validated on both internal and external cohorts. CLL-TIM is an ensemble algorithm composed of 28 machine learning algorithms based on data from 4,149 patients with CLL. The model is capable of dealing with heterogeneous data, including the high rates of missing data to be expected in the real-world setting, with a precision of 72% and a recall of 75%. To address concerns regarding the use of complex machine learning algorithms in the clinic, for each patient with CLL, CLL-TIM provides explainable predictions through uncertainty estimates and personalized risk factors. Chronic lymphocytic leukemia is an indolent disease, and many patients succumb to infection rather than the direct effects of the disease. Here, the authors use medical records and machine learning to predict the patients that may be at risk of infection, which may enable a change in the course of their treatment.
Highly specific Cas9 nucleases derived from SpCas9 are valuable tools for genome editing, but their wide applications are hampered by a lack of knowledge governing guide RNA (gRNA) activity. Here, we perform a genome-scale screen to measure gRNA activity for two highly specific SpCas9 variants (eSpCas9(1.1) and SpCas9-HF1) and wild-type SpCas9 (WT-SpCas9) in human cells, and obtain indel rates of over 50,000 gRNAs for each nuclease, covering ~20,000 genes. We evaluate the contribution of 1,031 features to gRNA activity and develope models for activity prediction. Our data reveals that a combination of RNN with important biological features outperforms other models for activity prediction. We further demonstrate that our model outperforms other popular gRNA design tools. Finally, we develop an online design tool DeepHF for the three Cas9 nucleases. The database, as well as the designer tool, is freely accessible via a web server, http://www.DeepHF.com/ .
Background
Modern molecular profiling techniques are yielding vast amounts of data from patient samples that could be utilized with machine learning methods to provide important biological insights and improvements in patient outcomes. Unsupervised methods have been successfully used to identify molecularly-defined disease subtypes. However, these approaches do not take advantage of potential additional clinical outcome information. Supervised methods can be implemented when training classes are apparent (e.g., responders or non-responders to treatment). However, training classes can be difficult to define when assessing relative benefit of one therapy over another using gold standard clinical endpoints, since it is often not clear how much benefit each individual patient receives.
Results
We introduce an iterative approach to binary classification tasks based on the simultaneous refinement of training class labels and classifiers towards self-consistency. As training labels are refined during the process, the method is well suited to cases where training class definitions are not obvious or noisy. Clinical data, including time-to-event endpoints, can be incorporated into the approach to enable the iterative refinement to identify molecular phenotypes associated with a particular clinical variable. Using synthetic data, we show how this approach can be used to increase the accuracy of identification of outcome-related phenotypes and their associated molecular attributes. Further, we demonstrate that the advantages of the method persist in real world genomic datasets, allowing the reliable identification of molecular phenotypes and estimation of their association with outcome that generalizes to validation datasets. We show that at convergence of the iterative refinement, there is a consistent incorporation of the molecular data into the classifier yielding the molecular phenotype and that this allows a robust identification of associated attributes and the underlying biological processes.
Conclusions
The consistent incorporation of the structure of the molecular data into the classifier helps to minimize overfitting and facilitates not only good generalization of classification and molecular phenotypes, but also reliable identification of biologically relevant features and elucidation of underlying biological processes.
The technologies of data production and collection have been advanced rapidly. As result to that,
everything gets automatically: data storage and accumulation. Data mining is the tool to predict the
unobserved useful information from that huge amount of data. Otherwise, we have a rich data but poor
information and this information may be incorrect. In this paper, review of data mining has been presented,
where this review show the data mining techniques and focuses on the popular decision tree algorithms (C4.5
and ID3) with their learning tools. Different datasets have been experimented to demonstrate the precision.
Understanding why a model makes a certain prediction can be as crucial as the prediction's accuracy in many applications. However, the highest accuracy for large modern datasets is often achieved by complex models that even experts struggle to interpret, such as ensemble or deep learning models, creating a tension between accuracy and interpretability. In response, various methods have recently been proposed to help users interpret the predictions of complex models, but it is often unclear how these methods are related and when one method is preferable over another. To address this problem, we present a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations). SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties. The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, we present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.
Scikit-learn is a Python module integrating a wide range of state-of-the-art
machine learning algorithms for medium-scale supervised and unsupervised
problems. This package focuses on bringing machine learning to non-specialists
using a general-purpose high-level language. Emphasis is put on ease of use,
performance, documentation, and API consistency. It has minimal dependencies
and is distributed under the simplified BSD license, encouraging its use in
both academic and commercial settings. Source code, binaries, and documentation
can be downloaded from http://scikit-learn.sourceforge.net.
Bridging the gap between animal or in vitro models and human disease is essential in medical research. Researchers often suggest that a biological mechanism is relevant to human cancer from the statistical association of a gene expression marker (a signature) of this mechanism, that was discovered in an experimental system, with disease outcome in humans. We examined this argument for breast cancer. Surprisingly, we found that gene expression signatures-unrelated to cancer-of the effect of postprandial laughter, of mice social defeat and of skin fibroblast localization were all significantly associated with breast cancer outcome. We next compared 47 published breast cancer outcome signatures to signatures made of random genes. Twenty-eight of them (60%) were not significantly better outcome predictors than random signatures of identical size and 11 (23%) were worst predictors than the median random signature. More than 90% of random signatures >100 genes were significant outcome predictors. We next derived a metagene, called meta-PCNA, by selecting the 1% genes most positively correlated with proliferation marker PCNA in a compendium of normal tissues expression. Adjusting breast cancer expression data for meta-PCNA abrogated almost entirely the outcome association of published and random signatures. We also found that, in the absence of adjustment, the hazard ratio of outcome association of a signature strongly correlated with meta-PCNA (R(2) = 0.9). This relation also applied to single-gene expression markers. Moreover, >50% of the breast cancer transcriptome was correlated with meta-PCNA. A corollary was that purging cell cycle genes out of a signature failed to rule out the confounding effect of proliferation. Hence, it is questionable to suggest that a mechanism is relevant to human breast cancer from the finding that a gene expression marker for this mechanism predicts human breast cancer outcome, because most markers do. The methods we present help to overcome this problem.
Non-biological experimental variation or "batch effects" are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes ( > 25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.
Many COVID-19 patients infected by SARS-CoV-2 virus develop pneumonia (called novel coronavirus pneumonia, NCP) and rapidly progress to respiratory failure. However, rapid diagnosis and identification of high-risk patients for early intervention are challenging. Using a large computed Tomography (CT) database from 4,154 patients, we developed an AI system that can diagnose NCP and differentiate it from other common pneumonia and normal controls. The AI system can assist radiologists and physicians in performing a quick diagnosis especially when the health system is overloaded. Significantly, our AI system identified important clinical markers that correlated with the NCP lesion properties. Together with the clinical data, our AI system was able to provide accurate clinical prognosis that can aid clinicians to consider appropriate early clinical management and allocate resources appropriately. We have made this AI system available globally to assist the clinicians to combat COVID-19.
Study reveals rampant racism in decision-making software used by US hospitals — and highlights ways to correct it. Study reveals rampant racism in decision-making software used by US hospitals — and highlights ways to correct it.
Racial bias in health algorithms
The U.S. health care system uses commercial algorithms to guide health decisions. Obermeyer et al. find evidence of racial bias in one widely used algorithm, such that Black patients assigned the same level of risk by the algorithm are sicker than White patients (see the Perspective by Benjamin). The authors estimated that this racial bias reduces the number of Black patients identified for extra care by more than half. Bias occurs because the algorithm uses health costs as a proxy for health needs. Less money is spent on Black patients who have the same level of need, and the algorithm thus falsely concludes that Black patients are healthier than equally sick White patients. Reformulating the algorithm so that it no longer uses costs as a proxy for needs eliminates the racial bias in predicting who needs extra care.
Science , this issue p. 447 ; see also p. 421
In this paper, the brief survey of data mining classification by using the machine learning techniques is presented. The machine learning techniques like decision tree and support vector machine play the important role in all the applications of artificial intelligence. Decision tree works efficiently with discrete data and SVM is capable of building the nonlinear boundaries among the classes. Both of these techniques have their own set of strengths which makes them suitable in almost all classification tasks.
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
A decision tree is a tree whose internal nodes can be taken as tests (on input data patterns) and whose leaf nodes can be taken as categories (of these patterns). These tests are filtered down through the tree to get the right output to the input pattern. Decision Tree algorithms can be applied and used in various different fields. It can be used as a replacement for statistical procedures to find data, to extract text, to find missing data in a class, to improve search engines and it also finds various applications in medical fields. Many Decision tree algorithms have been formulated. They have different accuracy and cost effectiveness. It is also very important for us to know which algorithm is best to use. The ID3 is one of the oldest Decision tree algorithms. It is very useful while making simple decision trees but as the complications increases its accuracy to make good Decision trees decreases. Hence IDA (intelligent decision tree algorithm) and C4.5 algorithms have been formulated.
The primary aim of this paper is to show how graphical models can be used as a mathematical language for integrating statistical
and subject-matter information. In particular, the paper develops a principled, nonparametric framework for causal inference,
in which diagrams are queried to determine if the assumptions available are sufficient for identifying causal effects from
nonexperimental data. If so the diagrams can be queried to produce mathematical expressions for causal effects in terms of
observed distributions; otherwise, the diagrams can be queried to suggest additional observations or auxiliary experiments
from which the desired inferences can be obtained.
Composed in honour of the sixty-fifth birthday of Lloyd Shapley, this volume makes accessible the large body of work that has grown out of Shapley's seminal 1953 paper. Each of the twenty essays concerns some aspect of the Shapley value. Three of the chapters are reprints of the 'ancestral' papers: Chapter 2 is Shapley's original 1953 paper defining the value; Chapter 3 is the 1954 paper by Shapley and Shubik applying the value to voting models; and chapter 19 is Shapley's 1969 paper defining a value for games without transferable utility. The other seventeen chapters were contributed especially for this volume. The first chapter introduces the subject and the other essays in the volume, and contains a brief account of a few of Shapley's other major contributions to game theory. The other chapters cover the reformulations, interpretations and generalizations that have been inspired by the Shapley value, and its applications to the study of coalition formulation, to the organization of large markets, to problems of cost allocation, and to the study of games in which utility is not transferable.
In this paper, we present a novel method for explaining the decisions of an arbitrary classifier, independent of the type of classifier. The method works at the instance level, decomposing the model’s prediction for an instance into the contributions of the attributes’ values. We use several artificial data sets and several different types of models to show that the generated explanations reflect the decision-making properties of the explained model and approach the concepts behind the data set as the prediction quality of the model increases. The usefulness of the method is justified by a successful application on a real-world breast cancer recurrence prediction problem.
Machine learning approaches have wide applications in bioinformatics, and decision tree is one of the successful approaches applied in this field. In this chapter, we briefly review decision tree and related ensemble algorithms and show the successful applications of such approaches on solving biological problems. We hope that by learning the algorithms of decision trees and ensemble classifiers, biologists can get the basic ideas of how machine learning algorithms work. On the other hand, by being exposed to the applications of decision trees and ensemble algorithms in bioinformatics, computer scientists can get better ideas of which bioinformatics topics they may work on in their future research directions. We aim to provide a platform to bridge the gap between biologists and computer scientists.
Matplotlib is a 2D graphics package used for Python for application development, interactive scripting, and publication-quality image generation across user interfaces and operating systems. The latest release of matplotlib runs on all major operating systems, with binaries for Macintosh's OS X, Microsoft Windows, and the major Linux distributions. Matplotlib has a Matlab emulation environment called PyLab, which is a simple wrapper of the matplotlib API. Matplotlib provides access to basic GUI events such as button_press_event, mouse_motion_event and can also be registered with those events to receive callbacks. Event handling code written in matplotlib works across many different GUIs. It supports toolkits for domain specific plotting functionality that is either too big or too narrow in purpose for the main distribution. Matplotlib has three basic API classes, including, FigureCanvasBase, RendererBase and Artist.
Shapley explainability on the data manifold
Jan 2021
Frye
Problems with Shapley-value-based explanations as feature importance measures
Jan 2020
5491
Kumar
Assessing and mitigating bias in medical artificial intelligence
Jan 2020
Noseworthy
Feature relevance quantification in explainable AI: A causal problem
Jan 2020
2907
Janzing
APACHE II: a severity of disease classification system