Article

Interpretable Classification Models for Recidivism Prediction

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We investigate a long-debated question, which is how to create predictive models of recidivism that are sufficiently accurate, transparent, and interpretable to use for decision-making. This question is complicated as these models are used to support different decisions, from sentencing, to determining release on probation, to allocating preventative social services. Each use case might have an objective other than classification accuracy, such as a desired true positive rate (TPR) or false positive rate (FPR). Each (TPR, FPR) pair is a point on the receiver operator characteristic (ROC) curve. We use popular machine learning methods to create models along the full ROC curve on a wide range of recidivism prediction problems. We show that many methods (SVM, Ridge Regression) produce equally accurate models along the full ROC curve. However, methods that designed for interpretability (CART, C5.0) cannot be tuned to produce models that are accurate and/or interpretable. To handle this shortcoming, we use a new method known as SLIM (Supersparse Linear Integer Models) to produce accurate, transparent, and interpretable models along the full ROC curve. These models can be used for decision-making for many different use cases, since they are just as accurate as the most powerful black-box machine learning models, but completely transparent, and highly interpretable.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Fortunately, interpretable machine learning algorithms and theories of fairness surrounding AI and ML have advanced considerably over the previous few years. Multiple research papers have demonstrated that publicly available interpretable machine learning algorithms can perform at similar levels of efficacy as blackbox machine learning algorithms [10,11,12]. Moreover, high-dimensional data sets on criminal recidivism have become easy to access and use in academic studies. ...
... Just like Zeng et al [10], this study employs machine learning techniques optimized for interpretability, and addresses 12 different prediction problems. However this study builds and improves upon the Zeng et al [10] paper by predicting probabilities for recidivism instead of binary predictions. ...
... Just like Zeng et al [10], this study employs machine learning techniques optimized for interpretability, and addresses 12 different prediction problems. However this study builds and improves upon the Zeng et al [10] paper by predicting probabilities for recidivism instead of binary predictions. Also, the Zeng et al [10] study compares IML models only with other machine learning methods, while this study uses the performance of existing risk assessments -COMPAS and Arnold PSA. ...
Thesis
With the increasing use of machine learning and artificial intelligence solutions across domains, there is also a growing reaction against black-box models. Especially for algorithms employed in critical sectors, there have been growing criticisms against the black-box nature of the algorithms and their failure to satisfy competing notions of fairness. As a reaction to these reactions and opinions, the field of interpretable machine learning has grown significantly in recent years, and has given the world simple, yet effective algorithms that are capable of competing against their commercial black-box alternatives. Simultaneously, the question of fairness remains to be explored. In this project, I enquire into the intersection of interpretability and fairness of machine learning approaches by revisiting the recidivism prediction problem. I employ some of the latest algorithms from the field of interpretable machine learning, and compare them against commercially used recidivism prediction algorithms, and other machine learning approaches. The models are assessed for performance, interpretability and fairness. The commercially used risk assessment methods I compare against are the COMPAS and the Arnold PSA. In this study, I present multiple models that beat these risk assessments in performance, and provide a fairness analysis of these models. The best performing interpretable models outperform the Arnold PSA system by an average performance increase of 6.50%, and the COMPAS system by 3.71%, alongside satisfying the constraints for fairness. The results that are obtained further imply that machine learning models should be trained separately for separate locations, and updated over time to ensure the best results as per the defined parameters.
... Both costs are relatively small for most datasets, but very large for a few. Second, there are datasets for which the comprehensible models perform as well or better than the black box models, supporting that one should not forgo trying comprehensible models [17]. We call these datasets "comprehensible datasets", as opposed to datasets where the black box is strictly better which we call "opaque datasets". ...
... Trepan We use the Python package Skater to implement TreeSurrogates, 17 which is based on [54]. The base estimator (oracle) can be any supervised learning model. ...
... com/ imosc ovitz/ wittg enste in. 17 A. Kramer et al Skater Python package. url: https:// github. ...
Article
Full-text available
A key challenge in Artificial Intelligence (AI) has been the potential trade-off between the accuracy and comprehensibility of machine learning models, as that also relates to their safe and trusted adoption. While there has been a lot of talk about this trade-off, there is no systematic study that assesses to what extent it exists, how often it occurs, and for what types of datasets. Based on the analysis of 90 benchmark classification datasets, we find that this trade-off exists for most (69%) of the datasets, but that somewhat surprisingly for the majority of cases it is rather small while for only a few it is very large. Comprehensibility can be enhanced by adding yet another algorithmic step, that of surrogate modelling using so-called ‘explainable’ models. Such models can improve the accuracy-comprehensibility trade-off, especially in cases where the black box was initially better. Finally, we find that dataset characteristics related to the complexity required to model the dataset, and the level of noise, can significantly explain this trade-off and thus the cost of comprehensibility. These insights lead to specific guidelines on how and when to apply AI algorithms when comprehensibility is required.
... Fortunately, techniques for interpretable ML and theories of fairness have advanced considerably over the last few years. Multiple works have demonstrated that publicly available interpretable ML algorithms can perform as well as black-box ML algorithms (Zeng et al. 2017;Angelino et al. 2018;Lou et al. 2013). Moreover, high-dimensional data sets on criminal recidivism have become increasingly available. ...
... Similar to Zeng et al. (2017), we use ML techniques optimized for interpretability, and address multiple prediction problems. This work is an improvement in the following ways. ...
... This work is an improvement in the following ways. We use interpretable ML techniques to create risk scores representing probabilities of recidivism rather than making binary predictions-techniques which were not available at the time of publication for Zeng et al. (2017). We compare with COMPAS and the Arnold Public Safety Assessment (PSA), two models currently used in the justice system, whereas Zeng et al. (2017) compared only with other ML methods. ...
Article
Full-text available
Objectives We study interpretable recidivism prediction using machine learning (ML) models and analyze performance in terms of prediction ability, sparsity, and fairness. Unlike previous works, this study trains interpretable models that output probabilities rather than binary predictions, and uses quantitative fairness definitions to assess the models. This study also examines whether models can generalize across geographic locations. Methods We generated black-box and interpretable ML models on two different criminal recidivism datasets from Florida and Kentucky. We compared predictive performance and fairness of these models against two methods that are currently used in the justice system to predict pretrial recidivism: the Arnold PSA and COMPAS. We evaluated predictive performance of all models on predicting six different types of crime over two time spans. Results Several interpretable ML models can predict recidivism as well as black-box ML models and are more accurate than COMPAS or the Arnold PSA. These models are potentially useful in practice. Similar to the Arnold PSA, some of these interpretable models can be written down as a simple table. Others can be displayed using a set of visualizations. Our geographic analysis indicates that ML models should be trained separately for separate locations and updated over time. We also present a fairness analysis for the interpretable models. Conclusions Interpretable ML models can perform just as well as non-interpretable methods and currently-used risk assessment scales, in terms of both prediction accuracy and fairness. ML models might be more accurate when trained separately for distinct locations and kept up-to-date.
... Approaches like interpretability and explainability have been proposed as a way to bridge the gap between ML models and human understanding. These include models that are inherently interpretable (e.g., decision trees [89], simple point systems [50,124] or generalized additive models [18,37]) and post-hoc explanations for the predictions made by complex models (e.g., LIME [92], SHAP [69]). Tools that implement interpretability and explainability approaches have also been made available for public use. ...
... These properties in turn promote trustworthiness, accountability, and fair and ethical decision-making [24,65]. At a high-level, interpretability approaches can be categorized into glassbox models (e.g., [18,37,50,58,89,124]) or post-hoc explanations for blackbox models (e.g., [6,69,92,99,102]). Instantiating these approaches into user-facing tools, static explanations output by mathematical representations of interpretability now includes interactive visuals output by explainable AI. ...
Preprint
Full-text available
Understanding how ML models work is a prerequisite for responsibly designing, deploying, and using ML-based systems. With interpretability approaches, ML can now offer explanations for its outputs to aid human understanding. Though these approaches rely on guidelines for how humans explain things to each other, they ultimately solve for improving the artifact -- an explanation. In this paper, we propose an alternate framework for interpretability grounded in Weick's sensemaking theory, which focuses on who the explanation is intended for. Recent work has advocated for the importance of understanding stakeholders' needs -- we build on this by providing concrete properties (e.g., identity, social context, environmental cues, etc.) that shape human understanding. We use an application of sensemaking in organizations as a template for discussing design guidelines for Sensible AI, AI that factors in the nuances of human cognition when trying to explain itself.
... Despite their unprecedented success in performing machine learning tasks accurately and fast, these trained models are often described as black boxes because they are so complex that their output is not easily explainable in terms of their inputs. As a result, in many cases, no explanation of decisions based on these models can be provided to those affected by them [51]. ...
... Additionally, laws and regulations have been proposed to require decisions made by the machine learning models to be accompanied with clear explanations for the individuals affected by the decisions [43]. Several methods have been developed to explain the outputs of models simpler than deep learning models to non-expert users such as administrators or clinicians [18,27,34,51]. In contrast, existing interpretation methods for deep learning models either lack the ability to directly communicate with non-expert users or have limitations in their scope, computational ability, or accuracy, as we will explain in the next section. ...
Article
Full-text available
Deep learning models have been criticized for their lack of easy interpretation, which undermines confidence in their use for important applications. Nevertheless, they are consistently utilized in many applications, consequential to humans’ lives, usually because of their better performance. Therefore, there is a great need for computational methods that can explain, audit, and debug such models. Here, we use flip points to accomplish these goals for deep learning classifiers used in social applications. A trained deep learning classifier is a mathematical function that maps inputs to classes. By way of training, the function partitions its domain and assigns a class to each of the partitions. Partitions are defined by the decision boundaries which are expected to be geometrically complex. This complexity is usually what makes deep learning models powerful classifiers. Flip points are points on those boundaries and, therefore, the key to understanding and changing the functional behavior of models. We use advanced numerical optimization techniques and state-of-the-art methods in numerical linear algebra, such as rank determination and reduced-order models to compute and analyze them. The resulting insight into the decision boundaries of a deep model can clearly explain the model’s output on the individual level, via an explanation report that is understandable by non-experts. We also develop a procedure to understand and audit model behavior towards groups of people. We show that examining decision boundaries of models in certain subspaces can reveal hidden biases that are not easily detectable. Flip points can also be used as synthetic data to alter the decision boundaries of a model and improve their functional behaviors. We demonstrate our methods by investigating several models trained on standard datasets used in social applications of machine learning. We also identify the features that are most responsible for particular classifications and misclassifications. Finally, we discuss the implications of our auditing procedure in the public policy domain.
... In the feld of Explainable AI, a myriad of research has aimed at increasing system transparency of machine learning systems. Popular domains in this body of work include recommender systems [34,64,77,98], healthcare applications [16,18,52,93], fnance [12,29,42,45], hiring [39,72], and criminal justice [94,97,102]. Explanations in all these domains have aimed to make the systems more interpretable and to explain the outcomes to the end-users. ...
... Explanations in all these domains have aimed to make the systems more interpretable and to explain the outcomes to the end-users. Prior work has explored a range of explanation approaches including: input infuence [5,30,102] (the degree of infuence of each input on the system output); sensitivity based [5,87,91] (how much the value of an input would have to difer to change the output); demographic-based [1,5,98] (aggregate statistics on the outcome classes for people in the same demographic categories as the decision subject); case-based [5,14,79] (using an example instance from the training data to explain the outcome); white-box [20] (showing the internal workings of an algorithm); and visual explanations [50,60,96] (explaining the outcomes or the model through a visual analytics interface). Except for case-based explanations, most of these approaches have focused on explaining the decision-process or the decision factors. ...
... Relaxing the assumption of correct specification connects this paper to classification problems with exogenous constraints. Such problems are studied in machine learning and statistics, and include interpretable classification (e.g., Zeng et al. (2017), andZhang et al. (2018)), fair classification (e.g., Dwork et al. (2012)), and monotone classification (e.g., Cano et al. (2019)). Some works in the existing literature adopt a surrogate loss approach. ...
... Decision-makers may prefer simple decision or classification rules that are easily understood or explained even at the cost of harming prediction accuracy. This concept, often referred to as interpretable machine learning, has been pursued, for instance, in the prediction analysis of recidivism (Zeng et al. (2017)) and the decision on medical intervention protocol (Zhang et al. (2018))). An example is a linear classification rule, in which G is a class of half-spaces with linear boundaries in X , ...
Preprint
Full-text available
Modern machine learning approaches to classification, including AdaBoost, support vector machines, and deep neural networks, utilize surrogate loss techniques to circumvent the computational complexity of minimizing empirical classification risk. These techniques are also useful for causal policy learning problems, since estimation of individualized treatment rules can be cast as a weighted (cost-sensitive) classification problem. Consistency of the surrogate loss approaches studied in Zhang (2004) and Bartlett et al. (2006) crucially relies on the assumption of correct specification, meaning that the specified set of classifiers is rich enough to contain a first-best classifier. This assumption is, however, less credible when the set of classifiers is constrained by interpretability or fairness, leaving the applicability of surrogate loss based algorithms unknown in such second-best scenarios. This paper studies consistency of surrogate loss procedures under a constrained set of classifiers without assuming correct specification. We show that in the setting where the constraint restricts the classifier's prediction set only, hinge losses (i.e., $\ell_1$-support vector machines) are the only surrogate losses that preserve consistency in second-best scenarios. If the constraint additionally restricts the functional form of the classifier, consistency of a surrogate loss approach is not guaranteed even with hinge loss. We therefore characterize conditions for the constrained set of classifiers that can guarantee consistency of hinge risk minimizing classifiers. Exploiting our theoretical results, we develop robust and computationally attractive hinge loss based procedures for a monotone classification problem.
... Integer programming (IP) solvers are the most widely-used tools for solving discrete optimization problems. They have numerous applications in machine learning, operations research, and many other fields, including MAP inference [22], combinatorial auctions [33], natural language processing [23], neural network verification [11], interpretable classification [37], training of optimal decision trees [9], and optimal clustering [30], among many others. ...
Preprint
Full-text available
The incorporation of cutting planes within the branch-and-bound algorithm, known as branch-and-cut, forms the backbone of modern integer programming solvers. These solvers are the foremost method for solving discrete optimization problems and thus have a vast array of applications in machine learning, operations research, and many other fields. Choosing cutting planes effectively is a major research topic in the theory and practice of integer programming. We conduct a novel structural analysis of branch-and-cut that pins down how every step of the algorithm is affected by changes in the parameters defining the cutting planes added to the input integer program. Our main application of this analysis is to derive sample complexity guarantees for using machine learning to determine which cutting planes to apply during branch-and-cut. These guarantees apply to infinite families of cutting planes, such as the family of Gomory mixed integer cuts, which are responsible for the main breakthrough speedups of integer programming solvers. We exploit geometric and combinatorial structure of branch-and-cut in our analysis, which provides a key missing piece for the recent generalization theory of branch-and-cut.
... For example, previous investigations indicated the risk of discrimination of particular minority groups in terms of biases regarding race and gender (Angwin et al., 2016;Hardt et al., 2016;. However, it shall be noted that there are new advances in this field with promising findings of easier to interpret and more transparent ML models which can be applied for binary decisions as well as a number of risk categories (Rudin & Ustun, 2018;Zeng et al., 2017;Zhang et al., 2021). ...
Preprint
Full-text available
Actuarial risk assessment instruments (ARAIs) are widely used for the prediction of recidivism in individuals convicted of sexual offenses. Although many studies supported the use of ARAIs because they outperformed unstructured judgments, it remains an ongoing challenge to seek potentials for improvement of their predictive performance. Machine learning (ML) algorithms, like random forests, are able to detect patterns in data useful for prediction purposes without explicitly programming them. In contrast to logistic regression, random forests are able to consider nonlinear effects between risk factors and the criterion in order to enhance predictive validity. Therefore, the current study aims to compare conventional logistic regression analyses with the random forest algorithm on a sample of N = 511 adult male individuals convicted of sexual offenses. Data was collected at the Federal Evaluation Center for Violent and Sexual Offenders (FECVSO) in Austria within a prospective-longitudinal research design and participants were followed-up for an average of M = 8.2 years. The Static-99, containing static risk factors, and the Stable-2007, containing stable dynamic risk factors, were included as predictors. The results demonstrated no superior predictive performance of the random forest compared to logistic regression; furthermore, methods of interpretable machine learning did not point to any robust nonlinear effects. Altogether, results supported the statistical use of logistic regression for the development and clinical application of ARAIs.
... It has been shown that decision support tools that build on rounded regression coefficients (e.g., those from logistic regression models) do not achieve the maximum accuracy possible [47]. Supersparse Linear Integer Models (SLIM) were introduced as an optimal way to create point scoring mechanisms [48,49] in a format that is familiar to physicians [50][51][52]. ...
Article
Full-text available
The opioid epidemic is a major policy concern. The widespread availability of opioids, which is fueled by physician prescribing patterns, medication diversion, and the interaction with potential illicit opioid use, has been implicated as proximal cause for subsequent opioid dependence and mortality. Risk indicators related to chronic opioid therapy (COT) at the point of care may influence physicians’ prescribing decisions, potentially reducing rates of dependency and abuse. In this paper, we investigate the performance of machine learning algorithms for predicting the risk of COT. Using data on over 12 million observations of active duty US Army soldiers, we apply machine learning models to predict the risk of COT in the initial months of prescription. We use the area under the curve (AUC) as an overall measure of model performance, and we focus on the positive predictive value (PPV), which reflects the models’ ability to accurately target military members for intervention. Of the many models tested, AUC ranges between 0.83 and 0.87. When we focus on the top 1% of members at highest risk, we observe a PPV value of 8.4% and 20.3% for months 1 and 3, respectively. We further investigate the performance of sparse models that can be implemented in sparse data environments. We find that when the goal is to identify patients at the highest risk of chronic use, these sparse linear models achieve a performance similar to models trained on hundreds of variables. Our predictive models exhibit high accuracy and can alert prescribers to the risk of COT for the highest risk patients. Optimized sparse models identify a parsimonious set of factors to predict COT: initial supply of opioids, the supply of opioids in the month being studied, and the number of prescriptions for psychotropic medications. Future research should investigate the possible effects of these tools on prescriber behavior (e.g., the benefit of clinician nudging at the point of care in outpatient settings).
... As Machine Learning (ML) models become increasingly accurate and accessible, they are now used in a wide variety of domains to support humans to make important decisions, such as making medical diagnosis [35], assessing recidivism risks for prisoners [38], or filtering applicants for a managing position [33]. Yet, a more widespread adoption of ML models is for now limited by their difficulty to provide explanations about the rationale behind their predictions [5,36]. ...
Conference Paper
Full-text available
The increasing usage of complex Machine Learning models for decision-making has raised interest in explainable artificial intelligence (XAI). In this work, we focus on the effects of providing accessible and useful explanations to non-expert users. More specifically, we propose generic XAI design principles for contextualizing and allowing the exploration of explanations based on local feature importance. To evaluate the effectiveness of these principles for improving users’ objective understanding and satisfaction, we conduct a controlled user study with 80 participants using 4 different versions of our XAI system, in the context of an insurance scenario. Our results show that the contextualization principles we propose significantly improve user’s satisfaction and is close to have a significant impact on user’s objective understanding. They also show that the exploration principles we propose improve user’s satisfaction. On the other hand, the interaction of these principles does not appear to bring improvement on both dimensions of users’ understanding.
... Explainability methods are usually used in one of two ways. The AI models that are designed to be inherently interpretable, often because of their simplicity, i.e., generalised additive models (GAMs) (Caruana et al., 2015) or simple point systems (Jung et al., 2017;Zeng et al., 2017), are naturally explainable because they allow us to calculate the contribution of each feature to the final prediction in a segmental way, which makes it easier for people to understand the degree of influence of each feature and allows us to obtain useful information about the predictions of the model. The second group of explainability techniques provide post hoc explanations for the predictions made by compounded models. ...
Article
Full-text available
The influence of Artificial Intelligence is growing, as is the need to make it as explainable as possible. Explainability is one of the main obstacles that AI faces today on the way to more practical implementation. In practise, companies need to use models that balance interpretability and accuracy to make more effective decisions, especially in the field of finance. The main advantages of the multi-criteria decision-making principle (MCDM) in financial decision-making are the ability to structure complex evaluation tasks that allow for well-founded financial decisions, the application of quantitative and qualitative criteria in the analysis process, the possibility of transparency of evaluation and the introduction of improved, universal and practical academic methods to the financial decision-making process. This article presents a review and classification of multi-criteria decision-making methods that help to achieve the goal of forthcoming research: to create artificial intelligence-based methods that are explainable, transparent, and interpretable for most investment decision-makers.
... Considering that these less favorable findings for ML methods might result from improper optimizations during the training phase of model building (Berk & Bleich, 2013), we tuned the models to maximize predictive accuracy after seeking optimal tuning parameters via systematic cross-validation procedures. In doing so, we attempted to make the ML procedures as transparent and interpretable as possible for both researchers and practitioners to address the "black box" issues inherent in the application and interpretation of ML methods (Zeng et al., 2017). ...
Article
Full-text available
Although machine learning (ML) methods have recently gained popularity in both academia and industry as alternative risk assessment tools for efficient decision-making, inconsistent patterns are observed in the existing literature regarding their competitiveness and utility in predicting various outcomes. Drawing on a sample of the general youth population in the U.S., we compared the predictive accuracy of logistic regression (LR) and neural networks (NNs), which are the most widely applied approaches in conventional statistics and contemporary ML methods, respectively, by adopting many theoretically relevant predictors of the future arrest outcome. Even after fully implementing rigorous ML protocols for model tuning and up-sampling and down-sampling procedures recommended in recent literature to optimize learning algorithms, NNs did not yield substantially improved performance over LR if we still rely on a conventional dataset with relatively small sample sizes and a limited number of predictors. Nonetheless, we encourage more rigorous, comprehensive, and diverse evaluation research for a complete understanding of the ML potential in predictive capacity and the contingencies in which modern ML methods can perform better than conventional parametric statistical models.
... An unknown data point x is mapped to to the class c ω s(ξ N G (x,W )) according to the WTA rule (1). This generalized matrix LVQ (GMLVQ) is mathematically proven to be a large margin classifier and robust while keeping the interpretability according to the prototype reference principle [21,25]. Its cost function E GM LV Q (W) = x E L (x, W, W ) approximates the overall classification error with local errors E L (x, W, W ) = f (µ (ξ (x, W ) , W)). ...
... is the main research subject of the study. When the graphical findings in Figure 2 are evaluated, it is seen that positive relationships, including a sigmoid trend between DBH and TTH and TTV, can be obtained for the simulation data set. The management of the network structure of DLA models, which are described as "black boxes" by some researchers (Zeng et. al., 2017;Angelino et al., 2018;Rudin et al., 2020), in such a way that they cannot cause overfitted predictions, has achieved such predictive results to be obtained in accordance with biological laws. The management of the network structure of DLA models could be achieved by customizing these hyper-parameters. In this respect, it is of great impo ...
Conference Paper
Full-text available
Providing various statistical assumptions, which is an important requirement in the application of regression models whose use in forestry dates to the 1930s, is the most challenging problem in the use of these models. When we look at the literature on forestry biometrics, "nonlinear mixed effect regression models" and "autoregressive models" are techniques that have been proposed and used to solve the problems in estimations that occur when these assumptions are not met. Artificial intelligence models, one of the machine learning techniques, have been proposed as another alternative solution technique to the problem of not providing statistical assumptions. According to the studies on artificial intelligence models, the success of these AI models is evaluated by certain statistics, such as bias, RMSE, AIC, BIC and so on. But it's clear that these prediction models don't look at how well they can meet biological realism in tree growth theories, which is a very important issue in forestry. In this study, it is aimed to evaluate the ability of the machine learning techniques such as Deep Learning Algorithms to provide expected biological realism in predicting individual tree volume. Moreover, it is aimed at analyzing the network parameters that make up the network topologies of these machine learning techniques and revealing the most ideal network parameter values in providing this biological realism about individual tree volume growth.
... Some have made headway in developing inherently interpretable models. Such models need to be designed and tailored for individual tasks, e.g., financial lending [12], embryo selection [1], recidivism prediction [55], etc. This judicious use of models is advantageous, especially for high-stakes decision-making. ...
Preprint
Many applications affecting human lives rely on models that have come to be known under the umbrella of machine learning and artificial intelligence. These AI models are usually complicated mathematical functions that map from an input space to an output space. Stakeholders are interested to know the rationales behind models' decisions and functional behavior. We study this functional behavior in relation to the data used to create the models. On this topic, scholars have often assumed that models do not extrapolate, i.e., they learn from their training samples and process new input by interpolation. This assumption is questionable: we show that models extrapolate frequently; the extent of extrapolation varies and can be socially consequential. We demonstrate that extrapolation happens for a substantial portion of datasets more than one would consider reasonable. How can we trust models if we do not know whether they are extrapolating? Given a model trained to recommend clinical procedures for patients, can we trust the recommendation when the model considers a patient older or younger than all the samples in the training set? If the training set is mostly Whites, to what extent can we trust its recommendations about Black and Hispanic patients? Which dimension (race, gender, or age) does extrapolation happen? Even if a model is trained on people of all races, it still may extrapolate in significant ways related to race. The leading question is, to what extent can we trust AI models when they process inputs that fall outside their training set? This paper investigates several social applications of AI, showing how models extrapolate without notice. We also look at different sub-spaces of extrapolation for specific individuals subject to AI models and report how these extrapolations can be interpreted, not mathematically, but from a humanistic point of view.
... Alongside audits, techniques for improving the explainability of algorithmic outputs to attain transparency and accountability are also being developed. For example, Zeng, Ustun and Rudin (2015) used an approach known as Supersparse Linear Integer Models to demonstrate how to develop transparent, interpretable risk assessment algorithms. Parent and colleagues (2020: 52) have also developed an approach to creating an explainable predictive policing algorithm whose predictions can be explained using 'past event information, weather, and socio-demographic information'. ...
Article
Full-text available
Data-driven digital technologies are playing a pivotal role in shaping the global landscape of criminal justice across several jurisdictions. Predictive algorithms in particular, now inform decision making at almost all levels of the criminal justice process. As the algorithms continue to proliferate, a fast-growing multidisciplinary scholarship has emerged to challenge their logics and highlight their capacity to perpetuate historical biases. Drawing on insights distilled from critical algorithm studies and the sociological scholarship on digital technologies, this paper outlines the limits of prevailing tech-reformist remedies. The paper also builds on the interstices between the two scholarships to make the case for a broader structural framework for understanding the conduits of algorithmic bias.
... However, it is obvious that we could take an arbitrary probabilistic machine learning model, e.g. a deep network with softmax output [16]. Although those networks frequently provide superior performance, their disadvantage is the lack of interpretability [17]. Although many attempts are made to explain deep models, the inherent requirement of interpretability should be determining the network design [18,19]. ...
Article
Full-text available
In the present contribution we investigate the mathematical model of the trade-off between optimum classification and reject option. The model provides a threshold value in dependence of classification, rejection and error costs The model is extended to the case that the training data are affected by label noise. We consider the respective mathematical model and show that the optimum threshold value does not depend on the presence/absence of label noise. We explain how this knowledge could be used for probabilistic classifiers in machine learning.
... In other cases, γ might be set arbitrarily according to an implicit assumption on the values of the features (Ustun and Rudin, 2016;Zeng et al., 2017). With this setting on hand, if example i is misclassified, the value of right-hand side of the inequality (14a) is positive. ...
Preprint
Scoring systems, as simple classification models, have significant advantages in interpretability and transparency when making predictions. It facilitates humans' decision-making by allowing them to make a quick prediction by hand through adding and subtracting a few point scores and thus has been widely used in various fields such as medical diagnosis of Intensive Care Units. However, the (un)fairness issues in these models have long been criticized, and the use of biased data in the construction of score systems heightens this concern. In this paper, we proposed a general framework to create data-driven fairness-aware scoring systems. Our approach is first to develop a social welfare function that incorporates both efficiency and equity. Then, we translate the social welfare maximization problem in economics into the empirical risk minimization task in the machine learning community to derive a fairness-aware scoring system with the help of mixed integer programming. We show that the proposed framework provides practitioners or policymakers great flexibility to select their desired fairness requirements and also allows them to customize their own requirements by imposing various operational constraints. Experimental evidence on several real data sets verifies that the proposed scoring system can achieve the optimal welfare of stakeholders and balance the interpretability, fairness, and efficiency issues.
... For example, it is difficult to graphically display how alternative decision trees "grow" in the random forest algorithm progress. The black-box mystery aspect associated with machine learning approaches may be a serious problem, but perhaps not a deadly one; one way of addressing this issue may be by improving the transparency of the algorithm development process (e.g., see Zeng et al., 2017). Future research will explore different machine learning approaches. ...
Chapter
Full-text available
Same as: Hu, X., Zhang, X., & Lovrich, N. (2021). Forecasting identity theft victims: Analyzing characteristics and preventive actions through machine learning approaches. Victims & Offenders, 16(4), 465-494. Reprinted in the book "The New Technology of Financial Crime: New Crime Commission Technology, New Victims, New Offenders, and New Strategies for Prevention and Control"
... More information about this instrument is detailed by VanNostrand and Rose (2009). We believe researchers may be interested in this analysis given the substantial focus on improving risk assessment instruments used by the criminal-legal system (e.g., Zeng et al. (2017); Wang et al. (2022)). ...
Preprint
Full-text available
Objectives: When someone is arrested and charged with a crime, they may be released on bail or required to participate in a community supervision program while awaiting trial. These 'pre-trial programs' are common throughout the United States, but very little research has demonstrated their effectiveness. Researchers have qualified these findings by emphasizing the need for more rigorous program evaluation methods, which we introduce in this article. Here, we (1) describe a program evaluation pipeline that uses novel state-of-the-art machine learning techniques, and (2) demonstrate these techniques on a case study of a pre-trial program in Durham, North Carolina. Methods: We used a quasi-experimental design that compared people who took part in the program to those who did not take part in the program and were instead released back into the community while awaiting trial. We tested whether the program significantly reduced the probability of new criminal charges using new and old evaluation techniques. Results: We found no evidence that the program either significantly increased or decreased the probability of new criminal charges. Conclusions: If these findings replicate, the criminal-legal system needs to either improve these pre-trial programs or consider alternatives to them. The simplest option is to release low-risk individuals back into the community without subjecting them to any restrictions or conditions. Another option is to assign individuals to pre-trial programs that incentivize pro-social behavior. Before making these changes however, more rigorous program evaluation is needed. We believe the techniques introduced here can provide researchers the rigorous tools they need to do that.
... Interpretable machine learning has recently witnessed a strong increase in attention [26], both within and outside the scientific community, driven by the increased use of machine learning in industry and society. This is especially true for applications domains where decision making is crucial and requires transparency, such as in health care [81,68] and societal problems [67,126]. ...
Thesis
Full-text available
In this work, we attempt to answer the question: "How to learn robust and interpretable rule-based models from data for machine learning and data mining, and define their optimality?".Rules provide a simple form of storing and sharing information about the world. As humans, we use rules every day, such as the physician that diagnoses someone with flu, represented by "if a person has either a fever or sore throat (among others), then she has the flu.". Even though an individual rule can only describe simple events, several aggregated rules can represent more complex scenarios, such as the complete set of diagnostic rules employed by a physician.The use of rules spans many fields in computer science, and in this dissertation, we focus on rule-based models for machine learning and data mining. Machine learning focuses on learning the model that best predicts future (previously unseen) events from historical data. Data mining aims to find interesting patterns in the available data. To answer our question, we use the Minimum Description Length (MDL) principle, which allows us to define the statistical optimality of rule-based models. Furthermore, we empirically show that this formulation is highly competitive for real-world problems.
... Several approaches were proposed in the literature to learn inherently interpretable models such as rule lists [75,78], decision trees and decision lists [45], decision sets [41], prototype-based models [16], and generalized additive models [17,49]. However, complex models such as deep neural networks often achieve higher accuracy than simpler models [59]. ...
Preprint
Full-text available
As post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to ensure that the quality of the resulting explanations is consistently high across various population subgroups including the minority groups. For instance, it should not be the case that explanations associated with instances belonging to a particular gender subgroup (e.g., female) are less accurate than those associated with other genders. However, there is little to no research that assesses if there exist such group-based disparities in the quality of the explanations output by state-of-the-art explanation methods. In this work, we address the aforementioned gaps by initiating the study of identifying group-based disparities in explanation quality. To this end, we first outline the key properties which constitute explanation quality and where disparities can be particularly problematic. We then leverage these properties to propose a novel evaluation framework which can quantitatively measure disparities in the quality of explanations output by state-of-the-art methods. Using this framework, we carry out a rigorous empirical analysis to understand if and when group-based disparities in explanation quality arise. Our results indicate that such disparities are more likely to occur when the models being explained are complex and highly non-linear. In addition, we also observe that certain post hoc explanation methods (e.g., Integrated Gradients, SHAP) are more likely to exhibit the aforementioned disparities. To the best of our knowledge, this work is the first to highlight and study the problem of group-based disparities in explanation quality. In doing so, our work sheds light on previously unexplored ways in which explanation methods may introduce unfairness in real world decision making.
... Además, recientemente se ha puesto en duda la necesidad misma de acudir a modelos opacos de machine learning para obtener resultados adecuados. De este modo, antes que tratar de interpretar arduamente algoritmos herméticos, bastaría con hacer uso de instrumentos interpretables por diseño cuando ello no redunde en pérdidas de eficacia notables (Zeng, Ustun y Rudin, 2017;Rudin, 2018;Rudin, Wang y Coker, 2018; y Rudin y Ustun, 2019). ...
Article
Full-text available
This article aims to reflect on a small part of the Law-IA relationship, namely that which concerns judicial decision-making. To this end, two hypotheses are examined: one that envisages the replacement of the human judge by an artificial intelligence, and another that understands the use of the latter as a complement or support for the judge throughout the decision-making process. Regarding the first issue, it reflects on the difficulties posed by the concept of «reason» when applied to the human and artificial decision-maker, and how this impacts on the jurisdictional function as a peer-to-peer activity. In relation to the second question, the paper highlights the potential that the use of AI as a support can come with its own difficulties. Reflecting through the example of criminal justice and recidivism predictions, it is argued that the use of such tools can produce «argumentation voids», i.e., blind spots in the justification of the judicial decision.
... A typical example is bird's feathers, which probably first evolved to keep the birds warm, and only later turned out to be useful for flying. 27 (Zeng et al., 2017;Su et al., 2015;Tan et al., 2017;Guidotti et al., 2018) 28 For example, the General Data Protection Regulation (GDPR) of the European Union imposes a general "right for explanation" for almost any decision made by an algorithm on an individual (Goodman and Flaxman, 2017). One reason for such a requirement is to make sure that the AI did not discriminate applicants based on gender, race, or similar characteristics-an objective called "fair AI". ...
Preprint
Full-text available
This book uses the modern theory of artificial intelligence (AI) to understand human suffering or mental pain. Both humans and sophisticated AI agents process information about the world in order to achieve goals and obtain rewards, which is why AI can be used as a model of the human brain and mind. This book intends to make the theory accessible to a relatively general audience, requiring only some relevant scientific background. The book starts with the assumption that suffering is mainly caused by frustration. Frustration means the failure of an agent (whether AI or human) to achieve a goal or a reward it wanted or expected. Frustration is inevitable because of the overwhelming complexity of the world, limited computational resources, and scarcity of good data. In particular, such limitations imply that an agent acting in the real world must cope with uncontrollability, unpredictability, and uncertainty, which all lead to frustration. Fundamental in such modelling is the idea of learning, or adaptation to the environment. While AI uses machine learning, humans and animals adapt by a combination of evolutionary mechanisms and ordinary learning. Even frustration is fundamentally an error signal that the system uses for learning. This book explores various aspects and limitations of learning algorithms and their implications regarding suffering. At the end of the book, the computational theory is used to derive various interventions or training methods that will reduce suffering in humans. The amount of frustration is expressed by a simple equation which indicates how it can be reduced. The ensuing interventions are very similar to those proposed by Buddhist and Stoic philosophy, and include mindfulness meditation. Therefore, this book can be interpreted as an exposition of a computational theory justifying why such philosophies and meditation reduce human suffering.
... Um classificador interpretável possui a habilidade de "explicar" as suas classificações para os usuários através, por exemplo, da apresentação de regras de classificação no formato: SE <condição> ENTÃO <rótulo(s) de classe> [Freitas 2014, Parmentier andVidal 2021]. Em alguns domínios de aplicação importantes da classificação, como diagnose médica [Luo et al. 2015], bioinformática [Fabris et al. 2017] e problemas no âmbito jurídico-criminal [Zeng et al. 2017], a habilidade de interpretar o resultado de uma classificação representa algo tão importante quanto o desempenho preditivo do modelo. Isto é válido também no contexto da administração pública, setor no qual as previsões baseadas em dados são normalmente utilizadas para apoiar os gestores no processo de tomada de decisões que podem ter um efeito profundo nas pessoas [Varshney 2015]. ...
Conference Paper
Full-text available
Árvores de Decisão (ADs) possuem larga utilização na administração pública, setor no qual as previsões baseadas em dados são utilizadas para apoiar os gestores no processo de tomada de decisões que podem ter um efeito profundo nas pessoas. Este trabalho realiza uma análise comparativa de três diferentes implementações de código aberto em Python e R para dois populares algoritmos de aprendizado de ADs (C4.5 e CART). Os modelos gerados foram comparados quanto ao desempenho preditivo, tempo para treinamento e classificação, e interpretabilidade. Objetiva-se que os resultados do estudo forneçam importantes contribuições para a utilização das implementações no serviço público, bem como nas demais áreas nas quais o uso de modelos de classificação interpretáveis seja desejável.
... De-Arteaga et al. (2018) and propose techniques for imputing missing labels using feedback from human experts. Zeng et al. (2017) and Lakkaraju and Rudin (2017) propose statistical techniques for assigning missing labels. ...
Preprint
We consider an online learning problem with one-sided feedback, in which the learner is able to observe the true label only for positively predicted instances. On each round, $k$ instances arrive and receive classification outcomes according to a randomized policy deployed by the learner, whose goal is to maximize accuracy while deploying individually fair policies. We first extend the framework of Bechavod et al. (2020), which relies on the existence of a human fairness auditor for detecting fairness violations, to instead incorporate feedback from dynamically-selected panels of multiple, possibly inconsistent, auditors. We then construct an efficient reduction from our problem of online learning with one-sided feedback and a panel reporting fairness violations to the contextual combinatorial semi-bandit problem (Cesa-Bianchi & Lugosi, 2009, Gy\"{o}rgy et al., 2007). Finally, we show how to leverage the guarantees of two algorithms in the contextual combinatorial semi-bandit setting: Exp2 (Bubeck et al., 2012) and the oracle-efficient Context-Semi-Bandit-FTPL (Syrgkanis et al., 2016), to provide multi-criteria no regret guarantees simultaneously for accuracy and fairness. Our results eliminate two potential sources of bias from prior work: the "hidden outcomes" that are not available to an algorithm operating in the full information setting, and human biases that might be present in any single human auditor, but can be mitigated by selecting a well chosen panel.
... Intrinsic and post hoc. Intrinsic explainability refers to the explainability provided by the used simple structural machine learning models that are essentially explainable, such as short decision trees [22] or sparse linear models [33]. Post hoc explainability [21] means the explainability provided by extra explanation methods after the training of black box models (ensemble methods or neural networks). ...
Article
Full-text available
Knowledge graph has gained significant popularity in recent years. As one of the W3C standards, SPARQL has become the de facto standard query language to retrieve the desired data from various knowledge graphs on the Web. Therefore, accurately measuring the similarity between different SPARQL queries is an important and fundamental task for many query-based applications, such as query suggestion, query rewriting, and query relaxation. However, conventional SPARQL similarity computation models only provide poorly-interpretable results, i,e., simple similarity scores for pairs of queries. Explaining the computed similarity scores will lead to an outcome of explaining why a specific computation model offers such scores. This helps users and machines understand the result of similarity measures in different query scenarios and can be used in many downstream tasks. We thus focus on providing explanations for typical SPARQL similarity measures in this paper. Specifically, given similarity scores of existing measures, we implement four explainable models based on Linear Regression, Support Vector Regression, Ridge Regression, and Random Forest Regression to provide quantitative weights to different dimensional SPARQL features, i.e., our models are able to explain different kinds of SPARQL similarity computation models by presenting the weights of different dimensional SPARQL features captured by them. Deep insight analysis and extensive experiments on real-world datasets are conducted to illustrate the effectiveness of our explainable models.
... Inherently Interpretable Models and Post hoc Explanations. Many approaches learn inherently interpretable models such as rule lists [67,64], decision trees and decision lists [39], and others [37,8,43,9]. However, complex models such as deep neural networks often achieve higher accuracy than simpler models [51]. ...
Preprint
Full-text available
While several types of post hoc explanation methods (e.g., feature attribution methods) have been proposed in recent literature, there is little to no work on systematically benchmarking these methods in an efficient and transparent manner. Here, we introduce OpenXAI, a comprehensive and extensible open source framework for evaluating and benchmarking post hoc explanation methods. OpenXAI comprises of the following key components: (i) a flexible synthetic data generator and a collection of diverse real-world datasets, pre-trained models, and state-of-the-art feature attribution methods, (ii) open-source implementations of twenty-two quantitative metrics for evaluating faithfulness, stability (robustness), and fairness of explanation methods, and (iii) the first ever public XAI leaderboards to benchmark explanations. OpenXAI is easily extensible, as users can readily evaluate custom explanation methods and incorporate them into our leaderboards. Overall, OpenXAI provides an automated end-to-end pipeline that not only simplifies and standardizes the evaluation of post hoc explanation methods, but also promotes transparency and reproducibility in benchmarking these methods. OpenXAI datasets and data loaders, implementations of state-of-the-art explanation methods and evaluation metrics, as well as leaderboards are publicly available at https://open-xai.github.io/.
... The present work focuses only on post-hoc explanation methods and, although the performance of the three methods considered here was similar, it remains possible that other modalities of offering explanations might be more useful for this particular task. Future work in the same context might want to explore other explainable ML methods such as inherently-interpretable models (4,12,39), counterfactual explanations (19,40), and example-based methods (such as prototypes and critics) (41). It is possible that the fraud detection setting is not conducive to post-hoc feature-based explanations as the review band consists of transactions where the model is less confident. ...
Preprint
Machine Learning (ML) models now inform a wide range of human decisions, but using ``black box'' models carries risks such as relying on spurious correlations or errant data. To address this, researchers have proposed methods for supplementing models with explanations of their predictions. However, robust evaluations of these methods' usefulness in real-world contexts have remained elusive, with experiments tending to rely on simplified settings or proxy tasks. We present an experimental study extending a prior explainable ML evaluation experiment and bringing the setup closer to the deployment setting by relaxing its simplifying assumptions. Our empirical study draws dramatically different conclusions than the prior work, highlighting how seemingly trivial experimental design choices can yield misleading results. Beyond the present experiment, we believe this work holds lessons about the necessity of situating the evaluation of any ML method and choosing appropriate tasks, data, users, and metrics to match the intended deployment contexts.
... Consequently, practitioners have often turned to inherently interpretable machine learning models for these applications, which people can more easily understand. There exist several methods in the literature that learn inherently interpretable models, such as decision list and sets [28,7,66,73], prototype based models [52,13,25] and generated additive models [34,4,12,11,74]. ...
Preprint
Machine Learning (ML) models are increasingly used to make critical decisions in real-world applications, yet they have also become more complex, making them harder to understand. To this end, several techniques to explain model predictions have been proposed. However, practitioners struggle to leverage explanations because they often do not know which to use, how to interpret the results, and may have insufficient data science experience to obtain explanations. In addition, most current works focus on generating one-shot explanations and do not allow users to follow up and ask fine-grained questions about the explanations, which can be frustrating. In this work, we address these challenges by introducing TalkToModel: an open-ended dialogue system for understanding machine learning models. Specifically, TalkToModel comprises three key components: 1) a natural language interface for engaging in dialogues, making understanding ML models highly accessible, 2) a dialogue engine that adapts to any tabular model and dataset, interprets natural language, maps it to appropriate operations (e.g., feature importance explanations, counterfactual explanations, showing model errors), and generates text responses, and 3) an execution component that run the operations and ensures explanations are accurate. We carried out quantitative and human subject evaluations of TalkToModel. We found the system understands user questions on novel datasets and models with high accuracy, demonstrating the system's capacity to generalize to new situations. In human evaluations, 73% of healthcare workers (e.g., doctors and nurses) agreed they would use TalkToModel over baseline point-and-click systems, and 84.6% of ML graduate students agreed TalkToModel was easier to use.
... While we have described the SSClass problem in the context of assessing disease risk, score classification is also used in other contexts, such as assigning letter grades to students, giving a quality rating to a product, or deciding whether a person charged with a crime should be released on bail. In Machine Learning, the focus is on learning the score classification function [12,18,[20][21][22]. In contrast, here our focus is on reducing the cost of evaluating the classification function. ...
Article
Full-text available
Consider the following Stochastic Score Classification problem. A doctor is assessing a patient’s risk of developing a disease and can perform n different binary tests on the patient. The probability that test i is positive is pi and the outcomes of the n tests are independent. A patient’s score is the total number of positive tests. Possible scores thus range between 0 and n. This range is divided into subranges, corresponding to risk classes (e.g., LOW, MEDIUM, or HIGH risk). Each test has an associated cost. To reduce testing cost, instead of performing all tests and determining an exact score, the doctor can perform tests sequentially and stop testing when it is possible to determine the patient’s risk class. The problem is to determine the order in which the doctor should perform the tests, so as to minimize expected testing cost. We address the unit-cost case of the Stochastic Score Classification problem, and provide polynomial-time approximation algorithms for adaptive and non-adaptive versions of the problem. We also pose a number of open questions.
... Machine learning focuses on predicting the output value for new observations, which is considered to be a relatively easier task than inference, using flexible functional forms that are suitable to find relationship between features and the outcome (Shmueli, 2010). Machine learning has been actively explored and used to inform risk-stratified decision making in various fields for decades, from optimizing medical scoring systems (Ustun & Rudin, 2016) to predicting criminal justice recidivism (Berk & Bleich, 2013;Zeng et al., 2017). ...
Article
Background: Youth who exit the nation's foster care system without permanency are at high risk of experiencing difficulties during the transition to adulthood. Objective: To present an illustrative test of whether an algorithmic decision aid could be used to identify youth at risk of existing foster care without permanency. Methods: For youth placed in foster care between ages 12 and 14, we assessed the risk of exiting care without permanency by age 18 based on their child welfare service involvement history. To develop predictive risk models, 28 years (1991-2018) of child welfare service records from California were used. Performances were evaluated using F1, AUC, and precision and recall scores at k %. Algorithmic racial bias and fairness was also examined. Results: The gradient boosting decision tree and random forest showed the best performance (F1 score = .54-.55, precision score = .62, recall score = .49). Among the top 30 % of youth the model identified as high risk, half of all youth who exited care without permanency were accurately identified four to six years prior to their exit, with a 39 % error rate. Although racial disparities between Black and White youth were observed in imbalanced error rates, calibration and predictive parity were satisfied. Conclusions: Our study illustrates the manner in which potential applications of predictive analytics, including those designed to achieve universal goals of permanency through more targeted allocations of resources, can be tested. It also assesses the model using metrics of fairness.
... Therefore, we focus on applying prototype-based methods using alignment-free dissimilarity measures for sequence comparison. In fact, prototype-based machine learning models for data classification and representation are known to be interpretable and robust [6,82,93]. Using such methods for the SARS-CoV-2 sequence data, first we verify the classification results for the GISAID data. ...
Article
Full-text available
We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment. Supplementary information: The online version contains supplementary material available at 10.1007/s00521-021-06018-2.
... We learn checklists from data by solving an integer program. Our problem can be viewed as a special case of an IP proposed by Ustun and Rudin [69], Zeng et al. [78] to learn sparse linear classifiers with small integer coefficients. The problem that we consider is considerably easier to solve from a computational standpoint because it restricts coefficients to binary values and does not include 0 -regularization. ...
Preprint
Checklists are simple decision aids that are often used to promote safety and reliability in clinical applications. In this paper, we present a method to learn checklists for clinical decision support. We represent predictive checklists as discrete linear classifiers with binary features and unit weights. We then learn globally optimal predictive checklists from data by solving an integer programming problem. Our method allows users to customize checklists to obey complex constraints, including constraints to enforce group fairness and to binarize real-valued features at training time. In addition, it pairs models with an optimality gap that can inform model development and determine the feasibility of learning sufficiently accurate checklists on a given dataset. We pair our method with specialized techniques that speed up its ability to train a predictive checklist that performs well and has a small optimality gap. We benchmark the performance of our method on seven clinical classification problems, and demonstrate its practical benefits by training a short-form checklist for PTSD screening. Our results show that our method can fit simple predictive checklists that perform well and that can easily be customized to obey a rich class of custom constraints.
Article
Full-text available
In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.
Preprint
Recent strides in interpretable machine learning (ML) research reveal that models exploit undesirable patterns in the data to make predictions, which potentially causes harms in deployment. However, it is unclear how we can fix these models. We present our ongoing work, GAM Changer, an open-source interactive system to help data scientists and domain experts easily and responsibly edit their Generalized Additive Models (GAMs). With novel visualization techniques, our tool puts interpretability into action -- empowering human users to analyze, validate, and align model behaviors with their knowledge and values. Built using modern web technologies, our tool runs locally in users' computational notebooks or web browsers without requiring extra compute resources, lowering the barrier to creating more responsible ML models. GAM Changer is available at https://interpret.ml/gam-changer.
Article
Risk assessment instruments are used across the criminal justice system to estimate the probability of some future event, such as failure to appear for a court appointment or re‐arrest. The estimated probabilities are then used in making decisions at the individual level. In the past, there has been controversy about whether the probabilities derived from group‐level calculations can meaningfully be applied to individuals. Using Bayesian hierarchical models applied to a large longitudinal dataset from the court system in the state of Kentucky, we analyse variation in individual‐level probabilities of failing to appear for court and the extent to which it is captured by covariates. We find that individuals within the same risk group vary widely in their probability of the outcome. In practice, this means that allocating individuals to risk groups based on standard approaches to risk assessment, in large part, results in creating distinctions among individuals who are not meaningfully different in terms of their likelihood of the outcome. This is because uncertainty about the probability that any particular individual will fail to appear is large relative to the difference in average probabilities among any reasonable set of risk groups.
Book
Full-text available
Výzkumy opakovaně potvrzují, že významnou součástí znalostí o pachatelích trestné činnosti, např. i pro účely posuzování možnosti budoucí kriminální recidivy, jsou informace o jejich kriminální historii. Jeden z výzkumů, řešených v Institutu pro kriminologii a sociální prevenci (IKSP), který mapoval oblast závažné násilné kriminality, se právě na oblast kriminální historie pachatelů, kteří se této trestné činnosti dopouští, zaměřil. Monografie, která byla vydaná v ediční řadě STUDIE a která nabízí vybrané výsledky tohoto empirického výzkumu, se zabývá mj. problematikou samotného definování kriminálního jednání, které lze pro účely celoplošného kriminologického výzkumu (v českém prostředí) označit za závažné násilné trestné činy v trestním právu nebo informacemi, které lze z oficiálních statistik o odsouzených pachatelích těchto deliktů získat (vzhledem k jejich kriminální historii). Samostatnou část pak tvoří analýza výzkumného souboru víc jak dvoutisícovky odsouzených pachatelů závažného násilí, u kterých byla mapována jejich kriminální historie, ve smyslu předchozích pravomocných odsouzení.
Article
Machine learning-based classification models are ubiquitous in a wide variety of classification tasks. These models usually rely on complex machine learning methods and vast historical data to make accurate predictions. Recently, a concern grows that these models lead to the unfair classifications, i.e., model predictions would hurt or benefit particular group of people based on sensitive features (e.g., race, age). To achieve fair classification, a popular machine learning method, AdaBoost, is extended into the Fair-AdaBoost method in this study. Fair-AdaBoost can achieve fair classification while preserving the advantages (i.e., interpretability, scalability, and accuracy) of the basic AdaBoost. To further enhance the performance of Fair-AdaBoost, the non-dominated sorting genetic algorithm-II is extended in this study to optimize the hyper-parameters of the base classifiers in Fair-AdaBoost. In the experiments, the proposed method and algorithm have been tested on three standard benchmark datasets widely used in the literature of fair classification, which has proven their superior performance.
Thesis
Le séquençage du génome humain a bouleversé la biologie et ouvert la voie à une meilleure identification et interprétation des variations génétiques, reflet de notre diversité, mais pouvant entrainer des maladies génétiques rares. L’objectif de ma thèse était de développer des outils pour mieux caractériser des variations génétiques impliquées dans les maladies génétiques rares. Mes travaux se sont organisés autour de deux axes majeurs : Premièrement, le développement de MISTIC (MISsense deleTeriousness predICtor), nouvel outil basé sur de l’intelligence artificielle, visant à prédire l’impact des variations faux-sens. Les performances élevées de MISTIC découlent d’une architecture originale et d’un choix minutieux des descripteurs intégrés. Deuxièmement, la création de duxt (differential usage across tissues), une métrique pour mieux caractériser les variations situées dans les exons alternatifs. L’application de duxt a permis d’identifier des exons fortement/faiblement utilisés dans certains tissus et d’explorer leurs relations avec des variations impliquées dans certains phénotypes atypiques de maladies génétiques rares.
Preprint
Machine learning (ML) interpretability techniques can reveal undesirable patterns in data that models exploit to make predictions--potentially causing harms once deployed. However, how to take action to address these patterns is not always clear. In a collaboration between ML and human-computer interaction researchers, physicians, and data scientists, we develop GAM Changer, the first interactive system to help domain experts and data scientists easily and responsibly edit Generalized Additive Models (GAMs) and fix problematic patterns. With novel interaction techniques, our tool puts interpretability into action--empowering users to analyze, validate, and align model behaviors with their knowledge and values. Physicians have started to use our tool to investigate and fix pneumonia and sepsis risk prediction models, and an evaluation with 7 data scientists working in diverse domains highlights that our tool is easy to use, meets their model editing needs, and fits into their current workflows. Built with modern web technologies, our tool runs locally in users' web browsers or computational notebooks, lowering the barrier to use. GAM Changer is available at the following public demo link: https://interpret.ml/gam-changer.
Article
Current adoption of machine learning in industrial, societal and economical activities has raised concerns about the fairness, equity and ethics of automated decisions. Predictive models are often developed using biased datasets and thus retain or even exacerbate biases in their decisions and recommendations. Removing the sensitive covariates, such as gender or race, is insufficient to remedy this issue since the biases may be retained due to other related covariates. We present a regularization approach to this problem that trades off predictive accuracy of the learned models (with respect to biased labels) for the fairness in terms of statistical parity, i.e. independence of the decisions from the sensitive covariates. In particular, we consider a general framework of regularized empirical risk minimization over reproducing kernel Hilbert spaces and impose an additional regularizer of dependence between predictors and sensitive covariates using kernel-based measures of dependence, namely the Hilbert-Schmidt Independence Criterion (HSIC) and its normalized version. This approach leads to a closed-form solution in the case of squared loss, i.e. ridge regression. We also provide statistical consistency results for both risk and fairness bound for our approach. Moreover, we show that the dependence regularizer has an interpretation as modifying the corresponding Gaussian process (GP) prior. As a consequence, a GP model with a prior that encourages fairness to sensitive variables can be derived, allowing principled hyperparameter selection and studying of the relative relevance of covariates under fairness constraints. Experimental results in synthetic examples and in real problems of income and crime prediction illustrate the potential of the approach to improve fairness of automated decisions.
Article
Full-text available
Arguably the most important decision at an arraignment is whether to release an offender until the date of his or her next scheduled court appearance. Under the Bail Reform Act of 1984, threats to public safety can be a key factor in that decision. Implicitly, a forecast of “future dangerousness” is required. In this article, we consider in particular whether usefully accurate forecasts of domestic violence can be obtained. We apply machine learning to data on over 28,000 arraignment cases from a major metropolitan area in which an offender faces domestic violence charges. One of three possible post-arraignment outcomes is forecasted within two years: (1) a domestic violence arrest associated with a physical injury, (2) a domestic violence arrest not associated with a physical injury, and (3) no arrests for domestic violence. We incorporate asymmetric costs for different kinds of forecasting errors so that very strong statistical evidence is required before an offender is forecasted to be a good risk. When an out-of-sample forecast of no post-arraignment domestic violence arrests within two years is made, it is correct about 90 percent of the time. Under current practice within the jurisdiction studied, approximately 20 percent of those released after an arraignment for domestic violence are arrested within two years for a new domestic violence offense. If magistrates used the methods we have developed and released only offenders forecasted not to be arrested for domestic violence within two years after an arraignment, as few as 10 percent might be arrested. The failure rate could be cut nearly in half. Over a typical 24-month period in the jurisdiction studied, well over 2,000 post-arraignment arrests for domestic violence perhaps could be averted.
Article
Full-text available
Scoring systems are linear classification models that only require users to add, subtract and multiply a few small numbers in order to make a prediction. These models are in widespread use by the medical community, but are difficult to learn from data because they need to be accurate and sparse, have coprime integer coefficients, and accommodate operational constraints. We present a new method for creating data-driven scoring systems called Supersparse Linear Integer Models (SLIM). SLIM scoring systems are built by solving a discrete optimization problem that directly encodes measures of accuracy (the 0--1 loss) and sparsity (the L0-seminorm) while restricting coefficients to coprime integers. SLIM can seamlessly incorporate a wide range of operational constraints that are difficult for other methods to accommodate. We provide bounds on the testing and training accuracy of SLIM scoring systems, as well as a new data reduction technique that can improve scalability by discarding a portion of the training data. We present results from an ongoing collaboration with the Massachusetts General Hospital Sleep Apnea Laboratory, where SLIM is being used to construct a highly tailored scoring system for sleep apnea screening.
Article
Full-text available
Forecasts of prospective criminal behavior have long been an important feature of many criminal justice decisions. There is now substantial evidence that machine learning procedures will classify and forecast at least as well, and typically better, than logistic regression, which has to date dominated conventional practice. However, machine learning procedures are adaptive. They "learn" inductively from training data. As a result, they typically perform best with very large datasets. There is a need, therefore, for forecasting procedures with the promise of machine learning that will perform well with small to moderately-sized datasets. Kernel methods provide precisely that promise. In this paper, we offer an overview of kernel methods in regression settings and compare such a method, regularized with principle components, to stepwise logistic regression. We apply both to a timely and important criminal justice concern: a failure to appear (FTA) at court proceedings following an arraignment. A forecast of an FTA can be an important factor is a judge's decision to release a defendant while awaiting trial and can influence the conditions imposed on that release. Forecasting accuracy matters, and our kernel approach forecasts far more accurately than stepwise logistic regression. The methods developed here are implemented in the R package kernReg currently available on CRAN.
Article
Full-text available
Theories of procedural justice suggest that individuals who experience respectful and fair legal decision-making procedures are more likely to believe in the legitimacy of the law and, in turn, are less likely to reoffend. However, few studies have examined these relationships in youth. To begin to fill this gap in the literature, in the current study, the authors studied 92 youth (67 male, 25 female) on probation regarding their perceptions of procedural justice and legitimacy, and then monitored their offending over the subsequent 6 months. Results indicated that perceptions of procedural justice predicted self-reported offending at 3 months but not at 6 months, and that youths' beliefs about the legitimacy of the law did not mediate this relationship. Furthermore, procedural justice continued to account for unique variance in self-reported offending over and above the predictive power of well-established risk factors for offending (i.e., peer delinquency, substance abuse, psychopathy, and age at first contact with the law). Theoretically, the current study provides evidence that models of procedural justice developed for adults are only partially replicated in a sample of youth; practically, this research suggests that by treating adolescents in a fair and just manner, justice professionals may be able to reduce the likelihood that adolescents will reoffend, at least in the short term. (PsycINFO Database Record (c) 2013 APA, all rights reserved).
Article
Full-text available
Many scholars and political leaders denounce racism as the cause of disproportionate incarceration of black Americans. All players in this system have been blamed including the legislators who enact laws that disproportionately harm blacks, police who unevenly arrest blacks, prosecutors who overcharge blacks, and judges that fail to release and oversentence black Americans. Some scholars have blamed the police and judges who make arrest and release decisions based on predictions of whether defendants will commit future crimes. They claim that prediction leads to minorities being treated unfairly. Others complain that racism results from misused discretion. This article explores where racial bias enters the criminal justice system through an empirical analysis that considers the impact of discretion and prediction. With a close look at the numbers and consideration of factors ignored by others, this article confirms some conventional wisdom but also makes several surprising findings. This article confirms what many commentators have suspected — that police arrest black defendants more often for drug crimes than white defendants. It also finds, contrary to popular belief, that there is little evidence to support the belief that drugs are linked to violent crime. Also, judges actually detain white defendants more than similarly-situated black defendants for all types of crimes. The important and surprising findings in this article challenge long-held conventions of race and help mitigate racial disparity in criminal justice.
Article
Full-text available
In sentencing research, significant negative coefficients on age research have been interpreted as evidence that actors in the criminal justice system discriminate against younger people. This interpretation is incomplete. Criminal sentencing laws generally specify punishment in terms of the number of past events in a defendant’s criminal history. Doing so inadvertently makes age a meaningful variable because older people have had more time to accumulate criminal history events. Therefore, two people of different ages with the same criminal history are not in fact equal. This is true for pure retributivists, as the fact that the younger offender has been committing crimes at a higher rate of offending may make the younger offender more culpable, and is also true for those with some utilitarian aims for sentencing. Simulation results illustrate the stakes. To a certain extent, the interests of low-rate older offenders are opposed to those of high-rate younger offenders.
Technical Report
Full-text available
Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.
Article
We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.
Article
Recent studies have examined racial disparities in stop-and-frisk, a widely employed but controversial policing tactic. The statistical evidence, however, has been limited and contradictory. We investigate by analyzing three million stops in New York City over five years, focusing on cases where officers suspected the stopped individual of criminal possession of a weapon (CPW). For each CPW stop, we estimate the ex ante probability that the detained suspect has a weapon. We find that in more than 40% of cases, the likelihood of finding a weapon (typically a knife) was less than 1%, raising concerns that the legal requirement of “reasonable suspicion” was often not met. We further find that blacks and Hispanics were disproportionately stopped in these low hit rate contexts, a phenomenon that we trace to two factors: (1) lower thresholds for stopping individuals—regardless of race—in high-crime, predominately minority areas, particularly public housing; and (2) lower thresholds for stopping minorities relative to similarly situated whites. Finally, we demonstrate that by conducting only the 6% of stops that are statistically most likely to result in weapons seizure, one can both recover the majority of weapons and mitigate racial disparities in who is stopped. We show that this statistically informed stopping strategy can be approximated by simple, easily implemented heuristics with little loss in efficiency.
Article
This study extended previous research comparing a set of widely employed actuarial risk assessment schemes as well as a new instrument, the Static-2002, in a sample of 468 sex offenders followed for an average of 5.9 years. All of the risk assessment instruments (Violence Risk Appraisal Guide [VRAG], Sex Offender Risk Appraisal Guide [SORAG], Rapid Risk Assessment for Sex Offense Recidivism [RRASOR], Static-99, Static-2002, and Minnesota Sex Offender Screening Tool-Revised [MnSOST-R]) were found to predict the recidivism outcomes for which they were designed. Although significant, indices of accuracy were generally lower than those reported by the developers of these instruments, even under conditions that have been shown to optimize predictive performance. For serious recidivism, the predictive accuracy of the Static-2002 and SORAG was significantly superior to that of the RRASOR, and the SORAG was significantly superior to the MnSOST-R as well. There were no significant differences among instruments in accuracy of predicting sexual recidivism.
Article
Despite record levels of incarceration and much discussion about the role that incarceration plays in influencing criminal activity, there does not yet exist a sound knowledge base about the extent to which incarceration exhibits a criminogenic, deterrent, or null effect on subsequent individual offending trajectories. This is an unfortunate happenstance since classic criminological theories make vastly different predictions about the role of punishment in altering criminal activity, and life-course criminologists suggest that life events can materially influence subsequent criminal activity. Using arrest histories of a sample of prisoners released from state prisons in 1994 and followed for three years post-release, this Article seeks to address the impact of incarceration on subsequent offending trajectories. Results indicate that a comparison of the counterfactual and actual offending patterns suggests that most releasees were either deterred from future offending (40%) or merely incapacitated by their incarceration (56%). Only about 4% had a criminogenic effect. Future theoretical and empirical research directions are outlined.
Chapter
Data preprocessing techniques generally refer to the addition, deletion, or transformation of the training set data. Preprocessing data is a crucial step prior to modeling since data preparation can make or break a model’s predictive ability. To illustrate general preprocessing techniques, we begin by introducing a cell segmentation data set (Section 3.1). This data set contains common predictor problems such as skewness, outliers, and missing values. Sections 3.2 and 3.3 review predictor transformations for single predictors and multiple predictors, respectively. In Section 3.4 we discuss several approaches for handling missing data. Other preprocessing steps may include removing (Section 3.5), adding (Section 3.6), or binning (Section 3.7) predictors, all of which must be done carefully so that predictive information is not lost or erroneous information is added to the data. The computing section (3.8) provides R syntax for the previously described preprocessing steps. Exercises are provided at the end of the chapter to solidify concepts.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ∗∗∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Book
Applied Predictive Modeling covers the overall predictive modeling process, beginning with the crucial steps of data preprocessing, data splitting and foundations of model tuning. The text then provides intuitive explanations of numerous common and modern regression and classification techniques, always with an emphasis on illustrating and solving real data problems. The text illustrates all parts of the modeling process through many hands-on, real-life examples, and every chapter contains extensive R code for each step of the process. This multi-purpose text can be used as an introduction to predictive models and the overall modeling process, a practitioner's reference handbook, or as a text for advanced undergraduate or graduate level predictive modeling courses. To that end, each chapter contains problem sets to help solidify the covered concepts and uses data available in the book's R package. This text is intended for a broad audience as both an introduction to predictive models as well as a guide to applying them. Non-mathematical readers will appreciate the intuitive explanations of the techniques while an emphasis on problem-solving with real data across a wide variety of applications will aid practitioners who wish to extend their expertise. Readers should have knowledge of basic statistical ideas, such as correlation and linear regression analysis. While the text is biased against complex equations, a mathematical background is needed for advanced topics. © Springer Science+Business Media New York 2013. All rights reserved.
Article
Using criminal population criminal conviction histories of recent offenders, prediction models are developed that predict three types of criminal recidivism: general recidivism, violent recidivism and sexual recidivism. The research question is whether prediction techniques from modern statistics, data mining and machine learning provide an improvement in predictive performance over classical statistical methods, namely logistic regression and linear discriminant analysis. These models are compared on a large selection of performance measures. Results indicate that classical methods do equally well as or better than their modern counterparts. The predictive performance of the different techniques differs only slightly for general and violent recidivism, while differences are larger for sexual recidivism. For the general and violent recidivism data we present the results of logistic regression and for sexual recdivisim of linear discriminant analysis.
Article
We present a new comprehensive approach to create accurate and interpretable linear classification models through mixed-integer programming. Unlike existing approaches to linear classification, our approach can produce models that incorporate a wide range of interpretability-related qualities, and achieve a precise, understandable and pareto-optimal balance between accuracy and interpretability. We demonstrate the value of our approach by using it to train scoring systems, M-of-N rule tables, and "personalized" classification models for applications in medicine, marketing, and crime. In addition, we propose novel methods to train interpretable classification models on large-scale datasets using off-the-shelf mixed-integer programming software. Our paper includes theoretical bounds to assess the predictive accuracy of our models.
Article
The vast majority of the literature evaluates the performance of classification models using only the criterion of predictive accuracy. This paper reviews the case for considering also the comprehensibility (interpretability) of classification models, and discusses the interpretability of five types of classification models, namely decision trees, classification rules, decision tables, nearest neighbors and Bayesian network classifiers. We discuss both interpretability issues which are specific to each of those model types and more generic interpretability issues, namely the drawbacks of using model size as the only criterion to evaluate the comprehensibility of a model, and the use of monotonicity constraints to improve the comprehensibility and acceptance of classification models by users.
Article
Computational algorithms for selecting subsets of regression variables are discussed. Only linear models and the least-squares criterion are considered. The use of planar-rotation algorithms, instead of Gauss-Jordan methods, is advocated. The advantages and disadvantages of a number of "cheap" search methods are described for use when it is not feasible to carry out an exhaustive search for the best-fitting subsets. Hypothesis testing for three purposes is considered, namely (i) testing for zero regression coefficients for remaining variables, (ii) comparing subsets and (iii) testing for any predictive value in a selected subset. Three small data sets are used to illustrate these test. Spjøtvoll's (1972a) test is discussed in detail, though an extension to this test appears desirable. Estimation problems have largely been overlooked in the past. Three types of bias are identified, namely that due to the omission of variables, that due to competition for selection and that due to the stopping rule. The emphasis here is on competition bias, which can be of the order of two or more standard errors when coefficients are estimated from the same data as were used to select the subset. Five possible ways of handling this bias are listed. This is the area most urgently requiring further research. Mean squared errors of prediction and stopping rules are briefly discussed. Competition bias invalidates the use of existing stopping rules as they are commonly applied to try to produce optimal prediction equations.
Article
The vast majority of real world classification problems are imbalanced, meaning there are far fewer data from the class of interest (the positive class) than from other classes. We propose two machine learning algorithms to handle highly imbalanced classification problems. The classifiers constructed by both methods are created as unions of parallel axis rectangles around the positive examples, and thus have the benefit of being interpretable. The first algorithm uses mixed integer programming to optimize a weighted balance between positive and negative class accuracies. Regularization is introduced to improve generalization performance. The second method uses an approximation in order to assist with scalability. Specifically, it follows a \textit{characterize then discriminate} approach, where the positive class is characterized first by boxes, and then each box boundary becomes a separate discriminative classifier. This method has the computational advantages that it can be easily parallelized, and considers only the relevant regions of feature space.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
Statistically based risk assessment devices are widely used in criminal justice settings. Their promise remains largely unfulfilled, however, because assumptions and premises requisite to their development and application are routinely ignored and/or violated. This article provides a brief review of the most salient of these assumptions and premises, addressing the base rate and selection ratios, methods of combining predictor variables and the nature of criterion variables chosen, cross-validation, replicability, and generalizability. The article also discusses decision makers’ choices to add or delete items from the instruments and suggests recommendations for policy makers to consider when adopting risk assessments. Suggestions for improved practice, practical and methodological, are made.
Article
Objectives Recent legislation in Pennsylvania mandates that forecasts of "future dangerousness" be provided to judges when sentences are given. Similar requirements already exist in other jurisdictions. Research has shown that machine learning can lead to usefully accurate forecasts of criminal behavior in such setting. But there are settings in which there is insufficient IT infrastructure to support machine learning. The intent of this paper is provide a prototype procedure for making forecasts of future dangerousness that could be used to inform sentencing decisions when machine learning is not practical. We consider how classification trees can be improved so that they may provide an acceptable second choice. Methods We apply an version of classifications trees available in R, with some technical enhancements to improve tree stability. Our approach is illustrated with real data that could be used to inform sentencing decisions. Results Modest sized trees grown from large samples can forecast well and in a stable fashion, especially if the small fraction of indecisive classifications are found and accounted for in a systematic manner. But machine learning is still to be preferred when practical. Conclusions Our enhanced version of classifications trees may well provide a viable alternative to machine learning when machine learning is beyond local IT capabilities.
Article
Using sentencing data from 1994 to 2002, spanning two different sentencing policies, this study examines the complex relationship between felony offenders' prior record, race/ethnicity, current offense, and sentencing outcomes. Expanding on past research, this study incorporates multiple dimensions of prior record and analyzes the differential impact of these dimensions across race/ethnicity and offense type. Unlike previous research, the current study examines these complex effects across different sentencing policies. The findings suggest that sentencing authorities' calculations of risk and dangerousness may not be based solely on legal considerations such as prior record and offense type or on extralegal factors such as race and ethnicity. Instead, it appears that a complex interplay exists between these legal and extralegal characteristics. Furthermore, the importance of policy change with regard to prior record and race/ethnicity is highlighted by the findings.
Article
This article compares the effects of indeterminate and determinate sentencing models on recidivism using a measure of parole board discretionary release and mandatory parole release under each sentencing model. Data collected from Recidivism of Prisoners Released in 1994: United States are used to conduct a state-specific comparison of the two release programs in six mixed-sentencing states. The results indicate that the effects of different sentencing models significantly vary across the six states. Whereas mandatory parole release was more likely to have a deterrent effect on recidivism in Maryland and Virginia, parole board discretionary release was more effective in New York and North Carolina. Release programs in Oregon and Texas showed no significant differences in their effects on recidivism.
Article
Risk dimensions used in guidelines systems have been implicated as contributing to racial (and gender) disproportionalities in America's prison and jail populations. Developers of some systems dealt with invidious risk predictors by purposely ignoring them, resulting in misspecification while not eliminating their effects. Eliminating all variables correlated with suspect factors would greatly attenuate power, rendering practical decision-making tools useless. This article shows that one risk-prediction device forming the basis of an operational guidelines system is, in fact, correlated with race and gender but that the approach suggested by some critics does not overcome this: Control factors will remain correlated with the final model even after second-order policy controls are implemented. Further, although the suggested approach is agnostic with respect to the nature of policy controls, these will have considerable practical importance. Illustration is provided. Unbiased models can be estimated with little appreciable loss of predictive utility, and this is demonstrated.
Article
This review highlights the importance of recognizing the possibility for doing harm when intentions are good. It describes several examples showing that well-planned and adequately executed programs provide no guarantee for safety or efficacy. The author concludes with recommendations for scientifically credible evaluations to promote progress in the field of crime prevention.
Article
Keywords and Phrases Introduction Formulation Multiple Groups Methods See also References
Article
Although blacks compose only 12 percent of the national population, they account for almost 50 percent of the prison population. Many states have adopted the use of guidelines for sentencing, parole, and decisions concerning the level of probationer supervision. Some argue that use of certain factors in guidelines systematically adversely affects minority offenders. The extent to which commonly used guideline factors are correlated with race and recidivism was established using data on over 16,500 offenders convicted of felonies in California in 1980. Race and recidivism correlations were calculated for all convicted felons, for probationers, and for prisoners. When all factors in the data base were used, accuracy in predicting rearrests was seldom greater than a 20 percent improvement over chance. The use only of factors that were not racially correlated increased predictive accuracy from 3 to 9 percent above chance; including racially correlated factors increased predictive accuracy another 5-12 percent. When status factors related to race are excluded, the guidelines identify high-risk criminals about as well as they do now, but racially correlated factors that reflect seriousness of crimes cannot be omitted unless society is willing to treat serious offenders less severely because many of them are black.
Article
This article reviews the literature of the past 20 years on offender classification. Early developments represented a convergence of professional, legal, and political demands. Recent progress in several areas is noted: risk assessment and correctional supervision, classification based on psychological characteristics, and needs assessment. An integration of trends argues for a systems approach to classification, connecting it more specifically to intervention. Current efforts that warrant further attention are discussed.
Article
The problem of learning from imbalanced data sets, while not the same problem as learning when misclassication costs are un- equal and unknown, can be handled in a simi- lar manner. That is, in both contexts, we can use techniques from roc analysis to help with classier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. We also compare for one domain these re- sults to those obtained by over-sampling and under-sampling the data set. The operations of sampling, moving the decision threshold, and adjusting the cost matrix produced sets of classiers that fell on the same roc curve.