ChapterPDF Available

The Scientific Validity of Current Approaches to Violence and Criminal Risk Assessment

Authors:
1
The scientific validity of current approaches to
violence and criminal risk assessment
submitted version
Seena Fazel
Introduction
Criminal justice systems in many high-income countries use some form of structured
risk assessment tool or instrument to inform decisions about sentencing, parole,
release and probation . These tools typically consider two aspects the future risk of
an individual for reoffending, and also the criminogenic needs to mitigate this future
risk. One estimate is that there are more than 300 such risk assessment tools (Singh,
Desmarais et al. 2014), many of which are heavily marketed and sold commercially.
In the US alone, one report based on a review from 1970 to 2012 documented that 39
states have their own risk assessment tools (Desmarais, Johnson et al. 2016). In
contrast, in England and Wales, there is one risk tool in place for prisons and
probation, called OASyS (Offender Assessment System), which has been revised as
its first edition was found to have poor predictive performance (Howard and Dixon
2012). Typically, such tools include a set of risk factors, which may or may not be
weighted, to provide a classification of risk (such as high, medium or low), or a
probabilistic score (i.e. a percentage probability of reoffending within a certain time-
frame), or both. At its most basic, a small number of static (or unchangeable) risk
factors, such as sex, age, and previous offending, are used to determine high, medium
or low risk but without any information as to what these categories actually mean in
terms of probabilities, data on accuracy, or how these risk factors translate into one of
these categories. The increasing use of these tools has been driven by the need to
provide more consistent and defensible estimates of future risk and, in tools that are
more focused on needs, better matching of treatment and interventions in criminal
justice with their limited resources. The needs-based approaches attempt to assess
individual factors that are thought to be related to offending, such as certain attitudes,
stable accommodation, relationship problems, and family support. The uptake of these
tools can also be explained by research findings, which suggest in general terms, that
they are better at prediction than human beings (Ægisdóttir, White et al. 2006), and
that unstructured clinical judgement (or the subjective judgement of individuals
without any explicit framework of assessment) may be biased for many different
reasons, including recent experience, prejudice against minority groups, and attitudes
towards certain offences.
This chapter will present a brief overview of performance measures for risk
assessment instruments, and then summarise a number of recent systematic reviews
examining the accuracy of commonly used instruments. I will then identify some gaps
in the field, and discuss whether the current tools are fit for purpose.
2
Measuring the statistical performance of risk assessment
tools
There are two approaches to test to the performance of such instruments:
discrimination and calibration. Discrimination measures a particular tool’s ability to
distinguish between those who have offended and those who have not by assigning a
higher risk score or category to those who offend. Discrimination is tested by
reporting sensitivity, specificity, positive predictive value and negative predictive
value (see definitions below), which can only be calculated at specific risk cut-offs. In
addition, an overall measure of discrimination across all possible cut-offs is the area
under the curve (AUC; reported as a c statistic or c-index in some studies), which
tests the probability that a randomly selected offender has a higher score on a tool
than a randomly selected non-offender. The curve is the Receiver Operating
Characteristic Curve (or ROC curve), which plots true positives against true
negatives. To take one example, an AUC of 0.70 is the equivalent of saying that that a
tool will correctly assign a higher score 70% of the time to a randomly selected
offender than a randomly selected non-offender. Many studies rely on simply
presenting discrimination statistics, and even then, only the AUC, which on its own is
uninformative. For example, a tool can correctly classify those into higher and lower
risk groups at all possible cut-offs, but is only used at a specific cut-off, where its
discrimination is much poorer. This can be exemplified in the case of a risk
assessment tool that has 30 items, and is scored from 0 to 30. If the tool is tested in a
research study, and it correctly assigns all the offenders with a score of 2 compared to
all non-offenders who score 0 and 1, then it will have a perfect AUC of 1. However,
the guidelines for the use of the tool state that a score of 5 and above should be used
to determine high risk of offending, and therefore the AUC statistic masks its poor
intended performance. If used as intended with a cut off of 5, this would mean that
everyone in the sample is assigned a low risk score, even though some of these
individuals are offenders. Depending on the number of offenders and non-offenders,
this would mean that the AUC is closer to 0.5 or chance. AUCs below 0.5 are worse
than chance in other words, such models are systematically wrong. This is one of
the reasons why presenting a range of performance measures is important, particularly
true and false positives and negatives. Indicative values of good discrimination
measures have been discussed but there is no clear consensus (Singh 2013).
Further, an instrument may be accurate in identifying risk groups but do so in a
way that is very different to their real offending rates. In such a case, a tool may
estimate rates of offending of 10% to higher risk offenders compared to 9% to lower
risk offenders, and hence perfectly discriminates between these two groups. But if the
higher risk offenders are more likely to offend at rates of around 40% and the lower
risk offenders at 1%, then it is very poorly calibrated and has little if any practical
utility as a prediction model (Lindhiem, Petersen et al. 2018). Calibration refers to the
agreement between observed outcomes (i.e. offending) and predictions from a
particular tool. For example, if there is a prediction of a 30% reoffending risk
following release from prison in 1 year, the observed frequency for reoffending
should be around 30 out of 100 released prisoners with such a prediction.
Sensitivity (the proportion of people who have offended that an instrument
correctly classified as high risk) needs to be high if the aim is to screen individuals for
a disease (e.g. for further costly or more invasive investigations), and important from
public policy perspective as the consequences of ‘missing’ an individual who offends
3
needs to be considered. The corollary of sensitivity is the false negative rate (which is
calculated as 1-sensitivity) the proportion of individuals who commit crimes that the
tool misses. A false negative rate of, say, 5% is equivalent to the tool not correctly
identifying 5 out of every 100 individuals who have offended. Specificity (the
proportion of individuals who have not offended that are correctly identified) should
be high if the implications of being labeled high risk are harmful (e.g. longer
sentences or preventative detention). The false positive rate is the inverse of
specificity (1-specificity) the proportion of people that the tool incorrectly estimates
will commit crimes. The relative proportion of true and false positive and negative
rates will be determined by a range of legal, ethical and political concerns. Low false
negative and positive rates will clearly be preferred, but a high false positive rate
could be acceptable if the consequences of being labelled higher risk are not harmful.
To exemplify this, if a tool does not miss individuals who reoffend on release (low
false negative rate) but also identifies many people as high risk who do not reoffend
(high false positive rate), this is less concerning if the consequences for those
incorrectly identified as high risk are not harmful, such as additional support on
release. Where it will be problematic is if the high risk group have their prison
sentences extended.
These decisions will need some alternatives to consider, such as the relative
balance without using such tools or two approaches can be compared. Some tools
have tried to maximize the combination of sensitivity and specificity by adjusting cut-
off points (e.g. looking at the inflection point of a ROC curve the point at which it
has the highest true positive rates on the y axis and negative rates on the x axis). Here,
researchers would look at the best possible cut-off by looking at the inflection point.
By finding the inflection point, this will translate into a cut-off to the nearest whole
number for a tool that has the best discrimination for that particular sample being
studied. The problem with this approach is that it is unlikely to generalize to other
samples, and pre-specifying a cut-off is preferable methodologically. In other words,
taking this approach to identifying the best cut-off statistically will likely only apply
to the specific sample being studied rather than new populations.
Some commentators have suggested that positive predictive value (PPV; the
proportion of people that a tool identifies as high risk that actually offend) and
negative predictive value (NPV, proportion that are low risk that do not offend) are
more relevant to criminal justice as it is how these tools are used in practice
(Buchanan and Leese 2001, Coid, Ullrich et al. 2013). The main limitation with this
approach is that these two measures, alongside sensitivity and specificity, are also
sensitive to the base rate so the PPV will be low if the rate of offending in the
population of interest is low, and the NPV will be high. Nevertheless, the NPV is
increasingly important in some countries where decarceration is a public policy
priority which provides information on the proportion of prisoners that can be safely
released (i.e. not reoffend within a specified time period). It is also important for some
populations such as juveniles and women, where prison should be avoided if possible,
due to secondary effects on education, work, family and social networks, and mental
health (Abram, Zwecker et al. 2015). Sensitivity, specificity, PPV and NPV will
change if a tool’s cut-off changes if the threshold for high risk increases, then
sensitivity and NPV will decrease, and correspondently specificity and PPV will
increase. This is one reason why the AUC is often presented as a summary statistic as
it presents measures of discrimination (sensitivity and 1-specificity) at all possible
cut-offs. At the same time, using AUCs to compare risk tools is problematic as very
different numbers of false negative and false positive predictions resulting from
4
different shapes of receiver operating curve may have the same overall AUC (Mallett,
Halligan et al. 2012).
The other key measure of a tool’s performance is calibration. This asks how
closely the tool’s predicted risk matches the observed risk. For example, a tool that
predicts a 20% chance of offending in a particular sample, but only 10% actually
offended, is poorly calibrated. Calibration can be examined graphically by plotting
predicted risk versus observed offending behaviour or through statistical tests to
measure the typical level of miscalibration, such as the Brier test or HL statistic
(Lindhiem, Petersen et al. 2018). Calibration is the key performance measure if only
probability scores are used as the discrimination measures are only possible if there
are a limited number of cut-offs. One important area of contention relevant to
calibration is the group to individual problem and proponents of this view have
argued that it is not possible to apply group information to individuals due to lack of
precision, also known as the G2I (‘group to individual’) problem. The argument is put
forward that when an actuarial tool provides a probability score of 30%, applying this
to an individual is subject to the potentially large variation underlying the probability
score. Hence 30% actually means 10-50% for an individual and hence not
informative. However, this view is based on a misunderstanding of statistics all
individual predictions are based on group data, and their precision will be a
consequence of sample size (Imrey and Dawid 2015). The probability score of 30%
for a risk assessment tool can be interpreted by stating that it refers to an individual
with the same risk factor profile who will on average reoffend at a rate of 30%.
The overall performance of currently used risk assessment
tools
So what do we know about the performance of currently used tools in criminal
justice? There have been a number of systematic reviews that have outlined their
performance. Interestingly, none of them has reported calibration statistics as it seems
that this is very rarely reported in the research literature. In fact, one 2013 review of
how AUCs were presented in 50 studies did not report one calibration metric (Singh,
Desmarais et al. 2013). The review by Yang and colleagues in 2010 looked at head to
head comparisons of 9 violence risk assessment tools, and identified 28 studies in no
more than 7221 individuals, that reported AUCs and a measure of effect size
(Cohen’s d). It concluded that there was little difference in the included risk
assessment measures, which varied in AUCs between 0.65-0.71 (Yang, Wong et al.
2010). A later and more comprehensive review of an overlapping but different set of 9
instruments identified 73 studies including 24,827 people (Fazel, Singh et al. 2012).
This review presented a broader range of discrimination statistics and also
separately by violent offending and any criminal offending. The findings were
different by type of predicted outcome for violent crime, sensitivity was high (0.92)
and specificity low (0.36), with moderate PPV (41%) and high NPV (91%). For any
offending, sensitivity was low (0.41) and specificity high (0.80), with moderate PPV
(52%) and NPV (76%). In terms of AUCs, for violent offending it was 0.72 and for
criminal offending it was 0.66. Overall, these are mixed discrimination metrics
moderate AUCs and NPVs, and suggest that their use in practice needs to reflect these
differing performance metrics. One possibility is to screen out low risk offenders.
Another is to solely use these tools as adjuncts in the decision-making process due to
5
positive predictive values of around 40-50%. Finally, due to the low specificity of
violence risk assessment, they should only be used when the consequences of high
risk categories are non-harmful interventions, such as additional management or
treatment. Another way of looking at these findings is to focus on false negative and
false positive rates for tools predicting violent outcomes, it was 8% and 64%,
respectively; for tools predicting any criminal outcomes (such as LSI-R), it was 59%
false negative and 20% false positive. If the implications of false positive rates are not
harmful, this would suggest that instruments predicting violent outcomes should be
prioritised over those focusing on any crime. In other words, this review found that
the balance between false negatives (low for tools focusing on violent crime but more
than 50% for tools with any crime outcomes) and false positives (high for tools
focusing on violent crime but lower in those predicting any crime) favours the violent
risk assessment tools if the consequences of false positive (i.e. being labelled high risk
and not reoffending) are not harmful. The 59% false negative rate for tools predicting
any crime is arguably too high for their widespread use in criminal justice.
A third notable review summarised research on the predictive validity of 19
instruments used in US corrections from 1970 to 2012 (Desmarais, Johnson et al.
2016). This review underscores the problems with the reporting of this literature. It
found that only summary statistics were presented and solely for general recidivism
(as distinct from violent recidivism). The median AUC of these tools typically ranged
from 0.64-0.71 for new offences, and in real-life settings, the Level of Service
Inventory (LSI-R), which is a commonly used tool, had an AUC of 0.63 and the RMS
an AUC of 0.66. As with the other reviews, no information on calibration was
reported, which is problematic as all the 19 included tools provide probabilistic scores
of reoffending (and, in some cases, parole violations).
Overall, based on these recent systematic reviews of current risk assessment
tools, there are problems in how these instruments are reported, with insufficient
information on their performance. In addition, there are other problems with
transparency . The statistical contribution of individual risk factors to the final model,
and the process by which they were chosen and categorised should be outlined. This
transparency is important as it allows for the models to be critically appraised by
experts, such as the nature of the sample that it was derived in, the choice of
predictors and how they were categorised, the statistical power of the study, and the
precision of the performance measures. This is particularly important if harm follows
from a tool’s use, such as longer sentences, certain interventions, and more
restrictions in the community . Another problem are the potential financial and non-
financial conflicts of interests among researchers in this field, and many of the tools
being studied are conducted by individuals who developed or translated them (Singh,
Grann et al. 2013). Such potential conflicts need to be disclosed, which currently
rarely occurs.
Scalability and cost need to be considered some of the tools have commercial
licences (such as the COMPAS or Correctional Offender Management Profile for
Alternative Sanctions), which takes up to 60 minutes to complete. Many of these tools
also assess individual needs and treatment (and linked to responsivity, which is the
extent to which an intervention is responsive to the individual needs identified), and
their predictive validity is one element in their potential value. However, conflating
risk and needs can lead to loss of performance on risk, and empirically robust risk
calculators are required before more careful assessment of needs and treatment.
Further, there have been some recent attempts to focus on causal risk factors as these
will lead to reductions in recidivism once treated (Howard and Dixon 2013).
6
However, one problem with this approach is that the most predictive factors (eg age,
previous crime) are not causal, and excluding such factors will lead to poorer
performance in terms of prediction. If the next stage of any risk management process
is needs assessment, then identifying causal risk factors will be informative but will
require different approaches (such as quasi-experimental designs and treatment trials
rather than correlational studies of risk factors). Another issue is that the performance
of current tools shrinks when used in real-world settings as distinct from research
studies. A recent example was reported for the commonly used Psychopathy
Checklist, revised edition (PCL-R). In a field trial in Belgium, its predictive validity
was poor with an overall AUC of 0.63 for general recidivism and 0.57 for violent
recidivism (Jeandarme, Edens et al. 2017), which compares unfavourably to mostly
research studies that have reported higher AUCs of 0.66-0.67 (Singh, Grann et al.
2011) (Yang, Wong et al. 2010). The Level of Service Inventory (LSI-R), when used
prospectively in over 22,000 prisoners in Washington State, USA, was associated
with an AUC of 0.64 for violent recidivism (Barnoski and Aos 2003), which is lower
than its performance in psychiatric samples and research studies. This shrinkage is a
consequence of a number of methodological weaknesses in the design of these tools
(see below for LSI-R).
A practical guide to evaluate risk assessment tools
So what to make of this in practice? How can individuals in criminal justice and
public policy determine whether a tool is fit for purposes? We have proposed a ten-
point guide (Fazel and Wolf 2018), which I will summarise. I will start with criteria
relevant to the derivation (or discovery or development) study, and then move on to
criteria relating to the validation of risk assessment tools. The relevant criteria are:
1. Did the study deriving the tool follow a protocol?
This is a key component if a study is to provide an accurate representation of a tool’s
performance. Without a protocol, the likelihood of creating a tool that reports strong
performance measures but performs poorly in practice is very high. The sample,
candidate variables, outcome(s), follow-up periods, statistical analyses, and output
should all be pre-specified before any data analysis is performed. This protocol
should be published, and any deviations from it in any particular study be clearly
explained and justified (such as a predictor being dropped because of large
proportions of missing data).
2. How were candidate variables selected for the tool?
The more variables that have been tested in a derivation study, particularly if the
sample was not sufficiently large, the more likely chance associations are found, and
the reported model performance will not perform well in external validation. One
rule of thumb is that for each variable tested the derivation sample should have at
least 10 outcomes (Royston and Sauerbrei 2008). Further, the choice of which
variables to test and how they are categorized should have followed a protocol, and
multivariable regression should have been conducted to determine their independent
association with the outcome before inclusion in a model. Otherwise, tools will
include variables that do not add incremental predictive accuracy, and lead to over-
complicated and time-consuming instruments.
7
3. How were variables weighted?
Many tools in criminal justice give equal weighting to all included items. This
makes two assumptions first, that all included predictors have the same association
with the outcome, and second, that the variables are all independently related to the
outcome. In terms of weighting, previous violent crime and living in a poor
neighbourhood are both associated with higher risk of crime, but they are not
equally important. Tools that have not weighted individual items will perform worse
(Hamilton, Neuilly et al. 2015).
4. How were other parameters selected?
Other key aspects of any research study should be determined beforehand, such as
the time for follow-up for the tool. If this has not been done, to take an example, a
particular tool may perform better at 3 years rather than 1 or 2 years, and the
researchers might decide that 3 years is the primary outcome. The problem with this
approach is that it is a form of multiple testing and the consequence will be that the
tool performs considerably worse in real-world settings.
5. Has internal validation been done?
This is typically done using a method called bootstrapping, which takes a number of
random samples from the dataset to provide an estimate of accuracy of performance
measures.
6. Has the tool been externally validated?
This question examines whether the tool’s performance has been investigated in a
new sample. In many ways, this is the most important question as tools tend to
perform considerably better in the sample in which they were derived (Khoury,
Gwinn et al. 2010, Monahan and Skeem 2016) and an external validation is
necessary to test how accurate it is. Splitting the original derivation sample into two
random groups is a form of internal validation, but is not external validation due to
the equal distribution of predictor variables. Such a split will lead to comparable
performance because the predictors will have a very similar distribution in the
derivation and the randomly split samples. To achieve external validation, the
sample should be split on other variables, which are not related to the outcome
(Fazel, Chang et al. 2016).
7. Has this validation been done in the population of interest?
Here the key issue is whether the new population for which the tool will be used has
similar characteristics, risk factors, and baseline risk, and outcome to the sample
where the tool was created. This may explain why some tools, such as the PCL-R,
which was not developed to predict violence risk but to identify a form of
personality disturbance, performs among the worse of commonly used tools (Singh,
Grann et al. 2011). In addition, this is problematic for some tools developed in
selected samples of high-risk offenders (which appears to have been the case for
LSI-R) that are then used in general criminal justice samples, such as all individuals
in prison or on probation.
8. Has the validation been conducted using robust methods?
Validation studies should stay true to the original model, be based on a protocol, and
anticipated changes should be discussed beforehand in a protocol (eg recalibration
8
will be considered if the underlying base rate of offending is different, and how this
recalibration of the model will be tested). Otherwise, what appears to be a validation
is no longer an external validation, but the derivation of a new model. The sample
size is also important, and should have ensured at least 100 events (or outcomes) for
statistical power (Collins, Ogundimu et al. 2016). Results should be published in
peer-reviewed journals, but, on its own, this is not a marker of methodological
quality. Studies should provide sufficient methodological detail to be replicable.
9. Has the validation study reported essential information?
As described above, tools should report both measures of discrimination (especially
rates of false positives and negatives) and calibration (ideally with a graphical plot
that compares observed with predicted risks).
10. Is the risk assessment tool useful, feasible, and acceptable?
The tool should provide useful information including a relevant outcome (eg
positive prediction of reoffending), and clearly defined risk categories. The tools and
their constituent predictors should be also be easy to complete, reliable and clearly
defined. For example, rating scales (e.g. 1-5 Likert scales) will be may vary between
raters. The tool should have face validity by including essential items (for example
age and sex), and justify the inclusion of other items. There are advantages in having
interview-independent tools to reduce the possibility of observer bias.
If a particular tool has not been externally validated, we argue that it should not be
used in practice apart from rare circumstances when alternatives are not appropriate
or available, and external validation is ongoing (Fazel and Wolf 2018). And even if it
has been externally validated, instruments should undergo prospective validation after
implementation to monitor their ongoing accuracy.
Applying quality criteria to individual risk assessment tools
The extent to which risk assessment tools currently used in criminal justice meet these
10 criteria needs to be systematically evaluated but few of them meet more than one
or two. To take some examples of commonly used tools, on these five criteria for
derivation discussed above, two such instruments, the HCR-20 (Historical Clinical
Risk Management-20) and VRAG (Violence Risk Appraisal Guide), meet few
criteria. The HCR-20 chose its 20 predictors based on expert opinion in 1997 rather
than a systematic review of the evidence or testing them in multivariable models (an
approach the authors reported in the following way: ‘What variables might clinicians
and administrators consider as they attempt evaluations of risk of violence in cases
where psychiatric disorders are thought to be involved?’) (Webster, Douglas et al.
1997, p. 251). The derivation did not include any statistical performance measures.
Each item is scored as ‘0’ (item not present), ‘1’ (item possibly present), or ‘2’ (item
definitely present) rather than assigning any weighting to them (Douglas and Reeves
2010). Age and sex, two of the strongest predictors of violence and considered
important for face validity, were not included. In developing the VRAG, 42 candidate
variables collected from a single sample of 618 mentally disordered Canadian
offenders (of whom 191 reoffended). Of those, 332 individuals had been received into
a maximum-security prison, and the remaining 286 had been admitted to secure
hospital for a brief pre-trial psychiatric assessment. With regards to the outcome, 191
9
reoffenders does not provide sufficient statistical power for 42 candidate variables
(Harris, Rice et al. 1993), and good practice would suggest that at least double the
number of reoffenders would be required for derivation. The VRAG’s derivation
study reports performance measures at five different cut-offs (which were not pre-
specified), and does not provide an overall performance measure. As with the HCR-
20, the offender’s sex was not one of the variables considered and hence was not
included in the final model, which consists 12 items, which are weighted.
Two other widely used tools are difficult to evaluate due to lack of published
information about certain aspects of their derivation and original validation. The LSI-
R is based on 54 dynamic items, and the OASys Violent Predictor (OVP) in England
and Wales, which is given to all individuals who receive sentences of 12 months or
more, and derived from the 100 item OASys (Howard and Dixon 2012). The LSI-R,
however, does not include some of the most powerful predictors such age or gender,
and has items that appear to be unreliable psychometrically (such as ‘could make
better use of time’, has ‘very few prosocial friends’ and four items on current
attitudes). Importantly, the original derivation study has not been published to my
knowledge. The OASys is better reported and has some selected publications
explaining aspects of its derivation, but lacks detail on some key areas (Howard and
Dixon 2011, Howard and Dixon 2013). At the same time, both LSI-R and OASys
have weighting for individual predictors that were tested using logistic regression in
developing the model, relatively simple scoring methods, and been subject to external
validation.
Putting this altogether, I would argue that the most commonly used tools in
criminal justice are not fit for purpose for prediction purposes. None of them meet all
the 10 tests outlined above to my knowledge, and few meet more than one or two of
the criteria outlined. At the same time, some of these instruments may provide a
useful framework for organising information, act as a reminder for those working in
criminal justice to assess certain risk factors and domains, and match individuals for
treatment based on needs. The first two of these justifications are arguably too high a
price to pay for those instruments that are resource-intensive.
OxRec model
After reviewing this literature, I have been part of a team that has developed OxRec
(Oxford Risk of Recidivism tool), using Swedish national data, which provides a
probabilistic score for violent and any reoffending in 1 and 2 years post-release from
prison, and also low/medium and high categories based on pre-specified levels. It can
be completed in 5-10 minutes using 14 routinely collected predictors, and using a
freely available online calculator (Fazel, Chang et al. 2016). The weighting of the
individual predictors and how they are combined to create a probability score has
been published (with the original protocol), with a full range of discrimination and
calibration statistics, making it a fully transparent risk prediction model. It has been
externally validated in Sweden in more than 10,000 individuals leaving prison, with
ongoing validations in the Netherlands and some other countries (Fazel 2019), and
provides a methodological rigorous approach with which to develop risk assessment
instruments. The probability score is relatively precise as it was derived on 37,100
released prisoners.
10
Summary
In summary, I have outlined some key ways of evaluating the performance of risk
assessment instruments in criminal justice, and highlighted the importance of both
investigating measures of discrimination and calibration. I have outlined some
systematic reviews of the field, which suggest that many current tools, such as the
LSI-R and PCL-R, have at best moderate performance in discrimination with no
information on calibration. Most tools currently used in criminal justice have not been
included in these reviews because research on their external validation has not been
published. Further, the development of risk assessment tools in criminal justice has
lagged behind methodological improvements in prognostic models in science, and
particularly in medicine.
Finally, I have provided a ten-point checklist that can be used to evaluate any
risk tool. On this basis, I have argued that current widely used tools should probably
not be used for prediction. At the very least, their use be reviewed in the light of the
ten tests outlined, and information that is lacking should be requested from these
tool’s developers and commercial entities marketing them. In terms of its implications
for predictive sentencing, risk predictions from these tools - either as categories such
as high, medium or low, or as probability scores do not have a sufficient evidence-
base in support that they can currently be used in court. As I have shown, current risk
assessment tools have not met some basic criteria in how they were derived or in
subsequent validations of their performance. Furthermore, when empirically-tested on
a range of measures and mostly in research studies, they typically lead to
unacceptably high false positives and false negative rates, particularly in tools aimed
at any recidivism. I have also discussed the development and validation of a new
scalable prediction tool, OxRec, which represents a methodological advance and
provides a model for transparent reporting of such tools.
References
Abram, KM, Zwecker, NA, Welty, LJ, Hershfield, JA, Dulcan, MK and Teplin, LA
(2015) 'Comorbidity and Continuity of Psychiatric Disorders in Youth after
Detention: A Prospective Longitudinal Study' 72 JAMA Psychiatry: 84.
Ægisdóttir, S, White, MJ, Spengler, PM, Maugherman, AS, Anderson, LA, Cook, RS,
Nichols, CN, Lampropoulos, GK, Walker, BS and Cohen, G (2006) 'The Meta-
Analysis of Clinical Judgment Project: Fifty-Six Years of Accumulated Research on
Clinical Versus Statistical Prediction' 34 The Counseling Psychologist: 341.
Barnoski, R and Aos, S (2003) 'Washington’s Offender Accountability Act: An
Analysis of the Department of Corrections’ Risk Assessment' (Olympia, Washington
State Institute for Public Policy).
Buchanan, A and Leese, M (2001) 'Detention of People with Dangerous Severe
Personality Disorders: A Systematic Review' 358 Lancet: 1955.
Coid, JW, Ullrich, S and Kallis, C (2013) 'Predicting Future Violence among
Individuals with Psychopathy' 203 The British Journal of Psychiatry: 387.
11
Collins, GS, Ogundimu, EO and Altman, DG (2016) 'Sample Size Considerations for
the External Validation of a Multivariable Prognostic Model: A Resampling Study' 35
Statistics in Medicine: 214.
Desmarais, SL, Johnson, KL and Singh, JP (2016) 'Performance of Recidivism Risk
Assessment Instruments in Us Correctional Settings' 13 Psychological Services: 206.
Douglas, KS and Reeves, KA (2010) Historical-Clinical-Risk Management-20 (HCR-
20) Violence Risk Assessment Scheme: Rationale, Application, and Empirical
Overview (Abingdon, Routledge/Taylor & Francis Group).
Fazel, S, Chang, Z, Fanshawe, T, Långström, N, Lichtenstein, P, Larsson, H and
Mallett, S (2016) 'Prediction of Violent Reoffending on Release from Prison:
Derivation and External Validation of a Scalable Tool' 3 Lancet Psychiatry: 535.
Fazel, S, Singh, JP, Doll, H and Grann, M (2012) 'Use of Risk Assessment
Instruments to Predict Violence and Antisocial Behaviour in 73 Samples Involving 24
827 People: Systematic Review and Meta-Analysis' 345 British Medical Journal
e4692.
Fazel, S and Wolf, A (2018) 'Selecting a Risk Assessment Tool to Use in Practice: A
10-Point Guide' 21 Evidence-based Mental Health: 41.
Fazel, S. A Wolf, MDLA Vazquez-Montes, TR Fanshaw. Prediction of violent
reoffending in prisoners and individuals on probation: a Dutch validation study
(OxRec). Scientific reports 9 (1), 841
Hamilton, Z, Neuilly, M-A, Lee, S and Barnoski, R (2015) 'Isolating Modeling
Effects in Offender Risk Assessment' 11 Journal of Experimental Criminology: 299.
Harris, GT, Rice, ME and Quinsey, VL (1993) 'Violent Recidivism of Mentally
Disordered Offenders: The Development of a Statistical Prediction Instrument' 20
Criminal Justice and Behavior: 315.
Howard, P and Dixon, L (2011) 'Developing an Empirical Classification of Violent
Offences for Use in the Prediction of Recidivism in England and Wales' 3 Journal of
Aggression, Conflict and Peace Research: 141.
Howard, PD and Dixon, L (2012) 'The Construction and Validation of the Oasys
Violence Predictor: Advancing Violence Risk Assessment in the English and Welsh
Correctional Services' 39 Criminal Justice and Behavior: 287.
Howard, PD and Dixon, L (2013) 'Identifying Change in the Likelihood of Violent
Recidivism: Causal Dynamic Risk Factors in the Oasys Violence Predictor' 37 Law
and Human Behavior: 163.
Imrey, PB and Dawid, AP (2015) 'A Commentary on Statistical Assessment of
Violence Recidivism Risk' 2 Statistics and Public Policy: 1.
12
Jeandarme, I, Edens, JF, Habets, P, Bruckers, L, Oei, K and Bogaerts, S (2017) 'Pcl-R
Field Validity in Prison and Hospital Settings' 41 Law and Human Behavior: 29.
Khoury, MJ, Gwinn, M and Ioannidis, JP (2010) 'The Emergence of Translational
Epidemiology: From Scientific Discovery to Population Health Impact' 172 American
Journal of Epidemiology: 517.
Lindhiem, O, Petersen, IT, Mentch, LK and Youngstrom, EA (2018) 'The Importance
of Calibration in Clinical Psychology' Assessment. (epub:
https://doi.org/10.1177/1073191117752055)
Mallett, S, Halligan, S, Thompson, M, Collins, GS and Altman, DG (2012)
'Interpreting Diagnostic Accuracy Studies for Patient Care' 345 British Medical
Journal e3999.
Monahan, J and Skeem, JL (2016) 'Risk Assessment in Criminal Sentencing' 12
Annual Review of Clinical Psychology: 489.
Royston, P and Sauerbrei, W (2008) Multivariable Model - Building: A Pragmatic
Approach to Regression Anaylsis Based on Fractional Polynomials for Modelling
Continuous Variables, Chichester: John Wiley &Sons.
Singh, JP (2013) 'Predictive Validity Performance Indicators in Violence Risk
Assessment: A Methodological Primer' 31 Behavioral Sciences & the Law: 8.
Singh, JP, Desmarais, SL, Hurducas, C, Arbach-Lucioni, K, Condemarin, C, Dean, K,
Doyle, M, Folino, JO, Godoy-Cervera, V, Grann, M, Ho, RMY, Large, MM, Nielsen,
LH, Pham, TH, Rebocho, MF, Reeves, KA, Rettenberger, M, de Ruiter, C, Seewald,
K and Otto, RK (2014) 'International Perspectives on the Practical Application of
Violence Risk Assessment: A Global Survey of 44 Countries' 13 International
Journal of Forensic Mental Health: 193.
Singh, JP, Desmarais, SL and Van Dorn, RA (2013) 'Measurement of Predictive
Validity in Violence Risk Assessment Studies: A Second-Order Systematic Review'
31 Behavioral Sciences & the Law: 55.
Singh, JP, Grann, M and Fazel, S (2011) 'A Comparative Study of Violence Risk
Assessment Tools: A Systematic Review and Metaregression Analysis of 68 Studies
Involving 25,980 Participants' 31 Clinical Psychology Review: 499.
Singh, JP, Grann, M and Fazel, S (2013) 'Authorship Bias in Violence Risk
Assessment? A Systematic Review and Meta-Analysis' 8 PloS One: e72484.
Webster, CD, Douglas, KS, Eaves, D and Hart, SD (1997) 'Assessing Risk of
Violence to Others' in C Webster and M Jackson (eds)Impulsivity: Theory,
assessment, and treatment (eds.) (New York, NY, US: Guilford Press).
Yang, M, Wong, SC and Coid, J (2010) 'The Efficacy of Violence Prediction: A
Meta-Analytic Comparison of Nine Risk Assessment Tools' 136 Psychological
Bulletin: 740.
13
... As the use of risk assessment instruments has increased, so has the diversity of professionals providing violence risk expert evidence in court (Storey et al., 2013). As a result, the quality of violence risk assessments may fluctuate: not all evaluators are equally competent and not all instruments have a strong empirical basis or have been appropriately validated (e.g., Fazel, 2019;Hopton et al., 2018). Despite these issues, research to date shows that courts generally accept the findings of risk assessments without evidentiary challenges being raised (Cox et al., 2018;Neal et al., 2019). ...
... Most studies into the HCR-20 V3 to date have only examined the predictive validity of the numerical codings. A few studies have studied numeric totals of scales as well as summary risk ratings of the HCR-20 V3 finding good predictive validity for both (see Hogan & Olver, 2016, 2019Persson et al., 2017). A smaller number of studies tested incremental validity and found that the summary risk ratings added incrementally to numerical scores (Neil et al., 2020;Strub et al., 2014). ...
... Authors or translators who know the tool very well may have better coding skills and show more enthusiasm and fidelity in the use of it, although no clear evidence has been found for authorship bias so far (Singh et al., 2013). Still, more research from independent research groups and in different samples, settings and countries is highly needed and is currently emerging (e.g., Brookstein et al., 2020;Hogan & Olver, 2016, 2019Penney et al., 2016). To date, most of the studies with the HCR-20 V3like with the HCR-20 Version 2 -have been conducted in Western countries. ...
... These tools are increasingly used in Western criminal justice to inform decisions regarding sentencing, release, probation and parole, despite significant limitations. Almost all have been developed without predetermined protocols, are not externally validated, do not report a range of recommended performance measures, and rarely include modifiable risk factors (i.e., substance misuse and mental illness) (7,8). ...
... To evaluate the model's discrimination, the area under the receiver operating characteristic curve (AUC), or c index, was used (22). The AUC takes values between 0.5 and 1, and represents the probability that individuals who commit violent crimes will be given a higher-risk score than those who do not reoffend (7). We also calculated sensitivity, specificity, positive and negative predictive values (PPV and NPV) for various prespecified risk thresholds to inform potential benefits (and harms) for intended management purposes. ...
... We present key performance measures of discrimination (i.e., true and false positives and negatives) for various risk thresholds, in addition to the AUC. Those are often overlooked in validation studies, despite being important in terms of potential consequences for justice-involved individuals and the wider society, as they are likely to inform decisions relating to rehabilitation and public safety (7). The study design is another strength. ...
Article
Full-text available
Background Although around 70% of the world's prison population live in low- and middle-income countries (LMICs), risk assessment tools for criminal recidivism have been developed and validated in high-income countries (HICs). Validating such tools in LMIC settings is important for the risk management of people released from prison, development of evidence-based intervention programmes, and effective allocation of limited resources. Methods We aimed to externally validate a scalable risk assessment tool, the Oxford Risk of Recidivism (OxRec) tool, which was developed in Sweden, using data from a cohort of people released from prisons in Tajikistan. Data were collected from interviews (for predictors) and criminal records (for some predictors and main outcomes). Individuals were first interviewed in prison and then followed up over a 1-year period for post-release violent reoffending outcomes. We assessed the predictive performance of OxRec by testing discrimination (area under the receiver operating characteristic curve; AUC) and calibration (calibration statistics and plots). In addition, we calculated sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) for different predetermined risk thresholds. Results The cohort included 970 individuals released from prison. During the 12-month follow-up, 144 (15%) were reincarcerated for violent crimes. The original model performed well. The discriminative ability of OxRec Tajikistan was good (AUC = 0.70; 95% CI 0.66–0.75). The calibration plot suggested an underestimation of observed risk probabilities. However, after recalibration, model performance was improved (Brier score = 0.12; calibration in the large was 1.09). At a selected risk threshold of 15%, the tool had a sensitivity of 60%, specificity of 65%, PPV 23% and NPV 90%. In addition, OxRec was feasible to use, despite challenges to risk prediction in LMICs. Conclusion In an external validation in a LMIC, the OxRec tool demonstrated good performance in multiple measures. OxRec could be used in Tajikistan to help prioritize interventions for people who are at high-risk of violent reoffending after incarceration and screen out others who are at lower risk of violent reoffending. The use of validated risk assessment tools in LMICs could improve risk stratification and inform the development of future interventions tailored at modifiable risk factors for recidivism, such as substance use and mental health problems.
... Only for few instruments information about validation analyses conducted by researchers besides the instrument's authoring team were available. The lack of external evaluation of instruments could lead to an overestimation of the instruments' assessment quality (Fazel, 2019). Therefore, more independent evaluations of the psychometric properties of all instruments presented in this review could be considered as highly recommendable from a methodological standpoint. ...
... Evaluation should be evidence-based, psychometrically appropriate, and independent evaluations should be supported, if possible, by publishing and sharing relevant information like manuals or guidelines. Another argument can be made for validation studies that are conducted independently from an instrument's developer to reduce authoring bias and analyze external validity (Fazel, 2019;Singh et al., 2013). Especially, the risk communication represents an important challenge and is the foundation of effective management and intervention planning (Ritter et al., 2023). ...
Article
Full-text available
Frontline law enforcement, police, and security personnel of various backgrounds have the challenging task to identify extremists who have a high risk for committing violent acts, describe driving risk trajectories, prioritize the use of scarce resources, and develop individualized risk management plans. In this line of work, risk and threat assessment instruments are frequently used to standardize the development of individual risk profiles and guide decision-making processes. The scope of this article is to provide an overview of the current state-of-the-art risk and threat assessment instruments for violent extremism by conducting a systematic literature research. Comparisons of the following instruments’ characteristics, development, application, and validation are reported: Violent Extremism Risk Assessment, Version 2–Revised (VERA-2R), Terrorist Radicalization Assessment Protocol (TRAP-18), Extremism Risk Guidelines 22+ (ERG 22+), Multi-Level Guidelines Version 2 (MLG Version 2), Islamic Radicalization (IR-46), Structured Assessment of Violent Extremism (SAVE), Radicalisation Awareness Network Center of Excellence Returnee 45 (RAN CoE Returnee 45), Regelbasierte Analyse potentiell destruktiver Täter zur Einschätzung des akuten Risikos—islamistischer Terrorismus (in English: rule-based analysis of potentially destructive perpetrators to assess the acute risk—Islamist terrorism; RADAR-iTE), and Investigative Search for Graph-Trajectories (INSiGHT). Most instruments are applied to violent extremism in general without specification of ideological phenomena; however, some are specifically developed for Islamism or right-wing extremism or certain subtypes of extremists like returnees. The number of factors, factor structures, and final risk evaluation varied substantially between instruments. The development of the instruments was regularly based on scientific theories and empirical data analysis approaches. However, data about the predictive validity was seldom available. Finally, future challenges and existing uncertainties within the approaches were discussed.
... Another rationale for their widespread use is that these tools may provide a more accurate and reliable risk assessment than unstructured clinical judgement (AEgisdóttir et al., 2006), and provide consistency within and across services. Applications of these tools have been advocated as an evidence-based approach towards treatment allocation, particularly for people in prison and on release from prison with psychiatric disorders and substance misuse where there are treatments that could modify risk (Fazel, 2019;Yu et al., 2022). ...
... We examined several model performance indicators to determine the predictive ability of the model in terms of discrimination (the model's ability to separate out individuals who have reoffended from those who have not), and calibration (the level of agreement between observed and expected outcomes). These indicators included the area under the receiver operating characteristic curve (AUC-ROC), or c index, as well as sensitivity, specificity, and positive and negative predictive values (PPV and NPV) (for discrimination); and the Brier score, calibration slope and calibration-in-the-large (CITL) (for calibration), defined as the ratio of prevalence of observed to predicted events (Fazel, 2019;Steyerberg, 2009). We selected cut-off scores that were easy to interpret and close to the baseline rates to calculate sensitivity, specificity, NPV, and PPV values. ...
... Various studies have shown that risk predictions carried out by humans are often far from accurate (see e.g. [14,15]). Similarly, meta-studies of algorithmic risk assessments have found only moderate levels of predictive accuracy [16,17]. ...
Article
Full-text available
Artificial intelligence is currently supplanting the work of humans in many societal contexts. The purpose of this article is to consider the question of when algorithmic tools should be regarded as performing sufficiently well to replace human judgements and decision-making at sentencing. More precisely, the question as to which are the ethically plausible criteria for the comparative performance assessments of algorithms and humans is considered with regard to both risk assessment algorithms that are designed to provide predictions of recidivism and sentencing algorithms designed to determine sentences in individual criminal cases. It is argued, first, that the prima facie most obvious assessment criteria do not stand up to ethical scrutiny. Second, that ethically plausible criteria presuppose ethical theory on penal distribution which currently has not been sufficiently developed. And third, that the current lack of assessment criteria has comprehensive implications regarding when algorithmic tools should be implemented in criminal justice practice.
... Diagnosticity, however, suggests that it is entirely possible that attending to statistical evidenceand, by extension, replacing case-by-case human judgment with algorithmic judgmentcould actually enhance the legal system's ability to track the truth. In fact, sensitivity, specificity and predictive value are regularly used to assess the performance of algorithmic prediction tools (Fazel 2020;Hester 2020). A significant part of tuning algorithmic risk assessment tools consists in trying to find the appropriate balance between sensitivity and specificity, with that relationship represented by the ROC (Hester 2020). ...
Article
Full-text available
The rapidly increasing role of automation throughout the economy, culture and our personal lives has generated a large literature on the risks of algorithmic decision-making, particularly in high-stakes legal settings. Algorithmic tools are charged with bias, shrouded in secrecy, and frequently difficult to interpret. However, these criticisms have tended to focus on particular implementations, specific predictive techniques, and the idiosyncrasies of the American legal-regulatory regime. They do not address the more fundamental unease about the prospect that we might one day replace judges with algorithms, no matter how fair, transparent, and intelligible they become. The aim of this paper is to propose an account of the source of that unease, and to evaluate its plausibility. I trace foundational unease with algorithmic decision-making in the law to the powerful intuition that there is a basic moral and legal difference between showing that something is true of many people just like you and showing that it is true of you . Human judgment attends to the exception; automation insists on blindly applying the rule. I show how this intuitive thought is connected to both epistemological arguments about the value of statistical evidence, as well as to court-centered conceptions of the rule of law. Unease with algorithmic decision-making in the law thus draws on an intuitive principle that underpins a disparate range of views in legal philosophy. This suggests the principle is deeply ingrained. Nonetheless, I argue that the powerful intuition is not as decisive as it may seem, and indeed runs into significant epistemological and normative challenges. At an epistemological level, I show how concerns about statistical evidence's ability to track the truth can be resolved by adopting a probabilistic, rather than modal, conception of truth-tracking. At a normative level, commitment to highly individualized decision-making co-exists with equally ingrained and competing principles, such as consistent application of law. This suggests that the “rule of law” may not identify a discrete set of institutional arrangements, as proponents of a court-centric conception would have it, but rather a more loosely defined set of values that could potentially be operationalized in multiple ways, including through some level of algorithmic adjudication. Although the prospect of replacing judges with algorithms is indeed unsettling, it does not necessarily entail unreasonable verdicts or an attack on the rule of law.
... Also, current violence risk assessment tools have been developed for heterogeneous and non-psychopathic populations. 15 However, the clinical use of violence risk assessment tools is not widely adopted, especially in Asia. 16 Different violence risk assessment tools have significant differences in their predictive rates, and the same tool may have different predictive rates for violence in different regions. ...
Article
Full-text available
Objective Prevention, de-escalation, and management of violence in the acute psychiatric ward is essential. Few studies have focused on differences in the duration of high-violence risk between different profiles of high-violence risk. This study aimed to analyze the data of high-violence patients and duration of high-violence risk to provide a new perspective on violence prevention, de-escalation and management. Methods This retrospective observational cohort study included 171 patients who were treated in the acute psychiatric ward of Keelung Chang Gung Memorial Hospital between January 2016 and June 2020, and who were assessed daily as having high violence risk. All patient data were collected from electronic hospital records (eg, age, gender, diagnosis, violence history, self-harm history, and admission condition (involuntary admission, discharged against medical advice). Between-group differences in disease severity, use of antipsychotics and benzodiazepine, and duration of high violence risk were analyzed using regression analysis. Results Only patients’ age was significantly associated with duration of high-violence risk (P = 0.028), making it predictive of longer duration of high-violence risk. In patients with schizophrenia spectrum disorder or bipolar disorder, higher severity was significantly associated with longer duration of high-violence risk (P = 0.007, P = 0.001, respectively). Conclusion Only age is a predictor of longer duration of violence risk in psychiatric patients, although higher severity is associated with higher violence risk. Study results may help management and healthcare staff better understand how quickly or slowly violence risk will decrease and may improve efficient use of healthcare resources and individualized patient-centered care.
... This limitation is even more serious cross-culturally, as the accuracy of prediction relies upon the representativeness of samples (Gottfredson and Moriarty, 2006), which are often predominantly Euro-American and validations for non-Euro-American samples are rare (e.g., Douglas, Pugh, Singh, Savulescu, & Fazel, 2017;Shepherd & Lewis-Fernandez, 2016). This challenges the utility of commonly used assessment tools for prediction purposes (Fazel, 2019). For example, the Canadian Supreme Court determined that "the Correctional Service of Canada had not taken reasonable steps to make sure its tools gave accurate and complete results for Indigenous offenders", as it continued using them despite being aware of concerns about their validity, resulting in ethnic discrimination (Ewert vs. Canada, 2018). ...
Article
Full-text available
The Static-99, Static-99R, and STABLE-2007 are internationally well-established instruments for predicting static and dynamic risks of sexual recidivism in individuals convicted of sexual offenses. Previous meta-analyses assessed their predictive and incremental validity, but none has yet compared the two Static versions and the Static–STABLE combinations. Here, we implemented diagnostic test accuracy network meta-analysis (DTA-NMA) to compare all tests and identify optimal cutoffs in one comprehensive analysis. The DTA-NMA included 32 samples comprising 45,224 adult male individuals. More information was available on the Static-99 (22 samples; 34,316 individuals) and the Static-99R (13 samples; 27,243 individuals), compared to the Static-99/STABLE-2007 (three samples; 762 individuals), the Static-99R/STABLE-2007 (two samples; 2,972 individuals), and the STABLE-2007 (three samples; 816 individuals). The primary outcome was the area under the receiver operating characteristic curve (AUC). The secondary outcomes were sensitivity and specificity. Optimal cutoffs were determined using the Youden index. The AUC suggested moderate predictive validity for Static-99 and Static-99R, whereas STABLE-2007 had no predictive value. The optimal cutoff of Static-99R was suggested to have higher specificity than that of Static-99, whereas sensitivity was comparable between instruments. The notion of incremental validity for STABLE-2007 could not be confirmed. This work represents the first meta-analysis to compare Static-99, Static-99R, STABLE-2007, and their combinations in one analysis. Static-99R demonstrated the highest specificity in predicting the risk of sexual recidivism, indicating a potential advantage in detecting true nonrecidivists. The findings are discussed, considering the current recommendations for assessing the risk of sexual recidivism in the criminal justice system.
Article
Full-text available
Evidence-based sentencing (EBS) is a new name for an aspiration that has deep roots in criminal law: to apply the sentence most appropriate to each offender's risk of reoffending, in order to reduce that risk as far as possible. This modern version of the traditional sentencing goals of rehabilitation and incapacitation fits into the broader approach of so-called “evidence-based public policy.” It takes the view that the best existing evidence for reducing reoffending are modern structured risk assessment tools and claims to be able to achieve several goals at once: reducing reoffending, maintaining high levels of public safety, making more efficient use of public resources, and moving criminal policy away from ideological battles by basing it on the objective knowledge provided by the best available scientific evidence. However, despite the success of this approach in recent years, it is not clear to what extent it succeeds in correctly assessing the risk of individual offenders, nor whether it achieves its intended effect of reducing recidivism. This paper aims to critically examine these two issues: the quality of the scientific evidence on which EBS is based, and the available data on the extent to which it achieves (or does not achieve) its intended goals.
Article
Full-text available
Scalable and transparent methods for risk assessment are increasingly required in criminal justice to inform decisions about sentencing, release, parole, and probation. However, few such approaches exist and their validation in external settings is typically lacking. A total national sample of all offenders (9072 released from prisoners and 6329 individuals on probation) from 2011–2012 in the Netherlands were followed up for violent and any reoffending over 2 years. The sample was mostly male (n = 574 [6%] were female prisoners and n = 784 [12%] were female probationers), and median ages were 30 in the prison sample and 34 in those on probation. Predictors for a scalable risk assessment tool (OxRec) were extracted from a routinely collected dataset used by criminal justice agencies, and outcomes from official criminal registers. OxRec’s predictive performance in terms of discrimination and calibration was tested. Reoffending rates in the Dutch prisoner cohort were 16% for 2-year violent reoffending and 44% for 2-year any reoffending, with lower rates in the probation sample. Discrimination as measured by the c-index was moderate, at 0.68 (95% CI: 0.66–0.70) for 2-year violent reoffending in prisoners and between 0.65 and 0.68 for other outcomes and the probation sample. The model required recalibration, after which calibration performance was adequate (e.g. calibration in the large was 1.0 for all scenarios). A recalibrated model for OxRec can be used in the Netherlands for individuals released from prison and individuals on probation to stratify their risk of future violent and any reoffending. The approach that we outline can be considered for external validations of criminal justice and clinical risk models.
Article
Full-text available
With the increase in the number of risk assessment tools and clinical algorithms in many areas of science and medicine, this Perspective article provides an overview of research findings that can assist in informing the choice of an instrument for practical use. We take the example of violence risk assessment tools in criminal justice and forensic psychiatry, where there are more than 200 such instruments and their use is typically mandated. We outline 10 key questions that researchers, clinicians and other professionals should ask when deciding what tool to use, which are also relevant for public policy and commissioners of services. These questions are based on two elements:research underpinning the external validation, and derivation or development of a particular instrument. We also recommend some guidelines for reporting drawn from consensus guidelines for research in prognostic models.
Article
Full-text available
Background: More than 30 million people are released from prison worldwide every year, who include a group at high risk of perpetrating interpersonal violence. Because there is considerable inconsistency and inefficiency in identifying those who would benefit from interventions to reduce this risk, we developed and validated a clinical prediction rule to determine the risk of violent offending in released prisoners. Methods: We did a cohort study of a population of released prisoners in Sweden. Through linkage of population-based registers, we developed predictive models for violent reoffending for the cohort. First, we developed a derivation model to determine the strength of prespecified, routinely obtained criminal history, sociodemographic, and clinical risk factors using multivariable Cox proportional hazard regression, and then tested them in an external validation. We measured discrimination and calibration for prediction of our primary outcome of violent reoffending at 1 and 2 years using cutoffs of 10% for 1-year risk and 20% for 2-year risk. Findings: We identified a cohort of 47 326 prisoners released in Sweden between 2001 and 2009, with 11 263 incidents of violent reoffending during this period. We developed a 14-item derivation model to predict violent reoffending and tested it in an external validation (assigning 37 100 individuals to the derivation sample and 10 226 to the validation sample). The model showed good measures of discrimination (Harrell's c-index 0·74) and calibration. For risk of violent reoffending at 1 year, sensitivity was 76% (95% CI 73-79) and specificity was 61% (95% CI 60-62). Positive and negative predictive values were 21% (95% CI 19-22) and 95% (95% CI 94-96), respectively. At 2 years, sensitivity was 67% (95% CI 64-69) and specificity was 70% (95% CI 69-72). Positive and negative predictive values were 37% (95% CI 35-39) and 89% (95% CI 88-90), respectively. Of individuals with a predicted risk of violent reoffending of 50% or more, 88% had drug and alcohol use disorders. We used the model to generate a simple, web-based, risk calculator (OxRec) that is free to use. Interpretation: We have developed a prediction model in a Swedish prison population that can assist with decision making on release by identifying those who are at low risk of future violent offending, and those at high risk of violent reoffending who might benefit from drug and alcohol treatment. Further assessments in other populations and countries are needed. Funding: Wellcome Trust, the Swedish Research Council, and the Swedish Research Council for Health, Working Life and Welfare.
Article
Full-text available
The past several years have seen a surge of interest in using risk assessment in criminal sentencing, both to reduce recidivism by incapacitating or treating high-risk offenders and to reduce prison populations by diverting low-risk offenders from prison. We begin by sketching jurisprudential theories of sentencing, distinguishing those that rely on risk assessment from those that preclude it. We then characterize and illustrate the varying roles that risk assessment may play in the sentencing process. We clarify questions regarding the various meanings of "risk" in sentencing and the appropriate time to assess the risk of convicted offenders. We conclude by addressing four principal problems confronting risk assessment in sentencing: conflating risk and blame, barring individual inferences based on group data, failing adequately to distinguish risk assessment from risk reduction, and ignoring whether, and if so, how, the use of risk assessment in sentencing affects racial and economic disparities in imprisonment. Expected final online publication date for the Annual Review of Clinical Psychology Volume 12 is March 28, 2016. Please see http://www.annualreviews.org/catalog/pubdates.aspx for revised estimates.
Article
Full-text available
Objectives Recent evolutions in actuarial research have revealed the potential increased utility of machine learning and data-mining strategies to develop statistical models such as classification/decision-tree analysis and neural networks, which are said to mimic the decision-making of practitioners. The current article compares such actuarial modeling methods with a traditional logistic regression risk-assessment development approach. Methods Utilizing a large purposive sample of Washington State offenders (N = 297,600), the current study examines and compares the predictive validity of the currently used Washington State Static Risk Assessment (SRA) instrument to classification tree analysis/random forest and neural network models. Results Overall findings varied, being dependent on the outcome of interest, with the best model for each method resulting in AUCs ranging from 0.732 to 0.762. Findings reveal some predictive performance improvements with advanced machine-learning methodologies, yet the logistic regression models demonstrate comparable predictive performance. Conclusions The study concluded that while data-mining techniques hold potential for improvements over traditional methods, regression-based models demonstrate comparable, and often improved, prediction performance with noted parsimony and greater interpretability.
Article
Full-text available
Increasing integration and availability of data on large groups of persons has been accompanied by proliferation of statistical and other algorithmic prediction tools in banking, insurance, marketiNg, medicine, and other FIelds (see e.g., Steyerberg (2009a;b)). Controversy may ensue when such tools are introduced to fields traditionally reliant on individual clinical evaluations. Such controversy has arisen about "actuarial" assessments of violence recidivism risk, i.e., the probability that someone found to have committed a violent act will commit another during a specified period. Recently Hart et al. (2007a) and subsequent papers from these authors in several reputable journals have claimed to demonstrate that statistical assessments of such risks are inherently too imprecise to be useful, using arguments that would seem to apply to statistical risk prediction quite broadly. This commentary examines these arguments from a technical statistical perspective, and finds them seriously mistaken in many particulars. They should play no role in reasoned discussions of violence recidivism risk assessment.
Chapter
With the population of adults under correctional supervision in the United States at an all-time high, psychologists and other professionals working in U.S. correctional agencies face mounting pressures to identify offenders at greater risk of recidivism and to guide treatment and supervision recommendations. Risk assessment instruments are increasingly being used to assist with these tasks; however, relatively little is known regarding the performance of these tools in U.S. correctional settings. In this review, we synthesize the findings of studies examining the predictive validity of assessments completed using instruments designed to predict general recidivism risk, including committing a new crime and violating conditions of probation or parole, among adult offenders in the United States. We searched for studies conducted in the United States and published between January 1970 and December 2012 in peer-reviewed journals, government reports, Master's theses, and doctoral dissertations using PsycINFO, the U.S. National Criminal Justice Reference Service Abstracts, and Google. We identified 53 studies (72 samples) conducted in U.S. correctional settings examining the predictive validity of 19 risk assessment instruments. The instruments varied widely in the number, type, and content of their items. For most instruments, predictive validity had been examined in one or two studies conducted in the United States that were published during the reference period. Only two studies reported on inter-rater reliability. No instrument emerged as producing the "most" reliable and valid risk assessments. Findings suggest the need for continued evaluation of the performance of instruments used to predict recidivism risk in U.S. correctional agencies.
Article
After developing a prognostic model, it is essential to evaluate the performance of the model in samples independent from those used to develop the model, which is often referred to as external validation. However, despite its importance, very little is known about the sample size requirements for conducting an external validation. Using a large real data set and resampling methods, we investigate the impact of sample size on the performance of six published prognostic models. Focussing on unbiased and precise estimation of performance measures (e.g. the c-index, D statistic and calibration), we provide guidance on sample size for investigators designing an external validation study. Our study suggests that externally validating a prognostic model requires a minimum of 100 events and ideally 200 (or more) events. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Article
Psychiatric disorders and comorbidity are prevalent among incarcerated juveniles. To date, no large-scale study has examined the comorbidity and continuity of psychiatric disorders after youth leave detention. To determine the comorbidity and continuity of psychiatric disorders among youth 5 years after detention. Prospective longitudinal study of a stratified random sample of 1829 youth (1172 male and 657 female; 1005 African American, 296 non-Hispanic white, 524 Hispanic, and 4 other race/ethnicity) recruited from the Cook County Juvenile Temporary Detention Center, Chicago, Illinois, between November 20, 1995, and June 14, 1998, and who received their time 2 follow-up interview between May 22, 2000, and April 3, 2004. At baseline, the Diagnostic Interview Schedule for Children Version 2.3. At follow-ups, the Diagnostic Interview Schedule for Children Version IV (child and young adult versions) and the Diagnostic Interview Schedule Version IV (substance use disorders and antisocial personality disorder). Five years after detention, when participants were 14 to 24 years old, almost 27% of males and 14% of females had comorbid psychiatric disorders. Although females had significantly higher rates of comorbidity when in detention (odds ratio, 1.3; 95% CI, 1.0-1.7), males had significantly higher rates than females at follow-up (odds ratio, 2.3; 95% CI, 1.6-3.3). Substance use plus behavioral disorders was the most common comorbid profile among males, affecting 1 in 6. Participants with more disorders at baseline were more likely to have a disorder approximately 5 years after detention, even after adjusting for demographic characteristics. We found substantial continuity of disorder. However, some baseline disorders predicted alcohol and drug use disorders at follow-up. Although prevalence rates of comorbidity decreased in youth after detention, rates remained substantial and were higher than rates in the most comparable studies of the general population. Youth with multiple disorders at baseline are at highest risk for disorder 5 years later. Because many psychiatric disorders first appear in childhood and adolescence, primary and secondary prevention of psychiatric disorders offers the greatest opportunity to reduce costs to individuals, families, and society. Only a concerted effort to address the many needs of delinquent youth will help them thrive in adulthood.
Article
IntroductionBackground Using the Bootstrap to Explore Model StabilityExample 1: Glioma DataExample 2: Educational Body-Fat DataExample 3: Breast Cancer DiagnosisModel Stability for FunctionsExample 4: GBSG Breast Cancer DataDiscussion