PreprintPDF Available

Explainable Unsupervised Anomaly Detection with Random Forest

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

We describe the use of an unsupervised Random Forest for similarity learning and improved unsupervised anomaly detection. By training a Random Forest to discriminate between real data and synthetic data sampled from a uniform distribution over the real data bounds, a distance measure is obtained that anisometrically transforms the data, expanding distances at the boundary of the data manifold. We show that using distances recovered from this transformation improves the accuracy of unsupervised anomaly detection, compared to other commonly used detectors, demonstrated over a large number of benchmark datasets. As well as improved performance, this method has advantages over other unsupervised anomaly detection methods, including minimal requirements for data preprocessing, native handling of missing data, and potential for visualizations. By relating outlier scores to partitions of the Random Forest, we develop a method for locally explainable anomaly predictions in terms of feature importance.
EXPLAINABLE UNSUPERVISED ANOMALY DETECTION WITH
RANDOM FOREST
A PREPRINT
Joshua S. Harvey
Prospect 33, LLC, USA
New York, NY 10006
joshua.harvey@prospect33.com
Joshua Rosaler
BlackRock, Inc, USA
New York, NY 10001
joshua.rosaler@blackrock.com
Mingshu Li
Prospect 33, LLC, USA
New York, NY 10006
mingshu.li@prospect33.com
Dhruv Desai
BlackRock, Inc, USA
New York, NY 10001
dhruvdesai@alumni.upenn.edu
Dhagash Mehta
BlackRock, Inc, USA
New York, NY 10001
dhagash.mehta@blackrock.com
April 23, 2025
ABS TRAC T
We describe the use of an unsupervised Random Forest for similarity learning and improved unsu-
pervised anomaly detection. By training a Random Forest to discriminate between real data and
synthetic data sampled from a uniform distribution over the real data bounds, a distance measure is
obtained that anisometrically transforms the data, expanding distances at the boundary of the data
manifold. We show that using distances recovered from this transformation improves the accuracy of
unsupervised anomaly detection, compared to other commonly used detectors, demonstrated over a
large number of benchmark datasets. As well as improved performance, this method has advantages
over other unsupervised anomaly detection methods, including minimal requirements for data prepro-
cessing, native handling of missing data, and potential for visualizations. By relating outlier scores to
partitions of the Random Forest, we develop a method for locally explainable anomaly predictions in
terms of feature importance.
Keywords Anomaly detection ·Random Forest ·Similarity learning ·Explainable Machine Learning
1 Introduction
Ensuring data quality is the first required step upstream of all data-driven functions. Integral to this is the detection of
anomalies in datasets, records with such great dissimilarity from the bulk of the data, that they are best explained as
having been generated by some other, potentially erroneous, process. Detecting anomalies is therefore critical for both
rectifying faulty data input processes, and identifying signals of interest in veridical data.
Popular anomaly detection algorithms typically require extensive, hands-on preprocessing of datasets to be successful.
This is because ultimately such methods, whether they be distance-based such as k-nearest neighbor (kNN), or density-
based such as kernel density estimation (KDE), assume either implicitly or explicitly some distance metric over the
data. As such, they are sensitive to preprocessing operations such as the scaling of numerical features and the encoding
of categorical features. Particularly in contexts where a high volume of diverse datasets must be absorbed, such
preprocessing can present a bottleneck, requiring dataset-specific subject matter expertise.
Fundamentally, anomalies are those data points that show markedly low similarity to the rest of the dataset, and a
critical aspect underpinning anomaly detection techniques is the ability to measure, or learn to measure, the similarity
or dissimilarity between data points. Some notion of similarity is integral to all learning and discriminating systems
[Tversky, 1977], including recommendation systems, image recognition, and anomaly detection. Yet there are many
arXiv:2504.16075v1 [stat.ML] 22 Apr 2025
Explainable Unsupervised RF Anomaly Detection A PREPRINT
potential ways to calculate similarity, which may be more or less useful depending on the task at hand. Similarity
learning is a subfield within machine learning that optimizes distance functions, the computing of similarity between
observations as a function of their features [Santini and Jain, 1999], towards achieving some goal.
One approach to computing similarity is to employ a machine learning model to learn a distance function (which may
not be a distance metric in the strict sense) over a dataset, so-called similarity learning. While this presents some
added computational demands for data quality pipelines, it obviates the need for arbitrary dataset-specific preprocessing
decisions, allowing the pipeline to be fully automated. Random Forests present an attractive candidate model for
this task, due to their versatility in handling high-dimensional datasets of mixed variable types. Depending on their
implementation, they may also have native support for missing data as well as un-encoded categorical features. As
noted by Breiman and Cutler, distance (or proximity) discovery is “one of the most useful tools in Random Forests”
[Breiman and Cutler, 2001].
While typically used according to a supervised learning paradigm, where some feature of interest in the dataset serves
as the target for classification or regression, Random Forests can also be used in an unsupervised manner. This may be
achieved either by completely randomizing the splitting decision function (extremely randomized trees, or ExtraTrees)
[Geurts et al., 2006], or by training the Random Forest to discriminate between real, original data and synthetic data
generated to retain some but not all statistical properties of the original [Shi and Horvath, 2006]. Such approaches have
previously been explored for their application to anomaly detection [Baron and Poznanski, 2016, Mensi et al., 2022,
Puggini et al., 2015, Rhodes et al., 2023a], but have not been directly compared.
In this paper, we examine the distance measure learned by a Random Forest trained to discriminate between original
data and synthetic data sampled uniformly over the bounds of the original (
RFuni
). We show that while the ExtraTrees
distance function approximates Euclidean distances over numerical features,
RFuni
learns a representation that
exaggerates distances at the boundary of the data manifold between outliers and inliers, making it particularly attractive
for anomaly detection. Over a large number of anomaly detection benchmark datasets [Han et al., 2022], we show its
superior performance over the use of distances obtained from the ExtraTrees model, and highly favorable performance
compared to other state-of-the-art unsupervised anomaly detection algorithms. We also show how outlier predictions
can be related to feature importance of the Random Forest, for locally explainable anomaly detection.
2 Previous work and our approach
Our approach can be reduced to four steps: (i) train a Random Forest with unsupervised learning on the dataset; (ii)
compute distances from the trained Random Forest; (iii) compute outlier scores from distances; and (iv) evaluate
anomaly detection performance. Each of these steps has multiple potential methods and has been the subject of
significant research effort. In this section, we review this background and describe our chosen approach.
2.1 Unsupervised Learning with Random Forests
Typically, random forests are trained to predict a particular target variable within the data. While one can calculate
distances from a random forest trained in this supervised manner, the model will only learn to represent relationships in
the data that can be exploited to predict the target variable. This may be helpful, but in the limiting case (where a target
variable shows complete conditional independence over all combinations of all other variables) could be completely
uninformative. An alternative approach, again originally described by Breiman and Cutler, is to train a random forest
to discriminate between the real dataset and some synthetic data, generated to share some but not all of statistical
properties of the real data [Breiman and Cutler, 2001].
Shi and Horvath investigated the use of unsupervised Random Forest for similarity learning on datasets without inherent
class structure [Shi and Horvath, 2006]. They identified two different methods for generating the artificial data, which
the Random Forest model must learn to discriminate from the real data. The first is that proposed by Breiman and Cutler,
of sampling each feature from its univariate distribution in the real data, in effect shuffling each feature independently.
This method, originally called Addcl1’, creates a synthetic dataset that retains the marginal distributions of the real
data, but not the joint distributions, and has been the most widely adopted approach for implementing unsupervised
Random Forests [Auret and Aldrich, 2010, Madhyastha et al., 2019, Seligson et al., 2005].
The second method was to sample each feature from a uniform distribution over the bounds of the real data. Considerably
less attention has been paid to this method, originally called Addcl2’, although we argue it has attractive properties
for anomaly detection. While Addcl1 only learns similarity structure that is predictive of the joint distribution of the
data, anomalies may affect features that exhibit no conditional dependence on other features, and Addcl1-determined
transformations may not be sensitive to such deviations. Addcl2, on the other hand, will be sensitive to any features
whose distributions differ from that of a uniform distribution. We also consider the limiting the case for each method.
2
Explainable Unsupervised RF Anomaly Detection A PREPRINT
For Addcl1, this is when no conditional independence is present for any combinations of features in the data. The
resultant Random Forest then partitions the space proportionally to the density of the data, in effect transforming the
space to one approximating even density, which is deleterious to anomaly detection.
The limiting case for Addcl2 occurs when all features approximate uniform distributions. In this case, the Random
Forest partitions would be drawn from a uniform distribution, and therefore is equivalent to the ExtraTrees model
[Geurts et al., 2006], learning a mapping that approximates Euclidean distances. We therefore apply the Addcl2 method,
referring to unsupervised Random Forest models built with this approach as RFuni.
2.2 Random Forest Distances
Random Forest, as an adaptive nearest neighbor algorithm, offers a robust solution for local distance learning across
disparate datasets from diverse domains. Central to this approach is first the fitting of a random forest model, optimized
for a particular task, followed by the calculation of pairwise distances from that model, represented in an
N×N
distance matrix, given
N
observations in the dataset. Many methods have been proposed for calculating pairwise
distances from the leaf indices of a Random Forest, the original being simply the proportion of trees in which points
do not coterminate, regardless of in-bag or out-of-bag status [Breiman and Cutler]. While attractive in its simplicity,
the distances recovered from a Random Forest with this method do not match up with its predictive performance.
Instead, “geometry and accuracy preserving” (GAP) proximities can be computed by differentially weighting in-bag
and out-of-bag observations when tallying their colocations to the model’s leaves [Rhodes et al., 2023b]:
pGAP
i,j =1
|Si|X
tSi
cj(t)I[jJi(t)]
|Mi(t)|(1)
where
Si
is the set of trees in the RF for which observation
i
is out of bag,
Mi(t)
is the multiset
1
of bagged points in
the same leaf as
i
in tree
t
,
Ji(t)
is the corresponding set (i.e., without repetitions) of bagged points in the same leaf as
i
in tree t, and cj(t)is the multiplicity of the index jin the bootstrap sample.
From GAP proximities, we can then compute symmetrized GAP distances as such:
dGAP
i,j =0,if i=j
(0.5·(pGAP
i,j +pGAP
j,i ))1,otherwise (2)
With GAP distances, it is possible to exactly reconstruct the predictions of the Random Forest as weighted averages of
training target labels. This makes it a particularly natural, effective, and interpretable notion of distance from Random
Forest representations (as given by leaf indices). While other more complex distance calculations have been developed,
such as those considering not only leaf colocations but wider notions of path similarity, these have been found to have
little impact on anomaly detection performance for
Extr aT rees
models [Mensi et al., 2022]. As such, we use the GAP
method for calculating RF distances.
2.3 Computing outlier scores from distances
There are many possible ways to compute outlier scores from inter-point distances
2
. Common approaches include
variants of kNN, such as the distance to the k-th neighbor of each point, or the average distance to the knearest neighbors
[Ramaswamy et al., 2000]. While often performing well, such methods require setting the hyperparameter of k. Our
tests showed that the performance of these methods on Random Forest distances is highly sensitive to the choice of k,
and that the optimal value of kvaries significantly between datasets.
A method that requires no hyperparameter tuning is the use of the median distance to all other points as a measure of
centrality. One drawback of this method is its sensitivity to the presence and proportion of outliers in the dataset. To
mitigate this, we propose to first identify the most central 50% of the data, and then only consider distances to these
central points. Given a dataset,
X={x1, x2, . . . , xn}
of
n
points, we find the subset of observations with the lowest
median GAP distances to all other observations:
1
Recall that a multiset is a generalization of the concept of a set, allowing for repetition among the elements of the set, where the
number of repetitions of a unique element in the multiset is known as its multiplicity.
2
Besides distance-based anomaly detectors, any other method can be applied (e.g. density-based detectors) after first embedding
distances in Euclidean space. This may be done with Principal Coordinate Analysis [Gower, 1966] (metric multidimensional scaling),
or other embedding methods.
3
Explainable Unsupervised RF Anomaly Detection A PREPRINT
Xcentral ={xiX: Rank(dGAP
med (xi)) 0.5n⌋}.
Where
dGAP
med (xi)
is the median GAP Random Forest distance for observation
xi
to all other observations. For each
observation we then compute its outlier score, O(xi)as the median GAP distance to observations in Xcentral:
O(xi) = med {dGAP
i,j :jXcentral}.
2.4 Evaluating the performance of anomaly detectors
Anomaly detection is a central field of research within applied statistics and machine learning, with various techniques
developed to identify unusual instances in data. For a comprehensive treatment of the evolution and methodologies
in anomaly detection, including robust statistical methods and modern machine learning approaches, readers are
encouraged to consult the meta-survey by Olteanu et al. [Olteanu et al., 2023]. Recent research has focused on deep
learning-based outlier detection techniques (e.g. DeepAnT framework for time-series data [Munir et al., 2019b]) and
various hybrid models that combine deep learning with traditional machine learning techniques to enhance anomaly
detection capabilities, e.g. [Munir et al., 2019a].
A notable contribution to this field includes the development of libraries such as PyOD (Python Outlier Detection)
[Zhao et al., 2019], an open-source Python library specifically designed for detecting outliers in multivariate data.
PyOD includes a comprehensive collection of anomaly detection algorithms (e.g. kNN, Isolation Forest, autoencoders
etc.). ADBench (Anomaly Detection Benchmark) [Han et al., 2022] is another significant contribution to the field of
anomaly detection. It provides a standardized set of benchmark datasets for evaluating the performance of different
anomaly detection algorithms. ADBench includes a diverse set of real-world datasets from numerous domains, each
with ground-truth labels of whether observations are outliers, facilitating the evaluation of anomaly detectors as binary
classifiers via methods such as the receiver operating characteristic (ROC) curve.
ADBench contains 47 datasets for benchmarking anomaly detection algorithms, gathered across diverse domains such
as healthcare, finance, sociology, image processing, and the physical sciences. The advantage of using ADBench lies
in immediate access to multiple, harmonized datasets, which have already been extensively tested and evaluated for
anomaly detection. However, it also presents drawbacks. Documentation for datasets can be cumbersome to locate,
such that it can be difficult to trace back to their original reference. More importantly, the datasets have already
been preprocessed, with numerical values being scaled and categorical features being encoded. Therefore, testing on
ADBench datasets is not completely representative of end-to-end anomaly detection in the real world. One of the key
benefits of the RF distance-based method described here is the lack of required preprocessing for diverse datasets,
particularly with respect to native handling of null values and outliers; this benefit will not be demonstrated by using
datasets that are already preprocessed.
As there is no universally applicable definition of what constitutes an anomaly, the labeling of ground truth anomalies
in datasets is somewhat subjective. In several of the ADBench datasets, the ‘anomaly’ subset of data actually represents
a distinct class within the dataset, with observations of this class being highly similar to one another. One example is
the optdigits dataset, consisting of 5,216 8-by-8 pixel images of hand-drawn digits. In this dataset, observations are
labeled as anomalous if they depict a zero, one of the 10 digits featured. We argue it does not make sense to consider
such a class as anomalous—although it is a small fraction of the dataset, such observations are to be expected, and are
in no sense ‘deviations’ from the majority of the data, which are themselves divided amongst the other nine categories
of digit. Similarly, other datasets may be strongly enriched for ‘anomalous’ observations, such as medical datasets
containing a proportion of pathological samples far higher than would be expected in the general population. While one
approach would be to train models only on inlier points (the approach taken in [Mensi et al., 2022]), such an approach
cannot be considered truly unsupervised, requiring anomaly labels for training. For such reasons, we excluded nine of
the 47 real-world datasets from experiments (see App. 6.1).
For all experiments, we randomly sampled 1,000 observations from each dataset ve times, preserving the balance of
inliers and outliers. Datasets with fewer than 1,000 observations were excluded, leaving 26 remaining datasets. Smaller
datasets pose problems for evaluating anomaly detection performance, as metrics such as AUCROC can be overly
optimistic for limited test set sizes [Novello et al., 2024]. When evaluated on datasets excluded from the main analysis,
we found no significant differences in the performance of different anomaly detection algorithms, with the exception of
OCSVM performing worse than kNN (p= 0.0218).
4
Explainable Unsupervised RF Anomaly Detection A PREPRINT
3 Results
3.1 Visualizing Unsupervised Random Forest Transformations
As there are multiple ways of training a Random Forest for unsupervised learning on a dataset, it is of value to
understand what different approaches achieve, and how they transform the data from measurement space to a fitted
model’s representation. For the purposes of visualization, we explored this with simulated data, drawn from a two-
dimensional Gaussian distribution, with points in the 90th percentile of distances from the distribution center designated
as outliers (Fig. 1.a.i). A distance matrix showing inter-point Euclidean distances is shown in Figure 1.a.ii, with points
sorted by their distance from the origin. We then fitted two Random Forest models, one tasked with discriminating
between the real data and synthetic data drawn from a uniform distribution over its bounds (
RFuni
), and one using a
completely randomized splitting function (
Extr aT rees
). From each fitted model, we then computed GAP distances
(Fig. 1.b).
Of note, the
RFuni
model significantly reshapes the inter-point distance relationships in the data, giving GAP distances
with a Spearman rank correlation
ρ= 0.77
to the original Euclidean distances in measurement space (Fig. 1.c.i). The
Extr aT rees
model, on the other hand, largely preserves the distance relationships of the data (
ρ= 0.95
, Fig. 1.c.ii).
Histograms of inter-point distances reveal that the
RFuni
model tends to exaggerate distances for outlier points, while
the distances for inlier points tend to decrease, with respect to the
Extr aT rees
model (Fig. 1.d). This increasing
isolation of outlier points suggests the approach may be useful for outlier detection.
To visualize these transformations, we embedded GAP distances with multidimensional scaling (MDS) [Gower, 1966],
which attempts to accommodate inter-point distances in a low-dimensional embedding with minimal distortion (‘stress’).
While the
Extr aT rees
distances could be accommodated with little stress in two dimensions, reflecting the original
measurement space of the data, RFuni distances could not (Figure 1.e).
Visualizing three-dimensional MDS embeddings, it can be seen that
RFuni
applies a significant, anisometric reshaping
of the data, with innermost points being pulled closer together and outermost points being pushed away through a higher
dimension from the more central data (Figure 1.f). This is in contrast to the
Extr aT rees
model, which approximately
retains the two-dimensional input space distances of the data, albeit with some curvature (Figure 1.b.ii).
Two-dimensional embedding with MDS shows again how the
RFuni
model pushes outlier points to the outer reaches
of the mapping while drawing inliers closer together (Figure 1.g.i). The
Extr aT rees
model, on the other hand, does
not distort the mapping of points from their measurement space (Figure 1.g.ii).
3.2 Evaluating Anomaly Detection Performance
The tendency of
RFuni
to exaggerate the isolation of outlier points makes it an attractive candidate for use in anomaly
detection. To test this, we first compared the performance directly of detectors using either
RFuni
or
Extr aT rees
distances, aggregated across the 26 ADBench datasets we included in our analysis. We found a significant improvement
in anomaly detection performance when using
RFuni
distances, when evaluated according to the ranking of the detector
(measured against other unsupervised detectors
3
), the AUCROC score, and the percentage of AUCROC score achieved
by the detector as a percentage of the maximnum AUCROC score achieved by the best-performing detector for each
dataset (Figure 2.a).
We then compared the performance of
RFuni
against other popular unsupervised anomaly detectors, using the
ADBench benchmark datasets (2.b). A boxplot of ranked accuracy shows little difference in the performance of
detectors aggregated across all datasets, with the exception of
RFuni
, which has a median performance ranking of three
out of 15 (2.b.i). This is reflected in the findings of a Conover post-hoc test, which shows that while most unsupervised
detector comparisons exhibit no statistically significant differences in performance,
RFuni
is significantly better than
half of the other detectors (p<0.1, adjusted for multiple comparisons via the Holm-Bonferroni method) (2.b.ii). These
comparisons are also represented in a critical difference diagram [Demšar, 2006] (2.b.iii).
To confirm that the elevated performance of
RFuni
could not be attributed to our method of converting inter-point
distances to outlier scores, we also applied this method on Euclidean distances. As expected, this detector performed
similarly to the
Extr aT rees
detector, confirming that the
RFuni
data transformation accounted for its superior anomaly
detection performance over the benchmark datasets.
3
We evaluated the performance of the following unsupervised detectors: OCSVM [Schölkopf et al., 1999], KNN [Ramaswamy
et al., 2000], LOF [Breunig et al., 2000], COF [Tang et al., 2002], PCA [Shyu et al., 2003], CBLOF [He et al., 2003], IForest [Liu
et al., 2008], KDE [Latecki et al., 2007], SOD [Kriegel et al., 2009], HBOS [Goldstein and Dengel, 2012], LODA [Pevný, 2016],
COPOD [Li et al., 2020], and ECOD [Li et al., 2023].
5
Explainable Unsupervised RF Anomaly Detection A PREPRINT
Figure 1: Embedding two-dimensional Gaussian data with unsupervised RF distances. A) i) Data simulated from a
two-dimensional Gaussian distribution. Points with distances from the origin above the 90th percentile are colored red
(outliers). ii) A matrix of Euclidean distances, sorted by each point’s distance from the origin. B) Distance matrices
for Random Forest GAP distances obtained from the (i)
RFuni
and (ii)
Extr aT rees
models. C) Spearman rank
correlation,
ρ
, between distances in measurement space (Euclidean) and RF distances. D) Histograms of inter-point RF
distances for inliers (blue) and outliers (red). E) Stress plot for multidimensional scaling (MDS) of RF distances. F)
MDS embedding in three dimensions for (i)
RFuni
and (ii)
Extr aT rees
RF distances. G) MDS embedding in two
dimensions for RF distances.
3.3 Anomaly Detection with Missing Values
Across many sectors and industries, datasets often contain a significant proportion of missing values. To evaluate the
stability of unsupervised Random Forest anomaly detection on missing data, we tested its performance on benchmark
datasets for varying levels of missingness in the data (missing completely at random, MCAR), and compared its
stability over increasing levels of missingness to that of other detectors. For some datasets,
RFuni
maintains a superior
performance without prior imputation of missing data (e.g. the Campaign dataset, Fig. 3.a). However, we find that in
general its performance is improved by first imputing missing values with the mean.
Aggregating results across all datasets, we find that
RFuni
maintains its competitive performance over other unsuper-
vised detectors, up to missingness levels of 60% (Fig. 3.b).
3.4 Explainability
While there are several model-agnostic explainability frameworks that can be leveraged for anomaly detectors (eg.
SHAP [Lundberg and Lee, 2017], LIME [Ribeiro et al., 2016]), here we explore a method that capitalizes on the
explainable properties of Random Forest estimators. Our method is inspired by the popular approach of constructing
saliency maps to visualize feature importance for image class prediction models [Simonyan et al., 2013]. While saliency
maps visualize input features that impact model performance as a function of neural network gradients, here we develop
an approach for counterfactual explanations, i.e. minimal differences in inputs that impact model predictions [Guidotti,
2024].
Given a dataset, X, the RFuni outlier score of each observation is given by O(xi). For each point xi, we want to find
O(xi)
, the gradient of
O
at
xi
. We first compute local gradients of the outlier score in measurement space, solving
6
Explainable Unsupervised RF Anomaly Detection A PREPRINT
B
i ii
ii
A
ExtraTrees RFuni
1
5
10
15
Rank
***
p = 0.0003
i
ExtraTrees RFuni
50
60
70
80
90
100
% max(AUCROC)
***
p = 0.0010
iii
iii
ExtraTrees RFuni
0.4
0.5
0.6
0.7
0.8
0.9
1.0
AUCROC
**
p = 0.0011
LODA
SOD
COF
CBLOF
HBOS
OCSVM
ExtraTrees
KDE
ECOD
LOF
COPOD
KNN
IForest
PCA
RFuni
1
5
10
15
Rank
CBLOF
COF
COPOD
ECOD
ExtraTrees
HBOS
IForest
KDE
KNN
LODA
LOF
OCSVM
PCA
SOD
RFuni
CBLOF
COF
COPOD
ECOD
ExtraTrees
HBOS
IForest
KDE
KNN
LODA
LOF
OCSVM
PCA
SOD
RFuni
p < 0.001
p < 0.01
p < 0.05
p < 0.1
NS
12 10 8 6 4
LODA
SOD
COF
CBLOF
HBOS
OCSVM
ExtraTrees
RF
uni
PC
A
IFo
rest
KN
N
CO
POD
LO
F
EC
OD
KD
E
Model rank
Figure 2: Anomaly detection performance on benchmark datasets. A) Direct comparison between anomaly
detectors distances from either the ExtraTrees or
RFuni
Random Forest model, aggregated across benchmark datasets.
Comparisons of i) ranking amongst other unsupervised detectors, ii) AUCROC score, and iii) percent of AUCROC
score achieved by best performing detector for each dataset.
p
-values indicate result of a Wilcoxon signed-rank test. B)
Results for all unsupervised anomaly detectors. i) Boxplots of ranked performance for each detector, sorted left to right
by superior performance. ii) Pairwise comparisons computed with a Conover post-hoc test, with p-values adjusted for
multiple comparisons via the Holm–Bonferroni method. iii) A critical difference diagram connecting detectors without
significant differences at
α= 0.1
. Dashed colored lines indicate no critical difference between specific detectors,
despite critical differences between their intermediately ranked detectors.
BA
0.0 0.2 0.4 0.6
Fraction MCAR
0.6
0.7
0.8
AUCROC
Campaign dataset
imputer
none
mean 0.0 0.2 0.4 0.6
Fraction MCAR
0.8
0.9
Relative AUCROC
All datasets
RFuni
PCA
IForest
KNN
COPOD
LOF
RFuni
PCA
IForest
KNN
COPOD
LOF
Figure 3: Anomaly detection performance with missing data. A) The performance of
RFuni
and other top-performing
unsupervised anomaly detectors with missing data on the Campaign dataset, evaluated with AUCROC. All detectors
were tested on data imputed with the mean (dashed lines), while
RFuni
was also applied directly on missing data (solid
line). Shaded area indicates standard deviation over 5 repeats. B) Aggregate performance across all datasets for varying
levels of missingness, following mean imputation. AUCROC scores for each dataset are normalized to that of the
best-performing detector on complete data.
the least squares solution of Oover neighborhoods of the dataset, with Nk(xi)being the set of knearest neighbors of
point xi:
7
Explainable Unsupervised RF Anomaly Detection A PREPRINT
Ai=X
jNk(xi)
(xjxi)(xjxi)T,
bi=X
jNk(xi)
(xjxi)(O(xj)O(xi)),
O(xi) = A1
ibi
Once the gradient field Ois computed, a trajectory can be charted given a learning rate, l:
pi=xi+l· O(xi),
xi+1 =argmin
xX
||xpi||2
Such a trajectory can be interpreted as a sequence of counterfactual explanations, minimal changes in inputs that
result in the largest possible impacts in outlier score. Figure 4.a shows the outlier gradient field computed over the
two-dimensional toy dataset from Figure 1, with an example counterfactual trajectory. The same trajectory is also
shown in an MDS embedding of GAP distances obtained from the RFuni model (Fig. 4.b).
We then compute the feature importance of each step along the counterfactual trajectory, with respect to the
RFuni
model. This is accomplished by identifying intersections made by the straight line segment connecting
xi
and
xi+1
,
through the hyperrectangular partitions of the Random Forest (described in App. 6.2). In this way, we are able to
explain RFuni outlier scores with respect to an observation’s feature values (Fig. 4.c).
AB
Outlier
percentile
0
50
100
C
1 2 3 4 5 6 7 8
Step
0
500
1000
Cumulative sum
Partition crossings
X feature
Y feature
2 0 2
X
2
0
2
Y
Start
End
Trajectory
0.5 0.0 0.5
MDS 1
0.5
0.0
0.5
MDS 2
Start
End
RFuni-GAP
Trajectory
Figure 4: Explainability of
RFuni
anomaly detector outlier scores. A) A trajectory through the outlier gradient field
on two-dimensional Gaussian data. Gradients are shown in red arrows. B) The same trajectory visualized in an MDS
embedding of GAP distances computed from the
RFuni
outlier detection model. C) Cumulative tally of which
RFuni
partitions are crossed along the trajectory.
As we cannot meaningfully interpret explanations on the ADBench datasets (due to both preprocessing of the data, and
required domain expertise), we turned to the MNIST dataset [Deng, 2012]. For our experiment, we took a sample of the
dataset comprising 90% digit 9s (inliers) and 10% digit 4s (anomalies). On this dataset,
RFuni
achieved an AUCROC
of 0.76 (for reference, Isolation Forest achieved 0.75). Figure 5.a shows a trajectory computed for a digit 4, plotted in a
t-SNE embedding of the original data [van der Maaten and Hinton, 2008]. We also view this trajectory in an MDS
embedding of GAP distances from the
RFuni
model (Fig. 5.b). The trajectory successfully identifies steps between
similar observations in the MNIST dataset that reduce outlier score predictions (Fig. 5.c).
Figure 5.d visualizes the importance of each pixel to the
RFuni
model, as a function of how many Random Forest
partitions target each pixel. Partitions crossed over the course of the counterfactual trajectory are then used to explain
reductions in outlier score, either going directly from the first to the last observation (Fig. 5.e), or by integrating over
the entire counterfactual trajectory (Fig. 5.f). Critically, these visualizations show not just differences in observations,
but the salience of these differences to the RFuni anomaly detection model.
8
Explainable Unsupervised RF Anomaly Detection A PREPRINT
C
B
D E F
A
MDS 1
MDS 2
MNIST, RFuni-GAP
Pixel importance
(start to end direct)
Pixel importance
(cumulative)
0 1 2 3 4 5
Step
0
5
10
15
20
Outlier score
Trajectory
RFuni pixel partitions
0
25
Start End Start End
PartitionsPartitions
decr. 0 incr.decr. 0 incr.
0.5 0.0 0.5
0.0
0.5
Start
End
9
4
25 0 25
TSNE 1
25
50
75
TSNE 2
Start
End
MNIST
9
4
Figure 5: Explainable anomaly detection with the MNIST dataset. A) The trajectory from an anomaly (digit 4,
red) to inliers (digit 9, blue) in the MNIST dataset, visualized in a t-SNE embedding of the input data. B) The same
trajectory visualized in an MDS embedding of GAP distances computed from the
RFuni
outlier detection model. C)
Outlier scores of each data point along the trajectory from outlier to inliers. The corresponding image for each data
point is shown, and its ground-truth anomaly label indicated by marker color. D) Visualization of feature importance of
each pixel for the
RFuni
model, as a tally of how many partitions target each pixel. E) Counterfactual explanation of an
outlier, showing the number of
RFuni
model partitions crossed (due to either increases or decreases in pixel values) to
move directly from the first image to the last image in the trajectory. Blue indicates pixels where decreases in value
explain reductions in outlier score, while red indicates where increases in the pixel value explain reductions in outlier
score. F) Visualization of total
RFuni
partitions crossed, integrated over the counterfactual trajectory. Same color
coding as in (E).
4 Discussion
The transformation that
RFuni
applies to data during similarity learning appears to be useful for anomaly detection.
Compared to the
Extr aT rees
model, which we show approximates a mapping of Euclidean distances,
RFuni
exhibits
improved performance for the vast majority of benchmark datasets evaluated here.
RFuni
also compares very favorably
to other unsupervised detectors such as KNN, Isolation Forest, local outlier factor, and PCA (Fig. 2). The more popular
approach of training an unsupervised Random Forest by generating synthetic data with the same marginal distribution,
lacking its joint distribution (such as by shuffling features independently), performed significantly worse than RFuni.
A notable property of supervised Random Forests is their insensitivity to coordinate transformations, monotone
transformations of ordered features in the data (p.57 [Breiman et al., 1984]). This property is preserved for unsupervised
learning in the Addcl1 case, where synthetic data are generated with the same marginal distribution as the real data, but
lacking any joint distribution. This is because coordinate transformations applied to real data will be carried over and
reflected in the synthetic data. Reducing feature values to effective rankings in this manner may be deleterious in many
anomaly detection applications. The
RFuni
model uses the Addcl2 method for synthetic data generation, where the
Random Forest is trained to discriminate between real data and synthetic data generated from a uniform distribution
over the real data bounds. The resultant model is therefore sensitive to the relative values of features, not just their
order, as the volume of synthetic data generated between real observations will be proportional to the difference in
their relative values. While this is likely to be advantageous in many applications, some applications may benefit from
pre-processing data with coordinate transformations, as shown to be the case for other unsupervised anomaly detection
methods [Kasieczka et al., 2023].
Another attractive feature of
RFuni
is that, using a Random Forest, it can readily be applied to datasets of different
modalities. This may include features of the measurement space, but also (and in tandem) engineered features, such as
9
Explainable Unsupervised RF Anomaly Detection A PREPRINT
those from a convolutional neural network (CNN) for image datasets. For example, while
RFuni
achieved an AUCROC
of 0.76 on the 4s and 9s subset of MNIST, using the bottleneck features from a CNN autoencoder improved performance
to an AUCROC of 0.864.
In critical applications, model explainability is of particular value. Explainability improves transparency and trustwor-
thiness, can safeguard against unfair and discriminatory processes, and provides actionable insights for users.
RFuni
,
by virtue of using an ensemble of decision trees, occupies a sweet spot between more expressive models such as neural
networks, and more interpretable models such as multivariate regression. As we show here, feature importance can be
read out both in terms of overall model functioning (Fig. 5.d) and to make sense of counterfactual explanations (Fig.
5.e,f).
A limitation of the
RFuni
model is that contamination of the training set with anomalies can significantly reduce
performance, particularly if anomalies conform to a self-similar class-like distribution. However, this is a limitation
for all unsupervised anomaly detection methods, and the reason we restrict experiments to datasets within ADBench
where unsupervised detectors are appropriate. Although we do not explore it here,
RFuni
could be sensitive to outliers,
as extreme values will determine the bounds over which the synthetic data is uniformly generated. In such cases, the
majority of synthetic data may be situated far away from the real data, reducing the resolution with which
RFuni
can
define the data manifold. While this can be mitigated by increasing the volume of synthetic data generated for training,
a more robust approach would be to use not the bounds of the real data, but rather upper and lower percentiles.
5 Conclusion
We describe an approach for anomaly detection using an unsupervised Random Forest for similarity learning. While
similarity learning with unsupervised Random Forests is not new, here we focus on using a uniform distribution
to generate the synthetic data that must be discriminated from the real data during training. Through analysis and
visualizations we show that the resultant model anisometrically reshapes the data from measurement space, by expanding
inter-point distances at the boundary of the data manifold. We show that this learned representation, and the exaggerated
isolation of outliers in it, is particularly useful for anomaly detection, evaluated over a large collection of benchmark
datasets. Finally, we demonstrate how predictions of this model can be rendered locally explainable, by interpreting
counterfactuals through the lens of Random Forest feature importance.
6 Appendix
6.1 ADBench datasets excluded from analysis
Datasets were excluded due to insufficient size (# Samples), over-represented anomalies (% Anomalies), or a dataset
specific reason (Dataset)5.
6.2 Identifying Random Forest partition intersections for counterfactual trajectories
To interpret counterfactual trajectory explanations in relation to
RFuni
feature importance, we identify which partitions
of the Random Forest are intersected by the line segment connecting points in the trajectory. This is solved by an
extension of the Liang–Barsky line clipping algorithm to higher dimensions [Liang and Barsky, 1984].
For points xiand xi+1 in the counterfactual trajectory, we first parameterize the line segment connecting them as L:
L={x(t) = xi+t(xi+1 xi)|t[0,1]} Rn
Considering Random Forest partitions as a set of hyperrectangles,
R={R1, R2, ..., Rm}
, we want to find
R
, the
subset of intersected partitions:
R={RjR|LRj=∅}
4
For this experiment we trained a two-layer Keras CNN autoencoder [Chollet et al., 2015]. The encoder consisted of a Conv2D
layer with 32 3x3 pixel filters, a 2x2 pixel MaxPooling2D layer, a second Conv2D layer with 32 3x3 pixel filters, and a second 2x2
pixel MaxPooling2D layer. The decoder consisted of two Conv2DTranspose layers, each with 32 3x3 pixel filters and a stride of two,
and a Conv2D layer with a single filter of 3x3 pixel convolution window.
5
Optdigits: Images of the digit zero vs nine other digits, which would not be naturally considered as ‘inliers’. ALOI : Almost
all unsupervised detectors performed below chance on this dataset, despite good performance of unsupervised detectors in prior
literature [Kriegel et al.].
10
Explainable Unsupervised RF Anomaly Detection A PREPRINT
Dataset # Samples # Features % Anomalies COPOD IForest KNN LOF PCA RFuni
ALOI 49534 27 3.04 0.50 0.49 0.50 0.58 0.51 0.51
Breastw 683 934.99 0.99 0.99 0.98 0.38 0.96 0.99
Cardiotocography 2114 21 22.04 0.67 0.71 0.61 0.62 0.75 0.69
Fault 1941 27 34.67 0.45 0.57 0.73 0.60 0.48 0.46
Glass 214 7 4.21 0.76 0.79 0.86 0.81 0.69 0.80
Hepatitis 80 19 16.25 0.80 0.70 0.55 0.59 0.75 0.80
InternetAds 1966 1555 18.72 0.67 0.68 0.69 0.65 0.62 0.56
Ionosphere 351 32 35.90 0.79 0.85 0.93 0.86 0.78 0.84
Landsat 6435 36 20.71 0.43 0.49 0.62 0.54 0.38 0.56
Lymphography 148 18 4.05 1.00 1.00 1.00 0.98 1.00 0.98
Magic.gamma 19020 10 35.16 0.67 0.72 0.77 0.74 0.66 0.69
Optdigits 5216 64 2.88 0.70 0.65 0.50 0.39 0.52 0.45
Pima 768 834.90 0.65 0.67 0.62 0.54 0.63 0.66
Skin 245057 3 20.75 0.50 0.67 0.72 0.39 0.44 0.35
SpamBase 4207 57 39.91 0.70 0.63 0.73 0.57 0.55 0.49
Stamps 340 9 9.12 0.93 0.88 0.82 0.69 0.90 0.87
Vertebral 240 6 12.50 0.33 0.36 0.33 0.49 0.38 0.29
WBC 223 9 4.48 0.99 1.00 0.99 0.83 0.99 0.99
WDBC 367 30 2.72 0.99 0.99 1.00 1.00 0.99 0.98
Wine 129 13 7.75 0.87 0.78 1.00 1.00 0.82 0.93
WPBC 198 33 23.74 0.52 0.49 0.52 0.52 0.48 0.57
Yeast 1484 8 34.16 0.38 0.40 0.40 0.46 0.42 0.39
Table 1: Anomaly detection performance for ADBench datasets excluded from analysis. AUCROC for top
performing unsupervised detectors. Bold values show dominance over other unsupervised detectors.
Each partition hyperrectangle is defined by:
Rj={xX| k {1, . . . , n}:ak
jxkbk
j}
Where
ak
j
and
bk
j
are the
k
-th coordinates of the minimum and maximum corners of
Rj
, respectively, for each dimension
kof XRn. We can immediately filter for hyperrectangles whose partition feature value, af=bf, is crossed by L:
R={RjR|af
j=bf
j=min(xf
i, xf
i+1)af
jmax(xf
i, xf
i+1)}
For remaining cases, we find the values of twhere the line segment would intersect the hyperrectangle bounds:
tk
a= (ak
jxk
i)/(xk
i+1 xk
i), tk
b= (bk
jxk
i)/(xk
i+1 xk
i)
The intervals of twhere the segment lies within the bounds for each dimension kare given by:
tk
min = min(tk
a, tk
b), tk
max = max(tk
a, tk
b)
A partition intersection occurs if and only if the maximal value of
tmin
is less than or equal to the minimal value of
tmax for all dimensions, and these values fall within [0,1]:
LR= n
max
k=1 (tk
min)n
min
k=1 (tk
max)[n
max
k=1 (tk
min),n
min
k=1 (tk
max)] [0,1] =
The importance of each feature for a counterfactual is then given as
|R
k|
, the cardinality of the subset of hyperrectangles
that take a single value for that dimension:
R
k={RjR|ak
j=bk
j}
11
Explainable Unsupervised RF Anomaly Detection A PREPRINT
Feature importance can be further broken down with respect to the partition criteria. For numerical features, this can
simply be a readout of whether partitions are intersected due to an increase or decrease in that feature, as shown in
Figure 5.e-f.
References
L. Auret and C. Aldrich. Unsupervised process fault detection with random forests. Industrial & Engineering Chemistry
Research, 49(19):9184–9194, 10 2010. doi:10.1021/ie901975c.
D. Baron and D. Poznanski. The weirdest sdss galaxies: results from an outlier detection algorithm. Monthly Notices of
the Royal Astronomical Society, 465(4):4530–4555, 11 2016. ISSN 0035-8711. doi:10.1093/mnras/stw3021.
L. Breiman and A. Cutler. "random forests".
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_
home.htm#prox. Accessed: 2024-05-27.
L. Breiman and A. Cutler. Random forests.
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_
home.htm, 2001.
L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. Taylor & Francis, 1984. ISBN
9780412048418.
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. SIGMOD Rec., 29
(2):93–104, May 2000. ISSN 0163-5808. doi:10.1145/335191.335388.
F. Chollet et al. Keras. https://keras.io, 2015.
J. Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7
(1):1–30, 2006.
L. Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal
Processing Magazine, 29(6):141–142, 2012. doi:10.1109/MSP.2012.2211477.
P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42, 2006.
doi:10.1007/s10994-006-6226-1.
M. Goldstein and A. Dengel. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm.
09 2012.
J. C. Gower. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53
(3-4):325–338, 12 1966. ISSN 0006-3444. doi:10.1093/biomet/53.3-4.325.
R. Guidotti. Counterfactual explanations and how to find them: literature review and benchmarking. Data Mining and
Knowledge Discovery, 38(5):2770–2824, 2024. doi:10.1007/s10618-022-00831-6.
S. Han, H. Xiyang, H. Huang, J. Mingqi, and Y. Zhao. Adbench: Anomaly detection benchmark. Advances in Neural
Information Processing Systems (NeurIPS 2022), 35:32142–32159, 2022.
Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9):1641–1650,
2003. ISSN 0167-8655. doi:https://doi.org/10.1016/S0167-8655(03)00003-5.
G. Kasieczka, R. Mastandrea, V. Mikuni, B. Nachman, M. Pettee, and D. Shih. Anomaly detection under coordinate
transformations. Phys. Rev. D, 107:015009, Jan 2023. doi:10.1103/PhysRevD.107.015009.
H.-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek. Interpreting and Unifying Outlier Scores, pages 13–24.
doi:10.1137/1.9781611972818.2.
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Outlier detection in axis-parallel subspaces of high dimensional
data. In T. Theeramunkong, B. Kijsirikul, N. Cercone, and T.-B. Ho, editors, Advances in Knowledge Discovery and
Data Mining, pages 831–838, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. ISBN 978-3-642-01307-2.
L. J. Latecki, A. Lazarevic, and D. Pokrajac. Outlier detection with kernel density functions. In P. Perner, editor,
Machine Learning and Data Mining in Pattern Recognition, pages 61–75, Berlin, Heidelberg, 2007. Springer Berlin
Heidelberg. ISBN 978-3-540-73499-4.
Z. Li, Y. Zhao, N. Botta, C. Ionescu, and X. Hu. COPOD: Copula-Based Outlier Detection . In 2020 IEEE International
Conference on Data Mining (ICDM), pages 1118–1123, Los Alamitos, CA, USA, Nov. 2020. IEEE Computer Society.
doi:10.1109/ICDM50108.2020.00135.
Z. Li, Y. Zhao, X. Hu, N. Botta, C. Ionescu, and G. H. Chen. Ecod: Unsupervised outlier detection using empirical
cumulative distribution functions. IEEE Transactions on Knowledge and Data Engineering, 35(12):12181–12193,
2023. doi:10.1109/TKDE.2022.3159580.
12
Explainable Unsupervised RF Anomaly Detection A PREPRINT
Y.-D. Liang and B. A. Barsky. A new concept and method for line clipping. ACM Trans. Graph., 3(1):1–22, Jan. 1984.
ISSN 0730-0301. doi:10.1145/357332.357333.
F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining,
pages 413–422, 2008. doi:10.1109/ICDM.2008.17.
S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc., 2017.
M. Madhyastha, P. Li, J. Browne, V. Strnadova-Neeley, C. E. Priebe, R. Burns, and J. T. Vogelstein. Geodesic learning
via unsupervised decision forests, 2019.
A. Mensi, F. Cicalese, and M. Bicego. Using Random Forest Distances for Outlier Detection, pages 75–86. 05 2022.
ISBN 978-3-031-06432-6. doi:10.1007/978-3-031-06433-3_7.
M. Munir, S. A. Siddiqui, M. A. Chattha, A. Dengel, and S. Ahmed. Fusead: Unsupervised anomaly detection in
streaming sensors data by fusing statistical and deep learning models. Sensors, (11):1991–2005, 2019a.
M. Munir, S. A. Siddiqui, A. Dengel, and S. Ahmed. Deepant: A deep learning approach for unsupervised anomaly
detection in time series. IEEE Access, 7:1991–2005, 2019b. doi:10.1109/ACCESS.2018.2886457.
P. Novello, J. Dalmau, and L. Andeol. Out-of-distribution detection should use conformal prediction (and vice-versa?),
2024.
M. Olteanu, F. Rossi, and F. Yger. Meta-survey on outlier and anomaly detection. Neurocomputing, 555, 2023.
T. Pevný. Loda: Lightweight on-line detector of anomalies. Machine Learning, 102(2):275–304, 2016.
doi:10.1007/s10994-015-5521-0.
L. Puggini, J. Doyle, and S. McLoone. Fault detection using random forest similarity distance. IFAC-PapersOnLine, 48
(21):583–588, 2015. ISSN 2405-8963. doi:https://doi.org/10.1016/j.ifacol.2015.09.589. 9th IFAC Symposium on
Fault Detection, Supervision and Safety for Technical Processes SAFEPROCESS 2015.
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. SIGMOD Rec.,
29(2):427–438, May 2000. ISSN 0163-5808. doi:10.1145/335191.335437.
J. S. Rhodes, A. Cutler, and K. R. Moon. Geometry- and accuracy-preserving random forest proximities. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 45(9):10947–10959, 2023a.
J. S. Rhodes, A. Cutler, and K. R. Moon. Geometry- and accuracy-preserving random forest proximities, 2023b.
M. T. Ribeiro, S. Singh, and C. Guestrin. "why should i trust you?": Explaining the predictions of any classifier. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD
’16, page 1135–1144, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322.
doi:10.1145/2939672.2939778.
S. Santini and R. Jain. Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(9):
871–883, 1999.
B. Schölkopf, R. C. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt. Support vector method for novelty detection.
In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems, volume 12. MIT
Press, 1999.
D. B. Seligson, S. Horvath, T. Shi, H. Yu, S. Tze, M. Grunstein, and S. K. Kurdistani. Global histone modification
patterns predict risk of prostate cancer recurrence. Nature, 435(7046):1262–1266, 2005. doi:10.1038/nature03672.
T. Shi and S. Horvath. Unsupervised learning with random forest predictors. Journal of Computational and Graphical
Statistics, 15(1):118–138, 2006.
M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, and L. Chang. A novel anomaly detection scheme based on principal
component classifier. 01 2003.
K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification
models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
J. Tang, Z. Chen, A. W.-c. Fu, and D. W. Cheung. Enhancing effectiveness of outlier detections for low density patterns.
In M.-S. Chen, P. S. Yu, and B. Liu, editors, Advances in Knowledge Discovery and Data Mining, pages 535–548,
Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. ISBN 978-3-540-47887-4.
A. Tversky. Features of similarity. Psychological Review, 84(4):327–352, 1977.
L. van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):
2579–2605, 2008.
13
Explainable Unsupervised RF Anomaly Detection A PREPRINT
Y. Zhao, N. Zain, and Z. Li. Pyod: A python toolbox for scalable outlier detection. Journal of Machine Learning
Research, pages 1–7, 2019.
14
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest and measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly matches the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
Article
Full-text available
There is a growing need for machine-learning-based anomaly detection strategies to broaden the search for beyond-the-Standard-Model physics at the Large Hadron Collider (LHC) and elsewhere. The first step of any anomaly detection approach is to specify observables and then use them to decide on a set of anomalous events. One common choice is to select events that have low probability density. It is a well-known fact that probability densities are not invariant under coordinate transformations, so the sensitivity can depend on the initial choice of coordinates. The broader machine learning community has recently connected coordinate sensitivity with anomaly detection and our goal is to bring awareness of this issue to the growing high-energy physics literature on anomaly detection. In addition to analytical explanations, we provide numerical examples from simple random variables and from the LHC Olympics dataset that show how using probability density as an anomaly score can lead to events being classified as anomalous or not depending on the coordinate frame.
Conference Paper
Full-text available
Given a long list of anomaly detection algorithms developed in the last few decades, how do they perform with regard to (i) varying levels of supervision, (ii) different types of anomalies, and (iii) noisy and corrupted data? In this work, we answer these key questions by conducting (to our best knowledge) the most comprehensive anomaly detection benchmark with 30 algorithms on 57 benchmark datasets, named ADBench. Our extensive experiments (98,436 in total) identify meaningful insights into the role of supervision and anomaly types, and unlock future directions for researchers in algorithm selection and design. With ADBench, researchers can easily conduct comprehensive and fair evaluations for newly proposed methods on the datasets (including our contributed ones from natural language and computer vision domains) against the existing baselines. To foster accessibility and reproducibility, we fully open-source ADBench and the corresponding results.
Article
Full-text available
Interpretable machine learning aims at unveiling the reasons behind predictions returned by uninterpretable classifiers. One of the most valuable types of explanation consists of counterfactuals. A counterfactual explanation reveals what should have been different in an instance to observe a diverse outcome. For instance, a bank customer asks for a loan that is rejected. The counterfactual explanation consists of what should have been different for the customer in order to have the loan accepted. Recently, there has been an explosion of proposals for counterfactual explainers. The aim of this work is to survey the most recent explainers returning counterfactual explanations. We categorize explainers based on the approach adopted to return the counterfactuals, and we label them according to characteristics of the method and properties of the counterfactuals returned. In addition, we visually compare the explanations, and we report quantitative benchmarking assessing minimality, actionability, stability, diversity, discriminative power, and running time. The results make evident that the current state of the art does not provide a counterfactual explainer able to guarantee all these properties simultaneously.
Article
Full-text available
Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.
Conference Paper
Full-text available
Outlier detection refers to the identification of rare items that are deviant from the general data distribution. Existing approaches suffer from high computational complexity, low predictive capability, and limited interpretability. As a remedy, we present a novel outlier detection algorithm called COPOD, which is inspired by copulas for modeling multivariate data distribution. COPOD first constructs an empirical copula, and then uses it to predict tail probabilities of each given data point to determine its level of "extremeness". Intuitively, we think of this as calculating an anomalous p-value. This makes COPOD both parameter-free, highly interpretable, and computationally efficient. In this work, we make three key contributions, 1) propose a novel, parameter-free outlier detection algorithm with both great performance and interpretability, 2) perform extensive experiments on 30 benchmark datasets to show that COPOD outperforms in most cases and is also one of the fastest algorithms, and 3) release an easy-to-use Python implementation for reproducibility.
Article
Full-text available
PyOD is an open-source Python toolbox for performing scalable outlier detection on multivariate data. Uniquely, it provides access to a wide range of outlier detection algorithms, including established outlier ensembles and more recent neural network-based approaches, under a single, well-documented API designed for use by both practitioners and researchers. With robustness and scalability in mind, best practices such as unit testing, continuous integration, code coverage, maintainability checks, interactive examples and parallelization are emphasized as core components in the toolbox's development. PyOD is compatible with both Python 2 and 3 and can be installed through Python Package Index (PyPI) or https://github.com/yzhao062/pyod.
Article
Full-text available
The need for robust unsupervised anomaly detection in streaming data is increasing rapidly in the current era of smart devices, where enormous data are gathered from numerous sensors. These sensors record the internal state of a machine, the external environment, and the interaction of machines with other machines and humans. It is of prime importance to leverage this information in order to minimize downtime of machines, or even avoid downtime completely by constant monitoring. Since each device generates a different type of streaming data, it is normally the case that a specific kind of anomaly detection technique performs better than the others depending on the data type. For some types of data and use-cases, statistical anomaly detection techniques work better, whereas for others, deep learning-based techniques are preferred. In this paper, we present a novel anomaly detection technique, FuseAD, which takes advantage of both statistical and deep-learning-based approaches by fusing them together in a residual fashion. The obtained results show an increase in area under the curve (AUC) as compared to state-of-the-art anomaly detection methods when FuseAD is tested on a publicly available dataset (Yahoo Webscope benchmark). The obtained results advocate that this fusion-based technique can obtain the best of both worlds by combining their strengths and complementing their weaknesses. We also perform an ablation study to quantify the contribution of the individual components in FuseAD, i.e., the statistical ARIMA model as well as the deep-learning-based convolutional neural network (CNN) model.
Article
Full-text available
Traditional distance and density based anomaly detection techniques are unable to detect periodic and seasonality related point anomalies which occur commonly in streaming data, leaving a big gap in time series anomaly detection in the current era of IoT. To address this problem, we present a novel deep learning based anomaly detection approach (DeepAnT) for time series data which is equally applicable to the non-streaming cases. DeepAnT is capable of detecting a wide range of anomalies i.e. point anomalies, contextual anomalies, and discords in time series data. In contrast to the anomaly detection methods where anomalies are learned, DeepAnT uses unlabeled data to capture and learn the data distribution, that is used to forecast the normal behavior of a time series. DeepAnT consists of two modules: Time Series Predictor and Anomaly Detector. The Time Series Predictor module uses Deep Convolutional Neural Network (CNN) to predict the next time stamp on the defined horizon. This module takes a window of time series (used as a context) and attempts to predict the next time stamp. The predicted value is then passed to the Anomaly Detector module, which is responsible for tagging the corresponding time stamp as normal or abnormal. DeepAnT can be trained even without removing the anomalies from the given data set. Generally, in deep learning based approaches, a lot of data are required to train a model. Whereas in DeepAnT, a model can be trained on relatively small data set while achieving good generalization capabilities due to the effective parameter sharing of the CNN. As the anomaly detection in DeepAnT is unsupervised, it doesn’t rely on anomaly labels at the time of model generation. Therefore, this approach can be directly applied to real life scenarios where it is practically impossible to label a big stream of data coming from heterogeneous sensors comprising of both normal as well as anomalous points. We have performed a detailed evaluation of 15 algorithms on 10 anomaly detection benchmarks, which contain a total of 433 real and synthetic time series. Experiments show that DeepAnT outperforms the state-of-the-art anomaly detection methods in most of the cases, while performing on par with others.
Chapter
In recent years, a great variety of outlier detectors have been proposed in the literature, many of which are based on pairwise distances or derived concepts. However, in such methods, most of the efforts have been devoted to the outlier detection mechanisms, not paying attention to the distance measure – in most cases the basic Euclidean distance is used. Instead, in the clustering field, data-dependent measures have shown to be very useful, especially those based on Random Forests: actually, Random Forests are partitioners of the space able to naturally encode the relation between two objects. In the outlier detection field, these informative distances have received scarce attention. This manuscript is aimed at filling this gap, studying the suitability of these measures in the identification of outliers. In our scheme, we build an unsupervised Random Forest model, from which we extract pairwise distances; these distances are then input to an outlier detector. In particular, we study the impact of several Random Forest-based distances, including advanced and recent ones, on different outlier detectors. We evaluate thoroughly our methodology on nine benchmark datasets for outlier detection, focusing on different aspects of the pipeline, such as the parametrization of the forest, the type of distance-based outlier detector, and most importantly, the impact of the adopted distance.KeywordsOutlier detectionRandom forest distancesData-dependent distances