Available via license: CC BY 4.0
Content may be subject to copyright.
DIVERSE COUNTERFACTUAL EXPLANATIONS FOR ANOMALY
DETECTION IN TIME SERIES
D´
eborah Sulem
Department of Statistics
University of Oxford
deborah.sulem@stats.ox.ac.uk
Michele Donini
Amazon Web Services
donini@amazon.com
Muhammad Bilal Zafar
Amazon Web Services
zafamuh@amazon.com
Franc¸ois-Xavier Aubet
Amazon Web Services
aubetf@amazon.com
Jan Gasthaus
Amazon Web Services
gasthaus@amazon.com
Tim Januschowski
Zalando Research
tim.januschowski@zalando.de
Sanjiv Das
Amazon Web Services
Santa Clara University
sanjivda@amazon.com
Krishnaram Kenthapadi
Fiddler AI
krishnaram@fiddler.ai
C´
edric Archambeau
Amazon Web Services
cedrica@amazon.com
March 22, 2022
ABS TRAC T
Data-driven methods that detect anomalies in times series data are ubiquitous in practice, but they
are in general unable to provide helpful explanations for the predictions they make. In this work we
propose a model-agnostic algorithm that generates counterfactual ensemble explanations for time
series anomaly detection models. Our method generates a set of diverse counterfactual examples,
i.e, multiple perturbed versions of the original time series that are not considered anomalous by the
detection model. Since the magnitude of the perturbations is limited, these counterfactuals represent
an ensemble of inputs similar to the original time series that the model would deem normal. Our
algorithm is applicable to any differentiable anomaly detection model. We investigate the value of
our method on univariate and multivariate real-world datasets and two deep-learning-based anomaly
detection models, under several explainability criteria previously proposed in other data domains
such as Validity, Plausibility, Closeness and Diversity. We show that our algorithm can produce
ensembles of counterfactual examples that satisfy these criteria and thanks to a novel type of visu-
alization, can convey a richer interpretation of a model’s internal mechanism than existing methods.
Moreover, we design a sparse variant of our method to improve the interpretability of counterfactual
explanations for high-dimensional time series anomalies. In this setting, our explanation is localized
on only a few dimensions and can therefore be communicated more efficiently to the model’s user.
1 Introduction
Anomaly detection in time series is a common data analysis task that can be defined as identifying outliers, i.e.,
observations that do not belong to a reference distribution and call for further investigation. For instance, anomaly
detection is leveraged to localize a defect in computing systems, disclose a fraud in financial transactions, or diagnose
a disease from health records [7]. However, taking appropriate actions in the presence of anomalies typically requires
an understanding of why the model marked the data as anomalous. Therefore, providing explanations for models that
detect anomalies can guide the decision process of their users and provide a justification for a chosen action. In spite
of its practical relevance, explaining anomalies detected by ML models is still an understudied problem, all the more
in the challenging setting of multivariate time series data.
arXiv:2203.11103v1 [cs.LG] 21 Mar 2022
More precisely, a model that detects anomalies classifies each timestamp of a time series as anomalous or not. Common
examples of anomalies are spike outliers, change-points or drifts [7]. However, in practice anomalies cannot be easily
categorised and require ad hoc analysis. Moreover, several state-of-the-art models involve complex deep learning
(DL) classifiers, such as LSTMs [21], RNNs [3] or TCNs [4, 10], whose internal mechanisms to detect anomalies are
opaque. This lack of transparency can prevent these models from being deployed in consequential contexts [8, 6]. To
improve their comprehensibility, prior work has proposed to include interpretable blocks in machine learning models
(e.g., attention mechanism in RNNs [8]) or design model-specific explainability methods (e.g., feature-importance
scores for Isolation Forests [9]).
In the larger spectrum of methods for explaining time series models, most existing ones consist of feature-saliency
estimation [12, 24], which attributes scores to features in terms of their relative contribution to the model’s prediction.
Although these techniques have provided valuable insight in image classification tasks [15], it is often a weak form of
explanation for anomalies in time series since it essentially indicates that the features of the anomalous subsequence
are salient. In Figure 1b, we show an example of applying the Dynamic Masks method [12] to explain an anomaly
prediction score. The saliency scores output by this method are high on the timestamps containing the anomalous
observations (highlighted in red in Figure 1a) and on few timestamps preceding the anomaly. In particular, they do not
explain why the model classified the input time series as anomalous (or not) and what it has learnt as the normal data
distribution. In practice, a user of an anomaly detection model is interested in (a) knowing what can be changed in the
input data to avoid encountering the anomaly again in the future (preferentially with minimal cost), and (b) understand
the model’s sensitivity to a particular anomaly.
With these properties in mind, we aim to generate counterfactual samples for explaining anomalies. Counterfactual
explanations have been previously proposed for time series classifiers [2, 14, 18] and can achieve criterion (a). A
counterfactual example (or for short, a counterfactual) is an instance-based explanation in the form of a perturbed input
on which the model’s prediction value is different from the original model output. It thus indicates what modifications
of the input must be made to obtain a dissimilar prediction. It is often defined as an instance X0minimizing a cost
function such as [31]:
L(X, X 0, y0, λ) = λ(f(X0)−y0)2+d(X, X0),
where fis the prediction model, Xis the original input, y0is a desired output value (e.g. a different predicted label
in classification contexts), d(., .)is an appropriate distance on the input space and λis a trade-off parameter. Coun-
terfactual explanations have notably been utilized for models that accept or reject loan applications as they provide
actionable feedback on personal records [30]. In their basic definition, they are closely related to adversarial examples
[30], however, their properties and their utility are distinct. Adversarial examples are often weakly constrained and
used as hard instances to train more robust models, whereas counterfactuals are post-hoc explanations that need to be
plausible examples to be interpretable.
Counterfactual explanations are particularly useful for explaining anomalies in time series detected by a given model,
in which case a counterfactual is another time series (or subsequence) that does not contain observations detected
as anomalous and is similar to the anomalous subsequence. It thus corresponds to the closest normal or expected
behaviour according to the model. For example, if the anomalous time series is a temporal record of the blood
glucose level of a patient at risk, a counterfactual could be an alternative series of values contained in a non-critical
interval. Hence, counterfactual explanations can reveal the boundaries of the normal time series distribution according
to the model and, consequently, its sensitivity to possible issues. Unfortunately, as often observed in distinct contexts
[26], a single counterfactual is only a partial explanation, satisfying a particular trade-off between predefined criteria.
One general solution is therefore to provide an ensemble of diverse counterfactuals [26, 23, 13]. However, for time
series anomaly detection, there is no existing approach for generating these ensembles, nor a strategy to effectively
communicate these more complex explanations to the model’s user for general data types.
In this work, we make the following contributions in this under-explored domain:
• We introduce a model-agnostic method that generates counterfactual ensemble explanations for anomalies
detected by an algorithm in a univariate or multivariate time series. Our ensemble explanation is a set of coun-
terfactual examples that can be used individually, or analysed together to investigate the model’s sensitivity
and the range of perturbation that can be applied to the original input.
• We propose an interpretable visualization of the counterfactual ensemble explanation, in particular by asso-
ciating the counterfactual examples and their prediction scores under the model.
Figure 1 summarizes our propositions, in contrast to existing explainability methods for time series models. The
advantages of our explanation is to be (a) sparse, in the sense that it is localized on a few time series features (b)
optimal in that it minimally modifies these features and (c) rich by diversifying the possible perturbations. Our method
2
is applicable to any differentiable anomaly detection model and can be delineated into two variants, whose respective
uses depend on prior knowledge of the data distribution. Dynamically Perturbed Ensembles (DPEs) leverage dynamic
perturbation operators [12], which induce a modification of a time series according to a pre-defined mechanism.
Interpretable Counterfactual Ensembles (ICEs) do not apply such a perturbation mechanism and can be applied in
absence of domain expertise. Additionally, we design for high dimensional time series a sparse version of our method
that provides a more parsimonious and readable explanation.
After succinctly reviewing existing work in the field of model explainability for time series and counterfactual ex-
planations in Section 2, we describe the general set-up in Section 3. In Section 4, we present the technical details
of our method. Then in Section 5, we demonstrate the effectiveness of our method on DL-based detection models
and benchmark datasets. We notably adapt metrics that have been previously used for counterfactual explanations in
distinct data domains, such as Validity, Plausibility, Closeness and Diversity [23, 25, 17]. Moreover, we qualitatively
show on several case studies that our novel visualization can provide a comprehensible insight into the model’s local
decision boundary and sensitivity. Finally, we conclude in Section 6 with a summary of our results and possible future
developments.
2 Related work
Explainability methods for users of machine learning models have developed along two paradigms: building models
with interpretable blocks or designing model-agnostic methods that can be applied to any model already deployed.
For time series data, RETAIN [11] incorporates an attention-mechanism in an RNN-based model while Dynamic
Masks [12] is a model-agnostic algorithm that produces sparse feature-importance masks on time series using dynamic
perturbation operators. In fact, many methods for time series adapt algorithms designed for tabular or image data: for
instance, TimeSHAP [5] extends SHAP, a feature-attribution method that approximates the local behaviour of a model
with a linear model using a subset of features. Another interesting line of work interprets CNNs for time series models
using Shapelet Learning [20]. Shapelets are subsequences that are learnt from a dataset to build interpretable time
series decompositions.
Nonetheless, previously cited work for time series are feature-saliency estimation methods. Although they are notably
helpful to localize the important parts of time series (in terms of their contribution to the model’s prediction), they
can only weakly explain anomaly detection models. Moreover, instance- or example-based explanations can be more
easily interpreted by a non-expert person [31]. These methods explain a prediction on a single instance by comparing
it to another real or generated example, e.g., the most typical examplar of the observed phenomenon (a prototype [16])
or a contrastive examplar related to a distinct behaviour (a counterfactual [2, 14, 18]). For time series classifiers,
counterfactuals can be generated by swapping the values of the most discriminative dimensions with those from an-
other training instance [2]. Unfortunately, this approach can yield implausible subsequences, that do not belong to
the data manifold [9], e.g., by breaking correlations between the dimensions of multivariate time series. The Native
Guide algorithm [14] does not suffer from the previous issue but uses a perturbation mechanism on the Nearest Unlike
Neighbor in the training set using the model’s internal feature vector. Lastly, for a k-NN and a Random Shapelet Forest
classifiers, [18] design a tweaking mechanism to produce counterfactual time series.
However, these methods necessitate knowledge of the model’s internal mechanism and/or access to its training dataset,
which can be expensive. Additionally, these counterfactual explanations suffer from the so-called Rashomon effect
[22], i.e., the fact that several equally-good perturbed examples might exist and be informative for the model’s user.
In this case, one might benefit from knowing multiple ones, before choosing the most helpful example in a specific
context [23]. For linear classifiers of tabular data, a set of diverse counterfactuals can be obtained by sequentially
adding constraints along the optimization iterations of the perturbation algorithm [26], whereas the Multi-Objective
Counterfactuals algorithm [13] records multiple perturbed examples generated along the iterations of a genetic al-
gorithm. These counterfactual sets therefore contain different trade-offs between conflicting criteria. While in the
previous methods, diversity is not explicitly enforced, the DiCE algorithm [23] includes a penalization on counterfac-
tuals’ similarity based on Determinantal Point Processes. In a similar fashion, for image classifiers, DiVE [25] perturbs
the latent features in a Variational Auto Encoder and penalises pairwise similarity between perturbations, while [17]
propose a general framework for generating counterfactual examples with diversity constraints in heterogeneous data.
Nevertheless, to our knowledge, there is no method to provide diverse counterfactual explanations for time series
data, a fortiori in the context of anomaly detection. Moreover, previous works that have enriched counterfactual
explanations with diverse examples have not discussed the additional challenge of communicating efficiently a set of
instances rather than a single one. Before exposing the technicalities of our method, we describe the general set-up in
the next section.
3
3 General set-up
In this work, we assume that anomalies in a time series are unpredictable and out-of-distribution subsequences. Hence,
an anomaly is a significant deviation from a given reference behaviour. In the remainder, we will not make a distinction
between anomaly, outlier and anomalous/abnormal/atypical observation. Not-anomalous data points will be consid-
ered as belonging to the data distribution, and denoted as the reference/normal/typical/expected behaviour. We will
also refer to the latter as the context.
For the description of the general set-up, we introduce the following notations: for an integer k∈N,[k]denotes the
set {i; 1 ≤i≤k}and for x∈R, let x+= max(0, x). For a vector v∈Rn, we denote viits i-th coordinate and for
X∈Rm×na matrix or multivariate time series, Xidenotes respectively the i-th row or the i-th observation.
Anomaly detection model We assume that we are given an anomaly detection model which we can use to predict –
or rather in this context, detect – anomalies on a time series of any given length. We consider a general setting where
time series are multivariate and the model processes all dimensions (or channels) jointly. More precisely, we denote
X∈RT×Da time series with Ttimestamps and Ddimensions. The prediction function of the model, denoted by
f, is used to classify each timestamp t∈[1, T ]of Xas ”anomalous” (i.e., label 1) or ”not-anomalous” (i.e., label 0).
In fact, the prediction f(X)∈RTis a vector of anomaly scores for each timestamp (e.g., probability scores of being
anomalous) which transforms into a vector of 0-1 labels using the model’s classification rule (e.g. a threshold on these
scores). Note that the dimension of the vector f(X)might be smaller than Tif the model needs a warm-up interval.
In practice, these models often detect anomalous timestamps by subdividing time series into smaller time windows and
classifying the latter (therefore each timestamp or a subset of them in these sub-windows). In other works, to output
a prediction on a single timestamp, the ”receptive field” of a model is generally a fixed-size (typically small) window.
Let’s denote W∈RL×Da window of size Land consider the following general set-up: the window W= [WC, WS]
is subdivided by the model into two parts, with WC∈R(L−S)×Dacontext part (that can be empty if the context is
implicit once the model is trained) and WS∈RS×Dasuspect part, for which the model makes a prediction. More
precisely, f(W)∈RSis the anomaly score of the window WSand, without loss of generality, we suppose that
f(W)∈[0,1]S. We also denote θ∈[0,1] the anomaly detection rule, i.e., a label 1 is given to WSif for some i∈[S],
f(W)i> θ.
Examples of anomaly detection models with the previously described mechanism are NCAD [10], where the context
window has typically thousands of timestamps and the suspect window has 1 to 5 timestamps, and USAD [3], where
W=WSand L= 5 or 10. In the latter case, the context is implicit and the whole training set is considered as normal
data and thus the context of anomalies detected in a test time series.
Counterfactual explanation In most cases, a single anomaly is a short subsequence, and can therefore be contained
in one or few contiguous subwindows WS. For ease of exposition, we suppose that an anomaly is contained in one
suspect window. An example is shown in Figure 1a where a suspect window WS(highlighted in red) contains an
anomaly. A counterfactual example for model fdetecting an anomaly in WS(i.e., for some i∈[S],f(W)i> θ), is
an alternative window f
W= [WC,f
WS]such that all predicted labels are 0 (i.e., for any i∈[S],f(f
W)i< θ). Since
the context of the anomaly is also key to its detection by the model, and if Wdoes not contain a context window WC,
we choose to add in the counterfactual example f
Wa fixed size window WC, that immediately precedes Win the time
series. Note that we implicitly suppose that anomalies are not too close to each other so that the additional context
window does not contain any anomaly. With a slight abuse of notations, we still denote f
Wthe obtained counterfactual
example.
Properties of counterfactual explanations There are four largely consensual properties that convey value and util-
ity to counterfactual explanations in the context of model elicitation [30]:
1. Validity or Correctness: achieving a desired model output, e.g. changing the predicted class label in classifi-
cation; this is the key goal of a contrastive explanation.
2. Parsimony or Closeness: minimally and sparsely changing the original input; this is motivated by practical
feasibility of the counterfactual if the input features are actionable, and by readibility of the information
communicated to the model’s user.
3. Plausibility: counterfactual explanations need to contain realistic examples of normal subsequences.
4. Being computable within a reasonable amount of time and with acceptable computing resources.
4
(a) Original time series window
(b) Feature-saliency map
(c) Counterfactual example
(d) Our counterfactual ensemble explanation
(e) Counterfactual examples from our ensemble explanation
Figure 1: Comparison between existing explainability methods for time series and ours, in the context of anomaly de-
tection. The original input (1a) is a univariate time series window containing an anomalous subsequence (highlighted
in red). Explanations from a feature-importance method (Dynamic Masks [12]) (1b), an instance-based method (coun-
terfactual example) (1c) and our method (1d) are represented. In 1b, the salient timestamps have saliency scores closed
to one and are highlighted in green. In 1d, all the examples from our counterfactual time series ensemble are plotted
on the anomalous sub-window (1d), and the orange color map indicates their anomaly probability scores (between 0
and 1) given by the model. In 1e, we also plot five examples from this ensemble.
5
In the context of an anomaly detected in a time series, property (1) is equivalent to flipping the anomaly detection
model’s prediction label from 1 to 0 (i.e., achieving a anomaly prediction score below the classifier threshold). Property
(2) can be enforced by restricting the perturbation of the input on a small window containing the anomaly (i.e.,
the suspect window WS) and on few dimensions of the time series (if the anomalous features are only located on
some channels). Property (3) requires that the counterfactual belongs to the normal data distribution. If the latter
is not known or estimated, this criterion can be complicated to evaluate, but some prior knowledge such as the time
series’ regularity, seasonality, or bounds can be leveraged. Property (4) potentially depends on the specific setting, in
particular the cost of using the model’s prediction function or its gradient, and the size of the dataset. However, in
our context, we assume that accessing the training set of the detection model is particularly expensive, since the latter
often decomposes the time series into small windows, leading to a large number of actual training inputs for long time
series.
Unfortunately, those properties are often conflicting (e.g. parsimony and plausibility in the context of a spike outlier),
therefore a single counterfactual example can only achieve a particular trade-off between them. In the next paragraph,
we motivate the use of counterfactual ensembles (or sets) as more comprehensive explanations.
Diversity as an additional property A classical counterfactual explanation is a single counterfactual example that
combines properties (1)-(4). However, the best trade-off might depend on the particular anomaly or user’s range of
action. In absence of this prior knowledge, previous work [23, 26] has added the notion of diversity, or range of per-
turbation, in the list of informative criteria. In particular, it can increase the likelihood of finding a helpful explanation
[25]. We also argue that this additional complexity should be adequately communicated to the explanation’s recipient,
e.g., with a suitable visualization. An example of our proposed representation is shown in Figure 1d: the range of time
series values and prediction scores spanned by the different counterfactuals in our ensemble explanation effectively
informs the user of the possible perturbations and the model’s sensitivity.
4 Methodology
In this section, we present our method to generate a counterfactual ensemble explanation for differentiable models, as
well as their sparse versions for high-dimensional time series. Additionally, we propose a gradient-free sampling al-
gorithm, called Forecasting Samples (FS), that is applicable to black-box models and relies on a auxiliary probabilistic
forecasting model.
4.1 Gradient-based counterfactual ensemble explanations
Most counterfactual algorithms (e.g. Native Guide [14], Growing Spheres [19], DiCE [23]) rely on adequately per-
turbing the input Wand optimise the perturbation to enforce some properties of the perturbed example. In our method,
using the notations of Section 3, we first define a penalized objective function over a single counterfactual example
f
W= [WC,g
WS], then use a gradient-descent algorithm to minimize it. The ensemble of examples is built along the
optimization path by collecting adequate perturbations. We define two variants of our method: the first one, called
Interpretable Counterfactual Ensemble (ICE), is an end-to-end method that does not require any domain knowledge
input; the second one, Dynamically Perturbed Ensemble (DPE), uses an explicit dynamic perturbation mechanism
[12].
Interpretable Counterfactual Ensemble (ICE) In this variant, the loss function on a counterfactual example is
defined as follows:
LIC E (f
W) = Lpred(f
W) + Lc(f
W) + Ls(f
W),(1)
where the first term accounts for the Validity property via a hinge loss on the prediction score on f
W, i.e.,
Lpred(f
W) = (f(f
W)−c)+,
with c∈[0,1] is a margin parameter. The second term in (1) enforces the Closeness constraint via a penalty similar to
the elastic net [33], here using the Frobenius and the L1matrix distances:
Lc(f
W) = λ1
S√Dkf
W−Wk1+λ2
SD kf
W−WkF,
where λ1, λ2>0are regularization parameters. Finally, the third term of (1) enforces Plausibility through temporal
smoothness:
Ls(f
W) = λT
(S−1)D
D
X
i=1
S−1
X
t=1 |[f
WS](t+1)i−[f
WS]ti|,
6
with λT>0. The assumption behind this constraint is that normal time series are not too rough and smoother than
abnormal windows, therefore realistic perturbations should also be quite smooth.
Dynamically Perturbed Ensemble (DPE) Contrary to ICE, this variant uses an explicit perturbation mechanism
based on a dynamic perturbation operator [12] and a map that spatially and temporally modulates this perturbation.
More precisely, a map is a matrix M∈[0,1]S×Dthat accounts for the amount of change applied to a timestamp and a
dimension in the suspect window WS. A value close to 1 in Mindicates a big change while a value close to 0 indicates
a small change. Our dynamic perturbation operator is a Gaussian blur which takes as input a time series window W,
a timestamp t∈[L−S, L], a dimension i∈[D]and a weight m∈[0,1], and is defined as:
πG(W, t, i, m) = PL
t0=1 Wt0iexp(−(t−t0)2/2(σmax(1 −m))2)
PL
t0=1 exp(−(t−t0)2/2(σmax(1 −m))2),
with σmax ≥0, a hyperparameter tuning the blur’s temporal bandwidth. We note that the bigger this parameter is, the
larger is the smoothing effect of the perturbation. The latter is called dynamic in the sense that it modifies a timestamp
using its neighbouring times. We also refer to [12] for more examples of dynamic perturbation operators.
Finally, for a given map M, a perturbed suspect window is given by [f
WS(M)]ti =π(W, L −S+t, i, 1−Mti),
t∈[S], i ∈[D]. The loss function is then written in terms of the perturbation map as:
LDP E (M) = Lpred (f
W(M)) + λ1
S√DkMk1+λ2
SD kW−f
W(M)kF+λT
(S−1)D
D
X
i=1
S−1
X
t=1 |M(t+1)i−Mti|,(2)
where the first term is the hinge loss, and the second and fourth terms account for the sparsity and smoothness con-
straints, in this case applied on Mrather than f
Was in (1).
Optimization We initialize the counterfactual ˜
Wat Wand minimize the objective function (1) or (2) using a
Stochastic Gradient Descent (SGD) algorithm. Along the iterations of the latter, if we find an example f
Wsuch
that ∀i∈[S], f (f
W)i< θ, we add f
Wto a set I. At the end of the iterations, we subsample Ncounterfactuals from
the set Ito obtain a diverse counterfactual ensemble. For simplicity, the subsampling scheme is a regular grid over the
generation rank of the examples.
4.2 Sparse counterfactual explanations for high-dimensional time series
In high-dimensional settings (typically D > 10), restricting the perturbations to act on short temporal windows (less
than 10 timestamps) is not enough to obtain an interpretable explanation if the counterfactual ensemble explanation
spans all dimensions. In fact, the model’s user is likely to prefer explanations that are low-dimensional since they
are easier to visualize and allow to change a minimal number of system units if the features are actionable. Besides,
it is often the case in multi-object systems that an anomaly only affects few dimensions (e.g. a small subsample of
monitoring metrics taking abnormal values in a servers network [28]), hence its explanation should also reflect this
low-dimensional property. For this purpose, we design a sparse version of our gradient-based method that constraints
the counterfactual ensemble explanation to be spatially sparse (i.e. sparse in the perturbed dimensions, therefore
parsimonious in dimensions).
Sparse ICE In the sparse version of ICE, we restrict the number of perturbed dimensions by introducing a vector
w∈[0,1]Dand a matrix Z∈RS×D, and defining f
WS(w, Z )=(w⊗1)Z+ ((1 −w)⊗1)WS. The role of wis
to select the dimensions in WSthat are perturbed with Z. We then consider an objective function in terms of (Z, w):
LIC E,SP (w, Z ) = (f(f
W(w, Z )) −c)++λ1
√Dkwk1+λ2
SD kW−f
W(w, Z )kF+λT
(S−1)D
D
X
i=1
S−1
X
u=1 |Z(u+1)i−Zui|.
(3)
Contrary to (1), where the sparsity penalization is applied globally (i.e., both temporally and spatially), the previous
objective enforces spatial sparsity through the L1-penalisation on w. Another way to see that is to re-interpret objective
(1) as objective (3) with w= (1,1,...,1),Z=f
WSand replace the L1-penalisation on wby λ1
S√DkZ−WSk1.
7
Sparse DPE We apply the same idea to the DPE variant by enforcing the perturbation maps to be spatially sparse.
More precisely, we define M(w, t) = t⊗wwith w∈[0,1]Dand t∈[0,1]Tand a loss function in terms of (w, t):
LDP E,S P (w, t)=(f(f
W(w, t)) −c)++λ1
√Dkwk1+λ2
SD kW−f
W(w, t)kF+λT
S−1
S−1
X
u=1 |tu+1 −tu|.(4)
Here the smoothness constraint is applied on tto guarantee that Mis also smooth in the temporal dimension.
4.3 Gradient-free approach: Forecasting Set
If the anomaly detection model is non-differentiable, we propose an alternative approach that generates a counterfac-
tual ensemble explanation using an appropriate sampling mechanism. Machine learning models for time series data
sometimes rely on sampling in the context of probabilistic forecasting. Here, we will train an auxiliary probabilistic
forecasting method and use it as a generative model of counterfactual subsequences. More precisely, given an in-
put window WC∈RL−S×D, our auxiliary model goutputs a distribution over a forecast horizon of Stimestamps,
g(WC), from which one can sample forecasting paths. We therefore sample Nwindows W(i)
F∼g(WC), i ∈[N],
then select the ones that are not anomalous according to the anomaly detection model, i.e., our counterfactual ensemble
is given by:
IF S ={W(i)
F;i∈[N]st ∀t∈[S], f ([WC, W (i)
F])t< θ}
Intuitively, since the probabilistic forecasting model is trained to learn the data distribution, it generates realistic
forecast samples. However, the sampling model is oblivious to the original input WSand therefore the forecasting
samples are not restricted to be minimally distant from it. In Section 5, we will construct and evaluate this approach
with a Feed Forward Neural Network (FFNN) for univariate data and a DeepVAR model [27] for multivariate data
from the GluonTS package [1] .1
5 Experiments
In this section, we test and compare the performances of our method on two differentiable models, and the relative
advantages of its variants (i.e., ICE, DPE, FS, Sparse ICE, Sparse DPE) in multiple contexts. For this analysis, we have
considered two DL anomaly detection models, NCAD [10] and USAD [3], and four benchmark time series datasets.
We report in Section 5.4 a qualitative evaluation of our counterfactual ensemble explanations and their visualization,
and in Section 5.5, a quantitative analysis under the previously defined criteria. Note that this study does not include
a comparison to existing baselines, since counterfactual ensemble explanations have not been previously considered
for time series data. Although some algorithms such as DiCE [23] exist in the context of tabular data, we do not deem
appropriate to use it in our context since perturbation methods are adapted to each data domain [12]. Nonetheless, for
the sake of comparison, we also include a naive baseline, which mechanism is described in Section 5.1. Section 5.2
and Section 5.3 provide additional details on the explainability metrics and the hyperparameters selection procedure.
5.1 Experimental set-up
Anomaly detection models In this experimental evaluation, we use two differentiable models with distinct tempo-
ral neural networks mechanisms. The first one, Neural Contextual Anomaly Detection (NCAD) [10], uses a temporal
convolutional network and subdivides time series into windows that include a context part. The second one, Un-
Supervised Anomaly Detection (USAD) [3], is based on a LSTM Auto-Encoder and predicts anomalies on suspect
windows without explicit context windows. Neither of these models are interpretable-by-design, but both have SOTA
performances on the benchmark anomaly detection datasets and reasonable training times (around 90 min). Before
evaluating our explainability method, we train these models using the procedure described in their respective papers.
More details on these models and their detection performance on the benchmark datasets are reported in Appendix A.
Datasets The four benchmark datasets selected in our experiments are the following:
•KPI: 2this dataset contains 29 univariate time series. It was released in the AIOPS data competition and
consists of Key Performance Indicator curves from different internet companies in 1 minute interval.
1https://ts.gluon.ai/index.html
2https://github.com/NetManAIOps/KPI-Anomaly-Detection
8
•YAHOO: 3this dataset was published by Yahoo labs and consists of 367 real and synthetic univariate time
series.
•Server Machine Dataset (SMD): this dataset contains 28 time series with 38 dimensions, collected from a
machine in large internet companies [29].
•Soil Moisture Active Passive satellite (SMAP): this NASA dataset published by Hundman et al. (2018)
contains 55 times series with 25 dimensions.
The main properties of these datasets are given in Table 1. For our evaluation, we use the test sets of each dataset,
which correspond to the last 50% timestamps of each time series [10]. When needed, the training and validation sets
contain respectively the first 30% and subsequent 20% timestamps. We note that all these datasets have ground-truth
anomaly labels on the test set, and in our evaluation, we only compute counterfactual ensemble explanations for the
ground-truth anomalies detected by each model (i.e., the True Positives). Since in practice, our method could be
applied on all the detected anomalies, including the False Positives (i.e., the observations with anomalous predicted
labels that are not ground-truth anomalies), we have also performed a complementary analysis on the latter. The
numerical results on the KPI dataset, available in Appendix B.2, seem to show that good counterfactual explanations
on False Positives are easier to obtain than on True Positives. Consequently, we did not include the former in our
numerical evaluation in Section 5.5 since they could eclipse the relative advantages of our method and its variants.
Naive counterfactual ensemble explanation As a simple baseline, we propose a procedure that generates a set of
counterfactual examples using a basic sampling scheme without requiring any training. The main idea is similar to
the Forecasting Set approach, but here the sampling mechanism is ”naive”. For each tested window Wcontaining
an anomaly in WS, we draw a sample by interpolating the anomalous window WSand a constant window with a
random weight. The constant window repeats the observation from the timestamp immediately before the anomaly,
i.e., [WC]L−S. Thus, for i∈[N], a sample f
Wnaive,i
Sis defined as:
f
Wnaive,i
S=wiWS+ (1 −wi)X−1,(5)
where wi
i.i.d.
∼U[0,1] and X−1= [[WC]L−S,...,[WC]L−S]∈RS×D. As in Section 4.3, we also select the samples
that are not anomalous under the model, i.e., the naive counterfactual ensemble is finally:
IN={f
Wnaive,i = [WC,f
Wnaive,i
S]; i∈[N]st ∀t∈[S], f (f
Wnaive,i)t< θ}
5.2 Explainability metrics
To evaluate the utility of our method, we compute the following metrics as proxies of the criteria defined in Section 3:
•Failure rate: Accounting for Validity or algorithm Correctness, this metric corresponds to the percentage
of times our method outputs at least one counterfactual example for the gradient-based methods, and the
rejection rate of the samples from the generative scheme in the Forecasting Set approach and naive baseline.
•Distance: The Closeness criterion is measured in terms of the Dynamic Time Warping (DTW) distances be-
tween each example of the counterfactual ensemble and the original anomalous window. The DTW distance
is generally more adapted to time series data than the Euclidean distance.
•Implausibility: since the Plausibility property is not easy to evaluate without expert knowledge of the
particular data domain, we decompose it into the three following proxy metrics that cover different notions
of deviation from an estimated normal behaviour:
–DTW distance to a reference time series, here, the median sample from the Forecasting Set approach
(Implausibility 1);
–Temporal Smoothness (Implausibility 2), defined as
D
X
i=1
S−1
X
t=1 |[f
WS](t+1)i−[f
WS]ti|.
–Negative log-likelihood under the probabilistic forecasting distribution g, if available (Implausibility
3).
We compute the latter metrics for each example of the counterfactual ensemble explanation.
3https://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70
9
•Diversity: the range of values spanned in a counterfactual ensemble is evaluated by the variance of the
counterfactual examples at each timestamp.
•Sparsity correctness: for multivariate time series, if additional information on the anomalous dimensions
in the ground-truth anomalies is available, we compute the precision and recall scores of the sparse variants
of DPE and ICE in identifying the dimensions to perturb.
5.3 Hyperparameters selection
The hyperparameters of our counterfactual explanation method with the gradient-based approaches are selected by
testing all configurations of λ1=λ2, λTin the set {0.001,0.01,0.1,1.0},σmax in {3,5,10}and the learning rate
of the SGD algorithm in {0.01,0.1,1.0,10.0,1000.0,10000.0}. As an explainability method can be finely tuned on
a particular problem and dataset, the configurations could be evaluated on all the anomalies in the test set. However,
for computational time efficiency reasons, we run this evaluation on 100 randomly chosen anomalies, then evaluate
the final performance of the chosen configuration on the entire test set. An exception holds for the the SMAP dataset,
which contains less than 100 anomalies detected by the models, therefore we run the configurations’ evaluation on
the whole test set. For each dataset and detection model, we select the set of hyperparameters having the minimal
Implausibility 2, given that the failure rate is kept under a pre-defined level, (see Figures 7, 8, 9 and 10 and tables in
Appendix D). Moreover, we run the SGD algorithm for 1000 iterations and select a maximum of N= 100 counter-
factual examples along the optimization path. The hyperparameters of the probabilistic models in the Forecasting Set
approach are reported in Table 8 in Appendix D. Finally, in order to provide a ready-to-use method, we also suggest
a default set of hyperparameters in Table 13 in Appendix D. For all datasets, models and approaches, we use suspect
windows of S= 10 timestamps and margin parameter c= 0.
5.4 Qualitative analysis
Similarly to image classification settings [32], visualizations in the time series domain can be human-friendly tools to
communicate model explanations, in particular in univariate or low-dimensional settings. In our context, we propose
to visualize our counterfactual ensemble explanation on the original observations for which a prediction was made,
possibly with an added context window (see Section 3) and a restricted number of channels. Since the anomaly
prediction score is a scalar, we can add the score of each counterfactual example in their representation using an
appropriate color scale. The explanation’s recipient can therefore observe how the model’s prediction score changes in
the counterfactual ensemble, guess the shape of the model’s local decision boundary, and evaluate the range of values
spanned by the examples.
On Figure 2, we present a visualization of our method on anomalies from the KPI dataset, detected by NCAD and
USAD. We observe that the counterfactual ensemble explanation from DPE (in red color scale), ICE (in green), and
FS (in purple) are quite dissimilar, although they all modify only the spike outliers’ features and globally lessen their
amplitude. In fact, on the one hand, DPE produces counterfactual ensembles that are less diverse than the other
approaches, and relatively close to the original input. On the other hand, ICE’s counterfactual sets cover a much larger
range of values and therefore allows to visualize more clearly how the anomaly score evolves for different magnitudes
of the spikes. In contrast, the counterfactual ensembles generated by FS do not have the aforementioned interpretation
but seem to visually correspond to the expected behaviour given the shape of the context windows. Note that additional
visualizations of our explanations can be found in Appendix C.
In summary, our counterfactual ensemble explanations effectively contain diverse perturbations of the input time
series that change the detected label of the anomalous subsequence, with a small number of altered features. The three
approaches, ICE, DPE and FS, bring different insights on the model’s prediction, the time series distribution and the
possible perturbations to apply to change the former. Their relative advantages may therefore depend on the particular
time series context and usage of the counterfactual explanation.
10
Dataset Dimensions Number of time
series
Total number of
timestamps
Total number of
anomalies in test set
KPI 1 29 5922913 54560
Yahoo 1 367 609666 2963
SMD 38 28 1416825 29444
SMAP 25 55 584860 57079
Table 1: Succinct description of the four benchmark datasets
Figure 2: Time series windows contains an anomaly and our counterfactual ensemble explanations, obtained with
DPE (left column), ICE (middle column) and FS (right column) from the KPI dataset. The first (resp. second) row
corresponds to an anomaly that has been detected by NCAD (resp. USAD). Each window includes a context part of
115 timestamps and an abnormal part of 10 timestamps at the end of the window. The original observations are plotted
in blue, while the counterfactual examples appear in red, green or purple color scales for respectively DPE, ICE and
FS.
5.5 Numerical evaluation
The numerical results discussed in this section are obtained in the set-up described in Section 5.1. However, due to the
space limitation, some of these results have been moved to Appendix B. We also add a partial sensitivity analysis of
our method in Appendix E.
The results on univariate datasets (see Table 2 and Table 5 in Appendix B.1), show that our method has fairly small
failure rates (except for the Yahoo dataset and the USAD model). In particular a rate smaller than 10% can be achieved
with at least one variant in most pairs (model, dataset), leading to a consequent improvement over the naive procedure.
We note that while the DPE variant seems to be valid more often than ICE on the NCAD model, it is the contrary for
the USAD model; this difference is possibly due to the distinct internal mechanisms of these models.
Moreover, the analysis of the other explainability metrics supports the qualititative interpretation from Section 5.4.
The Distance metric confirms than the gradient-based approaches, DPE and ICE, provides in almost all cases the
closest counterfactuals in average, i.e. the least perturbed examples. Note that it sometimes occur that the naive
baseline has a small distance, however its high failure rate probably indicates it might be a side-effect of the way
examples are picked. Moreover, the Implausibility metrics validate the observation that FS generates the most realistic
counterfactual examples in average, in particular in terms of Implausibility 1 (distance to median forecast sample)
and Implausibility 3 (NLL under the probabilistic forecasting distribution). This is in fact quite expected since these
quantities are directly derived from the forecasting sampling scheme. However, these counterfactuals are less smooth
(higher score in Implausibility 2) than for DPE and ICE, which regularize the time series smoothness in the objective
functions (2) and (1).
Finally, DPE and ICE provide a more diverse counterfactual ensemble in most cases in general, but their relative
ranking is not clear from these experiments. We conjecture that this metric is particularly sensitive to the learning rate
of the SGD algorithm, and the subsampling procedure after the objective minimization (see Section 4.1). In Appendix
E, we test our first hypothesis on a small sample of anomalies. We observe in this case that the Diversity criterion is
consistently higher for ICE, and greatly increases with the learning rate, at the cost of a higher failure rate.
11
NCAD on Yahoo
Method Failures (%) Distance Implausibility
1
Implausibility
2
Implausibility
3
Diversity
DPE 9.2 2.49 (4.91) 1.23 (1.37) 1.42 (2.19) 2.21 (4.76) 0.01
ICE 17.4 1.54 (1.21) 0.78 (1.37) 2.26 (1.67) 1.40 (5.17) 0.05
FS 56.6 6.06 (16.38) 0.27 (0.22) 3.36 (1.99) -0.29 (0.79) 0.10
Naive 72.2 2.69 (5.29) 1.04 (1.26) 1.32 (1.74) 1.89 (3.34) 0.05
USAD on Yahoo
Method Failures (%) Distance Implausibility
1
Implausibility
2
Implausibility
3
Diversity
DPE 29.1 5.20 (18.00) 6.42 (26.81) 0.42 (2.00) 3.74 (6.96) 0.05
ICE 25.5 6.66 (25.54) 2.68 (11.46) 0.40 (1.16) 2.48 (4.61) 3.23
FS 65.1 14.48 (46.03) 0.48 (0.72) 0.55 (0.58) -0.11 (1.12) 0.61
Naive 45.8 4.82 (18.36) 2.85 (16.31) 0.52 (1.60) 3.25 (6.00) 3.19
Table 2: Performance of our explainability method and the naive baseline in terms of Validity, Closeness, Plausibility
and Diversity on the Yahoo dataset and the NCAD (first panel) and USAD (second panel) anomaly detection models.
We report the average scores and standard deviations (in brackets) over the counterfactual ensembles. We recall that
Implausibility 1 is the DTW distance to the median forecasting sample, Implausibility 2 is the temporal smoothness,
and Implausibility 3 is the negative log-likelihood under the probabilistic forecasting output distribution. For all
metrics except Diversity, we assume that a lower value is better, and the best score is highlighted in bold.
The multivariate experiments (see Table 3 and Table 6 in Appendix B.1) showcase that our method also generates
valid counterfactual ensemble explanations in this more challenging setting, with even a failure rate of 0% for the
USAD model. Our method fails more frequently on the NCAD model, however, the sparse variants are more often
successful. This seems to show that imposing a sparsity constraint over the modified dimensions also helps to find
valid counterfactuals. Consistently with the univariate datasets, FS produces the most realistic counterfactual examples
while the gradient-based approach achieves a better Distance score. We note that in this case the Implausibility 3 metric
is not available since the forecast distribution likelihood function in the DeepVAR model is not available .4Moreover,
the sparse variants seem to correctly identify some of the anomalous channels (precision greater than 0.6 for the USAD
model).
Nonetheless, we noted the greater difficulty of tuning the hyperparameters of our method and ranking its variants
on these high-dimensional datasets compared to univariate data. In the latter, the default set of hyperparameters
achieves an acceptable performance and allows to quickly compare the relative advantages of an approach for a specific
pair (detection model, dataset). We therefore conclude by recalling that example-based explainability methods for
multivariate time series are still in their early developments, and providing general methods and tuning procedures to
generate useful explanations over the instances of a dataset is still an open problem.
6 Concluding remarks
This work proposed a novel method for generating explanations for time series anomalies and detection models. Our
real-world experiments show that the counterfactual framework, augmented with an ensemble approach, improves
the interpretability of time series anomaly detection models, and can help their users identify the possible actions
to take in consequence. Since there is generally little a priori knowledge on the possible anomalies, we compared
several approaches that leverage domain-specific perturbations and anomaly sparsity in high-dimensional settings.
Additionally, we have proposed a gradient-free approach that uses probabilistic forecasting techniques as a generative
scheme. Although our model-agnostic method offers greater flexibility, better explainability performances might be
achieved if more assumptions are put on the detection model. In particular, similarly to [25], we could adapt our
gradient-based approach to use the internal representations of the model rather than the raw time series.
References
[1] Alexander Alexandrov, Konstantinos Benidis, Michael Bohlke-Schneider, Valentin Flunkert, Jan Gasthaus, Tim
Januschowski, Danielle C. Maddix, Syama Rangapuram, David Salinas, Jasper Schulz, Lorenzo Stella, Ali
4https://ts.gluon.ai/api/gluonts/gluonts.model.deepvar.html
12
NCAD on SMD
Method Failures (%) Precision /
Recall
Distance Implausibility
1
Implausibility
2
Diversity
DPE 17.1 -8.46 (13.07) 50.76
(110.40)
12.12 (28.13) 1.21
ICE 42.9 - 79.62
(120.27)
23.69 (28.24) 47.73 (58.32) 4639.11
Sparse DPE 20.0 0.22 / 0.10 36.12 (77.87) 29.03 (54.36) 5.61 (11.30) 4687.05
Sparse ICE 20.0 0.20 / 0.33 26.01 (37.07) 62.65
(107.25)
10.88 (16.72) 174.39
FS 30.0 - 78.46
(157.57)
1.62 (2.33) 1.49 (1.31) 35.88
Naive 79.8 - 25.06 (53.66) 45.45 (93.03) 9.42 (18.79) 3255.92
USAD on SMD
Method Failures (%) Precision /
Recall
Distance Implausibility
1
Implausibility
2
Diversity
DPE 0.0 - 139.02
(261.44)
258.31
(464.18)
41.19 (79.11) 23339.20
ICE 0.0 -31.81 (9.27) 342.19
(708.88)
22.64 (45.16) 0.52
Sparse DPE 0.0 0.68 / 0.07 115.48
(206.46)
293.70
(679.88)
19.84 (34.98) 105.45
Sparse ICE 0.0 0.61 / 0.28 216.44
(316.05)
172.58
(475.15)
8.35 (17.52) 477.43
FS 0.0 - 366.57
(672.44)
18.10 (48.25) 8.57 (20.63) 12175.65
Naive 73.4 - 49.83 (60.13) 475.42
(879.38)
27.45 (45.43) 649.44
Table 3: Performance of our explainability method and the naive baseline in terms of Validity, Closeness, Plausibility
and Diversity on the SMD dataset and the NCAD (first panel) and USAD (second panel) anomaly detection models.
We report the average scores and standard deviations (in brackets) over the counterfactual ensemble. We recall that
Implausibility 1 is the DTW distance to the median forecasting sample and Implausibility 2 is the temporal smoothness.
For all metrics except Diversity,Precision and Recall, we assume that a lower value is better, and the best score is
highlighted in bold.
Caner T ˜
A¼rkmen, and Yuyang Wang. “GluonTS: Probabilistic and Neural Time Series Modeling in Python”.
In: Journal of Machine Learning Research 21.116 (2020), pp. 1–6. U RL:http://jmlr.org/papers/v21/
19-820.html (cit. on p. 8).
[2] Emre Ates, Burak Aksar, Vitus J. Leung, and Ayse K. Coskun. “Counterfactual Explanations for Multivariate
Time Series”. In: 2021 International Conference on Applied Artificial Intelligence (ICAPAI) (2021). DOI:10.
1109/icapai49758.2021.9462056.UR L:http://dx.doi.org/10.1109/ICAPAI49758.2021.9462056
(cit. on pp. 2, 3).
[3] Julien Audibert, Pietro Michiardi, Fr´
ed´
eric Guyon, S´
ebastien Marti, and Maria A Zuluaga. “USAD : UnSuper-
vised Anomaly Detection on Multivariate Time Series”. In: KDD2020 - The 26th ACM SIGKDD Conference
on Knowledge Discovery and Data Mining. San Diego, USA, 2020 (cit. on pp. 2, 4, 8, 16).
[4] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An Empirical Evaluation of Generic Convolutional and Recur-
rent Networks for Sequence Modeling. 2018. arXiv: 1803.01271 [cs.LG] (cit. on p. 2).
[5] Jo˜
ao Bento, Pedro Saleiro, Andr´
e F. Cruz, M´
ario A.T. Figueiredo, and Pedro Bizarro. “TimeSHAP: Explaining
Recurrent Models through Sequence Perturbations”. In: Proceedings of the 27th ACM SIGKDD Conference on
Knowledge Discovery & Data Mining (2021). DOI:10.1145 /3447548.3467166.URL:http:// dx .doi.
org/10.1145/3447548.3467166 (cit. on p. 3).
[6] Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir
Puri, Jos´
e MF Moura, and Peter Eckersley. “Explainable machine learning in deployment”. In: Proceedings of
the 2020 Conference on Fairness, Accountability, and Transparency. 2020, pp. 648–657 (cit. on p. 2).
13
[7] Ane Bl´
azquez-Garc´
ıa, Angel Conde, Usue Mori, and Jose A. Lozano. “A Review on Outlier/Anomaly Detection
in Time Series Data”. In: ACM Comput. Surv. 54.3 (2021). IS SN: 0360-0300. DO I:10.1145/3444690.UR L:
https://doi.org/10.1145/3444690 (cit. on pp. 1, 2).
[8] Andy Brown, Aaron Tuor, Brian Hutchinson, and Nicole Nichols. “Recurrent Neural Network Attention Mech-
anisms for Interpretable System Log Anomaly Detection”. In: Proceedings of the First Workshop on Machine
Learning for Computing Systems. MLCS’18. Tempe, AZ, USA: Association for Computing Machinery, 2018.
IS BN: 9781450358651. DOI:10.1145/3217871.3217872.UR L:https://doi.org/10.1145/3217871.
3217872 (cit. on p. 2).
[9] Mattia Carletti, Matteo Terzi, and Gian Antonio Susto. Interpretable Anomaly Detection with DIFFI: Depth-
based Isolation Forest Feature Importance. 2021. arXiv: 2007.11117 [cs.LG] (cit. on pp. 2, 3).
[10] Chris U. Carmona, Franc¸ois-Xavier Aubet, Valentin Flunkert, and Jan Gasthaus. Neural Contextual Anomaly
Detection for Time Series. 2021. arXiv: 2107.07702 [cs.LG] (cit. on pp. 2, 4, 8, 9, 15).
[11] Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, and Jimeng
Sun. “RETAIN: An Interpretable Predictive Model for Healthcare Using Reverse Time Attention Mechanism”.
In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16.
Barcelona, Spain: Curran Associates Inc., 2016, 3512–3520. IS BN: 9781510838819 (cit. on p. 3).
[12] Jonathan Crabbe and Mihaela van der Schaar. “Explaining Time Series Predictions with Dynamic Masks”. In:
ICML. 2021 (cit. on pp. 2, 3, 5–8).
[13] Susanne Dandl, Christoph Molnar, Martin Binder, and Bernd Bischl. “Multi-Objective Counterfactual Expla-
nations”. In: Lecture Notes in Computer Science (2020), 448–469. ISSN: 1611-3349. DO I:10.1007/978- 3-
030-58112-1_31.UR L:http://dx.doi.org/10.1007/978-3-030- 58112-1_31 (cit. on pp. 2, 3).
[14] Eoin Delaney, Derek Greene, and Mark T. Keane. “Instance-Based Counterfactual Explanations for Time Series
Classification”. In: Case-Based Reasoning Research and Development. Ed. by Antonio A. S´
anchez-Ruiz and
Michael W. Floyd. Cham: Springer International Publishing, 2021, pp. 32–47. ISBN: 978-3-030-86957-1 (cit.
on pp. 2, 3, 6).
[15] Ruth Fong, Mandela Patrick, and Andrea Vedaldi. “Understanding Deep Networks via Extremal Perturba-
tions and Smooth Masks”. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019,
pp. 2950–2958. DOI:10.1109/ICCV.2019.00304 (cit. on p. 2).
[16] Ville Hautamaki, Pekka Nykanen, and Pasi Franti. “Time-series clustering by approximate prototypes”. In: 2008
19th International Conference on Pattern Recognition. 2008, pp. 1–4. DO I:10.1109 /ICPR. 2008.4761105
(cit. on p. 3).
[17] Amir-Hossein Karimi, Gilles Barthe, Borja Balle, and Isabel Valera. “Model-Agnostic Counterfactual Expla-
nations for Consequential Decisions”. In: Proceedings of the Twenty Third International Conference on Arti-
ficial Intelligence and Statistics. Ed. by Silvia Chiappa and Roberto Calandra. Vol. 108. Proceedings of Ma-
chine Learning Research. PMLR, 2020, pp. 895–905. URL:https: / / proceedings . mlr .press / v108 /
karimi20a.html (cit. on p. 3).
[18] Isak Karlsson, Jonathan Rebane, Panagiotis Papapetrou, and Aristides Gionis. “Locally and globally explainable
time series tweaking”. In: Knowledge and Information Systems 62.5 (2020), pp. 1671–1700. DO I:10 .1007/
s10115-019-01389-4 (cit. on pp. 2, 3).
[19] Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, Xavier Renard, and Marcin Detyniecki.
“Comparison-based Inverse Classification for Interpretability in Machine Learning”. In: 17th International
Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU
2018). Ed. by Jes´
us Medina, Manuel Ojeda-Aciego, Jos´
e Luis Verdegay, David A. Pelta, Inma P. Cabrera,
Bernadette Bouchon-Meunier, and Ronald R. Yager. Information Processing and Management of Uncertainty
in Knowledge-Based Systems. Theory and Foundations. Cadix, Spain: Springer Verlag, June 2018, pp. 100–
111. DO I:10.1007/ 978- 3- 319- 91473 - 2 \_9.UR L:https ://hal .sorbonne-universite.fr /hal-
01905982 (cit. on p. 6).
[20] Qianli Ma, Wanqing Zhuang, Sen Li, Desen Huang, and Garrison W. Cottrell. “Adversarial Dynamic Shapelet
Networks”. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second
Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Edu-
cational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press,
2020, pp. 5069–5076. UR L:https://aaai.org/ojs/index.php/AAAI/article/view/5948 (cit. on p. 3).
[21] Pankaj Malhotra, Lovekesh Vig, Gautam M. Shroff, and Puneet Agarwal. “Long Short Term Memory Networks
for Anomaly Detection in Time Series”. In: ESANN. 2015 (cit. on p. 2).
[22] Christoph Molnar. Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. 2019
(cit. on p. 3).
14
[23] Ramaravind K. Mothilal, Amit Sharma, and Chenhao Tan. “Explaining machine learning classifiers through
diverse counterfactual explanations”. In: Proceedings of the 2020 Conference on Fairness, Accountability, and
Transparency (2020). DO I:10.1145/3351095.3372850.URL:http://dx.doi.org/10.1145/3351095.
3372850 (cit. on pp. 2, 3, 6, 8).
[24] Qingyi Pan, Wenbo Hu, and Jun Zhu. “Series Saliency: Temporal Interpretation for Multivariate Time Series
Forecasting”. In: ArXiv abs/2012.09324 (2020) (cit. on p. 2).
[25] Pau Rodriguez, Massimo Caccia, Alexandre Lacoste, Lee Zamparo, Issam Laradji, Laurent Charlin, and David
Vazquez. Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations. 2021. arXiv: 2103.
10226 [cs.LG] (cit. on pp. 3, 6, 12).
[26] Chris Russell. “Efficient Search for Diverse Coherent Explanations”. In: Proceedings of the Conference on Fair-
ness, Accountability, and Transparency. FAT* ’19. Atlanta, GA, USA: Association for Computing Machinery,
2019, 20–28. IS BN: 9781450361255. D OI:10. 1145 /3287560 . 3287569.URL:https : // doi . org / 10 .
1145/3287560.3287569 (cit. on pp. 2, 3, 6).
[27] David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. “High-
Dimensional Multivariate Forecasting with Low-Rank Gaussian Copula Processes”. In: NeurIPS. 2019 (cit.
on p. 8).
[28] Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. “Robust anomaly detection for multi-
variate time series through stochastic recurrent neural network”. In: Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining. 2019, pp. 2828–2837 (cit. on p. 7).
[29] Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. “Robust Anomaly Detection for Multi-
variate Time Series through Stochastic Recurrent Neural Network”. In: Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining. KDD ’19. Anchorage, AK, USA: Associa-
tion for Computing Machinery, 2019, 2828–2837. ISB N: 9781450362016. D OI:10.1145/3292500.3330672.
URL:https://doi.org/10.1145/3292500.3330672 (cit. on p. 9).
[30] Sahil Verma, John Dickerson, and Keegan Hines. Counterfactual Explanations for Machine Learning: A Re-
view. 2020. arXiv: 2010.10596 [cs.LG] (cit. on pp. 2, 4).
[31] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual Explanations without Opening the Black
Box: Automated Decisions and the GDPR. 2018. arXiv: 1711.00399 [cs.AI] (cit. on pp. 2, 3).
[32] Matthew Zeiler and Rob Fergus. “Visualizing and Understanding Convolutional Neural Networks”. In:
vol. 8689. Nov. 2013. ISB N: 978-3-319-10589-5. D OI :10.1007/978-3-319-10590-1_53 (cit. on p. 10).
[33] Hui Zou and Trevor Hastie. “Regularization and Variable Selection via the Elastic Net”. In: Journal of the Royal
Statistical Society. Series B (Statistical Methodology) 67.2 (2005), pp. 301–320. ISS N: 13697412, 14679868.
URL:http://www.jstor.org/stable/3647580 (cit. on p. 6).
A Technical details and performance of the selected anomaly detection models
In this section, we provide some technical details on the two anomaly detection models selected for the evaluation of
our explainability method reported in Section 5. In Table 4, we report their anomaly detection performances on the
benchmark datasets, after training with the hyperparameters’ sets reported in their respective papers when available.
Otherwise, we select the models’ hyperparameters on a validation set (20% of the time series) using the best adjusted
F1-score.
Neural Contextual Anomaly Detection (NCAD) [10] : This method splits time series into subwindows (Wi)i
and embeds them using a temporal convolutional network (TCN). Each Wiis subdivided into a context part and a
suspect part (typically much smaller than the former), i.e., Wi= [Wi
C, W i
S]. An embedding of the context window
Wi
Cis also computed by the TCN, then the distance between the embeddings of Wi, denoted zi, and Wi
C, denoted zi
C,
is evaluated. The algorithm finally labels Wi
Sas anomalous if the latter distance is greater than a chosen threshold,
i.e, if d(zi, zi
C)> η with d(., .)the Euclidean distance for instance and η > 0. The intuition behind this method is
that a large distance between the embeddings of a window and its context part means that the suspect part induces
a significant shift of zi
Cin the embedding space. Since the embedding of the context window should reflect the
normal behaviour, this deviation thus indicates the presence of an anomaly in Wi
S. For our experiments, we use the
open-source implementation. 5
5https://github.com/Francois-Aubet/gluon-ts/tree/adding_ncad_to_nursery/src/gluonts/nursery/ncad
15
NCAD on KPI
Method Failures (%) Distance Implausibility
1
Implausibility
2
Implausibility
3
Diversity
DPE 3.9 5.94 (15.78) 2.16 (4.71) 3.21 (29.62) 2.74 (2.57) 1.18
ICE 19.6 3.08 (1.21) 15.31
(115.12)
31.67
(206.70)
2.07 (2.06) 0.26
FS 6.0 32.05
(173.76)
0.21 (0.20) 2.42 (1.97) -0.56 (1.14) 0.12
Naive 53.4 11.82 (74.03) 2.90 (4.49) 4.57 (7.64) 3.48 (2.51) 0.54
USAD on KPI
Method Failures (%) Distance Implausibility
1
Implausibility
2
Implausibility
3
Diversity
DPE 5.0 25.22
(121.60)
9.40 (65.02) 1.03 (8.16) 3.13 (3.47) 13.10
ICE 3.5 6.52 (7.30) 4.99 (63.31) 0.50 (4.43) 1.27 (1.98) 0.28
FS 6.8 38.56
(189.88)
0.33 (0.29) 0.38 (0.28) -0.08 (1.12) 0.26
Naive 45.4 31.93
(154.62)
2.77 (3.93) 1.42 (6.01) 2.81 (2.48) 69.88
Table 5: Performance of our explainability method and the naive baseline in terms of Validity, Closeness, Plausibility
and Diversity on the KPI dataset and the NCAD (first panel) and USAD (second panel) anomaly detection models.
We report the average scores and standard deviations (in brackets) over the counterfactual ensemble. We recall that
Implausibility 1 is the DTW distance to the median forecasting sample, Implausibility 2 is the temporal smoothness,
and Implausibility 3 is the negative log-likelihood under the probabilistic forecasting output distribution. For all
metrics except Diversity, we assume that a lower value is better, and the best score is highlighted in bold.
UnSupervised Anomaly Detection (USAD) [3]: This reconstruction model splits time series into subwindows that
are reconstructed by a LSTM-based AutoEncoder. The latter contains a neural network, called encoder, that embeds
each window into a latent representation, and another neural network, called decoder, that maps back the embedding
into the original input space. The reconstruction error, i.e., the distance in the time series domain between the original
input and the reconstructed output, is used as an anomaly score (a high value of this error leads to the corresponding
window to be labelled as anomalous). We use the open source implementation provided by the authors 6and the
hyperparameters provided in the paper for the two multivariate data sets, i.e. SMD and SMAP. For the KPI dataset,
the final USAD model is trained for 80 epochs and has windows of size 5, hidden size of 10 and downsampling rate
of 0.01. For the Yahoo data, the window size is 10, hidden size of 10 and downsampling rate of 0.05.
Model KPI Yahoo SMD SMAP
NCAD 0.789 0.772 0.806 0.922
USAD 0.946 0.741 0.643 0.972
Table 4: F1-scores of the two anomaly detection models, i.e., NCAD and USAD, on the four benchmark datasets.
B Additional numerical results
In this section, we report quantitative evaluations of our explainability method that could not be included in the main
text due to space limitation. This section notably contains the results on two benchmark datasets using the procedure
described in Section 5, and an additional analysis on False Positives.
B.1 Numerical evaluation on the KPI and SMAP datasets
The results on the KPI and SMAP dataset are respectively in Table 5 and Table 6. Note that these results are included
in the discussion in Section 5.5.
6https://curiousily.com/posts/time-series-anomaly-detection- using-lstm-autoencoder-with-pytorch- in-python/
16
NCAD
Method Failures (%) Diversity Distance Implausibility 1 Implausibility 2
DPE 41.7 0.002 0.19 (0.40) 0.21 (0.28) 0.01 (0.03)
DPE sparse 27.8 0.004 0.22 (0.42) 0.29 (0.38) 0.03 (0.04)
ICE 5.6 0.067 0.26 (0.14) 0.39 (0.37) 0.15 (0.09)
ICE sparse 23.6 0.016 0.15 (0.08) 0.22 (0.21) 0.09 (0.06)
FS 87.6 0.012 0.56 (0.77) 0.05 (0.04) 0.05 (0.04)
Naive 84.5 0.003 0.06 (0.08) 0.09 (0.03) 0.02 (0.03)
USAD
Method Failures (%) Diversity Distance Implausibility 1 Implausibility 2
DPE 0.0 0.02 0.62 (0.65) 0.96 (0.53) 0.06 (0.05)
DPE sparse 0.0 0.02 0.78 (0.85) 0.82 (0.50) 0.05 (0.06)
ICE 0.0 0.17 0.74 (0.75) 0.87 (0.43) 0.04 (0.03)
ICE sparse 0.0 0.18 0.72 (0.76) 0.88 (0.42) 0.06 (0.02)
FS 56.8 0.02 2.23 (1.14) 0.09 (0.01) 0.10 (0.03)
Naive 46.9 0.01 0.14 (0.04) 0.23 (0.02) 0.07 (0.02)
Table 6: Performance of our explainability method and the naive baseline in terms of Validity, Closeness, Plausibility
and Diversity on the SMAP dataset and the NCAD (first panel) and USAD (second panel) anomaly detection models.
We report the average scores and standard deviations (in brackets) over the counterfactual ensemble. We recall that
Implausibility 1 is the DTW distance to the median forecasting sample and Implausibility 2 is the temporal smoothness.
For all metrics except Diversity,Precision and Recall, we assume that a lower value is better, and the best score is
highlighted in bold.
B.2 Numerical evaluation on False Positives
In the practical use of anomaly detection models, explanations can also be needed when the model wrongly detects
an anomaly a time series. We recall that we call False Positives the anomalies detected by the model that are not
ground-truth anomalies. We present here a numerical evaluation on the False Positives detected by NCAD in the KPI
benchmarck dataset. The results in Table 7 can be compared to the results obtained on True Positives (i.e., the ground-
truth, detected anomalies) reported in the first panel of Table 5 . We observe that in this case ICE achieves 0% failure
rate (instead of almost 20 %), and the naive method has also a significantly smaller number of failures. Moreover, all
methods seem to perform better in terms of the Distance and Implausibility metrics. This is probably due to the fact
that False Positives need less perturbation to become not anomalous for the model, e.g. if they lie close to the model’s
local decision boundary. Therefore they may inherently be less distant to the normal behaviour than True Positives
and thus easier instances for our counterfactual explanation method. Besides, the Diversity metric is smaller for DPE
and ICE, likely as another effect of the smaller amount of perturbation needed.
NCAD
Method Failures (%) Distance Implausibility
1
Implausibility
2
Implausibility
3
Diversity
DPE 8.8 2.22 (1.87) 2.44 (2.50) 2.36 (2.05) 2.15 (2.22) 0.02
ICE 0.0 4.36 (3.00) 0.28 (0.45) 0.61 (0.34) 0.22 (0.93) 0.12
FS 6.6 4.17 (2.91) 0.30 (0.31) 3.54 (3.61) -0.22 (0.95) 0.43
Naive 33.3 2.74 (2.22) 2.19 (2.43) 2.97 (2.56) 2.79 (2.29) 0.16
Table 7: Performance of our explainability method and the naive baseline in terms of Validity, Closeness, Plausibility
and Diversity on the false positives in the KPI data detected by the NCAD model. We report the average scores and
standard deviations (in brackets) over the counterfactual ensemble. We recall that Implausibility 1 is the DTW distance
to the median forecasting sample, Implausibility 2 is the temporal smoothness, and Implausibility 3 is the negative log-
likelihood under the probabilistic forecasting output distribution. For all metrics except Diversity, we assume that a
lower value is better, and the best score is highlighted in bold.
C Complementary visualizations of the explanations
In this section, we report additional visualizations of our counterfactual explanations, as well as illustrations of the
sparsity induced by the sparse variants of DPE and ICE. Figures 3 and 4 are visualizations applied to the univariate
datasets and respectively the NCAD and USAD. The advantage of Sparse ICE compared to the plain version ICE
17
Figure 3: Anomalous windows and counterfactual ensemble explanations obtained with DPE (left column), ICE
(middle column) and FS (right column) on anomalies in the KPI and Yahoo datasets detected by the NCAD model.
The rows correspond to different anomalies. The windows include a context part of 115 timestamps and an abnormal
part of 10 timestamps. The original subsequence is plotted in blue, while the explanations are in red, green or purple
colors for the different variants.
is shown in Figure 5, where only four channels of the multi-dimensional time series window are plotted. For this
anomaly, only one of these dimensions contains an anomalous observation but the counterfactual explanation obtained
with the plain ICE perturbs the four of them. In contrast, the Sparse ICE variant keeps two dimensions without anoma-
lous features unchanged, leading to a more accurate and readable explanation on this particular anomaly. Similarly,
Figure 6 shows two perturbation maps corresponding to examples generated by DPE and its sparse variant. While the
plain DPE produces globally sparse maps (i.e., in the temporal and dimensional features), Sparse DPE is sparse in
dimensions, leading to perturbed examples with few modified channels.
D Illustration of the hyperparameters selection
In this section, we illustrate the hyperparameters selection procedure for our gradient-based method. For each dataset
and model, we run our algorithm with several configurations as described in Section 5.3 and select the final one using
the failure rate and the Implausibility 1 metric. More precisely, we select a threshold of acceptable failure rate (e.g.,
10% or 20%), then amongst the configurations achieving a lower value of the latter, we select the one with the lowest
Implausibility 1 value. Figures 7, 8, 10 and 9 show the values of these metrics for all explored configurations for each
model and dataset. Lastly, in Tables 9, 10, 11 and 12, we report the selected configurations for respectively DPE, ICE,
Sparse DPE and Sparse ICE on the benchmark datasets. Besides, the hyperparameters of the gradient-free approach
can be found in Table 8.
18
Figure 4: Anomalous windows and counterfactual ensemble explanations obtained with DPE (left column), ICE
(middle column) and FS (right column) on anomalies in the KPI and Yahoo datasets detected by the USAD model.
The rows correspond to different anomalies. The windows include a context part of 115 timestamps and an abnormal
part of 10 timestamps. The original subsequence is plotted in blue, while the explanations are in red, green or purple
colors for the different variants.
Dataset Model type Number of
layers
Hidden size training
epochs
learning rate prediction
length
KPI FFNN 1 32 100 0.001 10
Yahoo FFNN 1 32 100 0.001 10
SMD DeepVAR 4 40 150 0.001 10
SWaT DeepVAR 4 40 150 0.001 10
Table 8: Hyperparameters of the Probabilistic Forecasting models used in the gradient-free approach on the four
benchmark datasets.
Dataset Perturbation σmax learning rate λ2λT
NCAD-KPI Gaussian blur 3.0 0.01 0.01 0.1
NCAD-Yahoo Gaussian blur 10.0 0.01 0.001 0.1
NCAD-SMD Gaussian blur 20.0 0.01 0.0 1.0
NCAD-SMAP Gaussian blur 10.0 0.01 1.0 1.0
USAD-KPI Gaussian blur 3.0 0.01 0.001 1.0
USAD-Yahoo Gaussian blur 10.0 0.01 0.001 1.0
USAD-SMD Gaussian blur 20.0 0.1 0.001 0.01
USAD-SMAP Gaussian Blur 20.0 0.01 0.01 0.1
Table 9: Hyperparameters of the DPE algorithm on the four benchmark datasets.
19
(a) ICE (b) Sparse ICE
Figure 5: Counterfactual explanation obtained with ICE (5a) and the sparse variant (5b). The different rows correspond
respectively to the first, third, ninth and twelfth dimensions of a subsequence in the SMD dataset. Amongst them, only
the fourth two (twelfth dimension) contains an anomalous observation in the last timestamp of the displayed window,
detected by the NCAD model. While ICE (5a) modifies all the plotted dimensions, Sparse ICE only perturbs the third
and fourth (i.e., the ninth and twelfth dimension).
(a) DPE (b) Sparse DPE
Figure 6: Perturbation maps of counterfactual examples in the explanations generated by DPE (left) and its sparse
variant (right) on one anomaly in the SMD dataset detected by NCAD. We recall that the rows of each mask correspond
to the different dimensions of the time series and the columns to the successive timestamps in the suspect window (see
Section 4). The color bars on the right sides of the maps indicate the values (between 0 and 1) of these maps along the
time series features.
20
Figure 7: Implausibility measures 1 (left column) and 2 (right column) versus failures rates for different sets of
hyperparameters of the ICE and DPE algorithms and their sparse variants applied to the NCAD (first row) and USAD
(second row) models on a the KPI dataset. The metrics are computed over a validation set of 5 time series and the
failure rate’s threshold is 10% (red dotted line).
Dataset learning rate λ1λ2λT
NCAD-KPI 0.1 0.01 0.01 1.0
NCAD-Yahoo 0.1 0.01 0.01 1.0
NCAD-SMD 0.1 0.01 0.01 0.1
NCAD-SMAP 0.1 0.1 0.1 1.0
USAD-KPI 0.1 0.001 0.001 1.0
USAD-Yahoo 0.1 0.001 0.001 1.0
USAD-SMD 1000.0 0.01 0.01 1.0
USAD-SMAP 1.0 0.001 0.001 1.0
Table 10: Hyperparameters of the ICE algorithm on the four benchmark datasets.
Dataset Perturbation σmax learning rate λ1λ2λT
NCAD-SMD Gaussian blur 20.0 0.1 0.01 0.01 0.1
NCAD-
SMAP
Gaussian
Blur
10.0 0.01 0.1 0.1 0.1
USAD-SMD Gaussian
Blur
20.0 0.01 0.01 0.01 1.0
USAD-
SMAP
Gaussian
Blur
20.0 0.1 0.01 0.01 0.1
Table 11: Hyperparameters of the Sparse DPE algorithm on the two benchmark multivariate datasets.
21
Figure 8: Implausibility measures 1 (left column) and 2 (right column) versus failures rates for different sets of
hyperparameters of the ICE and DPE algorithms and their sparse variants applied to the NCAD (first row) and USAD
(second row) models on a the Yahoo dataset. The metrics are computed over a validation set of 15 time series and the
failure rate’s threshold is 25% (red dotted line).
Dataset learning rate λ1λ2λT
NCAD-SMD 0.1 0.01 0.01 0.1
NCAD-SMAP 0.1 0.1 0.1 1.0
USAD-SMD 10000.0 0.01 0.01 0.1
USAD-SMAP 1.0 0.001 0.001 1.0
Table 12: Hyperparameters of the Sparse ICE algorithm on the two benchmark multivariate datasets.
Variant Perturbation σmax learning
rate
λ1λ2λTN
ICE - - 0.1 0.01 0.01 0.01 100
DPE Gaussian
Blur
3.0 0.01 - 0.1 0.01 100
Table 13: Default set of hyperparameters for our gradient-based counterfactual ensemble method.
22
Figure 9: Implausibility measures 1 (left column) and 2 (right column) versus failures rates for different sets of
hyperparameters of the ICE and DPE algorithms and their sparse variants applied to the NCAD (first row) and USAD
(second row) models on a the SMAP dataset. The metrics are computed over a validation set of 40 time series and the
failure rate’s threshold is 25% (red dotted line).
E Sensitivity of the Diversity criterion to the learning rate parameter
In this section we report a small-scale study of the influence of the learning rate in the SGD algorithm on the Diversity
metric, in our gradient-based approach. We evaluate the latter metric on 10 anomalies detected by the NCAD model in
the KPI dataset, obtained with DPE and ICE with learning rates in the set {0.001,0.01,0.1,1,10,100,1000,10000}.
The other hyperparameters of our method are the same as in Section 5.5. Figure 11 shows the evolution of the Diversity
score (left panel) and failure rate (right panel) when the learning rate increases. We observe that the diversity is always
higher for ICE than DPE, and dramatically increases when the learning rate is greater than 1 for the former. However,
failure rate also skyrockets for high learning rates.
23
Figure 10: Implausibility measures 1 (left column) and 2 (right column) versus failures rates for different sets of
hyperparameters of the ICE and DPE algorithms and their sparse variants applied to the NCAD (first row) and USAD
(second row) models on a the SMD dataset. The metrics are computed over a validation set of 6 time series and the
failure rate’s threshold is 40% for NCAD and 20% for USAD (red dotted lines).
Figure 11: Diversity of the counterfactual ensemble (left) and failure rate of our counterfactual method (right) versus
the learning rate of the SGD algorithm for the two variants of our method, ICE and DPE.
24