Access to this full-text is provided by Springer Nature.
Content available from Information Geometry
This content is subject to copyright. Terms and conditions apply.
Information Geometry (2023) 6:385–412
https://doi.org/10.1007/s41884-023-00107-y
RESEARCH PAPER
Anomaly detection in the probability simplex under
different geometries
Uriel Legaria1·Sergio Mota1·Sergio Martinez1·Alfredo Cobá2·
Argenis Chable2·Antonio Neme3
Received: 4 August 2022 / Revised: 24 April 2023 / Accepted: 13 May 2023 /
Published online: 30 May 2023
© The Author(s) 2023
Abstract
An open problem in data science is that of anomaly detection. Anomalies are instances
that do not maintain a certain property that is present in the remaining observations in a
dataset. Several anomaly detection algorithms exist, since the process itself is ill-posed
mainly because the criteria that separates common or expected vectors from anomalies
are not unique. In the most extreme case, data is not labelled and the algorithm has to
identify the vectors that are anomalous, or assign a degree of anomaly to each vector.
The majority of anomaly detection algorithms do not make any assumptions about
the properties of the feature space in which observations are embedded, which may
affect the results when those spaces present certain properties. For instance, compo-
sitional data such as normalized histograms, that can be embedded in a probability
simplex, constitute a particularly relevant case. In this contribution, we address the
problem of detecting anomalies in the probability simplex, relying on concepts from
Information Geometry, mainly by focusing our efforts in the distance functions com-
monly applied in that context. We report the results of a series of experiments and
conclude that when a specific distance-based anomaly detection algorithm relies on
Information Geometry-related distance functions instead of the Euclidean distance,
the performance is significantly improved.
Keywords Anomaly detection ·Probability simplex ·Information geometry
Communicated by Nihat Ay.
BAntonio Neme
antonio.neme@iimas.unam.mx
1Posgrado en Ciencia e Ingeniería de la Computación, Universidad Nacional Autónoma de
México, Mexico City, Mexico
2Facultad de Matemáticas, Universidad Autonoma de Yucatan, Mérida, Yucatan, Mexico
3Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas (IIMAS), Unidad Académica en
el Estado de Yucatán, Universidad Nacional Autónoma de México, Tablaje Catastral 6998 Carretera
Mérida - Tetiz, 97357 Mérida, Yucatan, Mexico
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
386 U. Legaria et al.
1 Introduction
The majority of processes, structures and phenomena that are of scientific or techno-
logical interest generate large amounts of data. Each observation can be an ensemble of
multiple attributes, which makes the data high-dimensional. The dataset representing
the structure or phenomena under study contains relevant information that has to be
extracted by the application of pertinent algorithms. The nature of those algorithms is
aligned with the scientific questions that are of relevance for a particular case. Efforts
to systematically study datasets to find answers to particular scientific interrogations
are found in several disciplines [1]. Each field of science that has to deal with large
datsets has contributed with its own (partial) solutions and techniques, and in many
cases those solutions are similar to each other, or follow the same ideas.
In a recent development of scientific ideas, the field of data science has emerged
with the objective of joining existing techniques along several fields, and to offer new
tools to understand data under a rational perspective [2,3]. Those new tools are firmly
based on Mathematics, Computer Science, and Statistics, and elements of each field
are taken in order to provide new lines of though [4].
Data science has as one of its goals the identification of patterns or structures defined
by observations, commonly referred to as vectors or points, located in usually high-
dimensional feature or input spaces. Based on those patterns researchers may grasp
some properties of the data under analysis, and make decisions regarding the next steps
of analyses. The starting point of data science is usually an unbiased and preliminary
exploration of data. This stage is known as exploratory data analysis [5], and it aims
to identify structure in the data to pinpoint possible relevant patterns present in it.
In exploratory data analysis, algorithms usually do not assume external labels or
classes assigned to the vectors in the datasets. Indeed, one of its objectives is to actually
compute a label and assign it to each vector, based on intrinsic properties of the data.
Some of the most common label-generating tasks include the identification of clusters
[6,7], the detection of anomalies [8], the identification of manifolds [9], and the
generation of centrality-oriented metrics [10].
An anomaly is a point that does not resemble the rest of the elements in a dataset [11].
Vectors in a dataset are characterized in terms of a certain property, and then, those
elements that do not fulfill that characterization are labelled as potential anomalies
[12–15]. There are a few loose elements in this definition, which allows the existence
of a large number of algorithms for anomaly detection. Since each element in a dataset
can be thought of as a point in a high-dimensional dataset, the problem can be framed
in a more geometric perspective. A geometric approach allows, for instance, the use
of certain aspects from Information Geometry in order to improve the capabilities of
certain anomaly detection algorithms.
Since in general the generating process of the data is not known, it is implicitly
approximated by one or several attributes. In particular, the concept of neighborhood
is heavily relied on, since it allows for a characterization of each vector in terms of
its surroundings [16,17]. By describing the neighborhood of a vector, for example, in
terms of density, a direct comparison between that vector and a subset of the remaining
vectors may offer an idea of how different that vector is with respect to other vectors
[18]. One of the assumptions behind the anomaly detection algorithms is that the
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 387
majority of vectors in a dataset are usual, and thus, the probability of detecting by
chance an anomaly is rather low. Thus, anomaly detection has to be attained under
rational perspectives [19].
Anomaly detection is an unsupervised learning task whose main goal is the identi-
fication of vectors that significantly differ from the rest of the observations in a dataset
[11,19]. The final objective of anomaly detection algorithms is to label, or classify,
each vector in the dataset as either usual or as anomaly [12], or to grade its anomaly
level. The candidate anomalies do not satisfy a certain property derived from the entire
group of observations, whereas the usual observations comply with that property [11].
Again, it is generally assumed that the number of anomalies in a dataset is small in
comparison to the number of expected, usual, or common observations.
The identification of observations that differ from the majority of instances within a
dataset is a rather important task. Vectors that are different to the majority of elements
under study are called anomalies, outliers or novelties [20]. Anomalies appear in
several contexts, and a prompt detection is always desirable. In biology, for example,
certain genes expressed in specific tissues have been detected as anomalies, which
highlight specific metabolic pathways involved in diseases [21]. In human health, for
instance, some heart diseases have as an early sign of appearance anomalies in the
signal produced by the heart. An automatic detection of those early anomalies could
be of enormous benefit [22]. In aviation, anomalies appear when there is a mismatch
between the information measured in sensors and the actions taken by human or robotic
actuators [23,24]. An early detection of such anomalies would prevent the aircraft to
perform an unsafe maneuver. In all cases, the common denominator is that specialists
do not known beforehand what observations are considered anomalies.
Traditionally, anomalies have been identified as noise in several scientific disci-
plines [25]. Although some anomalies may be indeed be noise caused by a malfunction
in the sensors, caused by human error along the analysis stages, or by an error in some
other stage of the data processing steps, not all of anomalies are the result of errors in
data processing. The common action was to discard those anomalies, relabelling them
as outliers, a synonym of something undesirable [15,20]. In a more recent perspective,
anomalies are considered relevant observations that may reveal hidden or changing
aspects of the phenomena under study [19].
Of particular interest to data science is a type of data known as compositional. In
compositional data the attributes represent the relative frequency or proportion of the
components of the system [26]. The sum of all components add up to a constant value.
This value is fixed for all elements in the sample. For example, all food can be cgarac-
terized in terms of their content of fat, carbohydrates and protein (disregarding other
components such as water). In all cases, the composition of these three constituents
in food adds up to 100%. The property that the relative abundance or frequency of the
components must add to a fixed value is referred to as the closure constraint [27]. This
constraint makes compositional data peculiar in geometrical terms, which leads to
the need of specific statistical tools its for analysis and interpretation. Compositional
data can be embedded in the probability simplex [28], and by taking into account its
properties, algorithms are expected to attain more reliable and interpretable results.
The field of Information Geometry offers relevant analysis methods and tools that
can be applied to compositional data [29]. Compositional data, such as normalized
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
388 U. Legaria et al.
histograms, may also contain anomalous observations. Anomaly detection in com-
positional data is often performed by algorithms that do not take into account the
aforementioned peculiarities of such data. However, the geometry of compositional
data can be considered in anomaly detection algorithms in order to improve the identi-
fication of true positives, that is, of anomalies. In a first step to study anomaly detection
when the constraints of compositional data are explicitly considered, we studied the
impact of distance functions in the identification of anomalies. For that, we selected a
distance-based anomaly detection algorithm.
The vast majority of anomaly detection algorithms assume no peculiarities about
the distribution of data in the high-dimensional attribute space. This is of course of
great benefit since it allows an extended application of those algorithms in any con-
text. However, the constraints of compositional data may be exploited to improve the
identification of anomalies in a more efficient way. In this contribution, we report the
efforts of the impact of Information Geometry-related aspects in anomaly detection
of compositional data. From Information Geometry, we applied several distance func-
tions, and, as we will show in the next sections, when anomaly detection algorithms
rely on some of those functions, the results improve. In Information Geometry, origi-
nally applied to understand the links between statistical manifolds and geometry, the
concept of distance functions and its inherent geometries are of particular relevance
[30].
The core idea developed in here can be described as follows. Given a set of points
in the probability simplex, can a subset of them be identified as anomalies? Several
anomaly detection algorithms are based directly or indirectly in the concept of dis-
tance. A distance function compares objects [31]. Based on that comparison, some
anomaly detection algorithms compute an anomaly index for each vector in the dataset.
In general, vectors are ranked accordingly to that index, or following a binary classi-
fication, labelled as either anomalies or as common or usual vectors. The hypothesis
we aim to prove in this contribution is that anomalies in the probability simplex can
be more accurately detected by relying on distance functions commonly applied in
Information Geometry, rather than the commonly applied Euclidean or L1distances.
Several families of anomaly detection algorithms have been created in more than
two decades of active research. Of particular interest are those algorithms that rely on
the characterization of vectors in terms of their nearest neighbors. This is of relevance
since the relative size of the neighborhood is determined by the applied distance
function. Since we are interested in characterizing the effects of distance functions in
anomaly detection, it is a natural choice to focus on this family of algorithms. Local
Outlier Factor (LOF) [18] is one of the best-known anomaly detection algorithms.
LOF takes into account the vicinity of each vector in order to compute an anomaly
index. Here, a vector vis characterized in terms of its knearest neighbors. Each of
those kneighbors is in turn characterized in terms of its nearest kneighbors. Once
the characterizations are concluded, the descriptions obtained from vare compared
to those obtained from its kneighbors. In the original version of the algorithm, the
chosen distance function is the Euclidean distance. We implemented de novo a version
of LOF in which the distance can be detected by the user. More details of LOF will
be offered in Sect.3.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 389
In order to define our aim in more concise words, suppose there is a group of
normalized histograms, each one representing, for example, an approximation of a
certain probability function. If all histograms in that group come from an unique
distribution described by the same parameters, then, there are no anomalies within that
group. On the other hand, if a minority of elements in that group come from a different
distribution, the elements in that reduced subset should be identified as anomalies. The
path to conclude that in the former case there are no anomalies, whereas in the latter
case there are anomalies is the problem we tackle in here. Once again, the hypothesis
we aim to prove is that anomalies in the probability simplex can be more accurately
detected by relying on distance functions commonly applied in Information Geometry.
The vast majority of distance-based anomaly detection algorithms assume, directly
or indirectly, that the feature space can be described by Euclidean geometry. Derived
from that geometry, the Euclidean distance is the most common applied one. Here,
we aim to ask whether other geometries may be suitable to detect anomalies in com-
positional data. We explore in this contribution the impact of the applied distance in a
well-known anomaly detection algorithm, LOF, for the case of points in the probability
simplex. We are most interested in exploring the capabilities of LOF compositional
data. In particular, since the algorithm is based on detecting changes as a function of
distance, we systematically explored the effects caused by different distance functions
in the detection rates of anomalies.
In the present contribution, we present a direct application of Information Geometry
to data science. We rely our approach in some of the most prominent results found in the
field of Information Geometry. In this contribution, we put our efforts in applying some
of the outstanding results from the field [29] in an area in which, to our knowledge,
it has not been applied explicitly. We are interested in verifying if the distances that
operate under the Riemmanian and other geometries are better choice to work with
when detecting anomalies within compositional data.
The rest of the contribution continues as follows. In the next section we briefly
describe some of the relevant aspects of compositional data. In Sect.3we describe
the anomaly detection algorithm to be applied along this contribution. In Sect. 4the
geometries and distances that are to be tested over the anomaly detection algorithm
are described. We present the main results of applying anomaly detection algorithms
in the probability simplex in Sect. 5, and we offer some conclusions and discuss the
main aspects of our approach in Sect. 6.
2 Compositional data and the probability simplex
Compositional data has attracted attention in research since the seminal works by
Aitchison [27,32]. In compositional data analysis, points with Ncomponents are
constrained to a region of the N–dimensional space, referred to as the probability
simplex. In Fig. 1, it is shown an example of the simplex for N=3 components,
displaying in pink a subset of all possible foods in terms of their percentage of pro-
tein, carbohydrates and fat. Three specific groups of food are shown as an example:
Red meat, fish and seafood, and vegetables. The visual capabilities of the probability
simplex, putting aside the formal aspects of it, are clear in this type of visualizations.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
390 U. Legaria et al.
Fig. 1 A depiction of compositional data and its representation in the probability simplez. All instances of
compositional data can be represented as points in the probability simplex 3. In the example, points in
pink represent possible foods in terms of their percentage of protein, carbohidrates and fat. Red meat foods
are represented by red squares, the blue circles indicate vegetables and green triangles show some types of
fish and seafood
For instance, it is possible to appreciate how these three types of food scatter over the
simplex.
In a recent perspective, the peculiarities of the probability simplex have been
considered when computing clusters in compositional data [33]. There, authors sys-
tematically study the effects of different distances involved in the clustering of
normalized histograms. Histograms can be useful as a visual representation of data,
where the range of a data sample is divided into a discrete number of non-overlapping
intervals called bins. A frequency is assigned for each bin, by counting the number of
points in the dataset that lies within the corresponding interval. Such frequencies may
be visualized as a plot bar, where the height of each bar indicates the frequency of a bin.
The application of Information Geometry to compositional data has been approached
in [34]. There, authors make use of the the Bregman divergence to normalize data and
ultimately propose a generalization of PCA able to deal with compositional data.
When working with histograms, it is common practice to normalize the frequencies.
In this way, the sum of all frequencies will be equal to 1, and the compositional or
closure property is achieved. That is, the sum of all frequencies fiadds up to 1.0 (see
Eq. 1).
N
i=1
fi=1(1)
For convention, when referring to histograms, the assumption in this work is that
frequencies are normalized. As a result of Eq. (1), the representation of a sample
as a histogram provides a method for viewing data as compositional, which is the
description for data that the methods investigated in this study focus on. Under this
perspective, each histogram can be though of as a point in a N-dimensional space.
Histograms are useful because according to the frequentist view of probability, as N
grows large ficonverges to the probability that a point sampled from the population
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 391
lies in the i−th bin or interval; that is: a histogram is an approximation of the
distribution for the data [35].
The probability simplex is a geometric object that represents all possible probability
distributions over a finite set of outcomes. For a set of Noutcomes, the probability
simplex is an (N−1)-dimensional convex polytope in N–dimensional Euclidean
space. The probability simplex is defined as the set of all non-negative vectors x=
(x1,x2, ..., xN)whose components sum to one. Thus, the probability simplex is a
subset of the N-dimensional Euclidean space, and it is bounded by the hyperplanes
defined by xi=0fori=1,2, ..., N.
The geometry of the probability simplex is characterized by its shape and structure.
The simplex has Nvertices, where Nis also the number of bins in a histogram. Each
vertex Vicorresponds to a probability distribution that assigns 1 to a single outcome
and 0 to the all others. That is, there is a bin in the histogram with probability 1, and
all the remainder bins have a probability of 0. The edges of the probability simplex
connect the vertices and represent the possible mixtures of these pure distributions.
The interior of the simplex represents all other possible probability distributions over
the set of outcomes.
The probability simplex, denoted as d, is a convex polytope, which means that
any two points on the simplex can be connected by a line segment that lies entirely
within the simplex [27]. This property has important implications for optimization and
statistical inference, as many optimization problems and statistical models involve
finding the point on the probability simplex that maximizes or minimizes a certain
criterion [36].
3 Anomaly detection algorithms: the case of local outlier factor
An anomaly is an observation, or a small subset of observations, which is different to
the the remainder of the set to which it belongs. The identification or detection of those
particular instances or peculiar observations is studied with the tools and perspectives
known, collectively, as anomaly detection. In data science, a given dataset, that recovers
observations from the process, structure or phenomena of interest, is to be dissected
in particular ways. That dissection includes the identification of anomalies.
There are various approaches followed by anomaly detection algorithms, such
as statistical methods [15], machine learning algorithms [20], and pattern recogni-
tion techniques [37]. Some common approaches include clustering-based methods,
density-based methods, distance-based methods, and machine learning-based meth-
ods. The choice of method depends on the specific application and the characteristics
of the dataset under analysis [11,19,38].
In particular, anomaly detection aims to answer whether there are peculiar vectors
or observations within a collection of instances. Given a dataset D, it is of interest
to identify two mutually exclusive subsets Uand A. The elements in Uconform the
common, normal, or expected observations, whereas the elements in Aare known as
anomalies, outliers, or novelties, or any other synonym. The development of algorithms
that allow the identification of Aand Uis an open field of research, since many
assumptions may vary in specific datasets.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
392 U. Legaria et al.
An anomaly detection task comprehends two stages. In the first stage, the goal is
to identify a certain description that is common to the majority of the data. Then,
in the subsequent stage, all vectors or observations are compared to the description
obtained in the first stage [38]. Depending on the nature of the existing data, there are
two instances of anomaly detection. The first one is closely related to the problem of
classification under unbalanced classes. In this scenario, each observation is labelled
as either common, normal or expected, or, on the other hand, labelled as an anomaly
or an outlier. In other words, each observation is known to be in either sets Uor A.
The former class is in general much more abundant than the latter, and thus, there is an
unbalance in the classes. The classification algorithm computes a specific description
that is common to observations in the normal class (U), and at the same time, that
description is not present in the vectors of the second or anomalous class (A). Once the
algorithm has been trained, it can be tested over the same data, or over new instances.
The description of a vector is inspected by the function inferred by the algorithm in
the training stage, and a decision of whether it belongs to Uor Ais taken.
In the second scenario for anomaly detection, it is not known what vectors, if
any, are anomalies. In technical terms, there is no external label or class (Uor A)
assigned to the elements in D, as opposed to what is found in the first scenario [39]. In
such conditions of lack of external label, the algorithm has to infer a certain property
that is common in the majority of the observations, and based on it, decide if those
observations that do not fit into that property, are indeed anomalies [40]. This scenario
of unlabelled data is of high relevance, since in many applications of data science,
it is not known beforehand what instances constitute anomalies and which ones are
common observations [41,42]. In this scenario, referred to as unsupervised anomaly
detection, the parameters and rules obtained by algorithm assign a degree of anomaly
to each element v∈D. The anomaly level of v, a(v) is one of the parameters from
which the algorithm takes the decision to assign vto Uor to A. Additional parameters
to reach a decision may be the expected anomaly level of all elements in D. This allows
for a global comparison of vto the rest of elements in D[39]. Another possibility to
reach a decision concerning the class assigned to vis the comparison of a(v) with
the corresponding characterization of a certain subset of D. For instance, when vis
compared to its k–nearest neighbors, the approach is focused on the identification of
local anomalies [38].
We are interested in the unsupervised scenario for anomaly detection. Again, what
characterizes this scenario is that vectors are not labelled and thus, the algorithm has
to infer the the most likely class of the vectors, or assign an anomaly degree to them,
based on undisclosed properties of the data [43,44]. Since the properties of the data
that are to be taken into consideration for telling apart anomalies from common vectors
are not unique, several alternatives exist. Some vectors can be identified as anomalies
under certain assumptions, and not under a different set of premises.
Several families of anomaly detection algorithms for the unlabelled case have been
created in more than two decades of active research [21,45–52]. In particular, those
focused on the analysis of nearest neighbors are of particular relevance, since the
relative size of the neighborhood is affected by the properties of the selected distance
function [38]. This makes this approach relevant to explore different geometries and
distances. LOF is one of the best-known anomaly detection algorithms that take into
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 393
account the surroundings of each vector in order to compute an anomaly index. Here, a
vector vis characterized in terms of its knearest neighbors. Each of those kneighbors
is in turn characterized in terms of its nearest kneighbors. Once the characterizations
are concluded, the descriptions obtained from vare compared to those obtained from
its kneighbors.
In more technical terms, a vector vis described by means of its k-distance. Let
k-distance(v) be the distance from vto its k-nearest neighbor. The set of neighbors
within reach of vbased on k-distance(v) is denoted as Nk(v), and it is also refrerred
to as the context of v. The reachability distance from vto a second vector wis given
by:
reachability −distancek(v, w) =max{k−distance(w), d(v, w )}(2)
Where dis assumed to be the Euclidean distance. The reachability distance is the
maximum between the actual distance from vto wand the k-distance of vector w.Note
that for k-distance, it is the context (neighborhood) of wthe one that is considered. It
may be the case that k-distance(w) > d(v, w ).
As stated above, the distance dis assumed to be Euclidean. However, it is in this
parameter that we invoke Information Geometry. By considering alternative geome-
tries and distances, we aim to understand their effect on the performance of anomaly
detection algorithms based on distances.
Continuing with our description of LOF, all k–neighbors of wwill be characterized
by the same reachability distance. It should be noted that the reachability distance may
be greater than the actual distance. The benefit of this substitution is that it offers more
stability for certain distributions.
From the reachability distance, vector vis further described by its local reachability
density, defined as:
lrdk(v) =1/w∈Nk(v) reachability −distancek(v, w)
|Nk(v)|(3)
lrdk(v) is a measure of the density of points around v. In particular, it is the expected
value over all the elements in Nk(v), that is, its k–neighbors. From this quantity, the
local outlier factor of vector v, denoted as lofk(v), is computed:
lofk(v) =w∈Nk(v) lrdk(w)
|Nk(v)|×lrdk(v) (4)
Algorithm 1displays the LOF algorithm. It takes as input the dataset to be inspected,
and the parameter k, which indicates the number of nearest neighbors to be considered.
In our contribution, it also takes the distance dto be applied. The output of the algorithm
is the local outlier factor for each vector xin the datset, denoted as lof (x).For
simplicity, the parameter kis not displayed, when the context allows it. Note that
LOF(X,k,d)refers to the Local Outlier Factor algorithm, with parameters X,k,and
d, (with ksometimes not shown), whereas lof (x)refers to the actual value assigned
to vector xby the algorithm.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
394 U. Legaria et al.
Algorithm 1 Local Outlier Factor (LOF)
Require: Dataset Xwith Nobservations. The parameter kNdefines the k–nearest neighbors. dis the
distance function.
Ensure: Outlier scores for all instances in X
1: for each instance xin Xdo
2: Find the k-nearest neighbors of xbased on d.
3: Compute the reachability distance of xto each of its k-nearest neighbors.
4: Compute the local reachability density (lrd) of xas the inverse of the average reachability distance to
its k-nearest neighbors
5: end for
6: for each instance xin Xdo
7: Compute the local outlier factor (LOF) of xas the average lrd of its k-nearest neighbors divided by
its own LRD
8: end for
9: Output the LOF scores for all instances x∈X,lo f (x).
When lofk(v) > 1, the local density of vcompared to that of its neighbors Nk(v)
is lower. On the other hand, if lofk(v) < 1, it means that vector vpresents a higher
density than the expected density of its neighbors. The former case defines vas an
outlier, whereas the latter defines it as an inlier. In this contribution we will refer
to both cases as anomalies. The more distant from 1, the higher the anomaly level.
The control parameter kallows for an increase of the neighborhood and thus. In the
extreme case, when kequals the number of elements in the dataset, leads to a global
comparison. There is not, however, a formal criteria to identify the correct value of k.
As in any other anomaly detection algorithm, if the criteria, in this case defined by the
neighborhood size changes, the outcome can also change. This leads to instabilities,
but is a problem not tracked in this contribution.
Figure 2shows the k-neighborhood of a vector, based on the k-distance, for k=3. A
vector vis characterized in terms of its context or neighborhood. That characterization
is compared to the context of the neighbors of v. The result of the comparison is a
measure of the similitude of vwith its neighbors.
4 Anomaly detection in the probability simplex
The probability simplex dis embedded in a (d−1)-dimensional space. Thus, existing
algorithms can be applied to the points in dto identify those that constitute anomalies.
However, since the geometry in dis sensible to the selection of the distance function,
a relevant question is what is the most adequate geometry to consider, and from it,
what is the most suitable distance function to be applied in order to identify anomalies
in the probability simplex? In this contribution we explore the effect of the distance
function in the LOF algorithm when applied to compositional data, that is, to points
in the probability simplex d.
In Fig. 4, upper plane, the blue points are expected or normal vectors, since they
are part of the annulus. The red points are anomalies, precisely because they do not
fall into the definition of an annulus. However, since the defining procedure of usual
or expected data is not to be assumed, an alternative description is to computed so
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 395
Fig. 2 The k-neighborhood of a
vector. LOF identifies the
k-nearest neighbors for each
vector based on the k-distance.
Vector vhas as its neighbors
vectors w1,w
2,andw3.The
expected distance from vto its
neighbors serves as the basis for
the characterization of v.The
neighbors of vhave to be
characterized in terms of their
own neighbors. The
characterization of vand those
in its context (neighborhood) are
to be compared to compute the
local outlier factor of v
v
w1
w2
w3
3−distance(v)
the red points are shown as anomalies. LOF does this by finding a description of
the neighborhood of each vector and comparing it to the corresponding one from
the neighbors of that vector. LOF relies on the characterization of neighborhoods,
which are defined by distances. Since we are interested in the way LOF is affected by
different distance functions, we introduce in the next paragraphs those distances that
were considered in the algorithm.
We focus our efforts in several relevant distances, both metric and non-metric,
and their associated geometries. The first one is Information Geometry, represented
by the Jensen-Shannon distance. The second approach of interest is the Riemmanian
geometry, represented by Fisher-Hotelling-Rao metric distance. A third geometry of
interest is the Hilbert projective geometry, achieved by the Hilbert metric distance. The
fourth geometry is the norm geometry, represented by the L1metric distance. Besides
these distances and divergences, we explored the effect of the Hellinger, Aitchison,
and Wasserstein distances in the anomaly detection algorithm.
The Aitchison distance is defined as the Euclidean distance once all points are log-
radio transformed and centered. Formally, the Aitchison distance distance between
points Aand Bis [27,53]:
DAitch(A,B)=DEucl (clr(A), clr(B)) (5)
Where DEucl is the Euclidean distance and the clr operation is defined as:
clr(A0,A1, ..., AN−1)=(log(A0/g(A)), log(A1/g(A)), ..., log(AN−1/g(A))) (6)
This last equation is the inverse of the softmax function, commonly applied in
machine learning. Aiis the i−th component of the compositional point A, and
g(A)=(π Ai)1/Nis the geometric mean of A.
The Wasserstein distance, also known as the Earth Mover’s Distance, compares
pairs of distributions in terms of the effort of transforming one into the other [54,55].
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
396 U. Legaria et al.
At a high level of abstraction, the Wasserstein distance measures the minimum amount
of work required to transform one probability distribution into another. This quantity
can be thought of as the amount of soil that needs to be moved from one distribution
to another to transform it. Since the effort is affected by the size of the column of
soil and by how far it requires to be transported (from the first bin to the last one, for
example), it is a natural way to compare histograms, and thus, compositional data.
At a high level, the Wasserstein distance measures the minimum amount of work
(or effort) required to transform one probability distribution into another. This "work"
can be thought of as the amount of "stuff" that needs to be moved from one distribution
to another to transform it.
More specifically, given two probability distributions A and B, the Wasserstein
distance between them is defined as the minimum amount of "work" required to
transform A into B, where "work" is defined as the product of the distance between
each point in A and its corresponding point in B, weighted by the amount of probability
mass being moved.
In more technical words, given two probability distributions Aand B, the Wasser-
stein distance between them is defined as the minimum amount of work required to
transform Ainto B, where work is defined as the product of the distance between each
point in Aand its corresponding point in B, weighted by the amount of probability
mass being moved [56].
The Wasserstein distance for the continuous case is expressed as:
Wp(A,B)=inf
γ∈(A,B)X×Y
d(x,y)pdγ(x,y)1/p
(7)
where pis a parameter that determines the order of the distance (usually p=1 or p=2),
d(x,y)is the distance between points xand yin the probability distributions, and γ
is a transport plan that specifies how much "stuff" is moved from each point in Ato
each point in B.
The 1-Wasserstein distance provides a metric for the comparison of probability
distributions. It is computed as the minimal cost of transport expended in transforming
one distribution into a second one [56]. This transformation can be computed by means
of optimal transport.
The optimal transport problem can be stated in the following manner. Let Aand B
be two given points in the probability simplex. In the probability simplex, every point
can be represented by a histogram. Let xbe the histogram associated to A, with bins
indexed by i.Letybe the histogram associated to B, with bins indexed by j.If fij is
the amount being transported from bin ito j, we want to find the value for the flows
that minimizes the cost shown in Eq. (8).
i
j
fijdij (8)
Where dij is the distance between the values of the random variables for bins iand
jin their respective histograms.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 397
Optimization for the coefficients fij, subject to mass conservation constraints, are
computed in practice using the network simplex algorithm. Once they are found, the
1-Wasserstein distance is obtained by Eq. (9).
W(x,y)=ijfijdij
ijfij
(9)
The Hellinger distance, also known as the Bhattacharyya distance, is a measure
of distance between two probability distributions. It belongs to the family of f-
divergences. f-divergences are statistical divergences or information divergences, and
constitute a family of mathematical functions that measure the difference between two
probability distributions. They were introduced by Csiszár in the 1960s and have since
been widely used in information theory, statistics, and machine learning [57].
The general form of an f-divergence between two probability distributions Aand
Bis:
Df(A|B)=fdA
dB(x)dB(x)(10)
The Hellinger distance, for discrete distributions, is defined as
DHellinger(A,B)=1/√2
d
i=0Ai−Bi2
(11)
The Hellinger distance has been widely applied in data science. For example, it has
been used in the comparison of species distribution maps [58]. One of its advantages in
certain contexts is that DHellinger emphazises the differences by individual attributes.
This is a rather relevant aspect for compositional data.
The Jensen–Shannon distance is measure of the similarity between two probability
distributions. It is also referred to as the symmetric Kullback–Leibler divergence.
The Jensen–Shannon distance is calculated as the square root of the Jensen–Shannon
divergence, which is in turn defined as the average of the Kullback–Leibler divergences
between the two distributions and their average distribution. The Jensen–Shannon
distance is given by the equation:
DJS(A,B)=1
2KL(A||C)+1
2KL(B||C)1
2
(12)
Where C=1
2(A+B)and KL is the Kullback–Leibler divergence.
In simpler terms, the Jensen–Shannon distance measures the distance between two
probability distributions by comparing how much they differ from their average. It
has a value between 0 (indicating that the distributions are identical) and 1 (indicating
that the distributions are completely dissimilar) [59].
The Hilbert distance is a useful distance measure for high-dimensional data, where
traditional Euclidean distance measures can become less informative due to the curse
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
398 U. Legaria et al.
of dimensionality. The Hilbert distance is defined as [33]:
DHilbert(A,B)=log maxi∈1..dAi
Bi
min j∈1..dAj
Bj
(13)
This distance is computed over convex data, which makes it convenient to work
with in the probability simplex. The Hilbert distance is a relatively fast and efficient
way to compute distances, especially compared to other distance measures that have
a higher computational complexity, such as the Euclidean distance [60].
The Fisher–Hotelling–Rao is a statistical distance used to compare two multivariate
distributions. It is defined as [33]:
DFHR(A,B)=2×arccos d
i=0AiBi(14)
The FHRdistance approximates the manifold in which points seem to be embedded
[61]. This metric distance is an instance of Riemannian metric in the space of prob-
ability distributions. In the probability simplex, it allows for a comparison between
points, since it provides a geodesic distance.
The effect of the distance function is shown in a probability simplex 2in Fig. 3.
Several concentric annulus were created so that the number of points on each of them
is constant. Although the definition of an annulus do not vary, the effect of the distance
function is clear since each distance function created annulus of rather distinct shapes.
Since several anomaly detection algorithms rely on the concept of distance, it is only
natural to wonder what is the most adequate, if any, choice of distance function.
5 Results
Anomaly detection algorithms identify a certain attribute that is present in the majority
of the vectors in a dataset, at the time that is absent in a small proportion of the
observations. Several anomaly detection algorithms rely, as the discriminant attributes,
in distance-related aspects, such as density and neighborhhod. In order to examine
the effect of the distance function in the capabilities of LOF (see Sect.3) to detect
anomalies in compositional data, we conducted three sets of experiments.
In the first group of experiment, points describing an annulus in the n–dimensional
simplex were created. Then, a few additional points not fulfilling the criteria of being
part of the annulus were added. The latter are to be identified as anomalies (see
Fig. 4, top). In the second set of experiments, several normalized histograms, derived
from a fixed probability distribution were created and embedded in the probability
simplex. Then, a few histograms from a different probability distribution were added
to the dataset (see Fig. 4, bottom). Again, the latter points are to be identified as
anomalies. This line of testing follows the ideas of Hawkings in his relevant book
[62]. A third group of experiments comes from the codon usage problem. The core
idea is that the DNA of organisms can be represented as a histogram of the relative
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 399
Fig. 3 Annulus under different distances—divergences. The shape of the obtained annulus is affected by
the applied distance
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
400 U. Legaria et al.
Fig. 4 A sketch of the test datasets. Top: in the first group, severalannulus were created within the probability
simple (blue points). A few points were added to the probability simplex, but not fulfilling the pattern criteria
(red). The latter are to be identified as anomalies. Bottom: several histograms were generated from a fixed
probability function. Each histogram consists of nbins and is embedded into the probability simplex. A
few histograms obtained from a different probability function are included to function as anomalies. In both
groups, the usual or regular class is from 5 to 10 times more abundant than the anomaly class
frequency of use of each of the 64 possible triplets or codons. Organisms from the same
family, say, primates, define the base or usual set. A few organisms from a different
category, viruses, for instance, are to be considered anomalies. From the three sets of
experiments, we are interested in quantifying the impact of the assumed geometry of
the data in order to detect anomalies. Since we know before hand the label of each
vector, we are in position of evaluating the impact of the geometry, relying on a fixed
anomaly detection algorithm.
In all three sets of experiments, LOF was tested under the distance functions
described in the previous section, but to remind the reader, the considered distances are:
Euclidean, L1, Wasserstein, Cosine, Aitchison, Hellinger, Jensen–Shannon, Fisher–
Hotelling–Rao (FHR), and Hilbert. From the lof score, a further step is needed in order
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 401
to decide whether a point is an anomaly or not. Since in our controlled experiments
vectors are either anomalies or expected or usual vectors, the decision is based on the
expected value of the lof score. Let Ube the set of all common or usual vectors, and
Abe the set of all anomalies. Let E(U)be the expected lof score of vectors in U, and
E(A)be the expected lof score for vectors in A. This is computed in the traditional
manner. Now, in order to compute the true positive TP,truenegativeTN, false pos-
itive FP and false negative FN rates, we need to compare the lof score of vector v,
namely lof(v), with both E(A)and E(U).
Let d(v, A)be the absolute difference between the lof score of vector vand the
expected value of anomalous vectors (A): d(v, A)=|E(A)−lof (v)|. Correspond-
ingly, let d(v, U)be the absolute difference between the lof score of vector vand
the expected value of usual or common vectors (U): d(v, U)=|E(U)−lof (v)|.
The smallest difference will give the estimated class of vector v.Ifv∈A, we should
expect that d(v, A)<d(v, U). If this is the case, then the number of TP is increased,
otherwise, FN is incremented. If, on the other hand, v∈U, it is expected that
d(v, U)<d(v, A). If this inequality is satisfied, TN is incremented, otherwise, FP
is increased. From these rates, the performance metrics are computed as follows. Pre-
cision is given by TP/(TP+FP), recall is given by TP/(TP+FN), and accuracy
is given by (TP+TN)/(TP+TN +FP +FN).
Algorithm 2displays the steps to evaluate the capabilities of LOF to detect anoma-
lies, under different distance functions.
The first group of experiments was conducted as follows. Points in dwere gen-
erated with similar statistical properties, such as density, and a few points that do not
fulfill that property were added. The common or expected data were obtained from an
annulus. A simple case can be shown in Fig.4, top, where the majority of the points
define an annulus, and a few additional points are included, located in different regions
of the simplex. The points in the annulus have the common property of being located
within a certain distance range to a fixed point, namely, the center. The latter points are
to be considered as anomalies, since the distance that separates them to the center is 0.
We created different geometric structures in several dimensions, as explained below.
Figure 5shows the results for the first set of experiments. The expected precision,
recall and accuracy for points defining an annulus in the probability simplex (d),
with d=3,5,10,20. The number of points in the annulus was 100 ×d, that is
#(U)=100 ×d, and the number of anomalies #(A)=dfor each case. In all cases,
the parameter kfor LOF was fixed to k=#(U)+2. This fixation was settled for
two reasons. The first one was to reduce the search space, and the second and most
important one, was to focus our efforts in the effect of the assumed geometry in the
detection of anomalies, and not to identify what was the best choice of parameters.
The latter would be an extension of this contribution, but it is not relevant at this point.
What is relevant, for this first set, is how the metrics are affected by the distance. It is
observed that the Fisher–Hotelling–Rao and the Jensen-Shannon distances present the
highest recall, followed by the Hellinger, Aitchison and cosine divergence. Specificity
has a less stable pattern along the four considered cases. Specificity was higher in
general than precision and recall, and again, Jensen-Shannon presented the highest
rates. The same pattern is maintained for precision.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
402 U. Legaria et al.
Algorithm 2 Anomaly detection in compositional data using LOF under different
distance functions
Require: Dataset Xwith Ninstances. U⊂X, is the set of usual or normal vectors. A⊂Xis the set of
anomalies. U∩A=∅. The parameter kdefines the k–nearest neighbors. |U|is the number common
observations, |A|is the number of anomalies. dis the distance function, with d∈{Euclidean, L1,
Wasserstein, Cosine, Aitchison, Hellinger, Fisher-Hotelling-Rao, Hilbert}.
Ensure: Label/class for each instance in X
1: Apply LOF(X,k,d). For each instance x∈X, its local outlier factor lof (x)is computed.
2: Obtain the mean lof for elements in U,μU=1
|U|x∈Ulof (x).
3: Obtain the mean lof for elements in A,μA=1
|A|x∈Alof (x).
4: TP =0,TN =0,FP =0,FN =0. P (positive) define anomalies, N (negative) define expected or
normal vectors.
5: for each instance in Xdo
6: if x∈Uthen
7: Class = N
8: else
9: Class = P
10: end if
11: dist2Anom =|lof (x)−μA|
12: dist2Usual =|lo f (x)−μU|
13: if dist2Anom <dist2Usual then
14: if Class = P then
15: TP =TP+1
16: else
17: FP =FP +1
18: end if
19: else
20: if Class = N then
21: TN =TN +1
22: else
23: FN =FN +1
24: end if
25: end if
26: end for
27: Precision = TP/(TP+FP)
28: Recall = TP/(TP+FN)
29: Accuracy = (TP +TN)/(TP+TN+FP +FN)
30: Report accuracy, recall, specificity for distance function d.
An interesting observation is that the Euclidean distance shows the lowest capabil-
ities. This is hardly a surprise, since it is well known than in many machine learning
and data sciences, the assumption of an Euclidean geometry is not the best choice.
However, it is relevant to observe how far the performance of this distance is led behind
as compared to more adequate choices.
Note that for the case of the annulus, it is quite clear what is the attribute that tells
apart anomalies from expected or regular vectors. An annulus is defined as the set
of points located within two concentric metric balls of radius θ1and θ2, respectively.
That is, all points located at a distance rfrom a center such that θ1<r<θ
2,
assuming the center is at the origin, are part of the annulus. In the vast majority of
relevant applications, however, it is almost never clear what such property of the data
should be considered to tell apart anomalies from usual or regular data. What anomaly
detection algorithms aim to is to approximate such quantity via a discriminant function,
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 403
Fig. 5 Precision, recall and accuracy of anomaly detection for the dataset of points defining an annulus in
the probability simplex d,ford=3,5,10,20
which is implicitly represented in the assumptions behind the algorithm. Following
this criteria, two more demanding experiments were conducted.
The second set of experiments, following [62], consisted of creating a histogram
from a pure or base distribution. This distribution is fixed, and a histogram is created
from sampling from that distribution. Over n>50 of such histograms were created.
Then, a few histograms from a different distribution were added to the dataset. The
latter are to be identified as anomalies. Figure 4, lower panel, depicts the creation of
this dataset. The probability functions that were considered are Gaussian, Geometric,
Uniform, Gamma, Gumbel, Weibull, with different parameters, except for the uniform
distribution.
The results for the second set of experiments are shown in Fig6. There, the base
or pure probability function was sampled and a normalized histogram of bbins was
created, with b=10,20,30,40. The parameter bdefines the dimension of the prob-
ability simplex. The number of histograms for each case in U(the expected or usual
class) was 5 ×b, whereas the number of anomalies was set to b. The rationale behind
this selection of parameters is that we are interested in comparing the effects of the
assumed geometry and not in the algorithm itself. That is, since the algorithm is held
constant and only the distance function is varied, it is fair to vary the relative sizes of
Uand Aas specified. The parameter kwas mset to |u|+2. By making the number
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
404 U. Legaria et al.
of anomalies equal to the number of bins, that is, by making the number of anomalies
equal to the dimension of the probability simplex, we reduce the search space and
focus on what we aim to prove: whether or not the assumed geometry has an effect in
the anomaly detection process.
The distributions that were considered in this experiment were Gaussian, geometric,
Gumbel, Weibull, Gamma, and uniform. For each case (except for the uniform), the
relevant parameters were selected at random for every experiment. Figure 6shows the
aggregate results over all cases.
Figure 6summarizes the effect of the distance function in the performance of LOF.
Several hundreds of experiments were conducted. For each one, A specific Uand A
sets were randomly selected. When the probability distribution was the same for the
base and anomalous cases, different parameters were forced (except for the uniform
distribution). As before, Udefines the base class (non-anomaly) and Adefines the
anomaly class. For a fixed Uclass, several cases were conducted varying Aand their
relevant parameters. LOF was applied to the union of sets Aand U, and from the lof
value of each vector, the accuracy, recall and precision were computed. Then, for each
base distribution, the expected recall, precision and accuracy are computed over all
cases and bins.
It is observed that the uniform distribution reached the highest performance metrics
for all considered distances. This is expected, since this distribution is the less similar
to the the rest and comes from a completely different family. Since anomaly detection
algorithms aim to identify data that are not similar to the rest within a dataset, when the
base case consists of histograms from an uniform distribution, almost any histogram
from almost any other distribution will be detected as not similar to the base ones.
In other words, detecting as anomalies those histograms derived from a distribution
different from the uniform is a relatively easy task under the considered algorithm and
for all the considered distances. On the other hand, the Gaussian distribution shows
the lowest performance metrics for all distances.
We are not interested in the specific details of telling apart distributions, but, once
again, we aim to identify if a given distance is in general more suitable to be applied
in order to detect anomalies. In that sense, the Fisher–Hotelling–Rao and Jensen–
Shannon distances are the ones with the best performances for all distributions. In
short, the anomalies can be better spotted, at least in this context, by distances that are
derived from Information Geometry.
The third experiment is perhaps the most interesting one. We applied LOF to the
set of codon usage along thousands of organisms. In this experiment, a relatively large
group of elements (organisms) was sampled from the same phylum, and linked to
the expected set U, whereas the anomaly candidates that constitute the set Awere
obtained from a different type. The expected or usual organisms may be, for example,
the codon usage of several primates, whereas the anomalies are obtained from bacteria.
By doing this, we assume that the second group is the anomaly class, and we are thus
able to test the performance of LOF based on different distance functions.
For this group of experiments, it is shown the results of the analysis over the codon
usage database [63,64]. In nature, there are 20 amino acids, and four nucleotides.
Amino acids are the building blocks of proteins [65]. In order for organism to code all
possible amino acids, sequences of length three are needed. In this way, up to 64 (43)
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 405
Fig. 6 Precision, recall and accuracy for anomaly detection of the normalized histograms dataset. Each
point in the simplex represents an histogram computed from a given probability function. A collection of
Uhistograms are selected from a base or pure distribution (shown as the label of each panel), and a few
histograms from a different distribution define the anomaly class A. LOF is then applied to the points in
the simplex (A+U). It is shown the expected precision, recall and accuracy over all number of bins and
considering all cases
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
406 U. Legaria et al.
possible amino acids could be represented by an alphabet of four symbols (nucleotides)
and sequences of length three. Sequences of shorter lengths cannot represent all 20
amino acids: Sequences of length one can code for only four amino acids, sequences
of length two can code for up to 16 amino acids, so the minimum length of sequences
of nucleotides is three [66].
The equivalence between the nucleotide code and the amino acid code is known as
the genetic code [63]. Each of the 64 sequences is described as the relative frequency
of its appearance within the genome. For example, the sequence ATC, which indicates
adenine followed by thymine followed by cytosine, appears on average 20.8 times for
every 1000 triplets in the human genome. From this perspective, each organism can
be though of as a point in a probability simplex embedded in a space of dimension
64. The relative frequency of appearance of each possible codon or triplet is known
as codon usage [67]. Phylogenetically related organisms tend to have a similar codon
usage, whereas unrelated organisms are more likely to show a rather different codon
usage. Since codon usage can be represented by normalized histograms, it is possible
to validate in that dataset the approach described in this contribution.
Several relevant questions arise once biology is put in the frame work of data
science and IG. In this contribution, however, we focus our efforts in detecting the
geometries that are best suit to identify anomalies within the codon usage made by
several organisms. Figure7shows the performance metrics of applying LOF under
the specified distances to the codon usage dataset.
We observe in Fig. 7shows the performance metrics of LOF under the considered
distances. Four groups of organisms were considered: Primates, bacteria, virus, and
invertebrates. The size of each set is 418, 4918, 4097, and 3536 organisms, respectively.
Several Monte Carlos iterations were performed. For each iteration and base or usual
class, selected from the four phylum, a sample of size one tenth of the size of that
group was selected. This defined the set U. Then, between |U|
20 and |U|
10 organisms from
any of the remaining three classes were selected at random to be included in the set
A. LOF was applied to the union of Aand U, under the different distances, as in the
case of the two previous experiments. 25 Monte Carlo experiments were performed
for each of the four categories of organisms, and the expected perform metrics are
shown in the figure.
Aligned with the previous two datasets, the Jensen-Shannon and Fisher–Hotelling–
Rao distances achieved the best results. For this case, the Hellinger and Wasserstein
distances offered results almost as good as those achieved by the Jensen-Shannon and
the Fisher–Hotelling–Rao distances. Interestingly, the Euclidean distance does not
perform well as compared to the rest. Also interesting is the observation that primates
as the base or expected class were the case with the lowest performance metrics for
all distances. This is of biological relevance, but it is a good opportunity to show what
we think is an important corroboration via a mathematical approach: the primates are
distributed in the probability simplex in such a manner that organisms belonging to
other families cannot be identified as anomalies, no matter what geometrical assump-
tions are maintained. This is also the case, although to a lower extent, for invertebrates.
Figure 8shows the results of applying Principal Component Analysis (PCA) to the
codon usage data of all 12,969 organisms, after a log-centered transformation. It is
observed that indeed, primates are located in at least two groups, which makes it dif-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 407
Fig. 7 Precision, recall and accuracy for anomaly detection of the codon usage dataset. Each point in the
simplex represents an histogram computed from a given probability function. A collection of Uhistograms
are selected from a base or pure distribution, and a few histograms from a different distribution define
the anomaly class A. LOF is then applied to the points in the simplex ( A+U). It is shown the expected
precision, recall and accuracy over all number of bins and considering all cases
ficult for LOF to properly detect anomalies when the base class Uis conformed or
heterogeneous instances.
The results described in this section describes the capabilities of LOF as an anomaly
detection algorithm under different distances. We have shown that the performance
metrics of the same algorithm varies substantially depending on the selected distance
function. In particular, in order to give evidence of our initial hypothesis, the Euclidean
distance is not the best choice when trying to identify anomalies in compositional data.
6 Discussion and conclusions
What we know nowadays as data science has been around for at least three decades, but
its origins can be tracked to at least the works of Pearson in Statistics [68], Poincare in
Geometry and Topology [69], Lloyd in vector quantization [70], Shannon in Informa-
tion Theory [71], and many more contributors that have enriched the field and applied
it to several practical problems. What is causing, in our opinion, a deep change in
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
408 U. Legaria et al.
Fig. 8 PCA, after a log-center transformation, of codon usage data. All organisms in the dataset reported
in [63] were included in the visualization
data science is the inclusion of a different set of mathematical tools with two main
objectives. The first one is to conduct exploratory data analysis under more robust
grounds. The second goal is to achieve more explainable models.
Exploratory data analysis is an instance of unsupervised learning. Unsupervised
learning explores the distribution of vectors in the (high-dimensional) attribute space
with the aim of finding relevant patterns and structures. One of such patterns is the
set of vectors that do not fulfill a certain criteria that is common to the vast majority
of the analyzed vectors. The vectors that are different from the majority are candidate
anomalies. The anomaly detection problem is an open task basically because it is an ill-
posed problem: The number of solutions exceeds the number of free parameters, and,
more over, those parameters are not known. A vector can be an anomaly under certain
assumptions, and not under a different set of constraints. In other words, it is not clear
what is the correct criteria to compare vectors and it is not clear how to identify the
threshold to decide whether vectors are different enough as to be considered anomalies
or not.
In an equally critical path, the majority of the anomaly detection algorithms operate
under certain assumptions about the underlying geometry of the feature space, that
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 409
not always hold. Almost all anomaly detection algorithms that operate in terms of
distance-based criteria assume an Euclidean geometry, regardless of the nature of
the data. In this contribution, we have investigated the role of different geometries
and the associated distance functions for one anomaly detection algorithm applied to
compositional data. Compositional data is defined as a collection of elements whose
descriptions or attributes add up to a fixed quantity, such as normalized histograms.
Compositional data can be embedded in a probability simplex, and, derived from the
properties of such structure, different geometries can be assumed. We tested Local
Outlier Factor, a well-known anomaly detection algorithm, under different distance
functions, over different compositional datasets.
We investigated the effect of several distance and divergences in the detection capa-
bilities of Local Outlier Factor. This is a well-known anomaly detection algorithm, and,
since its distance-based served our purposes of quantifying the effect of the chosen
distance to detect anomalies. In particular, we tested the performance of this algo-
rithm under different geometries and distances or divergences. First, we considered
Information Geometry (IG) under the Jensen-Shannon distance. The second approach
of interest was the Riemmanian geometry, represented by the Fisher-Hotelling-Rao
metric distance. A third geometry of interest we relied on was the Hilbert projective
geometry, achieved by the Hilbert metric distance. The fourth geometry was the norm
geometry, represented by the L1metric distance. Besides these distances and diver-
gences, we explored the effect of the Hellinger, Aitchison, and Wasserstein geometries.
We conducted three sets of experiments in which the task was to identify anomalies
via Local Outlier Factor under different geometries. In particular the datasets consisted
of composituional data, which can be embedded in the probability simplex. From
there, we answered the question of what points within the probability simplex are to
be considered anomalies. In all three groups of experiments, it was observed that the
Jensen-Shannon and the Fisher–Hotelling–Rao distances led to the best performance
metrics. The Wasserstein, Hellinger, and Aitchison distances also displayed good
results, almost all better than what was obtained when the Euclidean distance was
considered.
Our hypothesis in this contribution was that distances obtained from Information
Geometry were a better choice than the usual Euclidean distance to detect anomalies
for compositional data. We have provided evidence that the hypothesis can be accepted,
although, of course, a more formal proof is the next step.
Acknowledgements Authors thank the anonymous reviewers for their useful suggestions and comments.
Part of this research was supported by grant IA103921 from PAPIIT DGAPA, Universidad Nacional
Autónoma de México.
Data availability The data and software described in this manuscript are available in Github: https://github.
com/antonioneme/anomDet_IG.
Declarations
Conflict of interest On behalf of all authors, the corresponding author states that there is no conflict of
interest.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
410 U. Legaria et al.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included
in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
1. Desai, J., Watson, D., Wang, V., Tadeo,M., Floridi, L.: The epistemological foundations of data science:
a critical review. Synthese 200, 469 (2022). https://doi.org/10.1007/s11229-022-03933-2
2. Carmichael, I., Marron, J.S.: Data science vs. statistics: two cultures? Jpn. J. Stat. Data Sci. 1, 117–138
(2018). https://doi.org/10.1007/s42081-018-0009- 3
3. Daoud, A., Dubhashi, D.: Statistical, modeling: the three cultures. Harvard Data Sci. Rev. (2023).
https://doi.org/10.1162/99608f92.89f6fe66
4. Liberti, L.: Distance geometry and data science. TOP 28(2), 271–339 (2020). https://doi.org/10.1007/
s11750-020- 00563-0
5. Tukey, J.: Exploratory Data Analysis. Pearson, London (1977)
6. Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high-dimensional data. In: New
Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition
7. Epstein, C., Carlsson, G., Edelsbrunner, H.: Topological data analysis. Inverse Probl. 27(12), 120201
(2011). https://doi.org/10.1088/0266-5611/27/12/120201
8. Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for
multivariate data. PLoS ONE 11(4), e0152173 (2016). https://doi.org/10.1371/journal.pone.0152173
9. Tenenbaum, J.B., Silva, V., Langford, C.: A global geometric framework for nonlinear dimensionality
reduction. Science 290(5500), 2319–2323 (2000). https://doi.org/10.1126/ science.290.5500.2319
10. Lee, J., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer, New York (2007)
11. Aguayo, L., Barreto, G.: Novelty detection in time series using self-organizing neural networks: a
comprehensive evaluation. Neural Proc. Lett. 1, 1 (2017). https://doi.org/10.1007/s11063-017- 9679
12. Zimek, A., Schubert, E., Kriegel, P.: A survey on unsupervised outlier detection in high-dimensional
numerical data. Stat. Anal. Data Min. (2012)
13. Grubbs, F.E.: Sample criteria for testing outlying observations. Ann. Math. Stat. 21(1), 27–58 (1950).
https://doi.org/10.1214/aoms/1177729885
14. Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1978)
15. Markou, M., Singh, M.: Novelty detection: a review-Part 1, statistical approaches. Signal Process.
83(12), 2481–2497 (2003). https://doi.org/10.1016/j.sigpro.2003.07.0
16. Ester, M., Kriegel, H.P., Sander, J., Xu, X., Xiaowei, E.S., Evangelos,H., Jiawei, F., Usama M. (eds.).: A
density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings
of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp.
226–231. AAAI Press, Washington (1996)
17. Brendan, J.F., Delbert, D.: Clustering by passing messages between data points. Science 315(5814),
972–976 (2007). https://doi.org/10.1126/science.1136800
18. Breunig, M., Kriegel, H.P., Ng, R., Sander, J., LOF: Identifying density-based local outliers. In: Pro-
ceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104.
SIGMOD. https://doi.org/10.1145/335191.335388. ISBN 1-58113-217-4 (2000)
19. Pimentel, M., Clifton, D., Clifton, L., Tarassenko, L.: A review on novelty detection. Signal Process.
99, 215–249 (2014)
20. Markou, M., Singh, M.: Novelty detection: a review-Part 2, neural network based approaches. Signal
Process. 83(12), 2499–2521 (2003). https://doi.org/10.1016/j.sigpro.2003.07.019
21. Selicato, L., Esposito, F., Gargano, G., Vegliante, M.C., Opinto, G., Zaccaria, G.M., Ciavarella, S.,
Guarini, A., Del Buono, N.: A new ensemble method for detecting anomalies in gene expression
matrices. Mathematic 9, 882 (2021). https://doi.org/10.3390/math9080882
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Anomaly detection in the probability simplex... 411
22. Li, H.Z., Boulanger, P.: A survey of heart anomaly detection using ambulatory electrocardiogram
(ECG). Sensors (Basel) 20(5), 1461 (2020). https://doi.org/10.3390/s20051461
23. Basora, L., Olive, X., Dubot, T.: Recent advances in anomaly detection methods applied to aviation.
Aerospace 6(11), 117 (2019). https://doi.org/10.3390/aerospace6110117
24. Schwabacher, M., Oza, N., Matthews, B.: Unsupervised anomaly detection for liquid-fueled rocket
propulsion health monitoring. J. Aerosp. Comput. Inf. Commun. 6, 7 (2009)
25. Yepmo, G., Smits, G., Pivert, O.: Anomaly explanation: a review. Data Knowl. Eng. 137, 101946
(2022)
26. Greenacre, M.: Compositional Data Analysis in Practice. CRC Press, London (2018)
27. Aitchison, J.: The statistical analysis of compositional data. J. R. Stat. Soc. B 44(2), 139–177 (1982)
28. Nielsen, F.: An elementary introduction to information geometry. Entropy 22(10), 1100 (2020). https://
doi.org/10.3390/e22101100
29. Nielsen, F.: The many faces of information geometry. Notices AMS 69, 36–45 (2022)
30. Rao, C.R.: Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
Math. Soc. 37, 81–91 (1945)
31. Deza, M., Deza, E.: Encyclopedia of Distances. Springer, New York (2018)
32. Aitchison, J.: Principal component analysis of compositional data. Biometrika 70(1), 57–65 (1983)
33. Nielsen, F., Sun, K.: Clustering in Hilbert simplex geometry. Clustering in Hilbert’s projective geome-
try: the case studies of the probability simplex and the elliptope of correlation matrices. In: Nielsen, F.
(eds) Geometric Structures of Information. Signals and Communication Technology. Springer, Cham.
https://doi.org/10.1007/978-3-030- 02520-5_11 (2019)
34. Avalos-Fernandez, M., Nock, R., Ong, C.S., Rouar, J., Sun, K.: Representation learning of composi-
tional data. NIPS 18, 6680–6690 (2018). https://doi.org/10.5555/3327757.3327774
35. Bulmer, M.: Principles of Statistics. Dover Publications, New York (1979)
36. Li, Q., McKenzie, D., Yin, W.: From the simplex to the sphere: faster constrained optimization using
the Hadamard parametrization. arXiv:2112.05273.https://doi.org/10.48550/arXiv.2112.05273 (2022)
37. Mehrotra, K., Mihan, C., Huang, H.: Anomaly Detection, Principles and Algorithms. Springer, New
York (2017)
38. Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on
locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Discov.
28, 190–237 (2014). https://doi.org/10.1007/s10618-012-0300-z
39. Liu, F.T., Ting, K.M., ZHou, Z.H.: Isolation forest. In: Eighth IEEE International Conference on
Data Mining, pp. 413–422. https://doi.org/10.1109/ ICDM.2008.17. ISBN 978-0-7695-3502-9. S2CID
6505449 (2008)
40. Knorr, E., Ng, R., Tucakov, V.: Distance-based outliers: algorithms and applications. VLDB J. 8,
237–253 (2000). https://doi.org/10.1007/s007780050006
41. Iglewicz, B., Hoaglin, D.: How to Detect and Handle Outliers. American Society for Quality Control,
New York (1993)
42. Aguayo, L., Barreto, G.: Novelty detection in time series using self-organizing neural networks: a
comprehensive evaluation. Neural Process. Lett. 47, 1 (2017). https://doi.org/10.1007/s11063-017-
9679
43. Neme, A., Lugo, B., Cervera, A.: Authorship attribution as a case of anomaly detection: a neural
network model. Int. J. Hybrid Intell. Syst. 8(4), 225–235 (2011)
44. Neme, A., Gutierrez-Pulido, J., Muñoz, A., Hernández, S., Dey, T.: Stylistics analysis and authorship
attribution algorithms based on self-organizing maps. Neurocomputing 147, 147–159 (2015)
45. Forrest, S., Perelson, A.S., Allen, L., Cherukuri, R.: Self-nonself discrimination in a computer. In:
Proceedings of the 1994 IEEE Symposium on Research in Security and Privacy, Los Alamitos, pp.
202–212 (1994)
46. Wang, K., Langevin, S., Shattuck, M., Ogle, S., Kirby, M.: Anomaly detection in host signaling
pathways for the early prognosis of acute infection. PLOS (2016). https://doi.org/10.1371/journal.
pone.0160919
47. Wang, G., Yang, J., Li, R.: Imbalanced SVM-based anomaly detection algorithm for imbalanced
training datasets. Electron. Telecommun. Res. Inst. 39–5, 621–631 (2017). https://doi.org/10.4218/
etrij.17.0116.0879
48. Zhao, W., Li, L., Alam, S., Wang, Y.: An incremental clustering method for anomaly detection in flight
data. Transport. Res. Part C Emerg. Technol. 132, 103406 (2021). https://doi.org/10.1016/j.trc.2021.
103406
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
412 U. Legaria et al.
49. Evangelou, M., Adams, N.: An anomaly detection framework for cyber-security data. Comput. Secur.
97, 101941 (2021). https://doi.org/10.1016/j.cose.2020.101941
50. Novikova, E., Kotenko, I.: Visual analytics for detecting anomalous activity in mobile money transfer
services. In: International Cross-Domain Conference and Workshop on Availability, Reliability,and
Security (CD-ARES), Fribourg pp. 63–78. https://doi.org/10.1007/978- 3-319- 10975-65 (2014)
51. Garrard, P., Maloney, L., Hodges, J., Patterson, K.: The effects of very early Alzheimer’s disease on
the characteristics of writing by a renowned author. Brain 128(2), 250–260 (2005). https://doi.org/10.
1093/brain/awh341
52. Close, L., Kashef, R.: Combining artificial immune system and clustering analysis: a stock mar-
ket anomaly detection model. J. Intell. Learn. Syst. Appl. (2020). https://doi.org/10.4236/jilsa.2020.
124005
53. Colignatus, T.: Comparing the Aitchison Distance and the Angular Distance for Use as Inequality or
Disproportionality Measures for Votes and Seats (2018)
54. Villani, C.: Optimal Transport, Old and New. Springer, New York. ISBN 978-3-540-71050-9 (2008)
55. Bigot, J.: Statistical data analysis in the Wasserstein space. J. 2018 MAS Sampling Process. 68, 1–19
(2020). https://doi.org/10.1051/proc/202068001
56. Peyre, G., Cuturi, M.: Computational Optimal Transport. arXiv:1803.00567 (2018)
57. Aler, R., Valss, J., Bostrom, H.: Study of Hellinger distance as a splitting metric for random forests in
balanced and imbalanced classification datasets. Expert Syst. Appl. 1, 113264 (2020). https://doi.org/
10.1016/j.eswa.2020.113264
58. Lavigne, C., Ricci, B., Franck, P., Senoussi, R.: Spatial analyses of ecological count data: a density
map comparison approach. Basic Appl. Ecol. 11, 734–742 (2010)
59. Menendez, M.L., Pardo, J.A., Pardo, M.: The Jensen–Shannon divergence. J. Franklin Inst. 334(2),
307–318 (1997). https://doi.org/10.1016/S0016-0032(96)00063-4
60. Coles, P., Cerezo, M., Cincio, L.: Strong bound between trace distance and Hilbert-Schmidt distance for
low-rank states. Phys. Rev. A. 100(2), 022103 (2019). https://doi.org/10.1103/PhysRevA.100.022103
61. Gattone, S., Sanctis, A., Russo, T., Pulcini, D.: A shape distance based on the Fisher-Rao metric and its
application for shapes clustering. Phys. A Stat. Mech. Appl. (2017). https://doi.org/10.1016/j.physa.
2017.06.014
62. Hawkins, D.: Identification of Outliers. Springer, New York (1980)
63. Nakamura, Y., Gojobori, T., Ikemura, T.: Codon usage tabulated from the international DNA sequence
databases: status for the year 2000. Nucl. Acids Res. 28, 292 (2000)
64. Khomtchouk, B.B.: Codon usage bias levels predict taxonomic identity and genetic composition.
bioRxiv (2020). https://doi.org/10.1101/2020.10.26.356295
65. Nelson, D.L., Cox, M.M.: Principles of Biochemistry, 4th edn. W. H. Freeman, New York. ISBN
0-7167-4339-6 (2005)
66. Parvathy, S.T., Udayasuriyan, V., Bhadana, V.: Codon usage bias. Mol. Biol. Rep. 49, 539–565 (2022).
https://doi.org/10.1007/s11033-021-06749- 4
67. Prat, Y., Fromer, M., Linial, N.: Codon usage is associated with the evolutionary age of genes in
metazoan genomes. BMC Evol. Biol. 9, 285 (2009). https://doi.org/10.1186/1471- 2148-9- 285
68. Pearson, K.: A First Study of the Statistics of Pulmonary Tuberculosis. Dalau, London (1907)
69. Poincare, H.: Analysis Situs. Translated version from French (1895)
70. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
71. Shannon, C.E.A.: Mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423, 623–656
(2020) https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com