PreprintPDF Available

Outliers in compositional time series data

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This paper proposes an outlier detection ensemble for time series compositional data, that can also be applied to multivariate and univariate times series. In addition, we propose a coordinate transformation from the simplex S^n to R^n −1 , which allows zero-valued composites and preserves Euclidean distances and angles. We test this framework on simulated compositional time series and three real applications. The simulation study confirms that our approach accurately identifies the compositional outliers, while the empirical applications illustrate how the approach helps to identify the COVID-19 outbreaks in India and Spain. The R package composits implements the outlier detection ensemble and the coordinate transformation presented in this paper.
Content may be subject to copyright.
Outliers in compositional time series data
Sevvandi Kandanaarachchi1, Patricia Menéndez2,
Rubén Loaiza-Maya2, Ursula Laa2,3
1School of Science, Mathematical Sciences, RMIT University, Melbourne VIC 3000, Australia.
2Department of Econometrics and Business Statistics, Monash University, Clayton, VIC 3800, Australia.
3School of Physics and Astronomy, Monash University, Clayton, VIC 3800, Australia.
Abstract
This paper proposes an outlier detection ensemble for time series compositional data,
that can also be applied to multivariate and univariate times series. In addition, we pro-
pose a coordinate transformation from the simplex S𝑛to R𝑛1, which allows zero-valued
composites and preserves Euclidean distances and angles. We test this framework on
simulated compositional time series and three real applications. The simulation study
confirms that our approach accurately identifies the compositional outliers, while the
empirical applications illustrate how the approach helps to identify the COVID-19 out-
breaks in India and Spain. The R package composits implements the outlier detection
ensemble and the coordinate transformation presented in this paper.
Key words— compositional time series, outlier detection, anomaly detection, multivariate
time series, ensembling, outlier detection ensembles
1 Introduction
Outliers tell a story different from the norm. In our data-rich world, outlier detection methods are used in
diverse societal applications such as detecting intrusions, identifying emerging terrorist plots in social media
and detecting fetal anomalies in pregnancies. Outlier detection methods are constantly developed to cater for
varied aspects of data such as non-stationarity in data streams (Gu et al. 2020). The steady growth of outlier
detection methods has also contributed to a growing body of literature on outlier detection ensembles (Zimek
et al. 2014, Unwin 2019).
In this study we investigate time series outliers in compositional data and propose an ensemble method to
detect them. Compositional data refers to quantitative data that describes parts of a whole; specifically a
vector of positive elements that adds up to a constant, typically one. Although compositional data can arise
in the context of a Dirchlet distribution, it also appears naturally in geological, financial, economical and
biological applications to name a few. Examples include the study of geochemical composition of rocks and
sediments, the proportion of tourist arrivals from different countries, the study of shares portfolio composition
and the abundances of microbes in microbiome data. Compositional data appears in any problem where the
composition relative to the total is of importance.
Karl Pearson was the first statistician to look at this kind of data. In his study (Pearson 1897) he analyzed
spurious correlation between proportions and raised awareness about the problems of using standard statistical
methods for analyzing proportions. After that, other authors explored different aspects of compositional data
(Chayes 1960). However, it was not until the 80’s when John Aitchinson published two main works in this area
(Aitchison 1982, 1983), that a framework and the principles for analysis of compositional data was set up.
One of the many important contributions that Aitchinson brought to the field of compositional data analysis
was the representation via the simplex. That is, for compositional data with 𝑛components, these components
can be understood as elements of a simplex space S𝑛of dimension 𝑛where
S𝑛=(𝒙=(𝑥1, 𝑥2, . . . , 𝑥𝑛)|𝑥𝑖R+>0, 𝑖 =1,2, . . . , 𝑛,
𝑛
Õ
𝑖=1
𝑥𝑖=𝑐, 𝑐 R).
The vector 𝒙represents an 𝑛-part composition within the simplex S𝑛, which is an 𝑛1dimensional affine space
1
of R𝑛. This forms the foundation of what is known as Aitchison geometry (Aitchison 1982).
Working in the simplex with that geometry poses challenges when trying to use many of the well established
multivariate techniques that are designed to work with unconstrained data in R𝑛. Therefore, Aitchison intro-
duced transformations (Aitchison 1982) so that the simplex geometric space where compositions are naturally
represented can be mapped into an unconstrained space. Three different transformations from the simplex S𝑛
to R𝑛1or R𝑛using logratios were proposed: additive logratio transformation (alr), centred logratio transfor-
mation (clr) and isometric logratio transformation (ilr). For a point on the simplex 𝒙=(𝑥1, 𝑥2, . . . , 𝑥𝑛)S𝑛
and non-zero 𝑥𝑗, the additive logratio transformation (alr) is given by
𝒙0=𝑥0
1, . . . , 𝑥0
𝑛1=log 𝑥1
𝑥𝑗
,· · · ,log 𝑥𝑗1
𝑥𝑗
,log 𝑥𝑗+1
𝑥𝑗
,· · · ,log 𝑥𝑛
𝑥𝑗.
Similarly, the centred logratio transformation (clr) is given by
𝒙0=𝑥0
1, . . . , 𝑥0
𝑛=©«
log 𝑥1
𝑛
qÎ𝑛
𝑗=1𝑥𝑗
,· · · ,log 𝑥𝑛
𝑛
qÎ𝑛
𝑗=1𝑥𝑗ª®®¬
,
and the isometric logratio transformation is given by
𝒙0=𝑥0
1, . . . , 𝑥0
𝑛1with 𝑥0
𝑖=r𝑖
𝑖+1log
𝑖
qÎ𝑖
𝑗=1𝑥𝑗
𝑥𝑖+1
for 𝑖 < 𝑛 .
While these transformations set the foundations and have vastly contributed to the advancement of compo-
sitional data analysis, they are not without limitations. An obvious limitation is the inability to incorporate
zeros in the data. The issue of compositions with zeros was also discussed in Aitchison (1982) and solutions
were proposed depending on the “cause of the zeros”. The solutions included amalgamation, replacement
and Box-Cox transformations. Amalgamation comprises reducing the number of compositional components
by grouping different components together so that there are no zeros in the resulting composition. From an
outlier detection perspective, this approach can mask outliers because certain components are added together.
Other proposed solutions include replacing zeros by very small values or applying Box-Cox transformations
to the proportions. However, both these approaches have an adverse effect on outlier detection because they
2
introduce artificial outliers as a result of using logarithms and Box-Cox transformations on very small values
(Templ et al. 2017). A different approach to handle zeros was developed by Scealy & Welsh (2011), which
involved square-root transformations so that compositional data is mapped onto a hypersphere and distributions
for directional data can be used. Their focus was on spherical regression using Kent distributions. However,
taking the square root may dampen the large outliers.
The outlier detection literature, which includes contributions from statisticians and computer scientists is con-
stantly expanding. The reader is referred to Wang et al. (2019) for a recent survey on the subject. Outlier
detection in time series overlaps with change point detection, which also has a rich history and a wealth of ap-
proaches to tackle the problem (Aminikhanghahi & Cook 2017). In particular, outlier detection in the context
of compositional data has also received some attention in the last few years. For example, Brunsdon & Smith
(1998) proposed methods to analyze compositional time series and Filzmoser & Hron (2008) studied outlier
detection in compositional time series using robust methods.
Of particular interest is the work by Templ et al. (2017), which proposes methods for detecting outliers in
compositional data with zeros. In their study they explore the zero structure separately using subcompositions
determined by their zero patterns. Their main focus is on the subset of observations that has zeros. As part of
their analysis, they preprocess that subset such that the non-zero components are assigned 1, while zeros are
left as they are. Then they compute the binary Principal Component (PC) space and plot the observations in this
2-dimensional space. In addition to the binary PC space, they also use imputation methods and Mahalanobis
distances on these subcompositions.
In this paper, we propose a compositional coordinate transformation from the simplex S𝑛to R𝑛1, which pre-
serves Euclidean distances, angles between points and is agnostic to zeros. Furthermore, we propose a time
series outlier detection ensemble for compositional data that uses this transformation. Even though our focus is
on compositional time series, the outlier ensemble presented in this paper, can be used on univariate and mul-
tivariate time series, which is non-compositional. Additionally, we make this work available in the R package
composits (Kandanaarachchi et al. 2020).
The remainder of the paper is organised as follows: We discuss the coordinate transformation from the simplex
S𝑛to R𝑛1in Section 2.1. We note that this coordinate transformation can also be used on compositional data
that is not time dependent. After transforming the data to R𝑛1, we proceed to find outliers using an ensem-
3
ble of time series outlier detection methods. We use the time series outlier detection methods available in the
R packages forecast (Hyndman & Khandakar 2008), tsoutliers (de Lacalle 2019), ostad (Iturria et al.
2019) and anomalize (Dancho & Vaughan 2019) to build our ensemble. The compositional time series outlier
ensemble is discussed in Section 2.2, which contains the univariate and multivariate ensembles. The multi-
variate and compositional time series outlier detection methods comprise decomposing the multivariate data to
univariate by using Principal Component Analysis (PCA), Independent Component Analysis (ICA) (Comon
1994), Invariant Coordinate Selection (ICS) (Tyler et al. 2009) and DOBIN (Kandanaarachchi & Hyndman
2020), and applying the univariate time series outlier ensemble to these decomposed series. After finding the
outlying time points we apportion the outlier scores back to the compositions as explained in Section 2.3. In
Section 2.4 we explore visualization methods including animations using the R package tourr (Wickham et al.
2011). Then, we test our compositional time series outlier ensemble on simulated compositional time series
explained in Section 3. In Section 4 we apply our outlier ensemble to three real world datasets: international
tourism data from the World Bank, COVID-19 data in India from Kaggle and Spanish daily mortality data
from the Spanish ministry of Science and Innovation under the daily deaths monitoring program (Ministry of
Science and Innovation, Spain 2020). Finally, we discuss our conclusions in Section 5.
2 Methodology
2.1 Nullspace coordinates on the simplex
Consider the point 𝒙=(𝑥1, 𝑥2, . . . , 𝑥𝑛)on the simplex S𝑛, i.e. Í𝑛
𝑗=1𝑥𝑗=1. Although there are 𝑛coordinates to
describe 𝒙, these are constrained as they add up to a constant. We remove this constraint by finding a new set
of basis vectors for the simplex.
𝝂
𝒂
𝒙𝒂
Figure 1: The hyperplane given by the equation (𝒙𝒂)·𝝂=0.
4
Consider a hyperplane given by the equation 𝝂· (𝒙𝒂)=0, where 𝝂is the normal (perpendicular) vector to
the hyperplane and 𝒂is a point on the hyperplane. Let 𝝂=(𝜈1, . . . , 𝜈𝑛)𝑇and 𝒂=(𝑎1, . . . , 𝑎𝑛)𝑇represent the
aforementioned vectors. Then expanding the equation of the hyperplane, we get
𝝂·𝒙=𝝂·𝒂,
𝜈1𝑥1+𝜈2𝑥2+ · · · + 𝜈𝑛𝑥𝑛=˜𝑐 , (1)
where 𝝂·𝒂equals a constant ˜𝑐. Since 𝒙satisfies 𝑥1+𝑥2+ · · · + 𝑥𝑛=1, by comparing with equation (1), we see
that 𝝂=(1,1, . . . , 1)𝑇. We choose the barycenter 𝒄=1
𝑛,1
𝑛,· · · ,1
𝑛𝑇
, as the point 𝒂for the remainder of the
computation. We note that any point on the simplex S𝑛would be appropriate as the point 𝒂. Thus, for every
point 𝒙S𝑛,
𝒙𝒄𝝂,
𝝂𝑇(𝒙𝒄)=0,
(𝒙𝒄)Null(𝝂𝑇),
where Null denotes the null space. Hence, a basis for the null space of 𝝂𝑇can be used to describe vectors 𝒙𝒄
for all 𝒙S𝑛. In order to find a basis for the null space of 𝝂𝑇, we consider the equation
𝝂𝑇𝒚=0,
i.e. 𝑦1+. . . +𝑦𝑛=0.
As there are 𝑛1free parameters, by considering 𝑦𝑗=𝑠𝑗, where 𝑠𝑗denotes a free parameter for 𝑗∈ {1, . . . , 𝑛
1}, we have,
𝑦𝑛=
𝑛1
Õ
𝑗=1
𝑠𝑗
5
giving us 𝒚=
𝑠1
𝑠2
.
.
.
Í𝑛1
𝑗=1𝑠𝑗
,
𝒚=𝑠1
1
0
.
.
.
1
+𝑠2
0
1
.
.
.
1
+ · · · + 𝑠𝑛1
0
.
.
.
1
1
.
Therefore the set B1=
1
0
.
.
.
1
,
0
1
.
.
.
1
,· · · ,
0
.
.
.
1
1
gives a basis for the null space of 𝝂𝑇. We then can
perform a Gram-Schmidt orthogonalisation (Anton & Rorres 2013) to make this basis orthonormal. Let B1=
{𝒖1,𝒖2, . . . , 𝒖𝑛1}be the current basis for the null space as above. Then the Gram-Schmidt orthogonalisation
is performed by defining
𝒘1=𝒖1
k𝒖1k,
and letting ˜
𝒖2=𝒖2h𝒖2,𝒘1i𝒘1.
This makes ˜
𝒖2𝒘1,giving the second normalized basis vector
𝒘2=˜
𝒖2
k˜
𝒖2k.
Similarly letting ˜
𝒖3=𝒖3h𝒖3,𝒘1i𝒘1h𝒖3,𝒘2i𝒘2,we have ˜
𝒖3𝒘1and ˜
𝒖3𝒘2, giving the third
normalized basis vector
𝒘3=˜
𝒖3
k˜
𝒖3k.
6
After computing 𝒘𝑖, the next basis vector is found by
˜
𝒖𝑖+1=𝒖𝑖+1h𝒖𝑖+1,𝒘1i𝒘1− · · · h𝒖𝑖+1,𝒘𝑖i𝒘𝑖,
and defining
𝒘𝑖+1=˜
𝒖𝑖+1
k˜
𝒖𝑖+1k.
Consequently, we obtain an orthonormal basis B={𝒘1,𝒘2, . . . , 𝒘𝑛1}derived from B1. Software such as
R package pracma (Borchers 2019) can be used to find an orthonormal basis when computing the null space.
Once we find an orthonormal basis for Null 𝝂𝑇, we can compute the coordinates of all vectors (𝒙𝒄)in
this new basis. Let 𝒚Null(𝝂𝑇)and 𝒚𝑂denote the coordinates of 𝒚in the original basis and 𝒚𝐵denote the
coordinates of 𝒚in the null space basis B. Then as illustrated in Figure 2
𝒚=h𝒚,𝒘1i𝒘1+ h𝒚,𝒘2i𝒘2+ · · · + h𝒚,𝒘𝑛1i𝒘𝑛1,
giving the coordinates of 𝒚in the null space basis B
𝒚𝐵=(h𝒚,𝒘1i,h𝒚,𝒘2i, . . . , h𝒚,𝒘𝑛1i)𝑇.
𝒚
𝒘1
𝒘2
h𝒚,𝒘1i
h𝒚,𝒘2i
Figure 2: The coordinates of 𝒚in the orthonormal basis {𝒘1,𝒘2}are (h𝒚,𝒘1i,h𝒚,𝒘2i), i.e. 𝒚=h𝒚,𝒘1i𝒘1+
h𝒚,𝒘2i𝒘2.
Let us denote by 𝑋𝐶the matrix which contains 𝒙𝒄as column vectors, that is, 𝑋𝐶is an 𝑛×𝑁matrix.
7
Let 𝐵denote the matrix containing the basis vectors of Bas column vectors, i.e. 𝐵is a 𝑛× (𝑛1)matrix.
Then the null space coordinates are given by column vectors of
˜
𝑋=𝐵𝑇(𝑋𝐶).
We summarize the computation of the null space coordinates for 𝒙S𝑛below:
1. For all 𝒙S𝑛, compute 𝒙𝒄, where 𝒄=1
𝑛,1
𝑛,· · · ,1
𝑛𝑇
.
2. Construct an orthonormal matrix 𝐵, which contains the basis vectors for the null space of 𝝂𝑇.
3. For a point 𝒙on the simplex, transform the coordinates as ˜
𝒙=𝐵𝑇(𝒙𝒄).
Next, we show that the Euclidean distances and angles between points are not changed by this coordinate
transformation.
Lemma 2.1. For any two points 𝒙𝑖and 𝒙𝑗on the simplex S𝑛, we have k𝒙𝑖𝒙𝑗k=k𝒙0
𝑖𝒙0
𝑗k, where we consider
the Euclidean norm and 𝒙0denotes the coordinates of the point after the transformation. That is, the Euclidean
distance between points before and after the transformation is the same.
Proof. Let ¯
𝐵=[𝐵ˆ𝜈], where ˆ𝜈denotes the unit vector in the direction of 𝜈and 𝐵is the matrix containing the
basis vectors of B. As 𝜈is orthogonal to column vectors in 𝐵,¯
𝐵denotes an 𝑛×𝑛orthogonal matrix. Let
𝒙00 =¯
𝐵𝑇(𝒙𝒄),(2)
denote the coordinates of 𝒙𝒄in R𝑛using ¯
𝐵. Given that
𝒙0=𝐵𝑇(𝒙𝒄),
we have
𝒙00 =©«
𝒙0
0ª®®¬
(3)
because the last coordinate of 𝒙00 is equal to ˆ𝜈·(𝒙𝒄), which equals zero as 𝜈is perpendicular to 𝒙𝒄. Using
8
equation (2) we have
𝒙00
𝑖𝒙00
𝑗=¯
𝐵𝑇(𝒙𝑖𝒙𝑗),
giving us
k𝒙00
𝑖𝒙00
𝑗k2=D𝒙00
𝑖𝒙00
𝑗,𝒙00
𝑖𝒙00
𝑗E,
=𝒙00
𝑖𝒙00
𝑗𝑇𝒙00
𝑖𝒙00
𝑗,
=𝒙𝑖𝒙𝑗𝑇¯
𝐵¯
𝐵𝑇𝒙𝑖𝒙𝑗,
=𝒙𝑖𝒙𝑗𝑇𝒙𝑖𝒙𝑗,
=k𝒙𝑖𝒙𝑗k2,
where we have used ¯
𝐵¯
𝐵𝑇=𝐼. As ¯
𝐵𝑇¯
𝐵=𝐼and as ¯
𝐵is an 𝑛×𝑛matrix we have ¯
𝐵1=¯
𝐵𝑇, giving ¯
𝐵¯
𝐵𝑇=𝐼.
From equation (3) we know that k𝒙00k=k𝒙0k, which completes the proof.
Lemma 2.2. For any three points 𝒙𝑖,𝒙𝑗and 𝒙𝑘on the simplex S𝑛, we have
h𝒙0
𝑖𝒙0
𝑘,𝒙0
𝑗𝒙0
𝑘i=h𝒙𝑖𝒙𝑘,𝒙𝑗𝒙𝑘i,
where we consider the standard Euclidean inner product and 𝒙0denotes the coordinates of the point after the
transformation.
Proof. Similar to Lemma 2.1 we work with ¯
𝐵=[𝐵ˆ𝜈]. Using equation (2) and the fact that ¯
𝐵¯
𝐵𝑇=𝐼we obtain
D𝒙00
𝑖𝒙00
𝑘,𝒙00
𝑗𝒙00
𝑘E=𝒙00
𝑖𝒙00
𝑘𝑇𝒙00
𝑗𝒙00
𝑘,
=(𝒙𝑖𝒙𝑘)𝑇¯
𝐵¯
𝐵𝑇𝒙𝑗𝒙𝑘,
=(𝒙𝑖𝒙𝑘)𝑇𝒙𝑗𝒙𝑘,
=𝒙𝑖𝒙𝑘,𝒙𝑗𝒙𝑘.
As D𝒙00
𝑖𝒙00
𝑘,𝒙00
𝑗𝒙00
𝑘E=D𝒙0
𝑖𝒙0
𝑘,𝒙0
𝑗𝒙0
𝑘E
9
from equation (3) we get the desired result.
Lemmata 2.1 and 2.2 together tell us that the angles between points are preserved by this proposed coordinate
transformation. This completes the discussion on the coordinate transformation. Next we look at the outlier
ensemble.
2.2 Compositional time series outlier ensemble (CTSOE)
The compositional time series outlier ensemble consists of the following components: 1. the null space coor-
dinate transformation, 2. the multivariate outlier ensemble and 3. the univariate outlier ensemble. A schematic
diagram of the compositional time series outlier ensemble is given in Figure 3.
Using the null space coordinate transformation discussed in Section 2.1, we transform the compositional time
series data to R𝑛1, resulting in a multivariate time series without any constraints. We further decompose this
multivariate time series to univariate time series using 4decomposition methods and identify outliers using 4
time series outlier detection techniques, which are listed on CRAN task view:
1. tsoutliers by de Lacalle (2019) implements the algorithms by Chen & Liu (1993) for detecting outliers
in time series. They consider a combination of ARIMA models with hypothesis testing on the residuals
and find additive outliers, level shifts, temporary changes, innovational outliers and seasonal level shifts.
2. forecast by Hyndman & Khandakar (2008) identifies outliers in the residuals of the whitened time
series. They use Friedman’s super smoother supsmu for non-seasonal time series and a periodic seasonal
trend decomposition using LOESS (STL) for seasonal time series. Point are labeled as outliers if they lie
outside ±1.5×the interquartile range (IQR).
3. anomalize by Dancho & Vaughan (2019) implements outlier detection using remainders from trend
and/or seasonal decomposition of time series based on the interquartile range or the generalized extreme
studentized deviation tests.
4. otsad by Iturria et al. (2019) implements online fault detectors for time-series using two state shift detec-
tion methods, which identifies candidate outliers using different algorithms such as shift detection based
exponentially weighted moving average (SD-EWMA). Then they further test these candidate outliers
using a Kolmogorov-Smirnov test before labeling them as outliers.
10
Compositional (MV) time series
Nullspace coordinate transformation
Unconstrained MV time series
Decompose to univariate time series
DOBINPCA ICS ICA
𝑃1𝑃𝑞𝐷1𝐷𝑞𝐶1𝐶𝑞𝐼1𝐼𝑞
UTSOE UTSOE UTSOE UTSOE UTSOE UTSOE UTSOE UTSOE
Combine Combine Combine Combine
PC Score DOBIN Score ICS Score ICA Score
Combine
Outlier scores
... ... ... ...
MTSOE
NSC
CTSOE
Figure 3: Schematic diagram of the compositional time series outlier ensemble CTSOE with the nullspace co-
ordinate transformation NSC and the multivariate time-series outlier ensemble MTSOE. The light
orange colored rectangles denote processes and gray colored parallelograms denote input and out-
put data.
11
As the univariate time series are used to identify outliers, we discuss the univariate outlier ensemble next.
2.2.1 Univariate time series outlier ensemble (UTSOE)
Consider a univariate time series {𝑥𝑡}𝑁
𝑡=1for which we use the four outlier detection methods, tsoutliers,
forecast,anomalize and otsad to identify outliers . Let us denote the outlier indicator variable of the 𝑗th
method by {𝑦𝑗
𝑡}𝑁
𝑡=1,
𝑦𝑗
𝑡=
1If 𝑥𝑡is identified as an outlier by the 𝑗th method ,
0otherwise.
Some outlier detection methods may identify outliers sparingly while others identify a string of outliers. Outlier
detection methods that identify fewer outliers are generally preferred, as outliers are considered rare observa-
tions. As such, to construct an ensemble score, we want to give a higher weight to methods that identify few
outliers and a lower weight to methods that identify more outliers. To achieve this, we assign weights for the
outlier detection methods based on both, their level of agreement with other methods and the total number of
outliers identified by that method. Suppose 𝑂𝑗denotes the set of outliers identified by method 𝑗. Then the
weight for method 𝑗,𝜉𝑗is
𝜉𝑗=|𝑂𝑗𝑘𝑗𝑂𝑘|
|𝑂𝑗|=Number of common outliers with other methods
Number of outliers identified by method 𝑗.(4)
Using the method weights 𝜉𝑗, we obtain the univariate time series outlier score as
𝑦𝑡=
4
Õ
𝑗=1
𝜉𝑗𝑦𝑗
𝑡.
That is, the outlier score of each time point corresponds to a weighted average score of the methods that have
identified it as an outlier, with larger scores corresponding to higher ranked outliers.
In addition to {𝑦𝑡}𝑁
𝑡=1, we also provide the raw scores
𝑌𝑡=𝑦1
𝑡, 𝑦2
𝑡, 𝑦3
𝑡, 𝑦4
𝑡,
12
with 𝑌𝑡R4and 𝑦𝑗
𝑡R. As we will show, these scores are useful in a multivariate setting. We refer to
this univariate time series outlier ensemble as UTSOE in the remainder of the paper. Next we look at the
multivariate time series outlier ensemble.
2.2.2 Multivariate time series outlier ensemble (MTSOE)
To find outliers of a multivariate time series, we first decompose it to univariate time series.
Decomposition of multivariate time series to univariate
We employ the following 4dimension reduction techniques to decompose a multivariate time series to univari-
ate components:
1. Principle Component Analysis (PCA)
2. Independent Component Analysis (ICA)
ICA (Comon 1994) is originally a signal processing technique that separates or unmixes a number of
signals that are collected together. ICA finds independent components of the mixed signal. A well
known application of ICA is the “cocktail party problem”, where ICA is used to separate the mixed noise
of many people talking.
3. Invariant Coordinate Selection (ICS)
ICS (Tyler et al. 2009) is a method for exploring data by comparing different estimates of scatter that
identifies a system of independent components used to represent the data in a lower dimensional space.
4. DOBIN
DOBIN (Kandanaarachchi & Hyndman 2020) is a dimension reduction method specifically targeted for
outlier detection.
While none of these methods are specially designed for time series data, they can be used to decompose multi-
variate time series to univariate components (Baragona & Battaglia 2007, Aires et al. 2000). After decomposing
the multivariate time series, we apply UTSOE to each univariate component. We then consider the first 𝑞uni-
variate series resulting from each decomposition method, where 𝑞is set to two dimensions by default in our
algorithm and can be changed by the user.
13
Using the univariate ensemble UTSOE
Let {𝒙𝑡}𝑁
𝑡=1denote the original multivariate time series, i.e. for each 𝑡,𝒙𝑡R𝑛. So, we can represent our
original multivariate time series as a matrix of dimensions 𝑁×𝑛. As we decompose this multivarite time series
to univariate using four methods and use the first 𝑞components, this results in an three-dimensional array of
decomposed time series of dimension 𝑁×4×𝑞. We denote this time series object by 𝑍𝑁×4×𝑞, where the
first dimension denotes time, the second denotes the decomposition methods and the third the decomposed
components. Thus, we have 4𝑞univariate time series as a result of this decomposition.
Let 𝒛𝑘
𝑡denote the multivariate decomposed time series using the 𝑘th decomposition method for 𝑘∈ {1,2,3,4},
i.e. for each pair of 𝑘and 𝑡,𝒛𝑘
𝑡R𝑞with 𝑞𝑛. Let 𝑧𝑘 ,ℓ
𝑡Rdenote the th value of 𝒛𝑘
𝑡for fixed 𝑘and 𝑡, with
𝑞. Therefore, {𝑧𝑘 ,ℓ
𝑡}𝑁
𝑡=1is a univariate time series, where 𝑘∈ {1,2,3,4}and ∈ {1, . . . , 𝑞}.
We use UTSOE to identify outliers in 4𝑞univariate time series {𝑧𝑘,ℓ
𝑡}𝑁
𝑡=1for 𝑘∈ {1,2,3,4}and ∈ {1, . . . , 𝑞}.
As we use 4outlier detection methods, our outlier scores will comprise a four dimensional array of 𝑁×
4×𝑞×4dimensions. Let us call this array 𝑌𝑁×4×𝑞×4. The array 𝑌contains outlier scores for 𝑍with the
first dimension of 𝑌denoting time, the second denoting the decomposition method, the third denoting the
decomposed components and the fourth the outlier detection method.
Expanding 𝑌𝑁×4×𝑞×4in the fourth dimension we have 𝑌𝑁×4×𝑞×4=h𝑦𝑘,ℓ,1
𝑡𝑦𝑘,ℓ,2
𝑡𝑦𝑘,ℓ,3
𝑡𝑦𝑘,ℓ,4
𝑡i, with 𝑦𝑘 ,ℓ, 𝑗
𝑡for fixed
𝑗denoting a three dimensional array of dimension 𝑁×4×𝑞, containing the outlier scores of 𝑍for the 𝑗th
outlier detection method. Here onward, the index 𝑡denotes the time, 𝑗the outlier method, 𝑘the decomposition
method and the decomposed component.
Combination of univariate time series outlier scores
Next we want to combine the UTSOE scores given by the 4-dimensional array 𝑌𝑁×4×𝑞×4to a 𝑁×1vector. That
is, we need to combine the scores relating to the i) 4 decomposition methods, ii) 𝑞components of each de-
composition method and the iii) 4 outlier detection methods. First we combine the scores of the 4-dimensional
array 𝑌𝑁×4×𝑞×4by outlier detection method and obtain a 3-dimensional array 𝑌𝑁×4×𝑞. For this task we use the
outlier detection method weights described in equation (4). For 𝑡∈ {1, . . . , 𝑁 },𝑘∈ {1,2,3,4},∈ {1, . . . , 𝑞},
and fixed 𝑗=𝑗0, the outlier scores 𝑦𝑘 ,ℓ, 𝑗0
𝑡comprises an 𝑁×4×𝑞array. The array when 𝑗0=1gives the
scores of the outlier detection method tsoutliers. Similarly 𝑗0=2relates to forecast scores, 𝑗0=3to
anomalize scores and 𝑗0=4to otsad scores. We find 𝜉𝑗0for each 3-dimensional 𝑦𝑘 ,ℓ, 𝑗0
𝑡as in equation (4).
14
That is, 𝜉𝑗0does not depend on 𝑘or , it only depends on 𝑗0. So, each 𝜉𝑗0is computed over all decomposition
methods and components. Using the weights 𝜉𝑗we combine the univariate time series scores in 𝑌𝑁×4×𝑞×4to
𝑌𝑁×4×𝑞as follows:
𝑦𝑘,
𝑡=
4
Õ
𝑗=1
𝜉𝑗𝑦𝑘, ℓ, 𝑗
𝑡,(5)
where 𝑦𝑘,
𝑡are elements of the 3-dimensional array 𝑌𝑁×4×𝑞of dimension 𝑁×4×𝑞having outlier scores for 𝑁
time points, 4 decomposition methods and 𝑞components for each decomposition method.
Next we recognize that for decomposition methods PCA, ICS and DOBIN, the information contained in the
components decrease as we get to further components, i.e. the 50𝑡ℎ component is not as important as the 1st
component. However, for ICA, this is not the case because each component is independent. Therefore we use
different weighting mechanisms to combine the 𝑞decomposition scores of each method.
Let the weights for the 𝑞components of the decomposition method 𝑘=𝑘0be given by 𝑤𝑘0
1, 𝑤𝑘0
2, . . . , 𝑤 𝑘0
𝑞.
Then for each decomposition method 𝑘=𝑘0we choose weights such that Í𝑞
=1𝑤𝑘0
=1. We use the weighting
schemes given below for the four decomposition methods:
1. For PCA and ICS we choose the weights 𝑤𝑘0
=𝜆
Í𝜆, where 𝜆denotes the eigenvalue associated with
the th principal component/eigen vector and 𝑘0=1for PCA and 𝑘0=3for ICS.
2. For DOBIN we use decreasing weights proportional to 1,1
2,· · · ,1
𝑞, i.e. 𝑤𝑘0
=1
Í1
1
, with 𝑘0=2.
3. For ICA we use a constant weight for all components, i.e. 𝑤𝑘0
=1
𝑞with 𝑘0=4.
Using these weights 𝑤𝑘
, we combine the outlier scores as follows:
𝑦𝑘
𝑡=
𝑞
Õ
=1
𝑤𝑘
𝑦𝑘,
𝑡,
where 𝑦𝑘
𝑡denotes the outlier scores using the 𝑘th decomposition method. The sum of these four vectors give
the final ensemble outlier scores,
𝑦𝑡=𝑦1
𝑡+𝑦2
𝑡+𝑦3
𝑡+𝑦4
𝑡.
where 𝒚={𝑦𝑡}𝑁
𝑡=1is a vector of length 𝑁and contains the outlier scores of the multivariate time series {𝒙𝑡}𝑁
𝑡=1.
We call this outlier ensemble MTSOE.
15
Acknowledging multiple testing
As we consider 4decomposition methods, the univariate outlier ensemble UTSOE finds outliers on 4𝑞time
series. For each time series, UTSOE employs 4outlier detection methods. Therefore, each time point gets
tested for outlyingness 4×4𝑞times. As a result of this multiple testing procedure, we get unwanted outliers;
time points which are actually non-outlying marked as outliers.
In order to ascertain the true outliers we compare the outlier scores of the time series with an outlier-removed
version of the time series. This ‘comparison’ time series is constructed in the following way. First we count
the number of outliers 𝑀identified by MTSOE, which comprise the union of outliers identified by UTSOE.
If 𝑀𝑁/10, where 𝑁is the total number of time points in the time series, we remove the observations at
these outlying time points from the time series and interpolate the resulting time series linearly at the missing
time points so that there are no sudden jumps. If 𝑀 > 𝑁 /10 then, we only remove the top d𝑁/10eoutlying
time points according to the outlier score from the time series. Again, the resulting time series is interpolated
linearly at missing time points. Once we have this comparison time series, we compute outlier scores using
MTSOE. We compute the 95th percentile and the maximum outlier score for the comparison time series and
define the difference as the gap 𝑔. For the identified outliers of the original time series, we compute the gap
score 𝑔𝑠as
𝑔𝑠(𝑦𝑡)=𝑦𝑡max𝑚𝑐𝑚
𝑔+
(6)
where 𝒚={𝑦𝑡}𝑁
𝑡=1represents the outlier scores of the original time series, and 𝑐represents the outlier scores of
the comparison time series and [𝑥]+equals 𝑥if 𝑥is positive and 0otherwise. Thus, we have a set of gap scores
resulting from the comparison time series. The outlier scores with higher gap scores can be considered more
significant compared to the others.
However, we add a word of caution regarding the gap scores. Outliers are difficult to define (Unwin 2019) and
a single definition does not suit all applications. As such, the gap scores should not be taken as the ‘ideal’ truth
in determining the significance of the outliers. Rather, they should be taken for what they are – comparison
scores from a similar time series without outliers. Consequently, while we give a shorter outlier list containing
outliers with positive gap scores, we make all non-zero outlier scores available, so that a user-defined cut-off
can be employed.
16
2.3 Apportioning time outliers to covariates
At this juncture we have discussed the methodology to identify outliers in time. In a multivariate or a com-
positional setting we are also interested in recognising the variables or compositional units that contribute to
time-outliers. We achieve this by apportioning the outlier scores to the multivariate or compositional coordi-
nates by performing the inverse of the coordinate transformations discussed in Section 2.2.2.
For compositional data on the simplex S𝑛, the unconstrained coordinates lie in R𝑛1. Let 𝐵denote the 𝑛× (𝑛1)
matrix containing the nullspace basis coordinates, where each column contains a basis vector. Similarly, let
𝑃PCA,𝑃DOB ,𝑃ICS, and 𝑃ICA, denote (𝑛1) × 𝑞matrices containing PCA, DOBIN, ICS and ICA basis vectors,
respectively. Then the compositional coordinates 𝒙R𝑛gets transformed to the PCA space as
˜
𝒙=𝒙𝒄,
𝒛PCA =𝑃𝑇
PCA𝐵𝑇˜
𝒙,
giving 𝑍PCA =𝑃𝑇
PCA𝐵𝑇˜
𝑋=(𝐵𝑃PCA )𝑇˜
𝑋 ,
where ˜
𝑋is an 𝑛×𝑁matrix with each column corresponding to a compositional data point. Similarly, we obtain
the coordinates after performing DOBIN, ICS and ICA decomposition methods as follows:
𝑍DOB =𝑃𝑇
DOB𝐵𝑇˜
𝑋=(𝐵𝑃DOB )𝑇˜
𝑋 ,
𝑍ICS =𝑃𝑇
ICS𝐵𝑇˜
𝑋=(𝐵𝑃ICS )𝑇˜
𝑋 , (7)
𝑍ICA =𝑃𝑇
ICA𝐵𝑇˜
𝑋=(𝐵𝑃ICA )𝑇˜
𝑋 .
The univariate outlier ensemble UTSOE finds outliers from these coordinates. As such, we can associate the
outlier scores with these coordinate spaces.
The 3-dimensional array 𝑌𝑁×4×𝑞discussed in equation (5) contains outlier scores weighted by the outlier
method but not weighted by the decomposition component weights. It comprises four 𝑁×𝑞matrices stacked
in the second dimension, each containing the outlier scores for 𝑞PC, DOBIN, ICS and ICA components re-
spectively. Let us call these 𝑁×𝑞matrices 𝑌PCA,𝑌DOB,𝑌ICS, and 𝑌ICA. We recall that the weights of 𝑞PC,
DOBIN, ICS and ICA components are 𝒘1,𝒘2,𝒘3, and 𝒘4respectively, where each 𝒘𝑘is a 𝑞vector. Then the
17
weighted outlier scores 𝑊are obtained by
𝑊PCA =𝑌PCA ×diag 𝒘1,
𝑊DOB =𝑌DOB ×diag 𝒘2,
𝑊ICS =𝑌ICS ×diag 𝒘3,
𝑊ICA =𝑌ICA ×diag 𝒘4,
where diag 𝒘𝑘represents a 𝑞×𝑞matrix with weights 𝑤𝑘
on the diagonal, and all 𝑊matrices are of size
𝑁×𝑞. By associating the weighted outlier scores with the decomposition space we can transform the outlier
scores to the original compositional space by performing the appropriate coordinate transformation as follows:
𝐴PCA =𝐵𝑃PCA𝑊𝑇
PCA ,
𝐴DOB =𝐵𝑃DOB𝑊𝑇
DOB ,
𝐴ICS =𝐵𝑃ICS𝑊𝑇
ICS ,(8)
𝐴ICA =𝐵𝑃ICA𝑊𝑇
ICA ,
where 𝐵is an 𝑛× (𝑛1)matrix, 𝑃an (𝑛1) × 𝑞matrix, 𝑊an 𝑁×𝑞matrix and 𝐴an 𝑛×𝑁matrix. We note
that this is the back transformation to multiplying by matrix (𝐵𝑃𝑋𝑋𝑋)𝑇considered in equation (7) where 𝑋𝑋𝑋
denotes the dimension reduction method. We associate the matrix 𝐴with the compositional space, with each
column of 𝐴representing the transformed scores of each data point. However, the transformed scores can be
positive or negative. The sign of the scores depend on the choice of basis vectors and their sign. For example
if a basis vector of 𝐵or 𝑃is multiplied by 1, then the sign of the resulting coordinates in 𝐴will change. As
such, the magnitude of the transformed scores is important, not their sign. Using these transformed scores we
obtain
𝐴TOT =|𝐴PCA|+|𝐴DOB |+|𝐴ICS|+|𝐴ICA |,
where |𝐴xxx|denotes a matrix obtained by taking the absolute value of each element of 𝐴xxx . We call 𝐴TOT the
matrix of apportioned scores. By inspecting 𝐴TOT we can see which composites/covariates contributes to the
18
time-outliers.
Removing matrix 𝐵in equations (7) and (8) give the apportioned scores for the multivariate setting.
2.4 Visualization
Visualization of the results provides an important cross-check of the algorithm, and is necessary for the in-
terpretation of the tagged outliers in the context of the full dataset. The multivariate nature of the time series
considered makes this challenging, and it can be important to explore a range of diagnostic graphics, corre-
sponding to the different stages in the ensemble algorithm.
First, we highlight that we are working with three types of coordinate systems: the original compositional
coordinates, the null space coordinates corresponding to the unconstrained representation of the time series,
and finally the different coordinate systems obtained from the four decomposition methods. Generally the
preferred option for diagnostics will be to visualize the decomposed time series components, because it reduces
dimensionality and corresponds to the input series analyzed by the UTSOE. However, the other two coordinate
representations may be important for the interpretation of the results.
To deal with the high-dimensional nature of the time series, we consider four different approaches:
1. Univariate time series displays: selecting a single coordinate representation and component or compo-
nents with mapping components to color; or using faceting to display multiple coordinate representations.
2. “Biplot” displays: showing the first two components selected by any of the decomposition methods,
together with the axes representation of the corresponding projection matrix from the original or null
space coordinate system. Each observation in time is represented as a scatter point, and tagged outliers
can be highlighted. Using a biplot we can compare the outlying points to the overall distribution, and
understand the connection with the original coordinates from reading the axes display. However, the
scatter plot display means that temporal patterns are lost in this visualization. One option is to use lines
connecting the points in time to show the temporal pattern, but this can lead to overloading of the graph.
An alternative is to only connect tagged outliers to the two neighboring time points.
3. Tour display: tour methods (Asimov 1985, Buja et al. 2005) allow the visualization of high-dimensional
data through an animated sequence of low-dimensional projections. Here we use the tourr pack-
19
age (Wickham et al. 2011) to generate a sequence of two-dimensional projections. As with biplots,
observations in time are shown as scatter points in the multivariate space, and we highlight tagged out-
liers and allow to connect them to their neighbors to get an understanding of the temporal pattern. This
display can be used with any coordinate representation of the data, and using the decomposed time series
components may still be preferred for high-dimensional input, but requires more than the default two
components.
4. Scores over time: it is useful to visualize the apportioned scores, to interpret and understand them in
context. Here we consider the special case of working with spatial data and show the apportioned scores
for selected outliers on a map.
3 Simulation exercise
We use a simulation to test the method and understand its behavior. In this exercise we generate a compositional
time series vector 𝒛𝑡=𝑧1
𝑡, . . . , 𝑧𝑛
𝑡𝑇, for 𝑡=1, . . . , 𝑁 , with 𝑧𝑖
𝑡[0,1]and Í𝑛
𝑖=1𝑧𝑖
𝑡=1. This process has
both, time series persistence that resembles real compositional data, and an outlier generation process, whose
outliers will be detected using the proposed outlier detection ensemble. To specify the true data generating
process (DGP) of 𝒛𝑡, we first consider the multiple time series vector 𝒙𝑡=𝑥1
𝑡, . . . , 𝑥𝑛
𝑡𝑇R𝑛. This vector has
the state-space dynamics
𝒙𝑡=𝐴𝒓𝑡,(9)
𝒓𝑡=𝝁+𝐵𝒓𝑡1+𝐷𝜺𝑡+𝐶𝒃𝑡,(10)
where 𝒓𝑡=𝑟1
𝑡, . . . , 𝑟 𝐾
𝑡𝑇is a 𝐾dimensional vector of underlying factors (or states) driving 𝒙𝑡,𝐴is an 𝑛×𝐾
matrix of factor loadings, 𝜺𝑡𝑁(0, 𝐼𝐾)and 𝒃𝑡=𝑏𝑡,1, . . . , 𝑏𝑡 ,𝐾 𝑇, with 𝑏𝑡, 𝑘 Bernoulli(𝑝). In Equation (9)
the dimensionality of the problem is reduced from the 𝑛×1dimensional vector 𝒙𝑡to the 𝐾×1dimensional
vector 𝒓𝑡, where 𝐾 < 𝑛. The first three terms in the right hand side of Equation (10) specify an autoregressive
process of order one for 𝒓𝑡. The fourth term 𝐶𝒃𝑡, has the role of inducing outliers in the dynamics of 𝒓𝑡, with the
magnitude of those outliers determined by the 𝐾×𝐾matrix 𝐶, and their probability of occurrence determined
by the scalar 𝑝∈ [0,1]. The autoregressive process in 𝒓𝑡and the term 𝐶𝒃𝑡, allow for 𝒙𝑡to have time series
20
persistence and an outlier generation process. The compositional time series vector 𝒛𝑡is constructed from 𝒙𝑡,
so that 𝑧𝑖
𝑡=exp(𝑥𝑖
𝑡)
Í𝑛
𝑗=1exp(𝑥𝑗
𝑡).
In this exercise we consider the particular case where 𝑛=30 and 𝐾=2. The matrices required in the true DGP
are selected to be
𝝁=©«
0.3
0.7ª®®¬
;𝐵=©«
0.8 0
0 0.5ª®®¬
;𝐶=©«
5 0
0 4 ª®®¬
;𝐷=©«
0.4 0
0 0.4ª®®¬
.
The elements of 𝐴are independently generated from a univariate normal distribution with mean zero and
a standard deviation of 0.3, and finally, 𝑝=0.005. From this specification of the true DGP we generate
𝑀=1000 times series, each of length 𝑁=500. In addition to the random persistent outliers produced through
𝐶𝒃𝑡, we also add two discretionary outliers to each time series by setting 𝑥2
117 =log(10)and 𝑥8
40 =log(200).
Figure 4 displays one of the simulated compositional time series. Each colored line corresponds to a particular
composite. The two dashed lines represent the locations of the discretionary outliers, which only affect the time
series at single time points. The dotted lines represent the time location of the persistent outliers. Notice that
unlike the discretionary outliers, the effect of these type of outliers decay slowly over time. Even though these
outliers persist over time, we only consider the first of these persisting time points as outyling in our labeled
data.
We apply the compositional time series outlier ensemble CTSOE to every generated time series. To account for
the persisting outliers we adjust the CTSOE scores as follows. For each time series, the detected observations
that are part of a time sequence that is decreasing in outlier score are not predicted to be outliers. For instance,
if the time points 𝑡=20,𝑡=21 and 𝑡=22 are all detected as outliers by CTSOE, and 𝑦20 > 𝑦21 > 𝑦22 , then,
only 𝑡=20 is predicted to be an outlier for the purpose of this exercise. Using the true and predicted outlier
time locations for each simulated time series, we measure the predictive accuracy in terms of the area under
the receiver operating characteristic curve (AUC). Figure 5 presents the histogram of the AUC based on the
𝑀=1000 iterations. It provides strong evidence that the proposed outlier detection method provides accurate
identification of compositional outliers, as most of the probability mass in the histogram is located at values
greater than 0.9.
21
0.00
0.25
0.50
0.75
0 100 200 300 400 500
Time
Proportion
Figure 4: One of the simulated compositional time series. The vertical dashed lines represent the time loca-
tions of the discretionary outliers, while the dotted lines represent the time locations of the persistent
outliers.
0
50
100
150
200
250
0.7 0.8 0.9 1.0
AUC
Count
Figure 5: Histogram for the area under the ROC curve in the simulation exercise
22
4 Applications
4.1 International tourism data
We use the World Bank data on the number of international tourist arrivals, which is available at https:
//data.worldbank.org/indicator/ST.INT.ARVL. This dataset contains yearly data from 1995 till 2018.
We use the seven geographically aggregated time series on regions East Asia & Pacific (EAP), Europe &
Central Asia (ECA), Latin America & Caribbean (LAC), Middle East & North Africa (MENA), Sub-Saharan
Africa (SSA), South Asia (SA) and North America (NA). This data is shown in Figure 6.
Figure 6: International tourism arrivals for each region.
We make this a compositional time series by dividing the seven dimensional time series by the total number of
arrivals for each year. That is, if the original data for a certain year 𝑡is given by 𝒙𝑡=𝑥1
𝑡, 𝑥2
𝑡, . . . , 𝑥7
𝑡𝑇, then we
compute the compositional time series as
𝒛𝑡=𝑧1
𝑡, 𝑧2
𝑡, . . . , 𝑧7
𝑡𝑇
= 𝑥1
𝑡
Í𝑗𝑥𝑗
𝑡
,𝑥2
𝑡
Í𝑗𝑥𝑗
𝑡
, . . . , 𝑥7
𝑡
Í𝑗𝑥𝑗
𝑡!𝑇
.
We use the compositional time series outlier ensemble CTSOE illustrated in Figure 3 on this data. First we
transform 𝒛𝑡using the nullspace coordinate transformation to obtain unconstrained data. The new coordinates
𝜸𝑡are computed using equation (11):
23
𝜸𝑡=
0.3779 0.8963 0.1036 0.1036 0.1036 0.1036 0.1036
0.3779 0.1036 0.8963 0.1036 0.1036 0.1036 0.1036
0.3779 0.1036 0.1036 0.8963 0.1036 0.1036 0.1036
0.3779 0.1036 0.1036 0.1036 0.8963 0.1036 0.1036
0.3779 0.1036 0.1036 0.1036 0.1036 0.8963 0.1036
0.3779 0.1036 0.1036 0.1036 0.1036 0.1036 0.8963
𝑧1
𝑡
𝑧2
𝑡
𝑧3
𝑡
𝑧4
𝑡
𝑧5
𝑡
𝑧6
𝑡
𝑧7
𝑡
.(11)
Then CTSOE uses 𝜸𝑡as input to our multivariate time series outlier ensemble MTSOE. The output of CTSOE
is given in Figure 7. It gives the breakdown of identified outliers in terms of decomposition methods and outlier
detection methods. The highlighted cells in columns 2-5 contain non-zero scores of identified outliers obtained
from each decomposition method. The total score is the sum of these scores. The column headed Num_Coords
gives the number of decomposition methods that have contributed to the identification of outliers. The four
columns with headings forecast to anomalize give the weighted scores of each outlier detection method. The
next column Num_Methods gives the number of outlier detection methods that identified each observation as
an outlier. The last column gives the Total Score, with the highest score highlighted. The column Gap_Score_2
gives the gaps as computed by equation (6).
Figure 7: The output of CTSOE on international tourism data from 1995 to 2018 showing outliers.
From Figure 7 we see that the year 2003 – the year of the SARS outbreak – was the most outlying year for
international tourism from 1995 to 2018 from a compositional perspective. The gap score for 2003 is 43, which
is much higher compared with other outlying years. The time series plots of the null space coordinates and
their decompositions shown in Figures 8 and 9 confirm this finding.
To better understand the connection between the null space coordinates and the decompositions, we can look
at biplots; see Figure 10 for DOBIN, PCA and ICS biplots. We see that the outlying points, highlighted
24
Figure 8: Null space coordinates of international tourism data, with a dashed line at 2003.
Figure 9: DOBIN, ICA, ICS and PCA coordinates of international tourism data, with outlying time points
shown by vertical lines.
in red, are associated with long edges that show the temporal connection between the points. In particular,
we see that ICS, which has tagged all three outliers, reveals long edges for each of them. Reading the axes
displays across the plots, we find that X1 and X4 are important for tagging the first outlier (2003), while X5
is associated with the additional two outlying points. This is confirmed with the null space time series shown
25
Table 1: Outlier scores apportioned to geographical regions
Region 2003 2010 2014
East Asia & Pacific 0.737 0.229 0.108
Europe & Central Asia 0.859 0.336 0.100
Latin America & Carribbean 0.281 0.100 0.144
Middle East & North Africa 0.291 0.112 0.075
Sub Saharan Africa 0.171 0.055 0.127
South Asia 0.357 0.073 0.099
North America 0.355 0.120 0.096
in Figure 8. A more comprehensive overview of these connections is obtained with the tour display, which is
available at https://uschilaa.github.io/animations/composits1.html. This tour plot also reveals
the importance of X3, and shows that X6 does not contribute relevant information.
X1
X2
X3
X4
X5
X6
−0.44
−0.42
−0.40
−0.38
−0.025 0.000 0.025 0.050
dobin_1
dobin_2
X1
X2
X3
X4
X5
X6
−0.21
−0.20
−0.19
−0.18
−0.17
−0.35 −0.30 −0.25 −0.20
pca_1
pca_2
X1
X2
X3
X4
X5
X6
0.00
0.01
0.02
0.03
−0.260 −0.255 −0.250 −0.245 −0.240
ics_1
ics_2
Figure 10: DOBIN, PCA and ICS biplots of international tourism data, with outlying time points shown in red.
Oulying points are associated with long edges in the biplot.
Table 1 shows the apportioned outlier scores for the outlying years. For 2003, we see that the apportioned
scores for East Asia & Pacific and Europe & Central Asia are much higher than the other regions confirming
our understanding of the SARS outbreak.
Figure 11 shows the total number of international arrivals, which is a univariate time series with outliers identi-
fied by UTSOE drawn using dashed lines. Again we see that 2003 is a global outlier along with 2007, 2008 and
2009. However, 2007 - 2009 do not come up as compositional outliers in CTSOE in Figure 7. This is because
a global change may not necessarily result in changes at a compositional level. Clearly in 2003 the global dip
in international tourism arrivals had an impact on the compositions, with tourism arrivals in East and Central
Asia severely impacted compared to the rest of the world. But during 2007 - 2009, the compositional structure
has not changed enough to cause CTSOE to identify these years as compositional outliers. Even though the
global financial crisis in 2008 caused economies to contract everywhere causing a decrease in tourism in 2009,
26
its effect on the compositional structure is not large enough to generate a compositional outlier. This is further
validated by the null space coordinates in Figure 8.
Figure 11: Total international tourist arrivals with univariate outliers identified by UTSOE shown in dashed
lines.
4.2 COVID-19 data in India
For this analysis we use the dataset from Kaggle (Kaggle, SRK 2020), which contains daily COVID-19 data in
India. This dataset contains daily cases for 36 states and union territories in India from the 30th of January till
the 2nd of August 2020. For each day the counts are given by 𝒙𝑡=𝑥1
𝑡, 𝑥2
𝑡, . . . , 𝑥36
𝑡𝑇, of which many are zero
entries for the initial time period. As in the previous example we make this data compositional by considering
𝒛𝑡=𝑧1
𝑡, 𝑧2
𝑡, . . . , 𝑧36
𝑡𝑇
= 𝑥1
𝑡
Í𝑗𝑥𝑗
𝑡
,𝑥2
𝑡
Í𝑗𝑥𝑗
𝑡
, . . . , 𝑥36
𝑡
Í𝑗𝑥𝑗
𝑡!𝑇
.
Then we transform 𝒛𝑡using the nullspace coordinate transformation so that they are unconstrained. Suppose
the 𝑖th basis vector 𝒘𝑖=𝑤1
𝑖, 𝑤2
𝑖, . . . 𝑤36
𝑖𝑇. Then it has components given by
𝑤𝑗
𝑖=
0.16667 𝑗=1
0.97619 𝑗=(𝑖+1)
0.02381 otherwise
for 𝑖∈ {1, . . . , 35}. The use of our transformation is of particular importance in this example, as there is a large
number of days with zero COVID-19 cases. We input the original coordinates to CTSOE, our compositional
27
time series outlier ensemble. Figure 12 gives the outlying dates according to CTSOE using 2decomposition
components. We see that the 3rd of March is the most outlying date followed by the 2nd and the 4th of March.
Figure 12: The output of CTSOE on COVID-19 India data from 30 Jan 2020 to 2 August 2020 with the highest
total score highlighted.
Figure 13 shows 8 null space coordinates, selected using the biplots, with a dashed line on March 4. We see
that from March 2 to March 4, the selected null space coordinates exhibit a significant increase, which aligns
with the CTSOE outliers.
X7
X9
X14
X26
X1
X2
X3
X4
Apr Jul Apr Jul Apr Jul Apr Jul
Apr Jul Apr Jul Apr Jul Apr Jul
−0.1
0.0
0.1
0.2
−0.15
−0.10
−0.05
0.0
0.2
0.4
−0.15
−0.10
−0.05
0.00
−0.1
0.0
0.1
−0.1
0.0
0.1
0.2
0.3
−0.1
0.0
0.1
−0.1
0.0
0.1
Date
Null space coordinates
Figure 13: Eight null space coordinates of COVID-19 India data as suggested by the biplot results, with a
dashed line on March 4.
Figure 14 shows the four decomposition coordinates with two components for each decomposition method.
Again, we see big vertical jumps at outlying time points. The corresponding biplots are shown in Figure 15.
They show that both PCA and ICA combine X3 with X4, and X7 with X9 along roughly opposing directions.
Comparison with the null space coordinate time series in Figure 13 shows that these coordinates are similar
to each other: both X3 and X4 show a sharp peak followed by a quick drop and approximately flat behavior,
28
while X7 and X9 capture more of the variation at later times. It is interesting to note the different projections
found for the other two decomposition methods, for example the second component identified by DOBIN is
dominated by a single null space coordinate, X1. The biplot for ICS shows strong correlation, and Figure 14
confirms that both components have very similar temporal patterns. We also see that the end-on-the-line point
is not tagged as an outlier in the biplots. This point corresponds to the initial time steps with zero entries,
and is clearly visible in the biplot due to the connecting edges. Finally we note that the tour view, available at
https://uschilaa.github.io/animations/composits2.html, is not so useful for this data, as the scale
is dominated by the highly outlying time points and most views are not informative.
Figure 14: DOBIN, ICA, ICS and PCA coordinates of COVID19 India data with outlying time points shown
by vertical lines.
Table 2 gives the apportioned outlier scores for the states and union territories on the outlying dates. By
inspection we see that Kerala, Telangana, Rajasthan and Uttar Pradesh have relatively high outlying scores for
these dates. Figure 16 shows the map of India with the apportioned scores for the outlying dates.
We also examine the univariate time series outliers on total cases. Figure 17 gives the output of the univariate
outlier ensemble UTSOE on daily log totals. March 2nd and March 4th get identified as outliers by 3 methods
out of 4. From Figure 17 we see there are big jumps on the 2nd and the 4th of March. As such, the changes in
29
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
X21
X22
X23
X24
X25
X26
X27
X28
X29
X30
X31
X32
X33
X34
X35
−0.25
0.00
0.25
−0.5 0.0
pca_1
pca_2
X1
X2 X3
X4
X5
X6
X7 X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
X21
X22
X23
X24
X25
X26
X27
X28
X29
X30
X31
X32
X33
X34
X35
−0.2
−0.1
0.0
−0.8 −0.4 0.0 0.4
dobin_1
dobin_2
X1
X2
X3
X4
X5
X6
X7 X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19 X20
X21
X22
X23
X24
X25
X26X27
X28
X29
X30
X31
X32
X33
X34
X35
−0.25
0.00
0.25
0.50
0.75
1.00
−0.25 0.00 0.25
ica_1
ica_2
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
X21
X22
X23
X24
X25
X26
X27
X28
X29
X30
X31
X32
X33
X34
X35
−0.10
−0.05
0.00
−0.15 −0.10 −0.05 0.00
ics_1
ics_2
Figure 15: PCA, DOBIN, ICA and ICS biplots of Indian COVID-19 data, with outlying time points shown in
red. Outlying points are associated with long edges in the biplot.
Figure 16: Outlier scores apportioned to different states and union territories in India.
compositions have affected the total giving rise to outliers in total values as well as at a compositional level.
30
Table 2: Outlier scores apportioned to Indian states and union territories
State 2020-03-02 2020-03-03 2020-03-04
Kerala 1.961 2.009 1.938
Telangana 0.940 0.755 0.543
Delhi 0.241 0.222 0.194
Rajasthan 0.712 0.793 0.773
Uttar Pradesh 0.607 0.548 0.431
Haryana 0.381 0.377 0.352
Ladakh 0.278 0.238 0.183
Tamil Nadu 0.265 0.283 0.254
Karnataka 0.376 0.313 0.236
Maharashtra 0.588 0.662 0.639
Punjab 0.295 0.214 0.120
Jammu and Kashmir 0.117 0.093 0.064
Andhra Pradesh 0.116 0.107 0.090
Uttarakhand 0.209 0.153 0.087
Odisha 0.450 0.317 0.158
Puducherry 0.135 0.090 0.038
West Bengal 0.253 0.205 0.136
Chhattisgarh 0.260 0.186 0.099
Chandigarh 0.213 0.147 0.071
Gujarat 0.360 0.306 0.217
Himachal Pradesh 0.368 0.258 0.131
Madhya Pradesh 0.424 0.329 0.212
Bihar 0.150 0.125 0.087
Manipur 0.095 0.073 0.047
Mizoram 0.345 0.239 0.117
Andaman and Nicobar Islands 0.298 0.207 0.102
Goa 0.555 0.382 0.182
Unassigned 0.332 0.233 0.118
Assam 0.034 0.039 0.039
Jharkhand 0.467 0.323 0.155
Arunachal Pradesh 0.098 0.068 0.034
Tripura 0.094 0.069 0.039
Nagaland 0.339 0.233 0.111
Meghalaya 0.263 0.180 0.083
Dadra and Nagar Haveli and Daman and Diu 0.107 0.073 0.034
Sikkim 0.037 0.027 0.015
4.3 Spanish deaths
As our last application, we use the daily mortality counts in Spain organized by autonomous communities
(Ministry of Science and Innovation, Spain 2020). This dataset records mortality counts from the 18th of April
2018 until the 31st of July 2020 and is shown in Figure 18.
Similar to the previous example, we divide the daily mortality counts 𝒙𝑡=𝑥1
𝑡, 𝑥2
𝑡, . . . , 𝑥19
𝑡by their sum and
31
Figure 17: Log total of COVID-19 cases in India with univariate outliers identified by 3 methods in UTSOE
shown in dashed lines.
Figure 18: Mortality proportions in Spanish states from April 2018 to July 2020.
make it compositional:
𝒛𝑡=𝑧1
𝑡, 𝑧2
𝑡, . . . , 𝑧19
𝑡𝑇
= 𝑥1
𝑡
Í𝑗𝑥𝑗
𝑡
,𝑥2
𝑡
Í𝑗𝑥𝑗
𝑡
, . . . , 𝑥19
𝑡
Í𝑗𝑥𝑗
𝑡!𝑇
.
Then we transform 𝒛𝑡using the nullspace coordinate transformation. The null space transformation has 18
32
basis vectors, with the 𝑖th vector 𝒘𝑖=𝑤1
𝑖, 𝑤2
𝑖, . . . , 𝑤19
𝑖𝑇having the components
𝑤𝑗
𝑖=
0.22942 𝑗=1
0.95719 𝑗=(𝑖+1)
0.04281 otherwise
for 𝑖∈ {1, . . . , 18}. Again, this dataset contains zeros which can be handled by the null space coordinate
transformation. Figure 19 gives the outlying dates according to CTSOE using 2decomposition components
per decomposition method.
Figure 19: The output of CTSOE on Spanish mortality proportion data from April 18 2018 to July 31 2020
with the highest outlier score highlighted.
Figure 20 shows the null space coordinates with a dashed line on March 19, 2020. We see that there is a
visible spike around this date for many coordinates. Figure 21 shows the decomposition plots for the four
decomposition methods using two components. Again, we see that there is a spike around March 19 for all
four decomposition methods.
The biplots for all decomposition methods are shown in Figure 22. They show that the majority of time points
are normally distributed, apart from a group of outlying points that include most tagged outliers. This can also
be observed when looking at the null space coordinates in a tour, which is available at https://uschilaa.
33
Figure 20: Null space coordinates of Spanish mortality proportion data with a dashed line at March 19.
Figure 21: DOBIN, ICA, ICS and PCA coordinates of daily mortality proportions in Spain.
34
X1X2
X3 X4
X5
X6
X7
X8
X9
X10 X11
X12
X13
X14
X15
X16
X17
X18
−0.10
−0.05
0.00
0.05
−0.20 −0.15 −0.10 −0.05 0.00
pca_1
pca_2
X1
X2
X3
X4
X5 X6
X7 X8
X9X10
X11X12
X13
X14 X15
X16
X17
X18
−0.02
0.00
0.02
0.04
−0.15 −0.10 −0.05 0.00
dobin_1
dobin_2
X1
X2
X3
X4X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
−0.05
0.00
0.05
0.10
0.15
0.20
0.00 0.05 0.10
ica_1
ica_2
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
−0.12
−0.09
−0.06
−0.03
−0.100 −0.075 −0.050 −0.025 0.000
ics_1
ics_2
Figure 22: PCA, DOBIN, ICA and ICS biplots of Spanish mortality data, with outlying time points shown in
red. Outlying points are associated with long edges in the biplot.
github.io/animations/composits3.html. Interestingly, DOBIN has captured outliers only in the first
component, with the second component dominated only by X1 and without interesting patterns (as confirmed
by the time series in Figure 21). Note that the DOBIN decomposition did not contribute to the tagged outliers
as confirmed in Figure 19. We again observe similarities between PCA and ICA decompositions: both methods
contrast X12 with X9 in one direction, and pick X8 as the other important direction.
Table 3 gives the apportioned scores for different autonomous communities for the three most outlying dates
as per total scores. By inspection we see that Madrid, Catalonia and Andalusia have higher outlier scores com-
pared to others. Figure 23 shows the map of mainland Spain with outlier scores apportioned to the autonomous
communities at these outlying time points.
Next we examine the univariate time series outliers. Figure 24 gives the output of the univariate outlier ensem-
ble UTSOE on daily totals. We get consecutive days from March 21 to March 28 identified as outliers by 3
or 4 methods. We see that the univariate time series outliers have affected the compositions in this example.
However, the most outlying compositional outlier, which was on March 19 was not identified as an outlier
using the daily totals by 3 or more outlier methods. This shows that the changes in compositions contributed to
35
Table 3: Outlier scores apportioned to Spanish states
State 2020-03-19 2020-03-22 2020-03-20
Andalusia 0.448 0.484 0.431
Aragon 0.103 0.069 0.068
Principality of Asturias 0.077 0.074 0.067
Balearic Islands 0.054 0.050 0.046
Canary Islands 0.216 0.188 0.177
Cantabria 0.387 0.282 0.280
Castile and León 0.290 0.229 0.222
Castile-La Mancha 0.367 0.309 0.291
Catalonia 0.474 0.594 0.492
Valencian Community 0.295 0.255 0.238
Extremadura 0.140 0.107 0.105
Galicia 0.316 0.279 0.260
Community of Madrid 0.729 0.570 0.529
Region of Murcia 0.218 0.175 0.169
Navarre 0.050 0.039 0.038
Basque Country 0.177 0.138 0.134
La Rioja 0.115 0.084 0.083
Ceuta 0.145 0.106 0.106
Melilla 0.085 0.063 0.063
Figure 23: Outlier scores apportioned to different Spanish autonomous communities.
a compositional outlier before it was picked up as an outlier by the daily totals.
Figure 24: Total deaths in Spain with univariate outliers identified by UTSOE using 3 or more methods shown
in dashed lines.
36
5 Conclusions
Working with compositional data presents challenges as well as opportunities, as this is an area of research that
still contains unexplored territories. In this paper, we tackled the problem of outlier detection in the context
of compositional time series data and presented an ensemble method for outlier detection that is applicable
to univariate, multivariate, compositional and non-compositional time series. Our ensemble method copes
well with the characteristics of time series data and the peculiarities of compositional data. In addition, our
method is able to deal with the presence of zeros in the composites, a well known problem when working with
compositional data. For that, we presented a coordinate transformation that is able to map points from the
constrained simplex plane S𝑛, where compositional data naturally lives, to an unconstrained space R𝑛1while
preserving Euclidean distances and angles between points. The advantage of this transformation is that it can
handle zero entries in compositional data unlike the logratio transformations.
In summary, the ensemble method presented in the paper starts by first transforming the compositional co-
ordinates from the simplex to the unconstrained real space of dimension 𝑛1. From there, four dimension
reduction methods (PCA, DOBIN, ICS and ICA) are applied to the data so that the dimension of the original
compositional dataset gets reduced to a dimension that can be set by the user. In this reduced space, the original
dataset is represented by only a few components computed by each dimension reduction technique. Next, we
apply four existing time series outlier detection methods (tsoutliers,forecast,anomalize and ostad)
to each component obtained from the four dimension reduction methods. That provides us with a list of time
outliers and for each outlier a score is calculated. Then the scores are combined to an ensemble score by using
appropriate weights for each of the outlier detection methods, decomposition components and decomposition
methods. After finding the outlying time points, we apportion the scores back to the compositions so that we
have a better understanding of outlying constituents. In other words, we can identify which time series in the
original data contributed to the detected outliers.
The ensemble method is validated using a simulation as well as three real world datasets, which showed promis-
ing results. Furthermore, visualization techniques for the different steps in the ensemble method are provided,
including the dimension reduction results, as well as the time outliers and corresponding time series. The vi-
sualization methods including the animations are designed to aid interpretability of the final results and also to
better undertand the intermediate steps of the ensemble method. The ensemble method, the coordinate trans-
37
formation and the visualization methods are all implemented and available in the accompanying R package
composits.
Acknowledgments
This work was supported in part by the Australian Government through the Australian Research Council.
6 Supplementary Material
R package composits: This package contains functionality for the null space basis construction, the outlier
ensembles CTSOE, MTSOE and UTSOE, the outlier score apportionment, visualization methods and the com-
positional data simulation.
Animations: The animations for the three real datasets are available at https://uschilaa.github.io/
animations/.
Datasets: The synthetic dataset can be generated by using functionality of composits. The World Bank
dataset is available at the website https://data.worldbank.org/indicator/ST.INT.ARVL. The Kaggle
dataset is available at https://www.kaggle.com/sudalairajkumar/covid19-in-india and the Spanish
dataset is included in the package composits and it as available at https://momo.isciii.es/public/
momo/dashboard/momo_dashboard.html#datos.
Scripts
The scripts Figures_For_Paper_1.R and Figures_For_Paper_5.R contain the code used in Section 3. The
script Figures_For_Paper_2.R contains the code used for the international tourism data example. The scripts
Figures_For_Paper_3.R and Figures_For_Paper_4.R contain the code used for Indian and Spainish ex-
amples respectively. These scripts are available at our GitHub repository https://github.com/sevvandi/
composits_paper.
38
References
Aires, F., Chédin, A. & Nadal, J.-P. (2000), ‘Independent component analysis of multivariate time series: Ap-
plication to the tropical SST variability’, Journal of Geophysical Research: Atmospheres 105(D13), 17437–
17455.
Aitchison, J. (1982), ‘The statistical analysis of compositional data’, Journal of the Royal Statistical Society:
Series B (Methodological) 44(2), 139–160.
Aitchison, J. (1983), ‘Principal component analysis of compositional data’, Biometrika 70(1), 57–65.
Aminikhanghahi, S. & Cook, D. J. (2017), A survey of methods for time series change point detection’,
Knowledge and Information Systems 51(2), 339–367.
Anton, H. & Rorres, C. (2013), Elementary Linear Algebra: Applications Version, John Wiley & Sons.
Asimov, D. (1985), ‘The Grand Tour: A Tool for Viewing Multidimensional Data’, SIAM Journal of Scientific
and Statistical Computing 6(1), 128–143.
Baragona, R. & Battaglia, F. (2007), ‘Outliers detection in multivariate time series by independent component
analysis’, Neural Computation 19(7), 1962–1984.
Borchers, H. W. (2019), pracma: Practical Numerical Math Functions. R package version 2.2.9.
URL: https://CRAN.R-project.org/package=pracma
Brunsdon, T. M. & Smith, T. (1998), ‘The time series analysis of compositional data’, Journal of Official
Statistics 14(3), 237.
Buja, A., Cook, D., Asimov, D. & Hurley, C. (2005), 14 - Computational Methods for High-Dimensional
Rotations in Data Visualization, Vol. 24 of Handbook of Statistics, Elsevier, pp. 391 – 413.
URL: http://www.sciencedirect.com/science/article/pii/S0169716104240147
Chayes, F. (1960), ‘On correlation between variables of constant sum’, Journal of Geophysical research
65(12), 4185–4193.
Chen, C. & Liu, L.-M. (1993), ‘Joint estimation of model parameters and outlier effects in time series’, Journal
of the American Statistical Association 88(421), 284–297.
39
Comon, P. (1994), ‘Independent component analysis, a new concept?’, Signal Processing 36(3), 287–314.
Dancho, M. & Vaughan, D. (2019), anomalize: Tidy Anomaly Detection. R package version 0.2.0.
URL: https://CRAN.R-project.org/package=anomalize
de Lacalle, J. L. (2019), tsoutliers: Detection of Outliers in Time Series. R package version 0.6-8.
URL: https://CRAN.R-project.org/package=tsoutliers
Filzmoser, P. & Hron, K. (2008), ‘Outlier detection for compositional data using robust methods’, Mathemati-
cal Geosciences 40(3), 233–248.
Gu, M., Fei, J. & Sun, S. (2020), ‘Online anomaly detection with sparse gaussian processes’, Neurocomputing
403, 383 – 399.
Hyndman, R. J. & Khandakar, Y. (2008), ‘Automatic time series forecasting: the forecast package for R’,
Journal of Statistical Software 26(3), 1–22.
Iturria, A., Carrasco, J., Herrera, F., Charramendieta, S. & Intxausti, K. (2019), otsad: Online Time Series
Anomaly Detectors. R package version 0.2.0.
URL: https://CRAN.R-project.org/package=otsad
Kaggle, SRK (2020), ‘COVID-19 in India’.
URL: https://www.kaggle.com/sudalairajkumar/covid19-in-india
Kandanaarachchi, S. & Hyndman, R. J. (2020), ‘Dimension reduction for outlier detection using DOBIN’,
Journal of Computational and Graphical Statistics (Accepted) .
URL: https://robjhyndman.com/publications/dobin/
Kandanaarachchi, S., Menendez, P., Laa, U. & Loaiza-Maya, R. (2020), composits: Compositional, Multivari-
ate and Univariate Time Series Outlier Ensemble. R package version 0.0.0.9000.
URL: https://github.com/sevvandi/composits
Ministry of Science and Innovation, Spain (2020), ‘Deaths from all causes in excess, by population group.
Spain’.
URL: https://momo.isciii.es/public/momo/dashboard/momo_dashboard.html
40
Pearson, K. (1897), ‘Mathematical contributions to the theory of evolution. — On a form of spurious correlation
which may arise when indices are used in the measurement of organs’, Proceedings of the Royal Society of
London 60(359-367), 489–498.
Scealy, J. L. & Welsh, A. H. (2011), ‘Regression for compositional data by using distributions defined on the
hypersphere’, Journal of the Royal Statistical Society. Series B: Statistical Methodology 73(3), 351–375.
Templ, M., Hron, K. & Filzmoser, P. (2017), ‘Exploratory tools for outlier detection in compositional data with
structural zeros’, Journal of Applied Statistics 44(4), 734–752.
Tyler, D. E., Critchley, F., Dümbgen, L. & Oja, H. (2009), ‘Invariant co-ordinate selection’, Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 71(3), 549–592.
Unwin, A. (2019), ‘Multivariate outliers and the o3 plot’, Journal of Computational and Graphical Statistics
28(3), 635–643.
Wang, H., Bah, M. J. & Hammad, M. (2019), ‘Progress in Outlier Detection Techniques: A Survey’, IEEE
Access 7, 107964–108000.
Wickham, H., Cook, D., Hofmann, H. & Buja, A. (2011), ‘tourr: An R package for exploring multivariate data
with projections’, Journal of Statistical Software 40(2), 1–18.
URL: http://www.jstatsoft.org/v40/i02/
Zimek, A., Campello, R. J. & Sander, J. (2014), ‘Ensembles for unsupervised outlier detection: challenges and
research questions a position paper’, ACM SIGKDD Explorations Newsletter 15(1), 11–22.
41
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Detecting outliers is a significant problem that has been studied in various research and application areas. Researchers continue to design robust schemes to provide solutions to detect outliers efficiently. In this survey, we present a comprehensive and organized review of the progress of outlier detection methods from 2000 to 2019. First, we offer the fundamental concepts of outlier detection and then categorize them into different techniques from diverse outlier detection techniques, such as distance-, clustering-, density-, ensemble-, and learning-based methods. In each category, we introduce some state-of-the-art outlier detection methods and further discuss them in detail in terms of their performance. Second, we delineate their pros, cons, and challenges to provide researchers with a concise overview of each technique and recommend solutions and possible research directions. This paper gives current progress of outlier detection techniques and provides a better understanding of the different outlier detection methods. The open research issues and challenges at the end will provide researchers with a clear path for the future of outlier detection methods.
Article
Full-text available
Change points are abrupt variations in time series data. Such abrupt changes may represent transitions that occur between states. Detection of change points is useful in modelling and prediction of time series and is found in application areas such as medical condition monitoring, climate change detection, speech and image analysis, and human activity analysis. This survey article enumerates, categorizes, and compares many of the methods that have been proposed to detect change points in time series. The methods examined include both supervised and unsupervised algorithms that have been introduced and evaluated. We introduce several criteria to compare the algorithms. Finally, we present some grand challenges for the community to consider.
Article
This paper introduces DOBIN, a new approach to select a set of basis vectors tailored for outlier detection. DOBIN has a simple mathematical foundation and can be used as a dimension reduction tool for outlier detection tasks. We demonstrate the effectiveness of DOBIN on an extensive data repository, by comparing the performance of outlier detection methods using DOBIN and other bases. We further illustrate the utility of DOBIN as an outlier visualization tool. The R package dobin implements this basis construction.
Article
Online anomaly detection of time-series data is an important and challenging task in machine learning. Gaussian processes (GPs) are powerful and flexible models for modeling time-series data. However, the high time complexity of GPs limits their applications in online anomaly detection. Attributed to some internal or external changes, concept drift usually occurs in time-series data, where the characteristics of data and meanings of abnormal behaviors alter over time. Online anomaly detection methods should have the ability to adapt to concept drift. Motivated by the above facts, this paper proposes the method of sparse Gaussian processes with Q-function (SGP-Q). The SGP-Q employs sparse Gaussian processes (SGPs) whose time complexity is lower than that of GPs, thus significantly speeding up online anomaly detection. By using Q-function properly, the SGP-Q can adapt to concept drift well. Moreover, the SGP-Q makes use of few abnormal data in the training data by its strategy of updating training data, resulting in more accurate sparse Gaussian process regression models and better anomaly detection results. We evaluate the SGP-Q on various artificial and real-world datasets. Experimental results validate the effectiveness of the SGP-Q.
Article
The independent component analysis (ICA) of a random vector consists of searching for a linear transformation that minimizes the statistical dependence between its components. In order to define suitable search criteria, the expansion of mutual information is utilized as a function of cumulants of increasing orders. An efficient algorithm is proposed, which allows the computation of the ICA of a data matrix within a polynomial time. The concept of ICA may actually be seen as an extension of the principal component analysis (PCA), which can only impose independence up to the second order and, consequently, defines directions that are orthogonal. Potential applications of ICA include data analysis and compression, Bayesian detection, localization of sources, and blind identification and deconvolution.
Article
Identifying and dealing with outliers is an important part of data analysis. A new visualisation, the O3 plot, is introduced to aid in the display and understanding of patterns of multivariate outliers. It uses the results of identifying outliers for every possible combination of dataset variables to provide insight into why particular cases are outliers. The O3 plot can be used to compare the results from up to six different outlier identification methods. There is an R package OutliersO3 implementing the plot. The paper is illustrated with outlier analyses of German demographic and economic data.
Article
The simplex plays an important role as sample space in many practical situations where compositional data, in the form of proportions of some whole, require interpretation. It is argued that the statistical analysis of such data has proved difficult because of a lack both of concepts of independence and of rich enough parametric classes of distributions in the simplex. A variety of independence hypotheses are introduced and interrelated, and new classes of transformed‐normal distributions in the simplex are provided as models within which the independence hypotheses can be tested through standard theory of parametric hypothesis testing. The new concepts and statistical methodology are illustrated by a number of applications.
Article
The analysis of compositional data using the log-ratio approach is based on ratios between the compositional parts. Zeros in the parts thus cause serious difficulties for the analysis. This is a particular problem in case of structural zeros, which cannot be simply replaced by a non-zero value as it is done, e.g. for values below detection limit or missing values. Instead, zeros to be incorporated into further statistical processing. The focus is on exploratory tools for identifying outliers in compositional data sets with structural zeros. For this purpose, Mahalanobis distances are estimated, computed either directly for subcompositions determined by their zero patterns, or by using imputation to improve the efficiency of the estimates, and then proceed to the subcompositional and subgroup level. For this approach, new theory is formulated that allows to estimate covariances for imputed compositional data and to apply estimations on subgroups using parts of this covariance matrix. Moreover, the zero pattern structure is analyzed using principal component analysis for binary data to achieve a comprehensive view of the overall multivariate data structure. The proposed tools are applied to larger compositional data sets from official statistics, where the need for an appropriate treatment of zeros is obvious.
Article
The analysis of repeated surveys can be approached using model-based inference, utilising the methods of time series analysis. On a long run of repeated surveys it should then be possible to enhance the estimation of a survey parameter. However, many repeated surveys that are suited to this approach consist of variables that are proportions, and hence are bounded between 0 and 1. Furthermore interest is often in a multinomial vector of these proportions, that are sum-constrained to 1, i.e., a composition. A solution to using time series techniques on such data is to apply an additive logistic transformation to the data and then to model the resulting series using vector ARMA models. Here the additive logistic transformation is discussed which requires that one variable be selected as a reference variable. Its application to compositional time series is developed, which includes the result that the choice of’reference variable will not affect any final results in this context. The discussion also includes the production of forecasts and confidence regions for these forecasts. The method is illustrated by application to the Australian Labour Force Survey.
Article
Composition data are subject to the condition that the sum of the parent variables in any item is constant. This imposes a linear restraint which suppresses positive and increases negative covariance. Neither the resulting ‘spurious’ correlation itself nor the difficulty it creates with regard to the interpretation of composition data has been adequately described, and no general remedy has yet been suggested. This note describes some of the more important effects of a constant item-sum on correlation. It also proposes a test against the alternative of ‘spurious’ correlation arising from interaction between variables of equal variance, and a modification that may prove applicable to arrays characterized by inhomogeneous variance.