Content uploaded by Sevvandi Kandanaarachchi

Author content

All content in this area was uploaded by Sevvandi Kandanaarachchi on Aug 18, 2020

Content may be subject to copyright.

Outliers in compositional time series data

Sevvandi Kandanaarachchi1, Patricia Menéndez2,

Rubén Loaiza-Maya2, Ursula Laa2,3

1School of Science, Mathematical Sciences, RMIT University, Melbourne VIC 3000, Australia.

2Department of Econometrics and Business Statistics, Monash University, Clayton, VIC 3800, Australia.

3School of Physics and Astronomy, Monash University, Clayton, VIC 3800, Australia.

Abstract

This paper proposes an outlier detection ensemble for time series compositional data,

that can also be applied to multivariate and univariate times series. In addition, we pro-

pose a coordinate transformation from the simplex S𝑛to R𝑛−1, which allows zero-valued

composites and preserves Euclidean distances and angles. We test this framework on

simulated compositional time series and three real applications. The simulation study

conﬁrms that our approach accurately identiﬁes the compositional outliers, while the

empirical applications illustrate how the approach helps to identify the COVID-19 out-

breaks in India and Spain. The R package composits implements the outlier detection

ensemble and the coordinate transformation presented in this paper.

Key words— compositional time series, outlier detection, anomaly detection, multivariate

time series, ensembling, outlier detection ensembles

1 Introduction

Outliers tell a story different from the norm. In our data-rich world, outlier detection methods are used in

diverse societal applications such as detecting intrusions, identifying emerging terrorist plots in social media

and detecting fetal anomalies in pregnancies. Outlier detection methods are constantly developed to cater for

varied aspects of data such as non-stationarity in data streams (Gu et al. 2020). The steady growth of outlier

detection methods has also contributed to a growing body of literature on outlier detection ensembles (Zimek

et al. 2014, Unwin 2019).

In this study we investigate time series outliers in compositional data and propose an ensemble method to

detect them. Compositional data refers to quantitative data that describes parts of a whole; speciﬁcally a

vector of positive elements that adds up to a constant, typically one. Although compositional data can arise

in the context of a Dirchlet distribution, it also appears naturally in geological, ﬁnancial, economical and

biological applications to name a few. Examples include the study of geochemical composition of rocks and

sediments, the proportion of tourist arrivals from different countries, the study of shares portfolio composition

and the abundances of microbes in microbiome data. Compositional data appears in any problem where the

composition relative to the total is of importance.

Karl Pearson was the ﬁrst statistician to look at this kind of data. In his study (Pearson 1897) he analyzed

spurious correlation between proportions and raised awareness about the problems of using standard statistical

methods for analyzing proportions. After that, other authors explored different aspects of compositional data

(Chayes 1960). However, it was not until the 80’s when John Aitchinson published two main works in this area

(Aitchison 1982, 1983), that a framework and the principles for analysis of compositional data was set up.

One of the many important contributions that Aitchinson brought to the ﬁeld of compositional data analysis

was the representation via the simplex. That is, for compositional data with 𝑛components, these components

can be understood as elements of a simplex space S𝑛of dimension 𝑛where

S𝑛=(𝒙=(𝑥1, 𝑥2, . . . , 𝑥𝑛)|𝑥𝑖∈R+>0, 𝑖 =1,2, . . . , 𝑛,

𝑛

Õ

𝑖=1

𝑥𝑖=𝑐, 𝑐 ∈R).

The vector 𝒙represents an 𝑛-part composition within the simplex S𝑛, which is an 𝑛−1dimensional afﬁne space

1

of R𝑛. This forms the foundation of what is known as Aitchison geometry (Aitchison 1982).

Working in the simplex with that geometry poses challenges when trying to use many of the well established

multivariate techniques that are designed to work with unconstrained data in R𝑛. Therefore, Aitchison intro-

duced transformations (Aitchison 1982) so that the simplex geometric space where compositions are naturally

represented can be mapped into an unconstrained space. Three different transformations from the simplex S𝑛

to R𝑛−1or R𝑛using logratios were proposed: additive logratio transformation (alr), centred logratio transfor-

mation (clr) and isometric logratio transformation (ilr). For a point on the simplex 𝒙=(𝑥1, 𝑥2, . . . , 𝑥𝑛)∈S𝑛

and non-zero 𝑥𝑗, the additive logratio transformation (alr) is given by

𝒙0=𝑥0

1, . . . , 𝑥0

𝑛−1=log 𝑥1

𝑥𝑗

,· · · ,log 𝑥𝑗−1

𝑥𝑗

,log 𝑥𝑗+1

𝑥𝑗

,· · · ,log 𝑥𝑛

𝑥𝑗.

Similarly, the centred logratio transformation (clr) is given by

𝒙0=𝑥0

1, . . . , 𝑥0

𝑛=©«

log 𝑥1

𝑛

qÎ𝑛

𝑗=1𝑥𝑗

,· · · ,log 𝑥𝑛

𝑛

qÎ𝑛

𝑗=1𝑥𝑗ª®®¬

,

and the isometric logratio transformation is given by

𝒙0=𝑥0

1, . . . , 𝑥0

𝑛−1with 𝑥0

𝑖=r𝑖

𝑖+1log

𝑖

qÎ𝑖

𝑗=1𝑥𝑗

𝑥𝑖+1

for 𝑖 < 𝑛 .

While these transformations set the foundations and have vastly contributed to the advancement of compo-

sitional data analysis, they are not without limitations. An obvious limitation is the inability to incorporate

zeros in the data. The issue of compositions with zeros was also discussed in Aitchison (1982) and solutions

were proposed depending on the “cause of the zeros”. The solutions included amalgamation, replacement

and Box-Cox transformations. Amalgamation comprises reducing the number of compositional components

by grouping different components together so that there are no zeros in the resulting composition. From an

outlier detection perspective, this approach can mask outliers because certain components are added together.

Other proposed solutions include replacing zeros by very small values or applying Box-Cox transformations

to the proportions. However, both these approaches have an adverse effect on outlier detection because they

2

introduce artiﬁcial outliers as a result of using logarithms and Box-Cox transformations on very small values

(Templ et al. 2017). A different approach to handle zeros was developed by Scealy & Welsh (2011), which

involved square-root transformations so that compositional data is mapped onto a hypersphere and distributions

for directional data can be used. Their focus was on spherical regression using Kent distributions. However,

taking the square root may dampen the large outliers.

The outlier detection literature, which includes contributions from statisticians and computer scientists is con-

stantly expanding. The reader is referred to Wang et al. (2019) for a recent survey on the subject. Outlier

detection in time series overlaps with change point detection, which also has a rich history and a wealth of ap-

proaches to tackle the problem (Aminikhanghahi & Cook 2017). In particular, outlier detection in the context

of compositional data has also received some attention in the last few years. For example, Brunsdon & Smith

(1998) proposed methods to analyze compositional time series and Filzmoser & Hron (2008) studied outlier

detection in compositional time series using robust methods.

Of particular interest is the work by Templ et al. (2017), which proposes methods for detecting outliers in

compositional data with zeros. In their study they explore the zero structure separately using subcompositions

determined by their zero patterns. Their main focus is on the subset of observations that has zeros. As part of

their analysis, they preprocess that subset such that the non-zero components are assigned 1, while zeros are

left as they are. Then they compute the binary Principal Component (PC) space and plot the observations in this

2-dimensional space. In addition to the binary PC space, they also use imputation methods and Mahalanobis

distances on these subcompositions.

In this paper, we propose a compositional coordinate transformation from the simplex S𝑛to R𝑛−1, which pre-

serves Euclidean distances, angles between points and is agnostic to zeros. Furthermore, we propose a time

series outlier detection ensemble for compositional data that uses this transformation. Even though our focus is

on compositional time series, the outlier ensemble presented in this paper, can be used on univariate and mul-

tivariate time series, which is non-compositional. Additionally, we make this work available in the R package

composits (Kandanaarachchi et al. 2020).

The remainder of the paper is organised as follows: We discuss the coordinate transformation from the simplex

S𝑛to R𝑛−1in Section 2.1. We note that this coordinate transformation can also be used on compositional data

that is not time dependent. After transforming the data to R𝑛−1, we proceed to ﬁnd outliers using an ensem-

3

ble of time series outlier detection methods. We use the time series outlier detection methods available in the

R packages forecast (Hyndman & Khandakar 2008), tsoutliers (de Lacalle 2019), ostad (Iturria et al.

2019) and anomalize (Dancho & Vaughan 2019) to build our ensemble. The compositional time series outlier

ensemble is discussed in Section 2.2, which contains the univariate and multivariate ensembles. The multi-

variate and compositional time series outlier detection methods comprise decomposing the multivariate data to

univariate by using Principal Component Analysis (PCA), Independent Component Analysis (ICA) (Comon

1994), Invariant Coordinate Selection (ICS) (Tyler et al. 2009) and DOBIN (Kandanaarachchi & Hyndman

2020), and applying the univariate time series outlier ensemble to these decomposed series. After ﬁnding the

outlying time points we apportion the outlier scores back to the compositions as explained in Section 2.3. In

Section 2.4 we explore visualization methods including animations using the R package tourr (Wickham et al.

2011). Then, we test our compositional time series outlier ensemble on simulated compositional time series

explained in Section 3. In Section 4 we apply our outlier ensemble to three real world datasets: international

tourism data from the World Bank, COVID-19 data in India from Kaggle and Spanish daily mortality data

from the Spanish ministry of Science and Innovation under the daily deaths monitoring program (Ministry of

Science and Innovation, Spain 2020). Finally, we discuss our conclusions in Section 5.

2 Methodology

2.1 Nullspace coordinates on the simplex

Consider the point 𝒙=(𝑥1, 𝑥2, . . . , 𝑥𝑛)on the simplex S𝑛, i.e. Í𝑛

𝑗=1𝑥𝑗=1. Although there are 𝑛coordinates to

describe 𝒙, these are constrained as they add up to a constant. We remove this constraint by ﬁnding a new set

of basis vectors for the simplex.

𝝂

𝒂

𝒙−𝒂

Figure 1: The hyperplane given by the equation (𝒙−𝒂)·𝝂=0.

4

Consider a hyperplane given by the equation 𝝂· (𝒙−𝒂)=0, where 𝝂is the normal (perpendicular) vector to

the hyperplane and 𝒂is a point on the hyperplane. Let 𝝂=(𝜈1, . . . , 𝜈𝑛)𝑇and 𝒂=(𝑎1, . . . , 𝑎𝑛)𝑇represent the

aforementioned vectors. Then expanding the equation of the hyperplane, we get

𝝂·𝒙=𝝂·𝒂,

𝜈1𝑥1+𝜈2𝑥2+ · · · + 𝜈𝑛𝑥𝑛=˜𝑐 , (1)

where 𝝂·𝒂equals a constant ˜𝑐. Since 𝒙satisﬁes 𝑥1+𝑥2+ · · · + 𝑥𝑛=1, by comparing with equation (1), we see

that 𝝂=(1,1, . . . , 1)𝑇. We choose the barycenter 𝒄=1

𝑛,1

𝑛,· · · ,1

𝑛𝑇

, as the point 𝒂for the remainder of the

computation. We note that any point on the simplex S𝑛would be appropriate as the point 𝒂. Thus, for every

point 𝒙∈S𝑛,

𝒙−𝒄⊥𝝂,

𝝂𝑇(𝒙−𝒄)=0,

(𝒙−𝒄)∈Null(𝝂𝑇),

where Null denotes the null space. Hence, a basis for the null space of 𝝂𝑇can be used to describe vectors 𝒙−𝒄

for all 𝒙∈S𝑛. In order to ﬁnd a basis for the null space of 𝝂𝑇, we consider the equation

𝝂𝑇𝒚=0,

i.e. 𝑦1+. . . +𝑦𝑛=0.

As there are 𝑛−1free parameters, by considering 𝑦𝑗=𝑠𝑗, where 𝑠𝑗denotes a free parameter for 𝑗∈ {1, . . . , 𝑛−

1}, we have,

𝑦𝑛=−

𝑛−1

Õ

𝑗=1

𝑠𝑗

5

giving us 𝒚=

𝑠1

𝑠2

.

.

.

−Í𝑛−1

𝑗=1𝑠𝑗

,

𝒚=𝑠1

1

0

.

.

.

−1

+𝑠2

0

1

.

.

.

−1

+ · · · + 𝑠𝑛−1

0

.

.

.

1

−1

.

Therefore the set B1=

1

0

.

.

.

−1

,

0

1

.

.

.

−1

,· · · ,

0

.

.

.

1

−1

gives a basis for the null space of 𝝂𝑇. We then can

perform a Gram-Schmidt orthogonalisation (Anton & Rorres 2013) to make this basis orthonormal. Let B1=

{𝒖1,𝒖2, . . . , 𝒖𝑛−1}be the current basis for the null space as above. Then the Gram-Schmidt orthogonalisation

is performed by deﬁning

𝒘1=𝒖1

k𝒖1k,

and letting ˜

𝒖2=𝒖2−h𝒖2,𝒘1i𝒘1.

This makes ˜

𝒖2⊥𝒘1,giving the second normalized basis vector

𝒘2=˜

𝒖2

k˜

𝒖2k.

Similarly letting ˜

𝒖3=𝒖3−h𝒖3,𝒘1i𝒘1−h𝒖3,𝒘2i𝒘2,we have ˜

𝒖3⊥𝒘1and ˜

𝒖3⊥𝒘2, giving the third

normalized basis vector

𝒘3=˜

𝒖3

k˜

𝒖3k.

6

After computing 𝒘𝑖, the next basis vector is found by

˜

𝒖𝑖+1=𝒖𝑖+1−h𝒖𝑖+1,𝒘1i𝒘1− · · · − h𝒖𝑖+1,𝒘𝑖i𝒘𝑖,

and deﬁning

𝒘𝑖+1=˜

𝒖𝑖+1

k˜

𝒖𝑖+1k.

Consequently, we obtain an orthonormal basis B={𝒘1,𝒘2, . . . , 𝒘𝑛−1}derived from B1. Software such as

R package pracma (Borchers 2019) can be used to ﬁnd an orthonormal basis when computing the null space.

Once we ﬁnd an orthonormal basis for Null 𝝂𝑇, we can compute the coordinates of all vectors (𝒙−𝒄)in

this new basis. Let 𝒚∈Null(𝝂𝑇)and 𝒚𝑂denote the coordinates of 𝒚in the original basis and 𝒚𝐵denote the

coordinates of 𝒚in the null space basis B. Then as illustrated in Figure 2

𝒚=h𝒚,𝒘1i𝒘1+ h𝒚,𝒘2i𝒘2+ · · · + h𝒚,𝒘𝑛−1i𝒘𝑛−1,

giving the coordinates of 𝒚in the null space basis B

𝒚𝐵=(h𝒚,𝒘1i,h𝒚,𝒘2i, . . . , h𝒚,𝒘𝑛−1i)𝑇.

𝒚

𝒘1

𝒘2

h𝒚,𝒘1i

h𝒚,𝒘2i

Figure 2: The coordinates of 𝒚in the orthonormal basis {𝒘1,𝒘2}are (h𝒚,𝒘1i,h𝒚,𝒘2i), i.e. 𝒚=h𝒚,𝒘1i𝒘1+

h𝒚,𝒘2i𝒘2.

Let us denote by 𝑋−𝐶the matrix which contains 𝒙−𝒄as column vectors, that is, 𝑋−𝐶is an 𝑛×𝑁matrix.

7

Let 𝐵denote the matrix containing the basis vectors of Bas column vectors, i.e. 𝐵is a 𝑛× (𝑛−1)matrix.

Then the null space coordinates are given by column vectors of

˜

𝑋=𝐵𝑇(𝑋−𝐶).

We summarize the computation of the null space coordinates for 𝒙∈S𝑛below:

1. For all 𝒙∈S𝑛, compute 𝒙−𝒄, where 𝒄=1

𝑛,1

𝑛,· · · ,1

𝑛𝑇

.

2. Construct an orthonormal matrix 𝐵, which contains the basis vectors for the null space of 𝝂𝑇.

3. For a point 𝒙on the simplex, transform the coordinates as ˜

𝒙=𝐵𝑇(𝒙−𝒄).

Next, we show that the Euclidean distances and angles between points are not changed by this coordinate

transformation.

Lemma 2.1. For any two points 𝒙𝑖and 𝒙𝑗on the simplex S𝑛, we have k𝒙𝑖−𝒙𝑗k=k𝒙0

𝑖−𝒙0

𝑗k, where we consider

the Euclidean norm and 𝒙0denotes the coordinates of the point after the transformation. That is, the Euclidean

distance between points before and after the transformation is the same.

Proof. Let ¯

𝐵=[𝐵ˆ𝜈], where ˆ𝜈denotes the unit vector in the direction of 𝜈and 𝐵is the matrix containing the

basis vectors of B. As 𝜈is orthogonal to column vectors in 𝐵,¯

𝐵denotes an 𝑛×𝑛orthogonal matrix. Let

𝒙00 =¯

𝐵𝑇(𝒙−𝒄),(2)

denote the coordinates of 𝒙−𝒄in R𝑛using ¯

𝐵. Given that

𝒙0=𝐵𝑇(𝒙−𝒄),

we have

𝒙00 =©«

𝒙0

0ª®®¬

(3)

because the last coordinate of 𝒙00 is equal to ˆ𝜈·(𝒙−𝒄), which equals zero as 𝜈is perpendicular to 𝒙−𝒄. Using

8

equation (2) we have

𝒙00

𝑖−𝒙00

𝑗=¯

𝐵𝑇(𝒙𝑖−𝒙𝑗),

giving us

k𝒙00

𝑖−𝒙00

𝑗k2=D𝒙00

𝑖−𝒙00

𝑗,𝒙00

𝑖−𝒙00

𝑗E,

=𝒙00

𝑖−𝒙00

𝑗𝑇𝒙00

𝑖−𝒙00

𝑗,

=𝒙𝑖−𝒙𝑗𝑇¯

𝐵¯

𝐵𝑇𝒙𝑖−𝒙𝑗,

=𝒙𝑖−𝒙𝑗𝑇𝒙𝑖−𝒙𝑗,

=k𝒙𝑖−𝒙𝑗k2,

where we have used ¯

𝐵¯

𝐵𝑇=𝐼. As ¯

𝐵𝑇¯

𝐵=𝐼and as ¯

𝐵is an 𝑛×𝑛matrix we have ¯

𝐵−1=¯

𝐵𝑇, giving ¯

𝐵¯

𝐵𝑇=𝐼.

From equation (3) we know that k𝒙00k=k𝒙0k, which completes the proof.

Lemma 2.2. For any three points 𝒙𝑖,𝒙𝑗and 𝒙𝑘on the simplex S𝑛, we have

h𝒙0

𝑖−𝒙0

𝑘,𝒙0

𝑗−𝒙0

𝑘i=h𝒙𝑖−𝒙𝑘,𝒙𝑗−𝒙𝑘i,

where we consider the standard Euclidean inner product and 𝒙0denotes the coordinates of the point after the

transformation.

Proof. Similar to Lemma 2.1 we work with ¯

𝐵=[𝐵ˆ𝜈]. Using equation (2) and the fact that ¯

𝐵¯

𝐵𝑇=𝐼we obtain

D𝒙00

𝑖−𝒙00

𝑘,𝒙00

𝑗−𝒙00

𝑘E=𝒙00

𝑖−𝒙00

𝑘𝑇𝒙00

𝑗−𝒙00

𝑘,

=(𝒙𝑖−𝒙𝑘)𝑇¯

𝐵¯

𝐵𝑇𝒙𝑗−𝒙𝑘,

=(𝒙𝑖−𝒙𝑘)𝑇𝒙𝑗−𝒙𝑘,

=𝒙𝑖−𝒙𝑘,𝒙𝑗−𝒙𝑘.

As D𝒙00

𝑖−𝒙00

𝑘,𝒙00

𝑗−𝒙00

𝑘E=D𝒙0

𝑖−𝒙0

𝑘,𝒙0

𝑗−𝒙0

𝑘E

9

from equation (3) we get the desired result.

Lemmata 2.1 and 2.2 together tell us that the angles between points are preserved by this proposed coordinate

transformation. This completes the discussion on the coordinate transformation. Next we look at the outlier

ensemble.

2.2 Compositional time series outlier ensemble (CTSOE)

The compositional time series outlier ensemble consists of the following components: 1. the null space coor-

dinate transformation, 2. the multivariate outlier ensemble and 3. the univariate outlier ensemble. A schematic

diagram of the compositional time series outlier ensemble is given in Figure 3.

Using the null space coordinate transformation discussed in Section 2.1, we transform the compositional time

series data to R𝑛−1, resulting in a multivariate time series without any constraints. We further decompose this

multivariate time series to univariate time series using 4decomposition methods and identify outliers using 4

time series outlier detection techniques, which are listed on CRAN task view:

1. tsoutliers by de Lacalle (2019) implements the algorithms by Chen & Liu (1993) for detecting outliers

in time series. They consider a combination of ARIMA models with hypothesis testing on the residuals

and ﬁnd additive outliers, level shifts, temporary changes, innovational outliers and seasonal level shifts.

2. forecast by Hyndman & Khandakar (2008) identiﬁes outliers in the residuals of the whitened time

series. They use Friedman’s super smoother supsmu for non-seasonal time series and a periodic seasonal

trend decomposition using LOESS (STL) for seasonal time series. Point are labeled as outliers if they lie

outside ±1.5×the interquartile range (IQR).

3. anomalize by Dancho & Vaughan (2019) implements outlier detection using remainders from trend

and/or seasonal decomposition of time series based on the interquartile range or the generalized extreme

studentized deviation tests.

4. otsad by Iturria et al. (2019) implements online fault detectors for time-series using two state shift detec-

tion methods, which identiﬁes candidate outliers using different algorithms such as shift detection based

exponentially weighted moving average (SD-EWMA). Then they further test these candidate outliers

using a Kolmogorov-Smirnov test before labeling them as outliers.

10

Compositional (MV) time series

Nullspace coordinate transformation

Unconstrained MV time series

Decompose to univariate time series

DOBINPCA ICS ICA

𝑃1𝑃𝑞𝐷1𝐷𝑞𝐶1𝐶𝑞𝐼1𝐼𝑞

UTSOE UTSOE UTSOE UTSOE UTSOE UTSOE UTSOE UTSOE

Combine Combine Combine Combine

PC Score DOBIN Score ICS Score ICA Score

Combine

Outlier scores

... ... ... ...

MTSOE

NSC

CTSOE

Figure 3: Schematic diagram of the compositional time series outlier ensemble CTSOE with the nullspace co-

ordinate transformation NSC and the multivariate time-series outlier ensemble MTSOE. The light

orange colored rectangles denote processes and gray colored parallelograms denote input and out-

put data.

11

As the univariate time series are used to identify outliers, we discuss the univariate outlier ensemble next.

2.2.1 Univariate time series outlier ensemble (UTSOE)

Consider a univariate time series {𝑥𝑡}𝑁

𝑡=1for which we use the four outlier detection methods, tsoutliers,

forecast,anomalize and otsad to identify outliers . Let us denote the outlier indicator variable of the 𝑗th

method by {𝑦𝑗

𝑡}𝑁

𝑡=1,

𝑦𝑗

𝑡=

1If 𝑥𝑡is identiﬁed as an outlier by the 𝑗th method ,

0otherwise.

Some outlier detection methods may identify outliers sparingly while others identify a string of outliers. Outlier

detection methods that identify fewer outliers are generally preferred, as outliers are considered rare observa-

tions. As such, to construct an ensemble score, we want to give a higher weight to methods that identify few

outliers and a lower weight to methods that identify more outliers. To achieve this, we assign weights for the

outlier detection methods based on both, their level of agreement with other methods and the total number of

outliers identiﬁed by that method. Suppose 𝑂𝑗denotes the set of outliers identiﬁed by method 𝑗. Then the

weight for method 𝑗,𝜉𝑗is

𝜉𝑗=|𝑂𝑗∩∪𝑘≠𝑗𝑂𝑘|

|𝑂𝑗|=Number of common outliers with other methods

Number of outliers identiﬁed by method 𝑗.(4)

Using the method weights 𝜉𝑗, we obtain the univariate time series outlier score as

𝑦𝑡=

4

Õ

𝑗=1

𝜉𝑗𝑦𝑗

𝑡.

That is, the outlier score of each time point corresponds to a weighted average score of the methods that have

identiﬁed it as an outlier, with larger scores corresponding to higher ranked outliers.

In addition to {𝑦𝑡}𝑁

𝑡=1, we also provide the raw scores

𝑌𝑡=𝑦1

𝑡, 𝑦2

𝑡, 𝑦3

𝑡, 𝑦4

𝑡,

12

with 𝑌𝑡∈R4and 𝑦𝑗

𝑡∈R. As we will show, these scores are useful in a multivariate setting. We refer to

this univariate time series outlier ensemble as UTSOE in the remainder of the paper. Next we look at the

multivariate time series outlier ensemble.

2.2.2 Multivariate time series outlier ensemble (MTSOE)

To ﬁnd outliers of a multivariate time series, we ﬁrst decompose it to univariate time series.

Decomposition of multivariate time series to univariate

We employ the following 4dimension reduction techniques to decompose a multivariate time series to univari-

ate components:

1. Principle Component Analysis (PCA)

2. Independent Component Analysis (ICA)

ICA (Comon 1994) is originally a signal processing technique that separates or unmixes a number of

signals that are collected together. ICA ﬁnds independent components of the mixed signal. A well

known application of ICA is the “cocktail party problem”, where ICA is used to separate the mixed noise

of many people talking.

3. Invariant Coordinate Selection (ICS)

ICS (Tyler et al. 2009) is a method for exploring data by comparing different estimates of scatter that

identiﬁes a system of independent components used to represent the data in a lower dimensional space.

4. DOBIN

DOBIN (Kandanaarachchi & Hyndman 2020) is a dimension reduction method speciﬁcally targeted for

outlier detection.

While none of these methods are specially designed for time series data, they can be used to decompose multi-

variate time series to univariate components (Baragona & Battaglia 2007, Aires et al. 2000). After decomposing

the multivariate time series, we apply UTSOE to each univariate component. We then consider the ﬁrst 𝑞uni-

variate series resulting from each decomposition method, where 𝑞is set to two dimensions by default in our

algorithm and can be changed by the user.

13

Using the univariate ensemble UTSOE

Let {𝒙𝑡}𝑁

𝑡=1denote the original multivariate time series, i.e. for each 𝑡,𝒙𝑡∈R𝑛. So, we can represent our

original multivariate time series as a matrix of dimensions 𝑁×𝑛. As we decompose this multivarite time series

to univariate using four methods and use the ﬁrst 𝑞components, this results in an three-dimensional array of

decomposed time series of dimension 𝑁×4×𝑞. We denote this time series object by 𝑍𝑁×4×𝑞, where the

ﬁrst dimension denotes time, the second denotes the decomposition methods and the third the decomposed

components. Thus, we have 4𝑞univariate time series as a result of this decomposition.

Let 𝒛𝑘

𝑡denote the multivariate decomposed time series using the 𝑘th decomposition method for 𝑘∈ {1,2,3,4},

i.e. for each pair of 𝑘and 𝑡,𝒛𝑘

𝑡∈R𝑞with 𝑞≤𝑛. Let 𝑧𝑘 ,ℓ

𝑡∈Rdenote the ℓth value of 𝒛𝑘

𝑡for ﬁxed 𝑘and 𝑡, with

ℓ≤𝑞. Therefore, {𝑧𝑘 ,ℓ

𝑡}𝑁

𝑡=1is a univariate time series, where 𝑘∈ {1,2,3,4}and ℓ∈ {1, . . . , 𝑞}.

We use UTSOE to identify outliers in 4𝑞univariate time series {𝑧𝑘,ℓ

𝑡}𝑁

𝑡=1for 𝑘∈ {1,2,3,4}and ℓ∈ {1, . . . , 𝑞}.

As we use 4outlier detection methods, our outlier scores will comprise a four dimensional array of 𝑁×

4×𝑞×4dimensions. Let us call this array 𝑌𝑁×4×𝑞×4. The array 𝑌contains outlier scores for 𝑍with the

ﬁrst dimension of 𝑌denoting time, the second denoting the decomposition method, the third denoting the

decomposed components and the fourth the outlier detection method.

Expanding 𝑌𝑁×4×𝑞×4in the fourth dimension we have 𝑌𝑁×4×𝑞×4=h𝑦𝑘,ℓ,1

𝑡𝑦𝑘,ℓ,2

𝑡𝑦𝑘,ℓ,3

𝑡𝑦𝑘,ℓ,4

𝑡i, with 𝑦𝑘 ,ℓ, 𝑗

𝑡for ﬁxed

𝑗denoting a three dimensional array of dimension 𝑁×4×𝑞, containing the outlier scores of 𝑍for the 𝑗th

outlier detection method. Here onward, the index 𝑡denotes the time, 𝑗the outlier method, 𝑘the decomposition

method and ℓthe decomposed component.

Combination of univariate time series outlier scores

Next we want to combine the UTSOE scores given by the 4-dimensional array 𝑌𝑁×4×𝑞×4to a 𝑁×1vector. That

is, we need to combine the scores relating to the i) 4 decomposition methods, ii) 𝑞components of each de-

composition method and the iii) 4 outlier detection methods. First we combine the scores of the 4-dimensional

array 𝑌𝑁×4×𝑞×4by outlier detection method and obtain a 3-dimensional array 𝑌𝑁×4×𝑞. For this task we use the

outlier detection method weights described in equation (4). For 𝑡∈ {1, . . . , 𝑁 },𝑘∈ {1,2,3,4},ℓ∈ {1, . . . , 𝑞},

and ﬁxed 𝑗=𝑗0, the outlier scores 𝑦𝑘 ,ℓ, 𝑗0

𝑡comprises an 𝑁×4×𝑞array. The array when 𝑗0=1gives the

scores of the outlier detection method tsoutliers. Similarly 𝑗0=2relates to forecast scores, 𝑗0=3to

anomalize scores and 𝑗0=4to otsad scores. We ﬁnd 𝜉𝑗0for each 3-dimensional 𝑦𝑘 ,ℓ, 𝑗0

𝑡as in equation (4).

14

That is, 𝜉𝑗0does not depend on 𝑘or ℓ, it only depends on 𝑗0. So, each 𝜉𝑗0is computed over all decomposition

methods and components. Using the weights 𝜉𝑗we combine the univariate time series scores in 𝑌𝑁×4×𝑞×4to

𝑌𝑁×4×𝑞as follows:

𝑦𝑘,ℓ

𝑡=

4

Õ

𝑗=1

𝜉𝑗𝑦𝑘, ℓ, 𝑗

𝑡,(5)

where 𝑦𝑘,ℓ

𝑡are elements of the 3-dimensional array 𝑌𝑁×4×𝑞of dimension 𝑁×4×𝑞having outlier scores for 𝑁

time points, 4 decomposition methods and 𝑞components for each decomposition method.

Next we recognize that for decomposition methods PCA, ICS and DOBIN, the information contained in the

components decrease as we get to further components, i.e. the 50𝑡ℎ component is not as important as the 1st

component. However, for ICA, this is not the case because each component is independent. Therefore we use

different weighting mechanisms to combine the 𝑞decomposition scores of each method.

Let the weights for the 𝑞components of the decomposition method 𝑘=𝑘0be given by 𝑤𝑘0

1, 𝑤𝑘0

2, . . . , 𝑤 𝑘0

𝑞.

Then for each decomposition method 𝑘=𝑘0we choose weights such that Í𝑞

ℓ=1𝑤𝑘0

ℓ=1. We use the weighting

schemes given below for the four decomposition methods:

1. For PCA and ICS we choose the weights 𝑤𝑘0

ℓ=𝜆ℓ

Íℓ𝜆ℓ, where 𝜆ℓdenotes the eigenvalue associated with

the ℓth principal component/eigen vector and 𝑘0=1for PCA and 𝑘0=3for ICS.

2. For DOBIN we use decreasing weights proportional to 1,1

2,· · · ,1

𝑞, i.e. 𝑤𝑘0

ℓ=1

ℓÍℓ1

ℓ−1

, with 𝑘0=2.

3. For ICA we use a constant weight for all components, i.e. 𝑤𝑘0

ℓ=1

𝑞with 𝑘0=4.

Using these weights 𝑤𝑘

ℓ, we combine the outlier scores as follows:

𝑦𝑘

𝑡=

𝑞

Õ

ℓ=1

𝑤𝑘

ℓ𝑦𝑘,ℓ

𝑡,

where 𝑦𝑘

𝑡denotes the outlier scores using the 𝑘th decomposition method. The sum of these four vectors give

the ﬁnal ensemble outlier scores,

𝑦𝑡=𝑦1

𝑡+𝑦2

𝑡+𝑦3

𝑡+𝑦4

𝑡.

where 𝒚={𝑦𝑡}𝑁

𝑡=1is a vector of length 𝑁and contains the outlier scores of the multivariate time series {𝒙𝑡}𝑁

𝑡=1.

We call this outlier ensemble MTSOE.

15

Acknowledging multiple testing

As we consider 4decomposition methods, the univariate outlier ensemble UTSOE ﬁnds outliers on 4𝑞time

series. For each time series, UTSOE employs 4outlier detection methods. Therefore, each time point gets

tested for outlyingness 4×4𝑞times. As a result of this multiple testing procedure, we get unwanted outliers;

time points which are actually non-outlying marked as outliers.

In order to ascertain the true outliers we compare the outlier scores of the time series with an outlier-removed

version of the time series. This ‘comparison’ time series is constructed in the following way. First we count

the number of outliers 𝑀identiﬁed by MTSOE, which comprise the union of outliers identiﬁed by UTSOE.

If 𝑀≤𝑁/10, where 𝑁is the total number of time points in the time series, we remove the observations at

these outlying time points from the time series and interpolate the resulting time series linearly at the missing

time points so that there are no sudden jumps. If 𝑀 > 𝑁 /10 then, we only remove the top d𝑁/10eoutlying

time points according to the outlier score from the time series. Again, the resulting time series is interpolated

linearly at missing time points. Once we have this comparison time series, we compute outlier scores using

MTSOE. We compute the 95th percentile and the maximum outlier score for the comparison time series and

deﬁne the difference as the gap 𝑔. For the identiﬁed outliers of the original time series, we compute the gap

score 𝑔𝑠as

𝑔𝑠(𝑦𝑡)=𝑦𝑡−max𝑚𝑐𝑚

𝑔+

(6)

where 𝒚={𝑦𝑡}𝑁

𝑡=1represents the outlier scores of the original time series, and 𝑐represents the outlier scores of

the comparison time series and [𝑥]+equals 𝑥if 𝑥is positive and 0otherwise. Thus, we have a set of gap scores

resulting from the comparison time series. The outlier scores with higher gap scores can be considered more

signiﬁcant compared to the others.

However, we add a word of caution regarding the gap scores. Outliers are difﬁcult to deﬁne (Unwin 2019) and

a single deﬁnition does not suit all applications. As such, the gap scores should not be taken as the ‘ideal’ truth

in determining the signiﬁcance of the outliers. Rather, they should be taken for what they are – comparison

scores from a similar time series without outliers. Consequently, while we give a shorter outlier list containing

outliers with positive gap scores, we make all non-zero outlier scores available, so that a user-deﬁned cut-off

can be employed.

16

2.3 Apportioning time outliers to covariates

At this juncture we have discussed the methodology to identify outliers in time. In a multivariate or a com-

positional setting we are also interested in recognising the variables or compositional units that contribute to

time-outliers. We achieve this by apportioning the outlier scores to the multivariate or compositional coordi-

nates by performing the inverse of the coordinate transformations discussed in Section 2.2.2.

For compositional data on the simplex S𝑛, the unconstrained coordinates lie in R𝑛−1. Let 𝐵denote the 𝑛× (𝑛−1)

matrix containing the nullspace basis coordinates, where each column contains a basis vector. Similarly, let

𝑃PCA,𝑃DOB ,𝑃ICS, and 𝑃ICA, denote (𝑛−1) × 𝑞matrices containing PCA, DOBIN, ICS and ICA basis vectors,

respectively. Then the compositional coordinates 𝒙∈R𝑛gets transformed to the PCA space as

˜

𝒙=𝒙−𝒄,

𝒛PCA =𝑃𝑇

PCA𝐵𝑇˜

𝒙,

giving 𝑍PCA =𝑃𝑇

PCA𝐵𝑇˜

𝑋=(𝐵𝑃PCA )𝑇˜

𝑋 ,

where ˜

𝑋is an 𝑛×𝑁matrix with each column corresponding to a compositional data point. Similarly, we obtain

the coordinates after performing DOBIN, ICS and ICA decomposition methods as follows:

𝑍DOB =𝑃𝑇

DOB𝐵𝑇˜

𝑋=(𝐵𝑃DOB )𝑇˜

𝑋 ,

𝑍ICS =𝑃𝑇

ICS𝐵𝑇˜

𝑋=(𝐵𝑃ICS )𝑇˜

𝑋 , (7)

𝑍ICA =𝑃𝑇

ICA𝐵𝑇˜

𝑋=(𝐵𝑃ICA )𝑇˜

𝑋 .

The univariate outlier ensemble UTSOE ﬁnds outliers from these coordinates. As such, we can associate the

outlier scores with these coordinate spaces.

The 3-dimensional array 𝑌𝑁×4×𝑞discussed in equation (5) contains outlier scores weighted by the outlier

method but not weighted by the decomposition component weights. It comprises four 𝑁×𝑞matrices stacked

in the second dimension, each containing the outlier scores for 𝑞PC, DOBIN, ICS and ICA components re-

spectively. Let us call these 𝑁×𝑞matrices 𝑌PCA,𝑌DOB,𝑌ICS, and 𝑌ICA. We recall that the weights of 𝑞PC,

DOBIN, ICS and ICA components are 𝒘1,𝒘2,𝒘3, and 𝒘4respectively, where each 𝒘𝑘is a 𝑞vector. Then the

17

weighted outlier scores 𝑊are obtained by

𝑊PCA =𝑌PCA ×diag 𝒘1,

𝑊DOB =𝑌DOB ×diag 𝒘2,

𝑊ICS =𝑌ICS ×diag 𝒘3,

𝑊ICA =𝑌ICA ×diag 𝒘4,

where diag 𝒘𝑘represents a 𝑞×𝑞matrix with weights 𝑤𝑘

ℓon the diagonal, and all 𝑊matrices are of size

𝑁×𝑞. By associating the weighted outlier scores with the decomposition space we can transform the outlier

scores to the original compositional space by performing the appropriate coordinate transformation as follows:

𝐴PCA =𝐵𝑃PCA𝑊𝑇

PCA ,

𝐴DOB =𝐵𝑃DOB𝑊𝑇

DOB ,

𝐴ICS =𝐵𝑃ICS𝑊𝑇

ICS ,(8)

𝐴ICA =𝐵𝑃ICA𝑊𝑇

ICA ,

where 𝐵is an 𝑛× (𝑛−1)matrix, 𝑃an (𝑛−1) × 𝑞matrix, 𝑊an 𝑁×𝑞matrix and 𝐴an 𝑛×𝑁matrix. We note

that this is the back transformation to multiplying by matrix (𝐵𝑃𝑋𝑋𝑋)𝑇considered in equation (7) where 𝑋𝑋𝑋

denotes the dimension reduction method. We associate the matrix 𝐴with the compositional space, with each

column of 𝐴representing the transformed scores of each data point. However, the transformed scores can be

positive or negative. The sign of the scores depend on the choice of basis vectors and their sign. For example

if a basis vector of 𝐵or 𝑃is multiplied by −1, then the sign of the resulting coordinates in 𝐴will change. As

such, the magnitude of the transformed scores is important, not their sign. Using these transformed scores we

obtain

𝐴TOT =|𝐴PCA|+|𝐴DOB |+|𝐴ICS|+|𝐴ICA |,

where |𝐴xxx|denotes a matrix obtained by taking the absolute value of each element of 𝐴xxx . We call 𝐴TOT the

matrix of apportioned scores. By inspecting 𝐴TOT we can see which composites/covariates contributes to the

18

time-outliers.

Removing matrix 𝐵in equations (7) and (8) give the apportioned scores for the multivariate setting.

2.4 Visualization

Visualization of the results provides an important cross-check of the algorithm, and is necessary for the in-

terpretation of the tagged outliers in the context of the full dataset. The multivariate nature of the time series

considered makes this challenging, and it can be important to explore a range of diagnostic graphics, corre-

sponding to the different stages in the ensemble algorithm.

First, we highlight that we are working with three types of coordinate systems: the original compositional

coordinates, the null space coordinates corresponding to the unconstrained representation of the time series,

and ﬁnally the different coordinate systems obtained from the four decomposition methods. Generally the

preferred option for diagnostics will be to visualize the decomposed time series components, because it reduces

dimensionality and corresponds to the input series analyzed by the UTSOE. However, the other two coordinate

representations may be important for the interpretation of the results.

To deal with the high-dimensional nature of the time series, we consider four different approaches:

1. Univariate time series displays: selecting a single coordinate representation and component or compo-

nents with mapping components to color; or using faceting to display multiple coordinate representations.

2. “Biplot” displays: showing the ﬁrst two components selected by any of the decomposition methods,

together with the axes representation of the corresponding projection matrix from the original or null

space coordinate system. Each observation in time is represented as a scatter point, and tagged outliers

can be highlighted. Using a biplot we can compare the outlying points to the overall distribution, and

understand the connection with the original coordinates from reading the axes display. However, the

scatter plot display means that temporal patterns are lost in this visualization. One option is to use lines

connecting the points in time to show the temporal pattern, but this can lead to overloading of the graph.

An alternative is to only connect tagged outliers to the two neighboring time points.

3. Tour display: tour methods (Asimov 1985, Buja et al. 2005) allow the visualization of high-dimensional

data through an animated sequence of low-dimensional projections. Here we use the tourr pack-

19

age (Wickham et al. 2011) to generate a sequence of two-dimensional projections. As with biplots,

observations in time are shown as scatter points in the multivariate space, and we highlight tagged out-

liers and allow to connect them to their neighbors to get an understanding of the temporal pattern. This

display can be used with any coordinate representation of the data, and using the decomposed time series

components may still be preferred for high-dimensional input, but requires more than the default two

components.

4. Scores over time: it is useful to visualize the apportioned scores, to interpret and understand them in

context. Here we consider the special case of working with spatial data and show the apportioned scores

for selected outliers on a map.

3 Simulation exercise

We use a simulation to test the method and understand its behavior. In this exercise we generate a compositional

time series vector 𝒛𝑡=𝑧1

𝑡, . . . , 𝑧𝑛

𝑡𝑇, for 𝑡=1, . . . , 𝑁 , with 𝑧𝑖

𝑡∈[0,1]and Í𝑛

𝑖=1𝑧𝑖

𝑡=1. This process has

both, time series persistence that resembles real compositional data, and an outlier generation process, whose

outliers will be detected using the proposed outlier detection ensemble. To specify the true data generating

process (DGP) of 𝒛𝑡, we ﬁrst consider the multiple time series vector 𝒙𝑡=𝑥1

𝑡, . . . , 𝑥𝑛

𝑡𝑇∈R𝑛. This vector has

the state-space dynamics

𝒙𝑡=𝐴𝒓𝑡,(9)

𝒓𝑡=𝝁+𝐵𝒓𝑡−1+𝐷𝜺𝑡+𝐶𝒃𝑡,(10)

where 𝒓𝑡=𝑟1

𝑡, . . . , 𝑟 𝐾

𝑡𝑇is a 𝐾−dimensional vector of underlying factors (or states) driving 𝒙𝑡,𝐴is an 𝑛×𝐾

matrix of factor loadings, 𝜺𝑡∼𝑁(0, 𝐼𝐾)and 𝒃𝑡=𝑏𝑡,1, . . . , 𝑏𝑡 ,𝐾 𝑇, with 𝑏𝑡, 𝑘 ∼Bernoulli(𝑝). In Equation (9)

the dimensionality of the problem is reduced from the 𝑛×1dimensional vector 𝒙𝑡to the 𝐾×1dimensional

vector 𝒓𝑡, where 𝐾 < 𝑛. The ﬁrst three terms in the right hand side of Equation (10) specify an autoregressive

process of order one for 𝒓𝑡. The fourth term 𝐶𝒃𝑡, has the role of inducing outliers in the dynamics of 𝒓𝑡, with the

magnitude of those outliers determined by the 𝐾×𝐾matrix 𝐶, and their probability of occurrence determined

by the scalar 𝑝∈ [0,1]. The autoregressive process in 𝒓𝑡and the term 𝐶𝒃𝑡, allow for 𝒙𝑡to have time series

20

persistence and an outlier generation process. The compositional time series vector 𝒛𝑡is constructed from 𝒙𝑡,

so that 𝑧𝑖

𝑡=exp(𝑥𝑖

𝑡)

Í𝑛

𝑗=1exp(𝑥𝑗

𝑡).

In this exercise we consider the particular case where 𝑛=30 and 𝐾=2. The matrices required in the true DGP

are selected to be

𝝁=©«

0.3

0.7ª®®¬

;𝐵=©«

0.8 0

0 0.5ª®®¬

;𝐶=©«

5 0

0 4 ª®®¬

;𝐷=©«

0.4 0

0 0.4ª®®¬

.

The elements of 𝐴are independently generated from a univariate normal distribution with mean zero and

a standard deviation of 0.3, and ﬁnally, 𝑝=0.005. From this speciﬁcation of the true DGP we generate

𝑀=1000 times series, each of length 𝑁=500. In addition to the random persistent outliers produced through

𝐶𝒃𝑡, we also add two discretionary outliers to each time series by setting 𝑥2

117 =log(10)and 𝑥8

40 =log(200).

Figure 4 displays one of the simulated compositional time series. Each colored line corresponds to a particular

composite. The two dashed lines represent the locations of the discretionary outliers, which only affect the time

series at single time points. The dotted lines represent the time location of the persistent outliers. Notice that

unlike the discretionary outliers, the effect of these type of outliers decay slowly over time. Even though these

outliers persist over time, we only consider the ﬁrst of these persisting time points as outyling in our labeled

data.

We apply the compositional time series outlier ensemble CTSOE to every generated time series. To account for

the persisting outliers we adjust the CTSOE scores as follows. For each time series, the detected observations

that are part of a time sequence that is decreasing in outlier score are not predicted to be outliers. For instance,

if the time points 𝑡=20,𝑡=21 and 𝑡=22 are all detected as outliers by CTSOE, and 𝑦20 > 𝑦21 > 𝑦22 , then,

only 𝑡=20 is predicted to be an outlier for the purpose of this exercise. Using the true and predicted outlier

time locations for each simulated time series, we measure the predictive accuracy in terms of the area under

the receiver operating characteristic curve (AUC). Figure 5 presents the histogram of the AUC based on the

𝑀=1000 iterations. It provides strong evidence that the proposed outlier detection method provides accurate

identiﬁcation of compositional outliers, as most of the probability mass in the histogram is located at values

greater than 0.9.

21

0.00

0.25

0.50

0.75

0 100 200 300 400 500

Time

Proportion

Figure 4: One of the simulated compositional time series. The vertical dashed lines represent the time loca-

tions of the discretionary outliers, while the dotted lines represent the time locations of the persistent

outliers.

0

50

100

150

200

250

0.7 0.8 0.9 1.0

AUC

Count

Figure 5: Histogram for the area under the ROC curve in the simulation exercise

22

4 Applications

4.1 International tourism data

We use the World Bank data on the number of international tourist arrivals, which is available at https:

//data.worldbank.org/indicator/ST.INT.ARVL. This dataset contains yearly data from 1995 till 2018.

We use the seven geographically aggregated time series on regions East Asia & Paciﬁc (EAP), Europe &

Central Asia (ECA), Latin America & Caribbean (LAC), Middle East & North Africa (MENA), Sub-Saharan

Africa (SSA), South Asia (SA) and North America (NA). This data is shown in Figure 6.

Figure 6: International tourism arrivals for each region.

We make this a compositional time series by dividing the seven dimensional time series by the total number of

arrivals for each year. That is, if the original data for a certain year 𝑡is given by 𝒙𝑡=𝑥1

𝑡, 𝑥2

𝑡, . . . , 𝑥7

𝑡𝑇, then we

compute the compositional time series as

𝒛𝑡=𝑧1

𝑡, 𝑧2

𝑡, . . . , 𝑧7

𝑡𝑇

= 𝑥1

𝑡

Í𝑗𝑥𝑗

𝑡

,𝑥2

𝑡

Í𝑗𝑥𝑗

𝑡

, . . . , 𝑥7

𝑡

Í𝑗𝑥𝑗

𝑡!𝑇

.

We use the compositional time series outlier ensemble CTSOE illustrated in Figure 3 on this data. First we

transform 𝒛𝑡using the nullspace coordinate transformation to obtain unconstrained data. The new coordinates

𝜸𝑡are computed using equation (11):

23

𝜸𝑡=

−0.3779 0.8963 −0.1036 −0.1036 −0.1036 −0.1036 −0.1036

−0.3779 −0.1036 0.8963 −0.1036 −0.1036 −0.1036 −0.1036

−0.3779 −0.1036 −0.1036 0.8963 −0.1036 −0.1036 −0.1036

−0.3779 −0.1036 −0.1036 −0.1036 0.8963 −0.1036 −0.1036

−0.3779 −0.1036 −0.1036 −0.1036 −0.1036 0.8963 −0.1036

−0.3779 −0.1036 −0.1036 −0.1036 −0.1036 −0.1036 0.8963

𝑧1

𝑡

𝑧2

𝑡

𝑧3

𝑡

𝑧4

𝑡

𝑧5

𝑡

𝑧6

𝑡

𝑧7

𝑡

.(11)

Then CTSOE uses 𝜸𝑡as input to our multivariate time series outlier ensemble MTSOE. The output of CTSOE

is given in Figure 7. It gives the breakdown of identiﬁed outliers in terms of decomposition methods and outlier

detection methods. The highlighted cells in columns 2-5 contain non-zero scores of identiﬁed outliers obtained

from each decomposition method. The total score is the sum of these scores. The column headed Num_Coords

gives the number of decomposition methods that have contributed to the identiﬁcation of outliers. The four

columns with headings forecast to anomalize give the weighted scores of each outlier detection method. The

next column Num_Methods gives the number of outlier detection methods that identiﬁed each observation as

an outlier. The last column gives the Total Score, with the highest score highlighted. The column Gap_Score_2

gives the gaps as computed by equation (6).

Figure 7: The output of CTSOE on international tourism data from 1995 to 2018 showing outliers.

From Figure 7 we see that the year 2003 – the year of the SARS outbreak – was the most outlying year for

international tourism from 1995 to 2018 from a compositional perspective. The gap score for 2003 is 43, which

is much higher compared with other outlying years. The time series plots of the null space coordinates and

their decompositions shown in Figures 8 and 9 conﬁrm this ﬁnding.

To better understand the connection between the null space coordinates and the decompositions, we can look

at biplots; see Figure 10 for DOBIN, PCA and ICS biplots. We see that the outlying points, highlighted

24

Figure 8: Null space coordinates of international tourism data, with a dashed line at 2003.

Figure 9: DOBIN, ICA, ICS and PCA coordinates of international tourism data, with outlying time points

shown by vertical lines.

in red, are associated with long edges that show the temporal connection between the points. In particular,

we see that ICS, which has tagged all three outliers, reveals long edges for each of them. Reading the axes

displays across the plots, we ﬁnd that X1 and X4 are important for tagging the ﬁrst outlier (2003), while X5

is associated with the additional two outlying points. This is conﬁrmed with the null space time series shown

25

Table 1: Outlier scores apportioned to geographical regions

Region 2003 2010 2014

East Asia & Paciﬁc 0.737 0.229 0.108

Europe & Central Asia 0.859 0.336 0.100

Latin America & Carribbean 0.281 0.100 0.144

Middle East & North Africa 0.291 0.112 0.075

Sub Saharan Africa 0.171 0.055 0.127

South Asia 0.357 0.073 0.099

North America 0.355 0.120 0.096

in Figure 8. A more comprehensive overview of these connections is obtained with the tour display, which is

available at https://uschilaa.github.io/animations/composits1.html. This tour plot also reveals

the importance of X3, and shows that X6 does not contribute relevant information.

X1

X2

X3

X4

X5

X6

−0.44

−0.42

−0.40

−0.38

−0.025 0.000 0.025 0.050

dobin_1

dobin_2

X1

X2

X3

X4

X5

X6

−0.21

−0.20

−0.19

−0.18

−0.17

−0.35 −0.30 −0.25 −0.20

pca_1

pca_2

X1

X2

X3

X4

X5

X6

0.00

0.01

0.02

0.03

−0.260 −0.255 −0.250 −0.245 −0.240

ics_1

ics_2

Figure 10: DOBIN, PCA and ICS biplots of international tourism data, with outlying time points shown in red.

Oulying points are associated with long edges in the biplot.

Table 1 shows the apportioned outlier scores for the outlying years. For 2003, we see that the apportioned

scores for East Asia & Paciﬁc and Europe & Central Asia are much higher than the other regions conﬁrming

our understanding of the SARS outbreak.

Figure 11 shows the total number of international arrivals, which is a univariate time series with outliers identi-

ﬁed by UTSOE drawn using dashed lines. Again we see that 2003 is a global outlier along with 2007, 2008 and

2009. However, 2007 - 2009 do not come up as compositional outliers in CTSOE in Figure 7. This is because

a global change may not necessarily result in changes at a compositional level. Clearly in 2003 the global dip

in international tourism arrivals had an impact on the compositions, with tourism arrivals in East and Central

Asia severely impacted compared to the rest of the world. But during 2007 - 2009, the compositional structure

has not changed enough to cause CTSOE to identify these years as compositional outliers. Even though the

global ﬁnancial crisis in 2008 caused economies to contract everywhere causing a decrease in tourism in 2009,

26

its effect on the compositional structure is not large enough to generate a compositional outlier. This is further

validated by the null space coordinates in Figure 8.

Figure 11: Total international tourist arrivals with univariate outliers identiﬁed by UTSOE shown in dashed

lines.

4.2 COVID-19 data in India

For this analysis we use the dataset from Kaggle (Kaggle, SRK 2020), which contains daily COVID-19 data in

India. This dataset contains daily cases for 36 states and union territories in India from the 30th of January till

the 2nd of August 2020. For each day the counts are given by 𝒙𝑡=𝑥1

𝑡, 𝑥2

𝑡, . . . , 𝑥36

𝑡𝑇, of which many are zero

entries for the initial time period. As in the previous example we make this data compositional by considering

𝒛𝑡=𝑧1

𝑡, 𝑧2

𝑡, . . . , 𝑧36

𝑡𝑇

= 𝑥1

𝑡

Í𝑗𝑥𝑗

𝑡

,𝑥2

𝑡

Í𝑗𝑥𝑗

𝑡

, . . . , 𝑥36

𝑡

Í𝑗𝑥𝑗

𝑡!𝑇

.

Then we transform 𝒛𝑡using the nullspace coordinate transformation so that they are unconstrained. Suppose

the 𝑖th basis vector 𝒘𝑖=𝑤1

𝑖, 𝑤2

𝑖, . . . 𝑤36

𝑖𝑇. Then it has components given by

𝑤𝑗

𝑖=

−0.16667 𝑗=1

0.97619 𝑗=(𝑖+1)

−0.02381 otherwise

for 𝑖∈ {1, . . . , 35}. The use of our transformation is of particular importance in this example, as there is a large

number of days with zero COVID-19 cases. We input the original coordinates to CTSOE, our compositional

27

time series outlier ensemble. Figure 12 gives the outlying dates according to CTSOE using 2decomposition

components. We see that the 3rd of March is the most outlying date followed by the 2nd and the 4th of March.

Figure 12: The output of CTSOE on COVID-19 India data from 30 Jan 2020 to 2 August 2020 with the highest

total score highlighted.

Figure 13 shows 8 null space coordinates, selected using the biplots, with a dashed line on March 4. We see

that from March 2 to March 4, the selected null space coordinates exhibit a signiﬁcant increase, which aligns

with the CTSOE outliers.

X7

X9

X14

X26

X1

X2

X3

X4

Apr Jul Apr Jul Apr Jul Apr Jul

Apr Jul Apr Jul Apr Jul Apr Jul

−0.1

0.0

0.1

0.2

−0.15

−0.10

−0.05

0.0

0.2

0.4

−0.15

−0.10

−0.05

0.00

−0.1

0.0

0.1

−0.1

0.0

0.1

0.2

0.3

−0.1

0.0

0.1

−0.1

0.0

0.1

Date

Null space coordinates

Figure 13: Eight null space coordinates of COVID-19 India data as suggested by the biplot results, with a

dashed line on March 4.

Figure 14 shows the four decomposition coordinates with two components for each decomposition method.

Again, we see big vertical jumps at outlying time points. The corresponding biplots are shown in Figure 15.

They show that both PCA and ICA combine X3 with X4, and X7 with X9 along roughly opposing directions.

Comparison with the null space coordinate time series in Figure 13 shows that these coordinates are similar

to each other: both X3 and X4 show a sharp peak followed by a quick drop and approximately ﬂat behavior,

28

while X7 and X9 capture more of the variation at later times. It is interesting to note the different projections

found for the other two decomposition methods, for example the second component identiﬁed by DOBIN is

dominated by a single null space coordinate, X1. The biplot for ICS shows strong correlation, and Figure 14

conﬁrms that both components have very similar temporal patterns. We also see that the end-on-the-line point

is not tagged as an outlier in the biplots. This point corresponds to the initial time steps with zero entries,

and is clearly visible in the biplot due to the connecting edges. Finally we note that the tour view, available at

https://uschilaa.github.io/animations/composits2.html, is not so useful for this data, as the scale

is dominated by the highly outlying time points and most views are not informative.

Figure 14: DOBIN, ICA, ICS and PCA coordinates of COVID19 India data with outlying time points shown

by vertical lines.

Table 2 gives the apportioned outlier scores for the states and union territories on the outlying dates. By

inspection we see that Kerala, Telangana, Rajasthan and Uttar Pradesh have relatively high outlying scores for

these dates. Figure 16 shows the map of India with the apportioned scores for the outlying dates.

We also examine the univariate time series outliers on total cases. Figure 17 gives the output of the univariate

outlier ensemble UTSOE on daily log totals. March 2nd and March 4th get identiﬁed as outliers by 3 methods

out of 4. From Figure 17 we see there are big jumps on the 2nd and the 4th of March. As such, the changes in

29

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

X12

X13

X14

X15

X16

X17

X18

X19

X20

X21

X22

X23

X24

X25

X26

X27

X28

X29

X30

X31

X32

X33

X34

X35

−0.25

0.00

0.25

−0.5 0.0

pca_1

pca_2

X1

X2 X3

X4

X5

X6

X7 X8

X9

X10

X11

X12

X13

X14

X15

X16

X17

X18

X19

X20

X21

X22

X23

X24

X25

X26

X27

X28

X29

X30

X31

X32

X33

X34

X35

−0.2

−0.1

0.0

−0.8 −0.4 0.0 0.4

dobin_1

dobin_2

X1

X2

X3

X4

X5

X6

X7 X8

X9

X10

X11

X12

X13

X14

X15

X16

X17

X18

X19 X20

X21

X22

X23

X24

X25

X26X27

X28

X29

X30

X31

X32

X33

X34

X35

−0.25

0.00

0.25

0.50

0.75

1.00

−0.25 0.00 0.25

ica_1

ica_2

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

X12

X13

X14

X15

X16

X17

X18

X19

X20

X21

X22

X23

X24

X25

X26

X27

X28

X29

X30

X31

X32

X33

X34

X35

−0.10

−0.05

0.00

−0.15 −0.10 −0.05 0.00

ics_1

ics_2

Figure 15: PCA, DOBIN, ICA and ICS biplots of Indian COVID-19 data, with outlying time points shown in

red. Outlying points are associated with long edges in the biplot.

Figure 16: Outlier scores apportioned to different states and union territories in India.

compositions have affected the total giving rise to outliers in total values as well as at a compositional level.

30

Table 2: Outlier scores apportioned to Indian states and union territories

State 2020-03-02 2020-03-03 2020-03-04

Kerala 1.961 2.009 1.938

Telangana 0.940 0.755 0.543

Delhi 0.241 0.222 0.194

Rajasthan 0.712 0.793 0.773

Uttar Pradesh 0.607 0.548 0.431

Haryana 0.381 0.377 0.352

Ladakh 0.278 0.238 0.183

Tamil Nadu 0.265 0.283 0.254

Karnataka 0.376 0.313 0.236

Maharashtra 0.588 0.662 0.639

Punjab 0.295 0.214 0.120

Jammu and Kashmir 0.117 0.093 0.064

Andhra Pradesh 0.116 0.107 0.090

Uttarakhand 0.209 0.153 0.087

Odisha 0.450 0.317 0.158

Puducherry 0.135 0.090 0.038

West Bengal 0.253 0.205 0.136

Chhattisgarh 0.260 0.186 0.099

Chandigarh 0.213 0.147 0.071

Gujarat 0.360 0.306 0.217

Himachal Pradesh 0.368 0.258 0.131

Madhya Pradesh 0.424 0.329 0.212

Bihar 0.150 0.125 0.087

Manipur 0.095 0.073 0.047

Mizoram 0.345 0.239 0.117

Andaman and Nicobar Islands 0.298 0.207 0.102

Goa 0.555 0.382 0.182

Unassigned 0.332 0.233 0.118

Assam 0.034 0.039 0.039

Jharkhand 0.467 0.323 0.155

Arunachal Pradesh 0.098 0.068 0.034

Tripura 0.094 0.069 0.039

Nagaland 0.339 0.233 0.111

Meghalaya 0.263 0.180 0.083

Dadra and Nagar Haveli and Daman and Diu 0.107 0.073 0.034

Sikkim 0.037 0.027 0.015

4.3 Spanish deaths

As our last application, we use the daily mortality counts in Spain organized by autonomous communities

(Ministry of Science and Innovation, Spain 2020). This dataset records mortality counts from the 18th of April

2018 until the 31st of July 2020 and is shown in Figure 18.

Similar to the previous example, we divide the daily mortality counts 𝒙𝑡=𝑥1

𝑡, 𝑥2

𝑡, . . . , 𝑥19

𝑡by their sum and

31

Figure 17: Log total of COVID-19 cases in India with univariate outliers identiﬁed by 3 methods in UTSOE

shown in dashed lines.

Figure 18: Mortality proportions in Spanish states from April 2018 to July 2020.

make it compositional:

𝒛𝑡=𝑧1

𝑡, 𝑧2

𝑡, . . . , 𝑧19

𝑡𝑇

= 𝑥1

𝑡

Í𝑗𝑥𝑗

𝑡

,𝑥2

𝑡

Í𝑗𝑥𝑗

𝑡

, . . . , 𝑥19

𝑡

Í𝑗𝑥𝑗

𝑡!𝑇

.

Then we transform 𝒛𝑡using the nullspace coordinate transformation. The null space transformation has 18

32

basis vectors, with the 𝑖th vector 𝒘𝑖=𝑤1

𝑖, 𝑤2

𝑖, . . . , 𝑤19

𝑖𝑇having the components

𝑤𝑗

𝑖=

−0.22942 𝑗=1

0.95719 𝑗=(𝑖+1)

−0.04281 otherwise

for 𝑖∈ {1, . . . , 18}. Again, this dataset contains zeros which can be handled by the null space coordinate

transformation. Figure 19 gives the outlying dates according to CTSOE using 2decomposition components

per decomposition method.

Figure 19: The output of CTSOE on Spanish mortality proportion data from April 18 2018 to July 31 2020

with the highest outlier score highlighted.

Figure 20 shows the null space coordinates with a dashed line on March 19, 2020. We see that there is a

visible spike around this date for many coordinates. Figure 21 shows the decomposition plots for the four

decomposition methods using two components. Again, we see that there is a spike around March 19 for all

four decomposition methods.

The biplots for all decomposition methods are shown in Figure 22. They show that the majority of time points

are normally distributed, apart from a group of outlying points that include most tagged outliers. This can also

be observed when looking at the null space coordinates in a tour, which is available at https://uschilaa.

33

X1X2

X3 X4

X5

X6

X7

X8

X9

X10 X11

X12

X13

X14

X15

X16

X17

X18

−0.10

−0.05

0.00

0.05

−0.20 −0.15 −0.10 −0.05 0.00

pca_1

pca_2

X1

X2

X3

X4

X5 X6

X7 X8

X9X10

X11X12

X13

X14 X15

X16

X17

X18

−0.02

0.00

0.02

0.04

−0.15 −0.10 −0.05 0.00

dobin_1

dobin_2

X1

X2

X3

X4X5

X6

X7

X8

X9

X10

X11

X12

X13

X14

X15

X16

X17

X18

−0.05

0.00

0.05

0.10

0.15

0.20

0.00 0.05 0.10

ica_1

ica_2

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

X12

X13

X14

X15

X16

X17

X18

−0.12

−0.09

−0.06

−0.03

−0.100 −0.075 −0.050 −0.025 0.000

ics_1

ics_2

Figure 22: PCA, DOBIN, ICA and ICS biplots of Spanish mortality data, with outlying time points shown in

red. Outlying points are associated with long edges in the biplot.

github.io/animations/composits3.html. Interestingly, DOBIN has captured outliers only in the ﬁrst

component, with the second component dominated only by X1 and without interesting patterns (as conﬁrmed

by the time series in Figure 21). Note that the DOBIN decomposition did not contribute to the tagged outliers

as conﬁrmed in Figure 19. We again observe similarities between PCA and ICA decompositions: both methods

contrast X12 with X9 in one direction, and pick X8 as the other important direction.

Table 3 gives the apportioned scores for different autonomous communities for the three most outlying dates

as per total scores. By inspection we see that Madrid, Catalonia and Andalusia have higher outlier scores com-

pared to others. Figure 23 shows the map of mainland Spain with outlier scores apportioned to the autonomous

communities at these outlying time points.

Next we examine the univariate time series outliers. Figure 24 gives the output of the univariate outlier ensem-

ble UTSOE on daily totals. We get consecutive days from March 21 to March 28 identiﬁed as outliers by 3

or 4 methods. We see that the univariate time series outliers have affected the compositions in this example.

However, the most outlying compositional outlier, which was on March 19 was not identiﬁed as an outlier

using the daily totals by 3 or more outlier methods. This shows that the changes in compositions contributed to

35

Table 3: Outlier scores apportioned to Spanish states

State 2020-03-19 2020-03-22 2020-03-20

Andalusia 0.448 0.484 0.431

Aragon 0.103 0.069 0.068

Principality of Asturias 0.077 0.074 0.067

Balearic Islands 0.054 0.050 0.046

Canary Islands 0.216 0.188 0.177

Cantabria 0.387 0.282 0.280

Castile and León 0.290 0.229 0.222

Castile-La Mancha 0.367 0.309 0.291

Catalonia 0.474 0.594 0.492

Valencian Community 0.295 0.255 0.238

Extremadura 0.140 0.107 0.105

Galicia 0.316 0.279 0.260

Community of Madrid 0.729 0.570 0.529

Region of Murcia 0.218 0.175 0.169

Navarre 0.050 0.039 0.038

Basque Country 0.177 0.138 0.134

La Rioja 0.115 0.084 0.083

Ceuta 0.145 0.106 0.106

Melilla 0.085 0.063 0.063

Figure 23: Outlier scores apportioned to different Spanish autonomous communities.

a compositional outlier before it was picked up as an outlier by the daily totals.

Figure 24: Total deaths in Spain with univariate outliers identiﬁed by UTSOE using 3 or more methods shown

in dashed lines.

36

5 Conclusions

Working with compositional data presents challenges as well as opportunities, as this is an area of research that

still contains unexplored territories. In this paper, we tackled the problem of outlier detection in the context

of compositional time series data and presented an ensemble method for outlier detection that is applicable

to univariate, multivariate, compositional and non-compositional time series. Our ensemble method copes

well with the characteristics of time series data and the peculiarities of compositional data. In addition, our

method is able to deal with the presence of zeros in the composites, a well known problem when working with

compositional data. For that, we presented a coordinate transformation that is able to map points from the

constrained simplex plane S𝑛, where compositional data naturally lives, to an unconstrained space R𝑛−1while

preserving Euclidean distances and angles between points. The advantage of this transformation is that it can

handle zero entries in compositional data unlike the logratio transformations.

In summary, the ensemble method presented in the paper starts by ﬁrst transforming the compositional co-

ordinates from the simplex to the unconstrained real space of dimension 𝑛−1. From there, four dimension

reduction methods (PCA, DOBIN, ICS and ICA) are applied to the data so that the dimension of the original

compositional dataset gets reduced to a dimension that can be set by the user. In this reduced space, the original

dataset is represented by only a few components computed by each dimension reduction technique. Next, we

apply four existing time series outlier detection methods (tsoutliers,forecast,anomalize and ostad)

to each component obtained from the four dimension reduction methods. That provides us with a list of time

outliers and for each outlier a score is calculated. Then the scores are combined to an ensemble score by using

appropriate weights for each of the outlier detection methods, decomposition components and decomposition

methods. After ﬁnding the outlying time points, we apportion the scores back to the compositions so that we

have a better understanding of outlying constituents. In other words, we can identify which time series in the

original data contributed to the detected outliers.

The ensemble method is validated using a simulation as well as three real world datasets, which showed promis-

ing results. Furthermore, visualization techniques for the different steps in the ensemble method are provided,

including the dimension reduction results, as well as the time outliers and corresponding time series. The vi-

sualization methods including the animations are designed to aid interpretability of the ﬁnal results and also to

better undertand the intermediate steps of the ensemble method. The ensemble method, the coordinate trans-

37

formation and the visualization methods are all implemented and available in the accompanying R package

composits.

Acknowledgments

This work was supported in part by the Australian Government through the Australian Research Council.

6 Supplementary Material

R package composits: This package contains functionality for the null space basis construction, the outlier

ensembles CTSOE, MTSOE and UTSOE, the outlier score apportionment, visualization methods and the com-

positional data simulation.

Animations: The animations for the three real datasets are available at https://uschilaa.github.io/

animations/.

Datasets: The synthetic dataset can be generated by using functionality of composits. The World Bank

dataset is available at the website https://data.worldbank.org/indicator/ST.INT.ARVL. The Kaggle

dataset is available at https://www.kaggle.com/sudalairajkumar/covid19-in-india and the Spanish

dataset is included in the package composits and it as available at https://momo.isciii.es/public/

momo/dashboard/momo_dashboard.html#datos.

Scripts

The scripts Figures_For_Paper_1.R and Figures_For_Paper_5.R contain the code used in Section 3. The

script Figures_For_Paper_2.R contains the code used for the international tourism data example. The scripts

Figures_For_Paper_3.R and Figures_For_Paper_4.R contain the code used for Indian and Spainish ex-

amples respectively. These scripts are available at our GitHub repository https://github.com/sevvandi/

composits_paper.

38

References

Aires, F., Chédin, A. & Nadal, J.-P. (2000), ‘Independent component analysis of multivariate time series: Ap-

plication to the tropical SST variability’, Journal of Geophysical Research: Atmospheres 105(D13), 17437–

17455.

Aitchison, J. (1982), ‘The statistical analysis of compositional data’, Journal of the Royal Statistical Society:

Series B (Methodological) 44(2), 139–160.

Aitchison, J. (1983), ‘Principal component analysis of compositional data’, Biometrika 70(1), 57–65.

Aminikhanghahi, S. & Cook, D. J. (2017), ‘A survey of methods for time series change point detection’,

Knowledge and Information Systems 51(2), 339–367.

Anton, H. & Rorres, C. (2013), Elementary Linear Algebra: Applications Version, John Wiley & Sons.

Asimov, D. (1985), ‘The Grand Tour: A Tool for Viewing Multidimensional Data’, SIAM Journal of Scientiﬁc

and Statistical Computing 6(1), 128–143.

Baragona, R. & Battaglia, F. (2007), ‘Outliers detection in multivariate time series by independent component

analysis’, Neural Computation 19(7), 1962–1984.

Borchers, H. W. (2019), pracma: Practical Numerical Math Functions. R package version 2.2.9.

URL: https://CRAN.R-project.org/package=pracma

Brunsdon, T. M. & Smith, T. (1998), ‘The time series analysis of compositional data’, Journal of Ofﬁcial

Statistics 14(3), 237.

Buja, A., Cook, D., Asimov, D. & Hurley, C. (2005), 14 - Computational Methods for High-Dimensional

Rotations in Data Visualization, Vol. 24 of Handbook of Statistics, Elsevier, pp. 391 – 413.

URL: http://www.sciencedirect.com/science/article/pii/S0169716104240147

Chayes, F. (1960), ‘On correlation between variables of constant sum’, Journal of Geophysical research

65(12), 4185–4193.

Chen, C. & Liu, L.-M. (1993), ‘Joint estimation of model parameters and outlier effects in time series’, Journal

of the American Statistical Association 88(421), 284–297.

39

Comon, P. (1994), ‘Independent component analysis, a new concept?’, Signal Processing 36(3), 287–314.

Dancho, M. & Vaughan, D. (2019), anomalize: Tidy Anomaly Detection. R package version 0.2.0.

URL: https://CRAN.R-project.org/package=anomalize

de Lacalle, J. L. (2019), tsoutliers: Detection of Outliers in Time Series. R package version 0.6-8.

URL: https://CRAN.R-project.org/package=tsoutliers

Filzmoser, P. & Hron, K. (2008), ‘Outlier detection for compositional data using robust methods’, Mathemati-

cal Geosciences 40(3), 233–248.

Gu, M., Fei, J. & Sun, S. (2020), ‘Online anomaly detection with sparse gaussian processes’, Neurocomputing

403, 383 – 399.

Hyndman, R. J. & Khandakar, Y. (2008), ‘Automatic time series forecasting: the forecast package for R’,

Journal of Statistical Software 26(3), 1–22.

Iturria, A., Carrasco, J., Herrera, F., Charramendieta, S. & Intxausti, K. (2019), otsad: Online Time Series

Anomaly Detectors. R package version 0.2.0.

URL: https://CRAN.R-project.org/package=otsad

Kaggle, SRK (2020), ‘COVID-19 in India’.

URL: https://www.kaggle.com/sudalairajkumar/covid19-in-india

Kandanaarachchi, S. & Hyndman, R. J. (2020), ‘Dimension reduction for outlier detection using DOBIN’,

Journal of Computational and Graphical Statistics (Accepted) .

URL: https://robjhyndman.com/publications/dobin/

Kandanaarachchi, S., Menendez, P., Laa, U. & Loaiza-Maya, R. (2020), composits: Compositional, Multivari-

ate and Univariate Time Series Outlier Ensemble. R package version 0.0.0.9000.

URL: https://github.com/sevvandi/composits

Ministry of Science and Innovation, Spain (2020), ‘Deaths from all causes in excess, by population group.

Spain’.

URL: https://momo.isciii.es/public/momo/dashboard/momo_dashboard.html

40

Pearson, K. (1897), ‘Mathematical contributions to the theory of evolution. — On a form of spurious correlation

which may arise when indices are used in the measurement of organs’, Proceedings of the Royal Society of

London 60(359-367), 489–498.

Scealy, J. L. & Welsh, A. H. (2011), ‘Regression for compositional data by using distributions deﬁned on the

hypersphere’, Journal of the Royal Statistical Society. Series B: Statistical Methodology 73(3), 351–375.

Templ, M., Hron, K. & Filzmoser, P. (2017), ‘Exploratory tools for outlier detection in compositional data with

structural zeros’, Journal of Applied Statistics 44(4), 734–752.

Tyler, D. E., Critchley, F., Dümbgen, L. & Oja, H. (2009), ‘Invariant co-ordinate selection’, Journal of the

Royal Statistical Society: Series B (Statistical Methodology) 71(3), 549–592.

Unwin, A. (2019), ‘Multivariate outliers and the o3 plot’, Journal of Computational and Graphical Statistics

28(3), 635–643.

Wang, H., Bah, M. J. & Hammad, M. (2019), ‘Progress in Outlier Detection Techniques: A Survey’, IEEE

Access 7, 107964–108000.

Wickham, H., Cook, D., Hofmann, H. & Buja, A. (2011), ‘tourr: An R package for exploring multivariate data

with projections’, Journal of Statistical Software 40(2), 1–18.

URL: http://www.jstatsoft.org/v40/i02/

Zimek, A., Campello, R. J. & Sander, J. (2014), ‘Ensembles for unsupervised outlier detection: challenges and

research questions a position paper’, ACM SIGKDD Explorations Newsletter 15(1), 11–22.

41