ArticlePDF Available

Abstract and Figures

Most outlier detection rules for multivariate data are based on the assumption of elliptical symmetry of the underlying distribution. We propose an outlier detection method which does not need the assumption of symmetry and does not rely on visual inspection. Our method is a generalization of the Stahel-Donoho outlyingness. The latter approach assigns to each observation a measure of outlyingness, which is obtained by projection pursuit techniques that only use univariate robust measures of location and scale. To allow skewness in the data, we adjust this measure of outlyingness by using a robust measure of skewness as well. The observations corresponding to an outlying value of the adjusted outlyingness are then considered as outliers. For bivariate data, our approach leads to two graphical representations. The first one is a contour plot of the adjusted outlyingness values. We also construct an extension of the boxplot for bivariate data, in the spirit of the bagplot which is based on the concept of half space depth. We illustrate our outlier detection method on several simulated and real data.
Content may be subject to copyright.
Outlier detection for skewed data
Mia Hubert1and Stephan Van der Veeken
December 7, 2007
Abstract
Most outlier detection rules for multivariate data are based on the assumption of
elliptical symmetry of the underlying distribution. We propose an outlier detection
method which does not need the assumption of symmetry and does not rely on visual
inspection. Our method is a generalization of the Stahel-Donoho outlyingness. The
latter approach assigns to each observation a measure of outlyingness, which is obtained
by projection pursuit techniques that only use univariate robust measures of location
and scale. To allow skewness in the data, we adjust this measure of outlyingness
by using a robust measure of skewness as well. The observations corresponding to
an outlying value of the adjusted outlyingness are then considered as outliers. For
bivariate data, our approach leads to two graphical representations. The first one is a
contour plot of the adjusted outlyingness values. We also construct an extension of the
boxplot for bivariate data, in the spirit of the bagplot [1] which is based on the concept
of half space depth. We illustrate our outlier detection method on several simulated
and real data.
Keywords: Outlier detection, boxplot, bagplot, skewness, outlyingness
1Corresponding author
Authors’ affiliation: Department of Mathematics - LSTAT, Katholieke Universiteit Leuven, Celestijnen-
laan 200B, B-3001 Leuven, Belgium.
Email-addresses: mia.hubert@wis.kuleuven.be, stephan.vanderveeken@wis.kuleuven.be
1
1 Introduction
To detect outliers in multivariate data, it is common practice to estimate the location
and scatter of the data by means of robust estimators. Well-known high-breakdown
and affine equivariant estimators of location and scatter are e.g. the MCD-estimator [2],
the Stahel-Donoho estimator [3, 4], S-estimators [5, 6] and MM-estimators [7]. Their
high-breakdown property implies that the estimators can resist up to 50% of outliers,
whereas their affine equivariance allows for any affine transformation of the data (such
as rotations, rescaling, translations).
To classify the observations into regular points and outliers, one can then compute
robust Mahalanobis-type distances, and use a cutoff value based on the distribution
of these distances, see e.g. [8, 9, 10]. All these estimators assume that the data are
generated from an elliptical distribution, among which the multivariate gaussian is the
most popular one.
Consequently these outlier detection methods will not work appropriately when
data are skewed. A typical way to circumvent this problem is then to apply a sym-
metrizing transformation on some (or all) of the individual variables. Common exam-
ples are the logarithmic transformation or, more general, a Box-Cox transformation,
see e.g. [11]. This is certainly often a very useful approach, especially when the trans-
formed variables also have a physical meaning. However, this procedure needs more
preprocessing, is not affine invariant, and leads to new variables which are not al-
ways well interpretable. Moreover, the standard Box-Cox transformation is based on
maximum likelihood estimation and consequently not robust to outliers.
In this paper we propose an automatic outlier detection method for skewed multi-
variate data, which is applied on the raw data. Our method is inspired by the Stahel-
Donoho estimator [12]. This estimator is based on the outlyingness of the data points,
which are essentially obtained by projecting the observations on many univariate di-
rections and computing a robust center and scale in each projection. The observations
are then weighted according to their outlyingness and the robust Stahel-Donoho esti-
mates are obtained as a weighted mean and covariance matrix (see Section 2.4 for the
details).
In the first step of our procedure we adjust the Stahel-Donoho outlyingness to allow
for asymmetry, which leads to the so-called adjusted outlyingness (AO). The method is
based on the adjusted boxplot for skewed data [13] and essentially defines for univariate
data a different scale on each side of the median. This scale is obtained by means of a
2
robust measure of skewness [14].
In the second step of our outlier detection method, we declare an observation as
outlying when its adjusted outlyingness is ’too’ large. As the distribution of the AO’s
is in general not known, we apply again the adjusted boxplot outlier rule. All details
are provided in Section 2.
In Section 3 we show how our approach can be used to easily obtain two graphical
representations of bivariate data that well reflect their center and shape. Section 4
is devoted to a simulation study. Finally we show in the appendix that the adjusted
outlyingness of univariate data has a bounded influence function, which reflects its
robustness towards outliers.
It is well known that skewness is only an issue in small dimensions. As the dimen-
sionality increases, the data are more and more concentrated in an outside shell of the
distribution, see e.g. [15]. Hence, in this paper we only consider low-dimensional data
sets with, say, at most 10 variables. Of course, it is possible that data are represented
in a high-dimensional space, but in fact lie close to a low-dimensional space. Dimen-
sion reduction methods are then very helpful preprocessing techniques. One could for
example first apply a robust PCA method (e.g. [16]), and then apply our new out-
lier detection method on the principal components scores. A somewhat more refined
approach is recently proposed in [17], based on the work presented here.
2 Outlier detection for skewed data
2.1 Outlier detection for skewed univariate data
Since our proposal is based on looking for outliers in one-dimensional projections,
we first describe how we detect outliers in skewed univariate data. This problem
has been addressed in [13], where a skewness-adjusted boxplot is proposed. If Xn=
{x1, x2,...,xn}is a univariate (continuous, unimodal) data set, the standard box-
plot [18] is constructed by drawing a line at the sample median medn, a box from
the first Q1to the third Q3quartile, and whiskers w1and w2from the box to the
furthest non-outlying observations. These observations are defined as all cases inside
the interval
[Q11.5 IQR, Q3+ 1.5 IQR] (1)
with the interquartile range IQR = Q3Q1.
3
For data coming from a normal distribution, the probability to lie beyond the
whiskers is approximately 0.7%. However, if the data are skewed, this percentage can
be much higher. For example, in the case of the lognormal distribution (with µ= 0
and σ= 1), this probability is almost 7%. In [13] the whiskers w1and w2are adjusted
such that for skewed data, much less regular data points fall outside the whiskers. This
is obtained by replacing the interval (1) into
[Q11.5e4MC IQR, Q3+ 1.5e3MC IQR] (2)
if MC >0 and
[Q11.5e3MC IQR, Q3+ 1.5e4MC IQR]
for MC <0. Here, MC stands for the medcouple which is a robust measure of skewness
[14]. It is defined as
MC(Xn) = med
xi<medn<xj
h(xi, xj)
with mednthe sample median, and
h(xi, xj) = (xjmedn)(mednxi)
xjxi
.
Remark that at symmetric distributions, MC = 0 and hence equation (2) reduces to
equation (1) from the standard boxplot. It has been shown in [14] that the MC on
one hand has a good ability to detect skewness, and on the other hand attains a high
resistance to outliers. It has a 25% breakdown value, and a bounded influence function.
This means that up to 25% of the regular data can be replaced by contamination before
the estimator breaks down, whereas adding a small probability mass at a certain point
has a bounded influence on the estimate. Moreover, the medcouple can be computed
fast by an O(nlog n) algorithm.
To illustrate the difference between the standard and the adjusted boxplot, we
consider an example from geochemistry. The data set comes from a geological survey on
the composition in agricultural soils from 10 countries surrounding the Baltic Sea [19].
Top soil (0-25 cm) and bottom soil (50-75 cm) samples from 768 sites were analysed.
As an example, we consider the MgO-concentration which was apparently quite skew
(MC = 0.39). The original and the adjusted boxplot are shown in Figure 1. We
see that the standard boxplot marks many observations as possible outliers, whereas
the adjusted boxplot finds no cases with abnormal high concentration of magnesium
oxide. There are 15 observations that lie under the lower whisker, but they are clearly
boundary cases.
4
2.2 From the adjusted boxplot to the adjusted outlying-
ness
The adjusted boxplot introduced in the previous section now allows us to define
a skewness-adjusted outlyingness for univariate data. According to Stahel [3] and
Donoho [4], the outlyingness of a univariate point xiis defined as
SDOi= SDO(1)(xi, Xn) = |ximed(Xn)|
mad(Xn)
where med(Xn) = mednis the sample median, and mad(Xn) = bmedi|ximedn|,
the median absolution deviation. The constant b= 1.483 is a correction factor which
makes the MAD unbiased at the normal distribution. Note that instead of the median
and the MAD also other robust estimators of location and scale can be used [20, 16].
The outlyingness of a data point tells us how far the observation lies from the centre
of the data, standardized by means of a robust scale. In this definition, it does not
matter whether the data point is smaller or larger than the median. However, when
the distribution is skewed, we propose to apply a different scale on each side of the
median. More precisely the adjusted outlyingness is defined as:
AOi= AO(1)(xi, Xn) =
ximed(Xn)
w2med(Xn)if xi>med(Xn)
med(Xn)xi
med(Xn)w1if xi<med(Xn)
(3)
with w1and w2the lower and upper whisker of the adjusted boxplot applied to the
data set Xn. Again note that AOireduces to SDOiat symmetric distributions.
This adjusted outlyingness is illustrated in Figure 2. Observation x1has AO1=
d1/s1= (med(Xn)x1)/(med(Xn)w1), whereas for x2we have AO2=d2/s2=
(ximed(Xn))/(w2med(Xn)). So although x1and x2are located at the same
distance from the median, x1has a higher value of outlyingness, because the scale on
the lower side of the median is smaller than the scale on the upper side. Note that
SDO(1) and AO(1) are location and scale invariant, hence they are not affected by
changing the center and/or the scale of the data.
As the AO is based on robust measures of location, scale and skewness, it is resis-
tant to outliers. In theory, a resistance up to 25% of outliers can be achieved, although
we noticed in practice that the medcouple often has a substantial bias when the con-
tamination is more than 10%. Moreover, it can be shown that the influence function
[21] of the AO is bounded. We refer to the appendix for a formal proof.
5
2.3 Outlier detection for multivariate data
Consider now a p-dimensional sample Xn= (x1,...,xn)Twith xi= (xi1,...,xip)T.
The Stahel-Donoho outlyingness of xiis then defined as
SDOi= SDO(xi,Xn) = sup
aRp
SDO(1)(aTxi,Xna).(4)
Definition (4) can be interpreted as follows: for every univariate direction aRp
we consider the standardized distance of the projection aTxiof observation xito the
robust center of all the projected data points. Suppose now that SDO(xi,X) is large,
then there exists a direction in which the projection of xilies far away from the bulk
of the other projections. As such, one might suspect xibeing an outlier.
It is clear from its definition that the SD outlyingness does again not account for any
skewness, and hence it is only suited for elliptical symmetric data. To allow skewness,
we analogously define the adjusted outlyingness of a multivariate observation xias
AOi= AO(xi,Xn) = sup
aRp
AO(1)(aTxi,Xna).(5)
Note that in practice the AO can not be computed by projecting the observations
on all univariate vectors a. Hence, we should restrict ourselves to a finite set of random
directions. Many simulations have shown that considering m= 250pdirections yields
a good balance between ’efficiency’ and computation time. Random directions are
generated as the direction perpendicular to the subspace spanned by pobservations,
randomly drawn from the data set (as in [12]). As such, the AO is invariant to affine
transformations of the data. Moreover, in our implementation we always take kak= 1,
although this is not required as AO(1) is scale invariant.
Once the AO is computed for every observation, we can use this information to
decide whether an observation is outlying or not. Unless for normal distributions for
which the AOs(or SDOs) are asymptotically χ2
pdistributed, the distribution of the
AO is in general unknown (but typically right-skewed as they are bounded by zero).
Hence we compute the adjusted boxplot of the AO-values and declare a multivariate
observation outlying if its AOiexceeds the upper whisker of the adjusted boxplot.
More precisely, our outlier cutoff value equals
cutoff = Q3+ 1.5e3MC IQR (6)
where Q3is the third quartile of the AOiand similarly for IQR and MC.
6
Remark 1 Note that the construction of the adjusted boxplot and the adjusted out-
lyingness does not assume any particular underlying skewed distribution (only uni-
modality), hence it is a distribution-free approach. For univariate skewed data, sev-
eral more refined robust estimators and outlier detection methods are available, see
e.g. [22, 23, 24], but then one needs to assume that the data are sampled from a spe-
cific class of skewed distributions (such as the gamma distribution). Our approach is
in particular very useful when no information about the data distribution is available
and/or when an automatic and fast outlier detection method is required.
Remark 2 A similar outlier detection method has also been proposed in [25] to ro-
bustify independent component analysis (ICA). However, in [25] a different definition
of adjusted outlyingness was used, by replacing the constants 3 and 4 in (2) by 4 and
3.5, yielding
[Q11.5e3.5MC IQR, Q3+ 1.5e4MC IQR] (7)
for right-skewed distributions (and similarly for left-skewed data).
Definition (7) yields a larger fence than when we apply our current definition (2).
This affects both the scale estimates in (3) as well as the cutoff value (6) which sep-
arates the regular points from the outliers. When the proportion of contamination is
small, which is the typical problem in the context of ICA, such a rule will work very
well. Compared to our current approach, it will even often misclassify less regular
observations as outliers. However, when the contamination percentage is larger, say
5-10%, the medcouple will show more bias and the factor e4MC might become too large,
resulting in whiskers that might mask some or all of the outliers. Therefore, in the
general setting considered here, we prefer to work with the new rules.
Remark 3 Note that the concept of ’robustness towards outliers’ can become ambigu-
ous in the context of skewed distributions. Assume that a large majority of observations
is sampled from a symmetric distribution, and that some smaller group (at most 25%)
is outlying. When the outliers are located far from the regular points, a robust es-
timator of skewness should be able to detect the symmetry of the main group. An
outlyingness-approach based on such a robust estimator of skewness, combined with
robust estimators of location and scale, can then be able to flag the outlying mea-
surements. When the same methodology would be used with non-robust estimators of
location, scale and skewness, the outlyingness-values would be affected by the outliers
(e.g. yielding a high value of skewness, and an inflated scale) such that the outlying
7
group could be masked. This difference between a robust and non-robust approach also
applies when the majority group has an asymmetric distribution. In such a situation,
outliers could for example give the impression that the whole distribution is highly
asymmetric, whereas this might not hold for the large majority. If on the other hand
there are no outliers and the whole distribution is indeed skewed, a robust estimator of
skewness should also be able to detect the asymmetry. This is why we prefer to work
with the medcouple. In [26], it is shown that the MC is not too conservative (such
that asymmetry of the main group can be found) but robust enough (asymmetry due
to outliers is detected when the outliers are far enough in the tails of the distribution).
However, when the outliers are located not very far in the tails of the main dis-
tribution, the distinction between the regular and outlying points might become very
small. From our point of view, no estimator (robust or not) can then be able to make
the correct distinction. If one then presumes that the asymmetry is caused by the
outliers, and that the main group has a symmetric distribution, we advise to compare
the AO-values with the SD-values (or any other outlier detection method for symmet-
ric data). If the conclusions are very different, it is then up to the analyst to decide
whether he/she believes in the symmetry of the main group or not.
2.4 Example
We reconsider the geological data set of Section 2.1, and now consider the variables
that measure the concentration of MgO, MnO, Fe2O3and TiO2. Hence n= 768 and
p= 4. The medcouple of the individual variables is 0.39, 0.2, 0.26 and 0.14 respectively
which clearly indicates the presence of skewness in this data set. Moreover the adjusted
boxplots of the four variables marked several observations as (univariate) outliers.
When we apply our outlier detection method based on the AO, we find 9 observa-
tions that exceed the outlier cutoff. Figure 3 plots the AO-values on the vertical axis,
together with the adjusted boxplot cutoff (6). We see that two cases are really far
outlying, whereas five observations have a somewhat larger AO, and the other two are
merely boundary cases.
For this data set we also computed the robust distances
RDi=q(xˆ
µ)Tˆ
Σ1(xˆ
µ) (8)
with ˆµand ˆ
Σthe Stahel-Donoho estimates of location and scatter. The SD estimator
is defined by assigning a weight to every data point, inversely proportional to its
8
outlyingness, and computing the weighted mean and covariance matrix. According
to [20], we applied the gaussian weights
wi=φ(SDO2
i/c)
φ(1)
with φthe gaussian density and c=χ2
p,0.9the 90% quantile of the χ2distribution with
pdegrees of freedom. This weight function decreases exponentially for SDO2
i> c and
accords relatively high weights for (squared) SDO values smaller than c.
Figure 3(a) shows the robust SD distances on the horizontal axis, together with
the common cutoff value qχ2
4,0.99 (since the robust distances are approximately χ2
p
distributed at normal data). We see that the SD estimator detects four clear outliers
(indicated with a large dot), but also yields a huge number of observations outside
the outlier cutoff value. From the χ2
4quantile plot of the robust SD distances in
Figure 3(b), we can deduce that the robust distances are not χ2
4distributed (as the
data are skewed), and hence the cutoff value is not appropriate.
In Figure 4 we show several pairwise scatterplots indicating the observations with
outlying AO value. The four outliers with highest robust SD distance are marked with
a large dot. The remaining five observations with outlying AO are marked with a star.
These scatterplots show the multivariate skewness in the data, and illustrate why these
nine cases are different from the others. Figures 4(a) and (c) are the most informative
ones, and demonstrate that the outliers merely have outlying (x1, x2) and/or (x2, x4)
measurements.
3 Graphical representations for bivariate data
For bivariate data, the AO-values can be used to easily obtain two graphical represen-
tations of the data that well reflect their center and shape.
3.1 Contour plot
The first representation consist of a contour plot of the adjusted outlyingness values.
To illustrate such a contour plot, we consider the bloodfat data from [27]. For 371 male
patients, data were collected on the concentration of plasma cholesterol and plasma
triglycerides. The units are mg/dl. For 51 patients, there was no evidence of heart
disease; for the remaining 320 patients there was evidence of narrowing of the arteries.
9
Only those last 320 data points are used in the analysis. Both the SD and the ad-
justed outlyingness of the data are computed. Using cubic interpolation (by means of
the Matlab function interp2), contour plots of the two outlyingness measures are con-
structed. These plots are shown in Figure 5. We see that the contours of the AO show
the underlying skewed distribution very well. On the other hand, the inner contours
of the SDO values are closer to elliptical.
3.2 Bagplot
The bagplot is introduced in [1] as an extension of the boxplot for bivariate data. Just
as the boxplot, the construction of the bagplot relies on a ranking of the data points.
This ranking is based on the concept of halfspace depth, which was introduced in [28].
The halfspace depth of a bivariate point xis defined as the smallest number of data
points, lying in a closed halfplane bounded by a line through x. Using this halfspace
depth, a bivariate equivalent of the median can be defined as the point (not necessarily
an observation) with the highest depth, called the Tukey median. If this point is not
unique, the center of gravity of the deepest depth region is taken (see [1] for more
details). The bagplot consists of the Tukey median, the bag and the fence. The bag
contains the 50% data with highest depth. The fence is defined by inflating the bag
(relative to the Tukey median) by a factor 3. All observations outside the fence are
considered to be outliers. The outer loop consists of the convex hull of the non-outlying
observations. In Figure 6(a) the bagplot of the bloodfat data is shown. We clearly see
the skewness in the data, as the Tukey median (indicated with the + symbol) does not
lie in the center of the (dark-colored) bag, which itself is not elliptically shaped. Also
the light-colored loop is skewed and separates the three outliers (with star symbol) from
the other observations. As illustrated in this example, the bagplot is very useful to show
the shape of bivariate data as the halfspace depth does not make any distributional
assumptions. Moreover the bagplot is equivariant to affine transformations. Its only
drawback is its computational complexity, which is O(n2(log n)2). For larger datasets,
the computation time can be reduced by drawing a random subset from the data
and performing the computations on this smaller data set. This approach has been
proposed and applied in [1]. This explains why the bagplot of the bloodfat data in [1],
based on a random subset of size 150, is slightly different from Figure 6(a) which uses
all observations.
The concept of adjusted outlyingness allows us to make a similar bagplot in much
10
lower computation time. Instead of the Tukey median we mark the observation with
lowest adjusted outlyingness, and we define the bag as the convex hull of the half
sample with lowest outlyingness. If we look at the bagplot based on AO in Figure 6(b)
we see that it is very similar to the depth-based bagplot and the same observations are
classified as outliers. As the AO-values can be computed in O(mnp log n) time with m
the number of directions considered, and as we usually set m= 250p, this approach
thus yields a fast alternative to the depth-based bagplot.
Note that there exist alternative graphical representations of bivariate data, such
as those based on kernel density estimation. As kernel methods concentrate on local
properties, they are in particular suitable to detect multimodality. However, the notion
of outlier is different from what we have used in this paper. Kernel methods will
consider isolated points as outliers, whereas we try to detect observations which are far
away from the bulk of the data. We refer to [1] for an overview of alternative graphs
and more discussion.
The AO-based bagplot can easily be extended to higher dimensions, as long as the
software accurately supports high-dimensional graphs. To visualize multivariate data,
we can alternatively also construct a bagplot matrix (as in [1]). This is illustrated in
Figure 7 for the geological data of Section 2.4. On the diagonal we have plotted the
adjusted boxplot of each variable, whereas the other cells of the matrix contain the
AO-bagplot of each pair of variables. Note that as the number of observations in the
bag is quite large, we have not drawn all these observations.
4 Simulation study
In this section we study the outlier detection ability of our approach by means of a
simulation study. To this end we have generated data from a multivariate skew-normal
distribution [29]. A p-dimensional random variable Xis said to be multivariate skew-
normal distributed if its density function is of the form
f(x) = 2φp(x;)Φ(αTx) (9)
where φp(x;) is the p-dimensional normal density with zero mean and correlation
matrix , Φ is the standard normal distribution and αis a p-dimensional vector that
regulates the shape. Note that if α= 0, the skew-normal density reduces to the
standard normal density. In our simulations we set =Ipthe identity matrix, and α
a vector with elements equal to 10 or 4. For p= 2 we used α= (10,4)T, for p= 5 we set
11
α= (10,10,4,4,4)T, whereas for p= 10 we took α= (10,10,10,10,10,4,4,4,4,4)T.
Outliers are randomly generated from a normal distribution with Ip/20 as covariance
matrix and a center located along the 1pdirection (all components equal to -1).
This is on purpose not the direction of maximal directional skewness [30], but just a
direction in which there is a considerable amount of skewness. The contamination was
chosen to be clustered as from the simulation study in [25] this setting appeared to be
the most difficult to handle. We considered situations with 1% or 10% contamination
in data sets of size n= 200,500 and 1000. An example of such a simulation data set
with 10% contamination is illustrated in Figure 8.
We compare two methods for outlier detection. The first is our approach based on
the AO-values, as introduced in Section 2.3. For comparison, the second approach is
based on the SD outlyingness. It would have been possible to use the robust distances
from the SD estimator. However, as we have noticed in the previous sections and in
our simulations, this method always yields a huge number of observations that are
(erroneously) indicated as outliers. This stems from the fact that the SD method
assumes symmetry in the definition of the outlyingness, as well as in the use of the
χ2
pcutoff value. To eliminate the effect of the cutoff value, we therefore consider
another outlier detection approach, obtained by applying our adjusted boxplot rule to
the SD outlyingness. So the two methods used in this simulation study only differ in
the definition of the outlyingness, and not in how they define outliers. This makes it
more easy to quantify the improvements arising from the skewness adjustment in the
outlyingness.
In Figures 9-11 we report some results of our simulation study. The left figures
present the percentage of outliers that were detected by the two methods, as a function
of the distance of the center of the outliers from the origin. The figures to the right
show the percentage of regular observations that were erroneously classified as outliers.
In two dimensions (Figure 9), it is clear that the AO method outperforms the
SD approach considerably with respect to the detection of the correct outliers. The
improvement becomes even more apparent as the sample size increases. Both methods
are comparable in misclassifying regular observations.
In five dimensions (Figure 10), the gain of the skewness adjustment is still present
and again more pronounced when nincreases. In ten dimensions (Figure 11) on the
other hand, both methods are comparable. This is again because the data considered
here do not expose a lot of skewness in 10 dimensions. To illustrate, Figure 12(a)
12
shows for one of our simulated data sets (n= 1000) in 10 dimensions a histogram of
the (absolute) MC values on 10000 projections. For a two-dimensional data set, we
obtain Figure 12(b). We see that the skewness on average is much smaller when p= 10.
Consequently the AO-values will be very similar to the SDO-values.
5 Conclusion
In this paper we have proposed an outlier detection method for multivariate skewed
data. The procedure is based on the skewness-adjusted outlyingness, is distribution-
free and easy to compute. Moreover, we have presented contourplots and a bagplot
based on the AO to visualise the distribution of bivariate data. Simulations and ex-
amples on real data have illustrated that our method outperforms robust methods
that are designed for elliptical data. Software to compute these AO-values and to
drawn the bagplot (based on the AO or on the halfspace depth) will become avail-
able at wis.kuleuven.be/stat/robust as part of LIBRA: Matlab Library for Robust
Analysis [31].
Appendix: Influence function
In this section we derive the influence function of the adjusted outlyingness of a uni-
variate continuous distribution Fwith density f. This influence function describes the
effect on the adjusted outlyingness of a point xIR when we put an infinitesimally
small contamination in a point zIR [21]. More precisely, consider the contaminated
distribution
Fǫ,z = (1 ǫ)F+ǫz
for small ǫ. The distribution ∆zis the Dirac distribution which puts all probability
mass at the point z. Then the influence function of an estimator Tat the distribution
Fis defined as
I F (z;T, F ) = lim
ǫ0
T(Fǫ,z)T(F)
ǫ.(10)
Here, Tis the univariate adjusted outlyingness in some x. Therefore, the influence
function depends both on the position of the contamination as well as on the posi-
tion of the observation in which the adjusted outlyingness is computed. We compute
13
the influence function at a skew-normal distribution with, according to [29], density
function:
fα(z) = 2φ(z)Φ(αz).
Its distribution function is then given by Φα(z) = Φ(z)2Tα(z) with the T-function
defined as
Tα(z) = 1
2πZα
0
exp(1/(2z2(1 + x2)))
1 + x2dx.
We derive the IF at the skew-normal distribution F=F1with the skewness parameter
αequal to 1. This distribution has Med(F) = Φ1(1/2) with Φ(z) the gaussian
cumulative distribution function. Another choice of αcould have been considered
as well, but then the median can only be obtained by numerical integration. The
theoretical value of the medcouple can be found as the solution of
MCF=H1
F(0.5)
with
HF(u) = 4 Z
MedF
Fx2(u1) + 2MedF
u+ 1 dF (x2).
Solving this equation gives as result that the population medcouple equals 0.021.
To compute the influence function, two different cases now have to be considered:
points located on the lower side of the median and points on the upper side. Consider
first x < Med(F). The adjusted outlyingness is then defined as:
AO(1)(x, F ) = MedFx
MedFQ1+ 1.5e4M C IQR
(since the skew-normal has MC >0). When we contaminate F, we may assume that
ǫis sufficiently small such that x < Med(Fǫ) and MC(Fǫ)>0. Since
IF(z, AO(1)(x, F ), F ) =
∂ǫ AO(1)(x, Fǫ)|ǫ=0
we can easily derive that
IF(z, AO(1)(x, F ), F ) = 1
4.432.105 IF(z, Med, F )+
(Med(F)x)[IF(z, Med, F )IF(z, Q1, F ) + 1.41 IF(z , IQR, F )4.67 IF(z, MC, F )].
(11)
Expressions for the influence function of quantiles can e.g. be found in [32], whereas
the influence function of the medcouple is given in [14]. The influence function for
points located at the upper side of the median is calculated in a similar way. The
14
resulting function is plotted in Figure 13. Since all the influence functions that appear
in expression (11) are bounded, the influence function of the adjusted outlyingness is
bounded (in z) as well, showing its robustness. Note that the adjusted outlyingness
AO(1)(x, F ) is not bounded in x, but when xis fixed, the effect of contamination in
any point (even in z=x) is bounded. Mathematically, the derivative with respect to
ztends to a constant.
Acknowledgment
This research has been supported by grant GOA/2007/04 from K.U.Leuven.
References
[1] P.J. Rousseeuw, I. Ruts, and J.W. Tukey. The bagplot: A bivariate boxplot. The
American Statistician, 53:382–387, 1999.
[2] P.J. Rousseeuw. Least median of squares regression. Journal of the American
Statistical Association, 79:871–880, 1984.
[3] W.A. Stahel. Robuste Sch¨atzungen: infinitesimale Optimalit¨at und Sch¨atzungen
von Kovarianzmatrizen. PhD thesis, ETH Z¨urich, 1981.
[4] D.L. Donoho. Breakdown properties of multivariate location estimators. Qualify-
ing paper, Harvard University, Boston, 1982.
[5] P.J. Rousseeuw and V.J. Yohai. Robust regression by means of S-estimators.
In J. Franke, W. H¨ardle, and R.D. Martin, editors, Robust and Nonlinear Time
Series Analysis, pages 256–272, New York, 1984. Lecture Notes in Statistics No.
26, Springer-Verlag.
[6] L. Davies. Asymptotic behavior of S-estimators of multivariate location parame-
ters and dispersion matrices. The Annals of Statistics, 15:1269–1292, 1987.
[7] K.S. Tatsuoka and D.E. Tyler. On the uniqueness of S-functionals and M-
functionals under nonelliptical distributions. The Annals of Statistics, 28:1219–
1243, 2000.
15
[8] P.J. Rousseeuw and A.M. Leroy. Robust Regression and Outlier Detection. Wiley-
Interscience, New York, 1987.
[9] R. A. Maronna, D. R. Martin, and V. J. Yohai. Robust Statistics: Theory and
Methods. Wiley, New York, 2006.
[10] P.J. Rousseeuw, M. Debruyne, S. Engelen, and M. Hubert. Robustness and outlier
detection in chemometrics. Critical Reviews in Analytical Chemistry, 36:221–242,
2006.
[11] W.S. Rayens and C. Srinivasan. Box-Cox transformations in the analysis of com-
positional data. Journal of Chemometrics, 5:227–239, 1991.
[12] R.A. Maronna and V.J. Yohai. The behavior of the Stahel-Donoho robust mul-
tivariate estimator. Journal of the American Statistical Association, 90:330–341,
1995.
[13] M. Hubert and E. Vandervieren. An adjusted boxplot for skewed distributions.
Computational Statistics and Data Analysis, 2008. In press.
[14] G. Brys, M. Hubert, and A. Struyf. A robust measure of skewness. Journal of
Computational and Graphical Statistics, 13:996–1017, 2004.
[15] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
Springer, New York, 2001.
[16] M. Hubert, P.J. Rousseeuw, and K. Vanden Branden. ROBPCA: a new approach
to robust principal components analysis. Technometrics, 47:64–79, 2005.
[17] M. Hubert, P.J. Rousseeuw, and T. Verdonck. Robust PCA for skewed data. 2007.
Submitted.
[18] J.W. Tukey. Exploratory Data Analysis. Reading (Addison-Wesley), Mas-
sachusetts, 1977.
[19] Reimann C., Siewers U., Tarvainen T., Bityukova L., Eriksson J., Gilucis A.,
Gregorauskiene V., Lukashev V., Matinian N.N., and Pasieczna A. Baltic soil
survey: total concentrations of major and selected trace elements in arable soils
from 10 countries around the Baltic Sea. The Science of the Total Environment,
257:155–170, 2000.
16
[20] D. Gervini. The influence function of the Stahel–Donoho estimator of multivariate
location and scatter. Statistics and Probability Letters, 60:425–435, 2002.
[21] F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel. Robust Statis-
tics: The Approach Based on Influence Functions. Wiley, New York, 1986.
[22] A. Marazzi and C. Ruffieux. The truncated mean of an asymmetric distribution.
Computational Statistics and Data Analysis, 32:79–100, 1999.
[23] M. Markatou, A. Basu, and B.G. Lindsay. Weighted likelihood equations with
bootstrap root search. Journal of the American Statistical Association, 93:740–
750, 1998.
[24] M.-P. Victoria-Feser and E. Ronchetti. Robust methods for personal-income dis-
tribution models. The Canadian Journal of Statistics, 22:247–258, 1994.
[25] G. Brys, M. Hubert, and P.J. Rousseeuw. A robustification of Independent Com-
ponent Analysis. Journal of Chemometrics, 19:364–375, 2005.
[26] G. Brys, M. Hubert, and A. Struyf. A comparison of some new measures of
skewness. In R. Dutter, P. Filzmoser, U. Gather, and P.J. Rousseeuw, editors,
Developments in Robust Statistics: International Conference on Robust Statistics
2001, volume 114, pages 98–113. Physika Verlag, Heidelberg, 2003.
[27] D.J. Hand, A.J. Lunn, A.D.and McConway, and E. Ostrowski. A Handbook of
Small Data Sets. Chapman and Hall, London, 1994.
[28] J.W. Tukey. Mathematics and the picturing of data. In Proceedings of the In-
ternational Congress of Mathematicians, volume 2, pages 523–531, Vancouver,
1975.
[29] A. Azzalini and A. Dalla Valle. The multivariate skew-normal distribution.
Biometrika, 83:715–726, 1996.
[30] J.T. Ferreira and M.F. Steel. On describing multivariate skewed distributions: A
directional approach. Canadian Journal of Statistics, 34:411–429, 2006.
[31] S. Verboven and M. Hubert. LIBRA: a Matlab library for robust analysis. Chemo-
metrics and Intelligent Laboratory Systems, 75:127–136, 2005.
[32] P.J. Huber. Robust Statistics. Wiley, New York, 1981.
17
List of Figures
18
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Concentration MgO
Boxplot
(a) (b)
Figure 1: Geological data: (a) Standard boxplot; (b) Adjusted boxplot.
19
s s s
x1x2
d1d2
s1s2
- -
- -
Figure 2: Illustration of the adjusted outlyingness.
20
0 5 10 15 20 25 30
0
2
4
6
8
10
12
14
Stahel−Donoho robust distance
Adjusted outlyingness
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0
5
10
15
20
25
30
Square root of the quantiles of the chi−squared distribution
Distance
(a) (b)
Figure 3: (a) Adjusted outlyingness versus Stahel-Donoho robust distances; (b) χ2
4-quantile
plot of the SD distances.
21
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
0
2
4
6
8
10
12
14
x1
x2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x1
x4
(a) (b)
0 2 4 6 8 10 12 14
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x2
x4
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x3
x4
(c) (d)
Figure 4: Several scatterplots of the geological data with outliers marked.
22
Concentration of plasma cholesterol
concentration of
plasma triglycerides
Contourplot of adjusted outlyingness
50 100 150 200 250 300 350 400 450 500
0
100
200
300
400
500
600
700
800
900
0.5
1
1.5
2
Concentration of plasma cholesterol
Concentration of
plasma triglycerides
Stahel−Donoho outlyingness
50 100 150 200 250 300 350 400 450 500
0
100
200
300
400
500
600
700
800
900
2
4
6
8
10
12
14
16
(a) (b)
Figure 5: Contourplots of the (a) adjusted outlyingness and (b) Stahel-Donoho outlyingness
of the bloodfat data.
23
100 150 200 250 300 350 400
0
100
200
300
400
500
600
700
800
900
concentration of plasma cholesterol
concentration of
plasma triglycerides
Bagplot based on halfspacedepth
100 150 200 250 300 350 400
0
100
200
300
400
500
600
700
800
900
concentation of plasma cholesterol
concentration of
plasma triglycerides
Bagplot based on Adjusted Outlyingness
(a) (b)
Figure 6: Bagplots of the bloodfat data based on (a) halfspacedepth and (b) adjusted out-
lyingness.
24
Figure 7: Bagplot matrix of the geological data.
25
−3 −2 −1 0123−3
−2
−1
0
1
2
3
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
x2
x1
Probability Density
Figure 8: Density plot of simulated data from a skew-normal distribution and 10% outliers.
26
0 0.5 1 1.5 2
0
10
20
30
40
50
60
70
80
90
100 2D 10% outliers in a sample of size 200
Distance
Percentage
of outliers detected
S.D.
Adjusted
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Distance
Percentage of
regular points
classified as outliers
2D 10% outliers in a sample of size 200
S.D.
Adjusted
(a) (b)
0 0.5 1 1.5 2
0
10
20
30
40
50
60
70
80
90
100
Distance
Percentage
of outliers detected
2D 10% outliers in a sample of size 500
S.D.
Adjusted
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Distance
Percentage of
regular points
classified as outliers
2D10% outliers in a sample of size 500
S.D.
Adjusted
(c) (d)
Figure 9: Simulation results for two-dimensional data of size n= 200 and n= 500.
27
0 0.5 1 1.5 2
0
10
20
30
40
50
60
70
80
90
100
Distance
Percentage
of outliers detected
5D 10% outliers in a sample of size 200
S.D.
Adjusted
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Distance
Percentage of
regular points
classified as outliers
5D 10% outliers in a sample of size 200
S.D.
Adjusted
(a) (b)
0 0.5 1 1.5 2
0
10
20
30
40
50
60
70
80
90
100
Distance
Percentage
of outliers detected
5D 10% outliers in a sample of size 500
S.D.
Adjusted
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Distance
Percentage of
regular points
classified as outliers
5D 10% outliers in a sample of size 500
S.D.
Adjusted
(c) (d)
Figure 10: Simulation results for 5-dimensional data of size n= 200 and n= 500.
28
0 0.5 1 1.5 2
0
10
20
30
40
50
60
70
80
90
100 10D 1% outliers in a sample of size 200
Distance
Percentage
of outliers detected
S.D.
Adjusted
0 0.5 1 1.5 2
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3 10D 1% outliers in a sample of size 200
Distance
Percentage of
regular points
classified as outliers
S.D.
Adjusted
(a) (b)
0 0.5 1 1.5 2
0
10
20
30
40
50
60
70
80
90
100
Distance
Percentage
of outliers detected
10D 10% outliers in a sample of size 1000
S.D.
Adjusted
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Distance
Percentage of
regular points
classified as outliers
10D 10% outliers in a sample of size 1000
S.D.
Adjusted
(c) (d)
Figure 11: Simulation results for 10-dimensional data of size n= 200 and n= 1000.
29
0 0.05 0.1 0.15 0.2 0.25
0
500
1000
1500
2000
2500
abs(medcouple)
0 0.05 0.1 0.15 0.2 0.25
0
200
400
600
800
1000
1200
abs(medcouple)
(a) (b)
Figure 12: Histogram of the absolute MC values on all projections for a simulated data set
of dimension (a) 10 and (b) 2.
30
−4
−2
0
2
4−4
−2
0
2
4
−10
−5
0
5
10
position of the contamination
position in the distribution
Figure 13: Influence function of the univariate adjusted outlyingness at a skew-normal dis-
tribution.
31
... Accordingly, cases missing data submitted to these algorithms were deleted case-wise. Outliers were detected with the method outlined by (Hubert & Van Der Veeken, 2008). This method is robust to skewed distributions, which were apparent in many study variables (e.g., in NAPLS-3, baseline suicidality, Shapiro-Wilk test statistic W = 0.60, p , .001; ...
Article
Full-text available
Suicidality is common among people at clinical high risk (CHR) for psychosis. Delineating causal pathways to suicidality and identifying its determinants would inform tailored intervention efforts for these individuals. To this end, we analyzed data on CHR samples from the second and third North American Prodrome Longitudinal Studies (NAPLS-2, n = 355; NAPLS-3, n = 266). Data on correlates of suicidality—including depression and attenuated psychosis symptoms, sleep, and childhood trauma—from two initial study timepoints were submitted to the greedy relaxations of the sparsest permutation algorithm. Intervention calculus was used to estimate the (lower bound) total empirically plausible causal effects of each variable on suicidality. Across both samples, greedy relaxations of the sparsest permutation suggested that symptoms of depression—particularly hopelessness, self-deprecation, and depressed mood—were likely direct causes of suicidality among people at CHR for psychosis. Across samples and measurement time points, intervention calculus indicated that depressed mood exerted the greatest influence over suicidality of all measured variables. This study provides data-driven, testable hypotheses about the causal pathways leading to suicidality among people at CHR for psychosis and suggests promising targets for interventions on suicidality tailored to these individuals. Future experimental research should test these hypotheses by, for example, comparing the suicide risk reduction afforded by interventions aimed at each aforementioned target.
... Data were checked for normality and homogeneity before analysis, and outliers were excluded using the interquartile range technique (Hubert & Van der Veeken, 2008). Differences in the concentration of urea and phosphate among beach zones were tested using one-way analyses of variance (ANOVAs), with pairwise multiple comparisons performed using the Holm-Sidak method. ...
Article
The invasive seaweed Rugulopteryx okamurae , native to East Asia, is spreading rapidly along the western Mediterranean and southern Portugal, severely affecting coastal biodiversity, ecosystem structure, and economic sectors such as fisheries and tourism. This study examined the nutrient uptake kinetics of R. okamurae , including ammonium, nitrate, urea, amino acids, and phosphate, and their role in nitrogen and phosphorus budgets based on laboratory growth rates. R. okamurae demonstrated the highest uptake for ammonium (V max = 57.95 μmol · g ⁻¹ DW · h ⁻¹ ), followed by urea (7.74 μmol · g ⁻¹ DW · h ⁻¹ ), nitrate (5.37 μmol · g ⁻¹ DW · h ⁻¹ ), and amino acids (3.71 μmol · g ⁻¹ DW · h ⁻¹ ). The species showed higher uptake affinity for urea (α = 1.8), which accounted for 70% of nitrogen uptake. Phosphate uptake was low, and total nitrogen uptake exceeded growth requirements. These findings suggest that R. okamurae relies on organic nitrogen (urea) and may guide toward effective management strategies to mitigate its spread in coastal ecosystems.
... We would also like to study the performance of the DO when the data generating mechanism is known, and compare it with the AO measure proposed by Brys et al. (2005) and studied by Hubert and Van der Veeken (2008) and Hubert et al. (2015). For this we carried out an extensive simulation study, covering univariate as well as multivariate and functional data. ...
Preprint
Functional data covers a wide range of data types. They all have in common that the observed objects are functions of of a univariate argument (e.g. time or wavelength) or a multivariate argument (say, a spatial position). These functions take on values which can in turn be univariate (such as the absorbance level) or multivariate (such as the red/green/blue color levels of an image). In practice it is important to be able to detect outliers in such data. For this purpose we introduce a new measure of outlyingness that we compute at each gridpoint of the functions' domain. The proposed Directional Outlyingness} (DO) measure accounts for skewness in the data and only requires O(n) computation time per direction. We derive the influence function of the DO and compute a cutoff for outlier detection. The resulting heatmap and functional outlier map reflect local and global outlyingness of a function. To illustrate the performance of the method on real data it is applied to spectra, MRI images, and video surveillance data.
Article
In the last few years, the number of R packages implementing different robust statistical methods have increased substantially. There are now numerous packages for computing robust multivariate location and scatter, robust multivariate analysis like principal components and discriminant analysis, robust linear models, and other algorithms dedicated to cope with outliers and other irregularities in the data. This abundance of package options may be overwhelming for both beginners and more experienced R users. Here we provide an overview of the most important 25 R packages for different tasks. As metrics for the importance of each package, we consider its maturity and history, the number of total and average monthly downloads from CRAN (The Comprehensive R Archive Network), and the number of reverse dependencies. Then we briefly describe what each of these package does. After that we elaborate on the several above‐mentioned topics of robust statistics, presenting the methodology and the implementation in R and illustrating the application on real data examples. Particular attention is paid to the robust methods and algorithms suitable for high‐dimensional data. The code for all examples is accessible on the GitHub repository https://github.com/valentint/robust‐R‐ecosystem‐WIREs .
Article
Bu çalışmada, Türkiye sınırları içerisinde yapılmış olan doğal gaz boru hattı (DGBH) maliyetlerinin ön tahmini için makine öğrenmesi algoritmaları kullanılarak modeller yapılmıştır. Bunun için, 1997-2022 yılları arasında Türkiye'de tamamlanmış doğalgaz boru hattı projelerinden elde edilen veriler kullanılmıştır. Projelerin boru çapı, hat uzunluğu, hat vanası sayısı, take-off vana sayısı ve pig istasyonu sayısı gibi değişkenleri, maliyet tahmininde bağımsız değişkenler olarak belirlenmiştir. Veri setinin nicel anlamda yetersiz ve veri kalitesinin ortalama bir seviyede olmasından dolayı, klasik makine öğrenmesi tahmin süreçleri yürütülememiştir. Bu nedenle, mevcut veri seti eğitim ve test bölümlerine ayrılmadan, bütün veri kullanılarak çalışılmış ve Çoklu Doğrusal Regresyon (ÇDR) ile K-En Yakın Komşu (KNN) algoritmalarına konumlandırıldığında modelin uygun bir şekilde performans gösterip göstermediği incelenmiştir. Bu çalışma, ileride veri kalitesinin ve sayısının artması durumunda, klasik makine öğrenmesi tahmin süreçlerinin yürütülüp yürütülemeyeceği konusunda ön fikir vermesi amacıyla gerçekleştirilmiştir. Her iki farklı yöntem denemesinde de benzer ve ortalama düzeyde belirleme katsayıları (R²) elde edilmiştir. Sonuç olarak, bu çalışmada, doğalgaz boru hattı projelerinde ön maliyet tahminlerinin hassasiyetini iyileştirmek için ÇDR ve KNN yöntemlerinin etkinliği karşılaştırılmış ve sektöre önemli bir katkı sağlayacağı değerlendirilmiştir. Gelecekte yapılacak çalışmaların daha geniş veri setleri ve farklı model teknikleri kullanarak maliyet tahminlerinin doğruluğunu artırabileceği ve sektör paydaşlarına yol gösterici olabileceği öngörülmektedir.
Article
Background: Several reports suggest potential cytotoxic effects of creatine, possibly due to its role in facilitating the formation of food-borne chemical carcinogenic compounds. Aim: This cross-sectional study aims to investigate the relationship between creatine consumption and various carcinogenic biomarkers in blood and urine among individuals aged 18 years and older, utilizing data from the 2013–2014 National Health and Nutrition Examination Survey (NHANES). Methods: Daily creatine intake was assessed using the Dietary Data databases, which were compiled from individual in-person 24-h food recall interviews. The concentrations of carcinogenic compounds (heterocyclic amines, acrylamide, and formaldehyde) were extracted from NHANES 2013–2014 Laboratory Data database. Results: The final analysis included 1763 adult respondents, of whom 907 (51.4%) were female. The mean daily creatine intake was 0.83 ± 0.77 grams (95% CI, from 0.80 to 0.87). Regression analysis revealed no significant relationship between daily creatine intake and most carcinogenic biomarkers, except for a significant correlation (Model 1) between creatine intake and acrylamide levels ( B = −3.999, ß = −0.088, p = 0.05). Model 2 (demographics) confirmed a significant relationship between daily creatine intake and circulating acrylamide ( B = −3.490, ß = −0.077, p = 0.02), as well as for blood levels of glycidamide ( B = −2.992, ß = −0.068, p = 0.05) and urinary 2-Amino-1-methyl-6-phenylimidazo[4,5-b]pyridine (B = 0.190, ß = 0.088, p = 0.03). However, no correlation between creatine consumption and any carcinogenic biomarkers remained significant after adjusting for nutritional factors (Model 3) ( p > 0.05). Conclusion: In conclusion, the consumption of dietary creatine may be considered safe and not associated with increased levels of above carcinogens in the general population.
Article
Full-text available
Just as the value of crude oil is unlocked through refining, the true potential of air quality data is realized through systematic processing, analysis, and application. This refined data is critical for making informed decisions that may protect health and the environment. Perhaps ground-based air quality monitoring data often face quality control issues, notably outliers. The outliers in air quality data are reported as error and event-based. The error-based outliers are due to instrument failure, self-calibration, sensor drift over time, and the event based focused on the sudden change in meteorological conditions. The event-based outliers are meaningful while error-based outliers are noise that needs to be eliminated and replaced post-detection. In this study, we address error-based outlier detection in air quality data, particularly targeting particulate pollutants (PM2.5 and PM10) across various monitoring sites in Delhi. Our research specifically examines data from sites with less than 5% missing values and identifies four distinct types of error-based outliers: extreme values due to measurement errors, consecutive constant readings and low variance due to instrument malfunction, periodic outliers from self-calibration exceptions, and anomalies in the PM2.5/PM10 ratio indicative of issues with the instruments’ dryer unit. We developed a robust methodology for outlier detection by fitting a non-linear filter to the data, calculating residuals between observed and predicted values, and then assessing these residuals using a standardized Z-score to determine their probability. Outliers are flagged based on a probability threshold established through sensitivity testing. This approach helps distinguish normal data points from suspicious ones, ensuring the refined quality of data necessary for accurate air quality modeling. This method is essential for improving the reliability of statistical and machine learning models that depend on high-quality environmental data. Graphical Abstract
Article
The S-functionals of multivariate location and scatter, including the MVE-functionals, are known to be uniquely defined only at unimodal elliptically symmetric distributions. The goal of this paper is to establish the uniqueness of these functionals under broader classes of symmetric distributions. We also discuss some implications of the uniqueness of the functionals and give examples of striclty unimodal and symmetric distributions for which the MVE-functional is not uniquely defined. The uniqueness results for the S-functionals are obtained by embedding them within a more general class of functionals which we call the M-functionals with auxiliary scale. The uniqueness results of this paper are then obtained for this class of multivariate functionals. Besides the S-functionals, the class of multivariate M-functionals with auxiliary scale include the constrained M-functionals recently introduced by Kent and Tyler, as well as a new multivariate generalization of Yohai's MM-functionals.
Article
Time series outliers and their impactClassical estimates for AR modelsClassical estimates for ARMA modelsM-estimates of ARMA modelsGeneralized M-estimatesRobust AR estimation using robust filtersRobust model identificationRobust ARMA model estimation using robust filtersARIMA and SARIMA modelsDetecting time series outliers and level shiftsRobustness measures for time seriesOther approaches for ARMA modelsHigh-efficiency robust location estimatesRobust spectral density estimationAppendix A: heuristic derivation of the asymptotic distribution of M-estimates for ARMA modelsAppendix B: robust filter covariance recursionsAppendix C: ARMA model state-space representationProblems
Article
The paper extends earlier work on the so-called skew-normal distribution, a family of distributions including the normal, but with an extra parameter to regulate skewness. The present work introduces a multivariate parametric family such that the marginal densities are scalar skew-normal, and studies its properties, with special emphasis on the bivariate case.