ArticlePDF Available

Error Propagation in Isometric Log-ratio Coordinates for Compositional Data: Theoretical and Practical Considerations

Authors:

Abstract and Figures

Compositional data, as they typically appear in geochemistry in terms of concentrations of chemical elements in soil samples, need to be expressed in log-ratio coordinates before applying the traditional statistical tools if the relative structure of the data is of primary interest. There are different possibilities for this purpose, like centered log-ratio coefficients, or isometric log-ratio coordinates. In both the approaches, geometric means of the compositional parts are involved, and it is unclear how measurement errors or detection limit problems affect their presentation in coordinates. This problem is investigated theoretically by making use of the theory of error propagation. Due to certain limitations of this approach, the effect of error propagation is also studied by means of simulations. This allows to provide recommendations for practitioners on the amount of error and on the expected distortion of the results, depending on the purpose of the analysis.
Content may be subject to copyright.
Math Geosci
DOI 10.1007/s11004-016-9646-x
ORIGINAL PAPER
Error Propagation in Isometric Log-ratio Coordinates
for Compositional Data: Theoretical and Practical
Considerations
Mehmet Can Mert1·Peter Filzmoser1·
Karel Hron2
Received: 27 August 2015 / Accepted: 15 June 2016
© The Author(s) 2016. This article is published with open access at Springerlink.com
Abstract Compositional data, as they typically appear in geochemistry in terms of
concentrations of chemical elements in soil samples, need to be expressed in log-ratio
coordinates before applying the traditional statistical tools if the relative structure of
the data is of primary interest. There are different possibilities for this purpose, like cen-
tered log-ratio coefficients, or isometric log-ratio coordinates. In both the approaches,
geometric means of the compositional parts are involved, and it is unclear how mea-
surement errors or detection limit problems affect their presentation in coordinates.
This problem is investigated theoretically by making use of the theory of error prop-
agation. Due to certain limitations of this approach, the effect of error propagation is
also studied by means of simulations. This allows to provide recommendations for
practitioners on the amount of error and on the expected distortion of the results,
depending on the purpose of the analysis.
Keywords Aitchison geometry ·Orthonormal coordinates ·Taylor approximation ·
Compositional differential calculus ·Detection limit
BMehmet Can Mert
mehmet.mert@tuwien.ac.at; p.filzmoser@tuwien.ac.at
Karel Hron
hronk@seznam.cz
1Institute of Statistics and Mathematical Methods in Economics, Vienna University
of Technology, Wiedner Hauptstrasse 8-10, 1040 Vienna, Austria
2Department of Mathematical Analysis and Applications of Mathematics,
Faculty of Science, Palacký University, 17. listopadu 12, 771 46 Olomouc, Czech Republic
123
Math Geosci
1 Introduction
Compositional data analysis is concerned with analyzing the relative information
between the variables, the so-called compositional parts, of a multivariate data set.
Here, relative information refers to the log-ratio methodology (Aitchison 1986) and,
therefore, in fact, to an analysis of logarithms of ratios between the compositional
parts. It has been demonstrated that the sample space of compositions is not the usual
Euclidean space, but the simplex with the so-called Aitchison geometry (Pawlowsky-
Glahn et al. 2015). For a composition x=(x1,...,xD)with Dparts, the simplex
sample space is defined as
SD={x=(x1,...,xD)such that xj>0j,
D
j=1
xj=κ}
for an arbitrary constant κ. Nevertheless, according to recent developments, the sample
space of compositional data is even more general (Pawlowsky-Glahn et al. 2015): A
vector xis a D-part composition when all its components are strictly positive real
numbers and carry only relative information. Note that the term relative information
is equivalent to information lies in the ratios between the components, not in the
absolute values. As a consequence, the actual sample space of compositional data is
formed by equivalence classes of proportional positive vectors. Therefore, any constant
sum constraint is just a proper representation of compositions that honors the scale
invariance principle of compositions: the information in a composition does not depend
on the particular units, in which the composition is expressed (Egozcue 2009). In
practical terms, the choice of the constant κis irrelevant, since it does not alter the
results from a log-ratio-based analysis. In that sense, a discussion on whether the
values of an observation sum up to the same constant is needless, this would not make
any difference for the analysis considered in this paper. Though for the purpose of
better interpretability or visualization, one could also express compositions with the
closure operator
C(x)=κx1
D
j=1xj
,..., κxD
D
j=1xj,
which, then, sum up to the constant κ.
The Aitchison geometry defines a vector space structure of the simplex by the basic
operations of perturbation and powering. Given two compositions x=(x1,...,xD)
and y=(y1,...,yD)in SD, perturbation refers to vector addition, and is defined as
xy=C(x1y1,...,xDyD).
Powering refers to a multiplication of a composition x=(x1,...,xD)SDby a
scalar αR, and is defined as
αx=C(xα
1,xα
2,...,xα
D).
123
Math Geosci
Furthermore, the Aitchison inner product, the Aitchison norm, and the Aitchison
distance have been defined, and they lead to a Euclidean vector space structure
(Pawlowsky-Glahn et al. 2015). All these definitions employ log-ratios between the
compositional parts; for instance, the Aitchison inner product between the composi-
tions xand yis given as
x,yA=1
2D
D
j=1
D
k=1
ln xj
xk
ln yj
yk
,
that leads to the Aitchison norm and distance
||x||A=x,xA,dA(x,y)=||x(1)y||A
respectively. Working directly in the simplex sample space is not straightforward.
Rather, it is common to express compositional data in the usual Euclidean geometry.
In the literature, one frequently refers to transformations; here, it is prefered to use the
terminology of expressing the compositions in appropriate coordinates with respect to
the Aitchison geometry (Pawlowsky-Glahn and Egozcue 2001) that allows to analyze
compositions in the usual Euclidean geometry.
The focus in this paper is on isometric log-ratio (ilr) coordinates (Egozcue et al.
2003), which allow to express a composition xSDin the real space RD1.A
particular choice for ilr coordinates is
zj=ilr j(x)=Dj
Dj+1ln xj
Dj
D
k=j+1xk
,j=1,...,D1,(1)
and the coordinates z=(z1,...,zD1), indeed, correspond to an orthonormal basis
in RD1(Egozcue et al. 2003). The particular choice of the ilr coordinates in (1)
allows for an interpretation of the first coordinate z1, as that one expressing all relative
information about part x1, since x1is not included in any other ilr coordinate.
The definition of ilr coordinates (1) reveals that geometric means of (subsets of)
the parts are involved. Note that the geometric mean of xcan also be expressed as
gm(x)=
D
j=1
xj
1/D
=exp
1
D
D
j=1
ln xj
involving the arithmetic mean of the log-transformed values. It is well known that the
arithmetic mean is sensitive to data outliers (Maronna et al. 2006). Consequently, also
data imprecision in one or some compositional parts (that are usually measured without
respecting the relative nature of compositional data), or detection limit problems, may
act like outliers and lead to a distortion of the geometric mean. The resulting ilr
coordinates will suffer from data quality problems, and subsequent analyses based on
these coordinates can be biased.
123
Math Geosci
This unwanted effect is investigated here under the terminology of error propaga-
tion, where the effect of the errors on the output of a function is analyzed. Propagation
of error can be performed by a calculus-based approach, or by simulation studies. A
calculus-based approach makes use of the Taylor series expansion and calculates the
first two statistical moments of the error of output, the mean and the variance, under
the assumption that the errors are statistically independent (Ku 1966). With few excep-
tions, almost all analyses of error propagation with the calculus-based approach use
the first-order Taylor approximation, and neglect the higher order terms (Birge 1939).
This approach is briefly reviewed in Sect. 2. Section 3starts with a motivating exam-
ple about the effect of the errors on ilr coordinates and applies the concept of Taylor
approximation to error propagation in the simplex. While this is done in a general form
for any function (transformation), particular emphasis is given to error propagation for
ilr coordinates that cause one source of distortion of outputs in practical geochemical
problems (Filzmoser et al. 2009b).
Determining error propagation only for the first two moments is unsatisfactory,
because it would also be interesting how the data structure is changed in the case
of data problems like detection limits or imprecision of the measurements. Thus,
simulation-based methods for error propagation are considered as well. The Monte
Carlo method is adaptable and simple for the propagation of errors (Feller and
Blaich 2001;Cox and Siebert 2006), and various applications of this method can be
found (Liu 2008). The simulation-based approach in Sect. 4makes use of a prac-
tical data set and shows the effect of imprecision and detection limit effects on
the ilr coordinates. The interest lies particularly in error propagation on the first
ilr coordinate, because this contains all relative information about the first compo-
sitional part, and on error propagation on all ilr coordinated jointly, because they
contain the full multivariate information. The final Sect. 5discusses the findings and
concludes.
2 Error Propagation in the Standard Euclidean Geometry
Consider a p-dimensional random variable x=(x1,...,xp), and a function
f:RpRthat gives the output yas a result of y=f(x). The propagation
of the errors of each variable through the function fon the output can be derived
using Taylor approximation (Ku 1966). This yields a linear approximation of the
function fby the tangent plane where the slopes in x1,...,xpare described by the
partial derivatives y
x1,..., y
xpat a single point. One can express the random vari-
ables x1,...,xpas the sum of their expected values μ=μ1,...,μpand random
deviations from the expected value =ε1,...,εp, so that x=μ+, assuming
that the errors have mean zero. Taking the first-order Taylor approximation of f(x)
results in
y=f(x1,...,xp)=f1+1,...,μp+p)
f1,...,μp)+f
x1
1)1+···+f
xp
p)p.(2)
123
Math Geosci
In the framework of error propagation, it is common to assume that (x1,...,xp)fol-
low a known distribution, in most cases, a multivariate normal distribution (Ku 1966).
If the distribution is known, the partial derivatives are evaluated at the true means,
if not, the sample averages are used for the estimation. The approximation in Eq.
(2) can now be used to calculate mean and variance of y, which both depend on the
function f. The second central moment, the variance Var(y), describes the uncer-
tainty, which is mainly used to investigate the effect of error propagation and is given
as
Var (y)
p
j=1f
xj
j)2
E( j2)+
j=kf
xj
j)f
xk
k)E( jk).
(3)
Equation (3) reveals how the variability of the output ydepends on the errors and on
the function f.
3 Error Propagation on the Simplex
As a motivating example, the composition of sand, silt, and clay in agricultural soils
in Europe is considered. The data are reported in Reimann et al. (2014). From the
ternary diagram (Fig. 1a), it can be seen that the clay concentrations can be very
small, but data artifacts are not immediately visible. The resulting ilr coordinates z1
and z2are shown in Fig. 1b. Here, the small clay values are visible in form of a band
that deviates clearly from the joint data structure. In fact, small values of clay have
been rounded in the laboratory, which causes already a distortion of the multivariate
data structure. Thus, the imprecision here is visible as a rounding effect in the part
clay. Variables with values below a detection limit can result in similar artifacts, since
usually the values below detection are set to some constant, like 2/3 times the values
of the detection limit (Martín-Fernández et al. 2003). This is still the usual practice in
geosciences rather than employing more sophisticated algorithms for their imputation
(Martín-Fernández et al. 2012).
Similar as in Sect. 2, error propagation is derived for a general function using first-
order Taylor approximation. However, since this is directly done on the simplex, also
the Taylor approximation needs to be done on the simplex. The theoretical background
for the differential calculus on the simplex can be found in Barceló-Vidal and Martín-
Fernández (2002) and Barceló-Vidal et al. (2011). Here, the tools necessary to carry
out the Taylor approximation are recalled.
Let f:URmbe a vector-valued function defined on a subset URD
+.Let
U={C(w), wU}, the compositional closure of U, be a subset of SD.If fis scale
invariant, that is f(w)=f(kw)for any k>0, it induces a vector-valued function
f:URm. It suffices to define
f(x)=f(w), wU,
123
Math Geosci
(a)
tlisdnas
clay
0.8
0.6
0.4
0.2
0.8
0.6
0.4
0.2
0.8
0.6
0.4
0.2
101234
−2 −1 0 1 2 3
(b)
z1
z2
Fig. 1 Composition of sand, silt, and clay in agricultural soils of Europe. Ternary diagram a, representation
in ilr coordinates b
where C(w)=x(Barceló-Vidal et al. 2011). The function fis C-differentiable at
xU, if there exists an m×Dmatrix A=(aij), satisfying A1D=0m(defining a
linear transformation from RDto Rm), such that
lim
u
C
n
f(xu)f(x)Alnu
uA=0
for uU, where 1D=(1,...,1)with length D, and 0m=(0,...,0)with length
m. Note that n=C(1,...,1)is the neutral element of (SD,)and uC
ndenotes
that uconverges to non the simplex. From the definitions above, the first-order Taylor
approximation of a real-valued function fcan be written as
f(xu)f(x)+
D
j=1
ln(uj)Cf
xj
(x),(4)
where the C-derivative of fexists and is equal to
Cf
xj
(x)=xjf
xj
(x)
D
i=1
xi
f
xi
(x)for j=1,...,D.(5)
Given a D-part composition x=(x1,...,xD)SD, which can be expressed as a
perturbation of its center μ=(μ1,...,μD)(Pawlowsky-Glahn and Egozcue 2002)
and random deviations =(1,...,D)from the center, so that x=μ, then (4)
can be rewritten as
123
Math Geosci
f(μ)f(μ)+
D
j=1
ln( j)Cf
∂μj
(μ).(6)
One can proceed as in Sect. 2to derive the variance of the components of f(μ).
Similar as for the Taylor expansion (2) from Sect. 2, also here, the approximation is
valid just for small perturbations. Moreover, in contrast to the previous case, the error
is now multiplicative. Although this fits well with the nature of compositional data,
particularly with their scale invariance, in practice, error terms are often additive (van
den Boogaart et al. 2015). This fact should be considered for an error propagation
analysis of compositional data.
In the case of ilr coordinates, however, the investigation of the error propagation
simplifies. By considering (6) with ilr coordinate ilri(x)as ith component of f
cilri
∂μj=
0ifj<i,
Di
Di+1if j=i,
Di
Di+1
1
Diif j>i,
(7)
where i=1,...,D1. This corresponds exactly to a logcontrast (Aitchison 1986)
of the ith ilr coordinate of the compositional error , and consequently
ilri(x)=ilri(μ)=ilri(μ)+ilri(), i=1,...,D1.
In the context of error propagation this shows that the ilr coordinates are additive
with respect to multiplicative errors. On the other hand, for other forms of errors,
a non-linear behavior can be expected. This issue is further investigated within the
simulation study in Sect. 4.
In addition, this leads to an alternative verification of the linearity of ilr coordinates
z=ilr(x)=ilr(μ)=ilr(μ)+ilr(),
that is commonly shown directly with the definitions from Sect. 1. Even more, ilr
coordinates represent an isometry, which means that all metric concepts in the simplex
are maintained after taking the ilr coordinates (Pawlowsky-Glahn et al. 2015). The
variance can now be considered component-wise, for example for the jth component
zjof zone obtains
Var (zj)=Va r (ilr j(x)) =Var (ilr j()).
This variance can be expressed by log-ratios of the compositional parts, as shown in
Fišerová and Hron (2011)as
123
Math Geosci
Var (zj)=ABwith
A=1
Dj+1
D
k=j+1
Var ln j
k,
B=1
2(Dj)(Dj+1)
D
k=j+1
D
l=j+1
Var ln k
l.(8)
The contributions of log-ratio variances in this linear combination are clearly higher
for terms in Athat include j, and lower for terms in Bwhere jis not involved,
and their magnitude depends on the number of parts D. In particular, if Dis large
and contamination (imprecision, detection limit problem) is expected only in one
compositional part, the effect on the variance of zjwill be small. Note, however,
that for a multivariate analysis, the focus is in all coordinates z1,...,zD1simulta-
neously, and thus, it is not so straightforward to investigate the effect, since there
may also be dependencies among the error terms. There is a simple exception:
suppose that an error is to be expected only in log-ratios with one compositional
part. From a practical perspective, it would then appear that only one composi-
tional part is erroneous. If this part is taken as the first one, the ilr coordinates
from Eq. (1) will allow to assign this error exclusively to z1, but not to the other
coordinates.
Besides investigating the variance of the coordinates, it is also important to know
how the errors affect distances between different compositions, that is between obser-
vations of a compositional data set, and how the multivariate data structure is affected.
All these aspects will be investigated in more detail by simulations in the next
section.
4 Simulation-Based Investigations of Error Propagation
For a simulation-based analysis of error propagation, a real data set is used, namely
the GEMAS data mentioned in Sect. 1, described in Reimann et al. (2014). More than
2000 samples of agricultural soils have been analyzed in an area covering 5.6 million
km2of Europe across 33 countries, and for the simulations, the concentrations of the
elements Al, Ba, Ca, Cr, Fe, K, Mg, Mn, Na, Nb, P, Pb, Rb, Si, Sr, Ti, V, Y, Zn, and
Zr are considered. Precision or detection limit problems of these elements are rather
small or even not existing (Reimann et al. 2014), and thus, these elements form a good