ArticlePDF Available

Abstract and Figures

In the framework of Compositional Data Analysis, vectors carrying relative information, also called compositional vectors, can appear in regression models either as dependent or as explanatory variables. In some situations, they can be on both sides of the regression equation. Measuring the marginal impacts of covariates in these types of models is not straightforward since a change in one component of a closed composition automatically affects the rest of the composition. Previous work by the authors has shown how to measure, compute and interpret these marginal impacts in the case of linear regression models with compositions on both sides of the equation. The resulting natural interpretation is in terms of an elasticity, a quantity commonly used in econometrics and marketing applications. They also demonstrate the link between these elasticities and simplicial derivatives. The aim of this contribution is to extend these results to other situations, namely when the compositional vector is on a single side of the regression equation. In these cases, the marginal impact is related to a semi-elasticity and also linked to some simplicial derivative. Moreover we consider the possibility that a total variable is used as an explanatory variable, with several possible interpretations of this total and we derive the elasticity formulas in that case.
Content may be subject to copyright.
AJS
Austrian Journal of Statistics
January 2021, Volume 50, 1–15.
http://www.ajs.or.at/
doi:10.17713/ajs.v50i2.1069
Impact of Covariates in Compositional Models and
Simplicial Derivatives
Joanna Morais
Avisia, Bordeaux, France
Christine Thomas-Agnan
Toulouse School of Economics
Abstract
In the framework of Compositional Data Analysis, vectors carrying relative informa-
tion, also called compositional vectors, can appear in regression models either as dependent
or as explanatory variables. In some situations, they can be on both sides of the regression
equation. Measuring the marginal impacts of covariates in these types of models is not
straightforward since a change in one component of a closed composition automatically
affects the rest of the composition.
Previous work by the authors has shown how to measure, compute and interpret these
marginal impacts in the case of linear regression models with compositions on both sides
of the equation. The resulting natural interpretation is in terms of an elasticity, a quantity
commonly used in econometrics and marketing applications. They also demonstrate the
link between these elasticities and simplicial derivatives.
The aim of this contribution is to extend these results to other situations, namely
when the compositional vector is on a single side of the regression equation. In these
cases, the marginal impact is related to a semi-elasticity and also linked to some simpli-
cial derivative. Moreover we consider the possibility that a total variable is used as an
explanatory variable, with several possible interpretations of this total and we derive the
elasticity formulas in that case.
Keywords: compositional regression model, marginal effects, simplicial derivative, elasticity,
semi-elasticity.
1. Introduction and literature review
We consider regression models involving compositional vectors, i.e. vectors carrying relative
information. When relative information is the focus, meaningful functions are functions of
ratios of the vector’s components therefore using traditional regression models in such cases
is not correct. Regression models that respect the compositional nature of such data have
been proposed in the literature, for example those introduced by Aitchison (1986) based on
log-ratio transformations. Theory for inference in these models is developed for example in
Pawlowsky-Glahn and Buccianti (2011), Van Den Boogaart and Tolosana-Delgado (2013),
Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado (2015b), and Filzmoser, Hron, and Templ
(2018).
When the compositional vectors only appear as dependent variable, we will say that the
2Impacts of Covariates
model is of the ‘Y-compositional’ type (see e.g. Egozcue, Daunis-I-Estadella, Pawlowsky-
Glahn, Hron, and Filzmoser (2012)). When they only appear as explanatory variables, we
will say that the model is of the ‘X-compositional’ type (see e.g. Hron, Filzmoser, and
Thompson (2012)). Finally, when they appear on both sides, we will say that the model is of
the ‘YX-compositional’ type, see e.g. Kynclova, Filzmoser, and Hron (2015), Chen, Zhang,
and Li (2017), Morais, Thomas-Agnan, and Simioni (2018a) and Morais, Thomas-Agnan,
and Simioni (2018b). A simplified version of the YX-compositional type is presented in
Wang, Shangguan, Wu, and Guan (2013) and Morais, Thomas-Agnan, and Simioni (2018b)
later show that this model is equivalent to the so-called MCI (multiplicative competitive
interaction) model introduced earlier in the marketing literature by Nakanishi and Cooper
(1982). It may also be relevant to include in the model the total of the different parts
involved in the composition and we will consider each of the above models for the case with
or without a total variable, see e.g. Coenders, Mart´ın-Fern´andez, and Ferrer-Rosell (2017)
and Coenders, Ferrer-Rosell, Mateu-Figueras, and Pawlowsky-Glahn (2015). Extensions with
compositional functional predictors are presented in Sun, Xu, Cong, and Chen (2018), Bui,
Loubes, Risser, and Balaresque (2018) and Combettes and Muller (2019). Case studies using
some of these models are presented in Hron, Filzmoser, and Thompson (2012), Trinh, Morais,
Thomas-Agnan, and Simioni (2018) for the X-compositional type, Morais, Thomas-Agnan,
and Simioni (2017) for the YX-compositional type’.
The focus of the present work is on the definition and interpretation of impacts of covariates
in these models, question addressed by much fewer papers. Muller, Hron, Fiserova, Smahaj,
Cakirpaloglu, and Vancakova (2018) propose an interpretation for models of X-compositional
or Y-compositional types based on using a specific type of orthogonal coordinates (called
pivot coordinates, see e.g. Filzmoser, Hron, and Templ (2018)). Moreover they promote the
replacement of the natural logarithm by the base-2 logarithm for enhancing the interpretabil-
ity. The first drawback is that the resulting interpretation requires rerunning the model once
for each component in the Y-compositional case. Moreover changes in log-ratios correspond
to multiplicative increase (of the dependent or independent variables) in terms of relative
dominance, i.e. the ratio of one component to the geometric mean of the others (while keep-
ing all other log-ratios constant) which is not a very intuitive notion. This point of view is
extended in Coenders and Pawlowsky-Glahn (2020) by considering changes in more general
log-ratios leading to changes in any subset of components by a common factor (while reducing
the remaining components accordingly).
Morais, Thomas-Agnan, and Simioni (2018b) show that a natural interpretation tool in the
YX-compositional model is the notion of elasticity. The notion of elasticity is frequently used
in econometrics: it corresponds to the percent increase of the dependent variable induced by
a percent increase of the explanatory variable and is natural in a log-log regression model
because it coincides with the explanatory variable parameter (see Section 3). Indeed elastic-
ities are commonly computed for the MCI model in the marketing literature (see Nakanishi
and Cooper (1982)). Morais, Thomas-Agnan, and Simioni (2018b) relate it to the notion of
simplicial derivatives introduced in Egozcue, Jarauta-Bragulat, and D´ıaz-Barrero (2011) and
Barcel´o-Vidal, Mart´ın-Fern´andez, and Mateu-Figueras (2011).
With a graphical approach, Nguyen, Laurent, Thomas-Agnan, and Ruiz-Gazen (2018) bring
a different light on the evaluation of these impacts by plotting the predicted components
as a function of the explanatory variables but this graphing tool is limited to compositional
dependent or explanatory variables with three components.
Finally, for the X-compositional model, Coenders and Pawlowsky-Glahn (2020) consider the
introduction of the total variable among the explanatory and adapt the resulting interpreta-
tions, still in terms of log-ratio changes.
The objective of this paper is to extend Morais, Thomas-Agnan, and Simioni (2018b) to the
Y-compositional and the X-compositional models and to allow inclusion of the total variable
in the models. In Section 2, we introduce notations and define the different specifications
Austrian Journal of Statistics 3
of the considered models. In Section 3, we demonstrate the equations linking elasticities or
semi-elasticities (depending on the considered model) with simplicial derivatives. Section 4es-
tablishes the formulas for the elasticities and semi-elasticities in terms of model parameters in
the simplex as well as in coordinate space. Finally, Section 5provides examples of applications
to the X-compositional and to the Y-compositional models and Section 6concludes.
2. Compositional model specifications
Using the notation x0for the transpose of a vector x, let us denote by ˇ
X= ( ˇ
X1,· · · ,ˇ
XDX)0
RDX
+a vector of DXpositive components corresponding to the components of a compositional
vector expressed in original units: we call these components volumes as opposed to shares. For
example, in the case studied in Morais, Thomas-Agnan, and Simioni (2017), the volumes are
numbers of cars sold during a given month by the different brands of cars whereas the shares
represent the corresponding proportion of cars sold during that month by each brand relative
to the other brands in the study. The closure of the vector ˇ
Xof volumes is the corresponding
vector of shares
X=C(ˇ
X1,· · · ,ˇ
XDX)0= ˇ
X1
PDX
i=1 ˇ
Xi
,· · · ,ˇ
XDX
PDX
i=1 ˇ
Xi!0
= (X1,· · · , XDX)0
and belongs to the simplex space SDXof positive vectors in RDXwith sum equal to 1.
In some cases, it may be relevant to include in the regression model a variable measuring a
total (hence not scale-invariant) which may be T(X) or T(Y). Pawlowsky-Glahn, Egozcue,
and Lovell (2015a) argue that different formulas can be used for this total, for example one
of the following two:
Arithmetic total: TA(ˇ
X) = PD
i=1 ˇ
Xi
Geometric total: TG(ˇ
X) = (QD
i=1 ˇ
Xi)1/D
The general principle of simplicial regression is to use transformations to transport the sim-
plex space SD, equipped with the Aitchison geometry, into the Euclidian space RD1thus
eliminating the simplex constraints problem. It is generally agreed upon to use log-ratio
orthonormal coordinates (Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado (2015b)). We
recall that to each D×(D1) contrast matrix V,constructed from an orthonormal ba-
sis of SD,corresponds an isometric transformation traditionally called ilrV.As advocated
recently by Mart´ın-Fern´andez (2019), we will rather use the name olr (orthogonal log ra-
tio) for these transformations. We then have z=olrV(z) = V0log(z),where the natural
logarithm (denoted by log) is understood componentwise and the inverse transformation is
olr1
V(z) = C(exp(Vz)).Note that olr-coordinates take the same value regardless if you use
shares or volumes. However the inverse transformation always returns shares.
Using the traditional notations for the simplex operations (see Pawlowsky-Glahn, Egozcue,
and Tolosana-Delgado (2015b)) and denoting by <, >Athe Aitchison scalar product, the first
row of Table 1presents the formulation of the regression models explaining a collection of
ni.i.d. random variables (simplex valued or not) by corresponding explanatory variables
which may be simplex valued or not. The observations are indexed by t,t= 1,· · · n. Because
marginal effects only involve one explanatory at a time, if we had a model explaining a simplex
valued variable by both types of explanatory variables, we would use the first and last columns
of this table. The second row of Table 1presents the corresponding model formulations in
coordinate space for a given choice of olr transformation olrV. Parameters a,bor Bare
then estimated by maximum likelihood in coordinate space where the regression is classical.
Formulas to compute the corresponding parameter estimates in the simplex a,bor Bare
available and it is known that these estimated parameters in the simplex are independent of
4Impacts of Covariates
the particular choice of olrV, i.e. of the particular choice of contrast matrix. In Table 1the
different formulations may involve a total variable T(X) or T(Y) and it is printed in grey
to indicate that it is an option. Finally, we included in the formulations the particular case
of the MCI model obtained when DX=DYand the matrix Bis a multiple of the identity
resulting in BX=bX.
Table 1: Specifications of the compositional models and notations
Space Y-compositional model X-compositional model YX-compositional model
SDYt=aˇ
Xtbt
T(ˇ
Y)tc
ˇ
Yt=a+<b,Xt>A
+t+dT (ˇ
X)t
‘CODA’ model:
Yt=aBXtt
T(ˇ
X)td+T(ˇ
Y)tc
‘MCI’ model:
Yt=abXtt
T(ˇ
X)td+T(ˇ
Y)tc
RD1Y
t=a+bˇ
Xt+
t
+cT(ˇ
Y)t
ˇ
Yt=
a+PDX1
k=1 b
kX
t,k +t
+dT (ˇ
X)t
‘CODA’ model:
Y
t=a+BX
t+
t
+dT(ˇ
X)t+cT(ˇ
Y)t
‘MCI’ model:
Y
t=a+bX
t+
t
+dT(ˇ
X)t+cT(ˇ
Y)t
Yt,a,b, t∈ SDY,ˇ
XtR
Y
t,a,b,
tRDY1
Xt,b∈ SDX,
ˇ
Yt,ˇ
Xt, a, tR
X
t,bRDX1
BRDY,DX,bR
BRDY1,DX1
3. Elasticities, semi-elasticities and simplicial partial derivatives
A marginal impact in a linear regression model is usually understood as the change in the
expected value of the dependent variable Yinduced by an additive increase of the explanatory
of interest X. In nonlinear models, it is rather understood as the infinitesimal equivalent,
i.e. the derivative of the expected value of Ywith respect to Xand it may be non constant
throughout the range of X. In some nonlinear models, an elasticity or a semi-elasticity may
be more natural. Indeed in a log-log model, if E(log(Y)) depends linearly on log(X),then the
parameter of log(X) is equal to the derivative of E(log(Y)) with respect to log(X) (also called
the logarithmic derivative) and can be interpreted as the percent increase of E(Y) induced by
a one percent increase of X: this quantity is called elasticity of Ywith respect to X. Finally,
if the model is a semi-log model, the natural quantity is either the partial derivative of E(Y)
with respect to log(X) (if the logarithm is on the right hand side of the regression equation)
or symmetrically the partial derivative of E(log(Y)) with respect to Xin the other case (if
the logarithm is on the left hand side of the regression equation): in both cases it is called
a semi-elasticity. Because log-log models and semi-log models are frequent in econometrics,
elasticities and semi-elasticities are often used to measure the impact of covariates.
This supports the idea that, in a simplicial regression model, one should turn attention to
simplicial derivatives to evaluate the impacts of explanatory variables. Adapting the definition
of derivative to the case where a function is simplex valued or is defined on the simplex stems
Austrian Journal of Statistics 5
from the fact that a change in a vector of shares cannot be just reduced to a change in one
of the components since they are linked by their sum constraint: in other words, it is due to
the fact that one of the variables lies in a subspace of RD.
More precisely, the quantities of interest are
EY
∂X in the case of the Y-compositional model
EY
Xin the case of the X-compositional model,
EY
Xin the case of the YX-compositional model,
where Edenotes the expectation of a simplex valued random variable (see Pawlowsky-
Glahn and Buccianti (2011)) and where the symbol indicates that the derivative should
be understood in the simplicial derivative sense with respect to that variable (see Barcel´o-
Vidal, Mart´ın-Fern´andez, and Mateu-Figueras (2011) and Egozcue, Jarauta-Bragulat, and
D´ıaz-Barrero (2011).
For the Y-compositional and X-compositional models, we are first going to express the relevant
simplicial derivatives in terms of semi-elasticities.
Indeed, for the case of the X-compositional model, let us consider an homogeneous function of
degree zero fdefined from RD
+to Rinducing a function fon SDby f(x) = f(C(ˇx)) = f(ˇ
x).
Propositions (13.10) and (13.13) in Barcel´o-Vidal, Mart´ın-Fern´andez, and Mateu-Figueras
(2011) imply that the part-Cderivatives of f, which we denote here by ∂f (x)
xare given by:
∂f (x)
x=∂f (ˇx)
log(ˇx)
Therefore the derivative of a function fof a simplex valued variable x=C(ˇx) corresponds to
the ordinary semi-log derivative of the corresponding homogeneous function fof the volumes
ˇx. Applying this result to the function expressing EYas a function of the share vector X,we
obtain the link between the simplicial derivative of this function and the semi-elasticity (or
semi-log derivative) in the classical sense of the corresponding function of the volume vector
ˇ
X.
Similarly, for the case of the Y-compositional model, for a simplex-valued function fof a
real variable xR, Theorem 12.2.6 in Egozcue et al. in Egozcue, Jarauta-Bragulat, and
D´ıaz-Barrero (2011) implies that:
f(x)
∂x =Cexp log f(x)
∂x 0,
where fdenotes the simplicial derivative of fat x. This result links the simplicial derivatives
of a simplex-valued function fto the semi-log derivatives (in the ordinary sense) of this
function. Applying this result to the function expressing EYas a function of X, we obtain
the link between the simplicial derivative of this function and the semi-elasticity (or semi-log
derivative) in the classical sense of EYas a function of X.
For the YX-compositional model, Morais, Thomas-Agnan, and Simioni (2018a) linked sim-
plicial derivatives to elasticities in the case of a model without a total and in the particular
case where the number of components DYof the Y composition is the same as that of the X
composition (DX). The limitation DY=DXin Morais, Thomas-Agnan, and Simioni (2018a)
was simply due to the particular application framework of this work but there is no additional
mathematical difficulty to extend the result to DY6=DX. The corresponding formulas are
recalled in Table 2for completeness.
Finally, considering models including a total, one would need to define infinitesimal paths
in the T-space. Instead we consider three types of infinitesimal variations as described in
Section 4.3.
6Impacts of Covariates
Table 2: Simplicial derivative and (semi-)elasticities
Y-compositional model X-compositional model YX-compositional model
EY
∂X
=Cexp log EY
∂X 0
EY
X=EY
log ˇ
X
EY
X
=Cexp log EY
log ˇ
X0
For upcoming interpretations, it is interesting to consider first order Taylor approximations
of such functions (of a simplex variable or simplex valued). For a function ffrom SDto R,
consider as in Barcel´o-Vidal, Mart´ın-Fern´andez, and Mateu-Figueras (2011) the generating
system u1,· · · ,uDof SDdefined by
uj=D1
Dµj
µj= (1,· · · ,1,exp(1),1,· · · ,1),
where exp(1) is at the jth position. From Barcel´o-Vidal, Mart´ın-Fern´andez, and Mateu-
Figueras (2011), the first order Taylor’s approximation is given by
f(xδuj)f(x) + δ∂f (ˇx)
log( ˇxj).(1)
This additive (in the simplex sense) increase of δujcorresponds to a multiplicative increase
of the jth component while holding constant all other ratios of remaining components. It
is also equivalent in coordinate space, for a proper choice of olr transformation, to increase
additively one olr component while keeping all others constant. To summarize, note that the
increment is given by the product of δby the classical semi-elasticity, i.e., a semi-log derivative
in the ordinary sense of the corresponding function of the volumes. As we will see in Section
5,δis proportional to the rate of change of x.
For a function ffrom Rto SD,Egozcue, Jarauta-Bragulat, and D´ıaz-Barrero (2011) obtain
the following first order Taylor approximation for a small additive increase δ > 0 of xR
f(x+δ)f(x)δf(x)
∂x
As in Morais (2017), let us go one step further in the approximation. Indeed,
f(x)δf(x)
∂x =f(x)exp(δlog f(x)
∂x ).
Combining with a first order approximation of the exponential in a neighborhood of zero
exp(δlog f(x)
∂x )1 + δlog f(x)
∂x , we get the following approximation for the mth component of
f(x+δ)
fm(x+δ)fm(x)(1 + δlog fm(x)
∂x ).(2)
Taking the derivative of PD
m=1 fm(x)=1,we get PD
m=1 fm(x)log fm(x)
∂x = 0.Therefore the
RHS vector in equation (2) belongs to SD.To summarize, note that in this case the percent
increase of each component of f(x) is given by the classical semi-elasticity, i.e., a semi-log
derivative in the ordinary sense of the function.
Finally for a function ffrom SD
X, to SD
Y, a similar approximation has been obtained in Morais
(2017) for the particular case DX=DY.Combining the above two results, we obtain easily
that the Taylor approximation of a function ffrom SDXto SDYis given by
Austrian Journal of Statistics 7
fm(xδuj)fm(x)1 + δlog fm(ˇx)
log ˇxj.
showing that a percent increase of the components of x,proportional to δ, induces a percent
increase of each component of f(x) given by the classical elasticity of the corresponding
component log fm(ˇx)
log ˇxj.
4. Elasticities and semi-elasticities in terms of model parameters
The aim is now to relate the elasticities/semi-elasticities of the previous section to the model
parameters. The results of this section will be based on the following two lemmas which
establish the formulas for the semi-log derivatives of an olr transformation and its inverse.
Lemma 4.1 If zis a D-composition which is the closure of the vector ˇz of RD
+, and if z=
olrV(z) = V0log(z)is the olr-transformed vector associated to the contrast matrix V, then
∂olrV(z)
log ˇz =V0
This first lemma just results from the definition of the olr which is linear with respect to logˇz,
and could be used for any other log-ratio linear transformation.
Lemma 4.2 If zis a D-composition which is the closure of the vector ˇz of RD
+, and if z=
olrV(z) = V0log(z)is the olr-transformed vector associated to the contrast matrix V, then
log(olrV1(z))
z=WzV,
where z=olr1
V(z)and where Wz=ID1Dz0with IDthe the D×Didentity matrix and
1Dthe D×1vector of ones.
Let vij, (i= 1,· · · , D and j= 1,· · · , D 1), be the general term of the matrix V. To prove
Lemma 4.2, using the formula for the inverse transformation of an olr, one representative of
log ˇz = log(olrV1(z)) = log C(exp(Vz)) is given by Vzand therefore its derivative with
respect to zis V. Since log(z) = log(ˇz)log(S)1D,where S=TA(ˇz) = PD
i=1 ˇzi, and since
∂S
∂z
j
=
D
X
k=1
log( ˇzk)
∂z
j
ˇzk=
D
X
k=1
vkj ˇzk,
we have
log(S)
∂z
j
=1
S
∂S
∂z
j
=
D
X
k=1
vkj zi
Combining first and second terms yields, for i= 1,· · · , D and j= 1,· · · , D 1
log(zi)
∂z
j
=vij
D
X
k=1
vkj zk,
and this is the general term of the matrix WzV.
If we define W
z=WzV, note that W
zV0=Wz(will be used later on).
4.1. Semi-elasticities for Y-compositional models and X-compositional models
In the case of Y-compositional and X-compositional models, the natural tool is semi-elasticities.
However the formulas differ in the two cases:
8Impacts of Covariates
X-compositional case: se(Y, ˇ
X) = EY
log ˇ
X
Y-compositional case: se(Y,ˇ
X) = log EY
∂X
Let us denote by VX, respectively VY, the contrast matrices used for X, respectively Y.
The computation in the X-compositional case uses Lemma 4.1. Indeed, for j= 1,· · · , DX
EY
log ˇ
Xj
=
DX1
X
k=1
EY
∂X
k
∂X
k
log ˇ
Xj
=
DX1
X
k=1
b
kvX
jk
The result is reported in Table 3with a matrix formulation
E(ˇ
Y)
log ˇ
X=VXb=VXVX0log b=clr(b).(3)
The computation in the Y-compositional case uses Lemma 4.2 since EY=olr1
V(EY).We
have
log EY
ˇ
X=log EY
EY
EY
ˇ
X=W
zb=W
zVY0log b=WzVYVY0log b=
=Wzclr(b) (4)
where z=olrV1(E(olrVY)) = EY.
Expressions (3) and (4) underline the fact that the semi-elasticities are independent of the
particular contrast matrix. They are observation dependent in the Y-compositional case
through z=EY.
4.2. Elasticities for the YX-compositional model
For the YX-compositional model, Morais, Thomas-Agnan, and Simioni (2018a) have obtained
the expressions of the elasticities when the dimension of the Y composition is the same as
that of the X composition. Let us extend this result to the case DX6=DYusing the above
two lemmas.
We can see the relationship between log ˇ
Xand log EYas the composition of three functions
(listed from inside to outside)
the function which maps log ˇ
XR+DXto X∈ SDX
the function which maps X∈ SDXto EYR+DY
the function which maps EYR+DYto log EY∈ SDY
Using the generalized chain rule for functions of several variables which states that the Jaco-
bian matrix of the composite function is the product of the Jacobian matrices of the composed
functions evaluated at appropriate points, we get
log EY
log ˇ
X=log EY
EY
EY
X
X
log ˇ
X(5)
The rightmost term on the right hand side of (5) is obtained using Lemma 4.1:
X
log ˇ
X=VX0.
Austrian Journal of Statistics 9
The central term yields the matrix Bof parameters in coordinate space since the relationship
between EYand Xis linear. The leftmost term on the right hand side is obtained using
Lemma 4.2:log EY
EY=WzVY,
where z=EY.We finally get
log EY
log ˇ
X=WzVYBVX0=WzVYVY0B=WzB,(6)
using the relationships between Band B(see e.g. Nguyen, Laurent, Thomas-Agnan, and
Ruiz-Gazen (2018)), and using the fact that the matrix Bsatisfies the zero-sum property
(sum of rows equal sum of columns equals 0.)
Note that the elasticity is observation dependent through z=EY.
For the MCI model, we have that B=bVYVY0,and therefore
WEYB=bWEYVYVY0=bWEY
Table 3summarizes the different formulas for semi-elasticities and elasticities for the three
types of models as a function of parameters estimates, in the simplex or in coordinate space.
Both expressions are important to keep in mind: the expression as a function of the simplex
parameters makes it clear that these are intrinsically simplex quantities independent of any
transformation. The expression as a function of coordinate space parameters is handy for
computations.
Table 3: (Semi-)elasticities without total
Y-compositional model X-compositional model YX-compositional model
log EY
∂X =W
EYb
=WEYclr(b)
E(ˇ
Y)
log ˇ
X=VXb
=clr(b)
‘CODA’ Model
log EY
log ˇ
X=W
EYBVX0=
WEYB
‘MCI’ Model
log EY
log ˇ
X=bWEY
Notations Wz=ID1Dz0,W
z== WzVY
4.3. Models including a total
The presence of the total variable has to be taken into account in the partial impact measure
computations. We consider including among the explanatory variables
a total of Yin the Y-compositional model (model A)
a total of Xin the X-compositional model (model B)
a total of Xand/or a total of Yin the YX-compositional model (model C)
The right hand side of model equations from Table 1are modified as follows
model A: add T(Yt)c, where c∈ SDYis the parameter corresponding to the total
effect of Y
10 Impacts of Covariates
model B: add +dT (Xt), where dRis the parameter corresponding to the total effect
of X
model C: add T(Yt)cT(Xt)d,where c∈ SDYand d∈ SDYare the parameters
corresponding to the two total effects.
In the presence of a total, as mentioned in Section 3, we need to distinguish three types of
infinitesimal variations for a compositional variable X. The three types are as follows
Type 1: the total T(X) remains constant and we look at infinitesimal variations of the
composition X. Such variations correspond to considering derivatives in the direction of
one of the unitary vectors of an orthonormal basis of SDX. With a proper choice of basis
and of contrast matrix as in Hron, Filzmoser, and Thompson (2012), this corresponds
to an infinitesimal change in one component, along a linear path in the simplex, keeping
all but the first olr coordinate constant.
Type 2: the composition Xremains constant while the total is subject to an infinitesimal
variation. Such variations correspond to considering ordinary derivatives with respect
to the total T(X).
Type 3: one of the components of Xvaries together with the total T(X).
In model A, the impact of additively increasing the total of Yis the same question as the
impact of a non-compositional variable and therefore the formula of Table 3can be applied
with cinstead of b.
Type 1 variations of XIn model B, the impact of a type 1 variation of Xwith fixed total
can be computed as in the X-compositional model in Table 3. The impact of a type 1 variation
of Xin model C can be computed as in the YX-compositional model in Table 3.
Type 2 variations of XType 2 variations of Xcorrespond to ordinary derivatives with
respect to the total of X.
For model B, a type 2 variation of Xresults in an ordinary derivative
EY
∂T =d
In model C, a type 2 variation of Xcan be computed as in a Y-compositional model treating
the total T(X) as an ordinary variable and formula from Table 3can be applied with dinstead
of b.
Type 3 variations of XIn this case, evaluating the effect of the variation of Xor of T(X) is
equivalent since they are linked together, therefore one of the two formulas is enough. For type
3 variations of X,since both total and composition vary, the easiest way out is to express the
dependent as a function of the volumes and use ordinary derivatives of the ensuing function
of the volumes.
In model B, for computing the effect of a type 3 variation of X, we need to adapt equation (3)
adding an extra term taking into account the fact that the total depends upon the volumes
and we get EY
log ˇ
X=VXb+log EY
∂T
∂T
log ˇ
X=VXb+d∂T
log ˇ
X
This result shows that the derivatives of the total with respect to the volumes play a role in
the final expression of this semi-elasticity (hence we get a different formula for an arithmetic
or a geometric total).
Austrian Journal of Statistics 11
In model C, for a type 3 variation of X, the derivative with respect to Xof the first term
BXis obtained as in the YX-compositional model without total and and the derivative
of the second term T(X)dis obtained as in the X-compositional model with a T(X) total
(equation (6)) yielding overall
log EY
log ˇ
X=WEY(B+ log(d)T
log ˇ
X)
Once again, the result involves the derivatives of the total with respect to the volumes.
No additional complexity is introduced if we consider models involving a total of Yas an ad-
ditional dependent variable, which makes sense in the Y-compositional or YX-compositional
cases. In that case consequently, there would be no total of Yamong the explanatory vari-
ables. The impact of variations of a classical (resp: compositional) explanatory variable on
the compositional part of Ywould be studied as in the Y-compositional model (resp: YX-
compositional) and on the total of Yas in an ordinary linear model (resp: X-compositional
model).
5. Illustration
Let us give two toy examples of interpretation to illustrate our approach. We focus on
the X-compositional and the Y-compositional models since the case of the YX-compositional
model was already illustrated in Morais, Thomas-Agnan, and Simioni (2018a). Both examples
involve time series data and would justify a compositional time series model. However since
the focus is just on illustrating the impacts evaluation, we will ignore the time series aspect
and pretend the observations are i.i.d.. The subsequent computations were done using the R
package compositions.
5.1. Economic context and automobile market: Y-compositional model
In Morais and Thomas-Agnan (2020), the relationship between the socioeconomic context on
the demand of new cars by segments is investigated with a data set coming from the French
Renault company for market shares and from publicly available data bases. The data coming
from Renault has been blurred with a small noise for confidentiality reasons. The automobile
market is divided into five segments, from the smallest vehicles (A segment) to the largest
vehicles (E segment). The available explanatory variables are consumption expenditure, an
economic sentiment indicator, Gross Fixed Capital Formation of household, Gross Domestic
Product, diesel price and short term interest rate. The data is recorded monthly from 2003
to 2015 (167 observations). The model explaining the market shares of each segment by
the above explanatory is therefore a Y-compositional model in our terminology. We use the
following sequential binary partition: B versus A, C versus A and B, D versus A, B and C, and
E versus A, B, C and D to construct an orthonormal basis of the simplex and an associated
olr transformation. Figure 1displays the observed and predicted segments shares over time
and we can see that the compositional model catches the general tendency and smoothes the
jiggly patterns without suffering from overfitting, but not all the variance of this data which
differs across shares. The quality of fit of this model can be assessed by the multivariate
adjusted coefficient of determination based on the proportion of metric variance explained
by the model and which is equal to 0.86. Table 4contains the average semi-elasticities of
segments shares with respect to GDP.
Assuming the fitted model is correct, let us interpret for example the effect of a small increase
of GDP on the small cars (A segment) market shares. From formula (2), a small additive
increase δ= 1 billion euros (this amount representing 0.6% of the average monthly GDP)
results on average in a multiplicative increase of 0.0028 % of the A segment market share.
Instead of focusing on average elasticities, we could concentrate on a given point in time and
12 Impacts of Covariates
0.0
0.1
0.2
0.3
0.4
2005 2010 2015
Market shares
Segment ABCDESerie Fit Obs
Figure 1: Observed (in dotted line) and predicted (in solid line) segments shares over time
Table 4: Average semi-elasticities of segments shares with respect to GDP
se(St, GDPt)
A 2.88e-05
B -0.17e-05
C -0.96e-05
D 0.99e-05
E 1.18e-05
compute the impact on the whole share vector of such a small increase in GDP. We could
then check easily that the new shares vector is indeed in the simplex.
5.2. French GDP and job market: X-compositional model
In this second illustration, we are interested in the impact on French GDP of the structure
(composition) and the volume (total) of the French job market in the three main sectors
of activity: Agriculture (primary), Industry (secondary), and Services (tertiary). GDP is
expressed in million euros (current price) and total employment in thousands of people. The
data is collected quarterly from 2004 to 2018 1. We use the olr transformation corresponding
to the sequential binary partition: Agriculture versus Industry and Services, and Industry
versus Services. We consider the model explaining the GDP as a function of total employment
and the two olr coordinates associated to the above olr transformation. It is therefore an X-
compositional model including a total, in this case the simple arithmetic total employment.The
adjusted R square for this model is of 0.92. Table 5reports the semi-elasticities of GDP with
respect to the three sectors at the mean value of the sector composition corresponding to 788,
9196 and 19385 thousand employees for respectively Agriculture, Industry and Services. To
1https://data.oecd.org/emp/employment-by-activity.htm
Austrian Journal of Statistics 13
apply formula (1) in the neighborhood of the mean sector composition in volumes, we consider
a small δ > 0 and a variation of δujof x,where ujis the unit vector in the direction
of the component Services. This variation of xis equivalent, when δis small, to a relative
variation of p3/2δ(i.e. multiplying xby 1 + p3/2δ.) The factor p3/2 is pDX/DX1 in
the general case, corresponding to log(uj) in the Taylor expansion in Barcel´o-Vidal, Mart´ın-
Fern´andez, and Mateu-Figueras (2011) . Taking δ= 0.01% results in an increase of around
p3/219385 0.0001 = 2450 people of the Services employment while the ratio between
Agriculture and Industry employments remain constant, and, assuming the fit is correct, we
can see in Table 5that the model predicts that GDP should increase by 84 million euros. The
marginal effect of the size of the job market, assuming that its composition stays the same
is obtained by the parameter estimate of total employment in the model, which is equal to
26.52. When total employment increases by 1000 people, the predicted GDP increase is 26.5
millions.
Table 5: Semi-elasticities of GDP with respect to employment sectors composition
se(GDP, ˇ
EmplSect)
AGR -10157.26
INDU -51706.00
SERV 841030.75
Note that using a base 2 logarithm as in Muller, Hron, Fiserova, Smahaj, Cakirpaloglu, and
Vancakova (2018) is not useful in our approach and would rather introduce an unnecessary
constant.
6. Conclusion
This contribution highlights the fact that elasticities or semi-elasticities are well-adapted to
interpret the impacts of explanatory variables in all types of compositional regression models.
It also links these elasticities or semi-elasticities to the simplicial derivatives of the expected
response with respect to the considered explanatory variable. The models may contain compo-
sitional variables on the right hand side and/or on the left hand side of the regression equation,
and may contain or not total variables (relative to the dependent or the explanatory vari-
ables). Further work should be done about confidence intervals for (semi-)elasticities which
can be computed by the Delta method, or simply using a bootstrap approach. An extension
to time series compositional model as well as to spatial compositional model involves more
complex elasticites computations which take into account the time-lag and spatial-lag opera-
tors as can be seen in Thomas-Agnan, Laurent, Ruiz-Gazen, Nguyen, Chakir, and Lungarska
(2020) for the spatial case.
An alternative but more complex tool used in Wang, Shangguan, Wu, and Guan (2013)
and in Morais, Thomas-Agnan, and Simioni (2018a) is the elasticity of a ratio of shares. In
the framework of an MCI model, it would directly correspond to a parameter of the model,
which is attractive, but relates to a change rate of a ratio of components and not of a single
component and therefore is more difficult to vulgarize.
Acknowledgements
We acknowledge funding from the French National Research Agency (ANR) under the In-
vestments for the Future (Investissements d’Avenir) program, grant ANR-17-EURE-0010.
14 Impacts of Covariates
References
Aitchison J (1986). The Statistical Analysis of Compositional Data. Monographs on statistics
and applied probability. Chapman and Hall, Reprinted in 2003 with additional material by
Blackburn Press.
Barcel´o-Vidal C, Mart´ın-Fern´andez JA, Mateu-Figueras G (2011).“Compositional Differential
Calculus on the Simplex.” Compositional Data Analysis: Theory and Applications. John
Wiley & Sons.
Bui TTT, Loubes JM, Risser L, Balaresque P (2018). “Distribution Regression Model with a
Reproducing Kernel Hilbert Space Approach.” arXiv preprint arXiv:1806.10493.
Chen J, Zhang X, Li S (2017). “Multiple Linear Regression with Compositional Response and
Covariates.” Journal of Applied Statistics,44(12), 2270–2285.
Coenders G, Ferrer-Rosell B, Mateu-Figueras G, Pawlowsky-Glahn V (2015). “MANOVA of
Compositional Data with a Total.CODAWORK2015.
Coenders G, Mart´ın-Fern´andez JA, Ferrer-Rosell B (2017). “When Relative and Absolute
Information Matter: Compositional Predictor with a Total in Generalized Linear Models.”
Statistical Modelling,17(6), 494–512.
Coenders G, Pawlowsky-Glahn V (2020). “On Interpretations of Tests and Effect Sizes in
Regression Models with a Compositional Predictor.” SORT,44(1).
Combettes PL, Muller CL (2019). “Regression Models for Compositional Data: General Log-
contrast Formulations, Proximal Optimization, and Microbiome Data Applications.” URL
https://arxiv.org/abs/1903.01050.
Egozcue JJ, Daunis-I-Estadella J, Pawlowsky-Glahn V, Hron K, Filzmoser P (2012). “Sim-
plicial Regression. The Normal Model.” Journal of Applied Probability and Statistics.
Egozcue JJ, Jarauta-Bragulat E, D´ıaz-Barrero J (2011). “Calculus of Simplex-valued Func-
tions.” Compositional Data Analysis: Theory and Applications.
Filzmoser P, Hron K, Templ M (2018). Applied Compositional Data Analysis, With Worked
Examples in R. Springer series in Statistics. Springer.
Hron K, Filzmoser P, Thompson K (2012). “Linear Regression with Compositional Explana-
tory Variables.” Journal of Applied Statistics,39(5), 1115–1128.
Kynclova P, Filzmoser P, Hron K (2015). “Modeling Compositional Time Series with Vector
Autoregressive Models.” Journal of Forecasting,34(4), 303–314.
Mart´ın-Fern´andez JA (2019). “Comments on: Compositional Data: The Sample Space and Its
Structure.” TEST,28, 653–657. URL https://doi.org/10.1007/s11749-019-00672-4.
Morais J (2017). Impact of Media Investments on Brands’ Market Shares: A Compositional
Data Analysis Approach. Ph.D. thesis, Toulouse School of Economics (TSE).
Morais J, Thomas-Agnan C (2020). “Impact of the Economic Context on the Automobile
Market Segment Shares: A Compositional Approach.” Preprint.
Morais J, Thomas-Agnan C, Simioni M (2017). “Impact of Advertising on Brand’s Market-
shares in the Automobile Market: A Multi-channel Attraction Model with Competition
and Carryover Effects.” URL https://hal.archives-ouvertes.fr/hal-01666853/.
Morais J, Thomas-Agnan C, Simioni M (2018a). “Interpretation of Explanatory Variables
Impacts in Compositional Regression Models.” Austrian Journal of Statistics,47(5), 1–25.
Austrian Journal of Statistics 15
Morais J, Thomas-Agnan C, Simioni M (2018b). “Using Compositional and Dirichlet Models
for Market Share Regression.” Journal of Applied Statistics,45(9), 1670–1689.
Muller I, Hron K, Fiserova E, Smahaj J, Cakirpaloglu P, Vancakova J (2018). “Interpretation
of Compositional Regression with Application to Time Budget Analysis.” Austrian Journal
of Statistics,47(2), 3–19. doi:10.17713/ajs.v47i2.652. URL https://www.ajs.or.at/
index.php/ajs/article/view/vol47-2-1.
Nakanishi M, Cooper LG (1982). “Simplified Estimation Procedures for MCI Models.” Mar-
keting Science,1(3), pp. 314–322. ISSN 07322399. URL http://www.jstor.org/stable/
183931.
Nguyen THA, Laurent T, Thomas-Agnan C, Ruiz-Gazen A (2018). “Analyzing the Impacts
of Socio-economic Factors on French Departmental Elections with CODA Methods.” TSE
Working paper 18-961.
Pawlowsky-Glahn V, Buccianti A (2011). Compositional Data Analysis: Theory and Appli-
cations. John Wiley & Sons.
Pawlowsky-Glahn V, Egozcue JJ, Lovell D (2015a). “Tools for Compositional Data with a
Total.” Statistical Modelling,15(2), 175–190.
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2015b). Modeling and Analysis of
Compositional Data. John Wiley & Sons.
Sun Z, Xu W, Cong X, Chen K (2018). “Log-Contrast Regression with Functional Composi-
tional Predictors: Linking Preterm Infant’s Gut Microbiome Trajectories in Early Postnatal
Period to Neurobehavioral Outcome.” arXiv preprint arXiv:1808.02403.
Thomas-Agnan C, Laurent T, Ruiz-Gazen A, Nguyen T, Chakir R, Lungarska A (2020).
“Spatial Simultaneous Autoregressive Models for Compositional Data: Application to Land
Use.” TSE Working Paper,20(1098).
Trinh HT, Morais J, Thomas-Agnan C, Simioni M (2018). “Relations between Socio-economic
Factors and Nutritional Diet in Vietnam from 2004 to 2014: New Insights Using Composi-
tional Data Analysis.” Statistical Methods in Medical Research, p. 0962280218770223.
Van Den Boogaart KG, Tolosana-Delgado R (2013). Analysing Compositional Data with R.
Springer.
Wang H, Shangguan L, Wu J, Guan R (2013). “Multiple Linear Regression Modeling for
Compositional Data.” Neurocomputing,122, 490–500.
Affiliation:
Christine Thomas-Agnan
Toulouse School of Economics
Esplanade de l’universit´e
31080 Toulouse Cedex 06 France
E-mail: christine.thomas@tse-fr.eu
URL: https://www.tse-fr.eu/fr/people/christine-thomas-agnan
Austrian Journal of Statistics http://www.ajs.or.at/
published by the Austrian Society of Statistics http://www.osg.or.at/
Volume 50 Submitted: 2019-12-02
January 2021 Accepted: 2020-07-18
... The intuition is that both measures evaluate the functionf outside the simplex. An adaptation of the relative influence (or elasticity in the econometrics literature) to compositions based on the Aitchison geometry has recently been proposed by [46]. We adapt the relative influence without relying on the log-ratio transform and hence allow for more general function classes. ...
Article
Full-text available
Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.
... Một yếu tố có tác động dương đến ILR1 và có tác động âm đến ILR2, có nghĩa là khi tăng 1 đơn vị (hoặc so với phạm trù mặc định) sẽ làm tăng tỉ lệ chi tiêu cho thành phần 1 so với trung bình (hình học) hai thành phần 2 và 3 , đồng thời làm giảm tỉ lệ chi tiêu cho 2 so với 3 . Ngoài ra, hệ số mô hình hồi quy (2) có thể được biến đổi ngược về đơn hình và giải thích trực tiếp trên mô hình (1) nhưng tính toán phức tạp hơn (Morais & Thomas-Agnan, 2021). Trong trường hợp này, hệ số hồi quy của mô hình (1) là bất biến đối với các phân tổ khác nhau ở Bảng 1. ...
Article
Chi tiêu giáo dục và các khoản mục là chỉ tiêu quan trọng của hộ gia đình và cả xã hội. Bộ số liệu điều tra Mức sống dân cư 2020 gồm 5491 học sinh đang theo học tại các cấp học phổ thông và đặc điểm nhân khẩu học của hộ gia đình tương ứng được sử dụng trong nghiên cứu. Chi phí cho 1 năm học và tỉ trọng từng khoản mục đều tăng theo cấp học, trong đó, học phí là một khoản chi nhỏ so với nhiều khoản chi khác như học thêm, đóng góp, sách giáo khoa. Thông qua phương pháp phân tích đa hợp, kết quả phân tích cho thấy thu nhập hộ gia đình, loại hình trường học, và khu vực sinh sống tác động có ý nghĩa thống kê đến cơ cấu chi tiêu theo các khoản mục thiết yếu bắt buộc, đóng góp và học thêm. Đặc biệt, giới tính của người học không ảnh hưởng đến chi tiêu giáo dục. Kết quả nghiên cứu cung cấp thêm bằng chứng về đầu tư cho giáo dục của cả Chính phủ và hộ gia đình, đồng thời gợi ý các hàm ý chính sách để tăng hiệu quả đầu tư cho giáo dục ở các cấp học phổ thông.
Chapter
Internet has become a necessary tool for the Mexican tourism sector, providing travelers the ability to access a wide variety of services and enabling informed decision-making. This study analyzes the impact of tourism consumption in terms of internet access in Mexican households. The aim of this work is to evaluate the influence of internet on the process of tourism decision-making and its relationship with tourism consumption using a semi-logarithmic model with data from the National Survey of Household Income and Expenditure (ENIGH) carried out by the National Institute of Statistics and Geography (INEGI). The survey describes the availability and use of information and communications technologies in households and their use by individuals aged 6 and above in Mexico. The results indicate that households with a higher income show a progressive increase in tourism consumption, package tours, and internet expenditure. Thus, 90% of the population account for 59.1% of tourism consumption 37.5% of tour packages, and 77.9% of internet expenditure. Although tourism increases due to internet use, this rise is inelastic; that is, it does not proportionally increase along with internet access in households.
Article
Full-text available
Data extracted from Synthetic Aperture Radar (SAR) have been widely employed to estimate soil properties. However, these studies are typically constrained to bare soil conditions, as soil information retrieval in vegetated areas remains challenging. Polarimetric decomposition has emerged as a potentially useful method to separate the scattering contributions of different targets (e.g. canopy/leaves and the underlying soil), which is of significance for areas that are near-permanently covered in low-lying vegetation (e.g. grass) like Ireland – the study area for this investigation. Here, we test the surface scattering mechanism, derived from H-alpha dual-pol decomposition, together with other covariates, to estimate percentages of sand, silt, and clay, over vegetated terrain, using Sentinel 1 data (dual-pol C-band SAR). The statistical modelling approaches evaluated – linear regression (LRM) and tree-based regression models (machine learning) – explicitly consider the compositional nature of soil texture. When compared to the models fitted without surface scattering data, results showed that the inclusion of the surface scattering data improved estimates of silt and clay, with the compositional linear regression model, and estimates of sand and silt fractions with different tree-based models. While not without limitations, our study demonstrated that the polarimetric decomposition method, which is typically used for classification and segmentation purposes, could also be used for soil property estimation, broadening the application of this technique in microwave remote sensing studies.
Article
Full-text available
Data derived from Synthetic Aperture Radar (SAR) are widely employed to predict soil properties, particularly soil moisture and soil carbon content. However, few studies address the use of microwave sensors for soil texture retrieval and those that do are typically constrained to bare soil conditions. Here, we test two statistical modelling approaches – linear (with and without interaction terms) and tree‐based models, namely compositional linear regression model (LRM) and Random Forest (RF) – and both non‐geophysical (e.g. surface soil moisture, topographic etc) and geophysical‐based (electromagnetic, magnetic and radiometric) covariates to estimate soil texture (sand %, silt % and clay %), using microwave remote sensing data (ESA Sentinel 1). The statistical models evaluated explicitly consider the compositional nature of soil texture and were evaluated with leave‐one‐out cross validation (LOOCV). Our findings indicate that both modelling approaches yielded better estimates when fitted without the geophysical covariates. Based on the Nash‐Sutcliffe efficiency coefficient (NSE), LRM slightly outperformed RF, with NSE values for sand, silt, and clay of 0.94, 0.62, and 0.46, respectively; for RF, the NSE values were 0.93, 0.59, and 0.44. When interaction terms were included, RF was found to outperform LRM. The inclusion of interactions in the LRM resulted in a decrease in NSE value and an increase in the size of the residuals. Findings also indicate that the use of radar derived variables (e.g. VV, VH, RVI) alone were not able to predict soil particle size without the aid of other covariates. Our findings highlight the importance of explicitly considering the compositional nature of soil texture information in statistical analysis and regression modelling. As part of the continued assessment of microwave remote sensing data (e.g. ESA Sentinel‐1) for predicting topsoil particle‐size, we intend to test surface scattering information derived from the dual‐polarimetric decomposition technique and integrate that predictor into the models in order to deal with the effects of vegetation cover on topsoil backscattering. This article is protected by copyright. All rights reserved.
Article
Full-text available
Spatial autoregressive models have been adapted to model data with both a geographic and a compositional nature. Interpretation of parameters in such a model is intricate. Indeed, when the model involves a spatial lag of the dependent variable, this interpretation must focus on the so-called impacts rather than on parameters and when moreover the dependent variable of this model is of a compositional nature, this interpretation should be based on elasticities or semi-elasticities. Combining the two difficulties, we provide exact formulas for the evaluation of these elasticity-based impact measures which have been only approximated so far in some applications. We also discuss their decomposition into direct and indirect impacts taking into account the compositional nature of the dependent variable. Finally, we also propose more local summary measures as exploratory tools that we illustrate on a toy data set and on real data.
Article
Full-text available
The companies' investment and financing policies are dynamically interrelated and there is no general consensus about the direction of this relationship. There are theoretical arguments and empirical evidence supporting both possible directions, which makes panel vector autoregressive models an appropriate tool. However, the financial ratios normally used to assess this relationship empirically tend to be asymmetric, and to have extreme outliers and non-linear relationships. The aim of this article is to propose a methodological approach to address these issues by complementing panel vector autoregressive models with compositional data analysis. The usefulness of the proposed methodology is illustrated with real data of Spanish retail companies, while a reanalysis with standard financial ratios is inconclusive.
Chapter
Food environments have been evolving rapidly in lower-middle-income countries. Nevertheless, little is known about the impact of these changes on diet quality. Thanks to the availability of detailed data on Vietnamese household consumption, this chapter presents a set of first results on the association between food sources and diet quality. These results highlight the contrasts between three Vietnamese districts located on an urban to rural gradient. We used recent advances in compositional data analysis to take into account the compositional nature of the share data describing the different food sources: principal balances as a tool for summarizing information carried by share data and techniques to deal with observed zero-valued shares.
Article
Full-text available
Compositional data sets are ubiquitous in science, including geology, ecology, and microbiology. In microbiome research, compositional data primarily arise from high-throughput sequence-based profiling experiments. These data comprise microbial compositions in their natural habitat and are often paired with covariate measurements that characterize physicochemical habitat properties or the physiology of the host. Inferring parsimonious statistical associations between microbial compositions and habitat- or host-specific covariate data is an important step in exploratory data analysis. A standard statistical model linking compositional covariates to continuous outcomes is the linear log-contrast model. This model describes the response as a linear combination of log-ratios of the original compositions and has been extended to the high-dimensional setting via regularization. In this contribution, we propose a general convex optimization model for linear log-contrast regression which includes many previous proposals as special cases. We introduce a proximal algorithm that solves the resulting constrained optimization problem exactly with rigorous convergence guarantees. We illustrate the versatility of our approach by investigating the performance of several model instances on soil and gut microbiome data analysis tasks.
Article
Full-text available
Compositional data analysis is concerned with the relative importance of positive variables , expressed through their log-ratios. The literature has proposed a range of manners to compute log-ratios, some of whose interrelationships have never been reported when used as explanatory variables in regression models. This article shows their similarities and differences in interpretation based on the notion that one log-ratio has to be interpreted keeping all other constant. The article shows that centred, additive, pivot, balance and pairwise log-ratios lead to simple reparametrizations of the same model which can be combined to provide useful tests and comparable effect size estimates.
Article
Full-text available
We are interested in modeling the impact of media investments on automobile manufacturer's market shares. Regression models have been developed for the case where the dependent variable is a vector of shares. Some of them, from the marketing literature, are easy to interpret but quite simple (Model A). Alternative models, from the compositional data analysis literature, allow a large complexity but their interpretation is not straightforward (Model B). This paper combines both approaches in order to obtain a performing market share model and develop relevant interpretations for practical use. We prove that Model A is a particular case of Model B, and that an intermediate specification is possible (Model AB). A model selection procedure is proposed. Several impact measures are presented and we show that elasticities are particularly useful: they can be computed from the transformed or from the original model, and they are linked to the simplicial derivatives.
Article
The vote shares by party on a given subdivision of a territory form a vector called composition (mathematically, a vector belonging to a simplex). It is interesting to model these shares and study the impact of the characteristics of the territorial units on the outcome of the elections. In the political economy literature, few regression models are adapted to the case of more than two political parties. In the statistical literature, there are regression models adapted to share vectors including Compositional Data (CoDa) models, but also Dirichlet models, and others. Our goal is to discuss and illustrate the use CoDa regression models for political economy models for more than two parties. The models are fitted on French electoral data of the 2015 departmental elections.
Article
In an election, the vote shares by party for a given subdivision of a territory form a compositional vector (positive components adding up to 1). Conventional multiple linear regression models are not adapted to explain this composition due to the constraint on the sum of the components and the potential spatial autocorrelation across territorial units. We develop a simultaneous spatial autoregressive model for compositional data that allows for both spatial correlation and correlations across equations. Using simulations and a data set from the 2015 French departmental election, we illustrate its estimation by two-stage and three-stage least squares methods.
Article
The neonatal intensive care unit (NICU) experience is known to be one of the most crucial factors that drive preterm infant's neurodevelopmental and health outcome. It is hypothesized that stressful early life experience of very preterm neonate is imprinting gut microbiome by the regulation of the so-called brain-gut axis, and consequently, certain microbiome markers are predictive of later infant neurodevelopment. To investigate, a preterm infant study was conducted; infant fecal samples were collected during the infants' first month of postnatal age, resulting in functional compositional microbiome data, and neurobehavioral outcomes were measured when infants reached 36-38 weeks of post-menstrual age. To identify potential microbiome markers and estimate how the trajectories of gut microbiome compositions during early postnatal stage impact later neurobehavioral outcomes of the preterm infants, we innovate a sparse log-contrast regression with functional compositional predictors. The functional simplex structure is strictly preserved, and the functional compositional predictors are allowed to have sparse, smoothly varying, and accumulating effects on the outcome through time. Through a pragmatic basis expansion step, the problem boils down to a linearly constrained sparse group regression, for which we develop an efficient algorithm and obtain theoretical performance guarantees. Our approach yields insightful results in the preterm infant study. The identified microbiome markers and the estimated time dynamics of their impact on the neurobehavioral outcome shed lights on the linkage between stress accumulation in early postnatal stage and neurodevelpomental process of infants.
Book
This book presents the statistical analysis of compositional data using the log-ratio approach. It includes a wide range of classical and robust statistical methods adapted for compositional data analysis, such as supervised and unsupervised methods like PCA, correlation analysis, classification and regression. In addition, it considers special data structures like high-dimensional compositions and compositional tables. The methodology introduced is also frequently compared to methods which ignore the specific nature of compositional data. It focuses on practical aspects of compositional data analysis rather than on detailed theoretical derivations, thus issues like graphical visualization and preprocessing (treatment of missing values, zeros, outliers and similar artifacts) form an important part of the book. Since it is primarily intended for researchers and students from applied fields like geochemistry, chemometrics, biology and natural sciences, economics, and social sciences, all the proposed methods are accompanied by worked-out examples in R using the package robCompositions.
Book
It is difficult to imagine that the statistical analysis of compositional data has been a major issue of concern for more than 100 years. It is even more difficult to realize that so many statisticians and users of statistics are unaware of the particular problems affecting compositional data, as well as their solutions. The issue of spurious correlation'', as the situation was phrased by Karl Pearson back in 1897, affects all data that measures parts of some whole, such as percentages, proportions, ppm and ppb. Such measurements are present in all fields of science, ranging from geology, biology, environmental sciences, forensic sciences, medicine and hydrology. This book presents the history and development of compositional data analysis along with Aitchison's log-ratio approach. Compositional Data Analysis describes the state of the art both in theoretical fields as well as applications in the different fields of science. Key Features: • Reflects the state-of-the-art in compositional data analysis. • Gives an overview of the historical development of compositional data analysis, as well as basic concepts and procedures. • Looks at advances in algebra and calculus on the simplex. • Presents applications in different fields of science, including, genomics, ecology, biology, geochemistry, planetology, chemistry and economics. • Explores connections to correspondence analysis and the Dirichlet distribution. • Presents a summary of three available software packages for compositional data analysis. • Supported by an accompanying website featuring R code. Applied scientists working on compositional data analysis in any field of science, both in academia and professionals will benefit from this book, along with graduate students in any field of science working with compositional data.