Causal Search in Structural Vector Autoregressive Models.
Abstract
This paper reviews a class of methods to perform causal inference in the framework of a structural vector autoregressive model. We consider three different settings. In the first setting the underlying system is linear with normal disturbances and the structural model is identified by exploiting the information incorporated in the partial correlations of the estimated residuals. Zero partial correlations are used as input of a search algorithm formalized via graphical causal models. In the second, semiparametric, setting the underlying system is linear with nonGaussian disturbances. In this case the structural vector autoregressive model is identified through a search procedure based on independent component analysis. Finally, we explore the possibility of causal search in a nonparametric setting by studying the performance of conditional independence tests based on kernel density estimations.
JMLR: Workshop and Conference Proceedings 12 (2011) 95–118 Causality in Time Series
Causal Search in Structural Vector Autoregressive Models
Alessio Moneta
moneta@econ.mpg.de
Max Planck Institute of Economics
Jena, Germany
Nadine Chlaß
nadine.chlass@unijena.de
Friedrich Schiller University of Jena, Germany
Doris Entner
doris.entner@cs.helsinki.fi
Helsinki Institute for Information Technology, Finland
Patrik Hoyer
patrk.hoyer@helsinki.fi
Helsinki Institute for Information Technology, Finland
Editor: Florin Popescu and Isabelle Guyon
Abstract
This paper reviews a class of methods to perform causal inference in the framework of a
structural vector autoregressive model. We consider three diﬀerent settings. In the ﬁrst
setting the underlying system is linear with normal disturbances and the structural model is
identiﬁed by exploiting the information incorporated in the partial correlations of the esti
mated residuals. Zero partial correlations are used as input of a search algorithm formalized
via graphical causal models. In the second, semiparametric, setting the underlying system
is linear with nonGaussian disturbances. In this case the structural vector autoregressive
model is identiﬁed through a search procedure based on independent component analysis.
Finally, we explore the possibility of causal search in a nonparametric setting by studying
the performance of conditional independence tests based on kernel density estimations.
Keywords: Causal inference, econometric time series, SVAR, graphical causal models,
independent component analysis, conditional independence tests
1. Introduction
1.1. Causal inference in econometrics
Applied economic research is pervaded by questions about causes and eﬀects. For exam
ple, what is the eﬀect of a monetary policy intervention? Is energy consumption causing
growth or the other way around? Or does causality run in both directions? Are economic
ﬂuctuations mainly caused by monetary, productivity, or demand shocks? Does foreign
aid improve living standards in poor countries? Does ﬁrms’ expenditure in R&D causally
inﬂuence their proﬁts? Are recent rises in oil prices in part caused by speculation? These
are seemingly heterogeneous questions, but they all require some knowledge of the causal
process by which variables came to take the values we observe.
A traditional approach to address such questions hinges on the explicit use of a priori
economic theory. The gist of this approach is to partition a causal process in a determinis
tic, and a random part and to articulate the deterministic part such as to reﬂect the causal
c
2011 A. Moneta, N. Chlaß, D. Entner & P. Hoyer.
Moneta Chlaß Entner Hoyer
dependencies dictated by economic theory. If the formulation of the deterministic part is ac
curate and reliable enough, the random part is expected to display properties that can easily
be analyzed by standard statistical tools. The touchstone of this approach is represented by
the work of Haavelmo (1944), which inspired the research program subsequently pursued
by the Cowles Commission (Koopmans, 1950; Hood and Koopmans, 1953). Therein, the
causal process is formalized by means of a structural equation model, that is, a system of
equations with endogenous variables, exogenous variables, and error terms, ﬁrst developed
by Wright (1921). Its coeﬃcients were given a causal interpretation (Pearl, 2000).
This approach has been strongly criticized in the 1970s for being ineﬀective in both policy
evaluation and forecasting. Lucas (1976) pointed out that the economic theory included
in the SEM fails to take economic agents’ (rational) motivations and expectations into
consideration. Agents, according to Lucas, are able to anticipate policy intervention and
act contrary to the prediction derived from the structural equation model, since the model
usually ignores such anticipations. Sims (1980) puts forth another critique which runs
parallel to Lucas’ one. It explicitly addresses the status of exogeneity which the Cowles
Commission approach attributes (arbitrarily, according to Sims) to some variables such that
the structural model can be identiﬁed. Sims argues that theory is not a reliable source for
deeming a variable as exogenous. More generally, the Cowles Commission approach with its
strong a priori commitment to theory, risks falling into a vicious circle: if causal information
(even if only about direction) can exclusively be derived from background theory, how do
we obtain an empirically justiﬁed theory? (Cfr. Hoover, 2006, p.75).
An alternative approach has been pursued since Wiener (1956) and Granger’s (1969)
work. It aims at inferring causal relations directly from the statistical properties of the
data relying only to a minimal extent on background knowledge. Granger (1980) proposes
a probabilistic concept of causality, similar to Suppes (1970). Granger deﬁnes causality in
terms of the incremental predictability (at horizon one) of a time series variable {Y
t
} (given
the present and past values of {Y
t
} and of a set {Z
t
} of possible relevant variables) when
another time series variable {X
t
} (in its present and past values) is not omitted. More
formally:
{X
t
} Grangercauses {Y
t
} if P (Y
t+1
X
t
, X
t−1
, . . . , Y
t
, Y
t−1
, . . . , Z
t
, Z
t−1
, . . .) 6=
P (Y
t+1
Y
t
, Y
t−1
, . . . , Z
t
, Z
t−1
, . . .)
(1)
As pointed out by Florens and Mouchart (1982), testing the hypothesis of Granger non
causality corresponds to testing conditional independence. Given lags p, {X
t
} does not
Granger cause {Y
t
}, if
Y
t+1
⊥⊥ (X
t
, X
t−1
, . . . , X
t−p
)  (Y
t
, Y
t−1
, . . . , Y
t−p
, Z
t
, Z
t−1
, . . . , Z
t−p
) (2)
To test Granger noncausality, researchers often specify linear vector autoregressive (VAR)
models:
Y
t
= A
1
Y
t−1
+ . . . + A
p
Y
t−p
+ u
t
, (3)
in which Y
t
is a k × 1 vector of time series variables (Y
1,t
, . . . , Y
k,t
)
0
, where ()
0
is the
transpose, the A
j
(j = 1, . . . , p) are k × k coeﬃcient matrices, and u
t
is the k × 1 vector
of random disturbances. In this framework, testing the hypothesis that {Y
i,t
} does not
Grangercause {Y
j,t
}, reduces to test whether the (j, i) entries of the matrices A
1
, . . . , A
p
96
Causal Search in SVAR
are vanishing simultaneously. Granger noncausality tests have been extended to nonlinear
settings by Baek and Brock (1992), Hiemstra and Jones (1994), and Su and White (2008),
using nonparametric tests of conditional independence (more on this topic in section 4).
The concept of Granger causality has been criticized for failing to capture ‘structural
causality’ (Hoover, 2008). Suppose one ﬁnds that a variable A Grangercauses another
variable B. This does not necessarily imply that an economic mechanism exists by which
A can be manipulated to aﬀect B. The existence of such a mechanism in turn does not
necessarily imply Granger causality either (for a discussion see Hoover 2001, pp. 150155).
Indeed, the analysis of Granger causality is based on coeﬃcients of reducedform models,
like those incorporated in equation (3), which are unlikely to reliably represent actual eco
nomic mechanisms. For instance, in equation (3) the simultaneous causal structure is not
modeled in order to facilitate estimation. (However, note that Eichler (2007) and White
and Lu (2010) have recently developed and formalized richer structural frameworks in which
Granger causality can be fruitfully analyzed.)
1.2. The SVAR framework
Structural vector autoregressive (SVAR) models constitute a middle way between the Cowles
Commission approach and the Grangercausality approach. SVAR models aim at recovering
the concept of structural causality, but eschew at the same time the strong ‘apriorism’ of
the Cowles Commission approach. The idea is, like in the Cowles Commission approach,
to articulate an unobserved structural model, formalized as a dynamic generative model:
at each time unit the system is aﬀected by unobserved innovation terms, by which, once
ﬁltered by the model, the variables come to take the values we observe. But, diﬀerently
from the Cowles Commission approach, and similarly to the GrangerVAR model, the data
generating process is generally enough articulated so that time series variables are not dis
tinguished a priori between exogenous and endogenous. A linear SVAR model is in principle
a VAR model ‘augmented’ by the contemporaneous structure:
Γ
0
Y
t
= Γ
1
Y
t−1
+ . . . + Γ
p
Y
t−p
+ ε
t
. (4)
This is easily obtained by premultiplying each side of the VAR model
Y
t
= A
1
Y
t−1
+ . . . + A
p
Y
t−p
+ u
t
, (5)
by a matrix Γ
0
so that Γ
i
= Γ
0
A
i
, for i = 1, . . . , k and ε
t
= Γ
0
u
t
. Note, however, that
not any matrix Γ
0
will be suitable. The appropriate Γ
0
will be that matrix corresponding
to the ‘right’ rotation of the VAR model, that is the rotation compatible both with the
contemporaneous causal structure of the variable and the structure of the innovation term.
Let us consider a matrix B
0
= I−Γ
0
. If the system is normalized such that the matrix Γ
0
has
all the elements of the principal diagonal equal to one (which can be done straightforwardly),
the diagonal elements of B
0
will be equal to zero. We can write:
Y
t
= B
0
Y
t
+ Γ
1
Y
t−1
+ . . . + Γ
p
Y
t−p
+ ε
t
(6)
from which we see that B
0
(and thus Γ
0
) determines in which form the values of a variable
Y
i,t
will be dependent on the contemporaneous value of another variable Y
j,t
. The ‘right’
97
Moneta Chlaß Entner Hoyer
rotation will also be the one which makes ε
t
a vector of authentic innovation terms, which
are expected to be independent (not only over time, but also contemporaneously) sources
or shocks.
In the literature, diﬀerent methods have been proposed to identify the SVAR model
(4) on the basis of the estimation of the VAR model (5). Notice that there are more
unobserved parameters in (4), whose number amounts to k
2
(p + 1), than parameters that
can be estimated from (5), which are k
2
p + k(k + 1)/2, so one has to impose at least
k(k − 1)/2 restrictions on the system. One solution to this problem is to get a rotation
of (5) such that the covariance matrix of the SVAR residuals Σ
ε
is diagonal, using the
Cholesky factorization of the estimated residuals Σ
u
. That is, let P be the lowertriangular
Cholesky factorization of Σ
u
(i.e. Σ
u
= PP
0
), let D be a k × k diagonal matrix with
the same diagonal as P, and let Γ
0
= DP
−1
. By premultiplying (5) by Γ
0
, it turns out
that Σ
ε
= E[Γ
0
u
t
u
0
t
Γ
0
0
] = DD
0
, which is diagonal. A problem with this method is that
P changes if the ordering of the variables (Y
1t
, . . . , Y
kt
)
0
in Y
t
and, consequently, the order
of residuals in Σ
u
, changes. Since researchers who estimate a SVAR are often exclusively
interested on tracking down the eﬀect of a structural shock ε
it
on the variables Y
1,t
, . . . , Y
k,t
over time (impulse response functions), Sims (1981) suggested investigating to what extent
the impulse response functions remain robust under changes of the order of variables.
Popular alternatives to the Cholesky identiﬁcation scheme are based either on the use
of a priori, theorybased, restrictions or on the use of longrun restrictions. The former
solution consists in imposing economically plausible constraints on the contemporaneous
interactions among variables (Blanchard and Watson, 1986; Bernanke, 1986) and has the
drawback of ultimately depending on the a priori reliability of economic theory, similarly
to the Cowles Commission approach. The second solution is based on the assumptions that
certain economic shocks have longrun eﬀect to other variables, but do not inﬂuence in the
longrun the level of other variables (see Shapiro and Watson, 1988; Blanchard and Quah,
1989; King et al., 1991). This approach has been criticized as not being very reliable unless
strong a priori restrictions are imposed (see Faust and Leeper, 1997).
In the rest of the paper, we ﬁrst present a method, based on the graphical causal model
framework, to identify the SVAR (section 2). This method is based on conditional inde
pendence tests among the estimated residuals of the VAR estimated model. Such tests
rely on the assumption that the shocks aﬀecting the model are Gaussian. We then relax
the Gaussianity assumption and present a method to identify the SVAR model based on
independent component analysis (section 3). Here the main assumption is that shocks are
nonGaussian and independent. Finally (section 4), we explore the possibility of extending
the framework for causal inference to a nonparametric setting. In section 5 we wrap up the
discussion and conclude by formulating some open questions.
2. SVAR identiﬁcation via graphical causal models
2.1. Background
A datadriven approach to identify the structural VAR is based on the analysis of the es
timated residuals
ˆ
u
t
. Notice that when a basic VAR model is estimated (equation 3), the
information about contemporaneous causal dependence is incorporated exclusively in the
98
Causal Search in SVAR
residuals (being not modeled among the variables). Graphical causal models, as originally
developed by Pearl (2000) and Spirtes et al. (2000), represent an eﬃcient method to recover,
at least in part, the contemporaneous causal structure moving from the analysis of the con
ditional independencies among the estimated residuals. Once the contemporaneous causal
structure is recovered, the estimation of the lagged autoregressive coeﬃcients permits us to
identify the complete SVAR model.
This approach was initiated by Swanson and Granger (1997), who proposed to test
whether a particular causal order of the VAR is in accord with the data by testing all
the partial correlations of order one among error terms and checking whether some partial
correlations are vanishing. Reale and Wilson (2001), Bessler and Lee (2002), Demiralp and
Hoover (2003), and Moneta (2008) extended the approach by using the partial correlations
of the VAR residuals as input to graphical causal model search algorithms.
In graphical causal models, the structural model is represented as a causal graph (a
Directed Acyclic Graph if the presence of causal loops is excluded), in which each node
represents a random variable and each edge a causal dependence. Furthermore, a set of
assumptions or ‘rules of inference’ are formulated, which regulate the relationship between
causal and probabilistic dependencies: the causal Markov and the faithfulness conditions
(Spirtes et al., 2000). The former restricts the joint probability distribution of modeled
variables: each variable is independent of its graphical nondescendants conditional on its
graphical parents. The latter makes causal discovery possible: all of the conditional inde
pendence relations among the modeled variables follow from the causal Markov condition.
Thus, for example, if the causal structure is represented as Y
1t
→ Y
2t
→ Y
t,3
, it follows from
the Markov condition that Y
1,t
⊥⊥ Y
3,t
Y
2,t
. If, on the other hand, the only (conditional)
independence relation among Y
1,t
, Y
2,t
, Y
3,t
is Y
1,t
⊥⊥ Y
3,t
, it follows from the faithfulness
condition that Y
1,t
→ Y
3,t
<— Y
2,t
.
Constraintbased algorithms for causal discovery, like for instance, PC, SGS, FCI (Spirtes
et al. 2000), or CCD (Richardson and Spirtes 1999), use tests of conditional independence
to constrain the possible causal relationships among the model variables. The ﬁrst step of
the algorithm typically involves the formation of a complete undirected graph among the
variables so that they are all connected by an undirected edge. In a second step, condi
tional independence relations (or dseparations, which are the graphical characterization of
conditional independence) are merely used to erase edges and, in further steps, to direct
edges. The output of such algorithms are not necessarily one single graph, but a class of
Markov equivalent graphs.
There is nothing neither in the Markov or faithfulness condition, nor in the constraint
based algorithms that limits them to linear and Gaussian settings. Graphical causal models
do not require per se any a priori speciﬁcation of the functional dependence between vari
ables. However, in applications of graphical models to SVAR, conditional independence is
ascertained by testing vanishing partial correlations (Swanson and Granger, 1997; Bessler
and Lee, 2002; Demiralp and Hoover, 2003; Moneta, 2008). Since normal distribution guar
antees the equivalence between zero partial correlation and conditional independence, these
applications deal de facto with linear and Gaussian processes.
99
Moneta Chlaß Entner Hoyer
2.2. Testing residuals zero partial correlations
There are alternative methods to test zero partial correlations among the error terms
ˆ
u
t
=
(u
1t
, . . . , u
kt
)
0
. Swanson and Granger (1997) use the partial correlation coeﬃcient. That is,
in order to test, for instance, ρ(u
it
, u
kt
u
jt
) = 0, they use the standard t statistics from a
least square regression of the model:
u
it
= α
j
u
jt
+ α
k
u
kt
+ ε
it
, (7)
on the basis that α
k
= 0 ⇔ ρ(u
it
, u
kt
u
jt
) = 0. Since Swanson and Granger (1997) impose
the partial correlation constraints looking only at the set of partial correlations of order
one (that is conditioned on only one variable), in order to run their tests they consider
regression equations with only two regressors, as in equation (7).
Bessler and Lee (2002) and Demiralp and Hoover (2003) use Fisher’s z that is incorporated
in the software TETRAD (Scheines et al., 1998):
z(ρ
XY.K
, T ) =
1
2
p
T − K − 3 log
1 + ρ
XY.K

1 − ρ
XY.K

,
(8)
where K equals the number of variables in K and T the sample size. If the variables (for
instance X = u
it
, Y = u
kt
, K = (u
jt
, u
ht
)) are normally distributed, we have that
z(ρ
XY.K
, T ) − z(ˆρ
XY.K
, T ) ∼ N(0, 1) (9)
(see Spirtes et al., 2000, p.94).
A diﬀerent approach, which takes into account the fact that correlations are obtained
from residuals of a regression, is proposed by Moneta (2008). In this case it is useful to
write the VAR model of equation (3) in a more compact form:
Y
t
= Π
0
X
t
+ u
t
, (10)
where X
0
t
= [Y
0
t−1
, ...,Y
0
t−p
], which has dimension (1 × kp) and Π
0
= [A
1
, . . . , A
p
], which
has dimension (k ×kp). In case of stable VAR process (see next subsection), the conditional
maximum likelihood estimate of Π for a sample of size T is given by
ˆ
Π
0
=
"
T
X
t=1
Y
t
X
0
t
#"
T
X
t=1
X
t
X
0
t
#
−1
.
Moreover, the ith row of
ˆ
Π
0
is
ˆπ
0
i
=
"
T
X
t=1
Y
it
X
0
t
#"
T
X
t=1
X
t
X
0
t
#
−1
,
which coincides with the estimated coeﬃcient vector from an OLS regression of Y
it
on
X
t
(Hamilton 1994: 293). The maximum likelihood estimate of the matrix of variance
and covariance among the error terms Σ
u
turns out to be
ˆ
Σ
u
= (1/T )
P
T
t=1
ˆ
u
t
ˆ
u
0
t
, where
ˆ
u
t
= Y
t
−
ˆ
Π
0
X
t
. Therefore, the maximum likelihood estimate of the covariance between u
it
100
Causal Search in SVAR
and u
jt
is given by the (i, j) element of
ˆ
Σ
u
: ˆσ
ij
= (1/T )
P
T
t=1
ˆu
it
ˆu
jt
. Denoting by σ
ij
the
(i, j) element of Σ
u
, let us ﬁrst deﬁne the following matrix transform operators: vec, which
stacks the columns of a k × k matrix into a vector of length k
2
and vech, which vertically
stacks the elements of a k × k matrix on or below the principal diagonal into a vector of
length k(k + 1)/2. For example:
vec
σ
11
σ
12
σ
21
σ
22
=
σ
11
σ
21
σ
12
σ
22
, vech
σ
11
σ
12
σ
21
σ
22
=
σ
11
σ
21
σ
22
.
The process being stationary and the error terms Gaussian, it turns out that:
√
T [vech(
ˆ
Σ
u
) − vech(Σ
u
)]
d
−→ N (0, Ω), (11)
where Ω = 2D
+
k
(Σ
u
⊗Σ
u
)(D
+
k
)
0
, D
+
k
≡ (D
0
k
D
k
)
−1
D
0
k
, D
k
is the unique (k
2
×k(k+1)/2)
matrix satisfying D
k
vech(Ω) = vec(Ω), and ⊗ denotes the Kronecker product (see Hamilton
1994: 301). For example, for k = 2, we have,
√
T
ˆσ
11
− σ
11
ˆσ
12
− σ
12
ˆσ
22
− σ
22
d
−→ N
0
0
0
,
2σ
2
11
2σ
11
σ
12
2σ
2
12
2σ
11
σ
12
σ
11
σ
22
+ σ
2
12
2σ
12
σ
22
2σ
2
12
2σ
12
σ
22
2σ
2
22
Therefore, to test the null hypothesis that ρ(u
it
, u
jt
) = 0 from the VAR estimated residuals,
it is possible to use the Wald statistic:
T (ˆσ
ij
)
2
ˆσ
ii
ˆσ
jj
+ ˆσ
2
ij
≈ χ
2
(1).
The Wald statistic for testing vanishing partial correlations of any order is obtained by
applying the delta method, which suggests that if X
T
is a (r ×1) sequence of vectorvalued
randomvariables and if [
√
T (X
1T
− θ
1
), . . . ,
√
T (X
rT
− θ
r
)]
d
−→ N(0, Σ) and h
1
, . . . , h
r
are r realvalued functions of θ = (θ
1
, . . . , θ
r
), h
i
: R
r
→ R, deﬁned and continuously
diﬀerentiable in a neighborhood ω of the parameter point θ and such that the matrix
B = ∂h
i
/∂θ
j
 of partial derivatives is nonsingular in ω, then:
[
√
T [h
1
(X
T
) − h
1
(θ)], . . . ,
√
T [h
r
(X
T
) − h
r
(θ)]]
d
−→ N (0, BΣB
0
)
(see Lehmann and Casella 1998: 61).
Thus, for k = 4, suppose one wants to test corr(u
1t
, u
3t
u
2t
) = 0. First, notice that
ρ(u
1
, u
3
u
2
) = 0 if and only if σ
22
σ
13
−σ
12
σ
23
= 0 (by deﬁnition of partial correlation). One
can deﬁne a function g : R
k(k+1)/2
→ R, such that g(vech(Σ
u
)) = σ
22
σ
13
− σ
12
σ
23
. Thus,
∇g
0
= (0, −σ
23
, σ
22
, 0, σ
13
, −σ
12
, 0, 0, 0, 0).
Applying the delta method:
√
T [(ˆσ
22
ˆσ
13
− ˆσ
12
ˆσ
23
) − (σ
22
σ
13
− σ
12
σ
23
)]
d
−→ N (0, ∇g
0
Ω∇g).
101
Moneta Chlaß Entner Hoyer
The Wald test of the null hypothesis corr(u
1t
, u
3t
u
2t
) = 0 is given by:
T (ˆσ
22
ˆσ
13
− ˆσ
12
ˆσ
23
)
2
∇g
0
Ω∇g
≈ χ
2
(1).
Tests for higher order correlations and for k > 4 follow analogously (see also Moneta, 2003).
This testing procedure has the advantage, with respect to the alternative methods, to be
straightforwardly applied to the case of cointegrated data, as will be explained in the next
subsection.
2.3. Cointegration case
A typical feature of economic time series data in which there is some form of causal depen
dence is cointegration. This term denotes the phenomenon that nonstationary processes
can have linear combinations that are stationary. That is, suppose that each component
Y
it
of Y
t
= (Y
1t
, . . . , Y
kt
)
0
, which follows the VAR process
Y
t
= A
1
Y
t−1
+ . . . + A
p
Y
t−p
+ u
t
,
is nonstationary and integrated of order one (∼ I(1)). This means that the VAR process
Y
t
is not stable, i.e. det(I
k
− A
1
z − A
p
z
p
) is equal to zero for some z ≤ 1 (L¨utkepohl,
2006), and that each component ΔY
it
of ΔY
t
= (Y
t
− Y
t−1
) is stationary (I(0)), that
is it has timeinvariant means, variances and covariance structure. A linear combination
between between the elements of Y
t
is called a cointegrating relationship if there is a linear
combination c
1
Y
1t
+ . . . + c
k
Y
kt
which is stationary (I(0)).
If it is the case that the VAR process is unstable with the presence of cointegrating rela
tionships, it is more appropriate (L¨utkepohl, 2006; Johansen, 2006) to estimate the following
reparametrization of the VAR model, called Vector Error Correction Model (VECM):
ΔY
t
= F
1
ΔY
t−1
+ . . . + F
p−1
ΔY
t−p+1
− GY
t−p
+ u
t
, (12)
where F
i
= −(I
k
− A
1
− . . . − A
i
), for i = 1, . . . , p − 1 and G = I
k
− A
1
− . . . − A
p
. The
(k ×k) matrix G has rank r and thus G can be written as HC with H and C
0
of dimension
(k × r) and of rank r. C ≡ [c
1
, . . . , c
r
]
0
is called the cointegrating matrix.
Let
˜
C,
˜
H, and
˜
F
i
be the maximum likelihood estimator of C, H, F according to Jo
hansen’s (1988, 1991) approach. Then the asymptotic distribution of
˜
Σ
u
, that is the maxi
mum likelihood estimator of the covariance matrix of u
t
, is:
√
T [vech(
˜
Σ
u
) − vech(Σ
u
)]
d
−→ N (0, 2D
+
k
(Σ
u
⊗ Σ
u
)D
+
0
k
), (13)
which is equivalent to equation (11) (see it again for the deﬁnition of the various operators).
Thus, it turns out that the asymptotic distribution of the maximum likelihood estimator
˜
Σ
u
is the same as the OLS estimation
ˆ
Σ
u
for the case of stable VAR.
Thus, the application of the method described for testing residuals zero partial corre
lations can be applied straightforwardly to cointegrated data. The model is estimated as
a VECM error correction model using Johansen’s (1988, 1991) approach, correlations are
tested exploiting the asymptotic distribution of
˜
Σ
u
and ﬁnally can be parameterized back
in its VAR form of equation (3).
102
Causal Search in SVAR
2.4. Summary of the search procedure
The graphical causal models approach to SVAR identiﬁcation, which we suggest in case of
Gaussian and linear processes, can be summarized as follows.
Step 1 Estimate the VAR model Y
t
= A
1
Y
t−1
+ . . . + A
p
Y
t−p
+ u
t
with the usual
speciﬁcation tests about normality, zero autocorrelation of residuals, lags, and unit roots
(see L¨utkepohl, 2006). If the hypothesis of nonstationarity is rejected, estimate the VAR
model via OLS (equivalent to MLE under the assumption of normality of the errors). If unit
root tests do not reject I(1) nonstationarity in the data, specify the model as VECM testing
the presence of cointegrating relationships. If tests suggest the presence of cointegrating
relationships, estimate the model as VECM. If cointegration is rejected estimate the VAR
models taking ﬁrst diﬀerence.
Step 2 Run tests for zero partial correlations between the elements u
1t
, . . . , u
kt
of u
t
using
the Wald statistics on the basis of the asymptotic distribution of the covariance matrix of
u
t
. Note that not all possible partial correlations ρ(u
it
, u
jt
u
ht
, . . .) need to be tested, but
only those necessary for step 3.
Step 3 Apply a causal search algorithm to recover the causal structure among u
1t
, . . . , u
kt
,
which is equivalent to the causal structure among Y
1t
, . . . , Y
kt
(cfr. section 1.2 and see
Moneta 2003). In case of acyclic (no feedback loops) and causally suﬃcient (no latent
variables) structure, the suggested algorithm is the PC algorithm of Spirtes et al. (2000,
pp. 8485). Moneta (2008) suggested few modiﬁcations to the PC algorithm in order to
make the orientation of edges compatible with as many conditional independence tests as
possible. This increases the computational time of the search algorithm, but considering
the fact that VAR models deal with a few number of time series variables (rarely more
than six to eight; see Bernanke et al. 2005), this slowing down does not create a serious
concern in this context. Table 1 reports the modiﬁed PC algorithm. In case of acyclic
structure without causal suﬃciency (i.e. possibly including latent variables), the suggested
algorithm is FCI (Spirtes et al. 2000, pp. 144145). In the case of no latent variables
and in the presence of feedback loops, the suggested algorithm is CCD (Richardson and
Spirtes, 1999). There is no algorithm in the literature which is consistent for search when
both latent variables and feedback loops may be present. If the goal of the study is only
impulse response analysis (i.e. tracing out the eﬀects of structural shocks ε
1t
, . . . , ε
kt
on
Y
t
, Y
t−1
, . . .) and neither contemporaneous feedbacks nor latent variables can be excluded
a priori, a possible solution is to apply only steps (A) and (B) of the PC algorithm. If the
resulting set of possible causal structures (represented by an undirected graph) contains a
manageable number of elements, one can study the characteristics of the impulse response
functions which are robust across all the possible causal structures, where the presence of
both feedbacks and latent variables is allowed (Moneta, 2004).
Step 4 Calculate structural coeﬃcients and impulse response functions. If the output of
Step 3 is a set of causal structures, run sensitivity analysis to investigate the robustness of
the conclusions under the diﬀerent possible causal structures. Bootstrap procedures may
103
Moneta Chlaß Entner Hoyer
also be applied to determine which is the most reliable causal order (see simulations and
applications in Demiralp et al., 2008).
3. Identiﬁcation via independent component analysis
The methods considered in the previous section use tests for zero partial correlation on
the VARresiduals to obtain (partial) information about the contemporaneous structure
in an SVAR model with Gaussian shocks. In this section we show how nonGaussian
and independent shocks can be exploited for model identiﬁcation by using the statistical
method of ‘Independent Component Analysis’ (ICA, see Comon (1994); Hyv¨arinen et al.
(2001)). The method is again based on the VARresiduals u
t
which can be obtained as in
the Gaussian case by estimating the VAR model using for example ordinary least squares
or least absolute deviations, and can be tested for nonGaussianity using any normality test
(such as the ShapiroWilk or JarqueBera test).
To motivate, we note that, from equations (3) and (4) (with matrix Γ
0
) or the Cholesky
factorization in section 1.2 (with matrix PD
−1
), the VARdisturbances u
t
and the structural
shocks ε
t
are connected by
u
t
= Γ
−1
0
ε
t
= PD
−1
ε
t
(14)
with square matrices Γ
0
and PD
−1
, respectively. Equation (14) has two important prop
erties: First, the vectors u
t
and ε
t
are of the same length, meaning that there are as many
residuals as structural shocks. Second, the residuals u
t
are linear mixtures of the shocks
ε
t
, connected by the ‘mixing matrix’ Γ
−1
0
. This resembles the ICA model, when placing
certain assumptions on the shocks ε
t
.
In short, the ICA model is given by x = As, where x are the mixed components, s the
independent, nonGaussian sources, and A a square invertible mixing matrix (meaning that
there are as many mixtures as independent components). Given samples from the mixtures
x, ICA estimates the mixing matrix A and the independent components s, by linearly
transforming x in such a way that the dependencies among the independent components
s are minimized. The solution is unique up to ordering, sign and scaling (Comon, 1994;
Hyv¨arinen et al., 2001).
By comparing the ICA model x = As and equation (14), we see a onetoone correspon
dence of the mixtures x to the residuals u
t
and the independent components s to the shocks
ε
t
. Thus, to be able to apply ICA, we need to assume that the shocks are nonGaussian
and mutually independent. We want to emphasize that no speciﬁc nonGaussian distribu
tion is assumed for the shocks, but only that they cannot be Gaussian.
1
For the shocks
to be mutually independent their joint distribution has to factorize into the product of the
marginal distributions. In the nonGaussian setting, this implies zero partial correlation,
but the converse is not true (as opposed to the Gaussian case where the two statements
are equivalent). Thus, for nonGaussian distributions conditional independence is a much
stronger requirement than uncorrelatedness.
Under the assumption that the shocks ε
t
are nonGaussian and independent, equation
(14) follows exactly the ICAmodel and applying ICA to the VAR residuals u
t
yields a
unique solution (up to ordering, sign, and scaling) for the mixing matrix Γ
−1
0
and the
1. Actually, the requirement is that at most one of the residuals can be Gaussian.
104
Causal Search in SVAR
Table 1: Search algorithm (adapted from the PC Algorithm of Spirtes et al. (2000: 8485);
in bold character the modiﬁcations).
Under the assumption of Gaussianity conditional independence is tested by zero partial
correlation tests.
(A): (connect everything):
Form the complete undirected graph G on the vertex set u
1t
, . . . , u
kt
so that each
vertex is connected to any other vertex by an undirected edge.
(B)(cut some edges):
n = 0
repeat :
repeat :
select an ordered pair of variables u
ht
and u
it
that are adjacent in G
such that the number of variables adjacent to u
ht
is equal or greater
than n + 1. Select a set S of n variables adjacent to u
ht
such that
u
ti
/∈ S. If u
ht
⊥⊥ u
it
S delete edge u
ht
— u
it
from G;
until all ordered pairs of adjacent variables u
ht
and u
it
such that the
number of variables adjacent to u
ht
is equal or greater than n + 1 and all sets
S of n variables adjacent to u
ht
such that u
it
/∈ S have been checked to see if
u
ht
⊥⊥ u
it
S;
n = n + 1;
until for each ordered pair of adjacent variables u
ht
, u
it
, the number of adjacent
variables to u
ht
is less than n + 1;
(C)(build colliders):
for each triple of vertices u
ht
, u
it
, u
jt
such that the pair u
ht
, u
it
and the pair u
it
, u
jt
are
each adjacent in G but the pair u
ht
, u
jt
is not adjacent in G, orient u
ht
— u
it
— u
jt
as
u
ht
−→ u
it
<— u
jt
if and only if u
it
does not belong to any set of variables S such
that u
ht
⊥⊥ u
jt
S;
(D)(direct some other edges):
repeat :
if u
at
−→ u
bt
, u
bt
and u
ct
are adjacent, u
at
and u
ct
are not adjacent and
u
bt
belongs to every set S such that u
at
⊥⊥ u
ct
S, then orient u
bt
— u
ct
as
u
bt
−→ u
ct
;
if there is a directed path from u
at
to u
bt
, and an edge between u
at
and u
bt
,
then orient u
at
— u
bt
as u
at
−→ u
bt
;
until no more edges can be orien
ted.
105
Moneta Chlaß Entner Hoyer
independent components ε
t
(i.e. the structural shocks in our case). However, the ambiguities
of ICA make it hard to directly interpret the shocks found by ICA since without further
analysis we cannot relate the shocks directly to the measured variables.
Hence, we assume that the residuals u
t
follow a linear nonGaussian acyclic model
(Shimizu et al., 2006), which means that the contemporaneous structure is represented
by a DAG (directed acyclic graph). In particular, the model is given by
u
t
= B
0
u
t
+ ε
t
(15)
with a matrix B
0
, whose diagonal elements are all zero and, if permuted according to the
causal order, is strictly lower triangular. By rewriting equation (15) we see that
Γ
0
= I − B
0
. (16)
From this equation it follows that the matrix B
0
describes the contemporaneous structure
of the variables Y
t
in the SVAR model as shown in equation (6). Thus, if we can identify the
matrix Γ
0
, we also obtain the matrix B
0
for the contemporaneous eﬀects. As pointed out
above, the matrix Γ
−1
0
(and hence Γ
0
) can be estimated using ICA up to ordering, scaling,
and sign. With the restriction of B
0
representing an acyclic system, we can resolve these
ambiguities and are able to fully identify the model. For simplicity, let us assume that the
variables are arranged according to a causal ordering, so that the matrix B
0
is strictly lower
triangular. From equation (16) then follows that the matrix Γ
0
is lower triangular with all
ones on the diagonal. Using this information, the ambiguities of ICA can be resolved in the
following way.
The lower triangularity of B
0
allows us to ﬁnd the unique permutation of the rows of
Γ
0
, which yields all nonzero elements on the diagonal of Γ
0
, meaning that we replace the
matrix Γ
0
with Q
1
Γ
0
where Q
1
is the uniquely determined permutation matrix. Finding
this permutation resolves the orderingambiguity of ICA and links the shocks ε
t
to the
components of the residuals u
t
in a onetoone manner. The sign and scalingambiguity is
now easy to ﬁx by simply dividing each row of Γ
0
(the rowpermuted version from above)
by the corresponding diagonal element yielding all ones on the diagonal, as implied by
Equation (16). This ensures that the connection strength of the shock ε
t
on the residual u
t
is ﬁxed to one in our model (Equation (15)).
For the general case where B
0
is not arranged in the causal order, the above arguments
for solving the ambiguities still apply. Furthermore, we can ﬁnd the causal order of the
contemporaneous variables by performing simultaneous row and columnpermutations on
Γ
0
yielding the matrix closest to lower triangular, in particular
˜
Γ
0
= Q
2
Γ
0
Q
0
2
with an
appropriate permutation matrix Q
2
. In case non of these permutations leads to a close to
lower triangular matrix a warning is issued.
Essentially, the assumption of acyclicity allows us to uniquely connect the structural
shocks ε
t
to the components of u
t
and fully identify the contemporaneous structure. Details
of the procedure can be found in (Shimizu et al., 2006; Hyv¨arinen et al., 2010). In the
sense of the Cholesky factorization of the covariance matrix explained in Section 1 (with
PD
−1
= Γ
−1
0
), full identiﬁability means that a causal order among the contemporaneous
variables can be determined.
106
Causal Search in SVAR
In addition to yielding full identiﬁcation, an additional beneﬁt of using the ICAbased
procedure when shocks are nonGaussian is that it does not rely on the faithfulness assump
tion, which was necessary in the Gaussian case.
We note that there are many ways of exploiting nonGaussian shocks for model identiﬁ
cation as alternatives to directly using ICA. One such approach was introduced by Shimizu
et al. (2009). Their method relies on iteratively ﬁnding an exogenous variable and regressing
out their inﬂuence on the remaining variables. An exogenous variable is characterized by
being independent of the residuals when regressing any other variable in the model on it.
Starting from the model in equation (15), this procedure returns a causal ordering of the
variables u
t
and then the matrix B
0
can be estimated using the Cholesky approach.
One relatively strong assumption of the above methods is the acyclicity of the contem
poraneous structure. In (Lacerda et al., 2008) an extension was proposed where feedback
loops were allowed. In terms of the matrix B
0
this means that it is not restricted to being
lower triangular (in an appropriate ordering of the variables). While in general this model
is not identiﬁable because we cannot uniquely match the shocks to the residuals, Lacerda
et al. (2008) showed that the model is identiﬁable when assuming stability of the generating
model in (15) (the absolute value of the biggest eigenvalue in B
0
is smaller than one) and
disjoint cycles.
Another restriction of the above model is that all relevant variables must be included in
the model (causal suﬃciency). Hoyer et al. (2008b) extended the above model by allowing
for hidden variables. This leads to an overcomplete basis ICA model, meaning that there
are more independent nonGaussian sources than observed mixtures. While there exist
methods for estimating overcomplete basis ICA models, those methods which achieve the
required accuracy do not scale well. Additionally, the solution is again only unique up to
ordering, scaling, and sign, and when including hidden variables the orderingambiguity
cannot be resolved and in some cases leads to several observationally equivalent models,
just as in the cyclic case above.
We note that it is also possible to combine the approach of section 2 with that described
here. That is, if some of the shocks are Gaussian or close to Gaussian, it may be advan
tageous to use a combination of constraintbased search and nonGaussianitybased search.
Such an approach was proposed in Hoyer et al. (2008a). In particular, the proposed method
does not make any assumptions on the distributions of the VARresiduals u
t
. Basically, the
PC algorithm (see Section 2) is run ﬁrst, followed by utilization of whatever nonGaussianity
there is to further direct edges. Note that there is no need to know in advance which shocks
are nonGaussian since ﬁnding such shocks is part of the algorithm.
Finally, we need to point out that while the basic ICAbased approach does not require
the faithfulness assumption, the extensions discussed at the end of this section do.
4. Nonparametric setting
4.1. Theory
Linear systems dominate VAR, SVAR, and more generally, multivariate time series models
in econometrics. However, it is not always the case that we know how a variable X may
cause another variable Y . It may be the case that we have little or no a priori knowledge
107
Moneta Chlaß Entner Hoyer
about the way how Y depends on X. In its most general form we want to know whether
X is independent of Y conditional on the set of potential graphical parents Z, i.e.
H
0
: Y ⊥⊥ X  Z,
(17)
where Y, X, Z is a set of time series variables. Thereby, we do not per se require an a priori
speciﬁcation of how Y possibly depends on X. However, constraint based algorithms typi
cally specify conditional independence in a very restrictive way. In continuous settings, they
simply test for nonzero partial correlations, or in other words, for linear (in)dependencies.
Hence, these algorithms will fail whenever the data generation process (DGP) includes non
linear causal relations.
In search for a more general speciﬁcation of conditional independency, Chlaß and Moneta
(2010) suggest a procedure based on nonparametric density estimation. Therein, neither
the type of dependency between Y and X, nor the probability distributions of the variables
need to be speciﬁed. The procedure exploits the fact that if two random variables are in
dependent of a third, one obtains their joint density by the product of the joint density of
the ﬁrst two, and the marginal density of the third. Hence, hypothesis test (17) translates
into:
H
0
:
f(Y, X, Z
)
f(XZ)
=
f(Y Z
)
f(Z)
. (18)
If we deﬁne h
1
(∙) := f(Y, X, Z)f(Z), and h
2
(∙) := f(Y Z)f(XZ), we have:
H
0
: h
1
(∙) = h
2
(∙). (19)
We estimate h
1
and h
2
using a kernel smoothing approach (see Wand and Jones, 1995, ch.4).
Kernel smoothing has the outstanding property that it is insensitive to autocorrelation
phenomena and, therefore, immediately applicable to longitudinal or time series settings
(Welsh et al., 2002).
In particular, we use a socalled product kernel estimator:
ˆ
h
1
(x, y, z; b) =
1
N
2
b
m+d
n
P
n
i=1
K
X
i
−x
b
K
Y
i
−y
b
K
Z
i
−z
b
on
P
n
i=1
K
p
Z
i
−z
b
o
ˆ
h
2
(x, y, z; b) =
1
N
2
b
m+d
n
P
n
i=1
K
X
i
−x
b
K
Z
Z
i
−z
b
on
P
n
i=1
K
Y
i
−y
b
K
p
Z
i
−z
b
o
,
(20)
where X
i
, Y
i
, and Z
i
are the i
th
realization of the respective time series, K denotes the
kernel function, b indicates a scalar bandwidth parameter, and K
p
represents a product
kernel
2
.
So far, we have shown how we can estimate h
1
and h
2
. To see whether these are diﬀerent,
we require some similarity measure between both conditional densities. There are diﬀerent
ways to measure the distance between a product of densities:
2. I.e. K
p
((Z
i
− z)/b) =
Q
d
j=1
K((Z
j
i
− z
j
)/b). For our simulations (see next section) we choose the
kernel: K(u) = (3 − u
2
)φ(u)/2, with φ(u) the standard normal probability density function. We use a
“ruleofthumb” bandwidth: b = n
−1/8.5
.
108
Causal Search in SVAR
(i) The weighted Hellinger distance proposed by Su and White (2008):
d
H
=
1
n
n
X
i=1
(
1 −
s
h
2
(X
i
, Y
i
, Z
i
)
h
1
(X
i
, Y
i
, Z
i
)
)
2
a(X
i
, Y
i
, Z
i
), (21)
where a(∙) is a nonnegative weighting function. Both the weighting function a(∙), and
the resulting test statistic are speciﬁed in Su and White (2008).
(ii) The Euclidean distance proposed by Szekely and Rizzo (2004) in their ‘energy test’:
d
E
=
1
n
n
X
i=1
n
X
j=1
h
1
i
− h
2
j
 −
1
2n
n
X
i=1
n
X
j=1
h
1
i
− h
1
j
 −
1
2n
n
X
i=1
n
X
j=1
h
2
i
− h
2
j
, (22)
where h
1
i
= h
1
(X
i
, Y
i
, Z
i
), h
2
i
= h
2
(X
i
, Y
i
, Z
i
), and  ∙  is the Euclidean norm.
3
Given these test statistics and their distributions, we compute the typeI error, or pvalue
of our test problem (19). If Z = ∅, the tests are available in Rpackages energy and cramer.
The Hellinger distance is not suitable here, since one can only test for Z 6= ∅.
For Z 6= ∅, our test problem (19) requires higher dimensional kernel density estimation.
The more dimensions, i.e. the more elements in Z, the scarcer the data, and the greater
the distance between two subsequent data points. This socalled Curse of dimensionality
strongly reduces the accuracy of a nonparametric estimation (Yatchew, 1998). To circum
vent this problem, we calculate the typeI errors for Z 6= ∅ by a local bootstrap procedure,
as described in Su and White (2008, pp. 840841) and Paparoditis and Politis (2000, pp.
144145). Local bootstrap draws repeatedly with replacement from the sample and counts
how many times the bootstrap statistic is larger than the test statistic of the entire sample.
Details on the local bootstrap procedure ca be found in appendix A.
Now, let us see how this procedure fares in those time series settings, where other testing
procedures failed  the case of nonlinear time series.
4.2. Simulation Design
Our simulation design should allow us to see how the search procedures of 4.1 perform
in terms of size and power. To identify size properties (typeI error), H
0
(19) must hold
everywhere. We call data generation processes for which H
0
holds everywhere, sizeDGPs.
We induce a system of time series {V
1,t
, V
2,t
, V
3,t
}
n
t=1
whereby each time series follows an
autoregressive process AR(1) with a
1
= 0.5 and error term e
t
∼ N (0, 1), for instance,
V
1,t
= a
1
V
1,t−1
+ e
V
1
,t
. These time series may cause each other as illustrated in Fig. 1.
Therein, V
1,t
⊥⊥ V
2,t
V
1,t−1
, since V
1,t−1
dseparates V
1,t
from V
2,t
, while V
2,t
⊥⊥ V
3,s
, for any
t and s. Hence, the set of variables Z, conditional on which two sets of variables X and
3. An alternative Euclidean distance is proposed by Baringhaus and Franz (2004) in their Cramer test.
This distance turns out to be d
E
/2. The only substantial diﬀerence from the distance proposed in (ii)
lies in the method to obtain the critical values (see Baringhaus and Franz 2004).
109
Moneta Chlaß Entner Hoyer
V
1,t−1
V
2,t−1
V
3,t−1
V
1,t
V
2,t
V
3,t
Figure 1: Time series DAG.
Y are independent of each other, contains zero elements, i.e. V
2,t
⊥⊥ V
3,t−1
, contains one
element, i.e. V
1,t
⊥⊥ V
2,t
V
1,t−1
, or contains two elements, i.e. V
1,t
⊥⊥ V
2,t
V
1,t−1
, V
3,t−1
.
In our simulations, we vary two aspects. The ﬁrst aspect is the functional form of
the causal dependency. To systematically vary nonlinearity and its impact, we charac
terize the causal relation between, say, V
1,t−1
and V
2,t
, in a polynomial form, i.e. via
V
2,t
= f(V
1,t−1
) + e, where f =
P
p
j=0
b
j
V
j
1,t−1
. Herein, j reﬂects the degree of nonlinearity,
while b
j
would capture the impact nonlinearity exerts. For polynomials of any degree, we
set only b
p
6= 0. An additive error term e completes the speciﬁcation.
The second aspect is the number of variables in Z conditional on which X and Y can
be independent. Either zero, one, but maximally two variables may form the set Z =
{Z
1
, . . . , Z
d
} of conditioned variables; hence Z has cardinality #Z = {0, 1, 2}.
To identify power properties, H
0
must not hold anywhere, i.e. X ⊥⊥/ Y Z. We call data
generation processes where H
0
does not hold anywhere, powerDGPs. Such DGPs can be
induced by (i) a direct path between X and Y which does not include Z, (ii) a common
cause for X and Y which is not an element of Z, or (iii) a “collider” between X and Y be
longing to Z.
4
As before, we vary the functional form f of these causal paths polynomially
where again, only b
p
6= 0. Third, we investigate diﬀerent cardinalities #Z = {0, 1, 2} of set
Z conditional on which X and Y become dependent.
4.3. Simulation Results
Let us start with #Z = 0, that is, H
0
:= X ⊥⊥ Y . Table 2 reports our simulation results for
both size and power DGPs. Rejection frequencies are reported for three diﬀerent tests, for
a theoretical level of signiﬁcance of 0.05, and 0.1.
Take the ﬁrst line of Table 2. For size DGPs, H
0
holds everywhere. A test performs
accurately if it rejects H
0
in accordance with the respective theoretical signiﬁcance level. We
see that the energy test rejects H
0
slightly more often than it should (0.065 > 0.05; 0.122 >
0.1), whereas the Cramer test does not reject H
0
often enough (0.000 < 0.05, 0.000 < 0.1).
In comparison to the standard parametric Fisher’s z, we see that the latter rejects H
0
much
more often than it should. The energy test keeps the typeI error most accurately. Contrary
to both nonparametric tests, the parametric procedure leads us to suspect a lot more causal
relationships than there actually are, if #Z = 0.
How well do these tests perform if H
0
does not hold anywhere? That is, how accurately
do they reject H
0
if it is false (powerDGPs)? For linear time series, we see that the
nonparametric energy test has nearly as much power as Fisher’s z. For nonlinear time
4. An example of collider is displayed in Figure 1: V
2,t
forms a collider between V
1,t−1
and V
2,t−1
.
110
Causal Search in SVAR
Table 2: Proportion of rejection of H
0
(no conditioned variables)
Energy Cramer
Fisher Energy Cramer Fisher
level of signiﬁcance 5% level of signiﬁcance 10%
Size
DGPs
S0.1 (ind. time series) 0.065 0.000 0.151 0.122 0.000 0.213
Power
DGPs
P0.1 (time series linear) 0.959 0.308 0.999 0.981 0.462 1
P0.2 (time series quadratic) 0.986 0.255 0.432 0.997 0.452 0.521
P0.3 (time series cubic) 1 0.905 1 1 0.975 1
P0.3 (time series quartic) 1 0.781 0.647 1 0.901 0.709
Note: length series (n) = 100; number of iterations = 1000
series, the energy test clearly outperforms Fisher’s z
5
. As it did for size, Cramer’s test
generally underperforms in terms of power. Interestingly, its power appears to be higher for
higher degrees of nonlinearity. In summary, if one wishes to test for marginal independence
without any information on the type of a potential dependence, one would opt for the
energy test. It has a size close to the theoretical signiﬁcance level, and has power similar to
a parametric speciﬁcation.
Let us turn to #Z = 1, where H
0
:= X ⊥⊥ Y Z, for which the results are shown in Table 3.
Starting with size DGPs, tests based on Hellinger and Euclidian distance slightly underreject
H
0
whereas for the highest polynomial degree, the Hellinger test strongly overrejects H
0
.
The parametric Fisher’s z slightly overrejects H
0
in case of linearity, and for higher degrees,
starts to underreject H
0
.
Table 3: Proportion of rejection of H
0
(one conditioned variable)
Hellinger Euclid
Fisher Hellinger Euclid Fisher
level of signiﬁcance 5% level of signiﬁcance 10%
Size
DGPs
S1.1 (time series linear) 0.035 0.035 0.062 0.090 0.060 0.103
S1.2 (time series quadratic) 0.040 0.020 0.048 0.065 0.035 0.104
S1.3 (time series cubic) 0.010 0.010 0.050 0.020 0.015 0.093
S1.4 (time series quartic) 0.13 0 0.023 0.2 0.1 0.054
Power
DGPs
P1.1 (time series linear) 0.875 0.910 0.999 0.925 0.950 1
P1.2 (time series quadratic) 0.905 0.895 0.416 0.940 0.950 0.504
P1.3 (time series cubic) 0.990 1 1 1 1 1
P1.4 (time series quartic) 0.84 0.995 0.618 0.91 0.995 0.679
Note: n = 100; number of iterations = 200; number of bootstrap iterations (I) = 200
Turning to power DGPs, Fisher’s z suﬀers a dramatic loss in power for those polynomial
degrees which depart most from linearity, i.e. quadratic, and quartic relations. Nonpara
5. For cubic time series, Fisher’s z performs as well as the energy test does. This may be due to the fact
that a cubic relation resembles more to a line than other polynomial speciﬁcations do.
111
Moneta Chlaß Entner Hoyer
metric tests which do not require linearity have high power in absolute terms, and nearly
twice as much as compared to Fisher’s z. The power properties of the nonparametric proce
dures indicate that our local bootstrap succeeds in mitigating the Curse of dimensionality.
In sum, nonparametric tests exhibit good power properties for #Z = 1 whereas Fisher’s z
would fail to discover underlying quadratic or quartic relationships in some 60%, and 40%
of the cases, respectively.
The results for #Z = 2 are presented in Table 4. We ﬁnd that both nonparametric tests
have a size which is notably smaller than the theoretical signiﬁcance level we induce. Hence,
both have a strong tendency to underreject H
0
. Turning to power DGPs, we ﬁnd that the
Euclidean test still has over 90% power to correctly reject H
0
. For those polynomial degrees
which depart most from linearity, i.e. quadratic and quartic, the Euclidean test has three
times as much power as Fisher’s z. However, the Hellinger test performs even worse than
Fisher’s z. Here, it may be the Curse of dimensionality which starts to show an impact.
Table 4: Proportion of rejection of H
0
(two conditioned variables)
Hellinger Euclid
Fisher Hellinger Euclid Fisher
level of signiﬁcance 5% level of signiﬁcance 10%
Size
DGPs
S2.1 (time series linear) 0.006 0.020 0.050 0.033 0.046 0.102
S2.2 (time series quadratic) 0.000 0.010 0.035 0.000 0.040 0.087
S2.3 (time series cubic) 0 0.007 0.056 0 0.007 0.109
S2.4 (time series quartic) 0.006 0 0.031 0.013 0 0.067
Power
DGPs
P2.1 (time series linear) 0.28 0.92 1 0.4 0.973 1
P2.2 (time series quadratic) 0.170 0.960 0.338 0.250 0.980 0.411
P2.3 (time series cubic) 0.667 1 1 0.754 1 1
P2.4 (time series quartic) 0.086 0.946 0.597 0.133 0.966 0.665
Note: n = 100; number of iterations = 150; number of bootstrap iterations (I) = 100
To sum up, we can say that both marginal independencies, and higher dimensional con
ditional independencies, i.e. (#Z = 1, 2) are best tested for using Euclidean tests. The
Hellinger test seems to be more aﬀected by the Curse of dimensionality. We see that our
local bootstrap procedure mitigates the latter, but we admit that the number of variables
our nonparametric procedure can deal with is very small. Here, it might be promising to
opt for semiparametric (Chu and Glymour, 2008), rather than nonparametric procedures
which combine parametric and nonparametric approaches.
5. Conclusions
The diﬃculty of learning causal relations from passive, that is nonexperimental, obser
vations is one of the central challenges of econometrics. Traditional solutions involve the
distinction between structural and reduced form model. The former is meant to formal
ize the unobserved data generating process, whereas the latter aims to describe a simpler
transformation of that process. The structural model is articulated hinging on a priori
112
Causal Search in SVAR
economic theory. The reduced form model is formalized in such a way that it can be es
timated directly from the data. In this paper, we have presented an approach to identify
the structural model which minimizes the role of a priori economic theory and emphasizes
the need of an appropriate and rich statistical model of the data. Graphical causal models,
independent component analysis, and tests of conditional independence are the tools we
propose for structural identiﬁcation in vector autoregressive models. We conclude with an
overview of some important issues which are left open in this domain.
1. Speciﬁcation of the statistical model. Data driven procedures for SVAR identiﬁcation
depend upon the speciﬁcation of the (reduced form) VAR model. Therefore, it is important
to make sure that the estimated VAR model is an accurate description of the dynamics of
the included variables (whereas the contemporaneous structure is intentionally left out, as
seen in section 1.2). The usual criterion for accuracy is to check that the model estimates
residuals conform to white noise processes (although serial independence of residuals is not
a suﬃcient criterion for model validation). This implies stable dependencies captured by
the relationships among the modeled variables, and an unsystematic noise. It may be the
case, as in many empirical applications, that diﬀerent VAR speciﬁcations pass the model
checking tests equally well. For example, a VAR with Gaussian errors and p lags may ﬁt the
data equally well as a VAR with nonGaussian errors and q lags and these two speciﬁcations
justify two diﬀerent causal search procedures. So far, we do not know how to adjudicate
among alternative and seemingly equally accurate speciﬁcations.
2. Background knowledge and assumptions. Search algorithms are based on diﬀerent as
sumptions, such as, for example, causal suﬃciency, acyclicity, the Causal Markov Condition,
Faithfulness, and/or the existence of independent components. Maybe, background knowl
edge could justify some of these assumptions and reject others. For example, institutional
or theoretical knowledge about an economic process might inform us that Faithfulness is
a plausible assumption in some contexts rather than in others, or instead, that one should
expect feedback loops if data are collected at certain levels of temporal aggregation. Yet,
if background information could inform us here, this might again provoke a problem of
circularity mentioned at the outset of the paper.
3. Search algorithms in nonparametric settings. We have provided some information on
which nonparametric test procedures might be more appropriate in certain circumstances.
However, it is not clear which causal search algorithms are most eﬃcient in exploiting the
nonparametric conditional independence tests proposed in Section 4. The more variables
the search algorithm needs to be informed about at the same point of the search, the higher
the number of conditioned variables, and hence, the slower, or the more inaccurate, the
test.
4. Number of shocks and number of variables. To conserve degrees of freedom, SVARs
rarely model more than six to eight time series variables (Bernanke et al., 2005, p.388). It
is an open question how the procedures for causal inference we reviewed can be applied to
large scale systems such as dynamic factor models. (Forni et al., 2000)
5. Simulations and empirical applications. Graphical causal models for identifying
SVARs, equivalent or similar to the search procedures described in section 2, have been
applied to several sets of macroeconomic data (Swanson and Granger, 1997; Bessler and
Lee, 2002; Demiralp and Hoover, 2003; Moneta, 2004; Demiralp et al., 2008; Moneta, 2008;
Hoover et al., 2009). Demiralp and Hoover (2003) present Monte Carlo simulations to
113
Moneta Chlaß Entner Hoyer
evaluate the performance of the PC algorithm for such an identiﬁcation. There are no
simulation results so far about the performance of the alternative tests on residual partial
correlations presented in section 2.2. Moneta et al. (2010) applied an independent compo
nent analysis as described in section 3, to microeconomic US data about ﬁrms’ expenditures
on R&D and performance, as well as to macroeconomic US data about monetary policy
and its eﬀects on the aggregate economy. Hyv¨arinen et al. (2010) assess the performance of
independent component analysis for identifying SVAR models. It is yet to be established
how independent component analysis applied to SVARs fares compared to graphical causal
models (based on the appropriate conditional independence tests) in nonGaussian settings.
Nonparametric tests of conditional independence, as those proposed in section 4, have been
applied to test for Granger noncausality (Su and White, 2008), but there are not yet any
applications where these test results inform a graphical causal search algorithm. Overall,
there is a need for more empirical applications of the procedures described in this paper.
Such applications will be useful to test, compare, and improve diﬀerent search procedures,
to suggest new problems, and obtain new causal knowledge.
6. Appendix
6.1. Appendix 1  Details of the bootstrap procedure from 4.1.
(1) Draw a bootstrap sampling Z
∗
t
(for t = 1, . . . , n) from the estimated kernel density
ˆ
f(z) = n
−1
b
−d
P
n
t=1
K
p
((Z
t
− z)/b).
(2) For t = 1, . . . , n, given Z
∗
t
, draw X
∗
t
and Y
∗
t
independently from the estimated kernel
density
ˆ
f(xZ
∗
t
) and
ˆ
f(yZ
∗
t
) respectively.
(3) Using X
∗
t
, Y
∗
t
, and Z
∗
t
, compute the bootstrap statistic S
∗
n
using one of the distances
deﬁned above.
(4) Repeat steps (1) and (2) I times to obtain I statistics {S
∗
ni
}
I
i=1
.
(5) The pvalue is then obtained by:
p ≡
P
I
i=1
1{S
∗
ni
> S
n
}
I
,
where S
n
is the statistic obtained from the original data using one of the distances
deﬁned above, and 1{∙} denotes an indicator function taking value one if the expression
between brackets is true and zero otherwise.
References
E. Baek and W. Brock. A general test for nonlinear Granger causality: Bivariate model.
Discussin paper, Iowa State University and University of Wisconsin, Madison, 1992.
L. Baringhaus and C. Franz. On a new multivariate twosample test. Journal of Multivariate
Analysis, 88(1):190–206, 2004.
114
Causal Search in SVAR
B. S. Bernanke. Alternative explanations of the moneyincome correlation. In Carnegie
Rochester Conference Series on Public Policy, volume 25, pages 49–99. Elsevier, 1986.
B.S. Bernanke, J. Boivin, and P. Eliasz. Measuring the Eﬀects of Monetary Policy: A Factor
Augmented Vector Autoregressive (FAVAR) Approach. Quarterly Journal of Economics,
120(1):387–422, 2005.
D. A. Bessler and S. Lee. Money and prices: US data 18691914 (a study with directed
graphs). Empirical Economics, 27:427–446, 2002.
O. J. Blanchard and D. Quah. The dynamic eﬀects of aggregate demand and supply dis
turbances. The American Economic Review, 79(4):655–673, 1989.
O. J. Blanchard and M. W. Watson. Are business cycles all alike? The American business
cycle: Continuity and change, 25:123–182, 1986.
N. Chlaß and A. Moneta. Can Graphical Causal Inference Be Extended to Nonlinear
Settings? EPSA Epistemology and Methodology of Science, pages 63–72, 2010.
T. Chu and C. Glymour. Search for additive nonlinear time series causal models. The
Journal of Machine Learning Research, 9:967–991, 2008.
P. Comon. Independent component analysis, a new concept? Signal processing, 36(3):
287–314, 1994.
S. Demiralp and K. D. Hoover. Searching for the causal structure of a vector autoregression.
Oxford Bulletin of Economics and Statistics, 65:745–767, 2003.
S. Demiralp, K. D. Hoover, and D. J. Perez. A Bootstrap method for identifying and eval
uating a structural vector autoregression. Oxford Bulletin of Economics and Statistics,
65, 745767, 2008.
M. Eichler. Granger causality and path diagrams for multivariate time series. Journal of
Econometrics, 137(2):334–353, 2007.
J. Faust and E. M. Leeper. When do longrun identifying restrictions give reliable results?
Journal of Business & Economic Statistics, 15(3):345–353, 1997.
J. P. Florens and M. Mouchart. A note on noncausality. Econometrica, 50(3):583–591,
1982.
M. Forni, M. Hallin, M. Lippi, and L. Reichlin. The generalized dynamicfactor model:
Identiﬁcation and estimation. Review of Economics and Statistics, 82(4):540–554, 2000.
C. W. J. Granger. Investigating causal relations by econometric models and crossspectral
methods. Econometrica: Journal of the Econometric Society, 37(3):424–438, 1969.
C. W. J. Granger. Testing for causality:: A personal viewpoint. Journal of Economic
Dynamics and Control, 2:329–352, 1980.
T. Haavelmo. The probability approach in econometrics. Econometrica, 12:1–115, 1944.
115
Moneta Chlaß Entner Hoyer
C. Hiemstra and J. D. Jones. Testing for linear and nonlinear Granger causality in the stock
pricevolume relation. Journal of Finance, 49(5):1639–1664, 1994.
W. C. Hood and T. C. Koopmans. Studies in econometric method, Cowles Commission
Monograph, No. 14. New York: John Wiley & Sons, 1953.
K. D. Hoover. Causality in macroeconomics. Cambridge University Press, 2001.
K. D. Hoover. The methodology of econometrics. New Palgrave Handbook of Econometrics,
1:61–87, 2006.
K. D. Hoover. Causality in economics and econometrics. In The New Palgrave Dictionary
of Economics. London: Palgrave Macmillan, 2008.
K.D. Hoover, S. Demiralp, and S.J. Perez. Empirical Identiﬁcation of the Vector Autore
gression: The Causes and Eﬀects of US M2. In The Methodology and Practice of Econo
metrics. A Festschrift in Honour of David F. Hendry, pages 37–58. Oxford University
Press, 2009.
P. O. Hoyer, A. Hyv¨arinen, R. Scheines, P. Spirtes, J. Ramsey, G. Lacerda, and S. Shimizu.
Causal discovery of linear acyclic models with arbitrary distributions. In Proceedings of
the 24th Conference on Uncertainty in Artiﬁcial Intelligence, 2008a.
P. O. Hoyer, S. Shimizu, A. J. Kerminen, and M. Palviainen. Estimation of causal eﬀects
using linear nongaussian causal models with hidden variables. International Journal of
Approximate Reasoning, 49:362–378, 2008b.
A. Hyv¨arinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley, 2001.
A. Hyv¨arinen, K. Zhang, S. Shimizu, and P. O. Hoyer. Estimation of a Structural Vector
Autoregression model using nonGaussianity. Journal of Machine Learning Research, 11:
1709–1731, 2010.
S. Johansen. Statistical analysis of cointegrating vectors. Journal of Economic Dynamics
and Control, 12:231–254, 1988.
S. Johansen. Estimation and hypothesis testing of cointegrating vectors in Gaussian vector
autoregressive models. Econometrica, 59:1551–1580, 1991.
S. Johansen. Cointegration: An Overview. In Palgrave Handbook of Econometrics. Volume
1. Econometric Theory, pages 540–577. Palgrave Macmillan, 2006.
R. G. King, C. I. Plosser, J. H. Stock, and M. W. Watson. Stochastic trends and economic
ﬂuctuations. American Economic Review, 81:819–840, 1991.
T. C. Koopmans. Statistical Inference in Dynamic Economic Models, Cowles Commission
Monograph, No. 10. New York: John Wiley & Sons, 1950.
G. Lacerda, P. Spirtes, J. Ramsey, and P. O. Hoyer. Discovering cyclic causal models by
Independent Components Analysis. In Proc. 24th Conference on Uncertainty in Artiﬁcial
Intelligence (UAI2008), Helsinki, Finland, 2008.
116
Causal Search in SVAR
R. E. Lucas. Econometric policy evaluation: A critique. In CarnegieRochester Conference
Series on Public Policy, volume 1, pages 19–46. Elsevier, 1976.
H. L¨utkepohl. Vector Autoregressive Models. In Palgrave Handbook of Econometrics. Vol
ume 1. Econometric Theory, pages 477–510. Palgrave Macmillan, 2006.
A. Moneta. Graphical Models for Structural Vector Autoregressions. LEM Papers Series,
Sant’Anna School of Advanced Studies, Pisa, 2003.
A. Moneta. Identiﬁcation of monetary policy shocks: a graphical causal approach. Notas
Econ´omicas, 20, 3962, 2004.
A. Moneta. Graphical causal models and VARs: an empirical assessment of the real business
cycles hypothesis. Empirical Economics, 35(2):275–300, 2008.
A. Moneta, D. Entner, P.O. Hoyer, and A. Coad. Causal inference by independent com
ponent analysis with applications to microand macroeconomic data. Jena Economic
Research Papers, 2010:031, 2010.
E. Paparoditis and D. N. Politis. The local bootstrap for kernel estimators under general
dependence conditions. Annals of the Institute of Statistical Mathematics, 52(1):139–159,
2000.
J. Pearl. Causality: models, reasoning and inference. Cambridge University Press, Cam
bridge, 2000.
M. Reale and G. T. Wilson. Identiﬁcation of vector AR models with recursive structural
errors using conditional independence graphs. Statistical Methods and Applications, 10,
4965, 2001.
T. Richardson and P. Spirtes. Automated discovery of linear feedback models. In Compu
tation, causation and discovery. AAAI Press and MIT Press, Menlo Park, 1999.
R. Scheines, P. Spirtes, C. Glymour, C. Meek, and T. Richardson. The TETRAD project:
Constraint based aids to causal model speciﬁcation. Multivariate Behavioral Research,
33(1):65–117, 1998.
M. D. Shapiro and M. W. Watson. Sources of business cycle ﬂuctuations. NBER Macroe
conomics annual, 3:111–148, 1988.
S. Shimizu, P. O. Hoyer, A. Hyv¨arinen, and A. Kerminen. A linear nonGaussian acyclic
model for causal discovery. Journal of Machine Learning Research, 7:2003–2030, 2006.
S. Shimizu, A. Hyv¨arinen, Y. Kawahara, and T. Washio. A direct method for estimating
a causal ordering in a linear nonGaussian acyclic model. In Proceedings of the 25th
Conference on Uncertainty in Artiﬁcial Intelligence, 2009.
C. A. Sims. Macroeconomics and Reality. Econometrica, 48, 147, 1980.
117
Moneta Chlaß Entner Hoyer
C. A. Sims. An autoregressive index model for the u.s. 19481975. In J. Kmenta and
J.B. Ramsey, editors, Largescale macroeconometric models: theory and practice, pages
283–327. NorthHolland, 1981.
P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search. MIT Press,
Cambridge MA, 2nd edition, 2000.
L. Su and H. White. A nonparametric Hellinger metric test for conditional independence.
Econometric Theory, 24(04):829–864, 2008.
P. Suppes. A probabilistic theory of causation. Acta Philosophica Fennica, XXIV, 1970.
N. R. Swanson and C. W. J. Granger. Impulse response function based on a causal ap
proach to residual orthogonalization in vector autoregressions. Journal of the American
Statistical Association, 92:357–367, 1997.
G. J. Szekely and M. L. Rizzo. Testing for equal distributions in high dimension. InterStat,
5, 2004.
M. P. Wand and M. C. Jones. Kernel smoothing. Chapman&Hall Ltd., London, 1995.
A. H. Welsh, X. Lin, and R. J. Carroll. Marginal Longitudinal Nonparametric Regression.
Journal of the American Statistical Association, 97(458):482–493, 2002.
H. White and X. Lu. Granger Causality and Dynamic Structural Systems. Journal of
Financial Econometrics, 8(2):193, 2010.
N. Wiener. The theory of prediction. Modern mathematics for engineers, Series, 1:125–139,
1956.
S. Wright. Correlation and causation. Journal of agricultural research, 20(7):557–585, 1921.
A. Yatchew. Nonparametric regression techniques in economics. Journal of Economic
Literature, 36(2):669–721, 1998.
118
 CitationsCitations9
 ReferencesReferences64
 "All three RASL algorithms take a measurement timescale graph H as input. They are therefore compatible with any structure learning algorithm that outputs a measurement timescale graph, whether Structural Vector Autoregression (SVAR) [11], direct Dynamic Bayes Net search [12] , or modifications of standard causal structure learning algorithms such as PC [1, 13] and GES [14]. The problem of learning a measurement timescale graph is a very hard one, but is also not our primary focus here. "
[Show abstract] [Hide abstract] ABSTRACT: Causal structure learning from time series data is a major scientific challenge. Existing algorithms assume that measurements occur sufficiently quickly; more precisely, they assume that the system and measurement timescales are approximately equal. In many scientific domains, however, measurements occur at a significantly slower rate than the underlying system changes. Moreover, the size of the mismatch between timescales is often unknown. This paper provides three distinct causal structure learning algorithms, all of which discover all dynamic graphs that could explain the observed measurement data as arising from undersampling at some rate. That is, these algorithms all learn causal structure without assuming any particular relation between the measurement and system timescales; they are thus ``rateagnostic.’’ We apply these algorithms to data from simulations. The results provide insight into the challenge of undersampling. "All three RASL algorithms take a measurement timescale graph H as input. They are therefore compatible with any structure learning algorithm that outputs a measurement timescale graph, whether Structural Vector Autoregression (SVAR) [11], direct Dynamic Bayes Net search [12] , or modifications of standard causal structure learning algorithms such as PC [1, 13] and GES [14]. The problem of learning a measurement timescale graph is a very hard one, but is also not our primary focus here. "
[Show abstract] [Hide abstract] ABSTRACT: Causal structure learning from time series data is a major scientific challenge. Extant algorithms assume that measurements occur sufficiently quickly; more precisely, they assume approximately equal system and measurement timescales. In many domains, however, measurements occur at a significantly slower rate than the underlying system changes, but the size of the timescale mismatch is often unknown. This paper develops three causal structure learning algorithms, each of which discovers all dynamic causal graphs that explain the observed measurement data, perhaps given undersampling. That is, these algorithms all learn causal structure in a "rateagnostic" manner: they do not assume any particular relation between the measurement and system timescales. We apply these algorithms to data from simulations to gain insight into the challenge of undersampling. "Accordingly, we need some extra information to provide Q. However, even if we introduce an orthonormality constraint QQ T = E t (U (t)U (t) T ) to make noises in W(t) mutually uncorrelated, the representation of the DVAR model given by this approach is not unique, because there are many choices for Q that satisfy that constraint, for example, QO satisfying QO(QO) T = QOO T Q T = QQ T , where O is any orthonormal matrix [Moneta et al. 2011]. There are many studies on this issue for identifying the SVAR model. "
[Show abstract] [Hide abstract] ABSTRACT: A vector autoregressive model in discrete time domain (DVAR) is often used to analyze continuous time, multivariate, linear Markov systems through their observed time series data sampled at discrete timesteps. Based on previous studies, the DVAR model is supposed to be a noncanonical representation of the system, that is, it does not correspond to a unique system bijectively. However, in this article, we characterize the relations of the DVAR model with its corresponding Structural Vector AR (SVAR) and Continuous Time Vector AR (CTVAR) models through a finite difference method across continuous and discrete time domain. We further clarify that the DVARmodel of a continuous time,multivariate, linearMarkov system is canonical under a highly generic condition. Our analysis shows that we can uniquely reproduce its SVAR and CTVAR models from the DVAR model. Based on these results, we propose a novel Continuous and Structural Vector Autoregressive (CSVAR) modeling approach to derive the SVAR and the CTVAR models from their DVAR model empirically derived from the observed time series of continuous time linear Markov systems. We demonstrate its superior performance through some numerical experiments on both artificial and realworld data.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.
This publication is from a journal that may support self archiving.
Learn more