Page 1

JMLR: Workshop and Conference Proceedings 12 (2011) 95–118Causality in Time Series

Causal Search in Structural Vector Autoregressive Models

Alessio Moneta

Max Planck Institute of Economics

Jena, Germany

moneta@econ.mpg.de

Nadine Chlaß

Friedrich Schiller University of Jena, Germany

nadine.chlass@uni-jena.de

Doris Entner

Helsinki Institute for Information Technology, Finland

doris.entner@cs.helsinki.fi

Patrik Hoyer

Helsinki Institute for Information Technology, Finland

patrk.hoyer@helsinki.fi

Editor: Florin Popescu and Isabelle Guyon

Abstract

This paper reviews a class of methods to perform causal inference in the framework of a

structural vector autoregressive model. We consider three different settings. In the first

setting the underlying system is linear with normal disturbances and the structural model is

identified by exploiting the information incorporated in the partial correlations of the esti-

mated residuals. Zero partial correlations are used as input of a search algorithm formalized

via graphical causal models. In the second, semi-parametric, setting the underlying system

is linear with non-Gaussian disturbances. In this case the structural vector autoregressive

model is identified through a search procedure based on independent component analysis.

Finally, we explore the possibility of causal search in a nonparametric setting by studying

the performance of conditional independence tests based on kernel density estimations.

Keywords: Causal inference, econometric time series, SVAR, graphical causal models,

independent component analysis, conditional independence tests

1. Introduction

1.1. Causal inference in econometrics

Applied economic research is pervaded by questions about causes and effects. For exam-

ple, what is the effect of a monetary policy intervention? Is energy consumption causing

growth or the other way around? Or does causality run in both directions? Are economic

fluctuations mainly caused by monetary, productivity, or demand shocks? Does foreign

aid improve living standards in poor countries? Does firms’ expenditure in R&D causally

influence their profits? Are recent rises in oil prices in part caused by speculation? These

are seemingly heterogeneous questions, but they all require some knowledge of the causal

process by which variables came to take the values we observe.

A traditional approach to address such questions hinges on the explicit use of a priori

economic theory. The gist of this approach is to partition a causal process in a determinis-

tic, and a random part and to articulate the deterministic part such as to reflect the causal

c ? 2011 A. Moneta, N. Chlaß, D. Entner & P. Hoyer.

Page 2

Moneta Chlaß Entner Hoyer

dependencies dictated by economic theory. If the formulation of the deterministic part is ac-

curate and reliable enough, the random part is expected to display properties that can easily

be analyzed by standard statistical tools. The touchstone of this approach is represented by

the work of Haavelmo (1944), which inspired the research program subsequently pursued

by the Cowles Commission (Koopmans, 1950; Hood and Koopmans, 1953). Therein, the

causal process is formalized by means of a structural equation model, that is, a system of

equations with endogenous variables, exogenous variables, and error terms, first developed

by Wright (1921). Its coefficients were given a causal interpretation (Pearl, 2000).

This approach has been strongly criticized in the 1970s for being ineffective in both policy

evaluation and forecasting. Lucas (1976) pointed out that the economic theory included

in the SEM fails to take economic agents’ (rational) motivations and expectations into

consideration. Agents, according to Lucas, are able to anticipate policy intervention and

act contrary to the prediction derived from the structural equation model, since the model

usually ignores such anticipations. Sims (1980) puts forth another critique which runs

parallel to Lucas’ one. It explicitly addresses the status of exogeneity which the Cowles

Commission approach attributes (arbitrarily, according to Sims) to some variables such that

the structural model can be identified. Sims argues that theory is not a reliable source for

deeming a variable as exogenous. More generally, the Cowles Commission approach with its

strong a priori commitment to theory, risks falling into a vicious circle: if causal information

(even if only about direction) can exclusively be derived from background theory, how do

we obtain an empirically justified theory? (Cfr. Hoover, 2006, p.75).

An alternative approach has been pursued since Wiener (1956) and Granger’s (1969)

work. It aims at inferring causal relations directly from the statistical properties of the

data relying only to a minimal extent on background knowledge. Granger (1980) proposes

a probabilistic concept of causality, similar to Suppes (1970). Granger defines causality in

terms of the incremental predictability (at horizon one) of a time series variable {Yt} (given

the present and past values of {Yt} and of a set {Zt} of possible relevant variables) when

another time series variable {Xt} (in its present and past values) is not omitted. More

formally:

{Xt} Granger-causes {Yt} if P(Yt+1|Xt,Xt−1,...,Yt,Yt−1,...,Zt,Zt−1,...) ?=

P(Yt+1|Yt,Yt−1,...,Zt,Zt−1,...)

As pointed out by Florens and Mouchart (1982), testing the hypothesis of Granger non-

causality corresponds to testing conditional independence. Given lags p, {Xt} does not

Granger cause {Yt}, if

Yt+1⊥ ⊥ (Xt,Xt−1,...,Xt−p) | (Yt,Yt−1,...,Yt−p,Zt,Zt−1,...,Zt−p)

To test Granger noncausality, researchers often specify linear vector autoregressive (VAR)

models:

Yt= A1Yt−1+ ... + ApYt−p+ ut,

in which Yt is a k × 1 vector of time series variables (Y1,t,...,Yk,t)?, where ()?is the

transpose, the Aj(j = 1,...,p) are k × k coefficient matrices, and utis the k × 1 vector

of random disturbances. In this framework, testing the hypothesis that {Yi,t} does not

Granger-cause {Yj,t}, reduces to test whether the (j,i) entries of the matrices A1,...,Ap

(1)

(2)

(3)

96

Page 3

Causal Search in SVAR

are vanishing simultaneously. Granger noncausality tests have been extended to nonlinear

settings by Baek and Brock (1992), Hiemstra and Jones (1994), and Su and White (2008),

using nonparametric tests of conditional independence (more on this topic in section 4).

The concept of Granger causality has been criticized for failing to capture ‘structural

causality’ (Hoover, 2008). Suppose one finds that a variable A Granger-causes another

variable B. This does not necessarily imply that an economic mechanism exists by which

A can be manipulated to affect B. The existence of such a mechanism in turn does not

necessarily imply Granger causality either (for a discussion see Hoover 2001, pp. 150-155).

Indeed, the analysis of Granger causality is based on coefficients of reduced-form models,

like those incorporated in equation (3), which are unlikely to reliably represent actual eco-

nomic mechanisms. For instance, in equation (3) the simultaneous causal structure is not

modeled in order to facilitate estimation. (However, note that Eichler (2007) and White

and Lu (2010) have recently developed and formalized richer structural frameworks in which

Granger causality can be fruitfully analyzed.)

1.2. The SVAR framework

Structural vector autoregressive (SVAR) models constitute a middle way between the Cowles

Commission approach and the Granger-causality approach. SVAR models aim at recovering

the concept of structural causality, but eschew at the same time the strong ‘apriorism’ of

the Cowles Commission approach. The idea is, like in the Cowles Commission approach,

to articulate an unobserved structural model, formalized as a dynamic generative model:

at each time unit the system is affected by unobserved innovation terms, by which, once

filtered by the model, the variables come to take the values we observe. But, differently

from the Cowles Commission approach, and similarly to the Granger-VAR model, the data

generating process is generally enough articulated so that time series variables are not dis-

tinguished a priori between exogenous and endogenous. A linear SVAR model is in principle

a VAR model ‘augmented’ by the contemporaneous structure:

Γ0Yt= Γ1Yt−1+ ... + ΓpYt−p+ εt.(4)

This is easily obtained by pre-multiplying each side of the VAR model

Yt= A1Yt−1+ ... + ApYt−p+ ut, (5)

by a matrix Γ0so that Γi= Γ0Ai, for i = 1,...,k and εt= Γ0ut. Note, however, that

not any matrix Γ0will be suitable. The appropriate Γ0will be that matrix corresponding

to the ‘right’ rotation of the VAR model, that is the rotation compatible both with the

contemporaneous causal structure of the variable and the structure of the innovation term.

Let us consider a matrix B0= I−Γ0. If the system is normalized such that the matrix Γ0has

all the elements of the principal diagonal equal to one (which can be done straightforwardly),

the diagonal elements of B0will be equal to zero. We can write:

Yt= B0Yt+ Γ1Yt−1+ ... + ΓpYt−p+ εt

(6)

from which we see that B0(and thus Γ0) determines in which form the values of a variable

Yi,twill be dependent on the contemporaneous value of another variable Yj,t. The ‘right’

97

Page 4

Moneta Chlaß Entner Hoyer

rotation will also be the one which makes εta vector of authentic innovation terms, which

are expected to be independent (not only over time, but also contemporaneously) sources

or shocks.

In the literature, different methods have been proposed to identify the SVAR model

(4) on the basis of the estimation of the VAR model (5). Notice that there are more

unobserved parameters in (4), whose number amounts to k2(p + 1), than parameters that

can be estimated from (5), which are k2p + k(k + 1)/2, so one has to impose at least

k(k − 1)/2 restrictions on the system. One solution to this problem is to get a rotation

of (5) such that the covariance matrix of the SVAR residuals Σε is diagonal, using the

Cholesky factorization of the estimated residuals Σu. That is, let P be the lower-triangular

Cholesky factorization of Σu (i.e. Σu = PP?), let D be a k × k diagonal matrix with

the same diagonal as P, and let Γ0= DP−1. By pre-multiplying (5) by Γ0, it turns out

that Σε= E[Γ0utu?tΓ?

P changes if the ordering of the variables (Y1t,...,Ykt)?in Ytand, consequently, the order

of residuals in Σu, changes. Since researchers who estimate a SVAR are often exclusively

interested on tracking down the effect of a structural shock εiton the variables Y1,t,...,Yk,t

over time (impulse response functions), Sims (1981) suggested investigating to what extent

the impulse response functions remain robust under changes of the order of variables.

Popular alternatives to the Cholesky identification scheme are based either on the use

of a priori, theory-based, restrictions or on the use of long-run restrictions. The former

solution consists in imposing economically plausible constraints on the contemporaneous

interactions among variables (Blanchard and Watson, 1986; Bernanke, 1986) and has the

drawback of ultimately depending on the a priori reliability of economic theory, similarly

to the Cowles Commission approach. The second solution is based on the assumptions that

certain economic shocks have long-run effect to other variables, but do not influence in the

long-run the level of other variables (see Shapiro and Watson, 1988; Blanchard and Quah,

1989; King et al., 1991). This approach has been criticized as not being very reliable unless

strong a priori restrictions are imposed (see Faust and Leeper, 1997).

In the rest of the paper, we first present a method, based on the graphical causal model

framework, to identify the SVAR (section 2). This method is based on conditional inde-

pendence tests among the estimated residuals of the VAR estimated model. Such tests

rely on the assumption that the shocks affecting the model are Gaussian. We then relax

the Gaussianity assumption and present a method to identify the SVAR model based on

independent component analysis (section 3). Here the main assumption is that shocks are

non-Gaussian and independent. Finally (section 4), we explore the possibility of extending

the framework for causal inference to a nonparametric setting. In section 5 we wrap up the

discussion and conclude by formulating some open questions.

0] = DD?, which is diagonal. A problem with this method is that

2. SVAR identification via graphical causal models

2.1. Background

A data-driven approach to identify the structural VAR is based on the analysis of the es-

timated residuals ˆ ut. Notice that when a basic VAR model is estimated (equation 3), the

information about contemporaneous causal dependence is incorporated exclusively in the

98

Page 5

Causal Search in SVAR

residuals (being not modeled among the variables). Graphical causal models, as originally

developed by Pearl (2000) and Spirtes et al. (2000), represent an efficient method to recover,

at least in part, the contemporaneous causal structure moving from the analysis of the con-

ditional independencies among the estimated residuals. Once the contemporaneous causal

structure is recovered, the estimation of the lagged autoregressive coefficients permits us to

identify the complete SVAR model.

This approach was initiated by Swanson and Granger (1997), who proposed to test

whether a particular causal order of the VAR is in accord with the data by testing all

the partial correlations of order one among error terms and checking whether some partial

correlations are vanishing. Reale and Wilson (2001), Bessler and Lee (2002), Demiralp and

Hoover (2003), and Moneta (2008) extended the approach by using the partial correlations

of the VAR residuals as input to graphical causal model search algorithms.

In graphical causal models, the structural model is represented as a causal graph (a

Directed Acyclic Graph if the presence of causal loops is excluded), in which each node

represents a random variable and each edge a causal dependence. Furthermore, a set of

assumptions or ‘rules of inference’ are formulated, which regulate the relationship between

causal and probabilistic dependencies: the causal Markov and the faithfulness conditions

(Spirtes et al., 2000). The former restricts the joint probability distribution of modeled

variables: each variable is independent of its graphical non-descendants conditional on its

graphical parents. The latter makes causal discovery possible: all of the conditional inde-

pendence relations among the modeled variables follow from the causal Markov condition.

Thus, for example, if the causal structure is represented as Y1t→ Y2t→ Yt,3, it follows from

the Markov condition that Y1,t⊥ ⊥ Y3,t|Y2,t. If, on the other hand, the only (conditional)

independence relation among Y1,t,Y2,t,Y3,t is Y1,t ⊥ ⊥ Y3,t, it follows from the faithfulness

condition that Y1,t→ Y3,t<— Y2,t.

Constraint-based algorithms for causal discovery, like for instance, PC, SGS, FCI (Spirtes

et al. 2000), or CCD (Richardson and Spirtes 1999), use tests of conditional independence

to constrain the possible causal relationships among the model variables. The first step of

the algorithm typically involves the formation of a complete undirected graph among the

variables so that they are all connected by an undirected edge. In a second step, condi-

tional independence relations (or d-separations, which are the graphical characterization of

conditional independence) are merely used to erase edges and, in further steps, to direct

edges. The output of such algorithms are not necessarily one single graph, but a class of

Markov equivalent graphs.

There is nothing neither in the Markov or faithfulness condition, nor in the constraint-

based algorithms that limits them to linear and Gaussian settings. Graphical causal models

do not require per se any a priori specification of the functional dependence between vari-

ables. However, in applications of graphical models to SVAR, conditional independence is

ascertained by testing vanishing partial correlations (Swanson and Granger, 1997; Bessler

and Lee, 2002; Demiralp and Hoover, 2003; Moneta, 2008). Since normal distribution guar-

antees the equivalence between zero partial correlation and conditional independence, these

applications deal de facto with linear and Gaussian processes.

99

Page 6

Moneta Chlaß Entner Hoyer

2.2. Testing residuals zero partial correlations

There are alternative methods to test zero partial correlations among the error terms ˆ ut=

(u1t,...,ukt)?. Swanson and Granger (1997) use the partial correlation coefficient. That is,

in order to test, for instance, ρ(uit,ukt|ujt) = 0, they use the standard t statistics from a

least square regression of the model:

uit= αjujt+ αkukt+ εit,(7)

on the basis that αk= 0 ⇔ ρ(uit,ukt|ujt) = 0. Since Swanson and Granger (1997) impose

the partial correlation constraints looking only at the set of partial correlations of order

one (that is conditioned on only one variable), in order to run their tests they consider

regression equations with only two regressors, as in equation (7).

Bessler and Lee (2002) and Demiralp and Hoover (2003) use Fisher’s z that is incorporated

in the software TETRAD (Scheines et al., 1998):

z(ρXY.K,T) =1

2

?

T − |K| − 3 log

?|1 + ρXY.K|

|1 − ρXY.K|,

?

(8)

where |K| equals the number of variables in K and T the sample size. If the variables (for

instance X = uit, Y = ukt, K = (ujt,uht)) are normally distributed, we have that

z(ρXY.K,T) − z(ˆ ρXY.K,T) ∼ N(0,1) (9)

(see Spirtes et al., 2000, p.94).

A different approach, which takes into account the fact that correlations are obtained

from residuals of a regression, is proposed by Moneta (2008). In this case it is useful to

write the VAR model of equation (3) in a more compact form:

Yt= Π?Xt+ ut,(10)

where X?t= [Y?

has dimension (k×kp). In case of stable VAR process (see next subsection), the conditional

maximum likelihood estimate of Π for a sample of size T is given by

t−1, ...,Y?t−p], which has dimension (1 × kp) and Π?= [A1,...,Ap], which

ˆΠ?=

?T

t=1

?

YtX?

t

??T

?

t=1

XtX?

t

?−1

.

Moreover, the ith row ofˆΠ?is

ˆ π?

i=

?T

t=1

?

YitX?

t

??T

?

t=1

XtX?

t

?−1

,

which coincides with the estimated coefficient vector from an OLS regression of Yit on

Xt (Hamilton 1994: 293). The maximum likelihood estimate of the matrix of variance

and covariance among the error terms Σuturns out to beˆΣu= (1/T)?T

t=1ˆ utˆ u?t, where

ˆ ut= Yt−ˆΠ?Xt. Therefore, the maximum likelihood estimate of the covariance between uit

100

Page 7

Causal Search in SVAR

and ujtis given by the (i,j) element ofˆΣu: ˆ σij= (1/T)?T

stacks the columns of a k × k matrix into a vector of length k2and vech, which vertically

stacks the elements of a k × k matrix on or below the principal diagonal into a vector of

length k(k + 1)/2. For example:

The process being stationary and the error terms Gaussian, it turns out that:

√T [vech(ˆΣu) − vech(Σu)]

where Ω = 2D+

matrix satisfying Dkvech(Ω) = vec(Ω), and ⊗ denotes the Kronecker product (see Hamilton

1994: 301). For example, for k = 2, we have,

ˆ σ22− σ22

Therefore, to test the null hypothesis that ρ(uit,ujt) = 0 from the VAR estimated residuals,

it is possible to use the Wald statistic:

t=1ˆ uitˆ ujt. Denoting by σijthe

(i,j) element of Σu, let us first define the following matrix transform operators: vec, which

vec

?

σ11σ12

σ21σ22

?

=

σ11

σ21

σ12

σ22

,

vech

?

σ11σ12

σ21σ22

?

=

σ11

σ21

σ22

.

d

−→ N(0, Ω),

k, Dkis the unique (k2×k(k+1)/2)

(11)

k(Σu⊗Σu)(D+

k)?, D+

k≡ (D?

kDk)−1D?

√T

ˆ σ11− σ11

ˆ σ12− σ12

d

−→ N

0

0

0

,

2σ2

2σ11σ12

2σ2

11

2σ11σ12

σ11σ22+ σ2

2σ12σ22

2σ2

2σ12σ22

2σ2

12

12

12 22

T (ˆ σij)2

ˆ σiiˆ σjj + ˆ σ2

ij

≈ χ2(1).

The Wald statistic for testing vanishing partial correlations of any order is obtained by

applying the delta method, which suggests that if XTis a (r×1) sequence of vector-valued

random-variables and if [√T(X1T− θ1),...,√T(XrT− θr)]

are r real-valued functions of θ = (θ1,...,θr), hi : Rr→ R, defined and continuously

differentiable in a neighborhood ω of the parameter point θ and such that the matrix

B = ||∂hi/∂θj|| of partial derivatives is nonsingular in ω, then:

[√T[h1(XT) − h1(θ)],...,

(see Lehmann and Casella 1998: 61).

Thus, for k = 4, suppose one wants to test corr(u1t,u3t|u2t) = 0. First, notice that

ρ(u1,u3|u2) = 0 if and only if σ22σ13−σ12σ23= 0 (by definition of partial correlation). One

can define a function g : Rk(k+1)/2→ R, such that g(vech(Σu)) = σ22σ13− σ12σ23. Thus,

∇g?= (0, −σ23, σ22, 0, σ13, −σ12, 0, 0, 0, 0).

Applying the delta method:

√T[(ˆ σ22ˆ σ13− ˆ σ12ˆ σ23) − (σ22σ13− σ12σ23)]

d

−→ N(0,Σ) and h1,...,hr

√T[hr(XT) − hr(θ)]]

d

−→ N(0,BΣB?)

d

−→ N(0,∇g?Ω∇g).

101

Page 8

Moneta Chlaß Entner Hoyer

The Wald test of the null hypothesis corr(u1t,u3t|u2t) = 0 is given by:

T(ˆ σ22ˆ σ13− ˆ σ12ˆ σ23)2

∇g?Ω∇g

≈ χ2(1).

Tests for higher order correlations and for k > 4 follow analogously (see also Moneta, 2003).

This testing procedure has the advantage, with respect to the alternative methods, to be

straightforwardly applied to the case of cointegrated data, as will be explained in the next

subsection.

2.3. Cointegration case

A typical feature of economic time series data in which there is some form of causal depen-

dence is cointegration. This term denotes the phenomenon that nonstationary processes

can have linear combinations that are stationary. That is, suppose that each component

Yitof Yt= (Y1t,...,Ykt)?, which follows the VAR process

Yt= A1Yt−1+ ... + ApYt−p+ ut,

is nonstationary and integrated of order one (∼ I(1)). This means that the VAR process

Ytis not stable, i.e. det(Ik− A1z − Apzp) is equal to zero for some |z| ≤ 1 (L¨ utkepohl,

2006), and that each component ΔYit of ΔYt = (Yt− Yt−1) is stationary (I(0)), that

is it has time-invariant means, variances and covariance structure. A linear combination

between between the elements of Ytis called a cointegrating relationship if there is a linear

combination c1Y1t+ ... + ckYktwhich is stationary (I(0)).

If it is the case that the VAR process is unstable with the presence of cointegrating rela-

tionships, it is more appropriate (L¨ utkepohl, 2006; Johansen, 2006) to estimate the following

re-parametrization of the VAR model, called Vector Error Correction Model (VECM):

ΔYt= F1ΔYt−1+ ... + Fp−1ΔYt−p+1− GYt−p+ ut,

where Fi= −(Ik− A1− ... − Ai), for i = 1,...,p − 1 and G = Ik− A1− ... − Ap. The

(k×k) matrix G has rank r and thus G can be written as HC with H and C?of dimension

(k × r) and of rank r. C ≡ [c1,...,cr]?is called the cointegrating matrix.

Let˜C,˜H, and˜Fi be the maximum likelihood estimator of C, H, F according to Jo-

hansen’s (1988, 1991) approach. Then the asymptotic distribution of˜Σu, that is the maxi-

mum likelihood estimator of the covariance matrix of ut, is:

√T [vech(˜Σu) − vech(Σu)]

which is equivalent to equation (11) (see it again for the definition of the various operators).

Thus, it turns out that the asymptotic distribution of the maximum likelihood estimator

˜Σuis the same as the OLS estimationˆΣufor the case of stable VAR.

Thus, the application of the method described for testing residuals zero partial corre-

lations can be applied straightforwardly to cointegrated data. The model is estimated as

a VECM error correction model using Johansen’s (1988, 1991) approach, correlations are

tested exploiting the asymptotic distribution of˜Σuand finally can be parameterized back

in its VAR form of equation (3).

(12)

d

−→ N(0, 2D+

k(Σu⊗ Σu)D+?

k),(13)

102

Page 9

Causal Search in SVAR

2.4. Summary of the search procedure

The graphical causal models approach to SVAR identification, which we suggest in case of

Gaussian and linear processes, can be summarized as follows.

Step 1

specification tests about normality, zero autocorrelation of residuals, lags, and unit roots

(see L¨ utkepohl, 2006). If the hypothesis of nonstationarity is rejected, estimate the VAR

model via OLS (equivalent to MLE under the assumption of normality of the errors). If unit

root tests do not reject I(1) nonstationarity in the data, specify the model as VECM testing

the presence of cointegrating relationships. If tests suggest the presence of cointegrating

relationships, estimate the model as VECM. If cointegration is rejected estimate the VAR

models taking first difference.

Estimate the VAR model Yt = A1Yt−1+ ... + ApYt−p+ ut with the usual

Step 2

the Wald statistics on the basis of the asymptotic distribution of the covariance matrix of

ut. Note that not all possible partial correlations ρ(uit,ujt|uht,...) need to be tested, but

only those necessary for step 3.

Run tests for zero partial correlations between the elements u1t,...,uktof utusing

Step 3

which is equivalent to the causal structure among Y1t,...,Ykt (cfr. section 1.2 and see

Moneta 2003). In case of acyclic (no feedback loops) and causally sufficient (no latent

variables) structure, the suggested algorithm is the PC algorithm of Spirtes et al. (2000,

pp. 84-85). Moneta (2008) suggested few modifications to the PC algorithm in order to

make the orientation of edges compatible with as many conditional independence tests as

possible. This increases the computational time of the search algorithm, but considering

the fact that VAR models deal with a few number of time series variables (rarely more

than six to eight; see Bernanke et al. 2005), this slowing down does not create a serious

concern in this context. Table 1 reports the modified PC algorithm. In case of acyclic

structure without causal sufficiency (i.e. possibly including latent variables), the suggested

algorithm is FCI (Spirtes et al. 2000, pp. 144-145). In the case of no latent variables

and in the presence of feedback loops, the suggested algorithm is CCD (Richardson and

Spirtes, 1999). There is no algorithm in the literature which is consistent for search when

both latent variables and feedback loops may be present. If the goal of the study is only

impulse response analysis (i.e. tracing out the effects of structural shocks ε1t,...,εkton

Yt,Yt−1,...) and neither contemporaneous feedbacks nor latent variables can be excluded

a priori, a possible solution is to apply only steps (A) and (B) of the PC algorithm. If the

resulting set of possible causal structures (represented by an undirected graph) contains a

manageable number of elements, one can study the characteristics of the impulse response

functions which are robust across all the possible causal structures, where the presence of

both feedbacks and latent variables is allowed (Moneta, 2004).

Apply a causal search algorithm to recover the causal structure among u1t,...,ukt,

Step 4

Step 3 is a set of causal structures, run sensitivity analysis to investigate the robustness of

the conclusions under the different possible causal structures. Bootstrap procedures may

Calculate structural coefficients and impulse response functions. If the output of

103

Page 10

Moneta Chlaß Entner Hoyer

also be applied to determine which is the most reliable causal order (see simulations and

applications in Demiralp et al., 2008).

3. Identification via independent component analysis

The methods considered in the previous section use tests for zero partial correlation on

the VAR-residuals to obtain (partial) information about the contemporaneous structure

in an SVAR model with Gaussian shocks.

and independent shocks can be exploited for model identification by using the statistical

method of ‘Independent Component Analysis’ (ICA, see Comon (1994); Hyv¨ arinen et al.

(2001)). The method is again based on the VAR-residuals utwhich can be obtained as in

the Gaussian case by estimating the VAR model using for example ordinary least squares

or least absolute deviations, and can be tested for non-Gaussianity using any normality test

(such as the Shapiro-Wilk or Jarque-Bera test).

To motivate, we note that, from equations (3) and (4) (with matrix Γ0) or the Cholesky

factorization in section 1.2 (with matrix PD−1), the VAR-disturbances utand the structural

shocks εtare connected by

ut= Γ−1

In this section we show how non-Gaussian

0εt= PD−1εt

(14)

with square matrices Γ0and PD−1, respectively. Equation (14) has two important prop-

erties: First, the vectors utand εtare of the same length, meaning that there are as many

residuals as structural shocks. Second, the residuals utare linear mixtures of the shocks

εt, connected by the ‘mixing matrix’ Γ−1

0. This resembles the ICA model, when placing

certain assumptions on the shocks εt.

In short, the ICA model is given by x = As, where x are the mixed components, s the

independent, non-Gaussian sources, and A a square invertible mixing matrix (meaning that

there are as many mixtures as independent components). Given samples from the mixtures

x, ICA estimates the mixing matrix A and the independent components s, by linearly

transforming x in such a way that the dependencies among the independent components

s are minimized. The solution is unique up to ordering, sign and scaling (Comon, 1994;

Hyv¨ arinen et al., 2001).

By comparing the ICA model x = As and equation (14), we see a one-to-one correspon-

dence of the mixtures x to the residuals utand the independent components s to the shocks

εt. Thus, to be able to apply ICA, we need to assume that the shocks are non-Gaussian

and mutually independent. We want to emphasize that no specific non-Gaussian distribu-

tion is assumed for the shocks, but only that they cannot be Gaussian.1For the shocks

to be mutually independent their joint distribution has to factorize into the product of the

marginal distributions. In the non-Gaussian setting, this implies zero partial correlation,

but the converse is not true (as opposed to the Gaussian case where the two statements

are equivalent). Thus, for non-Gaussian distributions conditional independence is a much

stronger requirement than uncorrelatedness.

Under the assumption that the shocks εtare non-Gaussian and independent, equation

(14) follows exactly the ICA-model and applying ICA to the VAR residuals ut yields a

unique solution (up to ordering, sign, and scaling) for the mixing matrix Γ−1

0

and the

1. Actually, the requirement is that at most one of the residuals can be Gaussian.

104

Page 11

Causal Search in SVAR

Table 1: Search algorithm (adapted from the PC Algorithm of Spirtes et al. (2000: 84-85);

in bold character the modifications).

Under the assumption of Gaussianity conditional independence is tested by zero partial

correlation tests.

(A): (connect everything):

Form the complete undirected graph G on the vertex set u1t,...,ukt so that each

vertex is connected to any other vertex by an undirected edge.

(B)(cut some edges):

n = 0

repeat :

repeat :

select an ordered pair of variables uhtand uit that are adjacent in G

such that the number of variables adjacent to uhtis equal or greater

than n + 1. Select a set S of n variables adjacent to uht such that

uti/ ∈ S. If uht⊥ ⊥ uit|S delete edge uht— uitfrom G;

until all ordered pairs of adjacent variables uht and uit such that the

number of variables adjacent to uhtis equal or greater than n+1 and all sets

S of n variables adjacent to uhtsuch that uit/ ∈ S have been checked to see if

uht⊥ ⊥ uit|S;

n = n + 1;

until for each ordered pair of adjacent variables uht, uit, the number of adjacent

variables to uhtis less than n + 1;

(C)(build colliders):

for each triple of vertices uht,uit,ujtsuch that the pair uht,uitand the pair uit,ujtare

each adjacent in G but the pair uht,ujtis not adjacent in G, orient uht— uit— ujtas

uht−→ uit<— ujtif and only if uitdoes not belong to any set of variables S such

that uht⊥ ⊥ ujt|S;

(D)(direct some other edges):

repeat :

if uat −→ ubt, ubt and uct are adjacent, uat and uct are not adjacent and

ubtbelongs to every set S such that uat⊥ ⊥ uct|S, then orient ubt— uctas

ubt−→ uct;

if there is a directed path from uatto ubt, and an edge between uatand ubt,

then orient uat— ubtas uat−→ ubt;

until no more edges can be oriented.

105

Page 12

Moneta Chlaß Entner Hoyer

independent components εt(i.e. the structural shocks in our case). However, the ambiguities

of ICA make it hard to directly interpret the shocks found by ICA since without further

analysis we cannot relate the shocks directly to the measured variables.

Hence, we assume that the residuals ut follow a linear non-Gaussian acyclic model

(Shimizu et al., 2006), which means that the contemporaneous structure is represented

by a DAG (directed acyclic graph). In particular, the model is given by

ut= B0ut+ εt

(15)

with a matrix B0, whose diagonal elements are all zero and, if permuted according to the

causal order, is strictly lower triangular. By rewriting equation (15) we see that

Γ0= I − B0.(16)

From this equation it follows that the matrix B0describes the contemporaneous structure

of the variables Ytin the SVAR model as shown in equation (6). Thus, if we can identify the

matrix Γ0, we also obtain the matrix B0for the contemporaneous effects. As pointed out

above, the matrix Γ−1

0

(and hence Γ0) can be estimated using ICA up to ordering, scaling,

and sign. With the restriction of B0representing an acyclic system, we can resolve these

ambiguities and are able to fully identify the model. For simplicity, let us assume that the

variables are arranged according to a causal ordering, so that the matrix B0is strictly lower

triangular. From equation (16) then follows that the matrix Γ0is lower triangular with all

ones on the diagonal. Using this information, the ambiguities of ICA can be resolved in the

following way.

The lower triangularity of B0allows us to find the unique permutation of the rows of

Γ0, which yields all non-zero elements on the diagonal of Γ0, meaning that we replace the

matrix Γ0with Q1Γ0where Q1is the uniquely determined permutation matrix. Finding

this permutation resolves the ordering-ambiguity of ICA and links the shocks εt to the

components of the residuals utin a one-to-one manner. The sign- and scaling-ambiguity is

now easy to fix by simply dividing each row of Γ0(the row-permuted version from above)

by the corresponding diagonal element yielding all ones on the diagonal, as implied by

Equation (16). This ensures that the connection strength of the shock εton the residual ut

is fixed to one in our model (Equation (15)).

For the general case where B0is not arranged in the causal order, the above arguments

for solving the ambiguities still apply. Furthermore, we can find the causal order of the

contemporaneous variables by performing simultaneous row- and column-permutations on

Γ0 yielding the matrix closest to lower triangular, in particular˜Γ0 = Q2Γ0Q?

appropriate permutation matrix Q2. In case non of these permutations leads to a close to

lower triangular matrix a warning is issued.

Essentially, the assumption of acyclicity allows us to uniquely connect the structural

shocks εtto the components of utand fully identify the contemporaneous structure. Details

of the procedure can be found in (Shimizu et al., 2006; Hyv¨ arinen et al., 2010). In the

sense of the Cholesky factorization of the covariance matrix explained in Section 1 (with

PD−1= Γ−1

variables can be determined.

2with an

0), full identifiability means that a causal order among the contemporaneous

106

Page 13

Causal Search in SVAR

In addition to yielding full identification, an additional benefit of using the ICA-based

procedure when shocks are non-Gaussian is that it does not rely on the faithfulness assump-

tion, which was necessary in the Gaussian case.

We note that there are many ways of exploiting non-Gaussian shocks for model identifi-

cation as alternatives to directly using ICA. One such approach was introduced by Shimizu

et al. (2009). Their method relies on iteratively finding an exogenous variable and regressing

out their influence on the remaining variables. An exogenous variable is characterized by

being independent of the residuals when regressing any other variable in the model on it.

Starting from the model in equation (15), this procedure returns a causal ordering of the

variables utand then the matrix B0can be estimated using the Cholesky approach.

One relatively strong assumption of the above methods is the acyclicity of the contem-

poraneous structure. In (Lacerda et al., 2008) an extension was proposed where feedback

loops were allowed. In terms of the matrix B0this means that it is not restricted to being

lower triangular (in an appropriate ordering of the variables). While in general this model

is not identifiable because we cannot uniquely match the shocks to the residuals, Lacerda

et al. (2008) showed that the model is identifiable when assuming stability of the generating

model in (15) (the absolute value of the biggest eigenvalue in B0is smaller than one) and

disjoint cycles.

Another restriction of the above model is that all relevant variables must be included in

the model (causal sufficiency). Hoyer et al. (2008b) extended the above model by allowing

for hidden variables. This leads to an overcomplete basis ICA model, meaning that there

are more independent non-Gaussian sources than observed mixtures. While there exist

methods for estimating overcomplete basis ICA models, those methods which achieve the

required accuracy do not scale well. Additionally, the solution is again only unique up to

ordering, scaling, and sign, and when including hidden variables the ordering-ambiguity

cannot be resolved and in some cases leads to several observationally equivalent models,

just as in the cyclic case above.

We note that it is also possible to combine the approach of section 2 with that described

here. That is, if some of the shocks are Gaussian or close to Gaussian, it may be advan-

tageous to use a combination of constraint-based search and non-Gaussianity-based search.

Such an approach was proposed in Hoyer et al. (2008a). In particular, the proposed method

does not make any assumptions on the distributions of the VAR-residuals ut. Basically, the

PC algorithm (see Section 2) is run first, followed by utilization of whatever non-Gaussianity

there is to further direct edges. Note that there is no need to know in advance which shocks

are non-Gaussian since finding such shocks is part of the algorithm.

Finally, we need to point out that while the basic ICA-based approach does not require

the faithfulness assumption, the extensions discussed at the end of this section do.

4. Nonparametric setting

4.1. Theory

Linear systems dominate VAR, SVAR, and more generally, multivariate time series models

in econometrics. However, it is not always the case that we know how a variable X may

cause another variable Y . It may be the case that we have little or no a priori knowledge

107

Page 14

Moneta Chlaß Entner Hoyer

about the way how Y depends on X. In its most general form we want to know whether

X is independent of Y conditional on the set of potential graphical parents Z, i.e.

H0: Y ⊥ ⊥ X | Z,

(17)

where Y,X,Z is a set of time series variables. Thereby, we do not per se require an a priori

specification of how Y possibly depends on X. However, constraint based algorithms typi-

cally specify conditional independence in a very restrictive way. In continuous settings, they

simply test for nonzero partial correlations, or in other words, for linear (in)dependencies.

Hence, these algorithms will fail whenever the data generation process (DGP) includes non-

linear causal relations.

In search for a more general specification of conditional independency, Chlaß and Moneta

(2010) suggest a procedure based on nonparametric density estimation. Therein, neither

the type of dependency between Y and X, nor the probability distributions of the variables

need to be specified. The procedure exploits the fact that if two random variables are in-

dependent of a third, one obtains their joint density by the product of the joint density of

the first two, and the marginal density of the third. Hence, hypothesis test (17) translates

into:

H0:f(Y,X,Z)

f(XZ)

=f(Y Z)

f(Z). (18)

If we define h1(∙) := f(Y,X,Z)f(Z), and h2(∙) := f(Y Z)f(XZ), we have:

H0: h1(∙) = h2(∙). (19)

We estimate h1and h2using a kernel smoothing approach (see Wand and Jones, 1995, ch.4).

Kernel smoothing has the outstanding property that it is insensitive to autocorrelation

phenomena and, therefore, immediately applicable to longitudinal or time series settings

(Welsh et al., 2002).

In particular, we use a so-called product kernel estimator:

??n

ˆh2(x,y,z;b) =

N2bm+d

b

KZ

ˆh1(x,y,z;b) =

1

N2bm+d

1

i=1K

?

Xi−x

Xi−x

b

?

K

?

Yi−y

b

Zi−z

b

?

K

?

Zi−z

b

????n

i=1Kp

Yi−y

b

?

Zi−z

b

Zi−z

b

??

??n

i=1K

??? ????n

i=1K

??

Kp

???

,

(20)

where Xi, Yi, and Zi are the ithrealization of the respective time series, K denotes the

kernel function, b indicates a scalar bandwidth parameter, and Kp represents a product

kernel2.

So far, we have shown how we can estimate h1and h2. To see whether these are different,

we require some similarity measure between both conditional densities. There are different

ways to measure the distance between a product of densities:

2. I.e. Kp((Zi − z)/b) =

kernel: K(u) = (3 − u2)φ(u)/2, with φ(u) the standard normal probability density function. We use a

“rule-of-thumb” bandwidth: b = n−1/8.5.

?d

j=1K((Zji− zj)/b). For our simulations (see next section) we choose the

108

Page 15

Causal Search in SVAR

(i) The weighted Hellinger distance proposed by Su and White (2008):

dH=1

n

n

?

i=1

?

1 −

?

h2(Xi,Yi,Zi)

h1(Xi,Yi,Zi)

?2

a(Xi,Yi,Zi), (21)

where a(∙) is a nonnegative weighting function. Both the weighting function a(∙), and

the resulting test statistic are specified in Su and White (2008).

(ii) The Euclidean distance proposed by Szekely and Rizzo (2004) in their ‘energy test’:

dE=1

n

n

?

i=1

n

?

j=1

||h1i− h2j|| −

1

2n

n

?

i=1

n

?

j=1

||h1i− h1j|| −

1

2n

n

?

i=1

n

?

j=1

||h2i− h2j||, (22)

where h1i= h1(Xi,Yi,Zi), h2i= h2(Xi,Yi,Zi), and || ∙ || is the Euclidean norm.3

Given these test statistics and their distributions, we compute the type-I error, or p-value

of our test problem (19). If Z = ∅, the tests are available in R-packages energy and cramer.

The Hellinger distance is not suitable here, since one can only test for Z ?= ∅.

For Z ?= ∅, our test problem (19) requires higher dimensional kernel density estimation.

The more dimensions, i.e. the more elements in Z, the scarcer the data, and the greater

the distance between two subsequent data points. This so-called Curse of dimensionality

strongly reduces the accuracy of a nonparametric estimation (Yatchew, 1998). To circum-

vent this problem, we calculate the type-I errors for Z ?= ∅ by a local bootstrap procedure,

as described in Su and White (2008, pp. 840-841) and Paparoditis and Politis (2000, pp.

144-145). Local bootstrap draws repeatedly with replacement from the sample and counts

how many times the bootstrap statistic is larger than the test statistic of the entire sample.

Details on the local bootstrap procedure ca be found in appendix A.

Now, let us see how this procedure fares in those time series settings, where other testing

procedures failed - the case of nonlinear time series.

4.2. Simulation Design

Our simulation design should allow us to see how the search procedures of 4.1 perform

in terms of size and power. To identify size properties (type-I error), H0(19) must hold

everywhere. We call data generation processes for which H0holds everywhere, size-DGPs.

We induce a system of time series {V1,t,V2,t,V3,t}n

autoregressive process AR(1) with a1 = 0.5 and error term et ∼ N(0,1), for instance,

V1,t= a1V1,t−1+ eV1,t. These time series may cause each other as illustrated in Fig. 1.

t=1whereby each time series follows an

Therein, V1,t⊥ ⊥ V2,t|V1,t−1, since V1,t−1d-separates V1,tfrom V2,t, while V2,t⊥ ⊥ V3,s, for any

t and s. Hence, the set of variables Z, conditional on which two sets of variables X and

3. An alternative Euclidean distance is proposed by Baringhaus and Franz (2004) in their Cramer test.

This distance turns out to be dE/2. The only substantial difference from the distance proposed in (ii)

lies in the method to obtain the critical values (see Baringhaus and Franz 2004).

109