Available via license: CC BY 4.0
Content may be subject to copyright.
Identifiability and Consistent Estimation of the Gaussian
Chain Graph Model
Ruixuan Zhao†, Haoran Zhang‡and Junhui Wang‡
†School of Data Science
City University of Hong Kong
‡Department of Statistics
The Chinese University of Hong Kong
Abstract
The chain graph model admits both undirected and directed edges in one graph, where
symmetric conditional dependencies are encoded via undirected edges and asymmetric causal
relations are encoded via directed edges. Though frequently encountered in practice, the chain
graph model has been largely under investigated in literature, possibly due to the lack of identi-
fiability conditions between undirected and directed edges. In this paper, we first establish a set
of novel identifiability conditions for the Gaussian chain graph model, exploiting a low rank
plus sparse decomposition of the precision matrix. Further, an efficient learning algorithm is
built upon the identifiability conditions to fully recover the chain graph structure. Theoretical
analysis on the proposed method is conducted, assuring its asymptotic consistency in recover-
ing the exact chain graph structure. The advantage of the proposed method is also supported
by numerical experiments on both simulated examples and a real application on the Standard
& Poor 500 index data.
Keywords: Causal inference, tangent space, directed acyclic graph, Gaussian graphical model,
low-rank plus sparse decomposition
1 Introduction
Graphical model has attracted tremendous attention in recent years, which provides an efficient
modeling framework to characterize various relationships among multiple objects of interest. It
1
arXiv:2303.01031v1 [stat.ME] 2 Mar 2023
finds applications in a wide spectrum of scientific domains, ranging from finance[33], information
system [35], genetics [12], neuroscience [8] to public health [22].
In literature, two types of graphical models have been extensively studied. The first type is
the undirected graphical model, which encodes conditional dependences among collected nodes
via undirected edges. Various learning methods have been proposed to reconstruct the undirected
graph, especially under the well-known Gaussian graphical model [12, 3] where conditional de-
pendences are encoded via the zero pattern of the precision matrix. Another well-studied graphical
model is the directed acyclic graphical model, which uses directed edges to represent causal re-
lationships among collected nodes in a directed acyclic graph (DAG). To reconstruct the DAG
structure, linear Gaussian structural equation model (SEM) has been popularly considered in lit-
erature where causal relations are encoded via the sign pattern of the coefficient matrix. Various
identifiability conditions [30, 27] have been established for the linear Gaussian SEM, leading to a
number of DAG learning methods [6, 27].
Another more flexible graphical model, known as the chain graph model, can be traced back
to the early work in [19, 39]. It admits both undirected and directed edges in one graph, where
symmetric conditional dependencies are encoded via undirected edges and asymmetric causal re-
lations are encoded via directed edges. Further, it is often assumed that no semi-directed cycles
are allowed in the chain graph. As a direct consequence, the chain graph model can be seen as a
special DAG model with multiple chain components, where each chain component is a subset of
nodes connected via undirected edges, and directed edges are only allowed across different chain
components.
The chain graph model has been frequently encountered in practice [5, 15], but largely under-
investigated in literature. In fact, the chain graph model may have various interpretations, includ-
ing the Lauritzen–Wermuth–Frydenberg (LWF) interpretation [19, 13], the multivariate regression
(MVR) interpretation [9] and the Andersson-Madigan-Perlman (AMP) interpretation [1]. Each in-
terpretation implies a different independence relationship from the chain graph structure and leads
2
to several structure learning methods, including the IC like algorithm [36], the CKES algorithm
[29] and the decomposition-based algorithm [23] for LWF chain graphs, the PC like algorithm [34]
and the decomposition-based algorithm [16] for MVR chain graphs, and the PC like algorithm [28]
and the decomposition-based algorithm [17] for AMP chain graphs. Yet, all these methods can only
estimate some Markov equivalence classes of the chain graph model, and provide no guarantee for
the reconstruction of the exact chain graph structure, mostly due to the lack of identifiability condi-
tions between undirected and directed edges. It was until very recently that [38] extends the equal
noise variance assumption for DAG [30] to establish the identifiability of the chain graph model
under the AMP interpretation. Yet, the extended identifiability condition in [38] is rather artificial
and difficult to verify in practice. It is also worth mentioning that if the chain components and
their causal ordering are known a priori, then the chain graph model degenerates to a sequence of
multivariate regression models, and various methods [10, 25, 15] have been developed to recover
the graphical structure.
In this paper, we establish a set of novel identifiability conditions for the Gaussian chain graph
model under AMP interpretation, exploiting a low rank plus sparse decomposition of the precision
matrix. Further, an efficient learning algorithm is developed to recover the exact chain graph struc-
ture, including both undirected and directed edges. Specifically, we first reconstruct the undirected
edges by estimating the precision matrix of the noise vector through a regularized likelihood op-
timization. Then, we identify each chain component and determine its causal ordering based on
the conditional variances of its nodes. Finally, the directed edges are reconstructed via multivari-
ate regression coupled with truncated singular value decomposition (SVD). Theoretical analysis
shows that the proposed method consistently reconstructs the exact chain graph structure, which to
the best of our knowledge, is the first asymptotic consistency result for the chain graph model in
literature. The advantage of the proposed method is supported by numerical experiments on both
simulated examples and a real application on the Standard & Poor 500 index data, which reveals
some interesting impacts of the COVID-19 pandemic on the stock market.
3
The rest of the paper is organized as following. Section 2 introduces some preliminaries on
the chain graph model. Section 3 proposes the identifiability conditions for linear Gaussian chain
graph model, and develops an efficient learning algorithm to reconstruct the exact chain graph
structure. The asymptotic consistency of the proposed method are established in Section 4. Nu-
merical experiments of the proposed method on both simulated and real examples are included
in Section 5. Section 6 concludes the paper, and technical proofs are provided in the Appendix.
Auxiliary lemmas and further computational details are deferred to a separate Supplementary File.
Before moving to Section 2, we define some notations. For an integer m, denote [m] =
{1, ..., m}. For a real value x, denote dxeas the largest integer less than or equal to x. For two
nonnegative sequences anand bn, an.bnmeans there exists a constant c > 0such that an≤cbn
when nis sufficiently large. Further, an.Pbnmeans there exists a constant c > 0such that
Pr(an≤cbn)→1as ngrows to infinity. For a vector x, the sub-vector corresponding to an index
subset Sis denoted as xS= (xi)i∈S. For a matrix A= (aij)p×p, the sub-matrix corresponding
to rows in S1and columns in S2is denoted as AS1,S2= (aij)i∈S1,j ∈S2, and let A−1
S1,S2denote the
corresponding sub-matrix of A−1.Also, let kAk1,off =Pi6=j|aij|,kAkmax = maxij |aij|,kAk2
denote the spectrum norm, kAk∗denote the nuclear norm and v(A)∈Rp2denote the vectorization
of A.
2 Chain graph model
Suppose the joint distribution of x= (x1, ..., xp)>can be depicted as a chain graph G= (N,E),
where N={1, ..., p}represents the node set and E ⊂ N ×N represents the edge set containing all
undirected and directed edges. To differentiate, we denote (i−j)for an undirected edge between
nodes iand j,(i→j)for a directed edge pointing from node ito node j, and suppose that at most
one edge is allowed between two nodes. Then, there exists a positive integer msuch that Ncan be
uniquely partitioned into mdisjoint chain components N=Sm
k=1 τk,where each τkis a connected
4
component of nodes via undirected edges. Suppose that only undirected edges exist within each
chain component, and directed edges are only allowed across different chain components [24, 10].
Further suppose that there exists a permutation π= (π1, ..., πm)such that for i∈τπkand j∈τπl,
if (i→j), then k < l. This excludes the existence of semi-directed cycle in G[24]. We call such
permutation πas the causal ordering of the chain components, and directed edges could only point
from nodes in higher-order τkto nodes in lower-order one.
Let pa(i) = {j∈ N : (j→i)∈ E},ch(i) = {j∈ N : (i→j)∈ E } and ne(i) = {j∈
N: (j−i)∈ E} denote the parents, children and neighbors of node i, respectively. Further, let
pa(τk) = Si∈τkpa(i)be the parent set of chain component τk. Suppose the joint distribution of x
satisfies the Andersson–Madigan–Perlman (AMP) Markov property [1, 20] with respect to G,and
follows the linear structural equation model (SEM),
x=Bx+,(1)
where B= (βij)p×pis the coefficient matrix, = (1, ..., p)>∼ N (0,Ω−1), and Ω= (ωij)p×pis
the precision matrix of . Further, suppose that βji 6= 0 if and only if j∈pa(i), and ωij 6= 0 if and
only if j∈ne(i). Therefore, the undirected and directed edges in Gcan be directly implied by the
zero patterns in Ωand B, respectively. The joint density of xcan then be factorized as
P(x) =
m
Y
k=1
P(xτk|xpa(τk)),(2)
where xτk|xpa(τk)∼ N (Bτk,pa(τk)xpa(τk),Ω−1
τk,τk)for k∈[m], and Ωτk,τkis not necessarily a diago-
nal matrix. This is a key component of the chain graph model to allow undirected edges within each
τk, and thus differs from most existing SEM models with diagonal Ωin literature [31, 30, 27, 6].
To assure the acyclicity among chain components in G, we say (Ω,B)is CG-feasible if there
exists a permutation matrix Psuch that both PΩP>and PBP>share the same block structure,
5
where PΩP>is a block diagonal matrix and PBP>is a block lower triangular matrix with zero
diagonal blocks. Figure 1 shows a toy chain graph, as well as the supports of the original and
permuted (Ω,B). Let Θdenote the precision matrix for x, then it follows from (1) that
Θ= (Ip−B)>Ω(Ip−B) =: Ω+L,(3)
where L=B>ΩB −B>Ω−ΩB.
Figure 1: The left panel displays a toy chain graph with colors indicating different chain compo-
nents, and the right panel displays the supports of the original (Ω,B)in the first column and the
permuted (Ω,B)in the second column.
3 Proposed method
3.1 Identifiability of G
A key challenge in the chain graph model is the identifiability of the graph structure G, due to the
fact that Ωand Bare intertwined in the SEM model in (1). To proceed, we assume that Ωis sparse
with kΩk0=Sand Lis a low-rank matrix with rank(L) = K. The sparseness of Ωimplies
6
the sparseness of undirected edges in G, which has been well adopted in the literature of Gaussian
graphical model [26, 12, 3]. The low-rank of Linherits from that of B[11], which essentially
assumes the presence of hub nodes in G, i.e. nodes with multiple children or parents.
Let L=U1D1U>
1be the eigen decomposition of L,where U>
1U1=IKand D1is a K×K
diagonal matrix. We define two linear subspaces,
S(Ω) = {S∈Rp×p:S>=S,and sij = 0 if ωij = 0},(4)
T(L) = U1Y+Y>U>
1:Y∈RK×p,(5)
where S(Ω)is the tangent space, at point Ω, of the manifold containing symmetric matrices with
at most Snon-zero entries, and T(L)is the tangent space, at point L, of the manifold containing
symmetric matrices with rank at most K[4].
Assumption 1. S(Ω)and T(L)intersect at the origin only; that is, S(Ω)∩ T (L) = {0p×p}.
Assumption 1 is the same as the transversality condition in [4], which assures the identifiability
of (Ω,L)in the sense that Θcan be uniquely decomposed as the sum of a matrix in S(Ω)and the
other one in T(L).
Assumption 2. The Keigenvalues of Lare distinct.
Assumption 2 is necessary to identify the eigen space of the low-rank matrix L,which has been
commonly assumed in the literature of matrix perturbation [40]. Let Qbe the parameter space of
CG-feasible (Ω,B), where Ω0,Ω+L0,kΩk0≤S, rank(L)≤K, and Assumptions 1
and 2 are met. Let (Ω∗,B∗)denote the true parameters of the linear SEM model (1) satisfying
kΩ∗k0=Sand rank(L∗) = K.
Theorem 1. Suppose (Ω∗,B∗)∈ Q.Then, there exists a small > 0such that for any (Ω,B)∈ Q
7
satisfying kΩ−Ω∗kmax < and kB−B∗kmax < , if
(Ip−B)>Ω(Ip−B)=(Ip−B∗)>Ω∗(Ip−B∗),
it holds true that (Ω,B)=(Ω∗,B∗).
Theorem 1 establishes the local identifiability of (Ω∗,B∗)in (1), which further implies the
local identifiability of the chain graph G. It essentially states that (Ω∗,B∗)∈ Q can be uniquely
determined, within its neighborhood in Q, by the precision matrix Θ∗= (Ip−B∗)>Ω∗(Ip−B∗),
which can be consistently estimated from the observed sample.
Remark 1 (Identifiability for DAG). When Ωis indeed a diagonal matrix, each chain component
contains exactly one node and thus Greduces to a DAG. By Theorem 1, it is identifiable as long as
the eigenvalues of L∗are distinct and ej/∈span(U∗
1)for any j∈[p], where U∗
1∈Rp×Kcontains
eigenvectors of L∗and {ej}p
j=1 are the standard basis of Rp.This provides an alternative identi-
fiability condition for DAG, in contrast to the popularly-employed equal error variance condition
[30].
3.2 Learning algorithm
We now develop a learning algorithm to estimate (Ω∗,B∗)and reconstruct the chain graph G.
Suppose we observe independent copies x1, ..., xn∈Rpand denote X= (x1, ..., xp)>∈Rn×p.
We first estimate Ω∗via the following regularized likelihood,
(b
Ω,b
L) = argmin
Ω,L−l(Ω+L) + λn(kΩk1,off +γkLk∗)(6)
subject to Ω0and Ω+L0,
where l(Θ) = −tr(Θb
Σ) + log{det(Θ)}is the Gaussian log-likelihood with b
Σ=1
nX>X, and
λnand γare tuning parameters. Here kΩk1,off =Pi6=j|ωij|induces sparsity in Ω,kLk∗induces
8
low-rank of L, and the constraints are due to the fact that both Ωand Θ=Ω+Lare precision
matrices. Note that the optimization task in (6) is convex, and can be efficiently solved via the
alternative direction method of multipliers (ADMM; [2]). More computational details are deferred
to the Supplementary Files.
Once b
Ωis obtained, we connect nodes iand jwith undirected edges if bωij 6= 0, which leads
to multiple estimated chain components, denoted as bτ1, ..., bτbm. To determine the causal ordering of
the estimated chain components, for each bτkand any C ⊂ [p]\bτk, we define
b
D(bτk,C) = max
i∈bτknb
Σii −b
ΣiCb
Σ−1
CC b
ΣCi−b
Ω−1
ii o,
where b
Σii −b
ΣiCb
Σ−1
CC b
ΣCiis the estimated conditional variance for node i∈bτkgiven nodes in C,
and b
Ω−1
ii is the estimated variance for node i∈bτkgiven its parent chain components. It is thus
clear that b
D(bτk,C)shall be close to 0 if Cconsists of all upper chain components of bτk. We start
with b
D(bτk,∅) = maxi∈bτkb
Σii −b
Ω−1
ii for each chain component bτk, and select the first chain
component by bπ1= argminl∈[bm]b
D(bτl,∅). Suppose the first schain components bτbπ1, ..., bτbπshave
been selected, let b
Cs=∪s
k=1bτbπkand bπs+1 = argminl∈[bm]\∪s
k=1
bπkb
D(bτl,b
Cs). We repeat this procedure
until the causal orderings of all bτk’s are determined, which are denoted as b
π= (bπ1, ..., bπbm).
Finally, to estimate B, we first give an intermediate estimate b
Breg, whose submatrix b
Breg
bτbπk,
b
Ck−1
is obtained via a multivariate regression of xbτbπkon xb
Ck−1, as the directed edges are only allowed
from upper chain components to lower ones. Given b
Breg, we conduct singular value decomposition
(SVD) as b
Breg =b
Ureg b
Dreg(b
Vreg)>with b
Dreg = diag(bσreg
1, ..., bσreg
p), and then truncate the small
singular values to obtain b
Bsvd =b
Ureg b
Dsvd(b
Vreg)>, where b
Dsvd = diag(bσsvd
1, ..., bσsvd
p)with bσsvd
j= 0
if bσreg
j≤κnand bσsvd
j=bσreg
jif bσreg
j> κn, for some pre-specified κn>0. The final estimate b
B=
(b
βij)p×pis obtained by truncating the diagonal and upper triangular blocks to 0, and conducting a
hard thresholding to the lower triangular blocks with some pre-specified νn>0, where b
βij = 0 if
|b
βsvd
ij | ≤ νnand b
βij =b
βsvd
ij if |b
βsvd
ij |> νn. The nonzero elements of b
Bthen lead to the estimated
9
directed edges.
4 Asymptotic theory
This section quantifies the asymptotic behavior of (b
Ω,b
B), and establishes its consistency for re-
constructing the chain graph G∗= (N,E∗). Let Θ∗= (Ip−B∗)>Ω∗(Ip−B∗),and the Fisher
information matrix takes the form
I∗=−Eh∂2l(Θ∗)
∂Θ2i= (Θ∗)−1⊗(Θ∗)−1,
where ⊗denotes the Kronecker product. For a linear subspace M,let PMdenote the projection
onto M,and M⊥denote the orthogonal complement of M. We further define two linear operators
F:S(Ω∗)× T (L∗)→ S(Ω∗)× T (L∗)and F⊥:S(Ω∗)× T (L∗)→ S(Ω∗)⊥× T (L∗)⊥such that
F(∆Ω,∆L) = (PS(Ω∗)(I∗v(∆Ω+∆L)),PT(L∗)(I∗v(∆Ω+∆L))),
F⊥(∆Ω,∆L) = (PS(Ω∗)⊥(I∗v(∆Ω+∆L)),PT(L∗)⊥(I∗v(∆Ω+∆L))).
According to Assumption 1 and Lemma ?? in the Supplementary Files, Fis invertible and thus
F−1is well defined.
Let gγ(Ω,L) = max{kΩkmax,kLk2/γ},where γ > 0is the same as that in (6). Let L∗=
U∗
1D∗
1(U∗
1)>be the eigen decomposition of L∗.
Assumption 3. gγ(F⊥F−1(sign(Ω∗), γU∗
1sign(D∗
1)(U∗
1)>)) <1.
Assumption 3 is essential for establishing the selection and rank consistency of (b
Ω,b
L)through
penalties in (6). For example, in a special case when Θ∗=Ipand S(Ω∗)⊥ T (L∗), Assump-
tion 3 simplifies to max{γkU∗
1sign(D∗
1)(U∗
1)>kmax,ksign(Ω∗)k2/γ}<1, implying that U∗
1is
not sparse and sign(Ω∗)is not a low-rank matrix. Similar technical conditions have also been
10
assumed in literature [4, 7].
Theorem 2. Suppose (Ω∗,B∗)∈ Q and Assumption 3 holds. Let λn=n−1/2+ηwith a sufficiently
small positive constant η. Then, with probability approaching 1, (6) has a unique solution (b
Ω,b
L).
Furthermore, we have
kb
Ω−Ω∗kmax .Pn−1
2+2η,Pr(sign( b
Ω) = sign(Ω∗)) →1,
kb
L−L∗kmax .Pn−1
2+2η,Pr(rank(b
L) = rank(L∗))) →1.
(7)
Theorem 2 shows that b
Ωhas both estimation and sign consistency, which implies that the
undirected edges in G∗could be exactly reconstructed with high probability. It can also be shown
that b
Battains both estimation and selection consistency, implying the exact recovery of the directed
edges in G∗. Furthermore, given (b
Ω,b
B), we reconstruct the chain graph as b
G= (N,b
E), where
(i→j)∈b
Eif and only if b
βji 6= 0, and (i−j)∈b
Eif and only if bωij 6= 0. The following Theorem 3
establishes the consistency of b
G.
Theorem 3. Suppose all the conditions in Theorem 2 are satisfied, and we set κn=n−1/2+ηand
νn=n−1/2+2η. Then, Pr( b
G=G∗)→1as n→ ∞.
Theorem 3 shows that the proposed method gives an exact recovery of the chain graph with
high probability, which is in sharp contrast to the existing methods only recovering some Markov
equivalent class of the chain graph [36, 29, 23, 34, 16, 28, 17].
5 Numerical experiments
5.1 Simulated examples
We examine the numerical performance of the proposed method and compare it against existing
structural learning methods for chain graph, including the decomposition-based algorithm (LCD,
11
[17]) and PC like algorithm (PC-like, [28, 17]), as well as the PC algorithm for DAG (PC, [18]).
The implementations of LCD and PC-like are available at https://github.com/majavid/
AMPCGs2019. We implement PC algorithm through R packages pcalg and then convert the
resulted partial DAG to DAG by pdag2dag. The significance level of all tests in LCD, PC-like
and PC is set as α= 0.05.
We evaluate the numerical performances of all four methods in terms of the estimation accu-
racy of undirected edges, directed edges and the overall chain graph. Specifically, we report re-
call, precision and Matthews correlation coefficient (MCC) as evaluation metrics for the estimated
undirected edges and directed edges, respectively. Furthermore, we employ Structural Hamming
Distance (SHD) [37, 17] to evaluate the estimated chain graph, which is the number of edge inser-
tions, deletions or flips to change the estimated chain graph to the true one. Note that large values
of recall, precision and MCC and small values of SHD indicate good estimation performance.
Example 1. We consider a classic two-layer Gaussian graphical model [21, 25] with two layers
A1={1, ..., d0.1pe} and A2={d0.1pe+ 1, ..., p}, whose structure is illustrated in Figure 2(a).
Within each layer, we randomly connect each pair of nodes by an undirected edge with probability
0.02, and note that one layer may contain multiple chain components. Then, we generate the
directed edges from nodes in A1to nodes in A2with probability 0.8. Furthermore, the non-zero
values of ωij and βij are uniformly generated from [−1.5,−0.5] ∪[0.5,1.5]. To guarantee the
positive definiteness of Ω, each diagonal element is set as ωii =Pp
j=1,j6=iωji + 0.1.
Example 2. The structure of the second chain graph is illustrated in Figure 2(b). Particularly,
we randomly connect each pair of nodes by an undirected edge with probability 0.03, and read off
multiple chain components {τ1, ..., τm}from the set of undirected edges. Then, we set the causal
ordering of the chain components as (π1, ..., πm) = (1, ..., m). For each chain component τk, we
randomly select the nodes as hubs with probability 0.2, and let each hub node points to the nodes in
∪m
i=k+1τiwith probability 0.8. Similarly, the non-zero values of ωij and βij are uniformly generated
from [−1.5,−0.5] ∪[0.5,1.5], and ωii =Pp
j=1,j6=iωji + 0.1.
12
(a) (b)
Figure 2: The chain graph structures in Examples 1 and 2.
For each example, we consider four cases with (p, n) = (50,500),(50,1000),(100,500) and
(100,1000), and the averaged performance of all four methods over 50 independent replications
are summarized in Tables 1 and 2. As PC only outputs the DAGs with no undirected edges, its
evaluation metrics on b
Ωare all NA.
Table 1: The averaged evaluation metrics of all the methods in Example 1 together with their
standard errors in parentheses.
(p, n)Method Recall(
b
Ω) Precision(
b
Ω) MCC(
b
Ω) Recall(
b
B) Precision(
b
B) MCC(
b
B) SHD
(50,500) Proposed 0.6482 (0.0255) 0.8493 (0.0195) 0.7327 (0.0203) 0.2789 (0.0080) 0.4971 (0.0133) 0.3361 (0.0097) 187.8600 (2.4026)
LCD 0.1000 (0.0120) 0.1174 (0.0128) 0.0934 (0.0118) 0.1350 (0.0047) 0.5862 (0.0125) 0.2579 (0.0077) 192.7600 (1.5514)
PC-like 0.0441 (0.0075) 0.0424 (0.0084) 0.0270 (0.0076) 0.0055 (0.0007) 0.0679 (0.0090) -0.0006 (0.0025) 225.9800 (1.1569)
PC NA NA NA 0.0244 (0.0019) 0.0835 (0.0059) 0.0067 (0.0033) 238.1600 (1.1691)
(50,1000) Proposed 0.6729 (0.0237) 0.8380 (0.0208) 0.7425 (0.0196) 0.3060 (0.0095) 0.4663 (0.0124) 0.3386 (0.0107) 194.1200 (2.9303)
LCD 0.1174 (0.0139) 0.1394 (0.0113) 0.1122 (0.0115) 0.1869 (0.0054) 0.6762 (0.0113) 0.3324 (0.0078) 181.1600 (1.8179)
PC-like 0.0391 (0.0067) 0.0303 (0.0062) 0.0161 (0.0063) 0.0076 (0.0009) 0.0909 (0.0101) 0.0054 (0.0029) 231.7400 (1.0909)
PC NA NA NA 0.0296 (0.0019) 0.0898 (0.0051) 0.0109 (0.0031) 242.6000 (1.3613)
(100,500) Proposed 0.3475 (0.0092) 0.9987 (0.0009) 0.5832 (0.0080) 0.3480 (0.0051) 0.5383 (0.0068) 0.3984 (0.0059) 732.9800 (5.7756)
LCD 0.0279 (0.0028) 0.0860 (0.0089) 0.0396 (0.0048) 0.0751 (0.0019) 0.5693 (0.0093) 0.1882 (0.0043) 794.2800 (3.0063)
PC-like 0.0167 (0.0020) 0.0411 (0.0058) 0.0152 (0.0034) 0.0011 (0.0001) 0.0307 (0.0039) -0.0084 (0.0008) 859.3400 (2.4988)
PC NA NA NA 0.0057 (0.0004) 0.0421 (0.0030) -0.0114 (0.0011) 885.5600 (2.5778)
(100,1000) Proposed 0.4088 (0.0101) 0.9988 (0.0008) 0.6334 (0.0081) 0.3631 (0.0053) 0.4876 (0.0066) 0.3825 (0.0061) 775.5400 (7.2310)
LCD 0.0236 (0.0022) 0.0780 (0.0082) 0.0340 (0.0040) 0.0900 (0.0022) 0.6349 (0.0088) 0.2210 (0.0045) 779.8200 (3.0436)
PC-like 0.0193 (0.0020) 0.0349 (0.0037) 0.0131 (0.0026) 0.0016 (0.0002) 0.0377 (0.0042) -0.0072 (0.0009) 872.3000 (2.4148)
PC NA NA NA 0.0077 (0.0005) 0.0500 (0.0032) -0.0090 (0.0013) 896.0600 (2.5596)
From Tables 1 and 2, it is clear that the proposed method outperforms all competitors in most
scenarios. In Example 1, the proposed method produces a much better estimation of the undi-
13
Table 2: The averaged evaluation metrics of all the methods in Example 2 together with their
standard errors in parentheses. Here ** denotes the fact that the corresponding methods take too
long to produce any results.
(p, n)Method Recall(
b
Ω) Precision(
b
Ω) MCC(
b
Ω) Recall(
b
B) Precision(
b
B) MCC(
b
B) SHD
(50,500) Proposed 0.5229 (0.0263) 0.7843 (0.0157) 0.6255 (0.0215) 0.3272 (0.0178) 0.5651 (0.0249) 0.4063 (0.0210) 128.8400 (7.2189)
LCD 0.4510 (0.0290) 0.5184 (0.0261) 0.4662 (0.0271) 0.1646 (0.0102) 0.5469 (0.0192) 0.2764 (0.0108) 132.3600 (7.9336)
PC-like 0.4548 (0.0331) 0.4296 (0.0257) 0.4225 (0.0292) 0.0231 (0.0024) 0.1668 (0.0146) 0.0439 (0.0052) 149.3600 (8.9086)
PC NA NA NA 0.1637 (0.0149) 0.2490 (0.0140) 0.1655 (0.0130) 160.1200 (8.2415)
(50,1000) Proposed 0.5704 (0.0293) 0.7719 (0.0166) 0.6481 (0.0231) 0.3568 (0.0204) 0.5571 (0.0252) 0.4195 (0.0225) 128.7800 (7.8260)
LCD 0.4723 (0.0301) 0.4873 (0.0218) 0.4609 (0.0255) 0.1885 (0.0110) 0.5721 (0.0173) 0.3052 (0.0119) 128.7800 (7.8959)
PC-like 0.4584 (0.0309) 0.4108 (0.0239) 0.4138 (0.0269) 0.0205 (0.0025) 0.1529 (0.0181) 0.0379 (0.0059) 149.9800 (8.8576)
PC NA NA NA 0.1848 (0.0176) 0.2599 (0.0142) 0.1828 (0.0156) 159.1600 (8.5205)
(100,500) Proposed 0.3560 (0.0090) 0.9984 (0.0010) 0.5880 (0.0076) 0.4124 (0.0255) 0.7447 (0.0270) 0.5438 (0.0256) 155.0600 (4.4310)
LCD ** ** ** ** ** ** **
PC-like ** ** ** ** ** ** **
PC NA NA NA 0.2710 (0.0269) 0.1013 (0.0087) 0.1472 (0.0125) 231.6000 (5.1892)
(100,1000) Proposed 0.5035 (0.0103) 0.9977 (0.0009) 0.7016 (0.0073) 0.5882 (0.0262) 0.8817 (0.0155) 0.7115 (0.0218) 114.5800 (4.1264)
LCD ** ** ** ** ** ** **
PC-like ** ** ** ** ** ** **
PC NA NA NA 0.2875 (0.0264) 0.1065 (0.0091) 0.1562 (0.0127) 228.1400 (4.9927)
rected edges than all other methods. For directed edges, the proposed method achieves the highest
Recall(b
B)and MCC(b
B). It is interesting to note that LCD gets higher Precision(b
B)than the pro-
posed method, possibly due to the fact that LCD tends to produce fewer estimated directed edges,
resulting in large Precision(b
B)but small Recall(b
B). In Example 2, the proposed method outper-
forms all competitors in terms of almost all the evaluation metrics. Note that LCD and PC-like
take too long to produce any results when p= 100, due to their expensive computational cost
when there exist many hub nodes.
5.2 Standard & Poor index data
We apply the proposed method to study the relationships among stocks in the Standard & Poor’s
500 index, and analyze the impact of the COVID-19 pandemic on the stock market. Chain graph
can accurately reveal various relationships among stocks, with undirected edges for symmetric
competitive or cooperative relationship between stocks and directed edges for asymmetric causal
relation from one stock to the another.
To proceed, we select p= 100 stocks with the largest market sizes in the Standard & Poor’s
500 index, and retrieve their adjusted closing prices during the pre-pandemic period, August 2017-
February 2020, and the post-pandemic period, March 2020-September 2022. The data is publicly
14
available on many finance websites and has been packaged in some standard softwares, such as the
R package quantmod. For each period, we first calculate the daily returns of each stock based
on its adjusted closing prices, and then apply the proposed method to construct the corresponding
chain graph.
Figure 3 displays the undirected edges between stocks in both estimated chain graphs, which
consist of 39 and 21 undirected edges in pre-pandemic and post-pandemic, respectively. It is
clear that there are more estimated undirected edges in the pre-pandemic chain graph than in the
post-pandemic one, which echos the empirical findings that business expansion were more ac-
tive, company cooperations were closer and competition were fiercer before the COVID-19 pan-
demic. Furthermore, there are 13 common undirected edges in both chain graphs, and all these 13
connected stock pairs are from the same sector, including VISA(V) and MASTERCARD(MA),
JPMORGAN CHASE(JPM) and BANK OF AMERICA(BAC), MORGAN STANLEY(MS) and
GOLDMAN SACHS(GS), and HOME DEPOT(HD) and LOWE’S(LOW). All these pairs share
the same type of business, and their competition or cooperation receive less impact by the COVID-
19 pandemic. In Figure 3, it is also interesting to note that the number of the undirected edges
between stocks from different sectors have reduced in the post-pandemic chain graph. This con-
curs with the fact that diversified business transactions between companies have been decreased
and only essential business contacts have been maintained during the COVID-19 pandemic.
Figure 4 displays the boxplots of causal orderings of all stocks within each sector in both
pre-pandemic and post-pandemic, where the causal ordering of a stock is set as that of the cor-
responding chain component. It is generally believed that causal ordering implies the imbalance
of social demand and supply; that is, if a sector is getting more demanded, its causal ordering is
inclined to move up to upstream. Evidently, Energy and Materials are always at the top of the
causal ordering in both periods, as they are upstream industries and provide inputs for most other
sectors. The median causal ordering of Telecommunication Services goes from downstream to
upstream after the outbreak of the COVID-19 pandemic, since people travel less and rely more
15
Figure 3: The left and right panel display all the estimated undirected edges for pre-pandemic and
post-pandemic, respectively. Stocks from the same sectors are dyed with the same color, and the
common undirected edges in both chain graphs are boldfaced.
on telecommunication for business communication. The median causal ordering of Finances goes
down during the pandemic, as commercial entities are more cautious about credit expansion and
demand for financial services is likely to decline to battle the financial uncertainty. It is somewhat
surprising that the median causal ordering of Healthcares appears invariant, but many pharma-
ceutical and biotechnology corporation in this section actually have changed from downstream to
upstream, due to the rapid development vaccine and treatment during the pandemic.
In addition, the estimated chain graphs in pre-pandemic and post-pandemic consist of 149 and
190 directed edges, respectively. While many directed edges remain unchanged, there are some
stocks whose roles have changed dramatically in the chain graphs. In particular, some stocks with
no child but multiple parents in the pre-pandemic chain graph become ones with no parent but
multiple children in the post-pandemic chain graph, such as COSTCO(COST), APPLE(AAPL),
ACCENTURE(ACN), INTUIT(INTU), AT&T(T) and CHUBB(CB). This finding appears reason-
able, as most of these stocks correspond to the high demanded industries during pandemic, such as
16
Figure 4: The left and right panel display the boxplots of the estimated causal ordering of the
top 100 stocks in each sector for pre-pandemic and post-pandemic, respectively. The sectors are
ordered according to the median causal ordering of stocks in post-pandemic.
COSTCO for stocking up groceries, AT&T for remote communication, and APPLE for providing
communication and online learning equipment. On the other hand, there are some other stocks
with no parent but multiple children in the pre-pandemic chain component becoming ones with
no child but multiple parents in the post-pandemic chain component, including TESLA(TSLA),
TJX(TJX), BRISTOL-MYERS SQUIBB(BMY), PAYPAL(PYPL), AUTOMATIC DATA PRO-
CESSING(ADP) and BOEING(BA). Many of these companies have been severely impacted dur-
ing pandemic, such as BOEING due to minimized travels and TESLA due to shrunk consumer
purchasing power.
6 Conclusion
In this paper, we establish a set of novel identifiability conditions for the Gaussian chain graph
model under AMP interpretation, exploiting a low rank plus sparse decomposition of the precision
matrix. An efficient learning algorithm is developed to recover the exact chain graph structure,
including both undirected and directed edges. Theoretical analysis shows that the proposed method
consistently reconstructs the exact chain graph structure. Its advantage is also supported by various
numerical experiments on both simulated and real examples. It is also interesting to extend the
proposed identifiability conditions and learning algorithm to accommodate non-linear chain graph
17
model with non-Gaussian noise.
Acknowledgment
This work is supported in part by HK RGC Grants GRF-11304520, GRF-11301521 and GRF-
11311022.
References
[1] Andersson, S. A., Madigan, D., and Perlman, M. D. (2001). Alternative Markov properties for
chain graphs. Scandinavian Journal of Statistics,28(1):33–85.
[2] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization
and statistical learning via the alternating direction method of multipliers. Foundations and
Trends in Machine Learning,3(1):1–122.
[3] Cai, T., Liu, W., and Luo, X. (2011). A constrained `1minimization approach to sparse
precision matrix estimation. Journal of the American Statistical Association,106(494):594–
607.
[4] Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Willsky, A. S. (2011). Rank-sparsity
incoherence for matrix decomposition. SIAM Journal on Optimization,21(2):572–596.
[5] Chen, C., Chang, K. C.-C., Li, Q., and Zheng, X. (2018). Semi-supervised learning meets fac-
torization: Learning to recommend with chain graph model. ACM Transactions on Knowledge
Discovery from Data (TKDD),12(6):1–24.
[6] Chen, W., Drton, M., and Wang, Y. (2019). On causal discovery with an equal-variance as-
sumption. Biometrika,106(4):973–980.
18
[7] Chen, Y., Li, X., Liu, J., and Ying, Z. (2016). A fused latent and graphical model for multi-
variate binary data. arXiv preprint arXiv:1606.08925.
[8] Cole, M. W., Reynolds, J. R., Power, J. D., Repovs, G., Anticevic, A., and Braver, T. S. (2013).
Multi-task connectivity reveals flexible hubs for adaptive task control. Nature Neuroscience,
16(9):1348–1355.
[9] Cox, D. R. and Wermuth, N. (1993). Linear Dependencies Represented by Chain Graphs.
Statistical Science,8(3):204 – 218.
[10] Drton, M. and Eichler, M. (2006). Maximum likelihood estimation in gaussian chain graph
models under the alternative Markov property. Scandinavian Journal of Statistics,33(2):247–
257.
[11] Fang, Z., Zhu, S., Zhang, J., Liu, Y., Chen, Z., and He, Y. (2020). Low rank directed acyclic
graphs and causal structure learning. arXiv preprint arXiv:2006.05691.
[12] Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with
the graphical Lasso. Biostatistics,9(3):432–441.
[13] Frydenberg, M. (1990). The chain graph Markov property. Scandinavian Journal of Statistics,
17(4):333–353.
[14] Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations. JHU Press.
[15] Ha, M. J., Stingo, F. C., and Baladandayuthapani, V. (2021). Bayesian structure learn-
ing in multilayered genomic networks. Journal of the American Statistical Association,
116(534):605–618.
[16] Javidian, M. A. and Valtorta, M. (2018). Structural learning of multivariate regression chain
graphs via decomposition. arXiv preprint arXiv:1806.00882.
19
[17] Javidian, M. A., Valtorta, M., and Jamshidi, P. (2020). AMP chain graphs: Minimal separa-
tors and structure learning algorithms. Journal of Artificial Intelligence Research,69:419–470.
[18] Kalisch, M. and B¨
uhlmann, P. (2007). Estimating high-dimensional directed acyclic graphs
with the PC-algorithm. Journal of Machine Learning Research,8(3):613–636.
[19] Lauritzen, S. L. and Wermuth, N. (1989). Graphical models for associations between vari-
ables, some of which are qualitative and some quantitative. Annals of Statistics,17(1):31–57.
[20] Levitz, M., Perlman, M. D., and Madigan, D. (2001). Separation and completeness properties
for AMP chain graph markov models. Annals of Statistics,29(6):1751–1784.
[21] Lin, J., Basu, S., Banerjee, M., and Michailidis, G. (2016). Penalized maximum likelihood
estimation of multi-layered gaussian graphical models. Journal of Machine Learning Research,
17:1–51.
[22] Luke, D. A. and Harris, J. K. (2007). Network analysis in public health: history, methods,
and applications. Annual Review of Public Health,28:69–93.
[23] Ma, Z., Xie, X., and Geng, Z. (2008). Structural learning of chain graphs via decomposition.
Journal of Machine Learning Research,9(95):2847–2880.
[24] Maathuis, M., Drton, M., Lauritzen, S., and Wainwright, M. (2018). Handbook of Graphical
Models. CRC Press.
[25] McCarter, C. and Kim, S. (2014). On sparse gaussian chain graph models. Advances in
Neural Information Processing Systems,27.
[26] Meinshausen, S. and B¨
uhlmann, P. (2006). High dimensional graphs and variable selection
with the lasso. Annals of Statistics,34(3):1436–1462.
20
[27] Park, G. (2020). Identifiability of additive noise models using conditional variances. Journal
of Machine Learning Research,21(75):1–34.
[28] Pe˜
na, J. M. (2014). Learning marginal AMP chain graphs under faithfulness. In European
Workshop on Probabilistic Graphical Models, pages 382–395. Springer.
[29] Pe˜
na, J. M., Sonntag, D., and Nielsen, J. (2014). An inclusion optimal algorithm for chain
graph structure learning. In Proceedings of the Seventeenth International Conference on Artifi-
cial Intelligence and Statistics, pages 778–786. PMLR.
[30] Peters, J. and B¨
uhlmann, P. (2014). Identifiability of Gaussian structural equation models
with equal error variances. Biometrika,101(1):219–228.
[31] Peters, J., Janzing, D., and Sch¨
olkopf, B. (2017). Elements of Causal Inference -Foundations
and Learning Algorithms. MIT Press, Cambridge, MA.
[32] Ravikumar, P., Wainwright, M., Raskutti, G., and Yu, B. (2011). High-dimensional covari-
ance estimation by minimizing `1-penalized log-determinant divergence. Electronic Journal of
Statistics,5:935–980.
[33] Sanford, A. and Moosa, I. (2012). A Bayesian network structure for operational risk mod-
elling in structured finance operations. Journal of the Operational Research Society,63(4):431–
444.
[34] Sonntag, D. and Pe˜
na, J. M. (2012). Learning multivariate regression chain graphs under
faithfulness. In Sixth European Workshop on Probabilistic Graphical Models, pages 299–306.
[35] Stanton-Salazar, R. D. and Dornbusch, S. M. (1995). Social capital and the reproduction
of inequality: Information networks among mexican-origin high school students. Sociology of
Education,68(2):116–135.
21
[36] Studen`
y, M. (1997). A recovery algorithm for chain graphs. International Journal of Ap-
proximate Reasoning,17(2-3):265–293.
[37] Tsamardinos, I., Brown, L., and Aliferis, C. (2006). The max-min hill-climbing Bayesian
network structure learning algorithm. Machine Learning,65(1):31–78.
[38] Wang, Y. and Bhattacharyya, A. (2022). Identifiability of linear AMP chain graph models.
Proceedings of the AAAI Conference on Artificial Intelligence,36(9):10080–10089.
[39] Wermuth, N. and Lauritzen, S. L. (1990). On substantive research hypotheses, conditional
independence graphs and graphical chain models. Journal of the Royal Statistical Society:
Series B (methodological),52(1):21–50.
[40] Yu, Y., Wang, T., and Samworth, R. J. (2015). A useful variant of the Davis–Kahan theorem
for statisticians. Biometrika,102(2):315–323.
22