ArticlePDF Available

Data-driven discovery of interpretable causal relations for deep learning material laws with uncertainty propagation

Authors:

Abstract and Figures

This paper presents a computational framework that generates ensemble predictive mechanics models with uncertainty quantification (UQ). We first develop a causal discovery algorithm to infer causal relations among time-history data measured during each representative volume element (RVE) simulation through a directed acyclic graph. With multiple plausible sets of causal relationships estimated from multiple RVE simulations, the predictions are propagated in the derived causal graph while using a deep neural network equipped with dropout layers as a Bayesian approximation for UQ. We select two representative numerical examples (traction-separation laws for frictional interfaces, elastoplasticity models for granular assembles) to examine the accuracy and robustness of the proposed causal discovery method for the common material law predictions in civil engineering applications. Graphic abstract
This content is subject to copyright. Terms and conditions apply.
Granular Matter manuscript No.
(will be inserted by the editor)
Data-driven discovery of interpretable causal relations for deep learning1
material laws with uncertainty propagations2
Xiao Sun ·Bahador Bahmani ·Nikolaos N. Vlassis ·3
WaiChing Sun ·Yanxun Xu4
5
Received: June 21, 2021/ Accepted: date6
Abstract This paper presents a computational framework that generates ensemble predictive mechanics7
models with uncertainty quantification (UQ). We first develop a causal discovery algorithm to infer causal8
relations among time-history data measured during each representative volume element (RVE) simula-9
tion through a directed acyclic graph (DAG). With multiple plausible sets of causal relationships estimated10
from multiple RVE simulations, the predictions are propagated in the derived causal graph while using a11
deep neural network equipped with dropout layers as a Bayesian approximation for uncertainty quantifi-12
cation. We select two representative numerical examples (traction-separation laws for frictional interfaces,13
elastoplasticity models for granular assembles) to examine the accuracy and robustness of the proposed14
causal discovery method for the common material law predictions in civil engineering applications.15
Keywords causal discovery, granular materials, knowledge graph, path-dependent responses, soil16
mechanics17
1 Introduction18
Computer simulations for mechanics problems often require material (constitutive) laws that replicate the19
local constitutive responses of the materials. These material laws can be used to replicate the responses of20
an interface (e.g., traction-separation laws or cohesive zone models) or bulk materials (e.g., elastoplasticity21
models for solids, porosity-permeability relationship and water retention curve). A computer model is22
then completed by incorporating these local constitutive laws into a discretized form of balance principles23
(balance of mass, linear momentum and energy) where discretized numerical solutions can be sought by a24
proper solver.25
Constitutive laws, such as stress-strain relationship for bulk materials, traction-separation laws for in-26
terface, porosity-permeability for porous media, are often derived following a set of axioms and rules27
[Truesdell and Noll,2004]. In these hand-crafted models, phenomenological observations are incorporated28
into constitutive laws (e.g., critical state theory for soil mechanics [Schofield and Wroth,1968,Sun,2013,29
Bryant and Sun,2019,Na et al.,2019], void-growth theory for ductile damage [Gurson,1977]). While the30
earlier simpler models are often amended by newer and more comprehensive models [Dafalias,1984] in31
order to improve the performance (e.g. accuracy, more realistic interpretation of mechanisms), these im-32
provements are often a trade-off that may unavoidably increase the number of parameters, leading to33
increasing difficulty for the calibration, verification and validation processes, as well as the uncertainty34
Xiao Sun, Yanxun Xu (corresponding author)
Department of Applied Mathematics and Statistics Johns Hopkins University, Baltimore, MD Tel.: 410-516-7341, E-mail:
yanxun.xu@jhu.edu
Bahador Bahmani, Nikolas N. Vlassis, WaiChing Sun (corresponding author)
Department of Civil Engineering and Engineering Mechanics, Columbia University, New York, NY
2 Xiao Sun et al.
quantification [Dafalias,1984,Borja and Sun,2007,2008,Clément et al.,2013,Wang and Sun,2019a,Wang35
et al.,2016].36
The rise of big data and the great promises of machine learning have led to a new generation of37
approaches that either bypass the usages of constitutive laws via model-free data-driven methods (e.g.,38
Kirchdoerfer and Ortiz [2016], He and Chen [2020], Bahmani and Sun [2021], Karapiperis et al. [2021]) or39
replace parts of the modeling efforts/components with models generated from supervised learning (e.g.,40
Furukawa and Yagawa [1998], Lefik and Schrefler [2003], Wang and Sun [2018,2019b], Zhang et al. [2020],41
Vlassis et al. [2020a], Vlassis and Sun [2021], Logarzo et al. [2021], Masi et al. [2021]). However, one criti-42
cal issue of these machine learning and data-driven approaches is the lack of sufficient interpretability of43
predictions. While there is no universally accepted definition of interpretability, we will herein employ the44
definition used in Miller [2019] which refers interpretability as the degree to which a human can under-45
stand the cause of the prediction.46
One possible way to boost the interpretability is to introduce a proper medium to represent knowledge47
that can be understood by human [Sussillo and Barak,2013]. Causal graph, also known as causal Bayesian48
network, is one such medium in which the causal relations among different entities are mathematically rep-49
resented by a directed graph. In the application of computational mechanics Wang and Sun [2018,2019b],50
Heider et al. [2020] have derived a mathematical framework to decompose a complex prediction task into51
multiple easier predictions represented by subgraphs within graphs. More recent work such as Wang and52
Sun [2019a], Wang et al. [2019] introduce a deep reinforcement learning approach that employs the Monte53
Carlo Tree Search (MCTS) to assemble a directed graph that generates a sequence of interconnected predic-54
tions of physical quantities to emulate a hand-crafted constitutive law. However, these directed graphs are55
generated to optimize a given performance metric (e.g. accuracy, calculation speed and forward prediction56
robustness), but not necessarily reveal the underlying causal relations among physical quantities.57
Discovering causal relations from observational data is an important problem with applications in58
many fields of science, such as social science [Oktay et al.,2010], finance [Gong et al.,2017], and biomedicine59
[Shen et al.,2020]. The standard way to discover causality is through randomized controlled experiments.60
However, conducting such experiments can be either impractical, unethical, and/or very expensive in61
many disciplines [Xu et al.,2012,Xie and Xu,2020]. For mechanics problems, the major issues include the62
time and labor cost for physical experiments, the lack of facilities or equipment to complete the required63
tests and the difficulties to obtain specimens [Mitchell et al.,2005,Powrie,2018,Wood,1990]. As a result,64
an alternative approach, which is adopted in this study, is to use sub-scale simulations as the digital rep-65
resentation that generates auxiliary data sets to build material laws or forecast engine for the macroscopic66
material responses [Liu et al.,2016,Ma and Sun,2020,Frankel et al.,2019,Wang and Sun,2019b]. Clas-67
sic methods for causal discovery are based on probabilistic graphical modeling [Pearl,2000], the structure68
of which is a directed acyclic graph (DAG) with nodes representing random variables and edges repre-69
senting conditional dependencies between variables. Learning a DAG from observational data is highly70
challenging since the number of possible DAGs is super-exponential to the number of nodes. There are71
two main approaches for causal discovery: the constraint-based approach and the score-based approach.72
The constraint-based approach aims to recover a Markov equivalence class through inferring conditional73
independence relationships among the variables, and the resulting Markov equivalence class may contain74
multiple DAGs that indicate the same conditional independence relationships [Le et al.,2016,Cui et al.,75
2016]. On the other hand, the score-based approach uses a scoring function, such as the Bayesian Infor-76
mation Criterion (BIC), to search for the DAG that best fits the data [Heckerman et al.,1995,Huang et al.,77
2018].78
In this paper, we aim to discover causal relations that can explain the underlying mechanism of a79
history-dependent macroscopic constitutive law upscaled from direct numerical simulations at the meso-80
scale. The most common method for constructing the causal relations from time-series data is Granger81
causality [Granger,1969], which assumes a number of lagged effects and analyzes the data in a unit no82
more than that number of lags. See Runge [2018] for a review of causal discovery methods on time-series83
data. However, most of these causal discovery methods assume that the data are generated from a station-84
ary process, meaning that the data are generated by a distribution that does not change with time. Such85
an assumption does not hold in many physical processes, in which the mechanisms or parameters in the86
causal model may change over time. Several methods have been proposed recently to tackle time-varying87
causal relations in non-stationary processes [Ghassami et al.,2018,Huang et al.,2019a,2020]. However,88
Data-driven discovery of interpretable causal relation for material laws 3
they either assume linear causal models, or does not offer the flexibility of incorporating known physi-89
cal knowledge, limiting their applicability to the nonlinear path-dependent relations in learning material90
constitutive laws.91
In this work, we offer two major innovations. First, we introduce a new decoupled discovery/training92
approach where the discovery of causal relations represented by a DAG is enabled by a causal discovery93
algorithm that deduces plausible causal relations from non-stationary time series data and incorporates94
known physical knowledge. Second, we leverage the obtained causal graph as the representation of me-95
chanics knowledge and adopt a Bayesian approximation using the dropout layer technique first introduced96
by Gal and Ghahramani [2016a] to propagate epistemic uncertainty in the causal graph and generate quan-97
titative predictions with uncertainty quantification.98
The rest of the paper is organized as follows. Section 2first introduces the two data sets (learning99
traction-separation law and hypo-plasticity of granular materials) used for our numerical experiments.100
This is followed by the description of theory and implementation of the proposed causal discovery algo-101
rithm used to deduce the causal relations from non-stationary time series data (Section 3). The setup of102
the deep neural network model for the prediction tasks and the uncertainty propagation are included in103
Sections 4and 5respectively. The proposed framework is then tested against two numerical experiments104
(Section 6), which is then followed by the conclusion.105
2 Causal relations and constitutive laws106
As demonstrated in previous studies such as Wang and Sun [2018,2019a,b], Wang et al. [2019], Heider et al.107
[2020], Vlassis et al. [2020a], the relationships in a constitutive model can be represented by a network of108
unidirectional information flow, i.e., a DAG G= (V,E)where Vrepresents a vertex set and Edenotes109
an edge set. With appropriate assumptions that will be discussed later, the DAG can be identified as a110
causal graph [Pearl,2000]. The causal relations are not only useful to explain the underlying mechanism111
of a process but also provide us a basis to formulate multi-step transfer learning to predict constitutive112
responses. This strategy can be beneficial because one can leverage more data gathered from physical113
numerical experiments to train the prediction model. For instance, while a black-box prediction of stress-114
strain curves only leverages the stress-strain pair for supervised learning, the introduction of knowledge115
graphs may introduce multiple supervised learning tasks where measurements of porosity, fabric tensors116
or any other physical and geometrical attributes measured during the experiments can be leveraged to117
improve the training. For completeness, we briefly describe the procedure to consider the data set as vertex118
sets in graphs and the causal discovery process used to create the directed edge set in a knowledge graph119
through two examples.120
Note that many of the physical quantities that become the vertices in the knowledge graphs are graph121
metrics obtained from analyzing the connectivity topology of the granular system. For brevity, we will122
not provide a review on the applications of graph theory for granular matter here. Interested readers may123
refer to Appendix Bfor the definitions of the graph metrics and Satake [1978], Bagi [1996], Walker and124
Tordesillas [2010], Tordesillas et al. [2010], O’Sullivan [2011], Kuhn et al. [2015] and Vlassis et al. [2020a] for125
reviews on the graph theory applied to particulate and granular systems.126
2.1 Dataset for traction-separation law127
In the first example, our goal is to conduct a numerical experiment to verify whether the causal discovery128
algorithm is able to re-discover the well-known causal relation that links the plastic dilatancy and contrac-129
tion to the frictional behaviors [Scholz,1998,Popov,2010] with a small data set.130
Following Wang and Sun [2018,2019a,b], Wang et al. [2019], we consider the vertex set consists of five131
elements, the displacement jump/separation U, the traction T, and three geometric measures, i.e.,132
1. Displacement jump U, the relative displacement of an interface of two in-contact bodies.133
2. Porosity φ, the ratio between the volume of the void and the total volume of RVE.134
3. Coordination number (averaged) CN =Ncontact/Nparticle where Ncontact is the number of particle con-135
tacts and Nparticle is the number of particles in the RVE.136
4 Xiao Sun et al.
4. Fabric tensor Af=1
Ncontact Ncontact
c=1ncncwhere ncis the normal vector of a particle contact cin the137
RVE. The symbol ‘’ denotes a juxtaposition of two vectors (e.g., ab=aibj) or two symmetric second138
order tensors [e.g., (αβ)ijkl =αij βkl ].139
5. Traction T, the traction vector acts on the interface.140
To generate a machine learning based traction-separation law, we identify the displacement jump as141
the root and the traction as the leaf of the causal graph. The causal graph is a DAG G= (V,E)where Vis142
the set consisting of the physical quantities U,φ,CN,Afand T. Meanwhile, EV×Vis a set of directed143
edges that connect any two elements from V, and Eis determined from the causal discovery algorithm144
outlined in Section 3.145
The dataset is generated using an open-source code YADE. In total, there are 100 traction-separation law146
simulations run with different loading paths performed on the same RVE. This RVE consists of spherical147
particles with radii between 1 ±0.3 mm with uniform distribution. The RVE has a height of 20 mm in148
the normal direction of the frictional surface and is initially consolidated to an isotropic pressure of 10149
MPa. The inter-particle interaction is controlled by Cundall’s elastic-frictional contact model [Cundall and150
Strack,1979] with an inter-particle elastic modulus of Eeq =1GPa, a ratio between shear and normal151
stiffness of ks/kn=0.3, a frictional angle of φ=30, a density ρ=2600 kg/m3, and a Cundall damping152
coefficient αdamp =0.2. For brevity, the generation and setup of the simulations are not included in this153
paper. Interested readers please refer to Wang and Sun [2019a] for more information. The data required to154
replicate the results of this paper and for 3rd-party validation can be found in the Mendeley Data repository155
[Sun and Wang,2019].156
2.2 Dataset for hypo-plasticity of granular materials157
While the first data set is used to determine whether the causal graph algorithm may re-discover known158
physical relations in the literature, the second problem is designed to test whether the causal graph algo-159
rithm may successfully investigate new plausible causal relations not known a prior in the literature.160
For this purpose, we run 60 discrete element simulations and use 30 of them for calibrations and 30161
for blind forward predictions. In addition to the conventional microstructural attributes (e.g., porosity and162
fabric tensor) typically used for hand-crafted constitutive laws [Manzari and Dafalias,1997,Dafalias and163
Manzari,2004,Sun,2013,Bryant and Sun,2019,Na et al.,2019], we have also recorded the evolution of the164
particle contact pairs in each incremental time step of the discrete element simulations. The particle contact165
connectivity is itself an undirected graph Gcontact = (Vparticle,Econtact )where Vparticle is the set of particles166
and Econtact is the set of particle contacts, one for each contact between two contacting convex particle167
represented. They are undirected edges. To facilitate new discovery, we compute 15 different graph metrics168
of Gcontact (see Appendix Bfor definition) that have not been used for composing constitutive laws and see169
if (1) whether the causal discovery algorithm may discover causal relations among these new physical170
quantities and (2) whether the new discovery helps improve the accuracy, robustness and consistency of171
the forward predictions enabled by neural networks trained according to the discovered causal relations.172
In total, there are 11 types of time-history data in which 3 of them are second-order tensors (strain,173
stress and the strong fabric tensor), and the rest are scalar (porosity, coordination number graph density,174
graph local efficiency, graph average clustering, graph degree assortativity coefficient, graph transitivity175
and graph clique number). As such there are 11 elements in the vertex set and the goal of the causal176
discovery is to establish the edge set to complete the causal graph. A sequence of supervised learning is177
then used to generate predictions via deep learning.178
3 Causal discovery and knowledge graph constructions179
3.1 Notations and assumptions180
Let G= (V,E)be a DAG containing only directed edges and has no directed cycles. For each ViV, let181
PAidenotes the set of parents of Viin G. Since our data are time-history dependent, we assume that the joint182
probability distribution of Vat each time point according to Gcan factorize as p(V) = m
i=1p(Vi|PAi),183
Data-driven discovery of interpretable causal relation for material laws 5
where mis the number of vertices in G. Here p(Vi|PAi)can be regarded as “causal mechanism." For184
non-stationary time series data, the causal mechanism p(Vi|PAi)can change over time, and the changes185
may be due to the involved functional models or the causal strengths.186
Throughout this section, we use the example of traction-separation law to illustrate the proposed causal187
discovery method without loss of generality. Therefore, Vis the set consisting of displacement jump U,188
porosity φ, coordination number CN, fabric tensor Af, and traction T.Huang et al. [2019b] developed a189
constraint-based causal discovery algorithm for non-stationary time series data to identify changing causal190
modules and recover the causal structure. In this paper, we extend the algorithm in Huang et al. [2019b]191
such that the proposed causal discovery algorithm not only handles non-stationary time-history data but192
also incorporates certain physical constraints. For example, in constructing the traction-separation law, we193
have the prior knowledge that the dynamic changes in Ucan cause changes in other variables, not vice194
versa. Therefore, if there exists a directed edge between the displacement jump Uand any other variable195
Vi, then UVi.196
Denote VUto include all other variables in Vexcluding U(e.g., porosity, fabric tensor). Since the197
causal mechanism can change over time, we assume that the changes can be explained by certain time-198
varying confounders, which can be written as functions of time. As we have the prior knowledge that U199
itself is a time-dependent variable and could affect all other variables, we regard Uas such a confounder200
and assume that the causal relation for each ViVUcan be represented by the following structural201
equation model:202
Vi=gi(PAi,θi(U),ei), (1)
where PAiincludes Uif the changes in Ucan affect the changes in Vi,θi(U)denotes a function of U203
that influences Vias effective parameters, eiis a noise term that is independent of Uand PAi. The ei’s204
are assumed to be independent. As we treat Uas a random variable, there is a joint distribution over205
V{θi(U)}i:ViVU. Denote Gaug to be the graph by adding {θi(U)}i:ViVUto G, and for each i, adding206
an arrow from θi(U)to Vi. Note that Gis the induced subgraph of Gaug over V. Denote the joint distribution207
of Gaug to be paug .208
In order to apply any conditional independence test on the variable set Vfor recovering causal struc-209
ture, we set the following assumptions [Spirtes et al.,2000].210
Assumption 1. (Causal Markov condition) Gaug and the joint distribution paug on V∪ {θi(U)}i:ViVU
211
satisfy the causal Markov condition if and only if a vertex of Gaug is probabilistically independent of all its212
non-descendants in Gaug given the set of all its parents.213
Assumption 2. (Faithfulness) Gaug and the joint distribution paug satisfy the faithfulness condition if and214
only if no conditional independence holds unless entailed by the causal Markov condition.215
Assumption 3. (Causal sufficiency) The common causes of all variables in V∪ {θi(U)}i:ViVUare mea-216
sured.217
3.2 Recovery of the causal skeleton218
In this section, we propose a constraint-based method building upon the PC algorithm [Spirtes et al.,2000]219
to first identify the skeleton of G, defined as the obtained undirected graph if we ignore the directions of220
edges in a DAG G. We prove that given Assumptions 1-3, we can apply conditional independence tests221
to Vto recover the skeleton of G. Algorithm 1 describes the proposed method, which is supported by222
Theorem 1. The proof is provided in Appendix Afollowing Huang et al. [2019b].223
Theorem 1 Given Assumptions 1-3, for every Vi,VjVU, Viand Vjare not adjacent in G if and only if they are224
independent conditional on some subset of {Vk|VkVU,k6=i,k6=j}{U}.225
In lines 3-7 of Algorithm 1, we determine whether the changes in Ucause changes in Vi. If not, Uis226
not in the parent set of Viand there is no edge between Uand Viin G. The lines 8-12 of Algorithm 1 aims227
to identify the causal skeleton between variables in Vexcept U. Since how other variables change with228
Uand the relations between these variables are usually unknown and potentially very complex, we use a229
nonparametric conditional independence test, kernel-based condition independence (KCI) test developed230
6 Xiao Sun et al.
Algorithm 1 Obtain the undirected skeleton of G
1: Object: To obtain the undirected skeleton of G
2: Build a complete undirected graph UGwith variables V
3: for each node ViVUdo
4: if Viand Uare independent given a subset of {Vk|VkVU,k6=i}then
5: Remove the edge between Viand U
6: end if
7: end for
8: for every Vi,VjVUdo
9: if Viand Vjare independent given a subset of {Vk|VkVU,k6=i,k6=j} Uthen
10: Remove the edge between Viand Vj
11: end if
12: end for
13: return UG
by Zhang et al. [2012], to determine the dependence between variables throughout this paper. This non-231
parametric approach can not only capture the linear/nonlinear correlations between variables by testing232
for zero Hilbert-Schmidt norm of the partial cross-covariance operator, but also handle multidimensional233
data that are common in mechanics problems.234
3.3 Determination of causal directions235
After obtaining the skeleton UG, we need to determine the causal directions of edges. Meek [1995] provided236
a set of orientation rules to determine the directions of undirected edges in a graph based on conditional237
independence tests. However, the Meek rule [Meek,1995] is only applicable to edges that satisfy its con-238
ditions. In this section, we first introduce the Meek rule [Meek,1995], then propose an algorithm to orient239
the edges that are not covered by the Meek rule after incorporating known physical knowledge.240
Denote to be an undirected edge. The Meek rule has the following principles:241
1. For all triples ViVjVk, if Viand Vkare marginally independent but conditionally dependent242
given Vj, then ViVjVk;243
2. If ViVjVkand there is no edge between Viand Vk, then orient VjVk;244
3. If ViVjVkand there is an edge between Viand Vk, then orient ViVk;245
4. If ViVjVk,ViVkVk, and VkVj, then VkVj.246
Now we describe our algorithm on how to determine the edge directions in the obtained skeleton UG.247
Firstly, for any node Viadjacent to U, we orient UVidue to the prior physical knowledge that only U248
affects other variables, not vice versa. Then we apply the Meek rule to the obtained graph after orienting249
the edges from Uto its neighbours. For instance, suppose UViVj, if Vjand Uare independent given250
a set of variables including Vi, then we orient ViVj; if Vjand Uare independent given a set of variables251
excluding Vi, then we have VjVi.252
Next, we discuss how to determine the edge direction between two adjacent variables if they are both253
adjacent to U, i.e., ViVj,UVi, and UVj, since such a scenario is not covered by the Meek rule.254
The modularity property of causal systems [Pearl,2000] demonstrated that if there are no confounders255
for cause and effect, then p(cause)and p(effect |cause)are either fixed or change independently. Based on256
this principle, since both Viand Vjchange with U, we can test the conditional independence between257
p(Vi|θi(U)) and p(Vj|Vi,θj(U)), as well as between p(Vj|θj(U)) and p(Vi|Vj,θi(U)) to determine the258
direction between Viand Vj. That says, if p(Vi|θi(U)) and p(Vj|Vi,θj(U)) are conditionally independent259
but p(Vj|θj(U)) and p(Vi|Vj,θi(U)) are not, then ViVj.Huang et al. [2019b] developed a kernel260
embedding of non-stationary conditional distributions and extended the Hilbert Schmidt Independence261
Criterion (HSIC, Gretton et al. [2008]) to measure the dependence between distributions, based on which262
the causal directions can be determined. For example, if we have two random variables V1and V2, we can263
compute the dependence between p(V1)and p(V2|V1)using the normalized HSIC, denoted by ˆ
V1V2.264
Data-driven discovery of interpretable causal relation for material laws 7
By the same token, we can compute the dependence between p(V2)and p(V1|V2)using the normalized265
HSIC, denoted by ˆ
V2V1. If ˆ
V1V2<ˆ
V2V1, we orient V1V2; otherwise V2V1. After orienting266
all possible edges, we can get the Markov equivalent class of the DAG G. Algorithm 2 summarizes the267
proposed method on how to determine causal directions.268
Algorithm 2 Obtain the Markov equivalence class of the DAG G
1: Object: To orient the directions in the causal skeleton UG
2: Input: The undirected skeleton output from Algorithm 1
3: for any node Viadjacent to Udo
4: Orient UVi
5: end for
6: for all other undirected edges do
7: Apply the Meek rule
8: end for
9: Find all nodes that are adjacent to Uand have undirected edges with other nodes, denoted by S
10: if Sis empty then
11: return G
12: else
13: repeat
14: for each node V∈ S do
15: Consider the set Zof nodes that either are directed parents of Vor have undirected edges to
V
16: Calculate the normalized HSIC using the node Vas the effect and the set of nodes Zas the
cause
17: end for
18: Pick the node Vwith the smallest normalized HSIC
19: Orient all edge directions from the nodes in Zto node V
20: Remove the node Vfrom S
21: until Sis empty
22: end if
23: return G
Remark 1 In the numerical examples, we setup a threshold inclusion probability (20%) below which the269
causality relation is not included in the hierarchical neural network models. This treatment allows us to270
ensure that the causalities with sufficient likelihoods are included but the less prominent relationship is271
omitted to improve the efficiency and simplicity of the resultant model. This threshold can be viewed272
as a hyperparameter. A highly threshold may yield a DAG with less vertices and therefore reduce the273
total number of required supervised training at the expense of being less precise on the causality relations274
among the data.275
4 Supervised learning for path-dependent material laws276
Once causal relations are identified, a directed graph G= (V,E)can be established where there is an edge277
eij Efrom the node ViVto VjVif Viis a direct cause of Vj. Denote the leaf node to be the vertex278
that is not the cause to any other vertices, the root node to be the vertex that is not the target of any other279
vertices. Figure 1demonstrates a directed graph indicating an information flow how leaf node(s) is related280
to root node(s) via some intermediate nodes, e.g., in Figure 1{V1,V2},{V6}, and {V3,V4,V5}are sets of281
root, leaf, and intermediate nodes.282
Along with the similar idea introduced in Wang and Sun [2019a], we aim to discover all sub-graphs
that sequentially pass the information from the root to leaf, see Algorithm 3. Each of these subgraphs
will contain leaf and root but without any intermediate nodes. As such, supervised learning can help us
8 Xiao Sun et al.
Supervised ML
<latexit sha1_base64="ZadrQoGEeyMy9Ja7qsHUANACPJQ=">AAACAXicbVBNS8NAEN34WetX1IvgZbEInkoiBfVW8eJBoaL9gCaUzXbaLt3dhN1NoYSK4F/x4kERr/4Lb/4bk7YHbX0w8Hhvhpl5QcSZNo7zbS0sLi2vrObW8usbm1vb9s5uTYexolClIQ9VIyAaOJNQNcxwaEQKiAg41IP+ZebXB6A0C+W9GUbgC9KVrMMoManUsvc9QUxPieQujkANmIY29vDN9ahlF5yiMwaeJ+6UFNAUlZb95bVDGguQhnKiddN1IuMnRBlGOYzyXqwhIrRPutBMqSQCtJ+MPxjhozhb3AlVWtLgsfp7IiFC66EI0s7sXj3rZeJ/XjM2nTM/YTKKDUg6WdSJOTYhzuLAbaaAGj5MCaGKpbdi2iOKUJOGlk9DcGdfnie1k6JbKp7flgrli8dJHDl0gA7RMXLRKSqjK1RBVUTRA3pGr+jNerJerHfrY9K6YE0j3EN/YH3+AJVnl2I=</latexit>
Input
<latexit sha1_base64="MzXC02Kt2E8KsZZWu5QpddotF60=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIobqruNFdBfuAdiyZNNOGJpkhyShlKPgZblwo4tZ/ceffmGm70NYDgcM5N9xzTxBzpo3rfju5ldW19Y38ZmFre2d3r7h/0NRRoghtkIhHqh1gTTmTtGGY4bQdK4pFwGkrGF1lfuuBKs0ieWfGMfUFHkgWMoKNle67ApuhEumNjBMz6RVLbtmdAi0Tb05KMEe9V/zq9iOSCCoN4VjrjufGxk+xMoxwOil0E01jTEZ4QDuWSiyo9tNp6gk6sUofhZGyTxo0VX//SLHQeiwCO5ml1IteJv7ndRITnvspy06ikswWhQlHJkJZBajPFCWGjy3BRDGbFZEhVpgYW1TBluAtnrxMmmdlr1K+uK2UapdPszrycATHcAoeVKEG11CHBhBQ8Ayv8OY8Oi/Ou/MxG8058woP4Q+czx9bWJN+</latexit>
Output
<latexit sha1_base64="CJfc0P+LbZcl0M2w5b1nz8/6GCw=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M1gEVyURQd1V3Lizgn1AG8pkOmmHziRhHkINBf/DjQtF3Pop7vwbJ20X2npg4HDOHe65J0w5U9rzvp3Cyura+kZxs7S1vbNbdvf2myoxktAGSXgi2yFWlLOYNjTTnLZTSbEIOW2Fo+vcbz1QqVgS3+txSgOBBzGLGMHaSj233BVYD6XIbo1OjZ703IpX9aZAy8SfkwrMUe+5X91+QoygsSYcK9XxvVQHGZaaEU4npa5RNMVkhAe0Y2mMBVVBNg0+QcdW6aMokfbFGk3V3z8yLJQai9BO5jHVopeL/3kdo6OLIGOxPYnGZLYoMhzpBOUtoD6TlGg+tgQTyWxWRIZYYqJtVyVbgr948jJpnlb9s+rl3VmldvU0q6MIh3AEJ+DDOdTgBurQAAIGnuEV3pxH58V5dz5mowVnXuEB/IHz+QPBzJQ6</latexit>
Directed Graph
<latexit sha1_base64="22CdTvCDTbCc46H8dUVV2V7kJXo=">AAACAnicbVDLSgMxFM34rPU16krcBIvgqsxIQd1VFHRZwT6gM5RMJtOGJpkhyQhlKLrwV9y4UMStX+HOvzEz7UJbDwQO59xw7zlBwqjSjvNtLSwuLa+sltbK6xubW9v2zm5LxanEpIljFstOgBRhVJCmppqRTiIJ4gEj7WB4mfvteyIVjcWdHiXE56gvaEQx0kbq2fseR3ogeXZFJcGahNCD1xIlg3HPrjhVpwCcJ+6UVMAUjZ795YUxTjkRGjOkVNd1Eu1nSGqKGRmXvVSRBOEh6pOuoQJxovysiDCGR0YJYRRL84SGhfr7R4a4UiMemMn8YDXr5eJ/XjfV0ZmfUZGkmgg8WRSlDOoY5n3AsMjNRoYgLKm5FeIBksh0IVXZlODORp4nrZOqW6ue39Yq9YvHSR0lcAAOwTFwwSmogxvQAE2AwQN4Bq/gzXqyXqx362MyumBNK9wDf2B9/gA9nJe/</latexit>
V6
V4
V5
V2
V1
V3
1
V3
V5
V2
V6
V1
V2
2
3
V4
V5
V1V3
Fig. 1: (a) shows a direct graph partitioned into three sub-graphs. (b) indicates how each sub-graph is
used in a separate supervised machine learning task to predict the downstream node(s) from the upstream
node(s).
train neural network that predict the root of each subgraph with the corresponding leaf (or leaves) as
input(s). To identify all sub-graphs, we traverse the graph backward from leaf to root nodes. The leaf with
its immediate predecessors formed a directed tree which will be added into the list of potential sub-graphs.
Then, we remove edges of the founded tree in the graph G. In the next step, we select a new leaf node from
the updated Gand do the process just described until there is not any edges in G. For the case shown in
Figure 1we have the following potential sub-graphs:
Ga= ({V2,V3,V5,V6},{e26,e36,e56 }), (2)
Gb= ({V1,V4,V5},{e14,e15 }), (3)
Gc= ({V2,V4,V5},{e24,e25 }), (4)
Gd= ({V1,V3},{e13}). (5)
Those sub-graphs (directed trees) that share common upstream nodes will be merged into a bigger sub-
graph. For the graph in Figure 1final sub-graphs are:
G1=Ga= ({V2,V3,V5,V6},{e26,e36,e56 }), (6)
G2=GbGc= ({V1,V2,V4,V5},{e14,e15,e24,e25}), (7)
G3=Gd= ({V1,V3},{e13}). (8)
For each sub-graph, we have a separate supervised machine learning (ML) task. In each ML task, inputs283
and outputs features are upstream and downstream nodes in each directed sub-graph, as shown in Figure284
1(b).285
In the training step, we use the same architecture for all ML tasks consisting of five layers GRU-286
Dropout-GRU-Dropout-Dense. Each GRU layer has 32 neurons, and the linear activation function is used287
for the Dense layer. Gated Recurrent Unit (GRU) is one type of Recurrent Neural Networks (RNN) to288
model history dependence [Chung et al.,2014]. We use the GRU to strengthen the robustness of the ML289
black-box in dealing with any possible path-dependence [Wang et al.,2019]. The dropout rate in the GRU290
is controlled to quantify uncertainty in the model prediction, which will be detailed in Section 5. All the291
input and output feature columns for each supervised learning task are normalized to zero mean and unity292
standard deviation. The loss function is defined as the mean squared root error between output prediction293
and ground truth. This loss is minimized by the Adam optimizer inside Keras library with Tensorflow 2.0294
Data-driven discovery of interpretable causal relation for material laws 9
Algorithm 3 Obtain all supervised learning input-output pairs
1: Input: Directed graph G= (V,E).result of causal graph
2: S ← .a set of 2-tuples (input nodes, output nodes) for ML tasks
3: while E6=do
4: Vlget a leaf node of G.if Ghas multiple leaf nodes returns a random leaf
5: ¯
V← {Vl}
6: ˆ
V← {Vi|eil E}
7: S ← S ∪ {(ˆ
V,¯
V)}.ˆ
Vis input node(s), and ¯
Vis the output node for a potential ML task
8: EE\{eil |Viˆ
V}
9: end while
10: Modification: Elements iand jof Sknown as (ˆ
Vi,¯
Vi)and (ˆ
Vj,¯
Vj), respectively, are merged into one
element as (ˆ
Vi,¯
Vj¯
Vj)if ˆ
Viˆ
Vj.
11: return S
as its backend. Optimization process is done by the mini-batch stochastic gradient descent algorithm with295
a batch size 256 during 1000 epochs. The described neural network architecture and hyperparameters are296
chosen to be as close as possible to the ones used in Wang and Sun [2019a].297
Note that, the tuning of the hyperparameters (e.g. number of neurons, number of layers, type of activa-298
tion) may have a significant effect on the performance of the neural network models. The best combination299
of hyperparameters can be estimated via a variety of approaches such as the greedy search, random search300
[Bergstra and Bengio,2012], random forest [Probst et al.,2019], Bayesian optimization [Klein et al.,2017],301
meta-gradient iteration or deep reinforcement learning [Wang et al.,2020,Fuchs et al.,2021]. In this work,302
we adopt the random search approach in Bergstra and Bengio [2012] to fine-tune the hyperparameters303
(cf. Sec. 6.1) A rigorous hyperparameter study that compares different hyperparameter tuning for neural304
networks that generates constitutive laws may provide further insights on the optimal setup of the hyper-305
parameters but is out of the scope of this study.306
For the blind prediction, after training, we start from the root (e.g., Uin the traction-separation law)307
and sequentially predict intermediate nodes via their corresponding sub-graph trained neural networks308
(NN) until reaching the leaf node. For example in the case shown in Figure 1: NN 3 predicts V3from input309
V1;V4and V5are predicted by NN 2 from inputs V1and V2; and finally NN 1 is used to predict the target310
variable V6from input V2and the obtained intermediate nodes V3and V5.311
5 Uncertainty propagation in causal graph with dropout layers312
As described in Section 4, we use the deep learning method, GRU in the training step and prediction to han-313
dle path-dependent predictions. However, the GRU itself is not designed to capture prediction uncertainty,314
which is of crucial importance in learning material law. In machine learning and statistics, Bayesian meth-315
ods as probabilistic models provide us a natural way to quantify the model uncertainty through computing316
the posterior distribution of unknown parameters [Xie and Xu,2020]. However, these methods often suffer317
from a prohibitively high computational cost. In this paper, we show that the dropout technique [Srivas-318
tava et al.,2014] used in the GRU can quantify uncertainty in prediction as a Bayesian approximation.319
Dropout, a regularization method that randomly masks or ignores neurons during training, has been320
widely used in many deep learning models to avoid over-fitting and improve prediction [Hinton et al.,321
2012,Li et al.,2016,Boluki et al.,2020]. Gal and Ghahramani [2016a] firstly prove the link between dropout322
and a well known probabilistic model, the Gaussian process [Rasmussen,2003], and show that the use of323
dropout in the feed forward neural networks can be interpreted as a Bayesian approximation of Gaussian324
processes. In the context of RNNs, Gal and Ghahramani [2016b] treated RNNs as probabilistic models by325
assuming network weights as random variables with a Gaussian mixture prior (with one component fixed326
at zero with a small variance). Such a technique is similar to the spike-and-slab prior in Bayesian statistics327
for variable selection [Ishwaran et al.,2005]. Then Gal and Ghahramani [2016b] show that optimizing the328
objective in the variational inference [Blei et al.,2017] for approximating the posterior distribution over the329
weights is equivalent to conducting dropout in the respective RNNs, and demonstrate the implementation330
10 Xiao Sun et al.
in one commonly-used RNN model, the long short-term memory [Hochreiter and Schmidhuber,1997]. In331
this section, we propose to extend the technique developed in Gal and Ghahramani [2016b] in the context332
of GRUs for uncertainty quantification (UQ) in the prediction.333
Given training inputs Xand the corresponding output Y, suppose that we aim to predict an output334
yfor a new input x. From the Bayesian point of view, the prediction uncertainty can be characterized by335
the posterior predictive distribution of yas follows:336
p(y|x,X,Y)=Zp(y|x,ω)p(ω|X,Y)dω, (9)
where ωincludes all unknown model parameters, p(ω|X,Y)is the posterior distribution of ω. In the337
GRU, all unknown weights can be viewed as ω. As the posterior distribution p(ω|X,Y)is generally338
intractable, the variational inference method approximates it by proposing a variational distribution q(ω)339
and then finding the optimal parameters in the variational distribution through minimizing the Kullback-340
Leibler (KL) divergence between the approximating distribution and the full posterior distribution:341
KL(q(ω)kp(ω|X,Y)) Zq(ω)log p(Y|X,ω)dω+KL(q(ω)kp(ω)), (10)
where p(ω)is the prior distribution of ω.342
Given an input sequence X= [x1, . . . , xT]of length T, the hidden state htat time step tin the GRU343
neural network can be generated as follows:344
zt=σ(Wzxt+Uzht1+bz),
rt=σ(Wrxt+Urht1+br),
it=tanh (Whxt+Uh(rtht1)+bh),
ht=ztht1+(1zt)it,
(11)
where σdenotes the sigmoid function and denotes the element-wise product. Also, we assume that the345
model output at time step tcan be written as fY(ht) = htWY+bY. Then the unknown parameters in the346
GRU are ω={WY,Wz,Uz,Wr,Ur,Wh,Uh,bY,bz,br,bh}. We write ht=fω
h(xt,ht1)and fω
Yfor the347
output in order to make the dependence on ωclear.348
Then the right hand of (10) can be written as follows:349
Zq(ω)log p(Y|fω
Y(x1, . . . , xT,fω
h(xT,fω
h(. . . fω
h(x1,h0). . . )))) dω+KL(q(ω)kp(ω)), (12)
which can be approximated by Monte Carlo integration with the generated samples ˆωbq(ω)and plug350
in the sampled ˆωb’s to (12).351
Following Gal and Ghahramani [2016b], we use a mixture of Gaussian distributions as the variational352
distribution for every weight matrix row ωk:353
q(ω) =
K
k=1
q(ωk),q(ωk)=πNωk;0,τ2I+ (1π)Nωk;mk,τ2I, (13)
where πis the dropout probability, mkis the variational parameter (row vector), and τ2is a small variance.354
We optimize over mkby minimizing the KL divergence in (12). Sampling each row of ˆωbis equivalent355
to randomly mask rows in each weight matrix, i.e., conducting dropout. Then the predictive posterior356
distribution can be approximated by357
p(y|x,X,Y)=Zp(y|x,ω)p(ω|X,Y)dω1
B
B
b=1
p(y|x, ˆωb), (14)
where ˆωbq(ω)and Bis the total number of generated samples.358
Data-driven discovery of interpretable causal relation for material laws 11
To implement the dropout in the GRU, we re-parametrize (11) as follows:359
zt
rt
it
=
σ
σ
tanh
 xtmx
ht1mh·ω, (15)
where mxand mhdenote randomly masks repeated at all time steps.360
We take the prediction tasks in Fig. 1for example. The posterior predictive distribution of V3given V1
361
is straightforward by (14) since only root and leaf nodes are involved. When involving the intermediate362
nodes, e.g., the subgroup G1, the posterior predictive distribution of V6given the root nodes (V1and V2)363
can be computed by364
p(V6|V1,V2)=Zp(V6|V2,V3,V5,ω)p(V3|V1,ω)p(V5|V1,V2,ω)p(ω|V1,V2)dV3dV5dω. (16)
Specifically, when we generate Bsamples from the posterior distribution of ω, we also generate Bsamples365
for the intermediate nodes. Those Bsamples of ωand the corresponding intermediate nodes can be then366
used in Monte Carlo integration to calculate (16). Throughout this paper, we set B=200.367
6 Numerical Examples368
In this section, we conduct two numerical experiments to test our proposed framework that combines369
causal discovery with deep learning to build constitutive laws for granular materials. In Section 6.1, the370
causal discovery is conducted to determine the constitutive relationships for an RVE interface composed of371
spherical grains. The subsequent supervised machine learning then leverages the causal relations learned372
from the causal discovery algorithm to establish a serial of supervised learning that constitutes a forecast373
engine for traction. The propagation of uncertainty is enabled by the dropout technique that approximates374
Monte Carlo simulations to determine the confidence intervals for a given dropout rate. In Section 6.2, the375
same exercise is repeated for another data set to generate hypoplasticity surrogate model for a discrete376
element assembly where new topological measures are computed and incorporated into the proposed377
framework to (1) discover new physical mechanisms and (2) determine the benefit of the new discovery378
on the accuracy, robustness, and consistency of the forward predictions on unseen events.379
6.1 Numerical Example 1: Machine Learning traction-separation law380
Traction-separation laws are known as one of the main ingredients of cohesive fracture models used for381
brittle materials [Pandolfi et al.,2000,Park and Paulino,2011]. Generally, a traction-separation law consti-382
tutes a relation between traction and displacement jump fields over the fracture surface. There exist many383
hand-crafted models developed by experts for different applications [Park and Paulino,2011], while no384
unified framework had been developed until recently by Wang and Sun [2019a] who suggest an approach385
based on reinforcement learning. Also, in some applications such as granular materials more descriptors,386
e.g., porosity or fabric tensor, should be considered in these constitutive laws [Sun et al.,2013] to derive387
more predictive models. Lack of robustness in adding more descriptors is another weakness of classical388
models.389
Our first test for the data-driven causal discovery model with dropout UQ is on the traction-separation390
law data publicly available in the repository Mendeley data (cf. Sun and Wang [2019]). This dataset has391
also been used in Wang and Sun [2019a] where the traction-separation law is determined from reinforce-392
ment learning. Our major point of departure is three-fold. Firstly, we develop a causal discovery algorithm393
to identify causal relations among history-dependent physical quantities in RVE simulations. Secondly,394
we decouple the causal discovery from the training of the neural network such that we now first dis-395
cover causal relations, then utilize the discovered relationships to generate quantitative predictions using396
the method detailed in section 4. Thirdly, we introduce the Bayesian approximation using the dropout397
technique to propagate the uncertainty in the causal graph, building upon the theoretical framework es-398
tablished in Gal and Ghahramani [2016b].399
12 Xiao Sun et al.
The database includes 100 DEM experiments. In each DEM experiment, the time history of all the400
variables included in the DAG are recorded. Each experiment is conducted by a different ratio of normal401
to tangential loading rate and loading-unloading cycles on the same representative volume element of402
granular materials. As such, the total number of time-history data points in these experiments may vary403
from 51 to 111.404
The interested reader is referred to the appendix in Wang and Sun [2019a] for more information. In our405
study, feature space consists of displacement jump vector, traction vector, coordination number, symmetric406
part of fabric tensor, and porosity. We use half of the experiments for causal discovery and training artificial407
neural networks, and the rest is used for test and validation. In the causal discovery step, as different408
experimental setups may lead to different causal relations among variables, we apply the proposed causal409
discovery algorithm (Algorithms 1 and 2 in Section 3) to each of the training experiment, and then report410
the final causal graph by calculating the inclusion probabilities of directed edges appearing in all training411
experiments. The inclusion probability of one edge is defined as the proportion of causal graphs containing412
this edge. The directed edges with inclusion probabilities being larger than a pre-defined threshold (20%413
in our paper) are kept in the final causal graph. When both edge directions between two variables appear414
with positive inclusion probabilities (e.g., ViVjand VjViboth exist), we keep the edge direction that415
has a higher inclusion probability. The goal of the resultant model is to predict the same granular assembly416
responds to a different cyclic loading path unseen in the training. As such, the focus of this model is to417
generate a surrogate for one representative element volume.418
Fig. 2plots the final causal graph on the training data sets with edge inclusion probabilities. The strong419
confidence (96%) in the edge starting from the displacement jump vector to porosity is consistent with the420
common field knowledge, i.e., the immediate consequence of displacement jump is the volume change. The421
displacement jump vector, as the only control variable, affects all the intermediate physical quantities and422
traction vector. This observation may seem to be trivial, but it is not always the case which will be shown423
in the next example. The causal effects of fabric and coordination number on traction is aligned with the424
modern Critical State Theory [Li and Dafalias,2012] which is obtained without expert interpretation by the425
causal algorithm. Note that fabric encodes microstructural information in more detail such as directional426
dependence due to its tensorial nature, rather than porosity which smears out information into one scalar427
quantity. Therefore, it is reasonable to see that fabric has a considerable contribution in describing material428
behavior with a complex arrangement of force chains at the microstructural level.429
Remark 2 To make sure the suggested neural network architecture in Sec. 4works satisfactorily in this430
problem, we trained several neural networks with different hyper-parameters for each sub-graph learning431
task in Fig. 2. We used the random search approach [Bergstra and Bengio,2012] implemented in Keras432
Tunner package [O’Malley et al.,2019] for this study. The number of GRU layers is kept fixed and equal433
to two, and in the training stage, the dropout rate is set to zero. The number of epochs is also set to 200.434
The parameters under this study are as follows: the number of units in GRU layers are sampled from the435
set {8, 16, 32, 64}, the Adam learning rate is sampled from the set {0.01, 0.001, 0.0001}, the batch size for436
the SGD algorithm is sampled from the set {32, 64, 128, 256, 512}. Based on these hyperparameter ranges,437
each subgraph learning task has 240 different configurations; however, in the random search algorithm, we438
set the number of trials to 100 for each subgraph training task to reduce overall computational time. For439
this hyperparameter tuning task, we choose 50 data sets as the training set and another 50 data sets as the440
validation set. Our metric for selecting the best configuration is the minimum validation loss. We found441
that the learning rate 0.001 and batch size 32 are common among all the best configurations of subgraphs.442
In Table 1we study the effect of number of units in GRU layers when learning rate and batch size have443
their optimal values.444
Applying Algorithm 3 to Fig. 2, we need to perform four supervised learning tasks: 1) predict Poro445
from the input U; 2) predict CN from the input Uand the intermediate node Poro; 3) predict fabric from446
the input Uand the intermediate nodes Poro and CN; and 4) predict the target variable Tfrom the input447
Uand the obtained intermediate nodes Poro, CN, and fabric. We then use the GRU to train each sub-448
graph with the dropout rate being 0.2 for both training and feed-forward predictions. Fig. 3confirms that449
the neural network architecture proposed in section 4yields satisfactory performance for all supervised450
tasks. To examine the generalization performance of the trained neural networks, we study the empirical451
cumulative distribution functions (eCDFs) for training and test data sets following Wang and Sun [2019a].452
Data-driven discovery of interpretable causal relation for material laws 13
Fig. 2: Final causal graph for the traction-separation law deduced from time-history of displacement, trac-
tion, porosity, coordination number, and fabric tensor. The number on each edge represents the edge inclu-
sion probabilities among all possible causal relations from the training data sets.
subgraph mean standard deviation number of configurations suggested NN
Porosity 1.6e-6 7.36e-7 10 3.7e-6
CN 1.04e-3 2.3e-5 5 1.07e-3
Fabric 2.2e-3 3.98e-4 7 -2.4e-3
Traction 5.1e-5 1.01e-5 8 9.1e-5
Table 1: This table reports the mean and standard deviation of the validation loss among different config-
urations which are different based on their number of units utilized for each GRU layer when the optimal
learning rate 0.001 and batch size 32 is chosen. Notice that we randomly conduct 100 trials for each sub-
graph with different hyperparameters. The last column shows the validation loss when the neural network
has the same architecture suggested in Sec. 4. Based on the standard deviation values, we observe that
the number of units in GRU layers has a marginal effect on the performance. The suggested fixed neu-
ral network architecture in Sec. 4for all sub-graphs has almost the same performance as the best optimal
configurations.
We define the point-wise scaled mean squared error (MSE) between a set of ground-truth values with453
size Nand its corresponding approximation set as:454
ei=1
N
N
i=1
(S(ytrue
i)S(yappx
i)), (17)
where Sis a scaling function. In this paper, the scaling function linearly transforms a set of values into455
a new set where all values are in the range [0, 1]. We perform 200 feed-forward predictions to obtain the456
distribution of each feature output at a specified load-step. For eCDF calculation only, we use the average457
of these 200 predictions to approximate the feature output. In this way, the discrete eCDF of a target output458
feature, such as porosity, at data point iis defined as FN(ei) = 1
MM
j=11(eiej)where eiis the point-wise459
scaled MSE between the feature ground-truth value and its predictions’ average, Mis the total number460
of instances (i.e., the total number of data points across 50 training data sets) used for eCDF calculations,461
and 1(·)is the indicator function. Fig. 4plots eCDFs for all feature outputs in training and testing modes.462
14 Xiao Sun et al.
0 20 40 60 80 100
Epochs
0.0
0.2
0.4
0.6
0.8
1.0
Fabric Loss
0 20 40 60 80 100
Epochs
0
2
4
6
8
Porosity Loss
×101
0 20 40 60 80 100
Epochs
1
2
3
4
5
CN Loss
×101
0 20 40 60 80 100
Epochs
0.0
0.2
0.4
0.6
0.8
1.0
Traction Loss
Fig. 3: Training loss convergence behavior for four supervised learning tasks deduced from the causal
graph. Top left: fabric is predicted based on displacement, porosity and coordination number; Top right:
porosity is predicted based on displacement; Bottom left: coordination number is predicted based on dis-
placement and porosity; Bottom right: traction is predicted based on displacement, coordination number,
fabric, and porosity.
In these plots, the eCDFs for test and training cases are almost the same, indicating no under-fitting or463
over-fitting issue exists. Note that the use of dropout in the GRU is not only for uncertainty quantification464
in prediction, but also to improve model generalization performance.465
In the following, we present prediction results for one of the test cases where its applied normal and466
shear displacements are plotted in Fig. 5. Normal and shear displacement jumps experience cyclic loading-467
unloading path and are kept equal in magnitude.468
We focus on the average of model predictions in Figs. 6and 7. In Fig. 6, we see that the initial friction469
angle is close to 16.7 degree which is almost half of the inter-particle friction angle. This reduction in the470
overall friction angle might be due to the induced dilation in the normal displacement. Another reason471
could be related to initial confining pressure: the higher the confining pressure is, the lower the friction472
angle is. In each loading-unloading branch, the behavior is almost linear without any energy dissipation,473
but further loading after a level makes the behavior nonlinear. If we only follow the loading path we ob-474
serve the strain-softening which is the dominant mechanism of a dense granular assemblage; see Fig. 6475
and 7(b). In other words, the material shows an unstable peak shear strength which is followed by a soft-476
ening behavior until it reaches the critical state. The sign of changes in normal traction (Fig. 9(a)) and shear477
traction (Fig. 9(b)) are in agreement with the fabric normal (Fig. 10(a)) and shear (Fig. 10(b)) components,478
respectively. This confirms the tendency of fabric tensor to trace the load direction [Li and Li,2009,Li and479
Dafalias,2012,Wang and Sun,2016]. Overall the proposed data-driven scheme can replicate main features480
of a realistic experiment, and there exists a good agreement between the model and experiment. However,481
Data-driven discovery of interpretable causal relation for material laws 15
108106104102100
MSE
0.0
0.2
0.4
0.6
0.8
1.0
Fabric eCDF
Training Cases
Testing Cases
108106104102100
MSE
0.0
0.2
0.4
0.6
0.8
1.0
Porosity eCDF
Training Cases
Testing Cases
108106104102100
MSE
0.0
0.2
0.4
0.6
0.8
1.0
CN eCDF
Training Cases
Testing Cases
108106104102100
MSE
0.0
0.2
0.4
0.6
0.8
1.0
Traction eCDF
Training Cases
Testing Cases
Fig. 4: Empirical Cumulative Distribution Function (eCDF) for prediction on training data sets and mean
value of predictions on test data sets. We use 50 data sets for training and 50 other data sets for testing
purposes. For each test case, we perform 200 feed-forward predictions with the drop-out rate 0.2.
0 25 50 75 100
Loading Step
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Un [mm]
0 25 50 75 100
Loading Step
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Us [mm]
Fig. 5: Applied normal and shear displacements in one of the experimental cases. Increment in normal
jump indicates more compression.
there is an issue corresponding to second loading-unloading cycle where hysteresis is predicted by the482
model while experiment shows almost zero energy dissipation. This is mainly due to the neural network483
16 Xiao Sun et al.
capacity and design and can be resolved by enriching the neural network architecture with wider neurons484
or deeper layers or hyper-parameter tuning. Note that one needs to be aware of the over-fitting issue when485
the model complexity increases by increasing the number of neurons. Generally, a more complex neural486
network should be trained with more data.487
Fig. 6: Comparison of normal-shear traction between model and experiment in one case.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Normal Displacement [mm]
10
9
8
7
6
5
4
3
2
Normal Traction [MPa]
Experiment
Model
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Shear Displacement [mm]
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Shear Traction [MPa]
Experiment
Model
Fig. 7: Comparison of traction-displacement between model and experiment in one case.
The uncertainty in traction vector prediction is shown in Fig 8. Density distributions of traction vector488
at three loading steps are plotted in Fig 9. In this figure, steps 25, 47, and 100 belong to the first unload-489
ing, second peak, the last peak conditions, respectively (see Fig. 8). Fig. 8suggests that the model is able490
to track the path-dependent behavior of experiments with narrow variation bands in most of the loading491
steps. This figure also suggests that the uncertainty for shear traction is higher than normal traction and in-492
creases at peak loads. We know that, from mechanics, the shear mode of deformation is more complex and493
nonlinear than the normal mode and consequently deserves higher uncertainty, which agrees with these494
results (also see step 47 and 100 in Fig. 9for a quantitative comparison). At peak values, the complexity is495
more profound due to the cyclic loading or softening, so more uncertainty is expected.496
Model prediction for fabric tensor is plotted in Fig. 10. Density distributions of fabric tensor at three497
loading steps are plotted in Fig. 11. The uncertainty in fabric tensor has narrow variation bands in most of498
the loading steps. Similar to the traction prediction, the uncertainties in normal, shear, and mixed modes499
are higher at peak loads due to the cyclic loading or softening. However, comparing the normal, shear, and500
Data-driven discovery of interpretable causal relation for material laws 17
0 20 40 60 80 100
Loading Step
1.0
0.8
0.6
0.4
0.2
Tn [MPa]
×107
Model Prediction
Experiment Data
(a)
0 20 40 60 80 100
Loading Step
0.00
0.25
0.50
0.75
1.00
1.25
1.50
Ts [MPa]
×106
Model Prediction
Experiment Data
(b)
Fig. 8: Model predictions for normal (a) and shear (b) traction values. Shaded area includes predictions
within 95% confidence interval.
25 47 100
Loading Step
7
6
5
4
3
Tn [MPa] Distribution
×106
(a)
25 47 100
Loading Step
0.8
1.0
1.2
1.4
1.6
1.8
Ts [MPa] Distribution
×106
(b)
Fig. 9: Box plots of density distributions of normal (a) and shear (b) traction distributions at three load
steps.The top line is maximum value and the bottom line is the minimum value. The box composed of
three thick lines are separately first quantile, median, third quantile
mixed modes, we do not observe significant differences in uncertainty at the three loading steps (Fig. 11).501
Interestingly, we observe that traction predictions have less uncertainty at initial load steps, step 0 to 20,502
comparing to fabric while fabric is an intermediate node for traction prediction. This means that traction503
prediction is potentially less dependent on the fabric at initial loading steps and neural network weights504
are automatically adjusted to make predictions with high confidence as much as possible by an appropriate505
combination of porosity, displacement, and fabric. We observe an almost linear correlation between normal506
displacement jump and porosity, so we have not presented porosity prediction results due to its simplicity.507
Such a correlation is expected since the normal displacement is the boundary condition in this problem,508
and dilation is explicitly controlled during the experiment.509
6.2 Numerical Example 2: Machine Learning hypoplasticity510
In the second numerical experiment, we attempt to generate a predictive surrogate model for one numer-511
ical granular assembly undergoing monotonic true triaxial compression loading. For convenient purpose,512
18 Xiao Sun et al.
0 20 40 60 80 100
Loading Step
3.0
3.1
3.2
3.3
3.4
Fabric (normal)
×101
Model Prediction
Experiment Data
(a)
0 20 40 60 80 100
Loading Step
3.45
3.50
3.55
3.60
3.65
3.70
Fabric (shear)
×101
Model Prediction
Experiment Data
(b)
0 20 40 60 80 100
Loading Step
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Fabric (mixed)
×102
Model Prediction
Experiment Data
(c)
Fig. 10: Model prediction for components of symmetric fabric tensor A. Normal (a), shear (b), and mixed (c)
components of fabric tensors are Ann ,Ass , and Ans, respectively. Shaded area includes predictions within
95% confidence interval.
25 47 100
Loading Step
3.10
3.15
3.20
3.25
3.30
3.35
3.40
Fabric (normal) Distribution
×101
(a)
25 47 100
Loading Step
3.475
3.500
3.525
3.550
3.575
3.600
3.625
3.650
Fabric (shear) Distribution
×101
(b)
25 47 100
Loading Step
2.5
2.0
1.5
1.0
0.5
Fabric (mixed) Distribution
×102
(c)
Fig. 11: Box plots of density distributions of fabric’s components.
discrete element simulations are used as replacement of physical tests. These discrete element simulations513
are run via the open-source software YADE [Šmilauer et al.,2010].514
In total, we conduct 60 true triaxial compression tests with loading path that varying the principle515
stress σ1,σ2and σ3are performed on the same numerical specimen. Before the shearing phase, the material516
is subjected to hydrostatic loading to compress the assembly hydrostatically to reach the initial confining517
pressure. Following this step, a vertical compression or extension or a change of the applied tractions on518
the side walls are prescribed to generate different stress paths. To facilitate third-party validation and re-519
production of the simulation results, the data used for the causal discovery are given access to the public520
via Mendeley Data [Vlassis et al.,2020b].521
6.2.1 Data-driven causal relations of granular matter522
Fig. 12 shows the final causal graph of the causal discovery algorithm applied to the true triaxial test523
data generated from discrete element simulations. The number on each edge represents the edge inclusion524
probability in the calibration experimental data sets.525
The causal discovery driven by the small set of calibration data reveals a number of key observations526
that are worth-noticing. First, the causal discovery algorithm does re-discover the conventional wisdoms,527
such as the fact that 1) the changes of coordination number is due to the expansion of the void space; 2)528
Data-driven discovery of interpretable causal relation for material laws 19
Fig. 12: Final Causal graph for the hypoplasticity relations deduced from time-history of strain, stress, and
9 other measures of microstructural and topological properties. The number on each edge represents the
edge inclusion probabilities among all possible causal relations from the training data sets.
both the coordination number and the porosity changes may cause changes on the fabric tensors; and 3)529
the dominate role of the strong fabric tensors on the resultant stress. These observations are consistent with530
previous findings in a number of discrete element simulation literature [Wang et al.,2017,Sun et al.,2013,531
Kuhn et al.,2015,Shi et al.,2018] and the anisotropic critical state theory [Li and Dafalias,2012,Fu and532
Dafalias,2015,Zhao and Guo,2013].533
In addition to the rediscoveries of known knowledge, the causal discovery algorithm also finds a few534
causal relationships not known in the existing literature (to the best knowledge of the authors). For in-535
stance, the causal discovery algorithm is able to establish a casual relationship that changes in average536
clustering coefficient may affect the local efficiency of the particle connectivity, whereas the degree of as-537
sortativity coefficient, a measure of the similarity of the connections of the graphs, may affect the graph538
transitivity. Interestingly, the causal discovery algorithm also finds that changes of the strong fabric tensor539
may be caused by changes of the strain (93%), porosity (60%), coordination number (73%), graph density540
(73%), local efficiency (70%) and graph transitivity (70%), degree of assortativity (70%) as well as graph541
clique number (73%). This discovery indicates that the changes of the strong fabric tensors are driven by542
the changes of the underlying connectivity topology and the volume changes of the void space.543
Furthermore, another interesting discovery is that the changes of the stress tensor is only conditionally544
independently caused by the changes of the strong fabric tensor. This result is consistent with the previous545
finding of 2D granular materials reported in Shi et al. [2018] where it is shown that (1) the principal direc-546
tion of the strong fabric tensor (but not necessarily other fabric tensors) is coaxial with the homogenized547
Cauchy stress, and (2) the fabric tensor and stress tensor are related by a scalar coefficient that may vary548
according to the mean pressure.549
6.2.2 Predictions based on discovered causal relationships550
Here we investigate the accuracy, robustness and the limitations of the machine learning predictions gener-551
ated based on the deduced causal relations. For comparison purposes, we complete the training of two sets552
of neural networks – one employs the newly discovered causal relationships into the predictions, another553
one employs only the strain, fabric tensor and porosity to predict the stress. The latter neural network is554
20 Xiao Sun et al.
then used as a controlled experiment for the former one. The supervised learning procedures used to train555
the two models are identical.556
We first do not introduce the usage of dropout layer in the GRU and hence the dropout rate is zero. The557
hyperparameters are obtained from repeated trial-and-errors and they are summarized in Table 2. All the558
sub-graph predictions, regardless of the number of input variables, are trained by the neural network with559
the identical architecture listed in Table 2. After the predictions, we conduct a cross-validation in which560
the trained neural networks are tasked to predict both the homogenized Cauchy stress obtained from the561
calibration and testing simulation data. The results are shown in Fig. 13. Unlike the traction-separation law562
examples, the predicted stress-strain curves for the true triaxial test exhibit profound over-fitting regardless563
of whether the additional graph metrics are used for the predictions.564
NN setting description Abbreviation Values
Neuron type subset NeuronType GRU
Hidden layers subset numHiddenLayers 3
Number of neurons per layer numNeuronsPerLayer 32
Dropout rate subset Dro pOutRate 0.0
Optimizer type subset Optimizer Adam
Activation functions subset Activation relu
Batch sizes subset BatchSize 128
Minimum Learning rate ReduceLROnPlateau 0.95
Table 2: Hyperparameters used to train the neural network
The roughly 2-order of difference in stress predictions suggests that either regularization strategy or565
more data is needed to circumvent the mismatch of accuracy on the calibration and blind prediction data.566
Notice that expanding the data set is not difficult for discrete element simulations, it is certainly very567
difficult to conduct 60 true triaxial tests physically in a typical laboratory. As such, the results indicate the568
difficulty to create forecast engine to predict stress responses for unconventional stress paths even when569
the simulations are free of the issues, noises and errors exhibited in physical experiments.570
Interestingly, the predictions from the neural network with the new graph measures do not help signif-571
icantly on the mean errors of predictions. However, a closer examination of the tail of the eCDF on the two572
testing curves in Figure 13 does indicate that the neural network armed with the new knowledge produces573
a less catastrophic worst-case scenario. Figures 14,15, and 16 show the predicted principal stress difference574
against the benchmark data, in which q1=σ1σ2,q2=σ1σ3, and q3=σ2σ2where σ1σ2σ3
575
are principal stresses. In both the calibration and the testing cases, the discrepancy of the principal stress576
difference are minor.577
For brevity, we do not intend to present all the 30 forward prediction results. Here, we pick three sam-578
ples, two calibrations (Test No. 23 and 29) and two blind predictions (Test No. 50 and 56) for close exami-579
nation. Figs. 14,15, and 16 compare the difference of the three principal stress inferred from the recurrent580
neural network and obtained from discrete element simulations. In these figures, TXC and TXE denote581
triaxial compression and extension tests. For simplicity, this test does not contain cyclic loading, as a re-582
sult, the prediction task is much simpler. Nevertheless, despite of the relatively small data set, the trained583
neural networks in the causal graph is capable of predicting important characteristics, such as hardening/-584
softening properly. The predictions also exhibit more fluctuation, which is undesirable. However, this can585
be presumably suppressed with a different set of activation functions and other regularization strategies.586
The second important characteristics that warrants attention is the state path in the void ratio vs. loga-587
rithm of mean pressure. Here, we consider compressive pressure as positive and the results are shown in588
Fig. 17. Again, the predictions indicate that the trained neural network is able to predict the elastic com-589
pression followed by the plastic dilatancy in the triaxial compression (TXC) cases and the elastoplastic590
expansion in the triaxial extension (TXE) cases.591
Next, we examine the strong fabric tensor and its relationships with graph measures. Here, the fabric592
tensor Fis computed by the summation of the dyadic product of the branch vectors ndivided by the593
Data-driven discovery of interpretable causal relation for material laws 21
Fig. 13: Empirical Cumulative Distribution Function (eCDF) for prediction on training data sets and mean
value of predictions on test data sets. There are 60 simulations, 30 used for calibration and 30 for blind
forward testing. Blue curves are predictions made from neural networks generated according to the causal
graph, black curve is the control experiment counterpart generated from predictions that takes on strain
fabric tensor and porosity as inputs to predict Cauchy stress.
number of grain contacts nc, i.e.,594
F=1
nc
nc
i=1
nn. (18)
The strong fabric tensor is obtained by considering only a subset of the contact of which the contact normal595
force is larger than a threshold value. In this work, this threshold value is set to be the averaged contact596
force. For brevity, we only show the normalized fabric anisotropy variable, which measures the alignment597
between the fabric tensor and the normalized deviatoric component of the stress ndev,598
A=1
F:FF:ndev, (19)
in Fig. 18. Recall that the normalized fabric anisotropy variable A=1 is a necessary condition for a material599
to reach the critical state [Fu and Dafalias,2011,Li and Dafalias,2012,Zhao and Guo,2013], hence the600
predictions of Amay indicate how accurate the neural network in the causal graph predicts the critical601
state. Comparing the predictions of Ain the calibration cases and blind tests indicates that the neural602
network prediction tends to delay the predicted onset of the critical state. This may explain the over-fitting603
exhibited in Fig. 13.604
Finally, we examine the predictions of the graph measures most likely to be influential to the predictions605
of the fabric tensors. Fig. 19 shows that the graph density reduces during the shear phase in both triaxial606
compression and extension tests. According to the causal graph, the deformation is causing the graph607
density changing which in turn affects the fabric tensors. These causal relationships seem reasonable as the608
deformation during the shear phase is likely to cause plastic dilatancy and therefore reduces the number609
of contacts, which explains the drop in the graph density and the resultant changes in fabric tensors.610
A similar reasoning can also be used to explain the drops in the average clustering shown in Fig. 20611
where the shear deformation tends to reduce the tendency of the particles to cluster together and that in612
return leads to the evolution of the fabric tensor.613
These results indicate that, while the causal discovery may reveal potential causal relations not appar-614
ent to domain experts, the knowledge from causal relationships does not necessarily lead to more accurate615
predictions. Factors such as the choices of the supervised machine learning methods and the availability616
of data are also key factors that affect the usefulness of the new knowledge for predictions.617
22 Xiao Sun et al.
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 14: Difference between the major and minor principal stress vs. axial strain. Compressive strain has
positive sign convention.
6.2.3 Uncertainty propagation with dropout layer618
As the final numerical experiment, we activate dropout layers to collect results of stochastic forward passes619
through the model. This gives us a Monte Carlo estimate of the predictions. Note that the activation of the620
dropout layers will lead to a different set of neuron weights even the data used for the training of the621
neural network are identical.622
Figs. 21,22, and 23 show the confidence interval for 200 Monte Carlo predictions of principal stress623
differences, q1,q2and q3vs. axial strain for 4 selected triaxial extension and compression loading paths.624
In most of the cases shown in Figs. 21,22, and 23, the mean paths of the stochastic predictions gener-625
ated by the dropout layer is able to match qualitatively with the experimental benchmarks. Furthermore, in626
most cases, the principal stress differences observed from experiments are within the 95% confidence inter-627
val. It should nevertheless be noted that the blind test is less accurate than the calibration cases, indicating628
that the neural networks may have been over-fitted.629
To examine how uncertainty is propagated in the causal graph, we plot the diagonal components of the630
fabric tensor and the results are shown in Figs. 24,25, and 26. For brevity, the off-diagonal components of631
the fabric tensor, which are much smaller than the diagonal components, are not provided here. Comparing632
the 95% confidence interval of the fabric tensor and that of the principal stress difference, one can easily633
see that the predictions of stress tend to be more accurate when the fabric tensor can be more precisely634
determined with a narrower confidence interval.635
Data-driven discovery of interpretable causal relation for material laws 23
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 15: Difference between the major and immediate principal stress vs. axial strain. Compressive strain
has positive sign convention.
7 Conclusions636
In this paper, we introduce, for the first time, a data-driven framework that combines 1) the causal dis-637
covery algorithm that detects unknown causal relations, 2) the Bayesian approximation for uncertainty638
quantification enabled by the dropout technique, and 3) the recurrent neural network technique to ana-639
lyze, interpret, and forecast the path-dependent responses of granular materials. Numerical experiments640
conducted on idealized granular system have indicated that the data-driven framework is able to investi-641
gate and discover new hidden causal relationships and propagate uncertainty generated from a sequence642
of structured neural network predictions within a casual graph. This approach has potentials to help mod-643
elers and experimentalists to spot hidden mechanisms not apparent to human eyes as well as deduce com-644
plex casual relationships in a high-dimensional parametric space where intuition and domain knowledge645
are not sufficient due to the dimensionality of the data. Further work may include improvement and com-646
parisons of different causal inferences, extension to recover causal relations when both instantaneous and647
lagged causal relations exist, as well as the applications to more complex granular systems where particles648
are of different shapes and properties.649
24 Xiao Sun et al.
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 16: Difference between the immediate and minor principal stress vs. axial strain. Compressive strain
has positive sign convention.
8 Availability of data, material, and code for reproducing results650
The causal discovery algorithm can be found at Sun et al. [2020b]. The recurrent neural network is built via651
Tensorflow and the code to complete the training and the generation of the forecast engine can be found652
at Sun et al. [2020a]. The discrete element simulations data can be found in the Mendeley data repositories653
[Sun and Wang,2019,Vlassis et al.,2020b].654
9 Acknowledgments655
The members of the Columbia research group involved in this research are supported by National Science656
Foundation under grant contracts CMMI-1846875 and OAC-1940203, the Earth Materials and Processes657
program from the US Army Research Office under grant contract W911NF-18-2-0306, and the Dynamic658
Materials and Interactions Program from the Air Force Office of Scientific Research under grant contracts659
FA9550-17-1-0169. The members of the Johns Hopkins University are supported by National Science Foun-660
dation under grant contract 1940107. These supports are gratefully acknowledged. The authors would also661
like to thank Dr. Kun Wang from Los Alamos National Laboratory for providing the data for the traction-662
separation law.663
Data-driven discovery of interpretable causal relation for material laws 25
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 17: State path (void ratio vs. logarithm of mean pressure). Compressive strain has positive sign con-
vention.
The views and conclusions contained in this document are those of the authors, and should not be664
interpreted as representing the official policies, either expressed or implied, of the sponsors, including the665
Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and666
distribute reprints for Government purposes notwithstanding any copyright notation herein.667
9.1 Author statement668
Xiao Sun and Bahador Bahmani contribute equally as first authors. All authors have contributed to the669
planning/writing/reviewing/editing of this manuscript.670
9.2 Declaration of Competing Interest671
The authors confirm that there are no relevant financial or non-financial competing interests to report.672
26 Xiao Sun et al.
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 18: Normalized fabric anisotropy variable vs. axial strain. Compressive strain has positive sign con-
vention.
A Appendix: Proof of Theorem 1673
Theorem 1 Given Assumptions 1-3, for every Vi,VjVU,Viand Vjare not adjacent in Gif and only if674
they are independent conditional on some subset of {Vk|VkVU,k6=i,k6=j}{U}.675
Proof. From equation (1), any variable Viin VUcan be written as a function of {θi(U)}m1
i=1and {ei}m1
i=1,676
where m1 is the number of vertices included in VUsince Vincludes mvertices. Therefore, the dis-677
tribution of VUat each value of Uis determined by the distribution of e1, . . . , em1and the values of678
{θi(U)}m1
i=1. For any Vi,VjVUand S⊆ {Vk|VkVU,k6=i,k6=j},p(Vi,Vj|S{U})is determined679
by m1
i=1p(ei)and {θi(U)}m1
i=1. Since m1
i=1p(ei)does not change with U, we have680
p(Vi,Vj|S{θi(U)}m1
i=1{U}) = p(Vi,Vj|S {θi(U)}m1
i=1). (20)
Denote
|=
to indicate independence, it follows that681
U
|=
(Vi,Vj)|S{θi(U)}m1
i=1. (21)
Applying the weak union property of conditional independence, we have U
|=
Vi| {Vj}S{θi(U)}m1
i=1.682
Data-driven discovery of interpretable causal relation for material laws 27
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 19: Graph density vs. axial strain. Compressive strain has positive sign convention.
Suppose that Viand Vjare not adjacent in G, then they are not adjacent in Gaug . There exists a set683
S⊆ {Vk|VkVU,k6=i,k6=j}such that S{θi(U)}m1
i=1d-separates Viand Vj. Because of Assumption684
1, we have685
Vi
|=
Vj|S{θi(U)}m1
i=1. (22)
Since all θi(U)are deterministic functions of U, we have p(Vi,Vj|SU) = p(Vi,Vj|S{θi(U)}m1
i=1U).686
Equations (21) and (22) imply that Vi
|=
(U,Vj)|S{θi(U)}m1
i=1. By the weak union property of condi-687
tional independence, we have Vi
|=
Vj|S∪ {θi(U)}m1
i=1∪ {U}. Since all θi(U)are deterministic functions688
of U, it follows that Vi
|=
Vj|S{U}.689
Now we prove that if Viand Vjare conditionally independent given a subset Sof {Vk|VkVU,k6=690
i,k6=j}∪{U},Viand Vjare not adjacent in G. Because of Assumption 2 (faithfullness), Viand Vjare not691
adjacent in Gaug . Therefore, they are not adjacent in G.692
693
28 Xiao Sun et al.
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 20: Average Clustering vs. axial strain. Compressive strain has positive sign convention.
B Appendix: Graph metric definitions694
In this section, we provide brief review of the the terminology of the graph measures obtained from the695
grain connectivity graph generated in each time step of a discrete element simulation. These graph mea-696
sures are used to create the knowledge graph for the machine learning constitutive law in Section 6.2.697
The following graph metrics were calculated using the open-source software networkX [Hagberg et al.,698
2008] for exploration and analysis of graph networks.699
Definition 1 The degree assortativity coefficient measures the similarity of the connections in a graph700
with respect to the node degree.701
Definition 2 The graph transitivity is the fraction of all possible triangles present in the graph over the702
number of triads. Possible triangles are identified by the number of triads – two edges with a shared vertex.703
704
Definition 3 The density for undirected graphs is defined as:705
d=2m
n(n1), (23)
where nin the number of nodes and mis the number of edges of the graph.706
Data-driven discovery of interpretable causal relation for material laws 29
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 21: Difference between the major and minor principal stress vs. axial strain. Results are obtained for ac-
tive dropout layers. Shaded area includes predictions within 95% confidence interval. Compressive strain
has positive sign convention.
Definition 4 The average clustering coefficient of the graph is defined as:707
C=1
n
vG
cn, (24)
where nin the number of nodes and is cnis the clustering coefficient of node ndefined as:708
cn=2T(n)
deg(n)(deg(n)1), (25)
where T(n)is the number of triangles passing through node nand deg(n)is the degree of node n.709
Definition 5 Aclique is a subset of nodes of an undirected graph such that every two distinct nodes in710
the clique are adjacent. The graph clique number is the size of largest clique in the graph.711
Definition 6 The efficiency of a pair of nodes is defined as the reciprocal of the shortest path distance712
between the nodes. The local efficiency of a node in the graph is the average global efficiency of the713
subgraph induced by the neighbours of the node. The average local efficiency, used in this work, is the714
average of the local efficiency calculated for every node in the graph.715
30 Xiao Sun et al.
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 22: Difference between the major and immediate principal stress vs. axial strain. Results are obtained
for active dropout layers. Shaded area includes predictions within 95% confidence interval. Compressive
strain has positive sign convention.
C Appendix: Loading conditions in bulk plasticity experiments716
The database used for causal discovery and training of the neural network forecast engines for numerical717
example in Section 6.2 includes 60 true triaxial numerical experiments conducted via the YADE’ DEM718
simulator. These experiments differ according to the applied axial strain rate ˙
e11, initial confining pressure719
p0, initial void ratio e0, and a parameter b=σ22σ33
σ11σ33 that controls applied stress conditions. In all 60 cases720
we set ˙
σ33 =˙
σ12 =˙
σ23 =˙
σ13 =0. The setup of them are listed below. The tests with the bold font are the721
one discussed in Section 6.2. The first 30 test (labelled T0-T29) are used to train the neural network, while722
T30-T59 are used for forward predictions.723
T0 ˙
e11 <0, b=0, p0=300kPa,e0=0.539.724
T1 ˙
e11 <0, b=0, p0=400kPa,e0=0.536.725
T2 ˙
e11 <0, b=0, p0=500kPa,e0=0.534.726
T3 ˙
e11 >0, b=0, p0=300kPa,e0=0.539.727
T4 ˙
e11 >0, b=0, p0=400kPa,e0=0.536.728
T5 ˙
e11 >0, b=0, p0=500kPa,e0=0.534.729
T6 ˙
e11 <0, b=0.5, p0=300kPa,e0=0.539.730
T7 ˙
e11 <0, b=0.5, p0=400kPa,e0=0.536.731
Data-driven discovery of interpretable causal relation for material laws 31
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 23: Difference between the immediate and minor principal stress vs. axial strain. Results are obtained
for active dropout layers. Shaded area includes predictions within 95% confidence interval. Compressive
strain has positive sign convention.
T8 ˙
e11 <0, b=0.5, p0=500kPa,e0=0.534.732
T9 ˙
e11 >0, b=0.5, p0=300kPa,e0=0.539.733
T10 ˙
e11 >0, b=0.5, p0=400kPa,e0=0.536.734
T11 ˙
e11 >0, b=0.5, p0=500kPa,e0=0.534.735
T12 ˙
e11 <0, b=0.1, p0=300kPa,e0=0.539.736
T13 ˙
e11 <0, b=0.1, p0=400kPa,e0=0.536.737
T14 ˙
e11 <0, b=0.1, p0=500kPa,e0=0.534.738
T15 ˙
e11 >0, b=0.1, p0=300kPa,e0=0.539.739
T16 ˙
e11 >0, b=0.1, p0=400kPa,e0=0.536.740
T17 ˙
e11 >0, b=0.1, p0=500kPa,e0=0.534.741
T18 ˙
e11 <0, b=0.25, p0=300kPa,e0=0.539.742
T19 ˙
e11 <0, b=0.25, p0=400kPa,e0=0.536.743
T20 ˙
e11 <0, b=0.25, p0=500kPa,e0=0.534.744
T21 ˙
e11 >0, b=0.25, p0=300kPa,e0=0.539.745
T22 ˙
e11 >0, b=0.25, p0=400kPa,e0=0.536.746
T23 ˙
e11 >0, b=0.25, p0=500kPa,e0=0.534.747
T24 ˙
e11 <0, b=0.75, p0=300kPa,e0=0.539.748
T25 ˙
e11 <0, b=0.75, p0=400kPa,e0=0.536.749
32 Xiao Sun et al.
(a) Calibration No. 23 (TXE) (b) Calibration No. 29 (TXE)
(c) Blind Test No. 50 (TXC) (d) Blind Test No. 56 (TXC)
Fig. 24: Component 11 of fabric tensor vs. axial strain. Results are obtained for active dropout layers.
Shaded area includes predictions within 95% confidence interval. Compressive strain has positive sign
convention.
T26 ˙
e11 <0, b=0.75, p0=500kPa,e0=0.534.750
T27 ˙
e11 >0, b=0.75, p0=300kPa,e0=0.539.751
T28 ˙
e11 >0, b=0.75, p0=400kPa,e0=0.536.752
T29 ˙
e11 >0, b=0.75, p0=500kPa,e0=0.534.753
T30 ˙
e11 <0, b=0, p0=350kPa,e0=0.539.754
T31 ˙
e11 <0, b=0, p0=450kPa,e0=0.536.755
T32 ˙
e11 <0, b=0, p0=550kPa,e0=0.534.756
T33 ˙
e11 >0, b=0, p0=350kPa,e0=0.539.757
T34 ˙
e11 >0, b=0, p0=450kPa,e0=0.536.758
T35 ˙
e11 >0, b=0, p0=550kPa,e0=0.534.759
T36 ˙
e11 <0, b=0.5, p0=350kPa,e0=0.539.760
T37 ˙
e11 <0, b=0.5, p0=450kPa,e0=0.536.761
T38 ˙
e11 <0, b=0.5, p0=550kPa,e0=0.534.762
T39 ˙
e11 >0, b=0.5, p0=350kPa,e0=0.539.763
T40 ˙
e11 >0, b=0.5, p0=450kPa,e0=0.536.764
T41 ˙
e11 >0, b=0.5, p0=<