Article

Graphical models for recovering probabilistic and causal queries from missing data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We address the problem of deciding whether a causal or probabilistic query is estimable from data corrupted by missing entries, given a model of missingness process. We extend the results of Mohan et al. [2013] by presenting more general conditions for recovering probabilistic queries of the form P(y|x) and P(y,x) as well as causal queries of the form P(y|do(x)). We show that causal queries may be recoverable even when the factors in their identifying estimands are not recoverable. Specifically, we derive graphical conditions for recovering causal effects of the form P(y|do(x)) when Y and its missingness mechanism are not d-separable. Finally, we apply our results to problems of attrition and characterize the recovery of causal effects from data corrupted by attrition.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The problem of identification of the target distribution from the observed distribution in missing data DAG models bears many similarities to the problem of identification of interventional distributions from the observed distribution in causal DAG models with hidden variables. This observation prompted recent work [3,4,13] on adapting identification methods from causal inference to identifying target distributions in missing data models. ...
... In this paper we show that the most general currently known methods for identification in missing data DAG models retain a significant gap, in the sense that they fail to identify the target distribution in many models where it is identified. We show that methods used to obtain a complete characterization of identification of interventional distributions, via the ID algorithm [14,16], or their simple generalizations [3,4,13], are insufficient on their own for obtaining a similar characterization for missing data problems. We describe, via a set of examples, that in order to be complete, an identification algorithm for missing data must recursively simplify the problem by removing sets of variables, rather than single variables, and these must be removed according to a partial order, rather than a total order. ...
... Following a total order to fix is not always sufficient to identify the target law, as noted in [4,3,13]. Consider the model represented by DAG in Fig. 2(d). ...
Preprint
Missing data is a pervasive problem in data analyses, resulting in datasets that contain censored realizations of a target distribution. Many approaches to inference on the target distribution using censored observed data, rely on missing data models represented as a factorization with respect to a directed acyclic graph. In this paper we consider the identifiability of the target distribution within this class of models, and show that the most general identification strategies proposed so far retain a significant gap in that they fail to identify a wide class of identifiable distributions. To address this gap, we propose a new algorithm that significantly generalizes the types of manipulations used in the ID algorithm, developed in the context of causal inference, in order to obtain identification.
... The problem of identification of the target distribution from the observed distribution in missing data DAG models bears many similarities to the problem of identification of interventional distributions from the observed distribution in causal DAG models with hidden variables. This observation prompted recent work [3,4,13] on adapting identification methods from causal inference to identifying target distributions in missing data models. ...
... In this paper we show that the most general currently known methods for identification in missing data DAG models retain a significant gap, in the sense that they fail to identify the target distribution in many models where it is identified. We show that methods used to obtain a complete characterization of identification of interventional distributions, via the ID algorithm [14,16], or their simple generalizations [3,4,13], are insufficient on their own for obtaining a similar characterization for missing data problems. We describe, via a set of examples, that in order to be complete, an identification algorithm for missing data must recursively simplify the problem by removing sets of variables, rather than single variables, and these must be removed according to a partial order, rather than a total order. ...
... Following a total order to fix is not always sufficient to identify the target law, as noted in [4,3,13]. Consider the model represented by DAG in Fig. 2(d). ...
Article
Missing data is a pervasive problem in data analyses, resulting in datasets that contain censored realizations of a target distribution. Many approaches to inference on the target distribution using censored observed data, rely on missing data models represented as a factorization with respect to a directed acyclic graph. In this paper we consider the identifiability of the target distribution within this class of models, and show that the most general identification strategies proposed so far retain a significant gap in that they fail to identify a wide class of identifiable distributions. To address this gap, we propose a new algorithm that significantly generalizes the types of manipulations used in the ID algorithm [14, 16], developed in the context of causal inference, in order to obtain identification.
... There is a vast literature on dealing with missing data in diverse fields. We refer to [1,2] for a review of related work. Most work in machine learning assumes data are missing at random (MAR) [3,4], under which likelihood-based inference (as well as Bayesian inference) can be carried out while ignoring the mechanism that leads to missing data. ...
... Several sufficient graphical conditions have been derived under which probability queries of the form P (x, y) or P (y|x) are estimable [1]. Mohan and Pearl [2] extended those results and further developed conditions for recovering causal queries of the form P (y|do(x)). Shpitser et al. [11] formulated the problem as a causal inference problem and developed a systematic algorithm for estimating the joint distribution when the model contains no unobserved latent variables. ...
... In this paper we develop an algorithm for systematically determining the recoverability of the joint distribution from missing data in m-graphs that could contain latent variables. The result is significantly more general than the sufficient conditions in [1,2]. Compared to the result in [11] we allow latent variables in the model, and treat the problem in a purely probabilistic framework without appealing to causality theory. ...
Article
A probabilistic query may not be estimable from observed data corrupted by missing values if the data are not missing at random (MAR). It is therefore of theoretical interest and practical importance to determine in principle whether a probabilistic query is estimable from missing data or not when the data are not MAR. We present an algorithm that systematically determines whether the joint probability is estimable from observed data with missing values, assuming that the data-generation model is represented as a Bayesian network containing unobserved latent variables that not only encodes the dependencies among the variables but also explicitly portrays the mechanisms responsible for the missingness process. The result significantly advances the existing work.
... We now study causal identifiability on the M-CEG when there is floret-dependent missingness. This is analogous to the causal identifiablity on the BN with missingness indicators [35], which is called an m-graph [25,26,27]. ...
... Mohan and Pearl [25,26] defined recoverability of a joint probability p(X) under different missingness mechanisms on an m-graph. Here a joint probability distribution is recoverable whenever it can be consistently estimated despite the missingness mechanism. ...
... The recoverability of the joint distribution is complicated to analyse when data are MNAR because the missing indicators are not independent of R O and R M . However, even when the joint distribution cannot be consistently estimated, the probability p(y|do(x)) is still estimable from the dataset with missing data [26]. The recoverability of this probability is a sufficient condition for the identifiability of the causal query in the subgraph comprising of vertices corresponding to R O and R M [27]. ...
Preprint
Various graphical models are widely used in reliability to provide a qualitative description of domain experts hypotheses about how a system might fail. Here we argue that the semantics developed within standard causal Bayesian networks are not rich enough to fully capture the intervention calculus needed for this domain and a more tree-based approach is necessary. We instead construct a Bayesian hierarchical model with a chain event graph at its lowest level so that typical interventions made in reliability models are supported by a bespoke causal calculus. We then demonstrate how we can use this framework to automate the process of causal discovery from maintenance logs, extracting natural language information describing hypothesised causes of failures. Through our customised causal algebra we are then able to make predictive inferences about the effects of a variety of types of remedial interventions. The proposed methodology is illustrated throughout with examples drawn from real maintenance logs.
... In the following subsection we define the notion of Ordered factorization which leads to a criterion for sequentially recovering conditional probability distributions (Mohan et al. (2013); Mohan and Pearl (2014a)). ...
... The following theorem (Mohan et al. (2013); Mohan and Pearl (2014a)) formalizes the recoverability scheme exemplified above. ...
... Thus far, we dealt with recovering statistical properties and parameters. Similar results for recovering causal effects are available in Mohan and Pearl (2014a) and Shpitser (2016). ...
Article
This paper reviews recent advances in missing data research using graphical models to represent multivariate dependencies. We first examine the limitations of traditional frameworks from three different perspectives: \textit{transparency, estimability and testability}. We then show how procedures based on graphical models can overcome these limitations and provide meaningful performance guarantees even when data are Missing Not At Random (MNAR). In particular, we identify conditions that guarantee consistent estimation in broad categories of missing data problems, and derive procedures for implementing this estimation. Finally we derive testable implications for missing data models in both MAR (Missing At Random) and MNAR categories.
... It is known that when the data is MAR, the underlying distribution is estimable from observed data with missing values. Then a causal effect is estimable if it is identifiable from the observed distribution [10]. However, if the data is MNAR, whether a probabilistic distribution or a causal effect is estimable from missing data depends closely on both the query and the exact missing data mechanisms. ...
... M-graphs provide a general framework for inference with arbitrary types of missing data mechanisms including MNAR. Sufficient conditions for determining whether probabilistic queries (e.g., P (y | x) or P (x, y)) are estimable from missing data are provided in [11,10]. General algorithms for identifying the joint distribution have been developed in [19,23]. ...
... The problem of identifying causal effects P (y | do(x)) from missing data in the causal graphical model settings has not been well studied. To the best of our knowledge the only results are the sufficient conditions given in [10]. The goal of this paper is to provide general conditions under which the causal effects can be identified from missing data using the covariate adjustment formula, which is the most pervasive method in practice for causal effect estimation under confounding bias. ...
Preprint
Full-text available
Confounding bias, missing data, and selection bias are three common obstacles to valid causal inference in the data sciences. Covariate adjustment is the most pervasive technique for recovering casual effects from confounding bias. In this paper, we introduce a covariate adjustment formulation for controlling confounding bias in the presence of missing-not-at-random data and develop a necessary and sufficient condition for recovering causal effects using the adjustment. We also introduce an adjustment formulation for controlling both confounding and selection biases in the presence of missing data and develop a necessary and sufficient condition for valid adjustment. Furthermore, we present an algorithm that lists all valid adjustment sets and an algorithm that finds a valid adjustment set containing the minimum number of variables, which are useful for researchers interested in selecting adjustment sets with desired properties.
... Another recently developed branch of the literature considers a third group of definitions relying on the existence of always-observed auxiliary information Mohan et al., 2013). These definitions are termed variable-based or graph-based as they enable the use of graphical tools like missingness-graphs, that are directed acyclic graphs including missingness indicators in their set of nodes (Mohan & Pearl, 2014a;Mohan et al., 2013). In this paper, we denote these definitions by adding the prefix VB-, which stands for "variable-based". ...
... Mohan et al. (2013) use missingness-graphs to explore recoverability of probabilistic queries. Mohan & Pearl (2014a) and Daniel et al. (2012) focus also on recoverability of causal relations. Potthoff et al. (2006) show that whether data are missing according to their variable-based definition of missing at random, which they term MAR+, can be tested when at least two variables have missing values. ...
Article
Full-text available
Recent work (Seaman et al., 2013; Mealli & Rubin, 2015) attempts to clarify the not always well-understood difference between realised and everywhere definitions of missing at random (MAR) and missing completely at random. Another branch of the literature (Mohan et al., 2013; Pearl & Mohan, 2013) exploits always-observed covariates to give variable-based definitions of MAR and missing completely at random. In this paper, we develop a unified taxonomy encompassing all approaches. In this taxonomy, the new concept of ‘complementary MAR’ is introduced, and its relationship with the concept of data observed at random is discussed. All relationships among these definitions are analysed and represented graphically. Conditional independence, both at the random variable and at the event level, is the formal language we adopt to connect all these definitions. Our paper covers both the univariate and the multivariate case, where attention is paid to monotone missingness and to the concept of sequential MAR. Specifically, for monotone missingness, we propose a sequential MAR definition that might be more appropriate than both everywhere and variable-based MAR to model dropout in certain contexts.
... A syntax and semantics of neuron diagrams were formalized to identify the causal effects in [11]. A graphical representation of missing data mechanism was presented in [12]. And then a causal model was reported to solve the problem of estimating the causal relationship from data with missing entries [13]. ...
... One has to solve the equations set as follows 4 IEEE/CAA JOURNAL OF AUTOMATICA SINICA where is the kernel function (usually Gaussian kernel function) of the jth factor, I is an n order unit matrix and . Then, we have the regression model (12) By repeating the process above, the gas tank levels in the future will be computed to verify the effectiveness of the scheduling solution. ...
Article
Rational use of blast furnace gas U+0028 BFG U+0029 in steel industry can raise economic profit, save fossil energy resources and alleviate the environment pollution. In this paper, a causality diagram is established to describe the causal relationships among the decision objective and the variables of the scheduling process for the industrial system, based on which the total scheduling amount of the BFG system can be computed by using a causal fuzzy Cmeans clustering U+0028 CFCM U+0029 algorithm. In this algorithm, not only the distances among the historical samples but also the effects of different solutions on the gas tank level are considered. The scheduling solution can be determined based on the proposed causal probability of the causality diagram calculated by the total amount and the conditions of the adjustable units. The causal probability quantifies the impact of different allocation schemes of the total scheduling amount on the BFG system. An evaluation method is then proposed to evaluate the effectiveness of the scheduling solutions. The experiments by using the practical data coming from a steel plant in China indicate that the proposed approach can effectively improve the scheduling accuracy and reduce the gas diffusion.
... Using an m-graph, one can determine whether or not an effect is identifiable [1]. While this has received some attention in in graphical causality [41,42,28] which aims to recover identifiability in MAR and MNAR settings or perform structure learning despite missingness [43,44], m-graphs remain relatively underexplored in a potential outcomes setting, as we have. Note that none of the aforementioned works consider partial imputation, specifically to correctly identify a causal effect. ...
... Using an m-graph, one can determine whether or not an effect is identifiable [1]. While this has received some attention in in graphical causality [41,42,28] which aim to recover identifiability in MAR and MNAR settings or perform structure learning despite missingness [43,44], never before were these m-graphs considered in a potential outcomes setting, as we have. Note that none of the aforementioned works consider partial imputation, specifically to correctly identify a causal effect. ...
Preprint
Full-text available
Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.
... For example, PðX; Y; Z; R z Þ is recoverable in Figure 4(a) since the graph is in ðMÞ (it is also in MAR) and this distribution advertises the conditional independence Z \ \ R z jXY. Yet, Z \ \ R z jXY is not testable by any data in which the probability of observing Z is non-zero (for all x; y) [33,37]. Any such data can be construed as if generated by the model in Figure 4(a), where the independence holds. ...
... Thus far we dealt with the recoverability of joint and conditional probabilities. Extensions to causal relationships are discussed in Mohan and Pearl [37]. ...
Article
This paper reviews concepts, principles, and tools that have led to a coherent mathematical theory that unifies the graphical, structural, and potential outcome approaches to causal inference. The theory provides solutions to a number of pending problems in causal analysis, including questions of confounding control, policy analysis, mediation, missing data, and the integration of data from diverse studies.
... The problem of missing data in causal inference is being studied in the literature quite extensively. [14] derive graphical conditions for recovering joint and conditional distributions and sufficient conditions for recovering causal queries. [22] consider different missingness mechanisms and present graphical representations of those. ...
... where the conditioning on variable X 1 is due to the graph structure in Fig. 3. Given the conditional probability definitions in Eqs. (8)- (13), and the parent set definition in (14), we propose to obtain the predicted values of missing values using nonparametric regressions. For N observed data samples X i,j with i = 1, . . . ...
Chapter
Real-world datasets often contain many missing values due to several reasons. This is usually an issue since many learning algorithms require complete datasets. In certain cases, there are constraints in the real world problem that create difficulties in continuously observing all data. In this paper, we investigate if graphical causal models can be used to impute missing values and derive additional information on the uncertainty of the imputed values. Our goal is to use the information from a complete dataset in the form of graphical causal models to impute missing values in an incomplete dataset. This assumes that the datasets have the same data generating process. Furthermore, we calculate the probability of each missing data value belonging to a specified percentile. We present a preliminary study on the proposed method using synthetic data, where we can control the causal relations and missing values.
... In this experiment, we evaluate the performance of OGM in statistical inference tasks. Specifically, we hope to use OGM to discover significant correlations [26,[44][45][46] between variables from the online data. ...
Article
Full-text available
Gaussian Graphical Model is widely used to understand the dependencies between variables from high-dimensional data and can enable a wide range of applications such as principal component analysis, discriminant analysis, and canonical analysis. With respect to the streaming nature of big data, we study a novel Online Gaussian Graphical Model (OGM) that can estimate the inverse covariance matrix over the high-dimensional streaming data, in this paper. Specifically, given a small number of samples to initialize the learning process, OGM first estimates a low-rank estimation of inverse covariance matrix; then, when each individual new sample arrives, it updates the estimation of inverse covariance matrix using a low-complexity updating rule, without using the past data and matrix inverse. The significant edges of Gaussian graphical models can be discovered through thresholding the inverse covariance matrices. Theoretical analysis shows the convergence rate of OGM to the true parameters is guaranteed under Bernstein-style with mild conditions. We evaluate OGM using extensive experiments. The evaluation results backup our theory.
... """ self.l.value = losses self.q.value = weights self.prob.solve() return self.X.value process, the core effort in extending our method simply involves using off-the-shelf estimators to characterize the probability distributions (see [24] for example. ...
Preprint
Full-text available
Learning invariant representations is an important requirement when training machine learning models that are driven by spurious correlations in the datasets. These spurious correlations, between input samples and the target labels, wrongly direct the neural network predictions resulting in poor performance on certain groups, especially the minority groups. Robust training against these spurious correlations requires the knowledge of group membership for every sample. Such a requirement is impractical in situations where the data labeling efforts for minority or rare groups are significantly laborious or where the individuals comprising the dataset choose to conceal sensitive information. On the other hand, the presence of such data collection efforts results in datasets that contain partially labeled group information. Recent works have tackled the fully unsupervised scenario where no labels for groups are available. Thus, we aim to fill the missing gap in the literature by tackling a more realistic setting that can leverage partially available sensitive or group information during training. First, we construct a constraint set and derive a high probability bound for the group assignment to belong to the set. Second, we propose an algorithm that optimizes for the worst-off group assignments from the constraint set. Through experiments on image and tabular datasets, we show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
... Here are some potential things one could do to mitigate risks by approaching human data science as a science: work to think about some of these issues. For example, latent variable models are a convenient framework to estimate and correct for data issues, 5 causal modeling is a useful tool in dealing with selection error, 7 and both causal modeling and latent variables are useful tools when investigating model output's fairness (see Boeschoten et al. 8 and references therein). Second, human data scientists need to routinely take a much wider view of ''model evaluation''; a successful data science project must not only predict well in out-of-sample data, but also generalize to other contexts in which it might be applied, generalize to the concept it was intended to study, reflect uncertainty accurately, improve on what was already there, have benefits that outweigh its costs, and actually achieve its stated goals in social reality. ...
Article
Full-text available
Most data science is about people, and opinions on the value of human data differ. The author offers a synthesis of overly optimistic and overly pessimistic views of human data science: it should become a science, with errors systematically studied and their effects mitigated—a goal that can only be achieved by bringing together expertise from a range of disciplines.
... To model data missingness in a principled manner, we will use the causal graph framework proposed by [(Mohan and Pearl 2020;Mohan, Pearl, and Tian 2013;Mohan and Pearl 2014;Ilya, Karthika, and Judea 2015)]. The main idea in this framework is to model the missingness mechanism as a separate binary variable. ...
Preprint
Full-text available
Training datasets for machine learning often have some form of missingness. For example, to learn a model for deciding whom to give a loan, the available training data includes individuals who were given a loan in the past, but not those who were not. This missingness, if ignored, nullifies any fairness guarantee of the training procedure when the model is deployed. Using causal graphs, we characterize the missingness mechanisms in different real-world scenarios. We show conditions under which various distributions, used in popular fairness algorithms, can or can not be recovered from the training data. Our theoretical results imply that many of these algorithms can not guarantee fairness in practice. Modeling missingness also helps to identify correct design principles for fair algorithms. For example, in multi-stage settings where decisions are made in multiple screening rounds, we use our framework to derive the minimal distributions required to design a fair algorithm. Our proposed algorithm decentralizes the decision-making process and still achieves similar performance to the optimal algorithm that requires centralization and non-recoverable distributions.
... ( = | , ) × ( ) (by conditional independence properties) = ∑ ( | , ) × ( ) (by consistency) • Marginal distribution of : proof similar to that of marginal distribution of Conditional Joint distribution: see below • Conditional distribution of : same proof as for this distribution in m-Marginal distribution of : same proof as for this distribution in m-DAG A • Conditional distribution of : same proof as for this distribution in m-DAG A m-DAG E • Conditional distribution of : same proof as for this distribution in m-DAG A m-DAG F • Marginal distribution of : same proof as for this distribution in m-DAG AExpressions for joint distribution in m-DAGs B and CApplying Corollary 1 of[9], we can express the joint distribution in m-DAG B in terms of the observed data as follows: the same result, the joint distribution in m-DAG C is given by: ...
... For instance, in the current example, missingness only occurs in the outcome variable Y , a fact that is not represented in the diagram of Figure 1a. Missingness graphs [5], on the other hand, allow us to formally encode this distinction, as shown in Figure 1b. ...
... Although graphical models such as structural equation models (SEMs) have long been used to mathematically describe biological networks for various quantitative analysis tasks like parameter estimation [9][10][11], only a few computing methods and tools for graphical models take feedback loops into consideration because of various challenges associated with such topological structures [12][13][14][15]. In this study, we make an attempt to more efficiently address the structural identifiability analysis problem for time-invariant biological networks with feedback loops; and for simplicity, here we only consider the SEM representation of biological networks although graphical models refer to a broad range of different model forms [11,[16][17][18]. ...
Article
Quantitative analyses of biological networks such as key biological parameter estimation necessarily call for the use of graphical models. While biological networks with feedback loops are common in reality, the development of graphical model methods and tools that are capable of dealing with feedback loops is still in its infancy. Particularly, inadequate attention has been paid to the parameter identifiability problem for biological networks with feedback loops such that unreliable or even misleading parameter estimates may be obtained. In this study, the structural identifiability analysis problem of time-invariant linear structural equation models (SEMs) with feedback loops is addressed, resulting in a general and efficient solution. The key idea is to combine Mason's gain with Wright's path coefficient method to generate identifiability equations, from which identifiability matrices are then derived to examine the structural identifiability of every single unknown parameter. The proposed method does not involve symbolic or expensive numerical computations, and is applicable to a broad range of time-invariant linear SEMs with or without explicit latent variables, presenting a remarkable breakthrough in terms of generality. Finally, a subnetwork structure of the C. elegans neural network is used to illustrate the application of the authors' method in practice.
... There have been studies that utilize graphs to analyze missing data. Mohan et al. (2013); Mohan and Pearl (2014); Tian (2015); Mohan and Pearl (2018) proposed methods to test missing data assumptions under graphical model frameworks. ...
Preprint
Full-text available
We introduce the concept of pattern graphs--directed acyclic graphs representing how response patterns are associated. A pattern graph represents an identifying restriction that is nonparametrically identified/saturated and is often a missing not at random restriction. We introduce a selection model and a pattern mixture model formulations using the pattern graphs and show that they are equivalent. A pattern graph leads to an inverse probability weighting estimator as well as an imputation-based estimator. Asymptotic theories of the estimators are studied and we provide a graph-based recursive procedure for computing both estimators. We propose three graph-based sensitivity analyses and study the equivalence class of pattern graphs.
... It is, therefore, possible to use the BN as a framework for identifying when causal hypotheses are identifiable in this rather restricted setting. The associated analyses use various graphically stated criteria-such as the front-door and the back-door criteria-see e.g., [11][12][13]. However, unfortunately, the types of missingness that routinely occur in reliability-and, in particular, those associated with the data we collect when performing routine maintenance-are rarely missing across the original random vector associated with the system in this sort of symmetric way. ...
Article
Full-text available
Graph-based causal inference has recently been successfully applied to explore system reliability and to predict failures in order to improve systems. One popular causal analysis following Pearl and Spirtes et al. to study causal relationships embedded in a system is to use a Bayesian network (BN). However, certain causal constructions that are particularly pertinent to the study of reliability are difficult to express fully through a BN. Our recent work demonstrated the flexibility of using a Chain Event Graph (CEG) instead to capture causal reasoning embedded within engineers’ reports. We demonstrated that an event tree rather than a BN could provide an alternative framework that could capture most of the causal concepts needed within this domain. In particular, a causal calculus for a specific type of intervention, called a remedial intervention, was devised on this tree-like graph. In this paper, we extend the use of this framework to show that not only remedial maintenance interventions but also interventions associated with routine maintenance can be well-defined using this alternative class of graphical model. We also show that the complexity in making inference about the potential relationships between causes and failures in a missing data situation in the domain of system reliability can be elegantly addressed using this new methodology. Causal modelling using a CEG is illustrated through examples drawn from the study of reliability of an energy distribution network.
... In practice, all branches of experimental science are plagued by data with missing values [8], [9], e.g., failure of sensors or drop-outs of subjects in a longitudinal study. In this paper, we target to generalize the PC algorithm to settings where the data are still assumed to be drawn from a Gaussian copula model, but with some missing values. ...
Conference Paper
Full-text available
We consider the problem of causal structure learning from data with missing values, assumed to be drawn from a Gaussian copula model. First, we extend the ‘Rank PC’ algorithm, designed for Gaussian copula models with purely continuous data (so-called nonparanormal models), to incomplete data by applying rank correlation to pairwise complete observations and replacing the sample size with an effective sample size in the conditional independence tests to account for the information loss from missing values. The resulting approach works when the data are missing completely at random (MCAR). Then, we propose a Gibbs sampling procedure to draw correlation matrix samples from mixed data under missingness at random (MAR). These samples are translated into an average correlation matrix, and an effective sample size, resulting in the ‘Copula PC’ algorithm for incomplete data. Simulation study shows that: 1) the usage of the effective sample size significantly improves the performance of ‘Rank PC’ and ‘Copula PC’; 2) ‘Copula PC’ estimates a more accurate correlation matrix and causal structure than ‘Rank PC’ under MCAR and, even more so, under MAR. Also, we illustrate our methods on a real-world data set about gene expression.
... For instance, Mohan et al. (2018) suggested such an approach mainly for a self-masked mechanism, i.e., the lack depends only on the missing variable itself in a regression framework. In recent work (Mohan and Pearl, 2014;Mohan et al., 2013Mohan et al., , 2018, the missing data have been treated as a causal inference problem and graph-based procedures for consistently estimating parameters have been proposed to eciently handle the MNAR case. Further extension of high-dimensional model selection in the MNAR case needs to be explored. ...
Thesis
The problem of missing data has existed since the beginning of data analysis, as missing values are related to the process of obtaining and preparing data. In applications of modern statistics and machine learning, where the collection of data is becoming increasingly complex and where multiple sources of information are combined, large databases often have an extraordinarily high number of missing values. These data therefore present important methodological and technical challenges for analysis: from visualization to modeling including estimation, variable selection, predictive capabilities, and implementation through implementations. Moreover, although high-dimensional data with missing values are considered common difficulties in statistical analysis today, only a few solutions are available.The objective of this thesis is to provide new methodologies for performing statistical inferences with missing data and in particular for high-dimensional data. The most important contribution is to provide a comprehensive framework for dealing with missing values from estimation to model selection based on likelihood approaches. The proposed method doesn't rely on a specific pattern of missingness, and allows a good balance between quality of inference and computational efficiency.The contribution of the thesis consists of three parts. In Chapter 2, we focus on performing a logistic regression with missing values in a joint modeling framework, using a stochastic approximation of the EM algorithm. We discuss parameter estimation, variable selection, and prediction for incomplete new observations. Through extensive simulations, we show that the estimators are unbiased and have good confidence interval coverage properties, which outperforms the popular imputation-based approach. The method is then applied to pre-hospital data to predict the risk of hemorrhagic shock, in collaboration with medical partners - the Traumabase group of Paris hospitals. Indeed, the proposed model improves the prediction of bleeding risk compared to the prediction made by physicians.In chapters 3 and 4, we focus on model selection issues for high-dimensional incomplete data, which are particularly aimed at controlling for false discoveries. For linear models, the adaptive Bayesian version of SLOPE (ABSLOPE) we propose in Chapter 3 addresses these issues by embedding the sorted l1 regularization within a Bayesian spike-and-slab framework. Alternatively, in Chapter 4, aiming at more general models beyond linear regression, we consider these questions in a model-X framework, where the conditional distribution of the response as a function of the covariates is not specified. To do so, we combine knockoff methodology and multiple imputations. Through extensive simulations, we demonstrate satisfactory performance in terms of power, FDR and estimation bias for a wide range of scenarios. In the application of the medical data set, we build a model to predict patient platelet levels from pre-hospital and hospital data.Finally, we provide two open-source software packages with tutorials, in order to help decision making in medical field and users facing missing values.
Article
Causal inference is often phrased as a missing data problem - for every unit, only the response to observed treatment assignment is known, the response to other treatment assignments is not. In this paper, we extend the converse approach of [7] of representing missing data problems to causal models where only interventions on missingness indicators are allowed. We further use this representation to leverage techniques developed for the problem of identification of causal effects to give a general criterion for cases where a joint distribution containing missing variables can be recovered from data actually observed, given assumptions on missingness mechanisms. This criterion is significantly more general than the commonly used "missing at random" (MAR) criterion, and generalizes past work which also exploits a graphical representation of missingness. In fact, the relationship of our criterion to MAR is not unlike the relationship between the ID algorithm for identification of causal effects [22, 18], and conditional ignorability [13].
Chapter
Confounding bias, missing data, and selection bias are three common obstacles to valid causal inference in the data sciences. Covariate adjustment is the most pervasive technique for recovering casual effects from confounding bias. In this paper we introduce a covariate adjustment formulation for controlling confounding bias in the presence of missing-not-at-random data and develop a necessary and sufficient condition for recovering causal effects using the adjustment. We also introduce an adjustment formulation for controlling both confounding and selection biases in the presence of missing data and develop a necessary and sufficient condition for valid adjustment. Furthermore, we present an algorithm that lists all valid adjustment sets and an algorithm that finds a valid adjustment set containing the minimum number of variables, which are useful for researchers interested in selecting adjustment sets with desired properties.
Conference Paper
This paper applies graph based causal inference procedures for recovering information from missing data. We establish conditions that permit and prohibit recoverability. In the event of theoretical impediments to recoverability, we develop graph based procedures using auxiliary variables and external data to overcome such impediments. We demonstrate the perils of model-blind recovery procedures both in determining whether or not a query is recoverable and in choosing an estimation procedure when recoverability holds.
Article
Full-text available
With incomplete data, the missing at random (MAR) assumption is widely understood to enable unbiased estimation with appropriate methods. The need to assess the plausibility of MAR and to perform sensitivity analyses considering missing not at random (MNAR) scenarios have been emphasized, but the practical difficulty of these tasks is rarely acknowledged. What MAR means with multivariable missingness is difficult to grasp, while in many MNAR scenarios unbiased estimation is possible using methods commonly associated with MAR. Directed acyclic graphs (DAGs) have been proposed as an alternative framework for specifying practically accessible assumptions beyond the MAR-MNAR dichotomy. However, there is currently no general algorithm for deciding how to handle the missing data given a specific DAG. We construct "canonical" DAGs capturing typical missingness mechanisms in epidemiological studies with incomplete exposure, outcome and confounders. For each DAG, we determine whether common target parameters are "recoverable", meaning that they can be expressed as functions of the observed data distribution and thus estimated consistently, or if sensitivity analyses are necessary. We investigate the performance of available case and multiple imputation procedures. Using the Longitudinal Study of Australian Children, we illustrate how our findings can guide the treatment of missing data in point-exposure studies.
Article
The subject of this paper is the elucidation of effects of actions from causal assumptions represented as a directed graph, and statistical knowledge given as a probability distribution. In particular, we are interested in predicting conditional distributions resulting from performing an action on a set of variables and, subsequently, taking measurements of another set. We provide a necessary and sufficient graphical condition for the cases where such distributions can be uniquely computed from the available information, as well as an algorithm which performs this computation whenever the condition holds. Furthermore, we use our results to prove completeness of do-calculus [Pearl, 1995] for the same identification problem.
Conference Paper
A fundamental aspect of rating-based recommender systems is the observation process, the process by which users choose the items they rate. Nearly all research on collaborative ltering and recommender systems is founded on the as- sumption that missing ratings are missing at random. The statistical theory of missing data shows that incorrect as- sumptions about missing data can lead to biased parameter estimation and prediction. In a recent study, we demon- strated strong evidence for violations of the missing at ran- dom condition in a real recommender system. In this paper we present the rst study of the eect of non-random miss- ing data on collaborative ranking, and extend our previous results regarding the impact of non-random missing data on collaborative prediction.
Article
Estimating causal effects from incomplete data requires additional and inherently untestable assumptions regarding the mechanism giving rise to the missing data. We show that using causal diagrams to represent these additional assumptions both complements and clarifies some of the central issues in missing data theory, such as Rubin's classification of missingness mechanisms (as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR)) and the circumstances in which causal effects can be estimated without bias by analysing only the subjects with complete data. In doing so, we formally extend the back-door criterion of Pearl and others for use in incomplete data examples. These ideas are illustrated with an example drawn from an occupational cohort study of the effect of cosmic radiation on skin cancer incidence.
Article
Field experiments in the social sciences were increasingly used in the 20th century. This article briefly reviews some important lessons in design, analysis, and theory of field experiments emerging from that experience. Topics include the importance of ensuring that selection into experiments and assignment to conditions occurs properly, how to prevent and analyze attrition, the need to attend to power and effect size, how to measure and take partial treatment implementation into account in analyses, modern analyses of quasi-experimental and multilevel data, Rubin's model, and the role of internal and external validity. The article ends with observations on the computer revolution in methodology and statistics, convergences in theory and methods across disciplines, the need for an empirical program of methodological research, the key problem of selection bias, and the inevitability of increased specialization in field experimentation in the years to come.
Article
In biostatistical applications interest often focuses on the estimation of the distribution of a failure time-variable T . If one only observes whether or not T exceeds an observed monitoring time C, then the data structure is called current status data, also known as interval censored data, case I. We extend the data structure by allowing the presence of a possibly time-dependent covariate process which is observed up till the monitoring time C. We follow the approach of Robins and Rotnitzky (1992) by modeling the hazard of C conditional on the failure time-variable and the covariate-process, i.e. the missingness or censoring process, under the restriction that the missingness (monitoring) process satisfies coarsening at random. Because of the curse of dimensionality no globally efficient nonparametric estimators with a good practical performance at moderate sample sizes exist. We introduce an inverse prob- 1 Mark J. van der Laan is Assistant Professor, School of Public He...