ArticlePublisher preview available

A Causal Research Pipeline and Tutorial for Psychologists and Social Scientists

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Causality is a fundamental part of the scientific endeavor to understand the world. Unfortunately, causality is still taboo in much of psychology and social science. Motivated by a growing number of recommendations for the importance of adopting causal approaches to research, we reformulate the typical approach to research in psychology to harmonize inevitably causal theories with the rest of the research pipeline. We present a new process which begins with the incorporation of techniques from the confluence of causal discovery and machine learning for the development, validation, and transparent formal specification of theories. We then present methods for reducing the complexity of the fully specified theoretical model into the fundamental submodel relevant to a given target hypothesis. From here, we establish whether or not the quantity of interest is estimable from the data, and if so, propose the use of semi-parametric machine learning methods for the estimation of causal effects. The overall goal is the presentation of a new research pipeline which can (a) facilitate scientific inquiry compatible with the desire to test causal theories (b) encourage transparent representation of our theories as unambiguous mathematical objects, (c) tie our statistical models to specific attributes of the theory, thus reducing under-specification problems frequently resulting from the theory-to-model gap, and (d) yield results and estimates which are causally meaningful and reproducible. The process is demonstrated through didactic examples with real-world data, and we conclude with a summary and discussion of limitations.
Psychological Methods
A Causal Research Pipeline and Tutorial for Psychologists and Social
Scientists
Matthew James Vowels
Online First Publication, January 6, 2025. https://dx.doi.org/10.1037/met0000673
CITATION
Vowels, M. J. (2025). A causal research pipeline and tutorial for psychologists and social scientists.
Psychological Methods. Advance online publication. https://dx.doi.org/10.1037/met0000673
... Causal models, on the other hand, may have less predictive power under a stable joint distribution, but maintain this predictive power under covariate shift precisely because they leverage only those relationships which are invariant under such shift. Interested readers are directed to surveys and introductions by [28,58,60,68,87,88] for more information on causality. ...
Preprint
Full-text available
Artificial Neural Networks (ANNs), including fully-connected networks and transformers, are highly flexible and powerful function approximators, widely applied in fields like computer vision and natural language processing. However, their inability to inherently respect causal structures can limit their robustness, making them vulnerable to covariate shift and difficult to interpret/explain. This poses significant challenges for their reliability in real-world applications. In this paper, we introduce Causal Fully-Connected Neural Networks (CFCNs) and Causal Transformers (CaTs), two general model families designed to operate under predefined causal constraints, as specified by a Directed Acyclic Graph (DAG). These models retain the powerful function approximation abilities of traditional neural networks while adhering to the underlying structural constraints, improving robustness, reliability, and interpretability at inference time. This approach opens new avenues for deploying neural networks in more demanding, real-world scenarios where robustness and explainability is critical.
Article
Full-text available
Data collection and research methodology represents a critical part of the research pipeline. On the one hand, it is important that we collect data in a way that maximises the validity of what we are measuring, which may involve the use of long scales with many items. On the other hand, collecting a large number of items across multiple scales results in participant fatigue, and expensive and time consuming data collection. It is therefore important that we use the available resources optimally. In this work, we consider how the representation of a theory as a causal/structural model can help us to streamline data collection and analysis procedures by not wasting time collecting data for variables which are not causally critical for answering the research question. This not only saves time and enables us to redirect resources to attend to other variables which are more important, but also increases research transparency and the reliability of theory testing. To achieve this, we leverage structural models and the Markov conditional independency structures implicit in these models, to identify the substructures which are critical for a particular research question. To demonstrate the benefits of this streamlining we review the relevant concepts and present a number of didactic examples, including a real-world example.
Article
Full-text available
Many students of statistics and econometrics express frustration with the way a problem known as “bad control” is treated in the traditional literature. The issue arises when the addition of a variable to a regression equation produces an unintended discrepancy between the regression coefficient and the effect that the coefficient is intended to represent. Avoiding such discrepancies presents a challenge to all analysts in the data intensive sciences. This note describes graphical tools for understanding, visualizing, and resolving the problem through a series of illustrative examples. By making this “crash course” accessible to instructors and practitioners, we hope to avail these tools to a broader community of scientists concerned with the causal interpretation of regression models.
Preprint
Full-text available
Causal identification is at the core of the causal inference literature, where complete algorithms have been proposed to identify causal queries of interest. The validity of these algorithms hinges on the restrictive assumption of having access to a correctly specified causal structure. In this work, we study the setting where a probabilistic model of the causal structure is available. Specifically, the edges in a causal graph are assigned probabilities which may, for example, represent degree of belief from domain experts. Alternatively, the uncertainly about an edge may reflect the confidence of a particular statistical test. The question that naturally arises in this setting is: Given such a probabilistic graph and a specific causal effect of interest, what is the subgraph which has the highest plausibility and for which the causal effect is identifiable? We show that answering this question reduces to solving an NP-hard combinatorial optimization problem which we call the edge ID problem. We propose efficient algorithms to approximate this problem, and evaluate our proposed algorithms against real-world networks and randomly generated graphs.
Article
Full-text available
The study of causal inference has seen recent momentum in machine learning and artificial intelligence (AI), particularly in the domains of transfer learning, reinforcement learning, automated diagnostics, and explainability (among others). Yet, despite its increasing application to address many of the boundaries in modern AI, causal topics remain absent in most AI curricula. This work seeks to bridge this gap by providing classroom-ready introductions that integrate into traditional topics in AI, suggests intuitive graphical tools for the application to both new and traditional lessons in probabilistic and causal reasoning, and presents avenues for instructors to impress the merit of climbing the “causal hierarchy” to address problems at the levels of associational, interventional, and counterfactual inference. Finally, this study shares anecdotal instructor experiences, successes, and challenges integrating these lessons at multiple levels of education.
Article
Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modelling. Constructing a predictive model can be thought of as learning a prediction function (a function that takes as input covariate data and outputs a predicted value). Many strategies for learning prediction functions from data (learners) are available, from parametric regressions to machine learning algorithms. It can be challenging to choose a learner, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task. The super learner (SL) is an algorithm that alleviates concerns over selecting the one 'right' learner by providing the freedom to consider many, such as those recommended by collaborators, used in related research or specified by subject-matter experts. Also known as stacking, SL is an entirely prespecified and flexible approach for predictive modelling. To ensure the SL is well specified for learning the desired prediction function, the analyst does need to make a few important choices. In this educational article, we provide step-by-step guidelines for making these decisions, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience and guided by SL optimality theory.
Article
Learning about cause and effect is arguably the main goal in applied econometrics. In practice, the validity of these causal inferences is contingent on a number of critical assumptions regarding the type of data that has been collected and the substantive knowledge that is available about the phenomenon under investigation. For instance, unobserved confounding factors threaten the internal validity of estimates; data availability is often limited to non-random, selection-biased samples; causal effects need to be learned from surrogate experiments with imperfect compliance; and causal knowledge has to be extrapolated across structurally heterogeneous populations. A powerful and flexible causal inference framework is required in order to tackle all of these challenges, which plague essentially any data analysis to varying degrees. Building on the structural perspective on causality introduced by Haavelmo (1943) and the graph-theoretic approach proposed by Pearl (1995), the artificial intelligence (AI) literature has developed a wide array of techniques for causal inference that allow to leverage information from various imperfect, heterogeneous, and biased data sources (Bareinboim and Pearl, 2016). In this paper, we review recent advances made in this literature that have the potential to contribute to econometric methodology along three broad dimensions. First, they provide a unified and comprehensive framework for causal learning, in which the above-mentioned problems can be addressed in generality. Second, due to their origin in AI, they come together with sound, efficient, and complete (to be formally defined) algorithmic criteria for automation of the corresponding identification task. And third, because of the nonparametric description of structural models that graph-theoretic approaches build on, they combine the analytical rigor of structural econometrics with the flexibility of the potential outcomes framework, and thus offer a valuable complement to these two literature streams.
Article
A large-scale comparison of experimental advertising effects and those obtained using two state-of-the-art methods.
Article
Much research has been devoted to the problem of estimating treatment effects from observational data; however, most methods assume that the observed variables only contain confounders, i.e., variables that affect both the treatment and the outcome. Unfortunately, this assumption is frequently violated in real-world applications, since some variables only affect the treatment but not the outcome, and vice versa. Moreover, in many cases only the proxy variables of the underlying confounding factors can be observed. In this work, we first show the importance of differentiating confounding factors from instrumental and risk factors for both average and conditional average treatment effect estimation, and then we propose a variational inference approach to simultaneously infer latent factors from the observed variables, disentangle the factors into three disjoint sets corresponding to the instrumental, confounding, and risk factors, and use the disentangled factors for treatment effect estimation. Experimental results demonstrate the effectiveness of the proposed method on a wide range of synthetic, benchmark, and real-world datasets.