Qingyuan Zhao’s research while affiliated with University of Cambridge and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (62)


Counterfactual explainability of black-box prediction models
  • Preprint
  • File available

November 2024

·

9 Reads

Zijun Gao

·

Qingyuan Zhao

It is crucial to be able to explain black-box prediction models to use them effectively and safely in practice. Most existing tools for model explanations are associational rather than causal, and we use two paradoxical examples to show that such explanations are generally inadequate. Motivated by the concept of genetic heritability in twin studies, we propose a new notion called counterfactual explainability for black-box prediction models. Counterfactual explainability has three key advantages: (1) it leverages counterfactual outcomes and extends methods for global sensitivity analysis (such as functional analysis of variance and Sobol's indices) to a causal setting; (2) it is defined not only for the totality of a set of input factors but also for their interactions (indeed, it is a probability measure on a whole ``explanation algebra''); (3) it also applies to dependent input factors whose causal relationship can be modeled by a directed acyclic graph, thus incorporating causal mechanisms into the explanation.

Download

PathGPS: discover shared genetic architecture using GWAS summary data

July 2024

·

24 Reads

·

1 Citation

Biometrics

The increasing availability and scale of biobanks and “omic” datasets bring new horizons for understanding biological mechanisms. PathGPS is an exploratory data analysis tool to discover genetic architectures using Genome Wide Association Studies (GWAS) summary data. PathGPS is based on a linear structural equation model where traits are regulated by both genetic and environmental pathways. PathGPS decouples the genetic and environmental components by contrasting the GWAS associations of “signal” genes with those of “noise” genes. From the estimated genetic component, PathGPS then extracts genetic pathways via principal component and factor analysis, leveraging the low-rank and sparse properties. In addition, we provide a bootstrap aggregating (“bagging”) algorithm to improve stability under data perturbation and hyperparameter tuning. When applied to a metabolomics dataset and the UK Biobank, PathGPS confirms several known gene–trait clusters and suggests multiple new hypotheses for future investigations.



Figure 1: Model overview. The causal directed graph associated with structural equations (1). X 1 , ..., X K−1 are exposure traits in temporal increasing order and Y is the outcome trait. The blue arrows represent the causal effects of the genetic variants Z on the traits. The red arrows represent the effects of the unmeasured non-heritable confounders U .
Figure 2: Model identifiability under three scenarios
Figure 3: Illustration of causal relationships across traits allowing multiple traits at each time point.
Figure 4: Simulation scenarios
Figure 6: Evaluation of the effect of body size on breast cancer at different ages. a) Estimated effects of childhood body size (from UK Biobank) and adult BMI on breast cancer risk as estimated by MV-IVW, GRAPPLE, and our Bayesian approach. b) Estimated effects of childhood BMI and adult BMI on breast cancer risk from different methods. c) Estimated causal DAG from our Bayesian approach with selection p-value threshold at 10 −2 . The black arrows indicate significant direct effects.

+2

Causal mediation analysis for time-varying heritable risk factors with Mendelian Randomization

February 2024

·

25 Reads

Understanding the causal pathogenic mechanisms of diseases is crucial in clinical research. When randomized controlled experiments are not available, Mendelian Randomization (MR) offers an alternative, leveraging genetic mutations as a natural "experiment" to mitigate environmental confoundings. However, most MR analyses treat the risk factors as static variables, potentially oversimplifying dynamic risk factor effects. The framework of life-course MR has been introduced to address this issue. However, current methods face challenges especially when the age-specific GWAS datasets have limited cohort sizes and there are substantial correlations between time points for a single trait. This study proposes a novel approach, estimating a unified system of structural equations for a sequence of temporally ordered heritable traits, requiring only GWAS summary statistics. The method facilitates statistical inference on direct, indirect, and path-wise causal effects and demonstrates superior efficiency and reliability, particularly with noisy GWAS data. By incorporating a spike-and-slab prior for genetic effects, the approach can address extreme polygenicity and weak instrument bias. Through this methodology, we uncovered a protective effect of BMI on breast cancer during a confined period of childhood development. We also analyzed how BMI, systolic blood pressure (SBP), and low-density cholesterol levels influence stroke risk across childhood and adulthood, and identified the intriguing relationships between these risk factors.



Summary of multiple testing procedures combined with RANC p-values.
FDR and power analysis. We compare the validity and the power of three variants of the BH procedure (FDR level q = 0.2) across 6 joint distributions of the baseline p-values. The standard deviation of the FDP and the true positive rate is recorded in the bracket. All settings are repeated 10 4 times.
Simultaneous Hypothesis Testing Using Ranks Among Negative Controls

March 2023

·

71 Reads

Consider a simultaneous hypothesis testing problem where each hypothesis is associated with a test statistic. Suppose it is difficult to obtain the null distribution of the test statistics, but some null hypotheses--referred to as the internal negative controls--are known to be true. When it is reasonable to assume that the test statistics associated with the negative controls are exchangeable with those associated with the unknown true null hypotheses, we propose to use a statistic's Rank Among Negative Controls (RANC) as a p-value for the corresponding hypothesis. We provide two theoretical prospectives on this proposal. First, we view the empirical distribution of the negative control statistics as an estimate of the null distribution. We use this to show that, when the test statistics are exchangeable, the RANC p-values are individually valid and have a positive regression dependence on the subset of true nulls. Second, we study the empirical processes of the test statistics indexed by the rejection threshold. We use this to show that the Benjamini-Hochberg procedure applied to the RANC p-values may still control the false discovery rate when the test statistics are not exchangeable. The practical performance of our method is illustrated using numerical simulations and a real proteomic dataset.


Mendelian randomization

December 2022

·

653 Reads

·

1,055 Citations

Nature Reviews Methods Primers

Eleanor Sanderson

·

M. Maria Glymour

·

·

[...]

·

George Davey Smith

Mendelian randomization (MR) is a term that applies to the use of genetic variation to address causal questions about how modifiable exposures influence different outcomes. The principles of MR are based on Mendel’s laws of inheritance and instrumental variable estimation methods, which enable the inference of causal effects in the presence of unobserved confounding. In this Primer, we outline the principles of MR, the instrumental variable conditions underlying MR estimation and some of the methods used for estimation. We go on to discuss how the assumptions underlying an MR study can be assessed and describe methods of estimation that are robust to certain violations of these assumptions. We give examples of a range of studies in which MR has been applied, the limitations of current methods of analysis and the outlook for MR in the future. The differences between the assumptions required for MR analysis and other forms of epidemiological studies means that MR can be used as part of a triangulation across multiple sources of evidence for causal inference. Mendelian randomization is a technique for using genetic variation to examine the causal effect of a modifiable exposure on an outcome such as disease status. This Primer by Sanderson et al. explains the concepts of and the conditions required for Mendelian randomization analysis, describes key examples of its application and looks towards applying the technique to growing genomic datasets.


Small data, big time—A retrospect of the first weeks of COVID‐19

November 2022

·

14 Reads

·

9 Citations

Journal of the Royal Statistical Society Series A (Statistics in Society)

This article reviews some early investigations and research studies in the first weeks of the coronavirus disease 2019 (COVID‐19) pandemic from a statistician's perspective. These investigations were based on very small datasets but were momentous in the initial global reactions to the pandemic. The article discusses the initial evidence of high infectiousness of COVID‐19 and why that conclusion was not reached faster than in reality. Further reanalyses of some published COVID‐19 studies show that the epidemic growth was dramatically underestimated by compartmental models, and the lack of fit could have been clearly identified by simple data visualization. Finally, some lessons for statisticians are discussed.



A unified analysis of regression adjustment in randomized experiments

October 2022

·

39 Reads

Regression adjustment is broadly applied in randomized trials under the premise that it usually improves the precision of a treatment effect estimator. However, previous work has shown that this is not always true. To further understand this phenomenon, we develop a unified comparison of the asymptotic variance of a class of linear regression-adjusted estimators. Our analysis is based on the classical theory for linear regression with heteroscedastic errors and thus does not assume that the postulated linear model is correct. For a completely randomized binary treatment, we provide sufficient conditions under which some regression-adjusted estimators are guaranteed to be more asymptotically efficient than others. We explore other settings such as general treatment assignment mechanisms and generalized linear models, and find that the variance dominance phenomenon no longer occurs.


Citations (39)


... Fourth, the weighted model approach adjusts for the effect of differences in genotype frequencies on the results by weighting different genotypes (Hartwig, Davey Smith, and Bowden 2017). Fifth, MR-RAPS could correct the horizontal pleiotropy and thus reduce the bias (Zhao et al. 2019). The MR results are expressed as ORs and corresponding 95% confidence intervals (CIs). ...

Reference:

Association Between Cognitive Impairment and Dysphagia: A Two‐Sample Mendelian Randomization Study
Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score
  • Citing Preprint
  • January 2018

... Considering the highly polygenic nature of the complex biomarkers, large-scale GWASs typically identify the associated significant loci across the genome. Further disentangling the underlying biological mechanisms captured by polygenic IVs-oriented heterogeneous effects can help underpin the potential drug targets 46,48 . For example, body mass index (BMI), a polygenic biomarker, has been implicated in the risk of type 2 diabetes (T2D) and coronary heart disease (CHD), as demonstrated in RCTs and MR 49,50 . ...

PathGPS: discover shared genetic architecture using GWAS summary data
  • Citing Article
  • July 2024

Biometrics

... Though the replication of screened biomarkers in Stage 1 using large samples provides a more straightforward approach to triangulating the robust and consistent biomarker-outcome associations, understanding the underlying downstream biological mechanisms from the biomarker to the health outcome through various genetic analyses is crucial for prioritizing and identifying the putative drug targets. To achieve this, several genetic analyses are employed: (1) Elucidation of possible heterogeneous genetic effects oriented by polygenic IVs across the genome for the biomarker, utilizing methodologies such as Genome-wide mR Analysis under Pervasive PLEitropy (GRAPPLE) 46 and MR-PATH 47 . ...

A latent mixture model for heterogeneous causal mechanisms in Mendelian randomization
  • Citing Article
  • June 2024

The Annals of Applied Statistics

... The ordinary least squares (OLS) estimator is then applied to the treatment coefficient to obtain the ATE estimate. ANCOVA is appealing due to its intuitive structure: by projecting the outcome onto the covariates, it accounts for variability explained by those covariates, thereby improving precision (Reluga et al., 2024). Notably, this remains true even when the actual relationship between covariates and outcomes is more complex than a linear model (Wang et al., 2019Frank E Harrell, 2024;Harrell, 2017). ...

A unified analysis of regression adjustment in randomized experiments
  • Citing Article
  • January 2024

Electronic Journal of Statistics

... But from a statistical perspective, it is much easier to estimate r. Moreover, an initial doubling time of 2-3 days, which is what most countries in Europe and elsewhere experienced when the pandemic first hit (Pellis et al., 2021;Zhao, 2021) also highlights the urgency of the matter. ...

Small data, big time—A retrospect of the first weeks of COVID‐19
  • Citing Article
  • November 2022

Journal of the Royal Statistical Society Series A (Statistics in Society)

... Our work differs from the existing ML literature both in that we are concerned with inference problems broadly (not restricted to treatment effects), and in that we provide a means to impose externally-known constraints on shifts such as selection bias. Related issues have recently been considered in the epidemiology and biostatistics literature, motivated by the growing use of biobank-style datasets which have known selection biases [24,25]. Our work differs in that we explicitly consider the algorithmic properties of computing the resulting bounds, and in our analysis of estimators which themselves require fitting a model. ...

Sample-constrained partial identification with application to selection bias

Biometrika

... An external validation of the models was carried out, using metrics, such as q 2 and r 2 test , as well as other parameters, like r 2 0 , r' 2 0 , k, and k' (see Table 1). In addition, a Y-random test was performed (Table S4), ruling out the possibility that the developed models are the result of chance correlation [11]. The experimental and predictive pIC 50 values (−logIC 50 ), with the predictions made by the selected CoMFA and CoMSIA models, are presented in Table S3. ...

What is a Randomization Test?

... 20 To address these limitations, Lin 21 proposed including interactions between the treatment and covariates in the model. This method, known as analysis of heterogeneous covariance (ANHECOVA), 22,23 allows the regression model to capture heterogeneity in the treatment effect across covariates. Although ANHECOVA has historical roots, 22, 24 Lin's work was instrumental in establishing its advantages through a design-based lens. ...

Toward Better Practice of Covariate Adjustment in Analyzing Randomized Clinical Trials
  • Citing Article
  • March 2022

... Mendelian Randomization (MR) is a genetics-based approach that employs variations in the genome as instrumental variables (IVs) to reveal the causality between exposures and outcomes. It offers the advantages of reducing confounding bias, avoiding reverse causality and enabling the evaluation of exposures'long-term effects [33]. Given the existing evidence indicating inconsistencies in the correlation between lipid levels, CAD and BC, as well as the uncertainty surrounding causality, we conducted two-sample MR (TSMR) analyses to clarify the relationship between genetically elevated levels of four lipid traits and BC, as well as the causal association between CAD and BC. ...

Mendelian randomization
  • Citing Article
  • December 2022

Nature Reviews Methods Primers

... , IV p , containing p eligible IVs, of which n c are common SNPs and n r are rare SNPs, simultaneously correlated with the K exposures under the preset P value threshold. For the j-th IV, we allow the presence of DIE (δ j ), IIE (κ x j ) and CHP (κ y j ) for the primary exposure, as well as the potential uUHP (θ j ) [21,28]. Based on the above definitions, the IV-to-outcome effect can be written as: ...

Profile‐likelihood Bayesian model averaging for two‐sample summary data Mendelian randomization in the presence of horizontal pleiotropy

Statistics in Medicine