Estimating equation-based causality analysis with application to microarray time series data.
ABSTRACT Microarray time-course data can be used to explore interactions among genes and infer gene network. The crucial step in constructing gene network is to develop an appropriate causality test. In this regard, the expression profile of each gene can be treated as a time series. A typical existing method establishes the Granger causality based on Wald type of test, which relies on the homoscedastic normality assumption of the data distribution. However, this assumption can be seriously violated in real microarray experiments and thus may lead to inconsistent test results and false scientific conclusions. To overcome the drawback, we propose an estimating equation-based method which is robust to both heteroscedasticity and nonnormality of the gene expression data. In fact, it only requires the residuals to be uncorrelated. We will use simulation studies and a real-data example to demonstrate the applicability of the proposed method.
- [show abstract] [hide abstract]
ABSTRACT: MOTIVATION: Interaction among time series can be explored in many ways. All the approach has the usual problem of low power and high dimensional model. Here we attempted to build a causality network among a set of time series. The causality has been established by Granger causality, and then constructing the pathway has been implemented by finding the Minimal Spanning Tree within each connected component of the inferred network. False discovery rate measurement has been used to identify the most significant causalities. RESULTS: Simulation shows good convergence and accuracy of the algorithm. Robustness of the procedure has been demonstrated by applying the algorithm in a non-stationary time series setup. Application of the algorithm in a real dataset identified many causalities, with some overlap with previously known ones. Assembled network of the genes reveals features of the network that are common wisdom about naturally occurring networks.Bioinformatics 03/2007; 23(4):442-9. · 5.47 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: We propose an algorithm for selecting and clustering genes according to their time-course or dose-response profiles using gene expression data. The proposed algorithm is based on the order-restricted inference methodology developed in statistics. We describe the methodology for time-course experiments although it is applicable to any ordered set of treatments. Candidate temporal profiles are defined in terms of inequalities among mean expression levels at the time points. The proposed algorithm selects genes when they meet a bootstrap-based criterion for statistical significance and assigns each selected gene to the best fitting candidate profile. We illustrate the methodology using data from a cDNA microarray experiment in which a breast cancer cell line was stimulated with estrogen for different time intervals. In this example, our method was able to identify several biologically interesting genes that previous analyses failed to reveal.Bioinformatics 06/2003; 19(7):834-41. · 5.32 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: Many algorithms have been used to cluster genes measured by microarray across a time series. Instead of clustering, our goal was to compare all pairs of genes to determine whether there was evidence of a phase shift between them. We describe a technique where gene expression is treated as a discrete time-invariant signal, allowing the use of digital signal-processing tools, including power spectral density, coherence, and transfer gain and phase shift. We used these on a public RNA expression set of 2467 genes measured every 7 min for 119 min and found 18 putative associations. Two of these were known in the biomedical literature and may have been missed using correlation coefficients. Digital signal processing tools can be embedded and enhance existing clustering algorithms.Journal of Biomedical Informatics 01/2002; · 2.13 Impact Factor
Biostatistics (2009), 10, 3, pp. 468–480
Advance Access publication on March 29, 2009
Estimating equation–based causality analysis with
application to microarray time series data
Department of Biostatistics, Division of Quantitative Sciences,
University of Texas M. D. Anderson Cancer Center, Houston, TX, USA
Department of Statistics, University of Virginia, Charlottesville, VA, USA
Microarray time-course data can be used to explore interactions among genes and infer gene network. The
crucial step in constructing gene network is to develop an appropriate causality test. In this regard, the ex-
pression profile of each gene can be treated as a time series. A typical existing method establishes the
Granger causality based on Wald type of test, which relies on the homoscedastic normality assumption
of the data distribution. However, this assumption can be seriously violated in real microarray experi-
ments and thus may lead to inconsistent test results and false scientific conclusions. To overcome the
drawback, we propose an estimating equation–based method which is robust to both heteroscedasticity
and nonnormality of the gene expression data. In fact, it only requires the residuals to be uncorrelated.
We will use simulation studies and a real-data example to demonstrate the applicability of the proposed
Keywords: Chi-square approximation; Estimating equation; F-test; False-positive rate; Granger causality; Time-
Microarray technologies allow us to gain biological insight at the genomic scale by monitoring activ-
ities of thousands of genes simultaneously in a wide range of tissues, organs, and cell lines. In recent
years, time-course experiments (Spellman and others, 1998; Cho and others, 2001; Whitfield and others,
2002) generate gene expression data measured repeatedly over a time period. Therefore, it makes it pos-
sible to explore biological functions of genes and interactions among genes and their products. Several
typical types of temporal patterns and associations among genes have been studied using time-course
experiments, including periodicity, co-expression detection, gene clustering, and causality. For exam-
ple, Filkov and others (2002) and Wichert and others (2004) have focused on periodicity and phase
∗To whom correspondence should be addressed.
c ? The Author 2009. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: email@example.com.
Estimating equation–based causality analysis
detection. The correlation coefficient and its variants have been used as a primary tool of detecting gene
co-expression and interactions (Sch¨ afer and Strimmer, 2005; Zhu and others, 2005). There is also a large
literature on model-based or nonparametric gene clustering such as Peddada and others (2003), Schliep
and others (2003), and Song and others (2007).
Another essential problem in studying functional gene–gene interaction and constructing gene net-
work is causality detection. The general idea is to derive pairwise causality relationship among genes
based on a certain test that will be used later to construct the network. Therefore, developing an appropri-
ate causality test is the crucial step in network construction. And it is the focus of our study in this paper.
As introduced by Mukhopadhyay and Chatterjee (2007), the rough definition of causality relationship
between 2 genes is that gene 1 is a cause of gene 2 if expression of gene 1 is predictive of expression
of gene 2 at a future time period. In real life, 2 genes can have either a direct or an indirect causality
relationship. The indirect causality implies that at least one intermediate gene exists between the 2 genes
in the connection chain, which is commonly observed in gene regulatory network.
The causality relationship has been studied extensively in the literature. A class of causal models is
the marginal structural models (Robins and others, 2000; Hern´ an and others, 2001), which are used in ob-
servational studies with exposures or treatments varying over time, for example, in the logistic regression
setting. It provides the consistent inverse-probability-of-treatment-weighted estimators in the presence of
unmeasured confounding factors. In our problem of studying the causality relationship between 2 time
series, another class is probably more relevant that uses coherence- and partial coherence–based graphical
models as in Dahlhaus (2000), Butte and others (2001), and Salvador and others (2005). However, it has
been shown that the coherence-based approaches have several shortcomings: (1) it is sensitive to measure-
ment errors (Albo and others, 2004) and (2) it cannot detect the time precedence relationship (Kaminski
and others, 2001; Baccala and Sameshima, 2001) and can bear the least amount of additive random noise.
In contrast, Winterhalder and others (2005) demonstrated that Granger causality is more appropriate for
detecting the type of causality relationship of interest.
Mukhopadhyay and Chatterjee (2007) used a vector autoregressive (VAR) framework to test the
Granger causality via an F-test. However, validity of the F-test requires the independently and identi-
cally normal distribution of data. It is known that the strong distributional assumptions have often been
violated in gene expression data in the 2-fold way: (1) Gene expression intensities are often not normally
distributed. Some researchers used data transformation procedures (Rocke and Durbin, 2003; Durbin and
Rocke, 2004), typically logarithm transformation, to stabilize the variance so as to better approximate the
normal distribution, which is still not satisfactory in some circumstances. (2) The errors are not necessarily
homogeneous in real life or the variances of expression intensities in different groups or at different time
points are not identical. For example, earlier studies (Geller and others, 2003; Hu and Wright, 2007) have
shown that the variance is positively associated with the mean expression intensities. Another limitation
of the F-test is that it is only applicable to the least square type of parameter estimates, which is restrictive
in solving real problems. For example, F-test is not valid using the parameter estimates based on some
L1-distance objective function associated with robust (i.e. median) regression models, that is practically
necessary in some scenarios to account for the presence of outliers.
To overcome the shortcomings of an F-test, we propose a test based on estimating equations. The
proposed method only requires data to be uncorrelated, which allows making valid statistical inference
and hypotheses test robust to a wide range of data distribution forms and the heteroscedasticity of the
errors that may be related to the mean function of the models. Moreover, it is also generally applicable
to parameter estimates obtained from a wide range of objective functions. Aside from the capability of
maintaining the appropriate significance level, empirical studies also illustrate its advantage in terms of
false-positive (FP) rate in the comparison to F-test. The method and theory will be described in Section 2.
We will discuss simulation studies extensively in Section 3. A real human cell cycle time-course data will
be used for demonstration in Section 4.
470J. HU AND F. HU
2. MOTIVATION AND METHOD
We consider the following autoregressive model:
y1(t)= c + α1y1(t−1)+ ··· + αqy1(t−q)+ β1y2(t−1)+ ··· + βqy2(t−q)+ ?t,
where q is the autoregressive lag length and ?t, t = 1, ...,n, are uncorrelated and have mean 0 and
t. The gene 2 is said to Granger-cause gene 1 if at least a βi ?= 0, i = 1,...,q. The task
H0: β1= β2= ··· = βq= 0.
Hamilton (1994) and Mukhopadhyay and Chatterjee (2007) used F-test in a VAR framework assuming
the independent and identically distributed ?t. In gene expression time-series data, it is observed that
the variance of expression levels of a gene varies with the mean expression intensity at stages (or time
points) in the cell cycle. In this case, F-test would lead to inconsistent results in the presence of the
nonhomogeneous errors, mainly due to inconsistent variance estimation of the parameters.
To overcome the drawbacks of F-test, we propose a test based on estimating equations, which only
requires ?tin model (2.1) to be uncorrelated, without additional assumptions on the distributional form.
In addition, it also allows the unequal variance σ2
t. Let Y1(t)= (y1(t),..., y1(t−q)) and Y2(t)= (y2(t),...,
y2(t−q)). We define the parameter vector θ θ θ = (β β β,c,α α α). The estimate of θ θ θ is generally considered to be
the solution of the following estimating equation:
S(y,θ θ θ) = n−1/2
g(Y1(t),Y2(t),θ θ θ) = 0(2.3)
for some given function g satisfying
Eg(Y1(t),Y2(t),θ θ θ) = 0.
The normalization constant n−1/2is chosen for the convenience of expression asymptotic results. The
estimating equation (2.3) is typically obtained from minimization (maximization) of some objective func-
tion SS(y,θ θ θ) =?n
It is worthwhile emphasizing that the inference development on estimating equations in our problem
is different from the main existing research which focuses on the independent data (see Liang and Zeger,
1986; Godambe and Kale, 1991; Boos, 1992; Hu and Kalbfleisch, 2000). In our problem, the time-series
data are dependent where some explanatory variables are in fact also some forms of the outcome variable.
Based on model (2.1), the least square method results in the estimating function
t=1Gt(Y1(t),Y2(t),θ θ θ), that is, if the likelihood or least squares is used, then we have
gt(Y1(t),Y2(t),θ θ θ) = ∂Gt(Y1(t),Y2(t),θ θ θ)/∂θ.
g(Y1(t),Y2(t)) = (y2(t−1),..., y2(t−q),1, y1(t−1),..., y1(t−q))T× rt,
a 2q + 1 vector, where
rt= y1(t)− (c + α1y1(t−1)+ ··· + αqy1(t−q)+ β1y2(t−1)+ ··· + βqy2(t−q)).
We define the following notations prior to making the inference of θ θ θ. Let
V(θ θ θ) = Var S(y,θ θ θ) = n−1
Var(g(Y1(t),Y2(t),θ θ θ))
Estimating equation–based causality analysis
W(θ θ θ) = E
∂g(Y1(t),Y2(t),θ θ θ)
∂θ θ θT
Both V(θ θ θ) and W(θ θ θ) are (2q + 1) × (2q + 1) matrices. In practice, a reasonable estimate of V(θ θ θ) and
W(θ θ θ) for a given θ θ θ is
V(y,θ θ θ) = n−1
(g(Y1(t),Y2(t),θ θ θ) − ¯ g(y,θ θ θ))(g(Y1(t),Y2(t),θ θ θ) − ¯ g(y,θ θ θ))T,
where ¯ g(y,θ θ θ) = n−1?n
t=1g(Y1(t),Y2(t),θ θ θ). And
W(y,θ θ θ) = n−1
∂g(Y1(t),Y2(t),θ θ θ)
∂θ θ θT
Consequently, we have
V(y,θ θ θ) → V(θ θ θ)
W(y,θ θ θ) → W(θ θ θ)
in probability for both homogeneous and nonhomogeneous errors, which can be straightforwardly ob-
tained by the weak law of large number.
We start with a general testing problem that has a wide application,
H0: h(θ θ θ) = 0
for a set of differentiable functions h, with the length of the vector h denoted by r. Note that the test in
(2.2) is just a special case. Let
H(θ θ θ) =∂h(θ θ θ)
U(θ θ θ) = HT(HW−1HT)−1HW−1VW−1HT(HW−1HT)−1H,
and its general inverse (Moore–Penrose) is then
∂θ θ θT.
U−(θ θ θ) = W−1HT(HW−1VW−1HT)−1HW−1,
where the arguments y and θ θ θ are suppressed.
Let˜θ θ θ be the estimate of θ θ θ under the restriction h(θ θ θ) = 0 obtained from estimating equations. The test
statistics based on the estimating equation is then
˜Qh=0= S(y,˜θ θ θ)TU−(˜θ θ θ)S(y,˜θ θ θ).
We derived its asymptotic distribution property in the following theorem, with the proof shown in the
supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).
THEOREM 2.1 If the Lindeberg condition holds for g(Y1(t),Y2(t),θ θ θ). That is, for every ?,
g(Y1(t),Y2(t),θ θ θ)2
I(| g(Y1(t),Y2(t),θ θ θ) |> ?sn)
freedom r under H0. Therefore, H0can be rejected at the significance level α if˜Qh=0> χ2
t=1Var(g(Y1(t),Y2(t),θ θ θ)). Then,˜Qh=0follows a chi-square distribution with the degree of
472J. HU AND F. HU
Coming back to the test (2.2), we have the q × (2q + 1) matrix
H(θ θ θ) =
where the element in the ith row and ith column equals 1, i = 1,...,q, corresponding to β β β. So ˜Qh=0
follows the chi-square distribution with the degree of freedom q under H0.
Because our focus is on the casuality test, we will not discuss selection of the autoregressive lag length
in this paper. Throughout the empirical investigation, we will take q = 1 without the loss of generality. We
will also study the application of the gene-pair causality relationship in network construction, adopting
the same strategy in Mukhopadhyay and Chatterjee (2007) that is to first perform pairwise causality tests
and then use multiple testing adjustment to select the most associated gene pairs for constructing the
network. Regardless of the ad hoc way of network construction, it serves sufficiently the main purpose of
making the comparison between the estimating equation–based test and the F-test in the high-dimensional
3. SIMULATION STUDY
We conducted extensively a series of simulation studies to evaluate the performance of the estimating
equation–based causality detection method (EE) and the F-test (F).
It is essential for a test method to preserve the appropriate type-I error rate that is investigated in the
first set of simulations. Additionally, we empirically examine the distribution of the test statistic˜Qh=0to
verify the theoretical results stated in Theorem 1. We focus on testing the causality relationship between
2 variables X1and X2that x1t= 0.7x1(t−1)+ ?1tand x2t= 0.3x2(t−1)+ βx1(t−1)+ ?2t. The interest is
to test the null hypothesis H0: β = 0. We conducted 5000 simulations with the data generated under null
hypothesis H0.Weconsideredthenumberoftimepointstoben = 30or100.Weconductthe2methodsto
test H0in each simulation. Three homogeneous distributions of both ?1tand ?2tare considered: N(0,1),
t-distribution with degree of freedom 3, and centered mean-1 exponential distribution. We report the
proportion of the p-value no larger than 0.05 among the 5000 simulations in the 3 columns to the left
in Table 1. In most cases, the 2 methods yield similarly reasonable results that are very close to 0.05.
However, EE method has discernibly better performance when the t-distribution is posed on the residuals.
We also considered a nonhomogeneous case where ?1t∼ N(0,1) and the variance of ?2tis positively
associated with the absolute mean value of x1(t−1). Specifically, ?2t∼ N(0,x2
as an example to demonstrate the inconsistency of F-test shown in the right column of Table 1. We can
see that EE method performs consistently well while F-test yields much inflated type-I error rate.
1(t−1)). It serves sufficiently
Table 1. Type-I error rate results in simulation study
Method Homogeneous ?2t