Enhancing the discovery of rare disease variants through hierarchical modeling

Article (PDF Available)inBMC proceedings 5 Suppl 9(Suppl 9):S16 · November 2011with14 Reads
DOI: 10.1186/1753-6561-5-S9-S16 · Source: PubMed
  • 37.71 · University of Southern California
Abstract
Advances in next-generation sequencing technology are enabling researchers to capture a comprehensive picture of genomic variation across large numbers of individuals with unprecedented levels of efficiency. The main analytic challenge in disease mapping is how to mine the data for rare causal variants among a sea of neutral variation. To achieve this goal, investigators have proposed a number of methods that exploit biological knowledge. In this paper, I propose applying a Bayesian stochastic search variable selection algorithm in this context. My multivariate method is inspired by the combined multivariate and collapsing method. In this proposed method, however, I allow an arbitrary number of different sources of biological knowledge to inform the model as prior distributions in a two-level hierarchical model. This allows rare variants with similar prior distributions to share evidence of association. Using the 1000 Genomes Project single-nucleotide polymorphism data provided by Genetic Analysis Workshop 17, I show that through biologically informative prior distributions, some power can be gained over noninformative prior distributions.
PROCEEDINGS Open Access
Enhancing the discovery of rare disease variants
through hierarchical modeling
Gary K Chen
From Genetic Analysis Workshop 17
Boston, MA, USA. 13-16 October 2010
Abstract
Advances in next-generation sequencing technology are enabling researchers to capture a comprehensive picture
of genomic variation across large numbers of individuals with unprecedented levels of efficiency. The main analytic
challenge in disease mapping is how to mine the data for rare causal variants among a sea of neutral variation. To
achieve this goal, investigators have proposed a number of methods that exploit biological knowledge. In this
paper, I propose applying a Bayesian stochastic search variable selection algorithm in this context. My multivariate
method is inspired by the combined multivariate and collapsing method. In this proposed method, however, I
allow an arbitrary number of different sources of biological knowledge to inform the model as prior distributions in
a two-level hierarchical model. This allows rare variants with similar prior distributions to share evidence of
association. Using the 1000 Genomes Project single-nucleotide polymorphism data provided by Genetic Analysis
Workshop 17, I show that through biologically informative prior distributions, some power can be gained over
noninformative prior distributions.
Background
Genome-wide association studies (GWAS) have been a
powerful method for revealing common variants that
confer a modest increase in disease risk in carriers. In
general, the single-nucleotide polymorphisms (SNPs)
that show the strongest evidence for association in
GWAS do not perfectly tag the putative causal variant
(s) nearby because of ancestral r ecombination events;
therefore resequencing i n these regions is necessary to
resolve the precise location of the causal variant(s).
Dickson et al. [1] postulated one possible explanation
for wh y many fine-mapping efforts have failed to map a
single causal SNP in the region tagged by the original
genome-wide association sign al: multiple rare variants
(MRVs) residing on multiple haplotypes at the regi on of
the genome-wide association signal are generating a
synthetic association when these haplotypes share a
common allele that is observed more in case subjects
than in control subjects. In support of the MRV
hypothesis, several investigators have recentl y developed
a number of popular burden-type methods [2-4]. These
methods are predicated on the notion that presence of
or an increase in the number of mutations for a person
at a particular pathway, region, gene, or any other biolo-
gical unit can serve as a reasonable proxy for his/her
risk of developing disease. The common theme among
these methods is that the genotypes for MRVs that map
to these bi ological units, called bins, are collapsed into a
single vector of scor es, a technique that can potentially
improve statistical power to detec t disease assoc iation.
For example, in the combined multivariate and collap-
sing (CMC) m ethod of Li and Leal [2], a score for an
individual is assigned 1 if at least one mutation is
observed across all SNPs within a bin, or 0 otherwise.
The significa nce of a gene, fo r example, can then be
tested by jointly modeling all bins that map within the
gene using a multivariate method such as Hotelling s
multivariate T-test, logistic regression, or linear
regression.
In this paper, I describe how I adapted the concept of
the CMC method into a Bayesian variable selection
algorithm with the notion that common SNPs may also
Correspondence: gary.k.chen@usc.edu
Division of Biostatistics, Department of Preventive Medicine, University of
Southern California, 2001 North Soto Street, SSB 202Q, MC 9234, Los
Angeles, CA 90089-9234, USA
Chen BMC Proceedings 2011, 5(Suppl 9):S16
http://www.biomedcentral.com/1753-6561/5/S9/S16
© 2011 Chen; licensee BioM ed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which perm its unre stricted use, distribution, and reproduction in
any medium, provided the original work is prope rly cited.
contribute valuable information to nearby causal rare
variants, assuming that the shared haplotype model [1]
is true. The exon resequencing data set provided by the
organizers of Genetic Ana lysis Workshop 17 (GAW17)
provides an ideal opportunity for evaluating the perfor-
mance of this new approach.
Methods
Details of the simulated GAW17 data set can be found in
this same issue [5]. I defined variants that had a minor
allele frequency (MAF) less than 0.01 to be rare but
potential ly the mos t bi ologically i nteresting, because
extremely rare mutations are expected to have the great-
est deleterious effects on phenotype. Of all the SNPs pro-
vided in the data set, 73% (18,131) fall within th is MAF
range. For each gene, I applied the collapsing procedure,
as described in the CMC method [2], by grouping rare
SNPs into one of two bins defined by their predicted
impact on protein (i.e., synonymous or nonsynonymous
variant). Any bin with a MAF less than 0.01 after the col-
lapsing proce dure was not included for further analysis.
Common SNPs, defined as those having a MAF 0.01,
were not collapsed with any other SNPs in the gene. For
conciseness, I use the term variable to define either a sin-
gle common SNP or a SNP bin. The final marker panel
included 7,385 variables: 1,029 bins containing collapsed
rare variants and 6,356 bins containing common SNPs. I
experimented with higher threshold values for bin defini-
tion (e.g., MAF = 0.05), but this strategy did not recover
an appreciable number of bins from the filtering step
because most genes in the data set were small and har-
bored private mutations. True log relative risks (denoted
b) for each SNP are provided in the simulation answer
key, which quantifies ea ch SNPs effect on the quantita-
tive traits Q1 and Q2. Thus, to assess how accurately my
method can recover the true values of b at each SNP, I
constrained the analyses only to models where the out-
come phenotype was either trait Q1 or Q2.
The statistical model I used was a two-level hierarchi-
cal model, described in detail by Chen and Thomas [6].
One property of a hierarchical model that is appealing
when analyzing variant s of low frequency, where maxi-
mum-likelihood estimates (MLEs) of association
ˆ
bb
can
be highly unstable, is the ability t o smooth these point
estimates (and their variances) toward prior distributions
defined in a second level. At the first level, I apply
ordinary least-squares regression, which produces MLEs
of associ ation between a continuous trait (i.e., either Q1
or Q2) and a random set of m model variables. A design
matrix X stores the variable values, and the vector Y
stores values of Q1 or Q1 across all individuals:
YX=+bbbb
01,,
.
m
(1)
I defi ne a prior distribution on b in Eq. (1) using the
annotation information provided by GAW17. For vari-
able k, b
k
is distributed as a mixed-effects model, origin-
ally defined by Besag et al. [7] as:
k
T
kk k
pqj
Z ++,
(2)
where the latent fixed effect is π and the random
effects components are:
qs
jj
t
n
k
kk
k
N
N
~(, ), ()
~,. ()
03
3
2
2
a
b
−−
The Z matrix stores external knowledge about each of
the m variables currently in the model. To e ncode my
belief that deleterious mutations would have higher or
lower values of b relative to other types of mutations, I
assigned a value of 1 to the nonsynonymous mutation
in the second column (after reserving the first column
as the intercept) of the m × 2 d esign matrix Z and a
value of 0 for any other SNP category. The term π, esti-
mated using ordinary linear regressi on, relates the mag-
nitude of b in Eq. (1) to values in Z.Furthermore,to
encode my belief tha t mutations within the same gene
should have similar effects on disease, I specified an
indicator encoding whether predictor k and any other
model variables are in the same gene by means of a k ×
k adjacency matrix A. Specifically, the parameter
j
k
stores the mean of the MLE
ˆ
bb
from the first le vel,
taken across neighbors of variable k (i.e., all other vari-
ables that are in the same gene) defined by means of A.
Thevariancetermτ
2
is inversely scaled by v
k
,thenum-
ber of neighbors of k to weight the uncertainty about τ
2
.
Finally, θ
k
soaks up any remaining variation i n the sec-
ond level of the model through the variance term s
2
.
A posterior density is defined on the basis of the likeli-
hood and normal density function corresponding to the
first (Eq. (1)) and second (Eq. (2)) levels of the hierarchi-
cal model. I use the product of this density function and
a model transition function as the objective function of a
reversible jump Monte C arlo Markov chain (MCMC)
algorithm to stochastically explore the search space,
fitting all possible sets of model variables to the data. The
model transition kernel itself is informed through empiri-
cal Bayes estimates of the hyperparameters (e.g., π), so
that regions of the search space that have strong empiri-
cal support and prior evidence are prioritized. Further
details on how the variable selection algorithm works can
be found in Chen and Thomas [6].
In the next section I present results between a more
conventional method and my proposed method. The
Chen BMC Proceedings 2011, 5(Suppl 9):S16
http://www.biomedcentral.com/1753-6561/5/S9/S16
Page 2 of 6
first method is an ordinary least-squares regression
between the quantitative trait (i.e., Q1 or Q2) a nd each
vector of variable scores, which I denote as the MLE
method. This approach is equivalent to a conventional
genome-wide association scan, testing for marginal
effects. I compared this to four variations of the multi-
variate MCMC method. Specifically, I varied the degree
of informativeness in the prior distributio n by modifying
the definition of the matrices A and Z.Themostinfor-
mative prior distribution (denoted FULL) stores both
gene membership and SNP mutation type information
in the A and Z matrices, respectively. In the second var-
iant of the prior distr ibution (denoted Z only), I
removed gene membership information so that matrix
A was simply the identit y matrix. Conversely, in the
third variant of the prior distribution (denoted A only),
I removed mutation class information so that the Z
matrix included only the intercept. The last variant of
the prior distribution (denoted UNINF) includes both
the uninformative Z and A matrices and is equivalent to
a the ridge style prior distribution (i.e., b ~ N(0, s
2
)).
For each of the MCMC analyses, I s ampled 2 million
realizations from t he posterior distribution, retaining
statistics on only the last million realizations to mini-
mize any correlation to the initial parameter values. Run
time on a 2-GHz Xeon processor was approximately 8
h. I verified that the retained statistics reached conver-
gence by comparing their distributions across multiple
chains using a nonsignificant p-value extracted from the
Kolmogorov-Smirnov test.
To qua ntify evidence for any specific variable (either
common SNP or SNP bin), I empirically estimated
Bayes factors (BFs) for each variable by dividing the pos-
terior odds by the prior odds, as described by Chen and
Thomas [6]. BFs quantify the increase in evidence for a
hypothesis (in this case, inclusion o f a variable i nto the
model) in light of observed data relative to a prior
hypothesis [8].
Results
Table 1 lists the posterior estimates of the various
hyperparameters of the hierarchical model under the
FULL prior distribution specification. F or either of the
two q uantitative traits, the residual variance (τ
2
)inthe
random effects component was smaller than the residual
var iance from the fixed effects component (s
2
), indicat-
ing a good fit between the gene-membership prior
distribution and the observed data. The posterior esti-
mates for the prior mean (π) indicate a slightly positive
correlation (0.03) between disease risk and presence of a
nonsynonymous mutation, although the evide nce is
weak considering the large standard errors (0.06).
As alluded to earlier, hierarchical modeling shrinks
unstable MLEs toward means informed through either
informative or noninformative prior distributions. I con-
sideredtwometricsthatmeasuretheaccuracyofa
method s estimation of the true effect size: the mean
coverage rate (MCR) and the root mean-square error
(RMSE). I defined the MCR as the proportion across all
causal SNPs and simulation replicates where the true
value of b falls within the 95% confi dence interval of the
estimator. Thus a perfect estimator would have a value
of 1. Hierarchical modeling achieved an MCR of 0.91
under the Q1 disease model, in contrast to an MCR of
0.56 when applying maximum likelihood. The second
metric I considered, RMSE, is cal culated by takin g the
square root of the average squared difference (also taken
across all markers and replicates) between the estimated
and true v alues of b. A smaller value of the RMSE indi-
cates a more precise estimation of the true effect size.
Under the Q1 disease model, the RMSE for the hier-
archical model was 0.17, whereas for the maximum-like-
lihood model it was 0.38. When Q2 was the disease
model, the RMSE and MSE were similar (within ±0.01),
approximately 0.17 and 0.94, respectively, regardless of
which method was used. Table 2 presents a list of causal
variables under the Q1 disease model, indicating that
several SNPs at the FLT1 gene were poorly estimated
using maximum likelihood.
I next evaluated the ability of the M CMC sampler to
perform variable s election by comparing sensitivity and
specificity across the four variants of the prio r distribu-
tion. The receiver operating characteristic (ROC) curves
in Figures 1 and 2 illustrate power across various false
discovery rates for t raits Q1 and Q2, respectively. As
Table 1 Posterior estimates of hyperparameters
Parameter Trait Q1 Trait Q1
τ
2
0.006 0.006
s
2
0.01 0.01
π (SE) 0.03 (0.06) 0.02 (0.06)
Table 2 Accuracy of estimates of b for trait Q1 between
maximum-likelihood estimate (MLE) and hierarchical
modeling (HM) estimates
Variable Mean square error Mean coverage rate
a
MLE HM MLE HM
C1S6521 0.196 0.046 0.68 0.94
C13S398 0.069 0.036 0.90 0.93
C13S515 0.300 0.029 0.07 0.92
C13S522 0.126 0.037 0.06 0.71
HFE, nonsynonymous 0.218 0.033 0.63 0.98
KCTD14, nonsynonymous 0.007 0.009 0.91 0.84
C4S1878 0.102 0.024 0.7 0.95
a
Proportion of replicates where true b falls within the 95% confidence
interval.
Chen BMC Proceedings 2011, 5(Suppl 9):S16
http://www.biomedcentral.com/1753-6561/5/S9/S16
Page 3 of 6
one might expect, introducing informative prior distri-
butions into the model improves power to det ect causal
variants. Gene membership information as encoded in
matrix A proved to be the most critical component for
power overal l. When Q2 was used as the outcome phe-
notype, the method showed greater sensitivity than the
MLE method across all false discovery range (FDR)
values, regardless of the prior distribution specification.
Q1 performed slightly worse than the MLE at low FDRs
when gene membership information was omitted from
the prior specification. Table 3 summarizes the relative
differences in power at FDR = 0.05 between the MLE
and my approach.
I noted a wide range of evidence across the variables
considered. Tables 4 and 5 present a compari son of BFs
across the various prior specifications in the variable
selection algorithm for each causal variable that was
included in the analysis. Although guidelines for BF
interpretation [8] deem several variables to be barely
worth mentioning (BF range, 1 to 3), others could be
considered deci sive (BF > 100). Under Q1, evidence of
association was strongest for the C13S431, C13S522,
and C 13S523 SNPs in the FLT1 gene, which had more
common MAFs of 0.02, 0.03, and 0.07, respectively.
These same SNPs also h ad fairly large simulated odds
ratios (2.1, 1.9, and 1.9, respectively), which most likely
explain t he improved overall performance of all the
methods under the Q1 model, as shown in Figures 1
and 2, in contrast to the Q2 model, whose disease
model was more challenging. The o nly SNP under the
Q1 model that was more common than these three
SNPs was C4S1878 (MAF = 0.16). A relatively moderate
BF of 107 at this SNP reflects its modest simulated odds
ratio of 1.1. The A matrix info rmation, which help s dis-
tribute evidence of association across a gene, was advan-
tageous for SNPs within FLT1.Incontrast,forSNPsin
other genes, the Z matrix, which enables variables of the
same mutation type to share a common mean, improved
the methods power to detect causal variants, as seen in
the higher BFs in column 2 versus column 3 in Tables 4
and 5. This observation was not too surprising, consid-
ering the fact that the simulation model considered only
nonsynonymous mutations to be causal.
Discussion
In response to the missing heritability mystery plaguing
the field of complex trait genetics, there is understand-
ably massive interest in developing methods that can
effectively investigate the relationship between rare var-
iants and disease. In the methods described by Madsen
Figure 1 Receiver operating characteristic curve under
polygenic disease model for trait Q1. The proportion of causal
variants is plotted as a function of the proportion of noncausal
variants, taken across 200 replicates.
Figure 2 Receiver operating characteristic curve under
polygenic disease model for traitQ2. The proportion of causal
variants is plotted as a function of the proportion of noncausal
variants, taken across 200 replicates.
Table 3 Relative power (in relation to the maximum-
likelihood estimate) of hierarchical modeling method at
FDR = 0.05
Variation Trait Q1 Trait Q2
UNINF 0.94 1.14
Z only 0.98 1.17
A only 1.04 1.17
FULL 1.05 1.19
FDR, false discovery range.
Chen BMC Proceedings 2011, 5(Suppl 9):S16
http://www.biomedcentral.com/1753-6561/5/S9/S16
Page 4 of 6
and Browning [3] and a more recent refinement
described by Price et al. [4], common S NPs are down-
weighted on the assumption that their effect sizes are
expected to be smaller than their rarer neighbors.
Details on these approaches are found in Dering et al.
[9]. A one degree of freedom test is carried out at t he
gene leve l or other biological unit rather than at the
SNP level. These methods are appealing because power
can be increased as a result of fewer multiple hypotheses
to adjust for. I to ok a s omewhat different approach that
was closer in spirit to the CMC method [2]. L ike the
CMC appr oach, my method operates within a multivari-
ate framework so that multiple bins within a gene can
be considered; t his allows one to test multiple hypoth-
eses and to refine the signal, albeit at a statistical cost
resulting from multiple comparisons. In contrast to the
Madsen and B rowning [3] and Price et al. [4] methods,
I do not down-weight SNPs of higher frequency. In fact,
I believe that if there is a shared haplotype effect among
case subj ects, then these common SNPs can aid in dis-
covery of rarer neighbors through an appropriate prior
specification (e.g., the A matrix in the hierarchical
model). With any type of collapsing strategy, including
mine, the choice of how bins are defined is arbitrary
and some type of permutation procedure is necessary to
alleviate an increase in type I error from overfitting the
data. My Bayesian method, while also computationally
expensive, does not involve permutation. Through Baye s
model a veraging and reporting of BFs, the problem of
model o verfitting is handled naturally. I pr eviously
demonstrated through simulations that the model is
robust in light of multiple comparisons within the con-
text of discovering interactions [6].
The results from the analyses show that in certain
cases, such as when Q1 is modeled as the ou tcome, rare
variants can make accurate estimates of effect si ze diffi-
cult when operating under a conventional MLE frame-
work. Hierarchical modeling can be particularly helpful
here, even if the prior distributions are not particularly
informative. However, I must provide an important
caveat that the method, which still operates under a
standard multivariate regression framework at the first
level of the model, does not appear to work particular ly
well when rare v ariants (i.e., omitting a collapsing strat-
egy), such as singleton mutations, are directly tested;
convergence problems usually emerge when the design
matrix becomes numerically singular. Thus I was unable
to directly evaluate the method s performance on any
one specific SNP among the extremely rare causal var-
iants. The LASSO (least absolute shrinkage and selec-
tion ope rator) method [10], another flavor of penalized
regression that provides variable selection, has recently
been extended to allow one to directly test any rare var-
iant by defining bins (e.g., genes) that relax the global
penalization parameter [11]. Although my approach is
more limited in this sense, my model allows the inv esti-
gator to include an arbitrary numbe r of prior knowledge
sources through columns in a Z matrix, as demon-
strated in the se nsitivity analyses presented in Figures 1
and 2. I found that defining a richer prior distribution
on b basedonbiologycouldindeedimprovepowerto
detect variants. On closer inspection, I learned that
mutation type (synonymous vs. nonsynonymous) infor-
mation was more beneficial than gene-membership
information for most of the SNP bins, but the opposite
was tr ue for the FLT1 gene. Thus I recommend provid-
ing as much external knowledge as possible in the
model (e.g., adding additional columns i n Z). Because
my method is bas ed on empir ical Bayes est imates, it is
robust to poor specification of the prior distribution,
because this only leads to increased uncertainty
Table 4 Bayes factors for each causal variable under the
Q1 trait model
Causal variable
a
UNINF Z only A only FULL
ARNT, nonsynonymous 1.14 1.95 1.85 3.99
C4S1884 36.33 43.31 36.12 47.65
HIF1A, nonsynonymous 0.58 1.14 0.57 1.22
C13S522 527.13 600.4 764.92 773.35
C1S6533 109.38 149.45 119.39 162.42
C4S1878 57.23 92.15 69.16 107.15
C14S1734 1.47 2.42 1.47 2.58
C13S431 299.20 327.49 572.87 551.06
FLT1, nonsynonymous 17.25 27.80 81 103.17
C13S523 998.33 999.07 999.87 999.7
a
Defined as either a bin of SNPs (shown with convention gene name and
mutation class) or a single SNP. Only variables with MAF 0.01 were included
for analyses.
Table 5 Bayes factors for each causal variable under the
Q2 trait model
Causal variable
a
UNINF Z only A only FULL
C6S5441 53.96 77.22 65.32 88.21
SIRT1, nonsynonymous 22.36 34.60 23.92 34.82
C2S354 11.17 15.29 11.04 16.46
C8S442 62.14 88.72 64.60 87.26
C6S5449 64.43 85.17 73.89 94.77
SREBF1, nonsynonymous 60.71 83.47 60.57 85.76
PDGFD, nonsynonymous 49.64 71.03 49.12 70.54
C6S5426 0.87 1.48 1.25 2.10
C6S5380 212.1 264.3 210.2 263.0
PLAT, nonsynonymous 9.28 14.54 8.89 14.44
VLDLR, nonsynonymous 17.51 26.85 17.15 26.93
BCHE, nonsynonymous 38.11 55.60 37.50 55.62
LPL, nonsynonymous 1.34 2.43 1.62 2.63
a
Defined as either a bin of SNPs (shown with convention gene name and
mutation class) or a single SNP. Only variables with MAF 0.01 were included
for analyses.
Chen BMC Proceedings 2011, 5(Suppl 9):S16
http://www.biomedcentral.com/1753-6561/5/S9/S16
Page 5 of 6
(modeled in the prior variances τ
2
and s
2
), asymptoti-
cally reducing the prior distribution on b to a ridge
prior distribution.
Clearly, there is a need to develop methods to effec-
tively mine the data fo r rare v ariants t hat confer disease
risk. I am optimistic that my approach is more effective
than other methods in many cases, but it does have the
same lim itations as those shared by collapsing-style
methods, particularly the strong a ssumption that effect
sizes will point in the same direction among SNPs inside
a bin. I am considering other variations of the hierarchi-
cal model that might more flexibly a ccommodate this
type of heterogeneity. One appealing idea is to include a
new stochastic layer into the algorithm that randomly
groups SNPs into bins (and consequently compresses
the A and Z matrices accordingly). My method cur-
rently permits one to perform a global test of associa-
tion (i.e., ar e any rare va riants associated?) by testing
fixed bins. An important property o f enabling flexibility
in bin assignment is that one can additionally perform
local tests of association (i.e., how often does this SNP
appear in any bin?).
Conclusions
I have presented a computationally efficient Bayesian
method that simultaneously provides additional power
to discover rare disease variants and enhances estima-
tion of true effect sizes. Users interested in the algo-
rithm can download C++ source code and binaries from
my website (http://www-hsc.usc.edu/~garykche/).
Acknowledgments
I would like to thank the organizers of Genetic Analysis Workshop 17, the
anonymous manuscript reviewers, and the copy editor Mimi Braverman for
improving the paper. The Workshop is supported by National Institutes of
Health grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5
Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the
supplement are available online at http://www.biomedcentral.com/1753-
6561/5?issue=S9.
Authors contributions
GKC conceived of the study, carried out the statistical analyses, and drafted
the manuscript.
Competing interests
I have no competing interests to declare.
Published: 29 November 2011
References
1. Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB: Rare variants
create synthetic genome-wide associations. PLoS Biol 1000, 8:e294.
2. Li B, Leal SM: Methods for detecting associations with rare variants for
common diseases: application to analysis of sequence data. Am J Hum
Genet 2008, 83:311-321.
3. Madsen BE, Browning SR: A groupwise association test for rare mutations
using a weighted sum statistic. PLoS Genet 2009, 5 :e1000384.
4. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR:
Pooled association tests for rare variants in exon-resequencing studies.
Am J Hum Genet 2010, 86:832-838.
5. Almasy LA, Dyer TD, Peralta JM, Kent JW Jr, Charlesworth JC, Curran JE,
Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC
Proc 2011, 5(suppl 9):S2.
6. Chen GK, Thomas DC: Using biological knowledge to discover higher
order interactions in genetic association studies. Genet Epidemiol 2010,
34:863-878.
7. Besag J, York J, Mollie A: Bayesian image restoration, with two
applications in spatial statistics. Ann Inst Stat Math 1991, 43:1-20.
8. Kass RE, Raftery AE: Bayes factors. J Am Stat Assoc 1995, 90:773-795.
9. Dering C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants:
an overview of collapsing methods. Genet Epidemiol 2011, X(suppl X):X-X.
10. Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat
Soc Ser B Stat Methodol 1996, 58:267-288.
11. Zhou H, Sehl ME, Sinsheimer JS, Sobel EM, Lange K: Association screening
of common and rare genetic variants by penalized regression.
Bioinformatics 2010, 26(19):2375-82.
doi:10.1186/1753-6561-5-S9-S16
Cite this article as: Chen: Enhancing the discovery of rare disease
variants through hierarchical modeling. BMC Proceedings 2011 5(Suppl 9):
S16.
Submit your next manuscript to BioMed Central
and take full advantage of:
Convenient online submission
Thorough peer review
No space constraints or color figure charges
Immediate publication on acceptance
Inclusion in PubMed, CAS, Scopus and Google Scholar
Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Chen BMC Proceedings 2011, 5(Suppl 9):S16
http://www.biomedcentral.com/1753-6561/5/S9/S16
Page 6 of 6
    • "etc. Although several methods that use prior information have been proposed17181920 , further research is needed to utilize prior knowledge more effi- ciently [21] and to expand statistical tools available for researchers. In this article we propose a method that incorporates prior information into the region-based score test. "
    [Show abstract] [Hide abstract] ABSTRACT: The interest of the scientific community in investigating the impact of rare variants on complex traits has stimulated the development of novel statistical methodologies for association studies. The fact that many of the recently proposed methods for association studies suffer from low power to identify a genetic association motivates the incorporation of prior knowledge into statistical tests. In this article we propose a methodology to incorporate prior information into the region-based score test. Within our framework prior information is used to partition variants within a region into several groups, following which asymptotically independent group statistics are constructed and then combined into a global test statistic. Under the null hypothesis the distribution of our test statistic has lower degrees of freedom compared with those of the region-based score statistic. Theoretical power comparison, population genetics simulations and results from analysis of the GAW17 sequencing data set suggest that under some scenarios our method may perform as well as or outperform the score test and other competing methods. An approach which uses prior information to improve the power of the region-based score test is proposed. Theoretical power comparison, population genetics simulations and the results of GAW17 data analysis showed that for some scenarios power of our method is on the level with or higher than those of the score test and other methods.
    Full-text · Article · Jan 2014
  • [Show abstract] [Hide abstract] ABSTRACT: We summarize the methodological contributions from Group 3 of Genetic Analysis Workshop 17 (GAW17). The overarching goal of these methods was the evaluation and enhancement of state-of-the-art approaches in integration of biological knowledge into association studies of rare variants. We found that methods loosely fell into three major categories: (1) hypothesis testing of index scores based on aggregating rare variants at the gene level, (2) variable selection techniques that incorporate biological prior information, and (3) novel approaches that integrate external (i.e., not provided by GAW17) prior information, such as pathway and single-nucleotide polymorphism (SNP) annotations. Commonalities among the findings from these contributions are that gene-based analysis of rare variants is advantageous to single-SNP analysis and that the minor allele frequency threshold to identify rare variants may influence power and thus needs to be carefully considered. A consistent increase in power was also identified by considering only nonsynonymous SNPs in the analyses. Overall, we found that no single method had an appreciable advantage over the other methods. However, methods that carried out sensitivity analyses by comparing biologically informative to noninformative prior probabilities demonstrated that integrating biological knowledge into statistical analyses always, at the least, enabled subtle improvements in the performance of any statistical method applied to these simulated data. Although these statistical improvements reflect the simulation model assumed for GAW17, our hope is that the simulation models provide a reasonable representation of the underlying biology and that these methods can thus be of utility in real data.
    Full-text · Article · Jan 2011