# Enhancing the discovery of rare disease variants through hierarchical modeling

**Abstract**

Advances in next-generation sequencing technology are enabling researchers to capture a comprehensive picture of genomic variation across large numbers of individuals with unprecedented levels of efficiency. The main analytic challenge in disease mapping is how to mine the data for rare causal variants among a sea of neutral variation. To achieve this goal, investigators have proposed a number of methods that exploit biological knowledge. In this paper, I propose applying a Bayesian stochastic search variable selection algorithm in this context. My multivariate method is inspired by the combined multivariate and collapsing method. In this proposed method, however, I allow an arbitrary number of different sources of biological knowledge to inform the model as prior distributions in a two-level hierarchical model. This allows rare variants with similar prior distributions to share evidence of association. Using the 1000 Genomes Project single-nucleotide polymorphism data provided by Genetic Analysis Workshop 17, I show that through biologically informative prior distributions, some power can be gained over noninformative prior distributions.

PROCEEDINGS Open Access

Enhancing the discovery of rare disease variants

through hierarchical modeling

Gary K Chen

From Genetic Analysis Workshop 17

Boston, MA, USA. 13-16 October 2010

Abstract

Advances in next-generation sequencing technology are enabling researchers to capture a comprehensive picture

of genomic variation across large numbers of individuals with unprecedented levels of efficiency. The main analytic

challenge in disease mapping is how to mine the data for rare causal variants among a sea of neutral variation. To

achieve this goal, investigators have proposed a number of methods that exploit biological knowledge. In this

paper, I propose applying a Bayesian stochastic search variable selection algorithm in this context. My multivariate

method is inspired by the combined multivariate and collapsing method. In this proposed method, however, I

allow an arbitrary number of different sources of biological knowledge to inform the model as prior distributions in

a two-level hierarchical model. This allows rare variants with similar prior distributions to share evidence of

association. Using the 1000 Genomes Project single-nucleotide polymorphism data provided by Genetic Analysis

Workshop 17, I show that through biologically informative prior distributions, some power can be gained over

noninformative prior distributions.

Background

Genome-wide association studies (GWAS) have been a

powerful method for revealing common variants that

confer a modest increase in disease risk in carriers. In

general, the single-nucleotide polymorphisms (SNPs)

that show the strongest evidence for association in

GWAS do not perfectly tag the putative causal variant

(s) nearby because of ancestral r ecombination events;

therefore resequencing i n these regions is necessary to

resolve the precise location of the causal variant(s).

Dickson et al. [1] postulated one possible explanation

for wh y many fine-mapping efforts have failed to map a

single causal SNP in the region tagged by the original

genome-wide association sign al: multiple rare variants

(MRVs) residing on multiple haplotypes at the regi on of

the genome-wide association signal are generating a

“synthetic” association when these haplotypes share a

common allele that is observed more in case subjects

than in control subjects. In support of the MRV

hypothesis, several investigators have recentl y developed

a number of popular burden-type methods [2-4]. These

methods are predicated on the notion that presence of

or an increase in the number of mutations for a person

at a particular pathway, region, gene, or any other biolo-

gical unit can serve as a reasonable proxy for his/her

risk of developing disease. The common theme among

these methods is that the genotypes for MRVs that map

to these bi ological units, called bins, are collapsed into a

single vector of scor es, a technique that can potentially

improve statistical power to detec t disease assoc iation.

For example, in the combined multivariate and collap-

sing (CMC) m ethod of Li and Leal [2], a score for an

individual is assigned 1 if at least one mutation is

observed across all SNPs within a bin, or 0 otherwise.

The significa nce of a gene, fo r example, can then be

tested by jointly modeling all bins that map within the

gene using a multivariate method such as Hotelling ’s

multivariate T-test, logistic regression, or linear

regression.

In this paper, I describe how I adapted the concept of

the CMC method into a Bayesian variable selection

algorithm with the notion that common SNPs may also

Correspondence: gary.k.chen@usc.edu

Division of Biostatistics, Department of Preventive Medicine, University of

Southern California, 2001 North Soto Street, SSB 202Q, MC 9234, Los

Angeles, CA 90089-9234, USA

Chen BMC Proceedings 2011, 5(Suppl 9):S16

http://www.biomedcentral.com/1753-6561/5/S9/S16

© 2011 Chen; licensee BioM ed Central Ltd. This is an open access article distributed under the terms of the Creative Commons

Attribution License (http://creativecommons.org/licenses/by/2.0), which perm its unre stricted use, distribution, and reproduction in

any medium, provided the original work is prope rly cited.

contribute valuable information to nearby causal rare

variants, assuming that the shared haplotype model [1]

is true. The exon resequencing data set provided by the

organizers of Genetic Ana lysis Workshop 17 (GAW17)

provides an ideal opportunity for evaluating the perfor-

mance of this new approach.

Methods

Details of the simulated GAW17 data set can be found in

this same issue [5]. I defined variants that had a minor

allele frequency (MAF) less than 0.01 to be rare but

potential ly the mos t bi ologically i nteresting, because

extremely rare mutations are expected to have the great-

est deleterious effects on phenotype. Of all the SNPs pro-

vided in the data set, 73% (18,131) fall within th is MAF

range. For each gene, I applied the collapsing procedure,

as described in the CMC method [2], by grouping rare

SNPs into one of two bins defined by their predicted

impact on protein (i.e., synonymous or nonsynonymous

variant). Any bin with a MAF less than 0.01 after the col-

lapsing proce dure was not included for further analysis.

Common SNPs, defined as those having a MAF ≥ 0.01,

were not collapsed with any other SNPs in the gene. For

conciseness, I use the term variable to define either a sin-

gle common SNP or a SNP bin. The final marker panel

included 7,385 variables: 1,029 bins containing collapsed

rare variants and 6,356 bins containing common SNPs. I

experimented with higher threshold values for bin defini-

tion (e.g., MAF = 0.05), but this strategy did not recover

an appreciable number of bins from the filtering step

because most genes in the data set were small and har-

bored private mutations. True log relative risks (denoted

b) for each SNP are provided in the simulation answer

key, which quantifies ea ch SNP’s effect on the quantita-

tive traits Q1 and Q2. Thus, to assess how accurately my

method can recover the true values of b at each SNP, I

constrained the analyses only to models where the out-

come phenotype was either trait Q1 or Q2.

The statistical model I used was a two-level hierarchi-

cal model, described in detail by Chen and Thomas [6].

One property of a hierarchical model that is appealing

when analyzing variant s of low frequency, where maxi-

mum-likelihood estimates (MLEs) of association

ˆ

bb

can

be highly unstable, is the ability t o smooth these point

estimates (and their variances) toward prior distributions

defined in a second level. At the first level, I apply

ordinary least-squares regression, which produces MLEs

of associ ation between a continuous trait (i.e., either Q1

or Q2) and a random set of m model variables. A design

matrix X stores the variable values, and the vector Y

stores values of Q1 or Q1 across all individuals:

YX=+bbbb

01,,

.

m

(1)

I defi ne a prior distribution on b in Eq. (1) using the

annotation information provided by GAW17. For vari-

able k, b

k

is distributed as a mixed-effects model, origin-

ally defined by Besag et al. [7] as:

†

k

T

kk k

pqj

Z ++,

(2)

where the latent fixed effect is π and the random

effects components are:

qs

jj

t

n

k

kk

k

N

N

~(, ), ()

~,. ()

03

3

2

2

a

b

−−

⎛

⎝

⎜

⎜

⎞

⎠

⎟

⎟

The Z matrix stores external knowledge about each of

the m variables currently in the model. To e ncode my

belief that deleterious mutations would have higher or

lower values of b relative to other types of mutations, I

assigned a value of 1 to the nonsynonymous mutation

in the second column (after reserving the first column

as the intercept) of the m × 2 d esign matrix Z and a

value of 0 for any other SNP category. The term π, esti-

mated using ordinary linear regressi on, relates the mag-

nitude of b in Eq. (1) to values in Z.Furthermore,to

encode my belief tha t mutations within the same gene

should have similar effects on disease, I specified an

indicator encoding whether predictor k and any other

model variables are in the same gene by means of a k ×

k adjacency matrix A. Specifically, the parameter

j

−k

stores the mean of the MLE

ˆ

bb

from the first le vel,

taken across neighbors of variable k (i.e., all other vari-

ables that are in the same gene) defined by means of A.

Thevariancetermτ

2

is inversely scaled by v

k

,thenum-

ber of neighbors of k to weight the uncertainty about τ

2

.

Finally, θ

k

soaks up any remaining variation i n the sec-

ond level of the model through the variance term s

2

.

A posterior density is defined on the basis of the likeli-

hood and normal density function corresponding to the

first (Eq. (1)) and second (Eq. (2)) levels of the hierarchi-

cal model. I use the product of this density function and

a model transition function as the objective function of a

reversible jump Monte C arlo Markov chain (MCMC)

algorithm to stochastically explore the search space,

fitting all possible sets of model variables to the data. The

model transition kernel itself is informed through empiri-

cal Bayes estimates of the hyperparameters (e.g., π), so

that regions of the search space that have strong empiri-

cal support and prior evidence are prioritized. Further

details on how the variable selection algorithm works can

be found in Chen and Thomas [6].

In the next section I present results between a more

conventional method and my proposed method. The

Chen BMC Proceedings 2011, 5(Suppl 9):S16

http://www.biomedcentral.com/1753-6561/5/S9/S16

Page 2 of 6

first method is an ordinary least-squares regression

between the quantitative trait (i.e., Q1 or Q2) a nd each

vector of variable scores, which I denote as the MLE

method. This approach is equivalent to a conventional

genome-wide association scan, testing for marginal

effects. I compared this to four variations of the multi-

variate MCMC method. Specifically, I varied the degree

of informativeness in the prior distributio n by modifying

the definition of the matrices A and Z.Themostinfor-

mative prior distribution (denoted FULL) stores both

gene membership and SNP mutation type information

in the A and Z matrices, respectively. In the second var-

iant of the prior distr ibution (denoted Z only), I

removed gene membership information so that matrix

A was simply the identit y matrix. Conversely, in the

third variant of the prior distribution (denoted A only),

I removed mutation class information so that the Z

matrix included only the intercept. The last variant of

the prior distribution (denoted UNINF) includes both

the uninformative Z and A matrices and is equivalent to

a the ridge style prior distribution (i.e., b ~ N(0, s

2

)).

For each of the MCMC analyses, I s ampled 2 million

realizations from t he posterior distribution, retaining

statistics on only the last million realizations to mini-

mize any correlation to the initial parameter values. Run

time on a 2-GHz Xeon processor was approximately 8

h. I verified that the retained statistics reached conver-

gence by comparing their distributions across multiple

chains using a nonsignificant p-value extracted from the

Kolmogorov-Smirnov test.

To qua ntify evidence for any specific variable (either

common SNP or SNP bin), I empirically estimated

Bayes factors (BFs) for each variable by dividing the pos-

terior odds by the prior odds, as described by Chen and

Thomas [6]. BFs quantify the increase in evidence for a

hypothesis (in this case, inclusion o f a variable i nto the

model) in light of observed data relative to a prior

hypothesis [8].

Results

Table 1 lists the posterior estimates of the various

hyperparameters of the hierarchical model under the

FULL prior distribution specification. F or either of the

two q uantitative traits, the residual variance (τ

2

)inthe

random effects component was smaller than the residual

var iance from the fixed effects component (s

2

), indicat-

ing a good fit between the gene-membership prior

distribution and the observed data. The posterior esti-

mates for the prior mean (π) indicate a slightly positive

correlation (0.03) between disease risk and presence of a

nonsynonymous mutation, although the evide nce is

weak considering the large standard errors (0.06).

As alluded to earlier, hierarchical modeling shrinks

unstable MLEs toward means informed through either

informative or noninformative prior distributions. I con-

sideredtwometricsthatmeasuretheaccuracyofa

method’ s estimation of the true effect size: the mean

coverage rate (MCR) and the root mean-square error

(RMSE). I defined the MCR as the proportion across all

causal SNPs and simulation replicates where the true

value of b falls within the 95% confi dence interval of the

estimator. Thus a perfect estimator would have a value

of 1. Hierarchical modeling achieved an MCR of 0.91

under the Q1 disease model, in contrast to an MCR of

0.56 when applying maximum likelihood. The second

metric I considered, RMSE, is cal culated by takin g the

square root of the average squared difference (also taken

across all markers and replicates) between the estimated

and true v alues of b. A smaller value of the RMSE indi-

cates a more precise estimation of the true effect size.

Under the Q1 disease model, the RMSE for the hier-

archical model was 0.17, whereas for the maximum-like-

lihood model it was 0.38. When Q2 was the disease

model, the RMSE and MSE were similar (within ±0.01),

approximately 0.17 and 0.94, respectively, regardless of

which method was used. Table 2 presents a list of causal

variables under the Q1 disease model, indicating that

several SNPs at the FLT1 gene were poorly estimated

using maximum likelihood.

I next evaluated the ability of the M CMC sampler to

perform variable s election by comparing sensitivity and

specificity across the four variants of the prio r distribu-

tion. The receiver operating characteristic (ROC) curves

in Figures 1 and 2 illustrate power across various false

discovery rates for t raits Q1 and Q2, respectively. As

Table 1 Posterior estimates of hyperparameters

Parameter Trait Q1 Trait Q1

τ

2

0.006 0.006

s

2

0.01 0.01

π (SE) 0.03 (0.06) 0.02 (0.06)

Table 2 Accuracy of estimates of b for trait Q1 between

maximum-likelihood estimate (MLE) and hierarchical

modeling (HM) estimates

Variable Mean square error Mean coverage rate

a

MLE HM MLE HM

C1S6521 0.196 0.046 0.68 0.94

C13S398 0.069 0.036 0.90 0.93

C13S515 0.300 0.029 0.07 0.92

C13S522 0.126 0.037 0.06 0.71

HFE, nonsynonymous 0.218 0.033 0.63 0.98

KCTD14, nonsynonymous 0.007 0.009 0.91 0.84

C4S1878 0.102 0.024 0.7 0.95

a

Proportion of replicates where true b falls within the 95% confidence

interval.

Chen BMC Proceedings 2011, 5(Suppl 9):S16

http://www.biomedcentral.com/1753-6561/5/S9/S16

Page 3 of 6

one might expect, introducing informative prior distri-

butions into the model improves power to det ect causal

variants. Gene membership information as encoded in

matrix A proved to be the most critical component for

power overal l. When Q2 was used as the outcome phe-

notype, the method showed greater sensitivity than the

MLE method across all false discovery range (FDR)

values, regardless of the prior distribution specification.

Q1 performed slightly worse than the MLE at low FDRs

when gene membership information was omitted from

the prior specification. Table 3 summarizes the relative

differences in power at FDR = 0.05 between the MLE

and my approach.

I noted a wide range of evidence across the variables

considered. Tables 4 and 5 present a compari son of BFs

across the various prior specifications in the variable

selection algorithm for each causal variable that was

included in the analysis. Although guidelines for BF

interpretation [8] deem several variables to be “ barely

worth mentioning” (BF range, 1 to 3), others could be

considered “deci sive” (BF > 100). Under Q1, evidence of

association was strongest for the C13S431, C13S522,

and C 13S523 SNPs in the FLT1 gene, which had more

“common” MAFs of 0.02, 0.03, and 0.07, respectively.

These same SNPs also h ad fairly large simulated odds

ratios (2.1, 1.9, and 1.9, respectively), which most likely

explain t he improved overall performance of all the

methods under the Q1 model, as shown in Figures 1

and 2, in contrast to the Q2 model, whose disease

model was more challenging. The o nly SNP under the

Q1 model that was more common than these three

SNPs was C4S1878 (MAF = 0.16). A relatively moderate

BF of 107 at this SNP reflects its modest simulated odds

ratio of 1.1. The A matrix info rmation, which help s dis-

tribute evidence of association across a gene, was advan-

tageous for SNPs within FLT1.Incontrast,forSNPsin

other genes, the Z matrix, which enables variables of the

same mutation type to share a common mean, improved

the method’s power to detect causal variants, as seen in

the higher BFs in column 2 versus column 3 in Tables 4

and 5. This observation was not too surprising, consid-

ering the fact that the simulation model considered only

nonsynonymous mutations to be causal.

Discussion

In response to the missing heritability mystery plaguing

the field of complex trait genetics, there is understand-

ably massive interest in developing methods that can

effectively investigate the relationship between rare var-

iants and disease. In the methods described by Madsen

Figure 1 Receiver operating characteristic curve under

polygenic disease model for trait Q1. The proportion of causal

variants is plotted as a function of the proportion of noncausal

variants, taken across 200 replicates.

Figure 2 Receiver operating characteristic curve under

polygenic disease model for traitQ2. The proportion of causal

variants is plotted as a function of the proportion of noncausal

variants, taken across 200 replicates.

Table 3 Relative power (in relation to the maximum-

likelihood estimate) of hierarchical modeling method at

FDR = 0.05

Variation Trait Q1 Trait Q2

UNINF 0.94 1.14

Z only 0.98 1.17

A only 1.04 1.17

FULL 1.05 1.19

FDR, false discovery range.

Chen BMC Proceedings 2011, 5(Suppl 9):S16

http://www.biomedcentral.com/1753-6561/5/S9/S16

Page 4 of 6

and Browning [3] and a more recent refinement

described by Price et al. [4], common S NPs are down-

weighted on the assumption that their effect sizes are

expected to be smaller than their rarer neighbors.

Details on these approaches are found in Dering et al.

[9]. A one degree of freedom test is carried out at t he

gene leve l or other biological unit rather than at the

SNP level. These methods are appealing because power

can be increased as a result of fewer multiple hypotheses

to adjust for. I to ok a s omewhat different approach that

was closer in spirit to the CMC method [2]. L ike the

CMC appr oach, my method operates within a multivari-

ate framework so that multiple bins within a gene can

be considered; t his allows one to test multiple hypoth-

eses and to refine the signal, albeit at a statistical cost

resulting from multiple comparisons. In contrast to the

Madsen and B rowning [3] and Price et al. [4] methods,

I do not down-weight SNPs of higher frequency. In fact,

I believe that if there is a shared haplotype effect among

case subj ects, then these common SNPs can aid in dis-

covery of rarer neighbors through an appropriate prior

specification (e.g., the A matrix in the hierarchical

model). With any type of collapsing strategy, including

mine, the choice of how bins are defined is arbitrary

and some type of permutation procedure is necessary to

alleviate an increase in type I error from overfitting the

data. My Bayesian method, while also computationally

expensive, does not involve permutation. Through Baye s

model a veraging and reporting of BFs, the problem of

model o verfitting is handled naturally. I pr eviously

demonstrated through simulations that the model is

robust in light of multiple comparisons within the con-

text of discovering interactions [6].

The results from the analyses show that in certain

cases, such as when Q1 is modeled as the ou tcome, rare

variants can make accurate estimates of effect si ze diffi-

cult when operating under a conventional MLE frame-

work. Hierarchical modeling can be particularly helpful

here, even if the prior distributions are not particularly

informative. However, I must provide an important

caveat that the method, which still operates under a

standard multivariate regression framework at the first

level of the model, does not appear to work particular ly

well when rare v ariants (i.e., omitting a collapsing strat-

egy), such as singleton mutations, are directly tested;

convergence problems usually emerge when the design

matrix becomes numerically singular. Thus I was unable

to directly evaluate the method’ s performance on any

one specific SNP among the extremely rare causal var-

iants. The LASSO (least absolute shrinkage and selec-

tion ope rator) method [10], another flavor of penalized

regression that provides variable selection, has recently

been extended to allow one to directly test any rare var-

iant by defining bins (e.g., genes) that relax the global

penalization parameter [11]. Although my approach is

more limited in this sense, my model allows the inv esti-

gator to include an arbitrary numbe r of prior knowledge

sources through columns in a Z matrix, as demon-

strated in the se nsitivity analyses presented in Figures 1

and 2. I found that defining a richer prior distribution

on b basedonbiologycouldindeedimprovepowerto

detect variants. On closer inspection, I learned that

mutation type (synonymous vs. nonsynonymous) infor-

mation was more beneficial than gene-membership

information for most of the SNP bins, but the opposite

was tr ue for the FLT1 gene. Thus I recommend provid-

ing as much external knowledge as possible in the

model (e.g., adding additional columns i n Z). Because

my method is bas ed on empir ical Bayes est imates, it is

robust to poor specification of the prior distribution,

because this only leads to increased uncertainty

Table 4 Bayes factors for each causal variable under the

Q1 trait model

Causal variable

a

UNINF Z only A only FULL

ARNT, nonsynonymous 1.14 1.95 1.85 3.99

C4S1884 36.33 43.31 36.12 47.65

HIF1A, nonsynonymous 0.58 1.14 0.57 1.22

C13S522 527.13 600.4 764.92 773.35

C1S6533 109.38 149.45 119.39 162.42

C4S1878 57.23 92.15 69.16 107.15

C14S1734 1.47 2.42 1.47 2.58

C13S431 299.20 327.49 572.87 551.06

FLT1, nonsynonymous 17.25 27.80 81 103.17

C13S523 998.33 999.07 999.87 999.7

a

Defined as either a bin of SNPs (shown with convention gene name and

mutation class) or a single SNP. Only variables with MAF ≥ 0.01 were included

for analyses.

Table 5 Bayes factors for each causal variable under the

Q2 trait model

Causal variable

a

UNINF Z only A only FULL

C6S5441 53.96 77.22 65.32 88.21

SIRT1, nonsynonymous 22.36 34.60 23.92 34.82

C2S354 11.17 15.29 11.04 16.46

C8S442 62.14 88.72 64.60 87.26

C6S5449 64.43 85.17 73.89 94.77

SREBF1, nonsynonymous 60.71 83.47 60.57 85.76

PDGFD, nonsynonymous 49.64 71.03 49.12 70.54

C6S5426 0.87 1.48 1.25 2.10

C6S5380 212.1 264.3 210.2 263.0

PLAT, nonsynonymous 9.28 14.54 8.89 14.44

VLDLR, nonsynonymous 17.51 26.85 17.15 26.93

BCHE, nonsynonymous 38.11 55.60 37.50 55.62

LPL, nonsynonymous 1.34 2.43 1.62 2.63

a

Defined as either a bin of SNPs (shown with convention gene name and

mutation class) or a single SNP. Only variables with MAF ≥ 0.01 were included

for analyses.

Chen BMC Proceedings 2011, 5(Suppl 9):S16

http://www.biomedcentral.com/1753-6561/5/S9/S16

Page 5 of 6

(modeled in the prior variances τ

2

and s

2

), asymptoti-

cally reducing the prior distribution on b to a ridge

prior distribution.

Clearly, there is a need to develop methods to effec-

tively mine the data fo r rare v ariants t hat confer disease

risk. I am optimistic that my approach is more effective

than other methods in many cases, but it does have the

same lim itations as those shared by collapsing-style

methods, particularly the strong a ssumption that effect

sizes will point in the same direction among SNPs inside

a bin. I am considering other variations of the hierarchi-

cal model that might more flexibly a ccommodate this

type of heterogeneity. One appealing idea is to include a

new stochastic layer into the algorithm that randomly

groups SNPs into bins (and consequently compresses

the A and Z matrices accordingly). My method cur-

rently permits one to perform a global test of associa-

tion (i.e., ar e any rare va riants associated?) by testing

fixed bins. An important property o f enabling flexibility

in bin assignment is that one can additionally perform

local tests of association (i.e., how often does this SNP

appear in any bin?).

Conclusions

I have presented a computationally efficient Bayesian

method that simultaneously provides additional power

to discover rare disease variants and enhances estima-

tion of true effect sizes. Users interested in the algo-

rithm can download C++ source code and binaries from

my website (http://www-hsc.usc.edu/~garykche/).

Acknowledgments

I would like to thank the organizers of Genetic Analysis Workshop 17, the

anonymous manuscript reviewers, and the copy editor Mimi Braverman for

improving the paper. The Workshop is supported by National Institutes of

Health grant R01 GM031575.

This article has been published as part of BMC Proceedings Volume 5

Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the

supplement are available online at http://www.biomedcentral.com/1753-

6561/5?issue=S9.

Authors’ contributions

GKC conceived of the study, carried out the statistical analyses, and drafted

the manuscript.

Competing interests

I have no competing interests to declare.

Published: 29 November 2011

References

1. Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB: Rare variants

create synthetic genome-wide associations. PLoS Biol 1000, 8:e294.

2. Li B, Leal SM: Methods for detecting associations with rare variants for

common diseases: application to analysis of sequence data. Am J Hum

Genet 2008, 83:311-321.

3. Madsen BE, Browning SR: A groupwise association test for rare mutations

using a weighted sum statistic. PLoS Genet 2009, 5 :e1000384.

4. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR:

Pooled association tests for rare variants in exon-resequencing studies.

Am J Hum Genet 2010, 86:832-838.

5. Almasy LA, Dyer TD, Peralta JM, Kent JW Jr, Charlesworth JC, Curran JE,

Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC

Proc 2011, 5(suppl 9):S2.

6. Chen GK, Thomas DC: Using biological knowledge to discover higher

order interactions in genetic association studies. Genet Epidemiol 2010,

34:863-878.

7. Besag J, York J, Mollie A: Bayesian image restoration, with two

applications in spatial statistics. Ann Inst Stat Math 1991, 43:1-20.

8. Kass RE, Raftery AE: Bayes factors. J Am Stat Assoc 1995, 90:773-795.

9. Dering C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants:

an overview of collapsing methods. Genet Epidemiol 2011, X(suppl X):X-X.

10. Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat

Soc Ser B Stat Methodol 1996, 58:267-288.

11. Zhou H, Sehl ME, Sinsheimer JS, Sobel EM, Lange K: Association screening

of common and rare genetic variants by penalized regression.

Bioinformatics 2010, 26(19):2375-82.

doi:10.1186/1753-6561-5-S9-S16

Cite this article as: Chen: Enhancing the discovery of rare disease

variants through hierarchical modeling. BMC Proceedings 2011 5(Suppl 9):

S16.

Submit your next manuscript to BioMed Central

and take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color ﬁgure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at

www.biomedcentral.com/submit

Chen BMC Proceedings 2011, 5(Suppl 9):S16

http://www.biomedcentral.com/1753-6561/5/S9/S16

Page 6 of 6

- CitationsCitations2
- ReferencesReferences14

- "etc. Although several methods that use prior information have been proposed17181920 , further research is needed to utilize prior knowledge more effi- ciently [21] and to expand statistical tools available for researchers. In this article we propose a method that incorporates prior information into the region-based score test. "

[Show abstract] [Hide abstract]**ABSTRACT:**The interest of the scientific community in investigating the impact of rare variants on complex traits has stimulated the development of novel statistical methodologies for association studies. The fact that many of the recently proposed methods for association studies suffer from low power to identify a genetic association motivates the incorporation of prior knowledge into statistical tests. In this article we propose a methodology to incorporate prior information into the region-based score test. Within our framework prior information is used to partition variants within a region into several groups, following which asymptotically independent group statistics are constructed and then combined into a global test statistic. Under the null hypothesis the distribution of our test statistic has lower degrees of freedom compared with those of the region-based score statistic. Theoretical power comparison, population genetics simulations and results from analysis of the GAW17 sequencing data set suggest that under some scenarios our method may perform as well as or outperform the score test and other competing methods. An approach which uses prior information to improve the power of the region-based score test is proposed. Theoretical power comparison, population genetics simulations and the results of GAW17 data analysis showed that for some scenarios power of our method is on the level with or higher than those of the score test and other methods.- [Show abstract] [Hide abstract]
**ABSTRACT:**We summarize the methodological contributions from Group 3 of Genetic Analysis Workshop 17 (GAW17). The overarching goal of these methods was the evaluation and enhancement of state-of-the-art approaches in integration of biological knowledge into association studies of rare variants. We found that methods loosely fell into three major categories: (1) hypothesis testing of index scores based on aggregating rare variants at the gene level, (2) variable selection techniques that incorporate biological prior information, and (3) novel approaches that integrate external (i.e., not provided by GAW17) prior information, such as pathway and single-nucleotide polymorphism (SNP) annotations. Commonalities among the findings from these contributions are that gene-based analysis of rare variants is advantageous to single-SNP analysis and that the minor allele frequency threshold to identify rare variants may influence power and thus needs to be carefully considered. A consistent increase in power was also identified by considering only nonsynonymous SNPs in the analyses. Overall, we found that no single method had an appreciable advantage over the other methods. However, methods that carried out sensitivity analyses by comparing biologically informative to noninformative prior probabilities demonstrated that integrating biological knowledge into statistical analyses always, at the least, enabled subtle improvements in the performance of any statistical method applied to these simulated data. Although these statistical improvements reflect the simulation model assumed for GAW17, our hope is that the simulation models provide a reasonable representation of the underlying biology and that these methods can thus be of utility in real data.

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.

This publication is classified Romeo Green.

Learn more