Page 1

September 22, 20109:39WSPC - Proceedings Trim Size: 11in x 8.5in EZ-JA˙PSB2011-revision

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT

ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT

FOR UNCERTAINTY

E. ZEGGINI∗and J.L. ASIMIT

Wellcome Trust Sanger Institute, Hinxton, CB10 1HH, UK

∗E-mail: Eleftheria@sanger.ac.uk

There is growing interest in the role of rare variants in multifactorial disease etiology, and increasing

evidence that rare variants are associated with complex traits. Single SNP tests are underpowered

in rare variant association analyses, so locus-based tests must be used. Quality scores at both the

SNP and genotype level are available for sequencing data and they are rarely accounted for. A

locus-based method that has high power in the presence of rare variants is extended to incorporate

such quality scores as weights, and its power is compared with the original method via a simulation

study. Preliminary results suggest that taking uncertainty into account does not improve the power.

Keywords: Allele-Matching; Rare variants;Locus-based method; Quality scores; Sequencing

1. Introduction

There is an increasing interest in the role of rare variants in multifactorial disease etiology,

while the evidence that rare variants are associated with complex traits is steadily expanding.

Although any individual rare variant exists in low frequencies, the frequency with which any

rare variant is present makes them collectively common. Under the multiple rare variant

hypothesis (MRV), the effects of multiple rare variants with moderate to high penetrance

combine to increase the risk of most common inherited diseases [1]. At the other extreme is

the common disease common variant (CDCV) hypothesis, which states that most common

complex diseases are due to a few common variants with moderately small effects [2]. The

most likely scenario is that a combination of both common and rare variants contribute to

disease risk.

In most genome-wide association (GWA) studies only variants with minor allele frequency

(MAF) greater than 1-5% are followed up, and the focus tends to be on identifying common

disease variants that are associated with complex diseases. However, this approach is limited

since only 5-10% of the heritable component of disease is explained by the many genetic

variations identified as having strong evidence of disease association in GWA studies. This

suggest that a fruitful direction is to search for associations with multiple rare variants [3].

By design, SNP genotyping panels often focus on common SNPs, so that they only contain

a relatively small number of rare variants. This leads to a common issue in rare variant

analyses, in that on most platforms there is an insufficient number of rare variants (Table 1).

There appears to be a clear difference in the effects of rare variants in comparison to SNPs

of higher frequency, with rare variants having stronger effects. According to the odds ratios

(OR) for common and rare variants identified in published studies, most common-disease

associated variants have ORs between 1.1 and 1.4 with only a few above 2, while the majority

of the identified rare variants to date have an OR greater than 2 and a mean of 3.74 [1]. In

Page 2

September 22, 20109:39WSPC - Proceedings Trim Size: 11in x 8.5inEZ-JA˙PSB2011-revision

Table 1: Approximate low frequency/ rare variant GWAS platform content.

PlatformAffymetrixAffymetrix

500k 6.0

MAF< 0.05

55k106k

MAF< 0.01

17k35k

Illumina

370k

9k

1k

Illumina

550k

32k

7k

Illumina

610k

35k

8k

Illumina

1.2M

62k

22k

addition, causality may more easily be fine-tuned by identifying rare variants. For most GWA-

identified loci, there is difficulty in assigning causality since high LD complicates the use of

association mapping to precisely determine which variant is functionally relevant. There are

even more complications when elucidating the effects of SNPs that map to genomic regions

with no clear role. The problem may be simplified by searching for disease-associated rare

variants in known functional genomic regions, such as genes. In addition, it might be easier

to at least infer causality at a locus that contains both common and rare disease-associated

variants.

In the analysis of the association of rare variants and disease, there is a loss of power due

to genotype misspecification. Quality scores are available for genotype and sequence-derived

data, but in rare variant analyses, the information is not usually put to use. In addition,

the 1000 Genomes reference set contains variants with MAF as low as .01, which makes the

imputation of rare variants now possible. A probability distribution for the genotype at each

variant may be estimated using the imputation method of choice. We propose methods for

rare variant analyses that take advantage of the extra information contained in quality scores

derived from sequencing and probability distributions resulting from imputation.

In section 2 we introduce an Allele Matching Empirical Locus-specific Integrated Associ-

ation test (AMELIA), which is a nonparametric and robust test that accounts for genotype

uncertainty. It is an extension of a Kernel-Based Association Test (KBAT) [4], which has been

demonstrated to have high power in the presence of rare variants. In section 3 the powers of

AMELIA and KBAT are briefly compared in a short simulation study, while a concluding

discussion is provided in section 4.

2. Allele-Matching Tests

Before providing the details of AMELIA, we first discuss the original method, KBAT. The

kernel-based association test (KBAT) [4] tests for a joint association of multiple SNPs (cor-

related or independent) with a categorical phenotype, without any assumptions on the direc-

tions of individual SNP effects. In simulation studies done by the authors, KBAT was found

to generally have more power than other multi-marker approaches (Zglobal[5] and MDMR[6]),

especially in the presence of rare causal SNPs. First, similarity scores yl(ij)between individuals

i and j in group l (e.g. 1=cases, 2=controls) are determined by using a kernel, such as the

Allele Match (AM) kernel, which is the count of common alleles between the genotypes of two

individuals. Let gibe the genotype score at a specific SNP, which is conveniently defined as

the number of reference alleles at the SNP, since knowledge of the risk allele is irrelevant. At

a given SNP, for individuals i ̸= j in group l with respective genotypes gl(i) and gl(j) , the

Page 3

September 22, 20109:39 WSPC - Proceedings Trim Size: 11in x 8.5inEZ-JA˙PSB2011-revision

similarity score is defined by

yl(ij)=

4, if gl(i)= gl(j)

2, if gl(i)= 1,gl(j))∈ {0,2} or gl(j)= 1,gl(i)∈ {0,2}

0, otherwise

,

(1)

By defining the kernel in this way, there is no need to have knowledge of the risk allele at

each SNP. Similarity scores that depend on knowledge of the risk allele are also explored in

[4]. This is general to any number of L ≥ 2 groups, where group l consists of nlindividuals.

The similarity scores yl(ij)between individuals i and j in group l are modelled using a

one-way ANOVA model at each SNP:

yl(ij)= µ + αl+ εl(ij), i < j = 1,...,nl; l = 1,2,

where µ is the general effect for pairs of individuals, αlis the group specific treatment effect,

and to test for disease association the null hypothesis is H0: α1= α2. The single SNP test

statistic at marker k is the ratio of the between group sum of squares SSBkand the within

group sum of squares SSWk, and the K-marker KBAT test statistic is

∑K

Rather than summing over the K single SNP test statistics (ratios), the K-marker test statistic

takes the form of (2), which was found to have a higher power when the SNPs are correlated

(see [4]). Clearly the similarity scores yl(ij)are not independent Normal random variables, so

that neither the single SNP test statistics nor the KBAT test statistic (2) may be approximated

by an F-distribution. Thus, permutation is required to obtain the p-value for each locus.

Our extensions that incorporate genotype uncertainty due to quality scores at the SNP and

genotype level or imputation are introduced as AMELIA. Here, we focus on the incorporation

of the two levels of quality scores. Quality scores of SNPs and genotypes can be accounted for

by using weights. Phred quality scores at both the SNP and genotype level are transformed

into the probability of a correct call as follows, 1 − 10−q/10), where q is the quality score.

This transformation is employed in order to account for the fact that the phred quality scores

are not linear and to avoid down-weighting SNPs that are actually of acceptable quality. For

example, quality scores of 30 and 90 both translate to probabilities near 1, and by using the

phred quality scores as weights the SNP with score 30 would contribute little weight when it

is not really of poor quality.

First, (transformed) genotype quality scores are incorporated into the analysis by fitting

a weighted ANOVA model at each SNP k, where the weight for the pair of individuals (i,j)

in group l is a function of the genotype quality scores qk

function being wk

incorporation into the analysis we use qk

In the original method, KBAT, each of the similarity scores contributes a unit weight to the

SNP-level test statistic. However, with the simple weighting scheme that we consider, similarity

scores for which both genotype calls have a high probability of being correct are assigned a

weight above 1, while those with two poor scores are down-weighted to contribute a weight

k=1SSBk

∑K

k=1SSWk

.

(2)

l(i)and qk

l(j), with the simplest weight

l(ij)= qk

l(i)+ qk

l(j). Note that for a more suggestive notation for the quality

l(i)to denote the transformed genotype quality score.

Page 4

September 22, 20109:39 WSPC - Proceedings Trim Size: 11in x 8.5in EZ-JA˙PSB2011-revision

below 1. At marker k the weighted sum of squares within groups wSSWkand between groups

wSSBkmay be computed as follows, where for simplicity we have dropped the k superscript,

and¯Tl·is the weighted group mean of the similarity scores,¯T··is the weighted grand mean,

and ml= nl(nl− 1)/2 is the number of similarity scores in group l:

L

∑

L

∑

Components of SNP test statistic k in the sums of the K-marker test statistic can be

weighted by the SNP quality score(s) of SNP k. In the case that there is a common SNP

quality score Qkacross all individuals (score at a SNP is based on reads from all individuals),

the weight for SNP k in the sums is simply the (transformed) single SNP quality score Qk. If

the quality scores at a SNP differ among individuals (score at a SNP based on multiple reads

from single individual), then the weight may be taken as the sum of these scores at the SNP.

In the latter case, the K-marker test statistic is

∑K

In this form, SNPs that have a low probability of being a true variant contribute a lower

weight than the others.

wSSW =

l=1

ml

∑

i=2

∑

j<i

wl(ij)(yl(ij)−¯Tl·)2

(3)

wSSB =

l=1

ml(¯Tl·−¯T··)

(4)

k=1QkSSBk

∑K

k=1QkSSWk

.

(5)

2.1. Implementation

In order to increase the speed of the permutations, as suggested in [4], the similarity scores

between all possible pairs of individuals are computed, regardless of which cohort they belong

to. Then, in the permutation stage, the similarity scores for the permuted case-control samples

may be quickly extracted without further computation. However, for large cohorts (N > 1000),

this causes both AMELIA and KBAT to be memory-intensive, requiring additional memory

allocation to run. For example, when N = 1000 there are 499,500 similarity scores between

all possible pairs of individuals, which requires manipulation of a 499,500 × 499,500 array.

The time requirement for both methods also increases with the number of SNPs since a test

statistic must be computed at each SNP for each permutation.

3. Simulation Study

A brief simulation study has been run to compare the powers of KBAT [4] and our version

of AMELIA that accounts for quality scores. Genotype and quality score data are simulated

based on data from the pilot study of 1000 Genomes (68 individuals). More specifically, we

use the haplosim function of the hapsim [7] R package to simulate a population of haplotypes

that possess the same allele frequencies and pairwise LD structure as a specified chromosomal

region from the 1000 Genomes data. This approach produces realistic data that includes

variants with MAFs down to .01. A cohort of N individuals is formed by randomly pairing up

Page 5

September 22, 20109:39 WSPC - Proceedings Trim Size: 11in x 8.5in EZ-JA˙PSB2011-revision

2N haplotypes sampled from a population of 40000 simulated haplotypes. SNP and genotype

quality scores were generated by randomly sampling with replacement from the scores observed

in the 1000 Genomes data. In the simulations considered there is only one causal SNP, which

has a MAF close to a certain frequency, and is chosen randomly among the possible SNPs

that satisfy this criterion. More complicated simulations involving multiple causal SNPs are

to be explored in the near future.

Case-control status is generated by using a multiplicative model for the genotype relative

risks to compute the probability of disease given the genotype at the causal SNP and its

relative risk (RR) (for details see [4]). This probability is then used to generate a Bernoulli

random variable that ascertains an individual as a case when its value is 1, and a control

otherwise. For this reason, it is necessary to over-sample (say, 5N) the number of individuals

to ensure that the desired number of cases is attained.

In order to obtain the p-value in an efficient manner, we first obtained p-values based on

1000 permutations. If this p-value was below .02, additional permutations were run to update

the p-value on the basis of 10,000 permutations. This procedure of updating the p-value

continues up to a maximum of 1,000,000 permutations, if necessary.

In order to compare the two tests in a scenario similar to that of [4], rather than testing the

whole region we also test regions of 11 SNPs formed from the causal SNP and 10 randomly

selected SNPs among the 20 SNPs that form a neighborhood around the causal SNP (10

upstream and 10 downstream from the causal SNP) (termed the neighborhood region).

3.1. Results

In this brief simulation study, a 150 KB region from chromosome 1 of the 1000 Genomes data

was considered, which contains 342 SNPs. This region was chosen slightly arbitrarily, but also

because it has a genome-average recombination rate of approximately 1Mb/cM. All SNPs

were retained, except for those with a SNP quality of 0. We assumed a single low frequency

causal SNP (MAF=.02, RR=2), and 500 cases and 500 controls were simulated over 1000

replications.

Table 2: Power results (5% level of significance) for AMELIA and KBAT when there is one

rare causal SNP and there are 500 cases and 500 controls.

region

whole

neighborhood

AMELIA

.0871

.1731

KBAT

.0953

.2161

When jointly testing all SNPs within a region there is a slight loss of power with the use of

AMELIA in comparison to KBAT. However, both methods have a relatively low power when

there are many SNPs in the region. In a comparable scenario examined in [4], where the region

contains only 10 SNPs and the causal SNP has a MAF of .108 with RR=1.25 the power of

KBAT was .323. In our neighborhood simulations comparing AMELIA and KBAT we obtain

powers of similar magnitude (see Table 2). Thus the low powers for the entire region tests are