Page 1

arXiv:1208.5809v2 [stat.AP] 30 Aug 2012

Mixture Models for Single-Cell Assays with Application to Vaccine

Studies

Greg Finak1, Andrew McDavid1, Pratip Chattopadhyay3, Maria Dominguez3, Steve De

Rosa1,2, Mario Roederer3, and Raphael Gottardo1

1Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center

(FHCRC), Seattle, WA

2HIV Vaccine Trials Network, Fred Hutchinson Cancer Research Center (FHCRC), Seattle,

WA

3Vaccine Research Center, NIAID, NIH, 40 Convent Drive, Rm 5509, Bethesda, MD 20892

August 31, 2012

Abstract

Blood and tissue are composed of many functionally distinct cell subsets. In immunological

studies, these can only be measured accurately using single-cell assays. The characterization of

these small cell subsets is crucial to decipher system level biological changes. For this reason,

an increasing number of studies rely on assays that provide single-cell measurements of multiple

genes and proteins from bulk cell samples. A common problem in the analysis of such data

is to identify biomarkers (or combinations of biomarkers) that are differentially expressed be-

tween two biological conditions (e.g., before/after vaccination), where expression is defined as

the proportion of cells expressing that biomarker (or biomarker combination) in the cell sub-

set(s) of interest. Here, we present a Bayesian hierarchical framework based on a beta-binomial

mixture model for testing for differential biomarker expression using single-cell assays. Our

model allows the inference to be subject specific, as is typically required when accessing vac-

cine responses, while borrowing strength across subjects through common prior distributions.

We propose two approaches for parameter estimation: an empirical-Bayes approach using an

Expectation-Maximization algorithm and a fully Bayesian one based on a Markov chain Monte

Carlo algorithm. We compare our method against frequentist approaches for single-cell assays

including Fisher’s exact test, a likelihood ratio test, and basic log-fold changes. Using several

experimental assays measuring proteins or genes at the single-cell level and simulated data, we

show that our method has higher sensitivity and specificity than alternative methods. Addi-

tional simulations show that our framework is also robust to model misspecification. Finally, we

also demonstrate how our approach can be extended to testing multivariate differential expres-

sion across multiple biomarker combinations using a Dirichlet-multinomial model and illustrate

this multivariate approach using single-cell gene expression data and simulations.

1

Page 2

1 Introduction

Cell populations, particularly in the immune system, are never truly homogeneous; individual cells

may be in different biochemical states that define functional but measurable differences between

them. This single-cell heterogeneity is informative, but lost in assays that measure cell mixtures.

For this reason, endpoints in vaccine and immunological studies are measured through a variety of

assays that provide single-cell measurements of multiple genes and proteins. In the 1970s, single-

cell analysis was revolutionized with the development of fluorescence-based flow cytometry (FCM).

Since then, instrumentation and reagent advances have enabled the study of numerous cellular pro-

cesses via the simultaneous single-cell measurement of multiple surface and intracellular biomarkers

(up to 17 biomarkers). More recent technological development have drastically extended the ca-

pabilities of single-cell cytometry to measure dozens of simultaneous parameters (i.e. proteins,

genes, cytokines, etc.) per cell (Bendall et al., 2011). Although cells sorted using well-established

surface biomarkers may appear homogeneous, mRNA expression of other genes within these cells

can be heterogeneous (Narsinh et al., 2011, Flatz et al., 2011) and could further characterize and

subset these cells. A new technology based on microfluidic arrays combined with multiplexed

polymerase chain reactions (PCR) can now be used to perform thousands of PCRs in a single

device, enabling simultaneous, high-throughput gene expression measurements at the single-cell

level across hundreds of cells and genes (Pieprzyk, 2009). While classic gene expression microar-

rays sum the expression from many individual cells, the intrinsic stochastic nature of biochemical

processes results in relatively large cell-to-cell gene expression variability (van Oudenaarden, 2009).

This heterogeneity may carry important information, thus single-cell expression data should not

be analyzed in the same fashion as cell-population level data. Special treatment of single-cell level

data, which preserves information about population heterogeneity, is warranted in general. For

this reason, single-cell assays are an important tool in immunology, providing a functional and

phenotypic snapshot of the immune system at a given time. These assays typically measure multi-

ple biomarkers simultaneously on individual cells in a heterogeneous mixture such as whole blood

or peripheral blood mononuclear cells (PBMC), and are used for immune monitoring of disease,

vaccine research, and diagnosis of haematological malignancies (Altman et al., 1996, Betts et al.,

2006, Inokuma et al., 2007).

During analysis, cell level biomarker fluorescence intensities are typically thresholded as positive

or negative so that subsets with different multivariate +/− combinations can be obtained as Boolean

combinations. For some assays (e.g., flow cytometry), the positivity thresholds are set based on

prior biological knowledge while for others, thresholds are given by the assay technology. This is

the case for the Fluidigm technology where genes are recorded as absent (not expressed) or present

(expressed) at the single-cell level. After this thresholding step, we obtain a Boolean matrix of

dimension N ×K, where N is the number of cells recorded and K is number of biomarkers. Using

this matrix, one can form 2Kputative cell subsets obtained as Boolean combinations. When K

is large there is a combinatorial explosion of the number of subsets, and many of these might

be small or even empty. A common statistical problem is, for a given biomarker combination,

to identify subjects for whom the proportion of cells expressing that combination is significantly

different between two experimental conditions (e.g., before and after vaccination). Note that we

use the term ‘subject’ throughout the paper, but the approaches described are general and can be

applied to other experimental units (e.g., animal studies).

A motivating example from vaccine research is the flow cytometric intracellular cytokine staining

(ICS) assay, which is used to identify and quantify subjects’ immune responses to a vaccine. Upon

vaccination, antigen in the vaccine is taken up and presented to CD4 or CD8 T-cells via antigen

presenting cells. While not all T-cells can recognize all antigens, those that recognize antigens in

2

Page 3

the vaccine become activated and produce a variety of cytokines, further promoting the immune

response. After activation, this antigen-specific subpopulation proliferates and can persist in the

immune system for some time providing memory that can more rapidly recognize the same antigen

again in the future (McKinstry et al., 2010). The antigen–specific T-cell subpopulations (i.e. the

subset that can respond to one specific antigen) constitute a very small fraction of the total number

of CD4 and CD8 T-cells. The ICS assay measures the number of antigen-specific T-cells in PBMC

or whole blood by measuring cytokine production in response to activation following stimulation

by an antigen that closely matches what was present in the original vaccine. Individual cells are

labelled using fluorescently conjugated antibodies against phenotypic biomarkers (CD3, CD4, and

CD8), used to subset T-cells, and functional biomarkers (cytokines) used to define antigen specific

T-cells (Horton et al., 2007, De Rosa et al., 2004, Betts et al., 2006). A sufficiently large number of

cells must be collected to ensure that the rare cell populations can be detected. Subsequently, each

individual cell is classified as either positive or negative for each maker based on predetermined

thresholds, then the number of cells matching each subpopulation phenotype is counted.

These counts are compared between antigen stimulated and unstimulated samples from a subject

to identify significant differences. Subjects who generate a response after stimulation are called

responders, whereas subjects that do not show any differences are called non-responders. In many

immunological studies, the size of the functionally distinct subpopulations (i.e., the number of

positive cells) is very low (relative to the total number of cells), and real biological differences

might be difficult to detect.

Although there is no standard approach to analyzing ICS assays, current methods range from ad-

hoc rules based on log-fold changes (Trigona et al., 2003), to non–parametric methods (Sinclair et al.,

2004), to permutation tests based on Hotelling’s T2statistics (Nason, 2006), to exact tests of 2x2

contingency tables (e.g., Fisher’s exact test and χ2test) (Horton et al., 2007, Proschan and Nason,

2009, Peiperl et al., 2010, Nason, 2006). All of these methods test subjects separately, and no

information is shared across observations even though one could expect some similarities across

responders (or non-responders).

The framework developed in this paper, named MIMOSA (Mixture Models for Single Cell

Assays), addresses these issues explicitly. In our model, cell counts are modelled by a binomial

(or multinomial in the multivariate case) distribution and information is shared across subjects by

means of a prior distribution placed on the unknown proportion(s) of the binomial (or multinomial)

likelihood. In order to discriminate between responders and non-responders, the prior is written

as a mixture of two beta (or Dirichlet in the multivariate case) distributions where the hyper-

parameters for each mixture component are shared across subjects. This sharing of information

helps regularize proportion estimates when the cell counts are small, which is typical with single-cell

assays, and increases sensitivity and specificity when detecting responders. Because our framework

is multivariate in nature, multiple cell subsets can be modelled simultaneously, which could help

detect small biological changes that are spread out across multiple cell subsets (Nason, 2006). Our

paper is organized as follows; Section 2 introduces the data and notations used in the paper. In

Section 3, we present our model for testing differential biomarker expression in the univariate case.

Section 4 compares our approach to alternative methods and tests the robustness of our model. In

Section 5 we present a multivariate extension of our model that can be used to test multivariate

biomarker differential expression and present some results using a single-cell gene expression data.

Finally, in Section 6 we discuss our findings and future work.

3

Page 4

2 Notation and Data

In the remainder of the paper, we use the following notation to describe our model. We assume that

we observe cell counts from I subjects in two conditions: stimulated and un-stimulated. Each cell

can either be positive or negative for a biomarker. Given a set of K biomarkers, the measured cells

can be classified into 2Kpositive/negative biomarker combinations. We denote by nsikand nuik,

k = 1,...,2K,i = 1,...,I, the observed counts for the 2Kcombinations in the stimulated and un-

stimulated samples, respectively. We denote by Nsi=?

of cells measured for subject i in each sample, respectively. For ease of notation, we denote by yi

the vector of observed counts for subject i, i.e., yi= (nsi,nui) where nsi= {nsik: k = 1,...,2K}

and nui= {nuik: k = 1,...,2K}. Finally, we define y = (y1,...,yI).

We consider two types of immunological single-cell assays: flow cytometry and single-cell gene

expression, as described below.

Flow cytometry: The primary dataset used here is an ICS data set generated as part of a trial

testing the GeoVax DNA and MVA (Modified Vaccinia Ankara) HIV vaccine in a prime-boost

regimen (prime at zero and two months, boost at four and six months) (Goepfert et al., 2011).

The goal of this data set was to assess the immune response to the vaccine across multiple antigen

stimulations, time points, cytokines and T-cell subsets. Here, we analyze a subset of the data

consisting of 98 subject from the vaccine group at two time points: day 0 and day 182. Three

cytokines (IFN-γ, TNFα and IL2) were measured at the single-cell level for each subject and time

point, with and without stimulations with an antigen (here we focus on HIV Envelope peptide

pool) matching part of the vaccine. For ease of presentation we restricted ourselves to the CD4+

T-cell subsets. Samples on day 0 were taken just before vaccination and no response is expected

there. The corresponding samples can be used as negative controls. Conversely, day 182 (26 weeks)

should be close to the immunogenicity peak, and many subjects are expected to respond, for some

cytokines at least.

Fluidigm single-cell gene expression: This is a single-cell gene expression data set of sorted

CD8+ T-cells from sixteen subjects. T-cells isolated by flow cytometry from sixteen subjects were

stimulated in blocks of four subjects with four different antigens (HIV Gag, HIV Nef, CMV pp65

tm10, and CMV pp65 nlv5) and gene expression post-stimulation measured at the single-cell level

using the BioMark system (Fluidigm) 96 × 96 well arrays. The expression from the simulated

samples was compared to paired, unstimulated controls.

Although the immunological experiments described above will often look at multiple antigens

or stimulations, in the models presented here we consider only one stimulation (i.e. antigen or

condition) at a time vs. unstimulated. The issue of multiple antigens is handled through multiple

testing correction.

knsikand Nui=?

knuikthe total number

3Differential expression with one biomarker

Datasets like the ones presented here are usually analyzed in a univariate fashion to avoid being

underpowered due to the large number of combinations and the potential for very small cell counts

in many of the combinations. Here, by univariate, we mean that we have only one positive cell

subset. This cell subset can be defined by considering the expression of one biomarker alone

(marginalizing over all other measured biomarkers) such as A+ (vs. A−), or considering a specific

positive biomarker combination (and marginalizing over everything else) such as A+ and/or B+

(vs. A−/B−). Without loss of generality, we treat the univariate case as a one biomarker case

(i.e., K = 1). In this case, for a given subject, the data can be summarized in a contingency table

4

Page 5

Table 1: 2 x 2 contingency table of counts for biomarker positive and negative

cells between stimulated (s) and unstimulated (u) conditions for a given subject

i.

Biomarker

Negative

Nsi− nsi

Nui− nui

Positive

Stimulated

Unstimulated

nsi

nui

of +/− cell counts across the un-stimulated and stimulated samples as depicted in Table 1.

For a given subject and stimulation, we consider a biomarker to be differentially expressed if the

proportion of positive cells in the stimulated samples is different from the number of positive cells

in the un-stimulated sample. Subjects that show differential expression will be called responders

for that biomarker. In this section, we are concerned with identifying differential expression one

biomarker at a time, using a beta-binomial mixture model as described in what follows.

3.1Beta-binomial model

For a given subject i, the positive cell counts for the stimulated and un-stimulated samples are

jointly modeled as follows:

(nsi|psi) ∼ Bin(Nsi,psi)and(nui|pui) ∼ Bin(Nui,pui)

where psi, puiare the unknown proportions for the stimulated and un-stimulated paired samples,

respectively. In order to detect responding subjects, we consider two competing models:

M0: pui= psi

andM1: pui?= psi.

Under the null model, M0, there is no difference between the stimulated and un-stimulated samples,

and the proportions are equal (yet the cell counts can differ). Under the alternative model, M1,

there is a difference in proportions between the two samples and the subject i is a responder. In

some studies, such as the ICS data used here, the proportion of positive cells is expected to only

increase after stimulation, in which case the alternative model should be defined as ps> pu. This

alternative parametrization is described in Web Appendix B, and we refer to it as the one-sided

model.

3.2Priors

Our model shares information across all subjects using exchangeable (Bernardo, 1996) Beta priors

on the unknown proportions, as follows:

(pui|zi= 0) ∼ Beta(αu,βu)

(psi|zi= 1) ∼ Beta(αs,βs) and(pui|zi= 1) ∼ Beta(αu,βu),

where zi is an indicator variable equal to one if subject i is a responder, i.e., M1 is true, and

zero otherwise, and αu,βu,αs,βsare unknown hyper-parameters shared across all subjects. Note

that the parameters αuand βu are explicitly shared across the two models, whereas αsand βs

are only present in the alternative model. Finally, we assume that zi∼ Be(w) are independent

draws from a Bernoulli distribution with probability w, where w represents the (unknown) propor-

tion of responders. It follows that marginally, i.e., after integrating zi, the puiand psiare then

5

Page 6

jointly distributed as a mixture of a one dimensional Beta distribution and a product of two Beta

distributions (with a possible constraint), with mixing parameter w. Treating the zi’s as missing

data, the unknown parameter vector θ ≡ (αu,βu,αs,βs,w) can be estimated in an Empirical-Bayes

fashion using Expectation-Maximization algorithm (Dempster et al., 1977) as described in Section

3.3. As an alternative, we also describe a fully Bayesian model, where the hyperparameters αu,βu,

αs, and βsare each given vague exponential priors with mean 103, and w is assumed to be drawn

from a uniform distribution between 0 and 1. In this case, all parameters will be estimated via a

Markov chain Monte Carlo algorithm as described in Section 3.3.

3.3 Parameter estimation

In our proposed EM and MCMC algorithms, we greatly simplify our calculations by directly uti-

lizing the marginal likelihoods, L0and L1, obtained after marginalizing psiand puifrom the null

and alternative likelihoods. Given the conjugacy of the priors, the marginal likelihoods L0and L1

are available in closed-forms (Web Appendix A), and are given by,

L0(αu,βu|yi) =

?Nui

nui

B(nsi+ nui+ αu,Nsi− nsi+ Nui− nui+ βu)

B(αu,βu)

??Nsi

nsi

?

·

and

L1(αu,βu,αs,βs|yi) =

?Nui

nui

B(nui+ αu,Nui− nui+ βu)

B(αu,βu)

B(nsi+ αs,Nsi− nsi+ βs)

B(αs,βs)

??Nsi

nsi

?

·

·

(1)

Above, B is the Beta function. Assuming that the missing data, zi,i = 1,...,I, are known, we

define the complete data log-likelihood:

l(θ|y,z) =

?

i

zil0(αu,βu|yi) + (1 − zi)l1(αu,βu,αs,βs|yi)+

zilog(w) + (1 − zi)log(1 − w),

(2)

where l0and l1are the log marginal-likelihoods and θ ≡ (αu,βu,αs,βs,w) is the vector of param-

eters to be estimated. In the one-sided case, the alternative prior specification must satisfy the

constraint ps> pu, and the marginal likelihood derivation involves the calculation of a normalizing

constant that is not available in closed-form but can easily be estimated. All calculations for the

one-sided case are described in Web Appendix B.

EM algorithm

Given an estimate of the model parameter vector˜θ =

step consists of calculating the posterior probabilities of differential expression, defined by

?

˜ αu,˜βu, ˜ αs,˜βs, ˜ w

?

and the data y, the E

˜ zi≡ Pr(zi= 1|y,˜θ) =

˜ w · L1(˜ αu,˜βu, ˜ αs,˜βs,|yi)

(1 − ˜ w) · L0(˜ αu,˜βu|yi) + ˜ w · L1(˜ αu,˜βu, ˜ αs,˜βs|yi).

6

Page 7

The M-step then consist of optimizing the complete-data log-likelihood over θ after replacing zi

by ˜ ziin (2). Straightforward calculations lead to ˜ w =?

solutions exist for the remaining parameters. We use numerical optimization as implemented in

R’s optim function to estimate the remaining parameters (Ihaka and Gentleman, 1996). Starting

from some initial values, the EM algorithm iterates between the E and M steps until convergence.

In our case, we initialize the zi’s using Fisher’s exact test to assign each observation to either the

null or alternative model components. We then use the estimated zi’s to estimate the pui’s and

psi’s and use these to set the hyper-parameters to their method-of-moments estimates.

MCMC algorithm

We generated realizations from the posterior distribution via MCMC algorithms (Gelfand, 1996).

All updates were done via Metropolis-Hastings sampling except for the zi’s and w that were per-

formed via Gibbs samplings. Details about the algorithms are given in Web Appendix A. We used

the method of Raftery and Lewis (1992) and Raftery (1996) to determine the number of iterations,

based on a short pilot run of the sampler. For each dataset presented here, we calculated that

no more than about 1,000,000 iterations with 50,000 burn-in iterations was sufficient to estimate

standard posterior quantities. To leave some margin, we used 2,000,000 iterations after 50,000

burn-in iterations for each dataset explored here.

i˜ zi/I, but unfortunately no closed form

4Results

In this section, we apply our MIMOSA model to the data described in Section 2, and present the

results of a simulation study based on the ICS data. We evaluated and compared the performance of

MIMOSA against Fisher’s exact test, the likelihood ratio test, and log fold-change by ROC (receiver

operator characteristic) curve analysis and by comparing the observed FDR (false discovery rate)

against the nominal FDR (expected false discovery rate) for each data set (Storey, 2002), where a

false discovery (for the ICS data) is a day 0 sample (non–responder) that is incorrectly identified

as a responder by the model (or a competing method).

4.1ICS

Using the ICS data, we performed an ROC (receiver operator characteristic) analysis to assess

the sensitivity and specificity of the one-sided MIMOSA model compared to a one-sided Fisher’s

exact test, log fold-change, and a likelihood ratio test based on the MIMOSA model for identifying

vaccine responders and non-responders. We considered observations at the day 0 time point as true

negatives, and observations at the day 182 time point as true positives (potentially underestimating

the sensitivity of all methods considered here due to real non-responders at day 182 being treated

as true positives). The MIMOSA model has higher sensitivity and specificity than Fisher’s exact

test, the likelihood ratio test, or log fold-change for discriminating vaccine responders and non-

responders as shown by the ROC curves on Figure 1, panels A,C,E. At an FDR between 10-20%,

MIMOSA would lead to about 20% more true positives being detected. Our comparisons also show

that ranking based on log-fold change alone is not reliable and should not be used. In addition,

MIMOSA gave estimates of the observed false discovery rate that are better or comparable to

competing methods (Figure 1, panels B,D,F). Here we present the results based on IL2 and IFN-γ

alone and the subset IL2 and/or IFN-γ that were used in the original study (Goepfert et al., 2011).

These results are consistent for other cytokines and cytokine combinations (see Web Figure A).

7

Page 8

AB

CD

EF

Figure 1: Performance of MIMOSA (EM and MCMC implementations, one-sided model) and com-

peting methods on ICS data from the example flow cytometry data set. Sensitivity and specificity

(ROC analysis) as well as observed and nominal false discovery rates for positivity calls from CD4+

T-cells stimulated with A-B) ENV-1-PTEG and expressing IFN-γ or C-D) ENV-1-PTEG and ex-

pressing IL2. E-F) ENV-1-PTEG and expressing IFN-γ and/or IL2. ROC and FDR plots of other

cytokine combinations can be found in Web Figure A.This figure appears in color in the electronic

version of this article.

8

Page 9

4.2Single-cell gene expression

We applied the MIMOSA model to a Fluidigm single-cell gene expression data set. We used the two-

sided MIMOSA model because genes could be regulated upward or downward upon stimulation. In

order to detect stimulation specific changes of expression, we fit our model to each gene within each

stimulation. The results presented in Figure 2 show that MIMOSA identifies stimulation-specific

differences in the proportions of cells expressing each gene while preserving inter-subject variability

(Figure 2 A,B). These patterns are evident in the posterior probabilities (Figure 2 A) and preserved

in the posterior estimates of the differences of proportions (Figure 2 B). A similar analysis using

a two-sided Fisher’s exact test and clustering the signed FDR adjust p-values (Figure 2 C) does

not reveal any stimulation-specific patterns. For an FDR of 10%, Fisher’s exact test identified 47

significant genes, while MIMOSA identified 50 significant genes. Both methods identified 39 genes

in common.

4.3 Simulation Studies

We examined the performance of the constrained (ps > pu) and unconstrained (ps ?= pu) beta-

binomial mixture models via simulations. For the simulation, we used hyper–parameters estimated

from a one-sided MIMOSA model fit to ICS data (IL2 univariate) from the primary immunogenicity

time point. We simulated data from this constrained model with 200 observations, a response rate

of 60%, N = 1,000, 5,000, and 10,000 events, with ten independent realizations of data for each

N. We fit the one-sided MIMOSA model to this data. We evaluated the sensitivity and specificity

of the model’s ability to correctly identify observations from the “responder” and “non-responder”

groups through analysis of ROC curves, and compared against Fisher’s exact test, the likelihood

ratio test, and log fold-change. We repeated this procedure for the two-sided models fit to two-sided

data (Figure 3 A,C). In addition, we examined the nominal vs. observed FDR to assess the ability

of each method to properly estimate the FDR (Figure 3 B,D).

For both the constrained and unconstrained simulations, MIMOSA was superior to competing

methods, including Fisher’s exact test, with respect to sensitivity and specificity even at small

values of N (Figure 3 A and C and Web Figure B, panel E). Additionally, the estimated FDR for

MIMOSA more closely reflected the nominal FDR compared to Fisher’s exact test and competing

methods (Figure 3, panels B, D, and Web Figure B panel F).

To assess the sensitivity of the model to deviations from model assumptions, we repeated the

simulations with the cell proportions drawn from truncated normal distributions with support (0,1),

rather than beta distributions. The means and variances of the truncated normal distributions were

set to the maximum likelihood estimates of the beta distributions defined by the hyper–parameters

α and β estimated from the ICS data set (see Web Figure B panels C and D). Even under these

departures from the model assumptions, the unconstrained MIMOSA model outperformed Fisher’s

exact test.

5 Differential expression across biomarker combinations

Our beta-binomial model described in Section 3.1 can be generalized to a Dirichlet-multinomial

model to assess differential expression across multiple biomarker combinations. As described in the

data section, we now have counts for each biomarker combination, denoted by nsi= {nsik: k =

1,...,2K} and nui= {nuik: k = 1,...,2K}.

9

Page 10

A

B

C

Figure 2: Signed posterior probability, difference and log-odds ratio of the proportion of single-cells

expressing each gene on a 96x96 Fluidigm array. The posterior probability of response times the

sign of the change in expression is shown in A) (red indicates a decrease, green an increase, relative

to the control). Columns and rows are clustered based on these signed posterior probabilities. B)

The posterior differences in proportion of cells expressing a gene in the stimulated vs. control

samples. Rows and columns are ordered as in A) for comparison. The traces show the deviations

of each cell from zero. Colors along the columns denote different stimulations (green: CMV pp65

nlv5, red: HIV Gag, orange: HIV Nef, yellow: CMV pp65 tm10). C) Clustering of the signed

q-values from Fisher’s exact test. Genes selected from Fisher’s exact test at the 10% FDR level.

This figure appears in color in the electronic version of this article.

10

Page 11

A

B

C

D

Figure 3: Comparison of positivity detection methods on data simulated from the one-sided and

two–sided models. Ten simulations were generated at an N of 5,000 total counts using hyper-

parameter estimates from real ICS data (IFN-γ expressing CD4+ T-cells stimulated with ENV-1-

PTEG from HVTN065) with a five-fold effect size between responder and non-responder compo-

nents. A) Average ROC curve over the 10 simulated data sets (N=5,000), one–sided B) Average

observed and nominal false discovery rate over 10 simulated data sets (N=5,000), one–sided. C)

Average ROC curves, two–sided model. D) Average observed and nominal FDR, two–sided model.

Curves are shown for MIMOSA, Fisher’s exact test, the likelihood ratio test, and log fold-change.

Results for MIMOSA fit to a model violating model assumptions, as well as other values of N are

in Web Figure B. This figure appears in color in the electronic version of this article.

11

Page 12

5.1Model

In our multivariate model, the beta distribution is replaced by a multinomial distribution, as follows:

(nui|pui,) ∼ M(Nui,pui)and(nsi|psi) ∼ M(Nsi,psi)

where N{s,u}i=

2K

?

k=1

n{s,u}ikare the number of cells collected and pui and psi are the unknown

proportions for the un-stimulated and stimulated samples, respectively.

5.2Prior

As in the one-biomarker case, we share information across subjects using an exchangeable prior on

the unknown proportions. This time the beta priors are replaced by Dirichlet priors, such that

(pui|zi= 0) ∼ Dir(αu),

(pui|zi= 1) ∼ Dir(αu)and (psi|zi= 1) ∼ Dir(αs),

where the indicator variable ziis defined in Section 3.2, i.e., zi∼ Be(w). As in the beta-binomial

case, both an EM and MCMC algorithms can be used for parameter estimation. When using a fully

Bayesian approach via MCMC, we use the same priors for α{u,s}and w as for the beta-binomial

model.

5.3Parameter estimation

Again, to simplify the estimation problem, we make use of the marginal likelihoods that can be

obtained in closed forms (see Web Appendix C). For the null component, the marginal likelihood

L0is given by,

L0(αu|nsi,nui) =B(αu+ nui+ nsi)

B(αu)

·

Nsi!

knsik!·

?

Nui!

?

kΓ(αk)/Γ(?

knuik!,

where B is the 2K-dimensional Beta function defined as B(α) =?

the marginal likelihood for the alternative model is given by

kαk). Similarly

L1(αu,αsi|nsi,nui) =B(αu+ nui)B(αs+ nsi)

B(αs)B(αu)

·

Nsi!

knsik!

?

Nui!

?

knuik!.

The estimation procedures (both EM and MCMC based) for the Dirichlet–multinomial distribution

are the same as for the beta-binomial model except that the number of parameters to estimate is

larger. We initialize the zi in the EM algorithm with the positivity calls from the multivariate

Fisher’s exact test. In our experience, the performance of the EM algorithm greatly deteriorates

for K > 3, and is more dependent on the initial values and can fail to converge in many instances.

Although our MCMC algorithm is slightly more computational, it does not suffer from this problem

and provides a robust alternative when K is large. More details about our multivariate MCMC

algorithm is given in Web Appendix C.

5.4 Polyfunctionality in Fluidigm Single-Cell Gene Expression Data

As a proof-of-concept, we applied our multivariate MIMOSA model for two specific genes in the

Fluidigm data, namely BIRC3 and CCL5. For this example, K = 2, and we have four possible

12

Page 13

A

B

Figure 4: Counts of cells expressing different combinations of BIRC3 and CCL5 genes in the A)

unstimulated and B) stimulated conditions. No difference is observed from the marginalized counts,

while multivariate MIMOSA detects a difference between stimulated and unstimulated conditions

in 13 of 16 samples. Sample names highlighted in red identify those where MIMOSA did not detect

a difference. This figure appears in color in the electronic version of this article.

combinations. In Figure 4 we show heatmaps of the counts of cells expressing all combinations of

the BIRC3 and CCL5 genes in unstimulated and stimulated samples (Figure 4 A,B). Only CCL5

positive cells express BIRC3, and its expression increases upon stimulation. The typical approach

to analyzing poly-functional populations from intracellular cytokine staining data (summing the

counts over all possible polyfunctional cell populations as in IL2+ and/or IFN-γ+) would not be

appropriate in this case, since changes in the counts of these different cell populations occur in

both directions. That is, the number of BIRC3-/CCL5+ cells decreases upon stimulation, while

the number of BIRC3+/CCL5+ cells increases. When marginalizing over these cell populations,

no difference is apparent in any of the samples. In contrast, the multivariate MIMOSA model tests

all polyfunctional cell subpopulations simultaneously, and detects significant differences between

stimulated and unstimulated conditions in 13 of the 16 samples (Figure 4 D, black labels). Testing

all combinations simultaneously is an advantage over performing multiple univariate tests on the

subject combinations, which requires multiplicity adjustment and a potential loss of power.

Since the Fluidigm data set has a limited number of observations (100 cells and 16 samples), we

could not look at more than two biomarkers at once. Therefore, we performed simulations in eight

dimensions to assess the power of the multivariate MIMOSA model compared to Fisher’s exact test

and the likelihood ratio test on the resulting 2x8 tables (Figure 5 A-C). These results show that

multivariate MIMOSA has significantly increased power to detect true differences in multivariate

data, even with small counts and small effect sizes, and the model better fits the data than the

competing standard approaches tested (Figure 5 B).

6 Discussion

Experimentalists already have access to a myriad of single-cell assays such as flow cytometry, mass

cytometry and multiplexed quantitative-PCR, to name a few. As single-cell assays become even

more routine once sequencing at the single-cell level becomes practical (Ramsk¨ old et al., 2012), the

development of effective statistical methods to detect differences in gene or protein expression at

the single-cell level is becoming increasingly important. Current approaches for single-cell assays

13

Page 14

AB

Figure 5: Multivariate simulations from a two-sided model. Ten, eight-dimensional data sets were

simulated from a two-sided model with an effect sizes of 2.5 × 10−3and −2.5 × 10−3in two of the

eight dimensions (N=1,500). Multivariate MIMOSA was compared against Fisher’s exact test, and

the likelihood ratio test. A) Average ROC curves for the competing methods over 10 simulations.

B) Average observed and nominal false discovery rate for each method over 10 simulations. This

figure appears in color in the electronic version of this article.

are for the most part simplistic such as the t-test, χ2test, and Fisher’s exact test, and resulting

inference can be quite sub-optimal, especially when the cell counts are small. Most importantly,

these methods do not share information across samples, resulting in less power to detect true

differences than empirical-Bayes and hierarchical modeling approaches, which are widely applied

in the microarray literature (Kendziorski et al., 2003, Newton et al., 2001, Smyth et al., 2005). In

addition, most of these methods are univariate in nature and inappropriate for high–dimensional,

next–generation single-cell assays.

The MIMOSA model presented here uses a mixture model framework of beta-binomial or

Dirichlet-multinomial distributions to model counts in experimental subjects across multiple con-

ditions (i.e., vaccine responders and non-responders). Information is shared across responders and

non-responders through exchangeable beta or Dirichlet priors, increasing the power to detect true

differences between treatment and control conditions compared to Fisher’s exact test, even when

the underlying model assumptions are violated (Figures 3 and Web Figure B). The univariate

MIMOSA model based on the Beta-Binomial distribution allows us to constrain the alternative

hypothesis to the case ps> pu, where the proportion of cells in the stimulated sample is strictly

greater than the proportion of cells in the matched unstimulated sample. This has proven to be

useful for the ICS data where stimulation induced changes are expected to be one-sided.

Although we used two single-cell assay platforms as motivating examples, our MIMOSA model

can be applied to any type of single-cell assay where cells are dichotomized into positive and negative

sets, counted and compared across different conditions. In the case of the Fluidigm data, most

analysis methods have been focused on identifying differences in the continuous part of the signal

ignoring cells that are undetected (i.e., the gene is not expressed in the cell), or the information is

used for pre-filtering (Flatz et al., 2011). The ability of MIMOSA to identify stimulation-specific

expression patterns in single-cell gene expression data demonstrates not only the broader utility of

the method, but importantly, also demonstrates that biologically relevant signal is present in the

proportion of cells expressing each gene under different conditions (Figure 2 A-C).

Detecting differences in poly-functional cell populations (i.e., identifying changes in cell popula-

tions that co-express multiple proteins, cytokines, or genes) is important in immunology, since it al-

lows the identification of more precisely defined, more homogeneous cell populations (Milush et al.,

2009). In the context of HIV, poly-functional cell populations have been shown to be correlated

14

Page 15

with long-term disease non-progression, while in the context of vaccination studies (e.g. in Leish-

mania) poly–functional responses have been correlated with protection from disease (Betts et al.,

2006, Darrah et al., 2007, Precopio et al., 2007). In the ICS data used here, the stimulation is ex-

pected to increase only the number of antigen specific cells detected. Hence, if a specific cell subset

expressing multiple biomarkers is being differentially expressed, differential expression based on the

marginal cell counts should also be detected. As such, identifying poly-functional cytokine profiles

from ICS data can be done in an iterative way. First, univariate tests on marginal populations are

performed, and then specific cell subsets expressing the positive biomarkers detected are tested.

However, this iterative (univariate) approach might not be satisfactory due to the large number

of possible combinations that need to be tested, and a multivariate approach might be preferable.

In that case, as others have pointed out, in order to have the most power to detect a true differ-

ence, the statistical test should be selected taking into account only the cytokine combinations of

interest (Nason, 2006).

For two–sided changes, as with the Fluidigm data, changes in poly-functional cell populations

are not always detectable when looking at the marginal populations (Figure 4 A-C). In this case,

the use of multivariate model, as our Dirichlet-multinomial model, will become important to detect

differential biomarker expression. Here, we have shown that MIMOSA has higher sensitivity and

specificity than the competing methods to identify true differences between conditions in multi-

variate count data (Figure 4 A, and Figure 5 A,C), and the model generally provides a better fit

to the single-cell assay count data obtained from studies with these types of experimental designs

(Figure 5 B). Unfortunately, the limited number of samples in the Fluidigm data prevented us from

looking at co-expression involving more than two genes. In the case of more than two biomarkers,

the number of parameters to estimate for our Dirichlet–multinomial model is 2K+1+ 1, which is

large even for moderate values of K. As an example, we would need both, a large number of subjects

and a large number of events (cells) collected, to properly estimate the 33 parameters for K = 4.

A solution would be to explore alternative model parameterizations that could be used to reduce

the number of required parameters. For example, one could assume that the hyper-parameters are

constant across biomarker combinations, i.e., α{s,u}k= α{s,u}for all k, and the number of param-

eters would be reduced to 3 for any K. As attractive as this might sound, such a model would

be unrealistic given that certain stimulations are known to induce expression of certain biomarkers

more than others. More exploratory work will need to be done in this area once high dimensional

single-cell level data with large number of samples become available.

All of the results presented here were obtained with a software implementation of the EM and

MCMC MIMOSA models in R and C++, and is freely available from GitHub

(http://www.github.org/gfinak/MIMOSA). An R package will soon be released as part of the

Bioconductor project (Gentleman et al., 2004).

Supplementary Materials

Web Appendices A, B, C, and Web Figures A and B referenced in Sections 2 and 3,3.3, and 5 are

available in the attached Web-based supplementary material.

Acknowledgments

This work was supported by the Intramural Research Program of the National Institute of Allergy

and Infectious Diseases (NIAID) and the National Institutes of Health (NIH), and by grants R01

EB008400 and U01 AI068635-01 to RG, grants #OPP38744, and #OPP1032317 from the Bill &

Melinda Gates Foundation to VISC (Vaccine Immunology Statistical Center), grant #OPP1032325

15