[Show abstract][Hide abstract] ABSTRACT: We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computa-tional algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing is developed to successively generate "artificial" posterior distributions that, at the limiting temperature in the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data sets.
Preview · Article · Jan 2009 · Journal of Machine Learning Research
[Show abstract][Hide abstract] ABSTRACT: We present a suite of Bayesian hierarchical models that synthesize ensembles of climate model simulations, with the aim of reconciling different future projections of climate change, while characterizing their uncertainty in a rigorous fashion. Posterior distributions of future temperature and/or precipitation changes at regional scales are obtained, accounting for many peculiar data characteristics, like systematic biases, model-specific precisions, region-specific effects, changes in trend with increasing rates of greenhouse gas emissions. We expand on many important issues characterizing model experiments and their collection into multi-model ensembles. Also, we address the need of impact research, by proposing posterior predictive distributions as a representation of probabilistic projections. In addition, the calculation of the posterior predictive distribution for a new set of model data allows a rigorous cross-validation approach to confirm the reasonableness of our Bayesian model assumptions.
[Show abstract][Hide abstract] ABSTRACT: The tumor microenvironment has a significant impact on tumor development. Two important determinants in this environment are hypoxia and lactic acidosis. Although lactic acidosis has long been recognized as an important factor in cancer, relatively little is known about how cells respond to lactic acidosis and how that response relates to cancer phenotypes. We develop genome-scale gene expression studies to dissect transcriptional responses of primary human mammary epithelial cells to lactic acidosis and hypoxia in vitro and to explore how they are linked to clinical tumor phenotypes in vivo. The resulting experimental signatures of responses to lactic acidosis and hypoxia are evaluated in a heterogeneous set of breast cancer datasets. A strong lactic acidosis response signature identifies a subgroup of low-risk breast cancer patients having distinct metabolic profiles suggestive of a preference for aerobic respiration. The association of lactic acidosis response with good survival outcomes may relate to the role of lactic acidosis in directing energy generation toward aerobic respiration and utilization of other energy sources via inhibition of glycolysis. This "inhibition of glycolysis" phenotype in tumors is likely caused by the repression of glycolysis gene expression and Akt inhibition. Our study presents a genomic evaluation of the prognostic information of a lactic acidosis response independent of the hypoxic response. Our results identify causal roles of lactic acidosis in metabolic reprogramming, and the direct functional consequence of lactic acidosis pathway activity on cellular responses and tumor development. The study also demonstrates the utility of genomic analysis that maps expression-based findings from in vitro experiments to human samples to assess links to in vivo clinical phenotypes.
[Show abstract][Hide abstract] ABSTRACT: We describe studies in molecular profiling and biological pathway analysis that use sparse latent factor and regression models for microarray gene expression data. We discuss breast cancer applications and key aspects of the modeling and computational methodology. Our case studies aim to investigate and characterize heterogeneity of structure related to specific oncogenic pathways, as well as links between aggregate patterns in gene expression profiles and clinical biomarkers. Based on the metaphor of statistically derived "factors" as representing biological "subpathway" structure, we explore the decomposition of fitted sparse factor models into pathway subcomponents and investigate how these components overlay multiple aspects of known biological activity. Our methodology is based on sparsity modeling of multivariate regression, ANOVA, and latent factor models, as well as a class of models that combines all components. Hierarchical sparsity priors address questions of dimension reduction and multiple comparisons, as well as scalability of the methodology. The models include practically relevant non-Gaussian/nonparametric components for latent structure, underlying often quite complex non-Gaussianity in multivariate expression patterns. Model search and fitting are addressed through stochastic simulation and evolutionary stochastic search methods that are exemplified in the oncogenic pathway studies. Supplementary supporting material provides more details of the applications, as well as examples of the use of freely available software tools for implementing the methodology.
Full-text · Article · Dec 2008 · Journal of the American Statistical Association
[Show abstract][Hide abstract] ABSTRACT: Statistical mixture modeling provides an opportunity for automated identification and resolution of cell subtypes in flow cytometric data. The configuration of cells as represented by multiple markers simultaneously can be modeled arbitrarily well as a mixture of Gaussian distributions in the dimension of the number of markers. Cellular subtypes may be related to one or multiple components of such mixtures, and fitted mixture models can be evaluated in the full set of markers as an alternative, or adjunct, to traditional subjective gating methods that rely on choosing one or two dimensions. Four color flow data from human blood cells labeled with FITC-conjugated anti-CD3, PE-conjugated anti-CD8, PE-Cy5-conjugated anti-CD4, and APC-conjugated anti-CD19 Abs was acquired on a FACSCalibur. Cells from four murine cell lines, JAWS II, RAW 264.7, CTLL-2, and A20, were also stained with FITC-conjugated anti-CD11c, PE-conjugated anti-CD11b, PE-Cy5-conjugated anti-CD8a, and PE-Cy7-conjugated-CD45R/B220 Abs, respectively, and single color flow data were collected on an LSRII. The data were fitted with a mixture of multivariate Gaussians using standard Bayesian statistical approaches and Markov chain Monte Carlo computations. Statistical mixture models were able to identify and purify major cell subsets in human peripheral blood, using an automated process that can be generalized to an arbitrary number of markers. Validation against both traditional expert gating and synthetic mixtures of murine cell lines with known mixing proportions was also performed. This article describes the studies of statistical mixture modeling of flow cytometric data, and demonstrates their utility in examples with four-color flow data from human peripheral blood samples and synthetic mixtures of murine cell lines.
[Show abstract][Hide abstract] ABSTRACT: Noise in gene expression (stochastic variation in the composition of the transcriptome in response to stimuli) may play an important role in maintaining robustness and flexibility, which ensure the stability of normal physiology and provide adaptability to environmental changes for the living system. Broad-based technologies have allowed us to study with unprecedented accuracy the molecular profiles of various states of health and cardiovascular disease. In doing so, we have observed a correlation between the degree of variation in gene expression and the state of health. Specifically, the stochastic variation in gene expression in response to environmental and physiological factors is found in healthy mice, and tends to disappear in mice with advanced disease states. Although further evidence is needed to draw a solid conclusion with respect to the significance of decreased transcriptional noise in the disease state as a whole, it is tantalizing to introduce the concept that stochasticity may be linked to the organism's adaptability to a changing environment, and the "quiet" states of gene expression may indicate the loss of diversity in the organism's response.
Preview · Article · Aug 2008 · Trends in cardiovascular medicine
[Show abstract][Hide abstract] ABSTRACT: An important goal of research involving gene expression data for outcome prediction is to establish the ability of genomic data to define clinically relevant risk factors. Recent studies have demonstrated that microarray data can successfully cluster patients into low- and high-risk categories. However, the need exists for models which examine how genomic predictors interact with existing clinical factors and provide personalized outcome predictions. We have developed clinico-genomic tree models for survival outcomes which use recursive partitioning to subdivide the current data set into homogeneous subgroups of patients, each with a specific Weibull survival distribution. These trees can provide personalized predictive distributions of the probability of survival for individuals of interest. Our strategy is to fit multiple models; within each model we adopt a prior on the Weibull scale parameter and update this prior via Empirical Bayes whenever the sample is split at a given node. The decision to split is based on a Bayes factor criterion. The resulting trees are weighted according to their relative likelihood values and predictions are made by averaging over models. In a pilot study of survival in advanced stage ovarian cancer we demonstrate that clinical and genomic data are complementary sources of information relevant to survival, and we use the exploratory nature of the trees to identify potential genomic biomarkers worthy of further study.
Full-text · Article · Feb 2008 · Statistical Methodology
[Show abstract][Hide abstract] ABSTRACT: The most common application of microarray technology in disease research is to identify genes differentially expressed in disease versus normal tissues. However, it is known that, in complex diseases, phenotypes are determined not only by genes, but also by the underlying structure of genetic networks. Often, it is the interaction of many genes that causes phenotypic variations.
In this work, using cancer as an example, we develop graph-based methods to integrate multiple microarray datasets to discover disease-related co-expression network modules. We propose an unsupervised method that take into account both co-expression dynamics and network topological information to simultaneously infer network modules and phenotype conditions in which they are activated or de-activated. Using our method, we have discovered network modules specific to cancer or subtypes of cancers. Many of these modules are consistent with or supported by their functional annotations or their previously known involvement in cancer. In particular, we identified a module that is predominately activated in breast cancer and is involved in tumor suppression. While individual components of this module have been suggested to be associated with tumor suppression, their coordinated function has never been elucidated. Here by adopting a network perspective, we have identified their interrelationships and, particularly, a hub gene PDGFRL that may play an important role in this tumor suppressor network.
Using a network-based approach, our method provides new insights into the complex cellular mechanisms that characterize cancer and cancer subtypes. By incorporating co-expression dynamics information, our approach can not only extract more functionally homogeneous modules than those based solely on network topology, but also reveal pathway coordination beyond co-expression.
[Show abstract][Hide abstract] ABSTRACT: Imaging of single-cell dynamics is playing an increas- ingly central role in systems biology studies. Frame-by-frame cell segmentation and lineage analysis from image sequences is central to the goal of data extraction for systems studies, and poses chal- lenges for automating analyses that will apply to multiple studies with differing cell types. From this motivating perspective, we have developed a novel and comprehensive method for automating cell segmentation/lineage reconstruction; this is encoded in CellTracer, a software package to access these methods through an integrated graphical user interface written in Matlab. The software provides manual or semi-automated tools to correct problems from auto- mated algorithms. Studies in dynamic imaging and lineage recon- struction with bacteria, yeast and human cells, as well as multiple other examples of segmentation of different object types, demon- strate the analysis and effective improvements over existing tools as a multi-purpose and easy-to-use analysis approach.
[Show abstract][Hide abstract] ABSTRACT: In this chapter we are concerned with the problem of controlling a robot manipulator (i.e. a multijointed robot arm) to follow a given trajectory; this is known as the inverse dynamics problem. We consider a robot manipulator with ℓ revolute joints, and denote the joint angles as q1:ℓ. Similarly the joint velocities and accelerations are denoted by ˙q1:ℓ, and ¨q1:ℓ respectively. For brevity we set x = (q1:ℓ, ˙q1:ℓ, ¨q1:ℓ) ′ ∈ R 3ℓ. Our aim is to then learn (or estimate) the inverse dynamics of the robot from data; that is, to learn the ℓ torque functions τ1:ℓ(x), τ: R 3ℓ ↦ → R ℓ. It might be thought that estimating τ(x) would be unnecessary given knowledge of the physics of the robot. Indeed, for a simple and highly structured robot manipulator, it is often possible to find an analytical form for the input/output mapping that is needed to compute the torques, for example using inverse models based on rigid body dynamics derived from the Newton-Euler algorithm (Featherstone, 1987). These models are parameterized in terms of kinematic and dynamic parameters. The latter, which include the mass, centre of mass and moments of inertia of each link, are usually unknown even to the manufacturers of the robots (An et al., 1988). The calibration of these dynamic parameters is neither trivial nor robust, for example Armstrong et al. (1986) estimated them for a PUMA 560 arm by disassembling it and measuring the properties of
[Show abstract][Hide abstract] ABSTRACT: The incorporation of unlabeled data in regression and classification analysis is an increasing focus of the applied statistics and machine learning literatures, with a number of recent examples demonstrating the potential for unlabeled data to contribute to improved predictive accuracy. The statistical basis for this semisupervised analysis does not appear to have been well delineated; as a result, the underlying theory and rationale may be underappreciated, especially by nonstatisticians. There is also room for statisticians to become more fully engaged in the vigorous research in this important area of intersection of the statistical and computer sciences. Much of the theoretical work in the literature has focused, for example, on geometric and structural properties of the unlabeled data in the context of particular algorithms, rather than probabilistic and statistical questions. This paper overviews the fundamental statistical foundations for predictive modeling and the general questions associated with unlabeled data, highlighting the relevance of venerable concepts of sampling design and prior specification. This theory, illustrated with a series of central illustrative examples and two substantial real data analyses, shows precisely when, why and how unlabeled data matter. Comment: Published in at http://dx.doi.org/10.1214/088342307000000032 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)
Preview · Article · Oct 2007 · Statistical Science
[Show abstract][Hide abstract] ABSTRACT: In high-throughput genomics, large-scale designed experiments are becoming common, and analysis approaches based on highly multivariate regression and anova concepts are key tools. Shrinkage models of one form or another can provide comprehensive approaches to the problems of simultaneous inference that involve implicit multiple comparisons over the many, many parameters representing effects of design factors and covariates. We use such approaches here in a study of cardiovascular genomics. The primary experimental context concerns a carefully designed, and rich, gene expression study focused on gene-environment interactions, with the goals of identifying genes implicated in connection with disease states and known risk factors, and in generating expression signatures as proxies for such risk factors. A coupled exploratory analysis investigates cross-species extrapolation of gene expression signatures--how these mouse-model signatures translate to humans. The latter involves exploration of sparse latent factor analysis of human observational data and of how it relates to projected risk signatures derived in the animal models. The study also highlights a range of applied statistical and genomic data analysis issues, including model specification, computational questions and model-based correction of experimental artifacts in DNA microarray data.
Preview · Article · Oct 2007 · The Annals of Applied Statistics
[Show abstract][Hide abstract] ABSTRACT: We introduce and exemplify an efficient method for direct sampling from hyper-inverse Wishart distributions. The method relies very naturally on the use of standard junction-tree representation of graphs, and couples these with matrix results for inverse Wishart distributions. We describe the theory and resulting computational algorithms for both decomposable and nondecomposable graphical models. An example drawn from financial time series demonstrates application in a context where inferences on a structured covariance model are required. We discuss and investigate questions of scalability of the simulation methods to higher-dimensional distributions. The paper concludes with general comments about the approach, including its use in connection with existing Markov chain Monte Carlo methods that deal with uncertainty about the graphical model structure. Copyright 2007, Oxford University Press.
[Show abstract][Hide abstract] ABSTRACT: The purpose of this study was to develop an integrated genomic-based approach to personalized treatment of patients with advanced-stage ovarian cancer. We have used gene expression profiles to identify patients likely to be resistant to primary platinum-based chemotherapy and also to identify alternate targeted therapeutic options for patients with de novo platinum-resistant disease.
A gene expression model that predicts response to platinum-based therapy was developed using a training set of 83 advanced-stage serous ovarian cancers and tested on a 36-sample external validation set. In parallel, expression signatures that define the status of oncogenic signaling pathways were evaluated in 119 primary ovarian cancers and 12 ovarian cancer cell lines. In an effort to increase chemotherapy sensitivity, pathways shown to be activated in platinum-resistant cancers were subject to targeted therapy in ovarian cancer cell lines.
Gene expression profiles identified patients with ovarian cancer likely to be resistant to primary platinum-based chemotherapy with greater than 80% accuracy. In patients with platinum-resistant disease, we identified expression signatures consistent with activation of Src and Rb/E2F pathways, components of which were successfully targeted to increase response in ovarian cancer cell lines.
We have defined a strategy for treatment of patients with advanced-stage ovarian cancer that uses therapeutic stratification based on predictions of response to chemotherapy, coupled with prediction of oncogenic pathway deregulation, as a method to direct the use of targeted agents.
Preview · Article · Mar 2007 · Journal of Clinical Oncology
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of estimating an unknown function based on noisy data using nonparametric regression. One approach to this estimation prob-lem is to represent the function in a series expansion using a linear com-bination of basis functions. Overcomplete dictionaries provide a larger, but redundant collection of generating elements than a basis, however, coefficients in the expansion are no longer unique. Despite the non-uniqueness, this has the potential to lead to sparser representations by using fewer non-zero coef-ficients. Compound Poisson random fields and their generalization to Lévy random fields are ideally suited for construction of priors on functions using these overcomplete representations for the general nonparametric regression problem, and provide a natural limiting generalization of priors for the fi-nite dimensional version of the regression problem. While expressions for posterior modes or posterior distributions of quantities of interest are not available in closed form, the prior construction using Lévy random fields per-mits tractable posterior simulation via a reversible jump Markov chain Monte Carlo algorithm. Efficient computation is possible because updates based on adding/deleting or updating single dictionary elements bypass the need to invert large matrices. Furthermore, because dictionary elements are only computed as needed, memory requirements scale linearly with the sample size. In comparison with other methods, the Lévy random field priors provide excellent performance in terms of both mean squared error and coverage for out-of-sample predictions.
[Show abstract][Hide abstract] ABSTRACT: This paper deals with the detection of multiple changepoints for independent but non identically distributed observations, which are assumed to be modeled by a linear regression with normal errors. The problem has a natural formu-lation as a model selection problem and the main difficulty for computing model posterior probabilities is that neither the reference priors nor any form of empirical Bayes factors based on real training samples can be employed. We propose an analysis based on the intrinsic priors, which do not require real training samples and provide a feasible and sensible solution. For the case of changes in the regression coefficients very simple formulas for the prospective and the retrospective detection of changepoints are found. On the other hand, when the sample size grows the number of possible changepoints also does and consequently the number of models involved. A stochastic search for finding only those models having large posterior proba-bility is provided. Illustrative examples based on simulated and real data are given.