[Show abstract][Hide abstract] ABSTRACT: Although typically cosseted in the laboratory with constant temperatures and plentiful nutrients, microbes are frequently exposed to much more stressful conditions in their natural environments where survival and competitive fitness depend upon both growth rate when conditions are favourable and on persistence in a viable and recoverable state when they are not. In order to determine the role of genetic heterogeneity in environmental fitness we present a novel approach that combines the power of fluorescence-activated cell sorting with barcode microarray analysis and apply this to determining the importance of every gene in the Saccharomyces cerevisiae genome in a high-throughput, genome-wide fitness screen. We have grown > 6000 heterozygous mutants together and exposed them to a starvation stress before using fluorescence-activated cell sorting to identify and isolate those individual cells that have not survived the stress applied. Barcode array analysis of the sorted and total populations reveals the importance of cellular recycling mechanisms (autophagy, pexophagy and ribosome breakdown) in maintaining cell viability during starvation and provides compelling evidence for an important role for fatty acid degradation in maintaining viability. In addition, we have developed a semi-batch fermentor system that is a more realistic model of environmental fitness than either batch or chemostat culture. Barcode array analysis revealed that arginine biosynthesis was important for fitness in semi-batch culture and modelling of this regime showed that rapid emergence from lag phase led to greatly increased fitness. One hundred and twenty-five strains with deletions in unclassified proteins were identified as being over-represented in the sorted fraction, while 27 unclassified proteins caused a haploinsufficient phenotype in semi-batch culture. These methods thus provide a screen to identifying other genes and pathways that have a role in maintaining cell viability.
[Show abstract][Hide abstract] ABSTRACT: For many learning problems, estimates of the inverse population covariance are required and often obtained by inverting the sample covariance matrix. Increasingly for modern scientific data sets, the number of sample points is less than the number of features and so the sample covariance is not invertible. In such circumstances, the Moore-Penrose pseudo-inverse sample covariance matrix, constructed from the eigenvectors corresponding to nonzero sample covariance eigenvalues, is often used as an approximation to the inverse population covariance matrix. The reconstruction error of the pseudo-inverse sample covariance matrix in estimating the true inverse covariance can be quantified via the Frobenius norm of the difference between the two. The reconstruction error is dominated by the smallest nonzero sample covariance eigenvalues and diverges as the sample size becomes comparable to the number of features. For high-dimensional data, we use random matrix theory techniques and results to study the reconstruction error for a wide class of population covariance matrices. We also show how bagging and random subspace methods can result in a reduction in the reconstruction error and can be combined to improve the accuracy of classifiers that utilize the pseudo-inverse sample covariance matrix. We test our analysis on both simulated and benchmark data sets.
IEEE Transactions on Pattern Analysis and Machine Intelligence 08/2011; · 4.80 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Statistical mechanics techniques have proved to be useful tools in quantifying the accuracy with which signal vectors are extracted from experimental data. However, analysis has previously been limited to specific model forms for the population covariance C, which may be inappropriate for real world data sets. In this paper we obtain new statistical mechanical results for a general population covariance matrix C. For data sets consisting of p sample points in we use the replica method to study the accuracy of orthogonal signal vectors estimated from the sample data. In the asymptotic limit of at fixed α = p/N, we derive analytical results for the signal direction learning curves. In the asymptotic limit the learning curves follow a single universal form, each displaying a retarded learning transition. An explicit formula for the location of the retarded learning transition is obtained and we find marked variation in the location of the retarded learning transition dependent on the distribution of population covariance eigenvalues. The results of the replica analysis are confirmed against simulation.
Journal of Statistical Mechanics Theory and Experiment 04/2010; 2010(04):P04009. · 1.87 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Increasingly, genome-wide association studies are being used to identify positions within the human genome that have a link with a disease condition. The number of genomic locations studied means that computationally intensive and bioinformatic intensive solutions will have to be used in the analysis of these data sets. In this paper we present an integrated Workbench that provides user-friendly access to parallelized statistical genetics analysis codes for clinical researchers. In addition we biologically annotate statistical analysis results through the reuse of existing bionformatic Taverna workflows.
AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science. 01/2010; 2010:18-22.
[Show abstract][Hide abstract] ABSTRACT: Nitrogen-containing bisphosphonates are the elected drugs for the treatment of diseases in which excessive bone resorption occurs, for example, osteoporosis and cancer-induced bone diseases. The only known target of nitrogen-containing bisphosphonates is farnesyl pyrophosphate synthase, which ensures prenylation of prosurvival proteins, such as Ras. However, it is likely that the action of nitrogen-containing bisphosphonates involves additional unknown mechanisms. To identify novel targets of nitrogen-containing bisphosphonates, we used a genome-wide high-throughput screening in which 5,936 Saccharomyces cerevisiae heterozygote barcoded mutants were grown competitively in the presence of sub-lethal doses of three nitrogen-containing bisphosphonates (risedronate, alendronate and ibandronate). Strains carrying deletions in genes encoding potential drug targets show a variation of the intensity of their corresponding barcodes on the hybridization array over the time.
With this approach, we identified novel targets of nitrogen-containing bisphosphonates, such as tubulin cofactor B and ASK/DBF4 (Activator of S-phase kinase). The up-regulation of tubulin cofactor B may explain some previously unknown effects of nitrogen-containing bisphosphonates on microtubule dynamics and organization. As nitrogen-containing bisphosphonates induce extensive DNA damage, we also document the role of DBF4 as a key player in nitrogen-containing bisphosphonate-induced cytotoxicity, thus explaining the effects on the cell-cycle.
The dataset obtained from the yeast screen was validated in a mammalian system, allowing the discovery of new biological processes involved in the cellular response to nitrogen-containing bisphosphonates and opening up opportunities for development of new anticancer drugs.
[Show abstract][Hide abstract] ABSTRACT: The study of the genetics of diseases is entering a new era. Increasingly, genome-wide association studies are being used to identify positions within the human genome that have a link with a disease condition. The number of genomic locations studied means that High Performance Computing (HPC) solutions will have to increasingly be used in the statistical analysis of these data sets. Understanding the biomedical implications of the statistical analysis will also require heavy use of bioinformatics annotation tools. In this paper we report the outcome of developing HPC statistical genetics analysis codes for use by clinical researchers. Statistical results are automatically annotated with relevant biological information by calling multiple web-services orchestrated via pre-existing scientific workflows. Access to the HPC codes and bioinformatics annotation processes is via a client Workbench which hides as much as possible from the user the HPC infrastructure and bioinformatics annotation processes, whilst aiding the exchange of ideas and results between stakeholders.
Studies in health technology and informatics 02/2009; 147:232-41.
[Show abstract][Hide abstract] ABSTRACT: The study of the genetics of diseases has been revolutionised by the advent of genome-wide genotyping technologies. Increasingly, genome-wide association studies are being used to identify positions within the human genome that have a link with a disease condition. These new data sets require the use of distributed resources, both for the statistical analysis and for the interpretation of the analysis results. Aiding the latter will be be crucial for the statistical analysis process to be successful. In this paper we report our experiences in developing a user-friendly High Performance Computing (HPC) statistical genetics analysis platform for use by clinical researchers. Specifically, we report work on supporting the interpretation process through the automatic annotation of the statistical analysis results with relevant biological information. Retrieval of the biological annotation is performed by high-volume invocation of multiple Web-services orchestrated via pre-existing scientific workflows. We also report work on developing tools to aid the capture and replay of the processes performed by a user when exploring analysis results.
Proceedings of the Twenty-Second IEEE International Symposium on Computer-Based Medical Systems, August 3-4, 2009, Albuquerque, New Mexico, USA; 01/2009
[Show abstract][Hide abstract] ABSTRACT: Detection of copy number variations (CNV) on microarrays can be hampered by the presence of Single Nucleotide Polymorphisms (SNP). In order to be confident of calling genuine CNV rather than SNP, multiple contiguous probes are required to have non-zero log2 signal ratios. Consequently, only large CNV > ~5kb can be detected on typical CNV long-oligo arrays with probe densities 1 per 2kb. However the majority of CNV are probably <5kb (Nat Genet 2006, 38:82-85).
SNP data from the Perlegen 8 million SNP set and log2 signal ratios from ~300,000 long oligos were integrated in order to characterise the effect of SNP on log2 signal ratio and the effect of the position of the SNP within the probe. The maximum length of perfect match between probe and target appeared to be the dominant factor that affected hybridisation. The reduction in effective length of probe meant that single base changes could have a large effect on signal ratio and therefore be detectable on the long oligo arrays. Sequence differences were only expected to give high log2 signal ratios in our study design; therefore probes with low log2 signal ratios were potentially caused by CNV. Approximately 1000 probes with low log signal ratios were identified which were candidates for small CNV that would not have been identified by existing analysis approaches.
Most single probe aberrations appeared to be caused by genuine biological variants and were not due to experimental noise. Long-oligo CGH arrays can therefore provide more information than previously thought. The position specific effect of SNP will be useful for microarray design.
[Show abstract][Hide abstract] ABSTRACT: Using competition experiments in continuous cultures grown in different nutrient environments (glucose limited, ammonium limited, phosphate limited and white grape juice), we identified genes that show haploinsufficiency phenotypes (reduced growth rate when hemizygous) or haploproficiency phenotypes (increased growth rate when hemizygous). Haploproficient genes (815, 1,194, 733 and 654 in glucose-limited, ammonium-limited, phosphate-limited and white grape juice environments, respectively) frequently show that phenotype in a specific environmental context. For instance, genes encoding components of the ubiquitination pathway or the proteasome show haploproficiency in nitrogen-limited conditions where protein conservation may be beneficial. Haploinsufficiency is more likely to be observed in all environments, as is the case with genes determining polar growth of the cell. Haploproficient genes seem randomly distributed in the genome, whereas haploinsufficient genes (685, 765, 1,277 and 217 in glucose-limited, ammonium-limited, phosphate-limited and white grape juice environments, respectively) are over-represented on chromosome III. This chromosome determines a yeast's mating type, and the concentration of haploinsufficient genes there may be a mechanism to prevent its loss.
[Show abstract][Hide abstract] ABSTRACT: Microarrays are an important and widely used tool. Applications include capturing genomic DNA for high-throughput sequencing in addition to the traditional monitoring of gene expression and identifying DNA copy number variations. Sequence mismatches between probe and target strands are known to affect the stability of the probe-target duplex, and hence the strength of the observed signals from microarrays.
We describe a large-scale investigation of microarray hybridisations to murine probes with known sequence mismatches, demonstrating that the effect of mismatches is strongly position-dependent and for small numbers of sequence mismatches is correlated with the maximum length of perfectly matched probe-target duplex. Length of perfect match explained 43% of the variance in log2 signal ratios between probes with one and two mismatches. The correlation with maximum length of perfect match does not conform to expectations based on considering the effect of mismatches purely in terms of reducing the binding energy. However, it can be explained qualitatively by considering the entropic contribution to duplex stability from configurations of differing perfect match length.
The results of this study have implications in terms of array design and analysis. They highlight the significant effect that short sequence mismatches can have upon microarray hybridisation intensities even for long oligonucleotide probes. All microarray data presented in this study are available from the GEO database 1, under accession number [GEO: GSE9669]
[Show abstract][Hide abstract] ABSTRACT: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d ! ∞ at fixed ratio α = N=d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach.
Journal of Machine Learning Research - JMLR. 01/2008;
[Show abstract][Hide abstract] ABSTRACT: The study of the genetic causes of disease is entering a new era. Variations in DNA sequence between individuals at a single position (locus) within the human genome are termed single nucleotide polymorphisms (SNPs), and may lead to a frank disease state or a variation in normal physiology. By comparing and contrasting the genomes of people who have a disease with the genomes of people who don't, we can begin to identify those genetic locii which potentially play a role in the disease. Modern biotechnology allows for the genotyping of individuals at hundreds of thousands of genetic locii. Whilst metrics to quantify the statistical importance of a single locus are essentially of low complexity, for example calculation of a x<sup>2</sup> statistic, within a genome-wide association study this process is repeated at every locus. In addition, the entire computational process is often repeated with a number of randomised data sets, necessary for estimation of the statistical significance. The large number of locii, number of randomized data sets, and rapid combinatorial increase when analysing multiple SNPs, naturally dictates that a high performance computing (HPC) solution be developed. On a single core machine analysis of significant numbers of SNP pairs would take many years. Once statistical analysis of the data has been performed results must be annotated with relevant information to aid biological interpretation and hypothesis generation - this is a standard, but not in substantial bioinformatic task.
Fourth International Conference on e-Science, e-Science 2008, 7-12 December 2008, Indianapolis, IN, USA; 01/2008
[Show abstract][Hide abstract] ABSTRACT: Saccharomyces boulardii, a yeast that was isolated from fruit in Indochina, has been used as a remedy for diarrhea since 1950 and is now a commercially available treatment throughout Europe, Africa, and South America. Though initially classified as a separate species of Saccharomyces, recent publications have shown that the genome of S. boulardii is so similar to Saccharomyces cerevisiae that the two should be classified as conspecific. This raises the question of the distinguishing molecular and phenotypic characteristics present in S. boulardii that make it perform more effectively as a probiotic organism compared to other strains of S. cerevisiae. This investigation reports some of these distinguishing characteristics including enhanced ability for pseudohyphal switching upon nitrogen limitation and increased resistance to acidic pH. However, these differences did not correlate with increased adherence to epithelial cells or transit through mouse gut. Pertinent characteristics of the S. boulardii genome such as trisomy of chromosome IX, altered copy number of a number of individual genes, and sporulation deficiency have been revealed by comparative genome hybridization using oligonucleotide-based microarrays coupled with a rigorous statistical analysis. The contributions of the different genomic and phenotypic features of S. boulardii to its probiotic nature are discussed.
Applied and Environmental Microbiology 05/2007; 73(8):2458-67. · 3.95 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The serious biological consequences of metal toxicity are well documented, but the key modes of action of most metals are unknown. To help unravel molecular mechanisms underlying the action of chromium, a metal of major toxicological importance, we grew over 6,000 heterozygous yeast mutants in competition in the presence of chromium. Microarray-based screens of these heterozygotes are truly genome-wide as they include both essential and non-essential genes.
The screening data indicated that proteasomal (protein degradation) activity is crucial for cellular chromium (Cr) resistance. Further investigations showed that Cr causes the accumulation of insoluble and toxic protein aggregates, which predominantly arise from proteins synthesised during Cr exposure. A protein-synthesis defect provoked by Cr was identified as mRNA mistranslation, which was oxygen-dependent. Moreover, Cr exhibited synergistic toxicity with a ribosome-targeting drug (paromomycin) that is known to act via mistranslation, while manipulation of translational accuracy modulated Cr toxicity.
The datasets from the heterozygote screen represent an important public resource that may be exploited to discover the toxic mechanisms of chromium. That potential was validated here with the demonstration that mRNA mistranslation is a primary cause of cellular Cr toxicity.
[Show abstract][Hide abstract] ABSTRACT: Cell growth underlies many key cellular and developmental processes, yet a limited number of studies have been carried out on cell-growth regulation. Comprehensive studies at the transcriptional, proteomic and metabolic levels under defined controlled conditions are currently lacking.
Metabolic control analysis is being exploited in a systems biology study of the eukaryotic cell. Using chemostat culture, we have measured the impact of changes in flux (growth rate) on the transcriptome, proteome, endometabolome and exometabolome of the yeast Saccharomyces cerevisiae. Each functional genomic level shows clear growth-rate-associated trends and discriminates between carbon-sufficient and carbon-limited conditions. Genes consistently and significantly upregulated with increasing growth rate are frequently essential and encode evolutionarily conserved proteins of known function that participate in many protein-protein interactions. In contrast, more unknown, and fewer essential, genes are downregulated with increasing growth rate; their protein products rarely interact with one another. A large proportion of yeast genes under positive growth-rate control share orthologs with other eukaryotes, including humans. Significantly, transcription of genes encoding components of the TOR complex (a major controller of eukaryotic cell growth) is not subject to growth-rate regulation. Moreover, integrative studies reveal the extent and importance of post-transcriptional control, patterns of control of metabolic fluxes at the level of enzyme synthesis, and the relevance of specific enzymatic reactions in the control of metabolic fluxes during cell growth.
This work constitutes a first comprehensive systems biology study on growth-rate control in the eukaryotic cell. The results have direct implications for advanced studies on cell growth, in vivo regulation of metabolic fluxes for comprehensive metabolic engineering, and for the design of genome-scale systems biology models of the eukaryotic cell.
[Show abstract][Hide abstract] ABSTRACT: The learning of signal directions in high-dimensional data through orthogonal decomposition or principal component analysis (PCA) has many important applications in physics and engineering disciplines, e.g., wireless communication, information theory, and econophysics. The accuracy of the orthogonal decomposition can be studied using mean-field theory. Previous analysis of data produced from a model with a single signal direction has predicted a retarded learning phase transition below which learning is not possible, i.e., if the signal is too weak or the data set is too small then it is impossible to learn anything about the signal direction or magnitude. In this contribution we show that the result can be generalized to the case where there are multiple signal directions. Each nondegenerate signal is associated with a retarded learning transition. However, fluctuations around the mean-field solution lead to large finite size effects unless the signal strengths are very well separated. We evaluate the one-loop contribution to the mean-field theory, which shows that signal directions are indistinguishable from one another if their corresponding population eigenvalues are separated by O(N(-tau)) with exponent tau>1/3, where N is the data dimension. Numerical simulations are consistent with the analysis and show that finite size effects can persist even for very large data sets.
[Show abstract][Hide abstract] ABSTRACT: Machine learning is used in a large number of bioinformatics applications and studies. The application of machine learning techniques in other areas such as pattern recognition has resulted in accumulated experience as to correct and principled approaches for their use. The aim of this paper is to give an account of issues affecting the application of machine learning tools, focusing primarily on general aspects of feature and model parameter selection, rather than any single specific algorithm. These aspects are discussed in the context of published bioinformatics studies in leading journals over the last 5 years. We assess to what degree the experience gained by the pattern recognition research community pervades these bioinformatics studies. We finally discuss various critical issues relating to bioinformatic data sets and make a number of recommendations on the proper use of machine learning techniques for bioinformatics research based upon previously published research on machine learning.
Computers in Biology and Medicine 11/2006; 36(10):1104-25. · 1.48 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Determining protein sequence similarity is an important task for protein classification and homology detection. Typically
this may be done using sequence alignment algorithms, yet fast and accurate alignment-free kernel based classifiers exist.
Viewing sequences as a “bag of words”, we test a simple weighted string kernel, investigating the effects of k-mer length,
sequence length and choice of weighting. We also extend the kernel to operate on the k-mer frequency representation of a sequence
rather than the “bag of words” representation.
KeywordsProtein Classification-Homology-String Kernel