Alvis Brazma

EMBL-EBI, Cambridge, England, United Kingdom

Are you Alvis Brazma?

Claim your profile

Publications (164)1521.09 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based. We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome. The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
    BMC Genomics 06/2015; 16 Suppl 8(Suppl 8):S2. DOI:10.1186/1471-2164-16-S8-S2 · 4.04 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case vs. control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. Results: We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. We then show empirically that inference for the full probabilistic model can be approximated with good performance using the computationally fast k-means clustering algorithm. The suggested metric for retrieval using clusterings is the normalized information distance. The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method. Availability: The method can be implemented using only standard k-means and normalized information distance, available in many standard statistical software packages.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: The Cellular Phenotype Database (CPD) is a repository for data derived from high-throughput systems microscopy studies. The aims of this resource are: (i) to provide easy access to cellular phenotype and molecular localization data for the broader research community; (ii) to facilitate integration of independent phenotypic studies by means of data aggregation techniques, including use of an ontology; and (iii) to facilitate development of analytical methods in this field. Results: In this article we present CPD, its data structure and user interface, propose a minimal set of information describing RNAi experiments, and suggest a generic schema for management and aggregation of outputs from phenotypic or molecular localization experiments. The database has a flexible structure for management of data from heterogeneous sources of systems microscopy experimental outputs generated by a variety of protocols and technologies and can be queried by gene, reagent, gene attribute, study keywords, phenotype or ontology terms. Availability: CPD is developed as part of the Systems Microscopy Network of Excellence and is accessible at http://www.ebi.ac.uk/fg/sym Contact: jes{at}ebi.ac.uk, ugis{at}ebi.ac.uk
    Bioinformatics 04/2015; DOI:10.1093/bioinformatics/btv199 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is an international functional genomics database at the European Bioinformatics Institute (EMBL-EBI) recommended by most journals as a repository for data supporting peer-reviewed publications. It contains data from over 7000 public sequencing and 42 000 array-based studies comprising over 1.5 million assays in total. The proportion of sequencing-based submissions has grown significantly over the last few years and has doubled in the last 18 months, whilst the rate of microarray submissions is growing slightly. All data in ArrayExpress are available in the MAGE-TAB format, which allows robust linking to data analysis and visualization tools and standardized analysis. The main development over the last two years has been the release of a new data submission tool Annotare, which has reduced the average submission time almost 3-fold. In the near future, Annotare will become the only submission route into ArrayExpress, alongside MAGE-TAB format-based pipelines. ArrayExpress is a stable and highly accessed resource. Our future tasks include automation of data flows and further integration with other EMBL-EBI resources for the representation of multi-omics data.
    Nucleic Acids Research 10/2014; 43(D1). DOI:10.1093/nar/gku1057 · 9.11 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: One purpose of the biomedical literature is to report results in sufficient detail that the methods of data collection and analysis can be independently replicated and verified. Here we present reporting guidelines for gene expression localization experiments: the minimum information specification for in situ hybridization and immunohistochemistry experiments (MISFISHIE). MISFISHIE is modeled after the Minimum Information About a Microarray Experiment (MIAME) specification for microarray experiments. Both guidelines define what information should be reported without dictating a format for encoding that information. MISFISHIE describes six types of information to be provided for each experiment: experimental design, biomaterials and treatments, reporters, staining, imaging data and image characterizations. This specification has benefited the consortium within which it was developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The incidence of renal cell carcinoma (RCC) is increasing worldwide, and its prevalence is particularly high in some parts of Central Europe. Here we undertake whole-genome and transcriptome sequencing of clear cell RCC (ccRCC), the most common form of the disease, in patients from four different European countries with contrasting disease incidence to explore the underlying genomic architecture of RCC. Our findings support previous reports on frequent aberrations in the epigenetic machinery and PI3K/mTOR signalling, and uncover novel pathways and genes affected by recurrent mutations and abnormal transcriptome patterns including focal adhesion, components of extracellular matrix (ECM) and genes encoding FAT cadherins. Furthermore, a large majority of patients from Romania have an unexpected high frequency of A:T>T:A transversions, consistent with exposure to aristolochic acid (AA). These results show that the processes underlying ccRCC tumorigenesis may vary in different populations and suggest that AA may be an important ccRCC carcinogen in Romania, a finding with major public health implications.
    Nature Communications 10/2014; 5:5135. DOI:10.1038/ncomms6135 · 10.74 Impact Factor
  • Source
    Nuno A Fonseca · John Marioni · Alvis Brazma
    [Show abstract] [Hide abstract]
    ABSTRACT: Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the “true” expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the ‘ground truth’ in real RNAseq data sets, we used simulated data to assess the differences between the “true” expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to estimate the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.
    PLoS ONE 09/2014; 9(9-9):e107026. DOI:10.1371/journal.pone.0107026 · 3.23 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Chimeric RNAs originating from two or more different genes are known to exist not only in cancer, but also in normal tissues, where they can play a role in human evolution. However, the exact mechanism of their formation is unknown. Here, we use RNA sequencing data from 462 healthy individuals representing 5 human populations to systematically identify and in depth characterize 81 RNA tandem chimeric transcripts, 13 of which are novel. We observe that 6 out of these 81 chimeras have been regarded as cancer-specific. Moreover, we show that a prevalence of long introns at the fusion breakpoint is associated with the chimeric transcripts formation. We also find that tandem RNA chimeras have lower abundances as compared to their partner genes. Finally, by combining our results with genomic data from the same individuals we uncover intronic genetic variants associated with the chimeric RNA formation. Taken together our findings provide an important insight into the chimeric transcripts formation and open new avenues of research into the role of intronic genetic variants in post-transcriptional processing events.
    PLoS ONE 08/2014; 9(8):e104567. DOI:10.1371/journal.pone.0104567 · 3.23 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The cooperation of transcriptional and post-transcriptional levels of control to shape gene regulation is only partially understood. Here we show that a combination of two simple and non-invasive genomic techniques, coupled with kinetic mathematical modeling, afford insight into the intricate dynamics of RNA regulation in response to oxidative stress in the fission yeast Schizosaccharomyces pombe. This study reveals a dominant role of transcriptional regulation in response to stress, but also points to the first minutes after stress induction as a critical time when the coordinated control of mRNA turnover can support the control of transcription for rapid gene regulation. In addition, we uncover specialized gene expression strategies associated with distinct functional gene groups, such as simultaneous transcriptional repression and mRNA destabilization for genes encoding ribosomal proteins, delayed mRNA destabilization with varying contribution of transcription for ribosome biogenesis genes, dominant roles of mRNA stabilization for genes functioning in protein degradation, and adjustment of both transcription and mRNA turnover during the adaptation to stress. We also show that genes regulated independently of the bZIP transcription factor Atf1p are predominantly controlled by mRNA turnover, and identify putative cis-regulatory sequences that are associated with different gene expression strategies during the stress response. This study highlights the intricate and multi-faceted interplay between transcription and RNA turnover during the dynamic regulatory response to stress.
    RNA Biology 07/2014; 11(6). DOI:10.4161/rna.29196 · 5.38 Impact Factor
  • Keystone Big Data in Biology Symposium (organizers Lincoln D. Stein, Doreen Ware, Michael Schatz), San Francisco, CA; 03/2014
  • Neuromuscular Disorders 03/2014; 24:S22-S23. DOI:10.1016/S0960-8966(14)70075-6 · 3.13 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Expression Atlas (http://www.ebi.ac.uk/gxa) is a value-added database providing information about gene, protein and splice variant expression in different cell types, organism parts, developmental stages, diseases and other biological and experimental conditions. The database consists of selected high-quality microarray and RNA-sequencing experiments from ArrayExpress that have been manually curated, annotated with Experimental Factor Ontology terms and processed using standardized microarray and RNA-sequencing analysis methods. The new version of Expression Atlas introduces the concept of ‘baseline’ expression, i.e. gene and splice variant abundance levels in healthy or untreated conditions, such as tissues or cell types. Differential gene expression data benefit from an in-depth curation of experimental intent, resulting in biologically meaningful ‘contrasts’, i.e. instances of differential pairwise comparisons between two sets of biological replicates. Other novel aspects of Expression Atlas are its strict quality control of raw experimental data, up-to-date RNA-sequencing analysis methods, expression data at the level of gene sets, as well as genes and a more powerful search interface designed to maximize the biological value provided to the user.
    Nucleic Acids Research 12/2013; 42(Database issue). DOI:10.1093/nar/gkt1270 · 9.11 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The BioSamples database at the EBI (http://www.ebi.ac.uk/biosamples) provides an integration point for BioSamples information between technology specific databases at the EBI, projects such as ENCODE and reference collections such as cell lines. The database delivers a unified query interface and API to query sample information across EBI's databases and provides links back to assay databases. Sample groups are used to manage related samples, e.g. those from an experimental submission, or a single reference collection. Infrastructural improvements include a new user interface with ontological and key word queries, a new query API, a new data submission API, complete RDF data download and a supporting SPARQL endpoint, accessioning at the point of submission to the European Nucleotide Archive and European Genotype Phenotype Archives and improved query response times.
    Nucleic Acids Research 11/2013; 42(Database issue). DOI:10.1093/nar/gkt1081 · 9.11 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA sequencing is an increasingly popular technology for genome-wide analysis of transcript sequence and abundance. However, understanding of the sources of technical and interlaboratory variation is still limited. To address this, the GEUVADIS consortium sequenced mRNAs and small RNAs of lymphoblastoid cell lines of 465 individuals in seven sequencing centers, with a large number of replicates. The variation between laboratories appeared to be considerably smaller than the already limited biological variation. Laboratory effects were mainly seen in differences in insert size and GC content and could be adequately corrected for. In small-RNA sequencing, the microRNA (miRNA) content differed widely between samples owing to competitive sequencing of rRNA fragments. This did not affect relative quantification of miRNAs. We conclude that distributing RNA sequencing among different laboratories is feasible, given proper standardization and randomization procedures. We provide a set of quality measures and guidelines for assessing technical biases in RNA-seq data.
    Nature Biotechnology 09/2013; 31(11). DOI:10.1038/nbt.2702 · 39.08 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project-the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.
    Nature 09/2013; 501(7468). DOI:10.1038/nature12531 · 42.35 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To mechanistically characterize the microevolutionary processes active in altering transcription factor (TF) binding among closely related mammals, we compared the genome-wide binding of three tissue-specific TFs that control liver gene expression in six rodents. Despite an overall fast turnover of TF binding locations between species, we identified thousands of TF regions of highly constrained TF binding intensity. Although individual mutations in bound sequence motifs can influence TF binding, most binding differences occur in the absence of nearby sequence variations. Instead, combinatorial binding was found to be significant for genetic and evolutionary stability; cobound TFs tend to disappear in concert and were sensitive to genetic knockout of partner TFs. The large, qualitative differences in genomic regions bound between closely related mammals, when contrasted with the smaller, quantitative TF binding differences among Drosophila species, illustrate how genome structure and population genetics together shape regulatory evolution.
    Cell 08/2013; 154(3):530-40. DOI:10.1016/j.cell.2013.07.007 · 33.12 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Oncogenic fusion genes that involve kinases have proven to be effective targets for therapy in a wide range of cancers. Unfortunately, the diagnostic approaches required to identify these events are struggling to keep pace with the diverse array of genetic alterations that occur in cancer. Diagnostic screening in solid tumours is particularly challenging, as many fusion genes occur with a low frequency. To overcome these limitations, we developed a capture enrichment strategy to enable high throughput transcript sequencing of the human kinome. This approach provides a global overview of kinase fusion events, irrespective of the identity of the fusion partner. To demonstrate the utility of this system we profiled one hundred non-small cell lung cancers and identified numerous genetic alterations impacting Fibroblast Growth Factor Receptor 3 (FGFR3) in lung squamous cell carcinoma and a novel ALK fusion partner in lung adenocarcinoma.
    The Journal of Pathology 07/2013; 230(3). DOI:10.1002/path.4209 · 7.43 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA sequencing has opened new avenues for the study of transcriptome composition. Significant evidence has accumulated showing that the human transcriptome contains in excess of a hundred thousand different transcripts. However, it is still not clear to what extent this diversity prevails when considering the relative abundances of different transcripts from the same gene. Here we show that, in a given condition, most protein coding genes have one major transcript expressed at significantly higher level than others, that in human tissues the major transcripts contribute almost 85 percent to the total mRNA from protein coding loci, and that often the same major transcript is expressed in many tissues. We detect a high degree of overlap between the set of major transcripts and a recently published set of alternatively spliced transcripts that are predicted to be translated utilizing proteomic data. Thus, we hypothesize that although some minor transcripts may play a functional role, the major ones are likely to be the main contributors to the proteome. However, we still detect a non-negligible fraction of protein coding genes for which the major transcript does not code a protein. Overall, our findings suggest that the transcriptome from protein coding loci is dominated by one transcript per gene and that not all the transcripts that contribute to transcriptome diversity are equally likely to contribute to protein diversity. This observation can help to prioritize candidate targets in proteomics research and to predict the functional impact of the detected changes in variation studies.
    Genome biology 07/2013; 14(7):R70. DOI:10.1186/gb-2013-14-7-r70 · 10.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genes for the production of a broad range of fungal secondary metabolites are frequently colinear. The prevalence of such gene clusters was systematically examined across the genome of the cereal pathogen Fusarium graminearum. The topological structure of transcriptional networks was also examined to investigate control mechanisms for mycotoxin biosynthesis and other processes. The genes associated with transcriptional processes were identified, and the genomic location of transcription-associated proteins (TAPs) analyzed in conjunction with the locations of genes exhibiting similar expression patterns. Highly conserved TAPs reside in regions of chromosomes with very low or no recombination, contrasting with putative regulator genes. Co-expression group profiles were used to define positionally clustered genes and a number of members of these clusters encode proteins participating in secondary metabolism. Gene expression profiles suggest there is an abundance of condition-specific transcriptional regulation. Analysis of the promoter regions of co-expressed genes showed enrichment for conserved DNA-sequence motifs. Potential global transcription factors recognising these motifs contain distinct sets of DNA-binding domains (DBDs) from those present in local regulators. Proteins associated with basal transcriptional functions are encoded by genes enriched in regions of the genome with low recombination. Systematic searches revealed dispersed and compact clusters of co-expressed genes, often containing a transcription factor, and typically containing genes involved in biosynthetic pathways. Transcriptional networks exhibit a layered structure in which the position in the hierarchy of a regulator is closely linked to the DBD structural class.
    BMC Systems Biology 06/2013; 7(1):52. DOI:10.1186/1752-0509-7-52 · 2.85 Impact Factor
  • Source
    Leo Lahti · Aurora Torrente · Laura L Elo · Alvis Brazma · Johan Rung

Publication Stats

15k Citations
1,521.09 Total Impact Points

Institutions

  • 1998–2015
    • EMBL-EBI
      Cambridge, England, United Kingdom
  • 2010–2012
    • Cancer Research UK Cambridge Institute
      Cambridge, England, United Kingdom
  • 2007–2012
    • Wellcome Trust Sanger Institute
      Cambridge, England, United Kingdom
  • 2009–2011
    • University of Cambridge
      Cambridge, England, United Kingdom
    • Cambridge Institute for Medical Research
      Cambridge, England, United Kingdom
  • 2005
    • British Antarctic Survey
      Cambridge, England, United Kingdom
  • 2004–2005
    • University College Dublin
      • Conway Institute of Biomolecular & Biomedical Research
      Dublin, Leinster, Ireland
    • Stanford University
      • Department of Biochemistry
      Palo Alto, California, United States
  • 1996–2005
    • University of Latvia
      • Institute of Mathematics and Computer Science
      Rija, Riga, Latvia
  • 2002
    • Harvard University
      Cambridge, Massachusetts, United States
    • University of Helsinki
      • Department of Computer Science
      Helsinki, Province of Southern Finland, Finland
    • Instituto de Bioinformatica e Biotecnologia
      Natal, Rio Grande do Norte, Brazil