Alvis Brazma

European Molecular Biology Laboratory, Heidelburg, Baden-Württemberg, Germany

Are you Alvis Brazma?

Claim your profile

Publications (158)1498.88 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: The ArrayExpress Archive of Functional Genomics Data ( is an international functional genomics database at the European Bioinformatics Institute (EMBL-EBI) recommended by most journals as a repository for data supporting peer-reviewed publications. It contains data from over 7000 public sequencing and 42 000 array-based studies comprising over 1.5 million assays in total. The proportion of sequencing-based submissions has grown significantly over the last few years and has doubled in the last 18 months, whilst the rate of microarray submissions is growing slightly. All data in ArrayExpress are available in the MAGE-TAB format, which allows robust linking to data analysis and visualization tools and standardized analysis. The main development over the last two years has been the release of a new data submission tool Annotare, which has reduced the average submission time almost 3-fold. In the near future, Annotare will become the only submission route into ArrayExpress, alongside MAGE-TAB format-based pipelines. ArrayExpress is a stable and highly accessed resource. Our future tasks include automation of data flows and further integration with other EMBL-EBI resources for the representation of multi-omics data.
    Nucleic Acids Research 10/2014; DOI:10.1093/nar/gku1057 · 8.81 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: One purpose of the biomedical literature is to report results in sufficient detail that the methods of data collection and analysis can be independently replicated and verified. Here we present reporting guidelines for gene expression localization experiments: the minimum information specification for in situ hybridization and immunohistochemistry experiments (MISFISHIE). MISFISHIE is modeled after the Minimum Information About a Microarray Experiment (MIAME) specification for microarray experiments. Both guidelines define what information should be reported without dictating a format for encoding that information. MISFISHIE describes six types of information to be provided for each experiment: experimental design, biomaterials and treatments, reporters, staining, imaging data and image characterizations. This specification has benefited the consortium within which it was developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal.
  • Source
    Nuno A Fonseca, John Marioni, Alvis Brazma
    [Show abstract] [Hide abstract]
    ABSTRACT: Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the “true” expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the ‘ground truth’ in real RNAseq data sets, we used simulated data to assess the differences between the “true” expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to estimate the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.
    PLoS ONE 09/2014; 9(9-9):e107026. DOI:10.1371/journal.pone.0107026 · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Chimeric RNAs originating from two or more different genes are known to exist not only in cancer, but also in normal tissues, where they can play a role in human evolution. However, the exact mechanism of their formation is unknown. Here, we use RNA sequencing data from 462 healthy individuals representing 5 human populations to systematically identify and in depth characterize 81 RNA tandem chimeric transcripts, 13 of which are novel. We observe that 6 out of these 81 chimeras have been regarded as cancer-specific. Moreover, we show that a prevalence of long introns at the fusion breakpoint is associated with the chimeric transcripts formation. We also find that tandem RNA chimeras have lower abundances as compared to their partner genes. Finally, by combining our results with genomic data from the same individuals we uncover intronic genetic variants associated with the chimeric RNA formation. Taken together our findings provide an important insight into the chimeric transcripts formation and open new avenues of research into the role of intronic genetic variants in post-transcriptional processing events.
    PLoS ONE 08/2014; 9(8):e104567. DOI:10.1371/journal.pone.0104567 · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The cooperation of transcriptional and post-transcriptional levels of control to shape gene regulation is only partially understood. Here we show that a combination of two simple and non-invasive genomic techniques, coupled with kinetic mathematical modeling, afford insight into the intricate dynamics of RNA regulation in response to oxidative stress in the fission yeast Schizosaccharomyces pombe. This study reveals a dominant role of transcriptional regulation in response to stress, but also points to the first minutes after stress induction as a critical time when the coordinated control of mRNA turnover can support the control of transcription for rapid gene regulation. In addition, we uncover specialized gene expression strategies associated with distinct functional gene groups, such as simultaneous transcriptional repression and mRNA destabilization for genes encoding ribosomal proteins, delayed mRNA destabilization with varying contribution of transcription for ribosome biogenesis genes, dominant roles of mRNA stabilization for genes functioning in protein degradation, and adjustment of both transcription and mRNA turnover during the adaptation to stress. We also show that genes regulated independently of the bZIP transcription factor Atf1p are predominantly controlled by mRNA turnover, and identify putative cis-regulatory sequences that are associated with different gene expression strategies during the stress response. This study highlights the intricate and multi-faceted interplay between transcription and RNA turnover during the dynamic regulatory response to stress.
    RNA Biology 07/2014; 11(6). DOI:10.4161/rna.29196 · 5.38 Impact Factor
  • Neuromuscular Disorders 03/2014; 24:S22-S23. DOI:10.1016/S0960-8966(14)70075-6 · 3.13 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The incidence of renal cell carcinoma (RCC) is increasing worldwide, and its prevalence is particularly high in some parts of Central Europe. Here we undertake whole-genome and transcriptome sequencing of clear cell RCC (ccRCC), the most common form of the disease, in patients from four different European countries with contrasting disease incidence to explore the underlying genomic architecture of RCC. Our findings support previous reports on frequent aberrations in the epigenetic machinery and PI3K/mTOR signalling, and uncover novel pathways and genes affected by recurrent mutations and abnormal transcriptome patterns including focal adhesion, components of extracellular matrix (ECM) and genes encoding FAT cadherins. Furthermore, a large majority of patients from Romania have an unexpected high frequency of A:T>T:A transversions, consistent with exposure to aristolochic acid (AA). These results show that the processes underlying ccRCC tumorigenesis may vary in different populations and suggest that AA may be an important ccRCC carcinogen in Romania, a finding with major public health implications.
    Nature Communications 01/2014; 5:5135. DOI:10.1038/ncomms6135 · 10.74 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Expression Atlas ( is a value-added database providing information about gene, protein and splice variant expression in different cell types, organism parts, developmental stages, diseases and other biological and experimental conditions. The database consists of selected high-quality microarray and RNA-sequencing experiments from ArrayExpress that have been manually curated, annotated with Experimental Factor Ontology terms and processed using standardized microarray and RNA-sequencing analysis methods. The new version of Expression Atlas introduces the concept of 'baseline' expression, i.e. gene and splice variant abundance levels in healthy or untreated conditions, such as tissues or cell types. Differential gene expression data benefit from an in-depth curation of experimental intent, resulting in biologically meaningful 'contrasts', i.e. instances of differential pairwise comparisons between two sets of biological replicates. Other novel aspects of Expression Atlas are its strict quality control of raw experimental data, up-to-date RNA-sequencing analysis methods, expression data at the level of gene sets, as well as genes and a more powerful search interface designed to maximize the biological value provided to the user.
    Nucleic Acids Research 12/2013; 42(Database issue). DOI:10.1093/nar/gkt1270 · 8.81 Impact Factor
    This article is viewable in ResearchGate's enriched format
  • [Show abstract] [Hide abstract]
    ABSTRACT: The BioSamples database at the EBI ( provides an integration point for BioSamples information between technology specific databases at the EBI, projects such as ENCODE and reference collections such as cell lines. The database delivers a unified query interface and API to query sample information across EBI's databases and provides links back to assay databases. Sample groups are used to manage related samples, e.g. those from an experimental submission, or a single reference collection. Infrastructural improvements include a new user interface with ontological and key word queries, a new query API, a new data submission API, complete RDF data download and a supporting SPARQL endpoint, accessioning at the point of submission to the European Nucleotide Archive and European Genotype Phenotype Archives and improved query response times.
    Nucleic Acids Research 11/2013; 42(Database issue). DOI:10.1093/nar/gkt1081 · 8.81 Impact Factor
    This article is viewable in ResearchGate's enriched format
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project-the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.
    Nature 09/2013; DOI:10.1038/nature12531 · 42.35 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: RNA sequencing is an increasingly popular technology for genome-wide analysis of transcript sequence and abundance. However, understanding of the sources of technical and interlaboratory variation is still limited. To address this, the GEUVADIS consortium sequenced mRNAs and small RNAs of lymphoblastoid cell lines of 465 individuals in seven sequencing centers, with a large number of replicates. The variation between laboratories appeared to be considerably smaller than the already limited biological variation. Laboratory effects were mainly seen in differences in insert size and GC content and could be adequately corrected for. In small-RNA sequencing, the microRNA (miRNA) content differed widely between samples owing to competitive sequencing of rRNA fragments. This did not affect relative quantification of miRNAs. We conclude that distributing RNA sequencing among different laboratories is feasible, given proper standardization and randomization procedures. We provide a set of quality measures and guidelines for assessing technical biases in RNA-seq data.
    Nature Biotechnology 09/2013; DOI:10.1038/nbt.2702 · 39.08 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: To mechanistically characterize the microevolutionary processes active in altering transcription factor (TF) binding among closely related mammals, we compared the genome-wide binding of three tissue-specific TFs that control liver gene expression in six rodents. Despite an overall fast turnover of TF binding locations between species, we identified thousands of TF regions of highly constrained TF binding intensity. Although individual mutations in bound sequence motifs can influence TF binding, most binding differences occur in the absence of nearby sequence variations. Instead, combinatorial binding was found to be significant for genetic and evolutionary stability; cobound TFs tend to disappear in concert and were sensitive to genetic knockout of partner TFs. The large, qualitative differences in genomic regions bound between closely related mammals, when contrasted with the smaller, quantitative TF binding differences among Drosophila species, illustrate how genome structure and population genetics together shape regulatory evolution.
    Cell 08/2013; 154(3):530-40. DOI:10.1016/j.cell.2013.07.007 · 31.96 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA sequencing has opened new avenues for the study of transcriptome composition. Significant evidence has accumulated showing that the human transcriptome contains in excess of a hundred thousand different transcripts. However, it is still not clear to what extent this diversity prevails when considering the relative abundances of different transcripts from the same gene. Here we show that, in a given condition, most protein coding genes have one major transcript expressed at significantly higher level than others, that in human tissues the major transcripts contribute almost 85 percent to the total mRNA from protein coding loci, and that often the same major transcript is expressed in many tissues. We detect a high degree of overlap between the set of major transcripts and a recently published set of alternatively spliced transcripts that are predicted to be translated utilizing proteomic data. Thus, we hypothesize that although some minor transcripts may play a functional role, the major ones are likely to be the main contributors to the proteome. However, we still detect a non-negligible fraction of protein coding genes for which the major transcript does not code a protein. Overall, our findings suggest that the transcriptome from protein coding loci is dominated by one transcript per gene and that not all the transcripts that contribute to transcriptome diversity are equally likely to contribute to protein diversity. This observation can help to prioritize candidate targets in proteomics research and to predict the functional impact of the detected changes in variation studies.
    Genome biology 07/2013; 14(7):R70. DOI:10.1186/gb-2013-14-7-r70 · 10.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genes for the production of a broad range of fungal secondary metabolites are frequently colinear. The prevalence of such gene clusters was systematically examined across the genome of the cereal pathogen Fusarium graminearum. The topological structure of transcriptional networks was also examined to investigate control mechanisms for mycotoxin biosynthesis and other processes. The genes associated with transcriptional processes were identified, and the genomic location of transcription-associated proteins (TAPs) analyzed in conjunction with the locations of genes exhibiting similar expression patterns. Highly conserved TAPs reside in regions of chromosomes with very low or no recombination, contrasting with putative regulator genes. Co-expression group profiles were used to define positionally clustered genes and a number of members of these clusters encode proteins participating in secondary metabolism. Gene expression profiles suggest there is an abundance of condition-specific transcriptional regulation. Analysis of the promoter regions of co-expressed genes showed enrichment for conserved DNA-sequence motifs. Potential global transcription factors recognising these motifs contain distinct sets of DNA-binding domains (DBDs) from those present in local regulators. Proteins associated with basal transcriptional functions are encoded by genes enriched in regions of the genome with low recombination. Systematic searches revealed dispersed and compact clusters of co-expressed genes, often containing a transcription factor, and typically containing genes involved in biosynthetic pathways. Transcriptional networks exhibit a layered structure in which the position in the hierarchy of a regulator is closely linked to the DBD structural class.
    BMC Systems Biology 06/2013; 7(1):52. DOI:10.1186/1752-0509-7-52 · 2.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Oncogenic fusion genes that involve kinases have proven to be effective targets for therapy in a wide range of cancers. Unfortunately, the diagnostic approaches required to identify these events are struggling to keep pace with the diverse array of genetic alterations that occur in cancer. Diagnostic screening in solid tumours is particularly challenging, as many fusion genes occur with a low frequency. To overcome these limitations, we developed a capture enrichment strategy to enable high throughput transcript sequencing of the human kinome. This approach provides a global overview of kinase fusion events, irrespective of the identity of the fusion partner. To demonstrate the utility of this system we profiled one hundred non-small cell lung cancers and identified numerous genetic alterations impacting Fibroblast Growth Factor Receptor 3 (FGFR3) in lung squamous cell carcinoma and a novel ALK fusion partner in lung adenocarcinoma.
    The Journal of Pathology 05/2013; 230(3). DOI:10.1002/path.4209 · 7.33 Impact Factor
  • Source
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Rapid accumulation of large and standardized microarray data collections is opening up novel opportunities for holistic characterization of genome function. The limited scalability of current preprocessing techniques has, however, formed a bottleneck for full utilization of these data resources. Although short oligonucleotide arrays constitute a major source of genome-wide profiling data, scalable probe-level techniques have been available only for few platforms based on pre-calculated probe effects from restricted reference training sets. To overcome these key limitations, we introduce a fully scalable online-learning algorithm for probe-level analysis and pre-processing of large microarray atlases involving tens of thousands of arrays. In contrast to the alternatives, our algorithm scales up linearly with respect to sample size and is applicable to all short oligonucleotide platforms. The model can use the most comprehensive data collections available to date to pinpoint individual probes affected by noise and biases, providing tools to guide array design and quality control. This is the only available algorithm that can learn probe-level parameters based on sequential hyperparameter updates at small consecutive batches of data, thus circumventing the extensive memory requirements of the standard approaches and opening up novel opportunities to take full advantage of contemporary microarray collections.
    Nucleic Acids Research 04/2013; DOI:10.1093/nar/gkt229 · 8.81 Impact Factor
  • Johan Rung, Alvis Brazma
    [Show abstract] [Hide abstract]
    ABSTRACT: Our understanding of gene expression has changed dramatically over the past decade, largely catalysed by technological developments. High-throughput experiments - microarrays and next-generation sequencing - have generated large amounts of genome-wide gene expression data that are collected in public archives. Added-value databases process, analyse and annotate these data further to make them accessible to every biologist. In this Review, we discuss the utility of the gene expression data that are in the public domain and how researchers are making use of these data. Reuse of public data can be very powerful, but there are many obstacles in data preparation and analysis and in the interpretation of the results. We will discuss these challenges and provide recommendations that we believe can improve the utility of such data.
    Nature Reviews Genetics 12/2012; 14(2). DOI:10.1038/nrg3394 · 39.79 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The paper proposes a hybrid system based approach for modelling of intracellular networks and introduces a restricted subclass of hybrid systems - HSM - with an objective of still being able to provide sufficient power for the modelling of biological systems, while imposing some restrictions that facilitate analysis of systems described by such models. The use of hybrid system based models have become increasingly popular, likely due to the facts that: 1) they provide sufficiently powerful mathematical formalism to describe biological processes of interest and does it in a 'natural way' from the biological perspective; 2) there are well established mathematical techniques as well as supporting software tools for analyzing such models. However often these models are very dependent on the quantitative parameters of the system (concentrations of proteins, their growth functions etc.) that are seldom exactly known, instead of more limited information of the system that can be observed in practice (directions of change in concentrations, but not the exact values etc.) As a result these models may work well for simulation of the system (prediction of its state starting from some initial conditions), but are too complicated for prediction of all possible qualitatively different behaviours a modelled system might have. With HSM we try to propose a hybrid system based formalism that is still sufficiently powerful for description of biological systems, while being as restricted as possible to facilitate the analysis of the systems described. We separate between the quantitative system parameters and their qualitative values that can be observed in practice. For HSM we provide an algorithm that analyses the system without the need to know the exact parameter values. We apply our model and analysis methods to a well-studied gene network of lambda phage. The phage has two well-known qualitatively different behaviours - lysis and lysogeny. We show that our model has an attractor structure that corresponds well to these two behaviours and that these are the only stable behaviours that can be exhibited by the system. The algorithm also generates (in principle biologically verifiable) hypotheses about the mutations of lambda phage that should change its observable behaviour.
    Gene 12/2012; DOI:10.1016/j.gene.2012.11.084 · 2.20 Impact Factor

Publication Stats

13k Citations
1,498.88 Total Impact Points


  • 2014
    • European Molecular Biology Laboratory
      Heidelburg, Baden-Württemberg, Germany
  • 1998–2014
    • EMBL-EBI
      Cambridge, England, United Kingdom
  • 2012
    • CUNY Graduate Center
      New York City, New York, United States
  • 2003–2012
    • Wellcome Trust Sanger Institute
      Cambridge, England, United Kingdom
  • 2011
    • Aalto University
      • Department of Information and Computer Science
      Helsinki, Province of Southern Finland, Finland
    • Dana-Farber Cancer Institute
      Boston, Massachusetts, United States
  • 2010
    • Cancer Research UK Cambridge Institute
      Cambridge, England, United Kingdom
  • 2009
    • Cambridge Institute for Medical Research
      Cambridge, England, United Kingdom
    • Helsinki Institute for Information Technology HIIT
      Helsinki, Southern Finland Province, Finland
    • Yale University
      New Haven, Connecticut, United States
  • 1996–2007
    • University of Latvia
      • Institute of Mathematics and Computer Science
      Riga, Riga, Latvia
  • 2004–2006
    • Stanford University
      • Department of Biochemistry
      Stanford, CA, United States
  • 2005
    • University of Cambridge
      Cambridge, England, United Kingdom
    • British Antarctic Survey
      Cambridge, England, United Kingdom
  • 2004–2005
    • University College Dublin
      • Conway Institute of Biomolecular & Biomedical Research
      Dublin, Leinster, Ireland
  • 2002
    • University of Helsinki
      • Department of Computer Science
      Helsinki, Province of Southern Finland, Finland
    • University of California, Berkeley
      • Department of Molecular and Cell Biology
      Berkeley, MO, United States
    • Instituto de Bioinformatica e Biotecnologia
      Natal, Rio Grande do Norte, Brazil