Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories

1] Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands. [2] Netherlands Bioinformatics Centre, Leiden, The Netherlands.
Nature Biotechnology (Impact Factor: 41.51). 09/2013; 31(11). DOI: 10.1038/nbt.2702
Source: PubMed


RNA sequencing is an increasingly popular technology for genome-wide analysis of transcript sequence and abundance. However, understanding of the sources of technical and interlaboratory variation is still limited. To address this, the GEUVADIS consortium sequenced mRNAs and small RNAs of lymphoblastoid cell lines of 465 individuals in seven sequencing centers, with a large number of replicates. The variation between laboratories appeared to be considerably smaller than the already limited biological variation. Laboratory effects were mainly seen in differences in insert size and GC content and could be adequately corrected for. In small-RNA sequencing, the microRNA (miRNA) content differed widely between samples owing to competitive sequencing of rRNA fragments. This did not affect relative quantification of miRNAs. We conclude that distributing RNA sequencing among different laboratories is feasible, given proper standardization and randomization procedures. We provide a set of quality measures and guidelines for assessing technical biases in RNA-seq data.

Download full-text


Available from: Seyed Yahya Anvar, Aug 19, 2015
68 Reads
  • Source
    • "After this analysis, novel miRNAs have observed the following rules: (1) The set of reads from the novel miRNA locus should account for more than 95% of all the precursor mapped small RNA reads, and reliable novel miRNA reads should account for more than 75% of the corresponding set of reads; (2) the miRNA* reads should have two-nucleotide 3’ overhangs; or (3) base-pairing between the miRNA and the other arm of the hairpin, which includes the miRNA*, is extensive such that there are typically four or fewer mismatched miRNA bases, five mismatched bases being allowed if the miRNA* was detected, mean-while, no asymmetric bulges larger than two nucleotides and no more than two asymmetric bulges should be present within the miRNA/miRNA* duplex [62-64]. In addition, novel miRNAs were further identified depending on both the abundance of each sequence, which was normalized as reads per million of total miRNA reads (RPM) [65], and detection of miRNA*s. The sRNA sequences with abundance at least 5 RPM in at least one of the five tissues examined and with miRNA* detected were considered as novel miRNAs, while those had abundance at least 5 RPM in at least one of the five tissues tested but without miRNA*s detected were considered as candidate miRNAs. "
    [Show abstract] [Hide abstract]
    ABSTRACT: MicroRNAs (miRNAs) regulate various biological processes in plants. Considerable data are available on miRNAs involved in the development of rice, maize and barley. In contrast, little is known about miRNAs and their functions in the development of wheat. In this study, five small RNA (sRNA) libraries from wheat seedlings, flag leaves, and developing seeds were developed and sequenced to identify miRNAs and understand their functions in wheat development. Twenty-four known miRNAs belonging to 15 miRNA families were identified from 18 MIRNA loci in wheat in the present study, including 15 miRNAs (9 MIRNA loci) first identified in wheat, 13 miRNA families (16 MIRNA loci) being highly conserved and 2 (2 MIRNA loci) moderately conserved. In addition, fifty-five novel miRNAs were also identified. The potential target genes for 15 known miRNAs and 37 novel miRNAs were predicted using strict criteria, and these target genes are involved in a wide range of biological functions. Four of the 15 known miRNA families and 22 of the 55 novel miRNAs were preferentially expressed in the developing seeds with logarithm (log2) of the fold change of 1.0 ~ 7.6, and half of them were seed-specific, suggesting that they participate in regulating wheat seed development and metabolism. From 5 days post-anthesis to 20 days post-anthesis, miR164 and miR160 increased in abundance in the developing seeds, whereas miR169 decreased, suggesting their coordinating functions in the different developmental stages of wheat seed. Moreover, 8 known miRNA families and 28 novel miRNAs exhibited tissue-biased expression in wheat flag leaves, with the logarithm of the fold changes of 0.1 ~ 5.2. The putative targets of these tissue-preferential miRNAs were involved in various metabolism and biological processes, suggesting complexity of the regulatory networks in different tissues. Our data also suggested that wheat flag leaves have more complicated regulatory networks of miRNAs than developing seeds. Our work identified and characterised wheat miRNAs, their targets and expression patterns. This study is the first to elucidate the regulatory networks of miRNAs involved in wheat flag leaves and developing seeds, and provided a foundation for future studies on specific functions of these miRNAs.
    BMC Genomics 04/2014; 15(1):289. DOI:10.1186/1471-2164-15-289 · 3.99 Impact Factor
  • Source
    • "Performance would also likely be enhanced by a more detailed assessment of the transcriptome, for example, by quantifying transcript expression levels with RNA sequencing (RNA-seq), which has been shown to provide better estimates of expression than microarrays [44]. A recent study has also found that results of transcriptome sequencing using RNA-seq were highly reproducible between different laboratories, if procedures are standardized (which is not generally the case for expression microarrays) [45]. This provides further evidence that incorporating RNA-seq would increase power and the widespread utility of these types of expression-based prediction assays. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We demonstrate a method for the prediction of chemotheraputic response in patients using only before-treatment baseline tumor gene expression data. First, we fitted models for whole genome gene expression against drug sensitivity in a large panel of cell lines, using a method that allows every gene to influence the prediction. Following data homogenization and filtering, these models were applied to baseline expression levels from primary tumor biopsies, yielding an in vivo drug sensitivity prediction. We validated this approach in three independent clinical trial datasets, and obtained predictions equally good, or better than, gene signatures derived directly from clinical data.
    Genome biology 03/2014; 15(3):R47. DOI:10.1186/gb-2014-15-3-r47 · 10.81 Impact Factor
  • Source
    • "In large sequencing studies, specific samples, for technical or biological reasons, can be recognized as outliers and should be removed from the study [18]. To identify outlier samples, whose global gene expression pattern is not explained by known covariates, we used Principal Component Analysis (PCA), investigating the first six principal components, which together explain ~60% of the variance in the brain data. "
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA-Sequencing (RNA-Seq) experiments have been optimized for library preparation, mapping, and gene expression estimation. These methods, however, have revealed weaknesses in the next stages of analysis of differential expression, with results sensitive to systematic sample stratification or, in more extreme cases, to outliers. Further, a method to assess normalization and adjustment measures imposed on the data is lacking. To address these issues, we utilize previously published eQTLs as a novel gold standard at the center of a framework that integrates DNA genotypes and RNA-Seq data to optimize analysis and aid in the understanding of genetic variation and gene expression. After detecting sample contamination and sequencing outliers in RNA-Seq data, a set of previously published brain eQTLs was used to determine if sample outlier removal was appropriate. Improved replication of known eQTLs supported removal of these samples in downstream analyses. eQTL replication was further employed to assess normalization methods, covariate inclusion, and gene annotation. This method was validated in an independent RNA-Seq blood data set from the GTEx project and a tissue-appropriate set of eQTLs. eQTL replication in both data sets highlights the necessity of accounting for unknown covariates in RNA-Seq data analysis. As each RNA-Seq experiment is unique with its own experiment-specific limitations, we offer an easily-implementable method that uses the replication of known eQTLs to guide each step in one's data analysis pipeline. In the two data sets presented herein, we highlight not only the necessity of careful outlier detection but also the need to account for unknown covariates in RNA-Seq experiments.
    BMC Genomics 12/2013; 14(1):892. DOI:10.1186/1471-2164-14-892 · 3.99 Impact Factor
Show more