[Show abstract][Hide abstract] ABSTRACT: As whole-genome sequencing for cancer genome analysis becomes a clinical tool, a full understanding of
the variables affecting sequencing analysis output is required. Here using tumour-normal sample pairs
from two different types of cancer, chronic lymphocytic leukaemia and medulloblastoma, we conduct a
benchmarking exercise within the context of the International Cancer Genome Consortium.We compare
sequencing methods, analysis pipelines and validation methods. We show that using PCR-free methods
and increasing sequencing depth to B100� shows benefits, as long as the tumour:control coverage
ratio remains balanced. We observe widely varying mutation call rates and low concordance among
analysis pipelines, reflecting the artefact-prone nature of the raw data and lack of standards for dealing
with the artefacts. However, we show that, using the benchmark mutation set we have created, many
issues are in fact easy to remedy and have an immediate positive impact on mutation detection accuracy.
Full-text · Article · Dec 2015 · Nature Communications
[Show abstract][Hide abstract] ABSTRACT: We analyzed the DNA methylome of ten subpopulations spanning the entire B cell differentiation program by whole-genome bisulfite sequencing and high-density microarrays. We observed that non-CpG methylation disappeared upon B cell commitment, whereas CpG methylation changed extensively during B cell maturation, showing an accumulative pattern and affecting around 30% of all measured CpG sites. Early differentiation stages mainly displayed enhancer demethylation, which was associated with upregulation of key B cell transcription factors and affected multiple genes involved in B cell biology. Late differentiation stages, in contrast, showed extensive demethylation of heterochromatin and methylation gain at Polycomb-repressed areas, and genes with apparent functional impact in B cells were not affected. This signature, which has previously been linked to aging and cancer, was particularly widespread in mature cells with an extended lifespan. Comparing B cell neoplasms with their normal counterparts, we determined that they frequently acquire methylation changes in regions already undergoing dynamic methylation during normal B cell differentiation.
[Show abstract][Hide abstract] ABSTRACT: While analyzing the DNA methylome of multiple myeloma (MM), a plasma cell neoplasm, by whole-genome bisulfite sequencing and high-density arrays, we observed a highly heterogeneous pattern globally characterized by regional DNA hypermethylation embedded in extensive hypomethylation. In contrast to the widely reported DNA hypermethylation of promoter-associated CpG islands (CGIs) in cancer, hypermethylated sites in MM, as opposed to normal plasma cells, were located outside CpG islands and were unexpectedly associated with intronic enhancer regions defined in normal B cells and plasma cells. Both RNA-seq and in vitro reporter assays indicated that enhancer hypermethylation is globally associated with downregulation of its host genes. ChIP-seq and DNase-seq further revealed that DNA hypermethylation in these regions is related to enhancer decommissioning. Hypermethylated enhancer regions overlapped with binding sites of B cell-specific transcription factors (TFs) and the degree of enhancer methylation inversely correlated with expression levels of these TFs in MM. Furthermore, hypermethylated regions in MM were methylated in stem cells and gradually became demethylated during normal B-cell differentiation, suggesting that MM cells either reacquire epigenetic features of undifferentiated cells or maintain an epigenetic signature of a putative myeloma stem cell progenitor. Overall, we have identified DNA hypermethylation of developmentally-regulated enhancers as a new type of epigenetic modification associated with the pathogenesis of MM.
Published by Cold Spring Harbor Laboratory Press.
[Show abstract][Hide abstract] ABSTRACT: We apply a known algorithm for computing exactly inequalities between Beta distributions to assess whether a given position in a genome is differentially methylated across samples. We discuss the advantages brought by the adoption of this solution with respect to two approximations (Fisher's test and Z score). The same formalism presented here can be applied in a similar way to variant calling.
[Show abstract][Hide abstract] ABSTRACT: Background
In contrast to international pig breeds, the Iberian breed has not been admixed with Asian germplasm. This makes it an important model to study both domestication and relevance of Asian genes in the pig. Besides, Iberian pigs exhibit high meat quality as well as appetite and propensity to obesity. Here we provide a genome wide analysis of nucleotide and structural diversity in a reduced representation library from a pool (n=9 sows) and shotgun genomic sequence from a single sow of the highly inbred Guadyerbas strain. In the pool, we applied newly developed tools to account for the peculiarities of these data.
A total of 254,106 SNPs in the pool (79.6 Mb covered) and 643,783 in the Guadyerbas sow (1.47 Gb covered) were called. The nucleotide diversity (1.31x10-3 per bp in autosomes) is very similar to that reported in wild boar. A much lower than expected diversity in the X chromosome was confirmed (1.79x10-4 per bp in the individual and 5.83x10-4 per bp in the pool). A strong (0.70) correlation between recombination and variability was observed, but not with gene density or GC content. Multicopy regions affected about 4% of annotated pig genes in their entirety, and 2% of the genes partially. Genes within the lowest variability windows comprised interferon genes and, in chromosome X, genes involved in behavior like HTR2C or MCEP2. A modified Hudson-Kreitman-Aguadé test for pools also indicated an accelerated evolution in genes involved in behavior, as well as in spermatogenesis and in lipid metabolism.
This work illustrates the strength of current sequencing technologies to picture a comprehensive landscape of variability in livestock species, and to pinpoint regions containing genes potentially under selection. Among those genes, we report genes involved in behavior, including feeding behavior, and lipid metabolism. The pig X chromosome is an outlier in terms of nucleotide diversity, which suggests selective constraints. Our data further confirm the importance of structural variation in the species, including Iberian pigs, and allowed us to identify new paralogs for known gene families.
[Show abstract][Hide abstract] ABSTRACT: Simulated power against depth. Power was computed as the number of SNP called by SNAPE software divided by the total number of real SNPs in the pool. Depth corresponds to the average depth in the pooled data. Bottom: Power against MAF (minor allele frequency in the pool).
[Show abstract][Hide abstract] ABSTRACT: Variability (Wattersons's estimate, per bp) inside multicopy regions vs. variability of windows containing multicopy regions but outside the multicopy region units?
[Show abstract][Hide abstract] ABSTRACT: Genes within multicopy regions and extreme selection tests’ windows. MCR genes: genes within multicopy regions; Lowest theta shared autosomes: genes within extreme low θ in autosomes and X pseudoautosomal region (PAR) common in the individual and the pool; Lowest theta shared non-pseudoautosomal region (NPAR): genes within extreme low θ in X NPAR region common in the individual and the pool; Largest theta pool: genes within extreme high θ regions in the pool; Largest theta individual: genes within extreme high θ regions in the individual; Lowest combined test: genes with lowest values of the combined Tajima’s D- Fay&Wu’s H and θ test; ΗΚΑ excess of differentiation autosomes+PAR: genes within HKA excess of differentiation in autosomes and X PAR region; ΗΚΑ vexcess of polymorphism autosomes+PAR: genes within HKA excess of polymorphism in autosomes and X PAR region; ΗΚΑ excess of polymorphism NPAR: genes within HKA excess of polymorphism in X NPAR region.
[Show abstract][Hide abstract] ABSTRACT: Performing high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read - or, more likely, none - from a true singleton.
To improve upon existing theory and software packages, we have developed a Bayesian approach for minor allele frequency (MAF) computation and SNP calling in pools (and implemented it in a program called snape): the approach takes into account sequencing errors and allows users to choose different priors. We also set up a pipeline which can simulate the coalescence process giving rise to the SNPs, the pooling procedure and the sequencing. We used it to compare the performance of snape to that of other packages.
We present a software which helps in calling SNPs in pooled samples: it has good power while retaining a low false discovery rate (FDR). The method also provides the posterior probability that a SNP is segregating and the full posterior distribution of f for every SNP. In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling. In this setting, snape has better power and False Discovery Rate (FDR) than the comparable packages samtools, PoPoolation, Varscan : for N = 50 chromosomes, snape has power ≈ 35%and FDR ≈ 2.5%. snape is available at http://code.google.com/p/snape-pooled/ (source code and precompiled binaries).
[Show abstract][Hide abstract] ABSTRACT: High-throughput sequencing of cDNA libraries constructed from cellular RNA complements (RNA-Seq) naturally provides a digital quantitative measurement for every expressed RNA molecule. Nature, impact and mutual interference of biases in different experimental setups are, however, still poorly understood-mostly due to the lack of data from intermediate protocol steps. We analysed multiple RNA-Seq experiments, involving different sample preparation protocols and sequencing platforms: we broke them down into their common-and currently indispensable-technical components (reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing), investigating how such different steps influence abundance and distribution of the sequenced reads. For each of those steps, we developed universally applicable models, which can be parameterised by empirical attributes of any experimental protocol. Our models are implemented in a computer simulation pipeline called the Flux Simulator, and we show that read distributions generated by different combinations of these models reproduce well corresponding evidence obtained from the corresponding experimental setups. We further demonstrate that our in silico RNA-Seq provides insights about hidden precursors that determine the final configuration of reads along gene bodies; enhancing or compensatory effects that explain apparently controversial observations can be observed. Moreover, our simulations identify hitherto unreported sources of systematic bias from RNA hydrolysis, a fragmentation technique currently employed by most RNA-Seq protocols.
Preview · Article · Sep 2012 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: Missing data are common in DNA sequences obtained through high-throughput sequencing. Furthermore, samples of low quality or problems in the experimental protocol often cause a loss of data even with traditional sequencing technologies. Here we propose modified estimators of variability and neutrality tests that can be naturally applied to sequences with missing data, without the need to remove bases or individuals from the analysis. Modified statistics include the Watterson estimator θW, Tajima's D, Fay and Wu's H, and HKA. We develop a general framework to take missing data into account in frequency spectrum-based neutrality tests and we derive the exact expression for the variance of these statistics under the neutral model. The neutrality tests proposed here can also be used as summary statistics to describe the information contained in other classes of data like DNA microarrays.
[Show abstract][Hide abstract] ABSTRACT: We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).
[Show abstract][Hide abstract] ABSTRACT: We present and validate BlastR, a method for efficiently and accurately searching non-coding RNAs. Our approach relies on
the comparison of di-nucleotides using BlosumR, a new log-odd substitution matrix. In order to use BlosumR for comparison,
we recoded RNA sequences into protein-like sequences. We then showed that BlosumR can be used along with the BlastP algorithm
in order to search non-coding RNA sequences. Using Rfam as a gold standard, we benchmarked this approach and show BlastR to
be more sensitive than BlastN. We also show that BlastR is both faster and more sensitive than BlastP used with a single nucleotide
log-odd substitution matrix. BlastR, when used in combination with WU-BlastP, is about 5% more accurate than WU-BlastN and
about 50 times slower. The approach shown here is equally effective when combined with the NCBI-Blast package. The software
is an open source freeware available from www.tcoffee.org/blastr.html.
Full-text · Article · May 2011 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (∼1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest.
[Show abstract][Hide abstract] ABSTRACT: MOTIVATION: Molecular chaperones prevent the aggregation of their substrate proteins and thereby ensure that they reach their functional native state. The bacterial GroEL/ES chaperonin system is understood in great detail on a structural, mechanistic and functional level; its interactors in Escherichia coli have been identified and characterized. However, a long-standing question in the field is: What makes a protein a chaperone substrate? RESULTS: Here we identify, using a bioinformatics-based approach a simple set of quantities, which characterize the GroEL-substrate proteome. We define three novel parameters differentiating GroEL interactors from other cellular proteins: lower rate of evolution, hydrophobicity and aggregation propensity. Combining them with other known features to a simple Bayesian predictor allows us to identify known homologous and heterologous GroEL substrateproteins. We discuss our findings in relation to established mechanisms of protein folding and evolutionary buffering by chaperones.
[Show abstract][Hide abstract] ABSTRACT: To understand basic principles of bacterial metabolism organization and regulation, but also the impact of genome size, we systematically studied one of the smallest bacteria, Mycoplasma pneumoniae. A manually curated metabolic network of 189 reactions catalyzed by 129 enzymes allowed the design of a defined, minimal medium with 19 essential nutrients. More than 1300 growth curves were recorded in the presence of various nutrient concentrations. Measurements of biomass indicators, metabolites, and 13C-glucose experiments provided information on directionality, fluxes, and energetics; integration with transcription profiling enabled the global analysis of metabolic regulation. Compared with more complex bacteria, the M. pneumoniae metabolic network has a more linear topology and contains a higher fraction of multifunctional enzymes; general features such as metabolite concentrations, cellular energetics, adaptability, and global gene expression responses are similar, however.
[Show abstract][Hide abstract] ABSTRACT: The computation of the statistical properties of motif occurrences has an obviously relevant application: patterns that are significantly over- or under-represented in genomes or proteins are interesting candidates for biological roles. However, the problem is computationally hard; as a result, virtually all the existing motif finders use fast but approximate scoring functions, in spite of the fact that they have been shown to produce systematically incorrect results. A few interesting exact approaches are known, but they are very slow and hence not practical in the case of realistic sequences.
We give an exact solution, solely based on deterministic finite-state automata (DFA), to the problem of finding the whole relevant part of the probability distribution function of a simple-word motif in a homogeneous (biological) sequence. Out of that, the z-value can always be computed, while the P-value can be obtained either when it is not too extreme with respect to the number of floating-point digits available in the implementation, or when the number of pattern occurrences is moderately low. In particular, the time complexity of the algorithms for Markov models of moderate order (0 < or = m < or = 2) is far better than that of Nuel, which was the fastest similar exact algorithm known to date; in many cases, even approximate methods are outperformed.
DFA are a standard tool of computer science for the study of patterns; previous works in biology propose algorithms involving automata, but there they are used, respectively, as a first step to write a generating function, or to build a finite Markov-chain imbedding (FMCI). In contrast, we directly rely on DFA to perform the calculations; thus we manage to obtain an algorithm which is both easily interpretable and efficient. This approach can be used for exact statistical studies of very long genomes and protein sequences, as we illustrate with some examples on the scale of the human genome.
[Show abstract][Hide abstract] ABSTRACT: Sequencing DNA from several organisms has revealed that duplication and drift of existing genes have primarily moulded the contents of a given genome. Though the effect of knocking out or overexpressing a particular gene has been studied in many organisms, no study has systematically explored the effect of adding new links in a biological network. To explore network evolvability, we constructed 598 recombinations of promoters (including regulatory regions) with different transcription or -factor genes in Escherichia coli, added over a wild-type genetic background. Here we show that 95% of new networks are tolerated by the bacteria, that very few alter growth, and that expression level correlates with factor position in the wild-type network hierarchy. Most importantly, we find that certain networks consistently survive over the wild type under various selection pressures. Therefore new links in the network are rarely a barrier for evolution and can even confer a fitness advantage.