[Show abstract][Hide abstract] ABSTRACT: We apply a known algorithm for computing exactly inequalities between Beta distributions to assess whether a given position in a genome is differentially methylated across samples. We discuss the advantages brought by the adoption of this solution with respect to two approximations (Fisher's test and Z score). The same formalism presented here can be applied in a similar way to variant calling.
PLoS ONE 01/2014; 9(5):e97349. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: BACKGROUND: In contrast to international pig breeds, the Iberian breed has not been admixed with Asian germplasm. This makes it an important model to study both domestication and relevance of Asian genes in the pig. Besides, Iberian pigs exhibit high meat quality as well as appetite and propensity to obesity. Here we provide a genome wide analysis of nucleotide and structural diversity in a reduced representation library from a pool (n=9 sows) and shotgun genomic sequence from a single sow of the highly inbred Guadyerbas strain. In the pool, we applied newly developed tools to account for the peculiarities of these data. RESULTS: A total of 254,106 SNPs in the pool (79.6 Mb covered) and 643,783 in the Guadyerbas sow (1.47 Gb covered) were called. The nucleotide diversity (1.31x10-3 per bp in autosomes) is very similar to that reported in wild boar. A much lower than expected diversity in the X chromosome was confirmed (1.79x10-4 per bp in the individual and 5.83x10-4 per bp in the pool). A strong (0.70) correlation between recombination and variability was observed, but not with gene density or GC content. Multicopy regions affected about 4% of annotated pig genes in their entirety, and 2% of the genes partially. Genes within the lowest variability windows comprised interferon genes and, in chromosome X, genes involved in behavior like HTR2C or MCEP2. A modified Hudson-Kreitman-Aguade test for pools also indicated an accelerated evolution in genes involved in behavior, as well as in spermatogenesis and in lipid metabolism. CONCLUSIONS: This work illustrates the strength of current sequencing technologies to picture a comprehensive landscape of variability in livestock species, and to pinpoint regions containing genes potentially under selection. Among those genes, we report genes involved in behavior, including feeding behavior, and lipid metabolism. The pig X chromosome is an outlier in terms of nucleotide diversity, which suggests selective constraints. Our data further confirm the importance of structural variation in the species, including Iberian pigs, and allowed us to identify new paralogs for known gene families.
[Show abstract][Hide abstract] ABSTRACT: Performing high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read - or, more likely, none - from a true singleton.
To improve upon existing theory and software packages, we have developed a Bayesian approach for minor allele frequency (MAF) computation and SNP calling in pools (and implemented it in a program called snape): the approach takes into account sequencing errors and allows users to choose different priors. We also set up a pipeline which can simulate the coalescence process giving rise to the SNPs, the pooling procedure and the sequencing. We used it to compare the performance of snape to that of other packages.
We present a software which helps in calling SNPs in pooled samples: it has good power while retaining a low false discovery rate (FDR). The method also provides the posterior probability that a SNP is segregating and the full posterior distribution of f for every SNP. In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling. In this setting, snape has better power and False Discovery Rate (FDR) than the comparable packages samtools, PoPoolation, Varscan : for N = 50 chromosomes, snape has power ≈ 35%and FDR ≈ 2.5%. snape is available at http://code.google.com/p/snape-pooled/ (source code and precompiled binaries).
[Show abstract][Hide abstract] ABSTRACT: High-throughput sequencing of cDNA libraries constructed from cellular RNA complements (RNA-Seq) naturally provides a digital quantitative measurement for every expressed RNA molecule. Nature, impact and mutual interference of biases in different experimental setups are, however, still poorly understood-mostly due to the lack of data from intermediate protocol steps. We analysed multiple RNA-Seq experiments, involving different sample preparation protocols and sequencing platforms: we broke them down into their common-and currently indispensable-technical components (reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing), investigating how such different steps influence abundance and distribution of the sequenced reads. For each of those steps, we developed universally applicable models, which can be parameterised by empirical attributes of any experimental protocol. Our models are implemented in a computer simulation pipeline called the Flux Simulator, and we show that read distributions generated by different combinations of these models reproduce well corresponding evidence obtained from the corresponding experimental setups. We further demonstrate that our in silico RNA-Seq provides insights about hidden precursors that determine the final configuration of reads along gene bodies; enhancing or compensatory effects that explain apparently controversial observations can be observed. Moreover, our simulations identify hitherto unreported sources of systematic bias from RNA hydrolysis, a fragmentation technique currently employed by most RNA-Seq protocols.
Nucleic Acids Research 09/2012; · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Missing data are common in DNA sequences obtained through high-throughput sequencing. Furthermore, samples of low quality or problems in the experimental protocol often cause a loss of data even with traditional sequencing technologies. Here we propose modified estimators of variability and neutrality tests that can be naturally applied to sequences with missing data, without the need to remove bases or individuals from the analysis. Modified statistics include the Watterson estimator θW, Tajima's D, Fay and Wu's H, and HKA. We develop a general framework to take missing data into account in frequency spectrum-based neutrality tests and we derive the exact expression for the variance of these statistics under the neutral model. The neutrality tests proposed here can also be used as summary statistics to describe the information contained in other classes of data like DNA microarrays.
[Show abstract][Hide abstract] ABSTRACT: We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).
PLoS ONE 01/2012; 7(1):e30377. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We present and validate BlastR, a method for efficiently and accurately searching non-coding RNAs. Our approach relies on the comparison of di-nucleotides using BlosumR, a new log-odd substitution matrix. In order to use BlosumR for comparison, we recoded RNA sequences into protein-like sequences. We then showed that BlosumR can be used along with the BlastP algorithm in order to search non-coding RNA sequences. Using Rfam as a gold standard, we benchmarked this approach and show BlastR to be more sensitive than BlastN. We also show that BlastR is both faster and more sensitive than BlastP used with a single nucleotide log-odd substitution matrix. BlastR, when used in combination with WU-BlastP, is about 5% more accurate than WU-BlastN and about 50 times slower. The approach shown here is equally effective when combined with the NCBI-Blast package. The software is an open source freeware available from www.tcoffee.org/blastr.html.
Nucleic Acids Research 05/2011; 39(16):6886-95. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: MOTIVATION: Molecular chaperones prevent the aggregation of their substrate proteins and thereby ensure that they reach their functional native state. The bacterial GroEL/ES chaperonin system is understood in great detail on a structural, mechanistic and functional level; its interactors in Escherichia coli have been identified and characterized. However, a long-standing question in the field is: What makes a protein a chaperone substrate? RESULTS: Here we identify, using a bioinformatics-based approach a simple set of quantities, which characterize the GroEL-substrate proteome. We define three novel parameters differentiating GroEL interactors from other cellular proteins: lower rate of evolution, hydrophobicity and aggregation propensity. Combining them with other known features to a simple Bayesian predictor allows us to identify known homologous and heterologous GroEL substrateproteins. We discuss our findings in relation to established mechanisms of protein folding and evolutionary buffering by chaperones.
[Show abstract][Hide abstract] ABSTRACT: A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (∼1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest.
[Show abstract][Hide abstract] ABSTRACT: To understand basic principles of bacterial metabolism organization and regulation, but also the impact of genome size, we systematically studied one of the smallest bacteria, Mycoplasma pneumoniae. A manually curated metabolic network of 189 reactions catalyzed by 129 enzymes allowed the design of a defined, minimal medium with 19 essential nutrients. More than 1300 growth curves were recorded in the presence of various nutrient concentrations. Measurements of biomass indicators, metabolites, and 13C-glucose experiments provided information on directionality, fluxes, and energetics; integration with transcription profiling enabled the global analysis of metabolic regulation. Compared with more complex bacteria, the M. pneumoniae metabolic network has a more linear topology and contains a higher fraction of multifunctional enzymes; general features such as metabolite concentrations, cellular energetics, adaptability, and global gene expression responses are similar, however.
[Show abstract][Hide abstract] ABSTRACT: The computation of the statistical properties of motif occurrences has an obviously relevant application: patterns that are significantly over- or under-represented in genomes or proteins are interesting candidates for biological roles. However, the problem is computationally hard; as a result, virtually all the existing motif finders use fast but approximate scoring functions, in spite of the fact that they have been shown to produce systematically incorrect results. A few interesting exact approaches are known, but they are very slow and hence not practical in the case of realistic sequences.
We give an exact solution, solely based on deterministic finite-state automata (DFA), to the problem of finding the whole relevant part of the probability distribution function of a simple-word motif in a homogeneous (biological) sequence. Out of that, the z-value can always be computed, while the P-value can be obtained either when it is not too extreme with respect to the number of floating-point digits available in the implementation, or when the number of pattern occurrences is moderately low. In particular, the time complexity of the algorithms for Markov models of moderate order (0 < or = m < or = 2) is far better than that of Nuel, which was the fastest similar exact algorithm known to date; in many cases, even approximate methods are outperformed.
DFA are a standard tool of computer science for the study of patterns; previous works in biology propose algorithms involving automata, but there they are used, respectively, as a first step to write a generating function, or to build a finite Markov-chain imbedding (FMCI). In contrast, we directly rely on DFA to perform the calculations; thus we manage to obtain an algorithm which is both easily interpretable and efficient. This approach can be used for exact statistical studies of very long genomes and protein sequences, as we illustrate with some examples on the scale of the human genome.
[Show abstract][Hide abstract] ABSTRACT: Sequencing DNA from several organisms has revealed that duplication and drift of existing genes have primarily moulded the contents of a given genome. Though the effect of knocking out or overexpressing a particular gene has been studied in many organisms, no study has systematically explored the effect of adding new links in a biological network. To explore network evolvability, we constructed 598 recombinations of promoters (including regulatory regions) with different transcription or -factor genes in Escherichia coli, added over a wild-type genetic background. Here we show that 95% of new networks are tolerated by the bacteria, that very few alter growth, and that expression level correlates with factor position in the wild-type network hierarchy. Most importantly, we find that certain networks consistently survive over the wild type under various selection pressures. Therefore new links in the network are rarely a barrier for evolution and can even confer a fitness advantage.
[Show abstract][Hide abstract] ABSTRACT: Molecular chaperones ensure that their substrate proteins reach the functional native state, and prevent their aggregation. Recently, an additional function was proposed for molecular chaperones: they serve as buffers (_capacitors_) for evolution by permitting their substrate proteins to mutate and at the same time still allowing them to fold productively.
Using pairwise alignments of _E. coli_ genes with genes from other gamma-proteobacteria, we showed that the described buffering effect cannot be observed among substrate proteins of GroEL, an essential chaperone in _E. coli_. Instead, we find that GroEL substrate proteins evolve less than other soluble _E. coli_ proteins. We analyzed several specific structural and biophysical properties of proteins to assess their influence on protein evolution and to find out why specifically GroEL substrates do not show the expected higher divergence from their orthologs.
Our results culminate in four main findings: *1.* We find little evidence that GroEL in _E. coli_ acts as a capacitor for evolution _in vivo_. *2.* GroEL substrates evolved less than other _E. coli_ proteins. *3.* Predominantly structural features appear to be a strong determinant of evolutionary rate. *4.* Besides size, hydrophobicity is a criterion for exclusion for a protein as a chaperonin substrate.
[Show abstract][Hide abstract] ABSTRACT: Molecular chaperones ensure that their substrate proteins reach the functional native state and prevent their aggregation. The bacterial GroEL/ES chaperonin system is understood in great detail on a structural, mechanistic and functional level. Its substrate proteins in E. coli have been identified and characterized. However, a long standing and yet unresolved question in the field is: what makes a protein a chaperone substrate?
Here we demonstrate with a bioinformatics-based approach that a simple set of criteria is sufficient to describe the GroEL substrate proteome to unprecedented accuracy. We define two novel parameters differentiating GroEL substrates from other cellular proteins: evolutionary rate and hydrophobicity. We demonstrate their conjunct applicability and explain why they are suitable descriptors. Combining them with other specific features of proteins, such as structure and size, we manage to identify the subset of GroEL substrate proteins with high confidence. We verify the applicability of our findings by correctly predicting a number of known heterologous GroEL substrate proteins.
Furthermore, our results show that in vivo, the proposed buffering capacity of chaperones does not appear to be a dominant effect. Instead, the observed lower evolutionary rates among substrate proteins could be explained by their energetically unfavorable folding pathways not allowing for additional destabilizing mutations to occur.
We show that a combination of simple parameters is sufficient to accurately describe the GroEL substrate proteome and to successfully predict known heterologous substrates. Our approach can potentially be used to predict chaperonin usage for any given polypeptide chain. Observed low evolutionary rates of GroEL substrates suggest that constraints in the folding pathways of the respective proteins do not allow for the accumulation of mutations.