[Show abstract][Hide abstract] ABSTRACT: Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.
No preview · Article · Feb 2015 · Nature Biotechnology
[Show abstract][Hide abstract] ABSTRACT: Despite antiretroviral therapy (ART), human immunodeficiency virus (HIV)-1 persists in a stable latent reservoir, primarily in resting memory CD4(+) T cells. This reservoir presents a major barrier to the cure of HIV-1 infection. To purge the reservoir, pharmacological reactivation of latent HIV-1 has been proposed and tested both in vitro and in vivo. A key remaining question is whether virus-specific immune mechanisms, including cytotoxic T lymphocytes (CTLs), can clear infected cells in ART-treated patients after latency is reversed. Here we show that there is a striking all or none pattern for CTL escape mutations in HIV-1 Gag epitopes. Unless ART is started early, the vast majority (>98%) of latent viruses carry CTL escape mutations that render infected cells insensitive to CTLs directed at common epitopes. To solve this problem, we identified CTLs that could recognize epitopes from latent HIV-1 that were unmutated in every chronically infected patient tested. Upon stimulation, these CTLs eliminated target cells infected with autologous virus derived from the latent reservoir, both in vitro and in patient-derived humanized mice. The predominance of CTL-resistant viruses in the latent reservoir poses a major challenge to viral eradication. Our results demonstrate that chronically infected patients retain a broad-spectrum viral-specific CTL response and that appropriate boosting of this response may be required for the elimination of the latent reservoir.
[Show abstract][Hide abstract] ABSTRACT: DNA sequencing has become a powerful method to discover the genetic basis of disease. Standard, widely-used protocols for analysis usually begin by comparing each individual to the human reference genome. When applied to a set of related individuals, this approach reveals millions of differences, most of which are shared among the individuals and unrelated to the disease being investigated. We have developed a novel algorithm for variant detection, one that compares DNA sequences directly to one another, without aligning them to the reference genome. When used to find de novo mutations in exome sequences from family trios, or to compare normal and diseased samples from the same individual, the new method, Diamund, produces a dramatically smaller list of candidate mutations than previous methods, without losing sensitivity to detect the true cause of a genetic disease. We demonstrate our results on several example cases, including two family trios in which it correctly found the disease-causing variant while excluding thousands of harmless variants that standard methods had identified. This article is protected by copyright. All rights reserved.
[Show abstract][Hide abstract] ABSTRACT: Despite recent technological advances, the study of the human transcriptome is still in its early stages. Here we provide an overview of the complex human transcriptomic landscape, present the bioinformatics challenges posed by the vast quantities of transcriptomic data, and discuss some of the studies that have tried to determine how much of the human genome is transcribed. Recent evidence has suggested that more than 90% of the human genome is transcribed into RNA. However, this view has been strongly contested by groups of scientists who argued that many of the observed transcripts are simply the result of transcriptional noise. In this review, we conclude that the full extent of transcription remains an open question that will not be fully addressed until we decipher the complete range and biological diversity of the transcribed genomic sequences.
[Show abstract][Hide abstract] ABSTRACT: Comparison of the human genome with other primates offers the opportunity to detect evolutionary events that created the diverse phenotypes among the primate species. Because the primate genomes are highly similar to one another, methods developed for analysis of more divergent species do not always detect signs of evolutionary selection.
We have developed a new method, called DivE, specifically designed to find regions that have evolved either more or less rapidly than expected, for any clade within a set of very closely related species. Unlike some previous methods, DivE does not rely on rates of synonymous and nonsynonymous substitution, which enables it to detect evolutionary events in noncoding regions. We demonstrate using simulated data that DivE compares favorably to alternative methods, and we then apply DivE to the ENCODE regions in 14 primate species. We identify thousands of regions in these primates, ranging from 50 to >10000 bp in length, that appear to have experienced either constrained or accelerated rates of evolution. In particular, we detected 4942 regions that have potentially undergone positive selection in one or more primate species. Most of these regions occur outside of protein-coding genes, although we identified 20 proteins that have experienced positive selection.
DivE provides an easy-to-use method to predict both positive and negative selection in noncoding DNA, that is particularly well-suited to detecting lineage-specific selection in large genomes.
[Show abstract][Hide abstract] ABSTRACT: We developed a computational screen that tests an individual's genome for mutations in the BRCA genes, despite the fact that both are currently protected by patents.
[Show abstract][Hide abstract] ABSTRACT: Many people expected the question 'How many genes in the human genome?' to be resolved with the publication of the genome sequence in 2001, but estimates continue to fluctuate.
[Show abstract][Hide abstract] ABSTRACT: Schistosoma mansoni is responsible for the neglected tropical disease schistosomiasis that affects 210 million people in 76 countries. Here we present analysis of the 363 megabase nuclear genome of the blood fluke. It encodes at least 11,809 genes, with an unusual intron size distribution, and new families of micro-exon genes that undergo frequent alternative splicing. As the first sequenced flatworm, and a representative of the Lophotrochozoa, it offers insights into early events in the evolution of the animals, including the development of a body pattern with bilateral symmetry, and the development of tissues into organs. Our analysis has been informed by the need to find new drug targets. The deficits in lipid metabolism that make schistosomes dependent on the host are revealed, and the identification of membrane receptors, ion channels and more than 300 proteases provide new insights into the biology of the life cycle and new targets. Bioinformatics approaches have identified metabolic chokepoints, and a chemogenomic screen has pinpointed schistosome proteins for which existing drugs may be active. The information generated provides an invaluable resource for the research community to develop much needed new control tools for the treatment and eradication of this important and neglected disease.
[Show abstract][Hide abstract] ABSTRACT: Advances in sequencing technologies have accelerated the sequencing of new genomes, far outpacing the generation of gene and
protein resources needed to annotate them. Direct comparison and alignment of existing cDNA sequences from a related species
is an effective and readily available means to determine genes in the new genomes. Current spliced alignment programs are
inadequate for comparing sequences between different species, owing to their low sensitivity and splice junction accuracy.
A new spliced alignment tool, sim4cc, overcomes problems in the earlier tools by incorporating three new features: universal
spaced seeds, to increase sensitivity and allow comparisons between species at various evolutionary distances, and powerful
splice signal models and evolutionarily-aware alignment techniques, to improve the accuracy of gene models. When tested on
vertebrate comparisons at diverse evolutionary distances, sim4cc had significantly higher sensitivity compared to existing
alignment programs, more than 10% higher than the closest competitor for some comparisons, while being comparable in speed
to its predecessor, sim4. Sim4cc can be used in one-to-one or one-to-many comparisons of genomic and cDNA sequences, and can
also be effectively incorporated into a high-throughput annotation engine, as demonstrated by the mapping of 64 000 Fagus grandifolia 454 ESTs and unigenes to the poplar genome.
Full-text · Article · Jun 2009 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: The fast pace of bacterial genome sequencing and the resulting dependence on highly automated annotation methods has driven
the development of many genome-wide analysis tools. OperonDB, first released in 2001, is a database containing the results
of a computational algorithm for locating operon structures in microbial genomes. OperonDB has grown from 34 genomes in its
initial release to more than 500 genomes today. In addition to increasing the size of the database, we have re-designed our
operon finding algorithm and improved its accuracy. The new database is updated regularly as additional genomes become available
in public archives. OperonDB can be accessed at: http://operondb.cbcb.umd.edu
Preview · Article · Nov 2008 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.
[Show abstract][Hide abstract] ABSTRACT: Parasitic nematodes that cause elephantiasis and river blindness threaten hundreds of millions of people in the developing
world. We have sequenced the ∼90 megabase (Mb) genome of the human filarial parasite Brugia malayi and predict ∼11,500 protein coding genes in 71 Mb of robustly assembled sequence. Comparative analysis with the free-living,
model nematode Caenorhabditis elegans revealed that, despite these genes having maintained little conservation of local synteny during ∼350 million years of evolution,
they largely remain in linkage on chromosomal units. More than 100 conserved operons were identified. Analysis of the predicted
proteome provides evidence for adaptations of B. malayi to niches in its human and vector hosts and insights into the molecular basis of a mutualistic relationship with its Wolbachia endosymbiont. These findings offer a foundation for rational drug design.
[Show abstract][Hide abstract] ABSTRACT: Background: Protein domains are the common functional elements used by nature to generate tremendous diversity among proteins, and they
are used repeatedly in different combinations across all major domains of life. In this paper we address the problem of using
similarity to known protein domains in helping with the identification of genes in a DNA sequence. We have adapted the generalized
hidden Markov model (GHMM) architecture of the ab intio gene finder GlimmerHMM such that a higher probability is assigned to exons that contain homologues to protein domains. To
our knowledge, this domain homology based approach has not been used previously in the context of ab initio gene prediction. Results: GlimmerHMM was augmented with a protein domain module that recognizes gene structures that are similar to Pfam models. The
augmented system, GlimmerHMM+, shows 2% improvement in sensitivity and a 1% increase in specificity in predicting exact gene
structures compared to GlimmerHMM without this option. These results were obtained on two very different model organisms:
Arabidopsis thaliana (mustard wee) and Danio rerio (zebrafish), and together these preliminary results demonstrate the value of using protein domain homology in gene prediction.
The results obtained are encouraging, and we believe that a more comprehensive approach including a model that reflects the
statistical characteristics of specific sets of protein domain families would result in a greater increase of the accuracy
of gene prediction. GlimmerHMM and GlimmerHMM+ are freely available as open source software at http://cbcb.umd.edu/software.
[Show abstract][Hide abstract] ABSTRACT: We present a draft sequence of the genome of Aedes aegypti, the primary vector for yellow fever and dengue fever, which at ∼1376 million base pairs is about 5 times the size of the
genome of the malaria vector Anopheles gambiae. Nearly 50% of the Ae. aegypti genome consists of transposable elements. These contribute to a factor of ∼4 to 6 increase in average gene length and in
sizes of intergenic regions relative to An. gambiae and Drosophila melanogaster. Nonetheless, chromosomal synteny is generally maintained among all three insects, although conservation of orthologous gene
order is higher (by a factor of ∼2) between the mosquito species than between either of them and the fruit fly. An increase
in genes encoding odorant binding, cytochrome P450, and cuticle domains relative to An. gambiae suggests that members of these protein families underpin some of the biological differences between the two mosquito species.
[Show abstract][Hide abstract] ABSTRACT: Algorithmic approaches to splice site prediction have relied mainly on the consensus patterns found at the boundaries between protein coding and non-coding regions. However exonic splicing enhancers have been shown to enhance the utilization of nearby splice sites.
We have developed a new computational technique to identify significantly conserved motifs involved in splice site regulation. First, 84 putative exonic splicing enhancer hexamers are identified in Arabidopsis thaliana. Then a Gibbs sampling program called ELPH was used to locate conserved motifs represented by these hexamers in exonic regions near splice sites in confirmed genes. Oligomers containing 35 of these motifs have been shown experimentally to induce significant inclusion of A. thaliana exons. Second, integration of our regulatory motifs into two different splice site recognition programs significantly improved the ability of the software to correctly predict splice sites in a large database of confirmed genes. We have released GeneSplicerESE, the improved splice site recognition code, as open source software.
Our results show that the use of the ESE motifs consistently improves splice site prediction accuracy.
Full-text · Article · Feb 2007 · BMC Bioinformatics
[Show abstract][Hide abstract] ABSTRACT: We describe the genome sequence of the protist Trichomonas vaginalis, a sexually transmitted human pathogen. Repeats and transposable elements comprise about two-thirds of the ∼160-megabase
genome, reflecting a recent massive expansion of genetic material. This expansion, in conjunction with the shaping of metabolic
pathways that likely transpired through lateral gene transfer from bacteria, and amplification of specific gene families implicated
in pathogenesis and phagocytosis of host proteins may exemplify adaptations of the parasite during its transition to a urogenital
environment. The genome sequence predicts previously unknown functions for the hydrogenosome, which support a common evolutionary
origin of this unusual organelle with mitochondria.