[Show abstract][Hide abstract] ABSTRACT: Colorectal cancer with metastases limited to the liver (liver-limited mCRC) is a distinct clinical subset characterized by possible cure with surgery. We performed high-depth sequencing of over 750 cancer-associated genes and copy number profiling in matched primary, metastasis and normal tissues to characterize genomic progression in 18 patients with liver-limited mCRC.
High depth Illumina sequencing and use of three different variant callers enable comprehensive and accurate identification of somatic variants down to 2.5% variant allele frequency. We identify a median of 11 somatic single nucleotide variants (SNVs) per tumor. Across patients, a median of 79.3% of somatic SNVs present in the primary are present in the metastasis and 81.7% of all alterations present in the metastasis are present in the primary. Private alterations are found at lower allele frequencies; a different mutational signature characterized shared and private variants, suggesting distinct mutational processes. Using B-allele frequencies of heterozygous germline SNPs and copy number profiling, we find that broad regions of allelic imbalance and focal copy number changes, respectively, are generally shared between the primary tumor and metastasis.
Our analyses point to high genomic concordance of primary tumor and metastasis, with a thick common trunk and smaller genomic branches in general support of the linear progression model in most patients with liver-limited mCRC. More extensive studies are warranted to further characterize genomic progression in this important clinical population.
[Show abstract][Hide abstract] ABSTRACT: Human respiratory syncytial virus (RSV) is the major cause of lower respiratory tract infections in children under two years of age. Little is known about RSV intra-host genetic diversity over the course of infection, or about the immune pressures that drive RSV molecular evolution. We performed whole-genome deep sequencing on 53 RSV- positive samples (37 RSV subgroup A and 16 subgroup B) collected from the upper airways of hospitalized children in southern Vietnam over two consecutive seasons. RSV A NA1 and RSV B BA9 were the predominant genotypes found in our samples, consistent with other reports on global RSV circulation during the same period. For both RSV A and B, the M gene was the most conserved, confirming its potential as a target for novel therapeutics. The G gene was the most variable and was the only gene under detectable positive selection. Further, positively-selected sites in G were found in close proximity to and in some cases overlapped with predicted glycosylation motifs, suggesting that selection on amino acid glycosylation may drive viral genetic diversity. We further identified hotspots and coldspots of intra-host genetic diversity in the RSV genome, some of which may highlight previously unknown regions of functional importance.
Journal of General Virology 09/2015; DOI:10.1099/jgv.0.000298 · 3.18 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Dengue viruses (DENV) cause debilitating and potentially life-threatening acute disease throughout the tropical world. While drug development efforts are underway, there are concerns that resistant strains will emerge rapidly. Indeed, antiviral drugs that target even conserved regions in other RNA viruses lose efficacy over time as the virus mutates. Here, we sought to determine if there are regions in the DENV genome that are not only evolutionarily conserved but genetically constrained in their ability to mutate and could hence serve as better antiviral targets. High-throughput sequencing of DENV-1 genome directly from twelve, paired dengue patients' sera and then passaging these sera into the two primary mosquito vectors showed consistent and distinct sequence changes during infection. In particular, two residues in the NS5 protein coding sequence appear to be specifically acquired during infection in Ae. aegypti but not Ae. albopictus. Importantly, we identified a region within the NS3 protein coding sequence that is refractory to mutation during human and mosquito infection. Collectively, these findings provide fresh insights into antiviral targets and could serve as an approach to defining evolutionarily constrained regions for therapeutic targeting in other RNA viruses.
[Show abstract][Hide abstract] ABSTRACT: Dengue virus (DENV) infection of an individual human or mosquito host produces a dynamic population of closely-related sequences. This intra-host genetic diversity is thought to offer an advantage for arboviruses to adapt as they cycle between two very different host species, but it remains poorly characterized. To track changes in viral intra-host genetic diversity during horizontal transmission, we infected Aedes aegypti mosquitoes by allowing them to feed on DENV2-infected patients. We then performed whole-genome deep-sequencing of human- and matched mosquito-derived DENV samples on the Illumina platform and used a sensitive variant-caller to detect single nucleotide variants (SNVs) within each sample. >90% of SNVs were lost upon transition from human to mosquito, as well as from mosquito abdomen to salivary glands. Levels of viral diversity were maintained, however, by the regeneration of new SNVs at each stage of transmission. We further show that SNVs maintained across transmission stages were transmitted as a unit of two at maximum, suggesting the presence of numerous variant genomes carrying only one or two SNVs each. We also present evidence for differences in selection pressures between human and mosquito hosts, particularly on the structural and NS1 genes. This analysis provides insights into how population drops during transmission shape RNA virus genetic diversity, has direct implications for virus evolution, and illustrates the value of high-coverage, whole-genome next-generation sequencing for understanding viral intra-host genetic diversity.
[Show abstract][Hide abstract] ABSTRACT: Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and mapping technologies (e.g. optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kbp–2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging due to the lack of efficient and freely available software for robustly aligning maps to sequences. Here we introduce two new map-to-sequence alignment algorithms that efficiently and accurately align high-throughput mapping datasets to large, eukaryotic genomes while accounting for high error rates. In order to do so, these methods (OPTIMA for glocal and OPTIMA-Overlap for overlap alignment) exploit the ability to create efficient data structures that index continuous-valued mapping data while accounting for errors. We also introduce an approach for evaluating the significance of alignments that avoids expensive permutation-based tests while being agnostic to technology-dependent error rates. Our benchmarking results suggest that OPTIMA and OPTIMA-Overlap outperform state-of-the-art approaches in sensitivity (1.6–2× improvement) while simultaneously being more efficient (170–200%) and precise in their alignments (99% precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust and provide improved sensitivity while guaranteeing high precision.
[Show abstract][Hide abstract] ABSTRACT: High-throughput assays, such as RNA-seq, to detect differential abundance are widely used. Variable performance across statistical tests, normalizations, and conditions leads to resource wastage and reduced sensitivity. EDDA represents a first, general design tool for RNA-seq, Nanostring, and metagenomic analysis, that rationally selects tests, predicts performance, and plans experiments to minimize resource wastage. Case studies highlight EDDA’s ability to model single-cell RNA-seq, suggesting ways to reduce sequencing costs up to five-fold and improving metagenomic biomarker detection through improved test selection. EDDA’s novel mode-based normalization for detecting differential abundance improves robustness by 10% to 20% and precision by up to 140%.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0527-7) contains supplementary material, which is available to authorized users.
[Show abstract][Hide abstract] ABSTRACT: We present a method for obtaining long haplotypes, of over 3 kb in length, using a short-read sequencer, Barcode-directed Assembly for Extra-long Sequences (BAsE-Seq). BAsE-Seq relies on transposing a template-specific barcode onto random segments of the template molecule and assembling the barcoded short reads into complete haplotypes. We applied BAsE-Seq on mixed clones of hepatitis B virus and accurately identified haplotypes occurring at frequencies greater than or equal to 0.4%, with >99.9% specificity. Applying BAsE-Seq to a clinical sample, we obtained over 9,000 viral haplotypes, which provided an unprecedented view of hepatitis B virus population structure during chronic infection. BAsE-Seq is readily applicable for monitoring quasispecies evolution in viral diseases.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0517-9) contains supplementary material, which is available to authorized users.
[Show abstract][Hide abstract] ABSTRACT: Dehalococcoides mccartyi strain SG1, isolated from digester sludge, dechlorinates polychlorinated biphenyls (PCBs) to lower congeners. Here we report
the draft genome sequence of SG1, which carries a 22.65 kbp circular putative plasmid.
[Show abstract][Hide abstract] ABSTRACT: Chromosomal structural variations play an important role in determining the transcriptional landscape of human breast cancers. To assess the nature of these structural variations, we analyzed eight breast tumor samples with a focus on regions of gene amplification using mate-pair sequencing of long-insert genomic DNA with matched transcriptome profiling. We found that tandem duplications appear to be early events in tumor evolution, especially in the genesis of amplicons. In a detailed reconstruction of events on chromosome 17, we found large unpaired inversions and deletions connect a tandemly duplicated ERBB2 with neighboring 17q21.3 amplicons while simultaneously deleting the intervening BRCA1 tumor suppressor locus. This series of events appeared to be unusually common when examined in larger genomic data sets of breast cancers albeit using approaches with lesser resolution. Using siRNAs in breast cancer cell lines, we showed that the 17q21.3 amplicon harbored a significant number of weak oncogenes that appeared consistently coamplified in primary tumors. Down-regulation of BRCA1 expression augmented the cell proliferation in ERBB2-transfected human normal mammary epithelial cells. Coamplification of other functionally tested oncogenic elements in other breast tumors examined, such as RIPK2 and MYC on chromosome 8, also parallel these findings. Our analyses suggest that structural variations efficiently orchestrate the gain and loss of cancer gene cassettes that engage many oncogenic pathways simultaneously and that such oncogenic cassettes are favored during the evolution of a cancer.
Genome Research 09/2014; 24(10). DOI:10.1101/gr.164871.113 · 14.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Fastidious anaerobic bacteria play critical roles in environmental bioremediation of halogenated compounds. However, their characterization and application have been largely impeded by difficulties in growing them in pure culture. Thus far, no pure culture has been reported to respire on the notorious polychlorinated biphenyls (PCBs), and functional genes responsible for PCB detoxification remain unknown due to the extremely slow growth of PCB-respiring bacteria. Here we report the successful isolation and characterization of three Dehalococcoides mccartyi strains that respire on commercial PCBs. Using high-throughput metagenomic analysis, combined with traditional culture techniques, tetrachloroethene (PCE) was identified as a feasible alternative to PCBs to isolate PCB-respiring Dehalococcoides from PCB-enriched cultures. With PCE as an alternative electron acceptor, the PCB-respiring Dehalococcoides were boosted to a higher cell density (1.2 × 10(8) to 1.3 × 10(8) cells per mL on PCE vs. 5.9 × 10(6) to 10.4 × 10(6) cells per mL on PCBs) with a shorter culturing time (30 d on PCE vs. 150 d on PCBs). The transcriptomic profiles illustrated that the distinct PCB dechlorination profile of each strain was predominantly mediated by a single, novel reductive dehalogenase (RDase) catalyzing chlorine removal from both PCBs and PCE. The transcription levels of PCB-RDase genes are 5-60 times higher than the genome-wide average. The cultivation of PCB-respiring Dehalococcoides in pure culture and the identification of PCB-RDase genes deepen our understanding of organohalide respiration of PCBs and shed light on in situ PCB bioremediation.
Proceedings of the National Academy of Sciences 07/2014; 111(33). DOI:10.1073/pnas.1404845111 · 9.67 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Opisthorchiasis is a neglected, tropical disease caused by the carcinogenic Asian liver fluke, Opisthorchis viverrini. This hepatobiliary disease is linked to malignant cancer (cholangiocarcinoma, CCA) and affects millions of people in Asia. No vaccine is available, and only one drug (praziquantel) is used against the parasite. Little is known about O. viverrini biology and the diseases that it causes. Here we characterize the draft genome (634.5 Mb) and transcriptomes of O. viverrini, elucidate how this fluke survives in the hostile environment within the bile duct and show that metabolic pathways in the parasite are highly adapted to a lipid-rich diet from bile and/or cholangiocytes. We also provide additional evidence that O. viverrini and other flukes secrete proteins that directly modulate host cell proliferation. Our molecular resources now underpin profound explorations of opisthorchiasis/CCA and the design of new interventions.
[Show abstract][Hide abstract] ABSTRACT: RNA viruses are notorious for their ability to quickly adapt to selective pressure from the host immune system and/or antivirals. This adaptability is likely due to the error-prone characteristics of their RNA-dependent, RNA polymerase [1, 2]. Dengue virus, a member of the Flaviviridae family of positive-strand RNA viruses, is also known to share these error-prone characteristics . Utilizing high-throughput, massively parallel sequencing methodologies, or next-generation sequencing (NGS), we can now accurately quantify these populations of viruses and track the changes to these populations over the course of a single infection. The aim of this chapter is twofold: to describe the methodologies required for sample preparation prior to sequencing and to describe the bioinformatics analyses required for the resulting data.
[Show abstract][Hide abstract] ABSTRACT: In this chapter, we report a detailed analysis of repetitive elements in the papaya genome, including transposable elements (TEs), tandemly arrayed sequences, and high copy number genes. These repetitive sequences account for ~56 % of the papaya genome, with TEs being the most abundant at 52 %, tandem repeats at 1.3 %, and high copy number genes at 3 %. Most common types of TEs are represented in the papaya genome with retrotransposons being the dominant class, accounting for 40 % of the genome. The most prevalent retrotransposons are Ty3–gypsy (27.8 %) and Ty1–copia (5.5 %). Among the tandem repeats, microsatellites are the most abundant in number but represent only 0.19 % of the genome. Minisatellites and satellites are less abundant but represent 0.68 and 0.43 % of the genome, respectively, due to greater repeat length. Despite an overall smaller gene repertoire in papaya than many other angiosperms, a significant fraction of genes (>2 %) are present in large gene families with copy number greater than 20. Papaya sex chromosomes are significantly enriched of a repertoire of repetitive sequences, and the male-specific region expanded by massively accumulation of repeated DNA, representing 83 % (mostly TE), while the corresponding X region included 70 % of such repeats. In an effort to integrate all the information, we provide here the pipeline to gather and process data related to repetitive elements in papaya.
Genetics and Genomics of Papaya, 01/2014: pages 225-240; , ISBN: 978-1-4614-8086-0
[Show abstract][Hide abstract] ABSTRACT: Background / Purpose:
High-throughput sequencing datasets, in principle, enable the detection of extremely low frequency variants seen in a given cell population, to study their evolution and impact on phenotypes of interest. The use of ad hoc filters and statistics can however limit the sensitivity and specificity of detection, particularly when multiple samples are compared, as is the case for somatic variant calling and time course studies.
We demonstrate the utility of a systematic framework for variant calling (LoFreq) that simultaneously incorporates sequence quality, mapping quality, alignment quality and source quality information, allowing for single nucleotide variants and indels to be jointly called. Our benchmarking and validation results on real and in silico datasets demonstrate that this approach provides a significant boost in sensitivity over existing variant callers (accurately calling variants at less than 1% frequency), while retaining very high specificity.
Joint 21st Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) and 12th European Conference on Computational Biology (ECCB) 2013; 08/2013
[Show abstract][Hide abstract] ABSTRACT: Rickettsia prowazekii is a notable intracellular pathogen, the agent of epidemic typhus, and a potential biothreat agent. We present here whole-genome sequence data for four strains of R. prowazekii, including one from a flying squirrel.
[Show abstract][Hide abstract] ABSTRACT: The high throughput and cost-effectiveness afforded by short-read sequencing technologies, in principle, enable researchers to perform 16S rRNA profiling of complex microbial communities at unprecedented depth and resolution. Existing Illumina sequencing protocols are, however, limited by the fraction of the 16S rRNA gene that is interrogated and therefore limit the resolution and quality of the profiling. To address this, we present the design of a novel protocol for shotgun Illumina sequencing of the bacterial 16S rRNA gene, optimized to amplify more than 90% of sequences in the Greengenes database and with the ability to distinguish nearly twice as many species-level OTUs compared to existing protocols. Using several in silico and experimental datasets, we demonstrate that despite the presence of multiple variable and conserved regions, the resulting shotgun sequences can be used to accurately quantify the constituents of complex microbial communities. The reconstruction of a significant fraction of the 16S rRNA gene also enabled high precision (>90%) in species-level identification thereby opening up potential application of this approach for clinical microbial characterization.
PLoS ONE 04/2013; 8(4):e60811. DOI:10.1371/journal.pone.0060811 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Advances in sequencing technologies and increased access to sequencing services have led to renewed interest in sequence and genome assembly. Concurrently, new applications for sequencing have emerged, including gene expression analysis, discovery of genomic variants and metagenomics, and each of these has different needs and challenges in terms of assembly. We survey the theoretical foundations that underlie modern assembly and highlight the options and practical trade-offs that need to be considered, focusing on how individual features address the needs of specific applications. We also review key software and the interplay between experimental design and efficacy of assembly.
[Show abstract][Hide abstract] ABSTRACT: Background:
Gastric cancer is the second highest cause of global cancer mortality. To explore the complete repertoire of somatic alterations in gastric cancer, we combined massively parallel short read and DNA paired-end tag sequencing to present the first whole-genome analysis of two gastric adenocarcinomas, one with chromosomal instability and the other with microsatellite instability.
Integrative analysis and de novo assemblies revealed the architecture of a wild-type KRAS amplification, a common driver event in gastric cancer. We discovered three distinct mutational signatures in gastric cancer--against a genome-wide backdrop of oxidative and microsatellite instability-related mutational signatures, we identified the first exome-specific mutational signature. Further characterization of the impact of these signatures by combining sequencing data from 40 complete gastric cancer exomes and targeted screening of an additional 94 independent gastric tumors uncovered ACVR2A, RPL22 and LMAN1 as recurrently mutated genes in microsatellite instability-positive gastric cancer and PAPPA as a recurrently mutated gene in TP53 wild-type gastric cancer.
These results highlight how whole-genome cancer sequencing can uncover information relevant to tissue-specific carcinogenesis that would otherwise be missed from exome-sequencing data.