[Show abstract][Hide abstract] ABSTRACT: Colorectal cancer with metastases limited to the liver (liver-limited mCRC) is a distinct clinical subset characterized by possible cure with surgery. We performed high-depth sequencing of over 750 cancer-associated genes and copy number profiling in matched primary, metastasis and normal tissues to characterize genomic progression in 18 patients with liver-limited mCRC.
High depth Illumina sequencing and use of three different variant callers enable comprehensive and accurate identification of somatic variants down to 2.5% variant allele frequency. We identify a median of 11 somatic single nucleotide variants (SNVs) per tumor. Across patients, a median of 79.3% of somatic SNVs present in the primary are present in the metastasis and 81.7% of all alterations present in the metastasis are present in the primary. Private alterations are found at lower allele frequencies; a different mutational signature characterized shared and private variants, suggesting distinct mutational processes. Using B-allele frequencies of heterozygous germline SNPs and copy number profiling, we find that broad regions of allelic imbalance and focal copy number changes, respectively, are generally shared between the primary tumor and metastasis.
Our analyses point to high genomic concordance of primary tumor and metastasis, with a thick common trunk and smaller genomic branches in general support of the linear progression model in most patients with liver-limited mCRC. More extensive studies are warranted to further characterize genomic progression in this important clinical population.
[Show abstract][Hide abstract] ABSTRACT: Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and mapping technologies (e.g. optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kbp–2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging due to the lack of efficient and freely available software for robustly aligning maps to sequences. Here we introduce two new map-to-sequence alignment algorithms that efficiently and accurately align high-throughput mapping datasets to large, eukaryotic genomes while accounting for high error rates. In order to do so, these methods (OPTIMA for glocal and OPTIMA-Overlap for overlap alignment) exploit the ability to create efficient data structures that index continuous-valued mapping data while accounting for errors. We also introduce an approach for evaluating the significance of alignments that avoids expensive permutation-based tests while being agnostic to technology-dependent error rates. Our benchmarking results suggest that OPTIMA and OPTIMA-Overlap outperform state-of-the-art approaches in sensitivity (1.6–2× improvement) while simultaneously being more efficient (170–200%) and precise in their alignments (99% precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust and provide improved sensitivity while guaranteeing high precision.
[Show abstract][Hide abstract] ABSTRACT: High-throughput assays, such as RNA-seq, to detect differential abundance are widely used. Variable performance across statistical tests, normalizations, and conditions leads to resource wastage and reduced sensitivity. EDDA represents a first, general design tool for RNA-seq, Nanostring, and metagenomic analysis, that rationally selects tests, predicts performance, and plans experiments to minimize resource wastage. Case studies highlight EDDA’s ability to model single-cell RNA-seq, suggesting ways to reduce sequencing costs up to five-fold and improving metagenomic biomarker detection through improved test selection. EDDA’s novel mode-based normalization for detecting differential abundance improves robustness by 10% to 20% and precision by up to 140%.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0527-7) contains supplementary material, which is available to authorized users.
[Show abstract][Hide abstract] ABSTRACT: We present a method for obtaining long haplotypes, of over 3 kb in length, using a short-read sequencer, Barcode-directed Assembly for Extra-long Sequences (BAsE-Seq). BAsE-Seq relies on transposing a template-specific barcode onto random segments of the template molecule and assembling the barcoded short reads into complete haplotypes. We applied BAsE-Seq on mixed clones of hepatitis B virus and accurately identified haplotypes occurring at frequencies greater than or equal to 0.4%, with >99.9% specificity. Applying BAsE-Seq to a clinical sample, we obtained over 9,000 viral haplotypes, which provided an unprecedented view of hepatitis B virus population structure during chronic infection. BAsE-Seq is readily applicable for monitoring quasispecies evolution in viral diseases.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0517-9) contains supplementary material, which is available to authorized users.
[Show abstract][Hide abstract] ABSTRACT: Dehalococcoides mccartyi strain SG1, isolated from digester sludge, dechlorinates polychlorinated biphenyls (PCBs) to lower congeners. Here we report
the draft genome sequence of SG1, which carries a 22.65 kbp circular putative plasmid.
[Show abstract][Hide abstract] ABSTRACT: Chromosomal structural variations play an important role in determining the transcriptional landscape of human breast cancers. To assess the nature of these structural variations, we analyzed eight breast tumor samples with a focus on regions of gene amplification using mate-pair sequencing of long-insert genomic DNA with matched transcriptome profiling. We found that tandem duplications appear to be early events in tumor evolution, especially in the genesis of amplicons. In a detailed reconstruction of events on chromosome 17, we found large unpaired inversions and deletions connect a tandemly duplicated ERBB2 with neighboring 17q21.3 amplicons while simultaneously deleting the intervening BRCA1 tumor suppressor locus. This series of events appeared to be unusually common when examined in larger genomic data sets of breast cancers albeit using approaches with lesser resolution. Using siRNAs in breast cancer cell lines, we showed that the 17q21.3 amplicon harbored a significant number of weak oncogenes that appeared consistently coamplified in primary tumors. Down-regulation of BRCA1 expression augmented the cell proliferation in ERBB2-transfected human normal mammary epithelial cells. Coamplification of other functionally tested oncogenic elements in other breast tumors examined, such as RIPK2 and MYC on chromosome 8, also parallel these findings. Our analyses suggest that structural variations efficiently orchestrate the gain and loss of cancer gene cassettes that engage many oncogenic pathways simultaneously and that such oncogenic cassettes are favored during the evolution of a cancer.
Genome Research 09/2014; 24(10). DOI:10.1101/gr.164871.113 · 13.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Fastidious anaerobic bacteria play critical roles in environmental bioremediation of halogenated compounds. However, their characterization and application have been largely impeded by difficulties in growing them in pure culture. Thus far, no pure culture has been reported to respire on the notorious polychlorinated biphenyls (PCBs), and functional genes responsible for PCB detoxification remain unknown due to the extremely slow growth of PCB-respiring bacteria. Here we report the successful isolation and characterization of three Dehalococcoides mccartyi strains that respire on commercial PCBs. Using high-throughput metagenomic analysis, combined with traditional culture techniques, tetrachloroethene (PCE) was identified as a feasible alternative to PCBs to isolate PCB-respiring Dehalococcoides from PCB-enriched cultures. With PCE as an alternative electron acceptor, the PCB-respiring Dehalococcoides were boosted to a higher cell density (1.2 × 10(8) to 1.3 × 10(8) cells per mL on PCE vs. 5.9 × 10(6) to 10.4 × 10(6) cells per mL on PCBs) with a shorter culturing time (30 d on PCE vs. 150 d on PCBs). The transcriptomic profiles illustrated that the distinct PCB dechlorination profile of each strain was predominantly mediated by a single, novel reductive dehalogenase (RDase) catalyzing chlorine removal from both PCBs and PCE. The transcription levels of PCB-RDase genes are 5-60 times higher than the genome-wide average. The cultivation of PCB-respiring Dehalococcoides in pure culture and the identification of PCB-RDase genes deepen our understanding of organohalide respiration of PCBs and shed light on in situ PCB bioremediation.
Proceedings of the National Academy of Sciences 07/2014; 111(33). DOI:10.1073/pnas.1404845111 · 9.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Opisthorchiasis is a neglected, tropical disease caused by the carcinogenic Asian liver fluke, Opisthorchis viverrini. This hepatobiliary disease is linked to malignant cancer (cholangiocarcinoma, CCA) and affects millions of people in Asia. No vaccine is available, and only one drug (praziquantel) is used against the parasite. Little is known about O. viverrini biology and the diseases that it causes. Here we characterize the draft genome (634.5 Mb) and transcriptomes of O. viverrini, elucidate how this fluke survives in the hostile environment within the bile duct and show that metabolic pathways in the parasite are highly adapted to a lipid-rich diet from bile and/or cholangiocytes. We also provide additional evidence that O. viverrini and other flukes secrete proteins that directly modulate host cell proliferation. Our molecular resources now underpin profound explorations of opisthorchiasis/CCA and the design of new interventions.
[Show abstract][Hide abstract] ABSTRACT: RNA viruses are notorious for their ability to quickly adapt to selective pressure from the host immune system and/or antivirals. This adaptability is likely due to the error-prone characteristics of their RNA-dependent, RNA polymerase [1, 2]. Dengue virus, a member of the Flaviviridae family of positive-strand RNA viruses, is also known to share these error-prone characteristics . Utilizing high-throughput, massively parallel sequencing methodologies, or next-generation sequencing (NGS), we can now accurately quantify these populations of viruses and track the changes to these populations over the course of a single infection. The aim of this chapter is twofold: to describe the methodologies required for sample preparation prior to sequencing and to describe the bioinformatics analyses required for the resulting data.
[Show abstract][Hide abstract] ABSTRACT: In this chapter, we report a detailed analysis of repetitive elements in the papaya genome, including transposable elements (TEs), tandemly arrayed sequences, and high copy number genes. These repetitive sequences account for ~56 % of the papaya genome, with TEs being the most abundant at 52 %, tandem repeats at 1.3 %, and high copy number genes at 3 %. Most common types of TEs are represented in the papaya genome with retrotransposons being the dominant class, accounting for 40 % of the genome. The most prevalent retrotransposons are Ty3–gypsy (27.8 %) and Ty1–copia (5.5 %). Among the tandem repeats, microsatellites are the most abundant in number but represent only 0.19 % of the genome. Minisatellites and satellites are less abundant but represent 0.68 and 0.43 % of the genome, respectively, due to greater repeat length. Despite an overall smaller gene repertoire in papaya than many other angiosperms, a significant fraction of genes (>2 %) are present in large gene families with copy number greater than 20. Papaya sex chromosomes are significantly enriched of a repertoire of repetitive sequences, and the male-specific region expanded by massively accumulation of repeated DNA, representing 83 % (mostly TE), while the corresponding X region included 70 % of such repeats. In an effort to integrate all the information, we provide here the pipeline to gather and process data related to repetitive elements in papaya.
Genetics and Genomics of Papaya, 01/2014: pages 225-240; , ISBN: 978-1-4614-8086-0
[Show abstract][Hide abstract] ABSTRACT: Background / Purpose:
High-throughput sequencing datasets, in principle, enable the detection of extremely low frequency variants seen in a given cell population, to study their evolution and impact on phenotypes of interest. The use of ad hoc filters and statistics can however limit the sensitivity and specificity of detection, particularly when multiple samples are compared, as is the case for somatic variant calling and time course studies.
We demonstrate the utility of a systematic framework for variant calling (LoFreq) that simultaneously incorporates sequence quality, mapping quality, alignment quality and source quality information, allowing for single nucleotide variants and indels to be jointly called. Our benchmarking and validation results on real and in silico datasets demonstrate that this approach provides a significant boost in sensitivity over existing variant callers (accurately calling variants at less than 1% frequency), while retaining very high specificity.
Joint 21st Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) and 12th European Conference on Computational Biology (ECCB) 2013; 08/2013
[Show abstract][Hide abstract] ABSTRACT: Rickettsia prowazekii is a notable intracellular pathogen, the agent of epidemic typhus, and a potential biothreat agent. We present here whole-genome sequence data for four strains of R. prowazekii, including one from a flying squirrel.
[Show abstract][Hide abstract] ABSTRACT: The high throughput and cost-effectiveness afforded by short-read sequencing technologies, in principle, enable researchers to perform 16S rRNA profiling of complex microbial communities at unprecedented depth and resolution. Existing Illumina sequencing protocols are, however, limited by the fraction of the 16S rRNA gene that is interrogated and therefore limit the resolution and quality of the profiling. To address this, we present the design of a novel protocol for shotgun Illumina sequencing of the bacterial 16S rRNA gene, optimized to amplify more than 90% of sequences in the Greengenes database and with the ability to distinguish nearly twice as many species-level OTUs compared to existing protocols. Using several in silico and experimental datasets, we demonstrate that despite the presence of multiple variable and conserved regions, the resulting shotgun sequences can be used to accurately quantify the constituents of complex microbial communities. The reconstruction of a significant fraction of the 16S rRNA gene also enabled high precision (>90%) in species-level identification thereby opening up potential application of this approach for clinical microbial characterization.
PLoS ONE 04/2013; 8(4):e60811. DOI:10.1371/journal.pone.0060811 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Advances in sequencing technologies and increased access to sequencing services have led to renewed interest in sequence and genome assembly. Concurrently, new applications for sequencing have emerged, including gene expression analysis, discovery of genomic variants and metagenomics, and each of these has different needs and challenges in terms of assembly. We survey the theoretical foundations that underlie modern assembly and highlight the options and practical trade-offs that need to be considered, focusing on how individual features address the needs of specific applications. We also review key software and the interplay between experimental design and efficacy of assembly.
[Show abstract][Hide abstract] ABSTRACT: BACKGROUND: Gastric cancer is the second highest cause of global cancer mortality. To explore the complete repertoire of somatic alterations in gastric cancer, we combined massively-parallel short read and DNA-PET sequencing to present the first whole-genome analysis of two gastric adenocarcinomas, one with chromosomal instability and the other with microsatellite instability. RESULTS: Integrative analysis and de novo assemblies revealed the architecture of a wild-type KRAS amplification, a common driver event in gastric cancer. We discovered three distinct mutational signatures in gastric cancer - against a genome-wide backdrop of oxidative and microsatellite instability-related mutational signatures, we identified the first exome-specific mutational signature. Further characterization of the impact of these signatures by combining sequencing data from 40 complete gastric cancer exomes and targeted screening of an additional 94 independent gastric tumours uncovered ACVR2A, RPL22 and LMAN1 as recurrently mutated genes in microsatellite instability-positive gastric cancer and PAPPA as a recurrently mutated gene in TP53 wild-type gastric cancer. CONCLUSIONS: These results highlight how whole-genome cancer sequencing can uncover information relevant to tissue-specific carcinogenesis that would otherwise be missed from exome-sequencing data.
[Show abstract][Hide abstract] ABSTRACT: Oranges are an important nutritional source for human health and have immense economic value. Here we present a comprehensive analysis of the draft genome of sweet orange (Citrus sinensis). The assembled sequence covers 87.3% of the estimated orange genome, which is relatively compact, as 20% is composed of repetitive elements. We predicted 29,445 protein-coding genes, half of which are in the heterozygous state. With additional sequencing of two more citrus species and comparative analyses of seven citrus genomes, we present evidence to suggest that sweet orange originated from a backcross hybrid between pummelo and mandarin. Focused analysis on genes involved in vitamin C metabolism showed that GalUR, encoding the rate-limiting enzyme of the galacturonate pathway, is significantly upregulated in orange fruit, and the recent expansion of this gene family may provide a genomic basis. This draft genome represents a valuable resource for understanding and improving many important citrus traits in the future.
[Show abstract][Hide abstract] ABSTRACT: The study of cell-population heterogeneity in a range of biological systems, from viruses to bacterial isolates to tumor samples,
has been transformed by recent advances in sequencing throughput. While the high-coverage afforded can be used, in principle,
to identify very rare variants in a population, existing ad hoc approaches frequently fail to distinguish true variants from sequencing errors. We report a method (LoFreq) that models sequencing
run-specific error rates to accurately call variants occurring in <0.05% of a population. Using simulated and real datasets
(viral, bacterial and human), we show that LoFreq has near-perfect specificity, with significantly improved sensitivity compared
with existing methods and can efficiently analyze deep Illumina sequencing datasets without resorting to approximations or
heuristics. We also present experimental validation for LoFreq on two different platforms (Fluidigm and Sequenom) and its
application to call rare somatic variants from exome sequencing datasets for gastric cancer. Source code and executables for
LoFreq are freely available at http://sourceforge.net/projects/lofreq/.
Nucleic Acids Research 10/2012; 40(22). DOI:10.1093/nar/gks918 · 9.11 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Structural variations (SVs) contribute significantly to the variability of the human genome and extensive genomic rearrangements are a hallmark of cancer. While genomic DNA paired-end-tag (DNA-PET) sequencing is an attractive approach to identify genomic SVs, the current application of PET sequencing with short insert size DNA can be insufficient for the comprehensive mapping of SVs in low complexity and repeat-rich genomic regions. We employed a recently developed procedure to generate PET sequencing data using large DNA inserts of 10-20 kb and compared their characteristics with short insert (1 kb) libraries for their ability to identify SVs. Our results suggest that although short insert libraries bear an advantage in identifying small deletions, they do not provide significantly better breakpoint resolution. In contrast, large inserts are superior to short inserts in providing higher physical genome coverage for the same sequencing cost and achieve greater sensitivity, in practice, for the identification of several classes of SVs, such as copy number neutral and complex events. Furthermore, our results confirm that large insert libraries allow for the identification of SVs within repetitive sequences, which cannot be spanned by short inserts. This provides a key advantage in studying rearrangements in cancer, and we show how it can be used in a fusion-point-guided-concatenation algorithm to study focally amplified regions in cancer.
PLoS ONE 09/2012; 7(9):e46152. DOI:10.1371/journal.pone.0046152 · 3.23 Impact Factor