[Show abstract][Hide abstract] ABSTRACT: Colorectal cancer with metastases limited to the liver (liver-limited mCRC) is a distinct clinical subset characterized by possible cure with surgery. We performed high-depth sequencing of over 750 cancer-associated genes and copy number profiling in matched primary, metastasis and normal tissues to characterize genomic progression in 18 patients with liver-limited mCRC.
High depth Illumina sequencing and use of three different variant callers enable comprehensive and accurate identification of somatic variants down to 2.5% variant allele frequency. We identify a median of 11 somatic single nucleotide variants (SNVs) per tumor. Across patients, a median of 79.3% of somatic SNVs present in the primary are present in the metastasis and 81.7% of all alterations present in the metastasis are present in the primary. Private alterations are found at lower allele frequencies; a different mutational signature characterized shared and private variants, suggesting distinct mutational processes. Using B-allele frequencies of heterozygous germline SNPs and copy number profiling, we find that broad regions of allelic imbalance and focal copy number changes, respectively, are generally shared between the primary tumor and metastasis.
Our analyses point to high genomic concordance of primary tumor and metastasis, with a thick common trunk and smaller genomic branches in general support of the linear progression model in most patients with liver-limited mCRC. More extensive studies are warranted to further characterize genomic progression in this important clinical population.
[Show abstract][Hide abstract] ABSTRACT: Background: Next-generation sequencing (NGS) technologies have changed our understanding of the variability of the human genome. However, the identification of genome structural variations based on NGS approaches with read lengths of 35-300 bases remains a challenge. Single-molecule optical mapping technologies allow the analysis of DNA molecules of up to 2 Mb and as such are suitable for the identification of large-scale genome structural variations, and for de novo genome assemblies when combined with short-read NGS data. Here we present optical mapping data for two human genomes: the HapMap cell line GM12878 and the colorectal cancer cell line HCT116. Findings: High molecular weight DNA was obtained by embedding GM12878 and HCT116 cells, respectively, in agarose plugs, followed by DNA extraction under mild conditions. Genomic DNA was digested with KpnI and 310,000 and 296,000 DNA molecules (>= 150 kb and 10 restriction fragments), respectively, were analyzed per cell line using the Argus optical mapping system. Maps were aligned to the human reference by OPTIMA, a new glocal alignment method. Genome coverage of 6.8x and 5.7x was obtained, respectively; 2.9x and 1.7x more than the coverage obtained with previously available software. Conclusions: Optical mapping allows the resolution of large-scale structural variations of the genome, and the scaffold extension of NGS-based de novo assemblies. OPTIMA is an efficient new alignment method; our optical mapping data provide a resource for genome structure analyses of the human HapMap reference cell line GM12878, and the colorectal cancer cell line HCT116.
[Show abstract][Hide abstract] ABSTRACT: Malassezia is a unique lipophilic genus in class Malasseziomycetes in Ustilaginomycotina, (Basidiomycota, fungi) that otherwise consists almost exclusively of plant pathogens. Malassezia are typically isolated from warm-blooded animals, are dominant members of the human skin mycobiome and are associated with common skin disorders. To characterize the genetic basis of the unique phenotypes of Malassezia spp., we sequenced the genomes of all 14 accepted species and used comparative genomics against a broad panel of fungal genomes to comprehensively identify distinct features that define the Malassezia gene repertoire: gene gain and loss; selection signatures; and lineage-specific gene family expansions. Our analysis revealed key gene gain events (64) with a single gene conserved across all Malassezia but absent in all other sequenced Basidiomycota. These likely horizontally transferred genes provide intriguing gain-of-function events and prime candidates to explain the emergence of Malassezia. A larger set of genes (741) were lost, with enrichment for glycosyl hydrolases and carbohydrate metabolism, concordant with adaptation to skin's carbohydrate-deficient environment. Gene family analysis revealed extensive turnover and underlined the importance of secretory lipases, phospholipases, aspartyl proteases, and other peptidases. Combining genomic analysis with a re-evaluation of culture characteristics, we establish the likely lipid-dependence of all Malassezia. Our phylogenetic analysis sheds new light on the relationship between Malassezia and other members of Ustilaginomycotina, as well as phylogenetic lineages within the genus. Overall, our study provides a unique genomic resource for understanding Malassezia niche-specificity and potential virulence, as well as their abundance and distribution in the environment and on human skin.
[Show abstract][Hide abstract] ABSTRACT: Human respiratory syncytial virus (RSV) is the major cause of lower respiratory tract infections in children under two years of age. Little is known about RSV intra-host genetic diversity over the course of infection, or about the immune pressures that drive RSV molecular evolution. We performed whole-genome deep sequencing on 53 RSV- positive samples (37 RSV subgroup A and 16 subgroup B) collected from the upper airways of hospitalized children in southern Vietnam over two consecutive seasons. RSV A NA1 and RSV B BA9 were the predominant genotypes found in our samples, consistent with other reports on global RSV circulation during the same period. For both RSV A and B, the M gene was the most conserved, confirming its potential as a target for novel therapeutics. The G gene was the most variable and was the only gene under detectable positive selection. Further, positively-selected sites in G were found in close proximity to and in some cases overlapped with predicted glycosylation motifs, suggesting that selection on amino acid glycosylation may drive viral genetic diversity. We further identified hotspots and coldspots of intra-host genetic diversity in the RSV genome, some of which may highlight previously unknown regions of functional importance.
Full-text · Article · Sep 2015 · Journal of General Virology
[Show abstract][Hide abstract] ABSTRACT: Dengue viruses (DENV) cause debilitating and potentially life-threatening acute disease throughout the tropical world. While drug development efforts are underway, there are concerns that resistant strains will emerge rapidly. Indeed, antiviral drugs that target even conserved regions in other RNA viruses lose efficacy over time as the virus mutates. Here, we sought to determine if there are regions in the DENV genome that are not only evolutionarily conserved but genetically constrained in their ability to mutate and could hence serve as better antiviral targets. High-throughput sequencing of DENV-1 genome directly from twelve, paired dengue patients' sera and then passaging these sera into the two primary mosquito vectors showed consistent and distinct sequence changes during infection. In particular, two residues in the NS5 protein coding sequence appear to be specifically acquired during infection in Ae. aegypti but not Ae. albopictus. Importantly, we identified a region within the NS3 protein coding sequence that is refractory to mutation during human and mosquito infection. Collectively, these findings provide fresh insights into antiviral targets and could serve as an approach to defining evolutionarily constrained regions for therapeutic targeting in other RNA viruses.
[Show abstract][Hide abstract] ABSTRACT: Dengue virus (DENV) infection of an individual human or mosquito host produces a dynamic population of closely-related sequences. This intra-host genetic diversity is thought to offer an advantage for arboviruses to adapt as they cycle between two very different host species, but it remains poorly characterized. To track changes in viral intra-host genetic diversity during horizontal transmission, we infected Aedes aegypti mosquitoes by allowing them to feed on DENV2-infected patients. We then performed whole-genome deep-sequencing of human- and matched mosquito-derived DENV samples on the Illumina platform and used a sensitive variant-caller to detect single nucleotide variants (SNVs) within each sample. >90% of SNVs were lost upon transition from human to mosquito, as well as from mosquito abdomen to salivary glands. Levels of viral diversity were maintained, however, by the regeneration of new SNVs at each stage of transmission. We further show that SNVs maintained across transmission stages were transmitted as a unit of two at maximum, suggesting the presence of numerous variant genomes carrying only one or two SNVs each. We also present evidence for differences in selection pressures between human and mosquito hosts, particularly on the structural and NS1 genes. This analysis provides insights into how population drops during transmission shape RNA virus genetic diversity, has direct implications for virus evolution, and illustrates the value of high-coverage, whole-genome next-generation sequencing for understanding viral intra-host genetic diversity.
[Show abstract][Hide abstract] ABSTRACT: Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and mapping technologies (e.g. optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kbp–2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging due to the lack of efficient and freely available software for robustly aligning maps to sequences. Here we introduce two new map-to-sequence alignment algorithms that efficiently and accurately align high-throughput mapping datasets to large, eukaryotic genomes while accounting for high error rates. In order to do so, these methods (OPTIMA for glocal and OPTIMA-Overlap for overlap alignment) exploit the ability to create efficient data structures that index continuous-valued mapping data while accounting for errors. We also introduce an approach for evaluating the significance of alignments that avoids expensive permutation-based tests while being agnostic to technology-dependent error rates. Our benchmarking results suggest that OPTIMA and OPTIMA-Overlap outperform state-of-the-art approaches in sensitivity (1.6–2× improvement) while simultaneously being more efficient (170–200%) and precise in their alignments (99% precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust and provide improved sensitivity while guaranteeing high precision.
[Show abstract][Hide abstract] ABSTRACT: High-throughput assays, such as RNA-seq, to detect differential abundance are widely used. Variable performance across statistical tests, normalizations, and conditions leads to resource wastage and reduced sensitivity. EDDA represents a first, general design tool for RNA-seq, Nanostring, and metagenomic analysis, that rationally selects tests, predicts performance, and plans experiments to minimize resource wastage. Case studies highlight EDDA’s ability to model single-cell RNA-seq, suggesting ways to reduce sequencing costs up to five-fold and improving metagenomic biomarker detection through improved test selection. EDDA’s novel mode-based normalization for detecting differential abundance improves robustness by 10% to 20% and precision by up to 140%.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0527-7) contains supplementary material, which is available to authorized users.
[Show abstract][Hide abstract] ABSTRACT: We present a method for obtaining long haplotypes, of over 3 kb in length, using a short-read sequencer, Barcode-directed Assembly for Extra-long Sequences (BAsE-Seq). BAsE-Seq relies on transposing a template-specific barcode onto random segments of the template molecule and assembling the barcoded short reads into complete haplotypes. We applied BAsE-Seq on mixed clones of hepatitis B virus and accurately identified haplotypes occurring at frequencies greater than or equal to 0.4%, with >99.9% specificity. Applying BAsE-Seq to a clinical sample, we obtained over 9,000 viral haplotypes, which provided an unprecedented view of hepatitis B virus population structure during chronic infection. BAsE-Seq is readily applicable for monitoring quasispecies evolution in viral diseases.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0517-9) contains supplementary material, which is available to authorized users.
[Show abstract][Hide abstract] ABSTRACT: Dehalococcoides mccartyi strain SG1, isolated from digester sludge, dechlorinates polychlorinated biphenyls (PCBs) to lower congeners. Here we report
the draft genome sequence of SG1, which carries a 22.65 kbp circular putative plasmid.
[Show abstract][Hide abstract] ABSTRACT: Chromosomal structural variations play an important role in determining the transcriptional landscape of human breast cancers. To assess the nature of these structural variations, we analyzed eight breast tumor samples with a focus on regions of gene amplification using mate-pair sequencing of long-insert genomic DNA with matched transcriptome profiling. We found that tandem duplications appear to be early events in tumor evolution, especially in the genesis of amplicons. In a detailed reconstruction of events on chromosome 17, we found large unpaired inversions and deletions connect a tandemly duplicated ERBB2 with neighboring 17q21.3 amplicons while simultaneously deleting the intervening BRCA1 tumor suppressor locus. This series of events appeared to be unusually common when examined in larger genomic data sets of breast cancers albeit using approaches with lesser resolution. Using siRNAs in breast cancer cell lines, we showed that the 17q21.3 amplicon harbored a significant number of weak oncogenes that appeared consistently coamplified in primary tumors. Down-regulation of BRCA1 expression augmented the cell proliferation in ERBB2-transfected human normal mammary epithelial cells. Coamplification of other functionally tested oncogenic elements in other breast tumors examined, such as RIPK2 and MYC on chromosome 8, also parallel these findings. Our analyses suggest that structural variations efficiently orchestrate the gain and loss of cancer gene cassettes that engage many oncogenic pathways simultaneously and that such oncogenic cassettes are favored during the evolution of a cancer.
[Show abstract][Hide abstract] ABSTRACT: Fastidious anaerobic bacteria play critical roles in environmental bioremediation of halogenated compounds. However, their characterization and application have been largely impeded by difficulties in growing them in pure culture. Thus far, no pure culture has been reported to respire on the notorious polychlorinated biphenyls (PCBs), and functional genes responsible for PCB detoxification remain unknown due to the extremely slow growth of PCB-respiring bacteria. Here we report the successful isolation and characterization of three Dehalococcoides mccartyi strains that respire on commercial PCBs. Using high-throughput metagenomic analysis, combined with traditional culture techniques, tetrachloroethene (PCE) was identified as a feasible alternative to PCBs to isolate PCB-respiring Dehalococcoides from PCB-enriched cultures. With PCE as an alternative electron acceptor, the PCB-respiring Dehalococcoides were boosted to a higher cell density (1.2 × 10(8) to 1.3 × 10(8) cells per mL on PCE vs. 5.9 × 10(6) to 10.4 × 10(6) cells per mL on PCBs) with a shorter culturing time (30 d on PCE vs. 150 d on PCBs). The transcriptomic profiles illustrated that the distinct PCB dechlorination profile of each strain was predominantly mediated by a single, novel reductive dehalogenase (RDase) catalyzing chlorine removal from both PCBs and PCE. The transcription levels of PCB-RDase genes are 5-60 times higher than the genome-wide average. The cultivation of PCB-respiring Dehalococcoides in pure culture and the identification of PCB-RDase genes deepen our understanding of organohalide respiration of PCBs and shed light on in situ PCB bioremediation.
Full-text · Article · Jul 2014 · Proceedings of the National Academy of Sciences
[Show abstract][Hide abstract] ABSTRACT: Opisthorchiasis is a neglected, tropical disease caused by the carcinogenic Asian liver fluke, Opisthorchis viverrini. This hepatobiliary disease is linked to malignant cancer (cholangiocarcinoma, CCA) and affects millions of people in Asia. No vaccine is available, and only one drug (praziquantel) is used against the parasite. Little is known about O. viverrini biology and the diseases that it causes. Here we characterize the draft genome (634.5 Mb) and transcriptomes of O. viverrini, elucidate how this fluke survives in the hostile environment within the bile duct and show that metabolic pathways in the parasite are highly adapted to a lipid-rich diet from bile and/or cholangiocytes. We also provide additional evidence that O. viverrini and other flukes secrete proteins that directly modulate host cell proliferation. Our molecular resources now underpin profound explorations of opisthorchiasis/CCA and the design of new interventions.
[Show abstract][Hide abstract] ABSTRACT: RNA viruses are notorious for their ability to quickly adapt to selective pressure from the host immune system and/or antivirals. This adaptability is likely due to the error-prone characteristics of their RNA-dependent, RNA polymerase [1, 2]. Dengue virus, a member of the Flaviviridae family of positive-strand RNA viruses, is also known to share these error-prone characteristics . Utilizing high-throughput, massively parallel sequencing methodologies, or next-generation sequencing (NGS), we can now accurately quantify these populations of viruses and track the changes to these populations over the course of a single infection. The aim of this chapter is twofold: to describe the methodologies required for sample preparation prior to sequencing and to describe the bioinformatics analyses required for the resulting data.
[Show abstract][Hide abstract] ABSTRACT: In this chapter, we report a detailed analysis of repetitive elements in the papaya genome, including transposable elements (TEs), tandemly arrayed sequences, and high copy number genes. These repetitive sequences account for ~56 % of the papaya genome, with TEs being the most abundant at 52 %, tandem repeats at 1.3 %, and high copy number genes at 3 %. Most common types of TEs are represented in the papaya genome with retrotransposons being the dominant class, accounting for 40 % of the genome. The most prevalent retrotransposons are Ty3–gypsy (27.8 %) and Ty1–copia (5.5 %). Among the tandem repeats, microsatellites are the most abundant in number but represent only 0.19 % of the genome. Minisatellites and satellites are less abundant but represent 0.68 and 0.43 % of the genome, respectively, due to greater repeat length. Despite an overall smaller gene repertoire in papaya than many other angiosperms, a significant fraction of genes (>2 %) are present in large gene families with copy number greater than 20. Papaya sex chromosomes are significantly enriched of a repertoire of repetitive sequences, and the male-specific region expanded by massively accumulation of repeated DNA, representing 83 % (mostly TE), while the corresponding X region included 70 % of such repeats. In an effort to integrate all the information, we provide here the pipeline to gather and process data related to repetitive elements in papaya.
[Show abstract][Hide abstract] ABSTRACT: Background / Purpose:
High-throughput sequencing datasets, in principle, enable the detection of extremely low frequency variants seen in a given cell population, to study their evolution and impact on phenotypes of interest. The use of ad hoc filters and statistics can however limit the sensitivity and specificity of detection, particularly when multiple samples are compared, as is the case for somatic variant calling and time course studies.
We demonstrate the utility of a systematic framework for variant calling (LoFreq) that simultaneously incorporates sequence quality, mapping quality, alignment quality and source quality information, allowing for single nucleotide variants and indels to be jointly called. Our benchmarking and validation results on real and in silico datasets demonstrate that this approach provides a significant boost in sensitivity over existing variant callers (accurately calling variants at less than 1% frequency), while retaining very high specificity.