Article

ContEst: Estimating cross-contamination of human samples in next-generation sequencing data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Here, we present ContEst, a tool for estimating the level of cross-individual contamination in next-generation sequencing data. We demonstrate the accuracy of ContEst across a range of contamination levels, sources and read depths using sequencing data mixed in silico at known concentrations. We applied our tool to published cancer sequencing datasets and report their estimated contamination levels. ContEst is a GATK module, and distributed under a BSD style license at http://www.broadinstitute.org/cancer/cga/contest kcibul@broadinstitute.org; gadgetz@broadinstitute.org Supplementary data is available at Bioinformatics online.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Given the shared components of many plasmids, i.e., the origin of replication or lentivirus components, we reasoned that previously developed SNV-based contamination estimation methods would be ill-suited for the task. 22,35 Instead, we used a simple Bayesian model to leverage the proportion of unaligned bases ("clipped" bases) as well as unmapped reads during sequence alignment, taking into account the shared nature of plasmid components ( Figure 3A). ...
... Both aligned and unaligned reads are then assessed using a custom Python script. Our Bayesian formulation assumes a flat prior across equally spaced, discrete contamination level hypotheses, much like our approach in ref 22: ...
Article
Full-text available
Recombinant DNA is a fundamental tool in biotechnology and medicine. These DNA sequences are often built, replicated, and delivered in the form of plasmids. Validation of these plasmid sequences is a critical and time-consuming step, which has been dominated for the last 35 years by Sanger sequencing. As plasmid sequences grow more complex with new DNA synthesis and cloning techniques, we need new approaches that address the corresponding validation challenges at scale. Here we prototype a high-throughput plasmid sequencing approach using DNA transposition and Oxford Nanopore sequencing. Our method, Circuit-seq, creates robust, full-length, and accurate plasmid assemblies without prior knowledge of the underlying sequence. We demonstrate the power of Circuit-seq across a wide range of plasmid sizes and complexities, generating full-length, contiguous plasmid maps. We then leverage our long-read data to characterize epigenetic marks and estimate plasmid contamination levels. Circuit-seq scales to large numbers of samples at a lower per-sample cost than commercial Sanger sequencing, accelerating a key step in synthetic biology, while low equipment costs make it practical for individual laboratories.
... It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in wondered if we could leverage this read depth to estimate potential contamination levels. Given the shared components of many plasmids, i.e. the origin of replication or lentivirus components, we reasoned that previously developed SNV-based contamination estimation methods would be ill-suited for the task (Cibulskis et al. 2011;Jun et al. 2012). Instead, we used a simple Bayesian model to leverage the proportion of unaligned bases ('clipped' bases) as well as unmapped reads during sequence alignment, taking into account the shared nature of plasmid components ( fig. ...
... Both aligned and unaligned reads are then assessed using a custom Python script. Our Bayesian formulation assumes a flat prior across equality spaced, discrete contamination level hypotheses, much like our approach in Cibulskis and McKenna et al. (Cibulskis et al. 2011): Where is our flat prior and the denominator is the same over all contamination levels. We then need to evaluate the likelihood function. ...
Preprint
Full-text available
Recombinant DNA is a fundamental tool in biotechnology and medicine. Validation of the resulting plasmid sequence is a critical and time-consuming step, which has been dominated for the last 35 years by Sanger sequencing. As plasmid sequences grow more complex with new DNA synthesis and cloning techniques, we need new approaches that address the corresponding validation challenges at scale. Here we prototype a high-throughput plasmid sequencing approach using DNA transposition and Oxford Nanopore sequencing. Our method, Circuit-seq, creates robust, full-length, and accurate plasmid assemblies without prior knowledge of the underlying sequence for approximately $1.50 per plasmid. We demonstrate the power of Circuit-seq across a wide range of plasmid sizes and complexities, generating accurate and contiguous plasmid maps. We then leverage our long read-data to characterize epigenetic marks and estimate plasmid contamination levels. Circuit-seq scales to large numbers of samples at a lower cost than commercial Sanger sequencing, accelerating a key step in synthetic biology, with low startup costs make it practical for individual laboratories.
... One of the main problems in sequencing is cross-sample contamination [7], which may be classified into three main classes: cross-individual, within-individual and crossspecies. Cross-individual means that genetic material from one individual is contaminated by the material of the other individual. ...
... The last value may be disturbed by some genome duplication, sequencing errors, etc., but it should still be around 0.5. Other methods analyse the sample genotyped data at known, unique homozygous regions [7]. In our case, this particular analysis was on the side of our partner. ...
Article
Full-text available
Implementing a large genomic project is a demanding task, also from the computer science point of view. Besides collecting many genome samples and sequencing them, there is processing of a huge amount of data at every stage of their production and analysis. Efficient transfer and storage of the data is also an important issue. During the execution of such a project, there is a need to maintain work standards and control quality of the results, which can be difficult if a part of the work is carried out externally. Here, we describe our experience with such data quality analysis on a number of levels - from an obvious check of the quality of the results obtained, to examining consistency of the data at various stages of their processing, to verifying, as far as possible, their compatibility with the data describing the sample.
... In contrast, detecting same-species or within-species contamination is more challenging, and there are few valid, robust approaches. The most commonly implemented approach and the earliest developed is ContEst [12], a module in the Genome Analysis ToolKit (GATK) software [13]. ContEst uses a Bayesian method to calculate the posterior probability of a speci c contamination level and nd the maximum a posteriori probability (MAP) estimate of the contamination level at homozygous loci. ...
... Grid search is used to explore the twodimensional space ( ). The grid points of are chosen on an exponential scale of (2 −4 , 2 12 Availability of data and materials ...
Preprint
Full-text available
Background: Same-species contamination detection is an important quality control step in genetic data analysis. Due to a scarcity of methods to detect and correct for this quality control issue, same-species contamination is more difficult to detect than cross-species contamination. We introduce a novel machine learning algorithm to detect same-species contamination in next-generation sequencing (NGS) data using a support vector machine (SVM) model. Our approach uniquely detects contamination using variant calling information stored in variant call format (VCF) files for DNA or RNA. Importantly, it can differentiate between same-species contamination and mixtures of tumor and normal cells. In the first stage, a change-point detection method is used to identify copy number variations (CNVs) and copy number aberrations (CNAs) for filtering. Next, single nucleotide polymorphism (SNP) data is used to test for same-species contamination using an SVM model. Based on the assumption that alternative allele frequencies in NGS follow the beta-binomial distribution, the deviation parameter ρ is estimated by the maximum likelihood method. All features of a radial basis function (RBF) kernel SVM are generated using publicly available or private training data. Results: We demonstrate our approach in simulation experiments. The datasets combine, in silico, exome sequencing data of DNA from two lymphoblastoid cell lines (NA12878 and NA10855). We generate VCF files using variants identified in these data and then evaluate the power and false-positive rate of our approach. Our approach can detect contamination levels as low as 5% with a reasonable false-positive rate. Results in real data have sensitivity above 99.99% and specificity of 90.24%, even in the presence of degraded samples with similar features as contaminated samples. We provide an R software implementation of our approach. Conclusions: Our approach addresses the gap in methods to test for same-species contamination in NGS. Due to its high sensitivity for degraded samples and tumor-normal samples, it represents an important tool that can be applied within the quality control process. Additionally, the user-friendly software has the unique ability to conduct quality control using the VCF format.
... Tumors from TCGA had undergone WES or WGS, as described 25 . Analysis proceeded from paired aligned bam files, which were inputted into a standard WES somatic variant-calling pipeline, including MuTect for calling somatic SNVs 64 , Strelka for calling small insertions and deletions 65 , deTiN for estimating tumor-in-normal contamination 66 , Con-tEst for estimating cross-participant contamination 67 , AllelicCapSeg for calling allelic copy number variants 68 and ABSOLUTE for estimating tumor purity, ploidy, CCFs and absolute allelic copy number 68 . Artifactual variants were filtered out using a token panel of normals filter, a blat filter and an oxoG filter. ...
Article
Full-text available
Analysis of premalignant tissue has identified the typical order of somatic events leading to invasive tumors in several cancer types. For other cancers, premalignant tissue is unobtainable, leaving genetic progression unknown. Here, we demonstrate how to infer progression from exome sequencing of primary tumors. Our computational method, PhylogicNDT, recapitulated the previous experimentally determined genetic progression of human papillomavirus-negative (HPV–) head and neck squamous cell carcinoma (HNSCC). We then evaluated HPV⁺ HNSCC, which lacks premalignant tissue, and uncovered its previously unknown progression, identifying early drivers. We converted relative timing estimates of driver mutations and HPV integration to years before diagnosis based on a clock-like mutational signature. We associated the timing of transitions to aneuploidy with increased intratumor genetic heterogeneity and shorter overall survival. Our approach can establish previously unknown early genetic progression of cancers with unobtainable premalignant tissue, supporting development of experimental models and methods for early detection, interception and prognostication.
... GATK3 was used to process the resulting BAM files to correct mapping and base quality score recalibration (Van der Auwera et al., 2013). We used ContEst (Broad Institute, contamination rate <0.02) to estimate Cross-sample (Cibulskis et al., 2011). We used Mutect (Cibulskis et al., 2013) and Scalpel (Fang et al., 2016) to call Somatic Single Nucleotide Variant and insertion/deletions. ...
Article
Full-text available
Homologous recombination deficiency (HRD) is a critical feature guiding drug and treatment selection, mainly for ovarian and breast cancers. As it cannot be directly observed, HRD status is estimated on a small set of genomic instability features from sequencing data. The existing methods often perform poorly when handling targeted panel sequencing data; however, the targeted panel is the most popular sequencing strategy in clinical practices. Thus, we proposed HRD-MILN to overcome the computational challenges from targeted panel sequencing. HRD-MILN incorporated a multi-instance learning framework to discover as many loss of heterozygosity (LOH) associated with HRD status to cluster as possible. Then the HRD score is obtained based on the association between the LOHs and the cluster in the sample to be estimated, and finally, the HRD status is estimated based on the score. In comparison experiments on targeted panel sequencing data, the Precision of HRD-MILN could achieve 87%, significantly improved from 63% reported by the existing methods, where the highest margin of improvement reached 14%. It also presented advantages on whole exome sequencing data. Based on our best knowledge, HRD-MILN is the first practical tool for estimating HRD status from targeted panel sequencing data and could benefit clinical applications.
... 80 ContEst was applied to measure the amount of cross-sample contamination in samples; samples with contamination >0.04 were excluded. 81 The Picard task CrossCheckFingerprints was applied to determine sample mixups; samples with Fingerprints LOD value <0 were excluded. 86 Two FFPE samples that failed sequence processing and were noted to have extensive segment fragmentation and allelic imbalance were also excluded due to suspicion of poor sequencing. ...
Article
Full-text available
Molecular profiling studies have enabled discoveries for metastatic prostate cancer (MPC) but have predominantly occurred in academic medical institutions and involved non-representative patient populations. We established the Metastatic Prostate Cancer Project (MPCproject, mpcproject.org), a patient-partnered initiative to involve patients with MPC living anywhere in the US and Canada in molecular research. Here, we present results from our partnership with the first 706 MPCproject participants. While 41% of patient partners live in rural, physician-shortage, or medically underserved areas, the MPCproject has not yet achieved racial diversity, a disparity that demands new initiatives detailed herein. Among molecular data from 333 patient partners (572 samples), exome sequencing of 63 tumor and 19 cell-free DNA (cfDNA) samples recapitulated known findings in MPC, while inexpensive ultra-low-coverage sequencing of 318 cfDNA samples revealed clinically relevant AR amplifications. This study illustrates the power of a growing, longitudinal partnership with patients to generate a more representative understanding of MPC.
... Before variant calling, the impact of oxidative damage (oxoG) to DNA during sequencing was quantified using DeToxoG 68 . The cross-sample contamination was measured with ContEst based on the allele fraction of homozygous single-nucleotide polymorphisms 69 , and this measurement was used in the downstream mutation calling pipeline. From the aligned BAM files, somatic alterations were identified using a set of tools developed at the Broad Institute (www.broadinstitute.org/cancer/cga). ...
Article
Full-text available
Recent advances in cancer characterization have consistently revealed marked heterogeneity, impeding the completion of integrated molecular and clinical maps for each malignancy. Here, we focus on chronic lymphocytic leukemia (CLL), a B cell neoplasm with variable natural history that is conventionally categorized into two subtypes distinguished by extent of somatic mutations in the heavy-chain variable region of immunoglobulin genes (IGHV). To build the ‘CLL map,’ we integrated genomic, transcriptomic and epigenomic data from 1,148 patients. We identified 202 candidate genetic drivers of CLL (109 new) and refined the characterization of IGHV subtypes, which revealed distinct genomic landscapes and leukemogenic trajectories. Discovery of new gene expression subtypes further subcategorized this neoplasm and proved to be independent prognostic factors. Clinical outcomes were associated with a combination of genetic, epigenetic and gene expression features, further advancing our prognostic paradigm. Overall, this work reveals fresh insights into CLL oncogenesis and prognostication.
... TCGA DNA BAM files were aligned to the NCBI Human Reference Genome Build GRCh37 (hg19). Sample contamination by DNA originating from a different individual was assessed using ContEst 53 . Somatic single nucleotide variations (sSNVs) were then detected using MuTect 19 . ...
Article
Full-text available
Detection of somatic mutations using patients sequencing data has many clinical applications, including the identification of cancer driver genes, detection of mutational signatures, and estimation of tumor mutational burden (TMB). We have previously developed a tool for detection of somatic mutations using tumor RNA and a matched-normal DNA. Here, we further extend it to detect somatic mutations from RNA sequencing data without a matched-normal sample. This is accomplished via a machine-learning approach that classifies mutations as either somatic or germline based on various features. When applied to RNA-sequencing of >450 melanoma samples high precision and recall are achieved, and both mutational signatures and driver genes are correctly identified. Finally, we show that RNA-based TMB is significantly associated with patient survival, showing similar or higher significance level as compared to DNA-based TMB. Our pipeline can be utilized in many future applications, analyzing novel and existing datasets where only RNA is available.
... Sehn et al. found that nine of 296 (3%) clinical NGS cases showed cross-contamination of approximately 5% of DNA extracted from FFPE blocks, which was derived from other patients using a read haplotype-based approach [7]. Similar results were obtained from 230 cases using an NGS-based multiplex gene panel test, in which cross-contamination was detected in 3.9% of FFPE blocks using the ContEst program [8,9]. In a report by the Japanese Ministry of Health, Labour and Welfare, 6 of 104 cases (5.8%) were detected with more than 1% contamination in the Trial of Onco-Panel for Gene-profiling to Estimate both Adverse Events and Response (TOP-GEAR) project. ...
Article
Full-text available
Formalin-fixed paraffin-embedded (FFPE) blocks are used as biomaterials for next-generation sequencing of cancer panels. Cross-contamination is detected in approximately 5% of the DNA extracted from FFPE samples, which reduces the detection rate of genetic abnormalities. There are no effective methods available for processing FFPE blocks that prevent cells from mixing with other specimens. The present study evaluated 897 sheets that could potentially prevent cell transmission but allow for the movement of various solvents used in FFPE blocks. According to the International Organization for Standardization and Japanese Industrial Standards, six requirements were established for the screening of packing sheets: 1) filter opening ≤5 μm, 2) thickness ≤100 μm, 3) chemical resistance, 4) permeability ≥1.0 × 10 ⁻³ cm/s, 5) water retention rate <200%, and 6) cell transit test (≤2 cells/10 high-power fields). Polyamide, polyethylene terephthalate, and polypropylene/polyethylene composite sheets met all criteria. A pocket, which was designed to wrap the tissue uniformly, was made of these sheets and was found to effectively block the entry of all cell types during FFPE block processing. Using a sheet pocket, no single cell from the cell pellet could pass through the outer layer. The presence or absence of the sheet pocket did not affect hematoxylin and eosin staining. When processing FFPE blocks as a biomaterial for next-generation sequencing, the sheet pocket was effective in preventing cross-contamination. This technology will in part support the precise translation of histopathological data into genome sequencing data in general pathology laboratories.
... mtDNA mutations have been widely implicated in various physiological and pathological conditions, including mitochondrial disorders, aging and cancer, which raise growing demands for accurate sequencing and analysis of mtDNA. However, it is well recognized that mtDNA sequencing is extremely susceptible to sample cross-contamination.12,28,29 Due to its abundant genetic polymorphisms, mtDNA from two unrelated individuals may differ by several genetic variants, and thus even small levels of cross-contamination can lead to the appearance of multiple pseudo-mtDNA variants, manifesting as either high-level or low-level heteroplasmic sites.15 Most standard approaches detect cross-contamination based on mtDNA whaplogroup-level phylogeny.13,15,23 ...
Article
Next-generation sequencing (NGS) of mitochondrial DNA (mtDNA) has widespread applications in aging and cancer studies. However, cross-contamination of mtDNA constitutes a major concern. Previous methods for the detection of mtDNA contamination mainly focus on haplogroup-level phylogeny, but neglect haplotype-level differences, leading to limited sensitivity and accuracy. In this study, we present mitoDataclean, a random-forest-based machine learning package for accurate identification of cross-contamination, evaluation of contamination levels and detection of contamination-derived variants in mtDNA NGS data. Comprehensive optimization of mitoDataclean revealed that training simulation with mixtures of small haplogroup distance and low polymorphic difference was critical for optimal modeling. Compared with existing methods, mitoDataclean exhibited significantly improved sensitivity and accuracy for the detection of sample contamination in simulated data. In addition, mitoDataclean achieved area under the curve values of 0.91 and 0.97 for discerning genuine and contamination-derived mtDNA variants in a simulated Western dataset and private sequencing contamination data, respectively, suggesting that this tool may be applicable for different populations and samples with different sources of contamination. Finally, mitoDataclean was further evaluated in several private and public datasets and showed a robust ability for contamination detection. Altogether, our study demonstrates that mitoDataclean may be used for accurate detection of contaminated samples and contamination-derived variants in mtDNA NGS data. This article is protected by copyright. All rights reserved.
... Although MEF cells were removed by a differential www.nature.com/scientificreports/ attachment method to gelatin-coated culture dishes before genome extraction of GS and mGS cells, and the level of contamination was estimated to be less than 1%, such a low level of cellular heterogeneity is known to still give rise to a substantial number of false-positive mutation calls, especially at low allele frequencies 51 . ...
Article
Full-text available
Germline mutations underlie genetic diversity and species evolution. Previous studies have assessed the theoretical mutation rates and spectra in germ cells mostly by analyzing genetic markers and reporter genes in populations and pedigrees. This study reported the direct measurement of germline mutations by whole-genome sequencing of cultured spermatogonial stem cells in mice, namely germline stem (GS) cells, together with multipotent GS (mGS) cells that spontaneously dedifferentiated from GS cells. GS cells produce functional sperm that can generate offspring by transplantation into seminiferous tubules, whereas mGS cells contribute to germline chimeras by microinjection into blastocysts in a manner similar to embryonic stem cells. The estimated mutation rate of GS and mGS cells was approximately 0.22 × 10 ⁻⁹ and 1.0 × 10 ⁻⁹ per base per cell population doubling, respectively, indicating that GS cells have a lower mutation rate compared to mGS cells. GS and mGS cells also showed distinct mutation patterns, with C-to-T transition as the most frequent in GS cells and C-to-A transversion as the most predominant in mGS cells. By karyotype analysis, GS cells showed recurrent trisomy of chromosomes 15 and 16, whereas mGS cells frequently exhibited chromosomes 1, 6, 8, and 11 amplifications, suggesting that distinct chromosomal abnormalities confer a selective growth advantage for each cell type in vitro. These data provide the basis for studying germline mutations and a foundation for the future utilization of GS cells for reproductive technology and clinical applications.
... To evaluate contamination levels, current tools use two main sources of evidence: Sequence Alignement Map (SAM/BAM) and Variant Call Format (VCF). Most tools developed for contamination detection are dedicated to the somatic context and require the use of paired samples to determine contamination levels; these tools include Conpair [1], GATK CalculateContamination [2] and HYSYS [3]. To our knowledge, there are few tools dedicated to germline single sample contamination analysis. ...
Preprint
Full-text available
Background Interest in genomic medicine for human health studies and clinical applications is rapidly increasing. Clinical applications require contamination-free samples to avoid misleading results and provide a sound basis for diagnosis. Results Here we present ContaTester, a tool which requires only allele balance information gathered from a VCF file to detect cross-contamination in germline human DNA samples. Based on a regression model of allele balance distribution, ContaTester allows fast checking of contamination levels for single samples or large cohorts (less than two minutes per sample). We demonstrate the efficiency of ContaTester using experimental validations: ContaTester shows similar results to methods requiring alignment data but with a significantly reduced storage footprint and less computation time. Additionally, for contamination levels above 5%, ContaTester can identify contaminants across a cohort, providing important clues for troubleshooting and quality assessment. Conclusions ContaTester estimates contamination levels from VCF files generated from whole genome sequencing normal sample and provides reliable contaminant identification for cohorts or experimental batches.
... 1. Quality criteria: pass GATK (Cibulskis et al., 2011) standard filter and read depth ≥10. 2. Allelic frequency, based on the maximum minor allele frequency found in 1000G (http://brows er.1000g enomes.org), Genome Aggregation database (gnomAD; http://gnomad.broad ...
Article
Full-text available
Background Premature ovarian insufficiency (POI) is a heterogeneous clinical syndrome defined by a premature loss of ovarian function that associates menstrual disturbances and hypergonatropic hypogonadism. POI is a major cause of female infertility affecting 1% of women before the age of 40 and up to 0.01% before the age of 20. The etiology of POI may be iatrogenic, auto-immune or genetic but remains however undetermined in a large majority of cases. An underlying genetic etiology has to be searched in idiopathic cases, particularly in the context of a family history of POI. Methods Whole exome sequencing (WES) was performed in trio in a Belgian patient presenting POI and in her two parents. The patient presented delayed puberty and primary amenorrhea with hypergonadotropic hypogonadism. Results WES identified two novel compound heterozygous truncating mutations in the Newborn oogenesis homeobox (NOBOX) gene, c.826C>T (p.(Arg276Ter)) and c.1421del (p.(Gly474AlafsTer76)). Both mutations were confirmed by Sanger sequencing in the proband's sister who presented the same phenotype. Both variants were pathogenic and very likely responsible for the severe POI in this family. Conclusion We report here for the first time compound heterozygous truncating mutations of NOBOX in outbred patients, generalizing biallelic NOBOX null mutations as a cause of severe POI with primary amenorrhea. In addition, our findings also suggest that NOBOX haploinsufficiency is tolerated.
... We note that when largescale data, such as genome-wide single nucleotide polymorphism microarray data or next-generation whole genome sequencing data, are available, several more sensitive methods to identify cross-contamination are available. [34][35][36] For the past decade, Coriell has employed a supplemental panel of MSATs, the ''Identifiler Plus'' (the AmpFLSTRÔ IdentifilerÔ Plus PCR Amplification Kit by Thermo Fisher Scientific), for added discrimination in the subset of cases where all standard six MSAT loci match between apparently distinct individuals. The Identifiler Plus marker set contains 15 STRs, of which 13 overlap core CODIS loci. ...
Article
Full-text available
Microsatellites, or MSATs, offer a fast and cost-effective way for biobanks to establish a biospecimen genetic profile. Importantly, this genetic profile can be used to authenticate multiple submissions derived from the same individual as well as biospecimens derived from the same original sample submission over time. While the Certificate of Confidentiality provided by the National Institutes of Health offers some meaningful protection to prevent the disclosure of potentially identifiable information to entities within the United States, we consider, in this study, the potential to offer additional protection to participants who choose to donate to biobanks by minimizing the use of forensic Combined DNA Index System (CODIS) MSAT markers in biobanking. To this end, we report the design and validation of a new multiplexed MSAT assay that does not include CODIS markers for use in biobanking operations and quality control management.
... TCGA DNA BAM files aligned to the NCBI Human Reference Genome Build GRCh37 (hg19). Sample contamination by DNA originating from a different individual was assessed using ContEst [51]. Somatic single nucleotide variations (sSNVs) were then detected using MuTect [52]. ...
Preprint
Full-text available
Detection of somatic point mutations using patients sequencing data has many clinical applications, including the identification of cancer driver genes, detection of mutational signatures, and estimation of tumor mutational burden (TMB). In a recent work we developed a tool for detection of somatic mutations using tumor RNA and matched-normal DNA. Here, we further extend it to detect somatic mutations from RNA sequencing data without a matched-normal sample. This is accomplished via a machine learning approach that classifies mutations as either somatic or germline based on various features. When applied to RNA-sequencing of >450 melanoma samples high precision and recall are achieved, and both mutational signatures and driver genes are correctly identified. Finally, we show that RNA-based TMB is significantly associated with patient survival, with similar or superior performance to DNA-based TMB. Our pipeline can be utilized in many future applications, analyzing novel and existing datasets where only RNA is available.
... Methods for detecting and estimating contamination, as well as methods for correcting genotype calls accounting for contamination have been developed previously [13][14][15][16][17][18] . For example, verifyBamID 13 is one of the most popularly used software tools to detect and estimate DNA sample contamination and has been adopted as a part of standard analysis pipeline in most largescale sequencing centers in the US, and cleanCall 14 can correct for DNA contamination in genotype calling. ...
Thesis
The rapidly increasing throughput of sequencing technologies allows us to sequence genomes, transcriptomes, and epigenomes at an unprecedented scale. Robust, efficient, and accurate computational methods to analyze sequence reads are crucial for successful large-scale studies. In this dissertation, I address specific computational and statistical challenges in quality assessment of sequence reads, ancestry-agnostic estimation of DNA sample contamination, and deconvolution of genetically multiplexed scRNA-seq sequence data by leveraging genetic variants. In Chapter 2, I describe rapid and accurate algorithms to produce comprehensive quality metrics directly from raw sequence reads without the requirement of full sequence alignment. To produce a comprehensive set of quality metrics such as GC bias metrics, insert size distribution, contamination rates, and genetic ancestry, existing quality assessment methods usually require full sequence alignment which is the most time-consuming step. My methods offer orders of magnitude faster turnaround time by eliminating this requirement when compared to the widely used 1000 Genomes QC pipeline. The results show that the quality metrics estimated from my methods are highly concordant to full-alignment based methods. In Chapter 3, I present a robust statistical method that accurately estimates DNA contamination agnostic to genetic ancestry of the intended or contaminating samples. Through experiments with in-silico contaminated and real sequence datasets, I demonstrate that existing methods may fail to screen highly contaminated samples at a stringent contamination threshold due to the bias when the genetic ancestry is misspecified. Meanwhile, in the presence of contamination, genetic ancestry estimates can be substantially biased if contamination is ignored. My method integrates genetic ancestry and DNA contamination into a mixture model by leveraging individual-specific allele-frequencies projected from reference genotypes onto principal component coordinates. I show that my method robustly corrects for the bias in both estimates of contamination rate and genetic ancestry under various scenarios of contamination. In Chapter 4, I enable genetic multiplexing of single-cell RNA-seq (scRNA-seq) experiment without requiring external genotyping by developing genotyping-free scRNA-seq deconvolution method, freemuxlet. Genetic multiplexing of scRNA-seq (mux-seq) allows us to cost-effectively sequence single cell transcriptomes across multiple samples in a single library preparation by harnessing natural genetic variations while dramatically reducing the batch effect. However, the existing statistical method, demuxlet, which enables mux-seq, requires external genotypes to be collected a priori, limiting its applications when it is difficult to obtain high-quality genotypes such as in model organisms or cancer cells. Furthermore, the additional steps to obtain, process, and impute the external genotypes become a substantial bottleneck to analyze the data within rapid turnaround time. Freemuxlet defines the distances between a pair of cell barcodes as Bayes Factors (BF) to determine statistical confidence between possible hypotheses of genetic identities of each cell barcodes. The iterative procedure of multi-class clustering guided by BF distances simultaneously estimates the consensus genotypes of each individual while detecting multiplets and deconvoluting the sample provenances of singlets. I apply freemuxlet to real datasets and demonstrate high concordance of estimated droplets identities with other methods (cell hashing, demuxlet). I further demonstrate that freemuxlet can enable mux-seq on cancer cell line mixtures, where demuxlet could not due to the difficulty of accurately genotyping. My results suggest that freemuxlet can deconvolute mux-seq experiment as accurate as methods that utilize external information, facilitating a broader range of applications of population-scale single-cell sequencing.
Article
Innovation in sequencing instrumentation is increasing the per-batch data volumes and decreasing the per-base costs. Multiplexed chemistry protocols after the addition of index tags have further contributed to efficient and cost-effective sequencer utilization. With these pooled processing strategies, however, comes an increased risk of sample contamination. Sample contamination poses a risk of missing critical variants in a patient sample or wrongly reporting variants derived from the contaminant, which are particularly relevant issues in oncology specimen testing where low variant allele frequencies have clinical relevance. Small custom targeted Next Generation Sequencing (NGS) panels yield limited variants and pose challenges in delineating true somatic variants versus contamination calls. A number of popular contamination identification tools have the ability to perform well in whole genome/exome sequencing data, but in smaller gene panels there are fewer variant candidates for the tools to perform accurately. To prevent clinical reporting of potentially contaminated samples in small NGS panels, we have developed MICon (Microhaplotype Contamination detection) - a novel contamination detection model that utilizes microhaplotype site variant allele frequencies. In a heterogeneous hold-out test cohort of 210 samples, the model demonstrated state-of-the-art performance with an AUROC of 0.995.
Article
Currently, ~30–55% of the non-small cell lung cancer (NSCLC) patients develop recurrence due to minimal residual disease (MRD) after receiving surgical resection of the tumor. This study aims to develop an ultra-sensitive and affordable fragmentomic assay for MRD detection in NSCLC patients. A total of 87 NSCLC patients, who received curative surgical resections (23 patients relapsed during follow-up), enrolled in this study. A total of 163 plasma samples, collected at 7 days and 6 months postsurgical, were used for both whole-genome sequencing (WGS) and targeted sequencing. WGS-based cell-free DNA (cfDNA) fragment profile was used to fit regularized cox regression models, and leave-one-out cross-validation was further used to evaluate models' performance. The models showed excellent performances in detecting patients with a high risk of recurrence. At 7 days postsurgical, the high-risk patients detected by our model showed an increased risk of 4.6 times, while the risk increased to 8.3 times at 6 months post-surgical. These fragmentomics determined higher risk compared to the targeted sequencing based circulating mutations both at 7 days and 6 months postsurgical. The overall sensitivity for detecting patients with recurrence reached 78.3% while using both fragmentomics and mutation results from 7 days and 6 months postsurgical, which increased from the 43.5% sensitivity by using only the circulating mutations. The fragmentomics showed great sensitivity in predicting patient recurrence compared to the traditional circulating mutation, especially after the surgery for early-stage NSCLC, therefore exhibiting great potential to guide adjuvant therapeutics.
Preprint
Anaplastic thyroid carcinoma is arguably the most lethal human malignancy. It often cooccurs with differentiated thyroid cancers, yet the molecular origins of its aggressivity are unknown. We sequenced tumor DNA from 329 regions of thyroid cancer, including 213 from patients with primary anaplastic thyroid carcinomas and multi-region whole-genome sequencing. Anaplastic thyroid carcinomas have a higher burden of mutations than other thyroid cancers, with distinct mutational signatures and molecular subtypes. Specific cancer driver genes are mutated in anaplastic and differentiated thyroid carcinomas, even those arising in a single patient. We unambiguously demonstrate that anaplastic thyroid carcinomas share a genomic origin with co-occurring differentiated carcinomas, and emerge from a common malignant field through acquisition of characteristic clonal driver mutations.
Article
Full-text available
Anti-PD-1/PD-L1 agents have transformed the treatment landscape of advanced non-small cell lung cancer (NSCLC). To expand our understanding of the molecular features underlying response to checkpoint inhibitors in NSCLC, we describe here the first joint analysis of the Stand Up To Cancer-Mark Foundation cohort, a resource of whole exome and/or RNA sequencing from 393 patients with NSCLC treated with anti-PD-(L)1 therapy, along with matched clinical response annotation. We identify a number of associations between molecular features and outcome, including (1) favorable (for example, ATM altered) and unfavorable (for example, TERT amplified) genomic subgroups, (2) a prominent association between expression of inducible components of the immunoproteasome and response and (3) a dedifferentiated tumor-intrinsic subtype with enhanced response to checkpoint blockade. Taken together, results from this cohort demonstrate the complexity of biological determinants underlying immunotherapy outcomes and reinforce the discovery potential of integrative analysis within large, well-curated, cancer-specific cohorts.
Preprint
Full-text available
IMPORTANCE RCC encompasses a set of histologically distinct cancers with a high estimated genetic heritability, of which only a portion is currently explained. Previous rare germline variant studies in RCC have usually pooled clear and non-clear cell RCCs and have not adequately accounted for population stratification that may significantly impact the interpretation and discovery of certain candidate risk genes. OBJECTIVE To evaluate the enrichment of germline PVs in established cancer-predisposing genes (CPGs) in clear cell and non-clear cell RCC patients compared to cancer-free controls using approaches that account for population stratification and to identify unconventional types of germline RCC risk variants that confer an increased risk of developing RCC. DESIGN, SETTING, AND PARTICIPANTS In 1,436 unselected RCC patients with sufficient data quality, we systematically identified rare germline PVs, cryptic splice variants, and copy number variants (CNVs). From this unselected cohort, 1,356 patients were ancestry-matched with 16,512 cancer-free controls, and gene-level enrichment of rare germline PVs were assessed in 143 CPGs, followed by an investigation of somatic events in matching tumor samples. MAIN OUTCOMES AND MEASURES Gene-level burden of rare germline PVs, identification of secondary somatic events accompanying the germline PVs, and characterization of less-explored types of rare germline PVs in RCC patients. RESULTS In clear cell RCC (n = 976 patients), patients exhibited significantly higher prevalence of PVs in VHL compared to controls (OR: 39.1, 95% CI: 7.01-218.07, p-value:4.95e-05, q-value:0.00584). In non-clear cell RCC (n = 380 patients), patients carried enriched burden of PVs in FH (OR: 77.9, 95% CI: 18.68-324.97, p-value:1.55e-08, q-value: 1.83e-06) and MET (OR: 1.98e11, 95% CI: 0-inf, p-value: 2.07e-05, q-value: 3.50e-07). In a CHEK2-focused analysis with European cases and controls, clear cell RCC patients (n=906 European patients) harbored nominal enrichment of the previously reported low-penetrance CHEK2 variants, p.Ile157Thr (OR:1.84, 95% CI: 1.00-3.36, p-value:0.049) and p.Ser428Phe (OR:5.20, 95% CI: 1.00-26.40, p-value:0.045) while non-clear cell RCC patients (n=295 European patients) exhibited nominal enrichment of CHEK2 LOF germline PVs (OR: 3.51, 95% CI: 1.10-11.10, p-value: 0.033). RCC patients with germline PVs in FH, MET, and VHL exhibited significantly earlier age of cancer onset compared to patients without any germline PVs in CPGs (Mean: 46.0 vs 60.2 years old, Tukey adjusted p-value < 0.0001), and more than half had secondary somatic events affecting the same gene (n=10/15, 66.7%, 95% CI: 38.7-87.0%). Conversely, patients with rare germline PVs in CHEK2 exhibited a similar age of disease onset to patients without any identified germline PVs in CPGs (Mean: 60.1 vs 60.2 years old, Tukey adjusted p-value: 0.99), and only 30.4% of the patients carried secondary somatic events in CHEK2 (n=7/23, 95% CI: 14.1-53.0%). Finally, rare pathogenic germline cryptic splice variants underexplored in RCC were identified in SDHA and TSC1, and rare pathogenic germline CNVs were found in 18 patients, including CNVs in FH, SDHA, and VHL. CONCLUSIONS AND RELEVANCE This systematic analysis supports the existing link between several RCC risk genes and elevated RCC risk manifesting in earlier age of RCC onset. Our analysis calls for caution when assessing the role of germline PVs in CHEK2 due to the burden of founder variants with varying population frequency in different ancestry groups. It also broadens the definition of the RCC germline landscape of pathogenicity to incorporate previously understudied types of germline variants, such as cryptic splice variants and CNVs.
Article
The rapid development of next-generation sequencing (NGS) technology has promoted its wide clinical application in precision medicine for oncology. However, laborious and time-consuming manual operations, highly skilled personnel requirements, and cross-contamination are major challenges for the clinical implementation of NGS technology-based tests. The Automated NGS Diagnostic Solutions (ANDiS) 500 system is a fully enclosed cassette-dependent automated NGS library preparation system. This platform could produce qualified targeted amplicon library in three steps with only 15 minutes of hands-on time. Rigorous cross-contamination test using simulated contaminant plasmids confirmed that the design of disposable cassette guarantees zero sample cross-contamination. The BRCA1 and BRCA2 mutation detection panel and gastrointestinal cancer-related gene analysis panel for the ANDiS 500 platform showed 100% accuracy and precision in detecting germ-line mutations and somatic mutations respectively. Furthermore, those panels showed 100% concordance with verified methods in a prospective cohort study enrolling 363 patients and a cohort of 45 pan-cancer samples. In conclusion, the ANDiS 500 automated platform could overcome major challenges for implementing NGS assays clinically and is eligible for routine clinical tests.
Article
Patients with smoldering multiple myeloma (SMM) are observed until progression, but early treatment may improve outcomes. We conducted a phase II trial of elotuzumab, lenalidomide, and dexamethasone (EloLenDex) in patients with high-risk SMM and performed single-cell RNA and T cell receptor (TCR) sequencing on 149 bone marrow (BM) and peripheral blood (PB) samples from patients and healthy donors (HDs). We find that early treatment with EloLenDex is safe and effective and provide a comprehensive characterization of alterations in immune cell composition and TCR repertoire diversity in patients. We show that the similarity of a patient’s immune cell composition to that of HDs may have prognostic relevance at diagnosis and after treatment and that the abundance of granzyme K (GZMK)⁺ CD8⁺ effector memory T (TEM) cells may be associated with treatment response. Last, we uncover similarities between immune alterations observed in the BM and PB, suggesting that PB-based immune profiling may have diagnostic and prognostic utility.
Article
Purpose: Sensitivity to endocrine therapy (ET) is critical for the clinical benefit from the combination of palbociclib plus ET in hormone receptor-positive/HER2-negative (HR+/HER2-) advanced breast cancer. Bazedoxifene is a third-generation selective ER modulator and selective ER degrader with activity in preclinical models of endocrine-resistant breast cancer, including models harboring ESR1 mutations. Clinical trials in healthy women showed that bazedoxifene is well tolerated. Patients and methods: We conducted a phase Ib/II study of bazedoxifene plus palbociclib in patients with HR+/HER2- advanced breast cancer who progressed on prior ET (N=36) (NCT02448771). Results: The study met its primary endpoint, with a clinical benefit rate of 33.3%, and the safety profile was consistent with what has previously been seen with palbociclib monotherapy. The median progression free survival (PFS) was 3.6 months (CI95% 2.0-7.2). An activating PIK3CA mutation at baseline was associated with a shorter PFS (HR = 4.4, CI95% 1.5-13, P = 0.0026) but activating ESR1 mutations did not impact the PFS. Longitudinal plasma circulating tumor DNA whole exome sequencing (WES) (N=68 plasma samples) provided an overview of the tumor heterogeneity, the sub-clonal genetic evolution and identified actionable mutations acquired during treatment. Conclusions: The combination of palbocilib and bazedoxifene has clinical efficacy and an acceptable safety profile in a heavily pre-treated patient population with advanced HR+/HER2- breast cancer. These results merit continued investigation of bazedoxifene in breast cancer.
Preprint
High-risk localized prostate cancer (HRLPC) is associated with a substantial risk of recurrence and prostate cancer-specific mortality ¹ . Recent clinical trials have shown that intensifying anti-androgen therapies administered prior to prostatectomy can induce pathologic complete responses (pCR) or minimal residual disease (MRD) (<5 mm), together termed exceptional response, although the molecular determinants of these clinical outcomes are largely unknown. Here, we performed whole exome (WES) and whole transcriptome sequencing (RNA-seq) on pre-treatment multi-regional tumor biopsies from exceptional responders (ER: pCR and MRD patients) and non-responders (NR: pathologic T3 or lymph node positive disease) treated with intensive anti-androgen therapies prior to prostatectomy. SPOP mutation and SPOPL copy number loss were exclusively observed in ER, while TP53 mutation and PTEN copy number loss were exclusively observed in NR. These alterations were clonal in all tumor phylogenies per patient. Additionally, transcriptional programs involving androgen signaling and TGFβ signaling were enriched in ER and NR, respectively. The presence of these alterations in routine biopsies from patients with HRLPC may inform the prospective identification of responders to neoadjuvant anti-androgen therapies to improve clinical outcomes and stratify other patients to alternative biologically informed treatment strategies.
Article
Neoantigens arising from mutations in tumor DNA provide targets for immune-based therapy. Here, we report the clinical and immune data from a Phase Ib clinical trial of a personalized neoantigen-vaccine NEO-PV-01 in combination with pemetrexed, carboplatin, and pembrolizumab as first-line therapy for advanced non-squamous non-small cell lung cancer (NSCLC). This analysis of 38 patients treated with the regimen demonstrated no treatment-related serious adverse events. Multiple parameters including baseline tumor immune infiltration and on-treatment circulating tumor DNA levels were highly correlated with clinical response. De novo neoantigen-specific CD4⁺ and CD8⁺ T cell responses were observed post-vaccination. Epitope spread to non-vaccinating neoantigens, including responses to KRAS G12C and G12V mutations, were detected post-vaccination. Neoantigen-specific CD4⁺ T cells generated post-vaccination revealed effector and cytotoxic phenotypes with increased CD4⁺ T cell infiltration in the post-vaccine tumor biopsy. Collectively, these data support the safety and immunogenicity of this regimen in advanced non-squamous NSCLC.
Article
Full-text available
Background: Congenital heart disease (CHD) is the most common birth defect and has high heritability. Although some susceptibility genes have been identified, the genetic basis underlying the majority of CHD cases is still undefined. Methods: A total of 1320 unrelated CHD patients were enrolled in our study. Exome-wide association analysis between 37 tetralogy of Fallot (TOF) patients and 208 Han Chinese controls from the 1000 Genomes Project was performed to identify the novel candidate gene WD repeat-containing protein 62 (WDR62). WDR62 variants were searched in another expanded set of 200 TOF patients by Sanger sequencing. Rescue experiments in zebrafish were conducted to observe the effects of WDR62 variants. The roles of WDR62 in heart development were examined in mouse models with Wdr62 deficiency. WDR62 variants were investigated in an additional 1083 CHD patients with similar heart phenotypes to knockout mice by multiplex PCR-targeting sequencing. The cellular phenotypes of WDR62 deficiency and variants were tested in cardiomyocytes, and the molecular mechanisms were preliminarily explored by RNA-seq and co-immunoprecipitation. Results: Seven WDR62 coding variants were identified in the 237 TOF patients and all were indicated to be loss of function variants. A total of 25 coding and 22 non-coding WDR62 variants were identified in 80 (6%) of the 1320 CHD cases sequenced, with a higher proportion of WDR62 variation (8%) found in the ventricular septal defect (VSD) cohort. WDR62 deficiency resulted in a series of heart defects affecting the outflow tract and right ventricle in mouse models, including VSD as the major abnormality. Cell cycle arrest and an increased number of cells with multipolar spindles that inhibited proliferation were observed in cardiomyocytes with variants or knockdown of WDR62. WDR62 deficiency weakened the association between WDR62 and the cell cycle-regulated kinase AURKA on spindle poles, reduced the phosphorylation of AURKA, and decreased expression of target genes related to cell cycle and spindle assembly shared by WDR62 and AURKA. Conclusions: WDR62 was identified as a novel susceptibility gene for CHD with high variant frequency. WDR62 was shown to participate in the cardiac development by affecting spindle assembly and cell cycle pathway in cardiomyocytes.
Article
Neoadjuvant immune checkpoint blockade has shown promising clinical activity. Here, we characterized early kinetics in tumor-infiltrating and circulating immune cells in oral cancer patients treated with neoadjuvant anti-PD-1 or anti-PD-1/CTLA-4 in a clinical trial (NCT02919683). Tumor-infiltrating CD8 T cells that clonally expanded during immunotherapy expressed elevated tissue-resident memory and cytotoxicity programs, which were already active prior to therapy, supporting the capacity for rapid response. Systematic target discovery revealed that treatment-expanded tumor T cell clones in responding patients recognized several self-antigens, including the cancer-specific antigen MAGEA1. Treatment also induced a systemic immune response characterized by expansion of activated T cells enriched for tumor-infiltrating T cell clonotypes, including both pre-existing and emergent clonotypes undetectable prior to therapy. The frequency of activated blood CD8 T cells, notably pre-treatment PD-1-positive KLRG1-negative T cells, was strongly associated with intra-tumoral pathological response. These results demonstrate how neoadjuvant checkpoint blockade induces local and systemic tumor immunity.
Article
Full-text available
Smoldering multiple myeloma (SMM) is a precursor condition of multiple myeloma (MM) with significant heterogeneity in disease progression. Existing clinical models of progression risk do not fully capture this heterogeneity. Here we integrate 42 genetic alterations from 214 SMM patients using unsupervised binary matrix factorization (BMF) clustering and identify six distinct genetic subtypes. These subtypes are differentially associated with established MM-related RNA signatures, oncogenic and immune transcriptional profiles, and evolving clinical biomarkers. Three genetic subtypes are associated with increased risk of progression to active MM in both the primary and validation cohorts, indicating they can be used to better predict high and low-risk patients within the currently used clinical risk stratification models. Existing clinical models cannot fully capture smoldering multiple myeloma (SMM) heterogeneity. Here, integration of 42 genetic alterations from 214 SMM patients using an unsupervised binary matrix factorization clustering approach results in the identification of 6 distinct molecular and clinical subtypes.
Preprint
Full-text available
Although papillary thyroid carcinoma (PTC) is the most frequent endocrine tumor with a generally excellent prognosis, a patient developed a clinically aggressive PTC eleven years after receiving platinum chemotherapy for ovarian endometrioid adenocarcinoma. Germline and somatic analyses of multi-temporal and multi-regional molecular profiles indicated that ovarian and thyroid tumors did not share common genetic alterations. PTC tumors had driver events associated with aggressive PTC behavior, an RBPMS-NTRK3 fusion and a TERT promoter mutation. Spatial and temporal genomic heterogeneity analysis indicated a close link between anatomical locations and molecular patterns of PTC. Mutational signature analyses demonstrated a molecular footprint of platinum exposure, and that aggressive molecular drivers of PTC were linked to prior platinum-associated mutagenesis. This case provides a direct association between platinum chemotherapy exposure and secondary solid tumor evolution, in specific aggressive thyroid carcinoma, and suggests that uniform clinical assessments for secondary PTC after platinum chemotherapy may warrant further evaluation.
Article
Full-text available
Purpose: Adoptive cell therapy (ACT) of tumor-infiltrating lymphocytes (TIL) historically yields a 40-50% response rate in metastatic melanoma. However, the determinants of outcome are largely unknown. Experimental design: We investigated tumor-based genomic correlates of overall survival (OS), progression-free survival (PFS), and response to therapy by interrogating tumor samples initially collected to generate TIL infusion products. Results: Whole exome sequencing (WES) data from 64 samples indicated a positive correlation between neoantigen load and OS, but not PFS or response to therapy. RNA sequencing analysis of 34 samples showed that expression of PDE1C , RTKN2 , and NGFR were enriched in responders who had improved PFS and OS. In contrast, the expression of ELFN1 was enriched in patients with unfavorable response, poor PFS and OS, whereas enhanced methylation of ELFN1 was observed in patients with favorable outcomes. Expression of ELFN1 , NGFR and PDE1C was mainly found in cancer-associated fibroblasts and endothelial cells in tumor tissues across different cancer types in publicly available single cell RNA sequencing datasets, suggesting a role for elements of the tumor microenvironment in defining the outcome of TIL therapy. Conclusions: Our findings suggest that transcriptional features of melanomas correlate with outcomes after TIL therapy and may provide candidates to guide patient selection.
Article
Full-text available
Immune checkpoint blockade (CPB) improves melanoma outcomes, but many patients still do not respond. Tumor mutational burden (TMB) and tumor-infiltrating T cells are associated with response, and integrative models improve survival prediction. However, integrating immune/tumor-intrinsic features using data from a single assay (DNA/RNA) remains underexplored. Here, we analyze whole-exome and bulk RNA sequencing of tumors from new and published cohorts of 189 and 178 patients with melanoma receiving CPB, respectively. Using DNA, we calculate T cell and B cell burdens (TCB/BCB) from rearranged TCR/Ig sequences and find that patients with TMBhigh and TCBhigh or BCBhigh have improved outcomes compared to other patients. By combining pairs of immune- and tumor-expressed genes, we identify three gene pairs associated with response and survival, which validate in independent cohorts. The top model includes lymphocyte-expressed MAP4K1 and tumor-expressed TBX3. Overall, RNA or DNA-based models combining immune and tumor measures improve predictions of melanoma CPB outcomes.
Article
Background Muscle-invasive bladder cancer (MIBC) is a rare but serious event following definitive radiation for prostate cancer. Radiation-associated MIBC (RA-MIBC) can be difficult to manage given the challenges of delivering definitive therapy to a previously irradiated pelvis. The genomic landscape of RA-MIBC and whether it is distinct from non–RA-MIBC are unknown. Objective To define mutational features of RA-MIBC and compare the genomic landscape of RA-MIBC with that of non–RA-MIBC. Design, setting, and participants We identified patients from our institution who received radiotherapy for prostate cancer and subsequently developed MIBC. Outcome measurements and statistical analysis We performed whole exome sequencing of bladder tumors from RA-MIBC patients. Tumor genetic alterations including mutations, copy number alterations, and mutational signatures were identified and were compared with genetic features of non–RA-MIBC. We used the Kaplan-Meier method to estimate recurrence-free (RFS) and overall (OS) survival. Results and limitations We identified 19 RA-MIBC patients with available tumor tissue (n = 22 tumors) and clinical data. The median age was 76 yr, and the median time from prostate cancer radiation to RA-MIBC was 12 yr. The median RFS was 14.5 mo and the median OS was 22.0 mo. Compared with a cohort of non–RA-MIBC analyzed in parallel, there was no difference in tumor mutational burden, but RA-MIBCs had a significantly increased number of short insertions and deletions (indels) consistent with previous radiation exposure. We identified mutation signatures characteristic of APOBEC-mediated mutagenesis, aging, and homologous recombination deficiency. The frequency of mutations in many known bladder cancer genes, including TP53, KDM6A, and RB1, as well as copy number alterations such as CDKN2A loss was similar in RA-MIBC and non-RA-MIBC. Conclusions We identified unique mutational properties that likely contribute to the distinct biological and clinical features of RA-MIBC. Patient summary Bladder cancer is a rare but serious diagnosis following radiation for prostate cancer. We characterized genetic features of bladder tumors arising after prostate radiotherapy, and identify similarities with and differences from bladder tumors from patients without previous radiation.
Article
Distilling biologically meaningful information from cancer genome sequencing data requires comprehensive identification of somatic alterations using rigorous computational methods. As the amount and complexity of sequencing data have increased, so has the number of tools for analysing them. Here, we describe the main steps involved in the bioinformatic analysis of cancer genomes, review key algorithmic developments and highlight popular tools and emerging technologies. These tools include those that identify point mutations, copy number alterations, structural variations and mutational signatures in cancer genomes. We also discuss issues in experimental design, the strengths and limitations of sequencing modalities and methodological challenges for the future.
Article
Objectives Diversity of laboratory-developed tests (LDTs) using next-generation sequencing (NGS) raises concerns about their accuracy for selection of targeted therapies. A working group developed a pilot study of traceable reference samples to measure NGS LDT performance among a cohort of clinical laboratories. Methods Human cell lines were engineered via CRISPR/Cas9 and prepared as formalin-fixed, paraffin-embedded cell pellets (“wet” samples) to assess the entire NGS test cycle. In silico mutagenized NGS sequence files (“dry” samples) were used to assess the bioinformatics component of the NGS test cycle. Single and multinucleotide variants (n = 36) of KRAS and NRAS were tested at 5% or 15% variant allele fraction to determine eligibility for therapy with the EGFR inhibitor panitumumab in the setting of metastatic colorectal cancer. Results Twenty-one (21/21) laboratories tested wet samples; 19 of 21 analyzed dry samples. Of the laboratories that tested both the wet and dry samples, 7 (37%) of 19 laboratories correctly reported all variants, 3 (16%) of 19 had fewer than five errors, and 9 (47%) of 19 had five or more errors. Most errors were false negatives. Conclusions Genetically engineered cell lines and mutagenized sequence files are complementary reference samples for evaluating NGS test performance among clinical laboratories using LDTs. Variable accuracy in detection of genetic variants among some LDTs may identify different patient populations for targeted therapy.
Article
Purpose: Prostatic ductal adenocarcinoma (PDA) is recognized as an advanced stage cancer and is often observed in conjunction with acinar adenocarcinoma, with abundant cytoplasm arranged in a papillary pattern. When compared with acinar adenocarcinoma, it is characterized by an increased biochemical recurrence rate and unusual metastasis sites. The purpose of the present study was to further elucidate the genomic alterations associated with PDA. Methods: Whole-exome sequencing (WES) and linkage analyses were performed on genomic DNA isolated from formalin-fixed, paraffin-embedded (FFPE) samples obtained from eleven PDA tumors and paired benign tissues. The profiles of somatic mutations, indels as well as copy-number alterations were confirmed in PDA patients. The clonal evolution patterns of the eleven PDA cases were compared with the data obtained from the Cancer Genome Atlas (TCGA) for eight primary prostatic acinar adenocarcinoma patients. Results: The same somatic changes were observed in PDA as in advanced and/or metastatic acinar adenocarcinomas. For example, the mutations of a known prostate cancer driver gene CDKN1A, were the most significant events among 17% of tumors. In addition to the known amplification of chromosomes 1q, 4p, 8q, and 14q, the copy number of several large regions also increased significantly. The origin of PDA was heterogeneous: some patients (e.g. P5) were consistent with the monoclonal model, while others (e.g. P7) were polyclonal. Conclusions: PDA and acinar adenocarcinomas of prostate with high Gleason score have similar mutational profiles. The somatic mutations in PDA may be the reason for its invasive biological behavior.
Article
Full-text available
This humble effort highlights the intricate details of metagenomics in a simple, poetic, and rhythmic way. The paper enforces the significance of the research area, provides details about major analytical methods, examines the taxonomy and assembly of genomes, emphasizes some tools, and concludes by celebrating the richness of the ecosystem populated by the "metagenome."
Article
Full-text available
Neuroendocrine carcinomas (NEC) are tumors expressing markers of neuronal differentiation that can arise at different anatomic sites but have strong histological and clinical similarities. Here we report the chromatin landscapes of a range of human NECs and show convergence to the activation of a common epigenetic program. With a particular focus on treatment emergent neuroendocrine prostate cancer (NEPC), we analyze cell lines, patient-derived xenograft (PDX) models and human clinical samples to show the existence of two distinct NEPC subtypes based on the expression of the neuronal transcription factors ASCL1 and NEUROD1. While in cell lines and PDX models these subtypes are mutually exclusive, single-cell analysis of human clinical samples exhibits a more complex tumor structure with subtypes coexisting as separate sub-populations within the same tumor. These tumor sub-populations differ genetically and epigenetically contributing to intra- and inter-tumoral heterogeneity in human metastases. Overall, our results provide a deeper understanding of the shared clinicopathological characteristics shown by NECs. Furthermore, the intratumoral heterogeneity of human NEPCs suggests the requirement of simultaneous targeting of coexisting tumor populations as a therapeutic strategy.
Article
Full-text available
Immune checkpoint inhibitors (ICIs) have minimal therapeutic effect in hormone receptor-positive (HR+ ) breast cancer. We present final overall survival (OS) results (n = 88) from a randomized phase 2 trial of eribulin ± pembrolizumab for patients with metastatic HR+ breast cancer, computationally dissect genomic and/or transcriptomic data from pre-treatment tumors (n = 52) for molecular associations with efficacy, and identify cytokine changes differentiating response and ICI-related toxicity (n = 58). Despite no improvement in OS with combination therapy (hazard ratio 0.95, 95% CI 0.59–1.55, p = 0.84), immune infiltration and antigen presentation distinguished responding tumors, while tumor heterogeneity and estrogen signaling independently associated with resistance. Moreover, patients with ICI-related toxicity had lower levels of immunoregulatory cytokines. Broadly, we establish a framework for ICI response in HR+ breast cancer that warrants diagnostic and therapeutic validation. ClinicalTrials.gov Registration: NCT03051659. A randomized phase 2 clinical trial has recently shown no benefit of the combination eribulin and pembrolizumab over pembrolizumab alone in HR + metastatic breast cancer patients (NCT03051659). Here, the authors are reporting the final OS data and biomarker analyses on a subset of samples to analyze molecular correlates
Article
Full-text available
High-risk localized prostate cancer (HRLPC) is associated with a substantial risk of recurrence and disease mortality. Recent clinical trials have shown that intensifying anti-androgen therapies administered before prostatectomy can induce pathologic complete responses or minimal residual disease, called exceptional response, although the molecular determinants of these clinical outcomes are largely unknown. Here, we perform whole-exome and transcriptome sequencing on pre-treatment multi-regional tumor biopsies from exceptional responders (ERs) and non-responders (NRs, pathologic T3 or lymph node-positive disease) to intensive neoadjuvant anti-androgen therapies. Clonal SPOP mutation and SPOPL copy-number loss are exclusively observed in ERs, while clonal TP53 mutation and PTEN copy-number loss are exclusively observed in NRs. Transcriptional programs involving androgen signaling and TGF-β signaling are enriched in ERs and NRs, respectively. These findings may guide prospective validation studies of these molecular features in large HRLPC clinical cohorts treated with neoadjuvant anti-androgens to improve patient stratification.
Article
Full-text available
This single-arm phase II study investigated the efficacy and safety of cabozantinib combined with nivolumab in metastatic triple-negative breast cancer (mTNBC). The primary endpoint was objective response rate (ORR) by RECIST 1.1. Biopsies at baseline and after cycle 1 were analyzed for tumor-infiltrating lymphocytes (TILs), PD-L1, and whole-exome and transcriptome sequencing. Only 1/18 patients achieved a partial response (ORR 6%), and the trial was stopped early. Toxicity led to cabozantinib dose reduction in 50% of patients. One patient had a PD-L1-positive tumor, and three patients had TILs > 10%. The responding patient had a PD-L1-negative tumor with low tumor mutational burden but high TILs and enriched immune gene expression. High pretreatment levels of plasma immunosuppressive cytokines, chemokines, and immune checkpoint molecules were associated with rapid progression. Although this study did not meet its primary endpoint, immunostaining, genomic, and proteomic studies indicated a high degree of tumor immunosuppression in this mTNBC cohort.
Article
Full-text available
Sacituzumab govitecan (SG), the first antibody–drug conjugate (ADC) approved for triple-negative breast cancer, incorporates the anti-TROP2 antibody hRS7 conjugated to a topoisomerase-1 (TOP1) inhibitor payload. We sought to identify mechanisms of SG resistance through RNA and whole-exome sequencing of pretreatment and postprogression specimens. One patient exhibiting de novo progression lacked TROP2 expression, in contrast to robust TROP2 expression and focal genomic amplification of TACSTD2/TROP2 observed in a patient with a deep, prolonged response to SG. Analysis of acquired genomic resistance in this case revealed one phylogenetic branch harboring a canonical TOP1E418K resistance mutation and subsequent frameshift TOP1 mutation, whereas a distinct branch exhibited a novel TACSTD2/TROP2T256R missense mutation. Reconstitution experiments demonstrated that TROP2T256R confers SG resistance via defective plasma membrane localization and reduced cell-surface binding by hRS7. These findings highlight parallel genomic alterations in both antibody and payload targets associated with resistance to SG. Significance These findings underscore TROP2 as a response determinant and reveal acquired SG resistance mechanisms involving the direct antibody and drug payload targets in distinct metastatic subclones of an individual patient. This study highlights the specificity of SG and illustrates how such mechanisms will inform therapeutic strategies to overcome ADC resistance. This article is highlighted in the In This Issue feature, p. 2355
Article
Full-text available
The functional consequences of genetic variants within 5’ untranslated regions (UTRs) on a genome-wide scale are poorly understood in disease. Here we develop a high-throughput multi-layer functional genomics method called PLUMAGE (Pooled full-length UTR Multiplex Assay on Gene Expression) to quantify the molecular consequences of somatic 5’ UTR mutations in human prostate cancer. We show that 5’ UTR mutations can control transcript levels and mRNA translation rates through the creation of DNA binding elements or RNA-based cis-regulatory motifs. We discover that point mutations can simultaneously impact transcript and translation levels of the same gene. We provide evidence that functional 5’ UTR mutations in the MAP kinase signaling pathway can upregulate pathway-specific gene expression and are associated with clinical outcomes. Our study reveals the diverse mechanisms by which the mutational landscape of 5’ UTRs can co-opt gene expression and demonstrates that single nucleotide alterations within 5’ UTRs are functional in cancer. Mutations in 5’ untranslated regions (UTRs) have a functional role in gene expression in cancer. Here, the authors develop a sequencing-based high throughput functional assay named PLUMAGE and show the effects of these mutations on gene expression and their association with clinical outcomes in prostate cancer.
Article
Full-text available
Several risk factors have been established for colorectal cancer, yet their direct mutagenic effects in patients' tumors remain to be elucidated. Here, we leveraged whole-exome sequencing data from 900 colorectal cancer cases that had occurred in three U.S.-wide prospective studies with extensive dietary and lifestyle information. We found an alkylating signature that was previously undescribed in colorectal cancer and then showed the existence of a similar mutational process in normal colonic crypts. This alkylating signature is associated with high intakes of processed and unprocessed red meat prior to diagnosis. In addition, this signature was more abundant in the distal colorectum, predicted to target cancer driver mutations KRAS p.G12D, KRAS p.G13D, and PIK3CA p.E545K, and associated with poor survival. Together, these results link for the first time a colorectal mutational signature to a component of diet and further implicate the role of red meat in colorectal cancer initiation and progression. Significance Colorectal cancer has several lifestyle risk factors, but the underlying mutations for most have not been observed directly in tumors. Analysis of 900 colorectal cancers with whole-exome sequencing and epidemiologic annotations revealed an alkylating mutational signature that was associated with red meat consumption and distal tumor location, as well as predicted to target KRAS p.G12D/p.G13D. This article is highlighted in the In This Issue feature, p. 2355
Article
Full-text available
Tumor multi-region sequencing reveals intratumor heterogeneity (ITH) and clonal evolution playing a key role in tumor progression and metastases. Large-scale high depths multiregional sequencing of colorectal cancer (CRC), comparative analysis among right-sided colon cancer (RCC), left-sided colon cancer (LCC) and rectal cancer (RC) patients as well as the study of lymph node metastasis (LN) with extranodal tumor deposits (ENTD) from evolutionary perspective remain weakly explored. Here, we recruited 68 patients with RCC (18), LCC (20) and RC (30). We performed high-depth whole exome sequencing (WES) of 206 tumor regions including 176 primary tumors, 19 LN and 11 ENTD samples. Our results showed ITH with a Darwinian pattern of evolution and the evolution pattern of LCC and RC was more complex and divergent than RCC. Genetic and evolutionary evidences found that both LN and ENTD originated from different clones. Moreover, ENTD was a distinct entity from LN and evolved later.
Article
Context.— The presence of allogeneic contamination impacts clinical reporting in cancer next-generation sequencing specimens. Although consensus guidelines recommend the identification of contaminating DNA as a part of quality control, implementation of contamination assessment methods in clinical molecular diagnostic laboratories has not been reported in the literature. Objective.— To develop and implement a method to assess allogeneic contamination in clinical cancer next-generation sequencing specimens. Design.— We describe a method to detect contamination based on the evaluation of single-nucleotide polymorphic sites from tumor-only specimens. We validate this method and apply it to a large cohort of cancer sequencing specimens. Results.— Identification of specimen contamination is validated via in silico and in vitro mixtures, and reference range and reproducibility are established in a panel of normal specimens. The algorithm accurately detects an episode of systemic contamination due to reagent impurity. We prospectively apply this algorithm across 7571 clinical cancer specimens from a targeted next-generation sequencing panel, in which 262 specimens (3.5%) are predicted to be affected by greater than 5% contamination. Conclusions.— Allogeneic contamination can be inferred from intrinsic cancer next-generation sequencing data without paired normal sequencing. The adoption of this approach can be useful as a quality control measure for laboratories performing clinical next-generation sequencing.
Article
Full-text available
Despite initial responses1–3, most melanoma patients develop resistance⁴ to immune checkpoint blockade (ICB). To understand the evolution of resistance, we studied 37 tumor samples over 9 years from a patient with metastatic melanoma with complete clinical response to ICB followed by delayed recurrence and death. Phylogenetic analysis revealed co-evolution of seven lineages with multiple convergent, but independent resistance-associated alterations. All recurrent tumors emerged from a lineage characterized by loss of chromosome 15q, with post-treatment clones acquiring additional genomic driver events. Deconvolution of bulk RNA sequencing and highly multiplexed immunofluorescence (t-CyCIF) revealed differences in immune composition among different lineages. Imaging revealed a vasculogenic mimicry phenotype in NGFRhi tumor cells with high PD-L1 expression in close proximity to immune cells. Rapid autopsy demonstrated two distinct NGFR spatial patterns with high polarity and proximity to immune cells in subcutaneous tumors versus a diffuse spatial pattern in lung tumors, suggesting different roles of this neural-crest-like program in different tumor microenvironments. Broadly, this study establishes a high-resolution map of the evolutionary dynamics of resistance to ICB, characterizes a de-differentiated neural-crest tumor population in melanoma immunotherapy resistance and describes site-specific differences in tumor–immune interactions via longitudinal analysis of a patient with melanoma with an unusual clinical course.
Article
Full-text available
Genomics of radiation-induced damage The potential adverse effects of exposures to radioactivity from nuclear accidents can include acute consequences such as radiation sickness, as well as long-term sequelae such as increased risk of cancer. There have been a few studies examining transgenerational risks of radiation exposure but the results have been inconclusive. Morton et al. analyzed papillary thyroid tumors, normal thyroid tissue, and blood from hundreds of survivors of the Chernobyl nuclear accident and compared them against those of unexposed patients. The findings offer insight into the process of radiation-induced carcinogenesis and characteristic patterns of DNA damage associated with environmental radiation exposure. In a separate study, Yeager et al. analyzed the genomes of 130 children and parents from families in which one or both parents had experienced gonadal radiation exposure related to the Chernobyl accident and the children were conceived between 1987 and 2002. Reassuringly, the authors did not find an increase in new germline mutations in this population. Science , this issue p. eabg2538 , p. 725
Article
Full-text available
Multiple myeloma is an incurable malignancy of plasma cells, and its pathogenesis is poorly understood. Here we report the massively parallel sequencing of 38 tumour genomes and their comparison to matched normal DNAs. Several new and unexpected oncogenic mechanisms were suggested by the pattern of somatic mutation across the data set. These include the mutation of genes involved in protein translation (seen in nearly half of the patients), genes involved in histone methylation, and genes involved in blood coagulation. In addition, a broader than anticipated role of NF-κB signalling was indicated by mutations in 11 members of the NF-κB pathway. Of potential immediate clinical relevance, activating mutations of the kinase BRAF were observed in 4% of patients, suggesting the evaluation of BRAF inhibitors in multiple myeloma clinical trials. These results indicate that cancer genome sequencing of large collections of samples will yield new insights into cancer not anticipated by existing knowledge.
Article
Full-text available
A catalogue of molecular aberrations that cause ovarian cancer is critical for developing and deploying therapies that will improve patients' lives. The Cancer Genome Atlas project has analysed messenger RNA expression, microRNA expression, promoter methylation and DNA copy number in 489 high-grade serous ovarian adenocarcinomas and the DNA sequences of exons from coding genes in 316 of these tumours. Here we report that high-grade serous ovarian cancer is characterized by TP53 mutations in almost all tumours (96%); low prevalence but statistically recurrent somatic mutations in nine further genes including NF1, BRCA1, BRCA2, RB1 and CDK12; 113 significant focal DNA copy number aberrations; and promoter methylation events involving 168 genes. Analyses delineated four ovarian cancer transcriptional subtypes, three microRNA subtypes, four promoter methylation subtypes and a transcriptional signature associated with survival duration, and shed new light on the impact that tumours with BRCA1/2 (BRCA1 or BRCA2) and CCNE1 aberrations have on survival. Pathway analyses suggested that homologous recombination is defective in about half of the tumours analysed, and that NOTCH and FOXM1 signalling are involved in serous ovarian cancer pathophysiology.
Article
Full-text available
Prostate cancer is the second most common cause of male cancer deaths in the United States. However, the full range of prostate cancer genomic alterations is incompletely characterized. Here we present the complete sequence of seven primary human prostate cancers and their paired normal counterparts. Several tumours contained complex chains of balanced (that is, 'copy-neutral') rearrangements that occurred within or adjacent to known cancer genes. Rearrangement breakpoints were enriched near open chromatin, androgen receptor and ERG DNA binding sites in the setting of the ETS gene fusion TMPRSS2-ERG, but inversely correlated with these regions in tumours lacking ETS fusions. This observation suggests a link between chromatin or transcriptional regulation and the genesis of genomic aberrations. Three tumours contained rearrangements that disrupted CADM2, and four harboured events disrupting either PTEN (unbalanced events), a prostate tumour suppressor, or MAGI2 (balanced events), a PTEN interacting protein not previously implicated in prostate tumorigenesis. Thus, genomic rearrangements may arise from transcriptional or chromatin aberrancies and engage prostate tumorigenic mechanisms.
Article
Full-text available
The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: rd@sanger.ac.uk
Article
Full-text available
Mapping the vast quantities of short sequence fragments produced by next-generation sequencing platforms is a challenge. What programs are available and how do they work?
Article
Full-text available
Targeting genomic loci by massively parallel sequencing requires new methods to enrich templates to be sequenced. We developed a capture method that uses biotinylated RNA 'baits' to fish targets out of a 'pond' of DNA fragments. The RNA is transcribed from PCR-amplified oligodeoxynucleotides originally synthesized on a microarray, generating sufficient bait for multiple captures at concentrations high enough to drive the hybridization. We tested this method with 170-mer baits that target >15,000 coding exons (2.5 Mb) and four regions (1.7 Mb total) using Illumina sequencing as read-out. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper. The uniformity was such that approximately 60% of target bases in the exonic 'catch', and approximately 80% in the regional catch, had at least half the mean coverage. One lane of Illumina sequence was sufficient to call high-confidence genotypes for 89% of the targeted exon space.
Article
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Article
Full textFull text is available as a scanned copy of the original print version. Get a printable copy (PDF file) of the complete article (113K), or click on a page image below to browse page by page. 1215
TheSequenceAlignment/MapformatandSAMtools
  • H Li
  • Etal
Li,H.etal.(2009)TheSequenceAlignment/MapformatandSAMtools.Bioinformatics, 25, 2078–2079
The GenomeAnalysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data Integrated genomic analyses of ovarian carcinoma
  • A Mckenna
McKenna,A. et al. (2010) The GenomeAnalysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res., 20, 1297–1303. TCGA Research Network (2011) Integrated genomic analyses of ovarian carcinoma
Integrated genomic analyses of ovarian carcinoma
  • TCGA Research Network