Samuel Levy

Johns Hopkins University, Baltimore, MD, USA

Are you Samuel Levy?

Claim your profile

Publications (22)310.76 Total impact

  • Source
    Article: Distinct patterns of somatic alterations in a lymphoblastoid and a tumor genome derived from the same individual.
    [show abstract] [hide abstract]
    ABSTRACT: Although patterns of somatic alterations have been reported for tumor genomes, little is known on how they compare with alterations present in non-tumor genomes. A comparison of the two would be crucial to better characterize the genetic alterations driving tumorigenesis. We sequenced the genomes of a lymphoblastoid (HCC1954BL) and a breast tumor (HCC1954) cell line derived from the same patient and compared the somatic alterations present in both. The lymphoblastoid genome presents a comparable number and similar spectrum of nucleotide substitutions to that found in the tumor genome. However, a significant difference in the ratio of non-synonymous to synonymous substitutions was observed between both genomes (P = 0.031). Protein-protein interaction analysis revealed that mutations in the tumor genome preferentially affect hub-genes (P = 0.0017) and are co-selected to present synergistic functions (P < 0.0001). KEGG analysis showed that in the tumor genome most mutated genes were organized into signaling pathways related to tumorigenesis. No such organization or synergy was observed in the lymphoblastoid genome. Our results indicate that endogenous mutagens and replication errors can generate the overall number of mutations required to drive tumorigenesis and that it is the combination rather than the frequency of mutations that is crucial to complete tumorigenic transformation.
    Nucleic Acids Research 04/2011; 39(14):6056-68. · 8.03 Impact Factor
  • Article: A mechanism for TCR sharing between T cell subsets and individuals revealed by pyrosequencing.
    [show abstract] [hide abstract]
    ABSTRACT: The human naive T cell repertoire is the repository of a vast array of TCRs. However, the factors that shape their hierarchical distribution and relationship with the memory repertoire remain poorly understood. In this study, we used polychromatic flow cytometry to isolate highly pure memory and naive CD8(+) T cells, stringently defined with multiple phenotypic markers, and used deep sequencing to characterize corresponding portions of their respective TCR repertoires from four individuals. The extent of interindividual TCR sharing and the overlap between the memory and naive compartments within individuals were determined by TCR clonotype frequencies, such that higher-frequency clonotypes were more commonly shared between compartments and individuals. TCR clonotype frequencies were, in turn, predicted by the efficiency of their production during V(D)J recombination. Thus, convergent recombination shapes the TCR repertoire of the memory and naive T cell pools, as well as their interrelationship within and between individuals.
    The Journal of Immunology 03/2011; 186(7):4285-94. · 5.79 Impact Factor
  • Source
    Article: Systematic detection of putative tumor suppressor genes through the combined use of exome and transcriptome sequencing.
    [show abstract] [hide abstract]
    ABSTRACT: To identify potential tumor suppressor genes, genome-wide data from exome and transcriptome sequencing were combined to search for genes with loss of heterozygosity and allele-specific expression. The analysis was conducted on the breast cancer cell line HCC1954, and a lymphoblast cell line from the same individual, HCC1954BL. By comparing exome sequences from the two cell lines, we identified loss of heterozygosity events at 403 genes in HCC1954 and at one gene in HCC1954BL. The combination of exome and transcriptome sequence data also revealed 86 and 50 genes with allele specific expression events in HCC1954 and HCC1954BL, which comprise 5.4% and 2.6% of genes surveyed, respectively. Many of these genes identified by loss of heterozygosity and allele-specific expression are known or putative tumor suppressor genes, such as BRCA1, MSH3 and SETX, which participate in DNA repair pathways. Our results demonstrate that the combined application of high throughput sequencing to exome and allele-specific transcriptome analysis can reveal genes with known tumor suppressor characteristics, and a shortlist of novel candidates for the study of tumor suppressor activities.
    Genome biology 11/2010; 11(11):R114. · 6.63 Impact Factor
  • Source
    Article: Towards a comprehensive structural variation map of an individual human genome.
    [show abstract] [hide abstract]
    ABSTRACT: Several genomes have now been sequenced, with millions of genetic variants annotated. While significant progress has been made in mapping single nucleotide polymorphisms (SNPs) and small (<10 bp) insertion/deletions (indels), the annotation of larger structural variants has been less comprehensive. It is still unclear to what extent a typical genome differs from the reference assembly, and the analysis of the genomes sequenced to date have shown varying results for copy number variation (CNV) and inversions. We have combined computational re-analysis of existing whole genome sequence data with novel microarray-based analysis, and detect 12,178 structural variants covering 40.6 Mb that were not reported in the initial sequencing of the first published personal genome. We estimate a total non-SNP variation content of 48.8 Mb in a single genome. Our results indicate that this genome differs from the consensus reference sequence by approximately 1.2% when considering indels/CNVs, 0.1% by SNPs and approximately 0.3% by inversions. The structural variants impact 4,867 genes, and >24% of structural variants would not be imputed by SNP-association. Our results indicate that a large number of structural variants have been unreported in the individual genomes published to date. This significant extent and complexity of structural variants, as well as the growing recognition of their medical relevance, necessitate they be actively studied in health-related analyses of personal genomes. The new catalogue of structural variants generated for this genome provides a crucial resource for future comparison studies.
    Genome biology 01/2010; 11(5):R52. · 6.63 Impact Factor
  • Article: Expression profiling of the ovarian surface kinome reveals candidate genes for early neoplastic changes.
    [show abstract] [hide abstract]
    ABSTRACT: We tested the hypothesis that co-coordinated up-regulation or down-regulation of several ovarian cell surface kinases may provide clues for better understanding of the disease and help in rational design of therapeutic targets. We compared the expression signature of 69 surface kinases in normal ovarian surface epithelial cells (OSE), with OSE from patients at high risk and with ovarian cancer. Seven surface kinases, ALK, EPHA5, EPHB1, ERBB4, INSRR, PTK, and TGFbetaR1 displayed a distinctive linear trend in expression from normal, highrisk, and malignant epithelium. We confirmed these results using semiquantitative reverse transcription-polymerase chain reaction and tissue array of 202 ovarian cancer samples. A strong correlate was shown between disease-free survival and the expression of ERBB4. DNA sequencing revealed two novel mutations in ERBB4 in two cancer samples. A distinct subset of the ovarian surface kinome is altered in the transition from high risk to invasive cancer and genetic mutation is not a dominant mechanism for these modifications. These results have significant implications for early detection and targeted therapeutic approaches for women at high risk of developing ovarian cancer.
    Translational oncology 12/2009; 2(4):341-9. · 3.40 Impact Factor
  • Source
    Article: An agenda for personalized medicine.
    Nature 10/2009; 461(7265):724-6. · 36.28 Impact Factor
  • Article: Mobile elements create structural variation: analysis of a complete human genome.
    [show abstract] [hide abstract]
    ABSTRACT: Structural variants (SVs) are common in the human genome. Because approximately half of the human genome consists of repetitive, transposable DNA sequences, it is plausible that these elements play an important role in generating SVs in humans. Sequencing of the diploid genome of one individual human (HuRef) affords us the opportunity to assess, for the first time, the impact of mobile elements on SVs in an individual in a thorough and unbiased fashion. In this study, we systematically evaluated more than 8000 SVs to identify mobile element-associated SVs as small as 100 bp and specific to the HuRef genome. Combining computational and experimental analyses, we identified and validated 706 mobile element insertion events (including Alu, L1, SVA elements, and nonclassical insertions), which added more than 305 kb of new DNA sequence to the HuRef genome compared with the Human Genome Project (HGP) reference sequence (hg18). We also identified 140 mobile element-associated deletions, which removed approximately 126 kb of sequence from the HuRef genome. Overall, approximately 10% of the HuRef-specific indels larger than 100 bp are caused by mobile element-associated events. More than one-third of the insertion/deletion events occurred in genic regions, and new Alu insertions occurred in exons of three human genes. Based on the number of insertions and the estimated time to the most recent common ancestor of HuRef and the HGP reference genome, we estimated the Alu, L1, and SVA retrotransposition rates to be one in 21 births, 212 births, and 916 births, respectively. This study presents the first comprehensive analysis of mobile element-related structural variants in the complete DNA sequence of an individual and demonstrates that mobile elements play an important role in generating inter-individual structural variation.
    Genome Research 06/2009; 19(9):1516-26. · 13.61 Impact Factor
  • Source
    Article: NA-Seq: a discovery tool for the analysis of chromatin structure and dynamics during differentiation.
    [show abstract] [hide abstract]
    ABSTRACT: It is well established that epigenetic modulation of genome accessibility in chromatin occurs during biological processes. Here we describe a method based on restriction enzymes and next-generation sequencing for identifying accessible DNA elements using a small amount of starting material, and use it to examine myeloid differentiation of primary human CD34+ cells. The accessibility of several classes of cis-regulatory elements was a predictive marker of in vivo DNA binding by transcription factors, and was associated with distinct patterns of histone posttranslational modifications. We also mapped large chromosomal domains with differential accessibility in progenitors and maturing cells. Accessibility became restricted during differentiation, correlating with a decreased number of expressed genes and loss of regulatory potential. Our data suggest that a permissive chromatin structure in multipotent cells is progressively and selectively closed during differentiation, and illustrate the use of our method for the identification of functional cis-regulatory elements.
    Developmental cell 04/2009; 16(3):466-81. · 13.36 Impact Factor
  • Source
    Article: Evaluation of next generation sequencing platforms for population targeted sequencing studies.
    [show abstract] [hide abstract]
    ABSTRACT: Next generation sequencing (NGS) platforms are currently being utilized for targeted sequencing of candidate genes or genomic intervals to perform sequence-based association studies. To evaluate these platforms for this application, we analyzed human sequence generated by the Roche 454, Illumina GA, and the ABI SOLiD technologies for the same 260 kb in four individuals. Local sequence characteristics contribute to systematic variability in sequence coverage (>100-fold difference in per-base coverage), resulting in patterns for each NGS technology that are highly correlated between samples. A comparison of the base calls to 88 kb of overlapping ABI 3730xL Sanger sequence generated for the same samples showed that the NGS platforms all have high sensitivity, identifying >95% of variant sites. At high coverage, depth base calling errors are systematic, resulting from local sequence contexts; as the coverage is lowered additional 'random sampling' errors in base calling occur. Our study provides important insights into systematic biases and data variability that need to be considered when utilizing NGS platforms for population targeted sequencing studies.
    Genome biology 04/2009; 10(3):R32. · 6.63 Impact Factor
  • Source
    Article: Transcriptome-guided characterization of genomic rearrangements in a breast cancer cell line.
    [show abstract] [hide abstract]
    ABSTRACT: We have identified new genomic alterations in the breast cancer cell line HCC1954, using high-throughput transcriptome sequencing. With 120 Mb of cDNA sequences, we were able to identify genomic rearrangement events leading to fusions or truncations of genes including MRE11 and NSD1, genes already implicated in oncogenesis, and 7 rearrangements involving other additional genes. This approach demonstrates that high-throughput transcriptome sequencing is an effective strategy for the characterization of genomic rearrangements in cancers.
    Proceedings of the National Academy of Sciences 02/2009; 106(6):1886-91. · 9.68 Impact Factor
  • Source
    Article: Human genetics: Individual genomes diversify.
    Samuel Levy, Robert L Strausberg
    Nature 12/2008; 456(7218):49-51. · 36.28 Impact Factor
  • Source
    Article: The HuRef Browser: a web resource for individual human genomics.
    [show abstract] [hide abstract]
    ABSTRACT: The HuRef Genome Browser is a web application for the navigation and analysis of the previously published genome of a human individual, termed HuRef. The browser provides a comparative view between the NCBI human reference sequence and the HuRef assembly, and it enables the navigation of the HuRef genome in the context of HuRef, NCBI and Ensembl annotations. Single nucleotide polymorphisms, indels, inversions, structural and copy-number variations are shown in the context of existing functional annotations on either genome in the comparative view. Demonstrated here are some potential uses of the browser to enable a better understanding of individual human genetic variation. The browser provides full access to the underlying reads with sequence and quality information, the genome assembly and the evidence supporting the identification of DNA polymorphisms. The HuRef Browser is a unique and versatile tool for browsing genome assemblies and studying individual human sequence variation in a diploid context. The browser is available online at http://huref.jcvi.org.
    Nucleic Acids Research 12/2008; 37(Database issue):D1018-24. · 8.03 Impact Factor
  • Source
    Article: Genetic variation in an individual human exome.
    [show abstract] [hide abstract]
    ABSTRACT: There is much interest in characterizing the variation in a human individual, because this may elucidate what contributes significantly to a person's phenotype, thereby enabling personalized genomics. We focus here on the variants in a person's 'exome,' which is the set of exons in a genome, because the exome is believed to harbor much of the functional variation. We provide an analysis of the approximately 12,500 variants that affect the protein coding portion of an individual's genome. We identified approximately 10,400 nonsynonymous single nucleotide polymorphisms (nsSNPs) in this individual, of which approximately 15-20% are rare in the human population. We predict approximately 1,500 nsSNPs affect protein function and these tend be heterozygous, rare, or novel. Of the approximately 700 coding indels, approximately half tend to have lengths that are a multiple of three, which causes insertions/deletions of amino acids in the corresponding protein, rather than introducing frameshifts. Coding indels also occur frequently at the termini of genes, so even if an indel causes a frameshift, an alternative start or stop site in the gene can still be used to make a functional protein. In summary, we reduced the set of approximately 12,500 nonsilent coding variants by approximately 8-fold to a set of variants that are most likely to have major effects on their proteins' functions. This is our first glimpse of an individual's exome and a snapshot of the current state of personalized genomics. The majority of coding variants in this individual are common and appear to be functionally neutral. Our results also indicate that some variants can be used to improve the current NCBI human reference genome. As more genomes are sequenced, many rare variants and non-SNP variants will be discovered. We present an approach to analyze the coding variation in humans by proposing multiple bioinformatic methods to hone in on possible functional variation.
    PLoS Genetics 09/2008; 4(8):e1000160. · 8.69 Impact Factor
  • Article: Emerging DNA sequencing technologies for human genomic medicine.
    [show abstract] [hide abstract]
    ABSTRACT: The completion of draft sequences of the human genome represented a remarkable achievement for automated DNA sequencing based on Sanger technology. However, the future requires substantial leaps in sequencing technology such that whole genome sequencing will become a standard component of biomedical research and patient care. In this review we describe current advances that are in early stages of development, but that point toward technology that will enable the onset of genomic medicine encompasses strategies for preventative medicine and intervention based on complete knowledge of an individual's genome.
    Drug Discovery Today 08/2008; 13(13-14):569-77. · 6.83 Impact Factor
  • Article: Consensus generation and variant detection by Celera Assembler.
    [show abstract] [hide abstract]
    ABSTRACT: We present an algorithm to identify allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences, and to produce a set of haploid consensus sequences rather than a single consensus sequence. Existing WGS assemblers take a column-by-column approach to consensus generation, and produce a single consensus sequence which can be inconsistent with the underlying haploid alleles, and inconsistent with any of the aligned sequence reads. Our new algorithm uses a dynamic windowing approach. It detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation, assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele. This algorithm was used to produce the first diploid genome sequence of an individual human. It can also be applied to assemblies of multiple diploid individuals and hybrid assemblies of multiple haploid organisms. Being applied to the individual human genome assembly, the new algorithm detects exactly two confirmed alleles and reports two consensus sequences in 98.98% of the total number 2,033311 detected regions of sequence variation. In 33,269 out of 460,373 detected regions of size >1 bp, it fixes the constructed errors of a mosaic haploid representation of a diploid locus as produced by the original Celera Assembler consensus algorithm. Using an optimized procedure calibrated against 1 506 344 known SNPs, it detects 438 814 new heterozygous SNPs with false positive rate 12%. The open source code is available at: http://wgs-assembler.cvs.sourceforge.net/wgs-assembler/
    Bioinformatics 04/2008; 24(8):1035-40. · 5.47 Impact Factor
  • Source
    Article: Novel computational methods for increasing PCR primer design effectiveness in directed sequencing.
    [show abstract] [hide abstract]
    ABSTRACT: Polymerase chain reaction (PCR) is used in directed sequencing for the discovery of novel polymorphisms. As the first step in PCR directed sequencing, effective PCR primer design is crucial for obtaining high-quality sequence data for target regions. Since current computational primer design tools are not fully tuned with stable underlying laboratory protocols, researchers may still be forced to iteratively optimize protocols for failed amplifications after the primers have been ordered. Furthermore, potentially identifiable factors which contribute to PCR failures have yet to be elucidated. This inefficient approach to primer design is further intensified in a high-throughput laboratory, where hundreds of genes may be targeted in one experiment. We have developed a fully integrated computational PCR primer design pipeline that plays a key role in our high-throughput directed sequencing pipeline. Investigators may specify target regions defined through a rich set of descriptors, such as Ensembl accessions and arbitrary genomic coordinates. Primer pairs are then selected computationally to produce a minimal amplicon set capable of tiling across the specified target regions. As part of the tiling process, primer pairs are computationally screened to meet the criteria for success with one of two PCR amplification protocols. In the process of improving our sequencing success rate, which currently exceeds 95% for exons, we have discovered novel and accurate computational methods capable of identifying primers that may lead to PCR failures. We reveal the laboratory protocols and their associated, empirically determined computational parameters, as well as describe the novel computational methods which may benefit others in future primer design research. The high-throughput PCR primer design pipeline has been very successful in providing the basis for high-quality directed sequencing results and for minimizing costs associated with labor and reprocessing. The modular architecture of the primer design software has made it possible to readily integrate additional primer critique tests based on iterative feedback from the laboratory. As a result, the primer design software, coupled with the laboratory protocols, serves as a powerful tool for low and high-throughput primer design to enable successful directed sequencing.
    BMC Bioinformatics 02/2008; 9:191. · 2.75 Impact Factor
  • Source
    Article: The diploid genome sequence of an individual human.
    [show abstract] [hide abstract]
    ABSTRACT: Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.
    PLoS Biology 09/2007; 5(10):e254. · 11.45 Impact Factor
  • Article: Promoting transcriptome diversity.
    Robert L Strausberg, Samuel Levy
    [show abstract] [hide abstract]
    ABSTRACT: Although the number of protein-encoding human genes is more limited than many had estimated, the human transcript repertoire is much more diverse than anticipated. In part, transcript diversity is generated through the use of alternative promoters and alternate splicing. In addition, based on discoveries using technologies such as full-length cDNA libraries and whole genome tiling microarrays, it is now likely that non-protein-encoding transcripts comprise a substantial fraction of the human RNA population. Much attention is currently focused on understanding the role of alternative promoters in generating transcript diversity, both for non-protein-encoding (ncRNAs) and protein-encoding RNAs.
    Genome Research 08/2007; 17(7):965-8. · 13.61 Impact Factor
  • Source
    Article: Sequence survey of receptor tyrosine kinases reveals mutations in glioblastomas.
    [show abstract] [hide abstract]
    ABSTRACT: It is now clear that tyrosine kinases represent attractive targets for therapeutic intervention in cancer. Recent advances in DNA sequencing technology now provide the opportunity to survey mutational changes in cancer in a high-throughput and comprehensive manner. Here we report on the sequence analysis of members of the receptor tyrosine kinase (RTK) gene family in the genomes of glioblastoma brain tumors. Previous studies have identified a number of molecular alterations in glioblastoma, including amplification of the RTK epidermal growth factor receptor. We have identified mutations in two other RTKs: (i) fibroblast growth receptor 1, including the first mutations in the kinase domain in this gene observed in any cancer, and (ii) a frameshift mutation in the platelet-derived growth factor receptor-alpha gene. Fibroblast growth receptor 1, platelet-derived growth factor receptor-alpha, and epidermal growth factor receptor are all potential entry points to the phosphatidylinositol 3-kinase and mitogen-activated protein kinase intracellular signaling pathways already known to be important for neoplasia. Our results demonstrate the utility of applying DNA sequencing technology to systematically assess the coding sequence of genes within cancer genomes.
    Proceedings of the National Academy of Sciences 11/2005; 102(40):14344-9. · 9.68 Impact Factor
  • Source
    Article: Environmental genome shotgun sequencing of the Sargasso Sea.
    [show abstract] [hide abstract]
    ABSTRACT: We have applied "whole-genome shotgun sequencing" to microbial populations collected en masse on tangential flow and impact filters from seawater samples collected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairs of nonredundant sequence was generated, annotated, and analyzed to elucidate the gene content, diversity, and relative abundance of the organisms within these environmental samples. These data are estimated to derive from at least 1800 genomic species based on sequence relatedness, including 148 previously unknown bacterial phylotypes. We have identified over 1.2 million previously unknown genes represented in these samples, including more than 782 new rhodopsin-like photoreceptors. Variation in species present and stoichiometry suggests substantial oceanic microbial diversity.
    Science 05/2004; 304(5667):66-74. · 31.20 Impact Factor