Mark DePristo's research while affiliated with Google Inc. and other places

Publications (152)

Article
Full-text available
Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here...
Article
Physicians are increasingly using clinical sequencing tests to establish diagnoses of patients who might have genetic disorders, which means that accuracy of sequencing and interpretation are important elements in ensuring the benefits of genetic testing. In the past, clinical sequencing tests were designed to detect specific prespecified or unknow...
Article
Full-text available
Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transform...
Article
Full-text available
The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average le...
Preprint
Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challe...
Preprint
Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate 1/3 of microbial protein sequences, hampering our ability to exploit sequences collected from diverse...
Preprint
Full-text available
The major DNA sequencing technologies in use today produce either highly-accurate short reads or noisy long reads. We developed a protocol based on single-molecule, circular consensus sequencing (CCS) to generate highly-accurate (99.8%) long reads averaging 13.5 kb and applied it to sequence the well-characterized human HG002/NA24385. We optimized...
Article
Full-text available
Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of comp...
Article
Full-text available
Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relations...
Preprint
Motivation Inferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pa...
Article
The human genome is now investigated through high throughput functional assays, and through the generation of population genomic data. These advances support the identification of functional genetic variants and the prediction of traits (eg. deleterious variants and disease). This review summarizes lessons learned from the large-scale analyses of g...
Preprint
Full-text available
Next-generation sequencing (NGS) is a rapidly evolving set of technologies that can be used to determine the sequence of an individual’s genome ¹ by calling genetic variants present in an individual using billions of short, errorful sequence reads ² . Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and...
Article
Full-text available
A major challenge in evaluating the contribution of rare variants to complex disease is identifying enough copies of the rare alleles to permit informative statistical analysis. To investigate the contribution of rare variants to the risk of type 2 diabetes (T2D) and related traits, we performed deep whole-genome analysis of 1,034 members of 20 lar...
Article
Full-text available
To investigate the genetic basis of type 2 diabetes (T2D) to high resolution, the GoT2D and T2D-GENES consortia catalogued variation from whole-genome sequencing of 2,657 European individuals and exome sequencing of 12,940 individuals of multiple ancestries. Over 27M SNPs, indels, and structural variants were identified, including 99% of low-freque...
Article
Full-text available
To investigate the genetic basis of type 2 diabetes (T2D) to high resolution, the GoT2D and T2D-GENES consortia catalogued variation from whole-genome sequencing of 2,657 European individuals and exome sequencing of 12,940 individuals of multiple ancestries. Over 27M SNPs, indels, and structural variants were identified, including 99% of low-freque...
Article
Full-text available
Comprehensive disease gene discovery in both common and rare diseases will require the efficient and accurate detection of all classes of genetic variation across tens to hundreds of thousands of human samples. We describe here a novel assembly-based approach to variant calling, the GATK HaplotypeCaller (HC) and Reference Confidence Model (RCM), th...
Article
Full-text available
To identify novel coding association signals and facilitate characterization of mechanisms influencing glycemic traits and type 2 diabetes risk, we analyzed 109,215 variants derived from exome array genotyping together with an additional 390,225 variants from exome sequence in up to 39,339 normoglycemic individuals from five ancestry groups. We ide...
Article
Full-text available
Academy of Finland (129293, 128315, 129330, 131593, 139635, 139635, 121584, 126925, 124282, 129378, 258753); Action on Hearing Loss (G51); Ahokas Foundation; American Diabetes Association (#7-12-MN-02); Atlantic Canada Opportunities Agency; Augustinus foundation; Becket foundation; Benzon Foundation; Biomedical Research Council; British Heart Found...
Preprint
Copy number variants (CNVs) are an important type of genetic variation and play a causal role in many diseases. However, they are also notoriously difficult to identify accurately from next-generation sequencing (NGS) data. For larger CNVs, genotyping arrays provide reasonable benchmark data, but NGS allows us to assay a far larger number of small...
Article
Full-text available
Germline mutation detection from human DNA sequence data is challenging due to the rarity of such events relative to the intrinsic error rates of sequencing technologies and the uneven coverage across the genome. We developed PhaseByTransmission (PBT) to identify de novo single nucleotide variants and short insertions and deletions (indels) from se...
Article
Full-text available
Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortiu...
Article
Full-text available
The genetic architecture of common traits, including the number, frequency, and effect sizes of inherited variants that contribute to individual risk, has been long debated. Genome-wide association studies have identified scores of common variants associated with type 2 diabetes, but in aggregate, these explain only a fraction of the heritability o...
Article
The genetic architecture of common traits, including the number, frequency, and effect sizes of inherited variants that contribute to individual risk, has been long debated. Genome-wide association studies have identified scores of common variants associated with type 2 diabetes, but in aggregate, these explain only a fraction of the heritability o...
Preprint
Full-text available
Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities generated as part of the Exome Aggregation Consortium (...
Article
We report the sequence sof �244 human Y chromosomes randomly ascertained from 26 worldwide populations by the �1000 Genomes Project. We discovered more than 65,000 variants, including single-nucleotide variants, multiple-nucleotide variants, insertions and deletions, short tandem repeats, and copy number variants. Of these, copy number variants con...
Article
Full-text available
Myocardial infarction (MI), a leading cause of death around the world, displays a complex pattern of inheritance. When MI occurs early in life, genetic inheritance is a major component to risk. Previously, rare mutations in low-density lipoprotein (LDL) genes have been shown to contribute to MI risk in individual families, whereas common variants a...
Article
Full-text available
Spontaneously arising (de novo) mutations have an important role in medical genetics. For diseases with extensive locus heterogeneity, such as autism spectrum disorders (ASDs), the signal from de novo mutations is distributed across many genes, making it difficult to distinguish disease-relevant mutations from background variation. Here we provide...
Article
Background: Plasma triglyceride levels are heritable and are correlated with the risk of coronary heart disease. Sequencing of the protein-coding regions of the human genome (the exome) has the potential to identify rare mutations that have a large effect on phenotype. Methods: We sequenced the protein-coding regions of 18,666 genes in each of 3...
Article
Full-text available
Loss-of-function mutations protective against human disease provide in vivo validation of therapeutic targets, but none have yet been described for type 2 diabetes (T2D). Through sequencing or genotyping of ~150,000 individuals across 5 ancestry groups, we identified 12 rare protein-truncating variants in SLC30A8, which encodes an islet zinc transp...
Article
Full-text available
Schizophrenia is a common disease with a complex aetiology, probably involving multiple and heterogeneous genetic factors. Here, by analysing the exome sequences of 2,536 schizophrenia cases and 2,543 controls, we demonstrate a polygenic burden primarily arising from rare (less than 1 in 10,000), disruptive mutations distributed across many genes....
Article
This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls that can be used in downstream analyses. The complete workflow includes the core NGS data processing steps that are necessary to make the raw data suitable for analysis by the GATK, as well as...
Article
Exome sequencing has proven to be a powerful and cost-effective approach for the identification of causal mutations in many patients suffering from rare, severe Mendelian diseases. However, exome analysis unambiguously identifies a causal mutation in only 30–50% of sequenced families, indicating much work remains to be done to increase the yield of...
Data
PCA of Baylor and Broad samples together. first eigen-vector versus second eigen-vector for Broad and Baylor samples. (TIF)
Data
The p-values of 114 ASD genes. (XLSX)
Data
Histogram of p-values for SKAT and Burden Test. (A) and (B) are SKAT p-values for Broad and Baylor samples, respectively. (C) and (D) are Burden test p-values for Broad and Baylor samples, respectively. Green vertical lines are the 25%, 50% and 75% quantiles of p-values. (TIF)
Data
Full-text available
Comparison of seven individuals called by both Baylor and Broad under different filters. (PDF)
Data
Genes with p-value from the SKAT or Burden Test. (XLSX)
Data
P-values versus Missingness. We used 5500 genes to make this plot. For each gene, we calculate the -log 10 p-values and the odds ratio of missingness in case and control. The red line is the fitted line of these 5500 observations. (TIF)
Data
Full-text available
Classification tree results for heterozygote calls. (PDF)
Data
Full-text available
Genomic control and based on different types of PC adjustment. (PDF)
Data
PCA from common variants, low frequency variants and both type of variants for Baylor samples. Eigen-vectors are obtained by applying PCA to all common variants that have no missingness (14,702 variants) (A), all low frequency variants that have no missingness (8783 variants) (B), and both type of variants (C). The colors are obtained by clustering...
Data
Depth Comparison: Baylor versus Broad. We compare the average sample depth for all non-synonymous variants in the two data sets. (TIF)
Data
Distribution of the genomic control factor . By permuting case/control status 100 times the distribution of is obtained based on the 1000 largest genes. The red line shows the mean of the permutation distribution and the green line shows obtained from the data using (A) Broad SKAT p-values obtained without eigenvectors; (B) Broad SKAT p-values, wit...
Data
MAF Comparison: Baylor versus Broad. We compare the MAF for 72,758 shared non-synonymous variants in the two data sets. (TIF)
Data
The p-values of genes which have two or more de novo nonsense or missense mutations as reported in [5]. (XLSX)
Data
Full-text available
The required sample sizes by applying meta- and mega-analysis. (PDF)
Data
Full-text available
Additional Information Regarding Methods. Part A gives additional information about sequencing, including data generation and quality control. Part B gives the mathematical exposition of mega- and meta-analysis. Part C provides details for association analysis. (PDF)
Article
Full-text available
We report on results from whole-exome sequencing (WES) of 1,039 subjects diagnosed with autism spectrum disorders (ASD) and 870 controls selected from the NIMH repository to be of similar ancestry to cases. The WES data came from two centers using different methods to produce sequence and to call variants from it. Therefore, an initial goal was to...
Article
To characterize the role of rare complete human knockouts in autism spectrum disorders (ASDs), we identify genes with homozygous or compound heterozygous loss-of-function (LoF) variants (defined as nonsense and essential splice sites) from exome sequencing of 933 cases and 869 controls. We identify a 2-fold increase in complete knockouts of autosom...
Article
Full-text available
By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By...
Article
Full-text available
Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome some limitations of current next generation sequencing platforms by providing significantly longer reads, single molecule sequencing, low composition bias and an error profile that is orthogonal to other platforms. With these potential adva...
Data
Properties of resolved SNPs with greater than 25% discordant genotypes. The same data as in Figure S24 but for SNPs classified as disputed when array calls and sequence calls disagree for more than 25% of sample genotypes. (PDF)
Data
Full-text available
Sensitivity and specificity at sites not on the array: actual values. Shown is data analogous to Figure 4a but with actual values rather than normalized values. Calls based on sequence data (Seq) calls are plotted in red, joint (Joint) calls are plotted in blue. The red bars differ in size because the sites analyzed depend on the array. (a) 381 Eur...
Data
Full-text available
Sensitivity and specificity at heterozygous and homozygous variants. Shown are data analogous to Figure 2ab but with sensitivity and specificity computed separately for variants at which the test sample has a heterozygous and homozygous genotype. (a) Heterozygous genotypes. (b) Homozygous non-reference genotypes. (PDF)
Data
Full-text available
Sensitivity and specificity in coding regions. Shown are data analogous to Figure 2ab but broken into metrics for coding and noncoding variants. (a) Coding variants. (b) Noncoding variants. (PDF)
Data
Full-text available
Sensitivity and specificity of data collection strategies: 41 sample European reference panel. Shown is data analogous to Figure 2 but for a 42 European samples rather than 382 samples. As described in Text S1, this closely models the use of a 41 European sample reference panel for imputation (just as our main experiments closely model the use of a...
Data
Full-text available