Ryan Poplin's research while affiliated with Google Inc. and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (92)
Physicians are increasingly using clinical sequencing tests to establish diagnoses of patients who might have genetic disorders, which means that accuracy of sequencing and interpretation are important elements in ensuring the benefits of genetic testing. In the past, clinical sequencing tests were designed to detect specific prespecified or unknow...
We introduce a new dataset called SyntheticFur built specifically for machine learning training. The dataset consists of ray traced synthetic fur renders with corresponding rasterized input buffers and simulation data files. We procedurally generated approximately 140,000 images and 15 simulations with Houdini. The images consist of fur groomed wit...
Background
The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.
Methods
Here...
Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transform...
Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challe...
Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relations...
Motivation
Inferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pa...
Microscopy is a central method in life sciences. Many popular methods, such as antibody labeling, are used to add physical fluorescent labels to specific cellular constituents. However, these approaches have significant drawbacks, including inconsistency; limitations in the number of simultaneous labels because of spectral overlap; and necessary pe...
Next-generation sequencing (NGS) is a rapidly evolving set of technologies that can be used to determine the sequence of an individualās genome ¹ by calling genetic variants present in an individual using billions of short, errorful sequence reads ² . Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and...
Traditionally, medical discoveries are made by observing associations and then designing experiments to test these hypotheses. However, observing and quantifying associations in images can be difficult because of the wide variety of features, patterns, colors, values, shapes in real data. In this paper, we use deep learning, a machine learning tech...
This corrects the article DOI: 10.1038/sdata.2017.179.
Purpose:
We evaluate how deep learning can be applied to extract novel information such as refractive error from retinal fundus imaging.
Methods:
Retinal fundus images used in this study were 45- and 30-degree field of view images from the UK Biobank and Age-Related Eye Disease Study (AREDS) clinical trials, respectively. Refractive error was me...
To investigate the genetic basis of type 2 diabetes (T2D) to high resolution, the GoT2D and T2D-GENES consortia catalogued variation from whole-genome sequencing of 2,657 European individuals and exome sequencing of 12,940 individuals of multiple ancestries. Over 27M SNPs, indels, and structural variants were identified, including 99% of low-freque...
To investigate the genetic basis of type 2 diabetes (T2D) to high resolution, the GoT2D and T2D-GENES consortia catalogued variation from whole-genome sequencing of 2,657 European individuals and exome sequencing of 12,940 individuals of multiple ancestries. Over 27M SNPs, indels, and structural variants were identified, including 99% of low-freque...
Comprehensive disease gene discovery in both common and rare diseases will require the efficient and accurate detection of all classes of genetic variation across tens to hundreds of thousands of human samples. We describe here a novel assembly-based approach to variant calling, the GATK HaplotypeCaller (HC) and Reference Confidence Model (RCM), th...
Mobile element insertions (MEIs) represent ~25% of all structural variants in human genomes. Moreover, when they disrupt genes, MEIs can influence human traits and diseases. Therefore, MEIs should be fully discovered along with other forms of genetic variation in whole genome sequencing (WGS) projects involving population genetics, human diseases,...
Mosquito-borne illnesses such as dengue, chikungunya, and Zika are major global health problems, which are not yet addressable with vaccines and must be countered by reducing mosquito populations. The Sterile Insect Technique (SIT) is a promising alternative to pesticides; however, effective SIT relies on minimal releases of female insects. This pa...
The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in...
To identify novel coding association signals and facilitate characterization of mechanisms influencing glycemic traits and type 2 diabetes risk, we analyzed 109,215 variants derived from exome array genotyping together with an additional 390,225 variants from exome sequence in up to 39,339 normoglycemic individuals from five ancestry groups. We ide...
Academy of Finland (129293, 128315, 129330, 131593, 139635, 139635, 121584, 126925, 124282, 129378, 258753); Action on Hearing Loss (G51); Ahokas Foundation; American Diabetes Association (#7-12-MN-02); Atlantic Canada Opportunities Agency; Augustinus foundation; Becket foundation; Benzon Foundation; Biomedical Research Council; British Heart Found...
Copy number variants (CNVs) are an important type of genetic variation and play a causal role in many diseases. However, they are also notoriously difficult to identify accurately from next-generation sequencing (NGS) data. For larger CNVs, genotyping arrays provide reasonable benchmark data, but NGS allows us to assay a far larger number of small...
Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortiu...
The genetic architecture of common traits, including the number, frequency, and effect sizes of inherited variants that contribute to individual risk, has been long debated. Genome-wide association studies have identified scores of common variants associated with type 2 diabetes, but in aggregate, these explain only a fraction of the heritability o...
The genetic architecture of common traits, including the number, frequency, and effect sizes of inherited variants that contribute to individual risk, has been long debated. Genome-wide association studies have identified scores of common variants associated with type 2 diabetes, but in aggregate, these explain only a fraction of the heritability o...
Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities generated as part of the Exome Aggregation Consortium (...
We report the sequence sof ļæ½244 human Y chromosomes randomly ascertained from 26 worldwide populations by the ļæ½1000 Genomes Project. We discovered more than 65,000 variants, including single-nucleotide variants, multiple-nucleotide variants, insertions and deletions, short tandem repeats, and copy number variants. Of these, copy number variants con...
The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-covera...
The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-covera...
Accurate prediction of the functional effect of genetic variation is critical for clinical genome interpretation. We systematically
characterized the transcriptome effects of protein-truncating variants, a class of variants expected to have profound effects
on gene function, using data from the Genotype-Tissue Expression (GTEx) and Geuvadis project...
The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-covera...
Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each ind...
A major use of the 1000 Genomes Project (1000GP) data is genotype imputation in genome-wide association studies (GWAS). Here we develop a method to estimate haplotypes from low-coverage sequencing data that can take advantage of single-nucleotide polymorphism (SNP) microarray genotypes on the same samples. First the SNP array data are phased to bui...
Loss-of-function mutations protective against human disease provide in vivo validation of therapeutic targets, but none have yet been described for type 2 diabetes (T2D). Through sequencing or genotyping of ~150,000 individuals across 5 ancestry groups, we identified 12 rare protein-truncating variants in SLC30A8, which encodes an islet zinc transp...
This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls that can be used in downstream analyses. The complete workflow includes the core NGS data processing steps that are necessary to make the raw data suitable for analysis by the GATK, as well as...
Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three divers...
Histogram of p-values for SKAT and Burden Test. (A) and (B) are SKAT p-values for Broad and Baylor samples, respectively. (C) and (D) are Burden test p-values for Broad and Baylor samples, respectively. Green vertical lines are the 25%, 50% and 75% quantiles of p-values.
(TIF)
Depth Comparison: Baylor versus Broad. We compare the average sample depth for all non-synonymous variants in the two data sets.
(TIF)
PCA from common variants, low frequency variants and both type of variants for Baylor samples. Eigen-vectors are obtained by applying PCA to all common variants that have no missingness (14,702 variants) (A), all low frequency variants that have no missingness (8783 variants) (B), and both type of variants (C). The colors are obtained by clustering...
P-values versus Missingness. We used 5500 genes to make this plot. For each gene, we calculate the -log 10 p-values and the odds ratio of missingness in case and control. The red line is the fitted line of these 5500 observations.
(TIF)
Distribution of the genomic control factor . By permuting case/control status 100 times the distribution of is obtained based on the 1000 largest genes. The red line shows the mean of the permutation distribution and the green line shows obtained from the data using (A) Broad SKAT p-values obtained without eigenvectors; (B) Broad SKAT p-values, wit...
Genomic control and based on different types of PC adjustment.
(PDF)
MAF Comparison: Baylor versus Broad. We compare the MAF for 72,758 shared non-synonymous variants in the two data sets.
(TIF)
PCA of Baylor and Broad samples together. first eigen-vector versus second eigen-vector for Broad and Baylor samples.
(TIF)
The p-values of genes which have two or more de novo nonsense or missense mutations as reported in [5].
(XLSX)
Comparison of seven individuals called by both Baylor and Broad under different filters.
(PDF)
The required sample sizes by applying meta- and mega-analysis.
(PDF)
The p-values of 114 ASD genes.
(XLSX)
Additional Information Regarding Methods. Part A gives additional information about sequencing, including data generation and quality control. Part B gives the mathematical exposition of mega- and meta-analysis. Part C provides details for association analysis.
(PDF)
Genes with p-value from the SKAT or Burden Test.
(XLSX)
Classification tree results for heterozygote calls.
(PDF)