
Konrad J Karczewski- Stanford University
Konrad J Karczewski
- Stanford University
About
258
Publications
64,993
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
40,658
Citations
Current institution
Publications
Publications (258)
Background
Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is ineffic...
One of the strongest signatures of aging is an accumulation of mutant mitochondrial DNA (mtDNA) heteroplasmy. Here we investigate the mechanism underlying this phenomenon by calling mtDNA sequence, abundance, and heteroplasmic variation in human blood using whole genome sequences from ~750,000 individuals. Our analyses reveal a simple, two-step mec...
Genetic sequencing technologies are powerful tools for identifying rare variants and genes associated with Mendelian and complex traits; indeed, whole-exome and whole-genome sequencing are increasingly popular methods for population-scale genetic studies. However, careful quality control steps should be taken to ensure study accuracy and reproducib...
Motivation
The Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150,000 genome VCF would occupy 900 TiB, making it costly and complicated to produce, analyze, and store. The issue stems from VCF’s requirement to densely represent both reference-genotypes and allele-indexed arrays. These re...
The complete blood count (CBC) is an important screening tool for healthy adults and a common test at periodic exams. However, results are usually interpreted relative to one-size-fits-all reference intervals1,2, undermining the precision medicine goal to tailor care for patients on the basis of their unique characteristics3,4. Here we study thousa...
The Genome Aggregation Database (gnomAD) has been an invaluable resource for the genomics community, quantifying allele frequencies across multiple genetic ancestry groups. However, traditional gnomAD allele frequencies report aggregate estimates, overlooking fine-scale genetic differences. This can skew allele frequency estimates, particularly in...
Genome-wide association studies (GWAS) and rare-variant association studies (RVAS) have identified thousands of genes and variants that affect multiple phenotypes. Here, using rare variant association results from the UK Biobank (UKB) data, we identify pervasive gene-level pleiotropy across diverse phenotypic domains and highlight genes with appare...
Background
The complete blood count (CBC) is an important screening tool for healthy adults, commonly ordered at periodic physical exams. However, test results are typically interpreted using one-size-fits-all reference intervals (RIs), which don’t account for the low index-of-individuality of most CBC markers. This undermines the goal of precision...
Variant scoring methods (VSMs) aid in the interpretation of coding mutations and their potential impact on health, but their evaluation in the context of human genetics applications remains inconsistent. Here, we describe GeneticsGym, a systematic approach to evaluating the real-world impact of VSMs on human genetic analysis across selection regime...
Data within biobanks capture broad yet detailed indices of human variation, but biobank-wide insights can be difficult to extract due to complexity and scale. Here, using large-scale factor analysis, we distill hundreds of variables (diagnoses, assessments and survey items) into 35 latent constructs, using data from unrelated individuals with predo...
The high prevalence of autoimmune hypothyroidism (AIHT) - more than 5% in human populations - provides a unique opportunity to unlock the most complete picture to date of genetic loci that underlie systemic and organ-specific autoimmunity. Using a meta-analysis of 81,718 AIHT cases in FinnGen and the UK Biobank, we dissect associations along axes o...
Underrepresented populations are often excluded from genomic studies due in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open...
Missense variants can have a range of functional impacts depending on factors such as the specific amino acid substitution and location within the gene. To interpret their deleteriousness, studies have sought to identify regions within genes that are specifically intolerant of missense variation. Here, we leverage the patterns of rare missense vari...
Large biobanks, such as the UK Biobank (UKB), enable massive phenome by genome-wide association studies that elucidate genetic etiology of complex traits. However, individuals from diverse genetic ancestry groups are often excluded from association analyses due to concerns about population structure introducing false positive associations. Here, we...
Germline pathogenic variants associated with increased childhood mortality must be subject to natural selection. Here, we analyze publicly available germline genetic metadata from 4,574 children with cancer [11 studies; 1,083 whole exome sequences (WES), 1,950 whole genome sequences (WGS), and 1,541 gene panel] and 141,456 adults [125,748 WES and 1...
The Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150,000 genome VCF would occupy 900 TiB, making it both costly and complicated to produce and analyze. The issue stems from VCF’s requirement to densely represent both reference-genotypes and allele-indexed arrays. These requirements lea...
The complete blood count is an important screening tool for healthy adults and is the most commonly ordered test at periodic physical exams. However, results are usually interpreted relative to one-size-fits-all reference intervals, undermining the goal of precision medicine to tailor medical care to the needs of individual patients based on their...
Copy number variants (CNVs) are major contributors to genetic diversity and disease. While standardized methods, such as the genome analysis toolkit (GATK), exist for detecting short variants, technical challenges have confounded uniform large-scale CNV analyses from whole-exome sequencing (WES) data. Given the profound impact of rare and de novo c...
Mitochondrial DNA (mtDNA) is a maternally inherited, high-copy-number genome required for oxidative phosphorylation¹. Heteroplasmy refers to the presence of a mixture of mtDNA alleles in an individual and has been associated with disease and ageing. Mechanisms underlying common variation in human heteroplasmy, and the influence of the nuclear genom...
DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRA...
Large-scale high-throughput sequencing datasets have been transformative for informing clinical variant interpretation and as reference panels for statistical and population genetic efforts. While such resources are often treated as ground truth, we find that in widely used reference datasets such as the Genome Aggregation Database (gnomAD), some v...
Severe recessive diseases arise when both the maternal and the paternal copies of a gene carry, or are impacted by, a damaging genetic variant in the affected individual. When a patient carries two different potentially causal variants, accurate diagnosis requires determining that these two variants occur on different copies of the chromosome (i.e....
Predicted loss of function (pLoF) variants are highly deleterious and play an important role in disease biology, but many of these variants may not actually result in loss-of-function. Here we present a framework that advances interpretation of pLoF variants in research and clinical settings by considering three categories of LoF evasion: (1) predi...
Germline pathogenic variants associated with increased childhood mortality must be subject to natural selection. Here, we analyzed publically available germline genetic metadata from 141,456 adults [gnomAD; 125,748 whole exome sequences (WES) and 15,708 whole genome sequences (WGS)] and 4,810 children with cancer [11 studies; 1,319 WES, 1,950 WGS,...
Both common and rare genetic variants influence complex traits and common diseases. Genome-wide association studies have identified thousands of common-variant associations, and more recently, large-scale exome sequencing studies have identified rare-variant associations in hundreds of genes1–3. However, rare-variant genetic architecture is not wel...
Background
Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-conf...
Underrepresented populations are often excluded from genomic studies due in part to a lack of resources supporting their analysis. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open...
Human mitochondria contain a high copy number, maternally transmitted genome (mtDNA) that encodes 13 proteins required for oxidative phosphorylation. Heteroplasmy arises when multiple mtDNA variants co-exist in an individual and can exhibit complex dynamics in disease and in aging. As all proteins involved in mtDNA replication and maintenance are n...
Otosclerosis is one of the most common causes of conductive hearing loss, affecting 0.3% of the population. It typically presents in adulthood and half of the patients have a positive family history. The pathophysiology of otosclerosis is poorly understood. A previous genome-wide association study (GWAS) identified a single association locus in an...
Identifying causal factors for Mendelian and common diseases is an ongoing challenge in medical genetics1. Population bottleneck events, such as those that occurred in the history of the Finnish population, enrich some homozygous variants to higher frequencies, which facilitates the identification of variants that cause diseases with recessive inhe...
Host genetics is a key determinant of COVID-19 outcomes. Previously, the COVID-19 Host Genetics Initiative genome-wide association study used common variants to identify multiple loci associated with COVID-19 outcomes. However, variants with the largest impact on COVID-19 outcomes are expected to be rare in the population. Hence, studying rare vari...
The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders, but attempts to assess constraint for non-protein-coding regions have proven more difficult. Here we aggregate, process, and release a dataset of 76,156 human genomes from the...
Several biobanks, including UK Biobank (UKBB), are generating large-scale sequencing data. An existing method, SAIGE-GENE, performs well when testing variants with minor allele frequency (MAF) ≤ 1%, but inflation is observed in variance component set-based tests when restricting to variants with MAF ≤ 0.1% or 0.01%. Here, we propose SAIGE-GENE+ wit...
Broad yet detailed data collected in biobanks captures variation reflective of human health and behavior, but insights are hard to extract given their complexity and scale. In the largest factor analysis to date, we distill hundreds of medical record codes, physical assays, and survey items from UK Biobank into 35 understandable latent constructs....
Copy number variants (CNVs) are major contributors to genetic diversity and disease. To date, exome sequencing (ES) has been generated for millions of individuals in international biobanks, human disease studies, and clinical diagnostic screening. While standardized methods exist for detecting short variants (single nucleotide and insertion/deletio...
Rare copy-number variants (rCNVs) include deletions and duplications that occur infrequently in the global human population and can confer substantial risk for disease. In this study, we aimed to quantify the properties of haploinsufficiency (i.e., deletion intolerance) and triplosensitivity (i.e., duplication intolerance) throughout the human geno...
Genome-wide association studies have successfully discovered thousands of common variants associated with human diseases and traits, but the landscape of rare variations in human disease has not been explored at scale. Exome-sequencing studies of population biobanks provide an opportunity to systematically evaluate the impact of rare coding variati...
Both common and rare genetic variants influence complex traits and common diseases. Genome-wide association studies have discovered thousands of common-variant associations, and more recently, large-scale exome sequencing studies have identified rare-variant associations in hundreds of genes. However, rare-variant genetic architecture is not well c...
Rare coding variation has historically provided the most direct connections between gene function and disease pathogenesis. By meta-analysing the whole exomes of 24,248 schizophrenia cases and 97,322 controls, we implicate ultra-rare coding variants (URVs) in 10 genes as conferring substantial risk for schizophrenia (odds ratios of 3–50, P < 2.14 ×...
Host genetics is a key determinant of COVID-19 outcomes. Previously, the COVID-19 Host Genetics Initiative genome-wide association study used common variants to identify multiple loci associated with COVID-19 outcomes. However, variants with the largest impact on COVID-19 outcomes are expected to be rare in the population. Hence, studying rare vari...
Despite their widespread use in research, there has not yet been a systematic genomic analysis of human embryonic stem cell (hESC) lines at a single-nucleotide resolution. We therefore performed whole-genome sequencing (WGS) of 143 hESC lines and annotated their single-nucleotide and structural genetic variants. We found that while a substantial fr...
Identifying Mendelian diseases with recessive inheritance is challenging as the majority of cases are caused by compound heterozygous genotypes which require sequencing data in families to definitively identify. Bottleneck events, such as in the Finnish population, enrich specific homozygous variants to higher frequencies and thus facilitate identi...
Despite the great success of genome-wide association studies (GWAS) in identifying genetic loci significantly associated with diseases, the vast majority of causal variants underlying disease-associated loci have not been identified. To create an atlas of causal variants, we performed and integrated fine-mapping across 148 complex traits in three l...
Most age-related human diseases are accompanied by a decline in cellular organelle integrity, including impaired lysosomal proteostasis and defective mitochondrial oxidative phosphorylation. An open question, however, is the degree to which inherited variation in or near genes encoding each organelle contributes to age-related disease pathogenesis....
Motivation
Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for...
UK Biobank has released the whole-exome sequencing (WES) data for 200,000 participants, but the best practices remain unclear for rare variant tests, and an existing approach, SAIGE-GENE, can have inflated type I error rates with high computation cost. Here, we propose SAIGE-GENE+ with greatly improved type I error control and computational efficie...
Genome-wide association studies have successfully discovered thousands of common variants associated with human diseases and traits, but the landscape of rare variation in human disease has not been explored at scale. Exome sequencing studies of population biobanks provide an opportunity to systematically evaluate the impact of rare coding variatio...
Clinical genetic testing of protein-coding regions identifies a likely causative variant in only around half of developmental disorder (DD) cases. The contribution of regulatory variation in non-coding regions to rare disease, including DD, remains very poorly understood. We screened 9,858 probands from the Deciphering Developmental Disorders (DDD)...
Structural variants (SVs) rearrange large segments of DNA 1 and can have profound consequences in evolution and human disease 2,3 . As national biobanks, disease-association studies, and clinical genetic testing have grown increasingly reliant on genome sequencing, population references such as the Genome Aggregation Database (gnomAD) 4 have become...
In this Article, author Marquis P. Vawter was missing from the Genome Aggregation Database Consortium list. They are associated with the affiliation: ‘Department of Psychiatry & Human Behavior, University of California Irvine, Irvine, CA, USA’, and contributed to the generation of the primary data incorporated into the gnomAD resource. In addition,...
A Correction to this paper has been published: https://doi.org/10.1038/s41586-020-03177-5
A Correction to this paper has been published: https://doi.org/10.1038/s41586-020-03175-7
In this Article, author Marquis P. Vawter was missing from the Genome Aggregation Database Consortium list. They are associated with the affiliation: ‘Department of Psychiatry & Human Behavior, University of California Irvine, Irvine, CA, USA’, and contributed to the generation of the primary data incorporated into the gnomAD resource. The original...
A Correction to this paper has been published: https://doi.org/10.1038/s41467-021-21077-8.
A Correction to this paper has been published: https://doi.org/10.1038/s41467-021-21052-3
Admixed populations are routinely excluded from genomic studies due to concerns over population structure. Here, we present a statistical framework and software package, Tractor, to facilitate the inclusion of admixed individuals in association studies by leveraging local ancestry. We test Tractor with simulated and empirical two-way admixed Africa...
A Correction to this paper has been published: https://doi.org/10.1038/s41591-020-01185-6.
Rare deletions and duplications of genomic segments, collectively known as rare copy number variants (rCNVs), contribute to a broad spectrum of human diseases. To date, most disease-association studies of rCNVs have focused on recognized genomic disorders or on the impact of haploinsufficiency caused by deletions. By comparison, our understanding o...
Aging is associated with defects in many organelles, but an open question is whether the inherited risk for age-related disease is enriched within loci relevant to each organelle. Here, we begin with a focus on mitochondria, as mitochondrial dysfunction is a hallmark of age-related disease. We report a striking lack of enrichment of mitochondria-re...
Upstream open reading frames (uORFs) are tissue-specific cis -regulators of protein translation. Isolated reports have shown that variants that create or disrupt uORFs can cause disease. Here, in a systematic genome-wide study using 15,708 whole genome sequences, we show that variants that create new upstream start codons, and variants disrupting s...
Clinical genetic testing of protein-coding regions identifies a likely causative variant in only ∼35% of severe developmental disorder (DD) cases. We screened 9,858 patients from the Deciphering Developmental Disorders (DDD) study for de novo mutations in the 5’untranslated regions (5’UTRs) of dominant haploinsufficient DD genes. We identify four s...
Otosclerosis is one of the most common causes of conductive hearing loss, affecting 0.3% of the population. It typically presents in adulthood and half of the patients have a positive family history. The pathophysiology of otosclerosis is poorly understood and treatment options are limited. A previous genome-wide association study (GWAS) identified...
There has not yet been a systematic analysis of hESC whole genomes at a single nucleotide resolution. We therefore performed whole genome sequencing (WGS) of 143 hESC lines and annotated their single nucleotide and structural genetic variants. We found that while a substantial fraction of hESC lines contained large deleterious structural variants,...
By meta-analyzing the whole-exomes of 24,248 cases and 97,322 controls, we implicate ultra-rare coding variants (URVs) in ten genes as conferring substantial risk for schizophrenia (odds ratios 3 - 50, P < 2.14 x 10^-6), and 32 genes at a FDR < 5%. These genes have the greatest expression in central nervous system neurons and have diverse molecular...
Understanding the influence of genetics on human disease is among the primary goals for biology and medicine. To this end, the direct study of natural human genetic variation has provided valuable insights into human physiology and disease as well as into the origins and migrations of humans. In this review, we discuss the foundations of population...
Human genetic variants predicted to cause loss-of-function of protein-coding genes (pLoF variants) provide natural in vivo models of human gene inactivation and can be valuable indicators of gene function and the potential toxicity of therapeutic inhibitors targeting these genes1,2. Gain-of-kinase-function variants in LRRK2 are known to significant...
Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-fun...
Structural variants (SVs) rearrange large segments of DNA1 and can have profound consequences in evolution and human disease2,3. As national biobanks, disease-association studies, and clinical genetic testing have grown increasingly reliant on genome sequencing, population references such as the Genome Aggregation Database (gnomAD)4 have become int...
Naturally occurring human genetic variants that are predicted to inactivate protein-coding genes provide an in vivo model of human gene inactivation that complements knockout studies in cells and model organisms. Here we report three key findings regarding the assessment of candidate drug targets using human loss-of-function variants. First, even e...
The acceleration of DNA sequencing in samples from patients and population studies has resulted in extensive catalogues of human genetic variation, but the interpretation of rare genetic variants remains problematic. A notable example of this challenge is the existence of disruptive variants in dosage-sensitive disease genes, even in apparently hea...
Multi-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual, are a clinically and biologically important class of genetic variation. However, existing tools typically do not accurately classify MNVs, and understanding of their mutational origins remains limited. Here, we systematically su...
Upstream open reading frames (uORFs) are tissue-specific cis-regulators of protein translation. Isolated reports have shown that variants that create or disrupt uORFs can cause disease. Here, in a systematic genome-wide study using 15,708 whole genome sequences, we show that variants that create new upstream start codons, and variants disrupting st...
Admixed populations are routinely excluded from medical genomic studies due to concerns over population structure. Here, we present a statistical framework and software package, Tractor, to facilitate the inclusion of admixed individuals in association studies by leveraging local ancestry. We test Tractor with simulations and empirical data focused...
Following an earlier report suggesting increased mortality due to homozygosity at the CCR5-Δ32 allele, Wei and Nielsen recently suggested a deviation from Hardy-Weinberg Equilibrium (HWE) observed in public variant databases as additional supporting evidence for this hypothesis. Here, we present a re-analysis of the primary data underlying this var...
[This corrects the article DOI: 10.1371/journal.pgen.1007329.].
Structural variants (SVs) rearrange large segments of the genome and can have profound consequences for evolution and human diseases. As national biobanks, disease association studies, and clinical genetic testing grow increasingly reliant on genome sequencing, population references such as the Genome Aggregation Database (gnomAD) have become integ...
Multi-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual, are a clinically and biologically important class of genetic variation. However, existing tools for variant interpretation typically do not accurately classify MNVs, and understanding of their mutational origins remains limited....
Human genetic variants causing loss of function (LoF) of protein-coding genes provide natural in vivo models of gene inactivation, which are powerful indicators of gene function and the potential toxicity of therapeutic inhibitors targeting these genes. Gain of kinase function variants in LRRK2 are known to significantly increase the risk of Parkin...
The acceleration of DNA sequencing in patients and population samples has resulted in unprecedented catalogues of human genetic variation, but the interpretation of rare genetic variants discovered using such technologies remains extremely challenging. A striking example of this challenge is the existence of disruptive variants in dosage-sensitive...
Upstream open reading frames (uORFs) are important tissue-specific cis-regulators of protein translation. Although isolated case reports have shown that variants that create or disrupt uORFs can cause disease, genetic sequencing approaches typically focus on protein-coding regions and ignore these variants. Here, we describe a systematic genome-wid...
Human genetics has informed the clinical development of new drugs, and is beginning to influence the selection of new drug targets. Large-scale DNA sequencing studies have created a catalogue of naturally occurring genetic variants predicted to cause loss of function in human genes, which in principle should provide powerful in vivo models of human...
Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism's function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) v...
Recent work by Shah and colleagues demonstrated that many variants in the ClinVar database are misclassified, and that disease-specific allele frequency (AF) thresholds can identify wrongly classified alleles by flagging variants that are too prevalent in the population to be causative of rare penetrant disease. While we agree with the main conclus...
Variation in RNA splicing (i.e., alternative splicing) plays an important role in many diseases. Variants near 5' and 3' splice sites often affect splicing, but the effects of these variants on splicing and disease have not been fully characterized beyond the "essential" splice nucleotides flanking each exon. Here we provide quantitative measuremen...
As part of a broader collaborative network of exome sequencing studies, we developed a jointly called data set of 5,685 Ashkenazi Jewish exomes. We make publicly available a resource of site and allele frequencies, which should serve as a reference for medical genetics in the Ashkenazim (hosted in part at https://ibd.broadinstitute.org, also availa...
Analysis workflow diagram.
(PNG)
Cross-validation errors for number of clusters in ADMIXTURE.
(PNG)