Yaniv Erlich

Yaniv Erlich
Whitehead Institute for Biomedical Research

About

96
Publications
22,486
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,089
Citations
Introduction
Skills and Expertise

Publications

Publications (96)
Article
Full-text available
Nucleic acid microarrays are the only tools that can supply very large oligonucleotide libraries, cornerstones of the nascent fields of de novo gene assembly and DNA data storage. Although the chemical synthesis of oligonucleotides is highly developed and robust, it is not error free, requiring the design of methods that can correct or compensate f...
Preprint
Full-text available
One of the key questions regarding COVID19 vaccines is whether they can reduce viral shedding. To date, Israel vaccinated substantial parts of the adult population, which enables extracting real world signals. The vaccination rollout started on Dec 20th 2020, utilized mainly the BNT162b2 vaccine, and focused on individuals who are 60 years or older...
Preprint
Full-text available
Finding familial relatives using DNA has multiple applications, in genetic genealogy, population genetics, and forensics. So far, most relative matching algorithms rely on detecting identity-by-descent (IBD) segments with high quality genotype data. Recently, low coverage sequencing (LCS) has received growing attention as a promising cost-effective...
Article
Full-text available
DNA storage offers substantial information density1,2,3,4,5,6,7 and exceptional half-life³. We devised a ‘DNA-of-things’ (DoT) storage architecture to produce materials with immutable memory. In a DoT framework, DNA molecules record the data, and these molecules are then encapsulated in nanometer silica beads⁸, which are fused into various material...
Technical Report
SI for DNA-of-things storage architecture to create materials with embedded memory
Data
SI video of A DNA-of-things storage architecture to create materials with embedded memory
Article
Full-text available
The rapid digitization of genealogical and medical records enables the assembly of extremely large pedigree records spanning millions of individuals and trillions of pairs of relatives. Such pedigrees provide the opportunity to investigate the sociological and epidemiological history of human populations in scales much larger than previously possib...
Preprint
Full-text available
The rapid digitization of genealogical and medical records enables the assembly of extremely large pedigree records spanning millions of individuals and trillions of pairs of relatives. Such pedigrees provide the opportunity to investigate the sociological and epidemiological history of human populations in scales much larger than previously possib...
Article
Motivation: Hidden Markov models (HMMs) are powerful tools for modeling processes along the genome. In a standard genomic HMM, observations are drawn, at each genomic position, from a distribution whose parameters depend on a hidden state; the hidden states evolve along the genome as a Markov chain. Often, the hidden state is the Cartesian product...
Article
Consumer genomics databases have reached the scale of millions of individuals. Recently, law enforcement authorities have exploited some of these databases to identify suspects via distant familial relatives. Using genomic data of 1.28 million individuals tested with consumer genomics, we investigated the power of this technique. We project that ab...
Preprint
Full-text available
Motivation: Hidden Markov models (HMMs) are powerful tools for modeling processes along the genome. In a standard genomic HMM, observations are drawn, at each genomic position, from a distribution whose parameters depend on a hidden state; the hidden states evolve along the genome as a Markov chain. Often, the hidden state is the Cartesian product...
Preprint
Full-text available
Consumer genomics databases reached the scale of millions of individuals. Recently, law enforcement investigators have started to exploit some of these databases to find distant familial relatives, which can lead to a complete re-identification. Here, we leveraged genomic data of 600,000 individuals tested with consumer genomics to investigate the...
Article
Family trees have vast applications in multiple fields from genetics to anthropology and economics. However, the collection of extended family trees is tedious and usually relies on resources with limited geographical scope and complex data usage restrictions. Here, we collected 86 million profiles from publicly-available online data shared by gene...
Article
Full-text available
DNA re-identification is used for a broad suite of applications, ranging from cell line authentication to forensics. However, current re-identification schemes suffer from high latency and limited access. Here, we describe a rapid, inexpensive, and portable strategy to robustly re-identify human DNA called ‘MinION sketching’. MinION sketching requi...
Data
Supplementary Tables. Run statistics for the MinION sketch experiments.
Article
Identifying regions of the genome that are depleted of mutations can distinguish potentially deleterious variants. Short tandem repeats (STRs), also known as microsatellites, are among the largest contributors of de novo mutations in humans. However, per-locus studies of STR mutations have been limited to highly ascertained panels of several dozen...
Preprint
Precision medicine necessitates large scale collections of genomes and phenomes. Despite decreases in the costs of genomic technologies, collecting these types of information at scale is still a daunting task that poses logistical challenges and requires consortium-scale resources. Here, we describe DNA.Land, a digital biobank to collect genome and...
Preprint
Full-text available
Identifying regions of the genome that are depleted of mutations can reveal potentially deleterious variants. Short tandem repeats (STRs), also known as microsatellites, are among the largest contributors of de novo mutations in humans and are implicated in a variety of human disorders. However, because of the challenges STRs pose to bioinformatics...
Preprint
DNA re-identification is used for a broad range of applications, ranging from cell line authentication to crime scene sample identification. However, current re-identification schemes suffer from high latency. Here, we describe a rapid, inexpensive, and portable strategy to re-identify human DNA called MinION sketching. Using data from Oxford Nanop...
Article
Short tandem repeats (STRs) are highly variable elements that play a pivotal role in multiple genetic diseases, population genetics applications, and forensic casework. However, it has proven problematic to genotype STRs from high-throughput sequencing data. Here, we describe HipSTR, a novel haplotype-based method for robustly genotyping and phasin...
Article
Full-text available
Motivation: Millions of individuals have access to raw genomic data using direct-to-consumer companies. The advent of large-scale sequencing projects, such as the Precision Medicine Initiative, will further increase the number of individuals with access to their own genomic information. However, querying genomic data requires a computer terminal an...
Article
DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 10⁶ bytes in DNA oligonucleotides and perfectly...
Preprint
Full-text available
Family trees have vast applications in multiple fields from genetics to anthropology and economics. However, the collection of extended family trees is tedious and usually relies on resources with limited geographical scope and complex data usage restrictions. Here, we collected 86 million profiles from publicly-available online data from genealogy...
Article
Collecting cases for case–control genetic association studies can be time-consuming and expensive. In some situations (such as studies of late-onset or rapidly lethal diseases), it may be more practical to identify family members of cases. In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases...
Preprint
Full-text available
DNA is an attractive medium to store digital information. Here, we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 10 ⁶ bytes in DNA oligos and perfectly retrieve...
Preprint
Full-text available
Short tandem repeats (STRs) are highly variable elements that play a pivotal role in multiple genetic diseases, population genetics applications, and forensic casework. However, STRs have proven problematic to genotype from high-throughput sequencing data. Here, we describe HipSTR, a novel haplotype-based method for robustly genotyping, haplotyping...
Preprint
Full-text available
Motivation Millions of individuals have access to raw genomic data using direct-to-consumer companies. The advent of large-scale sequencing projects, such as the Precision Medicine Initiative, will further increase the number of individuals with access to their own genomic information. However, querying genomic data requires a computer terminal – a...
Article
Here we report the Simons Genome Diversity Project data set: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumu...
Preprint
We report a rapid, inexpensive, and portable strategy to re-identify human DNA using the MinION, a miniature sequencing sensor by Oxford Nanopore Technologies. Our strategy requires only 10-30 minutes of MinION sequencing, works with low input DNA, and enables familial searches. We also show that it can re-identify individuals from Direct-to-Consum...
Preprint
Full-text available
The case-control association study is a powerful method for identifying genetic variants that influence disease risk. However, the collection of cases can be time-consuming and expensive; if a disease occurs late in life or is rapidly lethal, it may be more practical to identify family members of cases. Here, we show that replacing cases with their...
Article
Full-text available
We report the sequences of 1,244 human Y chromosomes randomly ascertained from 26 worldwide populations by the 1000 Genomes Project. We discovered more than 65,000 variants, including single-nucleotide variants, multiple-nucleotide variants, insertions and deletions, short tandem repeats, and copy number variants. Of these, copy number variants con...
Article
We report the sequence sof �244 human Y chromosomes randomly ascertained from 26 worldwide populations by the �1000 Genomes Project. We discovered more than 65,000 variants, including single-nucleotide variants, multiple-nucleotide variants, insertions and deletions, short tandem repeats, and copy number variants. Of these, copy number variants con...
Article
Short tandem repeats (STRs) are mutation-prone loci that span nearly 1% of the human genome. Previous studies have estimated the mutation rates of highly polymorphic STRs by using capillary electrophoresis and pedigree-based designs. Although this work has provided insights into the mutational dynamics of highly mutable STRs, the mutation rates of...
Preprint
Short Tandem Repeats (STRs) are mutation-prone loci that span nearly 1% of the human genome. Previous studies have estimated the mutation rates of highly polymorphic STRs using capillary electrophoresis and pedigree-based designs. While this work has provided insights into the mutational dynamics of highly mutable STRs, the mutation rates of most o...
Article
Full-text available
Despite representing an important source of genetic variation, tandem repeats (TRs) remain poorly studied due to technical difficulties. We hypothesized that TRs can operate as expression (eQTLs) and methylation (mQTLs) quantitative trait loci. To test this we analyzed the effect of variation at 4849 promoter-associated TRs, genotyped in 120 indivi...
Data
Class website information. DOI: http://dx.doi.org/10.7554/eLife.14258.005
Data
Student handout for Hackathon 2: CSI Columbia. DOI: http://dx.doi.org/10.7554/eLife.14258.008
Data
Student handout for the final project. DOI: http://dx.doi.org/10.7554/eLife.14258.009
Data
Protocol for setting up a hackathon using MinION. DOI: http://dx.doi.org/10.7554/eLife.14258.006
Data
Teaching slides for Hackathon 1: Snack to sequence. DOI: http://dx.doi.org/10.7554/eLife.14258.010
Data
Teaching slides for Hackathon 2: CSI Columbia. DOI: http://dx.doi.org/10.7554/eLife.14258.011
Data
Student handout for Hackathon 1: Snack to sequence. DOI: http://dx.doi.org/10.7554/eLife.14258.007
Article
Full-text available
Screening large populations for carriers of known or de novo rare SNPs is required both in Targeting induced local lesions IN genomes (TILLING) experiments in plants and analogously in screening human populations. We previously suggested an approach that combines the celebrated mathematical field of compressed sensing with next-generation sequencin...
Article
The contribution of repetitive elements to quantitative human traits is largely unknown. Here we report a genome-wide survey of the contribution of short tandem repeats (STRs), which constitute one of the most polymorphic and abundant repeat classes, to gene expression in humans. Our survey identified 2,060 significant expression STRs (eSTRs). Thes...
Article
Full-text available
Genomics has recently celebrated reaching the $1000 genome milestone, making affordable DNA sequencing a reality. With this goal successfully completed, the next goal of the sequencing revolution can be sequencing sensors—miniaturized sequencing devices that are manufactured for real-time applications and deployed in large quantities at low costs....
Article
Full-text available
The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-covera...
Preprint
Full-text available
Expression quantitative trait loci (eQTLs) are a key tool to dissect cellular processes mediating complex diseases. However, little is known about the role of repetitive elements as eQTLs. We report a genome-wide survey of the contribution of Short Tandem Repeats (STRs), one of the most polymorphic and abundant repeat classes, to gene expression in...
Article
Full-text available
Fulfilling the promise of the genetic revolution requires the analysis of large datasets containing information from thousands to millions of participants. However, sharing human genomic data requires protecting subjects from potential harm. Current models rely on de-identification techniques in which privacy versus data utility becomes a zero-sum...
Article
Full-text available
Short tandem repeats are among the most polymorphic loci in the human genome. These loci play a role in the etiology of a range of genetic diseases and have been frequently utilized in forensics, population genetics, and genetic genealogy. Despite this plethora of applications, little is known about the variation of most STRs in the human populatio...
Preprint
Full-text available
Fulfilling the promise of the genetic revolution requires the analysis of large datasets containing information from thousands to millions of participants. However, sharing human genomic data requires protecting subjects from potential harm. Current models rely on de-identification techniques that treat privacy versus data utility as a zero-sum gam...
Article
Full-text available
Hemifacial microsomia (HFM) is the second most common facial anomaly after cleft lip and palate. The phenotype is highly variable and most cases are sporadic. We investigated the disorder in a large pedigree with five affected individuals spanning eight meioses. Whole-exome sequencing results indicated the absence of a pathogenic coding point mutat...
Article
We are entering an era of ubiquitous genetic information for research, clinical care and personal curiosity. Sharing these data sets is vital for progress in biomedical research. However, a growing concern is the ability to protect the genetic privacy of the data originators. Here, we present an overview of genetic privacy breaching strategies. We...
Conference Paper
Sharing sequencing datasets is essential for human genetics. I will present common techniques that were thought to de-identify these datasets, such as removing explicit identifiers, pooling, and data masking. Then, I will survey a range of techniques to bypass these methods and re-identify the “anonymous” samples. Specifically, I will show how publ...
Preprint
Full-text available
Hemifacial microsomia (HFM) is the second most common facial anomaly after cleft lip and palate. The phenotype is highly variable and most cases are sporadic. Here, we investigated the disorder in a large pedigree with five affected individuals spanning eight meioses. We performed whole-exome sequencing and a genome-wide survey of segmental variati...
Preprint
Full-text available
We are entering the era of ubiquitous genetic information for research, clinical care, and personal curiosity. Sharing these datasets is vital for rapid progress in understanding the genetic basis of human diseases. However, one growing concern is the ability to protect the genetic privacy of the data originators. Here, we technically map threats t...
Article
Short tandem repeats (STRs), also known as microsatellites, have a wide range of applications, including medical genetics, forensics, and population genetics. High-throughput sequencing has the potential to profile large numbers of STRs, but cumbersome gapped alignment and STR-specific noise patterns hamper this task. We recently developed an algor...
Patent
Full-text available
Methods and systems of DNA sequencing that compensate for sources of noise in next-generation DNA sequencers are described.
Article
Sharing sequencing data sets without identifiers has become a common practice in genomics. Here, we report that surnames can be recovered from personal genomes by profiling short tandem repeats on the Y chromosome (Y-STRs) and querying recreational genetic genealogy databases. We show that a combination of a surname with other types of metadata, su...
Article
Full-text available
A report on the 62nd Annual Meeting of the American Society of Human Genetics, San Francisco, California, USA, 6-10 November 2012.
Article
Full-text available
Motivation: Despite the rapid decline in sequencing costs, sequencing large cohorts of individuals is still prohibitively expensive. Recently, several sophisticated pooling designs were suggested that can identify carriers of rare alleles in large cohorts with a significantly smaller number of pools, thus dramatically reducing the cost of such larg...