
Jae-Yoon JungHarvard Medical School | HMS · Center for Biomedical Informatics
Jae-Yoon Jung
PhD
About
75
Publications
8,624
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
979
Citations
Introduction
Publications
Publications (75)
Large, whole-genome sequencing (WGS) data sets containing families provide an important opportunity to identify crossovers and shared genetic material in siblings. However, the high variant calling error rates of WGS in some areas of the genome can result in spurious crossover calls, and the special inheritance status of the X Chromosome presents c...
Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algori...
Autism spectrum disorder (ASD) has a complex genetic architecture involving contributions from both de novo and inherited variation. Few studies have been designed to address the role of rare inherited variation or its interaction with common polygenic risk in ASD. Here, we performed whole-genome sequencing of the largest cohort of multiplex famili...
While hundreds of thousands of human whole genome sequences (WGS) have been collected in the effort to better understand genetic determinants of disease, these whole genome sequences have less frequently been used to study another major determinant of human health: the human virome. Using the unmapped reads from WGS of over 1000 families, we presen...
Although it is heavily relied on to study genetic contributors to health and disease, the current human reference genome (GRCh38) is incomplete in two major ways: firstly, it is missing large sections of heterochromatic sequence, and secondly, as a singular, linear reference genome it does not represent the full spectrum of genetic diversity that e...
While hundreds of thousands of human whole genome sequences (WGS) have been collected in the effort to better understand genetic determinants of disease, these whole genome sequences have less frequently been used to study another major determinant of human health: the human virome. Using the unmapped reads from WGS of over 1,000 families, we prese...
The unmapped readspace of whole genome sequencing data tends to be large but is often ignored. We posit that it contains valuable signals of both human infection and contamination. Using unmapped and poorly aligned reads from whole genome sequences (WGS) of over 1000 families and nearly 5000 individuals, we present insights into common viral, bacte...
Autism Spectrum Disorder (ASD) has a complex genetic architecture involving contributions from de novo and inherited variation. Few studies have been designed to address the role of rare inherited variation, or its interaction with polygenic risk in ASD. Here, we performed whole genome sequencing of the largest cohort of multiplex families to date,...
Some of the most severe bottlenecks preventing widespread development of machine learning models for human behavior include a dearth of labeled training data and difficulty of acquiring high quality labels. Active learning is a paradigm for using algorithms to computationally select a useful subset of data points to label using metrics for model un...
While hundreds of thousands of human whole genome sequences (WGS) have been collected in the effort to better understand genetic determinants of disease, these whole genome sequences have rarely been used to study another major determinant of human health: the human virome. Using the unmapped reads from WGS of 1,000 families, we present insights in...
Background
The unmapped readspace of whole genome sequencing data tends to be large but is often ignored. We posit that it contains valuable signals of both human infection and contamination. Using unmapped and poorly aligned reads from whole genome sequences (WGS) of over 1,000 families and 5,000 individuals, we present insights into common viral,...
As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics' improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wi...
Background:
Sequencing partial 16S rRNA genes is a cost effective method for quantifying the microbial composition of an environment, such as the human gut. However, downstream analysis relies on binning reads into microbial groups by either considering each unique sequence as a different microbe, querying a database to get taxonomic labels from s...
Background/introduction:
Emotion detection classifiers traditionally predict discrete emotions. However, emotion expressions are often subjective, thus requiring a method to handle compound and ambiguous labels. We explore the feasibility of using crowdsourcing to acquire reliable soft-target labels and evaluate an emotion detection classifier tra...
BACKGROUND
Autism spectrum disorder (ASD) is a widespread neurodevelopmental condition with a range of potential causes and symptoms. Standard diagnostic mechanisms for ASD, which involve lengthy parent questionnaires and clinical observation, often result in long waiting times for results. Recent advances in computer vision and mobile technology h...
Objective
Autism spectrum disorder (ASD) is a widespread neurodevelopmental condition with a range of potential causes and symptoms. Children with ASD exhibit behavioral and social impairments, giving rise to the possibility of utilizing computational techniques to evaluate a child’s social phenotype from home videos.
Methods
Here, we use a mobile...
Background:
Selection bias and unmeasured confounding are fundamental problems in epidemiology that threaten study internal and external validity. These phenomena are particularly dangerous in internet-based public health surveillance where traditional mitigation and adjustment methods are inapplicable, unavailable, or out of date. Recent theoreti...
BACKGROUND
Selection bias and unmeasured confounding are fundamental problems in epidemiology that threaten study internal and external validity. These phenomena are particularly dangerous in internet-based public health surveillance, where traditional mitigation and adjustment methods are inapplicable, unavailable, or out of date. Recent theoretic...
Background
Machine learning approaches for predicting disease risk from high-dimensional whole genome sequence (WGS) data often result in unstable models that can be difficult to interpret, limiting the identification of putative sets of biomarkers. Here, we design and validate a graph-based methodology based on maximum flow, which leverages the pr...
Background
As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination o...
The evolutionary dynamics of SARS-CoV-2 have been carefully monitored since the COVID-19 pandemic began in December 2019. However, analysis has focused primarily on single nucleotide polymorphisms and largely ignored the role of insertions and deletions (indels) as well as recombination in SARS-CoV-2 evolution. Using sequences from the GISAID datab...
Crowd-powered telemedicine has the potential to revolutionize healthcare, especially during times that require remote access to care. However, sharing private health data with strangers from around the world is not compatible with data privacy standards, requiring a stringent filtration process to recruit reliable and trustworthy workers who can go...
BACKGROUND
Automated emotion classification could aid those who struggle to recognize emotion, including children with developmental behavioral conditions such as autism. However, most computer vision emotion recognition models are trained on adult affect and therefore underperform when used on child faces.
OBJECTIVE
We designed a strategy to gami...
Background:
Automated emotion classification could aid those who struggle to recognize emotions, including children with developmental behavioral conditions such as autism. However, most computer vision emotion recognition models are trained on adult emotion and therefore underperform when applied to child faces.
Objective:
We designed a strateg...
Automated emotion classification could aid those who struggle to recognize emotion, including children with developmental behavioral conditions such as autism. However, most computer vision emotion models are trained on adult affect and therefore underperform on child faces. In this study, we designed a strategy to gamify the collection and the lab...
The evolutionary dynamics of SARS-CoV-2 have been carefully monitored since the COVID-19 pandemic began in December 2019, however, analysis has focused primarily on single nucleotide polymorphisms and largely ignored the role of structural variants (SVs) as well as recombination in SARS-CoV-2 evolution. Using sequences from the GISAID database, we...
Background:
Complex human health conditions with etiological heterogeneity like Autism Spectrum Disorder (ASD) often pose a challenge for traditional genome-wide association study approaches in defining a clear genotype to phenotype model. Coalitional game theory (CGT) is an exciting method that can consider the combinatorial effect of groups of v...
We performed a comprehensive assessment of rare inherited variation in autism spectrum disorder (ASD) by analyzing whole-genome sequences of 2,308 individuals from families with multiple affected children. We implicate 69 genes in ASD risk, including 24 passing genome-wide Bonferroni correction and 16 new ASD risk genes, most supported by rare inhe...
Studies on autism spectrum disorder (ASD) have amassed substantial evidence for the role of genetics in the disease’s phenotypic manifestation. A large number of coding and non-coding variants with low penetrance likely act in a combinatorial manner to explain the variable forms of ASD. However, many of these combined interactions, both additive an...
Autism spectrum disorder (ASD) is a heritable neurodevelopmental disorder affecting 1 in 59 children. While noncoding genetic variation has been shown to play a major role in many complex disorders, the contribution of these regions to ASD susceptibility remains unclear. Genetic analyses of ASD typically use unaffected family members as controls; h...
The burden of comorbidity in Autism Spectrum Disorder (ASD) is substantial. The symptoms of autism overlap with many other human conditions, reflecting common molecular pathologies suggesting that cross-disorder analysis will help prioritize autism gene candidates. Genes in the intersection between autism and related conditions may represent nonspe...
Fitting curve for the multiscale bootstrap performed Fig 1 for each cluster.
(PDF)
pN value, pS value, and dN/dS ratio used in this study.
(XLSX)
Complete list of genes involved in each disorder analyzed.
(XLSX)
Complete list of pathways described in Fig 3.
(XLSX)
Correspondence between KEGG Orthologs, hsa (KEGG) and Symbol ID.
(XLSX)
Standard error of each p-value calculated in Fig 1 multi-scale bootstrap.
(PDF)
Correspondence table for ICD-9 codes, ICD-9 disorder, Phenopedia terms and MeSH terms.
(XLSX)
*Souilmi and Lancaster are equal contributors
Background: While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data c...
Highlighted abstract from the Tenth International Society for Computational Biology, 11 July, 2014, Boston, MA
Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addi...
Modern sequencing platforms are capable of sequencing approximately 5,000 megabases a day and programs such as the 1000 Genomes project are routinely generating data on the petabyte scale. The current challenge lies in the analysis and interpretation of this data, which has become the new rate-limiting step. Providing a solution to this problem tha...
To extract disorder-associated genes from the scientific literature in PubMed with greater sensitivity for literature-based support than existing methods.
We developed a PubMed query to retrieve disorder-related, original research articles. Then we applied a rule-based text-mining algorithm with keyword matching to extract target disorders, genes w...
Background
The genetic etiology of autism is heterogeneous. Multiple disorders share genotypic and phenotypic traits with autism. Network based cross-disorder analysis can aid in the understanding and characterization of the molecular pathology of autism, but there are few tools that enable us to conduct cross-disorder analysis and to visualize the...
Cloud computing services have emerged as a cost-effective alternative for cluster systems as the number of genomes and required computation power to analyze them increased in recent years. Here we introduce the Microsoft Azure platform with detailed execution steps and a cost comparison with Amazon Web Services.
The Autism Diagnostic Interview-Revised (ADI-R) is one of the most commonly used instruments for assisting in the behavioral diagnosis of autism. The exam consists of 93 questions that must be answered by a care provider within a focused session that often spans 2.5 hours. We used machine learning techniques to study the complete sets of answers to...
List of all the excluded questions from the ADI-R. We removed questions from consideration if they contained a majority of exception codes indicating that the question could not be answered in the format requested. We also removed all ‘special isolated skills’ questions and optional questions with hand-written answers.
(PDF)
Summary: Roundup is an online database of gene orthologs for over 1800 genomes, including 226 Eukaryota, 1447 Bacteria, 113 Archaea and 21 Viruses. Orthologs are inferred using the Reciprocal Smallest Distance algorithm. Users may query Roundup for single-linkage clusters of orthologous genes based on any group of genomes. Annotated query results m...
Roundup is an online database of gene orthologs for over 1800 genomes, including 226 Eukaryota, 1447 Bacteria, 113 Archaea and 21 Viruses. Orthologs are inferred using the Reciprocal Smallest Distance algorithm. Users may query Roundup for single-linkage clusters of orthologous genes based on any group of genomes. Annotated query results may be vie...
The phylogenetic profile matrix used in our study.
It is a PDF file and includes the following content: 1. Supplemental table 1. The taxonomic distribution of the 182 bacteria genomes. 2. Supplemental table 2. Top ten genes involved in most triplets and their functions. 3. Supplemental figure 1. Percentage of different logic relationships among the eight types across the whole spectrum of ΔU. 4. Su...
A "phylogenetic profile" refers to the presence or absence of a gene across a set of organisms, and it has been proven valuable for understanding gene functional relationships and network organization. Despite this success, few studies have attempted to search beyond just pairwise relationships among genes. Here we search for logic relationships in...
We developed a package TripletSearch to compute relationships within triplets of genes based on Roundup, an orthologous gene database containing >1500 genomes. These relationships, derived from the coevolution of genes, provide valuable information in the detection of biological network organization from the local to the system level, in the infere...
Complete Genotator results for autism. Contains the complete set of genes identified by Genotator as linked to autism together with the full set of attributes gathered by the Genotator resource (Figure 1).
Complete Genotator results for Parkinson's Disease. Contains the complete set of genes identified by Genotator as linked to Parkinson Disease together with the full set of attributes gathered by the Genotator resource (Figure 1).
Complete Genotator results for Alzheimer Disease. Contains the complete set of genes identified by Genotator as linked to Alzheimer Disease together with the full set of attributes gathered by the Genotator resource (Figure 1).
Disease-specific genetic information has been increasing at rapid rates as a consequence of recent improvements and massive cost reductions in sequencing technologies. Numerous systems designed to capture and organize this mounting sea of genetic data have emerged, but these resources differ dramatically in their disease coverage and genetic depth....
Selecting a representative set of single nucleotide polymorphism (SNP) markers for facilitating association studies is an important step to uncover the genetic basis of human disease. Tag SNP selection and functional SNP selection are the two main approaches for addressing the SNP selection problem. However, little was done so far to effectively co...
In this paper, we investigate the use of nested evolution in which each step of one evolutionary process involves running a second evolutionary process. We apply this approach to build an evolutionary system for reinforcement learning (RL) problems. Genetic programming based on a descriptive encoding is used to evolve the neural architecture, while...
In this paper, we investigate the use of nested evolution in which each step of one evolutionary process involves running a second evolutionary process. We apply this approach to build a neuroevolution system for reinforcement learning (RL) problems. Genetic programming based on a descriptive encoding is used to evolve the neural architecture, whil...
Neuroevolution refers to the design of arti.cial neural networks using evolutionary algorithms. It has been one of the promising
application areas for evolutionary computation, as neural network design is still being done by human experts and automating
the design process by evolutionary approaches will benefit developing intelligent systems and un...
Evolutionary algorithms are a promising approach to the automated design of artificial neural networks, but they require a compact and efficient genetic encoding scheme to represent repetitive and recurrent modules in networks. We present a problem-independent approach based on a human-readable and writable descriptive encoding using a high-level l...
Due to the tremendous number of single nucleotide polymorphisms (SNPs), there is a clear need to expedite genotyping by considering only a subset of all SNPs called haplotype tagging SNPs (htSNPs). Recently, the approach that selects htSNPs by maximizing their prediction accuracy has demonstrated very promising results. Here we propose a new predic...
Evolutionary algorithms are a promising approach for the automated design of artificial neural networks, but they require a com- pact and efficient genetic encoding scheme to represent repetitive and re- current modules in networks. Here we introduce a problem-independent approach based on a human-readable descriptive encoding using a high- level l...
Modularity is a major feature of biological central nervous systems. For ex-ample, the human/primate cerebral cortex is composed of dozens of structurally and functionally identifiable regions that are interconnected in a hierarchical network [1]. Motivated by this, our research group is studying the evolution and self-organization of modular neura...