About
153
Publications
23,806
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
16,057
Citations
Introduction
Current institution
Publications
Publications (153)
Machine learning models are often used as scoring functions to predict the binding affinity of a protein-ligand complex. These models are trained with limited amounts of data with experimentally measured binding affinity values. A large number of compounds are labeled inactive through single-concentration screens without measuring binding affinitie...
Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size—currently exceeding 10¹² nucleotides—and exponential growth pose signific...
Introduction
Recent advances in 3D structure-based deep learning approaches demonstrate improved accuracy in predicting protein-ligand binding affinity in drug discovery. These methods complement physics-based computational modeling such as molecular docking for virtual high-throughput screening. Despite recent advances and improved predictive perf...
Traditional methods for identifying “hit” molecules from a large collection of potential drug-like candidates rely on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between the drug and its protein target. These approaches have a significant limitation in that they require exceptional computing capa...
The design of two overlapping genes in a microbial genome is an emerging technique for adding more reliable control mechanisms in engineered organisms for increased stability. The design of functional overlapping gene pairs is a challenging procedure and computational design tools are used to improve the efficiency to deploy successful designs in g...
Background
Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size —currently exceeding 10 ¹² nucleotides— and exponential growth...
Decades of drug development research have explored a vast chemical space for highly active compounds. The exponential growth of virtual libraries enables easy access to billions of synthesizable molecules. Computational modeling, particularly molecular docking, utilizes physics-based calculations to prioritize molecules for synthesis and testing. N...
Viral populations in natural infections can have a high degree of sequence diversity, which can directly impact immune escape. However, antibody potency is often tested in vitro with a relatively clonal viral populations, such as laboratory virus or pseudotyped virus stocks, which may not accurately represent the genetic diversity of circulating vi...
The design of two overlapping genes in a microbial genome is an emerging technique for adding more reliable control mechanisms in engineered organisms for increased safety. The design of functional gene pairs is a challenging procedure and computational design tools are used to improve the efficiency to deploy successful designs in genetically engi...
Protein–ligand interactions are essential to drug discovery and drug development efforts. Desirable on-target or multitarget interactions are the first step in finding an effective therapeutic, while undesirable off-target interactions are the first step in assessing safety. In this work, we introduce a novel ligand-based featurization and mapping...
Viral populations in natural infections can have a high degree of sequence diversity, which can directly impact immune escape. However, antibody potency is often tested in vitro with a relatively clonal viral populations, such as laboratory virus or pseudotyped virus stocks, which may not accurately represent the genetic diversity of circulating vi...
Minimizing the human and economic costs of the COVID-19 pandemic and future pandemics requires the ability to develop and deploy effective treatments for novel pathogens as soon as possible after they emerge. To this end, we introduce a new computational pipeline for the rapid identification and characterization of binding sites in viral proteins a...
Neural Network (NN) models provide potential to speed up the drug discovery process and reduce its failure rates. The success of NN models requires uncertainty quantification (UQ) as drug discovery explores chemical space beyond the training data distribution. Standard NN models do not provide uncertainty information. Some methods require changing...
Protein-ligand interactions are essential to drug discovery and drug development efforts. Desirable on-target or multi-target interactions are a first step in finding an effective therapeutic; undesirable off-target interactions are a first step in assessing safety. In this work, we introduce a novel ligand-based featurization and mapping of human...
Molecular biology methods and technologies have advanced substantially over the past decade. These new molecular methods should be incorporated among the standard tools of planetary protection (PP) and could be validated for incorporation by 2026. To address the feasibility of applying modern molecular techniques to such an application, NASA conduc...
Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibb...
Generative molecular design (GMD) is an increasingly popular strategy for drug discovery, using machine learning models to propose, evaluate and optimize chemical structures against a set of target design criteria. We present the ATOM-GMD platform, a scalable multiprocessing framework to optimize many parameters simultaneously over large population...
Genetic analysis of intra-host viral populations provides unique insight into pre-emergent mutations that may contribute to the genotype of future variants. Clinical samples positive for SARS-CoV-2 collected in California during the first months of the pandemic were sequenced to define the dynamics of mutation emergence as the virus became establis...
Neural Network (NN) models provide potential to speed up the drug discovery process and reduce its failure rates. The success of NN models require uncertainty quantification (UQ) as drug discovery explores chemical space beyond the training data distribution. Standard NN models do not provide uncertainty information. Methods that combine Bayesian m...
We present a structure-based method for finding and evaluating structural similarities in protein regions relevant to ligand binding. PDBspheres comprises an exhaustive library of protein structure regions ('spheres') adjacent to complexed ligands derived from the Protein Data Bank (PDB), along with methods to find and evaluate structural matches b...
Predicting molecular activity against protein targets is difficult because of the paucity of experimental data. Approaches like multitask modeling and collaborative filtering seek to improve model accuracy by leveraging results from multiple targets, but are limited because different compounds are measured with different assays, leading to sparse d...
The growing capabilities of synthetic biology and organic chemistry demand tools to guide syntheses toward useful molecules. Here, we present Molecular AutoenCoding Auto-Workaround (MACAW), a tool that uses a novel approach to generate molecules predicted to meet a desired property specification (e.g., a binding affinity of 50 nM or an octane numbe...
Atomistic Molecular Dynamics (MD) simulations provide researchers the ability to model biomolecular structures such as proteins and their interactions with drug-like small molecules with greater spatiotemporal resolution than is otherwise possible using experimental methods. MD simulations are notoriously expensive computational endeavors that have...
The identification of promising lead compounds showing pharmacological activities toward a biological target is essential in early stage drug discovery. With the recent increase in available small-molecule databases, virtual high-throughput screening using physics-based molecular docking has emerged as an essential tool in assisting fast and cost-e...
The identification of promising lead compounds showing pharmacological activities toward a biological target is essential in early-stage drug discovery. With the recent increase in available small–molecule databases, virtual high-throughput screening using physics-based molecular docking has emerged as an essential tool in assisting fast and cost-e...
Minimizing the human and economic costs of the COVID-19 pandemic and of future pandemics requires the ability to develop and deploy effective treatments for novel pathogens as soon as possible after they emerge. To this end, we introduce a unique, computational pipeline for the rapid identification and characterization of binding sites in the prote...
The growing capabilities of synthetic biology and organic chemistry demand tools to guide syntheses towards useful molecules. Here, we present MACAW (Molecular AutoenCoding Auto-Workaround), a tool that uses a novel approach to generate molecules predicted to meet a desired property specification (e.g. a binding affinity of 50 nM or an octane numbe...
We present a structure-based method for finding and evaluating structural similarities in protein regions relevant to ligand binding. PDBspheres comprises an exhaustive library of protein structure regions (spheres) adjacent to complexed ligands derived from the Protein Data Bank (PDB), along with methods to find and evaluate structural matches bet...
To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross-validation within a single study to assess model accuracy. While an essential first step, cross-validation within a biological data s...
A rapid response is necessary to contain emergent biological outbreaks before they can become pandemics. The novel coronavirus (SARS-CoV-2) that causes COVID-19 was first reported in December of 2019 in Wuhan, China and reached most corners of the globe in less than two months. In just over a year since the initial infections, COVID-19 infected alm...
We improved the quality and reduced the time to produce machine learned models for use in small molecule antiviral design. Our globally asynchronous multi-level parallel training approach strong scales to all of Sierra with up to 97.7% efficiency. We trained a novel, character-based Wasserstein autoencoder that produces a higher quality model train...
To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross validation within a single study to assess model accuracy. While an essential first step, cross validation within a biological data s...
Structure-based Deep Fusion models were recently shown to outperform several physics- and machine learning-based protein-ligand binding affinity prediction methods. As part of a multi-institutional COVID-19 pandemic response, over 500 million small molecules were computationally screened against four protein structures from the novel coronavirus (S...
Predicting accurate protein–ligand binding affinities is an important task in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the application of deep convolutional and graph neural network-based approaches...
Cholestatic liver injury is frequently associated with drug inhibition of bile salt transporters, such as the bile salt export pump (BSEP). Reliable in silico models to predict BSEP inhibition directly from chemical structures would significantly reduce costs during drug discovery and could help avoid injury to patients. We report our development o...
Although Zika virus infection of pregnant women can result in congenital Zika syndrome, the factors that cause the syndrome in some but not all infected mothers are still unclear. We identified a mutation that was present in some ZIKV genomes in experimentally inoculated pregnant rhesus macaques and their fetuses. Although we did not find an associ...
Accurately predicting small molecule partitioning and hydrophobicity is critical in the drug discovery process. There are many heterogeneous chemical environments within a cell and entire human body. For example, drugs must be able to cross the hydrophobic cellular membrane to reach their intracellular targets and hydrophobicity is an important dri...
Although fetal death is now understood to be a severe outcome of congenital Zika syndrome, the role of viral genetics is still unclear. We sequenced Zika virus (ZIKV) from a rhesus macaque fetus that died after inoculation and identified a single intra-host mutation, M1404I, in the ZIKV polyprotein, located in NS2B. Targeted sequencing flanking pos...
In this paper, we investigate potential biases in datasets used to make drug binding predictions using machine learning. We investigate a recently published metric called the Asymmetric Validation Embedding (AVE) bias which is used to quantify this bias and detect overfitting. We compare it to a slightly revised version and introduce a new weighted...
Predicting accurate protein-ligand binding affinity is important in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the deep convolutional and graph neural network based approaches, the model performance d...
We present a new approach to estimate the binding affinity from given three-dimensional poses of protein-ligand complexes. In this scheme, every protein-ligand atom pair makes an additive free-energy contribution. The sum of these pairwise contributions then gives the total binding free energy or the logarithm of the dissociation constant. The pair...
One of the key requirements for incorporating machine learning (ML) into the drug discovery process is complete traceability and reproducibility of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing ML models that predict key pharma-relevant...
Drug-induced liver injury (DILI) is the most common cause of acute liver failure and a frequent reason for withdrawal of candidate drugs during preclinical and clinical testing. An important type of DILI is cholestatic liver injury, caused by buildup of bile salts within hepatocytes; it is frequently associated with inhibition of bile salt transpor...
Computational predictions of ligand binding is a difficult problem, with more accurate methods being extremely computationally expensive. The use of machine learning for drug binding predictions could possibly leverage the use of biomedical big data in exchange for time-intensive simulations. This paper reviews current trends in the use of machine...
In this paper, we investigate potential biases in datasets used to make drug binding predictions using machine learning. We investigate a recently published metric called the Asymmetric Validation Embedding (AVE) bias which is used to quantify this bias and detect overfitting. We compare it to a slightly revised version and introduce a new weighted...
The question of how Zika virus (ZIKV) changed from a seemingly mild virus to a human pathogen capable of microcephaly and sexual transmission remains unanswered. The unexpected emergence of ZIKV’s pathogenicity and capacity for sexual transmission may be due to genetic changes, and future changes in phenotype may continue to occur as the virus expa...
One of the key requirements for incorporating machine learning into the drug discovery process is complete reproducibility and traceability of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing machine learning models that predict key pharma...
FDA proactively invests in tools to support innovation of emerging technologies, such as infectious disease next generation sequencing (ID-NGS). Here, we introduce FDA-ARGOS quality-controlled reference genomes as a public database for diagnostic purposes and demonstrate its utility on the example of two use cases. We provide quality control metric...
Gene expression profiles have been widely used to characterize patterns of cellular responses to diseases. As data becomes available, scalable learning toolkits become essential to processing large datasets using deep learning models to model complex biological processes. We present an autoencoder to capture nonlinear relationships recovered from g...
Kawasaki disease (KD), first identified in 1967, is a pediatric vasculitis of unknown etiology that has an increasing incidence in Japan and many other countries. KD can cause coronary artery aneurysms. Its epidemiological characteristics, such as seasonality and clinical picture of acute systemic inflammation with prodromal intestinal/respiratory...
Open Reading Frames (ORFs) predicted for TTV7s.
ORFs were predicted in the reference sequence of TTV7 (a), and in the TTV7s identified in two KD patients (b, c). Minimal length of an ORF was assumed to be 50 bases. To date, experimental studies showed that three ORFs (ORF1, 2 and 3) were involved in the protein synthesis.
(EPS)
Torque teno viruses (TTVs) were fragmented and mapped to TTV7.
The genomes of TTVs, other than TTV7, were fragmented into 80 bases and mapped to the genome of TTV7. These fragments were mapped only to non-coding regions but not to open reading frames (ORFs). This indicates that the ORFs are specific to the strain of TTVs, while the non-coding regio...
Variants in nucleotide and amino acids of TTV7 identified in individual patients with Kawasaki disease (KD) (spread sheet format for Fig 3).
(XLS)
Abundance of reads mapped to viruses in the pooled samples (numerical data for Fig 1).
(XLS)
Strains of TTVs in the pooled whole blood (WB) DNA samples identified by metagenomic sequencing.
Reads from pooled WB DNA samples were mapped to anelloviruses. WB DNA pooled from Kawasaki disease (KD) patients was mapped to TTV5 (a) and TTV15 (b), that from diarrhea controls was mapped to TTV15 (c) and TTV29 (d), and that from respiratory infection...
The design of primer sets which were used for PCR of TTV7.
The primer set 1 was designed to span from the open reading frame (ORF) 2 to ORF1, while the primer set 2 covered the rest of the circular genome of TTV7.
(EPS)
Background
The National Cancer Institute drug pair screening effort against 60 well-characterized human tumor cell lines (NCI-60) presents an unprecedented resource for modeling combinational drug activity.
Results
We present a computational model for predicting cell line response to a subset of drug pairs in the NCI-ALMANAC database. Based on res...
Infectious disease next generation sequencing (ID-NGS) diagnostics are on the cusp of revolutionizing the clinical market. To facilitate this transition, FDA proactively invested in tools to support innovation of emerging technologies. FDA and collaborators established a publicly available database, FDA dAtabase for Regulatory-Grade micrObial Seque...
Antimicrobial resistance (AMR) is a global health issue. In an effort to minimize this threat to astronauts, who may be immunocompromised and thus at a greater risk of infection from antimicrobial resistant pathogens, a comprehensive study of the ISS "resistome' was conducted. Using whole genome sequencing (WGS) and disc diffusion antibiotic resist...
African swine fever virus (ASFV) is a macrophage-tropic virus responsible for ASF, a transboundary disease that threatens swine production world-wide. Since there are no vaccines available to control ASF after an outbreak, obtaining an understanding of the virus-host interaction is important for developing new intervention strategies. In this study...
The draft genome sequences of six Bacillus strains, isolated from the International Space Station and belonging to the Bacillus anthracis - B. cereus - B. thuringiensis group, are presented here. These strains were isolated from the Japanese Experiment Module (one strain), U.S. Harmony Node 2 (three strains), and Russian Segment Zvezda Module (two...
Background:
The built environment of the International Space Station (ISS) is a highly specialized space in terms of both physical characteristics and habitation requirements. It is unique with respect to conditions of microgravity, exposure to space radiation, and increased carbon dioxide concentrations. Additionally, astronauts inhabit a large p...
The microbiome of environmental surfaces from the International Space Station were characterized in order to examine the relationship to crew and hardware maintenance. The Microbial Observatory (ISS-MO) experiment generated a microbial census of ISS environments using advanced molecular microbial community analyses along with traditional culture-ba...
The microbiome of environmental surfaces from the International Space Station were
characterized in order to examine the relationship to crew and hardware maintenance. The Microbial Observatory (ISS-MO) experiment generated a microbial census of ISS
environments using advanced molecular microbial community analyses along with traditional culture-ba...
Concatenated alignment of core genes in G. vaginalis.DOI:
http://dx.doi.org/10.7554/eLife.20983.017
G. vaginalis core genome alignment trimmed with Gblocks.DOI:
http://dx.doi.org/10.7554/eLife.20983.018
Recombinant fragments detected with BratNextGen in S. saprophyticus alignment.DOI:
http://dx.doi.org/10.7554/eLife.20983.030
Maximum likelihood phylogenetic analysis of trimmed G. vaginalis alignment with RAxML.DOI:
http://dx.doi.org/10.7554/eLife.20983.019
Maximum likelihood phylogenetic analysis of trimmed S. saprophyticus alignment with RAxML.DOI:
http://dx.doi.org/10.7554/eLife.20983.028
Recombinant fragments detected with BratNextGen in trimmed G. vaginalis alignment.DOI:
http://dx.doi.org/10.7554/eLife.20983.020
S. saprophyticus plasmid alignment trimmed with trimal.DOI:
http://dx.doi.org/10.7554/eLife.20983.029
(A) Troy sample details (B) SEM-EDS results from nodule. For each replicate, upper value is weight %, lower value is atomic %. (C) Common chemical constituents of renal and bladder calculi (kidney and bladder stones) and Troy nodules. + - presence, ND- not detected, Unk- unknown, RF- Relative Frequency in modern populations (C.Y.C Pak (ed.) Pak [19...
ELife digest
Why and how have some bacteria evolved to cause illness in humans? One way to study bacterial evolution is to search for ancient samples of bacteria and use DNA sequencing technology to investigate how modern bacteria have changed from their ancestors. Understanding the evolution process may help researchers to understand how some bact...
The draft genome sequences of 20 biosafety level 2 (BSL-2) opportunistic pathogens isolated from the environmental surfaces of the International Space Station (ISS) were presented. These genomic sequences will help in understanding the influence of microgravity on the pathogenicity and virulence of these strains when compared with Earth strains.
[This corrects the article DOI: 10.1371/journal.pone.0146251.].
Confounding factors challenge the investigation of disease in antiquity, primarily the scenario where multiple evidentiary sources (e.g., skeletal, archaeological, or historical) do not provide a consensus or indicate the association of a specific disease with a given burial assemblage. Without such a priori knowledge, it is critical to integrate a...
In vivo serial passage of non-pathogenic viruses has been shown to lead to increased viral virulence, and although the precise mechanism(s) are not clear, it is known that both host and viral factors are associated with increased pathogenicity. Under- or overnutrition leads to a decreased or dysregulated immune response and can increase viral mutan...
Venezuelan equine encephalitis virus (VEEV) is a mosquito-borne alphavirus that has caused large outbreaks of severe illness in both horses and humans. New approaches are needed to rapidly infer the origin of a newly discovered VEEV strain, estimate its equine amplification and resultant epidemic potential, and predict human virulence phenotype. We...
Tanglegram connecting the corresponding taxa which illustrates the high similarity between the MSA tree (left) and the SNP tree (right).
(DOCX)
Tanglegram illustrating where the SNP tree based on all the SNPs (left) and that based only on the SNPs in the capsid gene (right) differ.
(DOCX)
Annotations, 13-mer contexts and reference genome alignments for SNPs identified by whole genome analysis.
(XLSX)