Jonathan E Allen

Jonathan E Allen
  • Ph.D.
  • Principal Investigator at Lawrence Livermore National Laboratory

About

153
Publications
23,806
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
16,057
Citations
Current institution
Lawrence Livermore National Laboratory
Current position
  • Principal Investigator

Publications

Publications (153)
Preprint
Full-text available
Machine learning models are often used as scoring functions to predict the binding affinity of a protein-ligand complex. These models are trained with limited amounts of data with experimentally measured binding affinity values. A large number of compounds are labeled inactive through single-concentration screens without measuring binding affinitie...
Article
Full-text available
Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size—currently exceeding 10¹² nucleotides—and exponential growth pose signific...
Article
Full-text available
Introduction Recent advances in 3D structure-based deep learning approaches demonstrate improved accuracy in predicting protein-ligand binding affinity in drug discovery. These methods complement physics-based computational modeling such as molecular docking for virtual high-throughput screening. Despite recent advances and improved predictive perf...
Article
Full-text available
Traditional methods for identifying “hit” molecules from a large collection of potential drug-like candidates rely on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between the drug and its protein target. These approaches have a significant limitation in that they require exceptional computing capa...
Article
The design of two overlapping genes in a microbial genome is an emerging technique for adding more reliable control mechanisms in engineered organisms for increased stability. The design of functional overlapping gene pairs is a challenging procedure and computational design tools are used to improve the efficiency to deploy successful designs in g...
Preprint
Full-text available
Background Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size —currently exceeding 10 ¹² nucleotides— and exponential growth...
Article
Full-text available
Decades of drug development research have explored a vast chemical space for highly active compounds. The exponential growth of virtual libraries enables easy access to billions of synthesizable molecules. Computational modeling, particularly molecular docking, utilizes physics-based calculations to prioritize molecules for synthesis and testing. N...
Article
Full-text available
Viral populations in natural infections can have a high degree of sequence diversity, which can directly impact immune escape. However, antibody potency is often tested in vitro with a relatively clonal viral populations, such as laboratory virus or pseudotyped virus stocks, which may not accurately represent the genetic diversity of circulating vi...
Preprint
The design of two overlapping genes in a microbial genome is an emerging technique for adding more reliable control mechanisms in engineered organisms for increased safety. The design of functional gene pairs is a challenging procedure and computational design tools are used to improve the efficiency to deploy successful designs in genetically engi...
Article
Full-text available
Protein–ligand interactions are essential to drug discovery and drug development efforts. Desirable on-target or multitarget interactions are the first step in finding an effective therapeutic, while undesirable off-target interactions are the first step in assessing safety. In this work, we introduce a novel ligand-based featurization and mapping...
Preprint
Full-text available
Viral populations in natural infections can have a high degree of sequence diversity, which can directly impact immune escape. However, antibody potency is often tested in vitro with a relatively clonal viral populations, such as laboratory virus or pseudotyped virus stocks, which may not accurately represent the genetic diversity of circulating vi...
Article
Full-text available
Minimizing the human and economic costs of the COVID-19 pandemic and future pandemics requires the ability to develop and deploy effective treatments for novel pathogens as soon as possible after they emerge. To this end, we introduce a new computational pipeline for the rapid identification and characterization of binding sites in viral proteins a...
Article
Full-text available
Neural Network (NN) models provide potential to speed up the drug discovery process and reduce its failure rates. The success of NN models requires uncertainty quantification (UQ) as drug discovery explores chemical space beyond the training data distribution. Standard NN models do not provide uncertainty information. Some methods require changing...
Preprint
Full-text available
Protein-ligand interactions are essential to drug discovery and drug development efforts. Desirable on-target or multi-target interactions are a first step in finding an effective therapeutic; undesirable off-target interactions are a first step in assessing safety. In this work, we introduce a novel ligand-based featurization and mapping of human...
Article
Molecular biology methods and technologies have advanced substantially over the past decade. These new molecular methods should be incorporated among the standard tools of planetary protection (PP) and could be validated for incorporation by 2026. To address the feasibility of applying modern molecular techniques to such an application, NASA conduc...
Preprint
Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibb...
Preprint
Full-text available
Generative molecular design (GMD) is an increasingly popular strategy for drug discovery, using machine learning models to propose, evaluate and optimize chemical structures against a set of target design criteria. We present the ATOM-GMD platform, a scalable multiprocessing framework to optimize many parameters simultaneously over large population...
Article
Full-text available
Genetic analysis of intra-host viral populations provides unique insight into pre-emergent mutations that may contribute to the genotype of future variants. Clinical samples positive for SARS-CoV-2 collected in California during the first months of the pandemic were sequenced to define the dynamics of mutation emergence as the virus became establis...
Preprint
Full-text available
Neural Network (NN) models provide potential to speed up the drug discovery process and reduce its failure rates. The success of NN models require uncertainty quantification (UQ) as drug discovery explores chemical space beyond the training data distribution. Standard NN models do not provide uncertainty information. Methods that combine Bayesian m...
Article
Full-text available
We present a structure-based method for finding and evaluating structural similarities in protein regions relevant to ligand binding. PDBspheres comprises an exhaustive library of protein structure regions ('spheres') adjacent to complexed ligands derived from the Protein Data Bank (PDB), along with methods to find and evaluate structural matches b...
Preprint
Predicting molecular activity against protein targets is difficult because of the paucity of experimental data. Approaches like multitask modeling and collaborative filtering seek to improve model accuracy by leveraging results from multiple targets, but are limited because different compounds are measured with different assays, leading to sparse d...
Article
Full-text available
The growing capabilities of synthetic biology and organic chemistry demand tools to guide syntheses toward useful molecules. Here, we present Molecular AutoenCoding Auto-Workaround (MACAW), a tool that uses a novel approach to generate molecules predicted to meet a desired property specification (e.g., a binding affinity of 50 nM or an octane numbe...
Article
Full-text available
Atomistic Molecular Dynamics (MD) simulations provide researchers the ability to model biomolecular structures such as proteins and their interactions with drug-like small molecules with greater spatiotemporal resolution than is otherwise possible using experimental methods. MD simulations are notoriously expensive computational endeavors that have...
Article
Full-text available
The identification of promising lead compounds showing pharmacological activities toward a biological target is essential in early stage drug discovery. With the recent increase in available small-molecule databases, virtual high-throughput screening using physics-based molecular docking has emerged as an essential tool in assisting fast and cost-e...
Preprint
Full-text available
The identification of promising lead compounds showing pharmacological activities toward a biological target is essential in early-stage drug discovery. With the recent increase in available small–molecule databases, virtual high-throughput screening using physics-based molecular docking has emerged as an essential tool in assisting fast and cost-e...
Preprint
Full-text available
Minimizing the human and economic costs of the COVID-19 pandemic and of future pandemics requires the ability to develop and deploy effective treatments for novel pathogens as soon as possible after they emerge. To this end, we introduce a unique, computational pipeline for the rapid identification and characterization of binding sites in the prote...
Preprint
Full-text available
The growing capabilities of synthetic biology and organic chemistry demand tools to guide syntheses towards useful molecules. Here, we present MACAW (Molecular AutoenCoding Auto-Workaround), a tool that uses a novel approach to generate molecules predicted to meet a desired property specification (e.g. a binding affinity of 50 nM or an octane numbe...
Preprint
Full-text available
We present a structure-based method for finding and evaluating structural similarities in protein regions relevant to ligand binding. PDBspheres comprises an exhaustive library of protein structure regions (spheres) adjacent to complexed ligands derived from the Protein Data Bank (PDB), along with methods to find and evaluate structural matches bet...
Article
Full-text available
To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross-validation within a single study to assess model accuracy. While an essential first step, cross-validation within a biological data s...
Article
Full-text available
A rapid response is necessary to contain emergent biological outbreaks before they can become pandemics. The novel coronavirus (SARS-CoV-2) that causes COVID-19 was first reported in December of 2019 in Wuhan, China and reached most corners of the globe in less than two months. In just over a year since the initial infections, COVID-19 infected alm...
Article
Full-text available
We improved the quality and reduced the time to produce machine learned models for use in small molecule antiviral design. Our globally asynchronous multi-level parallel training approach strong scales to all of Sierra with up to 97.7% efficiency. We trained a novel, character-based Wasserstein autoencoder that produces a higher quality model train...
Preprint
Full-text available
To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross validation within a single study to assess model accuracy. While an essential first step, cross validation within a biological data s...
Preprint
Full-text available
Structure-based Deep Fusion models were recently shown to outperform several physics- and machine learning-based protein-ligand binding affinity prediction methods. As part of a multi-institutional COVID-19 pandemic response, over 500 million small molecules were computationally screened against four protein structures from the novel coronavirus (S...
Article
Full-text available
Predicting accurate protein–ligand binding affinities is an important task in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the application of deep convolutional and graph neural network-based approaches...
Article
Cholestatic liver injury is frequently associated with drug inhibition of bile salt transporters, such as the bile salt export pump (BSEP). Reliable in silico models to predict BSEP inhibition directly from chemical structures would significantly reduce costs during drug discovery and could help avoid injury to patients. We report our development o...
Article
Full-text available
Although Zika virus infection of pregnant women can result in congenital Zika syndrome, the factors that cause the syndrome in some but not all infected mothers are still unclear. We identified a mutation that was present in some ZIKV genomes in experimentally inoculated pregnant rhesus macaques and their fetuses. Although we did not find an associ...
Article
Accurately predicting small molecule partitioning and hydrophobicity is critical in the drug discovery process. There are many heterogeneous chemical environments within a cell and entire human body. For example, drugs must be able to cross the hydrophobic cellular membrane to reach their intracellular targets and hydrophobicity is an important dri...
Preprint
Full-text available
Although fetal death is now understood to be a severe outcome of congenital Zika syndrome, the role of viral genetics is still unclear. We sequenced Zika virus (ZIKV) from a rhesus macaque fetus that died after inoculation and identified a single intra-host mutation, M1404I, in the ZIKV polyprotein, located in NS2B. Targeted sequencing flanking pos...
Chapter
In this paper, we investigate potential biases in datasets used to make drug binding predictions using machine learning. We investigate a recently published metric called the Asymmetric Validation Embedding (AVE) bias which is used to quantify this bias and detect overfitting. We compare it to a slightly revised version and introduce a new weighted...
Preprint
Predicting accurate protein-ligand binding affinity is important in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the deep convolutional and graph neural network based approaches, the model performance d...
Article
We present a new approach to estimate the binding affinity from given three-dimensional poses of protein-ligand complexes. In this scheme, every protein-ligand atom pair makes an additive free-energy contribution. The sum of these pairwise contributions then gives the total binding free energy or the logarithm of the dissociation constant. The pair...
Article
Full-text available
One of the key requirements for incorporating machine learning (ML) into the drug discovery process is complete traceability and reproducibility of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing ML models that predict key pharma-relevant...
Preprint
Drug-induced liver injury (DILI) is the most common cause of acute liver failure and a frequent reason for withdrawal of candidate drugs during preclinical and clinical testing. An important type of DILI is cholestatic liver injury, caused by buildup of bile salts within hepatocytes; it is frequently associated with inhibition of bile salt transpor...
Article
Computational predictions of ligand binding is a difficult problem, with more accurate methods being extremely computationally expensive. The use of machine learning for drug binding predictions could possibly leverage the use of biomedical big data in exchange for time-intensive simulations. This paper reviews current trends in the use of machine...
Preprint
In this paper, we investigate potential biases in datasets used to make drug binding predictions using machine learning. We investigate a recently published metric called the Asymmetric Validation Embedding (AVE) bias which is used to quantify this bias and detect overfitting. We compare it to a slightly revised version and introduce a new weighted...
Article
Full-text available
The question of how Zika virus (ZIKV) changed from a seemingly mild virus to a human pathogen capable of microcephaly and sexual transmission remains unanswered. The unexpected emergence of ZIKV’s pathogenicity and capacity for sexual transmission may be due to genetic changes, and future changes in phenotype may continue to occur as the virus expa...
Preprint
One of the key requirements for incorporating machine learning into the drug discovery process is complete reproducibility and traceability of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing machine learning models that predict key pharma...
Article
Full-text available
FDA proactively invests in tools to support innovation of emerging technologies, such as infectious disease next generation sequencing (ID-NGS). Here, we introduce FDA-ARGOS quality-controlled reference genomes as a public database for diagnostic purposes and demonstrate its utility on the example of two use cases. We provide quality control metric...
Preprint
Full-text available
Gene expression profiles have been widely used to characterize patterns of cellular responses to diseases. As data becomes available, scalable learning toolkits become essential to processing large datasets using deep learning models to model complex biological processes. We present an autoencoder to capture nonlinear relationships recovered from g...
Article
Full-text available
Kawasaki disease (KD), first identified in 1967, is a pediatric vasculitis of unknown etiology that has an increasing incidence in Japan and many other countries. KD can cause coronary artery aneurysms. Its epidemiological characteristics, such as seasonality and clinical picture of acute systemic inflammation with prodromal intestinal/respiratory...
Data
Open Reading Frames (ORFs) predicted for TTV7s. ORFs were predicted in the reference sequence of TTV7 (a), and in the TTV7s identified in two KD patients (b, c). Minimal length of an ORF was assumed to be 50 bases. To date, experimental studies showed that three ORFs (ORF1, 2 and 3) were involved in the protein synthesis. (EPS)
Data
Torque teno viruses (TTVs) were fragmented and mapped to TTV7. The genomes of TTVs, other than TTV7, were fragmented into 80 bases and mapped to the genome of TTV7. These fragments were mapped only to non-coding regions but not to open reading frames (ORFs). This indicates that the ORFs are specific to the strain of TTVs, while the non-coding regio...
Data
Variants in nucleotide and amino acids of TTV7 identified in individual patients with Kawasaki disease (KD) (spread sheet format for Fig 3). (XLS)
Data
Abundance of reads mapped to viruses in the pooled samples (numerical data for Fig 1). (XLS)
Data
Strains of TTVs in the pooled whole blood (WB) DNA samples identified by metagenomic sequencing. Reads from pooled WB DNA samples were mapped to anelloviruses. WB DNA pooled from Kawasaki disease (KD) patients was mapped to TTV5 (a) and TTV15 (b), that from diarrhea controls was mapped to TTV15 (c) and TTV29 (d), and that from respiratory infection...
Data
The design of primer sets which were used for PCR of TTV7. The primer set 1 was designed to span from the open reading frame (ORF) 2 to ORF1, while the primer set 2 covered the rest of the circular genome of TTV7. (EPS)
Article
Full-text available
Background The National Cancer Institute drug pair screening effort against 60 well-characterized human tumor cell lines (NCI-60) presents an unprecedented resource for modeling combinational drug activity. Results We present a computational model for predicting cell line response to a subset of drug pairs in the NCI-ALMANAC database. Based on res...
Preprint
Full-text available
Infectious disease next generation sequencing (ID-NGS) diagnostics are on the cusp of revolutionizing the clinical market. To facilitate this transition, FDA proactively invested in tools to support innovation of emerging technologies. FDA and collaborators established a publicly available database, FDA dAtabase for Regulatory-Grade micrObial Seque...
Article
Full-text available
Antimicrobial resistance (AMR) is a global health issue. In an effort to minimize this threat to astronauts, who may be immunocompromised and thus at a greater risk of infection from antimicrobial resistant pathogens, a comprehensive study of the ISS "resistome' was conducted. Using whole genome sequencing (WGS) and disc diffusion antibiotic resist...
Article
Full-text available
African swine fever virus (ASFV) is a macrophage-tropic virus responsible for ASF, a transboundary disease that threatens swine production world-wide. Since there are no vaccines available to control ASF after an outbreak, obtaining an understanding of the virus-host interaction is important for developing new intervention strategies. In this study...
Article
Full-text available
The draft genome sequences of six Bacillus strains, isolated from the International Space Station and belonging to the Bacillus anthracis - B. cereus - B. thuringiensis group, are presented here. These strains were isolated from the Japanese Experiment Module (one strain), U.S. Harmony Node 2 (three strains), and Russian Segment Zvezda Module (two...
Article
Full-text available
Background: The built environment of the International Space Station (ISS) is a highly specialized space in terms of both physical characteristics and habitation requirements. It is unique with respect to conditions of microgravity, exposure to space radiation, and increased carbon dioxide concentrations. Additionally, astronauts inhabit a large p...
Conference Paper
Full-text available
The microbiome of environmental surfaces from the International Space Station were characterized in order to examine the relationship to crew and hardware maintenance. The Microbial Observatory (ISS-MO) experiment generated a microbial census of ISS environments using advanced molecular microbial community analyses along with traditional culture-ba...
Conference Paper
Full-text available
The microbiome of environmental surfaces from the International Space Station were characterized in order to examine the relationship to crew and hardware maintenance. The Microbial Observatory (ISS-MO) experiment generated a microbial census of ISS environments using advanced molecular microbial community analyses along with traditional culture-ba...
Data
Concatenated alignment of core genes in G. vaginalis.DOI: http://dx.doi.org/10.7554/eLife.20983.017
Data
G. vaginalis core genome alignment trimmed with Gblocks.DOI: http://dx.doi.org/10.7554/eLife.20983.018
Data
Recombinant fragments detected with BratNextGen in S. saprophyticus alignment.DOI: http://dx.doi.org/10.7554/eLife.20983.030
Data
Maximum likelihood phylogenetic analysis of trimmed G. vaginalis alignment with RAxML.DOI: http://dx.doi.org/10.7554/eLife.20983.019
Data
Maximum likelihood phylogenetic analysis of trimmed S. saprophyticus alignment with RAxML.DOI: http://dx.doi.org/10.7554/eLife.20983.028
Data
Recombinant fragments detected with BratNextGen in trimmed G. vaginalis alignment.DOI: http://dx.doi.org/10.7554/eLife.20983.020
Data
S. saprophyticus plasmid alignment trimmed with trimal.DOI: http://dx.doi.org/10.7554/eLife.20983.029
Data
(A) Troy sample details (B) SEM-EDS results from nodule. For each replicate, upper value is weight %, lower value is atomic %. (C) Common chemical constituents of renal and bladder calculi (kidney and bladder stones) and Troy nodules. + - presence, ND- not detected, Unk- unknown, RF- Relative Frequency in modern populations (C.Y.C Pak (ed.) Pak [19...
Article
Full-text available
ELife digest Why and how have some bacteria evolved to cause illness in humans? One way to study bacterial evolution is to search for ancient samples of bacteria and use DNA sequencing technology to investigate how modern bacteria have changed from their ancestors. Understanding the evolution process may help researchers to understand how some bact...
Article
Full-text available
The draft genome sequences of 20 biosafety level 2 (BSL-2) opportunistic pathogens isolated from the environmental surfaces of the International Space Station (ISS) were presented. These genomic sequences will help in understanding the influence of microgravity on the pathogenicity and virulence of these strains when compared with Earth strains.
Conference Paper
Full-text available
Confounding factors challenge the investigation of disease in antiquity, primarily the scenario where multiple evidentiary sources (e.g., skeletal, archaeological, or historical) do not provide a consensus or indicate the association of a specific disease with a given burial assemblage. Without such a priori knowledge, it is critical to integrate a...
Article
Full-text available
In vivo serial passage of non-pathogenic viruses has been shown to lead to increased viral virulence, and although the precise mechanism(s) are not clear, it is known that both host and viral factors are associated with increased pathogenicity. Under- or overnutrition leads to a decreased or dysregulated immune response and can increase viral mutan...
Article
Full-text available
Venezuelan equine encephalitis virus (VEEV) is a mosquito-borne alphavirus that has caused large outbreaks of severe illness in both horses and humans. New approaches are needed to rapidly infer the origin of a newly discovered VEEV strain, estimate its equine amplification and resultant epidemic potential, and predict human virulence phenotype. We...
Data
Tanglegram connecting the corresponding taxa which illustrates the high similarity between the MSA tree (left) and the SNP tree (right). (DOCX)
Data
Tanglegram illustrating where the SNP tree based on all the SNPs (left) and that based only on the SNPs in the capsid gene (right) differ. (DOCX)
Data
Annotations, 13-mer contexts and reference genome alignments for SNPs identified by whole genome analysis. (XLSX)

Network

Cited By