Regina Barzilay’s research while affiliated with Massachusetts Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (338)


Protein codes promote selective subcellular compartmentalization
  • Article

February 2025

·

12 Reads

Science

Henry R. Kilgore

·

Itamar Chinn

·

Peter G. Mikhael

·

[...]

·

Richard A. Young

Cells have evolved mechanisms to distribute ~10 billion protein molecules to subcellular compartments where diverse proteins involved in shared functions must assemble. Here, we demonstrate that proteins with shared functions share amino acid sequence codes that guide them to compartment destinations. A protein language model, ProtGPS, was developed that predicts with high performance the compartment localization of human proteins excluded from the training set. ProtGPS successfully guided generation of novel protein sequences that selectively assemble in the nucleolus. ProtGPS identified pathological mutations that change this code and lead to altered subcellular localization of proteins. Our results indicate that protein sequences contain not only a folding code, but also a previously unrecognized code governing their distribution to diverse subcellular compartments.


Characteristics of the deep learning model and the training and evaluation datasets for prediction of HLA-I epitopes
a, Datasets used for training and evaluation were curated by combining data from several previous studies as well as a recent download of the IEDB. Eluted ligand data were used as positives and randomly sampled decoys from Swiss-Prot²⁶ served as negatives. For evaluation, data from an immunopeptidomic study involving 24 monoallelic cell lines were used²⁷. To evaluate immunogenicity, five studies28, 29–30 that measure the immunogenicity of influenza epitopes identified via mass spectrometry were used. b–e, Peptide length distribution of the HLA-I binders (b,c) and pie chart of the proportion of epitopes per HLA-I allele (d,e) in the presentation training (b,d) and evaluation (c,e) datasets. All alleles present in the dataset with a frequency <1% are denoted as ‘other’. f, The binding module takes as input the amino acid sequences of the major histocompatibility complex and peptide in the form: [cls] mhc [sep] pep [eos], where [cls], [sep] and [eos] are special tokens that separate the two sequences. This new sequence is fed to the Evolutionary Scale Modeling-2 (ESM-2) Transformer protein language model, and the vector representation for the [cls] token is used to represent the complex. The ligand elution module combines the binding vector with a long short-term memory (LSTM) recurrent neural network encoding of the peptide that includes its left and right flanks in the parent protein of origin. The model can be used when trained with or without flanking residues. These combined features are then concatenated and used to compute a ligand presentation score. The model is first trained on the ligand presentation task. Then, the model is trained with five different random seeds and their scores are averaged to create an ensemble score. pHLA: peptide-human leukocyte antigen complex; TCR: T cell receptor. Panel f created with BioRender.com.
MUNIS outperforms existing predictors in classifying HLA-I binders across 8–11mers
a,b, Average precision (a) and ROC-AUC (b) of MUNIS and current state-of-the-art tools MixMHCpred 2.2, NetMHCpan 4.1, MHCflurry 2.0, TransPHLA and BigMHC on predicting eluted ligands (binders) from mass spectrometry experiments from Pyke et al.²⁷ against decoy peptides (non-binders), n = 24 HLA-I alleles. Percentages of overlap with the training datasets of each tool across all epitopes in the presentation benchmark are shown below the plots. c, Per-allele pairwise comparisons of MUNIS and other predictors in classifying HLA-I binders. Each point is the model performance on one allele. d,e, Average precision (d) and ROC-AUC (e) of all predictors on classifying binders versus non-binders binned by epitope length, n = 24 HLA-I alleles. P values for pairwise comparisons between MUNIS and each predictor were calculated using the two-sided Wilcoxon rank sums test (not shown if P > 0.1; ****P < 1 × 10⁻⁴). Box plots are presented with medians as centre lines, 25th and 75th percentiles as lower and upper quartiles, and 1.5 times the interquartile range from the quartiles as whiskers (outliers not shown).
Motif analysis of misclassified binders reveals inconsistent reliance of existing models on canonical HLA-I-binding motifs
a, Box plots of model score for eluted ligands (binders) from mass spectrometry experiments from Pyke et al.²⁷ and decoy peptides (non-binders) for each predictor (41,724 binders and 208,609 non-binders). Box plots are presented with medians as centre lines, 25th and 75th percentiles as lower and upper quartiles, and 1.5 times the interquartile range from the quartiles as whiskers (outliers not shown). b, Binding motifs for 9-mers for all correctly classified binders (true positives) and misclassified non-binders (false positives) by each tool for representative allele HLA-B*40:01. HLA anchor residues are highlighted in yellow. Binding motifs are not shown for MUNIS false positives as there were fewer than 25 incorrectly labelled binders per allele. Model scores >0.90 were used as cut-offs for true positives and false negatives. c, Shannon entropy at HLA anchor residues (positions two and nine in a 9-mer) for true-positive (TP) and false-positive (FP) HLA-I binders predicted by each tool. Each point represents the Shannon entropy at a particular anchor residue for peptides that are false and true predicted binders for one HLA allele. P values for pairwise comparisons were calculated using the two-sided Wilcoxon rank sums test (****P < 1 × 10⁻⁴).
MUNIS outperforms existing tools in predicting epitope immunodominance hierarchies
a, Per-dataset performance of MUNIS against existing tools MixMHCpred 2.2, NetMHCpan 4.1, MHCflurry 2.0, TransPHLA, BigMHC and Prime 2.0 on predicting eluted ligands (binders) from five influenza immunopeptidome experiments against ‘decoy’ peptides (non-binders). Positives are all mass spectrometry-eluted ligands and negatives are all other peptides (‘decoys’) in the viral proteome. Only proteins with at least one eluted ligand are considered. b, Per-dataset performance when positives are conditioned on immunogenic peptides and negatives contain both the ‘decoys’ and the eluted ligands that were not immunogenic. In a and b, each point represents performance on one dataset (that is one HLA-I allele). Bar plots show median performance across datasets and error bars show the standard error across the five datasets. Percentages of epitope overlap with the training datasets of each tool across all positive epitopes in the five influenza benchmarks are shown below the plots. No pairwise comparisons between MUNIS and other predictors had a P value <0.05. c,d, Spearman correlation of each model’s score and frequency of response to an epitope across all epitope–allele pairs in acute (c) and chronic (d) HIV infection. Percentages of epitope overlap with the training datasets of each tool across all epitopes in the HIV benchmark are shown below the plots. e,f, Median model score ± standard error of the median for epitopes with binned frequencies of responses across all epitope–allele pairs in acute (e) and chronic (f) HIV infection.
Experimental HLA-I–peptide stability assay confirms the ability of MUNIS to discriminate between binding and non-binding peptides within EBV
a, Schematic showing the epitope prioritization pipeline for experimental validation. The top-337 ranked peptides from the BRLF1, B2LF1, EBNA1, LMP2 and EBNA3a proteins from EBV predicted to bind 1 of 17 different HLA-I alleles were chosen for downstream analysis. b, Schematic showing experimental validation of MUNIS performance on EBV epitope prediction. Stability assays on HLA-I–peptide pairs were performed using TAP-deficient monoallelic HLA-I cell lines to identify peptides that bind and are presented by HLA-I molecules. IFNγ ELISpot assays were performed on each peptide predicted to bind an HLA molecule presented by 30 HLA-haplotyped individuals to identify immunogenic peptides (data shown in Fig. 6). c, Representative data of the relative stabilization of HLA-B*35:01 by two EBV peptides predicted to bind the allele. The MFI for the DMSO negative control shown in light grey, the B*3501-specific HIV immunodominant peptide in light blue, the two predicted binders from the EBV proteome in blue and a predicted non-binder from the EBV proteome in dark grey. The higher the MFI, the greater stabilized the allele by a given peptide. d, Summary data for all predicted binders and non-binders for HLA-B*35:01. All MFIs were normalized to the HIV immunodominant peptide for the given HLA-I allele as denoted by the dashed line. Blue circles are predicted binders and grey circles are predicted non-binders. e, Summary data for all 17 HLA-I alleles evaluated for the 337 predicted peptides. Box plots are presented with medians as centre lines, 25th and 75th percentiles as lower and upper quartiles, and 1.5 times the interquartile range from the quartiles as whiskers (outliers not shown). f, Normalized anti-HLA MFI for binders versus non-binders conditioned on predicted binders with a MUNIS score greater than or equal to the given threshold score. Each point represents the median normalized anti-HLA MFI across all peptides predicted to bind or not bind a particular HLA-I allele (n = 17 HLA-I alleles). P values for pairwise comparisons between predicted binders and non-binders were calculated using the two-sided Wilcoxon rank sums test. Panels a and b created with BioRender.com.

+1

Deep learning enhances the prediction of HLA class I-presented CD8 T cell epitopes in foreign pathogens
  • Article
  • Full-text available

January 2025

·

23 Reads

Nature Machine Intelligence

Accurate in silico determination of CD8⁺ T cell epitopes would greatly enhance T cell-based vaccine development, but current prediction models are not reliably successful. Here, motivated by recent successes applying machine learning to complex biology, we curated a dataset of 651,237 unique human leukocyte antigen class I (HLA-I) ligands and developed MUNIS, a deep learning model that identifies peptides presented by HLA-I alleles. MUNIS shows improved performance compared with existing models in predicting peptide presentation and CD8⁺ T cell epitope immunodominance hierarchies. Moreover, application of MUNIS to proteins from Epstein–Barr virus led to successful identification of both established and novel HLA-I epitopes which were experimentally validated by in vitro HLA-I-peptide stability and T cell immunogenicity assays. MUNIS performs comparably to an experimental stability assay in terms of immunogenicity prediction, suggesting that deep learning can reduce experimental burden and accelerate identification of CD8⁺ T cell epitopes for rapid T cell vaccine development.

Download

Boltz-1: Democratizing Biomolecular Interaction Modeling

November 2024

·

23 Reads

·

5 Citations

Understanding biomolecular interactions is fundamental to advancing fields like drug discovery and protein design. In this paper, we introduce Boltz-1, an open-source deep learning model incorporating innovations in model architecture, speed optimization, and data processing achieving AlphaFold3-level accuracy in predicting the 3D structures of biomolecular complexes. Boltz-1 demonstrates a performance on-par with state-of-the-art commercial models on a range of diverse benchmarks, setting a new benchmark for commercially accessible tools in structural biology. By releasing the training and inference code, model weights, datasets, and benchmarks under the MIT open license, we aim to foster global collaboration, accelerate discoveries, and provide a robust platform for advancing biomolecular modeling.


Predicting sub-population specific viral evolution

October 2024

·

10 Reads

Forecasting the change in the distribution of viral variants is crucial for therapeutic design and disease surveillance. This task poses significant modeling challenges due to the sharp differences in virus distributions across sub-populations (e.g., countries) and their dynamic interactions. Existing machine learning approaches that model the variant distribution as a whole are incapable of making location-specific predictions and ignore transmissions that shape the viral landscape. In this paper, we propose a sub-population specific protein evolution model, which predicts the time-resolved distributions of viral proteins in different locations. The algorithm explicitly models the transmission rates between sub-populations and learns their interdependence from data. The change in protein distributions across all sub-populations is defined through a linear ordinary differential equation (ODE) parametrized by transmission rates. Solving this ODE yields the likelihood of a given protein occurring in particular sub-populations. Multi-year evaluation on both SARS-CoV-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. We also find that the transmission rates learned from data are consistent with the transmission pathways discovered by retrospective phylogenetic analysis.


Predicting perturbation targets with causal differential networks

October 2024

·

38 Reads

Rationally identifying variables responsible for changes to a biological system can enable myriad applications in disease understanding and cell engineering. From a causality perspective, we are given two datasets generated by the same causal model, one observational (control) and one interventional (perturbed). The goal is to isolate the subset of measured variables (e.g. genes) that were the targets of the intervention, i.e. those whose conditional independencies have changed. Knowing the causal graph would limit the search space, allowing us to efficiently pinpoint these variables. However, current algorithms that infer causal graphs in the presence of unknown intervention targets scale poorly to the hundreds or thousands of variables in biological data, as they must jointly search the combinatorial spaces of graphs and consistent intervention targets. In this work, we propose a causality-inspired approach for predicting perturbation targets that decouples the two search steps. First, we use an amortized causal discovery model to separately infer causal graphs from the observational and interventional datasets. Then, we learn to map these paired graphs to the sets of variables that were intervened upon, in a supervised learning framework. This approach consistently outperforms baselines for perturbation modeling on seven single-cell transcriptomics datasets, each with thousands of measured variables. We also demonstrate significant improvements over six causal discovery algorithms in predicting intervention targets across a variety of tractable, synthetic datasets.


VaxSeer: Selecting influenza vaccine strains through evolutionary and antigenicity models

September 2024

·

28 Reads

Current vaccines provide limited protection against rapidly evolving viruses. For example, the influenza vaccine's effectiveness has averaged below 40% for the past ten years. Today, clinical outcomes of vaccine effectiveness can only be assessed retrospectively. Prospective estimation of their effectiveness is crucial but remains under-explored. In this paper, we propose an in-silico method named VaxSeer that predicts expected vaccine effectiveness by considering both the future dominance of circulating viruses and antigenic profiles of vaccine candidates. Based on ten years of historical WHO data, our approach consistently selects superior strains than the annual recommendations. Finally, the prospective score we propose exhibits a strong correlation with retrospective vaccine effectiveness and reduced disease burden, highlighting the promise of this framework in driving the vaccine selection process. Predictions from our model for the 2024 and future winter seasons are available at https://wxsh1213.github.io/vaxseer.github.io/.





AI-driven discovery of synergistic drug combinations against pancreatic cancer

April 2024

·

49 Reads

Treatment regimens, especially in cancer, often include more than one medicine in order to achieve durable outcomes. Identifying the optimal combination of treatments has historically been done through clinical trial and error. And for many conditions, such as pancreatic cancer, an optimal treatment protocol has remained elusive, and the best available treatment combinations provide only modest benefit. Recent developments have led to the application of both experimental screening approaches and in silico modeling methods to identify synergistic drug combinations and expand the therapeutic options for multiple diseases. Here we conduct a study to compare different predictive approaches for identifying new treatment combinations for pancreatic cancer using cell line growth as an initial proxy for clinical utility. NCATS performed screening involving 496 pairwise combinations of 32 antineoplastic drugs, tested against the PANC-1 human pancreatic carcinoma cell line in duplicates using a 10 × 10 matrix format. This dataset served as the basis for generating and training advanced AI/ML models focused on pancreatic cancer. Next, three independent groups (NCATS, UNC and MIT), though in a collaborative manner, utilized three different workflows with AL/ML approaches to discover new perspective drug combinations against pancreatic cancer among over 1.5 million drug combinations. As a result of this collaboration, 88 proposed combinations were tested in a cell-based assay; 53 of them were synergistic (hit rate ~60%). While all machine learning approaches demonstrate advances in the direction of predicting synergistic drug combinations, graph convolutional networks resulted in the best performance with a hit rate ~83%, and Random Forest delivered the highest precision of 65%. Interestingly, all utilized AL/ML methods among the three groups proposed different drug combinations with a small overlap of only two combos from 90. This study demonstrates the potential of a collaborative modeling approach for prioritizing drug combinations in large-scale screening campaigns, particularly when focusing on maximizing the efficacy of drugs known to exhibit synergy.


Citations (62)


... OpenChemIE also extracts reaction data from text or figures. 136 Nevertheless, many of these tools encounter problems with variable end groups mostly noted as 'R-group'. 137 Another important modality for data extraction is plots and images. ...

Reference:

From Text to Insight: Large Language Models for Materials Science Data Extraction
OpenChemIE: An Information Extraction Toolkit for Chemistry Literature
  • Citing Article
  • July 2024

Journal of Chemical Information and Modeling

... Many works in this area have been focusing on sequence-to-sequence tasks and multi-step generation (Reid et al., 2022;Zheng et al., 2023;Ye et al., 2023;Sahoo et al., 2024) by extending the D3PM framework (Austin et al., 2021a), which showed early success for character-level generation. Yet, there have also been other effective discretizations of diffusion processes, with successful applications even for image and biological data (Hoogeboom et al., 2021;Campbell et al., 2024). In particular, we would like to highlight the recent frameworks of SEDD (Lou et al., 2024) and discrete flow matching , which have made significant strides in approaching small-scale autoregressive LMs. ...

Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design
  • Citing Article
  • June 2024

... It has applications in modeling text data, such as word appearance in documents [37], hyperspectral unmixing [38], customer segmentation based on spending patterns [39] and, more recently, in image classification using text-vision models like CLIP [40] and image restoration [41]. Additionally, it has been employed for generating DNA sequences [42]. ...

Dirichlet Flow Matching with Applications to DNA Sequence Design
  • Citing Article
  • May 2024

... To remedy this issue, diffusion models offer an innovative approach, as they have gained numerous attention for their versatile conditioning capabilities [34][35][36][37][38] . These models have been actively applied in the field of materials generation with drug-like molecules 39,40 , proteins 41,42 and small crystals 43 being main targets. When it comes to generating MOFs using diffusion models, Park et al. 44 focused on the generation of MOF linkers rather than entire structures, while Fu et al. 45 reduced the structural complexity by applying coarse-grained representation. ...

Diffusion models in protein structure and docking
  • Citing Article
  • April 2024

Wiley interdisciplinary reviews: Computational Molecular Science.

... GenAI expedites this discovery process by generating candidate materials with desirable properties and accurately predicting their performance. [93,94] This capability allows for a more targeted and efficient search for promising SSE materials, guiding experimental efforts and reducing reliance on traditional trial-and-error methods. In summary, the application of GenAI in molecular simulation and design offers a powerful set of tools for advancing lithium battery technology. ...

Closing the Execution Gap in Generative AI for Chemicals and Materials: Freeways or Safeguards
  • Citing Article
  • March 2024

... Machine learning (ML) has been widely applied to protein-ligand binding affinity prediction and has become central to computer aided-drug design more broadly, with applications including protein structure prediction, 11, 12 molecular docking, 13 small molecule property prediction, 14,15 and others. 16 Traditional ML approaches for protein-ligand binding affinity prediction, such as random forests [17][18][19][20] and shallow neural networks, 21 are increasingly being replaced by deep learning methods that are better suited for learning geometric representations of molecular structures. ...

Deep Confident Steps to New Pockets: Strategies for Docking Generalization
  • Citing Article
  • February 2024

... This led us to the development of AEV-PLIG, a novel attention-based graph-ML scoring function illustrated in Fig. 1. Graphbased methods have emerged as a popular architecture for protein-ligand binding affinity prediction due to their ability to naturally represent molecular 3D structures and topologies 36 . Extending this representation to molecular complexes, Moesser et al. recently introduced protein-ligand interaction graphs (PLIGs) that encode intermolecular contacts between proteins and ligands as graph node features 37 . ...

Graph neural networks
  • Citing Article
  • March 2024

Nature Reviews Methods Primers

... Our approach builds upon recent advancements in SE(3)-diffusion and flow matching to develop a generative model for antibody structures. Specifically, we adapt flow matching techniques, as introduced by Lipman et al. [7], and the protein backbone frame parameterization used by Yim et al. [19], [20], for antibody design. This enables efficient, simulation-free training of continuous normalizing flows (CNFs) for antibody structure generation. ...

Improved motif-scaffolding with SE(3) flow matching
  • Citing Article
  • January 2024

... Toward this end, as the first step, we developed a "prompt-to-code" framework and used it to evaluate the performance of different open-source and proprietary LLMs in tool-making and tool-use (Figure 2A). The core objective was to assess the reliability and accuracy of code produced by LLMs across four distinct tasks in the context of this study: (1) ML model training using a dataset on CÀ H oxidation, (2) development of code for tuning synthesis conditions and optimizing reaction yields, (3) interpretation of documentation and application of existing Python package for yield optimization, and (4) direct interaction with laboratory hardware [32] to prepare solutions based on generated synthesis parameters ( Figure 2B). These tasks were designed to span a range of practical applications, from data handling to physical lab automation, reflecting the diverse ways LLMs can implement code for ML to support chemical research ( Figure S22). ...

Autonomous, multiproperty-driven molecular discovery: From predictions to measurements and back
  • Citing Article
  • December 2023

Science

... Wang et al. 30 developed a prediction model for reaction conditions based on a transformer architecture and incorporated a pretraining strategy that leverages reaction domain knowledge. Qian et al. 31 utilized text retrieval methods to pinpoint relevant textual information for given reactions, thereby improving the accuracy of predicting conditions. Nonetheless, many challenges remain to be solved in predicting general reaction conditions. ...

Predictive Chemistry Augmented with Text Retrieval
  • Citing Conference Paper
  • January 2023