Jerret Ross’s research while affiliated with IBM Research and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (13)


Distributional Preference Alignment of LLMs via Optimal Transport
  • Preprint

June 2024

·

6 Reads

Igor Melnyk

·

·

·

[...]

·

Jerret Ross

Current LLM alignment techniques use pairwise human preferences at a sample level, and as such, they do not imply an alignment on the distributional level. We propose in this paper Alignment via Optimal Transport (AOT), a novel method for distributional preference alignment of LLMs. AOT aligns LLMs on unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. We introduce a convex relaxation of this first-order stochastic dominance and cast it as an optimal transport problem with a smooth and convex cost. Thanks to the one-dimensional nature of the resulting optimal transport problem and the convexity of the cost, it has a closed-form solution via sorting on empirical measures. We fine-tune LLMs with this AOT objective, which enables alignment by penalizing the violation of the stochastic dominance of the reward distribution of the positive samples on the reward distribution of the negative samples. We analyze the sample complexity of AOT by considering the dual of the OT problem and show that it converges at the parametric rate. Empirically, we show on a diverse set of alignment datasets and LLMs that AOT leads to state-of-the-art models in the 7B family of models when evaluated with Open LLM Benchmarks and AlpacaEval.


GP-MoLFormer: A Foundation Model For Molecular Generation
  • Preprint
  • File available

April 2024

·

404 Reads

Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. We explore the utility of GP-MoLFormer in generating novel, valid, and unique SMILES. Impressively, we find GP-MoLFormer is able to generate a significant fraction of novel, valid, and unique SMILES even when the number of generated molecules is in the 10 billion range and the reference set is over a billion. We also find strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We evaluate GP-MoLFormer's utility and compare it with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility for a variety of molecular generation tasks.

Download

Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

January 2024

·

3 Reads

·

2 Citations

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues by enabling a paradigm that relies on generative AI models to generate unbiased, privacy-preserving data while maintaining fidelity to the original data. However, assessing the trustworthiness of synthetic datasets and models is a critical challenge. We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensuring fidelity to the source data, and assessing utility, robustness, and privacy preservation. We demonstrate our framework’s effectiveness by auditing various generative models across diverse use cases like education, healthcare, banking, and human resources, spanning different data modalities such as tabular, time-series, vision, and natural language. This holistic assessment is essential for compliance with regulatory safeguards. We introduce a trustworthiness index to rank synthetic datasets based on their safeguards trade-offs. Furthermore, we present a trustworthiness-driven model selection and cross-validation process during training, exemplified with “TrustFormers” across various data types. This approach allows for controllable trustworthiness trade-offs in synthetic data creation. Our auditing framework fosters collaboration among stakeholders, including data scientists, governance experts, internal reviewers, external certifiers, and regulators. This transparent reporting should become a standard practice to prevent bias, discrimination, and privacy violations, ensuring compliance with policies and providing accountability, safety, and performance guarantees.


Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

April 2023

·

15 Reads

Data collected from the real world tends to be biased, unbalanced, and at risk of exposing sensitive and private information. This reality has given rise to the idea of creating synthetic datasets to alleviate risk, bias, harm, and privacy concerns inherent in the real data. This concept relies on Generative AI models to produce unbiased, privacy-preserving synthetic data while being true to the real data. In this new paradigm, how can we tell if this approach delivers on its promises? We present an auditing framework that offers a holistic assessment of synthetic datasets and AI models trained on them, centered around bias and discrimination prevention, fidelity to the real data, utility, robustness, and privacy preservation. We showcase our framework by auditing multiple generative models on diverse use cases, including education, healthcare, banking, human resources, and across different modalities, from tabular, to time-series, to natural language. Our use cases demonstrate the importance of a holistic assessment in order to ensure compliance with socio-technical safeguards that regulators and policymakers are increasingly enforcing. For this purpose, we introduce the trust index that ranks multiple synthetic datasets based on their prescribed safeguards and their desired trade-offs. Moreover, we devise a trust-index-driven model selection and cross-validation procedure via auditing in the training loop that we showcase on a class of transformer models that we dub TrustFormers, across different modalities. This trust-driven model selection allows for controllable trust trade-offs in the resulting synthetic data. We instrument our auditing framework with workflows that connect different stakeholders from model development to audit and certification via a synthetic data auditing report.


Overview of MoLFormer pipeline
The transformer neural network based model is trained on the SMILES sequences corresponding to a large collection of chemical molecules from PubChem and ZINC, two public chemical databases, in a self-supervised fashion. MoLFormer was designed with an efficient linear attention mechanism and relative positional embeddings, with the goal of learning a meaningful and compressed representation of chemical molecules. This foundation model was then adapted to different downstream molecular property prediction tasks via fine-tuning on task-specific data. The representative power was further tested by recovering molecular similarity using the MoLFormer encodings, as well as by analysing the correspondence between the interatomic spatial distance and attention value for a given molecule.
Comparison of training and validation losses for absolute and rotary embeddings
a,b, Training (a) and validation (b) losses of our linear attention MoLFormer with rotary (relative) and absolute position embeddings on PubChem. We see that both rotary and absolute MoLFormer have graceful training curves. Our rotary linear attention MoLFormer leads to lower training and validation losses than MoLFormer with absolute position embeddings.
a,b, Visualization of the learned attention map (using either full or linear attention) under rotary embedding and corresponding molecular structure (bond connectivity and 3D distance in Angstrom) for two random molecules: ‘CC1(C)C(C)(O)C1(C)O’ (a) and ‘CC(C)C(C)(C)O’ (b). The attention map (ranging from 0 to 1; only tokens that map to constituent atoms are shown for clarity), comprised of the average-pooled heads of an intermediate attention layer, exhibits awareness of both covalent bond connectivity and interatomic long-range spatial relationship. The linear attention variant captures (encircled in red) the medium-range 3D distance better than does its counterpart.
Large-scale chemical language representations capture molecular structure and properties

December 2022

·

507 Reads

·

189 Citations

Nature Machine Intelligence

Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties. Large language models have recently emerged with extraordinary capabilities, and these methods can be applied to model other kinds of sequence, such as string representations of molecules. Ross and colleagues have created a transformer-based model, trained on a large dataset of molecules, which provides good results on property prediction tasks.


Figure 2. Training (Left) and validation (Right) losses of our Linear attention MOLFORMER with rotary (relative) and absolute position embeddings on PubChem. We see that both rotary and absolute MOLFORMER have graceful training curves. Our Rotary Linear attention MOLFORMER leads to lower training and validation losses than MOLFORMER with absolute position embeddings.
Molformer: Large Scale Chemical Language Representations Capture Molecular Structure and Properties

April 2022

·

103 Reads

·

6 Citations

Predicting the properties of a chemical molecule is of great importance in many applications, including drug discovery and material design. Machine learning-based models promise to enable more accurate and faster molecular property predictions than the current state-of-the-art techniques, such as Density Functional Theory calculations or wet-lab experiments.Various supervised machine learning models, including graph neural nets, have demonstrated promising performance in molecular property prediction tasks. However, the vast chemical space and the limited availability of property labels make supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, unsupervised transformer-based language models pre-trained on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets.Experiments show that utilizing the learned molecular representation outperforms existing baselines on downstream tasks, including supervised and self-supervised graph neural net baselines and language models, on several classification and regression tasks from ten benchmark datasets while performing competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that the large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.


Do Large Scale Molecular Language Representations Capture Important Structural Information?

June 2021

·

153 Reads

·

2 Citations

Predicting chemical properties from the structure of a molecule is of great importance in many applications including drug discovery and material design. Machine learning based molecular property prediction holds the promise of enabling accurate predictions at much less complexity, when compared to, for example Density Functional Theory (DFT) calculations. Features extracted from molecular graphs, using graph neural nets in a supervised manner, have emerged as strong baselines for such tasks. However, the vast chemical space together with the limited availability of labels makes supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, pre-trained transformer-based language models (PTLMs) on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, here we present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer. This model was employed with a linear attention mechanism and highly paralleized training on 1D SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets. Experiments show that the learned molecular representation performs competitively, when compared to existing graph-based and fingerprint-based supervised learning baselines, on the challenging tasks of predicting properties of QM8 and QM9 molecules. Further task-specific fine-tuning of the MoLFormerr representation improves performance on several of those property prediction benchmarks. These results provide encouraging evidence that large-scale molecular language models can capture sufficient structural information to be able to accurately predict quantum chemical properties and beyond.



Tabular Transformers for Modeling Multivariate Time Series

November 2020

·

126 Reads

Tabular datasets are ubiquitous in data science applications. Given their importance, it seems natural to apply state-of-the-art deep learning algorithms in order to fully unlock their potential. Here we propose neural network models that represent tabular time series that can optionally leverage their hierarchical structure. This results in two architectures for tabular time series: one for learning representations that is analogous to BERT and can be pre-trained end-to-end and used in downstream tasks, and one that is akin to GPT and can be used for generation of realistic synthetic tabular sequences. We demonstrate our models on two datasets: a synthetic credit card transaction dataset, where the learned representations are used for fraud detection and synthetic data generation, and on a real pollution dataset, where the learned encodings are used to predict atmospheric pollutant concentrations. Code and data are available at https://github.com/IBM/TabFormer.


Fast Mixing of Multi-Scale Langevin Dynamics underthe Manifold Hypothesis

June 2020

·

31 Reads

Recently, the task of image generation has attracted much attention. In particular, the recent empirical successes of the Markov Chain Monte Carlo (MCMC) technique of Langevin Dynamics have prompted a number of theoretical advances; despite this, several outstanding problems remain. First, the Langevin Dynamics is run in very high dimension on a nonconvex landscape; in the worst case, due to the NP-hardness of nonconvex optimization, it is thought that Langevin Dynamics mixes only in time exponential in the dimension. In this work, we demonstrate how the manifold hypothesis allows for the considerable reduction of mixing time, from exponential in the ambient dimension to depending only on the (much smaller) intrinsic dimension of the data. Second, the high dimension of the sampling space significantly hurts the performance of Langevin Dynamics; we leverage a multi-scale approach to help ameliorate this issue and observe that this multi-resolution algorithm allows for a trade-off between image quality and computational expense in generation.


Citations (5)


... These biased GAI predictions may influence the final decisions made by humans. Belgodere et al. (2023) conducted the study on an admission bar exam dataset sourced from the Law School Admission Council (LSAC). This dataset records each student with their personal information of gender, race, Law School Admission Test (LSAT) score, undergraduate GPAs. ...

Reference:

Potential Societal Biases of ChatGPT in Higher Education: A Scoping Review
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs
  • Citing Article
  • January 2024

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

... challenge. While increasingly complex deep learning architectures are demonstrating state-of-the-art (SOA) results on a wide range of molecular property prediction tasks [9][10][11] , it is becoming increasingly clear that energy efficiency will become a greater priority over time as these models are, repeatedly, trained and deployed on increasingly vast human protein-drug interactome 12 . ...

Large-scale chemical language representations capture molecular structure and properties

Nature Machine Intelligence

... Many of the latest pre-trained chemical models employ self-supervised pre-training tasks on huge unlabeled datasets of 2D chemical structures. [44][45][46][47] Conversely, there are numerous instances of quasi-transfer learning, involving pre-training on datasets of ab initio calculated properties of the size comparable to the available experimental datasets. 12,37 We propose the atomic feature extraction from the models pre-trained for different chemical tasks on larger datasets, and we evaluate it by predicting experimental 13 C chemical shis. ...

Do Large Scale Molecular Language Representations Capture Important Structural Information?

... Unlike tabular data where rows are independent, sequential tabular data exhibits dependent rows with dependent columns [14,28]. This data exhibits strong dynamic and static patterns capturing local transient patterns and global identity [16]. For example, the customer's clickstream activity and the amount spent exhibit a dynamic pattern, while the type of product owned exhibits a static pattern replicated across the sequence for an elongated period. ...

Tabular Transformers for Modeling Multivariate Time Series

... You et al. [31,32] detect visual concepts (regions, objects, attributes, etc.) and combine the visual features with the concepts to generate captions. Dai et al. [33][34][35] approach image captioning as conditional GAN training. Zhang et al. [36][37][38] integrate part-of-speech information to ensure captions better adhere to language habits and grammar rules. ...

Adversarial Semantic Alignment for Improved Image Captions
  • Citing Conference Paper
  • June 2019