Jimmy Ba’s research while affiliated with University of Toronto and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (56)


Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
  • Preprint

September 2024

·

6 Reads

Blair Yang

·

Fuyang Cui

·

Keiran Paster

·

[...]

·

Michael R. Zhang

The rapid development and dynamic nature of large language models (LLMs) make it difficult for conventional quantitative benchmarks to accurately assess their capabilities. We propose report cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics. We develop a framework to evaluate report cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans). We also propose an iterative algorithm for generating report cards without human supervision and explore its efficacy by ablating various design choices. Through experimentation with popular LLMs, we demonstrate that report cards provide insights beyond traditional benchmarks and can help address the need for a more interpretable and holistic evaluation of LLMs.


Precision, recall, and F-score for question classification type. The overall accuracy is 81%.
Similarity between instructor and model answer by question type.
Decomposed Prompting to Answer Questions on a Course Discussion Board
  • Preprint
  • File available

July 2024

·

20 Reads

We propose and evaluate a question-answering system that uses decomposed prompting to classify and answer student questions on a course discussion board. Our system uses a large language model (LLM) to classify questions into one of four types: conceptual, homework, logistics, and not answerable. This enables us to employ a different strategy for answering questions that fall under different types. Using a variant of GPT-3, we achieve 81%81\% classification accuracy. We discuss our system's performance on answering conceptual questions from a machine learning course and various failure modes.

Download

PIFiA: self-supervised approach for protein functional annotation from single-cell imaging data

March 2024

·

14 Reads

·

6 Citations

Molecular Systems Biology

Fluorescence microscopy data describe protein localization patterns at single-cell resolution and have the potential to reveal whole-proteome functional information with remarkable precision. Yet, extracting biologically meaningful representations from cell micrographs remains a major challenge. Existing approaches often fail to learn robust and noise-invariant features or rely on supervised labels for accurate annotations. We developed PIFiA (Protein Image-based Functional Annotation), a self-supervised approach for protein functional annotation from single-cell imaging data. We imaged the global yeast ORF-GFP collection and applied PIFiA to generate protein feature profiles from single-cell images of fluorescently tagged proteins. We show that PIFiA outperforms existing approaches for molecular representation learning and describe a range of downstream analysis tasks to explore the information content of the feature profiles. Specifically, we cluster extracted features into a hierarchy of functional organization, study cell population heterogeneity, and develop techniques to distinguish multi-localizing proteins and identify functional modules. Finally, we confirm new PIFiA predictions using a colocalization assay, suggesting previously unappreciated biological roles for several proteins. Paired with a fully interactive website ( https://thecellvision.org/pifia/ ), PIFiA is a resource for the quantitative analysis of protein organization within the cell.


Decomposed Prompting to Answer Questions on a Course Discussion Board

June 2023

·

12 Reads

·

2 Citations

Communications in Computer and Information Science

We propose and evaluate a question-answering system that uses decomposed prompting to classify and answer student questions on a course discussion board. Our system uses a large language model (LLM) to classify questions into one of four types: conceptual, homework, logistics, and not answerable. This enables us to employ a different strategy for answering questions that fall under different types. Using a variant of GPT-3, we achieve 81% classification accuracy. We discuss our system’s performance on answering conceptual questions from a machine learning course and various failure modes.KeywordsCourse Discussion BoardGPT-3Large-Language ModelsMixture of ExpertsPrompting



STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

June 2023

·

42 Reads

Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.


Training on Thin Air: Improve Image Classification with Generated Data

May 2023

·

21 Reads

Acquiring high-quality data for training discriminative models is a crucial yet challenging aspect of building effective predictive systems. In this paper, we present Diffusion Inversion, a simple yet effective method that leverages the pre-trained generative model, Stable Diffusion, to generate diverse, high-quality training data for image classification. Our approach captures the original data distribution and ensures data coverage by inverting images to the latent space of Stable Diffusion, and generates diverse novel training images by conditioning the generative model on noisy versions of these vectors. We identify three key components that allow our generated images to successfully supplant the original dataset, leading to a 2-3x enhancement in sample complexity and a 6.5x decrease in sampling time. Moreover, our approach consistently outperforms generic prompt-based steering methods and KNN retrieval baseline across a wide range of datasets. Additionally, we demonstrate the compatibility of our approach with widely-used data augmentation techniques, as well as the reliability of the generated data in supporting various neural architectures and enhancing few-shot learning.


Residual Prompt Tuning: Improving Prompt Tuning with Residual Reparameterization

May 2023

·

189 Reads

Prompt tuning is one of the successful approaches for parameter-efficient tuning of pre-trained language models. Despite being arguably the most parameter-efficient (tuned soft prompts constitute <0.1% of total parameters), it typically performs worse than other efficient tuning methods and is quite sensitive to hyper-parameters. In this work, we introduce Residual Prompt Tuning - a simple and efficient method that significantly improves the performance and stability of prompt tuning. We propose to reparameterize soft prompt embeddings using a shallow network with a residual connection. Our experiments show that Residual Prompt Tuning significantly outperforms prompt tuning on SuperGLUE benchmark. Notably, our method reaches +7 points improvement over prompt tuning with T5-Base and allows to reduce the prompt length by 10x without hurting performance. In addition, we show that our approach is robust to the choice of learning rate and prompt initialization, and is effective in few-shot settings.


Boosted Prompt Ensembles for Large Language Models

April 2023

·

24 Reads

·

2 Citations

Methods such as chain-of-thought prompting and self-consistency have pushed the frontier of language model reasoning performance with no additional training. To further improve performance, we propose a prompt ensembling method for large language models, which uses a small dataset to construct a set of few shot prompts that together comprise a ``boosted prompt ensemble''. The few shot examples for each prompt are chosen in a stepwise fashion to be ``hard'' examples on which the previous step's ensemble is uncertain. We show that this outperforms single-prompt output-space ensembles and bagged prompt-space ensembles on the GSM8k and AQuA datasets, among others. We propose both train-time and test-time versions of boosted prompting that use different levels of available annotation and conduct a detailed empirical study of our algorithm.


PIFiA: Self-supervised Approach for Protein Functional Annotation from Single-Cell Imaging Data

February 2023

·

34 Reads

·

2 Citations

Fluorescence microscopy data describe protein localization patterns at single-cell resolution and have the potential to reveal whole-proteome functional information with remarkable precision. Yet, extracting biologically meaningful representations from cell micrographs remains a major challenge. Existing approaches often fail to learn robust and noise-invariant features or rely on supervised labels for accurate annotations. We developed PIFiA, (Protein Image-based Functional Annotation), a self-supervised approach for protein functional annotation from single-cell imaging data. We imaged the global yeast ORF-GFP collection and applied PIFiA to generate protein feature profiles from single-cell images of fluorescently tagged proteins. We show that PIFiA outperforms existing approaches for molecular representation learning and describe a range of downstream analysis tasks to explore the information content of the feature profiles. Specifically, we cluster extracted features into a hierarchy of functional organization, study cell population heterogeneity, and develop techniques to distinguish multi-localizing proteins and identify functional modules. Finally, we confirm new PIFiA predictions using a colocalization assay, suggesting previously unappreciated biological roles for several proteins. Paired with a fully interactive website (https://thecellvision.org/pifia/), PIFiA is a resource for the quantitative analysis of protein organization within the cell.


Citations (26)


... Improving CytoSummaryNet's generalization capabilities, possibly by increasing training data variation [35], could eliminate the need for retraining. Recent studies exploring imagederived representations using self-supervised and self-supervised learning [35,36] could inspire future research on using learned embeddings instead of classical features to enhance model-aggregated profiles. ...

Reference:

Capturing cell heterogeneity in representations of cell populations for image-based profiling using contrastive learning
PIFiA: self-supervised approach for protein functional annotation from single-cell imaging data
  • Citing Article
  • March 2024

Molecular Systems Biology

... soft tokens inside each Attention layer). Successive methods improve how to embed soft tokens (Liu et al., 2022;Razdaibiedina et al., 2023;. Though prompt-tuning for ViT adds learnable tokens at the beginning of the input (Jia et al., 2022;, it is unclear where the token should be added for Mamba because the order of tokens makes sense in SSM. ...

Residual Prompt Tuning: improving prompt tuning with residual reparameterization
  • Citing Conference Paper
  • January 2023

... Subramonyam et al. [11] focus on integrating user experience and needs into the AI development process, finding the problems such as low-level design and share information across expertise boundaries. Zhang et al. [12] uses LLM to answer student questions classified into four types, and finds the system effectively ignores questions that it cannot address. ...

Classifying Course Discussion Board Questions using LLMs

... Secondly, the objectives of instruction generation are not clear. Current research (Lester et al., 2021;Pitis et al., 2023) regards performance (i.e., metrics) as the sole criterion for instruction quality. However, model performance alone cannot precisely explain instruction quality. ...

Boosted Prompt Ensembles for Large Language Models
  • Citing Preprint
  • April 2023

... While some AI tools themselves can help to reformulate and improve the prompts (Zhou et al., 2022) via dialoguing with a user, it takes time and still does not guarantee the desired result. Moreover, some tools (i.e., Text-to-Image models) might experience difficulties in improving the prompt in the situations where users apply too many constraints on the desired solution. ...

Large Language Models Are Human-Level Prompt Engineers

... Much of this recent work has focused on training neural networks on very large datasets (sometimes containing millions of problems) 7,8 . Though this is a challenging task that has spurred the development of some interesting approaches [9][10][11][12] , it does not address the issue of whether analogical reasoning can emerge zero-shot (that is, without direct training), the capacity most central to human thought. ...

The Scattering Compositional Learner: Discovering Objects, Attributes, Relationships in Analogical Reasoning
  • Citing Preprint
  • July 2020

... Synthetic data in theorem proving. Synthetic data has long been recognized and used as an important ingredient in theorem proving [63][64][65][66] . State-of-the-art machine learning methods make use of expert iteration to generate a curriculum of synthetic proofs 2,3,15 . ...

INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving
  • Citing Preprint
  • July 2020

... However, Ranger proved to be the most successful. It combines LookAhead [53] with k = 6 and = 0.5 , and an inner RAdam optimizer [54] with 1 = 0.95 , 2 = 0.999 and = 10 −5 . RAdam can help to stabilize the learning rate and adapt it based on the variance of the gradient, making it a robust option when the learning rate is ill-adjusted. ...

Lookahead Optimizer: k steps forward, 1 step back
  • Citing Preprint
  • July 2019

... Apart from reduced memory requirement, a major benefit of approximate representations it their ability to generalize over the input space, and thereby make predictions for state-actions that have not been observed yet. However, the individual predictions of approximate methods may contain errors, and there are indications that the combination of tabular and approximate representations may provide the best of both worlds (Silver et al., 2017;Wang et al., 2019;Moerland et al., 2020b). Alternatively, we may also store the solution in a non-parametric way, where we simply store exact sampled traces (e.g., a search tree that does not aggregate over different traces), or semi-parametric way (Graves et al., 2016), where we may optimize a neural network to write to and read to a table Pritzel et al., 2017), sometimes referred to as episodic memory (Gershman and Daw, 2017). ...

Benchmarking Model-Based Reinforcement Learning