Jiaxin Ge’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (8)


Figure 2. Illustration of SLIDESBENCH. Each example of SLIDESBENCH consists of three instructions: Detailed Instructions with Images, Detailed Instructions Only, and High-Level Instructions. The model is tasked to generate a slide based on the instruction, and the generated slide is evaluated on the metrics suite, which contains both the reference-free metrics and the reference-based metrics.
AutoPresent: Designing Structured Visuals from Scratch
  • Preprint
  • File available

June 2025

·

45 Reads

Jiaxin Ge

·

Zora Zhiruo Wang

·

Xuhui Zhou

·

[...]

·

Trevor Darrell

Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.

Download


Training Task Experts through Retrieval Based Distillation

July 2024

·

5 Reads

One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.



Figure 1: Visualization of the SimPLE method. The figure shows the embedding space of natural sentences, and different colors represent different predicted labels. Each data sample is labeled with multiple random dropouts, and we use the SETRED algorithm to detect the uncertain pseudo-labels. The final label is voted by confident inferences.
Figure 3: Pseudo-labeling accuracy of entailment models with standard (ST), dropout, SETRED, and SimPLE strategies. SETRED achieves higher accuracy because uncertain data samples are dropped.
Experimental results on binary classification tasks with 10 independent experiments. Cat stands for concatenation-based pretraining and Sup stands for supposition classification.
Entailment as Robust Self-Learner

May 2023

·

77 Reads

Entailment has been recognized as an important metric for evaluating natural language understanding (NLU) models, and recent studies have found that entailment pretraining benefits weakly supervised fine-tuning. In this work, we design a prompting strategy that formulates a number of different NLU tasks as contextual entailment. This approach improves the zero-shot adaptation of pretrained entailment models. Secondly, we notice that self-training entailment-based models with unlabeled data can significantly improve the adaptation performance on downstream tasks. To achieve more stable improvement, we propose the Simple Pseudo-Label Editing (SimPLE) algorithm for better pseudo-labeling quality in self-training. We also found that both pretrained entailment-based models and the self-trained models are robust against adversarial evaluation data. Experiments on binary and multi-class classification tasks show that SimPLE leads to more robust self-training results, indicating that the self-trained entailment models are more efficient and trustworthy than large language models on language understanding tasks.


Chain of Thought Prompt Tuning in Vision Language Models

April 2023

·

54 Reads

·

1 Citation

Language-Image Pre-training has demonstrated promising results on zero-shot and few-shot downstream tasks by prompting visual models with natural language prompts. However, most recent studies only use a single prompt for tuning, neglecting the inherent step-to-step cognitive reasoning process that humans conduct in complex task settings, for example, when processing images from unfamiliar domains. Chain of Thought is a simple and effective approximation to human reasoning process and has been proven useful for natural language processing (NLP) tasks. Based on this cognitive intuition, we believe that conducting effective reasoning is also an important problem in visual tasks, and a chain of thought could be a solution to this problem. In this work, we propose a novel chain of thought prompt tuning for vision-language modeling. Extensive experiments show that our method not only generalizes better in image classification tasks, has greater transferability beyond a single dataset, and has stronger domain generalization performance, but also performs much better in imagetext retrieval and visual question answering, which require more reasoning capabilities. We are the first to successfully adapt chain-of-thought prompting that combines visual and textual embeddings. We will release our codes



Citations (5)


... To this end, we capitalize on a recent large language model (LLM)based code generation paradigm (Surís et al., 2023;Gupta & Kembhavi, 2023;Subramanian et al., 2023), which produces modular executable programs to answer natural language queries. While this approach has shown promise for zero-shot VideoQA (Surís et al., 2023;Ge et al., 2024), we are not interested in its task performance per se. Instead, we use its rich, structured intermediate representationsprograms, as shown in Figure 1 (bottom)-to capture the elusive complexity of the original questions. ...

Reference:

Understanding Complexity in VideoQA via Visual Program Generation
Recursive Visual Programming
  • Citing Chapter
  • September 2024

... Generate text (or image and other modalities) given a certain input [10,17,69] Retrieval Retrieve pieces of information given a certain input [35,45,62] Query rewriting Reformulate/expand queries to improve retrieval performance [21,93] Intent recognition Categorize user input into predefined intents or discover novel intents [5,95] Asking questions Asking clarifying questions proactively to engage the user to help the retrieval process [2,86] Entity Linking Identifying entity mentions in text and linking them to a knowledge base [40,71,79] Formal translation Convert natural language text into formal representations (e.g., SQL, logical forms) [22,23,63,90] Recommendation Suggest relevant items (e.g., products, documents, movies) [46,91] Action execution Call APIs based on pre-generated requests [65,85] Verification Validate model outputs for consistency, logic, and constraint compliance [23,53,90] Aggregation When multiple answers available, select/aggregate into one [13,53,96,97] GIO creation Create a refined, synthesized output that enhances final response for better user interaction; i.e., Generated Information Object (GIO) [16] Complex RAG Combine retrieval and generation to produce factually grounded responses [39] IRCoT ...

Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning
  • Citing Conference Paper
  • January 2024

... Visual commonsense reasoning (VCR) task aims to predict the answer to the multiple-choice question and provide a convincing rationale [11,24,44,45,49] about the image. In recent years, it has gained considerable attention from computer vision (CV) and natural language processing (NLP) communities due to the advancement of large multimodal models (LMMs) [3,14,24,41,46,51]. Specifically, inferring a reliable answer in VCR requires LMMs to not only recognize objects and scenes but also deeply understand the underlying visual commonsense (e.g., likely intents, goals, and social dynamics of people) in the image. ...

From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation
  • Citing Conference Paper
  • January 2023

... Model confidence/uncertainty without selffeedback. A parallel line of work in improving LLM generation outputs is related to assessing the model confidence and the uncertainty of its predictions without iterative language model calls (Jiang et al., 2021;Lang et al., 2022;Kuhn et al., 2023;Ge et al., 2023;Jiang et al., 2023;Vernikos et al., 2023;He et al., 2024b). Kuhn et al. (2023) define semantic entropy, a metric that incorporates linguistic invariance of the individual output candidates sharing identical meanings. ...

Entailment as Robust Self-Learner
  • Citing Conference Paper
  • January 2023

... There are two primary paradigms in CoT prompting: one employs a straightforward prompt to facilitate step-by-step reasoning in a zeroshot manner, and the other presents several manual demonstrations sequentially in a few-shot approach. Ge JX et al. (2023) applied CoT in prompt tuning, which differed from those prompt learning methods like CoOp and CoCoOp, because they made a chain of prompts before the text encoder, with all the prompts in the CoT chain being conditioned on visual features, and this generalized better in new classes than CoCoOp. Because language information is the key to CoT, multi-modal prompting will be a more promising way of using the CoT concept. ...

Chain of Thought Prompt Tuning in Vision Language Models
  • Citing Preprint
  • April 2023