Quoc V. Le’s research while affiliated with Mountain View College and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (222)


HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
  • Chapter

October 2024

·

4 Reads

·

2 Citations

Zhecan Wang

·

Garrett Bingham

·

Adams Wei Yu

·

[...]

·

Golnaz Ghiasi

EVOLvE: Evaluating and Optimizing LLMs For Exploration
  • Preprint
  • File available

October 2024

·

3 Reads

Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs' performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM's exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.

Download


Figure 3: Scaling inference time compute via repeated sampling leads to consistent coverage gains across a variety of model sizes (70M-70B), families (Llama, Gemma and Pythia) and levels of post-training (Base and Instruct models).
Figure 8: Bar charts showing the fraction of samples (out of 10,000 samples) that are correct, for each problem in the subsets of GSM8K and MATH we evaluate on. There is one bar per problem, and the height of the bar corresponds to the fraction of samples that arrive at the correct answer. Bars are green if self-consistency picked the correct answer and are red otherwise. We highlight that there are many problems with correct solutions, where the correct solutions are sampled infrequently.
Figure 9: SWE-bench Lite results, without and with problems that have flaky tests. For the graph on the left, all problems in Table 3 are excluded. For the graph on the right, all problems are included. We note that the trend is the same with or without the flaky tests.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

July 2024

·

151 Reads

·

2 Citations

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.


HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

July 2024

·

25 Reads

Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.


Figure 1: Illustrative example of the Trip Planning task.
Figure 2: Illustrative example of the Meeting Planning task.
Figure 3: Illustrative example of the Calendar Scheduling task.
Accuracy of 5 models on NATURAL PLAN. Task GPT-3.5 GPT-4 GPT-4o Gemini 1.5 Flash Gemini 1.5 Pro
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

June 2024

·

39 Reads

·

1 Citation

We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.




Solving olympiad geometry without human demonstrations

January 2024

·

1,544 Reads

·

158 Citations

Nature

Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning1–4, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges1,5, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.


Large Language Models as Optimizers

September 2023

·

690 Reads

·

6 Citations

Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks.


Citations (60)


... Building on the success of reinforcement learning for preference optimization [42], HA-DPO [63] and HIO [5] adopt this paradigm to fine-tune VLLMs, yielding outputs with fewer hallucinations. On the other hand, several benchmarks [4,7,8,21,22,29,30,[48][49][50][51][52]64] have recently been constructed to quantify visual hallucinations, focusing primarily on object-based hallucinations in images. ...

Reference:

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
  • Citing Chapter
  • October 2024

... Most of Markov's criticisms are of this form: it does not look to him like our method should work, and therefore it must not work, and any evidence suggesting otherwise is fraud. Nature investigated Markov's concerns, found them to be entirely without merit, and published an Addendum upholding our work at the conclusion of this process [20]. ...

Addendum: A graph placement methodology for fast chip design

Nature

... These advanced AI systems are capable of interacting with users, holding conversations, and expressing their own thoughts [86]. Recent research and industry applications have leveraged the conversational and reasoning capabilities of LLMs to create problem-solving tools [33,55,67,87,92], educational assistants [48], and novel search interfaces [50]. In the industry, many companies deploy AI agents on social media to assist with branding and customer service [39,43,90], while others embed agents in their applications for assisting writing [26], brainstorming [65], or gaming [73]. ...

Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses
  • Citing Conference Paper
  • May 2024

... The Mathematics Olympiad is a competitive platform for students [1], where the level of problem difficulty is extremely high [2], requiring exploration skills [3], problem-solving abilities [4], advanced mathematical reasoning, such as algebraic concepts [5], and a deep understanding of the subjects tested in the Olympiad [6]. It is not only students who struggle with solving complex mathematical problems; even university students may encounter difficulties in mathematical problemsolving [7]. ...

Author Correction: Solving olympiad geometry without human demonstrations

Nature

... • Asking LLM to "take a deep breath" (because, apparently, this combination of tokens is used when people describe the successful solutions of the problem after a long and frustrating chain of attempts; Yang et al., 2023), • Asking ChatGPT to imagine, that it is now May (that is related to the fact that ChatGPT also receives a latent timestamp of the prompt as well as the training dataset also having timestamps; apparently, close to holidays (especially in December), the length of human responses which were contained in the training dataset decreased, resulting in ChatGPT giving more concise responses leading up to and after wide-spread holidays 3 ), • Stating that a user "unfortunately has no fingers, " so "they cannot type" (apparently, it is especially successful in the request of writing programming code; it makes ChatGPT provide a final solution to the problem, incorporating all small changes to the final code at the same time; Ivanovs, 2023). • Additionally, several other tricks, such as making the LLM repeat the question before answering or stressing human-relevant motivation factors (Bsharat et al., 2023), appear to have a positive impact on LLM performance. ...

Large Language Models as Optimizers

... Vision-language models like CLIP [54] have shown promising zero-shot transfer capabilities [10]. However, when fine-tuned to a specific scenario [1,51], they often lose the ability to generalize to new domains [30,71]. To address this issue, recent works leveraged textual information to bridge the gap between domains [14,18,19,23,45,66]. ...

Combined scaling for zero-shot transfer learning
  • Citing Article
  • August 2023

Neurocomputing

... Score-based diffusion models [42,43,39,14] learn the data distribution by reversing a forward diffusion process that progressively transforms the data into Gaussian noise. These models have quickly surpassed the fidelity and diversity of previous generative modeling methods [27,9], achieving state-of-the-art results in various domains, including unconditional image generation [9,18], textto-image generation [32,37,2,33,30,47], video generation [5,4,10], image-to-image translation [36,22], motion synthesis [45,46], and audio generation [7,20,17]. ...

Noise2Music: Text-conditioned Music Generation with Diffusion Models

... Instruction-Tuning Longpre et al., 2023) is a framework of doing multi-task learning, which enables the use of human-readable instructions to guide the prediction of LLMs. This novel training paradigm can improve the performance of the downstream tasks and also shows great generalisation ability on unseen tasks (Chung et al., 2022;Sanh et al., 2022). ...

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning