October 2024
·
4 Reads
·
2 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
October 2024
·
4 Reads
·
2 Citations
October 2024
·
3 Reads
Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs' performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM's exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.
September 2024
·
63 Reads
·
1 Citation
Nature
July 2024
·
151 Reads
·
2 Citations
Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.
July 2024
·
25 Reads
Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.
June 2024
·
39 Reads
·
1 Citation
We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.
May 2024
·
3 Reads
·
9 Citations
February 2024
·
37 Reads
·
2 Citations
Nature
January 2024
·
1,544 Reads
·
158 Citations
Nature
Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning1–4, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges1,5, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.
September 2023
·
690 Reads
·
6 Citations
Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks.
... Building on the success of reinforcement learning for preference optimization [42], HA-DPO [63] and HIO [5] adopt this paradigm to fine-tune VLLMs, yielding outputs with fewer hallucinations. On the other hand, several benchmarks [4,7,8,21,22,29,30,[48][49][50][51][52]64] have recently been constructed to quantify visual hallucinations, focusing primarily on object-based hallucinations in images. ...
October 2024
... Most of Markov's criticisms are of this form: it does not look to him like our method should work, and therefore it must not work, and any evidence suggesting otherwise is fraud. Nature investigated Markov's concerns, found them to be entirely without merit, and published an Addendum upholding our work at the conclusion of this process [20]. ...
September 2024
Nature
... These advanced AI systems are capable of interacting with users, holding conversations, and expressing their own thoughts [86]. Recent research and industry applications have leveraged the conversational and reasoning capabilities of LLMs to create problem-solving tools [33,55,67,87,92], educational assistants [48], and novel search interfaces [50]. In the industry, many companies deploy AI agents on social media to assist with branding and customer service [39,43,90], while others embed agents in their applications for assisting writing [26], brainstorming [65], or gaming [73]. ...
May 2024
... The Mathematics Olympiad is a competitive platform for students [1], where the level of problem difficulty is extremely high [2], requiring exploration skills [3], problem-solving abilities [4], advanced mathematical reasoning, such as algebraic concepts [5], and a deep understanding of the subjects tested in the Olympiad [6]. It is not only students who struggle with solving complex mathematical problems; even university students may encounter difficulties in mathematical problemsolving [7]. ...
February 2024
Nature
... Another example of this realization is the development of AlphaGeometry, which proves mathematical theorems at the Olympiad level. 19 As a neuro-symbolic system, it uses a language model in conjunction with a symbolic deduction engine. ...
January 2024
Nature
... • Asking LLM to "take a deep breath" (because, apparently, this combination of tokens is used when people describe the successful solutions of the problem after a long and frustrating chain of attempts; Yang et al., 2023), • Asking ChatGPT to imagine, that it is now May (that is related to the fact that ChatGPT also receives a latent timestamp of the prompt as well as the training dataset also having timestamps; apparently, close to holidays (especially in December), the length of human responses which were contained in the training dataset decreased, resulting in ChatGPT giving more concise responses leading up to and after wide-spread holidays 3 ), • Stating that a user "unfortunately has no fingers, " so "they cannot type" (apparently, it is especially successful in the request of writing programming code; it makes ChatGPT provide a final solution to the problem, incorporating all small changes to the final code at the same time; Ivanovs, 2023). • Additionally, several other tricks, such as making the LLM repeat the question before answering or stressing human-relevant motivation factors (Bsharat et al., 2023), appear to have a positive impact on LLM performance. ...
September 2023
... Vision-language models like CLIP [54] have shown promising zero-shot transfer capabilities [10]. However, when fine-tuned to a specific scenario [1,51], they often lose the ability to generalize to new domains [30,71]. To address this issue, recent works leveraged textual information to bridge the gap between domains [14,18,19,23,45,66]. ...
August 2023
Neurocomputing
... At the same time, the pretraining corpus design can promote the model underfitting and overfitting on particular languages. We believe it can be accounted for by aggregating the language-specific cross-entropy loss and producing language weights similar to Xie et al. (2023). ...
Reference:
mGPT: Few-Shot Learners Go Multilingual
May 2023
... Score-based diffusion models [42,43,39,14] learn the data distribution by reversing a forward diffusion process that progressively transforms the data into Gaussian noise. These models have quickly surpassed the fidelity and diversity of previous generative modeling methods [27,9], achieving state-of-the-art results in various domains, including unconditional image generation [9,18], textto-image generation [32,37,2,33,30,47], video generation [5,4,10], image-to-image translation [36,22], motion synthesis [45,46], and audio generation [7,20,17]. ...
February 2023
... Instruction-Tuning Longpre et al., 2023) is a framework of doing multi-task learning, which enables the use of human-readable instructions to guide the prediction of LLMs. This novel training paradigm can improve the performance of the downstream tasks and also shows great generalisation ability on unseen tasks (Chung et al., 2022;Sanh et al., 2022). ...
January 2023