Yaru Hao’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (19)


Figure 10: Empirical evidence of Property 4.3: redundant training examples are discarded in optimal learning. We randomly sample 2048 training examples satisfying CT n,t > 0 (contributive and unlearned examples) throughout the near-optimal learning process and show the dynamics of the example weight γ n,t (represented by the color in (a) and (b)). Since Perceptron converges quickly, we only plot its γ n,t dynamics for t ≤ 50. The near-optimal policies assign γ n,t = 0 to redundant examples in addition to the perfectly learned and non-contributive data points.
Figure 12: The architecture of the equivalent neural network to find the optimal learning policy, Each layer consists of the gradient update and a residual connection.
Towards Optimal Learning of Language Models
  • Preprint
  • File available

November 2024

·

7 Reads

·

·

Yaru Hao

·

[...]

·

Furu Wei

This work studies the general principles of improving the learning of language models (LMs), which aims at reducing the necessary training steps for achieving superior performance. Specifically, we present a theory for the optimal learning of LMs. We first propose an objective that optimizes LM learning by maximizing the data compression ratio in an "LM-training-as-lossless-compression" view. Then, we derive a theorem, named Learning Law, to reveal the properties of the dynamics in the optimal learning process under our objective. The theorem is then validated by experiments on a linear classification and a real-world language modeling task. Finally, we empirically verify that the optimal learning of LMs essentially stems from the improvement of the coefficients in the scaling law of LMs, indicating great promise and significance for designing practical learning acceleration methods. Our code can be found at https://aka.ms/LearningLaw.

Download

Figure 7: LM performance on the OLMo evaluation tasks (Average Accuracy) and data scorer performance (Spearman Correlation) when different proxy model and proxy data sizes are adopted.
Test loss extrapolation using the Scaling Law [40]. We predict the test loss when the LM size N and the trained tokens D meet that of GPT-3 175B, Llama 6.7B, Llama 2 70B, and Llama 3.1 405B. The improvements of PDS remain consistent for these LMs.
Data Selection via Optimal Control for Language Models

October 2024

·

18 Reads

This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin's Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce PMP-based Data Selection (PDS), a framework that approximates optimal data selection by solving the PMP conditions. In our experiments, we adopt PDS to select data from CommmonCrawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which mitigates the quick exhaustion of available web-crawled corpora. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/data_selection.


Kosmos-2: Grounding Multimodal Large Language Models to the World

June 2023

·

51 Reads

·

3 Citations

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Data, demo, and pretrained models are available at https://aka.ms/kosmos-2.


Language Is Not All You Need: Aligning Perception with Language Models

February 2023

·

320 Reads

·

9 Citations

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.



Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers

December 2022

·

158 Reads

·

8 Citations

Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta optimizers and understands ICL as a kind of implicit finetuning. Theoretically, we figure out that the Transformer attention has a dual form of gradient descent based optimization. On top of it, we understand ICL as follows: GPT first produces meta gradients according to the demonstration examples, and then these meta gradients are applied to the original GPT to build an ICL model. Experimentally, we comprehensively compare the behavior of ICL and explicit finetuning based on real tasks to provide empirical evidence that supports our understanding. The results prove that ICL behaves similarly to explicit finetuning at the prediction level, the representation level, and the attention behavior level. Further, inspired by our understanding of meta optimization, we design a momentum-based attention by analogy with the momentum-based gradient descent algorithm. Its consistently better performance over vanilla attention supports our understanding again from another aspect, and more importantly, it shows the potential to utilize our understanding for future model designing.


Optimizing Prompts for Text-to-Image Generation

December 2022

·

245 Reads

·

8 Citations

Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, we propose prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts. Specifically, we first perform supervised fine-tuning with a pretrained language model on a small collection of manually engineered prompts. Then we use reinforcement learning to explore better prompts. We define a reward function that encourages the policy to generate more aesthetically pleasing images while preserving the original user intentions. Experimental results on Stable Diffusion show that our method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. Moreover, reinforcement learning further boosts performance, especially on out-of-domain prompts. The pretrained checkpoints are available at https://aka.ms/promptist. The demo can be found at https://aka.ms/promptist-demo.


Results on text classification. (1×) is the maximum shot of conventional in-context learning.
Results on multi-choice tasks. (1×) is the maximum shot of conventional in-context learning.
Structured Prompting: Scaling In-Context Learning to 1,000 Examples

December 2022

·

92 Reads

·

1 Citation

Large language models have exhibited intriguing in-context learning capability, achieving promising zero- and few-shot performance without updating the parameters. However, conventional in-context learning is usually restricted by length constraints, rendering it ineffective to absorb supervision from a large number of examples. In order to go beyond few shots, we introduce structured prompting that breaks the length limit and scales in-context learning to thousands of examples. Specifically, demonstration examples are separately encoded with well-designed position embeddings, and then they are jointly attended by the test example using a rescaled attention mechanism. So we can scale the number of exemplars with linear complexity instead of quadratic complexity with respect to length. Experimental results on a diverse set of tasks show that our approach improves end-task performance and reduces evaluation variance over conventional in-context learning as the number of demonstration examples increases. Code has been released at https://aka.ms/structured-prompting.


Language Models are General-Purpose Interfaces

June 2022

·

105 Reads

·

2 Citations

Foundation models have received much attention due to their effectiveness across a broad range of downstream applications. Though there is a big convergence in terms of architecture, most pretrained models are typically still developed for specific tasks or modalities. In this work, we propose to use language models as a general-purpose interface to various foundation models. A collection of pretrained encoders perceive diverse modalities (such as vision, and language), and they dock with a language model that plays the role of a universal task layer. We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders. We subsume the advantages and capabilities from both causal and non-causal modeling, thereby combining the best of two worlds. Specifically, the proposed method not only inherits the capabilities of in-context learning and open-ended generation from causal language modeling, but also is conducive to finetuning because of the bidirectional encoders. More importantly, our approach seamlessly unlocks the combinations of the above capabilities, e.g., enabling in-context learning or instruction following with finetuned encoders. Experimental results across various language-only and vision-language benchmarks show that our model outperforms or is competitive with specialized models on finetuning, zero-shot generalization, and few-shot learning.


Prototypical Calibration for Few-shot Learning of Language Models

May 2022

·

30 Reads

·

2 Citations

In-context learning of GPT-like models has been recognized as fragile across different hand-crafted templates, and demonstration permutations. In this work, we propose prototypical calibration to adaptively learn a more robust decision boundary for zero- and few-shot classification, instead of greedy decoding. Concretely, our method first adopts Gaussian mixture distribution to estimate the prototypical clusters for all categories. Then we assign each cluster to the corresponding label by solving a weighted bipartite matching problem. Given an example, its prediction is calibrated by the likelihood of prototypical clusters. Experimental results show that prototypical calibration yields a 15% absolute improvement on a diverse set of tasks. Extensive analysis across different scales also indicates that our method calibrates the decision boundary as expected, greatly improving the robustness of GPT to templates, permutations, and class imbalance.


Citations (12)


... Several studies have investigated the training dynamics of LLMs, specifically how they evolve during training [16,26,38]. [37] and [39] focused on the dynamics of memorization in language model pretraining. ...

Reference:

How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Investigating Learning Dynamics of BERT Fine-Tuning
  • Citing Conference Paper
  • January 2020

... Due to the black-box nature of LLMs, their interpretability has increasingly attracted attention . In general in-context learning (Brown et al., 2020), many researchers (Olsson et al., 2022;Dai et al., 2022;Todd et al., 2023;Wang et al., 2023b) have delved into the internals of the model to try to explain certain behaviors of the model. (Todd et al., 2023) identifies the task vectors to control the behaviors of LLMs through analysis of attention heads. ...

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
  • Citing Conference Paper
  • January 2023

... When multimodal large language models handle visual perception tasks, they typically output coordinates in the form of text tokens (Wang et al., 2023b,a) or discrete coordinate bin (Peng et al., 2023;Wang et al., 2022). To accomplish 3D grounding tasks, a straightforward approach would be to output the object's 3D spatial position in text format, including coordinates such as x, y, z, length, width, height, and rotation. ...

Kosmos-2: Grounding Multimodal Large Language Models to the World
  • Citing Preprint
  • June 2023

... In light of the emergence of expansive language models, scholarly investigations have been fervently delving into the application of LLMs for addressing multimodal challenges [20], [32], thereby culminating in the conception of Multimodal Large Language Model (MLLM) [17], [33], [34], [19], [35], [36]. A variety of methodologies have entailed the infusion of visual data into LLMs and the meticulous refinement of these models through instructional directives. ...

Language Is Not All You Need: Aligning Perception with Language Models
  • Citing Preprint
  • February 2023

... Considering these limitations, we investigate hard prompt optimization techniques such as Chain-of-Thought prompting . Acknowledging the in-context learning (ICL) as an indirect method of fine-tuning, we also explored in-context learning strategies (Dai et al., 2023). Among them, we were particularly inspired by MedPrompt, a promising composite prompting method applied to medical datasets, which achieved a 27% reduction in error rates on MedQA (Nori et al., 2023). ...

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers
  • Citing Preprint
  • December 2022

... Today, a user's text prompt is often augmented by generative image tools to make the output more aesthetically pleasing, invisibly appending phrases to the input the model itself receives. 23 Clinicians could help generative AI companies translate user-input prompts into safer backend prompts that have been tested and proven to produce less biased and stigmatising output results. ...

Optimizing Prompts for Text-to-Image Generation
  • Citing Preprint
  • December 2022

... Large language models (LLMs) encode vast amounts of knowledge during pre-training, enabling them to perform effectively across a wide range of natural language processing (NLP) tasks (Hao et al., 2021;Cao et al., 2021a;Jiang et al., 2023;Hernandez et al., 2023;Haviv et al., 2023;OpenAI, 2023). However, LLMs often incorporate outdated, incorrect, or biased information learned from training data, which can directly affect the reliability of their outputs (Hase et al., 2021;Pagnoni et al., 2021;Ji et al., 2023;Mousavi et al., 2024). ...

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer
  • Citing Article
  • May 2021

Proceedings of the AAAI Conference on Artificial Intelligence

... Currently, transformer-based models give the best results for the NER task [25]. Recently, Hao et al. [16] showed that language models were general interfaces that can merge multiple modalities. Modern transformer libraries make it easy to add and use custom tokens. ...

Language Models are General-Purpose Interfaces
  • Citing Preprint
  • June 2022

... Nowadays, discovering and explaining the important neuron in one trained model has drawn more and more attention. The related works either explain the patterns obtained at different network neurons via visualization (Foote et al., 2023;Ghiasi et al., 2022;Karpathy et al., 2015;Vig, 2019;Zeiler & Fergus, 2014), or study the effects of individual neurons (You et al., 2025;Dai et al., 2022;Dalvi et al., 2019;Durrani et al., 2020;Huang et al., 2023). Nevertheless, most of these methods overlook the correlation between the neurons in different layers and have not considered the complex joint influence of various neurons. ...

Knowledge Neurons in Pretrained Transformers
  • Citing Conference Paper
  • January 2022

... Hao et al. [42] proposed the HP (Hardness Prediction) method to improve the sampling efficiency of the generator. Following this idea, in this paper, a DP module (Difficulty Prediction) is added to the generator to improve the performance of the model by improving the sampling efficiency of the generator. ...

Learning to Sample Replacements for ELECTRA Pre-Training
  • Citing Conference Paper
  • January 2021