Jiacheng Liu’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (16)


Few-Shot Viral Variant Detection via Bayesian Active Learning and Biophysics
  • Preprint
  • File available

March 2025

·

15 Reads

Marian Huot

·

·

Jiacheng Liu

·

The early detection of high-fitness viral variants is critical for pandemic response, yet limited experimental resources at the onset of variant emergence hinder effective identification. To address this, we introduce an active learning framework that integrates protein language model ESM3, Gaussian process with uncertainty estimation, and a biophysical model to predict the fitness of novel variants in a few-shot learning setting. By benchmarking on past SARS-CoV-2 data, we demonstrate that our methods accelerates the identification of high-fitness variants by up to fivefold compared to random sampling while requiring experimental characterization of fewer than 1% of possible variants. We also demonstrate that our framework benchmarked on deep mutational scans effectively identifies sites that are frequently mutated during natural viral evolution with a predictive advantage of up to two years compared to baseline strategies, particularly those enabling antibody escape while preserving ACE2 binding. Through systematic analysis of different acquisition strategies, we show that incorporating uncertainty in variant selection enables broader exploration of the sequence landscape, leading to the discovery of evolutionarily distant but potentially dangerous variants. Our results suggest that this framework could serve as an effective early warning system for identifying concerning SARS-CoV-2 variants and potentially emerging viruses with pandemic potential before they achieve widespread circulation.

Download

2 OLMo 2 Furious

December 2024

·

73 Reads

·

1 Citation

Team OLMo

·

Pete Walsh

·

Luca Soldaini

·

[...]

·

Hannaneh Hajishirzi

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from T\"ulu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to compute, often matching or outperforming open-weight only models like Llama 3.1 and Qwen 2.5 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size, including Qwen 2.5, Llama 3.1 and Gemma 2. We release all OLMo 2 artifacts openly -- models at 7B and 13B scales, both pretrained and post-trained, including their full training data, training code and recipes, training logs and thousands of intermediate checkpoints. The final instruction model is available on the Ai2 Playground as a free research demo.


Establishing Task Scaling Laws via Compute-Efficient Model Ladders

December 2024

·

4 Reads

We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance. We train a set of small-scale "ladder" models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks written in ranked classification format, we can predict the accuracy of both target models within 2 points of absolute error. We have higher prediction error on four other tasks (average absolute error 6.9) and find that these are often tasks with higher variance in task metrics. We also find that using less compute to train fewer ladder models tends to deteriorate predictions. Finally, we empirically show that our design choices and the two-step approach lead to superior performance in establishing scaling laws.


AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text

October 2024

·

11 Reads

Creativity has long been considered one of the most difficult aspect of human intelligence for AI to mimic. However, the rise of Large Language Models (LLMs), like ChatGPT, has raised questions about whether AI can match or even surpass human creativity. We present CREATIVITY INDEX as the first step to quantify the linguistic creativity of a text by reconstructing it from existing text snippets on the web. CREATIVITY INDEX is motivated by the hypothesis that the seemingly remarkable creativity of LLMs may be attributable in large part to the creativity of human-written texts on the web. To compute CREATIVITY INDEX efficiently, we introduce DJ SEARCH, a novel dynamic programming algorithm that can search verbatim and near-verbatim matches of text snippets from a given document against the web. Experiments reveal that the CREATIVITY INDEX of professional human authors is on average 66.2% higher than that of LLMs, and that alignment reduces the CREATIVITY INDEX of LLMs by an average of 30.1%. In addition, we find that distinguished authors like Hemingway exhibit measurably higher CREATIVITY INDEX compared to other human writers. Finally, we demonstrate that CREATIVITY INDEX can be used as a surprisingly effective criterion for zero-shot machine text detection, surpassing the strongest existing zero-shot system, DetectGPT, by a significant margin of 30.2%, and even outperforming the strongest supervised system, GhostBuster, in five out of six domains.


Figure 2: The core aspects of learning from preference feedback. For DPO (solid line), preference data is directly used to train a policy model. For PPO (dashed line), preference data is used to train a reward model, which is then used to score model-generated responses during PPO training.
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

June 2024

·

50 Reads

Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. Despite significant gains of up to 5% in mathematical evaluation when scaling up reward models, we surprisingly observe marginal improvements in other categories. We publicly release the code used for training (https://github.com/hamishivi/EasyLM) and evaluating (https://github.com/allenai/open-instruct) our models, along with the models and datasets themselves (https://huggingface.co/collections/allenai/tulu-v25-suite-66676520fd578080e126f618).



Figure 5: Scaling trends of commonsense statement verifiers.
Figure 8: Comparing verification and QA, the two different formats for problem-solving tasks. Average accuracy on the development sets of the seen multiple-choice benchmarks is reported. We use text-davinci-002 as GPT-3.5 here, and gpt-3.5-turbo-0301 as ChatGPT. VERA in QA format actually means a T5 model finetuned on the same seen multiple-choice data as VERA.
Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements

May 2023

·

71 Reads

·

2 Citations

Despite the much discussed capabilities of today's language models, they are still prone to silly and unexpected commonsense failures. We consider a retrospective verification approach that reflects on the correctness of LM outputs, and introduce Vera, a general-purpose model that estimates the plausibility of declarative statements based on commonsense knowledge. Trained on ~7M commonsense statements created from 19 QA datasets and two large-scale knowledge bases, and with a combination of three training objectives, Vera is a versatile model that effectively separates correct from incorrect statements across diverse commonsense domains. When applied to solving commonsense problems in the verification format, Vera substantially outperforms existing models that can be repurposed for commonsense verification, and it further exhibits generalization capabilities to unseen tasks and provides well-calibrated outputs. We find that Vera excels at filtering LM-generated commonsense knowledge and is useful in detecting erroneous commonsense statements generated by models like ChatGPT in real-world settings.


Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements

January 2023

·

19 Reads

·

12 Citations



Figure 2: A proof sketch in Isabelle. The problem "Show that for any real number a, 10a ≤ 28a 2 + 1"
Figure 3: Number of problems solved on miniF2F against the number of autoformalization attempts per problem. Left: The figure displays the experiments carried out with the DSP method and three ablations on it. The curves represent the DSP method (blue), formal proof sketches without the in-line comments (orange), without informal proofs altogether (green), and without the automated provers (red). Right: The figure compares the experimental results with informal proof drafts written by humans (blue), the 540B Minerva model (orange), the 62B Minerva model (green), the 8B Minerva model (red), and the Codex model (purple).
Figure 4: IMO proof guided by a Minerva informal proof An informal proof of the International Math Olympiad problem imo 1959 p1 generated by Minerva that leads to a successful formal proof. The steps enclosed by the ATP delimiters are generated by an automated prover and all other steps are by the autoformalizer.
Figure 6: IMO proof guided by a Minerva informal proof An informal proof of the International Math Olympiad problem imo 1959 p1 generated by Minerva that led to a successful formal proof. The steps enclosed by ATP delimiters are generated by an automated theorem prover and the rest are by the DSP autoformalizer.
Figure 8: Algebra example with Minerva informal proof. An informal proof generated by Minerva that led to a successful formal proof. The autoformalizer generated a proof sketch containing all lines of the formal proof except for those delimited by the ATP tags. The sketch is structured according to the informal proof, containing five intermediate conjectures based on the informal proof. The autoformalizer generated in-line comments in the proof sketch (shown in red), which correctly identified an alignment between the formal and informal proofs.
Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

October 2022

·

395 Reads

·

8 Citations

The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9% to 39.3% on a collection of mathematical competition problems.


Citations (9)


... We use a LLM-aided model-in-the-loop annotation approach to identify the uncivil comments that are also toxic [11]. Recent research shows that such a model-in-theloop annotation methodology works well for this type of data, including hate and violent speech detection tasks [12], [13], [14], [15], [16], [17], [18]. For each uncivil comment (as annotated by Ehsani et al.), we asked GPT-4o if each comment is toxic or not. ...

Reference:

Understanding and Predicting Derailment in Toxic Conversations on GitHub
Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification
  • Citing Conference Paper
  • January 2024

... Beyond directly judging plausibility, some methods (Jung et al., 2022;Tafjord et al., 2022) evaluate the plausibility of hypotheses by scoring the validity of entailment paths generated by the LLMs, i.e., the reasoning chains justifying 'reasonable' or 'unreasonable' conclusions, and selecting the final prediction based on the highest-scoring path. VERA (Liu et al., 2023) adopts a discriminative approach, training a classification head to make predictions based on model representations, which fine-tunes LLMs on~7 million commonsense statements. In contrast, our approach also leverages internal knowledge from a discriminative perspective but does not require additional training. ...

Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements

... Our investigations focus on composition, where a model needs to "chain" different pieces of facts, as stated in Section 3. Although explicit verbalizations of reasoning steps (e.g., chain-of-thought rationales) can enhance task performance (Lake & Baroni, 2018a; Wei et al., 2022;Wang et al., 2022;Zelikman et al., 2022;Liu et al., 2023), they are not available during large-scale (pre-)training, which is the stage in which the model's core capabilities are developed (Li et al., 2020;Zhou et al., 2023a). Prior work has extensively studied whether transformer-based language models can perform implicit composition, with consistently negative results reported (Press et al., 2023;Yang et al., 2024). ...

Crystal: Introspective Reasoners Reinforced with Self-Feedback
  • Citing Conference Paper
  • January 2023

... Pre-trained language models (PLMs) have waved the NLP community as fundamental infrastructure by demonstrating remarkable abilities with the "pre-train, prompt, and predict" paradigm (Liu et al., 2023b;Zhao et al., 2023). The mere PLMs, however, lack the capacity to handle knowledgeintensive tasks with advanced functionalities like commonsense reasoning (Lin et al., 2019;Qiao et al., 2022;Liu et al., 2023a) and open-domain question answering (Yang et al., 2015). This necessitates a boosting trend for research focusing Figure 1: A pilot experimental case for the motivation of knowledge rumination. ...

Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements

... Machine learning offers one potential avenue for addressing this challenge. Indeed, recent years have seen several significant efforts dedicated to using machine learning to automatically search for proofs in proof assistants (e.g., [21], [22], [5], [6], [12], [11], [8], [9], [20]). While these efforts have produced promising results, many proofs are still beyond the reach of machine learning-based automation. ...

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

... Despite LLMs' exceptional logical reasoning capabilities, they lack the nuanced, domainspecific details [66,67] and can hallucinate, resulting in factual inaccuracy, which warrants caution for users [68,69]. While this can be solved by continual pre-training of LLMs on extensive corpora, such an approach requires vast computational resources and time investment [70]. However, recent advances in computer science show that LLMs and KGs can be used together to minimize hallucinations and use the powerful reasoning capabilities of LLMs for link predictions. ...

Rainier: Reinforced Knowledge Introspector for Commonsense Question Answering

... 5. Generated Knowledge Prompting involves prompting the model to generate relevant background knowledge before moving to the main task. The model can produce more robust and contextually appropriate responses by first generating relevant information [26]. ...

Generated Knowledge Prompting for Commonsense Reasoning
  • Citing Conference Paper
  • January 2022

... In the same way that LLMs can generate texts that are very much human-like-e.g., in the context of a conversation for ChatGPT-it is perfectly plausible that they could also generate mathematical proofs that look as if they were produced by human mathematical agents. Indeed, some initial works have recently been done in this direction and some interesting results have already been obtained (see, e.g., Lample & Charton, 2020;Welleck et al., 2021Welleck et al., , 2022Wei et al., 2022). For instance, Welleck et al. (2022) have developed a system called NaturalProver which they describe as "a language model that generates proofs by conditioning on background references (e.g., theorems and definitions that are either retrieved or human-provided), and optionally enforces their presence with a constrained decoding algorithm that leverages the multi-step structure of proofs" (Welleck et al., 2022, p. 2). ...

NaturalProver: Grounded Mathematical Proof Generation with Language Models