Wenpeng Yin’s research while affiliated with William Penn University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (94)


Adaptive and Explainable Margin Trading via Large Language Models on Portfolio Management
  • Conference Paper

November 2024

·

3 Reads

·

·

·

Wenpeng Yin

AAAR-1.0: Assessing AI's Potential to Assist Research

October 2024

·

4 Reads

Renze Lou

·

Hanzi Xu

·

Sijia Wang

·

[...]

·

Wenpeng Yin

Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; (iii) PaperWeakness, identifying weaknesses in paper submissions; and (iv) REVIEWCRITIQUE, identifying each segment in human reviews is deficient or not. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.


Figure 2: FAULTYMATH curation process. Three stages: i) GPT-4 converts valid math problems into faulty ones; ii) GPT-4 self-verifies; iii) human verification.
Figure 3: Distributions of the dataset by difficulty level and math category (left), and by the origin of falsehood (right).
Figure 4: Accuracy heatmaps for different difficulty levels and categories. Top-left: Average accuracy of top 3 LLMs. Top-right: Gemini-1.5-Pro (Rank 1). Bottom-left: GPT-4 (Rank 2). Bottomright: Qwen1.5-72B (Rank 3).
Figure 5: Accuracy of the top three LLMs across different origins of falsehood.
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems
  • Preprint
  • File available

October 2024

·

10 Reads

Consider the math problem: "Lily received 3 cookies from her best friend yesterday and ate 5 for breakfast. Today, her friend gave her 3 more cookies. How many cookies does Lily have now?" Many large language models (LLMs) in previous research approach this problem by calculating the answer "1" using the equation "3 - 5 + 3." However, from a human perspective, we recognize the inherent flaw in this problem: Lily cannot eat 5 cookies if she initially only had 3. This discrepancy prompts a key question: Are current LLMs merely Blind Solver that apply mathematical operations without deeper reasoning, or can they function as Logical Thinker capable of identifying logical inconsistencies? To explore this question, we propose a benchmark dataset, FaultyMath, which includes faulty math problems of rich diversity: i) multiple mathematical categories, e.g., algebra, geometry, number theory, etc., ii) varying levels of difficulty, and iii) different origins of faultiness -- ranging from violations of common sense and ambiguous statements to mathematical contradictions and more. We evaluate a broad spectrum of LLMs, including open-source, closed-source, and math-specialized models, using FaultyMath across three dimensions: (i) How accurately can the models detect faulty math problems without being explicitly prompted to do so? (ii) When provided with hints -- either correct or misleading -- about the validity of the problems, to what extent do LLMs adapt to become reliable Logical Thinker? (iii) How trustworthy are the explanations generated by LLMs when they recognize a math problem as flawed? Through extensive experimentation and detailed analysis, our results demonstrate that existing LLMs largely function as Blind Solver and fall short of the reasoning capabilities required to perform as Logical Thinker.

Download

Figure 1: Distribution of problems across different math categories and competitions in the CreativeMath dataset.
Figure 4: The prompt template for generating novel solution.
Figure 5: The prompt templates for evaluating the correctness (top) and novelty (bottom) of the generated solution. The criteria for evaluating the novelty are rephrased from the same criteria applied during the novel solution generation process to ensure alignment.
Average Correctness (C) and Novelty-to-Correctness Ratio (N/C) for all LLMs when solving math problems of varying difficulty levels, with k = 1 across all competitions.
Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems

October 2024

·

9 Reads

The mathematical capabilities of AI systems are complex and multifaceted. Most existing research has predominantly focused on the correctness of AI-generated solutions to mathematical problems. In this work, we argue that beyond producing correct answers, AI systems should also be capable of, or assist humans in, developing novel solutions to mathematical challenges. This study explores the creative potential of Large Language Models (LLMs) in mathematical reasoning, an aspect that has received limited attention in prior research. We introduce a novel framework and benchmark, CreativeMath, which encompasses problems ranging from middle school curricula to Olympic-level competitions, designed to assess LLMs' ability to propose innovative solutions after some known solutions have been provided. Our experiments demonstrate that, while LLMs perform well on standard mathematical tasks, their capacity for creative problem-solving varies considerably. Notably, the Gemini-1.5-Pro model outperformed other LLMs in generating novel solutions. This research opens a new frontier in evaluating AI creativity, shedding light on both the strengths and limitations of LLMs in fostering mathematical innovation, and setting the stage for future developments in AI-assisted mathematical discovery.


Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge

October 2024

·

8 Reads

Large language models (LLMs) have shown remarkable proficiency in generating text, benefiting from extensive training on vast textual corpora. However, LLMs may also acquire unwanted behaviors from the diverse and sensitive nature of their training data, which can include copyrighted and private content. Machine unlearning has been introduced as a viable solution to remove the influence of such problematic content without the need for costly and time-consuming retraining. This process aims to erase specific knowledge from LLMs while preserving as much model utility as possible. Despite the effectiveness of current unlearning methods, little attention has been given to whether existing unlearning methods for LLMs truly achieve forgetting or merely hide the knowledge, which current unlearning benchmarks fail to detect. This paper reveals that applying quantization to models that have undergone unlearning can restore the "forgotten" information. To thoroughly evaluate this phenomenon, we conduct comprehensive experiments using various quantization techniques across multiple precision levels. We find that for unlearning methods with utility constraints, the unlearned model retains an average of 21\% of the intended forgotten knowledge in full precision, which significantly increases to 83\% after 4-bit quantization. Based on our empirical findings, we provide a theoretical explanation for the observed phenomenon and propose a quantization-robust unlearning strategy to mitigate this intricate issue...


Exploring Language Model Generalization in Low-Resource Extractive QA

September 2024

·

9 Reads

In this paper, we investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift, i.e., can LLMs generalize well to closed-domains that require specific knowledge such as medicine and law in a zero-shot fashion without additional in-domain training? To this end, we devise a series of experiments to empirically explain the performance gap. Our findings suggest that: a) LLMs struggle with dataset demands of closed-domains such as retrieving long answer-spans; b) Certain LLMs, despite showing strong overall performance, display weaknesses in meeting basic requirements as discriminating between domain-specific senses of words which we link to pre-processing decisions; c) Scaling model parameters is not always effective for cross-domain generalization; and d) Closed-domain datasets are quantitatively much different than open-domain EQA datasets and current LLMs struggle to deal with them. Our findings point out important directions for improving existing LLMs.


Direct-Inverse Prompting: Analyzing LLMs' Discriminative Capacity in Self-Improving Generation

June 2024

·

1 Read

Mainstream LLM research has primarily focused on enhancing their generative capabilities. However, even the most advanced LLMs experience uncertainty in their outputs, often producing varied results on different runs or when faced with minor changes in input, despite no substantial change in content. Given multiple responses from the same LLM to the same input, we advocate leveraging the LLMs' discriminative capability to reduce this generative uncertainty, aiding in identifying the correct answers. Specifically, we propose and analyze three discriminative prompts: direct, inverse, and hybrid, to explore the potential of both closed-source and open-source LLMs in self-improving their generative performance on two benchmark datasets. Our insights reveal which discriminative prompt is most promising and when to use it. To our knowledge, this is the first work to systematically analyze LLMs' discriminative capacity to address generative uncertainty.


Figure 1: Specificity of reviews: LLM vs. Human.
LLMs assist NLP Researchers: Critique Paper (Meta-)Reviewing

June 2024

·

78 Reads

·

1 Citation

This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload? This study focuses on the topic of LLMs assist NLP Researchers, particularly examining the effectiveness of LLM in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with "deficiency" labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) "LLMs as Reviewers", how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) "LLMs as Metareviewers", how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.


LLMs' Classification Performance is Overclaimed

June 2024

·

22 Reads

In many classification tasks designed for AI or human to solve, gold labels are typically included within the label space by default, often posed as "which of the following is correct?" This standard setup has traditionally highlighted the strong performance of advanced AI, particularly top-performing Large Language Models (LLMs), in routine classification tasks. However, when the gold label is intentionally excluded from the label space, it becomes evident that LLMs still attempt to select from the available label candidates, even when none are correct. This raises a pivotal question: Do LLMs truly demonstrate their intelligence in understanding the essence of classification tasks? In this study, we evaluate both closed-source and open-source LLMs across representative classification tasks, arguing that the perceived performance of LLMs is overstated due to their inability to exhibit the expected comprehension of the task. This paper makes a threefold contribution: i) To our knowledge, this is the first work to identify the limitations of LLMs in classification tasks when gold labels are absent. We define this task as Classify-w/o-Gold and propose it as a new testbed for LLMs. ii) We introduce a benchmark, Know-No, comprising two existing classification tasks and one new task, to evaluate Classify-w/o-Gold. iii) This work defines and advocates for a new evaluation metric, OmniAccuracy, which assesses LLMs' performance in classification tasks both when gold labels are present and absent.


Large Language Model Instruction Following: A Survey of Progresses and Challenges

June 2024

·

4 Reads

·

7 Citations

Computational Linguistics

Task semantics can be expressed by a set of input-output examples or a piece of textual instruction. Conventional machine learning approaches for natural language processing (NLP) mainly rely on the availability of large-scale sets of task-specific examples. Two issues arise: first, collecting task-specific labeled examples does not apply to scenarios where tasks may be too complicated or costly to annotate, or the system is required to handle a new task immediately; second, this is not user-friendly since end-users are probably more willing to provide task description rather than a set of examples before using the system. Therefore, the community is paying increasing interest in a new supervision-seeking paradigm for NLP: learning to follow task instructions, i.e., instruction following. Despite its impressive progress, there are some unsolved research equations that the community struggles with. This survey paper tries to summarize and provide insights to the current research on instruction following, particularly, by answering the following questions: (i) What is task instruction, and what instruction types exist? (ii) How to model instructions? (iii) What are popular instruction following datasets and evaluation metrics? (iv) What factors influence and explain the instructions’ performance? (v) What challenges remain in instruction following? To our knowledge, this is the first comprehensive survey about instruction following.


Citations (56)


... Large language models (LLMs) have demonstrated remarkable proficiency in zero-shot decision making (Gonen et al., 2023;Schick and Schütze, 2021;Brown et al., 2020) and instruction following (Jiang et al., 2023;Köpf et al., 2023;Touvron et al., 2023b;Taori et al., 2023;Chiang et al., 2023;Ouyang et al., 2022). However, there can be significant variance in the performance of seemingly similar prompts (Zhao et al., 2021;Lu et al., 2022c;Webson and Pavlick, 2022;Gonen et al., 2023;Yan et al., 2024). Despite efforts of studies on prompting LMs (Shin et al., 2020;Li and Liang, 2021;Gao et al., 2021;Ding et al., 2022;Sanh et al., 2021;Kojima et al., 2022), it is still challenging to develop high-quality prompts that can induce better performance for varying tasks on evolving models in an effort-saving manner. ...

Reference:

Monotonic Paraphrasing Improves Generalization of Language Model Prompting
Contrastive Instruction Tuning
  • Citing Conference Paper
  • January 2024

... Moreover, interactions with users in dynamic environments and human biases in data annotation or model alignment complicate the uncertainty landscape. Unlike general deep learning models that primarily predict numerical outputs or classes, LLMs generate knowledge-based outputs which may include inconsistent or outdated information (Lin et al., 2024b). These features cannot be adequately addressed by simply categorizing uncertainty into three traditional types. ...

Navigating the Dual Facets: A Comprehensive Evaluation of Sequential Memory Editing in Large Language Models
  • Citing Conference Paper
  • January 2024

... Large Language Models (LLMs) have enabled the development of sophisticated applications, including complex task automation and domain-specific solutions (Xia et al., 2024) where strict adherence to formatting standards is crucial. Structured outputs improve AI integration into development tools by offering consistent output structures, simplifying error handling, and making LLM-generated responses more reliable for real-world use. ...

FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability
  • Citing Conference Paper
  • January 2024

... Recent advancements in Instruction Tuning [28,40] have significantly enhanced the ability of language models to better understand and accurately follow complex human instructions across a variety of contexts. Building upon this paradigm, Visual Instruction Tuning [13,32] further extends the capabilities by integrating both visual and textual data, enabling MLLMs to execute instructions that involve multiple data modalities. Models such as InstructBLIP [5], LLaVA [23], and MiniGPT-4 [45] exemplify this approach by leveraging large-scale pre-training and sophisticated alignment techniques to unify vision and language understanding. ...

Multimodal Instruction Tuning with Conditional Mixture of LoRA
  • Citing Conference Paper
  • January 2024

... This issue can perhaps be solved via LLM prompting. LLMs have tremendous capacity to generate texts while following a set of instructions Lou et al. (2024), potentially removing the need for sensitive training data. One could thus prompt an LLM to generate a patient portal message containing a predefined set of details. ...

Large Language Model Instruction Following: A Survey of Progresses and Challenges
  • Citing Article
  • June 2024

Computational Linguistics

... The biases recognized in AI-generated images raise significant concerns about inclusivity and the potential reinforcement of harmful stereotypes, underscoring the urgency of rectifying biases in generative AI technologies for a more equitable and unbiased future (Shin et al., 2024). These biases can have broad implications, particularly for women entrepreneurs who may feel cautious about adopting generative AI tools due to worries about gender discrimination and the perpetuation of negative stereotypes(Bhandari, 2023; Sandoval-Martin & Martínez-Sanzo, 2024). ...

Can Prompt Modifiers Control Bias? A Comparative Analysis of Text-to-Image Generative Models
  • Citing Preprint
  • June 2024

... PEFT methods aim to reduce the memory overhead of fine-tuning pre-trained models, enabling fine-tuning in resource-constrained environments. According to Han et al. [2024], PEFT methods can be categorized into: 1) Additive PEFT methods [Chronopoulou et al., 2023, Edalati et al., 2022, Lester et al., 2021, 2) Selective PEFT methods [Guo et al., 2020, Das et al., 2023, Sung et al., 2021, Ansell et al., 2021, Zaken et al., 2021, Vucetic et al., 2022, 3) Reparameterized PEFT methods [Hu et al., 2021b, Valipour et al., 2022, Karimi Mahabadi et al., 2021, Kopiczko et al., 2023, and 4) Hybrid PEFT methods [Mao et al., 2021, He et al., 2021, Zhang et al., 2022, Zhou et al., 2024. Among these, Low-Rank Adaptation (LoRA)-based methods, which are representative of reparameterized PEFT approaches, have gained significant attention due to their minimal architectural changes, no additional inference costs, and high efficiency. ...

Unified Low-Resource Sequence Labeling by Sample-Aware Dynamic Sparse Finetuning
  • Citing Conference Paper
  • January 2023

... As previously mentioned, diverse tasks and inconsistent annotation standards often limit the comprehensive understanding of videos. We adapt the instruction tuning [95] to unify these annotations under a cohesive framework. In our dataset, videos are segmented into 16-second clips, 3-5× longer than common video understanding dataset, ensuring each contains a rich number of actions while maintaining a manageable length, as shown in Table 4. ...

LLM-driven Instruction Following: Progresses and Concerns
  • Citing Conference Paper
  • January 2023

... For the results reported in this paper, we use the entity disambiguation function of BLINK [45], using the cause or effect phrase as span and the phrase used for extraction of the pair as the context for disambiguation. In the future, we will experiment with other alternatives, such as an adaptation of BLINK called EVELINK [47] that is tuned to perform significantly better than BLINK for linking event mentions. ...

Event Linking: Grounding Event Mentions to Wikipedia
  • Citing Conference Paper
  • January 2023

... This setting, sometimes called crosstarget stance classification [Zhao et al., 2022], is by far the most common zero-shot strategy among the selected studies, the only exceptions being the use of distant supervision and prompt-based learning [Zhang et al., 2023] as discussed below. SemEval ood topic-adversarial network Liu et al. [2021] VAST ood graph convolution network Luo et al. [2022] VAST ood common sense knowledge graph encoding Liang et al. [2022] SemEval, VAST, others ood graph contrastive learning Xu et al. [2022] SemEval, VAST, others dist text entailment, GPT-3, BERT Pavan and Paraboni [2022] UstanceBR-r1 ood BERT ADDA Zhao et al. [2022] SemEval, VAST, others ood BERT contrastive learning Chunling et al. [2023] SemEval, others ood graph+BERT ADDA Zhang et al. [2023] SemEval, others none ChatGPT Wen and Hauptmann [2023] VAST ood Conditional generation ...

OpenStance: Real-world Zero-shot Stance Detection
  • Citing Conference Paper
  • January 2022