Ziyi Wu’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (3)


Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  • Preprint

June 2022

·

981 Reads

·

61 Citations

·

Abhinav Rastogi

·

Abhishek Rao

·

[...]

·

Ziyi Wu

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.


Figure 1: Two riddle-style commonsense questions with multiple choices in RIDDLESENSE dataset. The correct answers are B (candle) and C (glove). The top example is a descriptive riddle that uses multiple pieces of commonsense fact about candle, and it needs understanding of figurative language such as metaphor. The bottom one additionally needs counterfactual reasoning ability to address the 'but-no' cues.
Figure 6: The curve of dev accuracy using different percentage of the RS-training data, respectively for RoBERTa-Large and ALBERT-XXL.
Figure of speech (ink → writing → knowledge → light of wisdom) + Counterfactual (without burning) Counterfactual (many legs but cannot stand) + Metaphor (bristles)
RiddleSense: Answering Riddle Questions as Commonsense Reasoning
  • Preprint
  • File available

January 2021

·

2,701 Reads

A riddle is a mystifying, puzzling question about everyday concepts. For example, the riddle "I have five fingers but I am not alive. What am I?" asks about the concept of a glove. Solving riddles is a challenging cognitive process for humans, in that it requires complex commonsense reasoning abilities and an understanding of figurative language. However, there are currently no commonsense reasoning datasets that test these abilities. We propose RiddleSense, a novel multiple-choice question answering challenge for benchmarking higher-order commonsense reasoning models, which is the first large dataset for riddle-style commonsense question answering, where the distractors are crowdsourced from human annotators. We systematically evaluate a wide range of reasoning models over it and point out that there is a large gap between the best-supervised model and human performance -- pointing to interesting future research for higher-order commonsense reasoning and computational creativity.

Download

Citations (2)


... Our research focuses on the Sports Understanding ability of LLMs and VLMs, an under-explored area yet is crucial for their potential applications in automated refereeing and related domains. Previous benchmarks have fallen short by either focusing on datasets containing limited sports understanding [11], relying on a single modality [12], or lacking detailed error analysis [12,13]. Moreover, no prior work has addressed the capabilities of the latest LLMs, especially in light of recent rapid advancements in LLMs and VLMs. ...

Reference:

Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models Through Question Answering from Text to Video
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  • Citing Preprint
  • June 2022

... (2) Suitable evaluation methodologies. First, following the popular standard LLM benchmark paradigm [8], [9], [10], [11], [12], [13], [14], we also establish a series of standard LLM evaluations by Oogiri-GO, such as ranking and selection [1], [31], [32], [33]. We find that even advanced LLMs and reasoning frameworks [3], [17], [34], [35], including GPT-4 and CoT, despite their exceptional reasoning capabilities and extensive prior knowledge of various forms of humor [17], still struggle to demonstrate adequate LoT ability for creative humor generation. ...

RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge
  • Citing Conference Paper
  • January 2021