Article

Embers of autoregression show how large language models are shaped by the problem they are trained to solve

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach—which we call the teleological approach—we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4’s accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system—one that has been shaped by its own particular set of pressures.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Large language models (LLMs) are capable of producing coherent text in a variety of settings (Radford et al., 2019;Dou et al., 2022;Bubeck et al., 2023;Chang & Bergen, 2024), yet they often fail at simple tasks, producing hallucinations and having difficulty performing logical reasoning (Lin et al., 2022;McCoy et al., 2023;Wu et al., 2024a;Mirzadeh et al., 2024;Razeghi et al., 2022;Stechly et al., 2024). McCoy et al. (2023) argue that these failures are partly a consequence of LLMs having difficulty producing low-probability output. ...
... Large language models (LLMs) are capable of producing coherent text in a variety of settings (Radford et al., 2019;Dou et al., 2022;Bubeck et al., 2023;Chang & Bergen, 2024), yet they often fail at simple tasks, producing hallucinations and having difficulty performing logical reasoning (Lin et al., 2022;McCoy et al., 2023;Wu et al., 2024a;Mirzadeh et al., 2024;Razeghi et al., 2022;Stechly et al., 2024). McCoy et al. (2023) argue that these failures are partly a consequence of LLMs having difficulty producing low-probability output. For example, when solving puzzles such as deciphering a message by shifting each letter in the message by one position in the alphabet, LLMs will perform better when the correct answer is a high-probability string than when it is a low-probability string, even though the underlying logic of these tasks is same (Figure 1 Although LLAMA 3 reasons through the correct characters, it outputs a more likely token 'in', instead of the correct 'inf' (which is the first token of the correct answer 'infidel'), followed by 'field'. ...
... reproduced by LLAMA 3). One way to understand these errors is to assume that LLMs perform Bayesian inference , letting the prior distribution over word sequences that they have learned through pre-training on large amounts of text influence their output (McCoy et al., 2023). However, it is unclear what the mechanisms behind this influence might be, and whether their effects can be mitigated. ...
Preprint
Full-text available
Large language models (LLMs) sometimes fail to respond appropriately to deterministic tasks -- such as counting or forming acronyms -- because the implicit prior distribution they have learned over sequences of tokens influences their responses. In this work, we show that, in at least some cases, LLMs actually compute the information needed to perform these tasks correctly, and we identify some interventions that can allow them to access this information to improve their performance. First, we show that simply prompting the language model to not rely on its prior knowledge leads to dramatic improvements in prior-dominated tasks. We then use mechanistic interpretability techniques to localize the prior within the LLM and manipulate the extent to which that prior influences its responses. Specifically, we show that it is possible to identify layers of the underlying neural network that correlate with the prior probability of a response and that lightweight finetuning of these layers with basic prompts on prior-dominated tasks achieves high performance on held-out answers. These results suggest that the information required to produce a correct response is contained within the representations of the problems formed by the models. Furthermore, we show that this finetuning is significantly more effective for prior-dominated tasks, and that the error after finetuning is no longer correlated with the prior. Our results suggest that it may be possible to define effective methods for manipulating the extent to which LLMs rely upon their priors in solving problems, potentially increasing their performance in settings where LLMs hallucinate for reasons related to the prior probability of token sequences.
... We construct a logistic mixedeffects regression predicting whether a given language model correctly assigns the possible sentence a higher probability than the impossible one. As predictors, we include the semantic relatedness of the possible and impossible critical words, the typicality of the possible and impossible critical words, and the frequency of the possible and impossible critical words (a possible confound; see, e.g., McCoy et al., 2024). We also include random intercepts for each language model and sentence context, as well as random uncorrelated slopes of each predictor for each of these. ...
... The presence of such a heuristic would be in line with other work showing that the strong performance of language models and other artificial intelligence systems can often be explained by them learning simpler 'shortcuts' or other heuristics that correlate (often but not always) with the task at hand (see, e.g., Gururangan et al., 2018;McCoy et al., 2019McCoy et al., , 2024Abdou et al., 2020;Geirhos et al., 2020;Schramowski et al., 2020;Shah et al., 2020;Zhang et al., 2020;Du et al., 2021Du et al., , 2022Elazar et al., 2021;Kavumba et al., 2021;Ye and Kovashka, 2021;Stefanik, 2022). ...
Preprint
Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models' ability to do this is far from robust. In fact, under certain conditions, all models tested - including Llama 3, Gemma 2, and Mistral NeMo - perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as 'the car was given a parking ticket by the brake' than to merely unlikely sentences such as 'the car was given a parking ticket by the explorer'.
... Encoder Models Encoder family models are custom transformer encoders trained on NCTE classroom transcripts. They use fixed-parameter pretrained sentence embeddings, differing in these and in training hyperparamters, thereby exploiting LLM sensitivites to pretraining regimes (D'Amour et al., 2020;McCoy et al., 2023). A quick summary of differences is in Table 4 and more training details can be found in Appendix D. In contrast to the GPT models, the only text pre-processing used with the encoders simply replaced all transcription notes with [inaudible] to mimic the uncertainty in live audio transcription, and no edits to indicate speakership were included. ...
... As foundation models are increasingly deployed in complex contexts where evaluation of the quality of their performance may not be feasible, identifying performance gaps in cases of unreliable annotations will be increasingly important, especially when downstream tasks diverge more from model training (McCoy et al., 2023). This paper demonstrated some techniques to show that even when human reliabilities are low, meaningful insights can be obtained to understand and improve model construction and use. ...
... Many LLMs, while achieving high performance on other benchmarks, have shown a lower success rate on this task 170 . When faced with unseen tasks, which are common for human scientists in research, LLMs exhibit a significant drop in accuracy even on simple questions, such as performing 9-base number addition or writing Python code with indexing starting at 1 instead of 0. This suggests that LLMs may rely more on pattern matching than on reasoning, contrary to what many assume [171][172][173] . Consequently, caution is advised when applying LLMs to novel reasoning tasks, and incorporating human oversight into the process is recommended 173,174 . ...
... When faced with unseen tasks, which are common for human scientists in research, LLMs exhibit a significant drop in accuracy even on simple questions, such as performing 9-base number addition or writing Python code with indexing starting at 1 instead of 0. This suggests that LLMs may rely more on pattern matching than on reasoning, contrary to what many assume [171][172][173] . Consequently, caution is advised when applying LLMs to novel reasoning tasks, and incorporating human oversight into the process is recommended 173,174 . Another crucial aspect of reasoning is planning capability. ...
Preprint
With recent Nobel Prizes recognising AI contributions to science, Large Language Models (LLMs) are transforming scientific research by enhancing productivity and reshaping the scientific method. LLMs are now involved in experimental design, data analysis, and workflows, particularly in chemistry and biology. However, challenges such as hallucinations and reliability persist. In this contribution, we review how Large Language Models (LLMs) are redefining the scientific method and explore their potential applications across different stages of the scientific cycle, from hypothesis testing to discovery. We conclude that, for LLMs to serve as relevant and effective creative engines and productivity enhancers, their deep integration into all steps of the scientific process should be pursued in collaboration and alignment with human scientific goals, with clear evaluation metrics. The transition to AI-driven science raises ethical questions about creativity, oversight, and responsibility. With careful guidance, LLMs could evolve into creative engines, driving transformative breakthroughs across scientific disciplines responsibly and effectively. However, the scientific community must also decide how much it leaves to LLMs to drive science, even when associations with 'reasoning', mostly currently undeserved, are made in exchange for the potential to explore hypothesis and solution regions that might otherwise remain unexplored by human exploration alone.
... However, this strength can become a weakness in complex analysis tasks. LLMs may develop heuristics based on surface-level features [32]; for example, identifying the presence of a "sanity check" construct often correlates statistically with safe code in training data. Consequently, the model might classify code containing such a check as safe without performing the deeper reasoning required to determine if the check is actually effective under all relevant execution paths. ...
... Recent studies have shown that LLMs are still far from performing reliable code reasoning, and their predictions are thus fragile and susceptible to superficial changes in input [19,37,40]. This fragility is often attributed to the learned models taking "shortcuts" based on superficial patterns in training data rather than robust, generalizable reasoning strategies [4,14,32,46,48]. BugLens mitigates this problem with a similar spirit to existing works on boosting the LLMs' reasoning by constraining their reasoning space with structural and symbolic procedures [8,9,25,27,47]. ...
Preprint
Full-text available
Static analysis is a cornerstone for software vulnerability detection, yet it often struggles with the classic precision-scalability trade-off. In practice, such tools often produce high false positive rates, particularly in large codebases like the Linux kernel. This imprecision can arise from simplified vulnerability modeling and over-approximation of path and data constraints. While large language models (LLMs) show promise in code understanding, their naive application to program analysis yields unreliable results due to inherent reasoning limitations. We introduce BugLens, a post-refinement framework that significantly improves static analysis precision. BugLens guides an LLM to follow traditional analysis steps by assessing buggy code patterns for security impact and validating the constraints associated with static warnings. Evaluated on real-world Linux kernel bugs, BugLens raises precision from 0.10 (raw) and 0.50 (semi-automated refinement) to 0.72, substantially reducing false positives and revealing four previously unreported vulnerabilities. Our results suggest that a structured LLM-based workflow can meaningfully enhance the effectiveness of static analysis tools.
... A broad search through published work in the social sciences suggests that the number of published articles using generative AI tools increased by a factor of 500 percent from 2023 to 2024, with no signs of slowing. 1 In political science, the 2023 annual meeting of the American Political Science Association included 10 research papers using generative AI and one "breaking news" panel on large language models (LLMs) in political science; the same conference in 2024 included a full-day pre-conference on LLMs and over 100 papers making some use of generative AI and LLMs. 2 Like others, we are enthusiastic about the potential contributions of LLMs across the social sciences [25][26][27]; LLMs have enabled our team to explore a range of interesting and complex questions in ways that would not have been otherwise possible [12,18,28,29]. LLMs are unlike other scientific tools: they are neither a statistical model with carefully bounded properties, nor a machine learning algorithm with well-defined inputs, outputs and optimization objectives [30]. They are "programmed" with prompts that are remarkably fragile [31][32][33], often trained on an opaque mix of data and aligned to secret standards [34,35], and their human-like outputs are presented so naturally and confidently that it's all too easy to forget they need to be validated [36]. ...
... The Task is Beyond the Capacity of Any Language Model: LLMs are highly versatile, and the full scope of their capacities is still being mapped across a wide variety of academic and industrial use-cases. However, we do not expect that they are capable of completing every task, and it is possible that an LLM fails because the researcher is asking it to do something that is fundamentally beyond what it can provide [30]. For example, current LLMs struggle with tasks that require deep subject matter expertise [53], genuine human-like emotion [54], or complex reasoning and logic [55,56]. ...
Preprint
Full-text available
Generative large language models (LLMs) are incredibly useful, versatile, and promising tools. However, they will be of most use to political and social science researchers when they are used in a way that advances understanding about real human behaviors and concerns. To promote the scientific use of LLMs, we suggest that researchers in the political and social sciences need to remain focused on the scientific goal of inference. To this end, we discuss the challenges and opportunities related to scientific inference with LLMs, using validation of model output as an illustrative case for discussion. We propose a set of guidelines related to establishing the failure and success of LLMs when completing particular tasks, and discuss how we can make inferences from these observations. We conclude with a discussion of how this refocus will improve the accumulation of shared scientific knowledge about these tools and their uses in the social sciences.
... We highlight that, in our evaluation, during the inference phase, we employ an autoregressive setting [42] instead of a teacher-forced setting [44,46,60,79] to generate the next token, thereby constructing the target incrementally. This can prevent from exaggerating model performance when recalling knowledge, thereby obtaining a fair and accurate performance evaluation that is more in line with practical applications. ...
... While some researchers [60] argue that changes in token prediction under teacher forcing indicate a successful influence on the LLM, this approach is considered unrealistic for real-word predictions and, to some extent, can be seen as a form of cheating. Therefore, for a more equitable and reasonable assessment, we uniformly adopt an autoregressive generation [42] paradigm for prediction. ...
Preprint
Full-text available
As real-world knowledge evolves, the information embedded within large language models (LLMs) can become outdated, inadequate, or erroneous. Model editing has emerged as a prominent approach for updating LLMs' knowledge with minimal computational costs and parameter changes. This approach typically identifies and adjusts specific model parameters associated with newly acquired knowledge. However, existing methods often underestimate the adverse effects that parameter modifications can have on broadly distributed knowledge. More critically, post-edit LLMs frequently struggle with multi-hop reasoning and continuous knowledge updates. Although various studies have discussed these shortcomings, there is a lack of comprehensive evaluation. In this paper, we provide an evaluation of ten model editing methods along four dimensions: reliability, generalization, locality, and portability. Results confirm that all ten popular model editing methods show significant shortcomings across multiple dimensions, suggesting model editing is less promising. We then propose a straightforward method called Selective Contextual Reasoning (SCR), for knowledge updating. SCR does not modify model parameters but harnesses LLM's inherent contextual reasoning capabilities utilizing the updated knowledge pieces. Under SCR, an LLM first assesses whether an incoming query falls within the scope of an external knowledge base. If it does, the relevant external knowledge texts are contextualized to enhance reasoning; otherwise, the query is answered directly. We evaluate SCR against the ten model editing methods on two counterfactual datasets with three backbone LLMs. Empirical results confirm the effectiveness and efficiency of contextual reasoning for knowledge updating.
... Indeed, developmental psychologists have widely debated the age at which ToM "emerges" in children, and tasks that reduce auxiliary demands have revealed evidence for some ToM abilities in young children who would otherwise fail similar tests (Lewis and Osborne, 1990;Carlson et al., 1998;Surian and Leslie, 1999;Setoh et al., 2016;Fu et al., 2023). To design ToM evaluations that more directly measure ToM while minimizing auxiliary demands, we need to develop a deeper understanding of the kinds of resource constraints that LLMs face, as well as best practices for performing "species-fair" evaluations (McCoy et al., 2024;Lampinen, 2023;Firestone, 2020). ...
... and the statistical regularities of pretraining data(McCoy et al., 2024). A reasonable and popular strategy for testing the robustness of a model's ToM abilities is to construct adversarial test cases, which might violate a model's expectations or introduce settings that are beyond the distribution seen in training. ...
Preprint
Full-text available
The question of whether large language models (LLMs) possess Theory of Mind (ToM) -- often defined as the ability to reason about others' mental states -- has sparked significant scientific and public interest. However, the evidence as to whether LLMs possess ToM is mixed, and the recent growth in evaluations has not resulted in a convergence. Here, we take inspiration from cognitive science to re-evaluate the state of ToM evaluation in LLMs. We argue that a major reason for the disagreement on whether LLMs have ToM is a lack of clarity on whether models should be expected to match human behaviors, or the computations underlying those behaviors. We also highlight ways in which current evaluations may be deviating from "pure" measurements of ToM abilities, which also contributes to the confusion. We conclude by discussing several directions for future research, including the relationship between ToM and pragmatic communication, which could advance our understanding of artificial systems as well as human cognition.
... However, these studies represent a best-case scenario of relatively easy tasks, as they cover English-language data about standard societal and political issues that are likely much-discussed in LLM training data and do not require much expertise for coding. Research on logical reasoning tasks suggests that LLMs tend to struggle with tasks that are comparably complex, but less commonly appearing in their training and alignment processes (McCoy et al., 2023). In addition, there is ample evidence that LLMs are biased against non-English language contexts in a variety of other tasks (e.g., Durmus et al., 2024;Johnson et al., 2022;Li et al., 2024;Wang et al., 2024). ...
Preprint
Full-text available
The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.
... Additional detailed analysis is available in Appendix D, including experiments with more models and across a wider range of datasets that further support this finding. Data familiarity and FEEDBACK FRICTION Prior work [14,26] suggests that language models perform better with familiar entities and topics encountered frequently during training. Are these models more resistant to feedback about familiar entities? ...
Preprint
Full-text available
Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs' ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.
... These views differ fundamentally from Justaism and hence are not the target of our critique. For instance, some empirical research highlights specific LLM cognitive deficits (e.g., Berglund et al., 2024;McCoy et al., 2024;Turpin et al., 2024). Rather than denying LLM cognition outright, such work is better understood as qualifying the extent of cognitive abilities in LLMs. ...
Preprint
Full-text available
Large language models (LLMs) are arguably the most predictive models of human cognition available. Despite their impressive human-alignment, LLMs are often labeled as "*just* next-token predictors" that purportedly fall short of genuine cognition. We argue that these deflationary claims need further justification. Drawing on prominent cognitive and artificial intelligence research, we critically evaluate two forms of "Justaism" that dismiss LLM cognition by labeling LLMs as "just" simplistic entities without specifying or substantiating the critical capacities these models supposedly lack. Our analysis highlights the need for a more measured discussion of LLM cognition, to better inform future research and the development of artificial intelligence.
... While these LLMs demonstrate promising language understanding with strong compression capabilities, their intelligence and reasoning abilities remain a critical topic of scientific debate [7,8]. Earlier iterations of LLMs [9,10,11] exhibited poor performance on reasoning benchmarks [12,13,14,6]. To address these shortcomings, several approaches have been explored with the common theme among them being "scaling" both the training data and test-time computation. ...
Preprint
Full-text available
Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.
... Since larger models have seen more pronominal IO sentences compared to non-pronominal ones, they are expected to perform more like error-driven learning for pronominal IO sentences than non-pronominal IO sentences in ICL, under the assumption that data scale (like model scale) increases the strength of the IFE. Such relationships between data scale and the strength of ICL-related effects have also been observed in McCoy et al. (2024), who showed that LLMs displayed better ICL performance on highprobability sentences than low-probability ones. This observation could also potentially explain the interesting finding from Sinclair et al. (2022) that structural priming in LLMs (to a greater degree than in humans) is modulated by semantic plausibility: semantically plausible inputs are better represented in the training data, leading to more ICL/structural priming. ...
... Therefore, the question we want to address in this paper is the following: "how can we quantify the sensitivity of an LLM to variations of the prompt?". Existing works have answered this question by considering accuracy as the sole metric of interest (McCoy et al., 2023), but this has a limited impact on the everyday life of developers and requires enough ground truth labels for the estimate to be reliable. As a matter of fact, with the recent progress in LLM agents (Gioacchini et al., 2024) and chain of thoughts (Wei et al., 2022) techniques, the existence of multiple intermediate steps and/or user inputs, each handled somehow by an LLM, implies an exponential amount of potential failure paths. ...
... These results align with observations made in McCoy et al. (2024) that, on low-probability tasks and/or low-probability inputs (both of which hold in our case), LLMs are biased towards the output with the highest unconditional probability. In our case, the model selects text that may have a high frequency (relative to other texts conditioned on a prompt about Nahuatl) in its training data, regardless of whether this text truly satisfies the prompt's request. ...
... Generative language models frequently generate fluent but incorrect answers that can cause downstream harm (Band et al., 2024;Huang et al., 2024). These errors arise from objectives that reward fluency over factuality (McCoy et al., 2023;McKenna et al., 2023;Wan et al., 2023), gaps or biases in pre-training data (Baan et al., 2023;Ji et al., 2024), or decoding that relies more on marginal probability than crucial context (Sadat et al., 2023;van der Poel et al., 2022). When no external verifier is available, a model's self-reported confidence is the only proxy for correctness, making reliable uncertainty estimates essential. ...
Preprint
Full-text available
We study the source of uncertainty in DeepSeek R1-32B by analyzing its self-reported verbal confidence on question answering (QA) tasks. In the default answer-then-confidence setting, the model is regularly over-confident, whereas semantic entropy - obtained by sampling many responses - remains reliable. We hypothesize that this is because of semantic entropy's larger test-time compute, which lets us explore the model's predictive distribution. We show that granting DeepSeek the budget to explore its distribution by forcing a long chain-of-thought before the final answer greatly improves its verbal score effectiveness, even on simple fact-retrieval questions that normally require no reasoning. Furthermore, a separate reader model that sees only the chain can reconstruct very similar confidences, indicating the verbal score might be merely a statistic of the alternatives surfaced during reasoning. Our analysis concludes that reliable uncertainty estimation requires explicit exploration of the generative space, and self-reported confidence is trustworthy only after such exploration.
... Yet numerous studies have pointed towards serious flaws: they encode societal biases (Gallegos et al., 2024;Hofmann et al., 2024) and are sensitive to spurious correlations (Du et al., 2023). Many of these weaknesses have been traced back to the training data (Feng et al., 2023;McCoy et al., 2024). Moreover, as the size of these training datasets continues to increase, improvements in task performance show diminishing returns (Tirumala et al., 2023). ...
Preprint
Full-text available
Although diversity in NLP datasets has received growing attention, the question of how to measure it remains largely underexplored. This opinion paper examines the conceptual and methodological challenges of measuring data diversity and argues that interdisciplinary perspectives are essential for developing more fine-grained and valid measures.
... There have been a large number of both theoretical and empirical studies into the emergent reasoning capabilities of LLMs [26,37,52,74,78,79] and how these can be improved [4,22,29,34,35,40,60,83]. At the time of writing there is an almost monthly release of new LLMs from different companies. ...
Preprint
Full-text available
Empirical methods to examine the capability of Large Language Models (LLMs) to use Automated Theorem Prover (ATP) reasoning strategies are studied. We evaluate the performance of State of the Art models from December 2023 and August 2024 on PRONTOQA steamroller reasoning problems. For that, we develop methods for assessing LLM response accuracy and correct answer correlation. Our results show that progress in improving LLM reasoning abilities has stalled over the nine month period. By tracking completion tokens, we show that almost all improvement in reasoning ability since GPT-4 was released can be attributed to either hidden system prompts or the training of models to automatically use generic Chain of Thought prompting strategies. Among the ATP reasoning strategies tried, we found that current frontier LLMs are best able to follow the bottom-up (also known as forward-chaining) strategy. A low positive correlation was found between an LLM response containing correct reasoning and arriving at the correct conclusion.
... Such setups may lead to overly optimistic results and fail to accurately reflect the method's effectiveness in real-world generative scenarios, as also highlighted by [73]. To ensure a more fair and realistic evaluation, we adopt a unified autoregressive generation paradigm [74] for prediction. We directly evaluate whether the generated answer is correct with respect to the target answer. ...
Preprint
Full-text available
Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective Contextual Reasoning (SCR). Empirical results reveal that parameter-based editing methods perform poorly under realistic conditions. In contrast, SCR consistently outperforms them across all settings. This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.
... Bayesian Optimal Experiment Design An adjacent line of work considers the sequential design of experiments which maximally yield information gain about an unknown parameter of interest [39,17,12,21]; one may interpret these methods as studying a non-LLM-focused, Bayesian analogue of the reverse-engineering problem we formulate in the subsequent section, where a learner begins with a prior distribution over the black box in question and must maximally reduce epistemic uncertainty [19] with a given budget of experiments. To the extent that LLMs may implicitly engage with an underlying approximate posterior inference scheme [78,28,82,20,49], the reverse-engineering capabilities studied in this work can be tied to this Bayesian optimal experiment design problem. ...
Preprint
Using AI to create autonomous researchers has the potential to accelerate scientific discovery. A prerequisite for this vision is understanding how well an AI model can identify the underlying structure of a black-box system from its behavior. In this paper, we explore how well a large language model (LLM) learns to identify a black-box function from passively observed versus actively collected data. We investigate the reverse-engineering capabilities of LLMs across three distinct types of black-box systems, each chosen to represent different problem domains where future autonomous AI researchers may have considerable impact: Program, Formal Language, and Math Equation. Through extensive experiments, we show that LLMs fail to extract information from observations, reaching a performance plateau that falls short of the ideal of Bayesian inference. However, we demonstrate that prompting LLMs to not only observe but also intervene -- actively querying the black-box with specific inputs to observe the resulting output -- improves performance by allowing LLMs to test edge cases and refine their beliefs. By providing the intervention data from one LLM to another, we show that this improvement is partly a result of engaging in the process of generating effective interventions, paralleling results in the literature on human learning. Further analysis reveals that engaging in intervention can help LLMs escape from two common failure modes: overcomplication, where the LLM falsely assumes prior knowledge about the black-box, and overlooking, where the LLM fails to incorporate observations. These insights provide practical guidance for helping LLMs more effectively reverse-engineer black-box systems, supporting their use in making new discoveries.
... Notice that inappropriate substitution of "were" or "are" for "am" also hints at entanglement between the grammatical ability of the network and its ability to follow instructions that are agnostic to grammar. Modern LLMs have a strong preference for answers that are statistically more likely in the training data (McCoy et al., 2023), hurting performance when this assumption might be violated. ...
Preprint
Full-text available
Much of the excitement in modern AI is driven by the observation that scaling up existing systems leads to better performance. But does better performance necessarily imply better internal representations? While the representational optimist assumes it must, this position paper challenges that view. We compare neural networks evolved through an open-ended search process to networks trained via conventional stochastic gradient descent (SGD) on the simple task of generating a single image. This minimal setup offers a unique advantage: each hidden neuron's full functional behavior can be easily visualized as an image, thus revealing how the network's output behavior is internally constructed neuron by neuron. The result is striking: while both networks produce the same output behavior, their internal representations differ dramatically. The SGD-trained networks exhibit a form of disorganization that we term fractured entangled representation (FER). Interestingly, the evolved networks largely lack FER, even approaching a unified factored representation (UFR). In large models, FER may be degrading core model capacities like generalization, creativity, and (continual) learning. Therefore, understanding and mitigating FER could be critical to the future of representation learning.
... Finding a simple but tough-to-beat baseline for a challenging task can encourage rethinking of the status quo, motivating better model architectures and baselines [81]. For example, identifying zeroshot learning strategies beyond parroting can spur the development of next-generation foundation models and contribute to the debate on whether (or to what extent) large language models are stochastic parrots [82][83][84][85]. Context parroting formalizes an explicit baseline to compare against in the time-series domain and can help discover beyond-parroting strategies. ...
Preprint
Full-text available
Recently-developed time series foundation models for scientific machine learning exhibit emergent abilities to predict physical systems. These abilities include zero-shot forecasting, in which a model forecasts future states of a system given only a short trajectory as context. Here, we show that foundation models applied to physical systems can give accurate predictions, but that they fail to develop meaningful representations of the underlying physics. Instead, foundation models often forecast by context parroting, a simple zero-shot forecasting strategy that copies directly from the context. As a result, a naive direct context parroting model scores higher than state-of-the-art time-series foundation models on predicting a diverse range of dynamical systems, at a tiny fraction of the computational cost. We draw a parallel between context parroting and induction heads, which explains why large language models trained on text can be repurposed for time series forecasting. Our dynamical systems perspective also ties the scaling between forecast accuracy and context length to the fractal dimension of the attractor, providing insight into the previously observed in-context neural scaling laws. Context parroting thus serves as a simple but tough-to-beat baseline for future time-series foundation models and can help identify in-context learning strategies beyond parroting.
... This suggests that LLMs may not truly grasp mathematical concepts but rather rely on pattern matching to generate responses. Additionally, [93] highlights that LLMs perform worse on rare tasks than on more frequent ones, even when the tasks share the same level of complexity. Moreover, LLMs are sensitive to the probability distribution of inputs and outputs in their training data (Internet text), even for deterministic tasks. ...
Preprint
Full-text available
Problem-solving has been a fundamental driver of human progress in numerous domains. With advancements in artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools capable of tackling complex problems across diverse domains. Unlike traditional computational systems, LLMs combine raw computational power with an approximation of human reasoning, allowing them to generate solutions, make inferences, and even leverage external computational tools. However, applying LLMs to real-world problem-solving presents significant challenges, including multi-step reasoning, domain knowledge integration, and result verification. This survey explores the capabilities and limitations of LLMs in complex problem-solving, examining techniques including Chain-of-Thought (CoT) reasoning, knowledge augmentation, and various LLM-based and tool-based verification techniques. Additionally, we highlight domain-specific challenges in various domains, such as software engineering, mathematical reasoning and proving, data analysis and modeling, and scientific research. The paper further discusses the fundamental limitations of the current LLM solutions and the future directions of LLM-based complex problems solving from the perspective of multi-step reasoning, domain knowledge integration and result verification.
... Its self-attention mechanism enables thorough consideration of broader context and the detection of dependencies between distant tokens, generating high-quality text beyond mere syntax. However, it also introduces uncertainties, which can cause LLMs to stray from their training data and instructions, especially in low-probability or ambiguous scenarios (McCoy et al., 2023;Peng et al., 2024). ...
Article
Full-text available
We investigate the ‘bewitchment’ of understanding interactions between humans and systems based on large language models (LLMs) inspired by Wittgenstein’s later view on language. This framework is particularly apt for analyzing human-LLM interaction as it treats understanding as a public phenomenon manifested in observable communicative practices, rather than as a mental or computational state—an approach especially valuable given LLMs’ inherent opacity. Drawing on this perspective, we show that successful communication requires not merely regularity in language use, but constancy in maintaining reference points through agreement in both definitions and judgments. Crucially, LLMs lack the constancy needed to track negations and contradictions throughout a dialogue, thereby disrupting the reference points necessary for genuine communication. The apparent understanding in human-LLM interactions arises from what we characterize as a ‘bewitchment’: the interaction between LLMs’ statistical adherence to linguistic patterns and humans’ tendency to blindly follow familiar language games. Moreover, when interaction with LLMs is based on stereotyped contexts in which the system seems capable of identifying reference points, we humans automatically apply the practical principle that there is understanding until proven otherwise. The bewitchment becomes thus more profound as LLMs improve in modeling stereotypical aspects of human interaction. This improvement, far from addressing the highlighted limitations, can only deepen the illusion of understanding, raising significant concerns for meaningful control over such systems.
... Meanwhile the LLM-based PSRL agent, while successful at maintaining visitation counts, is slow to achieve the same convergence and, across many posterior samples, leaves non-negligible probability mass on nonexistent transitions with fictitious rewards. One plausible explanation would be that such concentration errors stem from a lack of familiarity by the LLMs, given that Dirichlet distributions with fractional parameters are encountered with less frequency (McCoy et al., 2024); however, our preliminary experiments with a Dirichlet(1,1,1,1) prior showed no significant improvement. ...
Preprint
Full-text available
A burgeoning area within reinforcement learning (RL) is the design of sequential decision-making agents centered around large language models (LLMs). While autonomous decision-making agents powered by modern LLMs could facilitate numerous real-world applications, such successes demand agents that are capable of data-efficient RL. One key obstacle to achieving data efficiency in RL is exploration, a challenge that we demonstrate many recent proposals for LLM agent designs struggle to contend with. Meanwhile, classic algorithms from the RL literature known to gracefully address exploration require technical machinery that can be challenging to operationalize in purely natural language settings. In this work, rather than relying on finetuning or in-context learning to coax LLMs into implicitly imitating a RL algorithm, we illustrate how LLMs can be used to explicitly implement an existing RL algorithm (Posterior Sampling for Reinforcement Learning) whose capacity for statistically-efficient exploration is already well-studied. We offer empirical results demonstrating how our LLM-based implementation of a known, data-efficient RL algorithm can be considerably more effective in natural language tasks that demand prudent exploration.
... However, the findings still emphasized the need for caution when interpreting LLM-generated decisions. Recent work by McCoy et al. (2024) further highlights this concern, showing that LLMs, pre-trained for nextword prediction, are often influenced by superficial features of inputs and output probabilities rather than deep understanding. Prior works (Achiam et al., 2023;Strachan et al., 2024) also demonstrated that the post-training processes can significantly affect the calibration of model's accuracy and confidence (log probability). ...
Preprint
Full-text available
Large Language Models (LLMs) are increasingly used in tasks such as psychological text analysis and decision-making in automated workflows. However, their reliability remains a concern due to potential biases inherited from their training process. In this study, we examine how different response format: binary versus continuous, may systematically influence LLMs' judgments. In a value statement judgments task and a text sentiment analysis task, we prompted LLMs to simulate human responses and tested both formats across several models, including both open-source and commercial models. Our findings revealed a consistent negative bias: LLMs were more likely to deliver "negative" judgments in binary formats compared to continuous ones. Control experiments further revealed that this pattern holds across both tasks. Our results highlight the importance of considering response format when applying LLMs to decision tasks, as small changes in task design can introduce systematic biases.
... A typical sub-case is when the association magnitudes are close to zero, it indicates that no subsequence is strongly associated with both the hallucinated or faithful answer. This situation typically occurs when the input is out-of-distribution (McCoy et al., 2023) or inconsistent with the training distribution (exposure bias) (Bengio et al., 2015). Other contributing factors include entities in the long-tailed (Sun et al., 2023) region, the model being architecturally limited (Banerjee et al., 2024), or insufficient training (Zhang et al., 2023c) with many factoids appearing only once (Kalai & Vempala, 2024). ...
Preprint
Large language models (LLMs) frequently generate hallucinations-content that deviates from factual accuracy or provided context-posing challenges for diagnosis due to the complex interplay of underlying causes. This paper introduces a subsequence association framework to systematically trace and understand hallucinations. Our key insight is that hallucinations arise when dominant hallucinatory associations outweigh faithful ones. Through theoretical and empirical analyses, we demonstrate that decoder-only transformers effectively function as subsequence embedding models, with linear layers encoding input-output associations. We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts. Experiments show our method outperforms standard attribution techniques in identifying hallucination causes and aligns with evidence from the model's training corpus. This work provides a unified perspective on hallucinations and a robust framework for their tracing and analysis.
... Many of the behaviors that appear to indicate general reasoning ability in LLMs in fact represent modes of memorization. For instance, GPT-4's ability to decode ROT13 ciphers but not less common variants like ROT2 raises questions about how LLMs acquire and generalize skills (McCoy et al. 2023). Similar patterns emerge in other capabilities. ...
Article
Full-text available
An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic concept space with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.
... At the other end of the spectrum, the emergence of large language models (LLMs) has marked a significant milestone in AI. While LLMs excel at generating coherent and contextually relevant text (Brown et al. 2020), their reliance on statistical inference leads to challenges in maintaining logical consistency and accuracy in reasoning and planning tasks (McCoy et al. 2023;Valmeekam et al. 2023). This limitation is particularly apparent when explanations need to be both linguistically coherent and logically sound. ...
Article
We present TRACE-cs, a novel hybrid system that combines symbolic reasoning with large language models (LLMs) to address contrastive queries in scheduling problems. TRACE-cs leverages SAT solving techniques to encode scheduling constraints and generate explanations for user queries, while utilizing an LLM to process the user queries into logical clauses as well as refine the explanations generated by the symbolic solver to natural language sentences. By integrating these components, our approach demonstrates the potential of combining symbolic methods with LLMs to create explainable AI agents with correctness guarantees.
... The majority of modern LLMs are pre-trained with an autoregressive objective. Recent studies suggest that autoregressive objectives used during pre-training may have unexpected impacts on LLM behavior (McCoy et al. 2023). Since the pre-training process of autoregressive models is more similar to generation than discrimination, we hypothesize SELF- [IN]CORRECT is also partially caused by the use of autoregressive pre-training objective. ...
Article
Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that model’s are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
... Even as language models (LMs) are becoming increasingly proficient reasoners, progress has been "jagged" (Karpathy, 2024;Roose, 2025): today's frontier models surpass experts at science and math reasoning (e.g., Hendrycks et al., 2021;Rein et al., 2023;Wang et al., 2024b) yet still routinely struggle with counting, arithmetic, tic-tac-toe, metered poetry, and other intuitively simple tasks (Ball et al., 2024;McCoy et al., 2023;Xu & Ma, 2024). For example, even very capable LMs have difficulty writing a coherent sentence under the constraints in Fig. 1, which are manageable for most proficient English speakers. ...
Preprint
While test-time reasoning enables language models to tackle complex tasks, searching or planning in natural language can be slow, costly, and error-prone. But even when LMs struggle to emulate the precise reasoning steps needed to solve a problem, they often excel at describing its abstract structure--both how to verify solutions and how to search for them. This paper introduces DisCIPL, a method for "self-steering" LMs where a Planner model generates a task-specific inference program that is executed by a population of Follower models. Our approach equips LMs with the ability to write recursive search procedures that guide LM inference, enabling new forms of verifiable and efficient reasoning. When instantiated with a small Follower (e.g., Llama-3.2-1B), DisCIPL matches (and sometimes outperforms) much larger models, including GPT-4o and o1, on challenging constrained generation tasks. In decoupling planning from execution, our work opens up a design space of highly-parallelized Monte Carlo inference strategies that outperform standard best-of-N sampling, require no finetuning, and can be implemented automatically by existing LMs.
... Even though the tax authorities could add a disclaimer to the answers provided by chatbots, doubts remain about whether these systems should be built at all if correctness cannot be guaranteed. Another marked limitation is their lower accuracy when dealing with topics that are underrepresented in the training data, particularly in low-resource languages, where the sample size hinders chatbot performance (McCoy et al., 2024). As a result, outputs may lack depth or precision in these areas, potentially leading to misinformation. ...
Article
Full-text available
Artificial intelligence (AI) is revolutionizing tax compliance and fraud detection, offering tax authorities unprecedented opportunities to enhance their efficiency and accuracy. This article explores the AI-based tax technology tools designed and implemented by the Estonian Tax and Customs Board in its daily operations. It analyses three main applications of AI: language models to assist taxpayers in their searches, network models to uncover connections between entities and individuals, and X-ray imaging systems to identify the nature of goods in packages for customs purposes. These technologies aim to assist taxpayers in understanding their rights, while simultaneously preventing and detecting tax fraud. Beyond the technological aspects, this discussion addresses the contributions of these tools to the legal sphere and their compatibility with the proactive law approach.
... Just as cognitive science had to move beyond purely computational models to explain human cognition and linguistic ability, there is an ongoing debate about whether AI systems need similar grounding to achieve genuine language understanding and commonsense reasoning [4,36]. While some recent work suggests that Large Language Models (LLMs) can grasp physical concepts through text alone [28], there are reasons to be skeptical about whether this statistical learning can capture the full depth of human conceptual understanding [23,25]. For instance, [29] highlights that LLMs employing in-context learning face significant challenges with tasks that require extensive specification, particularly those where even human annotators must carefully review a complex set of annotation guidelines to perform the task correctly. ...
Preprint
Full-text available
Despite advances in embodied AI, agent reasoning systems still struggle to capture the fundamental conceptual structures that humans naturally use to understand and interact with their environment. To address this, we propose a novel framework that bridges embodied cognition theory and agent systems by leveraging a formal characterization of image schemas, which are defined as recurring patterns of sensorimotor experience that structure human cognition. By customizing LLMs to translate natural language descriptions into formal representations based on these sensorimotor patterns, we will be able to create a neurosymbolic system that grounds the agent's understanding in fundamental conceptual structures. We argue that such an approach enhances both efficiency and interpretability while enabling more intuitive human-agent interactions through shared embodied understanding.
... La carencia en la capacidad de examinar críticamente cada paso del argumento viene asociada a la imposibilidad de volver atrás cuando se descubre que un paso es incorrecto y modificar el argumento en consecuencia. Por estas marcadas diferencias, que son intrínsecas a los mecanismos de generación de la IAG, McCoy et al. (2023) destacan la importancia de no ver a los LLMs como resolvedores de problemas matemáticos, sino más bien como un sistema estadístico de predicción de palabras que se utiliza para resolver problemas matemáticos. Entonces, los fracasos pueden entenderse directamente en términos de un conflicto entre la tarea de predicción de palabras y la de resolución de problemas matemáticos. ...
... This prompts reconsideration of complex tasks once thought uniquely human. While LLMs excel at information-intensive tasks, they lack general reasoning capabilities [43][44][45][46][47] and are grounded in human-derived data. This raises questions about their efficacy in scenarios requiring original thinking or high-order cognition and about potential bias propagation [48]. ...
Article
Full-text available
Background/Objectives: Artificial intelligence (AI), particularly large language models (LLMs), has demonstrated versatility in various applications but faces challenges in specialized domains like neurology. This study evaluates a specialized LLM’s capability and trustworthiness in complex neurological diagnosis, comparing its performance to neurologists in simulated clinical settings. Methods: We deployed GPT-4 Turbo (OpenAI, San Francisco, CA, US) through Neura (Sciense, New York, NY, US), an AI infrastructure with a dual-database architecture integrating “long-term memory” and “short-term memory” components on a curated neurological corpus. Five representative clinical scenarios were presented to 13 neurologists and the AI system. Participants formulated differential diagnoses based on initial presentations, followed by definitive diagnoses after receiving conclusive clinical information. Two senior academic neurologists blindly evaluated all responses, while an independent investigator assessed the verifiability of AI-generated information. Results: AI achieved a significantly higher normalized score (86.17%) compared to neurologists (55.11%, p < 0.001). For differential diagnosis questions, AI scored 85% versus 46.15% for neurologists, and for final diagnosis, 88.24% versus 70.93%. AI obtained 15 maximum scores in its 20 evaluations and responded in under 30 s compared to neurologists’ average of 9 min. All AI-provided references were classified as relevant with no hallucinatory content detected. Conclusions: A specialized LLM demonstrated superior diagnostic performance compared to practicing neurologists across complex clinical challenges. This indicates that appropriately harnessed LLMs with curated knowledge bases can achieve domain-specific relevance in complex clinical disciplines, suggesting potential for AI as a time-efficient asset in clinical practice.
... However, there is some indication that these models are sometimes reliant on more surface level heuristics, and fail in situations which are straightforward to humans (McCoy et al., 2019;Ettinger, 2020). More generally, language models have been generally shown to struggle in out-ofdomain situations (McCoy et al., 2024) and have some difficulty applying linguistic paradigms to nonce words (Weissweiler et al., 2023) and rare syntactic constructions (Scivetti et al., 2025). ...
Preprint
Full-text available
Construction Grammar hypothesizes that knowledge of a language consists chiefly of knowledge of form-meaning pairs (''constructions'') that include vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work has shown that transformer language models represent at least some constructional patterns, including ones where the construction is rare overall. In this work, we probe BERT's representation of the form and meaning of a minor construction of English, the NPN (noun-preposition-noun) construction -- exhibited in such expressions as face to face and day to day -- which is known to be polysemous. We construct a benchmark dataset of semantically annotated corpus instances (including distractors that superficially resemble the construction). With this dataset, we train and evaluate probing classifiers. They achieve decent discrimination of the construction from distractors, as well as sense disambiguation among true instances of the construction, revealing that BERT embeddings carry indications of the construction's semantics. Moreover, artificially permuting the word order of true construction instances causes them to be rejected, indicating sensitivity to matters of form. We conclude that BERT does latently encode at least some knowledge of the NPN construction going beyond a surface syntactic pattern and lexical cues.
... Analysis of this task leads to the prediction that LLMs will perform better when they need to produce a high-probability piece of text than when they need to produce a low-probability piece of text, even in deterministic settings where probability should not matter. Intuitively, this prediction follows from the way in which next-token prediction fundamentally depends on the probabilities of token sequences; this intuition is derived more formally in (McCoy et al., 2024) via a Bayesian analysis of autoregression. ...
Preprint
Full-text available
Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on Marr's three levels of analysis. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.
... However, although these approaches achieve results that are considered impressive, they are unreliable and fail in many tasks that appear simple from a human perspective (Berglund et al., 2023;Dziri et al., 2023;Nezhurina, Cipolina-Kun, Cherti, & Jitsev, 2024). They also fail the more frequently the less similar the tasks are to those on which they were trained (McCoy, Yao, Friedman, Hardy, & Griffiths, 2023;Wu et al., 2023). Such weaknesses do not occur only in specific approaches, but constitute a general problem in the field of AI (Dohare et al., 2024;Shanahan & Mitchell, 2022). ...
Preprint
Full-text available
The article analyses foundational principles relevant to the creation of artificial general intelligence (AGI). Intelligence is understood as the ability to create novel skills that allow to achieve goals under previously unknown conditions. To this end, intelligence utilises reasoning methods such as deduction, induction and abduction as well as other methods such as abstraction and classification to develop a world model. The methods are applied to indirect and incomplete representations of the world, which are obtained through perception, for example, and which do not depict the world but only correspond to it. Due to these limitations and the uncertain and contingent nature of reasoning, the world model is constructivist. Its value is functionally determined by its viability, i.e., its potential to achieve the desired goals. In consequence, meaning is assigned to representations by attributing them a function that makes it possible to achieve a goal. This representational and functional conception of intelligence enables a naturalistic interpretation that does not presuppose mental features, such as intentionality and consciousness, which are regarded as independent of intelligence. Based on a phenomenological analysis, it is shown that AGI can gain a more fundamental access to the world than humans, although it is limited by the No Free Lunch theorems, which require assumptions to be made.
... Brittleness: Generative models perform worse outside of their training distribution [49,50]. For language models, this includes changes in both language (idiolects, dialects, diachronic changes), content (e.g. ...
Preprint
Full-text available
The rapid adoption of AI across diverse domains has led to the development of organisational guidelines that vary significantly, even within the same sector. This paper examines AI policies in two domains, news organisations and universities, to understand how bottom-up governance approaches shape AI usage and oversight. By analysing these policies, we identify key areas of convergence and divergence in how organisations address risks such as bias, privacy, misinformation, and accountability. We then explore the implications of these findings for international AI legislation, particularly the EU AI Act, highlighting gaps where practical policy insights could inform regulatory refinements. Our analysis reveals that organisational policies often address issues such as AI literacy, disclosure practices, and environmental impact, areas that are underdeveloped in existing international frameworks. We argue that lessons from domain-specific AI policies can contribute to more adaptive and effective AI governance at the global level. This study provides actionable recommendations for policymakers seeking to bridge the gap between local AI practices and international regulations.
... By contrast, even the largest state-of-the-art large LMs (LLMs) are still highly challenged by arithmetic and formal logic tasks [5], [6], [7]. This substantial discrepancy between language use versus logical reasoning in LMs continues to be heavily studied, yet remains poorly understood. ...
Preprint
Full-text available
Specific empirical phenomena spanning human natural language, and mathematical and logical abilities, are rigorously situated in the well-studied grammar-automata (G-A) hierarchy. We identify three tiers and corresponding two transitions within the hierarchy and show their correspondence to the emergence of particular abilities in humans and in transformer-based language models (LMs). These emergent abilities have often been described in terms of "scaling"; we show that it is the transition between tiers, rather than size itself, that determines a system's capabilities. Specifically, humans effortlessly process language yet require specific training to perform arithmetic or logical reasoning tasks; and LMs possess language abilities absent from predecessor systems yet still struggle with logical processing. The resulting principled analyses provide underlying explanatory accounts of both the abilities and shortfalls of these systems, and suggest actionable insights into the expansion of logic abilities in AI systems.
Article
Few studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study employed three LLMs and the ML-based scoring engine, EvoGrader. We measured scoring reliability (percentage agreement, kappa, precision, recall, F1), processing time, and explored contextual factors like ethics and cost. Results showed that with very basic prompt engineering, ChatGPT-4o achieved the highest performance across LLMs. Proprietary LLMs outperformed open-weight LLMs for most concepts. GPT-4o achieved robust but less accurate scoring than EvoGrader (~500 additional scoring errors). Ethical concerns over data ownership, reliability, and replicability over time were LLM limitations. EvoGrader offered superior accuracy, reliability, and replicability, but required, in its development a large, high-quality, human-scored corpus, domain expertise, and restricted assessment items. These findings highlight the diversity of considerations that should be used when considering LLM and ML scoring in science education. Despite impressive LLM advances, ML approaches may remain valuable in some contexts, particularly those prioritizing precision, reliability, replicability, privacy, and controlled implementation.
Preprint
Full-text available
Despite the widespread use of ''artificial intelligence'' (AI) framing in Natural Language Processing (NLP) research, it is not clear what researchers mean by ''intelligence''. To that end, we present the results of a survey on the notion of ''intelligence'' among researchers and its role in the research agenda. The survey elicited complete responses from 303 researchers from a variety of fields including NLP, Machine Learning (ML), Cognitive Science, Linguistics, and Neuroscience. We identify 3 criteria of intelligence that the community agrees on the most: generalization, adaptability, & reasoning. Our results suggests that the perception of the current NLP systems as ''intelligent'' is a minority position (29%). Furthermore, only 16.2% of the respondents see developing intelligent systems as a research goal, and these respondents are more likely to consider the current systems intelligent.
Article
Full-text available
Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question–answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs’ impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.
Article
Full-text available
Artificial Intelligence is a field that lives many lives, and the term has come to encompass a motley collection of scientific and commercial endeavours. In this paper, I articulate the contours of a rather neglected but central scientific role that AI has to play, which I dub “AI-as-exploration”. The basic thrust of AI-as-exploration is that of creating and studying systems that can reveal candidate building blocks of intelligence that may differ from the forms of human and animal intelligence we are familiar with. In other words, I suggest that AI is one of the best tools we have for exploring intelligence space, namely the space of possible intelligent systems. I illustrate the value of AI-as-exploration by focusing on a specific case study, i.e., recent work on the capacity to combine novel and invented concepts in humans and Large Language Models. I show that the latter, despite showing human-level accuracy in such a task, most probably solve it in ways radically different, but no less relevant to intelligence research, to those hypothesised for humans.
Article
“Synthetic samples” generated by large language models (LLMs) have been argued to complement or replace traditional surveys, assuming their training data is grounded in human-generated data that potentially reflects attitudes and behaviors prevalent in the population. Initial US-based studies that have prompted LLMs to mimic survey respondents found that the responses match survey data. However, the relationship between the respective target population and LLM training data might affect the generalizability of such findings. In this paper, we critically evaluate the use of LLMs for public opinion research in a different context, by investigating whether LLMs can estimate vote choice in Germany. We generate a synthetic sample matching the 2017 German Longitudinal Election Study respondents and ask the LLM GPT-3.5 to predict each respondent’s vote choice. Comparing these predictions to the survey-based estimates on the aggregate and subgroup levels, we find that GPT-3.5 exhibits a bias towards the Green and Left parties. While the LLM predictions capture the tendencies of “typical” voters, they miss more complex factors of vote choice. By examining the LLM-based prediction of voting behavior in a non-English speaking context, our study contributes to research on the extent to which LLMs can be leveraged for studying public opinion. The findings point to disparities in opinion representation in LLMs and underscore the limitations in applying them for public opinion estimation.
Preprint
Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.
Preprint
Full-text available
Some things are impossible, but some things may be even more impossible than impossible. Levitating a feather using one's mind is impossible in our world, but fits into our intuitive theories of possible worlds, whereas levitating a feather using the number five cannot be conceived in any possible world ("inconceivable"). While prior work has examined the distinction between improbable and impossible events, there has been little empirical research on inconceivability. Here, we investigate whether people maintain a distinction between impossibility and inconceivability, and how such distinctions might be made. We find that people can readily distinguish the impossible from the inconceivable, using categorization studies similar to those used to investigate the differences between impossible and improbable (Experiment 1). However, this distinction is not explained by people's subjective ratings of event likelihood, which are near zero and indistinguishable between impossible and inconceivable event descriptions (Experiment 2). Finally, we ask whether the probabilities assigned to event descriptions by statistical language models (LMs) can be used to separate modal categories, and whether these probabilities align with people's ratings (Experiment 3). We find high-level similarities between people and LMs: both distinguish among impossible and inconceivable event descriptions, and LM-derived string probabilities predict people's ratings of event likelihood across modal categories. Our findings suggest that fine-grained knowledge about exceedingly rare events (i.e., the impossible and inconceivable) may be learned via statistical learning over linguistic forms, yet leave open the question of whether people represent the distinction between impossible and inconceivable as a difference not of degree, but of kind.
Article
Full-text available
How should we compare the capabilities of language models (LMs) and humans? In this article, I draw inspiration from comparative psychology to highlight challenges in these comparisons. I focus on a case study: processing of recursively nested grammatical structures. Prior work suggests that LMs cannot process these structures as reliably as humans can. However, the humans were provided with instructions and substantial training, while the LMs were evaluated zero-shot. I therefore match the evaluation more closely. Providing large LMs with a simple prompt—with substantially less content than the human training—allows the LMs to consistently outperform the human results, even in more deeply nested conditions than were tested with humans. Furthermore, the effects of prompting are robust to the particular structures and vocabulary used in the prompt. Finally, reanalyzing the existing human data suggests that the humans may not perform above chance at the difficult structures initially. Thus, large LMs may indeed process recursively nested grammatical structures as reliably as humans, when evaluated comparably. This case study highlights how discrepancies in the evaluation methods can confound comparisons of language models and humans. I conclude by reflecting on the broader challenge of comparing human and model capabilities, and highlight an important difference between evaluating cognitive models and foundation models.
Article
Full-text available
The recent advent of large language models has reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data. Of particular interest is the ability of these models to reason about novel problems zero-shot, without any direct training. In human cognition, this capacity is closely tied to an ability to reason by analogy. Here we performed a direct comparison between human reasoners and a large language model (the text-davinci-003 variant of Generative Pre-trained Transformer (GPT)-3) on a range of analogical tasks, including a non-visual matrix reasoning task based on the rule structure of Raven’s Standard Progressive Matrices. We found that GPT-3 displayed a surprisingly strong capacity for abstract pattern induction, matching or even surpassing human capabilities in most settings; preliminary tests of GPT-4 indicated even better performance. Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.
Article
Full-text available
Language models (LMs) like GPT‐3, PaLM, and ChatGPT are the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of LMs. LMs can serve many purposes and their behavior should satisfy many desiderata. To navigate the vast space of potential scenarios and metrics, we taxonomize the space and select representative subsets. We evaluate models on 16 core scenarios and 7 metrics, exposing important trade‐offs. We supplement our core evaluation with seven targeted evaluations to deeply analyze specific aspects (including world knowledge, reasoning, regurgitation of copyrighted content, and generation of disinformation). We benchmark 30 LMs, from OpenAI, Microsoft, Google, Meta, Cohere, AI21 Labs, and others. Prior to HELM, models were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: all 30 models are now benchmarked under the same standardized conditions. Our evaluation surfaces 25 top‐level findings. For full transparency, we release all raw model prompts and completions publicly. HELM is a living benchmark for the community, continuously updated with new scenarios, metrics, and models https://crfm.stanford.edu/helm/latest/.
Article
Full-text available
We study GPT-3, a recent large language model, using tools from cognitive psychology. More specifically, we assess GPT-3's decision-making, information search, deliberation, and causal reasoning abilities on a battery of canonical experiments from the literature. We find that much of GPT-3's behavior is impressive: It solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multiarmed bandit task, and shows signatures of model-based reinforcement learning. Yet, we also find that small perturbations to vignette-based tasks can lead GPT-3 vastly astray, that it shows no signatures of directed exploration, and that it fails miserably in a causal reasoning task. Taken together, these results enrich our understanding of current large language models and pave the way for future investigations using tools from cognitive psychology to study increasingly capable and opaque artificial agents.
Article
Full-text available
The rise of machine-learning systems that process sensory input has brought with it a rise in comparisons between human and machine perception. But such comparisons face a challenge: Whereas machine perception of some stimulus can often be probed through direct and explicit measures, much of human perceptual knowledge is latent, incomplete, or unavailable for explicit report. Here, we explore how this asymmetry can cause such comparisons to misestimate the overlap in human and machine perception. As a case study, we consider human perception of adversarial speech - synthetic audio commands that are recognized as valid messages by automated speech-recognition systems but that human listeners reportedly hear as meaningless noise. In five experiments, we adapt task designs from the human psychophysics literature to show that even when subjects cannot freely transcribe such speech commands (the previous benchmark for human understanding), they can sometimes demonstrate other forms of understanding, including discriminating adversarial speech from closely matched nonspeech (Experiments 1 and 2), finishing common phrases begun in adversarial speech (Experiments 3 and 4), and solving simple math problems posed in adversarial speech (Experiment 5) - even for stimuli previously described as unintelligible to human listeners. We recommend the adoption of such "sensitive tests" when comparing human and machine perception, and we discuss the broader consequences of such approaches for assessing the overlap between systems.
Article
Full-text available
Significance Language is a quintessentially human ability. Research has long probed the functional architecture of language in the mind and brain using diverse neuroimaging, behavioral, and computational modeling approaches. However, adequate neurally-mechanistic accounts of how meaning might be extracted from language are sorely lacking. Here, we report a first step toward addressing this gap by connecting recent artificial neural networks from machine learning to human recordings during language processing. We find that the most powerful models predict neural and behavioral responses across different datasets up to noise levels. Models that perform better at predicting the next word in a sequence also better predict brain measurements—providing computationally explicit evidence that predictive processing fundamentally shapes the language comprehension mechanisms in the brain.
Article
Full-text available
Does the human mind resemble the machines that can behave like it? Biologically inspired machine-learning systems approach “human-level” accuracy in an astounding variety of domains, and even predict human brain activity—raising the exciting possibility that such systems represent the world like we do. However, even seemingly intelligent machines fail in strange and “unhumanlike” ways, threatening their status as models of our minds. How can we know when human–machine behavioral differences reflect deep disparities in their underlying capacities, vs. when such failures are only superficial or peripheral? This article draws on a foundational insight from cognitive science—the distinction between performance and competence —to encourage “species-fair” comparisons between humans and machines. The performance/competence distinction urges us to consider whether the failure of a system to behave as ideally hypothesized, or the failure of one creature to behave like another, arises not because the system lacks the relevant knowledge or internal capacities (“competence”), but instead because of superficial constraints on demonstrating that knowledge (“performance”). I argue that this distinction has been neglected by research comparing human and machine behavior, and that it should be essential to any such comparison. Focusing on the domain of image classification, I identify three factors contributing to the species-fairness of human–machine comparisons, extracted from recent work that equates such constraints. Species-fair comparisons level the playing field between natural and artificial intelligence, so that we can separate more superficial differences from those that may be deep and enduring.
Article
Full-text available
It is well known that real-time human language processing is highly incremental and context-driven, and that the strength of a comprehender's expectation for each word encountered is a key determinant of the difficulty of integrating that word into the preceding context. In reading, this differential difficulty is largely manifested in the amount of time taken to read each word. While numerous studies over the past thirty years have shown expectation-based effects on reading times driven by lexical, syntactic, semantic, pragmatic, and other information sources, there has been little progress in establishing the quantitative relationship between expectation (or prediction) and reading times. Here, by combining a state-of-the-art computational language model, two large behavioral data-sets, and non-parametric statistical techniques, we establish for the first time the quantitative form of this relationship, finding that it is logarithmic over six orders of magnitude in estimated predictability. This result is problematic for a number of established models of eye movement control in reading, but lends partial support to an optimal perceptual discrimination account of word recognition. We also present a novel model in which language processing is highly incremental well below the level of the individual word, and show that it predicts both the shape and time-course of this effect. At a more general level, this result provides challenges for both anticipatory processing and semantic integration accounts of lexical predictability effects. And finally, this result provides evidence that comprehenders are highly sensitive to relative differences in predictability - even for differences between highly unpredictable words - and thus helps bring theoretical unity to our understanding of the role of prediction at multiple levels of linguistic structure in real-time language comprehension.
Article
Full-text available
Sentence processing theories typically assume that the input to our language processing mechanisms is an error-free sequence of words. However, this assumption is an oversimplification because noise is present in typical language use (for instance, due to a noisy environment, producer errors, or perceiver errors). A complete theory of human sentence comprehension therefore needs to explain how humans understand language given imperfect input. Indeed, like many cognitive systems, language processing mechanisms may even be "well designed"-in this case for the task of recovering intended meaning from noisy utterances. In particular, comprehension mechanisms may be sensitive to the types of information that an idealized statistical comprehender would be sensitive to. Here, we evaluate four predictions about such a rational (Bayesian) noisy-channel language comprehender in a sentence comprehension task: (i) semantic cues should pull sentence interpretation towards plausible meanings, especially if the wording of the more plausible meaning is close to the observed utterance in terms of the number of edits; (ii) this process should asymmetrically treat insertions and deletions due to the Bayesian "size principle"; such nonliteral interpretation of sentences should (iii) increase with the perceived noise rate of the communicative situation and (iv) decrease if semantically anomalous meanings are more likely to be communicated. These predictions are borne out, strongly suggesting that human language relies on rational statistical inference over a noisy channel.
Article
Full-text available
Humans routinely generalize universal relationships to unfamiliar instances. If we are told "if glork then frum," and "glork," we can infer "frum"; any name that serves as the subject of a sentence can appear as the object of a sentence. These universals are pervasive in language and reasoning. One account of how they are generalized holds that humans possess mechanisms that manipulate symbols and variables; an alternative account holds that symbol-manipulation can be eliminated from scientific theories in favor of descriptions couched in terms of networks of interconnected nodes. Can these "eliminative" connectionist models offer a genuine alternative? This article shows that eliminative connectionist models cannot account for how we extend universals to arbitrary items. The argument runs as follows. First, if these models, as currently conceived, were to extend universals to arbitrary instances, they would have to generalize outside the space of training examples. Next, it is shown that the class of eliminative connectionist models that is currently popular cannot learn to extend universals outside the training space. This limitation might be avoided through the use of an architecture that implements symbol manipulation.
Article
Full-text available
Article
Full-text available
In this paper three problems for a connectionist account of language are considered: 1. What is the nature of linguistic representations? 2. How can complex structural relationships such as constituent structure be represented? 3. How can the apparently open-ended nature of language be accommodated by a fixed-resource system? Using a prediction task, a simple recurrent network (SRN) is trained on multiclausal sentences which contain multiply-embedded relative clauses. Principal component analysis of the hidden unit activation patterns reveals that the network solves the task by developing complex distributed representations which encode the relevant grammatical relations and hierarchical constituent structure. Differences between the SRN state representations and the more traditional pushdown store are discussed in the final section.
Article
reasoning is a key ability for an intelligent system. Large language models (LMs) achieve above-chance performance on abstract reasoning tasks but exhibit many imperfections. However, human abstract reasoning is also imperfect. Human reasoning is affected by our real-world knowledge and beliefs, and shows notable “content effects”; humans reason more reliably when the semantic content of a problem supports the correct logical inferences. These content-entangled reasoning patterns are central to debates about the fundamental nature of human intelligence. Here, we investigate whether language models—whose prior expectations capture some aspects of human knowledge—similarly mix content into their answers to logic problems. We explored this question across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task. We evaluate state of the art LMs, as well as humans, and find that the LMs reflect many of the same qualitative human patterns on these tasks—like humans, models answer more accurately when the semantic content of a task supports the logical inferences. These parallels are reflected in accuracy patterns, and in some lower-level features like the relationship between LM confidence over possible answers and human response times. However, in some cases the humans and models behave differently—particularly on the Wason task, where humans perform much worse than large models, and exhibit a distinct error pattern. Our findings have implications for understanding possible contributors to these human cognitive effects, as well as the factors that influence language model performance.
Article
Interacting with a contemporary LLM-based conversational agent can create an illusion of being in the presence of a thinking creature. Yet, in their very nature, such systems are fundamentally not like us.
Article
Planning underpins the impressive flexibility of goal-directed behavior. However, even when planning, people can display surprising rigidity in how they think about problems (e.g., “functional fixedness”) that lead them astray. How can our capacity for behavioral flexibility be reconciled with our susceptibility to conceptual inflexibility? We propose that these tendencies reflect avoidance of two cognitive costs: the cost of representing task details and the cost of switching between representations. To test this hypothesis, we developed a novel paradigm that affords participants opportunities to choose different families of simplified representations to plan. In two preregistered, online studies ( Ns = 377 and 294 adults), we found that participants’ optimal behavior, suboptimal behavior, and reaction time were explained by a computational model that formalized people’s avoidance of representational complexity and switching. These results demonstrate how the selection of simplified, rigid representations leads to the otherwise puzzling combination of flexibility and inflexibility observed in problem solving.
Article
In 1967, Marvin Minksy, a founder of the field of artificial intelligence (AI), made a bold prediction: "Within a generation…the problem of creating 'artificial intelligence' will be substantially solved." Assuming that a generation is about 30 years, Minsky was clearly overoptimistic. But now, nearly two generations later, how close are we to the original goal of human-level (or greater) intelligence in machines?
Article
We survey a current, heated debate in the artificial intelligence (AI) research community on whether large pretrained language models can be said to understand language-and the physical and social situations language encodes-in any humanlike sense. We describe arguments that have been made for and against such understanding and key questions for the broader sciences of intelligence that have arisen in light of these arguments. We contend that an extended science of intelligence can be developed that will provide insight into distinct modes of understanding, their strengths and limitations, and the challenge of integrating diverse forms of cognition.
Article
Recent progress in artificial intelligence provides the opportunity to ask the question of what is unique about human intelligence, but with a new comparison class. I argue that we can understand human intelligence, and the ways in which it may differ from artificial intelligence, by considering the characteristics of the kind of computational problems that human minds have to solve. I claim that these problems acquire their structure from three fundamental limitations that apply to human beings: limited time, limited computation, and limited communication. From these limitations we can derive many of the properties we associate with human intelligence, such as rapid learning, the ability to break down problems into parts, and the capacity for cumulative cultural evolution.
Article
This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model simply tries to predict a masked word in a given context. Human language communication is via sequences of words, but language understanding requires constructing rich hierarchical structures that are never observed explicitly. The mechanisms for this have been a prime mystery of human language acquisition, while engineering work has mainly proceeded by supervised learning on treebanks of sentences hand labeled for this latent structure. However, we demonstrate that modern deep contextual language models learn major aspects of this structure, without any explicit supervision. We develop methods for identifying linguistic hierarchical structure emergent in artificial neural networks and demonstrate that components in these models focus on syntactic grammatical relationships and anaphoric coreference. Indeed, we show that a linear transformation of learned embeddings in these models captures parse tree distances to a surprising degree, allowing approximate reconstruction of the sentence tree structures normally assumed by linguists. These results help explain why these models have brought such large improvements across many language-understanding tasks.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
An adaptationist programme has dominated evolutionary thought in England and the United States during the past 40 years. It is based on faith in the power of natural selection as an optimizing agent. It proceeds by breaking an oragnism into unitary 'traits' and proposing an adaptive story for each considered separately. Trade-offs among competing selective demands exert the only brake upon perfection; non-optimality is thereby rendered as a result of adaptation as well. We criticize this approach and attempt to reassert a competing notion (long popular in continental Europe) that organisms must be analysed as integrated wholes, with Baupläne so constrained by phyletic heritage, pathways of development and general architecture that the constraints themselves become more interesting and more important in delimiting pathways of change than the selective force that may mediate change when it occurs. We fault the adaptationist programme for its failure to distinguish current utility from reasons for origin (male tyrannosaurs may have used their diminutive front legs to titillate female partners, but this will not explain why they got so small); for its unwillingness to consider alternatives to adaptive stories; for its reliance upon plausibility alone as a criterion for accepting speculative tales; and for its failure to consider adequately such competing themes as random fixation of alleles, production of non-adaptive structures by developmental correlation with selected features (allometry, pleiotropy, material compensation, mechanically forced correlation), the separability of adaptation and selection, multiple adaptive peaks, and current utility as an epiphenomenon of non-adaptive structures. We support Darwin's own pluralistic approach to identifying the agents of evolutionary change.
Article
In this paper three problems for a connectionist account of language are considered:1. What is the nature of linguistic representations?2. How can complex structural relationships such as constituent structure be represented?3. How can the apparently open-ended nature of language be accommodated by a fixed-resource system?Using a prediction task, a simple recurrent network (SRN) is trained on multiclausal sentences which contain multiply-embedded relative clauses. Principal component analysis of the hidden unit activation patterns reveals that the network solves the task by developing complex distributed representations which encode the relevant grammatical relations and hierarchical constituent structure. Differences between the SRN state representations and the more traditional pushdown store are discussed in the final section.
Article
Debate about adaptationism in biology continues, in part because within "the" problem of assessing adaptationism, three distinct problems are mixed together. The three problems concern the assessment of three distinct adaptationist positions, each of which asserts the central importance of adaptation and natural selection to the study of evolution, but conceives this importance in a different way. As there are three kinds of adaptationism, there are three distinct "anti-adaptationist" positions as well. Or putting it more formally, there are three different dimensions here, and strongly adaptationist views, strongly anti-adaptationist views, and moderate views are possible for each dimension. Understanding the distinctions between the three adaptationist positions will not remove all controversy, but some progress can be made through clarifying the distinctions. In particular, progress can be made by recognizing that evidence against one kind of adaptationism need not also be evidence against other kinds. So the main aims of this paper are classification and clarification. I will describe the three kinds of adaptationism, and then discuss the evidence relevant to each. In particular, I will try to say which problems might be solved directly through empirical research, and which are more philosophical in character.
Article
Several problems, all solvable by one somewhat complex procedure, are presented in succession. If afterwards a similar task is given which can be solved by a more direct and simple method, will the individual be blinded to this direct possibility ( Einstellung)? If a blinding effect does result, will it be of characteristically different strength in groups that differ in educational level, age, etc.? Moreover, if we introduce means to save the subjects or to rescue them from such blindness, will these means readily work? Will they operate differently in various groups? And what may be the real cause for the blinding effect? How are we to understand this phenomenon? (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Subjects read passages of text which differed in the extent to which the context constrained or predicted the occurrence of particular target words. In Experiment 1, misspellings were introduced into target words and we examined the extent to which fixation duration and probability of fixating the target word was influenced by contextual constraint and the misspelling. Subjects had a lower probability of fixating the target word in the high-constraint passages than in the low-constraint passages. Furthermore, when subjects did fixate the target, the fixation duration was shorter in the high-constraint passages. In Experiment 2, subjects read passages which included either a predictable target word or a visually similar word which was unpredictable. Fixation durations on the target word were shorter when the predictable word was in the target location than when the unpredictable word was present. The implications of the results for the role of contextual constraint in reading are discussed.
Article
An adaptationist programme has dominated evolutionary thought in England and the United States during the past 40 years. It is based on faith in the power of natural selection as an optimizing agent. It proceeds by breaking an oragnism into unitary 'traits' and proposing an adaptive story for each considered separately. Trade-offs among competing selective demands exert the only brake upon perfection; non-optimality is thereby rendered as a result of adaptation as well. We criticize this approach and attempt to reassert a competing notion (long popular in continental Europe) that organisms must be analysed as integrated wholes, with Baupläne so constrained by phyletic heritage, pathways of development and general architecture that the constraints themselves become more interesting and more important in delimiting pathways of change than the selective force that may mediate change when it occurs. We fault the adaptationist programme for its failure to distinguish current utility from reasons for origin (male tyrannosaurs may have used their diminutive front legs to titillate female partners, but this will not explain why they got so small); for its unwillingness to consider alternatives to adaptive stories; for its reliance upon plausibility alone as a criterion for accepting speculative tales; and for its failure to consider adequately such competing themes as random fixation of alleles, production of non-adaptive structures by developmental correlation with selected features (allometry, pleiotropy, material compensation, mechanically forced correlation), the separability of adaptation and selection, multiple adaptive peaks, and current utility as an epiphenomenon of non-adaptive structures. We support Darwin's own pluralistic approach to identifying the agents of evolutionary change.
Article
This paper explores differences between Connectionist proposals for cognitive architecture and the sorts of models that have traditionally been assumed in cognitive science. We claim that the major distinction is that, while both Connectionist and Classical architectures postulate representational mental states, the latter but not the former are committed to a symbol-level of representation, or to a 'language of thought': i.e., to representational states that have combinatorial syntactic and semantic structure. Several arguments for combinatorial structure in mental representations are then reviewed. These include arguments based on the 'systematicity' of mental representation: i.e., on the fact that cognitive capacities always exhibit certain symmetries, so that the ability to entertain a given thought implies the ability to entertain thoughts with semantically related contents. We claim that such arguments make a powerful case that mind/brain architecture is not Connectionist at the cognitive level. We then consider the possibility that Connectionism may provide an account of the neural (or 'abstract neurological') structures in which Classical cognitive architecture is implemented. We survey a number of the standard arguments that have been offered in favor of Connectionism, and conclude that they are coherent only on this interpretation.
Article
A psychological space is established for any set of stimuli by determining metric distances between the stimuli such that the probability that a response learned to any stimulus will generalize to any other is an invariant monotonic function of the distance between them. To a good approximation, this probability of generalization (i) decays exponentially with this distance, and (ii) does so in accordance with one of two metrics, depending on the relation between the dimensions along which the stimuli vary. These empirical regularities are mathematically derivable from universal principles of natural kinds and probabilistic geometry that may, through evolutionary internalization, tend to govern the behaviors of all sentient organisms.
Article
Teleological explanations (TEs) account for the existence or properties of an entity in terms of a function: we have hearts because they pump blood, and telephones for communication. While many teleological explanations seem appropriate, others are clearly not warranted--for example, that rain exists for plants to grow. Five experiments explore the theoretical commitments that underlie teleological explanations. With the analysis of [Wright, L. (1976). Teleological Explanations. Berkeley, CA: University of California Press] from philosophy as a point of departure, we examine in Experiment 1 whether teleological explanations are interpreted causally, and confirm that TEs are only accepted when the function invoked in the explanation played a causal role in bringing about what is being explained. However, we also find that playing a causal role is not sufficient for all participants to accept TEs. Experiment 2 shows that this is not because participants fail to appreciate the causal structure of the scenarios used as stimuli. In Experiments 3-5 we show that the additional requirement for TE acceptance is that the process by which the function played a causal role must be general in the sense of conforming to a predictable pattern. These findings motivate a proposal, Explanation for Export, which suggests that a psychological function of explanation is to highlight information likely to subserve future prediction and intervention. We relate our proposal to normative accounts of explanation from philosophy of science, as well as to claims from psychology and artificial intelligence.
Article
This paper investigates the role of resource allocation as a source of processing difficulty in human sentence comprehension. The paper proposes a simple information-theoretic characterization of processing difficulty as the work incurred by resource reallocation during parallel, incremental, probabilistic disambiguation in sentence comprehension, and demonstrates its equivalence to the theory of Hale [Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. In Proceedings of NAACL (Vol. 2, pp. 159-166)], in which the difficulty of a word is proportional to its surprisal (its negative log-probability) in the context within which it appears. This proposal subsumes and clarifies findings that high-constraint contexts can facilitate lexical processing, and connects these findings to well-known models of parallel constraint-based comprehension. In addition, the theory leads to a number of specific predictions about the role of expectation in syntactic comprehension, including the reversal of locality-based difficulty patterns in syntactically constrained contexts, and conditions under which increased ambiguity facilitates processing. The paper examines a range of established results bearing on these predictions, and shows that they are largely consistent with the surprisal theory.
Article
In human sentence processing, cognitive load can be defined many ways. This report considers a definition of cognitive load in terms of the total probability of structural options that have been disconfirmed at some point in a sentence: the surprisal of word w i given its prefix w 0...i-1 on a phrase-structural language model. These loads can be e#ciently calculated using a probabilistic Earley parser (Stolcke, 1995) which is interpreted as generating predictions about reading time on a word-by-word basis. Under grammatical assumptions supported by corpusfrequency data, the operation of Stolcke's probabilistic Earley parser correctly predicts processing phenomena associated with garden path structural ambiguity and with the subject/object relative asymmetry.
Llama 2: Open foundation and fine-tuned chat models
  • H Touvron
Language models are unsupervised multitask learners
  • A Radford
Attention is all you need
  • A Vaswani
  • Vaswani A.
Mechanistic versus functional understanding” in Varieties of Understanding: New Perspectives from Philosophy Psychology and
  • Lombrozo D Wilkenfeld