Michael Hahn’s research while affiliated with Saarland University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (44)


Figure 6: Loss (MSE) on ICL on linear functions when trained with attention links ablated during training, with transformers with l = 4, 6, 8, 12 layers. The PARALLEL circuit performs much better than a random baseline (loss 20.0), but underperforms the full model by a large margin in shallow transformers. Adding links within x's or y's improves performance, particularly the latter. Adding both performs at par with the full model, in line with our findings on real-world tasks in Gemma-2.
Figure 7: Accuracy of circuits by prompt length (Gemma-2 2B). Compare Table 2.
Figure 14: Error patterns: 10 shots (top) and 3 shots (bottom), y i → t N+1 (Capitalization). Perturbing the functional behavior but leaving the output space intact leads to incorrect responses staying in the correct output space (capitalized tokens; dark blue). Disrupting the output space leads to production of other tokens. See Appendix F for further information.
Figure 15: Error patterns for ablations in the (aggregation-only) circuit for Copying. Breaking the functional relationship between x and y often leads to reproduction of other tokens from the corrupted prompt (light green), or the production of other tokens (yellow). See Appendix F for further information.
Figure 20: Error patterns for ablations in the x i → x i+1 edges for Country-Capital. The y-axis denotes the number of datapoints in each class. Corrupting the input-space information in these edges leads to a substantial fraction of copying responses, i.e., the functional input-output behavior breaks down, in particular at 3 shots. See Appendix F for further information.

+1

Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
  • Preprint
  • File available

March 2025

·

2 Reads

Aleksandra Bakalova

·

Yana Veitsman

·

Xinting Huang

·

Michael Hahn

In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.

Download

Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

February 2025

·

12 Reads

Chain-of-thought reasoning and scratchpads have emerged as critical tools for enhancing the computational capabilities of transformers. While theoretical results show that polynomial-length scratchpads can extend transformers' expressivity from TC0TC^0 to PTIME, their required length remains poorly understood. Empirical evidence even suggests that transformers need scratchpads even for many problems in TC0TC^0, such as Parity or Multiplication, challenging optimistic bounds derived from circuit complexity. In this work, we initiate the study of systematic lower bounds for the number of CoT steps across different algorithmic problems, in the hard-attention regime. We study a variety of algorithmic problems, and provide bounds that are tight up to logarithmic factors. Overall, these results contribute to emerging understanding of the power and limitations of chain-of-thought reasoning.


A Formal Framework for Understanding Length Generalization in Transformers

October 2024

·

17 Reads

Xinting Huang

·

·

Satwik Bhattamishra

·

[...]

·

Michael Hahn

A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the task, theoretical understanding of this phenomenon remains limited. In this work, we introduce a rigorous theoretical framework to analyze length generalization in causal transformers with learnable absolute positional encodings. In particular, we characterize those functions that are identifiable in the limit from sufficiently long inputs with absolute positional encodings under an idealized inference scheme using a norm-based regularizer. This enables us to prove the possibility of length generalization for a rich family of problems. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks. Our theory not only explains a broad set of empirical observations but also opens the way to provably predicting length generalization capabilities in transformers.



Separations in the Representational Capabilities of Transformers and Recurrent Architectures

June 2024

·

3 Reads

Transformer architectures have been widely adopted in foundation models. Due to their high inference costs, there is renewed interest in exploring the potential of efficient recurrent architectures (RNNs). In this paper, we analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance, including index lookup, nearest neighbor, recognizing bounded Dyck languages, and string equality. For the tasks considered, our results show separations based on the size of the model required for different architectures. For example, we show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. Conversely, while constant-size RNNs can recognize bounded Dyck languages, we show that one-layer Transformers require a linear size for this task. Furthermore, we show that two-layer Transformers of logarithmic size can perform decision tasks such as string equality or disjointness, whereas both one-layer Transformers and recurrent models require linear size for these tasks. We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass; on the other hand recurrent models require linear size. Our constructions are based on the existence of N nearly orthogonal vectors in O(logN)O(\log N) dimensional space and our lower bounds are based on reductions from communication complexity problems. We supplement our theoretical results with experiments that highlight the differences in the performance of these architectures on practical-size sequences.


The Expressive Capacity of State Space Models: A Formal Language Perspective

May 2024

·

18 Reads

Recently, recurrent models based on linear state space models (SSMs) have shown promising performance in language modeling (LM), competititve with transformers. However, there is little understanding of the in-principle abilities of such models, which could provide useful guidance to the search for better LM architectures. We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs. We find that SSMs and transformers have overlapping but distinct strengths. In star-free state tracking, SSMs implement straightforward and exact solutions to problems that transformers struggle to represent exactly. They can also model bounded hierarchical structure with optimal memory even without simulating a stack. On the other hand, we identify a design choice in current SSMs that limits their expressive power. We discuss implications for SSM and LM research, and verify results empirically on a recent SSM, Mamba.


InversionView: A General-Purpose Method for Reading Information from Neural Activations

May 2024

·

10 Reads

The inner workings of neural networks can be better understood if we can fully decipher the information encoded in neural activations. In this paper, we argue that this information is embodied by the subset of inputs that give rise to similar activations. Computing such subsets is nontrivial as the input space is exponentially large. We propose InversionView, which allows us to practically inspect this subset by sampling from a trained decoder model conditioned on activations. This helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models. We present three case studies where we investigate models ranging from small transformers to GPT-2. In these studies, we demonstrate the characteristics of our method, show the distinctive advantages it offers, and provide causally verified circuits.


Linguistic Structure from a Bottleneck on Sequential Information Processing

May 2024

·

46 Reads

Human language is a unique form of communication in the natural world, distinguished by its structured nature. Most fundamentally, it is systematic, meaning that signals can be broken down into component parts that are individually meaningful -- roughly, words -- which are combined in a regular way to form sentences. Furthermore, the way in which these parts are combined maintains a kind of locality: words are usually concatenated together, and they form contiguous phrases, keeping related parts of sentences close to each other. We address the challenge of understanding how these basic properties of language arise from broader principles of efficient communication under information processing constraints. Here we show that natural-language-like systematicity arises from minimization of excess entropy, a measure of statistical complexity that represents the minimum amount of information necessary for predicting the future of a sequence based on its past. In simulations, we show that codes that minimize excess entropy factorize their source distributions into approximately independent components, and then express those components systematically and locally. Next, in a series of massively cross-linguistic corpus studies, we show that human languages are structured to have low excess entropy at the level of phonology, morphology, syntax, and semantics. Our result suggests that human language performs a sequential generalization of Independent Components Analysis on the statistical distribution over meanings that need to be expressed. It establishes a link between the statistical and algebraic structure of human language, and reinforces the idea that the structure of human language may have evolved to minimize cognitive load while maximizing communicative expressiveness.


A unifying theory explains seemingly contradictory biases in perceptual estimation

February 2024

·

200 Reads

·

16 Citations

Nature Neuroscience

Perceptual biases are widely regarded as offering a window into the neural computations underlying perception. To understand these biases, previous work has proposed a number of conceptually different, and even seemingly contradictory, explanations, including attraction to a Bayesian prior, repulsion from the prior due to efficient coding and central tendency effects on a bounded range. We present a unifying Bayesian theory of biases in perceptual estimation derived from first principles. We demonstrate theoretically an additive decomposition of perceptual biases into attraction to a prior, repulsion away from regions with high encoding precision and regression away from the boundary. The results reveal a simple and universal rule for predicting the direction of perceptual biases. Our theory accounts for, and yields, new insights regarding biases in the perception of a variety of stimulus attributes, including orientation, color and magnitude. These results provide important constraints on the neural implementations of Bayesian computations.



Citations (21)


... Edelman et al. [11] argue that attention in transformers tends to represent dependencies among only a small number of tokens, causing failures for global problems. Hahn and Rofin [16] show that although Parity is representable, it is hard to learn a length-generalizing solution because the training loss landscape is highly sensitive to all the inputs. Similarly, Thomm et al. [38] show that LLMs are data inefficient over compositional problems and Liu et al. [19] argue that transformers learn shortcuts to simulating finitestate automata. ...

Reference:

Lost in Transmission: When and Why LLMs Fail to Reason Globally
Why are Sensitive Functions Hard for Transformers?
  • Citing Conference Paper
  • January 2024

... Although it is widely accepted that neural processing is shaped by environmental statistics, quantitatively defining this relationship remains challenging [46]. Various theories have been proposed, such as sensory neurons maximizing mutual information between stimuli and their representations under limited coding capacity [47]- [50], or aiming to eliminate redundancy in representations. Other perspectives suggest that the brain seeks to infer high-level hidden features of the environment within the 7 . ...

A unifying theory explains seemingly contradictory biases in perceptual estimation

Nature Neuroscience

... (Similar power-law behaviors, with different exponents, arise from different loss functions [10,12,59,60].) Concretely, these power-law expressions were derived using the Fisher information, whose inverse is a lower bound to the variance of an unbiased estimator, and which can be related to the mutual information in some limits. Here, for all values of � R, the error is well described by a power law, ε 2 ðxÞ ¼ A e =pðxÞ g e , where the exponent changes as a function of � R, and depends on the choice of the decoder (Fig 8A, S7(D) and S8(D) Figs). ...

A unifying theory explains seemingly contradicting biases in perceptual estimation
  • Citing Article
  • August 2023

Journal of Vision

... As UD treebanks have become valuable resources in both NLP (e.g., Jumelet et al., 2025;Opitz et al., 2025) and language acquisition research (e.g., Clark et al., 2023;Hahn et al., 2020), there have been increasing efforts to parse And a a green one . ...

A Cross-Linguistic Pressure for Uniform Information Density in Word Order

Transactions of the Association for Computational Linguistics

... Consistent with this, there has been a blossoming of research efforts that can be construed as using a resource-rational approach. By now, resource-rational analysis (broadly construed) has been applied to virtually all standard topics of cognitive psychology, including perception (e.g., Wei and Stocker, 2017;Cheyette and Piantadosi, 2020), visual attention (e.g., Hoppe and Rothkopf, 2019;Callaway et al., 2021), working memory (e.g., O'Reilly and Frank, 2006b;Suchow and Griffiths, 2016;Van den Berg and Ma, 2018), long-term memory (e.g., Lu et al., 2022;Zhang et al., 2022;Callaway et al., 2023a) language (e.g., Hahn et al., 2022;Dingemanse, 2020), reasoning (e.g., Dasgupta et al., 2020Dasgupta et al., , 2017Zhao et al., 2024), problem-solving (e.g., Prystawski et al., 2022;Callaway et al., 2022b;Binz and Schulz, 2023), judgment , decision-making (e.g., Lieder et al., 2018a;Bhui et al., 2021), active learning (Bramley et al., 2017;Binz and Schulz, 2022;, categorization (Dasgupta and Griffiths, 2022), mental imagery (Hamrick et al., 2015), and moral cognition . Moreover, resource-rational analysis has also found its way into several other fields, ranging from economics (Gabaix, 2023) to psychiatry (Bari and Gershman, 2023). ...

A resource-rational model of human processing of recursive linguistic structure

Proceedings of the National Academy of Sciences

... More recently, some studies propose a generalization from dependency locality to information locality, where any pair of linguistic units with high co-occurrence statistics, no matter whether they are in the same syntactic dependency or not, should stay close in linear order (Futrell, 2019;Hahn, Degen, & Futrell, 2021;Hahn & Xu, 2022). Compared to previous work, these studies highlight the role of predictive processing, pointing out an interaction between the memory-based and the expectation-based mechanisms. ...

Crosslinguistic word order variation reflects evolutionary pressures of dependency and information locality

Proceedings of the National Academy of Sciences

... Ganguly and Gupta [20] formulated explainer selection as a rate-distortion problem, optimizing the trade-off between explanation complexity and fidelity through their InfoExplain benchmark. The connection between simplicity and lower information content has been explored in various contexts, from Dessalles' [14] algorithmic simplicity theory, which defines unexpectedness as U = C exp − C obs where lower observed complexity correlates with simpler explanations, to Futrell and Hahn's work [19] linking optimal coding theory to cognitive efficiency. Earlier critiques, such as Salmon's [37] analysis of transmitted information as an explanatory metric, highlighted the context-dependent aspects of simplicity. ...

Information Theory as a Bridge Between Language Function and Language Form

Frontiers in Communication

... Efficiency-based accounts have been shown to explain word meaning variation across languages (e.g., Kemp & Regier, 2012;Regier et al., 2015;Y. Xu, Liu, & Regier, 2020;Zaslavsky, Kemp, Regier, & Tishby, 2018), the structures of word forms (e.g., Zipf, 1949;Mahowald, Dautriche, Gibson, & Piantadosi, 2018;Bentz & Ferrer Cancho, 2016;Hahn, Mathew, & Degen, 2022), and grammatical form-meaning mappings (Mollica et al., 2021). We extend this growing body of research with the aim to understand the general principles that shape the diverse strategies of lexicalization. ...

Morpheme Ordering Across Languages Reflects Optimization for Processing Efficiency

Open Mind

... At one end, there is the locality effect where words that are more mutually predictable tend to stay closer to each other in linear order (e.g., Futrell, 2019;Futrell, Qian, Gibson, Fedorenko, & Blank, 2019). At the other end, the same pressure of compression governs the fusion of morphemes (e.g., Hahn et al., 2021;Rathi, Hahn, & Futrell, 2021), and makes mutually predictable units more likely to go through processes such as affixation and phonological reduction (e.g., Bybee, 2006;Bybee et al., 1994;Gahl & Baayen, 2024). ...

An Information-Theoretic Characterization of Morphological Fusion
  • Citing Conference Paper
  • January 2021