Roger Wattenhofer’s research while affiliated with ETH Zurich and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (697)


Figure A.2: RPM example from I-RAVEN.
Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?
  • Preprint
  • File available

March 2025

·

·

Roger Wattenhofer

·

[...]

·

This work presents a first evaluation of two state-of-the-art Large Reasoning Models (LRMs), OpenAI's o3-mini and DeepSeek R1, on analogical reasoning, focusing on well-established nonverbal human IQ tests based on Raven's progressive matrices. We benchmark with the I-RAVEN dataset and its more difficult extension, I-RAVEN-X, which tests the ability to generalize to longer reasoning rules and ranges of the attribute values. To assess the influence of visual uncertainties on these nonverbal analogical reasoning tests, we extend the I-RAVEN-X dataset, which otherwise assumes an oracle perception. We adopt a two-fold strategy to simulate this imperfect visual perception: 1) we introduce confounding attributes which, being sampled at random, do not contribute to the prediction of the correct answer of the puzzles and 2) smoothen the distributions of the input attributes' values. We observe a sharp decline in OpenAI's o3-mini task accuracy, dropping from 86.6% on the original I-RAVEN to just 17.0% -- approaching random chance -- on the more challenging I-RAVEN-X, which increases input length and range and emulates perceptual uncertainty. This drop occurred despite spending 3.4x more reasoning tokens. A similar trend is also observed for DeepSeek R1: from 80.6% to 23.2%. On the other hand, a neuro-symbolic probabilistic abductive model, ARLC, that achieves state-of-the-art performances on I-RAVEN, can robustly reason under all these out-of-distribution tests, maintaining strong accuracy with only a modest reduction from 98.6% to 88.0%. Our code is available at https://github.com/IBM/raven-large-language-models.

Download

Figure 3: Overview of the pre-training procedure. The boxes in the models indicate the type of attention masks used. The attention masks are explained in Figure 4.
Figure 4: Attention masks during pre-training for an input with the sentence index vector [0,0,1,1,1]: The left matrix is the "block triangular mask" as in Section 3.1. After going through the encoder, every token represents the compressed prefix of its sequence up to itself, and is only allowed to attend to itself and compressions of previous sequences (right).
Figure 5: Illustration of the Fast Generation Algorithm. Having finished s 1 and s 2 in the context, any subsequent token mathematically cannot influence e 1 , e 2 . The Fast Generation Algorithm caches them and feeds them directly to the slt_body, together with e 3 .
Text Compression for Efficient Language Generation

We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the "Generative Pretrained Thoughtformer" (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT's architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.


Byzantine Game Theory: Sun Tzus Boxes

February 2025

·

1 Read

We introduce the Byzantine Selection Problem, living at the intersection of game theory and fault-tolerant distributed computing. Here, an event organizer is presented with a group of n agents, and wants to select <n\ell < n of them to form a team. For these purposes, each agent i self-reports a positive skill value viv_i, and a team's value is the sum of its members' skill values. Ideally, the value of the team should be as large as possible, which can be easily achieved by selecting agents with the highest \ell skill values. However, an unknown subset of at most t<nt < n agents are byzantine and hence not to be trusted, rendering their true skill values as 0. In the spirit of the distributed computing literature, the identity of the byzantine agents is not random but instead chosen by an adversary aiming to minimize the value of the chosen team. Can we still select a team with good guarantees in this adversarial setting? As it turns out, deterministically, it remains optimal to select agents with the highest \ell values. Yet, if tt \geq \ell, the adversary can choose to make all selected agents byzantine, leading to a team of value zero. To provide meaningful guarantees, one hence needs to allow for randomization, in which case the expected value of the selected team needs to be maximized, assuming again that the adversary plays to minimize it. For this case, we provide linear-time randomized algorithms that maximize the expected value of the selected team.


Condorcet Winners and Anscombes Paradox Under Weighted Binary Voting

February 2025

·

6 Reads

We consider voting on multiple independent binary issues. In addition, a weighting vector for each voter defines how important they consider each issue. The most natural way to aggregate the votes into a single unified proposal is issue-wise majority (IWM): taking a majority opinion for each issue. However, in a scenario known as Ostrogorski's Paradox, an IWM proposal may not be a Condorcet winner, or it may even fail to garner majority support in a special case known as Anscombe's Paradox. We show that it is co-NP-hard to determine whether there exists a Condorcet-winning proposal even without weights. In contrast, we prove that the single-switch condition provides an Ostrogorski-free voting domain under identical weighting vectors. We show that verifying the condition can be achieved in linear time and no-instances admit short, efficiently computable proofs in the form of forbidden substructures. On the way, we give the simplest linear-time test for the voter/candidate-extremal-interval condition in approval voting and the simplest and most efficient algorithm for recognizing single-crossing preferences in ordinal voting. We then tackle Anscombe's Paradox. Under identical weight vectors, we can guarantee a majority-supported proposal agreeing with IWM on strictly more than half of the overall weight, while with two distinct weight vectors, such proposals can get arbitrarily far from IWM. The severity of such examples is controlled by the maximum average topic weight w~max\tilde{w}_{max}: a simple bound derived from a partition-based approach is tight on a large portion of the range w~max(0,1)\tilde{w}_{max} \in (0,1). Finally, we extend Wagner's rule to the weighted setting: an average majority across topics of at least 34\frac{3}{4}'s precludes Anscombe's paradox from occurring.


Fig. 1. Proposed DisCoder architecture. The mel spectrogram is encoded into a low-dimensional latent space before being decoded to a 44.1 kHz waveform. During the first stage of training, the latent space is aligned with the DAC prior. During the second stage of training this constraint is removed, and a skip connection is introduced to preserve information encoded in the inital mel spectrogram.
High-Fidelity Music Vocoder using Neural Audio Codecs

February 2025

·

3 Reads

While neural vocoders have made significant progress in high-fidelity speech synthesis, their application on polyphonic music has remained underexplored. In this work, we propose DisCoder, a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space before reconstructing it to an audio signal using a fine-tuned DAC decoder. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.


Logarithmic Approximation for Road Pricing on Grids

February 2025

Consider a graph G=(V,E)G = (V, E) and some commuters, each specified by a tuple (u,v,b)(u, v, b) consisting of two nodes in the graph u,vVu, v \in V and a non-negative real number b, specifying their budget. The goal is to find a pricing function p of the edges of G that maximizes the revenue generated by the commuters. Here, each commuter (u,v,b)(u, v, b) either pays the lowest-cost of a u-v path under the pricing p, or 0, if this exceeds their budget b. We study this problem for the case where G is a bounded-width grid graph and give a polynomial-time approximation algorithm with approximation ratio O(logE)O(\log |E|). Our approach combines existing ideas with new insights. Most notably, we employ a rather seldom-encountered technique that we coin under the name 'assume-implement dynamic programming.' This technique involves dynamic programming where some information about the future decisions of the dynamic program is guessed in advance and 'assumed' to hold, and then subsequent decisions are forced to 'implement' the guess. This enables computing the cost of the current transition by using information that would normally only be available in the future.


Transaction Fee Market Design for Parallel Execution

February 2025

·

9 Reads

Given the low throughput of blockchains like Bitcoin and Ethereum, scalability -- the ability to process an increasing number of transactions -- has become a central focus of blockchain research. One promising approach is the parallelization of transaction execution across multiple threads. However, achieving efficient parallelization requires a redesign of the incentive structure within the fee market. Currently, the fee market does not differentiate between transactions that access multiple high-demand resources versus a single low-demand one, as long as they require the same computational effort. Addressing this discrepancy is crucial for enabling more effective parallel execution. In this work, we aim to bridge the gap between the current fee market and the need for parallel execution by exploring alternative fee market designs. To this end, we propose a framework consisting of two key components: a Gas Computation Mechanism (GCM), which quantifies the load a transaction places on the network in terms of parallelization and computation, measured in units of gas, and a Transaction Fee Mechanism (TFM), which assigns a price to each unit of gas. We also introduce a set of desirable properties for a GCM, present multiple candidate mechanisms, and evaluate them against the properties. One promising candidate emerges: the weighted area GCM. Notably, this mechanism can be seamlessly composed with existing TFMs, such as EIP-1559. While our exploration primarily focuses on the execution component of the fee, which directly relates to parallel execution, we also outline how it could be integrated with fees associated with other factors, such as storage and data bandwidth, by drawing a parallel to a multi-dimensional fee market.


Byzantine Stable Matching

February 2025

·

4 Reads

In stable matching, one must find a matching between two sets of agents, commonly men and women, or job applicants and job positions. Each agent has a preference ordering over who they want to be matched with. Moreover a matching is said to be stable if no pair of matched agents prefer each other compared to their current matching. We consider solving stable matching in a distributed synchronous setting, where each agent is its own process. Moreover, we assume up to tLt_L agents on one side and tRt_R on the other side can be byzantine. After properly defining the stable matching problem in this setting, we study its solvability. When there are as many agents on each side with fully-ordered preference lists, we give necessary and sufficient conditions for stable matching to be solvable in the synchronous setting. These conditions depend on the communication model used, i.e., if parties on the same side are allowed to communicate directly, and on the presence of a cryptographic setup, i.e., digital signatures.


Figure 4 Two synchronous updates of mountain valley landscape represented using a tertiary alternating timer t ∈ {1, 2, 3}. Notice that after 3 time steps the landscape shifts one cell to the left.
Universality Frontier for Asynchronous Cellular Automata

February 2025

·

5 Reads

In this work, we investigate the computational aspects of asynchronous cellular automata (ACAs), a modification of cellular automata in which cells update independently, following an asynchronous schedule. We introduce flip automata networks (FAN), a simple modification of automata networks that remain robust under any asynchronous update schedule. We show that asynchronous automata can efficiently simulate their synchronous counterparts with a linear memory overhead, which improves upon the previously established quadratic bound. Additionally, we address the universality gap for (a)synchronous cellular automata -- the boundary separating universal and non-universal automata, which is still not fully understood. We tighten this boundary by proving that all one-way asynchronous automata lack universal computational power. Conversely, we establish the existence of a universal 6-state first-neighbor automaton in one dimension and a 3-state von Neumann automaton in two dimensions, which represent the smallest known universal constructions to date.


Beyond Interpolation: Extrapolative Reasoning with Reinforcement Learning and Graph Neural Networks

February 2025

·

31 Reads

Despite incredible progress, many neural architectures fail to properly generalize beyond their training distribution. As such, learning to reason in a correct and generalizable way is one of the current fundamental challenges in machine learning. In this respect, logic puzzles provide a great testbed, as we can fully understand and control the learning environment. Thus, they allow to evaluate performance on previously unseen, larger and more difficult puzzles that follow the same underlying rules. Since traditional approaches often struggle to represent such scalable logical structures, we propose to model these puzzles using a graph-based approach. Then, we investigate the key factors enabling the proposed models to learn generalizable solutions in a reinforcement learning setting. Our study focuses on the impact of the inductive bias of the architecture, different reward systems and the role of recurrent modeling in enabling sequential reasoning. Through extensive experiments, we demonstrate how these elements contribute to successful extrapolation on increasingly complex puzzles.These insights and frameworks offer a systematic way to design learning-based systems capable of generalizable reasoning beyond interpolation.


Citations (32)


... More recently, also math competitions such as AIME2024 (of America, 2024) have been used to evaluate the newest models. Estermann et al. (2024) have introduced PUZZLES, a benchmark focusing on algorithmic and logical reasoning for reinforcement learning. While PUZZLES does not focus on LLMs, except for a short ablation in the appendix, we argue that the scalability provided by the underlying puzzles is an ideal testbed for testing extrapolative reasoning capabilities in LLMs. ...

Reference:

Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs
PUZZLES: A Benchmark for Neural Algorithmic Reasoning

... To this end, we present Audio Atlas, an opensource interactive web application that helps users navigate audio and music datasets. Inspired by existing work in the image domain [6], Audio Atlas can visualize any audio dataset, providing a responsive user interface even when displaying tens of millions of samples. To obtain semantically meaningful embeddings, we use CLAP [7], a contrastive neural network trained on audiotext pairs. ...

AEye: A Visualization Tool for Image Datasets
  • Citing Conference Paper
  • October 2024

... To address this limitation, we introduced a fast commit rule to AlterBFT [4,32,40,67]. Specifically, an honest replica commits a block proposed in an epoch if it receives votes for the block from all replicas in the system, provided no evidence of misbehavior (e.g., an equivocation or silence certificate) is detected. As a result, when there are no failures in the system, AlterBFT commits a block in just two communication steps, achieving optimal latency [58], without waiting for the synchrony bound ∆ S . ...

Banyan: Fast Rotating Leader BFT
  • Citing Conference Paper
  • December 2024

... It is worth noting that MEV is not solely about financial gains and losses; it also has posed a threat to decentralization. The fundamental principle of blockchain technology is that no small group of entities should be able to manipulate the blockchain's records or impose censorship [4]. In the current Ethereum Proof of Stake (PoS) paradigm, traditional miners in the Proof of Work (PoW) are replaced by validators, and these block producers can allocate MEV rewards as additional stakes to block reward, bolstering their power over the protocol. ...

Ethereum Proof-of-Stake Consensus Layer: Participation and Decentralization
  • Citing Chapter
  • November 2024

... AutoConcierge [164] Conducts real conversations with users Interaction FLOW [11] Introduces a feedback loop to enable collaboration between the recommendation agent and the user agent Representation AgentCF [173] Facilitates collaborative learning between user and item agents Representation Rec4Agentverse [170] Yoon et al. [160] Uses LLMs to simulate users for conversational recommendation tasks Simulation SUBER [21] Develops an RL environment using LLM to simulate user feedback Simulation CSHI [189] Proposes a framework for LLM-based user simulators in conversational RSs Simulation iEvaLM [137] Suggests new evaluation methods using LLMs Simulation ...

SUBER: An RL Environment with Simulated Human Behavior for Recommender Systems

... Such a classifier can also serve as a tool for gathering the data about the most commonly used CAPTCHA types across the web. The development of the powerful machine learning algorithms made it possible to automatically solve different types of CAPTCHAs like textbased [2]- [4] or image-based also called "reCAPTCHAs" [5], [6]. Increasing number of automated CAPTCHA solvers (such as Optical Character Recognition (OCR) algorithms and Deep Learning Models) resulted in the development of new CAPTCHA types and schemes varying from simple puzzles and math, through audios and videos, to interactive and minigames [7], [8]. ...

Breaking reCAPTCHAv2
  • Citing Conference Paper
  • July 2024

... Optimizations regarding communication complexity have also been considered. The work of [17] has achieved optimal-resilience asynchronous AA with O(n 2 ) messages per iteration as opposed to the prior solutions of O(n 3 ) [1], and the work of [25] has shown that O(ℓn) bits are sufficient to achieve a stronger version of AA (with exact agreement) on ℓ-bit integer inputs, given that ℓ is large enough. ...

Brief Announcement: Communication-Optimal Convex Agreement
  • Citing Conference Paper
  • June 2024

... In the specific context of DAO governance, a few papers provide empirical descriptions of the distribution of votes. Fritsch et al. (2024) analyse voting power in three DAOs, including Uniswap, and describe the structure of delegation networks by distinguishing "single holder"-those who receive at least 50% of their voting power from a single token holder-from the remaining "community" delegates. They find that large and powerful delegates typically decide in the same way as the larger community. ...

Analyzing Voting Power in Decentralized Governance: Who controls DAOs?
  • Citing Article
  • June 2024

Blockchain Research and Applications

... Researchers have recently focused on alleviating the cold start issues of functions on GPU [14,[26][27][28]. Figure 3 presents the lifecycle of a cold-start LLM invocation on GPU, utilized to better demonstrate these works and define the target scope of TIDAL. In the figure, we omit the CPU-side initialization processes, such as container creation, as these have been extensively explored in CPU-only studies [17,[29][30][31]. ...

Optimus: Warming Serverless ML Inference via Inter-Function Model Transformation

... Moreover, there is much research on NFT prices, as shown in Table 7. Understanding the price determinants of NFTs involves different factors that includes textual and visual data, market dynamics, social media influence, and broader financial trends. Textual descriptions and image data within NFT collections can help explain price variations among individual NFTs; however, features extracted from these data do not generalize well to new, unseen collections [237]. This indicates that, while certain stylistic or descriptive elements might boost an NFT's value within a specific collection, these factors do not necessarily apply across different collections. ...

What Determines the Price of NFTs?
  • Citing Conference Paper
  • December 2023