Yonatan Belinkov’s research while affiliated with Technion – Israel Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (172)


Figure 1: An illustration of PFF and FUR. In order to produce a parameter intervention, we first prompt the model M to produce an answer y and reasoning chain (CoT). We then segment the reasoning chain and unlearn a single reasoning step from the model. The unlearned model M * is then prompted to produce an answer y * . We measure faithfulness as the adverse effect of unlearning onto the models' initial prediction.
Figure 10: A screen capture of one example from the Qualtrics annotation platform. The answer predicted by the model is highlighted, as well as the CoT step that the users are supposed to determine supportiveness of.
Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps
  • Preprint
  • File available

February 2025

·

1 Read

Martin Tutek

·

Fateme Hashemi Chaleshtori

·

Ana Marasović

·

Yonatan Belinkov

When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. However, despite much work on CoT prompting, it is unclear if CoT reasoning is faithful to the models' parameteric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters. We perform experiments unlearning CoTs of four LMs prompted on four multi-choice question answering (MCQA) datasets. Our experiments show that FUR is frequently able to change the underlying models' prediction by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning. Importantly, CoT steps identified as important by FUR do not align well with human notions of plausbility, emphasizing the need for specialized alignment

Download

Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs

February 2025

·

2 Reads

Adi Simhi

·

Itay Itzhak

·

Fazl Barez

·

[...]

·

Yonatan Belinkov

Large Language Models (LLMs) often generate outputs that lack grounding in real-world facts, a phenomenon known as hallucinations. Prior research has associated hallucinations with model uncertainty, leveraging this relationship for hallucination detection and mitigation. In this paper, we challenge the underlying assumption that all hallucinations are associated with uncertainty. Using knowledge detection and uncertainty measurement methods, we demonstrate that models can hallucinate with high certainty even when they have the correct knowledge. We further show that high-certainty hallucinations are consistent across models and datasets, distinctive enough to be singled out, and challenge existing mitigation methods. Our findings reveal an overlooked aspect of hallucinations, emphasizing the need to understand their origins and improve mitigation strategies to enhance LLM safety. The code is available at https://github.com/technion-cs-nlp/Trust_me_Im_wrong .


Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models

February 2025

·

10 Reads

We investigate how large language models perform latent multi-hop reasoning in prompts like "Wolfgang Amadeus Mozart's mother's spouse is". To analyze this process, we introduce logit flow, an interpretability method that traces how logits propagate across layers and positions toward the final prediction. Using logit flow, we identify four distinct stages in single-hop knowledge prediction: (A) entity subject enrichment, (B) entity attribute extraction, (C) relation subject enrichment, and (D) relation attribute extraction. Extending this analysis to multi-hop reasoning, we find that failures often stem from the relation attribute extraction stage, where conflicting logits reduce prediction accuracy. To address this, we propose back attention, a novel mechanism that enables lower layers to leverage higher-layer hidden states from different positions during attention computation. With back attention, a 1-layer transformer achieves the performance of a 2-layer transformer. Applied to four LLMs, back attention improves accuracy on five reasoning datasets, demonstrating its effectiveness in enhancing latent multi-hop reasoning ability.


Unsupervised Translation of Emergent Communication

February 2025

·

2 Reads

Emergent Communication (EC) provides a unique window into the language systems that emerge autonomously when agents are trained to jointly achieve shared goals. However, it is difficult to interpret EC and evaluate its relationship with natural languages (NL). This study employs unsupervised neural machine translation (UNMT) techniques to decipher ECs formed during referential games with varying task complexities, influenced by the semantic diversity of the environment. Our findings demonstrate UNMT's potential to translate EC, illustrating that task complexity characterized by semantic diversity enhances EC translatability, while higher task complexity with constrained semantic variability exhibits pragmatic EC, which, although challenging to interpret, remains suitable for translation. This research marks the first attempt, to our knowledge, to translate EC without the aid of parallel data.


Figure 4: Example schema for each task. We show examples from the LLM+Mask method. See §A for examples of human-designed schemas.
Figure 6: Hard faithfulness curves for GPT-2-small on Greater-Than (left) and IOI (mid-left), and for Llama-3-8b on IOI (mid-right) and Winobias (right).
Figure 7: The first example in Figure 7 is taken from the Greater-Than task and is generated using GPT2-small. Both the second and third examples are from the IOI dataset. The second mask is generated with GPT2-small, while the third is generated with LLaMA-3-8b. The highlighted positions are intended to capture the most influential positions that affect the model's predictions.
Figure 11: Winobias task results showing soft and hard faithfulness curves. Each column shows results for a single trial. The soft faithfulness curves initially drop significantly, suggesting the circuit assigns higher logits to the correct answer than to the incorrect, biased answer. The dotted lines in the hard faithfulness curves quantify this by showing the average percentage of cases where the circuit generates the correct answer, despite focusing on examples where the model predicts the biased answer. As the circuit size increases, the soft faithfulness curves rise, correlating with an increased percentage of biased predictions. This effect is more pronounced when token positions are differentiated.
An example for each dataset. Each entry demonstrates a pronoun resolution scenario, with variations designed to reflect anti-female, pro-female, anti-male, and pro-male biases.
Position-aware Automatic Circuit Discovery

February 2025

·

1 Read

A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model's computation graph that executes a specific task. We identify a gap in existing circuit discovery methods: they assume circuits are position-invariant, treating model components as equally relevant across input positions. This limits their ability to capture cross-positional interactions or mechanisms that vary across positions. To address this gap, we propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples. First, we extend edge attribution patching, a gradient-based method for circuit discovery, to differentiate between token positions. Second, we introduce the concept of a dataset schema, which defines token spans with similar semantics across examples, enabling position-aware circuit discovery in datasets with variable length examples. We additionally develop an automated pipeline for schema generation and application using large language models. Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.


Figure 3: Images generated from different segments of the input prompt using ITE. Description of each column, from left to right: (1) An image generated using the full prompt (both prompt tokens and padding tokens encoded together), (2) An image generated using only the prompt tokens and clean padding tokens, (3) An image generated using only the prompt-contextual pads encoded with the prompt, while the prompt tokens were replaced with clean pad tokens.
Figure 5: Images generated from Lavi-bridge with LoRa loaded with scaling factor α (y-axis). We analyze pad token segments: the first column shows the full image, and the next columns show three consecutive 20% of the pads. As α decreases, fewer pad tokens are used.
Figure 6: Attention histogram for Stable Diffusion XL and FLUX* for each token reveals that while both models exclude semantic information from padding tokens, FLUX utilizes these tokens, whereas Stable Diffusion does not. *In FLUX, we have removed the long middle part with low attention in order to improve visualization.
Figure 7: Attention maps for FLUX diffusion show strong alignment between prompt tokens and semantically relevant image tokens. These maps also reveal high attention for padding tokens with the main objects in the image.
Figure 9: Images generated with FLUX from different prompt segments show distinct alignments: prompt tokens produce semantically accurate images, while the visual nuance like 'cozy' emerges only from the promptcontextual pad tokens.
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

January 2025

·

50 Reads

Text-to-image (T2I) diffusion models rely on encoded prompts to guide the image generation process. Typically, these prompts are extended to a fixed length by adding padding tokens before text encoding. Despite being a default practice, the influence of padding tokens on the image generation process has not been investigated. In this work, we conduct the first in-depth analysis of the role padding tokens play in T2I models. We develop two causal techniques to analyze how information is encoded in the representation of tokens across different components of the T2I pipeline. Using these techniques, we investigate when and how padding tokens impact the image generation process. Our findings reveal three distinct scenarios: padding tokens may affect the model's output during text encoding, during the diffusion process, or be effectively ignored. Moreover, we identify key relationships between these scenarios and the model's architecture (cross or self-attention) and its training process (frozen or trained text encoder). These insights contribute to a deeper understanding of the mechanisms of padding tokens, potentially informing future model design and training practices in T2I systems.


BetaAlign: a deep learning approach for multiple sequence alignment

January 2025

·

16 Reads

Bioinformatics

Motivation Multiple sequence alignments (MSAs) are extensively used in biology, from phylogenetic reconstruction to structure and function prediction. Here, we suggest an out-of-the-box approach for the inference of MSAs, which relies on algorithms developed for processing natural languages. We show that our artificial intelligence (AI)-based methodology can be trained to align sequences by processing alignments that are generated via simulations, and thus different aligners can be easily generated for datasets with specific evolutionary dynamics attributes. We expect that natural language processing (NLP) solutions will replace or augment classic solutions for computing alignments, and more generally, challenging inference tasks in phylogenomics. Results The MSA problem is a fundamental pillar in bioinformatics, comparative genomics, and phylogenetics. Here, we characterize and improve BetaAlign, the first deep learning aligner, which substantially deviates from conventional algorithms of alignment computation. BetaAlign draws on NLP techniques and trains transformers to map a set of unaligned biological sequences to an MSA. We show that our approach is highly accurate, comparable and sometimes better than state-of-the-art alignment tools. We characterize the performance of BetaAlign and the effect of various aspects on accuracy; for example, the size of the training data, the effect of different transformer architectures, and the effect of learning on a subspace of indel-model parameters (subspace learning). We also introduce a new technique that leads to improved performance compared to our previous approach. Our findings further uncover the potential of NLP-based methods for sequence alignment, highlighting that AI-based algorithms can substantially challenge classic approaches in phylogenomics and bioinformatics. Availability and implementation Datasets used in this work are available on HuggingFace (Wolf et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. p.38–45. 2020) at: https://huggingface.co/dotan1111. Source code is available at: https://github.com/idotan286/SimulateAlignments.


Figure 2: The Venn diagram displays the count and overlap of proteins with high cosine similarity scores (above 0.6). For example, five proteins achieved high cosine similarity scores for the three BetaDescribe predictions and the BlastPbased description. Additionally, 20, 9, 2, and 9 proteins had high cosine similarity scores exclusively in predictions 1, 2, 3, or using BlastP, respectively. After normalization, these counts correspond to 37.4%, 40.7%, 34.5%, and 18.3%, respectively.
Figure 3: BlastP-based predictions are more accurate when congruent with BetaDescribe's predictions. BlastP scores on Category 3 proteins, as a function of their E-value score as yielded by BlastP. In each bin, we divided the BlastP predictions into two groups: those that are congruent with BetaDescribe's prediction 1 (cosine similarity score above the median) and those that do not.
Figure 4: Identifying functionally important regions for the preproinsulin protein. The four main regions of the insulin are marked: Signal peptide, Insulin B chain, C peptide, and Insulin A chain.
Performance of BetaDescribe on Category 1 proteins, i.e., test proteins without BlastP hits when searched against the training data. The number of descriptions for each column is stated in parentheses. (a) The performance considering the 189 proteins in Category 1. (b) The performance without the 38 proteins with identical sequences in the training data. Predictions 1,2, and 3 are the first, second, and third descriptions, respectively provided by BetaDescribe.
Protein2Text: Providing Rich Descriptions for Protein Sequences

December 2024

·

28 Reads

Understanding the functionality of proteins has been a focal point of biological research due to their critical roles in various biological processes. Unraveling protein functions is essential for advancements in medicine, agriculture, and biotechnology, enabling the development of targeted therapies, engineered crops, and novel biomaterials. However, this endeavor is challenging due to the complex nature of proteins, requiring sophisticated experimental designs and extended timelines to uncover their specific functions. Public large language models (LLMs), though proficient in natural language processing, struggle with biological sequences due to the unique and intricate nature of biochemical data. These models often fail to accurately interpret and predict the functional and structural properties of proteins, limiting their utility in bioinformatics. To address this gap, we introduce BetaDescribe, a collection of models designed to generate detailed and rich textual descriptions of proteins, encompassing properties such as function, catalytic activity, involvement in specific metabolic pathways, subcellular localizations, and the presence of particular domains. The trained BetaDescribe model receives protein sequences as input and outputs a textual description of these properties. BetaDescribe's starting point was the LLAMA2 model, which was trained on trillions of tokens. Next, we trained our model on datasets containing both biological and English text, allowing biological knowledge to be incorporated. We demonstrate the utility of BetaDescribe by providing descriptions for proteins that share little to no sequence similarity to proteins with functional descriptions in public datasets. We also show that BetaDescribe can be harnessed to conduct in-silico mutagenesis procedures to identify regions important for protein functionality without needing homologous sequences for the inference. Altogether, BetaDescribe offers a powerful tool to explore protein functionality, augmenting existing approaches such as annotation transfer based on sequence or structure similarity.


Figure 2: Notation for the emergent communication (EC) setup.
Figure 4: Average message purity, comparing trained models to random baselines.
Figure 7: Reconstruction (top) and discrimination (bottom) examples on Shapes
Figure 9: Message purity per attribute and game type.
Semantics and Spatiality of Emergent Communication

November 2024

·

4 Reads

When artificial agents are jointly trained to perform collaborative tasks using a communication channel, they develop opaque goal-oriented communication protocols. Good task performance is often considered sufficient evidence that meaningful communication is taking place, but existing empirical results show that communication strategies induced by common objectives can be counterintuitive whilst solving the task nearly perfectly. In this work, we identify a goal-agnostic prerequisite to meaningful communication, which we term semantic consistency, based on the idea that messages should have similar meanings across instances. We provide a formal definition for this idea, and use it to compare the two most common objectives in the field of emergent communication: discrimination and reconstruction. We prove, under mild assumptions, that semantically inconsistent communication protocols can be optimal solutions to the discrimination task, but not to reconstruction. We further show that the reconstruction objective encourages a stricter property, spatial meaningfulness, which also accounts for the distance between messages. Experiments with emergent communication games validate our theoretical results. These findings demonstrate an inherent advantage of distance-based communication goals, and contextualize previous empirical discoveries.


Growing a Tail: Increasing Output Diversity in Large Language Models

November 2024

·

5 Reads

How diverse are the outputs of large language models when diversity is desired? We examine the diversity of responses of various models to questions with multiple possible answers, comparing them with human responses. Our findings suggest that models' outputs are highly concentrated, reflecting a narrow, mainstream 'worldview', in comparison to humans, whose responses exhibit a much longer-tail. We examine three ways to increase models' output diversity: 1) increasing generation randomness via temperature sampling; 2) prompting models to answer from diverse perspectives; 3) aggregating outputs from several models. A combination of these measures significantly increases models' output diversity, reaching that of humans. We discuss implications of these findings for AI policy that wishes to preserve cultural diversity, an essential building block of a democratic social fabric.


Citations (42)


... This class of methods enables probing complex systems without constraining their design or needing to extract an interpretable model. Inspired by modern interpretability methods [28,31,58,141], and new XRL approaches [64,67,110], we decided to anticipate the adoption of explainability in the expanding field of MADRL and encourage the AAMAS community to use and engage more systematically with modern direct interpretability methods. ...

Reference:

Perspectives for Direct Interpretability in Multi-Agent Deep Reinforcement Learning
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
  • Citing Conference Paper
  • January 2024

... For instance, FreeU (Si et al., 2024) analyzes the roles played by the U-Net's backbone and skip connections from a frequency-domain perspective. Additionally, several studies have examined properties of text encoders and crossattention mechanisms, shedding light on the interactions between textual prompts and the diffusion process (Toker et al., 2024;Yang et al., 2024b;Yi et al., 2024). In this study, we aim to further investigate the underlying mechanisms from the perspective of positional information construction of U-Net. ...

Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines
  • Citing Conference Paper
  • January 2024

... For instance, Xu, Niethammer, and Raffel (2022) evaluated learned EC on unseen test data to assess generalization, providing insights into NL aspects. Carmeli, Belinkov, and Meir (2024) proposed mapping EC symbols to NL concepts, assessing EC's compositionality. However, their approach forms a global mapping of atomic symbols, rather than full translation of individual messages. ...

Concept-Best-Matching: Evaluating Compositionality In Emergent Communication
  • Citing Conference Paper
  • January 2024

... In this stage, TRCE starts by eliminating the influence of malicious semantics from input prompts. We apply a closed-form cross-attention refinement [11], which is widely used in editing knowledge in attention-based networks [1,4,11,12,21,26]. In these studies, the 'Key' and 'Value' projection matrices W K and W V of the crossattention layers are adjusted to map the concept embeddings ...

ReFACT: Updating Text-to-Image Models by Editing the Text Encoder
  • Citing Conference Paper
  • January 2024

... Related Work Although few studies directly compare representational similarity measures based on their discriminative power, most efforts in this area focus on identifying metrics that distinguish between models by their construction. These efforts typically involve assessing measures based on their ability to match corresponding layers across models with varying seeds (Kornblith et al., 2019) or identical architectures with different initializations (Han et al., 2023;Rahamim & Belinkov, 2024). The closest to our work are studies by Ding et al. (Ding et al., 2021) and Cloos et al. (Cloos et al., 2024). ...

ContraSim – Analyzing Neural Representations Based on Contrastive Learning
  • Citing Conference Paper
  • January 2024

... Using the conceptual framework of fields of visibility, [54] analyse content asymmetries on Wikipedia as a system composed of diverse agents that affects content in three ways: representation, characterization, and structural placement. If model providers do not wish to contribute to this systematic underand mis-representation, they must be aware of such phenomena and operationalize them in order to mitigate bias (e.g., by actively collecting under-represented data [55] or training with debiasing techniques [56,57,58]). Furthermore, transparent communication of learned bias to downstream model deployers is an important responsibility of model providers. ...

Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information
  • Citing Conference Paper
  • January 2024

... Specifically, they read information about the context or reasoning results from the residual stream, then enhance the information that needs to be expressed as output, and write it back into the stream. Amplification Head [Lieberum et al., 2023] and Correct Head [Wiegreffe et al., 2024] amplify the signal of the correct choice letter in MCQA problems near the [END] position. This amplification ensures that after passing through the Unembedding layer and softmax calculation, the correct choice letter has the highest probability. ...

Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions

... Besides character-based and k-merbased tokenization, subword tokenization methods, traditionally developed for natural language texts, have also been * These authors contributed equally to this work. evoBPE A PREPRINT used for tokenizing protein sequences [Bepler and Berger, 2021, Tan et al., 2023, Dotan et al., 2024, Ieremie et al., 2024. Subword tokenization methods have proven successful in NLP when handling rare words and improving model efficiency [Sennrich et al., 2016]. ...

Effect of tokenization on transformers for biological sequences
  • Citing Article
  • April 2024

Bioinformatics

... Recent work has also addressed ethical concerns Base→DMD Base→LCM Figure 2. Customization adapters (custom diffusion [16] and dreambooth [26]) and concept control adapters (concept sliders [8]) trained on SDXL-base model can be transferred to all the distilled modeled without any additional finetuning. This demonstrates that concept representations are preserved through the diffusion distillation process through targeted concept removal techniques by editing selective weights [7,9,20], redirecting concept representations [15,23]. Since distillation modifies the UNet model of diffusion, in this work, we mainly focus on custom concept and control representations that are captured in UNet modules. ...

Unified Concept Editing in Diffusion Models
  • Citing Conference Paper
  • January 2024

... Lall and Tallur 2023 proposed reinforcement neural networks for pairwise sequence alignments of longer sequences. Conceptually very interesting are the natural language transformer-based models proposed by Dotan et al. 2022, Dotan et al. 2024a, and Dotan et al. 2024b. They aligned up to 10 sequences and up to 1024 symbols. ...

BetaAlign: a deep learning approach for multiple sequence alignment