Anton Bakhtin’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (21)


Figure 4: With Linguistic Prompting, LLM does not appear to be more representative of the corresponding non-Western countries.
Figure 6: Distribution of topics in the data. Majority of the questions are classified into "Politics and policy" and "Regions and countries".
Figure 7: An example where cross-national promoting changes the model's responses, but the model responses do not become more representative of the responses of the participants from Turkey. Corresponding model generations are in Table 7.
Figure 9: An example where the model's response changes when provided with a cross-national prompt, assigning 99.1% probability to the response "Generally bad".
Towards Measuring the Representation of Subjective Global Opinions in Language Models
  • Preprint
  • File available

June 2023

·

190 Reads

·

5 Citations

Esin Durmus

·

Karina Nyugen

·

Thomas I. Liao

·

[...]

·

Deep Ganguli

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com.

Download

Human-level play in the game of Diplomacy by combining language models with strategic reasoning

November 2022

·

455 Reads

·

185 Citations

Science

Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge. We introduce Cicero, the first AI agent to achieve human-level performance in Diplomacy , a strategy game involving both cooperation and competition that emphasizes natural language negotiation and tactical coordination between seven players. Cicero integrates a language model with planning and reinforcement learning algorithms by inferring players' beliefs and intentions from its conversations and generating dialogue in pursuit of its plans. Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.


Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning

October 2022

·

16 Reads

No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-learned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model.


Figure 3: 20 independent runs of MAPPO (solid lines) and 24 of QMIX (dashed lines) on the tigertrampoline toy problem. 2 MAPPO runs find the optimal strategy while no QMIX run does.
Self-Explaining Deviations for Coordination

July 2022

·

28 Reads

Fully cooperative, partially observable multi-agent problems are ubiquitous in the real world. In this paper, we focus on a specific subclass of coordination problems in which humans are able to discover self-explaining deviations (SEDs). SEDs are actions that deviate from the common understanding of what reasonable behavior would be in normal circumstances. They are taken with the intention of causing another agent or other agents to realize, using theory of mind, that the circumstance must be abnormal. We first motivate SED with a real world example and formalize its definition. Next, we introduce a novel algorithm, improvement maximizing self-explaining deviations (IMPROVISED), to perform SEDs. Lastly, we evaluate IMPROVISED both in an illustrative toy setting and the popular benchmark setting Hanabi, where it is the first method to produce so called finesse plays, which are regarded as one of the more iconic examples of human theory of mind.


Modeling Strong and Human-Like Gameplay with KL-Regularized Search

December 2021

·

145 Reads

We consider the task of building strong but human-like policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while self-play learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to understand and coordinate with. We show in chess and Go that regularizing search policies based on the KL divergence from an imitation-learned policy by applying Monte Carlo tree search produces policies that have higher human prediction accuracy and are stronger than the imitation policy. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that applying this algorithm to no-press Diplomacy yields a policy that maintains the same human prediction accuracy as imitation learning while being substantially stronger.


Figure 3: The most probable actions for both players as predicted by the agent. The left figure shows the actions and the right figure shows the state after the actions are executed. France's army escapes to MUN.
Figure 4: The most probable action for France (blue) as predicted by the agent vs the best action for Austria (red) that the policy proposal network failed to find on its own. The left figure shows the actions and the right figure shows the state after the actions are executed. France's army is crushed and disbanded.
SoS scores of various agents playing against 6 copies of another agent. The ± shows one standard
No-Press Diplomacy from Scratch

October 2021

·

120 Reads

Prior AI successes in complex games have largely focused on settings with at most hundreds of actions at each decision point. In contrast, Diplomacy is a game with more than 10^20 possible actions per turn. Previous attempts to address games with large branching factors, such as Diplomacy, StarCraft, and Dota, used human data to bootstrap the policy or used handcrafted reward shaping. In this paper, we describe an algorithm for action exploration and equilibrium approximation in games with combinatorial action spaces. This algorithm simultaneously performs value iteration while learning a policy proposal network. A double oracle step is used to explore additional actions to add to the policy proposals. At each state, the target state value and policy for the model training are computed via an equilibrium search procedure. Using this algorithm, we train an agent, DORA, completely from scratch for a popular two-player variant of Diplomacy and show that it achieves superhuman performance. Additionally, we extend our methods to full-scale no-press Diplomacy and for the first time train an agent from scratch with no human data. We present evidence that this agent plays a strategy that is incompatible with human-data bootstrapped agents. This presents the first strong evidence of multiple equilibria in Diplomacy and suggests that self play alone may be insufficient for achieving superhuman performance in Diplomacy.


Physical Reasoning Using Dynamics-Aware Models

February 2021

·

44 Reads

A common approach to solving physical-reasoning tasks is to train a value learner on example tasks. A limitation of such an approach is it requires learning about object dynamics solely from reward values assigned to the final state of a rollout of the environment. This study aims to address this limitation by augmenting the reward value with additional supervisory signals about object dynamics. Specifically,we define a distance measure between the trajectory of two target objects, and use this distance measure to characterize the similarity of two environment rollouts.We train the model to correctly rank rollouts according to this measure in addition to predicting the correct reward. Empirically, we find that this approach leads to substantial performance improvements on the PHYRE benchmark for physical reasoning: our approach obtains a new state-of-the-art on that benchmark.


Human-Level Performance in No-Press Diplomacy via Equilibrium Search

October 2020

·

95 Reads

Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via external regret minimization. External regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and achieves a rank of 23 out of 1,128 human players when playing anonymous games on a popular Diplomacy website.


Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

July 2020

·

162 Reads

The combination of deep reinforcement learning and search at both training and test time is a powerful paradigm that has led to a number of a successes in single-agent settings and perfect-information games, best exemplified by the success of AlphaZero. However, algorithms of this form have been unable to cope with imperfect-information games. This paper presents ReBeL, a general framework for self-play reinforcement learning and search for imperfect-information games. In the simpler setting of perfect-information games, ReBeL reduces to an algorithm similar to AlphaZero. Results show ReBeL leads to low exploitability in benchmark imperfect-information games and achieves superhuman performance in heads-up no-limit Texas hold'em poker, while using far less domain knowledge than any prior poker AI. We also prove that ReBeL converges to a Nash equilibrium in two-player zero-sum games in tabular settings.


Residual Energy-Based Models for Text Generation

April 2020

·

46 Reads

·

1 Citation

Text generation is ubiquitous in many NLP tasks, from summarization, to dialogue and machine translation. The dominant parametric approach is based on locally normalized models which predict one word at a time. While these work remarkably well, they are plagued by exposure bias due to the greedy nature of the generation process. In this work, we investigate un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level. In order to make training tractable, we first work in the residual of a pretrained locally normalized language model and second we train using noise contrastive estimation. Furthermore, since the EBM works at the sequence level, we can leverage pretrained bi-directional contextual representations, such as BERT and RoBERTa. Our experiments on two large language modeling datasets show that residual EBMs yield lower perplexity compared to locally normalized baselines. Moreover, generation via importance sampling is very efficient and of higher quality than the baseline models according to human evaluation.


Citations (13)


... The first study. In 2023, scientists developed the "Chinese Room of Increased Complexity" technology to create algorithmic copies of citizens of any country [11]. This was followed by the Wuhan experiment to predict the US presidential election in 2024 based on the analysis of the AI model of preferences of simulacra rather than people. ...

Reference:

1(45) (2025): International Journal of Innovative Technologies in Social Science CITATION Nataliya Onishchenko, Oleksii Kostenko, Dmytro Zhuravlov. (2025) AI Technologies to The Question of The "Policy" of Legal Regulation at The Present Stage. Essential and Instrumental Factors
Towards Measuring the Representation of Subjective Global Opinions in Language Models

... 1), suggests evolution toward increasingly sophisticated autonomous agents. While predictive AI has already reached human-level performance in socially advanced games like poker and Diplomacy(Bakhtin et al., 2022;Brown & Sandholm, 2019), GenAI's advances in general-purpose capabilities involving emotion and theory of mind(Kosinski, 2024;Strachan et al., 2024) suggest a shift in AI capabilities, enabling more ubiquitous deployment of agentic AI for human collaboration and task delegation in real-life problem solving. These developments create research opportunities regarding how organizations can effectively leverage GenAI models for simulation and learning, what determines appropriate boundaries between human and artificial agency(Murray et al., 2021), and how technical architectures can enable human-AI complementarity while maintaining meaningful human involvement. ...

Human-level play in the game of Diplomacy by combining language models with strategic reasoning
  • Citing Article
  • November 2022

Science

... PLMs (Pre-trained Language Models) have learned from a large number of corpus materials to model the distribution of natural language to a large extent; hence they are able to generate texts of unprecedented quality [26]. Nevertheless, PLMs are based on neural networks, which essentially are still black boxes, lacking a good level of interpretability. ...

Residual Energy-Based Models for Text Generation
  • Citing Preprint
  • April 2020

... The core idea behind PEFT is to achieve comparable performance to full-parameter finetuning by updating only a portion of the existing model's or newly added parameters. Inspired by the manually defined prompt (Petroni et al. 2019), the learnable prompt adjusts the model by adding a few parameterized input blocks into the input layer of the trained Transformer model (Jia et al. 2022;Dong et al. 2023;Nie et al. 2023). Some subsequent works have explored adjusting other elements of the Transformer architecture, such as attention block (Li and Liang 2021). ...

Language Models as Knowledge Bases?
  • Citing Conference Paper
  • January 2019

... Prompts in this context can be divided based on their form: cloze prompts, which are designed to fill in the blanks of a textual string, and prefix prompts, used to continue a string prefix (Cui et al., 2021;Petroni et al., 2019;Li & Liang, 2021). These different structures offer varying advantages depending on the task at hand. ...

Language Models as Knowledge Bases?
  • Citing Preprint
  • September 2019

... hypothetical scenario conceptualization is necessary as well as ''physical intelligence'' that one also encounters in the solutions of Bakhtin et al. 82 without it yet being a benchmark that tests causal sequence events as CLEVRER 60 (see section ''CLEVRER''). Nevertheless, all scenes until the last one that manifests the correct question's answer have to be physically plausible. ...

PHYRE: A New Benchmark for Physical Reasoning
  • Citing Preprint
  • August 2019

... With the rising fear of AI-generated texts' proliferation, a growing body of research in NLP has explored different approaches to detecting AI-generated texts (e.g., Bakhtin et al., 2019), including watermarking AI-generated texts (e.g., Kirchenbauer et al., 2023), detecting traces of language modeling in generated texts (e.g., Tay et al., 2020), and training AI models to discriminate between AI-generated and human-written texts (e.g., Fangi et al., 2021). One of the promising detection approaches that can help ESL educators control AI-assisted plagiarism is training an AI-based classifier (i.e., a program that classifies texts based on specific criteria) to distinguish between human-written and AI-generated texts (Jawahar et al., 2020). ...

Real or Fake? Learning to Discriminate Machine from Human Generated Text
  • Citing Preprint
  • June 2019

... The particular task we apply to is closely related to G2P and polyphone disambiguation with rich prior work, particularly on Chinese, Japanese, and Arabic [16,17,18,19], possibly enhanced by explicit rules [20] or implicit external knowledge [21], and lexicons have been used during training [22,23]. While the flexibility of using external knowledge has been discussed in MT [24] and G2P [25], though knowledge from raw lexicon texts were not leveraged. Representations from pretrained language models introduce additional knowledge and enhance TTS and G2P performances [26,27,28,29,30], but in most cases, the training data contain limited phonemic information, making these models less useful for capturing pronunciations. ...

Dictionary Augmented Sequence-to-Sequence Neural Network for Grapheme to Phoneme Prediction
  • Citing Conference Paper
  • September 2018

... Then we introduce the features individually designed for each of the models. For N-gram, we follow Bakhtin et al. (2018) and extract each order's discount probabilities as the features. And for Transformer-XL, we design the confidence score based on the attention scores. ...

Lightweight Adaptive Mixture of Neural and N-gram Language Models
  • Citing Article
  • April 2018

... Search results are transformed into keyword activation scores by computing confidence scores of speech segments. Efforts have been made to evaluate different strategies for computing confidence scores for CTC-based KWS [21], [22]. In this work, we follow the confidence metric proposed in [22]. ...

Streaming small-footprint keyword spotting using sequence-to-sequence models
  • Citing Conference Paper
  • December 2017