Thomas Hubert’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (16)


The relationship between C++ and assembly programs
a, A C++ implementation of a variable sort 2 function that sorts any input sequence of up to two elements. b, The C++ implementation in a is compiled to this equivalent low-level assembly representation.
The AssemblyGame and algorithm correctness computation
a, The AssemblyGame is played by AlphaDev, which receives as input the current assembly algorithm generated thus far St and plays the game by selecting an action to execute. In this example, the action is a mov<Register0,Memory1> assembly instruction, which is appended to the current algorithm. The agent receives a reward that is a function of the algorithm’s correctness, discussed in b, as well as the algorithm’s latency. The game is won by the player discovering a low latency, correct algorithm. b, The program correctness and latency computations are used to compute the reward rt. In this example, test sequences are input to the algorithm; for example, in the case of sorting three elements, test inputs comprise all sequences of unsorted elements of length 3. For each sequence, the algorithm output is compared to the expected output (in the case of sorting, the expected output is the sorted elements). In this example, the output D′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\bf{D}}{\boldsymbol{{\prime} }}$$\end{document} does not match the expected output B′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\bf{B}}{\boldsymbol{{\prime} }}$$\end{document} and the algorithm is therefore incorrect.
Sorting networks and algorithmic improvements discovered by AlphaDev
a, An optimal classic sorting network for three inputs. The circled comparators have been improved by AlphaDev. See the AlphaDev swap move for more details. b,c, The assembly pseudocode before applying the AlphaDev swap move (b) and after applying the AlphaDev swap move (c), resulting in the removal of a single instruction. d, An optimal classic sorting network comparator configuration that has been improved by AlphaDev. See the AlphaDev copy move for more details. e,f, The assembly pseudocode before applying the AlphaDev copy move (e) and after applying the AlphaDev copy move (f), resulting in the removal of a single instruction.
Fundamentally different algorithms discovered by AlphaDev
a, A flow diagram of the variable sort 4 (VarSort4) human benchmark algorithm. In this algorithm, a sequence of unsorted numbers are input into the algorithm. If the sequence length is four, three or two numbers, then the corresponding sort 4, sort 3 or sort 2 sorting network is called that sorts the resulting sequence. The result is then returned and output by the function. b, The VarSort4 algorithm discovered by AlphaDev. This algorithm also receives sequences of length four, three or two numbers as input. In this case, if the length is two, then it calls the sort 2 sorting network and returns. If the length is three then it calls sort 3 to sort the first three numbers and returns. If, however, the length is greater than three, then it calls sort 3, followed by a simplified sort 4 routine that sorts the remaining unsorted number. It is this part of the routine that results in significant latency savings.
Faster sorting algorithms discovered using deep reinforcement learning
  • Article
  • Full-text available

June 2023

·

1,952 Reads

·

120 Citations

Nature

·

Andrea Michi

·

Anton Zhernov

·

[...]

·

David Silver

Fundamental algorithms such as sorting or hashing are used trillions of times on any given day¹. As demand for computation grows, it has become critical for these algorithms to be as performant as possible. Whereas remarkable progress has been achieved in the past², making further improvements on the efficiency of these routines has proved challenging for both human scientists and computational approaches. Here we show how artificial intelligence can go beyond the current state of the art by discovering hitherto unknown routines. To realize this, we formulated the task of finding a better sorting routine as a single-player game. We then trained a new deep reinforcement learning agent, AlphaDev, to play this game. AlphaDev discovered small sorting algorithms from scratch that outperformed previously known human benchmarks. These algorithms have been integrated into the LLVM standard C++ sort library³. This change to this part of the sort library represents the replacement of a component with an algorithm that has been automatically discovered using reinforcement learning. We also present results in extra domains, showcasing the generality of the approach.

Download

Optimizing Memory Mapping Using Deep Reinforcement Learning

May 2023

·

105 Reads

Resource scheduling and allocation is a critical component of many high impact systems ranging from congestion control to cloud computing. Finding more optimal solutions to these problems often has significant impact on resource and time savings, reducing device wear-and-tear, and even potentially improving carbon emissions. In this paper, we focus on a specific instance of a scheduling problem, namely the memory mapping problem that occurs during compilation of machine learning programs: That is, mapping tensors to different memory layers to optimize execution time. We introduce an approach for solving the memory mapping problem using Reinforcement Learning. RL is a solution paradigm well-suited for sequential decision making problems that are amenable to planning, and combinatorial search spaces with high-dimensional data inputs. We formulate the problem as a single-player game, which we call the mallocGame, such that high-reward trajectories of the game correspond to efficient memory mappings on the target hardware. We also introduce a Reinforcement Learning agent, mallocMuZero, and show that it is capable of playing this game to discover new and improved memory mapping solutions that lead to faster execution times on real ML workloads on ML accelerators. We compare the performance of mallocMuZero to the default solver used by the Accelerated Linear Algebra (XLA) compiler on a benchmark of realistic ML workloads. In addition, we show that mallocMuZero is capable of improving the execution time of the recently published AlphaTensor matrix multiplication model.


Published as a conference paper at ICLR 2022 PLANNING IN STOCHASTIC ENVIRONMENTS WITH A LEARNED MODEL

February 2023

·

401 Reads

Model-based reinforcement learning has proven highly successful. However, learning a model in isolation from its use during planning is problematic in complex environments. To date, the most effective techniques have instead combined value-equivalent model learning with powerful tree-search methods. This approach is exemplified by MuZero, which has achieved state-of-the-art performance in a wide range of domains, from board games to visually rich environments, with discrete and continuous action spaces, in online and offline settings. However, previous instantiations of this approach were limited to the use of deterministic models. This limits their performance in environments that are inherently stochastic, partially observed, or so large and complex that they appear stochastic to a finite agent. In this paper we extend this approach to learn and plan with stochastic models. Specifically , we introduce a new algorithm, Stochastic MuZero, that learns a stochastic model incorporating afterstates, and uses this model to perform a stochastic tree search. Stochastic MuZero matched or exceeded the state of the art in a set of canonical single and multi-agent environments, including 2048 and backgammon, while maintaining the superhuman performance of standard MuZero in the game of Go.


Competition-level code generation with AlphaCode

December 2022

·

241 Reads

·

1,009 Citations

Science

Programming is a powerful and ubiquitous problem-solving tool. Systems that can assist programmers or even generate programs themselves could make programming more productive and accessible. Recent transformer-based neural network models show impressive code generation abilities yet still perform poorly on more complex tasks requiring problem-solving skills, such as competitive programming problems. Here, we introduce AlphaCode, a system for code generation that achieved an average ranking in the top 54.3% in simulated evaluations on recent programming competitions on the Codeforces platform. AlphaCode solves problems by generating millions of diverse programs using specially trained transformer-based networks and then filtering and clustering those programs to a maximum of just 10 submissions. This result marks the first time an artificial intelligence system has performed competitively in programming competitions.


Matrix multiplication tensor and algorithms
a, Tensor T2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathscr{T}}}_{2}$$\end{document} representing the multiplication of two 2 × 2 matrices. Tensor entries equal to 1 are depicted in purple, and 0 entries are semi-transparent. The tensor specifies which entries from the input matrices to read, and where to write the result. For example, as c1 = a1b1 + a2b3, tensor entries located at (a1, b1, c1) and (a2, b3, c1) are set to 1. b, Strassen's algorithm² for multiplying 2 × 2 matrices using 7 multiplications. c, Strassen's algorithm in tensor factor representation. The stacked factors U, V and W (green, purple and yellow, respectively) provide a rank-7 decomposition of T2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathscr{T}}}_{2}$$\end{document} (equation (1)). The correspondence between arithmetic operations (b) and factors (c) is shown by using the aforementioned colours.
Overview of AlphaTensor
The neural network (bottom box) takes as input a tensor St\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathscr{S}}}_{t}$$\end{document}, and outputs samples (u, v, w) from a distribution over potential next actions to play, and an estimate of the future returns (for example, of −Rank(St)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-{\rm{Rank}}\,({{\mathscr{S}}}_{t})$$\end{document}). The network is trained on two data sources: previously played games and synthetic demonstrations. The updated network is sent to the actors (top box), where it is used by the MCTS planner to generate new games.
Comparison between the complexity of previously known matrix multiplication algorithms and the ones discovered by AlphaTensor
Left: column (n, m, p) refers to the problem of multiplying n × m with m × p matrices. The complexity is measured by the number of scalar multiplications (or equivalently, the number of terms in the decomposition of the tensor). ‘Best rank known’ refers to the best known upper bound on the tensor rank (before this paper), whereas ‘AlphaTensor rank’ reports the rank upper bounds obtained with our method, in modular arithmetic (Z2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathbb{Z}}}_{2}$$\end{document}) and standard arithmetic. In all cases, AlphaTensor discovers algorithms that match or improve over known state of the art (improvements are shown in red). See Extended Data Figs. 1 and 2 for examples of algorithms found with AlphaTensor. Right: results (for arithmetic in R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb{R}}$$\end{document}) of applying AlphaTensor-discovered algorithms on larger tensors. Each red dot represents a tensor size, with a subset of them labelled. See Extended Data Table 1 for the results in table form. State-of-the-art results are obtained from the list in ref. ⁶⁴.
Algorithm discovery beyond standard matrix multiplication
a, Decompositions found by AlphaTensor for the tensors of size n(n−1)2×n×n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{n(n-1)}{2}\times n\times n$$\end{document} (with n = 3, 4, 5, 6) representing the skew-symmetric matrix-vector multiplication. The red pixels denote 1, the blue pixels denote −1 and the white pixels denote 0. Extrapolation to n = 10 is shown in the rightmost figure. b, Skew-symmetric matrix-by-vector multiplication algorithm, obtained from the examples solved by AlphaTensor. The wij and qi terms in steps 3 and 5 correspond to the mr terms in Algorithm 1. It is noted that steps 6–9 do not involve any multiplications.
Speed-ups of the AlphaTensor-discovered algorithm
a,b, Speed-ups (%) of the AlphaTensor-discovered algorithms tailored for a GPU (a) and a TPU (b), optimized for a matrix multiplication of size 8,192 × 8,192. Speed-ups are measured relative to standard (for example, cuBLAS for the GPU) matrix multiplication on the same hardware. Speed-ups are reported for various matrix sizes (despite optimizing the algorithm only on one matrix size). We also report the speed-up of the Strassen-square algorithm. The median speed-up is reported over 200 runs. The standard deviation over runs is <0.4 percentage points (see Supplementary Information for more details). c, Speed-up of both algorithms (tailored to a GPU and a TPU) benchmarked on both devices.
Discovering faster matrix multiplication algorithms with reinforcement learning

October 2022

·

5,476 Reads

·

466 Citations

Nature

Improving the efficiency of algorithms for fundamental computations can have a widespread impact, as it can affect the overall speed of a large amount of computations. Matrix multiplication is one such primitive task, occurring in many systems—from neural networks to scientific computing routines. The automatic discovery of algorithms using machine learning offers the prospect of reaching beyond human intuition and outperforming the current best human-designed algorithms. However, automating the algorithm discovery procedure is intricate, as the space of possible algorithms is enormous. Here we report a deep reinforcement learning approach based on AlphaZero1 for discovering efficient and provably correct algorithms for the multiplication of arbitrary matrices. Our agent, AlphaTensor, is trained to play a single-player game where the objective is finding tensor decompositions within a finite factor space. AlphaTensor discovered algorithms that outperform the state-of-the-art complexity for many matrix sizes. Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago2. We further showcase the flexibility of AlphaTensor through different use-cases: algorithms with state-of-the-art complexity for structured matrix multiplication and improved practical efficiency by optimizing matrix multiplication for runtime on specific hardware. Our results highlight AlphaTensor’s ability to accelerate the process of algorithmic discovery on a range of problems, and to optimize for different criteria. A reinforcement learning approach based on AlphaZero is used to discover efficient and provably correct algorithms for matrix multiplication, finding faster algorithms for a variety of matrix sizes.


Approximate Exploitability: Learning a Best Response

July 2022

·

27 Reads

·

25 Citations

Researchers have shown that neural networks are vulnerable to adversarial examples and subtle environment changes. The resulting errors can look like blunders to humans, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. Such evaluation typically fails to evaluate robustness to worst-case outcomes. Computer poker research has examined how to assess such worst-case performance. Unfortunately, exact computation is infeasible with larger domains, and existing approximations are poker-specific. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, approximating worst-case performance. We demonstrate the technique in several games against a variety of agents, including several AlphaZero-based agents. Supplementary material is available at https://arxiv.org/abs/2004.09677.


Figure 4: MuZero-RC Representation Network
BD-rate difference of Lagragnian-relaxation method compared to libvpx. Columns in this table are same as those in table 1.
Bitrate constraint satisfaction behavior of Lagrangian-method over the evaluation set.
MuZero with Self-competition for Rate Control in VP9 Video Compression

February 2022

·

183 Reads

·

1 Citation

Video streaming usage has seen a significant rise as entertainment, education, and business increasingly rely on online video. Optimizing video compression has the potential to increase access and quality of content to users, and reduce energy use and costs overall. In this paper, we present an application of the MuZero algorithm to the challenge of video compression. Specifically, we target the problem of learning a rate control policy to select the quantization parameters (QP) in the encoding process of libvpx, an open source VP9 video compression library widely used by popular video-on-demand (VOD) services. We treat this as a sequential decision making problem to maximize the video quality with an episodic constraint imposed by the target bitrate. Notably, we introduce a novel self-competition based reward mechanism to solve constrained RL with variable constraint satisfaction difficulty, which is challenging for existing constrained RL methods. We demonstrate that the MuZero-based rate control achieves an average 6.28% reduction in size of the compressed videos for the same delivered video quality level (measured as PSNR BD-rate) compared to libvpx's two-pass VBR rate control policy, while having better constraint satisfaction behavior.


Competition-Level Code Generation with AlphaCode

February 2022

·

1,578 Reads

·

16 Citations

Programming is a powerful and ubiquitous problem-solving tool. Developing systems that can assist programmers or even generate programs independently could make programming more productive and accessible, yet so far incorporating innovations in AI has proven challenging. Recent large-scale language models have demonstrated an impressive ability to generate code, and are now able to complete simple programming tasks. However, these models still perform poorly when evaluated on more complex, unseen problems that require problem-solving skills beyond simply translating instructions into code. For example, competitive programming problems which require an understanding of algorithms and complex natural language remain extremely challenging. To address this gap, we introduce AlphaCode, a system for code generation that can create novel solutions to these problems that require deeper reasoning. In simulated evaluations on recent programming competitions on the Codeforces platform, AlphaCode achieved on average a ranking of top 54.3% in competitions with more than 5,000 participants. We found that three key components were critical to achieve good and reliable performance: (1) an extensive and clean competitive programming dataset for training and evaluation, (2) large and efficient-to-sample transformer-based architectures, and (3) large-scale model sampling to explore the search space, followed by filtering based on program behavior to a small set of submissions.


Learning and Planning in Complex Action Spaces

April 2021

·

127 Reads

Many important real-world problems have action spaces that are high-dimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and Real-World RL Suite.


Online and Offline Reinforcement Learning by Planning with a Learned Model

April 2021

·

93 Reads

·

1 Citation

Learning efficiently from small amounts of data has long been the focus of model-based reinforcement learning, both for the online case when interacting with the environment and the offline case when learning from a fixed dataset. However, to date no single unified algorithm could demonstrate state-of-the-art results in both settings. In this work, we describe the Reanalyse algorithm which uses model-based policy and value improvement operators to compute new improved training targets on existing data points, allowing efficient learning for data budgets varying by several orders of magnitude. We further show that Reanalyse can also be used to learn entirely from demonstrations without any environment interactions, as in the case of offline Reinforcement Learning (offline RL). Combining Reanalyse with the MuZero algorithm, we introduce MuZero Unplugged, a single unified algorithm for any data budget, including offline RL. In contrast to previous work, our algorithm does not require any special adaptations for the off-policy or offline RL settings. MuZero Unplugged sets new state-of-the-art results in the RL Unplugged offline RL benchmark as well as in the online RL benchmark of Atari in the standard 200 million frame setting.


Citations (11)


... AlphaZero's impact has been profound, both in terms of performance and its influence on subsequent research directions. In particular, the implications of this algorithm go well beyond games: its application has notably made it possible to identify a more efficient sorting algorithm [17]. as well as a more efficient matrix multiplication algorithm [11]. ...

Reference:

On some improvements to Unbounded Minimax
Faster sorting algorithms discovered using deep reinforcement learning

Nature

... With deep learning and Large Language Models (LLMs) capable of generating code from large code bases, new challenges arise. Tools such as DeepMind's AlphaCode [19], Google's PaLM, 5 Microsoft's CodeBERT [11], and OpenAI's ChatGPT and Codex [12] can generate reasonable code from natural language descriptions. ...

Competition-level code generation with AlphaCode
  • Citing Article
  • December 2022

Science

... We focus on robustness against adversarial policies: strategies designed to make an opponent perform poorly. Adversarial policies give an empirical lower bound for an agent's exploitability: its worst-case loss relative to Nash equilibria [46]. Gleave et al. [15] previously explored such policies in a zero-sum game between simulated humanoids trained with self-play. ...

Approximate Exploitability: Learning a Best Response
  • Citing Conference Paper
  • July 2022

... Repository-Level Coding LLM-based code generation can be broadly categorized into single-file coding and multi-file (repository-level) coding. Single-file coding focuses on generating relatively short code snippets to solve isolated tasks, such as programming competition problems or simple coding queries, and was the primary focus in the early stages of LLM-based coding research [21,6,1,14]. As LLMs have advanced in code comprehension, long-context reasoning, and handling complex workflows, research has increasingly shifted toward repository-level coding, where multiple files are generated while jointly considering architectural and functional requirements [23,18,25]. ...

Competition-Level Code Generation with AlphaCode

... (4) The fourth criterion is whether the agent adapts and reacts to changes in the environment in real time. On-line learning constantly interacts with the environment and updates the policy according to the rewards received, whereas off-line learning updates the policy according to a dataset of experience that was collected from the environment beforehand [26]. (5) The fifth criterion is the number of agents interacting with the environment. ...

Online and Offline Reinforcement Learning by Planning with a Learned Model
  • Citing Preprint
  • April 2021

... Although humans and AI algorithms are often matched in decision-making capability and, moreover, AI agents have surpassed human-level performance in a range of tasks (Schrittwieser et al., 2020;Silver et al., 2017), people remain less trusting of AI decision-makers than of actual humans. For instance, people are more likely to believe that they understand decisions made by humans than those made by the advanced algorithms behind AI, even if there is no difference in their objective understanding of the decisions made between the two (Cadario et al., 2021), thus reflecting lower levels of confidence and higher levels of skepticism toward AI technologies (e.g., self-driving cars; Shariff et al., 2021). ...

Mastering Atari, Go, chess and shogi by planning with a learned model

Nature

... The second example involves recent research, which draws inspiration from human experiments to test the adaptivity of deep model-based algorithms [36,143]. Researchers wanted to test whether popular model-based deep MBRL agents, i.e. muZero [144], PlaNet [145], DreamerV2 [146], show the behavioural flexibility expected of model-based agents, i.e. RL agents that can flexibly change their action policies using an internal model of the environment that is stored independently of rewards. ...

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
  • Citing Preprint
  • November 2019

... The neural network based AI AlphaZero achieved superhuman performance in for example chess and Go in 24 hours of learning from self-play, starting from scratch. [162] ...

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
  • Citing Article
  • December 2018

Science