Aja Huang’s research while affiliated with University of Toronto and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (7)


Figure 1 | Histogram of player MMR from replays used for training.
Figure 2 | Training procedure.
Figure 3 | Illustration of the architecture that we used for our reference agents. Different types of data are denoted by different types of arrows (vectors, units or feature planes).
Figure 10 | Win rate matrix of the reference agents broken down by race (Protoss, Terran, Zerg), normalized between 0 and 100. Note that because of draws, the win rates do not always sum to 100 across the diagonal. AS-SUP corresponds the the original AlphaStar Supervised agent trained to play all races.
Performance of behavior cloning when using different MMR filtering schemes. Higher quality data also means fewer episodes, therefore worse performance. High quality data for fine-tuning gives the best results.

+1

AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning
  • Preprint
  • File available

August 2023

·

256 Reads

Michaël Mathieu

·

Sherjil Ozair

·

Srivatsan Srinivasan

·

[...]

·

Oriol Vinyals

StarCraft II is one of the most challenging simulated reinforcement learning environments; it is partially observable, stochastic, multi-agent, and mastering StarCraft II requires strategic planning over long time horizons with real-time low-level execution. It also has an active professional competitive scene. StarCraft II is uniquely suited for advancing offline RL algorithms, both because of its challenging nature and because Blizzard has released a massive dataset of millions of StarCraft II games played by human players. This paper leverages that and establishes a benchmark, called AlphaStar Unplugged, introducing unprecedented challenges for offline reinforcement learning. We define a dataset (a subset of Blizzard's release), tools standardizing an API for machine learning methods, and an evaluation protocol. We also present baseline agents, including behavior cloning, offline variants of actor-critic and MuZero. We improve the state of the art of agents using only offline data, and we achieve 90% win rate against previously published AlphaStar behavior cloning agent.

Download

Matrix multiplication tensor and algorithms
a, Tensor T2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathscr{T}}}_{2}$$\end{document} representing the multiplication of two 2 × 2 matrices. Tensor entries equal to 1 are depicted in purple, and 0 entries are semi-transparent. The tensor specifies which entries from the input matrices to read, and where to write the result. For example, as c1 = a1b1 + a2b3, tensor entries located at (a1, b1, c1) and (a2, b3, c1) are set to 1. b, Strassen's algorithm² for multiplying 2 × 2 matrices using 7 multiplications. c, Strassen's algorithm in tensor factor representation. The stacked factors U, V and W (green, purple and yellow, respectively) provide a rank-7 decomposition of T2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathscr{T}}}_{2}$$\end{document} (equation (1)). The correspondence between arithmetic operations (b) and factors (c) is shown by using the aforementioned colours.
Overview of AlphaTensor
The neural network (bottom box) takes as input a tensor St\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathscr{S}}}_{t}$$\end{document}, and outputs samples (u, v, w) from a distribution over potential next actions to play, and an estimate of the future returns (for example, of −Rank(St)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-{\rm{Rank}}\,({{\mathscr{S}}}_{t})$$\end{document}). The network is trained on two data sources: previously played games and synthetic demonstrations. The updated network is sent to the actors (top box), where it is used by the MCTS planner to generate new games.
Comparison between the complexity of previously known matrix multiplication algorithms and the ones discovered by AlphaTensor
Left: column (n, m, p) refers to the problem of multiplying n × m with m × p matrices. The complexity is measured by the number of scalar multiplications (or equivalently, the number of terms in the decomposition of the tensor). ‘Best rank known’ refers to the best known upper bound on the tensor rank (before this paper), whereas ‘AlphaTensor rank’ reports the rank upper bounds obtained with our method, in modular arithmetic (Z2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathbb{Z}}}_{2}$$\end{document}) and standard arithmetic. In all cases, AlphaTensor discovers algorithms that match or improve over known state of the art (improvements are shown in red). See Extended Data Figs. 1 and 2 for examples of algorithms found with AlphaTensor. Right: results (for arithmetic in R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb{R}}$$\end{document}) of applying AlphaTensor-discovered algorithms on larger tensors. Each red dot represents a tensor size, with a subset of them labelled. See Extended Data Table 1 for the results in table form. State-of-the-art results are obtained from the list in ref. ⁶⁴.
Algorithm discovery beyond standard matrix multiplication
a, Decompositions found by AlphaTensor for the tensors of size n(n−1)2×n×n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{n(n-1)}{2}\times n\times n$$\end{document} (with n = 3, 4, 5, 6) representing the skew-symmetric matrix-vector multiplication. The red pixels denote 1, the blue pixels denote −1 and the white pixels denote 0. Extrapolation to n = 10 is shown in the rightmost figure. b, Skew-symmetric matrix-by-vector multiplication algorithm, obtained from the examples solved by AlphaTensor. The wij and qi terms in steps 3 and 5 correspond to the mr terms in Algorithm 1. It is noted that steps 6–9 do not involve any multiplications.
Speed-ups of the AlphaTensor-discovered algorithm
a,b, Speed-ups (%) of the AlphaTensor-discovered algorithms tailored for a GPU (a) and a TPU (b), optimized for a matrix multiplication of size 8,192 × 8,192. Speed-ups are measured relative to standard (for example, cuBLAS for the GPU) matrix multiplication on the same hardware. Speed-ups are reported for various matrix sizes (despite optimizing the algorithm only on one matrix size). We also report the speed-up of the Strassen-square algorithm. The median speed-up is reported over 200 runs. The standard deviation over runs is <0.4 percentage points (see Supplementary Information for more details). c, Speed-up of both algorithms (tailored to a GPU and a TPU) benchmarked on both devices.
Discovering faster matrix multiplication algorithms with reinforcement learning

October 2022

·

5,444 Reads

·

455 Citations

Nature

Improving the efficiency of algorithms for fundamental computations can have a widespread impact, as it can affect the overall speed of a large amount of computations. Matrix multiplication is one such primitive task, occurring in many systems—from neural networks to scientific computing routines. The automatic discovery of algorithms using machine learning offers the prospect of reaching beyond human intuition and outperforming the current best human-designed algorithms. However, automating the algorithm discovery procedure is intricate, as the space of possible algorithms is enormous. Here we report a deep reinforcement learning approach based on AlphaZero1 for discovering efficient and provably correct algorithms for the multiplication of arbitrary matrices. Our agent, AlphaTensor, is trained to play a single-player game where the objective is finding tensor decompositions within a finite factor space. AlphaTensor discovered algorithms that outperform the state-of-the-art complexity for many matrix sizes. Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago2. We further showcase the flexibility of AlphaTensor through different use-cases: algorithms with state-of-the-art complexity for structured matrix multiplication and improved practical efficiency by optimizing matrix multiplication for runtime on specific hardware. Our results highlight AlphaTensor’s ability to accelerate the process of algorithmic discovery on a range of problems, and to optimize for different criteria. A reinforcement learning approach based on AlphaZero is used to discover efficient and provably correct algorithms for matrix multiplication, finding faster algorithms for a variety of matrix sizes.


Grandmaster level in StarCraft II using multi-agent reinforcement learning

October 2019

·

15,247 Reads

·

3,763 Citations

Nature

Many real-world applications require artificial agents to compete and coordinate with other agents in complex environments. As a stepping stone to this goal, the domain of StarCraft has emerged as an important challenge for artificial intelligence research, owing to its iconic and enduring status among the most difficult professional esports and its relevance to the real world in terms of its raw complexity and multi-agent challenges. Over the course of a decade and numerous competitions1–3, the strongest agents have simplified important aspects of the game, utilized superhuman capabilities, or employed hand-crafted sub-systems⁴. Despite these advantages, no previous agent has come close to matching the overall skill of top StarCraft players. We chose to address the challenge of StarCraft using general-purpose learning methods that are in principle applicable to other complex domains: a multi-agent reinforcement learning algorithm that uses data from both human and agent games within a diverse league of continually adapting strategies and counter-strategies, each represented by deep neural networks5,6. We evaluated our agent, AlphaStar, in the full game of StarCraft II, through a series of online games against human players. AlphaStar was rated at Grandmaster level for all three StarCraft races and above 99.8% of officially ranked human players.


Figure 3: Typical values of the observed and maximum expected win-rates as a function of the optimization steps.
Bayesian Optimization in AlphaGo

December 2018

·

351 Reads

·

1 Citation

During the development of AlphaGo, its many hyper-parameters were tuned with Bayesian optimization multiple times. This automatic tuning process resulted in substantial improvements in playing strength. For example, prior to the match with Lee Sedol, we tuned the latest AlphaGo agent and this improved its win-rate from 50% to 66.5% in self-play games. This tuned version was deployed in the final match. Of course, since we tuned AlphaGo many times during its development cycle, the compounded contribution was even higher than this percentage. It is our hope that this brief case study will be of interest to Go fans, and also provide Bayesian optimization practitioners with some insights and inspiration.


Mastering the game of Go without human knowledge

October 2017

·

30,810 Reads

·

9,888 Citations

Nature

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.


Figure 2: Strength and accuracy of policy and value networks. a Plot showing the playing strength of policy networks as a function of their training accuracy. Policy networks with 128, 192, 256 and 384 convolutional filters per layer were evaluated periodically during training; the plot shows the winning rate of AlphaGo using that policy network against the match version of AlphaGo . b Comparison of evaluation accuracy between the value network and rollouts with different policies. Positions and outcomes were sampled from human expert games. Each position was evaluated by a single forward pass of the value network v θ , or by the mean outcome of 100 
Figure 3: Monte-Carlo tree search in AlphaGo . a Each simulation traverses the tree by selecting the edge with maximum action-value Q , plus a bonus u ( P ) that depends on a stored prior probability P for that edge. b The leaf node may be expanded; the new node is processed once by the policy network p σ and the output probabilities are stored as prior probabilities P for each action. 
Figure 5: How AlphaGo (black, to play) selected its move in an informal game against Fan Hui. For each of the following statistics, the location of the maximum value is indicated by an orange circle. a Evaluation of all successors s of the root position s , using the value network v θ ( s ) ; estimated winning percentages are shown for the top evaluations. b Action-values Q ( s, a ) 
Details of match between AlphaGo and Fan Hui. The match consisted of five formal games with longer time controls, and five informal games with shorter time controls. Time controls and playing conditions were chosen by Fan Hui in advance of the match.
Mastering the game of Go with deep neural networks and tree search

January 2016

·

105,267 Reads

·

17,838 Citations

Nature

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.


Table 1: Features used as inputs to the CNN. 
Figure 2: A game played between the 12-layer CNN (without any search) and Fuego (using 100k rollouts/move ). The CNN plays white.  
Move Evaluation in Go Using Deep Convolutional Neural Networks

December 2014

·

1,031 Reads

·

103 Citations

The game of Go is more challenging than other board games, due to the difficulty of constructing a position or move evaluation function. In this paper we investigate whether deep convolutional networks can be used to directly represent and learn this knowledge. We train a large 12-layer convolutional neural network by supervised learning from a database of human professional games. The network correctly predicts the expert move in 55% of positions, equalling the accuracy of a 6 dan human player. When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GnuGo in 97% of games, and matched the performance of a state-of-the-art Monte-Carlo tree search that simulates a million positions per move.

Citations (6)


... Reinforcement learning (RL) algorithms have surpassed humans' ability in many games (Mnih et al. 2015;Schrittwieser et al. 2020), and have now also found success in real world problems such as controlling plasma in a nuclear fusion reactor (Degrave et al. 2022), video compression (Mandhane et al. 2022), large language models (Ouyang et al. 2022) and algorithm design (Fawzi et al. 2022;Mankowitz et al. 2023). However, even for relatively simple tasks, algorithms still require many simulations or real interactions to learn a strong policy, making them inefficient. ...

Reference:

Epistemic Bellman Operators
Discovering faster matrix multiplication algorithms with reinforcement learning

Nature

... Prior work on ZSC has mainly focused on the challenge of adapting to novel partners, including novel human partners. Typical RL approaches use methods like population-based training (PBT) and variations on self-play (SP), approaches which helped algorithms such as AlphaStar achieve superhuman performance in zero-sum games (Vinyals et al., 2019). These approaches work by leveraging "partner diversity," during training time. ...

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Nature

... Lastly, our observations of f are subject to a (potentially heteroscedastic 80,81 ) noise process. BO 82 is an adaptive strategy that has recently emerged as a powerful solution method for black-box optimisation problems with proven success in applications including machine learning hyperparameter optimisation, 83,84 chemical reaction optimisation, 32 protein design, 85 and as a sub-component in AlphaGo 86 and Amazon Alexa. 87 The ESI 1 provides pseudo-code for bo and more details on the algorithm. ...

Bayesian Optimization in AlphaGo

... Machine Learning (ML), introduced by A. L. Samuel [51] in 1959, has found extensive applications in several fields such as computer vision, materials, general game playing, data mining, and bioinformatics [52][53][54][55][56][57]. As the Artificial Intelligence and ML mature, significant progress is being made by both mainstream Artificial Intelligence researchers and professionals from other domains who are using these technologies to achieve their goals [57]. ...

Mastering the game of Go without human knowledge

Nature

... Since the predicted fashion items is filled with noise in the early stages of the inference process, it is difficult for both humans and the discriminator model to judge the quality of the images. Therefore, following Reinforcement Learning (RL) [36,37] methods, we assume that if the final inference result is better than , then at any timestep during the inference process, the state and action are better than and . In Section 4.1, we saved the latent image variables from timesteps of each generated fashion item as the state . ...

Mastering the game of Go with deep neural networks and tree search

Nature

... Clark and Storkey [2] trained an 8-layer CNN using two Go datasets made by expert players and achieved an accuracy of 0.4437 in move prediction. Subsequently, Maddison et al. [8] created a 12-layer CNN to predict expert moves, reaching an accuracy of 0.5450, which is comparable to a 6-dan human player. Duc et al. [9] proposed a 5-layer CNN trained on approximately 600,000 board states to forecast the next move, and their work also suggested next moves by three-player ranks, beneficial for novice players. ...

Move Evaluation in Go Using Deep Convolutional Neural Networks