A preview of this full-text is provided by Springer Nature.
Content available from Nature Electronics
This content is subject to copyright. Terms and conditions apply.
Articles
https://doi.org/10.1038/s41928-019-0221-6
1Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA, USA. 2Binghamton University, Binghamton, NY, USA.
3Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, USA. 4Air Force Research Laboratory, Information Directorate, Rome, NY, USA.
5College of Information and Computer Sciences, University of Massachusetts, Amherst, MA, USA. 6Department of Electrical Engineering and Computer
Science, Syracuse University, Syracuse, NY, USA. 7Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA.
8These authors contributed equally: Zhongrui Wang, Can Li. *e-mail: qxia@umass.edu; jjyang@umass.edu
A
primary goal of machine learning is to equip machines with
behaviours that optimize their control over different envi-
ronments. Unlike supervised or unsupervised learning, rein-
forcement learning, which is inspired by cognitive neuroscience,
provides a way to formulate the decision-making process that is
learned without a supervisor providing labelled training examples.
It instead uses less informative evaluations in the form of ‘rewards’,
and learning is directed towards maximizing the amount of rewards
received over time1.
Developments in deep neural networks have advanced rein-
forcement learning2,3, exemplified with the recent achievements of
AlphaGo4,5. However, the first generation of AlphaGo (or AlphaGo
Fan) ran on 1,920 central processing units (CPUs) and 280 graph-
ics processing units (GPUs), consuming a peak power of half a
megawatt. Application-specific integrated circuits (ASIC) like
DaDianNao6, tensor processing unit (TPU)7 and Eyeriss8 offer
potential enhancements in speed and reductions in power consump-
tion. However, since the majority of neural network parameters (for
example, weights) are still stored in dynamic random-access mem-
ory (DRAM), moving data back-and-forth between the DRAM and
the caches (for example, static random-access memory; SRAM) of
processing units increases both the latency and power consump-
tion. The growing challenge of this communication bottleneck,
together with the saturation of Moore’s law, limits the speed and
energy efficiency of complementary metal–oxide–semiconductor
(CMOS)-based reinforcement learning in the era of big data.
A processing-in-memory architecture could provide a highly
parallel and energy-efficient approach to address these challenges,
relying on dense, power-efficient, fast, and scalable building blocks
such as ionic transistors9, phase change memory10–13 and redox mem-
ristors14–26. A key advantage of networks based on these emerging
devices is ‘compute by physics’, where vector-matrix multiplica-
tions are performed intrinsically via Ohm’s law (for multiplication)
and Kirchhoff’s current law (for summation)27,28. Such a network
computes exactly where the data are stored and thus avoids the
communication bottleneck. Furthermore, it is able to compute in
parallel and in the analogue domain. Applications, including sig-
nal processing29,30, scientific computing31,32, hardware security33 and
neuromorphic computing22,28,34–41, have been recently demonstrated
(see Supplementary Table 1).
Memristor arrays have shown potential speed and energy
enhancements for in situ supervised learning (for spatial/tempo-
ral pattern classification)22,34–37,39,41,42 and unsupervised learning
(for data clustering)40,43,44. Although the memristor crossbar imple-
mentation of reinforcement learning may significantly benefit the
reward predictions based on the forward passes of the deep-Q net-
work, using historical observations repeatedly replayed from the
‘experience’ to optimize the decision-making in unknown environ-
ments3, it has yet to be demonstrated due to lack of the hardware
and its corresponding algorithm.
In this Article, we report an experimental demonstration of rein-
forcement learning in analogue memristor arrays. The parallel and
energy-efficient in situ reinforcement learning with a three-layer
fully connected memristive deep-Q network is implemented on a
128 × 64 1-transistor 1-memristor (1T1R) array. We show that the
learning can be generally applied to classic reinforcement learning
environments, including cart–pole45 and mountain car46 problems.
Our results indicate that in-memristor reinforcement learning can
achieve a 4 to 5 bit representation capability per weight using a
two-pulse write-without-verification scheme to program the 1T1R
array, with potential improvements in computing speed and energy
efficiency (see Supplementary Note 1).
Reinforcement learning with analogue
memristor arrays
ZhongruiWang1,8, CanLi 1,8, WenhaoSong1, MingyiRao1, DanielBelkin 1, YunningLi1, PengYan1,
HaoJiang1, PengLin1, MiaoHu 2, JohnPaulStrachan3, NingGe3, MarkBarnell4, QingWu4,
AndrewG.Barto5, QinruQiu6, R.StanleyWilliams 7, QiangfeiXia 1* and J.JoshuaYang 1*
Reinforcement learning algorithms that use deep neural networks are a promising approach for the development of machines
that can acquire knowledge and solve problems without human input or supervision. At present, however, these algorithms are
implemented in software running on relatively standard complementary metal–oxide–semiconductor digital platforms, where
performance will be constrained by the limits of Moore’s law and von Neumann architecture. Here, we report an experimental
demonstration of reinforcement learning on a three-layer 1-transistor 1-memristor (1T1R) network using a modified learning
algorithm tailored for our hybrid analogue–digital platform. To illustrate the capabilities of our approach in robust in situ train-
ing without the need for a model, we performed two classic control problems: the cart–pole and mountain car simulations.
We also show that, compared with conventional digital systems in real-world reinforcement learning tasks, our hybrid
analogue–digital computing system has the potential to achieve a significant boost in speed and energy efficiency.
NATURE ELECTRONICS | VOL 2 | MARCH 2019 | 115–124 | www.nature.com/natureelectronics 115
Content courtesy of Springer Nature, terms of use apply. Rights reserved