Alex Graves’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (27)


A Practical Sparse Approximation for Real Time Recurrent Learning
  • Preprint

June 2020

·

53 Reads

Jacob Menick

·

Erich Elsen

·

·

[...]

·

Alex Graves

Current methods for training recurrent neural networks are based on backpropagation through time, which requires storing a complete history of network states, and prohibits updating the weights `online' (after every timestep). Real Time Recurrent Learning (RTRL) eliminates the need for history storage and allows for online weight updates, but does so at the expense of computational costs that are quartic in the state size. This renders RTRL training intractable for all but the smallest networks, even ones that are made highly sparse. We introduce the Sparse n-step Approximation (SnAp) to the RTRL influence matrix, which only keeps entries that are nonzero within n steps of the recurrent core. SnAp with n=1 is no more expensive than backpropagation, and we find that it substantially outperforms other RTRL approximations with comparable costs such as Unbiased Online Recurrent Optimization. For highly sparse networks, SnAp with n=2 remains tractable and can outperform backpropagation through time in terms of learning speed when updates are done online. SnAp becomes equivalent to RTRL when n is large.


Associative Compression Networks

April 2018

·

122 Reads

·

5 Citations

This paper introduces Associative Compression Networks (ACNs), a new framework for variational autoencoding with neural networks. The system differs from existing variational autoencoders in that the prior distribution used to model each code is conditioned on a similar code from the dataset. In compression terms this equates to sequentially transmitting the data using an ordering determined by proximity in latent space. As the prior need only account for local, rather than global variations in the latent space, the coding cost is greatly reduced, leading to rich, informative codes, even when autoregressive decoders are used. Experimental results on MNIST, CIFAR-10, ImageNet and CelebA show that ACNs can yield improved dataset compression relative to order-agnostic generative models, with an upper bound of 73.9 nats per image on binarized MNIST. They also demonstrate that ACNs learn high-level features such as object class, writing style, pose and facial expression, which can be used to cluster and classify the data, as well as to generate diverse and convincing samples.


Figure 6: Left: the training curves of DNC and Kanerva machine both shows 6 instances with the best hyperparameter configuration for each model found via grid search. DNCs were more sensitive to random initilisation, slower, and plateaued with larger error. Right: the test variational lower-bounds of a DNC (dashed lines) and a Kanerva Machine as a function of different episode sizes and different sample classes. 
Figure 7: Interpolation for Omniglot and CIFAR images. The first and last column show 2 random images from the data. Between them are linear interpolations in the space of memory accessing weights w t . 
Figure 8: Iteratively sampled priors from VAE, for both Omniglot (left) and Cifar (right). In both panels, the columns show samples after 0, 2, 4, 6, 8 and 10 iterations, mirroring the procedure producing figure 4 and 5. 
Figure 9: The architecture of the VAE and the Kanerva Machine used in our experiments. conv/deconv: convolutional and transposed convolutions neural networks. MLP: multiplayer perceptron. concat: vector concatenation. The blue arrows show memory writing as exact inference. 
Figure 10: Covariance between memory rows is important. The two curves shows the test loss (negative variational lower bound) as a function of iterations. Four models using full K × K covariance matrix U are shown by red curves and four models using diagonal covariance matrix are shown in blue. All other settings for these 8 models are the same (as described in section 4). These 8 models are trained on machines with similar setup. The models using full covariance matrices were slightly slower per-iteration, but the test loss decreased far more quickly. 

+2

The Kanerva Machine: A Generative Distributed Memory
  • Article
  • Full-text available

April 2018

·

991 Reads

·

22 Citations

We present an end-to-end trained memory system that quickly adapts to new data and generates samples like them. Inspired by Kanerva's sparse distributed memory, it has a robust distributed reading and writing mechanism. The memory is analytically tractable, which enables optimal on-line compression via a Bayesian update-rule. We formulate it as a hierarchical conditional generative model, where memory provides a rich data-dependent prior distribution. Consequently, the top-down memory and bottom-up perception are combined to produce the code representing an observation. Empirically, we demonstrate that the adaptive memory significantly improves generative models trained on both the Omniglot and CIFAR datasets. Compared with the Differentiable Neural Computer (DNC) and its variants, our memory model has greater capacity and is significantly easier to train.

Download

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

November 2017

·

1,170 Reads

·

496 Citations

The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality. The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, and is deployed online by Google Assistant, including serving multiple English and Japanese voices.


Noisy Networks for Exploration

June 2017

·

654 Reads

·

416 Citations

We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward to implement and adds little computational overhead. We find that replacing the conventional exploration heuristics for A3C, DQN and dueling agents (entropy reward and ϵ\epsilon-greedy respectively) with NoisyNet yields substantially higher scores for a wide range of Atari games, in some cases advancing the agent from sub to super-human performance.


Automated Curriculum Learning for Neural Networks

April 2017

·

298 Reads

·

287 Citations

We introduce a method for automatically selecting the path, or syllabus, that a neural network follows through a curriculum so as to maximise learning efficiency. A measure of the amount that the network learns from each data sample is provided as a reward signal to a nonstationary multi-armed bandit algorithm, which then determines a stochastic syllabus. We consider a range of signals derived from two distinct indicators of learning progress: rate of increase in prediction accuracy, and rate of increase in network complexity. Experimental results for LSTM networks on three curricula demonstrate that our approach can significantly accelerate learning, in some cases halving the time required to attain a satisfactory performance level.


Neural Machine Translation in Linear Time

October 2016

·

466 Reads

·

268 Citations

We present a neural architecture for sequence processing. The ByteNet is a stack of two dilated convolutional neural networks, one to encode the source sequence and one to decode the target sequence, where the target network unfolds dynamically to generate variable length outputs. The ByteNet has two core properties: it runs in time that is linear in the length of the sequences and it preserves the sequences' temporal resolution. The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent neural networks. The ByteNet also achieves a performance on raw character-level machine translation that approaches that of the best neural translation models that run in quadratic time. The implicit structure learnt by the ByteNet mirrors the expected alignments between the sequences.


Figure 5: A schematic of the memory efficient backpropagation through time. Each circle represents an instance of the SAM core at a given time step. The grey box marks the dense memory. Each core holds a reference to the single instance of the memory, and this is represented by the solid connecting line above each core. We see during the forward pass, the memory's contents are modified sparsely, represented by the solid horizontal lines. Instead of caching the changing memory state, we store only the sparse modifications-represented by the dashed white boxes. During the backward pass, we "revert" the cached modifications to restore the memory to its prior state, which is crucial for correct gradient calculations.
Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

October 2016

·

368 Reads

·

138 Citations

Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows --- limiting their applicability to real-world domains. Here, we present an end-to-end differentiable memory access scheme, which we call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories. We show that SAM achieves asymptotic lower bounds in space and time complexity, and find that an implementation runs 1, ⁣000×1,\!000\times faster and with 3, ⁣000×3,\!000\times less physical memory than non-sparse models. SAM learns with comparable data efficiency to existing models on a range of synthetic tasks and one-shot Omniglot character recognition, and can scale to tasks requiring 100, ⁣000100,\!000s of time steps and memories. As well, we show how our approach can be adapted for models that maintain temporal associations between memories, as with the recently introduced Differentiable Neural Computer.


Hybrid computing using a neural network with dynamic external memory

October 2016

·

1,814 Reads

·

1,596 Citations

Nature

Artificial neural networks are remarkably adept at sensory processing, sequence learning and reinforcement learning, but are limited in their ability to represent variables and data structures and to store data over long timescales, owing to the lack of an external memory. Here we introduce a machine learning model called a differentiable neural computer (DNC), which consists of a neural network that can read from and write to an external memory matrix, analogous to the random-access memory in a conventional computer. Like a conventional computer, it can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data. When trained with supervised learning, we demonstrate that a DNC can successfully answer synthetic questions designed to emulate reasoning and inference problems in natural language. We show that it can learn tasks such as finding the shortest path between specified points and inferring the missing links in randomly generated graphs, and then generalize these tasks to specific graphs such as transport networks and family trees. When trained with reinforcement learning, a DNC can complete a moving blocks puzzle in which changing goals are specified by sequences of symbols. Taken together, our results demonstrate that DNCs have the capacity to solve complex, structured tasks that are inaccessible to neural networks without external read-write memory.


Video Pixel Networks

October 2016

·

248 Reads

·

194 Citations

We propose a probabilistic video model, the Video Pixel Network (VPN), that estimates the discrete joint distribution of the raw pixel values in a video. The model and the neural architecture reflect the time, space and color structure of video tensors and encode it as a four-dimensional dependency chain. The VPN approaches the best possible performance on the Moving MNIST benchmark, a leap over the previous state of the art, and the generated videos show only minor deviations from the ground truth. The VPN also produces detailed samples on the action-conditional Robotic Pushing benchmark and generalizes to the motion of novel objects.


Citations (25)


... A number of relevant autoencoder and GAN variants have also been proposed. Associative compression networks (ACNs) [12] learn to compress at the dataset level by conditioning data on other previously transmitted data which are similar in code space, resulting in models that can "daydream" semantically similar samples, similar to BigBiGAN reconstructions. VQ-VAEs [36] pair a discrete (vector quantized) encoder with an autoregressive decoder to produce faithful reconstructions with a high compression factor and demonstrate representation learning results in reinforcement learning settings. ...

Reference:

Large Scale Adversarial Representation Learning
Associative Compression Networks
  • Citing Article
  • April 2018

... For example, the Hopfield Network [428,435] pioneered the idea of storing patterns in low-energy states in a dynamic system, and Kanerva's sparse distributed memory (SDR) model [436], which affords fast reads and writes and dissociates capacity from the dimensionality of input by introducing addressing into a distributed memory store whose size is independent of the dimension of the data. The influence of SDR on modern machine learning has led to the Kanerva Machine [437,438] that replaces NN-memory slot updates with Bayesian inference, and is naturally compressive and generative. By implementing memory as a generative model, the Kanerva Machine can remove assumptions of uniform data distributions, and can retrieve unseen patterns from the memory through sampling, both of which behoove use in open-ended environments. ...

The Kanerva Machine: A Generative Distributed Memory

... Recent models, such as VALL-E (Wang et al., 2023), AudioLM (Borsos et al., 2023a), MusicGen (Copet et al., 2023, and VQ-VAE-based approaches for sound synthesis (Liu et al., 2021), improve efficiency by instead modeling quantized latent sequences. Non-autoregressive models (Oord et al., 2018) and adversarial audio synthesis (Donahue et al., 2018) were developed to overcome the inefficiencies of autoregressive models. Recent non-autoregressive models such as VampNet (Garcia et al., 2023), SoundStorm (Borsos et al., 2023b, or StemGen (Parker et al., 2024) are based on masked token modeling (Chang et al., 2022). ...

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

... They show that the latent representation is able to learn useful information about the overall distribution of text. Variations of this architecture are used by Gregor et al. (2015) for generating images, and Roberts et al. (2018) for generating music. In some cases models are evaluated on the KL divergence of their latent layer and regeneration of test sequences, however this is primarily used for model comparison. ...

DRAW: A Recurrent Neural Network For Image Generation
  • Citing Conference Paper
  • February 2015

... Besides, Compared with the random sampling used by DQN, Prioritized Experience Replay is proposed in [58] to samples more data with larger TD error, so that the model can learn better and faster. Noisy net is introduced in [59] to increase the exploration ability of DQN. Compared with -greedy that only explores more actions, noisy net has a stronger exploration ability by adding noise to the parameters for state-independent consistent exploration. ...

Noisy Networks for Exploration
  • Citing Article
  • June 2017

... The concept of curriculum learning (Bengio et al. 2009) was initially introduced by Bengio et al., who pointed out that by gradually introducing more complex training data, deep neural networks can be helped to converge faster and achieve higher training performance. Recent research has shown significant success with various cur-riculum learning approaches, such as teacher-student curriculum learning (Matiisen et al. 2019), dynamic curriculum learning , and reinforcement learning (Graves et al. 2017;Zhao et al. 2020). Additionally, there is theoretical evidence that curriculum learning can perform well under various conditions, such as inherent task difficulty, size of the network, and specific regularization and optimization techniques (Weinshall et al. 2018). ...

Automated Curriculum Learning for Neural Networks
  • Citing Article
  • April 2017

... On the other hand, a CNN was first introduced to NMT in [3], which was unsuccessful due to its limited receptive field. In [10], a solution was proposed to increase the receptive field through dilation. Another solution is to reduce the computations involved in CNN [11]. ...

Neural Machine Translation in Linear Time
  • Citing Article
  • October 2016

... There has also been much research interest in exploring new architectures to explicitly model components of a memory system or to address key challenges of reasoning over longer contexts. For instance, past work has looked at incorporating neural models of memory within neural networks by implementing different reading and writing operations -either directly replacing their layers (Weston et al., 2014;Sukhbaatar et al., 2015), or introducing new auxiliary components (Rae et al., 2016;Lample et al., 2019). In relation to transformers, more recent works have been proposed rethinking the ingredients of the self-attention operation, mostly in the context of LMs. ...

Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

... Differentiable Neural Computers (DNCs) (Graves et al., 2016) improve upon NTMs by using RNNs to manage scalable memory, inspired by the human hippocampus. DNCs incorporate attention mechanisms to query input similarity, temporal memory relationships, and update recency for memory management. ...

Hybrid computing using a neural network with dynamic external memory
  • Citing Article
  • October 2016

Nature

... Over the past decade, generative models have achieved remarkable success in creating instances across a wide variety of data modalities, including images [7,23,47], audio [6,44], video [27], and text [8,61,65]. Diffusion probabilistic models (DPMs) have emerged as a promising generative approach that is observed to outperform generative adversarial nets on image and audio synthesis [16,33], and underpins the major accomplishment in text-to-image creators such as DALL·E 2 [46] and Stable Diffusion [49]. ...

Video Pixel Networks
  • Citing Article
  • October 2016