# Geoffrey E. Hinton's research while affiliated with University of Toronto and other places

**What is this page?**

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

## Publications (368)

We present Bit Diffusion: a simple and generic approach for generating discrete data with continuous diffusion models. The main idea behind our approach is to first represent the discrete data as binary bits, and then train a continuous diffusion model to model these bits as real numbers which we call analog bits. To generate samples, the model fir...

While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision t...

Recent progress in Medical Artificial Intelligence (AI) has delivered systems that can reach clinical expert level performance. However, such systems tend to demonstrate sub-optimal "out-of-distribution" performance when evaluated in clinical settings different from the training environment. A common mitigation strategy is to develop separate syste...

This paper presents Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we simply cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequ...

How can neural networks learn the rich internal representations required for difficult tasks such as recognizing objects or understanding language?

This paper does not describe a working system. Instead, it presents a single idea about representation which allows advances made by several different groups to be combined into an imaginary system called GLOM. The advances include transformers, neural fields, contrastive representation learning, distillation and capsules. GLOM answers the question...

We propose an unsupervised capsule architecture for 3D point clouds. We compute capsule decompositions of objects through permutation-equivariant attention, and self-supervise the process by training with pairs of randomly rotated objects. Our key idea is to aggregate the attention masks into semantic keypoints, and use these to supervise a decompo...

Capsule networks are designed to parse an image into a hierarchy of objects, parts and relations. While promising, they remain limited by an inability to learn effective low level part descriptions. To address this issue we propose a novel self-supervised method for learning part descriptors of an image. During training, we exploit motion as a powe...

Efficient representation of articulated objects such as human bodies is an important problem in computer vision and graphics. To efficiently simulate deformation, existing approaches represent 3D objects using polygonal meshes and deform them using skinning techniques. This paper introduces neural articulated shape approximation (NASA), an alternat...

Effective training of deep neural networks can be challenging, and there remain many open questions on how to best learn these models. Recently developed methods to improve neural network training examine teaching: providing learned information during the training process to improve downstream model performance. In this paper, we take steps towards...

One paradigm for learning from few labeled examples while making best use of a large amount of unlabeled data is unsupervised pretraining followed by supervised fine-tuning. Although this paradigm uses unlabeled data in a task-agnostic way, in contrast to most previous approaches to semi-supervised learning for computer vision, we show that it is s...

Deep neural networks (DNNs) are powerful black-box predictors that have achieved impressive performance on a wide variety of tasks. However, their accuracy comes at the cost of intelligibility: it is usually unclear how they make their decisions. This hinders their applicability to high stakes decision-making domains such as healthcare. We propose...

During learning, the brain modifies synapses to improve behaviour. In the cortex, synapses are embedded within multilayered networks, making it difficult to determine the effect of an individual synaptic modification on the behaviour of the system. The backpropagation algorithm solves this problem in deep artificial neural networks, but historicall...

Efficient representation of articulated objects such as human bodies is an important problem in computer vision and graphics. To efficiently simulate deformation, existing approaches represent 3D objects using polygonal meshes and deform them using skinning techniques. This paper introduces neural articulated shape approximation (NASA), an alternat...

This paper presents the Imputer, a neural sequence model that generates output sequences iteratively via imputations. The Imputer is an iterative generative model, requiring only a constant number of generation steps independent of the number of input or output tokens. The Imputer can be trained to approximately marginalize over all possible alignm...

There has been an ongoing cycle where stronger defenses against adversarial attacks are subsequently broken by a more advanced defense-aware attack. We present a new approach towards ending this cycle where we "deflect'' adversarial attacks by causing the attacker to produce an input that semantically resembles the attack's target class. To this en...

This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we system...

After a large "teacher" neural network has been trained on labeled data, the probabilities that the teacher assigns to incorrect classes reveal a lot of information about the way in which the teacher generalizes. By training a small "student" model to match these probabilities, it is possible to transfer most of the generalization ability of the te...

Efficient representation of articulated objects such as human bodies is an important problem in computer vision and graphics. To efficiently simulate deformation, existing approaches represent objects as meshes and deform them using skinning techniques. This paper introduces neural articulated shape approximation (NASA), a framework that enables ef...

Any solid object can be decomposed into a collection of convex polytopes (in short, convexes). When a small number of convexes are used, such a decomposition can be thought of as a piece-wise approximation of the geometry. This decomposition is fundamental to real-time physics simulation in computer graphics, where it creates a unifying representat...

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this p...

Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. Most of the proposed methods for mitigating adversarial examples have subsequently been defeated by stronger attacks. Motivated by these issues, we take a different approach and propose to instead detect adversarial examples...

An object can be seen as a geometrically organized set of interrelated parts. A system that makes explicit use of these geometric relationships to recognize objects should be naturally robust to changes in viewpoint, because the intrinsic geometric relationships are viewpoint-invariant. We describe an unsupervised version of capsule networks, in wh...

The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-o...

Neural networks are easier to optimise when they have many more weights than are required for modelling the mapping from inputs to outputs. This suggests a two-stage learning procedure that first learns a large net and then prunes away connections or hidden units. But standard training does not necessarily encourage nets to be amenable to pruning....

To generalize to novel visual scenes with new viewpoints and new object poses, a visual system needs representations of the shapes of the parts of an object that are invariant to changes in viewpoint or pose. 3D graphics representations disentangle visual factors such as viewpoints and lighting from object structure in a natural way. It is possible...

Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network representations based on canonical correlation analysis (CCA). We show that CCA belongs to a family of statistics for measuring multivariate similarity...

A capsule is a group of neurons whose outputs represent different properties of
the same entity. Each layer in a capsule network contains many capsules. We describe a version of capsules in which each capsule has a logistic unit to represent
the presence of an entity and a 4x4 matrix which could learn to represent the relationship between that enti...

We explore and expand the $\textit{Soft Nearest Neighbor Loss}$ to measure the $\textit{entanglement}$ of class manifolds in representation space: i.e., how close pairs of points from the same class are relative to pairs of points from different classes. We demonstrate several use cases of the loss. As an analytical tool, it provides insights into...

We present a simple technique that allows capsule models to detect adversarial images. In addition to being trained to classify images, the capsule model is trained to reconstruct the images from the pose parameters and identity of the correct top-level capsule. Adversarial images do not look like a typical member of the predicted class and they ha...

Widespread application of artificial intelligence in health care has been anticipated for half a century. For most of that time, the dominant approach to artificial intelligence was inspired by logic: researchers assumed that the essence of intelligence was manipulating symbolic expressions, using rules of inference. This approach produced expert s...

The backpropagation of error algorithm (BP) is often said to be impossible to implement in a real brain. The recent success of deep networks in machine learning and AI, however, has inspired proposals for understanding how the brain might learn across multiple layers, and hence how it might implement or approximate BP. As of yet, none of these prop...

Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of...

Deep neural networks have proved to be a very effective way to perform classification tasks. They excel when the input data is high dimensional, the relationship between the input and the output is complicated, and the number of labeled training examples is large. But it is hard to explain why a learned network makes a particular classification dec...

A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation paramters. Active capsules at one level make pred...

Data are often labeled by many different experts with each expert only labeling a small fraction of the data and each data point being labeled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority o...

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significa...

We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalt...

Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many differen...

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are...

We present a framework for efficient inference in structured image models that explicitly reason about objects. We achieve this by performing probabilistic inference using a recurrent neural network that attends to scene elements and processes them one at a time. Crucially, the model itself learns to choose the appropriate number of inference steps...

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. D...

Deep Learning methods aim at learning feature hierarchies. Applications of deep learning to vision tasks date back to convolutional networks in the early 1990s. These methods have been the subject of a recent surge of interest for two main reasons: when labeled data is scarce, unsupervised learning algorithms can learn useful feature hierarchies. W...

Learning long term dependencies in recurrent networks is difficult due to
vanishing and exploding gradients. To overcome this difficulty, researchers
have developed sophisticated optimization techniques and network architectures.
In this paper, we propose a simpler solution that use recurrent neural networks
composed of rectified linear units. Key...

A very simple way to improve the performance of almost any machine learning
algorithm is to train many different models on the same data and then to
average their predictions. Unfortunately, making predictions using a whole
ensemble of models is cumbersome and may be too computationally expensive to
allow deployment to a large number of users, espe...

Syntactic parsing is a fundamental problem in computational linguistics and
Natural Language Processing. Traditional approaches to parsing are highly
complex and problem specific. Recently, Sutskever et al. (2014) presented a
domain-independent method for learning to map input sequences to output
sequences that achieved strong results on a large sc...

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for address...

Applications of Deep Belief Nets (DBN) to various problems have been the subject of a number of recent studies ranging from image classification and speech recognition to audio classification. In this study we apply DBNs to a natural language understanding problem. The recent surge of activity in this area was largely spurred by the development of...

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, espe...

We describe a simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural Network-Hidden Markov Model (ANN-HMM) hybrid systems. In this approach a Deep Neural Network (DNN) is trained to predict the forced-alignment state of multiple frames using a separate softmax for each of the frames. This is in contrast t...

In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications,” as organized by the authors. We also describe the historical context in which acoustic models based on deep neural networks...

Recently, pre-trained deep neural networks (DNNs) have outperformed traditional acoustic models based on Gaussian mixture models (GMMs) on a variety of large vocabulary speech recognition benchmarks. Deep neural nets have also achieved excellent results on various computer vision tasks using a random “dropout” procedure that drastically improves ge...

We introduce a Deep Boltzmann Machine model suitable for modeling and
extracting latent semantic representations from a large unstructured collection
of documents. We overcome the apparent difficulty of training a DBM with
judicious parameter tying. This parameter tying enables an efficient
pretraining algorithm and a state initialization scheme th...

This paper describes a Markov Random Field for real-valued image modeling that has two sets of latent variables. One set is used to gate the interactions between all pairs of pixels, while the second set determines the mean intensities of each pixel. This is a powerful model with a conditional distribution over the input that is Gaussian, with both...

It is possible to learn multiple layers of non-linear features by backpropagating error derivatives through a feedforward neural network. This is a very effective learning procedure when there is a huge amount of labeled training data, but for many learning tasks very few labeled examples are available. In an effort to overcome the need for labeled...

Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networ...

Recurrent neural networks (RNNs) are a powerful model for sequential
data. End-to-end training methods such as Connectionist Temporal
Classification make it possible to train RNNs for sequence labelling
problems where the input-output alignment is unknown. The combination of
these methods with the Long Short-term Memory RNN architecture has
proved...

Factor Analysis is a statistical method that seeks to explain linear variations in data by using unobserved latent variables. Due to its additive nature, it is not suitable for modeling data that is generated by multiple groups of latent factors which interact multiplicatively. In this paper, we introduce Tensor Analyzers which are a multilinear ge...

Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved...

Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing sc...

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward...

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-f...

Product models of low dimensional experts are a powerful way to avoid the
curse of dimensionality. We present the ``under-complete product of experts'
(UPoE), where each expert models a one dimensional projection of the data. The
UPoE is fully tractable and may be interpreted as a parametric probabilistic
model for projection pursuit. Its ML learni...

When a large feedforward neural network is trained on a small training set,
it typically performs poorly on held-out test data. This "overfitting" is
greatly reduced by randomly omitting half of the feature detectors on each
training case. This prevents complex co-adaptations in which a feature detector
is only helpful in the context of several oth...

Visual perception is a challenging problem in part due to illumination
variations. A possible solution is to first estimate an illumination invariant
representation before using it for recognition. The object albedo and surface
normals are examples of such representations. In this paper, we introduce a
multilayer generative model where the latent v...

While Boltzmann Machines have been successful at unsupervised learning and density modeling of images and speech data, they can be very sensitive to noise in the data. In this paper, we introduce a novel model, the Robust Boltzmann Machine (RoBM), which allows Boltzmann Machines to be robust to corruptions. In the domain of visual recognition, the...

An efficient way to learn deep density models that have many layers of latent
variables is to learn one layer at a time using a model that has only one layer
of latent variables. After learning each layer, samples from the posterior
distributions for that layer are used as training data for learning the next
layer. This approach is commonly used wi...

Deep Belief Networks (DBNs) are a very competitive alternative to Gaussian mixture models for relating states of a hidden Markov model to frames of coefficients derived from the acoustic input. They are competitive for three reasons: DBNs can be fine-tuned as neural networks; DBNs have many non-linear hidden layers; and DBNs are generatively pre-tr...

Products of Hidden Markov Models(PoHMMs) are an interesting class of
generative models which have received little attention since their
introduction. This maybe in part due to their more computationally expensive
gradient-based learning algorithm,and the intractability of computing the log
likelihood of sequences under the model. In this paper, we...

We present a new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent statistics are estimated using a variational approximation that tends to focus on a single mode, and data-independent statistics are estimated using persistent Markov chains. The use of two quite different techniques for estimatin...

Introduction Recent advances in motion capture technology have fueled interest in the synthesis and analysis of complex animate mo-tion for animation and tracking. In this work we focus on model driven analysis and synthesis but avoid the complexities involved in imposing physics-based constraints [1], and the storage requirements involved in conca...

Conditional Restricted Boltzmann Machines (CRBMs) are rich probabilistic
models that have recently been applied to a wide range of problems, including
collaborative filtering, classification, and modeling motion capture data.
While much progress has been made in training non-conditional RBMs, these
algorithms are not applicable to conditional model...