Geoffrey E. Hinton’s research while affiliated with Mountain View College and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (392)


A Generalist Framework for Panoptic Segmentation of Images and Videos
  • Conference Paper

October 2023

·

31 Reads

·

98 Citations

Ting Chen

·

Lala Li

·

Saurabh Saxena

·

[...]

·

David J. Fleed

Overview of the REMEDIS approach for developing robust and efficient ML for medical imaging
REMEDIS starts with representations initialized using large-scale natural-image pretraining following the BiT method⁵². We then adapt the model to the medical domain using intermediate contrastive self-supervised learning without using any labelled medical data. Finally, we fine-tune the model to specific downstream medical-imaging ML tasks. We evaluate the ML model in both ID and OOD settings to establish the data-efficient generalization performance of the model.
Overview of clinical settings for evaluating REMEDIS
We evaluated REMEDIS as well as baseline ML models on five different domains containing six tasks and involving a wide and complex variety of distribution shifts in clinical settings.
Data-efficient generalization
Overview of the results showing overall performance and data-efficient generalization of REMEDIS as well as of the strong supervised baseline pretrained on JFT-300M for the dermatology-condition classification (T1), DME classification (T2), chest-X-ray-condition classification (T3), pathology metastases detection (T4), pathology colorectal survival prediction (T5) and mammography classification task (T6). We observed considerably improved OOD generalization and substantial reduction in need for labelled medical data when using REMEDIS. We calculated 95% confidence intervals by running each label fraction and experiment up to ten times, and intervals are shown by the shaded area and error bars. A two-sided t-test was also done for each label fraction as well as when computing the ID results. If no * is shown, the P value is less than 0.001, otherwise, the P value is as indicated. The red lines indicate the amount of data that REMEDIS needs to match the highest supervised ML baseline performance when simulated in a new OOD clinical deployment setting. The amount of annotated data (%) and clinician hours potentially saved by using REMEDIS for each medical task considered are also indicated above and below two-sided arrows, respectively.
Data-efficient generalization of REMEDIS with various self-supervised learning techniques
Overview of the results showing performance and data-efficient generalization of REMEDIS with SimCLR, RELIC, MoCo and Barlow Twins as the self-supervised strategy for the dermatology-condition classification (T1), DME classification (T2) and chest-X-ray-condition classification (T3) with ResNet-152 (2×) as the encoder. The grey shadowed area indicates the performance margin of the strong supervised baseline pretrained on JFT. We observed that REMEDIS is compatible with MoCo, RELIC and Barlow Twins as alternative self-supervised learning strategies and that all the REMEDIS variants lead to data-efficient generalization improvements over the strong supervised baseline. The 95% confidence intervals were calculated by running each label fraction and experiment up to ten times, and intervals are shown using the shaded area and error bar. The strong supervised baselines pretrained on JFT-300M is shown by the dashed grey lines. A two-sided t-test was performed between the strong supervised baseline and the REMEDIS variants, and P > 0.05 is indicated with †.
REMEDIS versus weakly supervised DeepMIL
Overview of the results showing the relative improvement between REMEDIS (R = SSL+L+W), DeepMIL⁸¹ pretrained using large-scale JFT data (L+W) and DeepMIL pretrained using ImageNet data (W). SSL, self-supervised learning; L, large-scale supervised pretraining using JFT data; W, weakly supervised learning.

+1

Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging
  • Article
  • Publisher preview available

June 2023

·

571 Reads

·

149 Citations

Nature Biomedical Engineering

Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such ‘out of distribution’ performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for ‘Robust and Efficient Medical Imaging with Self-supervision’), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1–33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging.

View access options

Showing the bottom-up, top-down, and same-level interactions among three adjacent levels of the proposed GLOM architecture for a single column. The blue and red arrows representing bottom-up and top-down interactions are implemented by two different neural networks that have several hidden layers. These networks can differ between pairs of levels, but they are shared across columns and across time steps. The top-down net should probably use sinusoidal units (Sitzmann, Martel, Bergman, Lindell, & Wetzstein, 2020). For a static image, the green arrows could simply be scaled residual connections that implement temporal smoothing of the embedding at each level. For video, the green connections could be neural networks that learn temporal dynamics based on several previous states of the capsule. Interactions between the embedding vectors at the same level in different columns are implemented by a nonadaptive, attention-weighted, local smoother, which is not shown.
A picture of the embeddings at a particular time in six nearby columns. All of the locations shown belong to the same object, and the scene level has not yet settled on a shared vector. The complete embedding vector for each location is shown by dividing the vector into a separate section for each level in the part-whole hierarchy and then showing the high-dimensional embedding vector for a level as a 2D vector. This makes it easy to illustrate alignment of the embedding vectors of different locations. The islands of identical vectors at the various levels shown in the figure represent a parse tree. But islands of identity are considerably more powerful than phrase structure grammars. They have no difficulty representing disconnected objects as in, “Will this slow phrase structure grammarians down?”
A very simple example of a neural field using individual pixels as the locations. The intensities of four pixels can all be represented by the same code (a,b) even though their intensities vary according to the function f(x)=ax+b. The decoder has an extra input that specifies the location.
This is a different way of visualizing the architecture shown in Figure 1, which makes the relationship of that architecture to transformers more obvious. The horizontal dimension, which represents time in Figure 1, becomes the vertical dimension, which represents layers in this figure. At each location, every layer now has embeddings for all of the levels in the part-whole hierarchy. This corresponds to vertically compressing the depiction of the levels within a single time slice in Figure 1. A single forward pass through this architecture is all that is required to interpret a static image. All of the level-specific bottom-up and top-down neural nets are shown here as a single neural net. Figure 5 shows the individual bottom-up and top-down neural nets for this alternative way of viewing the GLOM architecture.
A picture of two adjacent layers of GLOM for a single location (i.e., part of a single column). During the forward pass, the embedding vector at level L receives input from the level L-1 embedding vector in the previous layer via a multilayer, bottom-up neural net. It also receives input from the level L+1 embedding in the previous layer via a multilayer, top-down neural net. The dependence on level L+1 in the previous layer implements top-down effects during the forward pass. The level L embedding in layer t+1 also depends on the level L embedding in layer t and an attention-weighted sum of the level L embeddings at other nearby locations in layer t. These within-level interactions are not shown.
How to Represent Part-Whole Hierarchies in a Neural Network

February 2023

·

82 Reads

·

202 Citations

This article does not describe a working system. Instead, it presents a single idea about representation that allows advances made by several different groups to be combined into an imaginary system called GLOM.¹ The advances include transformers, neural fields, contrastive representation learning, distillation, and capsules. GLOM answers the question: How can a neural network with a fixed architecture parse an image into a part-whole hierarchy that has a different structure for each image? The idea is simply to use islands of identical vectors to represent the nodes in the parse tree. If GLOM can be made to work, it should significantly improve the interpretability of the representations produced by transformer-like systems when applied to vision or language.


The Forward-Forward Algorithm: Some Preliminary Investigations

December 2022

·

510 Reads

·

7 Citations

The aim of this paper is to introduce a new learning procedure for neural networks and to demonstrate that it works well enough on a few small problems to be worth further investigation. The Forward-Forward algorithm replaces the forward and backward passes of backpropagation by two forward passes, one with positive (i.e. real) data and the other with negative data which could be generated by the network itself. Each layer has its own objective function which is simply to have high goodness for positive data and low goodness for negative data. The sum of the squared activities in a layer can be used as the goodness but there are many other possibilities, including minus the sum of the squared activities. If the positive and negative passes could be separated in time, the negative passes could be done offline, which would make the learning much simpler in the positive pass and allow video to be pipelined through the network without ever storing activities or stopping to propagate derivatives.


Meta-Learning Fast Weight Language Models

December 2022

·

37 Reads

Dynamic evaluation of language models (LMs) adapts model parameters at test time using gradient information from previous tokens and substantially improves LM performance. However, it requires over 3x more compute than standard inference. We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently by expressing gradient updates as linear attention. A key improvement over dynamic evaluation is that FWLs can also be applied at training time so the model learns to make good use of gradient updates. FWLs can easily be added on top of existing transformer models, require relatively little extra compute or memory to run, and significantly improve language modeling perplexity.


Testing GLOM's ability to infer wholes from ambiguous parts

November 2022

·

180 Reads

The GLOM architecture proposed by Hinton [2021] is a recurrent neural network for parsing an image into a hierarchy of wholes and parts. When a part is ambiguous, GLOM assumes that the ambiguity can be resolved by allowing the part to make multi-modal predictions for the pose and identity of the whole to which it belongs and then using attention to similar predictions coming from other possibly ambiguous parts to settle on a common mode that is predicted by several different parts. In this study, we describe a highly simplified version of GLOM that allows us to assess the effectiveness of this way of dealing with ambiguity. Our results show that, with supervised training, GLOM is able to successfully form islands of very similar embedding vectors for all of the locations occupied by the same object and it is also robust to strong noise injections in the input and to out-of-distribution input transformations.


Gaussian-Bernoulli RBMs Without Tears

October 2022

·

195 Reads

We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We propose a modified contrastive divergence (CD) algorithm so that one can generate images with GRBMs starting from noise. This enables direct comparison of GRBMs with deep generative models, improving evaluation protocols in the RBM literature. Moreover, we show that modified CD and gradient clipping are enough to robustly train GRBMs with large learning rates, thus removing the necessity of various tricks in the literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture. Our code is released at: \url{https://github.com/lrjconan/GRBM}.


A Generalist Framework for Panoptic Segmentation of Images and Videos

October 2022

·

28 Reads

Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model based on analog bits is used to model panoptic masks, with a simple, generic architecture and loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our generalist approach can perform competitively to state-of-the-art specialist methods in similar settings.


Figure 4: Importance of StopGradient in the InfoNCE loss, using M/8 on CIFAR-10 with 256 channels 1 group.
Figure 8: Numerical verification of the theoretical variance properties
shows architectural details for the different sizes of models we investigate.
Supervised learning for image classification
Comparing different normalization schemes. NG=No normalization gradient. CIFAR-10 test / train error (%)
Scaling Forward Gradient With Local Losses

October 2022

·

199 Reads

Forward gradient learning computes a noisy directional gradient and is a biologically plausible alternative to backprop for learning deep neural networks. However, the standard forward gradient algorithm, when applied naively, suffers from high variance when the number of parameters to be learned is large. In this paper, we propose a series of architectural and algorithmic modifications that together make forward gradient learning practical for standard deep learning benchmark tasks. We show that it is possible to substantially reduce the variance of the forward gradient estimator by applying perturbations to activations rather than weights. We further improve the scalability of forward gradient by introducing a large number of local greedy loss functions, each of which involves only a small number of learnable parameters, and a new MLPMixer-inspired architecture, LocalMixer, that is more suitable for local learning. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.


Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning

August 2022

·

85 Reads

·

11 Citations

We present Bit Diffusion: a simple and generic approach for generating discrete data with continuous diffusion models. The main idea behind our approach is to first represent the discrete data as binary bits, and then train a continuous diffusion model to model these bits as real numbers which we call analog bits. To generate samples, the model first generates the analog bits, which are then thresholded to obtain the bits that represent the discrete variables. We further propose two simple techniques, namely Self-Conditioning and Asymmetric Time Intervals, which lead to a significant improvement in sample quality. Despite its simplicity, the proposed approach can achieve strong performance in both discrete image generation and image captioning tasks. For discrete image generation, we significantly improve previous state-of-the-art on both CIFAR-10 (which has 3K discrete 8-bit tokens) and ImageNet-64x64 (which has 12K discrete 8-bit tokens), outperforming the best autoregressive model in both sample quality (measured by FID) and efficiency. For image captioning on MS-COCO dataset, our approach achieves competitive results compared to autoregressive models.


Citations (78)


... This structure contrasts with Recurrent Neural Networks (RNNs), where nodes in the hidden layer are not only interconnected but also receive inputs from both the current state and the preceding state. This design enables RNNs to maintain a memory of past information, which is then integrated with new inputs to update the network's state (Graves et al. 2013). However, RNNs often face challenges with gradient vanishing or explosion during long-term dependencies and backpropagation, which can impede effective learning (Bengio et al. 1994). ...

Reference:

Forecasting step-like landslide displacement through diverse monitoring frequencies
Speech Recognition with Deep Recurrent Neural Networks
  • Citing Preprint
  • March 2013

... Beyond image generation, DMs inherently perform implicit discriminative reasoning while generating data, which proves highly effective in visual tasks that require complex relationship modeling and spatiotemporal reasoning. Therefore, a surge of work has successfully adapted generative diffusion models for tasks including image segmentation [1,4,5,13,30,41,95], object detection [11,75,106], object tracking [64,65,105], and monocular depth estimation [24,66,77,121]. Recent research has also utilizes DMs for more complex tasks that require high-level visual understanding abilities such as visual-linguistic understanding [45], scene generation [36], and human-object interaction detection [38,51]. ...

A Generalist Framework for Panoptic Segmentation of Images and Videos
  • Citing Conference Paper
  • October 2023

... Schlag et al. (2021) point out that self-attention without softmax and other linear Transformer variants (Tsai et al., 2019;Katharopoulos et al., 2020;Choromanski et al., 2021;Peng et al., 2021) can be viewed as FWPs. Clark et al. (2022) propose fast weight layers which are added on top of the Transformer model after the last attention layer for language modeling. Different from previous work mainly focusing on specific tasks, our goal is to enhance frozen pretrained LMs with fast associative memory for general language processing. ...

Meta-Learning Fast Weight Language Models
  • Citing Conference Paper
  • January 2022

... Azizi et al. created REMEDIS [5], a multi-supervision level approach for transfer learning on medical tasks. REMEDIS combines large-scale supervised learning on natural images with self-supervised learning on medical images, and achieves significant performance gains on 15 medical tasks, compared to supervised baselines. ...

Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging

Nature Biomedical Engineering

... Recent approaches have also explored using two forward passes to facilitate communication between upstream and downstream neurons [12][13][14][15]. The "Forward-Forward" (FF) learning algorithm [12], is an approach in which data and label hypotheses are combined as inputs, with optimisation seeking to upregulate neural response to correctly labelled inputs and subdue responses to spuriously labelled inputs. ...

The Forward-Forward Algorithm: Some Preliminary Investigations
  • Citing Preprint
  • December 2022

... Indeed, while often effective under Independent and Identically Distributed (I.I.D) conditions, CNNs' reliance on local spatial correlations leads to a tendency to prioritize superficial features, such as texture, over more intrinsic object characteristics as shown by Geirhos et al [15]. As further elaborated by in Hinton [17], this limitation stems from architectural choices like pooling layers, which, while providing translation invariance, inadvertently sacrifice precise spatial relationships crucial for encoding exact pose. Hinton further points out that this architecture prioritizes activity invariance over activity equivariance and weight invariance, leading to a reliance on data augmentation for viewpoint generalization. ...

How to Represent Part-Whole Hierarchies in a Neural Network

... For the microsatellite instability status prediction, we used Sim-CLR with a Resnet-18 as a backbone. SimCLR is a contrastive learning method that maximizes the agreement between two different augmented versions of the same image, thereby learning a relevant feature representation of the image 54 . It was trained on 50,000 synthetic tiles (10,000 per cancer type) for 50 epochs. ...

Robust and Efficient Medical Imaging with Self-Supervision

... However their use of a mesh template does not allow for photorealistic renderings. Implicit functions [51,59] have also been utilized to reconstruct detailed 3D clothed humans [4,10,13,25,29,64]. However, they are also unable to generate photorealistic renderings and are often not reposable. ...

NASA Neural Articulated Shape Approximation