ArticlePDF Available

Learning Invariance From Transformation Sequences

Authors:

Abstract

The visual system can reliably identify objects even when the retinal image is transformed considerably by commonly occurring changes in the environment. A local learning rule is proposed, which allows a network to learn to generalize across such transformations. During the learning phase, the network is exposed to temporal sequences of patterns undergoing the transformation. An application of the algorithm is presented in which the network learns invariance to shift in retinal position. Such a principle may be involved in the development of the characteristic shift invariance property of complex cells in the primary visual cortex, and also in the development of more complicated invariance properties of neurons in higher visual areas.
A preview of the PDF is not available
... Many essential tasks such as catching prey and escaping predators rely on predictions about the future state of the world. Naturally occurring sensory input typically has a complex relationship with future environmental states, and it has been hypothesized that organisms transform these inputs into internal representations that facilitate temporal prediction [1][2][3][4][5] . Consider vision. ...
... We conclude that, as with perceptual effects 6 , the nonlinear pooling mechanism that underlies the phase-invariance of complex cells is largely responsible for the model's ability to explain patterns of straightening at the population level. This is broadly consistent with the view that complex cell pooling reduces temporal image fluctuations due to translation or other deformations 1,20,22,28,[31][32][33][34] . ...
Article
Full-text available
Many sensory-driven behaviors rely on predictions about future states of the environment. Visual input typically evolves along complex temporal trajectories that are difficult to extrapolate. We test the hypothesis that spatial processing mechanisms in the early visual system facilitate prediction by constructing neural representations that follow straighter temporal trajectories. We recorded V1 population activity in anesthetized macaques while presenting static frames taken from brief video clips, and developed a procedure to measure the curvature of the associated neural population trajectory. We found that V1 populations straighten naturally occurring image sequences, but entangle artificial sequences that contain unnatural temporal transformations. We show that these effects arise in part from computational mechanisms that underlie the stimulus selectivity of V1 cells. Together, our findings reveal that the early visual system uses a set of specialized computations to build representations that can support prediction in the natural environment. Many behaviours depend on predictions about the environment. Here the authors find neural populations in primary visual cortex to straighten the temporal trajectories of natural video clips, facilitating the extrapolation of past observations.
... To produce labelled samples from the multivariate time series S, we propose to sample pairs of time windows (X t , X t ) where each window X t , X t is in R C×T and T is the duration of each window, and where the index t indicates the time sample at which the window starts in S. The first window X t is referred to as the anchor window. Our assumption is that an appropriate representation of the data should evolve slowly over time (akin to the driving hypothesis behind slow feature analysis (SFA) (Földiák, 1991;Becker, 1993;Wiskott and Sejnowski, 2002)) suggesting that time windows close in time should share the same label. In the context of sleep staging, for instance, sleep stages usually last between 1 to 40 minutes (Altevogt and Colten, 2006); therefore, nearby windows likely come from the same sleep stage, whereas faraway windows likely come from different sleep stages. ...
... lative positioning"), le brassage temporel ("temporal shuffling"), et le codage prédictif contrastif ("contrastive predictive coding") (van den Oord et al., 2018). Ces trois approches reposent intuitivement sur le fait qu'une bonne représentation de l'EEG devrait changer lentement (Földiák, 1991;Becker, 1993;Wiskott and Sejnowski, 2002), puisque des fenêtres d'EEG proches dans le temps partagent normalement la même étiquette. Nous étudions les propriétés de ces méthodes auto-supervisées par le biais d'expériences sur deux grands ensembles de données publiques, contenant des milliers d'enregistrements (López et al., 2017;Ghassemi et al., 2018;Goldberger et al., 2000) nous permettant d'effectuer des comparaisons avec des approches purement supervisées ou basées sur l'extraction manuelle de traits caractéristiques (Gemein et al., 2020). ...
Thesis
Our understanding of the brain has improved considerably in the last decades, thanks to groundbreaking advances in the field of neuroimaging. Now, with the invention and wider availability of personal wearable neuroimaging devices, such as low-cost mobile EEG, we have entered an era in which neuroimaging is no longer constrained to traditional research labs or clinics. "Real-world'' EEG comes with its own set of challenges, though, ranging from a scarcity of labelled data to unpredictable signal quality and limited spatial resolution. In this thesis, we draw on the field of deep learning to help transform this century-old brain imaging modality from a purely clinical- and research-focused tool, to a practical technology that can benefit individuals in their day-to-day life. First, we study how unlabelled EEG data can be utilized to gain insights and improve performance on common clinical learning tasks using self-supervised learning. We present three such self-supervised approaches that rely on the temporal structure of the data itself, rather than onerously collected labels, to learn clinically-relevant representations. Through experiments on large-scale datasets of sleep and neurological screening recordings, we demonstrate the significance of the learned representations, and show how unlabelled data can help boost performance in a semi-supervised scenario. Next, we explore ways to ensure neural networks are robust to the strong sources of noise often found in out-of-the-lab EEG recordings. Specifically, we present Dynamic Spatial Filtering, an attention mechanism module that allows a network to dynamically focus its processing on the most informative EEG channels while de-emphasizing any corrupted ones. Experiments on large-scale datasets and real-world data demonstrate that, on sparse EEG, the proposed attention block handles strong corruption better than an automated noise handling approach, and that the predicted attention maps can be interpreted to inspect the functioning of the neural network. Finally, we investigate how weak labels can be used to develop a biomarker of neurophysiological health from real-world EEG. We translate the brain age framework, originally developed using lab and clinic-based magnetic resonance imaging, to real-world EEG data. Using recordings from more than a thousand individuals performing a focused attention exercise or sleeping overnight, we show not only that age can be predicted from wearable EEG, but also that age predictions encode information contained in well-known brain health biomarkers, but not in chronological age. Overall, this thesis brings us a step closer to harnessing EEG for neurophysiological monitoring outside of traditional research and clinical contexts, and opens the door to new and more flexible applications of this technology.
... However, for the more structured organisation of equivariant capsule representations, the usual approach is to hard-code this structure into the network, or to encourage it through regularization terms [4,15]. To achieve this through unsupervised learning, we propose to incorporate another key inductive bias: "temporal coherence" [18,24,48,52]. The principle of temporal coherence, or "slowness", asserts than when processing correlated sequences, we wish for our representations to change smoothly and slowly over space and time. ...
... Prior work on learning equivariant and invariant representations is similarly vast and also has a deep relationship with these generative models. Specifically, Independant Subspace Analysis [26], models involving temporal coherence [18,24,48,52], and Adaptive Subspace Self Organizing Maps [34] have all demonstrated the ability to learn invariant feature subspaces and even 'disentangle' space and time [19]. Our work assumes a similar generative model to these works while additionally allowing for efficient estimation of the model through variational inference [32,46]. ...
Preprint
In this work we seek to bridge the concepts of topographic organization and equivariance in neural networks. To accomplish this, we introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. Furthermore, through topographic organization over time (i.e. temporal coherence), we demonstrate how predefined latent space transformation operators can be encouraged for observed transformed input sequences -- a primitive form of unsupervised learned equivariance. We demonstrate that this model successfully learns sets of approximately equivariant features (i.e. "capsules") directly from sequences and achieves higher likelihood on correspondingly transforming test sequences. Equivariance is verified quantitatively by measuring the approximate commutativity of the inference network and the sequence transformations. Finally, we demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
... Although our focus here is weak supervision, we should mention that there are various approaches to use explicit labels for enhancing disentangling performance, such as semi-supervised learning , Siddharth et al., 2017 or adversarial learning to promote disentanglement [Lample et al., 2017, Mathieu et al., 2016. Also, as group-based learning was originally inspired by temporal coherence principle [Földiák, 1991] (i.e., the object identity is often stable over time), some weakly supervised disentangling approaches have explicitly used it [Yang et al., 2015]. ...
Preprint
Data of general object images have two most common structures: (1) each object of a given shape can be rendered in multiple different views, and (2) shapes of objects can be categorized in such a way that the diversity of shapes is much larger across categories than within a category. Existing deep generative models can typically capture either structure, but not both. In this work, we introduce a novel deep generative model, called CIGMO, that can learn to represent category, shape, and view factors from image data. The model is comprised of multiple modules of shape representations that are each specialized to a particular category and disentangled from view representation, and can be learned using a group-based weakly supervised learning method. By empirical investigation, we show that our model can effectively discover categories of object shapes despite large view variation and quantitatively supersede various previous methods including the state-of-the-art invariant clustering algorithm. Further, we show that our approach using category-specialization can enhance the learned shape representation to better perform down-stream tasks such as one-shot object identification as well as shape-view disentanglement.
... Most of the works apply pairwise matching to minimize the alignment difference through optical flow or correspondence matching to achieve temporal smoothness. Other works [1,13,36,40,48] apply consistency constraints to predictions from the same input with different transformations in order to obtain perturbation-invariant representations. Our work can be seen as a combination of both types of consistency to fully consider the spatial and temporal continuity in motion forecasting. ...
Preprint
We present a novel framework for motion forecasting with Dual Consistency Constraints and Multi-Pseudo-Target supervision. The motion forecasting task predicts future trajectories of vehicles by incorporating spatial and temporal information from the past. A key design of DCMS is the proposed Dual Consistency Constraints that regularize the predicted trajectories under spatial and temporal perturbation during the training stage. In addition, we design a novel self-ensembling scheme to obtain accurate pseudo targets to model the multi-modality in motion forecasting through supervision with multiple targets explicitly, namely Multi-Pseudo-Target supervision. Our experimental results on the Argoverse motion forecasting benchmark show that DCMS significantly outperforms the state-of-the-art methods, achieving 1st place on the leaderboard. We also demonstrate that our proposed strategies can be incorporated into other motion forecasting approaches as general training schemes.
... A branch of modeling works has linked the ability to accurately recognize objects during movements-which could support perceptual stability-to invariances for translations, rotations, and expansions learned directly from the statistics of the visual inputs. This class of models, e.g., unsupervised temporal learning models [7,8] and slow feature analysis models [9][10][11][12][13][14] has found supporting evidence in psychophysical [15,16] and physiological studies [7,17,18] and has inspired deep learning approaches for unsupervised rules to learn coherent visual representations in the presence of moving stimuli e.g., contrastive embedding [19,20]. However, these models are agnostic with respect to whether retinal activations are due to objects moving in the environment or to movements of the organism, with the latter characteristically defining the phenomenon of perceptual stability. ...
Article
Full-text available
Our ability to perceive a stable visual world in the presence of continuous movements of the body, head, and eyes has puzzled researchers in the neuroscience field for a long time. We reformulated this problem in the context of hierarchical convolutional neural networks (CNN)-whose architectures have been inspired by the hierarchical signal processing of the mammalian visual system-and examined perceptual stability as an optimization process that identifies image-defining features for accurate image classification in the presence of movements. Movement signals, multiplexed with visual inputs along overlapping convolutional layers, aided classification invariance of shifted images by making the classification faster to learn and more robust relative to input noise. Classification invariance was reflected in activity manifolds associated with image categories emerging in late CNN layers and with network units acquiring movement-associated activity modulations as observed experimentally during saccadic eye movements. Our findings provide a computational framework that unifies a multitude of biological observations on perceptual stability under optimality principles for image classification in artificial neural networks.
... For that we used the Brain-Score [12,17], a set of metrics that evaluate deep networks' correspondence to neural recordings of cortical areas V1 [41], V2 [41], V4 [42], and inferior temporal (IT) cortex [42] in primates, as well as behavioral data [43]. The advantage of Brain-Score is that it provides a standardized benchmark that looks at the whole ventral stream, and not only its isolated properties like translation invariance (which many early models focused on [44,45,46,47]). We do not directly check for translation invariance in our models (only through V1/V2 data). However, as our approach achieves convolutional solutions (see above), we trivially have translation equivariance after training: translating the input will translate the layer's response (in our case, each k-th translation will produce the same translated response for a kernel of size k). ...
Article
Convolutional networks are ubiquitous in deep learning. They are particularly useful for images, as they reduce the number of parameters, reduce training time, and increase accuracy. However, as a model of the brain they are seriously problematic, since they require weight sharing - something real neurons simply cannot do. Consequently, while neurons in the brain can be locally connected (one of the features of convolutional networks), they cannot be convolutional. Locally connected but non-convolutional networks, however, significantly underperform convolutional ones. This is troublesome for studies that use convolutional networks to explain activity in the visual system. Here we study plausible alternatives to weight sharing that aim at the same regularization principle, which is to make each neuron within a pool react similarly to identical inputs. The most natural way to do that is by showing the network multiple translations of the same image, akin to saccades in animal vision. However, this approach requires many translations, and doesn't remove the performance gap. We propose instead to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity. This requires the network to pause occasionally for a sleep-like phase of "weight sharing". This method enables locally connected networks to achieve nearly convolutional performance on ImageNet and improves their fit to the ventral stream data, thus supporting convolutional networks as a model of the visual stream.
... In two temporally close frames, both the manipulated objects and the arm may have changed their position, the objects themselves may be different, or the lighting conditions may have changed due to failures. scenes with moving objects [Föl91]. For simplicity of exposition, we assume that the ...
Preprint
Full-text available
The world is structured in countless ways. It may be prudent to enforce corresponding structural properties to a learning algorithm's solution, such as incorporating prior beliefs, natural constraints, or causal structures. Doing so may translate to faster, more accurate, and more flexible models, which may directly relate to real-world impact. In this dissertation, we consider two different research areas that concern structuring a learning algorithm's solution: when the structure is known and when it has to be discovered.
... Some of the alternative theories are based on the assumption of time continuity and they attempt to reconstruct general transformations from pairs of perceptions separated by some infinitesimal time interval [37,38]. These models focus on the the visual system and they do not attempt to explain the emergence of other invariances which can not usually be observed as a time-continuous process, like a change of key in music. ...
Preprint
Full-text available
The invariance of natural objects under perceptual changes is possibly encoded in the brain by symmetries in the graph of synaptic connections. The graph can be established via unsupervised learning in a biologically plausible process across different perceptual modalities. This hypothetical encoding scheme is supported by the correlation structure of naturalistic audio and image data and it predicts a neural connectivity architecture which is consistent with many empirical observations about primary sensory cortex.
Article
Full-text available
How does the brain form a useful representation of its environment? It is shown here that a layer of simple Hebbian units connected by modifiable anti-Hebbian feed-back connections can learn to code a set of patterns in such a way that statistical dependency between the elements of the representation is reduced, while information is preserved. The resulting code is sparse, which is favourable if it is to be used as input to a subsequent supervised associative layer. The operation of the network is demonstrated on two simple problems
Article
I describe a local synaptic learning rule that can be used to remove the effects of certain types of systematic temporal variation in the inputs to a unit. According to this rule, changes in synaptic weight result from a conjunction of short-term temporal changes in the inputs and the output. Formally, This is like the differential rule proposed by Klopf (1986) and Kosko (1986), except for a change of sign, which gives it an anti-Hebbian character. By itself this rule is insufficient. A weight conservation condition is needed to prevent the weights from collapsing to zero, and some further constraint—implemented here by a biasing term—to select particular sets of weights from the subspace of those which give minimal variation. As an example, I show that this rule will generate center-surround receptive fields that remove temporally varying linear gradients from the inputs.
Article
The development of shift tolerance and deformation tolerance in neural representation is discussed with reference to a prototypical paradigm, which summarizes the essential problem of representation of a distribution of input patterns containing features that are distributed uniformly throughout an image space and that are subject to variation in form. A form of sparse, local representation is proposed in which the position of a feature is localized with precision proportional to the extent of the representation's tolerance to deformation of the feature, which in turn reflects the extent to which the form of that feature is subject to variation over the probability distribution of input patterns. A local self-organizing mechanism is described which inevitably generates representations of this form, regardless of the initial configuration of the synaptic strength parameters. The form of the representation established by this mechanism is unaffected by the inclusion of superfluous representation units: the position tolerance and deformation tolerance of representation units are independent of the number of units participating in the self-organization process, provided that this number is adequate to form a complete representation. It is demonstrated that this self-organizing mechanism is able to discriminate between distinct features and represents these using separate representation units, even though the various forms of a single feature are represented by a single variation-tolerant unit. The attributes of local position tolerance and deformation tolerance arise purely in response to the invariance properties of the probability distribution of input patterns: the mechanism relies neither on the imposition of prior architectural constraints nor on associations in time between successive patterns in order to generate these attributes.
Article
This paper reports the results of our studies with an unsupervised learning paradigm which we have called “Competitive Learning.” We have examined competitive learning using both computer simulation and formal analysis and have found that when it is applied to parallel networks of neuron-like elements, many potentially useful learning tasks can be accomplished. We were attracted to competitive learning because it seems to provide a way to discover the salient, general features which can be used to classify a set of patterns. We show how a very simply competitive mechanism can discover a set of feature detectors which capture important aspects of the set of stimulus input patterns. We also show how these feature detectors can form the basis of a multilayer system that can serve to learn categorizations of stimulus sets which are not linearly separable. We show how the use of correlated stimuli con serve as a kind of “teaching” input to the system to allow the development of feature detectors which would not develop otherwise. Although we find the competitive learning mechanism a very interesting and powerful learning principle, we do not, of course, imagine thot it is the only learning principle. Competitive learning is an essentially nonassociative statistical learning scheme. We certainly imagine that other kinds of learning mechanisms will be involved in the building of associations among patterns of activation in a more complete neural network. We offer this analysis of these competitive learning mechanisms to further our understanding of how simple adaptive networks can discover features important in the description of the stimulus environment in which the system finds itself.
Conference Paper
One major goal of research on massively parallel networks of neuron-like processing elements is to discover efficient methods for recognizing patterns. Another goal is to discover general learning procedures that allow networks to construct the internal representations that are required for complex tasks. This paper describes a recently developed procedure that can learn to perform a recognition task. The network is trained on examples in which the input vector represents an instance of a pattern in a particular position and the required output vector represents its name. After prolonged training, the network develops canonical internal representations of the patterns and it uses these canonical representations to identify familiar patterns in novel positions.