ArticlePDF Available

Self-organizing neural network that discovers surfaces in random-dot stereograms

Authors:

Abstract

The standard form of back-propagation learning is implausible as a model of perceptual learning because it requires an external teacher to specify the desired output of the network. We show how the external teacher can be replaced by internally derived teaching signals. These signals are generated by using the assumption that different parts of the perceptual input have common causes in the external world. Small modules that look at separate but related parts of the perceptual input discover these common causes by striving to produce outputs that agree with each other. The modules may look at different modalities (such as vision and touch), or the same modality at different times (for example, the consecutive two-dimensional views of a rotating three-dimensional object), or even spatially adjacent parts of the same image. Our simulations show that when our learning procedure is applied to adjacent patches of two-dimensional images, it allows a neural network that has no prior knowledge of the third dimension to discovery depth in random dot stereograms of curved surfaces.
© 1992Nature Publishing Group
© 1992Nature Publishing Group
© 1992Nature Publishing Group
... As one of the earliest approaches, Linsker proposed maximum information transfer from input data to its latent representation and showed that it is equivalent to maximizing the determinant of the output covariance under the Gaussian distribution assumption [34]. Around the same time frame, Becker & Hinton [35] put forward a representation learning approach based on the maximization of (an approximation of) the SMI between the alternative latent vectors obtained from the same image. The most well-known application is the Independent Component Analysis (ICA) Infomax algorithm [36] for separating independent sources from their linear combinations. ...
... Instead, we maximize the mutual information content of the alternative latent representations of the same input. From this point of view, our approach is closer to what is aimed at by [35]. However, we use a different (correlative) information measure, which is computationally more efficient and induces a special form of linear dependence among alternative latent representations of the same input, which may be more desirable considering the goal of generating features for a linear classifier. ...
Preprint
Self-supervised learning allows AI systems to learn effective representations from large amounts of data using tasks that do not require costly labeling. Mode collapse, i.e., the model producing identical representations for all inputs, is a central problem to many self-supervised learning approaches, making self-supervised tasks, such as matching distorted variants of the inputs, ineffective. In this article, we argue that a straightforward application of information maximization among alternative latent representations of the same input naturally solves the collapse problem and achieves competitive empirical results. We propose a self-supervised learning method, CorInfoMax, that uses a second-order statistics-based mutual information measure that reflects the level of correlation among its arguments. Maximizing this correlative information measure between alternative representations of the same input serves two purposes: (1) it avoids the collapse problem by generating feature vectors with non-degenerate covariances; (2) it establishes relevance among alternative representations by increasing the linear dependence among them. An approximation of the proposed information maximization objective simplifies to a Euclidean distance-based objective function regularized by the log-determinant of the feature covariance matrix. The regularization term acts as a natural barrier against feature space degeneracy. Consequently, beyond avoiding complete output collapse to a single point, the proposed approach also prevents dimensional collapse by encouraging the spread of information across the whole feature space. Numerical experiments demonstrate that CorInfoMax achieves better or competitive performance results relative to the state-of-the-art SSL approaches.
... The fundamental assumption in self-supervised learning is that the input data contain more task-specific information than sparse categorical ground truth data in supervised learning [41]. Consequently, a careful augmentation design should provide better results on downstream tasks than a supervised learning scenario [41,42]. While improvements in Accuracy in supervised learning are usually related to architecture modifications, regularization, or loss function, CLR is about domain-specific augmentation strategies above anything else [16,41]. ...
... Formally, CLR is a technique to create an embedding space from arbitrary input modalities, enabling comparative analysis. CLR dates back to work done in the nineties [42] but only recently has seen a renaissance, yielding State-of-theart results in visual- [16,17,36,43], audio- [44][45][46], video- [47][48][49], and text-representation [50,51] learning. Contrastive baseline and contrastive projection learning. ...
Preprint
Full-text available
Single-shot diffraction imaging of isolated nanosized particles has seen remarkable success in recent years, yielding in-situ measurements with ultra-high spatial and temporal resolution. The progress of high-repetition-rate sources for intense X-ray pulses has further enabled recording datasets containing millions of diffraction images, which are needed for structure determination of specimens with greater structural variety and for dynamic experiments. The size of the datasets, however, represents a monumental problem for their analysis. Here, we present an automatized approach for finding semantic similarities in coherent diffraction images without relying on human expert labeling. By introducing the concept of projection learning, we extend self-supervised contrastive learning to the context of coherent diffraction imaging. As a result, we achieve a semantic dimensionality reduction producing meaningful embeddings that align with the physical intuition of an experienced human researcher. The method yields a substantial improvement compared to previous approaches, paving the way toward real-time and large-scale analysis of coherent diffraction experiments at X-ray free-electron lasers.
... For example, patch-based methods [28,39,40] can understand images by learning the relative positions of randomly sampled image patches. Recently, contrastive learning studies [41,42] have shown promising results; e.g., [43,44], which compute similarity and dissimilarity (or focus only on similarity) on images between two or more views. For example, SimCLR [44] combines contrastive learning with some novel ideas, providing an effective framework to improve the state of self-supervised learning in computer vision. ...
Article
Full-text available
This paper provides insights into the interpretation beyond simply combining self-supervised learning (SSL) with remote sensing (RS). Inspired by the improved representation ability brought by SSL in natural image understanding, we aim to explore and analyze the compatibility of SSL with remote sensing. In particular, we propose a self-supervised pre-training framework for the first time by applying the masked image modeling (MIM) method to RS image research in order to enhance its efficacy. The completion proxy task used by MIM encourages the model to reconstruct the masked patches, and thus correlate the unseen parts with the seen parts in semantics. Second, in order to figure out how pretext tasks affect downstream performance, we find the attribution consensus of the pre-trained model and downstream tasks toward the proxy and classification targets, which is quite different from that in natural image understanding. Moreover, this transferable consensus is persistent in cross-dataset full or partial fine-tuning, which means that SSL could boost general model-free representation beyond domain bias and task bias (e.g., classification, segmentation, and detection). Finally, on three publicly accessible RS scene classification datasets, our method outperforms the majority of fully supervised state-of-the-art (SOTA) methods with higher accuracy scores on unlabeled datasets.
... In this paper, we investigate a novel pretraining method called self-supervised learning, which provides flexibility in the training requirements by removing the need for labeled data to train the network. The general idea of this method is to make the model learn a universal representation of the data using joint embedding objective functions [10,11] to compare embedding representations of the data. For speech in particular, several frameworks have been proposed [12] such as generative modeling [13], discriminative/contrastive modeling [14,15,16], and multitask learning [17]. ...
... Some of the clustering-based systems are Becker and Hinton first introduced the concept of mutual information in clustering [3]. On the other hand, learning the mutual information from paired samples was later introduced and was termed co-clustering [8,11]. ...
Article
Full-text available
The subdomain of computer vision applications is Image Classification which helps in categorizing the images. The advent of handheld devices and image sensors leads to the availability of a huge amount of data without labels. Hence, to categorize these images, a supervised learning algorithm won’t be suitable as it requires labels. On the other hand, unsupervised learning uses clustering that also not useful as its accuracy is not reliable as the data are not labeled in advance. Self-Supervised Learning techniques can be used to overcome this problem. In this work, we present a novel Swin Transformer based Contrastive Self-Supervised Learning (Swin-TCSSL), where the paired sample is formed using the transformation of the given input image and this paired sample is passed to the Swin-T transformer which produces a feature vector. The maximum Mutual Information of these feature vectors is used to form robust clusters and these cluster labels get propagates to the Swin Transformer block until the appropriate clusters are obtained. It is then followed by contrastive learning and finally produces the classified output. The experimental results prove that the proposed system is invariant to occlusion, viewpoint variation, and illumination effects. The proposed Swin-TSSCL achieves state-of-the-art results in 5 benchmark datasets namely CIFAR-10, Snapshot Serengeti, Stanford dogs, Animals with attributes, and ImageNet dataset. As evident from the rigorous experiments, the proposed Swin-TCSSL has set a new global state-of-the-art with an average accuracy of 97.63%, which is comparatively higher than the state-of-the-art systems.
... This idea of CSL has appeared in former literature a long time ago [3,12,13,15,32]. The term "view" in CSL generally refers to augmented samples, and the goal of CSL is to close the distance between different views from the same sample, a.k.a. the anchor, while keeping views from different samples away from the anchor. ...
Preprint
Contrastive Self-supervised Learning (CSL) is a practical solution that learns meaningful visual representations from massive data in an unsupervised approach. The ordinary CSL embeds the features extracted from neural networks onto specific topological structures. During the training progress, the contrastive loss draws the different views of the same input together while pushing the embeddings from different inputs apart. One of the drawbacks of CSL is that the loss term requires a large number of negative samples to provide better mutual information bound ideally. However, increasing the number of negative samples by larger running batch size also enhances the effects of false negatives: semantically similar samples are pushed apart from the anchor, hence downgrading downstream performance. In this paper, we tackle this problem by introducing a simple but effective contrastive learning framework. The key insight is to employ siamese-style metric loss to match intra-prototype features, while increasing the distance between inter-prototype features. We conduct extensive experiments on various benchmarks where the results demonstrate the effectiveness of our method on improving the quality of visual representations. Specifically, our unsupervised pre-trained ResNet-50 with a linear probe, out-performs the fully-supervised trained version on the ImageNet-1K dataset.
Article
We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal 'hidden' units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.