Article

MMatch: Semi-Supervised Discriminative Representation Learning for Multi-View Classification

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Semi-supervised multi-view learning has been an important research topic due to its capability to exploit complementary information from unlabeled multi-view data. This work proposes MMatch, a new semi-supervised discriminative representation learning method for multi-view classification. Unlike existing multi-view representation learning methods that seldom consider the negative impact caused by particular views with unclear classification structures (weak discriminative views). MMatch jointly learns view-specific representations and class probabilities of training data. The representations concatenated to integrate multiple views’ information to form a global representation. Moreover, MMatch performs the smoothness constraint on the class probabilities of the global representation to improve pseudo labels, whereas the pseudo labels regularize the structure of view-specific representations. A discriminative global representation is mined with the training process, and the negative impact of weak discriminative views is overcome. Besides, MMatch learns consistent classification while preserving diverse information from multiple views. Experiments on several multi-view datasets demonstrate the effectiveness of MMatch.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... (such as CCTV images from different angles) or heterogeneous (such as images and audio in videos) multi-source data into a shared representation space, also called a view-common representation. MvRL has achieved remarkable success in various downstream tasks over the past few decades [4], such as classification [5]- [9] and video action recognition [10]- [13]. These accomplishments are attributed to the community's accumulation of a large amount of annotated data and expensive computational resources. ...
... Although the core idea behind these approaches is to enhance the robustness of multiview representations by maximizing consistency across views, this objective can be task-dependent, resulting in reduced generalization performance of learned representations. While consistency is important for clustering [19], [28], [29] or classification tasks [8], [9], the complementary information contained within multi-views is critical for specific tasks like cross-view information retrieval [30]- [32] and multi-view synthesis [33], [34]. Thus, improving the generalization of multiview representations is key for expanding the range of applications to which it can be applied. ...
... For instance, CCA-based methods [1], [21] as well as subspace-based methods [19], [23], etc. have been proposed. Zheng et al. [48] proposed a graph-guided representation learning technique that captures the higherorder structure of view-common representations by exploring relationships between view-specific graphs.Wang et al. [9] introduced a semi-supervised method to enhance the discriminative capability of the model for weak discriminative views by enforcing constraints of consistency between local view structures and global view structures, while leveraging global structural information for appropriate pseudo-label inference. Zhu et al. [49] proposed a graph-based incomplete view method by introducing the neighborhood constraint and viewexistence constraint to create a heterogeneous graph. ...
Article
Full-text available
Multi-view representation learning aims to extract comprehensive information from multiple sources. It has achieved significant success in applications such as video understanding and 3D rendering. However, how to improve the robustness and generalization of multi-view representations from unsupervised and incomplete scenarios remains an open question in this field. In this study, we discovered a positive correlation between the semantic distance of multi-view representations and the tolerance for data corruption. Moreover, we found that the information ratio of consistency and complementarity significantly impacts the performance of discriminative and generative tasks related to multi-view representations. Based on these observations, we propose an end-to-end CLustering-guided cOntrastiVE fusioN (CLOVEN) method, which enhances the robustness and generalization of multi-view representations simultaneously. To balance consistency and complementarity, we design an asymmetric contrastive fusion module. The module first combines all view-specific representations into a comprehensive representation through a scaling fusion layer. Then, the information of the comprehensive representation and view-specific representations is aligned via contrastive learning loss function, resulting in a view-common representation that includes both consistent and complementary information. We prevent the module from learning suboptimal solutions by not allowing information alignment between view-specific representations. We design a clustering-guided module that encourages the aggregation of semantically similar views. This action reduces the semantic distance of the view-common representation. We quantitatively and qualitatively evaluate CLOVEN on five datasets, demonstrating its superiority over 13 other competitive multi-view learning methods in terms of clustering and classification performance. In the data-corrupted scenario, our proposed method resists noise interference better than competitors. Additionally, the visualization demonstrates that CLOVEN succeeds in preserving the intrinsic structure of view-specific representations and improves the compactness of view-common representations. Our code can be found at https://github.com/guanzhou-ke/cloven.
... With the development of multimedia technology, most real-life data exists in the form of multi-view/multi-modality. For example, during autonomous driving, different sensors perceive the surrounding environment, such as ultrasonic radar, cameras, millimeter-wave radar, etc., and the data collected by each sensor is regarded as a view [1][2][3][4][5]. A video consists of audio, images, and text, with each medium acting as a view [6]. ...
... Pseudo-labeling methods provide labeled data directly, which is particularly advantageous for deep learning models [25,27]. For example, Wang et al. [1] proposed generating pseudo-labels on the fused representation of multiple views as supervised information to guide the learning of the single-view representation. With more supervised information, the learning of representations for individual views is thus improved thus facilitating the generation of better-fused representations. ...
Article
Full-text available
Semi-supervised multi-view classification plays a crucial role in understanding and utilizing existing multi-view data, especially in domains like medical diagnosis and autonomous driving. However, conventional semi-supervised multi-view classification methods often merely fuse features from multiple views without significantly improving classification performance. To address this issue, we propose a dynamic fusion approach for Semi-supervised Mult I-view c Lassification (SMILE). This approach leverages a high-level semantic mapping module to extract discriminative features from each view, reducing redundancy features. Furthermore, it introduces a dynamic fusion module to assess the quality of different views of different samples dynamically, diminishing the negative impact of low-quality views. We compare our method with six competitive methods on four datasets, exhibiting distinct advantages on the classification task, which demonstrates significant performance improvements across various evaluation metrics. Visualization experiments demonstrate that our approach is able to learn classification-friendly representations.
... Comparing Against SemiSL SOTAs: We compared STiL with SOTA SemiSL approaches, including 3 image methods, CoMatch [32], SimMatch [79], and FreeMatch [60]; and 3 multimodal methods, Co-training [10], MMatch [56] and Self-KD [58]. These image methods employ strongto-weak consistency regularization, where predictions from weakly augmented samples act as pseudo-labels for their strongly augmented versions. ...
... MMatch [56]: In MMatch, predictions from a multimodal classifier are used as pseudo-labels for training unimodal classifiers. In addition, similar to CoMatch, MMatch refines the pseudo-labels by aggregating label information from nearby samples in the embedding space. ...
Preprint
Full-text available
Multimodal image-tabular learning is gaining attention, yet it faces challenges due to limited labeled data. While earlier work has applied self-supervised learning (SSL) to unlabeled data, its task-agnostic nature often results in learning suboptimal features for downstream tasks. Semi-supervised learning (SemiSL), which combines labeled and unlabeled data, offers a promising solution. However, existing multimodal SemiSL methods typically focus on unimodal or modality-shared features, ignoring valuable task-relevant modality-specific information, leading to a Modality Information Gap. In this paper, we propose STiL, a novel SemiSL tabular-image framework that addresses this gap by comprehensively exploring task-relevant information. STiL features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement. We also propose a novel consensus-guided pseudo-labeling strategy to generate reliable pseudo-labels based on classifier consensus, along with a new prototype-guided label smoothing technique to refine pseudo-label quality with prototype embeddings, thereby enhancing task-relevant information learning in unlabeled data. Experiments on natural and medical image datasets show that STiL outperforms the state-of-the-art supervised/SSL/SemiSL image/multimodal approaches. Our code is publicly available.
... In the generic palmprint recognition scenarios, there exist different categories of palmprint images, such as contactless, contact-based, hyperspectral, and high-resolution palmprint images (see Figure 2). Recently, multiview learning (MVL) has attracted significant consideration from researchers, since the utilization of the heterogeneous features from multiple views has enormous potential for better recognition performance [24][25][26]. Compared with single-view representation, multiview learning can typically leverage more characteristics and structural information hidden in the data to improve the learning performance [27,28]. ...
... Following that, Tao et al. [30] formulated both of the cohesion and diversity information of different features for the multiview classifier learning. In [24], MMatch jointly learns view-specific representations and class probabilities of training data for discriminative representation learning. Up to now, a variety of MFL-based palmprint methods have achieved significant performances. ...
Article
Full-text available
Palmprint recognition has been widely applied to security authentication due to its rich characteristics, i.e., local direction, wrinkle, and texture. However, different types of palmprint images captured from different application scenarios usually contain a variety of dominant features. Specifically, the palmprint recognition performance will be degraded by the interference factors, i.e., noise, rotations, and shadows, while palmprint images are acquired in the open-set environments. Seeking to handle the long-standing interference information in the images, multiview palmprint feature learning has been proposed to enhance the feature expression by exploiting multiple characteristics from diverse views. In this paper, we first introduced six types of palmprint representation methods published from 2004 to 2022, which described the characteristics of palmprints from a single view. Afterward, a number of multiview-learning-based palmprint recognition methods (2004–2022) were listed, which discussed how to achieve better recognition performances by adopting different complementary types of features from multiple views. To date, there is no work to summarize the multiview fusion for different types of palmprint features. In this paper, the aims, frameworks, and related methods of multiview palmprint representation will be summarized in detail.
... Zheng et al. [41] proposed a graph-guided representation learning technique that captures higher-order structure of view-common representations by exploring relationships between view-specific graphs. Wang et al. [42] introduced a semi-supervised method to enhance model discriminative capabilities for weakly discriminative views by enforcing consistency constraints between local and global view structures. Recent advances in multi-view clustering have leveraged deep and graph-based models. ...
Article
Full-text available
Multi-view clustering is effective at uncovering the latent structures within different views or modalities. However, existing approaches often oversimplify the problem by treating the contribution and granularity of information from all views as uniform, neglecting the semantic richness and diversity inherent in different views. To address this limitation, we propose a dynamic hierarchical fusion method that not only integrates multi-granularity representations from multiple views but also dynamically computes and adjusts the contribution of each view’s representation for the clustering task through weighted fusion. Specifically, we design a multi-view hierarchical features fusion module that adaptively maps the intermediate representations from all view-specific encoders into a unified representation space, enabling effective multi-scale fusion. This process yields a set of intermediate representations that transition from coarse to fine granularity across multiple views and scales. Additionally, we introduce a multi-view gated fusion module , which utilizes a set of learnable normalized parameters to dynamically compute the contribution of each view’s representation and its multi-scale features. This weighted fusion produces a unified clustering representation that captures the most relevant information for the clustering task. Experimental results on benchmark datasets, such as Scene-15 Dataset, show that our method outperforms existing state-of-the-art methods by up to 1.88% in clustering accuracy (e.g., 41.08% compared to 39.20%). Ablation studies demonstrate the importance of the multi-view gated fusion module in learning the relative contributions of different views, which significantly enhances clustering performance.
... Numerous previous studies [11][12][13] have demonstrated the enhanced effectiveness of multi-view semi-supervised classification schemes, with some approaches involving the use of deep neural networks, including graph convolutional networks. This paper proposes a novel method for multi-view semi-supervised classification named deep random walk inspired multi-view graph convolutional networks (DRWM-GCN), which propagates label information across samples in a topological space. ...
Article
Full-text available
Recent studies highlight the growing appeal of multi-view learning due to its enhanced generalization. Semi-supervised classification, using few labeled samples to classify the unlabeled majority, is gaining popularity for its time and cost efficiency, particularly with high-dimensional and large-scale multi-view data. Existing graph-based methods for multi-view semi-supervised classification still have potential for improvement in further enhancing classification accuracy. Since deep random walk has demonstrated promising performance across diverse fields and shows potential for semi-supervised classification. This paper proposes a deep random walk inspired multi-view graph convolutional network model for semi-supervised classification tasks that builds signal propagation between connected vertices of the graph based on transfer probabilities. The learned representation matrices from different views are fused by an aggregator to learn appropriate weights, which are then normalized for label prediction. The proposed method partially reduces overfitting, and comprehensive experiments show it delivers impressive performance compared to other state-of-the-art algorithms, with classification accuracy improving by more than 5% on certain test datasets.
... By utilizing graph regularization and the learned common representation, labels from fixed labeled data are propagated to unlabeled data. Xu et al. [73] studied a multi-view weakly labeled learning approach, it effectively utilizes weakly labeled multi-view data. Their method generates pseudo-label vectors and incorporates different strategies and iterations to leverage the weak labels. ...
Article
Full-text available
Multi-view learning is an emerging field that aims to enhance learning performance by leveraging multiple views or sources of data across various domains. By integrating information from diverse perspectives, multi-view learning methods effectively enhance accuracy, robustness, and generalization capabilities. The existing research on multi-view learning can be broadly categorized into four groups in the survey based on the tasks it encompasses, namely multi-view classification approaches, multi-view semi-supervised classification approaches, multi-view clustering approaches, and multi-view semi-supervised clustering approaches. Despite its potential advantages, multi-view learning poses several challenges, including view inconsistency, view complementarity, optimal view fusion, the curse of dimensionality, scalability, limited labels, and generalization across domains. Nevertheless, these challenges have not discouraged researchers from exploring the potential of multiview learning. It continues to be an active and promising research area, capable of effectively addressing complex real-world problems.
... 2) utilizing unlabeled data, such as semi-supervised segmentation. Semi-supervised segmentation enhances segmentation performance by mining information in unlabeled data to address limited annotation challenges [11]- [13]. ...
Preprint
Full-text available
Deep learning-based medical image segmentation helps assist diagnosis and accelerate the treatment process while the model training usually requires large-scale dense annotation datasets. Weakly semi-supervised medical image segmentation is an essential application because it only requires a small amount of scribbles and a large number of unlabeled data to train the model, which greatly reduces the clinician's effort to fully annotate images. To handle the inadequate supervisory information challenge in weakly semi-supervised segmentation (WSSS), a SuperPixel-Propagated Pseudo-label (SP3{}^3) learning method is proposed, using the structural information contained in superpixel for supplemental information. Specifically, the annotation of scribbles is propagated to superpixels and thus obtains a dense annotation for supervised training. Since the quality of pseudo-labels is limited by the low-quality annotation, the beneficial superpixels selected by dynamic thresholding are used to refine pseudo-labels. Furthermore, aiming to alleviate the negative impact of noise in pseudo-label, superpixel-level uncertainty is incorporated to guide the pseudo-label supervision for stable learning. Our method achieves state-of-the-art performance on both tumor and organ segmentation datasets under the WSSS setting, using only 3\% of the annotation workload compared to fully supervised methods and attaining approximately 80\% Dice score. Additionally, our method outperforms eight weakly and semi-supervised methods under both weakly supervised and semi-supervised settings. Results of extensive experiments validate the effectiveness and annotation efficiency of our weakly semi-supervised segmentation, which can assist clinicians in achieving automated segmentation for organs or tumors quickly and ultimately benefit patients.
... In autonomous driving [49], "views" can also refer to video frames of the same object captured by cameras at different positions. While MRL has proven effective in practical applications, such as clustering [7,14], classification [36,46], and face synthesis [40,42,43], understanding the underlying relationships between different view representations remains an open question. In general, multi-view comprehensive representations consist of consistent and specific representations [19], which combine in a certain pattern to form the multi-view comprehensive representation. ...
Preprint
Multi-view (or -modality) representation learning aims to understand the relationships between different view representations. Existing methods disentangle multi-view representations into consistent and view-specific representations by introducing strong inductive biases, which can limit their generalization ability. In this paper, we propose a novel multi-view representation disentangling method that aims to go beyond inductive biases, ensuring both interpretability and generalizability of the resulting representations. Our method is based on the observation that discovering multi-view consistency in advance can determine the disentangling information boundary, leading to a decoupled learning objective. We also found that the consistency can be easily extracted by maximizing the transformation invariance and clustering consistency between views. These observations drive us to propose a two-stage framework. In the first stage, we obtain multi-view consistency by training a consistent encoder to produce semantically-consistent representations across views as well as their corresponding pseudo-labels. In the second stage, we disentangle specificity from comprehensive representations by minimizing the upper bound of mutual information between consistent and comprehensive representations. Finally, we reconstruct the original data by concatenating pseudo-labels and view-specific representations. Our experiments on four multi-view datasets demonstrate that our proposed method outperforms 12 comparison methods in terms of clustering and classification performance. The visualization results also show that the extracted consistency and specificity are compact and interpretable. Our code can be found at \url{https://github.com/Guanzhou-Ke/DMRIB}.
... The regularization hyperparameters or some other types of hyperparameters are often involved in previous MVC algorithms to adjust the influences of different terms (or components) [2], where dataset-specific fine-tuning is frequently required to seek the proper values of these hyperparameters in a probably extensive trial-and-error manner. However, unlike supervised or semi-supervised learning [14], [15], in the unsupervised situations it may be arguable whether the ground-truth labels can be used for guiding the finetuning process. Without the fine-tuning guided by partial or even all ground-truth labels, the practicality of these MVC algorithms may be significantly weakened. ...
Article
Full-text available
Despite significant progress, there remain three limitations to the previous multi-view clustering algorithms. First, they often suffer from high computational complexity, restricting their feasibility for large-scale datasets. Second, they typically fuse multi-view information via one-stage fusion, neglecting the possibilities in multi-stage fusions. Third, dataset-specific hyperparameter-tuning is frequently required, further undermining their practicability. In light of this, we propose a fast m ulti-v i ew c lustering via e nsembles (FastMICE) approach. Particularly, the concept of random view groups is presented to capture the versatile view-wise relationships, through which the hybrid early-late fusion strategy is designed to enable efficient multi-stage fusions. With multiple views extended to many view groups, three levels of diversity (w.r.t. features, anchors, and neighbors, respectively) are jointly leveraged for constructing the view-sharing bipartite graphs in the early-stage fusion. Then, a set of diversified base clusterings for different view groups are obtained via fast graph partitioning, which are further formulated into a unified bipartite graph for final clustering in the late-stage fusion. Notably, FastMICE has almost linear time and space complexity, and is free of dataset-specific tuning. Experiments on 22 multi-view datasets demonstrate its advantages in scalability (for extremely large datasets), superiority (in clustering performance), and simplicity (to be applied) over the state-of-the-art. Code available: https://github.com/huangdonghere/FastMICE .
Article
Multiview semi-supervised learning is a popular research area in which people utilize cross-view knowledge to overcome the limitation of labeled data in semi-supervised learning. Existing methods mainly utilize deep neural network, which is relatively time-consuming due to the complex network structure and back propagation iterations. In this article, co-training broad Siamese-like network (Co-BSLN) is proposed for coupled-view semi-supervised classification. Co-BSLN learns knowledge from two-view data and can be used for multiview data with the help of feature concatenation. Different from existing deep learning methods, Co-BSLN utilizes a simple shallow network based on broad learning system (BLS) to simplify the network structure and reduce training time. It replaces back propagation iterations with a direct pseudo inverse calculation to further reduce time consumption. In Co-BSLN, different views of the same instance are considered as positive pairs due to cross-view consistency. Predictions of views in positive pairs are used to guide the training of each other through a direct logit vector mapping. Such a design is fast and effectively utilizes cross-view consistency to improve the accuracy of semi-supervised learning. Evaluation results demonstrate that Co-BSLN is able to improve accuracy and reduce training time on popular datasets.
Article
Since hand-print recognition, i.e., palmprint, finger-knuckle-print (FKP), and hand-vein, have significant superiority in user convenience and hygiene, it has attracted greater enthusiasm from researchers. Seeking to handle the long-standing interference factors, i.e., noise, rotation, shadow, in hand-print images, multi-view hand-print representation has been proposed to enhance the feature expression by exploiting multiple characteristics from diverse views. However, the existing methods usually ignore the high-order correlations between different views or fuse very limited types of features. To tackle these issues, in this paper, we present a novel tensorized multi-view low-rank approximation based robust hand-print recognition method (TMLA_RHR), which can dexterously manipulate the multi-view hand-print features to produce a high-compact feature representation. To achieve this goal, we formulate TMLA_RHR by two key components, i.e., aligned structure regression loss and tensorized low-rank approximation, in a joint learning model. Specifically, we treat the low-rank representation matrices of different views as a tensor, which is regularized with a low-rank constraint. It models the across information between different views and reduces the redundancy of the learned sub-space representations. Experimental results on eight real-world hand-print databases prove the superiority of the proposed method in comparison with other state-of-the-art related works.
Article
Semi-supervised multi-view learning is a remarkable but challenging task. Existing semi-supervised multi-view classification (SMVC) approaches mainly focus on performance improvement while ignoring decision reliability, which limits their deployment in safety-critical applications. Although several trusted multi-view classification methods are proposed recently, they rely on manual annotations. Therefore, this work emphasizes trusted multi-view classification learning under semi-supervised conditions. Different from existing SMVC methods, this work jointly models class probabilities and uncertainties based on evidential deep learning to formulate view-specific opinions. Moreover, unlike previous works that explore cross-view consistency in a single schema, this work proposes a multi-level consistency constraint. Specifically, we explore instance-level consistency on the view-specific representation space and category-level consistency on opinions from multiple views. Our proposed trusted graph-based contrastive loss nicely establishes the relationship between joint opinions and view-specific representations, which enables view-specific representations to enjoy a good manifold to improve classification performance. Overall, the proposed approach provides reliable and superior semi-supervised multiview classification decisions. Extensive experiments demonstrate the effectiveness, reliability and robustness of the proposed model.
Chapter
Multi-view discriminant analysis method (MvDA) is an effective supervised multi-view learning method. However, in practical multi-view learning application scenarios, there may be the problem that the sample label information from some viewpoints is unreliable. In the MvDA method, the information of all views is considered to have equal reliability. The data information in views containing unreliable label information may make it difficult for MvDA to learn accurate common features among data from different views. Therefore, in this paper, a weighted consistent multi-view discriminant analysis (WMvDA-VC) is proposed for the multi-view learning task in unreliable labeling environment. By considering the data characteristics under different views, the information from different views is assigned different reliability weights. Using the weighted scatter matrices, multiple projection matrices are learned. Then the data from different views are projected into the same common projection space. In the common subspace, dissimilar samples in different views are dispersed as much as possible and similar samples are aggregated as much as possible. The impact of data from views containing unreliable labels on the multi-view learning performance is reduced by adjusting the weights of the views with unreliable labels. The method is experimented on four widely used datasets. The experimental results show that the proposed WMvDA-VC exhibits excellent performance in the classification recognition task with unreliably labeled environment.
Article
Deep clustering incorporates embedding into clustering in order to find a lower-dimensional space suitable for clustering tasks. Conventional deep clustering methods aim to obtain a single global embedding subspace (aka latent space) for all the data clusters. In contrast, in this article, we propose a deep multirepresentation learning (DML) framework for data clustering whereby each difficult-to-cluster data group is associated with its own distinct optimized latent space and all the easy-to-cluster data groups are associated with a general common latent space. Autoencoders (AEs) are employed for generating cluster-specific and general latent spaces. To specialize each AE in its associated data cluster(s), we propose a novel and effective loss function which consists of weighted reconstruction and clustering losses of the data points, where higher weights are assigned to the samples more probable to belong to the corresponding cluster(s). Experimental results on benchmark datasets demonstrate that the proposed DML framework and loss function outperform state-of-the-art clustering approaches. In addition, the results show that the DML method significantly outperforms the SOTA on imbalanced datasets as a result of assigning an individual latent space to the difficult clusters.
Article
Enhancing the accuracy of dense classification with limited labeled data and abundant unlabeled data, known as semi-supervised semantic segmentation, is an essential task in vision comprehension. Due to the lack of annotation in unlabeled data, additional pseudo-supervised signals, typically pseudo-labeling, are required to improve the performance. Although effective, these methods fail to consider the internal representation of neural networks and the inherent class-imbalance in dense samples. In this work, we propose an information transfer theory, which establishes a theoretical relationship between shallow and deep representations. We further apply this theory at both the semantic and pixel levels, referred to as IIT-SP, to align different types of information. The proposed IIT-SP optimizes shallow representations to match the target representation required for segmentation. This limits the upper bound of deep representations to enhance segmentation performance. We also propose a momentum-based Cluster-State bar that updates class status online, along with a HardClassMix augmentation and a loss weighting technique to address class imbalance issues based on it. The effectiveness of the proposed method is demonstrated through comparative experiments on PASCAL VOC and Cityscapes benchmarks, where the proposed IIT-SP achieves state-of-the-art performance, reaching mIoU of 68.34% with only 2% labeled data on PASCAL VOC and mIoU of 64.20% with only 12.5% labeled data on Cityscapes.
Article
Multi-view data describes an image sample with different modalities of features, thus provides a more comprehensive description of data. Its three basic characteristics, i.e ., consensus, complementary and redundancy, determine its performances in computer vision tasks. In this paper, we effectively exploit the above three characteristics to propose a deep learning scheme with joint shared-and-specific information (JSSI) for multi-view clustering. Aiming at facilitating the consensus, JSSI extracts shared information of multi-view data via an adversarial similarity constraint, which is realized by classification and discrimination interactions. Aiming at reducing the redundancy, JSSI separate out view-specific features and prevent them from interfering with the shared features via a difference constraint. Aiming at ensuring the complementary, JSSI aligns the shared features and then concatenates them with the specific features. We examine the effectiveness of JSSI with multi-view clustering on real-world datasets, such as faces and indoor scenes. Extensive experiments and comparisons show that JSSI outperforms other state-of-the-art methods in most of these datasets.
Article
The development of information gathering and extraction technology has led to the popularity of multi-view data, which enables samples to be seen from numerous perspectives. Multi-view clustering (MVC), which groups data samples by leveraging complementary and consensual information from several views, is gaining popularity. Despite the rapid evolution of MVC approaches, there has yet to be a study that provides a full MVC roadmap for both stimulating technical improvements and orienting research newbies to MVC. In this article, we review recent MVC techniques with the purpose of exhibiting the concepts of popular methodologies and their advancements. This survey not only serves as a unique MVC comprehensive knowledge for researchers but also has the potential to spark new ideas in MVC research. We summarise a large variety of current MVC approaches based on two technical mechanisms: heuristic-based multi-view clustering (HMVC) and neural network-based multi-view clustering (NNMVC). We end with four technological approaches within the category of HMVC: nonnegative matrix factorisation, graph learning, latent representation learning, and tensor learning. Deep representation learning and deep graph learning are two technical methods that we demonstrate in NNMVC. We also show 15 publicly available multi-view datasets and examine how representative MVC approaches perform on them. In addition, this study identifies the potential research directions that may require further investigation in order to enhance the further development of MVC.
Article
Full-text available
Fully capturing valid complementary information in multi-view data enhances the connection between similar data points and weakens the correlation between different data point categories. In this paper, we propose a new multi-view clustering via dual-norm and Hilbert-Schmidt independence criterion (HSIC) induction (MCDHSIC) approach, which can enhance the complementarity, reduce the redundancy between multi-view representations, and improve the accuracy of the clustering results. This model uses the HSIC as the diversity regularization term to capture the nonlinear relationship between different views. In addition, l1-norm and Frobenius norm constraints are imposed to obtain a subspace representation matrix with inter-class sparsity and intra-class consistency. Moreover, we also designed a valuable approach to optimizing the proposed model and theoretically analyzing the convergence of the MCDHSIC method. The results of extensive experiments conducted on five challenging data sets show that the proposed method objectively achieves a highly competent performance compared with several other state-of-the-art multi-view clustering methods.
Article
Recent semi-supervised learning (SSL) algorithms such as FixMatch achieve state-of-the-art performance by exploiting consistency regularization and entropy minimization techniques. However, many consistency-based SSL algorithms extract pseudo-labels from unlabeled data through a fixed threshold and ignore the different learning progress of each category, which makes the easy-to-learn categories have more examples contributing to the loss, resulting in a class-imbalance problem and affecting training efficiency. In order to improve the training reliability, we propose adaptive weighted losses (AWL). Through the evaluation of the class-wise learning progress, the loss contribution of the pseudo-labeled data of each category is continuously and dynamically adjusted during the learning process, and the pseudo-label discrimination ability of the model can be steadily improved. Moreover, to improve the training efficiency, we propose a bidirectional distribution approximation (DA) method, which introduces the consistency information of the predictions under the threshold into the loss calculation, and significantly improves the model convergence speed. Through the combination of AWL and DA, our method surpasses the performance of other algorithms on multiple benchmarks with a faster convergence efficiency, especially in the case of labeled data extremely limited. For example, AWL&DA achieves 95.29% test accuracy on the CIFAR-10-40-labels experiment and 92.56% accuracy on a faster experiment setting with only 2182^{18} iterations.
Article
Data augmentation via randomly combining training instances and interpolating the corresponding labels has shown impressive gains in image classification. However, model attention regions are not necessarily meaningful in class semantics, especially for the case of limited supervision. In this paper, we present a semi-supervised classification model based on Class-Ambiguous Data with Attention Regularization, which is referred to as CADAR. Specifically, we adopt a Random Regional Interpolation (RRI) module to construct complex and effective class-ambiguous data, such that the model behavior can be regularized around decision boundaries. By aggregating the parameters of a classification network over training epochs to produce more reliable predictions on unlabeled data, RRI can also be applied to them as well as labeled data. Further, the classifier is enforced to apply consistent attention on the original and constructed data. This is important for inducing the model to learn discriminative features from the class-related regions. The experiment results demonstrate that CADAR significantly benefits from the constructed data and attention regularization, and thus achieves superior performance across multiple standard benchmarks and different amounts of labeled data.
Conference Paper
Full-text available
This work focuses on the unsupervised scene adaptation problem of learning from both labeled source data and unlabeled target data. Existing approaches focus on minoring the inter-domain gap between the source and target domains. However, the intra-domain knowledge and inherent uncertainty learned by the network are under-explored. In this paper, we propose an orthogonal method, called memory regularization in vivo, to exploit the intra-domain knowledge and regularize the model training. Specifically, we refer to the segmentation model itself as the memory module, and minor the discrepancy of the two classifiers, i.e., the primary classifier and the auxiliary classifier, to reduce the prediction inconsistency. Without extra parameters, the proposed method is complementary to most existing domain adaptation methods and could generally improve the performance of existing methods. Albeit simple, we verify the effectiveness of memory regularization on two synthetic-to-real benchmarks: GTA5 → Cityscapes and SYNTHIA → Cityscapes, yielding +11.1% and +11.3% mIoU improvement over the baseline model, respectively. Besides, a similar +12.0% mIoU improvement is observed on the cross-city benchmark: Cityscapes → Oxford RobotCar.
Article
Full-text available
As data can be acquired in an ever-increasing number of ways, multi-view data is becoming more and more available. Considering the high price of labeling data in many machine learning applications, we focus on multi-view semi-supervised classification problem. To address this problem, in this paper, we propose a method called joint consensus and diversity for multi-view semi-supervised classification, which learns a common label matrix for all training samples and view-specific classifiers simultaneously. A novel classification loss named probabilistic square hinge loss is proposed, which avoids the incorrect penalization problem and characterizes the contribution of training samples according to its uncertainty. Power mean is introduced to incorporate the losses of different views, which contains the auto-weighted strategy as a special case and distinguishes the importance of various views. To solve the non-convex minimization problem, we prove that its solution can be obtained from another problem with introduced variables. And an efficient algorithm with proved convergence is developed for optimization. Extensive experimental results on nine datasets demonstrate the effectiveness of the proposed algorithm.
Article
Full-text available
UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP as described has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.
Article
Full-text available
With the advent of multi-view data, multi-view learning has become an important research direction in machine learning and image processing. Considering the difficulty of obtaining labeled data in many machine learning applications, we focus on the multi-view semi-supervised classification problem. In this paper, we propose an algorithm named Multi-View Semi-Supervised Classification via Adaptive Regression (MVAR) to address this problem. Specifically, regression based loss functions with ℓ2,1 matrix norm are adopted for each view and the final objective function is formulated as the linear weighted combination of all the loss functions. An efficient algorithm with proved convergence is developed to solve the non-smooth ℓ2,1-norm minimization problem. Regressing to class labels directly makes the proposed algorithm efficient in calculation and can be applied to large-scale datasets. The adaptively optimized weight coefficients balance the contributions of different views automatically, which makes the performance robust against the existence of low-quality views. With the learned projection matrices and bias vectors, predictions for out-of-sample data can be easily made. To validate the effectiveness of MVAR, comparisons are made with some benchmark methods on real-world datasets and in the scene classification scenario as well. The experimental results demonstrate the effectiveness of our proposed algorithm.
Article
Full-text available
Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to low-memory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static pre-defined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR-100 recognition and Market-1501 person re-identification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary -- mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher.
Conference Paper
Full-text available
This paper studies the problem of RGB-D object recognition. Inspired by the great success of deep convolutional neural networks (DCNN) in AI, researchers have tried to apply it to improve the performance of RGB-D object recognition. However, DCNN always requires a large-scale annotated dataset to supervise its training. Manually labeling such a large RGB-D dataset is expensive and time consuming, which prevents DCNN from quickly promoting this research area. To address this problem, we propose a semi-supervised multimodal deep learning framework to train DCNN effectively based on very limited labeled data and massive unlabeled data. The core of our framework is a novel diversity preserving co-training algorithm, which can successfully guide DCNN to learn from the unlabeled RGB-D data by making full use of the complementary cues of the RGB and depth data in object representation. Experiments on the benchmark RGB-D dataset demonstrate that, with only 5% labeled training data, our approach achieves competitive performance for object recognition compared with those state-of-the-art results reported by fully-supervised methods.
Article
Full-text available
Multimedia data are usually represented by multiple features. In this paper, we propose a new algorithm, namely Multi-feature Learning via Hierarchical Regression for multimedia semantics understanding, where two issues are considered. First, labeling large amount of training data is labor-intensive. It is meaningful to effectively leverage unlabeled data to facilitate multimedia semantics understanding. Second, given that multimedia data can be represented by multiple features, it is advantageous to develop an algorithm which combines evidence obtained from different features to infer reliable multimedia semantic concept classifiers. We design a hierarchical regression model to exploit the information derived from each type of feature, which is then collaboratively fused to obtain a multimedia semantic concept classifier. Both label information and data distribution of different features representing multimedia data are considered. The algorithm can be applied to a wide range of multimedia applications and experiments are conducted on video data for video concept annotation and action recognition. Using Trecvid and CareMedia video datasets, the experimental results show that it is beneficial to combine multiple features. The performance of the proposed algorithm is remarkable when only a small amount of labeled training data are available.
Conference Paper
Full-text available
We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. This regularizer can be applied to any model of posterior probabilities. Our approach provides a new motivation for some existing semi-supervised learning algorithms which are particular or limiting instances of minimum entropy regularization. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can be far superior to manifold learning in high dimension spaces, and also when the manifolds are generated by moving examples along the discriminating directions.
Conference Paper
Full-text available
This paper introduces a web image dataset created by NUS's Lab for Media Search. The dataset includes: (1) 269,648 images and the associated tags from Flickr, with a total of 5,018 unique tags; (2) six types of low-level features extracted from these images, including 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments extracted over 5x5 fixed grid partitions, and 500-D bag of words based on SIFT descriptions; and (3) ground-truth for 81 concepts that can be used for evaluation. Based on this dataset, we highlight characteristics of Web image collections and identify four research issues on web image annotation and retrieval. We also provide the baseline results for web image annotation by learning from the tags using the traditional k-NN algorithm. The benchmark results indicate that it is possible to learn effective models from sufficiently large image dataset to facilitate general image retrieval.
Article
Full-text available
Co-training is a multiview semi-supervised learning algorithm to learn from both labeled and unlabeled data, which iteratively adopts a classifier trained on one view to teach the other view using some confident predictions given on unlabeled examples. However, as it does not examine the reliability of the labels provided by classifiers on either view, co-training might be problematic. Even very few inaccurately labeled examples can deteriorate the performance of learned classifiers to a large extent. In this paper, a new method named robust co-training is proposed, which integrates canonical correlation analysis (CCA) to inspect the predictions of co-training on those unlabeled training examples. CCA is applied to obtain a low-dimensional and closely correlated representation of the original multiview data. Based on this representation the similarities between an unlabeled example and the original labeled examples are determined. Only those examples whose predicted labels are consistent with the outcome of CCA examination are eligible to augment the original labeled data. The performance of robust co-training is evaluated on several different classification problems where encouraging experimental results are observed.
Article
Webpage classification has attracted a lot of research interest. Webpage data is often multi-view and high-dimensional, and the webpage classification application is usually semi-supervised. Due to these characteristics, using semi-supervised multi-view feature learning (SMFL) technique to deal with the webpage classification problem has recently received much attention. However, there still exists room for improvement for this kind of feature learning technique. How to effectively utilize the correlation information among multi-view of webpage data is an important research topic. Correlation analysis on multi-view data can facilitate extraction of the complementary information. In this paper, we propose a novel SMFL approach, named semi-supervised multi-view correlation feature learning (SMCFL), for webpage classification. SMCFL seeks for a discriminant common space by learning a multi-view shared transformation in a semi-supervised manner. In the discriminant space, the correlation between intra-class samples is maximized, and the correlation between inter-class samples and the global correlation among both labeled and unlabeled samples are minimized simultaneously. We transform the matrix-variable based nonconvex objective function of SMCFL into a convex quadratic programming problem with one real variable, and can achieve a global optimal solution. Experiments on widely used datasets demonstrate the effectiveness and efficiency of the proposed approach.
Article
In this paper, we address the problem of large-scale multi-view spectral clustering. In many real-world applications, data can be represented in various heterogeneous features or views. Different views often provide different aspects of information that are complementary to each other. Several previous methods of clustering have demonstrated that better accuracy can be achieved using integrated information of all the views than just using each view individually. One important class of such methods is multi-view spectral clustering, which is based on graph Laplacian. However, existing methods are not applicable to large-scale problem for their high computational complexity. To this end, we propose a novel large-scale multi-view spectral clustering approach based on the bipartite graph. Our method uses local manifold fusion to integrate heterogeneous features. To improve efficiency, we approximate the similarity graphs using bipartite graphs. Furthermore, we show that our method can be easily extended to handle the out-of-sample problem. Extensive experimental results on five benchmark datasets demonstrate the effectiveness and efficiency of the proposed method, where our method runs up to nearly 3000 times faster than the state-of-the-art methods.
Article
Contextual information has been shown to be powerful for semantic segmentation. This work proposes a novel Context-based Tandem Network (CTNet) by interactively exploring the spatial contextual information and the channel contextual information, which can discover the semantic context for semantic segmentation. Specifically, the Spatial Contextual Module (SCM) is leveraged to uncover the spatial contextual dependency between pixels by exploring the correlation between pixels and categories. Meanwhile, the Channel Contextual Module (CCM) is introduced to learn the semantic features including the semantic feature maps and class-specific features by modeling the long-term semantic dependence between channels. The learned semantic features are utilized as the prior knowledge to guide the learning of SCM, which can make SCM obtain more accurate long-range spatial dependency. Finally, to further improve the performance of the learned representations for semantic segmentation, the results of the two context modules are adaptively integrated to achieve better results. Extensive experiments are conducted on four widely-used datasets, i.e., PASCAL-Context, Cityscapes, ADE20K and PASCAL VOC2012. The results demonstrate the superior performance of the proposed CTNet by comparison with several state-of-the-art methods. The source code and models are available at https://github.com/syp2ysy/CTNet .
Article
In this paper, we delve into the challenging problem in multi-view learning, namely unsupervised multi-view representation learning, the goal of which is to effectively integrate information from multiple views and learn the unified feature representation with comprehensive information in an unsupervised manner. Despite the progress attained in recent years, it is still a challenging issue since the correlations across multiple views are complex and difficult to model during the learning process, especially in the absence of label information. To address this problem, we introduce a novel method, termed Collaborative Unsupervised Multi-view Representation Learning (CUMRL), which benefits from the high-order view correlations of multi-view data by introducing a collaborative learning strategy. Specifically, the low-rank tensor constraint is employed and plays the role of a bridge, which links the view-specific compact learning and unified representation learning in CUMRL. Experiments demonstrate the effectiveness and competitiveness of the multi-view representation achieved by the proposed method for different learning tasks, compared to several state-of-the-art methods.
Article
In real-world applications, complete or incomplete multi-view data are common, which leads to the problem of generalized multi-view clustering. Recently, researchers attempt to learn the latent representation in the common subspace from heterogeneous data, which usually suffers from feature degeneration. Moreover, there are limited efforts on simultaneously revealing the underlying subspace structure and exploring the complementary information from incomplete multiple views. In this paper, we introduce a novel Generalized Multi-view Collaborative Subspace Clustering (GMCSC) framework to address the above issues, in which consensus subspace structure of all views and embedding subspaces for each view are jointly learned to benefit each other. Specifically, we develop a novel collaborative subspace learning strategy based on self-representation learning, which provides a brand-new way of pursuing the complete subspace structure directly from multi-view data. Furthermore, we explore complementary information by enforcing the consistency across different views and preserving the view-specific information of each view, which can alleviate the problem of feature degeneration and enhance the reasonability of using a consensus representation for multiple views. Experimental results on six benchmark datasets demonstrate that the proposed method can significantly outperform the state-of-the-art algorithms.
Article
Conventional feature selection methods select the same feature subset for all classes, which means that the selected features might work better for some classes than the others. Towards this end, this paper proposes a new semi-supervised local feature selection method (S2LFS) allowing to select different feature subsets for different classes. According to this method, class-specific feature subsets are selected by learning the importance of features considering each class separately. In particular, the class labels of all available data are jointly learned under a consistent constraint over the labeled data, which enables the proposed method to select the most discriminative features. Experiments on six data sets demonstrate the effectiveness of the proposed method compared to some popular feature selection methods.
Article
Multi-view semi-supervised classification (MSSC) focuses on exploring information from multiple views of labeled and unlabeled data to boost classification performance. However, most of the existing methods build models for each view individually, therefore, the potential relationship between different views can’t be fully explored. Additionally, they either focus on the correlations for consistency or maximize the independence for complementarity. Consistency and complementarity are equally important for multi-view learning. Therefore, this work proposes a novel Graph-based Remodeling Network for MSSC (GRNet), which can explore the potential relationship of multiple views and balance the consistency and complementation adaptively. Specifically, this model integrates multiple views and then generates reformed pseudo views by an attention-based ensemble learning strategy. Moreover, to exploit the information of unlabeled data, this work introduces graph regularization for such a goal. Extensive experiments on several datasets demonstrate the effectiveness and efficiency of the proposed method.
Article
Multi-view clustering has attracted increasing attentions recently by utilizing information from multiple views. However, existing multi-view clustering methods are either with high computation and space complexities, or lack of representation capability. To address these issues, we propose deep embedded multi-view clustering with collaborative training (DEMVC) in this paper. Firstly, the embedded representations of multiple views are learned individually by deep autoencoders. Then, both consensus and complementary of multiple views are taken into account and a novel collaborative training scheme is proposed. Concretely, the feature representations and cluster assignments of all views are learned collaboratively. A new consistency strategy for cluster centers initialization is further developed to improve the multi-view clustering performance with collaborative training. Experimental results on several popular multi-view datasets show that DEMVC achieves significant improvements over state-of-the-art methods.
Article
With the popularity of cameras and sensors, massive data are captured from various view angles or modalities, which provide abundant complementary information and also bring great challenges for traditional clustering methods. In this article, we propose a novel Adaptive K-Multiple-Means for multi-view clustering method (AKM 3 C). Unlike traditional multi-view K-means methods by grouping samples into C clusters each with a cluster center in every view, the proposed AKM 3 C employs M(M>C)M (M>C) sub-cluster centers in each view to reveal the sub-cluster structure in the multi-view data thus enhances the clustering performance. Additionally, to distinguish the importance of different views, instead of using empirical weights, AKM 3 C exploits the multi-view combination weights strategy to assign a weight to each view automatically and thus fuses the complementary information of different views properly to get an optimally shared bipartite graph, on which the Laplacian rank constraint is executed and the final clusters are obtained by directly partitioning. An efficient optimization algorithm proposed with complexity and convergence analysis is used to solve the proposed AKM 3 C method. The extensive experimental results on eight public datasets show that the proposed AKM 3 C performs better than state-of-the-art multi-view clustering methods. The code can be downloaded at https://drive.google.com/file/d/1CQ0royrYxKFJdNLnbBQSbDrohtfH71di/view?usp=sharing .
Article
Multi-view semi-supervised learning achieves great success in recent years which leverages information among labeled and unlabeled multi-view data to improve the generalization performance. So far two classical two-view semi-supervised learning methods are multi-view Laplacian support vector machines (MvLapSVM) and multi-view Laplacian twin support vector machines (MvLapTSVM). But they can only deal with two-view classification problems and cannot deal with general multi-view classification problems. They both solve the quadratic programming problems (QPPs) so that their time complexity is quite high. In this paper, we formulate general multi-view Laplacian least squares support vector machines (GMvLapSVM) and general multi-view Laplacian least squares twin support vector machines (GMvLapTSVM) which solve linear equations as compared to QPPs in MvLapSVM and MvLapTSVM. They can handle the general multi-view classification problems by combining multiple different views in a non-pairwise way. The disagreement among different views is considered as a regularization term in the objective function to explore the consensus information. Multi-manifold regularization is adopted for multi-view semi-supervised learning. Combination weights for all view in the norm regularization terms are adopted to exploit complementarity information among distinct views. Finally, an efficient alternating algorithm is proposed for optimization. Experiments are performed on various real world datasets, which give state-of-the-art generalization performance.
Article
Microsoft’s Kinect sensors are receiving an increasing amount of interest by security researchers since they are cost-effective and can provide both visual and depth modality data at the same time. Unfortunately, depth or RGB modalities are unavailable in training or testing procedures in some realistic scenarios. Therefore, we explore a new problem focusing on arbitrary absence of modality, which is completely different from conventional action recognition. The new problem in action recognition aims to deal with cross modality data (e.g., RGB training and depth testing data), “missing” modality data (e.g., RGB training and RGB-D test data), and single modality data (e.g., RGB/depth in both phases). Accordingly, our method aims to borrow some information (e.g., correlation between two modalities) from the well-established RGB-D dataset and apply it to the existing dataset to recover some latent information to improve the performance of recognition. For instance, a crossmodality regularizer is used to preserve the correlation of RGB and depth modalities. The “missing” knowledge is considered as latent information, which is recovered by low-rank learning in our model. In the real world, the target data is usually sparsely labeled or completely unlabeled, however, we could exploit the pseudo labels of the target as prior knowledge for “supervised” learning in target domain. Accordingly, we propose a semisupervised model for transfer learning. Experiments on three widely used RGB-D action datasets show that the our method performs better than that of the state-of-art transfer learning methods in most cases in terms of accuracy and time efficiency.
Article
Learning an expressive representation from multi-view data is a key step in various real-world applications. In this paper, we propose a Semi-supervised Multi-view Deep Discriminant Representation Learning (SMDDRL) approach. Unlike existing joint or alignment multi-view representation learning methods that cannot simultaneously utilize the consensus and complementary properties of multi-view data to learn inter-view shared and intra-view specific representations, SMDDRL comprehensively exploits the consensus and complementary properties as well as learns both shared and specific representations by employing the shared and specific representation learning network. Unlike existing shared and specific multi-view representation learning methods that ignore the redundancy problem in representation learning, SMDDRL incorporates the orthogonality and adversarial similarity constraints to reduce the redundancy of learned representations. Moreover, to exploit the information contained in unlabeled data, we design a semi-supervised learning framework by combining deep metric learning and density clustering. Experimental results on three typical multi-view learning tasks, i.e., webpage classification, image classification, and document classification demonstrate the effectiveness of the proposed approach.
Article
Non-intrusive load monitoring (NILM) is a technique that infers appliance-level energy consumption patterns and operation state changes based on feeder power signals. With the availability of fine-grained electric load profiles, there has been increasing interest in using this approach for demand-side energy management in smart grids. NILM is a multi-label classification problem due to the simultaneous operation of multiple appliances. Recently, deep learning based techniques have been shown to be a promising approach to solving this problem, but annotating the huge volume of load profile data with multiple active appliances for learning is very challenging and impractical. In this paper, a new semi-supervised multi-label deep learning based framework is proposed to address this problem with the goal of mitigating the reliance on large labeled datasets. Specifically, a temporal convolutional neural network (CNN) is used to automatically extract high-level load signatures for individual appliances. These signatures can be efficiently used to improve the feature representation capability of the framework. Case studies conducted on two open-access NILM datasets demonstrate the effectiveness and superiority of the proposed approach.
Article
Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra-and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.
Article
In this paper, we address the multiview nonlinear subspace representation problem. Traditional multiview subspace learning methods assume that the heterogeneous features of the data usually lie within the union of multiple linear subspaces. However, instead of linear subspaces, the data feature actually resides in multiple nonlinear subspaces in many real-world applications, resulting in unsatisfactory clustering performance. To overcome this, we propose a hyper-Laplacian regularized multilinear multiview self-representation model, which is referred to as HLR-M 2 VS, to jointly learn multiple views correlation and a local geometrical structure in a unified tensor space and view-specific self-representation feature spaces, respectively. In unified tensor space, a well-founded tensor low-rank regularization is adopted to impose on the self-representation coefficient tensor to ensure global consensus among different views. In view-specific feature space, hypergraph-induced hyper-Laplacian regularization is utilized to preserve the local geometrical structure embedded in a high-dimensional ambient space. An efficient algorithm is then derived to solve the optimization problem of the established model with theoretical convergence guarantee. Furthermore, the proposed model can be extended to semisupervised classification without introducing any additional parameters. An extensive experiment of our method is conducted on many challenging datasets, where a clear advance over state-of-the-art multiview clustering and multiview semisupervised classification approaches is achieved.
Article
Recently, multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas. This paper introduces two categories for multi-view representation learning: multi-view representation alignment and multi-view representation fusion. Consequently, we first review the representative methods and theories of multi-view representation learning based on the perspective of alignment, such as correlation-based alignment. Representative examples are canonical correlation analysis (CCA) and its several extensions. Then from the perspective of representation fusion we investigate the advancement of multi-view representation learning that ranges from generative methods including multi-modal topic learning, multi-view sparse coding, and multi-view latent space Markov networks, to neural network-based methods including multi-modal autoencoders, multi-view convolutional neural networks and multi-modal recurrent neural networks. Further, we also investigate several important applications of multi-view representation learning. Overall, this survey aims to provide an insightful overview of theoretical foundation and state-of-the-art developments in the field of multi-view representation learning and to help researchers find the most appropriate tools for particular applications.
Article
In this paper, we address the problem of large-scale graph-based semi-supervised learning for multi-class classification. Most existing scalable graph-based semi-supervised learning methods are based on the hard linear constraint or cannot cope with the unseen samples, which limits their applications and learning performance. To this end, we build upon our previous work flexible manifold embedding (FME) [1] and propose two novel linear-complexity algorithms called fast flexible manifold embedding (f-FME) and reduced flexible manifold embedding (r-FME). Both of the proposed methods accelerate FME and inherit its advantages. Specifically, our methods address the hard linear constraint problem by combining a regression residue term and a manifold smoothness term jointly, which naturally provides the prediction model for handling unseen samples. To reduce computational costs, we exploit the underlying relationship between a small number of anchor points and all data points to construct the graph adjacency matrix, which leads to simplified closed-form solutions. The resultant f-FME and r-FME algorithms not only scale linearly in both time and space with respect to the number of training samples but also can effectively utilize information from both labeled and unlabeled data. Experimental results show the effectiveness and scalability of the proposed methods.
Article
In this work, we investigate the problem of learning knowledge from the massive community-contributed images with rich weakly-supervised context information, which can benefit multiple image understanding tasks simultaneously, such as social image tag refinement and assignment, content-based image retrieval, tag-based image retrieval and tag expansion. Towards this end, we propose a Deep Collaborative Embedding (DCE) model to uncover a unified latent space for images and tags. The proposed method incorporates the end-to-end learning and collaborative factor analysis in one unified framework for the optimal compatibility of representation learning and latent space discovery. A nonnegative and discrete refined tagging matrix is learned to guide the end-to-end learning. To collaboratively explore the rich context information of social images, the proposed method integrates the weakly-supervised image-tag correlation, image correlation and tag correlation simultaneously and seamlessly. The proposed model is also extended to embed new tags in the uncovered space. To verify the effectiveness of the proposed method, extensive experiments are conducted on two widely-used social image benchmarks for multiple social image understanding tasks. The encouraging performance of the proposed method over the state-of-the-art approaches demonstrates its superiority.
Article
Due to the efficiency of learning relationships and complex structures hidden in data, graph-oriented methods have been widely investigated and achieve promising performance. Generally, in the field of multi-view learning, these algorithms construct informative graph for each view, on which the following clustering or classification procedure are based. However, in many real world dataset, original data always contain noise and outlying entries that result in unreliable and inaccurate graphs, which cannot be ameliorated in the previous methods. In this paper, we propose a novel multi-view learning model which performs clustering/semi-supervised classification and local structure learning simultaneously. The obtained optimal graph can be partitioned into specific clusters directly. Moreover, our model can allocate ideal weight for each view automatically without additional weight and penalty parameters. An efficient algorithm is proposed to optimize this model. Extensive experimental results on different real-world datasets show that the proposed model outperforms other state-of-the-art multi-view algorithms.
Article
While logistic sigmoid neurons are more biologically plausable that hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabelled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labelled data sets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised nueral networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training
Article
The hashing methods have been widely used for efficient similarity retrieval on large scale image datasets. The traditional hashing methods learn hash functions to generate binary codes from hand-crafted features, which achieve limited accuracy since the hand-crafted features cannot optimally represent the image content and preserve the semantic similarity. Recently, several deep hashing methods have shown better performance because the deep architectures generate more discriminative feature representations. However, these deep hashing methods are mainly designed for the supervised scenarios, which only exploit the semantic similarity information, but ignore the underlying data structures. In this paper, we propose the semi-supervised deep hashing (SSDH) method, to perform more effective hash learning by simultaneously preserving the semantic similarity and the underlying data structures. Our proposed approach can be divided into two phases. First, a deep network is designed to extensively exploit both the labeled and unlabeled data, in which we construct the similarity graph online in a mini-batch with the deep feature representations. To the best of our knowledge, our proposed deep network is the first deep hashing method that can perform the hash code learning and feature learning simultaneously in a semi-supervised fashion. Second, we propose a loss function suitable for the semi-supervised scenario by jointly minimizing the empirical error on the labeled data as well as the embedding error on both the labeled and unlabeled data, which can preserve the semantic similarity, as well as capture the meaningful neighbors on the underlying data structures for effective hashing. Experiment results on 4 widely used datasets show that the proposed approach outperforms state-of-the-art hashing methods.
Article
In many image processing and pattern recognition problems, visual contents of images are currently described by high-dimensional features, which are often redundant and noisy. Towards this end, we propose a novel unsupervised feature selection scheme, namely Nonnegative Spectral analysis with Constrained Redundancy (NSCR), by jointly leveraging nonnegative spectral clustering and redundancy analysis. The proposed method can directly identify a discriminative subset of the most useful and redundancy-constrained features. Nonnegative spectral analysis is developed to learn more accurate cluster labels of the input images, during which the feature selection is performed simultaneously. The joint learning of the cluster labels and feature selection matrix enables to select the most discriminative features. Row-wise sparse models with a general ℓ2;p-norm (0 < p 1) are leveraged to make the proposed model suitable for feature selection and robust to noise. Besides, the redundancy between features is explicitly exploited to control the redundancy of the selected subset. The proposed problem is formulated as an optimization problem with a well-defined objective function solved by the developed simple yet efficient iterative algorithm. Finally, we conduct extensive experiments on 9 diverse image benchmarks, including face data, handwritten digit data and object image data. The proposed method achieves encouraging experimental results in comparison with several representative algorithms, which demonstrates the effectiveness of the proposed algorithm for unsupervised feature selection.
Chapter
Chapter 1 strongly advocates the stochastic back-propagation method to train neural networks. This is in fact an instance of a more general technique called stochastic gradient descent (SGD). This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations.
Article
In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.
Conference Paper
Due to its occurrence in engineering domains and implications for natural learning, the problem of utilizing unlabeled data is attract- ing increasing attention in machine learning. A large body of recent literature has focussed on the transductive setting where labels of unlabeled examples are estimated by learn- ing a function dened only over the point cloud data. In a truly semi-supervised setting however, a learning machine has access to la- beled and unlabeled examples and must make predictions on data points never encountered before. In this paper, we show how to turn transductive and standard supervised learn- ing algorithms into semi-supervised learn- ers. We construct a family of data-dependent norms on Reproducing Kernel Hilbert Spaces (RKHS). These norms allow us to warp the structure of the RKHS to reect the under- lying geometry of the data. We derive ex- plicit formulas for the corresponding new ker- nels. Our approach demonstrates state of the art performance on a variety of classication tasks.
Conference Paper
Detecting humans in films and videos is a challenging problem owing to the motion of the subjects, the camera and the background and to varia- tions in pose, appearance, clothing, illumination and background clutter. We de- velop a detector for standing and moving people in videos with possibly moving cameras and backgrounds, testing several different motion coding schemes and showing empirically that orientated histograms of differential optical flow give the best overall performance. These motion-based descriptors are combined with our Histogram of Oriented Gradient appearance descriptors. The resulting de- tector is tested on several databases including a challenging test set taken from feature films and containing wide ranges of pose, motion and background vari- ations, including moving cameras and backgrounds. We validate our results on two challenging test sets containing more than 4400 human examples. The com- bined detector reduces the false alarm rate by a factor of 10 relative to the best appearance-based detector, for example giving false alarm rates of 1 per 20,000 windows tested at 8% miss rate on our Test Set 1.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Unsupervised data augmentation for consistency training
  • Xie
MixMatch: A holistic approach to semi-supervised learning
  • Berthelot