Article

Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including Support Vector Machines and Regularized Least Squares can be obtained as special cases. We utilize properties of Reproducing Kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discuss ion of unsupervised and fully supervised learning within our general framework.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To understand the behaviour of models 2(ii) and 2(iii), we calculate the generalised travel time of some simple path sets over the square grid in two space dimensions with constant unit weights and slowness function; see Figures 2 and 3, respectively. In each case, we calculate the travel times for the three path sets P (1) x 0 ,i , P (2) x 0 ,i and P (3) x 0 ,i , where x 0 = (0, 0) and i = (2, 2). Let U and R be the paths travelling 'up' and 'right' from a node to a neighbour on the square grid. ...
... Let U and R be the paths travelling 'up' and 'right' from a node to a neighbour on the square grid. We set P (1) x 0 ,i = {(U, R, U, R)}, P (2) x 0 ,i = P (1) x 0 ,i ∪ {(R, U, R, U)} and P (3) x 0 ,i = P (2) x 0 ,i ∪ {(U, U, R, R), (R, R, U, U)}, so these path sets have 1, 2 and 4 elements, respectively. We show the generalised travel time for path sets P (1) x 0 ,i , P (2) x 0 ,i and P (3) x 0 ,i for models 2(ii) and 2(iii) in Figures 2 and 3, respectively. ...
... We set P (1) x 0 ,i = {(U, R, U, R)}, P (2) x 0 ,i = P (1) x 0 ,i ∪ {(R, U, R, U)} and P (3) x 0 ,i = P (2) x 0 ,i ∪ {(U, U, R, R), (R, R, U, U)}, so these path sets have 1, 2 and 4 elements, respectively. We show the generalised travel time for path sets P (1) x 0 ,i , P (2) x 0 ,i and P (3) x 0 ,i for models 2(ii) and 2(iii) in Figures 2 and 3, respectively. Here, the numbers at nodes along the different paths denote the generalised travel time from the origin x 0 to the respective nodes. ...
Article
Full-text available
We propose and unify classes of different models for information propagation over graphs. In a first class, propagation is modelled as a wave, which emanates from a set of known nodes at an initial time, to all other unknown nodes at later times with an ordering determined by the arrival time of the information wave front. A second class of models is based on the notion of a travel time along paths between nodes. The time of information propagation from an initial known set of nodes to a node is defined as the minimum of a generalised travel time over subsets of all admissible paths. A final class is given by imposing a local equation of an eikonal form at each unknown node, with boundary conditions at the known nodes. The solution value of the local equation at a node is coupled to those of neighbouring nodes with lower values. We provide precise formulations of the model classes and prove equivalences between them. Finally, we apply the front propagation models on graphs to semi-supervised learning via label propagation and information propagation on trust networks.
... A statistical manifold is a Riemannian manifold equipped with a pair of dual affine connections, enabling the geometric interpretation of probabilistic and information-theoretic concepts. Statistical manifolds represent a significant topic with extensive applications across various fields, including machine learning, image analysis, neural networks, physics, general relativity, and control systems, among others [1,2]. These manifolds arise from statistical models, where the points of the Riemannian manifold correspond to probability distributions. ...
... A pair (D := D 0 + K, g) becomes a statistical structure if and only if the difference tensor K ∈ Γ(TM (1,2) ) satisfies the following conditions: ...
... On comparing normal components, we obtain B(V 1 ...
Article
Full-text available
This paper aims to develop a general theory of quaternion Kahlerian statistical manifolds and to study quaternion CR-statistical submanifolds in such ambient manifolds. It extends the existing theories of quaternion submanifolds and totally real submanifolds. Additionally, the work examines quaternion Kahlerian statistical submersions, including illustrative examples. The exploration also includes an analysis of the total space and fibers under certain conditions with example(s) in support. Moreover, Chen–Ricci inequality on the vertical distribution is derived for quaternion Kahlerian statistical submersions from quaternion Kahlerian statistical manifolds.
... The most common class of methods for SSL is based on the manifold assumption [3], that is, two examples with similar features tend to share the same class label. Manifold regularization tries to explore the geometry of the intrinsic data probability distribution by penalizing the regression function along the potential manifold. ...
... Manifold regularization tries to explore the geometry of the intrinsic data probability distribution by penalizing the regression function along the potential manifold. Laplacian regularization (LR) [2], [3] is one of the representative works in which the geometry of the underlying manifold is determined by the graph Laplacian. LR-based SSL has received intensive attention and many algorithms have been developed, such as Laplacian regularized least squares (LapLS) and Laplacian support vector machines (LapSVM) [3]. ...
... Laplacian regularization (LR) [2], [3] is one of the representative works in which the geometry of the underlying manifold is determined by the graph Laplacian. LR-based SSL has received intensive attention and many algorithms have been developed, such as Laplacian regularized least squares (LapLS) and Laplacian support vector machines (LapSVM) [3]. ...
Preprint
The rapid development of computer hardware and Internet technology makes large scale data dependent models computationally tractable, and opens a bright avenue for annotating images through innovative machine learning algorithms. Semi-supervised learning (SSL) has consequently received intensive attention in recent years and has been successfully deployed in image annotation. One representative work in SSL is Laplacian regularization (LR), which smoothes the conditional distribution for classification along the manifold encoded in the graph Laplacian, however, it has been observed that LR biases the classification function towards a constant function which possibly results in poor generalization. In addition, LR is developed to handle uniformly distributed data (or single view data), although instances or objects, such as images and videos, are usually represented by multiview features, such as color, shape and texture. In this paper, we present multiview Hessian regularization (mHR) to address the above two problems in LR-based image annotation. In particular, mHR optimally combines multiple Hessian regularizations, each of which is obtained from a particular view of instances, and steers the classification function which varies linearly along the data manifold. We apply mHR to kernel least squares and support vector machines as two examples for image annotation. Extensive experiments on the PASCAL VOC'07 dataset validate the effectiveness of mHR by comparing it with baseline algorithms, including LR and HR.
... The vector-valued function [7] has recently been introduced to resolve multi-label classification [8] and has been demonstrated to be effective in semantic scene annotation. This method naturally incorporates the label-dependencies into the classification model by first computing the graph Laplacian [9] of the output similarity graph, and then using this graph to construct a vector-valued kernel. This model is superior to most of the existing multi-label learning methods [10], [11], [12] because it naturally considers the label correlations and efficiently outputs all the predicted labels at one time. ...
... GENERALIZATION This section briefly introduces the manifold regularization framework [9] and its vector-valued generalization [8]. Given ...
... Here, γ A and γ I are trade-off parameters to control the complexities of f in the ambient space and the compact support of the marginal distribution. The Representer theorem [9] ensures the solution of problem (1) takes the form ...
Preprint
In computer vision, image datasets used for classification are naturally associated with multiple labels and comprised of multiple views, because each image may contain several objects (e.g. pedestrian, bicycle and tree) and is properly characterized by multiple visual features (e.g. color, texture and shape). Currently available tools ignore either the label relationship or the view complementary. Motivated by the success of the vector-valued function that constructs matrix-valued kernels to explore the multi-label structure in the output space, we introduce multi-view vector-valued manifold regularization (MV3\mathbf{^3}MR) to integrate multiple features. MV3\mathbf{^3}MR exploits the complementary property of different features and discovers the intrinsic local geometry of the compact support shared by different features under the theme of manifold regularization. We conducted extensive experiments on two challenging, but popular datasets, PASCAL VOC' 07 (VOC) and MIR Flickr (MIR), and validated the effectiveness of the proposed MV3\mathbf{^3}MR for image classification.
... To identify the domain expert's interests, one could simply have the user label instances as high or low utility through active learning frameworks like the algorithms in [6], [7], and subsequently use popular supervised or semi-supervised classification methods [8], [9], [10], [11] to discriminate between the high-utility and low-utility instances. The drawback with this approach in contrast to our proposed approach is that these methods do not exploit the following key idea: only statistically rare points can be of high-utility, or equivalently, all nominal points are low-utility. ...
... In order to incorporate unlabeled points, we use the semisupervised framework of [9], which requires the decision boundary to be smooth with respect to the marginal distribution of all the data, P X . This is because we assume that unlabeled points have the same label as their labeled neighbors and prefer decision boundaries in low density regions. ...
... . . , α l ≥ 0, and β ≥ 0. Since the smoothness constraint is formulated using the semi-supervised framework of [9], the above objective function is very similar to their proposed Laplacian SVM (LapSVM). This is more obviously seen by extending (1) to nonlinear discriminant functions though a kernel function k(·, ·) and treating β as a fixed parameter to be chosen separately. ...
Preprint
Data-driven anomaly detection methods suffer from the drawback of detecting all instances that are statistically rare, irrespective of whether the detected instances have real-world significance or not. In this paper, we are interested in the problem of specifically detecting anomalous instances that are known to have high real-world utility, while ignoring the low-utility statistically anomalous instances. To this end, we propose a novel method called Latent Laplacian Maximum Entropy Discrimination (LatLapMED) as a potential solution. This method uses the EM algorithm to simultaneously incorporate the Geometric Entropy Minimization principle for identifying statistical anomalies, and the Maximum Entropy Discrimination principle to incorporate utility labels, in order to detect high-utility anomalies. We apply our method in both simulated and real datasets to demonstrate that it has superior performance over existing alternatives that independently pre-process with unsupervised anomaly detection algorithms before classifying.
... Eq.(2) cannot be solved directly. In this paper, it is proposed to leverage the idea of manifold regularization [12] to learn the intrinsic structure of the target domain to allow that the target task is treated as an unsupervised clustering task. ...
... The use of multi-task regularization can remove additional conditions, such as orthogonal constraint, to avoid degenerate solutions as required in manifold regularization-based unsupervised learning [12]. Note that the term f t 2 I is different from the manifold regularization term in [3]. ...
... The objective function (Eq.(8)) can be solved efficiently in closed form. Following [12], each trade-off coefficient is treated as a whole, e.g.γ I = γI ns n 2 t ,γ A = γ A n s , γ M = γ M n s , andγ D = γ D n s , when tuning the parameters. ...
Preprint
This paper presents a novel multi-task learning-based method for unsupervised domain adaptation. Specifically, the source and target domain classifiers are jointly learned by considering the geometry of target domain and the divergence between the source and target domains based on the concept of multi-task learning. Two novel algorithms are proposed upon the method using Regularized Least Squares and Support Vector Machines respectively. Experiments on both synthetic and real world cross domain recognition tasks have shown that the proposed methods outperform several state-of-the-art domain adaptation methods.
... Hence, this general form of SWSL can be formulated under the framework of semi-supervised learning with additional label constraints on the unlabeled data. A variety of methods for semi-supervised learning including multiple instance learning semi-supervised learning have been proposed [24], [25], [26], [27], [28], [29], [30]. In this work, we adopt one of the most popular method for semi-supervised learning, manifold regularization on graphs [26] for SWSL. ...
... A variety of methods for semi-supervised learning including multiple instance learning semi-supervised learning have been proposed [24], [25], [26], [27], [28], [29], [30]. In this work, we adopt one of the most popular method for semi-supervised learning, manifold regularization on graphs [26] for SWSL. We name this variant of SWSL as graphSWSL. ...
... Since the labels for instances x n+1 to x n+m are essentially unknown and yet constrained by Eq 1, we can formulate this learning process as a constrained form of semi-supervised learning (SSL). A particularly well known method for semisupervised learning is manifold regularization on graphs [26]. This method is inductive which is one of the reasons we adopt it for SWSL. ...
Preprint
In this paper we propose a novel learning framework called Supervised and Weakly Supervised Learning where the goal is to learn simultaneously from weakly and strongly labeled data. Strongly labeled data can be simply understood as fully supervised data where all labeled instances are available. In weakly supervised learning only data is weakly labeled which prevents one from directly applying supervised learning methods. Our proposed framework is motivated by the fact that a small amount of strongly labeled data can give considerable improvement over only weakly supervised learning. The primary problem domain focus of this paper is acoustic event and scene detection in audio recordings. We first propose a naive formulation for leveraging labeled data in both forms. We then propose a more general framework for Supervised and Weakly Supervised Learning (SWSL). Based on this general framework, we propose a graph based approach for SWSL. Our main method is based on manifold regularization on graphs in which we show that the unified learning can be formulated as a constraint optimization problem which can be solved by iterative concave-convex procedure (CCCP). Our experiments show that our proposed framework can address several concerns of audio content analysis using weakly labeled data.
... Intuitively, Laplacian eigenmaps tries to find an optimal nonlinear mapping such that the k-nearest neighbouring speech segments in the reference set Y ref are mapped to similar regions in the target space R D . To embed an arbitrary segment Y which is not an element of Y ref , a kernel-based out-of-sample extension is used [31]. This performs a type of interpolation using the exemplars in Y ref that are similar to target segment Y . ...
... In [31], it was shown that the optimal projection to the j th dimension in the target space is given by ...
... We have given only a brief outline of the embedding method here; complete details can be found in [29]- [31]. ...
Preprint
In settings where only unlabelled speech data is available, speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text. A similar problem is faced when modelling infant language acquisition. In these cases, categorical linguistic structure needs to be discovered directly from speech audio. We present a novel unsupervised Bayesian model that segments unlabelled speech and clusters the segments into hypothesized word groupings. The result is a complete unsupervised tokenization of the input speech in terms of discovered word types. In our approach, a potential word segment (of arbitrary length) is embedded in a fixed-dimensional acoustic vector space. The model, implemented as a Gibbs sampler, then builds a whole-word acoustic model in this space while jointly performing segmentation. We report word error rates in a small-vocabulary connected digit recognition task by mapping the unsupervised decoded output to ground truth transcriptions. The model achieves around 20% error rate, outperforming a previous HMM-based system by about 10% absolute. Moreover, in contrast to the baseline, our model does not require a pre-specified vocabulary size.
... Graph-based semi-supervised learning (SSL) methods tackle this task building on the premise that the true labels are distributed "smoothly" with respect to the underlying network, which then motivates leveraging the network structure to increase the classification accuracy [11]. Graph-based SSL has been pursued by various intertwined methods, including iterative label propagation [6], [43], [29], [25], kernels on graphs [31], manifold regularization [5], graph partitioning [40], [19], transductive learning [39], competitive infection models [36], and bootstrapped label propagation [10]. SSL based on graph filters was discussed in [37], and further developed in [12] for bridge monitoring. ...
... is the indicator vector of the nodes belonging to class c. Using our diffusion model in (5), the N −dimensional optimization problem (9) reduces to solving for the K−dimensional vector ...
... For the multilabel graphs, we found λ = 5.0 and even shorter walks of K = 10 to perform well. For the dictionary mode of AdaDIF, we preselected D = 10, with the first five collumns of C being HK coefficients with parameters t ∈ [5,8,12,15,20], and the other five polynomial coefficients c i = k β with β ∈ [2,4,6,8,10]. ...
Preprint
Diffusion-based classifiers such as those relying on the Personalized PageRank and the Heat kernel, enjoy remarkable classification accuracy at modest computational requirements. Their performance however is affected by the extent to which the chosen diffusion captures a typically unknown label propagation mechanism, that can be specific to the underlying graph, and potentially different for each class. The present work introduces a disciplined, data-efficient approach to learning class-specific diffusion functions adapted to the underlying network topology. The novel learning approach leverages the notion of "landing probabilities" of class-specific random walks, which can be computed efficiently, thereby ensuring scalability to large graphs. This is supported by rigorous analysis of the properties of the model as well as the proposed algorithms. Furthermore, a robust version of the classifier facilitates learning even in noisy environments. Classification tests on real networks demonstrate that adapting the diffusion function to the given graph and observed labels, significantly improves the performance over fixed diffusions; reaching -- and many times surpassing -- the classification accuracy of computationally heavier state-of-the-art competing methods, that rely on node embeddings and deep neural networks.
... but for now, we can say that self-attention is a mechanism that allows a model to weigh different parts of the input sequence differently, capturing dependencies between words. The multi-headed self-attention mechanism lets the model capture different dependencies and relationships between words, enhancing language 9 https://www.microsoft.com/it-it/bing?form=MA13FV 10 Retrieval-Augmented Generation 11 Models are designed to process and understand information from multiple modalities or sources (e.g., text, image, audio, video). Multimodal models aim to handle and integrate data from two or more modalities. ...
... • Manifold Assumption: The manifold assumption suggests that high-dimensional data lie on a low-dimensional manifold. This assumption implies that the data points are situated on a manifold of much lower dimensionality embedded within the higher-dimensional space, and learning can be simplified if this manifold structure is discovered and exploited [11]. The manifold assumption often complements the cluster and continuity assumptions, providing a geometric interpretation of the data's distribution. ...
Preprint
Full-text available
The rapid advancement of artificial intelligence, particularly with the development of Large Language Models (LLMs) built on the transformer architecture, has redefined the capabilities of natural language processing. These models now exhibit remarkable performance across various language-related tasks, such as text generation, question answering, translation, and summarization, often rivaling human-like comprehension. More intriguingly, LLMs have demonstrated emergent abilities extending beyond their core functions, showing proficiency in tasks like commonsense reasoning, code generation, and arithmetic. This survey paper explores the foundational components, scaling mechanisms, and architectural strategies that drive these capabilities. Emphasizing models like GPT and LLaMA, we analyze the impact of exponential data and computational growth on LLM performance, while also addressing the trade-offs associated with scaling. We also examine LLM applications across sectors, such as healthcare, finance, education, and law, highlighting their adaptability and potential to solve domain-specific challenges. Central to this work are the questions of how LLMs generalize across diverse tasks, exhibit planning, and reasoning abilities, and whether these emergent abilities can be systematically elicited or enhanced. In particular, we provide some insights into the CoT (Chain of Thought) and PoT (Plan of Thought) abilities within LLMs, focusing on how pre-training data influences their emergence. Additionally, we investigate LLM-modulo frameworks that integrate external systems, allowing LLMs to handle complex, dynamic tasks. By analyzing these factors, this paper aims to foster the ongoing discussion on the capabilities and limits of LLMs, promoting their responsible development and application in novel and increasingly complex environments.
... (15) and (16) are used to construct a Laplacian regularization. Based on the adaptation idea [40], similar data with the minimum distance in the feature space is most likely to belong to the same class. In this way, the label of test data can be predicted using the label of training data. ...
... (30). Using the Representer theorem [40], the prediction function f (.) can be expressed as follows: ...
Article
Full-text available
Getting machine learning (ML) to perform accurate prediction needs a sufficient number of labeled samples. However, due to the either lack or small number of labeled samples in most domains, it is often beneficial to use domain adaptation (DA) and transfer learning (TL) to leverage a related auxiliary source domain to optimize the performance on target domain. In fact, the purpose of TL and DA is to use the labeled sample information (i.e., samples and the corresponding labels) for training the classifier to categorize the unlabeled samples. In this paper, we aim to propose a novel semi-supervised transfer learning method entitled “Latent Sparse subspace learning and visual domain classification via Balanced distribution alignment and Hilbert–Schmidt metric (LSBH)”. LSBH uses the latent sparse domain transfer learning for visual adaptation (LSDT) to adapt the samples with different distributions or feature spaces across domains and prevent the creation of local common subspace for source and target domains via the simultaneous learning of latent space and sparse reconstruction. LSBH proposes a novel robust classifier which maintains performance and accuracy even when faced with variations across the source and target domains. To this end, it utilizes the following two criteria in the optimization problem: maximum mean discrepancy and Hilbert–Schmidt independence criterion to reduce the marginal and conditional distribution disparities of domains and increase the dependency between samples and labels at the classification step. LSBH obtains the optimal coefficients for the classifier, which results in the minimum error in the loss function by solving the optimization problem. Thus, the error minimizing of the loss function is a part of the optimization problem. Also, to maintain the geometric structure of data in the classification step, the neighborhood graph of samples is used. The efficiency of the proposed method has been evaluated on different visual datasets and has been compared with new and prominent methods of domain adaptation and transfer learning. The results induce the superior performance of LSBH compared to the other state-of-the-art methods in label prediction.
... Compare it to feature extraction that does not use LapSVM to guide DNN, but uses cross-entropy loss to guide feature extraction. As shown in Fig. 3, we can see that there is a large number of similar feature blocks in the In the case of linear kernel, we can directly write f ⋆ (x) = w T x, and use the primal optimization method [35] to solve the weight vector w. Then the above equation can be written as: ...
... • LapSVM: Laplacian SVM. It uses hinge loss and manifold assumption to construct a hyperplane classifier [35]. • Lap-TSVM: Laplacian twin SVM. ...
Article
Full-text available
Deep learning is a rapidly growing field that can effectively extract latent features from data and use them to make predictions based on the learned features, but most models just sum the loss of each sample without considering the relationship between samples. On the other hand, the traditional Laplacian Support Vector Machine (LapSVM) can effectively utilize samples and the relationship between samples by constructing a Laplacian graph, and performs well on semi-supervised data. In this paper, we combine LapSVM and deep learning and propose Deep Laplacian Support Vector Machine. Our approach is to first use a Deep Neural Network to extract the latent features from the image, then based on the extracted feature information and a small amount of original label information, we use LapSVM for classification, build a loss function, and finally iteratively update the two parts together. We evaluate our method on several benchmark datasets and demonstrate that it outperforms other semi-supervised learning methods.
... In recent years, optimization theory has found several applications in machine learning problems: often, these are formulated as optimization problems in which, given a finite set of empirical data (e.g., a finite set of feature vectors, where each feature may represent the output of a measurement process), it is required to find a parameter vector associated with a model of such data, which is optimal according to a suitable performance index. In the context of the so-called kernel methods (Shawe-Taylor and Cristianini, 2004), examples of such problems are supervised (Cucker and Smale, 2002), unsupervised (von Luxburg, 2007), and semi-supervised (Belkin and Niyogi, 2006;Melacci and Belkin, 2011) learning, and identification of models of dynamical systems (Pillonetto et al., 2014). The performance index is typically composed of the sum of an empirical term (which measures the fitness of the model associated with a specific parameter vector in explaining the empirical data), and a regularization term, whose goal is to endow the optimal model with the capability of generalizing its predictions to new data, not used during the training of the model. ...
... Among other possible future developments, we mention: potential extensions to other kernel methods (e.g., Laplacian SVMs (Belkin and Niyogi, 2006), or combinations of soft and hard constraints (Gnecco et al., 2015b)) of the theoretical results about symmetry/antisymmetry of the optimal solutions; the investigation of possible symmetry/antisymmetry preserving properties of other variations of the SMO algorithm used, e.g., for SVM regression problems. ...
Preprint
A particularly interesting instance of supervised learning with kernels is when each training example is associated with two objects, as in pairwise classification (Brunner et al., 2012), and in supervised learning of preference relations (Herbrich et al., 1998). In these cases, one may want to embed additional prior knowledge into the optimization problem associated with the training of the learning machine, modeled, respectively, by the symmetry of its optimal solution with respect to an exchange of order between the two objects, and by its antisymmetry. Extending the approach proposed in (Brunner et al., 2012) (where the only symmetric case was considered), we show, focusing on support vector binary classification, how such embedding is possible through the choice of a suitable pairwise kernel, which takes as inputs the individual feature vectors and also the group feature vectors associated with the two objects. We also prove that the symmetry/antisymmetry constraints still hold when considering the sequence of suboptimal solutions generated by one version of the Sequential Minimal Optimization (SMO) algorithm, and we present numerical results supporting the theoretical findings. We conclude discussing extensions of the main results to support vector regression, to transductive support vector machines, and to several kinds of graph kernels, including diffusion kernels.
... The machine learning community has already looked at SPoG-related issues in the context of semi-supervised learning under the term of transductive regression and classification [6]- [8]. Existing approaches rely on smoothness assumptions for inference of processes over graphs using nonparametric methods [2], [3], [6], [9]. Whereas some works consider estimation of real-valued signals [7]- [10], most in this body of literature have focused on estimating binary-valued functions; see e.g. ...
... . Laplacian regularization [3], [4], [9], [30], [31] is effected by setting r(λ) = 1 + σ 2 λ with σ 2 sufficiently large. Observe that obtainingK generally requires an eigendecomposition of L, which is computationally challenging for large graphs (N ). ...
Preprint
A number of applications in engineering, social sciences, physics, and biology involve inference over networks. In this context, graph signals are widely encountered as descriptors of vertex attributes or features in graph-structured data. Estimating such signals in all vertices given noisy observations of their values on a subset of vertices has been extensively analyzed in the literature of signal processing on graphs (SPoG). This paper advocates kernel regression as a framework generalizing popular SPoG modeling and reconstruction and expanding their capabilities. Formulating signal reconstruction as a regression task on reproducing kernel Hilbert spaces of graph signals permeates benefits from statistical learning, offers fresh insights, and allows for estimators to leverage richer forms of prior information than existing alternatives. A number of SPoG notions such as bandlimitedness, graph filters, and the graph Fourier transform are naturally accommodated in the kernel framework. Additionally, this paper capitalizes on the so-called representer theorem to devise simpler versions of existing Thikhonov regularized estimators, and offers a novel probabilistic interpretation of kernel methods on graphs based on graphical models. Motivated by the challenges of selecting the bandwidth parameter in SPoG estimators or the kernel map in kernel-based methods, the present paper further proposes two multi-kernel approaches with complementary strengths. Whereas the first enables estimation of the unknown bandwidth of bandlimited signals, the second allows for efficient graph filter selection. Numerical tests with synthetic as well as real data demonstrate the merits of the proposed methods relative to state-of-the-art alternatives.
... One way of taking advantage of connectivity is provided by graph-based semi-supervised learning approaches, whereby the labels, or values, known on a subset of nodes, are propagated to the rest of the graph. Laplacian-based approaches such as Harmonic Functions (Zhu, Ghahramani, and Lafferty 2003) and Laplacian Regularization (Belkin, Niyogi, and Sindhwani 2006) epitomize this class of methods. Although a variety of improvements and extensions have been proposed (Zhu 2008;Belkin, Niyogi, and Sindhwani 2006;Zhou and Belkin 2011;Wu et al. 2012;Solomon et al. 2014), the interpretability of these learning algorithms has not received much attention and remains limited to the analysis of the obtained prediction weights. ...
... Laplacian-based approaches such as Harmonic Functions (Zhu, Ghahramani, and Lafferty 2003) and Laplacian Regularization (Belkin, Niyogi, and Sindhwani 2006) epitomize this class of methods. Although a variety of improvements and extensions have been proposed (Zhu 2008;Belkin, Niyogi, and Sindhwani 2006;Zhou and Belkin 2011;Wu et al. 2012;Solomon et al. 2014), the interpretability of these learning algorithms has not received much attention and remains limited to the analysis of the obtained prediction weights. In order to promote accountability and trust, it is desirable to have a more transparent representation of the prediction process that can be visualized, interactively examined, and thoroughly understood. ...
Preprint
In this paper, we consider the interpretability of the foundational Laplacian-based semi-supervised learning approaches on graphs. We introduce a novel flow-based learning framework that subsumes the foundational approaches and additionally provides a detailed, transparent, and easily understood expression of the learning process in terms of graph flows. As a result, one can visualize and interactively explore the precise subgraph along which the information from labeled nodes flows to an unlabeled node of interest. Surprisingly, the proposed framework avoids trading accuracy for interpretability, but in fact leads to improved prediction accuracy, which is supported both by theoretical considerations and empirical results. The flow-based framework guarantees the maximum principle by construction and can handle directed graphs in an out-of-the-box manner.
... has numerous applications in partial differential equations [21,48,51], topology [44] and differential geometry [40], shape optimization [50], computer graphics [14], and even machine learning [6,54]. The main application area in mind for this work is that of electromagnetics, where it is often useful to partition tangential vector fields (e.g. ...
... where ψ = ψ(u, v) is a scalar function along Γ, F = F(u, v) is a tangential vector field defined with respect to the tangent vectors x u and x v : 6) and the coefficients g i j are the components of the inverse of g: ...
Preprint
The Laplace-Beltrami problem ΔΓψ=f\Delta_\Gamma \psi = f has several applications in mathematical physics, differential geometry, machine learning, and topology. In this work, we present novel second-kind integral equations for its solution which obviate the need for constructing a suitable parametrix to approximate the in-surface Green's function. The resulting integral equations are well-conditioned and compatible with standard fast multipole methods and iterative linear algebraic solvers, as well as more modern fast direct solvers. Using layer-potential identities known as Calder\'on projectors, the Laplace-Beltrami operator can be pre-conditioned from the left and/or right to obtain second-kind integral equations. We demonstrate the accuracy and stability of the scheme in several numerical examples along surfaces described by curvilinear triangles.
... ) is the weight between two neighboring nodes i and j in the data adjacency graph [33]. Both γ and γ I are non-negative trade-off hyperparameters, and ω is the bandwidth hyper-parameter. ...
... The matrix L M is the graph Laplacian as defined in [33]. We propose to solve the problem (13) efficiently by utilizing the projected gradient method (PGM) presented in [37]. ...
Preprint
The goal of transfer learning is to improve the performance of target learning task by leveraging information (or transferring knowledge) from other related tasks. In this paper, we examine the problem of transfer distance metric learning (DML), which usually aims to mitigate the label information deficiency issue in the target DML. Most of the current Transfer DML (TDML) methods are not applicable to the scenario where data are drawn from heterogeneous domains. Some existing heterogeneous transfer learning (HTL) approaches can learn target distance metric by usually transforming the samples of source and target domain into a common subspace. However, these approaches lack flexibility in real-world applications, and the learned transformations are often restricted to be linear. This motivates us to develop a general flexible heterogeneous TDML (HTDML) framework. In particular, any (linear/nonlinear) DML algorithms can be employed to learn the source metric beforehand. Then the pre-learned source metric is represented as a set of knowledge fragments to help target metric learning. We show how generalization error in the target domain could be reduced using the proposed transfer strategy, and develop novel algorithm to learn either linear or nonlinear target metric. Extensive experiments on various applications demonstrate the effectiveness of the proposed method.
... Multi-parameter regularization is a widely used technique for addressing ill-posed problems [4,6,8,9,10,18,19,28]. Motivated by the challenges posed by big data in practical applications, sparse multi-parameter regularization using the ℓ 1 norm has become a prominent area of research [1,11,21,22,23,26]. ...
Preprint
This paper introduces a multi-parameter regularization approach using the 1\ell_1 norm, designed to better adapt to complex data structures and problem characteristics while offering enhanced flexibility in promoting sparsity in regularized solutions. As data volumes grow, sparse representations of learned functions become critical for reducing computational costs during function operations. We investigate how the selection of multiple regularization parameters influences the sparsity of regularized solutions. Specifically, we characterize the relationship between these parameters and the sparsity of solutions under transform matrices, enabling the development of an iterative scheme for selecting parameters that achieve prescribed sparsity levels. Special attention is given to scenarios where the fidelity term is non-differentiable, and the transform matrix lacks full row rank. In such cases, the regularized solution, along with two auxiliary vectors arising in the sparsity characterization, are essential components of the multi-parameter selection strategy. To address this, we propose a fixed-point proximity algorithm that simultaneously determines these three vectors. This algorithm, combined with our sparsity characterization, forms the basis of a practical multi-parameter selection strategy. Numerical experiments demonstrate the effectiveness of the proposed approach, yielding regularized solutions with both predetermined sparsity levels and satisfactory approximation accuracy.
... Its core principle posits that similar nodes are likely to share the same labels, and the algorithm iteratively updates the labels of each node until the label distribution across the entire network attains a stable state. Current approaches include traditional semi-supervised learning algorithms using Gaussian fields and summation functions [25], graph-based label propagation methods [26], as well as LapSVM and LapRLS methods rooted in Laplacian regularization [27]. ...
Article
Full-text available
Symmetric non-negative matrix factorization (SNMF) decomposes a similarity matrix into the product of an indicator matrix and its transpose, allowing clustering results to be directly extracted from the indicator matrix without additional clustering methods. Furthermore, SNMF has been shown to be effective in clustering nonlinearly separable data. SNMF-based clustering methods significantly depend on the quality of the pairwise similarity matrix, yet their effectiveness is often hindered by the reliance on predefined matrices in most semi-supervised SNMF approaches. Thus, we propose a novel algorithm, named semi-supervised symmetric non-negative matrix factorization with graph quality improvement and constraints (S3NMFGC\text {S}^{3}\text {NMFGC}), addressing this limitation by employing an integrated clustering strategy that dynamically generates and adaptively updates the similarity matrices. This is accomplished by integrating a weighted graph construction based on multiple clustering results, a label propagation algorithm, and pairwise constraint terms into a unified optimization framework that enhances the semi-supervised SNMF model. Subsequently, we adopt an alternating iterative update method to solve the optimization problem and prove its convergence. Rigorous experiments highlight the superiority of our model, which outperforms seven state-of-the-art NMF methods across six datasets.
... This learning paradigm shares similarities with transductive learning. Semi-supervised learning is grounded in two central principles: the cluster assumption [9] and the manifold assumption [14]. The cluster assumption posits that points within the same cluster tend to have identical labels, with transductive SVM [9] serving as a significant illustration. ...
Article
Full-text available
Machine learning has become indispensable across various domains, yet understanding its theoretical underpinnings remains challenging for many practitioners and researchers. Despite the availability of numerous resources, there is a need for a cohesive tutorial that integrates foundational principles with state-of-the-art theories. This paper addresses the fundamental concepts and theories of machine learning, with an emphasis on neural networks, serving as both a foundational exploration and a tutorial. It begins by introducing essential concepts in machine learning, including various learning and inference methods, followed by criterion functions, robust learning, discussions on learning and generalization, model selection, bias–variance trade-off, and the role of neural networks as universal approximators. Subsequently, the paper delves into computational learning theory, with probably approximately correct (PAC) learning theory forming its cornerstone. Key concepts such as the VC-dimension, Rademacher complexity, and empirical risk minimization principle are introduced as tools for establishing generalization error bounds in trained models. The fundamental theorem of learning theory establishes the relationship between PAC learnability, Vapnik–Chervonenkis (VC)-dimension, and the empirical risk minimization principle. Additionally, the paper discusses the no-free-lunch theorem, another pivotal result in computational learning theory. By laying a rigorous theoretical foundation, this paper provides a comprehensive tutorial for understanding the principles underpinning machine learning.
... Manifold regularization is widely used in various algorithms such as ridge regression, SVM, etc. For the LapRLS/L model [38], it minimizes the regression errors while preserving the manifold smoothness, i.e. ...
Article
Full-text available
Least Squares Regression (LSR) is a powerful machine learning method for image classification and feature selection. In this study, a framework approach is introduced for the multi-classification problem based on the L 2, p -norm, utilizing more general loss functions and regularization terms, which is a robust sparse kernel-free quadratic surface least squares regression (RSQSLSR). The nonlinear relationship between features is addressed using a quadratic kernel-free technique combined with ϵ-dragging technology and manifold regularization to learn soft labels, which can achieve the goal of feature selection and classification, simultaneously. This model utilizes K quadratic surfaces mapping samples from the input space to the label space, preserving the local structure of the samples. To enhance practical applications, such as image classification, a simplified version of the method is proposed. An iterative algorithm for RSQSLSR is designed and its convergence is proved theoretically. The salient features and theoretical analysis of our proposed method are comprehensively discussed in this paper. Extensive experiments on synthetic and real datasets validate the effectiveness of our method, surpassing other state-of-the-art methods in terms of classification accuracy and feature selection performance.
... First, we identify the k nearest neighbors for each point in the dataset, and some choice of integer, k, using on Euclidean distance 7 These constraints are important in the theoretical studies of manifold learning. But they are not so widely used in practically useful algorithms, as directly enforcing such constraints can be computationally expensive (Belkin et al., 2006;Fefferman et al., 2016;Berenfeld et al., 2022). 8 We can loosely distinguish two kinds of manifold learning algorithm, local and global methods (Cayton, 2005) For local methods, the cost function considers the placement of each point with respect to its neighbors, whereas for global methods tend to consider the relative placement of all points. ...
Article
Full-text available
Manifold learning and effective model building are generally viewed as fundamentally different types of procedure. After all, in one we build a simplified model of the data, in the other, we construct a simplified model of the another model. Nonetheless, I argue that certain kinds of high-dimensional effective model building, and effective field theory construction in quantum field theory, can be viewed as special cases of manifold learning. I argue that this helps to shed light on all of these techniques. First, it suggests that the effective model building procedure depends upon a certain kind of algorithmic compressibility requirement. All three approaches assume that real-world systems exhibit certain redundancies, due to regularities. The use of these regularities to build simplified models is essential for scientific progress in many different domains.
... Manifold learning [2] is a non-linear dimensionality reduction technique in which we assume the high dimensional data generally lie on a lower dimension manifold. The semi-supervised manifold regularisation framework can be defined as ...
Thesis
Full-text available
This dissertation explores and extends existing work by developing two novel semi-supervised frameworks based on applications of hypergraphs in machine learning. Improved Hypergraph Laplacian Support Vector Machine (IHLSVM) and Hypergraph Regularized Semi-Supervised Least Squares Twin Support Vector Machine for Multi-label Learning (HMLLSTSVM). The IHLSVM framework combines Laplacian and hypergraph representations to better capture pairwise and higher-order interactions in the data, offering a robust approach to pattern classification. Meanwhile, HMLLSTSVM leverages hypergraph Laplacians and least-squares loss for efficient and accurate multilabel learning, particularly in scenarios with missing or sparse label information. Experimental evaluations on benchmark datasets validate the proposed methods' superior classification and multilabel learning capabilities, highlighting their efficacy in real-world applications like medical diagnosis, text classification, and image annotation.
... The HSNP-WMCA systems apply only the simple addition at the update layer to combine the node feature vectors with the aggregated neighborhood vectors. [55] 0.595 0.601 0.707 DeepWalk [56] 0.672 0.432 0.653 Planetoid [52] 0.757 0.647 0.772 DGI [57] 0.823 0.718 0.768 Chebyshev [42] 0.812 0.698 0.744 GCN [34] 0.815 0.703 0.790 SGC [43] 0.810 0.719 0.789 GAT [36] 0.830 0.725 0.790 AGNN [58] 0.826 0.717 0.799 MVAN-CP [60] 0.844 0.733 0.795 AIR-GCN [59] 0.847 0.729 0.800 AIR-GAT [59] 0.845 0.735 0.800 HSNP-WMC systems 0.854 0.777 0.832 ...
Article
Full-text available
Spiking neural P (SN P) systems are membrane computing models inspired by the structure and the information transmission and processing mechanisms of nerve neurons. The hierarchical spiking neural P systems with weights on multiple channels (HSNP-WMC systems) is proposed as a novel variant of SN P systems with a hierarchical structure. In the HSNP-WMC systems, the rules contained in the neurons in a layer have the same form. Neurons on a layer process the inputted information in parallel and transmit the processed results to the next layer for further processing. The synapses connecting neurons on two different layers have multiple channels and each channel is assigned a weight. Based on the HSNP-WMC system, a graph-based node classification algorithm is developed. Experiments on node classification using three citation network datasets are performed to evaluate the performance of the HSNP-WMC systems. Experimental results show that the HSNP-WMC systems have much better performance than the 13 baseline methods in classification accuracy, providing evidence for the effectiveness of the HSNP-WMC systems for graph-based node classification.
... Moreover, in order to further maintain the structural information of the source and target domains during the projection process, manifold regularization (MR) is introduced to extract the local neighborhood features of the data through MR, and maintain this structure in the manifold space after the projection [25,26]. Its objective function is: ...
Article
Full-text available
Background Aiming at the problem that traditional transfer methods are prone to lose data information in the overall domain-level transfer, and it is difficult to achieve the perfect match between source and target domains, thus reducing the accuracy of the soft sensor model. Methods This paper proposes a soft sensor modeling method based on the transfer modeling framework of substructure domain. Firstly, the Gaussian mixture model clustering algorithm is used to extract local information, cluster the source and target domains into multiple substructure domains, and adaptively weight the substructure domains according to the distances between the sub-source domains and sub-target domains. Secondly, the optimal subspace domain adaptation method integrating multiple metrics is used to obtain the optimal projection matrices Ws{{W}_{s}} and Wt{{W}_{t}} that are coupled with each other, and the data of source and target domains are projected to the corresponding subspace to perform spatial alignment, so as to reduce the discrepancy between the sample data of different working conditions. Finally, based on the source and target domain data after substructure domain adaptation, the least squares support vector machine algorithm is used to establish the prediction model. Results Taking Pichia pastoris fermentation to produce inulinase as an example, the simulation results verify that the root mean square error of the proposed soft sensor model in predicting Pichia pastoris concentration and inulinase concentration is reduced by 48.7% and 54.9%, respectively. Conclusion The proposed soft sensor modeling method can accurately predict Pichia pastoris concentration and inulinase concentration online under different working conditions, and has higher prediction accuracy than the traditional soft sensor modeling method.
... To obtain drug-disease associations, we employed an improved dual Laplacian regularized Least Squares [52] with two drug and disease feature spaces' combined kernel matrices. The loss function is determined as: ...
Article
Full-text available
Drug-disease association prediction is increasingly recognized as crucial for a comprehensive understanding of the functions and mechanisms of drugs. However, the process of obtaining approval for a new drug to deal with a disease is often laborious, time-consuming and expensive. As a consequence, there is a growing interest among researchers from diverse fields in developing computational methods to identify drug-disease interactions. Thus, in this work, a new CFMKGATDDA method was proposed to unveil drug-disease associations. It firstly uses a collaborative filtering algorithm for mitigating the impact sparse associations. It secondly provides a new way to fuse multiple similarities of drugs and diseases to obtain integrated similarities for drugs and diseases. Finally, it learns drugs and diseases’ embeddings by combining multiple kernels and graph attention networks to predict high quality drug-disease associations. It attains a noticeable performance of drug-disease interaction prediction with remarkable averaged AUC and AUPR values of 0.9931 and 0.9334, respectively, on the Cdataset. When comparing on the same Cdataset, it outperforms other approaches in both metrics of AUC and AUPR. Thus, it can be regarded a useful tool for revealing drug-disease associations.
... , S L . Common selection of shift-invariant kernels include the diffusion kernels exp(σ 2 L sym /2) with σ > 0, the p-step random walk kernels (aI − L sym ) −p with a > 2 and p ≥ 1, the Laplacian regularization kernels I+σ 2 L sym with σ > 0, and the spline kernels [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35]. For a SIGRKHS H with a shift-invariant kernel K, we can express any element in H as a linear combination of the columns of K, i.e., ...
Preprint
In this paper, we introduce the concept of graph shift-invariant space (GSIS) on an undirected finite graph, which is the linear space of graph signals being invariant under graph shifts, and we study its bandlimiting, kernel reproducing and sampling properties. Graph bandlimited spaces have been widely applied where large datasets on networks need to be handled efficiently. In this paper, we show that every GSIS is a bandlimited space, and every bandlimited space is a principal GSIS. Functions in a reproducing kernel Hilbert space with shift-invariant kernel could be learnt with significantly low computational cost. In this paper, we demonstrate that every GSIS is a reproducing kernel Hilbert space with a shift-invariant kernel. Based on the nested Krylov structure of GSISs in the spatial domain, we propose a novel sampling and reconstruction algorithm with finite steps, with its performance tested for well-localized signals on circulant graphs and flight delay dataset of the 50 busiest airports in the USA.
... SSL evolves very rapidly in recent years as an exciting new area of research in statistics and machine learning. Researchers have shifted their focus from classification problems (Nigam et al. 2004;Zhang 2005, 2007;Belkin et al. 2006;Wang and Shen 2007;Wang et al. 2008;Vandewalle et al. 2013;Gallaugher and McNicholas 2018) to regression problems (Chakrabortty 2016;Zhang and Bradic 2022;Song et al. 2024b). Chakrabortty and Cai (2018) imputes the missing label via nonparametric regression and proposes a class of efficient and adaptive SS estimators of the coefficients in the working model. ...
Article
Full-text available
Estimating comparison functions is crucial in numerous domains, such as econometrics, clinical medicine, and public health, where evaluating the effectiveness of interventions or treatment effects is a central concern. While the response variables are much more expensive to collect than the covariates in many scenarios, to tackle the challenge of limited labeled data, we present a unified semi-supervised learning (SSL) framework to estimate comparison functions, like the difference between two independent samples in means, probabilities for events, the survival competition probability, by leveraging the information of unlabelled data with only covariate observations to improve estimation accuracy. Specifically, a class of efficient and adaptive estimators for comparison functions is proposed to effectively utilize both the labeled data and unlabelled data under the semi-supervised (SS) framework. We establish the consistency and asymptotic normality of the proposed estimators and provide the optimal weight yielding the most efficient estimator. Furthermore, the resulting estimator is shown to be semiparametric efficient if the working model is correctly specified. Extensive numerical simulations are conducted to confirm the consistency and efficiency of our proposed estimators. An application to a real data extracted from the 2001 Medical Expenditures Panel Survey (MEPS) is also included.
... This paper addresses the problem of classifying data manifolds that contain invariances with a number of continuous degress of freedom. These invariances may be modeled using prior knowledge, manifold learning algorithms [9,10,11,12,13,14] or as generative neural networks via adversarial training [15]. Based upon knowledge of these structures, other work has considered building grouptheoretic invariant representations [16] or constructing invariant metrics [17]. ...
Preprint
We consider the problem of classifying data manifolds where each manifold represents invariances that are parameterized by continuous degrees of freedom. Conventional data augmentation methods rely upon sampling large numbers of training examples from these manifolds; instead, we propose an iterative algorithm called M_{CP} based upon a cutting-plane approach that efficiently solves a quadratic semi-infinite programming problem to find the maximum margin solution. We provide a proof of convergence as well as a polynomial bound on the number of iterations required for a desired tolerance in the objective function. The efficiency and performance of M_{CP} are demonstrated in high-dimensional simulations and on image manifolds generated from the ImageNet dataset. Our results indicate that M_{CP} is able to rapidly learn good classifiers and shows superior generalization performance compared with conventional maximum margin methods using data augmentation methods.
... In this work, we discuss multi-task learning approach that considers a notion of relatedness based on the concept of manifold regularization. In scalar-valued function setting, Belkin et al. [14] introduced the concept of manifold regularization which focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Minh and Sindhwani [50] generalized the concept of manifold learning for vector-valued functions which exploits output inter-dependencies while enforcing smoothness with respect to input data geometry. ...
Preprint
In this paper, we study the Nystr{\"o}m type subsampling for large scale kernel methods to reduce the computational complexities of big data. We discuss the multi-penalty regularization scheme based on Nystr{\"o}m type subsampling which is motivated from well-studied manifold regularization schemes. We develop a theoretical analysis of multi-penalty least-square regularization scheme under the general source condition in vector-valued function setting, therefore the results can also be applied to multi-task learning problems. We achieve the optimal minimax convergence rates of multi-penalty regularization using the concept of effective dimension for the appropriate subsampling size. We discuss an aggregation approach based on linear function strategy to combine various Nystr{\"o}m approximants. Finally, we demonstrate the performance of multi-penalty regularization based on Nystr{\"o}m type subsampling on Caltech-101 data set for multi-class image classification and NSL-KDD benchmark data set for intrusion detection problem.
... In [39], classifiers of different views are enforced to be agreed on unlabeled data by the use of a regularization term, in a graph-based semisupervised framework. This is called co-regularization and a similar idea was utilized in [43], under the theme of manifold regularization [44]. ...
Preprint
There is growing interest in multi-label image classification due to its critical role in web-based image analytics-based applications, such as large-scale image retrieval and browsing. Matrix completion has recently been introduced as a method for transductive (semi-supervised) multi-label classification, and has several distinct advantages, including robustness to missing data and background noise in both feature and label space. However, it is limited by only considering data represented by a single-view feature, which cannot precisely characterize images containing several semantic concepts. To utilize multiple features taken from different views, we have to concatenate the different features as a long vector. But this concatenation is prone to over-fitting and often leads to very high time complexity in MC based image classification. Therefore, we propose to weightedly combine the MC outputs of different views, and present the multi-view matrix completion (MVMC) framework for transductive multi-label image classification. To learn the view combination weights effectively, we apply a cross validation strategy on the labeled set. In the learning process, we adopt the average precision (AP) loss, which is particular suitable for multi-label image classification. A least squares loss formulation is also presented for the sake of efficiency, and the robustness of the algorithm based on the AP loss compared with the other losses is investigated. Experimental evaluation on two real world datasets (PASCAL VOC' 07 and MIR Flickr) demonstrate the effectiveness of MVMC for transductive (semi-supervised) multi-label image classification, and show that MVMC can exploit complementary properties of different features and output-consistent labels for improved multi-label image classification.
... Furthermore, existing KT techniques mostly ignore the geometry of the teacher's feature space, e.g., manifolds that are formed, similarities between neighboring samples, etc., since they merely regress the output of the teacher network. However, it has been shown that exploiting such kind of information can significantly increase the quality of the learned model regardless the domain of the application [2]. ...
Preprint
Knowledge Transfer (KT) techniques tackle the problem of transferring the knowledge from a large and complex neural network into a smaller and faster one. However, existing KT methods are tailored towards classification tasks and they cannot be used efficiently for other representation learning tasks. In this paper a novel knowledge transfer technique, that is capable of training a student model that maintains the same amount of mutual information between the learned representation and a set of (possible unknown) labels as the teacher model, is proposed. Apart from outperforming existing KT techniques, the proposed method allows for overcoming several limitations of existing methods providing new insight into KT as well as novel KT applications, ranging from knowledge transfer from handcrafted feature extractors to {cross-modal} KT from the textual modality into the representation extracted from the visual modality of the data.
... given by (19). Then, the SCA algorithm for problem (16) proceeds as described in Algorithm 2. The updating scheme reads: at every iteration k, given the current estimate p[k], the first step of Algorithm 2 solves a surrogate optimization problem involving the objective function 1 T p, augmented with a proximal regularization term (with τ > 0), and the surrogate set T Set k = 1. ...
Preprint
The goal of this paper is to propose novel strategies for adaptive learning of signals defined over graphs, which are observed over a (randomly time-varying) subset of vertices. We recast two classical adaptive algorithms in the graph signal processing framework, namely, the least mean squares (LMS) and the recursive least squares (RLS) adaptive estimation strategies. For both methods, a detailed mean-square analysis illustrates the effect of random sampling on the adaptive reconstruction capability and the steady-state performance. Then, several probabilistic sampling strategies are proposed to design the sampling probability at each node in the graph, with the aim of optimizing the tradeoff between steady-state performance, graph sampling rate, and convergence rate of the adaptive algorithms. Finally, a distributed RLS strategy is derived and is shown to be convergent to its centralized counterpart. Numerical simulations carried out over both synthetic and real data illustrate the good performance of the proposed sampling and reconstruction strategies for (possibly distributed) adaptive learning of signals defined over graphs.
... On the other side, there are techniques with completely different approaches as Laplacian SVM (Belkin et al, 2006), a manifold learning model for semi-supervised learning based on an ordinary Support Vector Machine (SVM) classifier supplemented with an additional manifold regularization term. This method was originally designed for Euclidian data, hence its scope is different from the previous models. ...
Preprint
A graph-based classification method is proposed for semi-supervised learning in the case of Euclidean data and for classification in the case of graph data. Our manifold learning technique is based on a convex optimization problem involving a convex quadratic regularization term and a concave quadratic loss function with a trade-off parameter carefully chosen so that the objective function remains convex. As shown empirically, the advantage of considering a concave loss function is that the learning problem becomes more robust in the presence of noisy labels. Furthermore, the loss function considered here is then more similar to a classification loss while several other methods treat graph-based classification problems as regression problems.
... More specifically, Spectral regression discriminant analysis (SRDA) [15] solves LDA efficiently and is a special case of spectral regression (SR) [17] consisting of two phases: spectral analysis (SA) of an embedded graph and regression. Using manifold regularization [18], LDA was extended to the semi-supervised setting leading to semi-supervised discriminant analysis (SDA) [19]. ...
Preprint
High-dimensional data requires scalable algorithms. We propose and analyze three scalable and related algorithms for semi-supervised discriminant analysis (SDA). These methods are based on Krylov subspace methods which exploit the data sparsity and the shift-invariance of Krylov subspaces. In addition, the problem definition was improved by adding centralization to the semi-supervised setting. The proposed methods are evaluated on a industry-scale data set from a pharmaceutical company to predict compound activity on target proteins. The results show that SDA achieves good predictive performance and our methods only require a few seconds, significantly improving computation time on previous state of the art.
... We refer the reader to (Chapelle et al., 2006;Zhu, 2008) and the references therein for a comprehensive survey on the topic of semi-supervised and transductive learning. Theoretical analysis of the generalisation error and the excess risk in this context can be found in (Rigollet, 2007;Wang and Shen, 2007;Lafferty and Wasserman, 2007), whereas the closely related area of manifold learning is studied in (Belkin et al., 2006;Nadler et al., 2009;Niyogi, 2013). The purpose of the present work differs from these papers in that we put the emphasis on the high-dimensional setting and the sparsity assumption. ...
Preprint
In this paper we revisit the risk bounds of the lasso estimator in the context of transductive and semi-supervised learning. In other terms, the setting under consideration is that of regression with random design under partial labeling. The main goal is to obtain user-friendly bounds on the off-sample prediction risk. To this end, the simple setting of bounded response variable and bounded (high-dimensional) covariates is considered. We propose some new adaptations of the lasso to these settings and establish oracle inequalities both in expectation and in deviation. These results provide non-asymptotic upper bounds on the risk that highlight the interplay between the bias due to the mis-specification of the linear model, the bias due to the approximate sparsity and the variance. They also demonstrate that the presence of a large number of unlabeled features may have significant positive impact in the situations where the restricted eigenvalue of the design matrix vanishes or is very small.
... In particular, the labeled data is the key factor on the excess risk without the additional assumption on the marginal distribution. This observation is consistent with the previous analysis for semi-supervised learning [1,6]. ...
Preprint
Learning with Fredholm kernel has attracted increasing attention recently since it can effectively utilize the data information to improve the prediction performance. Despite rapid progress on theoretical and experimental evaluations, its generalization analysis has not been explored in learning theory literature. In this paper, we establish the generalization bound of least square regularized regression with Fredholm kernel, which implies that the fast learning rate O(l^{-1}) can be reached under mild capacity conditions. Simulated examples show that this Fredholm regression algorithm can achieve the satisfactory prediction performance.
... In this case, each data instance is represented by a vertex and is linked to other vertices according to a predefined affinity rule. The labels are propagated to the whole graph using a particular optimization heuristic [11]. ...
Preprint
The emergence of collective dynamics in neural networks is a mechanism of the animal and human brain for information processing. In this paper, we develop a computational technique using distributed processing elements in a complex network, which are called particles, to solve semi-supervised learning problems. Three actions govern the particles' dynamics: generation, walking, and absorption. Labeled vertices generate new particles that compete against rival particles for edge domination. Active particles randomly walk in the network until they are absorbed by either a rival vertex or an edge currently dominated by rival particles. The result from the model evolution consists of sets of edges arranged by the label dominance. Each set tends to form a connected subnetwork to represent a data class. Although the intrinsic dynamics of the model is a stochastic one, we prove there exists a deterministic version with largely reduced computational complexity; specifically, with linear growth. Furthermore, the edge domination process corresponds to an unfolding map in such way that edges "stretch" and "shrink" according to the vertex-edge dynamics. Consequently, the unfolding effect summarizes the relevant relationships between vertices and the uncovered data classes. The proposed model captures important details of connectivity patterns over the vertex-edge dynamics evolution, in contrast to previous approaches which focused on only vertex or only edge dynamics. Computer simulations reveal that the new model can identify nonlinear features in both real and artificial data, including boundaries between distinct classes and overlapping structures of data.
Chapter
As industrial equipment is becoming more and more modern with increasing complexity of structures, the health assessment and diagnosis of equipment has become a hot research direction. However, health assessment and diagnosis technology is facing great challenges due to the complex coupling and high degree of nonlinearity in industrial equipment, and the information used to identify health state often appears to be highly nonlinear, non-stationary, and non-Gaussian. The recent development of differential geometry theory and its advantages in nonlinear analysis [1] provide an important tool for dealing with nonlinear problems in the field of fault diagnosis [2]. The application of differential geometry technology in health assessment and diagnosis of industrial equipment is in line with the needs of research.
Preprint
Full-text available
Spatial transcriptomics has significantly advanced our ability to map gene expression within native tissue contexts. However, current low-resolution technologies are constrained by limited spatial resolution and tissue coverage. We present PanoSpace, a novel computational framework that integrates low-resolution spatial transcriptomics data with high-resolution histological images and matched single-cell RNA sequencing references. PanoSpace achieves comprehensive single-cell level and whole-tissue analysis by accurately inferring spatial localization, cell type, and gene expression for all cells across entire tissue slides. It also facilitates exploration of intra-cell-type heterogeneity and cell-cell interactions within spatial contexts. Application of PanoSpace to breast, prostate, and cervical cancer tissues reveals detailed cell-type distributions and gene expression patterns with unprecedented resolution and coverage. Furthermore, through analysis of interactions with cancer-associated fibroblasts, PanoSpace uncovers intra-cell-type heterogeneity and provides novel insights into tumor microenvironment dynamics. These findings highlight PanoSpace as a powerful tool for offering insights beyond the reach of existing technologies and computational methods.
Article
In the graph-based semi-supervised learning, the Green-function method is a classical method that works by computing the Green's function in the graph space. However, when applied to large graphs, especially those sparse ones, this method performs unstably and unsatisfactorily. We make a detailed analysis on it and propose a novel method from the perspective of optimization. On fully connected graphs, the method is equivalent to the Green-function method and can be seen as another interpretation with physical meanings, while on non-fully connected graphs, it helps to explain why the Green-function method causes a mess on large sparse graphs. To solve this dilemma, we propose a workable approach to improve our proposed method. Unlike the original method, our improved method can also apply two accelerating techniques, Gaussian Elimination, and Anchored Graphs to become more efficient on large graphs. Finally, the extensive experiments prove our conclusions and the efficiency, accuracy, and stability of our improved Green's function method.
Preprint
Manifold learning is a hot research topic in the field of computer science and has many applications in the real world. A main drawback of manifold learning methods is, however, that there is no explicit mappings from the input data manifold to the output embedding. This prohibits the application of manifold learning methods in many practical problems such as classification and target detection. Previously, in order to provide explicit mappings for manifold learning methods, many methods have been proposed to get an approximate explicit representation mapping with the assumption that there exists a linear projection between the high-dimensional data samples and their low-dimensional embedding. However, this linearity assumption may be too restrictive. In this paper, an explicit nonlinear mapping is proposed for manifold learning, based on the assumption that there exists a polynomial mapping between the high-dimensional data samples and their low-dimensional representations. As far as we know, this is the first time that an explicit nonlinear mapping for manifold learning is given. In particular, we apply this to the method of Locally Linear Embedding (LLE) and derive an explicit nonlinear manifold learning algorithm, named Neighborhood Preserving Polynomial Embedding (NPPE). Experimental results on both synthetic and real-world data show that the proposed mapping is much more effective in preserving the local neighborhood information and the nonlinear geometry of the high-dimensional data samples than previous work.
Preprint
Accurate determination of the regularization parameter in inverse problems still represents an analytical challenge, owing mainly to the considerable difficulty to separate the unknown noise from the signal. We present a new approach for determining the parameter for the general-form Tikhonov regularization of linear ill-posed problems. In our approach the parameter is found by approximate minimization of the distance between the unknown noiseless data and the data reconstructed from the regularized solution. We approximate this distance by employing the Picard parameter to separate the noise from the data in the coordinate system of the generalized SVD. A simple and reliable algorithm for the estimation of the Picard parameter enables accurate implementation of the above procedure. We demonstrate the effectiveness of our method on several numerical examples. A MATLAB-based implementation of the proposed algorithms can be found at https://www.weizmann.ac.il/condmat/superc/software/
Preprint
The search for optimal configurations of pointsets, the most notable examples being the problems of Kepler and Thompson, have an extremely rich history with diverse applications in physics, chemistry, communication theory, and scientific computing. In this paper, we introduce and study a new optimality criteria for pointset configurations. Namely, we consider a certain weighted graph associated with a pointset configuration and seek configurations which minimize certain spectral properties of the adjacency matrix or graph Laplacian defined on this graph, subject to geometric constraints on the pointset configuration. This problem can be motivated by solar cell design and swarming models, and we consider several spectral functions with interesting interpretations such as spectral radius, algebraic connectivity, effective resistance, and condition number. We prove that the regular simplex extremizes several spectral invariants on the sphere. We also consider pointset configurations on flat tori via (i) the analogous problem on lattices and (ii) through a variety of computational experiments. For many of the objectives considered (but not all), the triangular lattice is extremal.
Article
Full-text available
In this paper, we study a family of semisu- pervised learning algorithms for "aligning" dierent data sets that are characterized by the same underlying manifold. The optimiza- tions of these algorithms are based on graphs that provide a discretized approximation to the manifold. Partial alignments of the data sets—obtained from prior knowledge of their manifold structure or from pairwise corre- spondences of subsets of labeled examples— are completed by integrating supervised sig- nals with unsupervised frameworks for man- ifold learning. As an illustration of this semisupervised setting, we show how to learn mappings between dierent data sets of im- ages that are parameterized by the same un- derlying modes of variability (e.g., pose and viewing angle). The curse of dimensionality in these problems is overcome by exploiting the low dimensional structure of image man- ifolds. due to the curse of dimensionality and associ- ated large computational demands. However, in many cases, the statistical analysis of these data sets may be tractable due to an underlying low-dimensional mani- fold structure in the data. Recently, a series of learn- ing algorithms that approximate data manifolds have been developed, such as Isomap (15), locally linear em- bedding (13), Laplacian eigenmaps (3), Hessian eigen- maps (7), and charting (5). While these algorithms ap- proach the problem of learning manifolds from an un- supervised perspective; in this paper, we address the problem of establishing a regression between two or more data sets by aligning their underlying manifolds. We show how to align the low-dimensional representa- tions of the data sets given some additional informa- tion about the mapping between the data sets. Our al- gorithm relies upon optimization over a graphical rep- resentation of the data, where edges in the graphs are computed to preserve local structure in the data. This optimization yields a common low-dimensional embed- ding which can then be used to map samples between the disparate data sets. Two main approaches for alignment of manifolds are presented. In the first approach, additional knowledge about the intrinsic embedding coordinates of some of the samples are used to constrain the alignment. This information about coordinates may be available given knowledge about the data generating process, or when some coordinates are manually assigned to correspond to certain labeled samples. Our algorithm yields a graph embedding where these known coordinates are preserved. Given multiple data sets with such coordi- nate labels, we show how the underlying data mani- folds can be aligned to each other through a common
Conference Paper
Full-text available
Text categorization - the assignment of natural language texts to one or more predefined categories based on their content - is an important component in many information organization and management tasks. We compare the effectiveness of five different automatic learning algorithms for text categorization in terms of learning speed, real-time classification speed, and classification accuracy. We also examine training set size, and alternative document representations. Very accurate text classifiers can be learned automatically from training examples. Linear Support Vector Machines (SVMs) are particularly promising because they are very accurate, quick to train, and quick to evaluate.
Conference Paper
Full-text available
We introduce a family of kernels on graphs based on the notion of regularization operators. This generalizes in a natural way the notion of regularization and Greens functions, as commonly used for real valued functions, to graphs. It turns out that diusion kernels can be found as a special case of our reasoning. We show that the class of positive, monotonically decreasing functions on the unit interval leads to kernels and corresponding regularization operators.
Article
Full-text available
Regularization Networks and Support Vector Machines are techniques for solving certain problems of learning from examples – in particular, the regression problem of approximating a multivariate function from sparse data. Radial Basis Functions, for example, are a special case of both regularization and Support Vector Machines. We review both formulations in the context of Vapnik's theory of statistical learning which provides a general foundation for the learning problem, combining functional analysis and statistics. The emphasis is on regression: classification is treated as a special case.
Article
Full-text available
One of the central problems in machine learning and pattern recognition is to develop appropriate representations for complex data. We consider the problem of constructing a representation for data lying on a low-dimensional manifold embedded in a high-dimensional space. Drawing on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for representing the high-dimensional data. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality-preserving properties and a natural connection to clustering. Some potential applications and illustrative examples are discussed.
Article
Full-text available
We provide a framework for structural multiscale geometric organization of graphs and subsets of R(n). We use diffusion semigroups to generate multiscale geometries in order to organize and represent complex structures. We show that appropriately selected eigenfunctions or scaling functions of Markov matrices, which describe local transitions, lead to macroscopic descriptions at different scales. The process of iterating or diffusing the Markov matrix is seen as a generalization of some aspects of the Newtonian paradigm, in which local infinitesimal transitions of a system lead to global macroscopic descriptions by integration. We provide a unified view of ideas from data analysis, machine learning, and numerical analysis.
Article
Full-text available
We describe a method for recovering the underlying parametrization of scattered data (m(i)) lying on a manifold M embedded in high-dimensional Euclidean space. The method, Hessian-based locally linear embedding, derives from a conceptual framework of local isometry in which the manifold M, viewed as a Riemannian submanifold of the ambient Euclidean Space R(n), is locally isometric to an open, connected subset Θ of Euclidean space R(d). Because Θ does not have to be convex, this framework is able to handle a significantly wider class of situations than the original ISOMAP algorithm. The theoretical framework revolves around a quadratic form H(f) = ∫(M)||H(f)(m)||²(F)dm defined on functions f: M--> R. Here Hf denotes the Hessian of f, and H(f) averages the Frobenius norm of the Hessian over M. To define the Hessian, we use orthogonal coordinates on the tangent planes of M. The key observation is that, if M truly is locally isometric to an open, connected subset of R(d), then H(f) has a (d + 1)-dimensional null space consisting of the constant functions and a d-dimensional space of functions spanned by the original isometric coordinates. Hence, the isometric coordinates can be recovered up to a linear isometry. Our method may be viewed as a modification of locally linear embedding and our theoretical framework as a modification of the Laplacian eigenmaps framework, where we substitute a quadratic form based on the Hessian in place of one based on the Laplacian.
Article
Full-text available
There has been an increase of interest for semi-supervised learning recently, because of the many datasets with large amounts of unlabeled examples and only a few labeled ones. This paper follows up on proposed non-parametric algorithms which provide an estimated continuous label for the given unlabeled examples. It extends them to function induction algorithms that correspond to the minimization of a regularization criterion applied to an out-of-sample example, and happens to have the form of a Parzen windows regressor. The advantage of the extension is that it allows predicting the label for a new example without having to solve again a linear system of dimension 'n' (the number of unlabeled and labeled training examples), which can cost O(n^3). Experiments show that the extension works well, in the sense of predicting a label close to the one that would have been obtained if the test example had been included in the unlabeled set. This relatively efficient function induction procedure can also be used when 'n' is large to approximate the solution by writing it only in terms of a kernel expansion with 'm' << 'n' terms, and reducing the linear system to 'm' equations in 'm' unknowns. Il y a eu un regain d'intérêt récemment pour l'apprentissage semi-supervisé, à cause du grand nombre de bases de données comportant de très nombreux exemples non étiquetés et seulement quelques exemples étiquetés. Cet article poursuit le travail fait sur les algorithmes non paramétriques qui fournissent une étiquette continue estimée pour les exemples non-étiquetés. Il les étend à des algorithmes d'induction fonctionnelle qui correspondent à la minimisation d'un critère de régularisation appliqué à un exemple hors-échantillon, et qui ont la forme d'un régresseur à fenêtre de Parzen. L'avantage de cette extension est qu'elle permet de prédire l'étiquette d'un nouvel exemple sans avoir à résoudre de nouveau un système de dimension 'n' (le nombre d'exemp
Article
Full-text available
Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. This paper provides a unified framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. This framework is based on seeing these algorithms as learning eigenfunctions of a data-dependent kernel.
Article
Full-text available
We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
Article
Full-text available
We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations.
Article
Full-text available
The application of kernel-based learning algorithms has, so far, largely been confined to realvalued data and a few special data types, such as strings. In this paper we propose a general method of constructing natural families of kernels over discrete structures, based on the matrix exponentiation idea. In particular, we focus on generating kernels on graphs, for which we propose a special class of exponential kernels called diffusion kernels, which are based on the heat equation and can be regarded as the discretization of the familiar Gaussian kernel of Euclidean space.
Article
We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a generalpurpose learner. Some transductive graph learning algorithms and standard methods including Support Vector Machines and Regularized Least Squares can be obtained as special cases. We utilize properties of Reproducing Kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph based approaches) we obtain a natural out-of-sample extension to novel examples and are thus able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unla-beled data effectively. In the absence of labeled examples, our framework gives rise to a regularized form of spectral clustering with an out-of-sample extension.
Article
This thesis shows that several old, somewhat discredited machine learning techniques are still valuable in the solution of modern, large-scale machine learning problems. We begin by considering Tikhonov regularization, a broad framework of schemes for binary classification. Tikhonov regularization attempts to find a function which simultaneously has small empirical loss on a training set and small norm in a Reproducing Kernel Hilbert Space. The choice of loss function determines the learning scheme. Using the hinge loss gives rise to the now well-known Support Vector Machine algorithm. We present SvmFu, a state-of-the-art SVM solver developed as part of thesis. We discuss the design and implementation issues involved in SvmFu, present empirical results on its performance, and offer general guidance on the use of SVMs to solve machine learning problems. We also consider, and advocate in many cases, the use of the more classical square loss, giving rise to the Regularized Least Squares Classifiation algorithm. RLSC is "trained" by solving a single system of linear equations. While it is widely believed that the SVM will perform substantially better than RLSC, we note that the same generalization bounds that apply to SVMs apply to RLSC, and we demonstrate empirically on both toy and real-world examples that RLSC's performance is essentially equivalent to SVMs across a wide range of problems, implying that the choice between SVM and RLSC should be based on computational tractability considerations. We demonstrate the empirical advantages and properties of RLSC, discussing the tradeoffs between RLSC and SVMs.
Article
,We provide a framework,for structural multiscale geometric organization of graphs and sub- sets of R,. We use diffusion semigroups,to gener- ate multiscale geometries,in order to organize,and represent complex,structures. We show,that appro- priately selected eigenfunctions or scaling functions of Markov matrices, which describe local transitions, lead to macroscopic,descriptions at different scales. The process of iterating or diffusing the Markov ma- trix is seen as a generalization of some,aspects of the Newtonian paradigm, in which local infinitesimal transitions of a system,lead to global macroscopic descriptions by integration. In Part I below, we pro- vide a unified view of ideas from data analysis, ma- chine learning and numerical analysis. In Part II [1], we augment,this approach,by introducing fast order- N algorithms,for homogenization,of heterogeneous structures as well as for data representation.
Article
Regularization Networks and Support Vector Machines are techniques for solving certain problems of learning from examples – in particular, the regression problem of approximating a multivariate function from sparse data. Radial Basis Functions, for example, are a special case of both regularization and Support Vector Machines. We review both formulations in the context of Vapnik’s theory of statistical learning which provides a general foundation for the learning problem, combining functional analysis and statistics. The emphasis is on regression: classification is treated as a special case.
Book
This monograph is based on a series of 10 lectures at Ohio State University at Columbus, March 23–27, 1987, sponsored by the Conference Board of the Mathematical Sciences and the National Science Foundation. The selection of topics is quite personal and, together with the talks of the other speakers, the lectures represent a story, as I saw it in March 1987, of many of the interesting things that statisticians can do with splines. I told the audience that the priority order for topic selection was, first, obscure work of my own and collaborators, second, other work by myself and students, with important work by other speakers deliberately omitted in the hope that they would mention it themselves. This monograph will more or less follow that outline, so that it is very much slanted toward work I had some hand in, although I will try to mention at least by reference important work by the other speakers and some of the attendees. The other speakers were (in alphabetical order), Dennis Cox, Randy Eubank, Ker-Chau Li, Douglas Nychka, David Scott, Bernard Silverman, Paul Speckman, and James Wendelberger. The work of Finbarr O'Sullivan, who was unable to attend, in extending the developing theory to the non-Gaussian and nonlinear case will also play a central role, as will the work of Florencio Utreras.
Article
A main theme of this report is the relationship of approximation to learning and the primary role of sampling (inductive inference). We try to emphasize relations of the theory of learning to the main stream of mathematics.
Article
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.
Conference Paper
We consider the problem of labeling a partially labeled graph. This setting may arise in a number of situations from survey sampling to information retrieval to pattern recognition in manifold settings. It is also, especially, of potential practical importance when data is abundant, but labeling is expensive or requires human assistance. Our approach develops a framework for regularization on such graphs parallel to Tikhonov regularization on continuous spaces. The algorithms are very simple and involve solving a single, usually sparse, system of linear equations. Using the notion of algorithmic stability, we derive bounds on the generalization error and relate it to the structural invariants of the graph.
Conference Paper
We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine la- beled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonpara- metric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and av oids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is compu- tationally feasible for large datasets. We evaluate the ker nels on real datasets using support vector machines, with encouraging results.
Conference Paper
We formulate the problem of graph inference where part of the graph is known as a supervised learning problem, and propose an algorithm to solve it. The method involves the learning of a mapping of the vertices to a Euclidean space where the graph is easy to infer, and can be formu- lated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of metabolic network re- construction from genomic data.
Conference Paper
In the machine learning community it is generally believed that graph Laplacians corresponding to a flnite sample of data points converge to a continuous Laplace operator if the sample size increases. Even though this assertion serves as a justiflcation for many Laplacian- based algorithms, so far only some aspects of this claim have been rigor- ously proved. In this paper we close this gap by establishing the strong pointwise consistency of a family of graph Laplacians with data-dependent weights to some weighted Laplace operator. Our investigation also in- cludes the important case where the data lies on a submanifold of Rd.
Conference Paper
Due to its occurrence in engineering domains and implications for natural learning, the problem of utilizing unlabeled data is attract- ing increasing attention in machine learning. A large body of recent literature has focussed on the transductive setting where labels of unlabeled examples are estimated by learn- ing a function dened only over the point cloud data. In a truly semi-supervised setting however, a learning machine has access to la- beled and unlabeled examples and must make predictions on data points never encountered before. In this paper, we show how to turn transductive and standard supervised learn- ing algorithms into semi-supervised learn- ers. We construct a family of data-dependent norms on Reproducing Kernel Hilbert Spaces (RKHS). These norms allow us to warp the structure of the RKHS to reect the under- lying geometry of the data. We derive ex- plicit formulas for the corresponding new ker- nels. Our approach demonstrates state of the art performance on a variety of classication tasks.
Conference Paper
Many application domains suffer from not having enough labeled training data for learning. However, large amounts of unlabeled examples can often be gathered cheaply. As a result, there has been a great deal of work in recent years on how unlabeled data can be used to aid classification. We consider an algorithm based on finding minimum cuts in graphs, that uses pairwise relationships among the examples in order to learn from both labeled and unlabeled data. Our algorithm
Conference Paper
An approach to semi-supervised learning is pro- posed that is based on a Gaussian random field model. Labeled and unlabeled data are rep- resented as vertices in a weighted graph, with edge weights encoding the similarity between in- stances. The learning problem is then formulated in terms of a Gaussian random field on this graph, where the mean of the field is characterized in terms of harmonic functions, and is efficiently obtained using matrix methods or belief propa- gation. The resulting learning algorithms have intimate connections with random walks, elec- tric networks, and spectral graph theory. We dis- cuss methods to incorporate class priors and the predictions of classifiers obtained by supervised learning. We also propose a method of parameter learning by entropy minimization, and show the algorithm's ability to perform feature selection. Promising experimental results are presented for synthetic data, digit classification, and text clas- sification tasks.
Article
We believe that the cluster assumption is key to successful semi-supervised learning. Based on this, we propose three semi-supervised algorithms: 1. deriving graph-based distances that emphazise low density regions between clusters, followed by training a standard SVM; 2. optimizing the Transductive SVM objective function, which places the decision boundary in low density regions, by gradient descent; 3. combining the first two to make maximum use of the cluster assumption. We compare with state of the art algorithms and demonstrate superior accuracy for the latter two methods.
Article
Thesis (Ph. D.)--University of Chicago, Dept. of Mathematics, August 2003. Includes bibliographical references.
Article
Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs—30,000 auditory nerve fibers or 106 optic nerve fibers—a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.
Article
We propose a framework to incorporate unlabeled data in kernel classi er, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach.
Article
We equate nonlinear dimensionality reduction (NLDR) to graph embedding with side information about the vertices, and derive a solution to either problem in the form of a kernel-based mixture of affine maps from the ambient space to the target space. Unlike most spectral NLDR methods, the central eigenproblem can be made relatively small, and the result is a continuous mapping defined over the entire space, not just the datapoints. A demonstration is made to visualizing the distribution of word usages (as a proxy to word meanings) in a sample of the machine learning literature.
Article
We formulate a principle for classification with the knowledge of the marginal distribution over the data points (unlabeled data). The principle is cast in terms of Tikhonov style regularization where the regularization penalty articulates the way in which the marginal density should constrain otherwise unrestricted conditional distributions.
Article
We describe a nonparametric Bayesian approach to generalizing from few labeled examples, guided by a larger set of unlabeled objects and the assumption of a latent tree-structure to the domain. The tree (or a distribution over trees) may be inferred using the unlabeled data. A prior over concepts generated by a mutation process on the inferred tree(s) allows efficient computation of the optimal Bayesian classification function from the labeled examples.