Access to this full-text is provided by Springer Nature.
Content available from Machine Learning
This content is subject to copyright. Terms and conditions apply.
Machine Learning (2020) 109:373–440
https://doi.org/10.1007/s10994-019-05855-6
A survey on semi-supervised learning
Jesper E. van Engelen1·Holger H. Hoos1,2
Received: 3 December 2018 / Revised: 20 September 2019 / Accepted: 29 September 2019 /
Published online: 15 November 2019
© The Author(s) 2019
Abstract
Semi-supervised learning is the branch of machine learning concerned with using labelled
as well as unlabelled data to perform certain learning tasks. Conceptually situated between
supervised and unsupervised learning, it permits harnessing the large amounts of unlabelled
data available in many use cases in combination with typically smaller sets of labelled data.
In recent years, research in this area has followed the general trends observed in machine
learning, with much attention directed at neural network-based models and generative learn-
ing. The literature on the topic has also expanded in volume and scope, now encompassing a
broad spectrum of theory, algorithms and applications. However, no recent surveys exist to
collect and organize this knowledge, impeding the ability of researchers and engineers alike
to utilize it. Filling this void, we present an up-to-date overview of semi-supervised learn-
ing methods, covering earlier work as well as more recent advances. We focus primarily on
semi-supervised classification, where the large majority of semi-supervised learning research
takes place. Our survey aims to provide researchers and practitioners new to the field as well
as more advanced readers with a solid understanding of the main approaches and algorithms
developed over the past two decades, with an emphasis on the most prominent and currently
relevant work. Furthermore, we propose a new taxonomy of semi-supervised classification
algorithms, which sheds light on the different conceptual and methodological approaches for
incorporating unlabelled data into the training process. Lastly, we show how the fundamental
assumptions underlying most semi-supervised learning algorithms are closely connected to
each other, and how they relate to the well-known semi-supervised clustering assumption.
Keywords Semi-supervised learning ·Machine learning ·Classification
Editor: Tom Fawcett.
BJesper E. van Engelen
jesper.van.engelen@gmail.com
Holger H. Hoos
hh@liacs.nl
1Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
2Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
374 Machine Learning (2020) 109:373–440
1 Introduction
In machine learning, a distinction has traditionally been made between two major tasks:
supervised and unsupervised learning (Bishop 2006). In supervised learning, one is presented
with a set of data points consisting of some input xand a corresponding output value y.The
goal is, then, to construct a classifier or regressor that can estimate the output value for
previously unseen inputs. In unsupervised learning, on the other hand, no specific output
value is provided. Instead, one tries to infer some underlying structure from the inputs. For
instance, in unsupervised clustering, the goal is to infer a mapping from the given inputs (e.g.
vectors of real numbers) to groups such that similar inputs are mapped to the same group.
Semi-supervised learning is a branch of machine learning that aims to combine these
two tasks (Chapelle et al. 2006b;Zhu2008). Typically, semi-supervised learning algorithms
attempt to improve performance in one of these two tasks by utilizing information generally
associated with the other. For instance, when tackling a classification problem, additional
data points for which the label is unknown might be used to aid in the classification process.
For clustering methods, on the other hand, the learning procedure might benefit from the
knowledge that certain data points belong to the same class.
As is the case for machine learning in general, a large majority of the research on semi-
supervised learning is focused on classification. Semi-supervised classification methods are
particularly relevant to scenarios where labelled data is scarce. In those cases, it may be
difficult to construct a reliable supervised classifier. This situation occurs in application
domains where labelled data is expensive or difficult obtain, like computer-aided diagnosis,
drug discovery and part-of-speech tagging. If sufficient unlabelled data is available and under
certain assumptions about the distribution of the data, the unlabelled data can help in the
construction of a better classifier. In practice, semi-supervised learning methods have also
been applied to scenarios where no significant lack of labelled data exists: if the unlabelled
data points provide additional information that is relevant for prediction, they can potentially
be used to achieve improved classification performance.
A plethora of learning methods exists, each with their own characteristics, advantages and
disadvantages. The most recent comprehensive survey of the area was published by Zhu in
2005 and last updated in 2008 [see Zhu (2008)]. The book by Chapelle et al. (2006b)and
the introductory book by Zhu and Goldberg (2009) also provide good bases for studying
earlier work on semi-supervised learning. More recently, Subramanya and Talukdar (2014)
provided an overview of several graph-based techniques, and Triguero et al. (2015) reviewed
and analyzed pseudo-labelling techniques, a class of semi-supervised learning methods.
Since the survey by Zhu (2008) was published, some important developments have taken
place in the field of semi-supervised learning. Across the field, new learning approaches
have been proposed, and existing approaches have been extended, improved, and analyzed
in more depth. Additionally, the rise in popularity of (deep) neural networks (Goodfellow
2017) for supervised learning has prompted new approaches to semi-supervised learning,
driven by the simplicity of incorporating unsupervised loss terms into the cost functions of
neural networks. Lastly, there has been increased attention for the development of robust
semi-supervised learning methods that do not degrade performance, and for the evaluation
of semi-supervised learning methods for practical purposes.
In this survey, we aim to provide the reader with a comprehensive overview of the cur-
rent state of the research area of semi-supervised learning, covering early work and recent
advances, and providing explanations of key algorithms and approaches. We present a new
taxonomy for semi-supervised classification methods that captures the assumptions under-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 375
lying each group of methods as well as the way in which they relate to existing supervised
methods. In this, we provide a perspective on semi-supervised learning that allows for a
more thorough understanding of different approaches and the connections between them.
Furthermore, we shed new light on the fundamental assumptions underlying semi-supervised
learning, and show how they connect to the so-called cluster assumption.
Although we aim to provide a comprehensive survey on semi-supervised learning, we
cannot possibly cover every method in existence. Due to the sheer size of the literature
on the topic, this would not only be beyond the scope of this article, but also distract
from the key insights which we wish to provide to the reader. Instead, we focus on the
most influential work and the most important developments in the area over the past twenty
years.
The rest of this article is structured as follows. The basic concepts and assumptions of semi-
supervised learning are covered in Sect. 2, where we also make a connection to clustering.
In Sect. 3, we present our taxonomy of semi-supervised learning methods, which forms
the conceptual basis for the remainder of our survey. Inductive methods are covered in
Sects. 4through 6. We first consider wrapper methods (Sect. 4), followed by unsupervised
preprocessing (Sect. 5), and finally, we cover intrinsically semi-supervised methods (Sect. 6).
Sect. 7covers transductive methods, which form the second major branch of our taxonomy.
Semi-supervised regression and clustering are discussed in Sect. 8. Finally, in Sect. 9,we
provide some prospects for the future of semi-supervised learning.
2 Background
In traditional supervised learning problems, we are presented with an ordered collection
of llabelled data points DL=((xi,yi))l
i=1. Each data point (xi,yi)consists of an object
xi∈Xfrom a given input space X, and has an associated label yi,where yiis real-valued in
regression problems and categorical in classification problems. Based on a collection of these
data points, usually called the training data, supervised learning methods attempt to infer a
function that can successfully determine the label y∗of some previously unseen input x∗.
In many real-world classification problems, however, we also have access to a collection
of udata points, DU=(xi)l+u
i=l+1, whose labels are unknown. For instance, the data points for
which we want to make predictions, usually called the test data, are unlabelled by definition.
Semi-supervised classification methods attempt to utilize unlabelled data points to construct
a learner whose performance exceeds the performance of learners obtained when using only
the labelled data. In the remainder of this survey, we denote with XLand XUthe collection
of input objects for the labelled and unlabelled samples, respectively.1
There are many cases where unlabelled data can help in constructing a classifier. Consider,
for example, the problem of document classification, where we wish to assign topics to a
collection of text documents (such as news articles). Assuming our documents are represented
by the set of words that appear in it, one could train a simple supervised classifier that, for
example, learns to recognize that documents containing the word “neutron” are usually about
physics. This classifier might work well on documents containing terms that it has seen in
the training data, but will inherently fail when a document does not contain predictive words
that also occurred in the training set. For example, if we encounter a physics document
1We note that the collections of data points referred to here are technically lists. However, following common
usage, in this survey, we refer to them as ‘sets’ and, in a slight abuse of notation, apply standard set-theoretic
concepts to them.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
376 Machine Learning (2020) 109:373–440
Fig. 1 A basic example of binary classification in the presence of unlabelled data. The unlabelled data points
are coloured according to their true label. The coloured, unfilled circles depict the contour curves of the input
data distribution corresponding to standard deviations of 1, 2 and 3 (Color figure online)
about particle accelerators that does not contain the word “neutron”, the classifier is unable
to recognize it as a document concerning physics. This is where semi-supervised learning
comes in. If we consider the unlabelled data, there might be documents that connect the word
“neutron” to the phrase “particle accelerator”. For instance, the word “neutron” would often
occur in a document that also contains the word “quark”. Furthermore, the word “quark”
would regularly co-occur with the phrase “particle accelerator”, which guides the classifiers
towards classifying these documents as revolving around physics as well, despite having
never seen the phrase “particle accelerator” in the labelled data.
Figure 1provides some further intuition towards the use of unlabelled data for classifi-
cation. We consider an artificial classification problem with two classes. For both classes,
100 samples are drawn from a 2-dimensional Gaussian distribution with identical covariance
matrices. The labelled data set is then constructed by taking one sample from each class.
Any supervised learning algorithm will most likely obtain as the decision boundary the solid
line, which is perpendicular to the line segment connecting the two labelled data points and
intersects it in the middle. However, this is quite far from the optimal decision boundary.
As is clear from this figure, the clusters we can infer from the unlabelled data can help us
considerably in placing the decision boundary: assuming that the data stems from two Gaus-
sian distributions, a simple semi-supervised learning algorithm can infer a close-to-optimal
decision boundary.
2.1 Assumptions of semi-supervised learning
A necessary condition of semi-supervised learning is that the underlying marginal data distri-
bution p(x)over the input space contains information about the posterior distribution p(y|x).
If this is the case, one might be able to use unlabelled data to gain information about p(x),and
thereby about p(y|x). If, on the other hand, this condition is not met, and p(x)contains no
information about p(y|x), it is inherently impossible to improve the accuracy of predictions
based on the additional unlabelled data (Zhu 2008).
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 377
(a) Smoothness and low-density assumptions. (b) Manifold assumption.
Fig. 2 Illustrations of the semi-supervised learning assumptions. In each picture, a reasonable supervised
decision boundary is depicted, as well as the optimal decision boundary, which could be closely approximated
by a semi-supervised learning algorithm relying on the respective assumption
Fortunately, the previously mentioned condition appears to be satisfied in most learning
problems encountered in the real world, as is suggested by the successful application of semi-
supervised learning methods in practice. However, the way in which p(x)and p(y|x)interact
is not always the same. This has given rise to the semi-supervised learning assumptions,
which formalize the types of expected interaction (Chapelle et al. 2006b). The most widely
recognized assumptions are the smoothness assumption (if two samples xand xare close
in the input space, their labels yand yshould be the same), the low-density assumption
(the decision boundary should not pass through high-density areas in the input space), and
the manifold assumption (data points on the same low-dimensional manifold should have the
same label). These assumptions are the foundation of most, if not all, semi-supervised learning
algorithms, which generally depend on one or more of them being satisfied, either explicitly or
implicitly. Throughout this survey, we will elaborate on the underlying assumptions utilized
by each specific learning algorithm. The assumptions are explained in more detail below; a
visual representation is provided in Fig. 2.
2.1.1 Smoothness assumption
The smoothness assumption states that, for two input points x,x∈Xthat are close by in
the input space, the corresponding labels y,yshould be the same. This assumption is also
commonly used in supervised learning, but has an extended benefit in the semi-supervised
context: the smoothness assumption can be applied transitively to unlabelled data. For exam-
ple, assume that a labelled data point x1∈XLand two unlabelled data points x2,x3∈XU
exist, such that x1is close to x2and x2is close to x3,butx1is not close to x3. Then, because
of the smoothness assumption, we can still expect x3to have the same label as x1,since
proximity—and thereby the label—is transitively propagated through x2.
2.1.2 Low-density assumption
The low-density assumption implies that the decision boundary of a classifier should prefer-
ably pass through low-density regions in the input space. In other words, the decision
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
378 Machine Learning (2020) 109:373–440
boundary should not pass through high-density regions. The assumption is defined over
p(x), the true distribution of the input data. When considering a limited set of samples from
this distribution, it essentially means that the decision boundary should lie in an area where
few data points are observed. In that light, the low-density assumption is closely related to
the smoothness assumption; in fact, it can be considered the counterpart of the smoothness
assumption for the underlying data distribution.
Suppose that a low-density area exists, i.e. an area R⊂Xwhere p(x)is low. Then very
few observations are expected to be contained in R, and it is thus unlikely that any pair of
similar data points in Ris observed. If we place the decision boundary in this low-density
area, the smoothness assumption is not violated, since it only concerns pairs of similar data
points. For high-density areas, on the other hand, many data points can be expected. Thus,
placing the decision boundary in a high-density region violates the smoothness assumption,
since the predicted labels would then be dissimilar for similar data points.
The converse is also true: if the smoothness assumption holds, then any two data points
that lie close together have the same label. Therefore, in any densely populated area of the
input space, all data points are expected to have the same label. Consequently, a decision
boundary can be constructed that passes only through low-density areas in the input space,
thus satisfying the low-density assumption as well. Due to their close practical relation, we
depict the low-density assumption and the smoothness assumption in a single illustration in
Fig. 2.
2.1.3 Manifold assumption
In machine learning problems where the data can be represented in Euclidean space, the
observed data points in the high-dimensional input space Rdare usually concentrated along
lower-dimensional substructures. These substructures are known as manifolds: topological
spaces that are locally Euclidean. For instance, when we consider a 3-dimensional input space
where all points lie on the surface of a sphere, the data can be said to lie on a 2-dimensional
manifold. The manifold assumption in semi-supervised learning states that (a) the input space
is composed of multiple lower-dimensional manifolds on which all data points lie and (b)
data points lying on the same manifold have the same label. Consequently, if we are able
to determine which manifolds exist and which data points lie on which manifold, the class
assignments of unlabelled data points can be inferred from the labelled data points on the
same manifold.
2.2 Connection to clustering
In semi-supervised learning research, an additional assumption that is often included is the
cluster assumption, which states that data points belonging to the same cluster belong to
the same class (Chapelle et al. 2006b). We argue, however, that the previously mentioned
assumptions and the cluster assumption are not independent of each other but, rather, that
the cluster assumption is a generalization of the other assumptions.
Consider an input space Xwith some objects X⊂X, drawn from the distribution p(x).
A cluster, then, is a set of data points C⊆Xthat are more similar to each other than to other
data points in X, according to some concept of similarity (Anderberg 1973). Determining
clusters corresponds to finding some function f:X→Ythat maps each input in x∈Xto
a cluster with label y=f(x), where each cluster label y∈Yuniquely identifies one cluster.
Since we do not have direct access to p(x)to determine a suitable clustering, we need to rely
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 379
on some concept of similarity between data points in X, according to which we can assign
clusters to similar data points.
The concept of similarity we choose, often implicitly, dictates what constitutes a cluster.
Although the efficacy of any particular clustering method for finding these clusters depends
on many other factors, the concept of similarity uniquely defines the interaction between p(x)
and p(y|x). Therefore, whether two points belong to the same cluster can be derived from
their similarity to each other and to other points. From our perspective, the smoothness, low-
density, and manifold assumptions boil down to different definitions of the similarity between
points: the smoothness assumption states that points that are close to each other in input space
are similar; the low-density assumption states that points in the same high-density area are
similar; and the manifold assumption states that points that lie on the same low-dimensional
manifold are similar. Consequently, the semi-supervised learning assumptions can be seen
as more specific instances of the cluster assumption: that similar points tend to belong to the
same group.
One could even argue that the cluster assumption corresponds to the necessary condition
for semi-supervised learning: that p(x)carries information on p(y|x). In fact, assuming the
output space Ycontains the labels of all possible clusters, the necessary condition for semi-
supervised learning to succeed can be seen to be the necessary condition for clustering to
succeed. In other words: if the data points (both unlabelled and labelled) cannot be mean-
ingfully clustered, it is impossible for a semi-supervised learning method to improve on a
supervised learning method.
2.3 When does semi-supervised learning work?
The primary goal of semi-supervised learning is to harness unlabelled data for the construction
of better learning procedures. As it turns out, this is not always easy or even possible. As
mentioned earlier, unlabelled data is only useful if it carries information useful for label
prediction that is not contained in the labelled data alone or cannot be easily extracted from
it. To apply any semi-supervised learning method in practice, the algorithm then needs to be
able to extract this information. For practitioners and researchers alike, this begs the question:
when is this the case?
Unfortunately, it has proven difficult to find a practical answer to this question. Not only
is it difficult to precisely define the conditions under which any particular semi-supervised
learning algorithm may work, it is also rarely straightforward to evaluate to what extent these
conditions are satisfied. However, one can reason about the applicability of different learning
methods on various types of problems. Graph-based methods, for example, typically rely on
a local similarity measure to construct a graph over all data points. To apply such methods
successfully, it is important that a meaningful local similarity measure can be devised. In
high-dimensional data, such as images, where Euclidean feature distance is rarely a good
indicator of the similarity between data points, this is often difficult. As can be seen in the
literature, most semi-supervised learning approaches for images rely on a weak variant of
the smoothness assumption that requires predictions to be invariant to minor perturbations
in the input (Rasmus et al. 2015; Laine and Aila 2017; Tarvainen and Valpola 2017). Semi-
supervised extensions of supervised learning algorithms, on the other hand, generally rely
on the same assumption as their supervised counterparts. For instance, both supervised and
semi-supervised support vector machines rely on the low-density assumption, which states
that the decision boundary should lie in a low-density region of the decision space. If a
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
380 Machine Learning (2020) 109:373–440
supervised classifier performs well in such cases, it is only natural to use the semi-supervised
extension to the algorithm.
As is the case for supervised learning algorithms, no method has yet been discovered to
determine a priori what learning method is best-suited for any particular problem. What is
more, it is impossible to guarantee that the introduction of unlabelled data will not degrade
performance. Such performance degradation has been observed in practice, and its preva-
lence is likely under-reported due to publication bias (Zhu 2008). The problem of potential
performance degradation has been identified in multiple studies (Zhu 2008; Chapelle et al.
2006b; Singh et al. 2009; Li and Zhou 2015; Oliver et al. 2018), but remains difficult to
address. It is particularly relevant in scenarios where good performance can be achieved with
purely supervised classifiers. In those cases, the potential performance degradation is much
larger than the potential performance gain.
The main takeaway from these observations is that semi-supervised learning should not be
seen as a guaranteed way of achieving improved prediction performance by the mere intro-
duction of unlabelled data. Rather, it should be treated as another direction in the process of
finding and configuring a learning algorithm for the task at hand. Semi-supervised learning
procedures should be part of the suite of algorithms considered for use in a particular appli-
cation scenario, and a combination of theoretical analysis (where possible) and empirical
evaluation should be used to choose an approach that is well suited to the given situation.
2.4 Empirical evaluation of semi-supervised learning methods
When evaluating and comparing machine learning algorithms, a multitude of decisions influ-
ence the relative performance of different algorithms. In supervised learning, these include
the selection of data sets, the partitioning of those data sets into training, validation and test
sets, and the extent to which hyperparameters are tuned. In semi-supervised learning, addi-
tional factors come into play. First, in many benchmarking scenarios, a decision has to be
made which data points should be labelled and which should remain unlabelled. Second, one
can choose to evaluate the performance of the learner on the unlabelled data used for training
(which is by definition the case in transductive learning), or on a completely disjoint test
set. Additionally, it is important to establish high-quality supervised baselines to allow for
proper assessment of the added value of the unlabelled data. In practice, excessively limiting
the scope of the evaluation can lead to unrealistic perspectives on the performance of the
learning algorithms. Recently, Oliver et al. (2018) established a set of guidelines for the real-
istic evaluation of semi-supervised learning algorithms; several of their recommendations
are included here.
In practical use cases, the partitioning of labelled and unlabelled data is typically fixed.
In research, data sets used for evaluating semi-supervised learning algorithms are usually
obtained by simply removing the labels of a large amount of data points from an existing
supervised learning data set. In earlier research, the data sets from the UCI Machine Learning
Repository were often used (Dua and Graff 2019). In more recent research on semi-supervised
image classification, the CIFAR-10/100 (Krizhevsky 2009) and SVHN (Netzer et al. 2011)
data sets have been popular choices. Additionally, two-dimensional toy datasets are some-
times used to demonstrate the viability of a new approach. Typically, these toy data sets
consist of an input distribution where data points from each class are concentrated along
a one-dimensional manifold. For instance, the popular half-moon data set consists of data
points drawn from two interleaved half circles, each associated with a different class.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 381
As has been observed in practice, the choice of data sets and their partitioning can have
significant impact on the relative performance of different learning algorithms (see, e.g.
Chapelle et al. 2006b; Triguero et al. 2015). Some algorithms may work well when the
amount of labelled data is limited and perform poorly when more labelled data is available;
others may excel on particular types of data sets but not on others. To provide a realistic
evaluation of semi-supervised learning algorithms, researchers should thus evaluate their
algorithms on a diverse suite of data sets with different quantities of labelled and unlabelled
data.
In addition to the choice of data sets and their partitioning, it is important that a strong
baseline is chosen when evaluating the performance of a semi-supervised learning method.
After all, it is not particularly relevant to practitioners whether the introduction of unlabelled
data improves the performance of any particular learning algorithm. Rather, the central ques-
tion is: does the introduction of unlabelled data yield a learner that is better than any other
learner—be it supervised or semi-supervised. As pointed out by Oliver et al. (2018), this calls
for the inclusion of state-of-the-art, properly tuned supervised baselines when evaluating the
performance of semi-supervised learning algorithms.
Several studies haveindependently evaluated the performance of different semi-supervised
learning methods on various data sets. Chapelle et al. (2006b) empirically compared eleven
diverse semi-supervised learning algorithms, using supervised support vector machines and k-
nearest neighbours as their baseline. They included semi-supervised support vector machines,
label propagation and manifold regularization techniques, applying hyperparameter opti-
mization for each algorithm. Comparing the performance of the algorithms on eight different
data sets, the authors found that no algorithm uniformly outperformed the others. Substan-
tial performance improvements over the baselines were observed on some data sets, while
performance was found to be degraded on others. Relative performance also varied with the
amount of unlabelled data.
Oliver et al. (2018) compared severalsemi-supervised neural networks, including the mean
teacher model, virtual adversarial training and a wrapper method called pseudo-label,on
two image classification problems. They reported substantial performance improvements
for most of the algorithms, and observed that the error rates typically declined as more
unlabelled data points were added (without removing any labelled data points). Performance
degradations were observed only when there was a mismatch between the classes present in
the labelled data and the classes present in the unlabelled data. These results are promising
indeed: they indicate that, in image classification tasks, unlabelled data can be employed
by neural networks to consistently improve performance. It is an interesting avenue for
future research to investigate whether these consistent performance improvements can also be
obtained for other types of data. Furthermore, it is an open question whether the assumptions
underlying these semi-supervised neural networks could be exploited to consistently improve
the performance of other learning methods.
3 Taxonomy of semi-supervised learning methods
Over the past two decades, a broad variety of semi-supervised classification algorithms has
been proposed. These methods differ in the semi-supervised learning assumptions they are
based on, in how they make use of unlabelled data, and in the way they relate to supervised
algorithms. Existing categorizations of semi-supervised learning methods generally use a
subset of these properties and are typically relatively flat, thereby failing to capture similarities
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
382 Machine Learning (2020) 109:373–440
Fig. 3 Visualization of the semi-supervised classification taxonomy. Each leaf in the taxonomy corresponds
to a specific type of approach to incorporating unlabelled data into classification methods. In the leaf corre-
sponding to transductive, graph-based methods, the dashed boxes represent distinct phases of the graph-based
classification process, each of which has a multitude of variations
between different groups of methods. Furthermore, the categorizations are often fine-tuned
towards existing work, making them less suited for the inclusion of new approaches.
In this survey, we propose a new way to represent the spectrum of semi-supervised classifi-
cation algorithms. We attempt to group them in a clear, future-proof way, allowing researchers
and practitioners alike to gain insight into the way semi-supervised learning methods relate
to each other, to existing supervised learning methods, and to the semi-supervised learning
assumptions. The taxonomy is visualized in Fig. 3. At the highest level, it distinguishes
between inductive and transductive methods, which give rise to distinct optimization proce-
dures: the former attempt to find a classification model, whereas the latter are solely concerned
with obtaining label predictions for the given unlabelled data points. At the second level, it
considers the way the semi-supervised learning methods incorporate unlabelled data. This
distinction gives rise to three distinct classes of inductive methods, each of which is related
to supervised classifiers in a different way.
The first distinction we make in our taxonomy, between inductive and transductive meth-
ods, is common in the literature on semi-supervised learning (see, e.g. Chapelle et al. 2006b;
Zhu 2008; Zhu and Goldberg 2009). The former, like supervised learning methods, yield a
classification model that can be used to predict the label of previously unseen data points.
The latter do not yield such a model, but instead directly provide predictions. In other words,
given a data set consisting of labelled and unlabelled data, XL,XU⊆X, with labels yL∈Yl
for the llabelled data points, inductive methods yield a model f:X→ Y, whereas transduc-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 383
tive methods produce predicted labels ˆ
yUfor the unlabelled data points in XU. Accordingly,
inductive methods involve optimization over prediction models, whereas transductive meth-
ods optimize directly over the predictions ˆ
yU.
Inductive methods, which generally extend supervised algorithms to include unlabelled
data, are further differentiated in our taxonomy based on the way they incorporate unlabelled
data: either in a preprocessing step, directly inside the objective function, or via a pseudo-
labelling step. The transductive methods are in all cases graph-based; we group these based on
the choices made in different stages of the learning process. In the remainder of this section,
we will elaborate on the grouping of semi-supervised learning methods represented in the
taxonomy, which forms the basis for our discussion of semi-supervised learning methods in
the remainder of this survey.
3.1 Inductive methods
Inductive methods aim to construct a classifier that can generate predictions for any object in
the input space. Unlabelled data may be used when training this classifier, but the predictions
for multiple new, previously unseen examples are independent of each other once training has
been completed. This corresponds to the objective in supervised learning methods: a model
is built in the training phase and can then be used for predicting the labels of new data points.
3.1.1 Wrapper methods
A simple approach to extending existing, supervised algorithms to the semi-supervised setting
is to first train classifiers on labelled data, and to then use the predictions of the resulting
classifiers to generate additional labelled data. The classifiers can then be re-trained on this
pseudo-labelled data in addition to the existing labelled data. Such methods are known as
wrapper methods: the unlabelled data is pseudo-labelled by a wrapper procedure, and a
purely supervised learning algorithm, unaware of the distinction between originally labelled
and pseudo-labelled data, constructs the final inductive classifier. This reveals a key property
of wrapper methods: most of them can be applied to any given supervised base learner,
allowing unlabelled data to be introduced in a straightforward manner. Wrapper methods
form the first part of the inductive side of the taxonomy, and are covered in Sect. 4.
3.1.2 Unsupervised preprocessing
Secondly, we consider unsupervised preprocessing methods, which either extract useful fea-
tures from the unlabelled data, pre-cluster the data, or determine the initial parameters of a
supervised learning procedure in an unsupervised manner. Like wrapper methods, they can
be used with any supervised classifier. However, unlike wrapper methods, the supervised
classifier is only provided with originally labelled data points. These methods are covered in
Sect. 5.
3.1.3 Intrinsically semi-supervised methods
The last class of inductive methods we consider directly incorporate unlabelled data into the
objective function or optimization procedure of the learning method. Many of these methods
are direct extensions of supervised learning methods to the semi-supervised setting: they
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
384 Machine Learning (2020) 109:373–440
extend the objective function of the supervised classifier to include unlabelled data. Semi-
supervised support vector machines (S3VMs), for example, extend supervised SVMs by
maximizing the margin not only on labelled, but also on unlabelled data. There are intrinsically
semi-supervised extensions of many prominent supervised learning approaches, including
SVMs, Gaussian processes and neural networks, and we describe these in Sect. 6.Wefurther
group the methods inside this category based on the semi-supervised learning assumptions
on which they rely.
3.2 Transductive methods
Unlike inductive methods, transductive methods do not construct a classifier for the entire
input space. Instead, their predictive power is limited to exactly those objects that it encoun-
ters during the training phase. Therefore, transductive methods have no distinct training and
testing phases. Since supervised learning methods are by definition not supplied with unla-
belled data until the testing phase, no clear analogies of transductive algorithms exist in
supervised learning.
Since no model of the input space exists in transductive learners, information has to
be propagated via direct connections between data points. This observation naturally gives
rise to a graph-based approach to transductive methods: if a graph can be defined in which
similar data points are connected, information can then be propagated along the edges of
this graph. In practice, all transductive methods we discuss are either explicitly graph-based
or can implicitly be understood as such. We note that inductive graph-based methods also
exist; we cover them in Sect. 6.3. Inductive as well as transductive graph-based methods are
typically premised on the manifold assumption: the graphs, constructed based on the local
similarity between data points, provide a lower-dimensional representation of the potentially
high-dimensional input data.
Transductive graph-based methods generally consist of three steps: graph construction,
graph weighting and inference. In the first step, the set of objects, X, is used to construct a
graph where each node represents a data point and pairwise similar data points are connected
by an edge. In the second step, these edges are weighted to represent the extent of the pairwise
similarity between the respective data points. In the third step, the graph is used to assign
labels to the unlabelled data points. Different methods for carrying out these three steps are
discussed in detail in Sect. 7.
4 Wrapper methods
Wrapper methods are among the oldest and most widely known algorithms for semi-
supervised learning (Zhu 2008). They utilize one or more supervised base learners and
iteratively train these with the original labelled data as well as previously unlabelled data
that is augmented with predictions from earlier iterations of the learners. The latter is com-
monly referred to as pseudo-labelled data. The procedure usually consists of two alternating
steps of training and pseudo-labelling. In the training step, one or more supervised classifiers
are trained on the labelled data and, possibly, pseudo-labelled data from previous iterations.
In the pseudo-labelling step, the resulting classifiers are used to infer labels for the previ-
ously unlabelled objects; the data points for which the learners were most confident of their
predictions are pseudo-labelled for use in the next iteration.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 385
A significant advantage of wrapper methods is that they can be used with virtually any
supervised base learner. The supervised base learner can be entirely unaware of the wrapper
method, which simply passes pseudo-labelled samples to the base learner as if they were
regular labelled samples. Although some wrapper methods require the base learner to provide
probabilistic predictions, many wrapper methods relying on multiple base learners do not.
For any particular wrapper method, the semi-supervised learning assumptions underlying it
are dependent on the base learners that are used. In that sense, a wrapper method cannot be
considered a learning method on its own: it only becomes a complete learning method when
it is combined with a particular set of base learners.
A comprehensive survey of wrapper methods was published recently by Triguero et al.
(2015). In addition to providing an overview of such methods, they also proposed a catego-
rization and taxonomy of wrapper methods, which is based on (1) how many classifiers are
used, (2) whether different types of classifiers are used, and (3) whether they use single-view
or multi-view data (i.e. whether the data is split into multiple feature subsets). This taxonomy
provides valuable insight into the space of wrapper methods.
We present a less complex taxonomy, focused on the three relatively independent types of
wrapper methods that have been studied in the literature. Firstly, we consider self-training,
which uses one supervised classifier that is iteratively re-trained on its own most confident
predictions. Secondly, we consider co-training, an extension of self-training to multiple
classifiers that are iteratively re-trained on each other’s most confident predictions. The
classifiers are supposed to be sufficiently diverse, which is usually achieved by operating
on different subsets of the given objects or features. Lastly, we consider pseudo-labelled
boosting methods. Like traditional boosting methods, they build a classifier ensemble by
constructing individual classifiers sequentially, where each individual classifier is trained on
both labelled data and the most confident predictions of the previous classifiers on unlabelled
data.
4.1 Self-training
Self-training methods (sometimes also called “self-learning” methods) are the most basic
of pseudo-labelling approaches (Triguero et al. 2015). They consist of a single supervised
classifier that is iteratively trained on both labelled data and data that has been pseudo-labelled
in previous iterations of the algorithm.
At the beginning of the self-training procedure, a supervised classifier is trained on only
the labelled data. The resulting classifier is used to obtain predictions for the unlabelled
data points. Then, the most confident of these predictions are added to the labelled data set,
and the supervised classifier is re-trained on both the original labelled data and the newly
obtained pseudo-labelled data. This procedure is typically iterated until no more unlabelled
data remain.
Self-training was first proposed by Yarowsky (1995) as an approach to word sense dis-
ambiguation in text documents, predicting the meaning of words based on their context.
Since then, several applications and variations of self-training have been put forward. For
instance, Rosenberg et al. (2005) applied self-training to object detection problems, and
showed improved performance over a state-of-the-art (at that time) object detection model.
Dópido et al. (2013) developed a self-training approach for hyperspectral image classifi-
cation. They used domain knowledge to select a set of candidate unlabelled samples, and
pseudo-labelled the most informative of these samples with the predictions made by the
trained classifier.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
386 Machine Learning (2020) 109:373–440
The self-training paradigm admits a multitude of design decisions, including the selection
of data to pseudo-label, the re-use of pseudo-labelled data in later iterations of the algorithm,
and stopping criteria (see, e.g. Rosenberg et al. 2005; Triguero et al. 2015). The selection
procedure for data to be pseudo-labelled is of particular importance, since it determines
which data end up in the training set for the classifier. In typical self-training settings, where
this selection is made based on prediction confidence, the quality of the confidence esti-
mates significantly influences algorithm performance. In particular, the ranking of prediction
probabilities for the unlabelled samples should reflect the true confidence ranking.
If well-calibrated probabilistic predictions are available, the respective probabilities can
be used directly. In this case, the self-training approach is iterative and not incremental, as
label probabilities for unlabelled data points are re-estimated in each step. In that case, the
approach becomes similar to expectation-maximization (EM; Dempster et al. 1977). It has
been particularly well studied in the context of naïve Bayes classifiers, which are inherently
probabilistic (Nigam and Ghani 2000; Nigam et al. 2000,2006). Wu et al. (2012b) recently
applied semi-supervised EM with a naïve Bayes classifier to the problem of detecting fake
product reviews on e-commerce websites.
Algorithms that do not natively support robust probabilistic predictions may require adap-
tations to benefit from self-training. Decision trees are a prime example of this: without any
modifications or pruning, prediction probability estimates, which are generally calculated
from the fraction of samples in a leaf with a certain label, are generally of low quality. This
can be mainly attributed to the fact that most decision tree learning algorithms explicitly
attempt to minimize the impurity in tree nodes, thereby encouraging small leaves and highly
biased probability estimates (Provost and Domingos 2003). Tanha et al. (2017) attempted to
overcome this problem in two distinct ways. Firstly, they applied several existing methods,
such as grafting and Laplace correction, to directly improve prediction probability estimates.
Secondly, they used a local distance-based measure to determine the confidence ranking
between instances: the prediction confidence of an unlabelled data point is based on the
absolute difference in the Mahalanobis distances between that point and the labelled data
from each class. They showed improvements in performance of both decision trees and
random forests (ensembles of decision trees) using this method (Tanha et al. 2017).
Leistner et al. (2009) also utilized self-training to improve random forests. Instead of
labelling the unlabelled data x∈XUwith the label predicted to be most likely, they pseudo-
label each unlabelled data point independently for each tree according to the estimated
posterior distribution p(y|x). Furthermore, they proposed a stopping criterion based on the
out-of-bag-error: when the out-of-bag-error (which is an unbiased estimate of the general-
ization error) increases, training is stopped.
The base learners in self-training are by definition agnostic to the presence of the wrapper
method. Consequently, they have to be completely re-trained in each self-training iteration.
However, when a classifier can be trained incrementally (i.e. optimizing the objective function
over individual data points or subsets of the given data), an iterative pseudo-labelling approach
similar to self-training can be applied. Instead of re-training the entire algorithm in each
iteration, data points can be pseudo-labelled throughout the training process. This approach
wasappliedtoneuralnetworksbyLee(2013), who proposed the pseudo-label approach.
Since the pseudo-labels predicted in the earlier training stages are generally less reliable,
the weight of the pseudo-labelled data is increased over time. The pseudo-label approach
exhibits clear similarities to self-training, but differs in the sense that the classifier is not
re-trained after each pseudo-labelling step: instead, it is fine-tuned with new pseudo-labelled
data, and therefore technically deviates from the wrapper method paradigm.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 387
Limited studies regarding the theoretical properties of self-training algorithms exist. Haf-
fari and Sarkar (2007) performed a theoretical analysis of several variants of self-training and
showed a connection with graph-based methods. Culp and Michailidis (2008) analyzed the
convergence properties of a variant of self-training with several base learners, and considered
the connection to graph-based methods as well.
4.2 Co-training
Co-training is an extension of self-training to multiple supervised classifiers. In co-training,
two or more supervised classifiers are iteratively trained on the labelled data, adding their
most confident predictions to the labelled data set of the other supervised classifiers in each
iteration. For co-training to succeed, it is important that the base learners are not too strongly
correlated in their predictions. If they are, their potential to provide each other with useful
information is limited. In the literature, this condition is usually referred to as the diversity
criterion (Wang and Zhou 2010). Zhou and Li (2010) provided a survey of semi-supervised
learning methods relying on multiple base learners. They jointly refer to these methods as
disagreement-based methods, referring to the observation that co-training approaches exploit
disagreements between multiple learners: they exchange information through unlabelled data,
for which different learners predict different labels.
To promote classifier diversity, earlier co-training approaches mainly relied on the exis-
tence of multiple different views of the data, which generally correspond to distinct subsets of
the feature set. For instance, when handling video data, the data can be naturally decomposed
into visual and audio data. Such co-training methods belong to the broader class of multi-
view learning approaches, which includes a broad range of supervised learning algorithms as
well. A comprehensive survey of multi-view learning was produced by Xu et al. (2013). We
cover multi-view co-training methods in Sect. 4.2.1. In many real-world problem scenarios,
no distinct views of the data are known a priori. Single-view co-training methods address
this problem either by automatically splitting the data into different views, or by promoting
diversity in the learning algorithms themselves; we cover these methods in Sect. 4.2.2.We
also briefly discuss co-regularization methods, in which multiple classifiers are combined
into a single objective function, in Sect. 4.2.3.
4.2.1 Multi-view co-training
The basic form of co-training was proposed by Blum and Mitchell (1998). In their seminal
paper, they proposed to construct two classifiers that are trained on two distinct views, i.e.
subsets of features, of the given data. After each training step, the most confident predictions
for each view are added to the set of labelled data for the other view. Blum and Mitchell
applied the co-training algorithm to the classification for university web pages, using the
web page text and the anchor text in links to the web page from external sources as two
distinct views. This algorithm and variants thereof have been successfully applied in several
fields, most notably natural language processing (Kiritchenko and Matwin 2001; Mihalcea
2004;Wan2009).
The original co-training algorithm by Blum and Mitchell (1998) relies on two main
assumptions to succeed: (1) each individual subset of features should be sufficient to obtain
good predictions on the given data set, and (2) the subsets of features should be conditionally
independent given the class label. The first assumption can be understood trivially: if one
of the two feature subsets is insufficient to form good predictions, a classifier using that set
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
388 Machine Learning (2020) 109:373–440
can never contribute positively to the overall performance of the combined approach. The
second assumption is related to the diversity criterion: if the feature subsets are conditionally
independent given the class label, the predictions of the individual classifiers are unlikely
to be strongly correlated. Formally, for any data point xi=x(1)
i×x(2)
i, decomposed into
x(1)
iand x(2)
ifor the first and second feature subset, respectively, the conditional indepen-
dence assumption amounts to p(x(1)
i|x(2)
i,yi)=p(x(1)
i|yi). Dasgupta et al. (2002)showed
that, under the previously mentioned assumptions, generalization error can be decreased by
promoting agreement among the individual learners.
In practice, the second assumption is generally not satisfied: even if a natural split of
features exists, such as in the experimental setup used by Blum and Mitchell (1998), it is
unlikely that information contained in one view provides no information about the other view
when conditioned on the class label (Du et al. 2011). Considering the university web page
classification example, the anchor text of a link to a web page can indeed be expected to
contain clues towards the content of the web page, even if it is known that the web page
is classified as a faculty member’s home page. For example, if the link’s anchor text is
“Dean of the Engineering Faculty”, one is more likely to find information about the dean of
the engineering faculty than about any other person in the text of that page. Thus, several
alternatives to this assumption have been considered.
Abney (2002) showed that a weak independence assumption is sufficient for successful
co-training. Balcan et al. (2005) further relaxed the conditional independence assumption,
showing that a much weaker assumption, which they dub the expansion assumption,issuf-
ficient and to some extent necessary. The expansion assumption states that the two views
are not highly correlated, and that individual classifiers never confidently make incorrect
predictions.
Du et al. (2011) studied empirical methods to determine to what degree the sufficiency and
independence assumptions hold. They proposed several methods for automatically splitting
the feature set into two views, and showed that the resulting empirical independence and suf-
ficiency is positively correlated with the performance of the co-trained algorithm, indicating
that feature splits optimizing sufficiency and independence lead to good classifiers.
4.2.2 Single-view co-training
AsshownbyDuetal.(2011), co-training can be successful even when no natural split in a
given feature set is known a priori. This observation is echoed throughout the literature on co-
training, and many different approaches to applying co-training in this so-called single-view
setting exist.
Chen et al. (2011) attempted to alleviate the need for pre-defined disjoint feature sets by
automatically splitting the feature set in each co-training iteration. They formulated a single
optimization problem closely related to co-training, incorporating both the requirement that
the feature sets should be disjoint and the expansion property of Balcan et al. (2005). They
showed promising results for this approach on a partially synthetic data set, where multiple
views of each data point are automatically generated. Wang and Zhou (2010) reasoned about
sufficient and necessary conditions for co-training to succeed, approaching co-training from
a graph-based perspective, where label propagation is alternately applied to each learner. A
downside of this approach is that, although inspired by co-training, it cannot be applied to
an arbitrary supervised learning algorithm without modification: the operations resembling
co-training are embedded in the objective function, which is optimized directly.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 389
Several techniques have been proposed for splitting single-view data sets into multiple
views. For instance, Wang et al. (2008b) suggested to generate krandom projections of
the data, and use these as the views for kdifferent classifiers. Zhang and Zheng (2009)
proposed to project the data onto a lower-dimensional subspace using principal component
analysis and to construct the pseudo-views by greedily selecting the transformed features
with maximal variance. Yaslan and Cataltepe (2010) do not transform the data to a different
basis, but select the features for each view iteratively, with preference given to features with
high mutual information with respect to the given labels.
Further approaches to apply algorithms resembling co-training to data sets where no
explicit views are available focus on other ways of introducing diversity among the clas-
sifiers. For example, one can use different hyperparameters for the supervised algorithms
(Wang and Zhou 2007; Zhou and Li 2005a), or use different algorithms altogether (Goldman
and Zhou 2000;Xuetal.2012; Zhou and Goldman 2004). Wang and Zhou (2007) provided
both theoretical and empirical analyses on why co-training can work in single-view settings.
They showed that the diversity between the learners is positively correlated with their joint
performance. Zhou and Li (2005b) proposed tri-training, where three classifiers are alter-
nately trained. When two of the three classifiers agree on their prediction for a given data
point, that data point is passed to the other classifier along with the respective label. Crucially,
tri-training does not rely on probabilistic predictions of individual classifiers, and can thus
be applied to a much broader range of supervised learning algorithms.
The authors of the tri-training approach proposed to extend it to more than three learners—
notably, to random forests (Li and Zhou 2007). The approach, known as co-forest,startsby
training the decision trees independently on all labelled data. Then, in each iteration, each
classifier receives pseudo-labelled data based on the joint prediction of all other classifiers
on the unlabelled data: if the fraction of classifiers predicting a class ˆyifor an unlabelled
data point xiexceeds a certain threshold, the pseudo-labelled data point (xi,yi)is passed to
the classifier. The decision trees are then all re-trained on their labelled and pseudo-labelled
data. In the next iteration, all previously pseudo-labelled data is treated as unlabelled again.
We note that, as the number of trees approaches infinity, this approach becomes a form of
self-training.
Co-forest includes a mechanism for reducing the influence of possibly mislabelled data
points in the pseudo-labelling step by weighting the newly labelled data based on predic-
tion confidence. Deng and Zu Guo (2011) attempted to further prevent the influence of
possibly mislabelled data points by removing “suspicious” pseudo-labellings. After each
pseudo-labelling step, the prediction for each pseudo-labelled data point xiis compared to
the (pseudo-)labels of its knearest neighbours (both labelled and pseudo-labelled); in case
of a mismatch, the pseudo-label is removed from xi.
We note that in existing literature concerning co-forest, the size of the forest has always
been limited to six trees. It has been empirically shown that, in supervised random forests,
performance can substantially improve as the number of trees is increased (Oshiro et al. 2012).
Therefore, it is likely that increasing the number of trees in co-forest will substantially affect
relative performance compared to random forests.
4.2.3 Co-regularization
Co-training methods reduce disagreement between classifiers by passing information
between them, in the form of pseudo-labelled data. Furthermore, the implicit objective of co-
training is to minimize the error rate of the ensemble of classifiers. Sindhwani et al. proposed
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
390 Machine Learning (2020) 109:373–440
to make these properties explicit in a single objective function (Sindhwani et al. 2005; Sind-
hwani and Rosenberg 2008). They propose co-regularization, a regularization framework in
which both the ensemble quality and the disagreement between base learners are simultane-
ously optimized. The key idea is to use an objective function comprised of two terms: one
that penalizes incorrect predictions made by the ensemble, and another that directly penalizes
different predictions of the base classifiers. To handle per-view noise within this framework,
Yu et al. (2011) introduced Bayesian co-training, which uses a graphical model for combin-
ing data from multiple views and a kernel-based method for co-regularization. This model
was extended to handle different noise levels per data point by Christoudias et al. (2009).
Co-training can be seen as a greedy optimization strategy for the co-regularization objec-
tive. The two components of the objective function are minimized in an alternating fashion:
the prediction error of the ensemble is minimized by training the base learners independently,
and the disagreement between classifiers is minimized by propagating predictions from one
classifier to the others as if they were ground truth. We note, however, that the general co-
regularization objective does not have to be optimized using a wrapper method, and many
co-regularization algorithms use different approaches (see, e.g. Sindhwani and Rosenberg
2008;Yuetal.2011).
4.3 Boosting
Ensemble classifiers consist of multiple base classifiers, which are trained and then used to
form combined predictions (Zhou 2012). The simplest form of ensemble learning trains kbase
classifiers independently and aggregates their predictions. Beyond this simplistic approach,
two main branches of supervised ensemble learning exist: bagging and boosting (Zhou 2012).
In bagging methods, each base learner is provided with a set of ldata points, which are
sampled, uniformly at random with replacement, from the original data set (bootstrapping).
The base classifiers are trained independently. When training is completed, their outputs are
aggregated to form the prediction of the ensemble. In boosting methods, on the other hand,
each base learner is dependent on the previous base learners: it is provided with the full data
set, but with weights applied to the data points. The weight of a data point xiis based on
the performance of the previous base learners on xi, such that larger weights get assigned
to data points that were incorrectly classified. The final prediction is obtained as a linear
combination of the predictions of the base classifiers.
Technically, boosting methods construct a weighted ensemble of classifiers htin a greedy
fashion. Let FT−1(x)=T−1
t=1αt·ht(x)denote the ensemble of classifiers htwith weight
αtat time T−1. Furthermore, let ( ˆy,y)denote the loss function for predicting label ˆyfor
a data point with true label y. In each iteration of the algorithm, an additional classifier hT
is added to the ensemble with a certain weight αT, such that the cost function
L(FT)=
l
i=1
(FT(xi), yi)
=
l
i=1
(FT−1(xi)+αT·hT(xi), yi)
is minimized. Note that, at time T,theensemble FT−1is fixed. With particular choices of
loss functions, such as ( ˆy,y)=exp(−ˆy·y), the optimization problem yields a weighted
classification problem for determining hT, and allows us to express the optimal αTin terms
of the loss of hTon the training data.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 391
By definition, base learners in bagging methods are trained independently. Therefore,
the only truly semi-supervised bagging method would apply self-training to individual base
learners. Co-training, however, can be seen to be closely related to bagging methods: the only
way classifiers interact is by the exchange of pseudo-labelled data; other than that, the classi-
fiers can be trained independently and simultaneously. However, most co-training methods do
not use bootstrapping, a defining characteristic of bagging methods. In boosting, on the other
hand, there is an inherent dependency between base learners. Consequently, boosting meth-
ods can be readily extended to the semi-supervised setting, by introducing pseudo-labelled
data after each learning step; this idea gives rise to the class of semi-supervised boosting
methods.
Semi-supervised boosting methods have been studied extensively over the past two
decades. The success achieved by supervised boosting methods, such as AdaBoost (Freund
and Schapire 1997), gradient boosting, and XGBoost (Chen and Guestrin 2016), provides
ample motivation for bringing boosting to the semi-supervised setting. Furthermore, the
pseudo-labelling approach of self-training and co-training can be easily extended to boosting
methods.
4.3.1 SSMBoost
The first effort towards semi-supervised boosting methods was made by Grandvalet et al.,
who extended AdaBoost to the semi-supervised setting. They proposed a semi-supervised
boosting algorithm (Grandvalet et al. 2001), which they later extended and motivated from
the perspective of gradient boosting (d’Alché Buc et al. 2002). A loss function is defined for
unlabelled data, based on the predictions of the current ensemble and on the predictions of the
base learner under construction. Experiments were conducted with multiple loss functions;
the authors reported the strongest results using the expected loss of the new, combined
classifier. The weighted error tfor base classifier htis thus adapted to include the unlabelled
data points, causing the weight term αtto depend on the unlabelled data as well.
Crucially, SSMBoost does not assign pseudo-labels to the unlabelled data points. As a
result, it requires semi-supervised base learners to make use of the unlabelled data and is
therefore intrinsically semi-supervised, in contrast to most other semi-supervised boosting
algorithms, which are wrapper methods. Nevertheless, SSMBoost is included here, because
it forms the foundation for all other forms of semi-supervised boosting algorithms, which do
not require semi-supervised base learners.
4.3.2 ASSEMBLE
The ASSEMBLE algorithm, short for Adaptive Supervised Ensemble, pseudo-labels the unla-
belled data points after each iteration, and uses these pseudo-labelled data points in the
construction of the next classifier, thus alleviating the need for semi-supervised base learn-
ers (Bennett et al. 2002). As shown by its authors, ASSEMBLE effectively maximizes the
classification margin in function space.
Since pseudo-labels are used in ASSEMBLE, it is not trivial to decide which unla-
belled data points to pass to the next base learner. Bennett et al. (2002) proposed to use
bootstrapping—i.e. sampling, uniformly at random, with replacement, ldata points from the
l+ulabelled and unlabelled data points.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
392 Machine Learning (2020) 109:373–440
4.3.3 SemiBoost
The semi-supervised boosting algorithm SemiBoost addresses the problem of selecting data
points to be used by the base learners by relying on the manifold assumption, utilizing
principles from graph-based methods (Mallapragada et al. 2009). Each unlabelled data point
is assigned a pseudo-label, and the corresponding prediction confidence is calculated based
on a predefined neighbourhood graph that encodes similarity between data points. Then, a
subset of these pseudo-labelled data points is added to the set of labelled data points for
training the next base learner. The probability of a sample being selected for this subset
is proportional to its prediction confidence. SemiBoost was successfully applied to object
tracking in videos by Grabner et al. (2008).
SemiBoost uses the standard boosting classification model, expressing the final label
prediction as a linear combination of the predictions of the individual learners. Its cost func-
tion, however, is highly dissimilar from the previously described semi-supervised boosting
methods. Mallapragada et al. (2009) argue that a successful labelling of the test data should
conform to the following three requirements. Firstly, the predicted labels of the unlabelled
data should be consistent for unlabelled data points that are close to each other. Secondly, the
predicted labels of the unlabelled data should be consistent with the labels of nearby labelled
data points. And, thirdly, the predicted labels for the labelled data points should correspond to
their true labels. These requirements are expressed in the form of a constrained optimization
problem, where the first two are captured by the objective function, and the last is imposed as
a constraint. In other words, the SemiBoost algorithm uses boosting to solve the optimization
problem
minimize
FT
LL(ˆ
y,A,FT)+λ·LU(ˆ
y,A,FT)
subject to ˆyi=yi,i=1,...,l,
(1)
where LUand LLare the cost functions expressing the inconsistency across the unlabelled
and the combined labelled and unlabelled data, respectively, and λ∈Ris a constant governing
the relative weight of the cost terms; Ais an n×nsymmetric matrix denoting the pairwise
similarities between data points. Lastly, FTdenotes the joint prediction function of the
ensemble of classifiers at time T. We note that the optimization objective in Eq. 1is very
similar to the cost functions encountered in graph-based methods (see Sects. 6.3 and 7)in
that it favours classifiers that consistently label data points on the same manifold. In graph-
based methods, however, no distinction is generally made between labelled-unlabelled and
unlabelled-unlabelled pairs.
4.3.4 Other semi-supervised boosting methods
The three previously discussed methods form the core of semi-supervised boosting research.
Further work in the area includes RegBoost, which, like SemiBoost, includes local label
consistency in its objective function (Chen and Wang 2011). In RegBoost, this term is also
dependent on the estimated local density of the marginal distribution p(x). Several attempts
have been made to extend the label consistency regularization to the multiclass setting (Tanha
et al. 2012; Valizadegan et al. 2008).
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 393
5 Unsupervised preprocessing
We now turn to a second category of inductive methods, known as as unsupervised prepro-
cessing, which, unlike wrapper methods and intrinsically semi-supervised methods, use the
unlabelled data and labelled data in two separate stages. Typically, the unsupervised stage
comprises either the automated extraction or transformation of sample features from the unla-
belled data (feature extraction), the unsupervised clustering of the data (cluster-then-label),
or the initialization of the parameters of the learning procedure (pre-training).
5.1 Feature extraction
Since the early days of machine learning, feature extraction has played an important role in
the construction of classifiers. Feature extraction methods attempt to find a transformation of
the input data such that the performance of the classifier improves or such that its construction
becomes computationally more efficient. Feature extraction is an expansive research topic that
has been covered by several books and surveys. We focus on a small number of particularly
prominent techniques and refer the reader to the existing literature on feature extraction
methods for further information (see, e.g. Guyon and Elisseeff 2006; Sheikhpour et al. 2017).
Many feature extraction methods operate without supervision, i.e. without taking into
account labels. Principal component analysis, for example, transforms the input data to a
different basis, such that they are linearly uncorrelated, and orders the principal components
based on their variance (Wold et al. 1987). Other traditional feature extraction algorithms
operate on the labelled data and try to extract features with high predictive power (see, e.g.
Guyon and Elisseeff 2006).
Recent semi-supervised feature extraction methods have mainly been focused on finding
latent representations of the input data using deep neural networks (in Sect. 6.2.1,wefurther
discuss neural networks). The most prominent example of this is the autoencoder: a neural
network with one or more hidden layers that has the objective of reconstructing its input. By
including a hidden layer with relatively few nodes, usually called the representation layer,
the network is forced to find a way to compactly represent its input data. Once the network
is trained, features are provided by the representation layer. A schematic representation of a
standard autoencoder is provided in Fig. 4.
The network can be considered to consist of two parts: the encoder h, which maps an
input vector xto its latent representation h(x), and the decoder g, which attempts to map
the latent representation back to the original x. The network is trained by optimizing a loss
function penalizing the reconstruction error: a measure of inconsistency between the input
xand the corresponding reconstruction g(h(x)). Once the network is trained, the latent
representation of any xcan be found by simply propagating it through the encoder part of
the network to obtain h(x). A popular type of autoencoders is the denoising autoencoder,
which is trained on noisy versions of the input data, penalizing the reconstruction error of
the reconstructions against the noiseless originals (Vincent et al. 2008). Another variant, the
contractive autoencoder, directly penalizes the sensitivity of the autoencoder to perturbations
in the input (Rifai et al. 2011b).
Autoencoders attempt to find a lower-dimensional representation of the input space
without sacrificing substantial amounts of information. Thus, they inherently act on the
assumption that the input space contains lower-dimensional substructures on which the data
lie. Furthermore, when applied as a preprocessing step to classification, they assume that
two samples on the same lower-dimensional substructure have the same label. These obser-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
394 Machine Learning (2020) 109:373–440
... ......
Encoder
h(·)
Decoder
g(·)
x
h(x)
g(h(x))
Fig. 4 Simplified representation of an autoencoder. The rectangles correspond to layers within the network;
the trapeziums represent the encoder and decoder portions of the network, which can consist of multiple layers
vations indicate that the assumptions underlying autoencoders are closely related to the
semi-supervised manifold assumption.
In some domains, data is not inherently represented as a meaningful feature vector. Since
many common classification methods require such a representation, feature extraction is a
necessity in those cases. The feature extraction step, then, consists of finding an embedding
of the given object into a vector space by taking into account the relations between different
input objects. Examples of such approaches can be found in natural language processing
(Collobert et al. 2011; Mikolov et al. 2013) and network science (Grover and Leskovec
2016; Perozzi et al. 2014;Wangetal.2016).
5.2 Cluster-then-label
Clustering and classification have traditionally been regarded as relatively disjoint research
areas. However, many semi-supervised learning algorithms use principles from clustering to
guide the classification process. Cluster-then-label approaches form a group of methods that
explicitly join the clustering and classification processes: they first apply an unsupervised or
semi-supervised clustering algorithm to all available data, and use the resulting clusters to
guide the classification process.
Goldberg et al. (2009) first cluster the labelled data and a subset of the unlabelled data.
A classifier is then trained independently for each cluster on the labelled data contained in
it. Finally, the unlabelled data points are classified using the classifiers for their respective
clusters. In the clustering step, a graph is constructed over the data points using the Hellinger
distance; size-constrained spectral clustering is then applied to the resulting graph. Since the
clustering is only used to segment the data, after which individual learners are applied to
each cluster, the approach supports any supervised base learner.
Demiriz et al. (1999) first cluster the data in a semi-supervised manner, favouring clusters
with limited label impurity (i.e. a high degree of consistency in the labels of the data points
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 395
within a given cluster), and use the resulting clusters in classification. Dara et al. (2002)
proposed a more elaborate preprocessing step, applying self-organizing maps (Kohonen
1998) to the labelled data in an iterative fashion. The unlabelled data points are then mapped,
yielding a cluster assignment for each of them. If the cluster to which an unlabelled data
point xiis mapped contains only data points with the same label, that label is also assigned
to xi. This process can be iterated, after which the resulting label assignments can be used
to train an inductive classifier (in the work of Dara et al., a multilayer perceptron). We note
that this approach can be regarded as a wrapper method (see Sect. 4).
5.3 Pre-training
In pre-training methods, unlabelled data is used to guide the decision boundary towards
potentially interesting regions before applying supervised training.
This approach naturally applies to deep learning methods, where each layer of the hierar-
chical model can be considered a latent representation of the input data. The most commonly
known algorithms corresponding to this paradigm are deep belief networks and stacked
autoencoders. Both methods are based on artificial neural networks and aim to guide the
parameters (weights) of a network towards interesting regions in model space using the
unlabelled data, before fine-tuning the parameters with the labelled data.
Pre-training approaches have deep roots in the field of deep learning. Since the early 2000s,
neural networks with multiple hidden layers (deep neural networks) have been gaining an
increasing amount of attention. However, due to their high number of tunable parameters,
training these networks has often been challenging: convergence tended to be slow, and trained
networks were prone to poor generalization (Erhan et al. 2010). Early on, these problems
were commonly addressed by employing unsupervised pre-training methods. Since then, this
has been mostly superseded by the application of weight sharing, regularization methods and
different activation functions. Consequently, the work we cover in this section mainly stems
from the first decade of the 2000s. However, the underlying principles still apply, and are
still used in other methods (such as ladder networks, see Sect. 6.2.2).
Deep belief networks consist of multiple stacked restricted Boltzmann machines (RBMs),
which are trained layer-by-layer with unlabelled data in a greedy fashion (Hinton et al. 2006).
The resulting weights are then used as the initialization for a deep neural network with the
same architecture augmented by an additional output layer, enabling the model to be trained
in a supervised manner on the labelled data.
Stacked autoencoders are very similar to deep belief networks, but they use autoencoders
as their base models instead of RBMs. The autoencoders are trained layer-by-layer, where the
encoding h(x)produced by each autoencoder is passed as the input to the next autoencoder,
which is then trained to reconstruct it with minimal error. Finally, these trained autoencoders
are combined, an output layer is added (as in deep belief networks), and the resulting network
is trained on the labelled data in a supervised manner. The paradigm works with multiple
types of autoencoders, including denoising and contractive autoencoders (Vincent et al. 2008;
Rifai et al. 2011b).
Based on an empirical analysis of deep belief networks and stacked autoencoders, Erhan
et al. (2010) suggested that unsupervised pre-training guides the neural network model
towards regions in model space that provide better generalization. Deep neural networks
are often motivated from the perspective that they learn a higher-level representation of the
data at each layer. Thus, each layer of the network can be considered to contain a differ-
ent representation of the input data. Both deep belief networks and stacked autoencoders
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
396 Machine Learning (2020) 109:373–440
attempt to guide the model in the extraction of these hierarchical representations, pushing
the model towards the extraction of representations that are deemed informative. From that
perspective, pre-training methods are closely related to the unsupervised feature extraction
methods described earlier: they both use unlabelled data in an attempt to extract meaningful
information from the input data. Crucially, however, the parameters used for unsupervised
preprocessing can be changed in the supervised fine-tuning phase of pre-training methods,
whereas they remain fixed after the unsupervised phase of feature extraction approaches.
6 Intrinsically semi-supervised methods
We now turn our attention to inductive learning algorithms that directly optimize an objective
function with components for labelled and unlabelled samples. These methods, which we
call intrinsically semi-supervised, do not rely on any intermediate steps or supervised base
learners. Usually, they are extensions of existing supervised methods to include unlabelled
samples in the objective function.
Generally, these methods rely either explicitly or implicitly on one of the semi-supervised
learning assumptions (see Sect. 2.1). For instance, maximum-margin methods rely on the
low-density assumption, and most semi-supervised neural networks rely on the smooth-
ness assumption. We begin with an overview of the earliest intrinsically semi-supervised
classification methods, namely maximum-margin methods. Next, we discuss perturbation-
based methods, which directly incorporate the smoothness assumption. These encompass
most semi-supervised neural networks. Then, we consider manifold-based techniques, which
either explicitly or implicitly approximate the manifolds on which the data lie. Finally, we
consider generative models.
6.1 Maximum-margin methods
Maximum-margin classifiers attempt to maximize the distance between the given data points
and the decision boundary. This approach corresponds to the semi-supervised low-density
assumption: when the margin between all data points and the decision boundary is large
(except for some outliers), the decision boundary must be in a low-density area (Ben-David
et al. 2009). Conceptually, maximum-margin methods thus lend themselves well to extension
to the semi-supervised setting: one can incorporate knowledge from the unlabelled data to
determine where the density is low and thus, where a large margin can be achieved.
6.1.1 Support vector machines
The most prominent example of a supervised maximum-margin classifier is the support
vector machine (SVM): a classification method that attempts to maximize the distance from
the decision boundary to the points closest to it, while encouraging data points to be classified
correctly. It was one of the first maximum-margin approaches to be proposed in the semi-
supervised setting, and it has been studied extensively since. We briefly introduce supervised
SVMs, but the reader to machine learning book by Bishop (2006) for a more extensive
introduction.
The objective of an SVM is to find a decision boundary that maximizes the margin,which
is defined as the distance between the decision boundary and the data points closest to it.
The term is also commonly used to describe the area extruding from the decision boundary
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 397
in which no data points lie. The soft-margin SVM is a popular variant of SVMs that allows
data points to violate the margin (i.e. lie between the corresponding margin boundary and
the decision boundary, or even be misclassified) at a certain cost. SVMs supports implicit
mapping of objects to higher-dimensional feature spaces using the so-called kernel trick.
Formally, when training an SVM, we endeavour to find a weight vector w∈Rdwith
minimal magnitude and a bias variable b∈R, such that yi·(w·xi+b)≥1−ξifor
all data points xi∈XL. Here, ξi≥0 is called the “slack variable” for xi, which allows xi
to violate the margin at some cost, which is incorporated into the objective function. The
corresponding optimization problem can be formulated as follows:
minimize
w,b,ξ
1
2· ||w||2+C·
l
i=1
ξi
subject to yi·(w·xi+b)≥1−ξi,i=1,...,l,
ξ≥0,i=1,...,l,
where C∈Ris a constant scaling factor for the penalization of data points violating the
margin. If Cis large, the optimal margin will generally be narrow, and if Cis small, the
optimal margin will generally be wide. Thus, Cacts as a regularization parameter, governing
the trade-off between the complexity of the decision boundary and prediction accuracy on
the training set.
The concept of semi-supervised SVMs, or S3VMs, is similar: we want to maximize the
margin, and we want to correctly classify the labelled data. However, in the semi-supervised
setting, an additional objective becomes relevant: we also want to minimize the number of
unlabelled data points that violate the margin. Since the labels of the unlabelled data points are
unknown, those that violate (i.e. lie within) the margin are penalized based on their distance
to the closest margin boundary.
The intuitive extension of the optimization problem for S3VMs thus becomes
minimize
w,b,ξ
1
2· ||w||2+C·
l
i=1
ξi+C·
n
i=l+1
ξi
subject to yi·(w·xi+b)≥1−ξi,i=1,...,l,
|w·xi+b|≥1−ξi,i=l+1,...,n,
ξi≥0,i=1,...,n,
(2)
where C∈Ris the margin violation cost associated with unlabelled data points.
S3VMs were proposed by Vapnik (1998), who motivated the problem from a more trans-
ductive viewpoint: instead of optimizing only over the weight vector, bias and slack variables,
he proposed to also optimize over the label predictions ˆ
yU. The constraint for the unlabelled
data was formulated similarly to that for labelled data, but with the predicted labels ˆ
yU.
Though different at first sight, this formulation is equivalent to optimization problem 2above,
since any labelling ˆ
yUcan only be optimal if, for each ˆyi∈ˆ
yU,xiis on the correct side
of the decision boundary (i.e. ˆyi·(w·xi+b)≥0). Otherwise, a better solution could be
obtained by simply inverting the labelling of xi.
The extension of SVMs to the semi-supervised setting carries one significant disadvantage:
the optimization problem encountered when training S3VMs becomes non-convex and NP-
hard. Consequently, most efforts in the study of S3VMs have been focused on training them
efficiently in practice.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
398 Machine Learning (2020) 109:373–440
Initial efforts showed promising results in applying S3VMs, but only to small data sets.
For instance, Bennett and Demiriz (1999) proposed to use the L1 norm instead of the L2 norm
in the objective function and posed the problem as a mixed integer programming problem.
The earliest widely used optimization approach was introduced by Joachims (1999), whose
approach for solving the optimization problem starts with a random assignment of ˆ
yUand
alowvalueforC. Each iteration of the algorithm then consists of three steps. First, the
supervised SVM optimization problem corresponding to the current label assignment ˆ
yUis
solved. Next, the algorithm inverts the labels of each pair of data points for which this inversion
improves the objective function, until no more such pairs exist. Finally, Cis increased. The
algorithm terminates when Creaches a predefined value specified by the user.
Other approaches to solving S3VMs have been put forward. For instance, several stud-
ies have proposed convex relaxations of the objective function, which can be solved using
semidefinite programming methods. The earliest such approach was introduced by de Bie
and Cristianini (2004,2006) and later extended to the multiclass setting by Xu and Schuur-
mans (2005). However, due to their time complexity, these approaches do not scale to large
amounts of data.
Chapelle et al. (2008) provided an overview of optimization procedures for S3VMs up
until 2008 and broadly categorize S3VM optimization methods into two categories: com-
binatorial methods, finding the label assignment ˆ
yUthat minimizes the objective function,
and continuous methods, directly solving the optimization problem using label assignments
ˆyi=sign(w·xi+b). All the approaches we havethus far described fall into the combinatorial
category. However, the formulation in optimization problem 2corresponds to the continuous
approach; it underlies, for example, the concave-convex procedure, which decomposes the
non-convex objective function into a convex and a concave component, and iteratively solves
the optimization problem by replacing the concave component with a linear approximation
at the current solution (Chapelle et al. 2008; Collobert et al. 2006).
Other continuous methods make use of the fact that problem 2can be reformulated as an
optimization problem without constraints. This stems from the fact that, if a labelled point
xi∈XLdoes not violate the margin, then ξi=0 in the optimal solution. If it does violate
the margin, then ξi=1−yi·(w·xi+b). For an unlabelled data point xi∈XU,ξi=0if
it does not violate the margin, and otherwise, ξi=1−|w·xi+b|. Thus, the optimization
problem can be reformulated as
minimize
w,b
1
2· ||w||2+C·
l
i=1
max (0,1−yi·f(xi))
+C·
n
i=l+1
max (0,1−|f(xi)|),
(3)
where f(xi)=w·xi+b.
This approach underlies ∇TSVM by Chapelle and Zien (2005), which is based on a smooth
approximation of the object function in Eq. 3obtained by squaring the loss for the labelled
data points, and by approximating the loss for the unlabelled data points with an exponential
function. This optimization problem is then solved by gradient descent, where Cis gradually
increased from some value close to zero to its intended value. Chapelle et al. (2006a)takea
similar approach, where they keep Cfixed and use a continuous approach to transform the
objective function from using only the labelled data to the final objective function.
As is the case for most semi-supervised learning methods, S3VMs are not guaranteed
to perform better than their supervised counterparts (Singh et al. 2009). Specifically, if one
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 399
of the underlying assumptions of the semi-supervised learning method is violated, there
is a large risk of degrading performance when introducing the unsupervised objective. In
the case of S3VMs, many highly diverse decision boundaries may exist that pass through
a low-density area and achieve reasonable classification performance on the labelled data.
Consequently, one can expect the generalization performance of such classifiers to exhibit
significant variance.
Li and Zhou (2015) proposed to mitigate this problem by considering a diverse set of
low-density separators and choosing the separator that performs best under the worst pos-
sible ground truth. Like all S3VM variants, their method is premised on the assumption
that the optimal decision boundary lies in a low-density area. Their algorithm, called S4VM
(safe S3VM), consists of two stages. Firstly, a diverse set of low-density decision boundaries
is constructed. To this end, the authors propose to minimize a cost function that penalizes
the pairwise similarity between the label predictions associated with the decision bound-
aries, using deterministic annealing and a heuristic sampling method. Secondly, the decision
boundary with maximal worst-case performance gain over the supervised decision boundary
is chosen as the result of S4VM training. This problem formulation limits the probability that
the solution found by a S4VM exhibits performance worse than the corresponding supervised
SVM.
The performance gain is formulated as the resulting increase in the number of correctly
labelled data points minus the increase in the number of incorrectly labelled data. The latter
term is multiplied by a factor λ∈R, governing the amount of risk of performance degradation
one wishes to take. Formally, this is captured by a scoring function J(ˆ
y,y,ysvm)for a set of
predicted labels ˆ
y, ground truth y, and supervised SVM predictions ysvm defined as
J(ˆ
y,y,ysvm)=gai n(ˆ
y,y,ysvm)−λ·lose(ˆ
y,y,ysvm),
where gain and lose denote the increases in correctly and incorrectly labelled data points,
respectively. The optimal label assignment ¯
yin the worst-case true labelling can then be
found as
¯
y∈arg max
y∈{±1}umin
ˆ
y∈M
J(y,ˆ
y,ysvm),
where Mis the set of all candidate label assignments such that the corresponding decision
boundary cuts through a low-density area. Due to the optimization over all possible label
assignments, this optimization problem is NP-hard. Li and Zhou (2015) proposed a con-
vex relaxation of the problem to effectively find a good candidate solution. Based on the
assumption that the true label assignment is indeed in this set, they proved that, if λ≥1,
the performance of S4VM is never lower than that of the corresponding SVM. They vali-
dated this finding empirically, and showed that their implementation achieves performance
improvements over standard SVMs similar to other S3VM approaches, but that, contrary to
those, performance never significantly degrades relative to supervised SVMs.
The formulation of the second stage of the optimization procedure is not limited to sup-
port vector machines; indeed, it could theoretically be applied to any other semi-supervised
learning algorithm. Li and Zhou (2015) additionally propose to perform both stages simul-
taneously in a deterministic annealing approach.
6.1.2 Gaussian processes
The notion of margin maximization is directly incorporated into support vector machines,
and it should thus not come as a surprise that they are easily extended to the semi-supervised
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
400 Machine Learning (2020) 109:373–440
setting. Less obviously, similar efforts have been made with other supervised methods as
well. Notably, Lawrence and Jordan (2005) have extended Gaussian processes to handle
unlabelled data.
Gaussian processes are a family of non-parametric models that estimate the posterior
probability over the function fmapping points in the input space to a continuous output
space. When used for binary classification purposes, which are the focus of the paper by
Lawrence and Jordan (2005), this output is in turn mapped to the label space Y={−1,1}.
In the learning phase, fis established such that the likelihood of observing the data points
((xi,yi))l
i=1is maximized. The resulting model can be considered an l-dimensional Gaussian
distribution over the label vector yof the input data points, where lis the number of labelled
data points. Predictions for previously unseen data points x∗can then be made by the model by
evaluating the posterior probability of the respective class label, conditioned on the observed
data points X, their associated labels y, and the observed data point x∗. The associated
covariance matrix is the Gram matrix obtained from all l+1 data points using some kernel
function k.
Lawrence and Jordan (2005) extended Gaussian processes for binary classification to
the semi-supervised case by incorporating the unlabelled data points into the likelihood
function. Specifically, the likelihood for an unlabelled data point xis low when it is close to
the decision boundary (i.e. when f(x)is close to 0), and high when it is far away from the
decision boundary. The space of possible labels is expanded to include a null category;the
posterior probability of this null category is high around the decision boundary. By imposing
the constraint that unlabelled data points can never be mapped to the null category, the model
is explicitly discouraged from choosing a decision boundary that passes through a high-
density area of unlabelled data points. In other words, unlabelled data points should be far
away from the decision boundary.
This extension of Gaussian processes to the semi-supervised setting has an interesting side
effect: contrary to supervised Gaussian processes, introducing additional (unlabelled) data
can increase the posterior variance. In other words, additional data can increase uncertainty.
This effect stems from the observation that the likelihood function for a single unlabelled
data point x∗can be bimodal if f(x∗)is close to 0.
6.1.3 Density regularization
Another way of encouraging the decision boundary to pass through a low-density area is to
explicitly incorporate the amount of overlap between the estimated posterior class probabil-
ities into the cost function. When there is a large amount of overlap, the decision boundary
passes through a high-density area, and when there is a small amount of overlap, it passes
through a low-density area. Several approaches have been proposed to use this assumption
to regularize the objective function used in the context of classification.
Grandvalet and Bengio (2005) proposed to formalize this in the maximum a posteri-
ori (MAP) framework by imposing a prior on the model parameters, favouring parameters
inducing small class overlap in the predictive model (additionally, see Chapelle et al. 2006b).
In particular, they used Shannon’s conditional entropy as a measure of class overlap. The
prior is weighted by a constant λ∈R. The resulting objective is generally non-convex. The
authors proposed solving the optimization problem by means of deterministic annealing.
This entropy regularization method can be applied to any supervised learning method based
on maximum-likelihood; the authors conducted experiments using logistic regression.
Corduneanu and Jaakkola (2003) proposed to directly incorporate an estimate of p(x),
the distribution over the input data, into the objective function. They add a cost term to the
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 401
objective function that reflects the belief that, in high-density areas, the posterior probability
of yconditioned on xshould not vary too much. To this end, they cover the entire input
space Xwith multiple, possibly overlapping, small regions; the cost term is then calculated
as the sum of the mutual information between labels and inputs in each of these regions,
weighted by the estimated density in the region. Their work is an extension of earlier work
by Szummer and Jaakkola (2003).
Liu et al. (2013,2015) proposed to incorporate the prior density into the node splitting
criterion of decision trees. When selecting the hyperplane for splitting the data at a node in
a decision tree, their approach penalizes high-density areas, using Gaussian kernel density
estimators to approximate p(x). They conducted experiments with random forests consisting
of 100 of the resulting semi-supervised decision trees and observed significant performance
improvements over supervised random forests for several data sets. Levati´cetal.(2017)
introduced a more generic framework for using unlabelled data in the splitting criterion by
constructing an impurity measure for unlabelled data. In their experiments, they promoted
feature consistency within the data subsets on each side of the splitting boundary, penalizing
empirical variance for numerical data and the Gini impurity for nominal data. We note that
the specific categorization of these methods within our taxonomy depends on the choice of
splitting criterion.
6.1.4 Pseudo-labelling as a form of margin maximization
Depending on the base learner used, the self-training approach described in Sect. 4can
also be regarded a margin-maximization method. For instance, when using self-training
with supervised SVMs, the decision boundary is iteratively pushed away from the unlabelled
samples. Even though the unlabelled data are not explicitly incorporated into the loss function,
this amounts to exploiting the low-density assumption, as done in the case of S3VMs.
6.2 Perturbation-based methods
The smoothness assumption entails that a predictive model should be robust to local pertur-
bations in its input. This means that, when we perturb a data point with a small amount of
noise, the predictions for the noisy and the clean inputs should be similar. Since this expected
similarity is not dependent on the true label of the data points, we can make use of unlabelled
data.
Many different methods exist for incorporating the smoothness assumption into a given
learning algorithm. For instance, one could apply noise to the input data points, and incor-
porate the difference between the clean and the noisy predictions into the loss function.
Alternatively, one could implicitly apply noise to the data points by perturbing the classifier
itself. These two approaches give rise to the category of perturbation-based methods.
Perturbation-based methods are often implemented with neural networks. Due to their
straightforward incorporation of additional (unsupervised) loss terms into their objective
function, they are extendable to the semi-supervised setting with relative ease. In recent
years, neural networks have received renewed interest, due their successful application in
various application areas (see, e.g. Collobert et al. 2011;Krizhevskyetal.2012; LeCun et al.
2015). As a result, interest in semi-supervised neural networks has risen as well. In particular,
neural networks with many layers, so-called deep neural networks, have given rise to inter-
esting extensions to the semi-supervised setting. These intrinsically semi-supervised neural
networks differ from the neural networks used for feature extraction, which we discussed in
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
402 Machine Learning (2020) 109:373–440
Sect. 5.1: the unlabelled data is incorporated directly into the optimization objective, rather
than being used in a separate preprocessing step. Before continuing our discussion of such
methods, we provide a short, general introduction to neural networks targeted at readers who
are not too familiar with them. For a more extensive introduction to (deep) neural networks,
we refer the interested reader to the recent book by Goodfellow et al. (2016).
6.2.1 Neural networks
A neural network is a formal system that computes an output vector by propagating an input
vector through a network of simple processing elements with weighted connections between
them. These simple processing elements are called nodes, and each of them contains an
activation function that ultimately determines its output. In the feedforward networks we are
considering here, nodes are usually grouped together into layers, where nodes from each
layer are only connected to nodes from adjacent layers. The output vector is calculated by
propagating the input vector through the weighted connections of the network. The output
of each node, commonly referred to as its activation, is calculated by applying its activation
function to the weighted sum of its inputs.
In supervised neural networks, the network weights are generally optimized to calculate
the desired output vector for a given input vector. Considering the classification task, let
f:Rd→ R|Y|denote the vector-valued function modelled by a neural network, mapping
an input vector x∈Rdto a |Y|-dimensional output vector, where Ydenotes the set of possible
classes. The function fis modelled by a neural network consisting of one or multiple layers;
nodes in consecutive layers are connected by weighted edges. The weights are stored in a
weight matrix W, where the element at position (i,j)denotes the weight of the edge between
nodes iand j.Weuse f(x;W)to denote the output obtained by propagating the input x
through the network and evaluating the activations of the final layer.
A loss function is then defined, calculating the cost associated with output layer activa-
tions f(x;W)for a data point xwith true label y. The complete cost function is then defined
as
L(W)=
l
i=1
( f(xi;W), yi). (4)
The explicit notion of the parametrization of fby Wis often omitted for conciseness.
The weights in Ware iteratively optimized by passing input samples through the network
and propagating the share of one or more samples in the cost Lbackwards through the
network. In this process, known as backpropagation, the weights are updated, using gradient
descent or a similar method to iteratively minimize the cost (Goodfellow et al. 2016). To
achieve good performance (in terms of loss), the network generally needs to pass multiple
times over the entire training set, and each such pass is known as an epoch.
In the literature on neural networks, various notation styles are used. In particular, some
of the articles we discuss use θto denote the network weights, and denote the output of
the corresponding network by fθ(x). In discussing these articles, we use this notation style
when we deem it essential for maintaining relatability between the respective article and this
survey.
6.2.2 Semi-supervised neural networks
The simplicity and efficiency of the backpropagation algorithm for a great variety of loss func-
tions make it attractive to simply add an unsupervised component to L. This approach, which
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 403
can be considered a form of regularization over the unlabelled data, is employed by virtually
all semi-supervised deep learning methods. Furthermore, the hierarchical nature of repre-
sentations in deep neural networks make them a viable candidate for other semi-supervised
approaches. If deeper layers in the network express increasingly abstract representations of
the input sample, one can argue that unlabelled data could be used to guide the network
towards more informative abstract representations. Approaches based on this argument can
be readily implemented in deep neural networks through the smoothness assumption, giving
rise to so-called perturbation-based semi-supervised neural networks.
6.2.3 Ladder networks
The first such approach is the ladder network, proposed by Rasmus et al. (2015). It extends
a feedforward network to incorporate unlabelled data by using the feedforward part of the
network as the encoder of a denoising autoencoder, adding a decoder, and including a term
in the cost function to penalize the reconstruction cost. The underlying idea is that latent
representations that are useful for input reconstruction can also facilitate class prediction.
Consider a feedforward network with Khidden layers and weights W. We denote the
inputs of a layer k(after normalization) as zk, and the layer’s activations (i.e. after applying
the activation function) as hk. Note that for conciseness, when referring to layer inputs and
activations, we do not explicitly mention the input data xi, nor the parametrization W(e.g.
we write hkfor the activation vector of the k-th layer in a neural network with weights W
for data point xi). In a regular feedforward network, the loss for a given data point xiis
calculated by comparing the activations of the final layer f(xi)=hKto the corresponding
label yiwith ( f(xi), yi).AsisshowninEq.4, the final cost function for the network is then
L(W)=l
i=1( f(xi), yi).
Ladder networks add an additional term to L, in order to penalize the sensitivity of the
network to small perturbations of the input. This is achieved by treating the entire network
as the encoder part of a denoising autoencoder: isotropic Gaussian noise with mean zero
and fixed variance is added to the input samples, and the existing feedforward network is
treated as the encoder part. A decoder is then added alongside it, which is supposed to take
the final-layer representation hKof a noisy data point ˜
x, and transform it to reconstruct x.
To achieve this goal, a reconstruction cost is added to the cost function of the network. This
inherently unsupervised cost term penalizes the difference between the input data points and
their reconstructions generated by the network; it applies to both labelled and unlabelled
data.
Although the autoencoder component of ladder networks is highly similar to regular
denoising autoencoders, it differs from those in two ways. Firstly, a ladder network injects
noise not only at the first layer, but at every layer. We denote the noisy inputs of a layer k
as ˜
zk, and the resulting activations as ˜
hk. The supervised loss component for each sample
‘omes ( ˜
hK,y): the loss function is evaluated against the output for the noisy sample. Note
that, in the testing phase, no noise is induced at any point in the network.
Secondly, ladder networks utilize a different reconstruction cost calculation. Where reg-
ular denoising autoencoders only penalize the difference between the clean input xand the
reconstructed version ˆ
xof the noisy input ˜
x, the ladder network also penalizes local recon-
structions of the hidden representations of the data. To do so, they enforce the decoder to
have Klayers, the same number of layers as the original network (the encoder). Each of these
layers is also required to have the same number of nodes as the corresponding layer in the
encoder. As a data point passes through the encoder, noise is added to the layer inputs at each
layer. Then, at each layer in the decoder, the reconstructed representation ˆ
zkis compared to
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
404 Machine Learning (2020) 109:373–440
the hidden representation zkof the clean input xat layer kin the encoder. This, of course,
requires each data point to pass through the network twice: once without noise (to obtain z),
and once with noise (to obtain ˜
zand the reconstructed ˆ
z).
The final semi-supervised cost function of ladder networks then becomes
L(W)=
l
i=1
( f(xi), yi)+
n
i=1
K
k=1
ReconsCost(zk
i,ˆ
zk
i),
where ReconsCost(·,·)is defined as the squared L2 norm of the difference between the two
normalized latent vectors and summed over the labelled and unlabelled data. For a detailed
diagram of the information flow in ladder networks, which uses the same notation we do, we
refer the reader to Figure 1 in the ladder network study by Pezeshki et al. (2016,p.4).
Through their penalization of reconstruction errors, ladder networks effectively attempt to
push the network towards extracting interesting latent representations of the data. The method
is premised on the assumption that a latent representation hKthat is useful for reconstructing
xcan also facilitate the prediction of the corresponding class label. Rasmus et al. (2015)
showed that ladder networks achieve state-of-the-art results on image data sets with partially
labelled data, including MNIST. Interestingly, they also reported improvements when using
only labelled data. Prémont-Schwarz et al. (2017) extended the ladder network architecture to
the recurrent setting by adding connections between the encoders and decoders of successive
instances of the network.
Rasmus et al. also proposed a simpler, computationally more efficient variant of ladder
networks. This method, generally referred to as the -model, only includes the reconstruction
cost for the last layer. Therefore, no full decoder needs to be constructed. The -model was
empirically shown to provide substantial performance improvements over the corresponding
fully-supervised model.
Pezeshki et al. (2016) conducted an extensive empirical study of the different components
of ladder networks. Their study revealed that the reconstruction cost at the first layer of the
neural network, combined with the introduction of noise in that layer, has critical impact on
overall performance. We note that this architecture differs from the -model, which only
considers the last, rather than the first, layer of the network when assessing reconstruction
error.
6.2.4 Pseudo-ensembles
Instead of explicitly perturbing the input data, one can also perturb the neural network
model itself. Robustness in the model can then be promoted by imposing a penalty on
the difference between the activations of the perturbed network and those of the orig-
inal network for the same input. Bachman et al. (2014) proposed a general framework
for this approach, where an unperturbed parent model with parameters θis perturbed to
obtain one or more child models. In this framework, which they call pseudo-ensembles,
the perturbation is obtained from a noise distribution . The perturbed network ˜
fθ(x;ξ)
is then generated based on the unperturbed parent network fθ(x)and a sample ξfrom
the noise distribution. The semi-supervised cost function then consists of a supervised part
and an unsupervised part. The former captures the loss of a perturbed network for labelled
input data, and the latter the consistency across perturbed networks for the unlabelled data
points.
Based on this framework, Bachman et al. (2014) proposed a semi-supervised cost function.
Consider a neural network with Klayers, and let fk
θ(x)and ˜
fk
θ(x;ξ) denote the k-th layer
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 405
activations of the unperturbed and the perturbed network, respectively. The cost function of
the pseudo-ensemble for neural networks then becomes
E
ξ∼1
l·
l
i=1
L(˜
fθ(xi;ξ), yi)
+E
ξ∼1
n·
n
i=1
K
k=2
λk·Vkfk
θ(xi), ˜
fk
θ(xi;ξ),
where the consistency loss Vkpenalizes differences between the activations of the unper-
turbed and perturbed networks at the k-th layer for the same input; λkis the relative weight
of that particular cost term.2Bachman et al. propose to gradually increase each ˘kover time,
in effect placing more weight on the supervised objective in early iterations. One particu-
larly prominent method of inducing noise is dropout, which randomly sets weights to zero
(i.e. removes connections in the neural network) in each training iteration (Srivastava et al.
2014). In its originally proposed form, it was only applied to the supervised loss component.
However, Wager et al. (2013) and Bachman et al. (2014) showed that dropout can be readily
applied to unlabelled data as well.
The framework proposed by Bachman et al. is not limited to semi-supervised settings:
the supervised term in the loss function can be applied to any supervised learning problem.
Furthermore, a similar approach could be applied to other learning algorithms than neural
networks, although the per-layer activation comparison would have to be replaced by a
suitable alternative. Of course, since neural networks are entirely parametrized by connection
weights, they offer a relatively straightforward implementation of model perturbation.
6.2.5 5-model
Instead of comparing the activations of the unperturbed parent model with those of the
perturbed models in the cost function, one can also compare the perturbed models directly.
A simple variant of this approach, where two perturbed neural network models are trained,
was suggested by Laine and Aila (2017). They use dropout (Srivastava et al. 2014)asthe
perturbation process, and penalize the differences in the final-layer activations of the two
networks using squared loss. The weight of the unsupervised term in the cost function starts
at zero, and is gradually increased. This approach, which they name the -model, can be
seen as a simple variant of pseudo-ensembles.
6.2.6 Temporal ensembling
Since the noise process used in the methods described thus far is stochastic, the entire neural
network model can be considered a stochastic model. With the -model, the network is
regularized by penalizing the difference in output of two perturbed network models, drawn
from the same distribution, on the same input. This idea can be extended to more than two
perturbed models. Such an approach was taken by Sajjadi et al. (2016), who additionally
perturbed the input data with random transformations. Of course, such pairwise compar-
isons will increase the running time of each training iteration quadratically in the number
2Bachman et al. (2014) consider distributions over the input data and consequently use expectations in their
formalism; for consistency within this survey, we replaced these expectations by averages over the given data.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
406 Machine Learning (2020) 109:373–440
of perturbations. Pseudo-ensembles solve this problem by comparing the perturbed network
activations to the activations of the unperturbed network model.
In the same paper in which they propose the -model, Laine and Aila (2017) propose a
different approach to combining multiple perturbations of a network model: they compare
the activations of the neural network at each epoch to the activations of the network at
previous epochs. In particular, after each epoch, they compare the output of the network
to the exponential moving average of the outputs of the network in previous epochs. Since
the connection weights are changed in each iteration, this cannot be considered a form of
pseudo-ensembling, but it is conceptually related, in that the network output is smoothed
over multiple model perturbations.
This approach—dubbed temporal ensembling, because it penalizes the difference in the
network outputs at different points in time during the training process—can be considered an
extension of the -model. However, instead of comparing fθ(x;ξ)to fθ(x;ξ)for ξ, ξ∼,
it uses comparisons to the exponential moving average of final-layer activations in previous
epochs. Since the loss function for unlabelled data points depends on the network output in
previous iterations, temporal ensembling is closely related to pseudo-labelling methods, such
as the pseudo-label approach (Lee 2013) and self-training. The crucial difference, however,
is that the entire set of final-layer activations is compared to the activations of the previous
network model, whereas self-training approaches and pseudo-label convert these outputs to
a single, hard prediction (the pseudo-label).
6.2.7 Mean teacher
When training a neural network using temporal ensembling, unlabelled data points are incor-
porated into the learning process at large intervals. Since the activations for each input are
only generated once per epoch, it takes a long time for the activations of unlabelled data
points to influence the inference process. Tarvainen and Valpola (2017) attempted to over-
come this problem by considering moving averages over connection weights, instead of
moving averages over network activations.
Specifically, they suggested calculating the exponential moving average of weights at
each training iteration, and compared the resulting final-layer activations to the final-layer
activations when using the latest set of weights. Furthermore, they imposed noise on the
input data to increase robustness. Formally, consider a neural network with weights Wtat
iteration t, and a set of averaged weights ˆ
Wt. The loss function for an unlabelled input,
then, is calculated as (x)=||f(˜
x;ˆ
Wt)−f(˜
x;Wt)||2,where ˜
xand ˜
xare two noise-
augmented versions of x. After calculating Wt+1using backpropagation, ˆ
Wt+1is calculated
by ˆ
Wt+1=α·ˆ
Wt+(1−α) ·Wt+1,whereαis the decay rate. They name the model with
averaged weights ˆ
Wthe teacher model, and the latest model with weights Wtthe student
model. This terminology has since been adopted in the literature when constructing semi-
supervised neural networks.
6.2.8 Virtual adversarial training
Most of the perturbation-based approaches we have discussed thus far aim to promote robust-
ness to small perturbations in the input. In doing so, they do not take into account the
directionality of the perturbation: the injected noise is generally isotropic. However, it has
been suggested in several studies that the sensitivity of neural networks to perturbations in
the input is often highly dependent on the direction of these perturbations (Szegedy et al.
2013; Goodfellow et al. 2014b).
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 407
Miyato et al. (2018) proposed a regularization procedure that takes the perturbation
direction into account. For each data point, labelled or unlabelled, they approximate the
perturbation to the corresponding input data that would yield the largest change in network
output (the so-called advesarial noise). They then incorporate a term into the loss function
that penalizes the difference in the network outputs for the perturbed and unperturbed input
data. For the unperturbed data point, the weights from the previous optimization iteration are
used. Formally, the adversarial loss function for a sample xcan be defined as
(x)=D(f(x;ˆ
W), f(x+γadv;W)),
where Dis some divergence measure, γadvis the adversarial noise, and ˆ
Ware the previous
network weights. Their approach is called virtual adversarial training,afterthesupervised
adversarial training method proposed by Goodfellow et al. (2014b). In the latter approach,
the outputs for the perturbed input are compared to the respective true outputs, rather than to
the outputs of the network for the unperturbed input. As such, regular adversarial training can
only be applied in a supervised setting. Adversarial training and virtual adversarial training
both bear close similarity to contractive autoencoders: there, the sensitivity of the network to
perturbations in the inputs is penalized by directly assessing the derivatives of the network
outputs with respect to the inputs (Rifai et al. 2011b).
Park et al. (2018) combined concepts from virtual adversarial training with the -model
(see Sect. 6.2.5). Instead of perturbing the unlabelled data points with adversarial noise, they
apply an adversarial dropout mask to the network weights. First, they sample a random
dropout mask s. Then, within some maximum distance from s, they find the dropout mask
advthat maximizes the difference between the unperturbed network output and the network
output when the dropout mask is applied. Their loss function, then, is defined as
(x)=D(f(x;W,s), f(x;W,
adv)),
where the network is parameterized by the weights Was well as the dropout mask. Park et al.
(2018) reported small performance improvements over virtual adversarial training and the
-model.
6.2.9 Semi-supervised mixup
The perturbation-based neural networks we have discussed thus far rely on a particularly
strong instantiation of the smoothness assumption: they encourage the predictions of the
network to be identical for minor perturbations in the input, regardless of the direction of
the perturbation. Recently, several researchers have considered the possibility of applying
larger perturbations to the input. In this scenario, the direction of the perturbation generally
does matter: when the perturbation points towards the decision boundary, the neural network
outputs (but not necessarily the resulting class assignment) should typically change more
than when it points away from the decision boundary.
This approach was formalized in the supervised mixup method, which was proposed by
Zhang et al. (2018). They postulate that, in a robust classifier, the predictions for a linear
combination of feature vectors should be a linear combination of their labels. They incorporate
this by training on augmented data points in addition to the original labelled samples. To
this end, they randomly select pairs of data points (x,y)and (x,y)during training, and
sample an interpolation factor λfrom a symmetric beta distribution Beta(α, α),whereαis
a predetermined hyperparameter. The network is then trained in a supervised manner on the
linearly interpolated data point (ˆ
x,ˆ
y),where
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
408 Machine Learning (2020) 109:373–440
ˆ
x=λ·x+(1−λ) ·x,
ˆ
y=λ·y+(1−λ) ·y.
In their experiments, Zhang et al. (2018) report substantial performance improvements in
several training scenarios. Their best results are achieved when the beta distribution hyper-
parameter, α, is relatively low, causing the distribution to be strongly biased towards the
extremes (i.e., λ=0andλ=1). Consequently, a large majority of interpolated samples will
lie very close to either of the two selected data points.
The interpolation used in mixup can be applied to unlabelled samples as well, by interpo-
lating the predicted labels rather than the true labels. Verma et al. (2019) combined mixup with
the mean teacher approach (see Sect. 6.2.7), determining the target label for the augmented
data point as the linear interpolation of the predictions of the teacher model. Interestingly,
the interpolation was only applied to pairs of unlabelled data points, and not to mixed pairs
of labelled and unlabelled data points. Berthelot et al. (2019) proposed a semi-supervised
neural network composed of several supervised and semi-supervised components, includ-
ing a semi-supervised extension of mixup. In selecting data points for interpolation, they
do not distinguish between labelled and unlabelled data points. For labelled data points,
the true label is then used in interpolation; for unlabelled data points, the predicted label is
used.
Mixup exhibits similarities to graph-based methods (see Sects. 6.3 and 7): rather than
employing pointwise perturbations, they apply perturbations based on combinations of dif-
ferent data points. Unlike in graph-based methods, however, the pairwise similarity between
data points is not taken into account. The precise implications of this remain an interesting
avenue for future research.
6.3 Manifolds
Perturbation-based methods make direct use of the smoothness assumption, penalizing dif-
ferences in the behaviour of a classifier under slight changes in the input or in the classifier
itself. However, one can imagine that not all minor changes to the input should yield similar
outputs. In particular, if the data lie on lower-dimensional manifolds, one can expect the
classifier to be insensitive only to minor changes along the manifold. This observation corre-
sponds to the manifold assumption, which forms the basis of a significant body of intrinsically
semi-supervised learning algorithms.
An m-dimensional manifold is a subspace of the original input space that locally resem-
bles Euclidean space Rm. Reiterating the definition from Sect. 2, the manifold assumption
states that (a) the input space is composed of multiple lower-dimensional manifolds on
which all data points lie and (b) data points lying on the same lower-dimensional manifold
have the same label. Formally, the first part of the manifold assumption states that each
conditional probability distribution p(x|y)has a structure corresponding to the union of
one or more Riemannian manifolds M. The second part, then, states that points on the
same Riemannian manifold Mshould have the same label. If these assumptions hold,
information about the manifolds present in the input space can prove useful to classifica-
tion.
In this section, we consider two general types of methods that are based on the manifold
assumption. Firstly, we consider manifold regularization techniques, which define a graph
over the data points and implicitly penalize differences in predictions for data points with
small geodesic distance. Secondly, we consider manifold approximation techniques,which
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 409
explicitly estimate the manifolds Mon which the data lie and optimize an objective function
accordingly.
6.3.1 Manifold regularization
Consider a labelled data point xiand an unlabelled data point xj, and assume that xilies
on some manifold M.Ifxjalso lies on M, the manifold assumption implies that it is
likely to have the same label as xi. Furthermore, assuming that the data is concentrated
on lower-dimensional manifolds, we can expect there to be more data points x∗located
on M.
If we have sufficiently many data points, we can thus expect there to be some “path”,
a so-called geodesic, from xjto xi, passing through other labelled or unlabelled samples,
such that each path segment is relatively short. We can formalize this notion of a path by
defining a graph over all data points, connecting pairs of data points that are close together
in the original input space with an edge. Edge weights may be used to express the degree of
similarity. This is the key principle underlying graph-based methods, which also form the
basis of transductive semi-supervised learning (see Sect. 7).
Following this motivation, Belkin et al. (2005,2006) formulated a general framework for
regularizing inductive learners based on manifolds. Theyconsidered a kernel K:X×X→ R
with a corresponding hypothesis space HKand an associated norm || · ||K. For supervised
problems, then, they formulated the following general optimization problem:
minimize
f∈HK
l
i=1
[( f(xi), yi)]+γ·||f||2
K,
for some loss function on labelled data. Here, γdenotes the relative influence of the
smoothing term. This objective function simultaneously penalizes misclassifications and
promotes smoothness of the predictive function. For the semi-supervised setting, they
added an unsupervised regularization term that penalizes differences in label assignments
for pairs of data points that have a direct edge between them in the graph. Implic-
itly, they thereby encourage data points on the same manifold to receive the same label
prediction.
This unsupervised regularization term gives rise to the class of manifold regularization
methods. Consider a similarity graph with symmetric weighted adjacency matrix W,where
Wij denotes the similarity between data points xiand xj(Wij =0 if the points are not con-
nected). Let Ddenote the degree matrix, which is a diagonal matrix with Dii =n
j=1Wij.
The manifold regularization term || f||2
Iis then defined as
|| f||2
I=1
2·
n
i=1
n
j=1
Wij ·(f(xi)−f(xj))2.(5)
The regularization term can be expressed as f·L·f,whereL=D−Wis the graph
Laplacian, and f∈Rnis the vector of evaluations of ffor each xi. The final optimization
problem, including the manifold regularization term from Eq. 5, becomes
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
410 Machine Learning (2020) 109:373–440
minimize
f∈HK
1
l·
l
i=1
( f(xi), yi)+γ·||f||2
K+γU·||f||2
I,(6)
where γUdetermines the relative influence of the manifold regularization term.
This general framework leads to semi-supervised extensions of popular supervised learn-
ing algorithms, such as Laplacian support vector machines (LapSVMs), where the loss
function is defined as the hinge loss, i.e. ( ˆy,y)=max{1−yˆy,0}. The supervised objective
of LapSVMs maximizes the margin, and the unsupervised objective maximizes consistency
of predictions along the estimated manifolds. In the paper proposing this framework, Belkin
et al. (2006) suggested to solve the resulting loss minimization problem in its dual form,
similar to popular solving techniques for supervised SVMs, in time O(n3). Melacci and
Belkin (2011) suggested solving the optimization problem in its primal form. Combining an
early stopping criterion with a preconditioned conjugate gradient, they reduced the time com-
plexity to O(c·n2)for some cthat is empirically shown to be significantly smaller than n.
Qi et al. (2012) suggested to extend twin SVMs, which optimize two SVM-like objective
functions to yield two non-parallel decision boundaries (one for each class) (Jayadeva et al.
2007), to include the LapSVM regularization term. Sindhwani et al. (2005); Sindhwani and
Rosenberg (2008) extend manifold regularization to the co-regularization framework (see
Sect. 4.2). They proposed to construct two classifiers using an objective function similar to
that of LapSVMs for two different views. Niyogi (2008) provided some theoretical analysis
on the manifold regularization framework and analyzed its usefulness in semi-supervised
learning.
Zhu and Lafferty (2005) proposed to incorporate a manifold regularization term in a
generative model. They expressed the data-generating distribution as a mixture model, where
the manifold is locally approximated by a mixture model component. Their loss function
consists of a regularizer over the graph and a generative component. Weston et al. (2008)
incorporated a manifold regularization term into deep neural networks. They proposed several
methods to incorporate the manifold structure using an auxiliary embedding task,which
encourages the latent representations in the neural network to be similar for similar inputs.
Furthermore, they suggested to include a regularization term that explicitly pushes the latent
representations of non-similar data points (defined as not being neighbours in the underlying
graph) further apart. This approach was applied to hyperspectral image classification by Ratle
et al. (2010). More recently, Luo et al. (2018) employed a loss function that encourages data
points with the same label, either predicted (for unlabelled data points) or true (for labelled
data points), to have similar latent representations in the penultimate layer. Additionally,
it encourages the latent representations of data points with different predicted labels to be
dissimilar.
The graph construction process is non-trivial and involves many hyperparameters. For
instance, one can use a variety of connectivity criteria and edge weighting schemes. This
makes the performance of manifold regularization methods highly dependent on hyperpa-
rameter settings. Geng et al. (2012) attempted to overcome this problem by first selecting a
set of candidate Laplacians using different hyperparameter settings. They then posed the opti-
mization problem as finding the linear combination of Laplacians that minimizes the manifold
regularization objective. Formally, let there be mcandidate Laplacians L1,...,Lm. Assume
that the optimal manifold L∗lies in the convex hull of L1,...,Lm, i.e. L∗=m
j=1μj·Lj
with m
j=1μj=1andμj≥0for j=1,...,m. Since each Ljis a valid graph Laplacian,
their linear combination is a valid graph Laplacian as well. Using exponential weights in the
Laplacian, the manifold regularization term || f||2
Ithen becomes
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 411
|| f||2
I=f·L·f
=f·⎛
⎝
m
j=1
μj·Lj⎞
⎠·f
=
m
j=1
μj·||f||2
I(j),
where || f||2
I(j)is the manifold regularization term for candidate Laplacian Lj. This final
regularization term is then used in the original optimization problem from Eq. 6, with the
addition of a regularization term ||μ||2to prevent the optimizer from overfitting to one
manifold, and the constraint that m
j=1μj=1. The objective function is then optimized
with respect to μand f, which Geng et al. proposed to do in an EM-like fashion (i.e.
fixing one and optimizing the other alternatingly). Their approach, which they call ensemble
manifold regularization, was demonstrated to be superior to LapSVMs when applied to the
SVM objective function on both synthetic and real-world data sets (Geng et al. 2012).
Aside from the methods proposed by Geng et al. (2012)andLuoetal.(2018), graph
construction methods have mainly been studied in the context of transductive semi-supervised
learning. We cover these methods extensively in Sect. 7.
6.3.2 Manifold approximation
Manifold regularization techniques introduce a regularization term that directly captures the
fact that manifolds locally represent lower-dimensional Euclidean space. However, one can
also consider a two-stage approach, where the manifold is first explicitly approximated and
then used in a classification task. This is the approach taken by manifold approximation
techniques, which construct an explicit representation of the manifold. We note that such
approaches have a close relation to, and can in some cases even be considered as, semi-
supervised preprocessing (see Sect. 5).
Rifaietal.(2011a) developed such an approach, where the manifolds are first estimated
using contractive autoencoders (CAE, see Rifai et al. 2011a),andthenusedbyasuper-
vised training algorithm. CAEs are a variant of autoencoders that, in addition to the normal
reconstruction cost term in autoencoders, penalize the derivatives of the output activations
with respect to the input values. By doing so, they penalize the sensitivity of the learned
features to small perturbations in the input without relying on sampling these perturbations
(like denoising autoencoders do). Rifai et al. (2011b) claim that CAEs do not merely penalize
sensitivity to small perturbations in the input, but that they penalize small perturbations of
the input data along the manifold. They argue that this effect occurs due to the balance of
promoting reconstruction and penalizing sensitivity to inputs. In other words, they claim to
act directly on the manifold assumption.
The loss function Lutilized by contractive autoencoders with reconstruction cost (·,·)is
L=
n
i=1
(g(h(xi)), yi)+λ·||J||2
F,
where ||J||Fis the Frobenius norm of the Jacobian matrix of the outputs with respect to
the inputs, i.e. the sum of the squared partial derivatives of each output activation with
respect to each input value. Rifai et al. additionally proposed to penalize the Hessian of the
output values. Due to the computational complexity of exactly calculating the Hessian, they
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
412 Machine Learning (2020) 109:373–440
propose to approximate it as the difference between the Jacobians corresponding to small
perturbations of the input.
Using singular value decomposition, they estimate the tangent plane at each input point
to approximate the actual manifolds. As a result, the distance between two data points along
the manifold can be estimated and subsequently used in classification, e.g. via a k-nearest
neighbour algorithm. Additionally, they suggested to use a deep neural network pre-trained
with multiple, stacked contractive autoencoders, where an additional term is added to the loss
function to explicitly penalize sensitivity of the outputs to perturbations along the tangent
plane.
A manifold can be described as a collection of overlapping charts, each having a simple
geometry, that jointly cover the entire manifold. Such a collection of charts is known as an
atlas. Pitelis et al. (2013,2014) suggested to approximate these charts explicitly, associating
each with an affine subspace. They alternate between assigning data points to charts, and
choosing the affine subspace best matching the data for each chart. The charts are initialized
using principal component analysis on a set of random subspaces. From this, a set of charts
and a soft assignment of points to charts is obtained (since points can be associated with
more than one chart). Finally, from these charts and soft assignments, kernels are generated
that are then used in SVM-based supervised learning.
6.4 Generative models
The aforementioned methods are all discriminative: their only goal is to infer a function that
can classify data points. In some cases, they produce probabilistic predictions; in others,
they only yield the most likely class to assign. In all cases, they approach the classification
problem without explicitly modelling any of the data-generating distributions. In contrast, the
primary goal of methods based on generative models is to model the process that generated
the data. When such a generative model is conditioned on a given label y, it can also be used
for classification.
6.4.1 Mixture models
If prior knowledge about p(x,y)is available, generative models can be very powerful. For
instance, consider the case where we know that our data p(x,y)is composed of a mixture of
kGaussian distributions, each of which corresponds to a certain class. Most discriminative
methods would not be able to properly incorporate this prior information. Instead, one would
be best served by simply fixing the model as a mixture of kGaussian components. Each
component j=1,...,khas three parameters: a weight πj(where k
j=1πj=1), mean
vector μj, and covariance matrix j. The most likely parameters can then be inferred, for
example via expectation-maximization (Dempster et al. 1977). This model is generative: it
models the distribution p(x,y), from which samples (x,y)can be drawn. The model can then
also be used for classification: since the inference procedure yields an estimate ˆp(x|y)of the
conditional distribution p(x|y), one can simply assign to an unlabelled data point xi∈XU
the class cthat maximizes ˆp(xi|yi=c)·p(yi=c). In the case of Gaussian mixture models
described earlier, p(yi=c)=πc.
The application of mixture models to generative modelling comes with several caveats
(Cozman et al. 2003;Zhu2008). Firstly, the mixture model should be identifiable: each
distinct parameter choice for the mixture model should determine a distinct joint distribution,
up to a permutation of the mixture components. Secondly, mixture models hinge on the
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2020) 109:373–440 413
critical assumption that the assumed model is correct. If the model is not correct, i.e. the true
distribution p(x,y)does not conform with the assumed model, unlabelled data may hurt
performance rather than improve it.
In real-world applications, the model correctness assumption rarely holds. Therefore,
using mixture models for generative modelling can prove difficult. Some approaches exist to
mitigate these problems; for example, Nigam et al. (2000) vary the influence of unlabelled
data in EM. However