Content uploaded by Oliver Pfaffel
Author content
All content in this area was uploaded by Oliver Pfaffel on Jun 03, 2020
Content may be subject to copyright.
CLUSTIMPUTE: AN R PACKAGE FOR K-MEANS CLUSTERING WITH
BUILD-IN MISSING DATA IMPUTATION
OLIVER PFAFFEL
Abstract. This article introduces a novel k-means clustering implementation that handles
missing values efficiently. The implementation can deal with missing values in multiple variables
and is computationally efficient since it uses the current cluster assignment to define a plausible
distribution for the missing values. Experiments show a good scalability comparable to a simple
random imputation and a clustering performance on par with a pre-processing by more complex
imputation methods.
1. Introduction: clustering in the presence of missing values
When missing values are present in the data, partitional clustering is not possible out of the
box. A typical strategy is missing value imputation before clustering. For an in-depth account on
the imputation of missing values we refer to [6]. On a high-level, there exist broadly two different
types of imputation methods that are applied to each column/variable of the data set which has
at least one missing entry:
•Unconditional approaches such as mean, median or random imputation. These approaches
have the benefit of being computationally cheap. However, mean and median imputation
reduce the variance in the data and are deterministic. This may lead to a false confidence in
the resulting clusters. Random imputation is exact if missings are completely at random,
but ignores multi-collinearity between them in case missings are at random.
•Conditional approaches iteratively train a regression model on all rows without missing
values given all other variables as covariates (incl. currently imputed values) and predict
a value for all rows with a missing value. These approaches are exact if missings are at
random. However, such approaches are often quite costly, and typically significantly more
costly than the clustering itself.
This article introduces a novel k-means clustering implementation that includes an efficient missing
data imputation. A version of this algorithm is implemented in the ClustImpute R package. The
imputation method basically is a column-wise nearest neighborhood imputation which is nested
in the clustering algorithm. The neighborhoods are given by the current cluster assignments.
Therefore, there is no need to fit a k-nearest neighbors algorithm for each imputation step which
reduces computational time considerably. Typically k-means initializes with random partitions
that will not yield good imputation neighborhoods if the missings are not completely at random.
Thus, the imputed values drawn from each neighborhood are weighted down for a ”‘burn-in”’ phase
(to be set by the user).
The benefit will be visualized by a simple example: Often a clustering based on median or
random imputation will not provide good results even if we know the number of clusters. Both
approaches badly distort the data set in Figure 1 and lead to unreasonable clusters. The result
with ClustImpute for the very same data set can be observed in Figure 2.
ClustImpute draws the missing values iteratively based on the current cluster assignment so
that correlations are considered on this level. Basically, we assume a more granular dependence
structure between the missings is not relevant since we are “only” interested in kpartitions. Sub-
sequently, penalizing weights are imposed on imputed values and successively decreased (to zero)
as the partitioning and therefore also the missing data imputation become more accurate. The
Key words and phrases. clustering, imputation, missing data, k-means, scalability, missing at random, R.
1
2 OLIVER PFAFFEL
Figure 1. Imputation with the median vs. random imputation on a simulated
data set further described in Chapter 4. Median imputation on the left produces
an artificial point mass near the xand yaxis. Random imputation on the right
produces a lot of points far away from the actual clusters.
heuristic is that at some point the observed point is near a cluster that provides a suitable neigh-
borhood to draw the missing variable from. The algorithm is computationally efficient since the
imputation is only as accurate as the clustering, and is shown to be much faster than any approach
that derives the full conditional missing distribution, e.g., as implemented in the powerful MICE
package, independently of the clustering.
The article is structured as follows: Chapter 2 describes the implemented algorithm in great de-
tail. Subsequently, Chapter 3 highlights important parts of the implementation in Rand provides
on overview of similar R packages. Scalability and overall clustering performance of the imple-
mentation are tested in Chapter 4. Finally, we conclude with a short summary and an outlook for
further interesting features to be implemented in future versions of ClustImpute.
2. Description of the algorithm
The intuition for the algorithm is that an observation should be clustered with other observations
mainly based on their observed values (hence the weights on imputed values), while the resulting
clusters provide donors for the missing value imputation, so that subsequently all variables can be
used for clustering. It is usually recommended to standardize the data, if not measured on the same
scale, prior to clustering so that each column has a mean of zero and a standard deviation of one.
If all variables were measured on the same scale it is still important that they are centered before
ClustImpute is applied with weights, i.e., with nend>1, otherwise the weighting procedure will
introduce a bias.
On a high-level the algorithm follows these steps:
1. Random imputation: replace all NAs by random imputation, i.e., for each variable with miss-
ings, draw from the marginal distribution of this variable excluding the missings. This does
not take into account any correlations with other variables.
2. Weights <1are multiplied with imputed values to adjust their scale. The weights are calculated
by a (linear) weight function that starts near zero and converges to 1 at nend.
3. Regular k-means clustering with the Euclidean norm is performed with a number of csteps steps
starting with a random initialization.
CLUSTIMPUTE: CLUSTERING AND MISSING DATA IMPUTATION 3
4. The imputed values from step 2 are replaced by new draws conditionally on the cluster assign-
ment from step 3.
5. Steps 2 to 4 are repeated nriter times in total. Any subsequent k-means clustering in step 3
uses the previous cluster centroids for initialization. Typically nriter is larger than nend.
6. After the last draw of missing values a final k-means clustering is performed.
A very good overview on missing data imputation can be found in [6], for example. For k-means
clustering we refer to the chapter on unsupervised learning in [1]
In the following we describe the clustering procedure formally and in more detail. We begin by
describing classical k-means clustering and then highlight the difference of this implementation.
First we describe the computation of the (hidden) k-th cluster centroid from all observations
xassigned to cluster k. Let’s assume there are Nobservations xin a p-dimensional space of
real numbers and Kclusters. Then each partition is characterized by a function s:{1, . . . , N} 7→
{1, . . . , K}that maps each observation to exactly one cluster. The computation of the k-th cluster
centroid in the l-th iteration can be written as
(1) cl
k=1
Nl−1
kX
{i:sl−1(i)=k}
xi
where Nl−1
kis the number of observations in cluster k, i.e., Nl−1
k=|{i:sl−1(i) = k}|, and sl−1is
the partition function from the previous iteration. The initialization s0is typically random. The
following step is to determine the closest centroid for each observation, i.e., to update the partition
function:
(2) sl(i) = arg min
k
xi−cl
k
2
Here k·k denotes the Euclidean norm. In our setting X= (xij )has missing values. Therefore we
are estimating
(3) ˜
Xij
l=1{xij 6=NA}xij +1{xij =N A}Ul
ij w(l),
Figure 2. Clustering based on the simulated data with missings from Figure
1. On the left, k-means is performed to the data set where a random imputation
was applied. On the right, the proposed package ClustImpute was used without
any imputation or other pre-processing step. Note that the underlying data has
additional four ”‘noise”’ variables not shown in the figures, thus the clusters are
somewhat overlapping when plotted w.r.t xand yonly.
4 OLIVER PFAFFEL
where the weight function is given by w(l) = min l
nend ,1, and Ul
ij is uniformly distributed on all
non-missing values of the same column jthat lie in the same cluster sl−1(i), or all other variables
if this set is empty. Mathematically, Ul
ij is uniformly distributed on
(4) Sl
ij =({xrj 6=N A :sl−1(r) = sl−1(i)},if non-empty
{xrj 6=N A},otherwise.
Thus the calculation of the new centroids is not only conditional on sbut also on the realization
of the random variable Ul= (Ul
ij ):
(5) ˜ckl=1
Nl−1
kX
{i:sl−1(i)=k}
˜xil,
where Uij is simply set to zero if xij is not missing. In early iterations, the weight function w(l)is
near zero, thus, for each component j, this is basically the mean over-all non missing values xij.
Since the denominator Nkdoes not change with the share of missing values, there is some linear
regularization towards zero, the mean due to standardization, proportional to the share of missing
values. Finally, the update of the partition function,
(6) sl(i) = arg min
k
˜xil−˜ckl
2
triggers, by definition, an update of Sij ,Uand ˜
X.
3. Description of the implementation
ClustImpute is an Rimplementation of the algorithm stated in Chapter 2. The main script
is ClustImpute.R and explained here. Since random imputation is simply a conditional sam-
pling this is implemented directly in the algorithm. The clustering is performed by the function
ClusterR::KMeans_arma for csteps where the centroids are initialized randomly in the first step
and with the current clusters centroids thereafter, i.e., in the 2nd step we use
ClusterR::KMeans_arma(...,CENTROIDS=cl_old,...) where clold denotes the current cluster
centroids. The ClusterR package was chosen for its fast implementation using RcppArmadillo
and the ability to start from customly chosen centroids. The missing values are sampled condi-
tionally on the current cluster assignment provided by ClusterR::predict_KMeans. Therefore,
only rather ‘standard’ packages like dplyr and rlang are imported by ClustImpute.
The package uses testthat for unit testing, has (as of version 0.1.3) full test coverage and uses
Travis CI for continuous integration. Further dependencies arise from helper functions for the
vignette and will be explained in the next chapter. The development page of ClustImpute can
be found at https://github.com/o1iv3r/ClustImpute
We are not aware of any implementations in Ror Python that implement the algorithm stated
in Chapter 2. However there are several packages for missing data imputation that can be used
before an application of any k-means implementation in R, for example:
•MICE provides multiple imputation using Fully Conditional Specification implemented
by the MICE algorithm as described in [7]. Each variable has its own imputation model
and built-in imputation models are provided for continuous data, binary data, unordered
categorical data and ordered categorical data. Thus its applicability is much broader than
ClustImpute’s.
•Amelia implements Bootstrap multiple imputation using EM to estimate the parameters,
for quantitative data it imputes assuming a Multivariate Gaussian distribution, cf. [2].
•missRanger provides a fast implementation of the ’MissForest’ algorithm used to impute
mixed-type data sets by chaining random forests, introduced by [5]. Under the hood, it
uses the fast random jungle package ranger.
These packages will be used in combination with ClusterR for a fair comparison with ClustIm-
pute. Our goal is to compare with popular packages that represent a diverse set of methodologies.
However, we are aware that there exists a large number of further imputation packages. For a
CLUSTIMPUTE: CLUSTERING AND MISSING DATA IMPUTATION 5
good overview we refer to the CRAN task views: https://cran.r-project.org/web/views/
MissingData.html
4. Computational experiments
In this section we empirically test scalability and clustering performance of ClustImpute. To
this end we compare ClustImpute with alternative approaches performing a pre-imputation with
one of the packages stated above. Nevertheless, we do not aim at making general conclusions on
which approach should be taken in a specific setting. The main purpose of this section is to shed
some light on the performance and properties of the proposed package with regards to established
other approaches.
4.1. Scalability on simulated data. This experiment will evaluate the scalability of ClustIm-
pute for a data set with a growing number of rows. To reproduce this experiment, the supple-
mentary file Scalability.R can be used.
The data is generated by a simple function that produces an ntimes nrother + 2 matrix with
three clusters, where ndenotes the number of rows and nrother + 2 the number of columns. The
variables in the nrother columns are just variables to distract the algorithms from the true clusters
that lie in a 2-dimensional hyperplane. A visualization of a sample can be seen in Figure 1 and
2 with nrother = 4. Missings are created with the function ClustImpute::Missing_simulation.
In this function copula::normalCopula() is used to generate a Gaussian Copula based on a
random covariance matrix, which is diagonal in the ‘Missing completely at random (MCAR)’
setting, otherwise non-diagonal (‘Missing at random (MAR)’ case). A sample from this Gaussian
Copula via copula::rMvdc() provides the indicators used to replace existing values by NA’s. For
a mathematical definition of MCAR and MAR we refer to [3].
We compare ClustImpute with Rpackages for missing data imputation that are applied to
the data in a pre-processing step before the k-means clustering is performed. For the latter we
use the same clustering algorithm as in ClustImpute, namely the ClusterR package, to ensure
that only the computation time of the imputation is compared. The running time is measured as
an average of five clustering runs for each procedure and data set.
The parametrization is as follows:
nr_iter <- 14 # iterations of procedure
n_end <- 10 # step until convergence of weight function to 1
nr_cluster <- 3 # number of clusters
c_steps <- 50 # number of cluster steps per iteration
nr_iter_other <- (nr_iter-n_end) * c_steps
The parametrization nriterother of the clustering for external imputation strategies is set so that
we have a comparable number of steps after imputation (steps of ClustImpute with a weight <
1 are considered a ”‘burn-in”’ phase and only afterwards the distribution is considered credible),
otherwise approaches with a separate imputation beforehand would face a disadvantage. Due to a
very slow performance we decided to drop MissForrest and only consider missRanger, however,
the code for MissForrest is still part of the attached code file so that the reader could also add it
to this comparison. The results below are for a missing rate of 20%, though this could be changed
in a single line in the attached code file. For each data set, imputation and clustering are repeated
five times with a different seed each time. Essentially, ClustImpute is used as follows:
res <- ClustImpute(dat_with_miss,nr_cluster=nr_cluster, nr_iter=nr_iter,
c_steps=c_steps, n_end=n_end, seed_nr = random_seed)
clusterR::external_validation(random_data$true_clusters, res$clusters)
In Figure 3 we observe that Amelia and ClustImpute scale much better than MICE and
missRanger.ClustImpute scales like a simple random imputation and similarly to Amelia.
We only show the result from a MAR simulation but the results for MCAR are comparable.
6 OLIVER PFAFFEL
Figure 3. This figure shows the median running time in seconds for an appli-
cation of ClustImpute vs. a comparable k-means clustering performed on a data
set imputed by Amelia,MICE,MissRanger or simple random imputation, on
a regular (left) and on a log-scale (right).
Nr. of obs. ClustImpute
RandomImp
+ClusterR
missRanger
+ClusterR
MICE
+ClusterR
Amelia
+ClusterR
400 0.6785 0.3163 0.3031 0.2920 0.3529
800 0.6519 0.3149 0.3108 0.3187 0.2840
1600 0.6896 0.4122 0.2648 0.2978 0.2051
3200 0.6732 0.3459 0.3365 0.4452 0.4263
6400 0.6673 0.2768 0.3134 0.2250 0.3040
Table 1. Rand index on simulated data
In a nutshell, we see that ClustImpute scales like a random imputation and hence is much
faster than a pre-processing with MICE or missRanger. This is not surprising since ClustIm-
pute basically runs a fixed number of random imputations conditional on the current cluster
assignment.
In this experiment it was assumed that the number of clusters is known in advance. The vignette
of ClustImpute shows how to use the ”‘variance reduction”’ function
ClustImpute::var_reduction() to tune this hyper-parameter. Moreover, ClustImpute tracks
the mean and variance of imputed variables. These values are part of the output list and can
help to asses ”‘convergence”’ and to tune the nriter parameter appropriately, cf., the somewhat
comparable trace plots from the MICE package.
Of course, running time is only of relevance if the resulting clusters are adequate. In Table
1 we provide the rand indices, cf. [4] for a definition, comparing the resulting clusters with
the true clusters. ClustImpute provides the highest numbers, although the other imputation
methods (except random imputation) could provide better results if they were tuned. The stable
RandIndex independently of the number of observations justifies to keep the hyper-parameters
of ClustImpute constant. Therefore the running time of ClustImpute is equivalent to a fixed
number of random imputations plus a fixed number of clustering steps, hence the algorithm scales
exactly like a random imputation in combination with a fixed number of clustering steps afterwards.
In the next section we examine the clustering performance more adequately on the Iris data set.
4.2. Benchmarking Clustering Performance on Iris. For benchmarking ClustImpute and
clustering based on alternative imputation methods we use the popular IRIS data set. To reproduce
this experiment, the supplementary file Benchmarking_Iris.R can be used. In contrast to the
previous example, the size of the data set stays fixed but the share of missing values for each
CLUSTIMPUTE: CLUSTERING AND MISSING DATA IMPUTATION 7
Figure 4. This figure shows the Rand Index on a censored IRIS data set for
an application of ClustImpute vs. a comparable k-means clustering performed
on a data set imputed by Amelia,MICE,MissRanger or simple random im-
putation. A different share of missing values was simulated using MCAR.
of the four variables changes. Missings are generated similarly as in the previous sub-section on
scalability.
The parametrization is as follows:
nr_iter <- 15 # iterations of procedure
n_end <- 8 # step until convergence of weight function to 1
nr_cluster <- 3 # number of clusters
c_steps <- 1 # numer of cluster steps per iteration
nr_iter_other <- nr_iter * c_steps
In contrast to the previous section we use the same number of clustering steps for all approaches
(see last line of the code above) so that the benchmark approaches are not disadvantaged. Clus-
tering performance is measured as an average of the RandIndex resulting from 30 clustering runs
for each procedure and share of missings.
For a MCAR missing scheme, we observe in Figure 4 we that all approaches yield good clusters
when the share of missing values is low. ClustImpute is among the top for a share of missings
of 30% or less. For an even larger share of missings a pre-imputation with MICE provides better
results. Nevertheless, its performance is higher than for MissForrest or a simple random impu-
tation. Interestingly we see that ClustImpute without a weight function (meaning a weight of
1 already as of the first iteration) performs worse than with a weight function. Note that Amelia
provides very good results for a low share of missings but fails to converge for a larger share.
Therefore we decided to drop points where the package did not run the full 30 iterations in order
to avoid biased results.
In the second experiment, everything was kept the same expect for the missing simulation that
now is MAR. To generate MAR, the missings will be correlated using the previously mentioned
function ClustImpute::Missing_simulation. Figure 5, made with the corrplot package, shows
that missings are indeed strongly correlated. The results in the MAR setting are somewhat
similar. However, ClustImpute is closer to a pre-imputation with MICE, and the two curves
for ClustImpute are further apart which indicates that the importance of the weighting function
increases. Amelia and, in this setting, also MICE fail to converge for a large share of missings
and as in the first experiment we do not show results where the packages did not run the full 30
iterations in order to avoid stating biased results.
The benchmarking experiments on the IRIS data have been conducted for a fixed parametriza-
tion of ClustImpute. The following experiment shows that, at least on this data set, there is a
rather low sensitivity with respect to changes in the parametrization: We considered all combina-
tions of the following parameters:
8 OLIVER PFAFFEL
Figure 5. This figure shows the correlations between missing values in the
censored IRIS data set with 40% missings at random (MAR).
Figure 6. This figure shows the Rand Index on a censored IRIS data set for an
application of ClustImpute vs. a comparable k-means clustering performed on a
data set imputed by Amelia,MICE,missRanger or simple random imputation.
A different share of missing values was simulated using MAR.
nr_iter <- c(5,10,20,30,50) # iterations of procedure
c_steps <- c(1,10,25,50) # numer of cluster steps per iteration
# n_end is defined as a fraction of nr_iter
CLUSTIMPUTE: CLUSTERING AND MISSING DATA IMPUTATION 9
Figure 7. Points refer to the averaged Rand Index over 30 runs for each pa-
rameter combination. In the first plot from the left, the end point of convergence
of the weight function was set either to 1, for the others to a fraction of 30%,
60% or 90% of the number of iterations. The size of the points corresponds to
the total number of clustering steps.
The parameter nend was set either to 1 (i.e., no weight function is used), or 30%,60% or 90%
of nriter, rounded to the closest integer. The missing simulation was performed as above using
MAR and a missing share of 30%. For each of the resulting 80 parameter combinations, 30
ClustImpute runs were performed and the resulting Rand Index averaged. Results be be seen
in Figure 7. One observes that the Rand Index is, on average, higher if a weight function is used.
Moreover, higher scores can be obtained if the convergence is well before the final iteration by
defining nend as a fraction of 30% or 60% of nriter. The size of the points corresponds to the total
number of clustering steps, i.e., the product of nriter and csteps. Best scores can be obtained by
a total number of iterations that is not too high. For example, a good result can be obtained by
setting nriter and csteps to 10 and nend to 6 (this is the 5th best result).
While the experiments in this section provide interesting insights into the clustering performance
of ClustImpute, results should be taken with care. First of all performance may vary considerably
with different data sets and missing schemes. Secondly, other imputation methods exist and even
the considered ones have a large number of tuning options with a potentially high impact on
clustering performance. Moreover, in real-world applications partition labels do not exist so the
quality of resulting partitions has to be assesed with different, often domain-specific, methods.
5. Summary and outlook
In this article we have described a novel implementation of a k-mean clustering algorithm with
built-in multivariate missing data imputation. Scalability and overall clustering performance of
the implementation in Rwere tested on simulated data and the IRIS data set with missings
simulated at random (i.e., considering a potential correlation of missings within the data set). A
potential issue of ClustImpute is the standardization being based on complete observations only.
An independent imputation, e.g. using MICE, can potentially detect a shift in expectation and
variances so that a more accurate standardization is possible for the clustering. This should be
seen as a down-side of any combined approach which yields a faster run-time. In the following we
want to highlight directions for further research.
Categorical variables can be considered in ClustImpute if a dummy coding is applied to the
underlying data. The imputation based on actual observations guarantees that no other values
other than 0 or 1 are created. Hence the structure of the data does not change and the resulting
clusters can be interpreted in terms of categories. However, non-binary categories might end up
with multiple assignments to some extent describing the associated uncertainty due to the missing
information. Alternatively one could replace the current clustering algorithm in step 3 by one that
10 OLIVER PFAFFEL
supports other metrics, e.g. Gower’s distance which defines a weighted average of the Euclidean
with the L0-norm. The imputation / sampling parts of the algorithm do not have to be changed.
A further improvement would be a parallelization of the imputation that is possible as columns
are sampled independently of each other, in contrast to an imputation using chained equations
where there is an update after each column so parallelization is not trivial. This should speed up
the computation for high-dimensional data sets.
Another methodological improvement would be adaptive weights. At the moment, the weights
from step 2 only depend on the iteration. However, adaptive weights could depend on other
features such as the relative distance to the closest centroid compared to the other centroids. This
way one could perform a sampling on the whole data set with cluster weights proportional on
the centroid distance. In comparison, the current implementation can be seen as a sampling on
the entire data set with a weight of 1 for the currently assigned cluster and 0 otherwise. On the
contrary, such an approach would increase computation time.
References
[1] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1.
Springer series in statistics New York, 2001.
[2] J. Honaker, G. King, and M. Blackwell. Amelia II: A program for missing data. Journal of
Statistical Software, 45(7):1–47, 2011.
[3] R. J. Little and D. B. Rubin. Statistical analysis with missing data, volume 793. John Wiley
& Sons, 2019.
[4] W. M. Rand. Ob jective criteria for the evaluation of clustering methods. Journal of the
American Statistical association, 66(336):846–850, 1971.
[5] D. J. Stekhoven and P. B¨
uhlmann. Missforest—non-parametric missing value imputation for
mixed-type data. Bioinformatics, 28(1):112–118, 2012.
[6] S. Van Buuren. Flexible imputation of missing data. Chapman and Hall/CRC, 2018.
[7] S. van Buuren and K. Groothuis-Oudshoorn. mice: Multivariate imputation by chained equa-
tions in r. Journal of Statistical Software, 45(3):1–67, 2011.