Content uploaded by Oliver Pfaffel

Author content

All content in this area was uploaded by Oliver Pfaffel on Jun 03, 2020

Content may be subject to copyright.

CLUSTIMPUTE: AN R PACKAGE FOR K-MEANS CLUSTERING WITH

BUILD-IN MISSING DATA IMPUTATION

OLIVER PFAFFEL

Abstract. This article introduces a novel k-means clustering implementation that handles

missing values eﬃciently. The implementation can deal with missing values in multiple variables

and is computationally eﬃcient since it uses the current cluster assignment to deﬁne a plausible

distribution for the missing values. Experiments show a good scalability comparable to a simple

random imputation and a clustering performance on par with a pre-processing by more complex

imputation methods.

1. Introduction: clustering in the presence of missing values

When missing values are present in the data, partitional clustering is not possible out of the

box. A typical strategy is missing value imputation before clustering. For an in-depth account on

the imputation of missing values we refer to [6]. On a high-level, there exist broadly two diﬀerent

types of imputation methods that are applied to each column/variable of the data set which has

at least one missing entry:

•Unconditional approaches such as mean, median or random imputation. These approaches

have the beneﬁt of being computationally cheap. However, mean and median imputation

reduce the variance in the data and are deterministic. This may lead to a false conﬁdence in

the resulting clusters. Random imputation is exact if missings are completely at random,

but ignores multi-collinearity between them in case missings are at random.

•Conditional approaches iteratively train a regression model on all rows without missing

values given all other variables as covariates (incl. currently imputed values) and predict

a value for all rows with a missing value. These approaches are exact if missings are at

random. However, such approaches are often quite costly, and typically signiﬁcantly more

costly than the clustering itself.

This article introduces a novel k-means clustering implementation that includes an eﬃcient missing

data imputation. A version of this algorithm is implemented in the ClustImpute R package. The

imputation method basically is a column-wise nearest neighborhood imputation which is nested

in the clustering algorithm. The neighborhoods are given by the current cluster assignments.

Therefore, there is no need to ﬁt a k-nearest neighbors algorithm for each imputation step which

reduces computational time considerably. Typically k-means initializes with random partitions

that will not yield good imputation neighborhoods if the missings are not completely at random.

Thus, the imputed values drawn from each neighborhood are weighted down for a ”‘burn-in”’ phase

(to be set by the user).

The beneﬁt will be visualized by a simple example: Often a clustering based on median or

random imputation will not provide good results even if we know the number of clusters. Both

approaches badly distort the data set in Figure 1 and lead to unreasonable clusters. The result

with ClustImpute for the very same data set can be observed in Figure 2.

ClustImpute draws the missing values iteratively based on the current cluster assignment so

that correlations are considered on this level. Basically, we assume a more granular dependence

structure between the missings is not relevant since we are “only” interested in kpartitions. Sub-

sequently, penalizing weights are imposed on imputed values and successively decreased (to zero)

as the partitioning and therefore also the missing data imputation become more accurate. The

Key words and phrases. clustering, imputation, missing data, k-means, scalability, missing at random, R.

1

2 OLIVER PFAFFEL

Figure 1. Imputation with the median vs. random imputation on a simulated

data set further described in Chapter 4. Median imputation on the left produces

an artiﬁcial point mass near the xand yaxis. Random imputation on the right

produces a lot of points far away from the actual clusters.

heuristic is that at some point the observed point is near a cluster that provides a suitable neigh-

borhood to draw the missing variable from. The algorithm is computationally eﬃcient since the

imputation is only as accurate as the clustering, and is shown to be much faster than any approach

that derives the full conditional missing distribution, e.g., as implemented in the powerful MICE

package, independently of the clustering.

The article is structured as follows: Chapter 2 describes the implemented algorithm in great de-

tail. Subsequently, Chapter 3 highlights important parts of the implementation in Rand provides

on overview of similar R packages. Scalability and overall clustering performance of the imple-

mentation are tested in Chapter 4. Finally, we conclude with a short summary and an outlook for

further interesting features to be implemented in future versions of ClustImpute.

2. Description of the algorithm

The intuition for the algorithm is that an observation should be clustered with other observations

mainly based on their observed values (hence the weights on imputed values), while the resulting

clusters provide donors for the missing value imputation, so that subsequently all variables can be

used for clustering. It is usually recommended to standardize the data, if not measured on the same

scale, prior to clustering so that each column has a mean of zero and a standard deviation of one.

If all variables were measured on the same scale it is still important that they are centered before

ClustImpute is applied with weights, i.e., with nend>1, otherwise the weighting procedure will

introduce a bias.

On a high-level the algorithm follows these steps:

1. Random imputation: replace all NAs by random imputation, i.e., for each variable with miss-

ings, draw from the marginal distribution of this variable excluding the missings. This does

not take into account any correlations with other variables.

2. Weights <1are multiplied with imputed values to adjust their scale. The weights are calculated

by a (linear) weight function that starts near zero and converges to 1 at nend.

3. Regular k-means clustering with the Euclidean norm is performed with a number of csteps steps

starting with a random initialization.

CLUSTIMPUTE: CLUSTERING AND MISSING DATA IMPUTATION 3

4. The imputed values from step 2 are replaced by new draws conditionally on the cluster assign-

ment from step 3.

5. Steps 2 to 4 are repeated nriter times in total. Any subsequent k-means clustering in step 3

uses the previous cluster centroids for initialization. Typically nriter is larger than nend.

6. After the last draw of missing values a ﬁnal k-means clustering is performed.

A very good overview on missing data imputation can be found in [6], for example. For k-means

clustering we refer to the chapter on unsupervised learning in [1]

In the following we describe the clustering procedure formally and in more detail. We begin by

describing classical k-means clustering and then highlight the diﬀerence of this implementation.

First we describe the computation of the (hidden) k-th cluster centroid from all observations

xassigned to cluster k. Let’s assume there are Nobservations xin a p-dimensional space of

real numbers and Kclusters. Then each partition is characterized by a function s:{1, . . . , N} 7→

{1, . . . , K}that maps each observation to exactly one cluster. The computation of the k-th cluster

centroid in the l-th iteration can be written as

(1) cl

k=1

Nl−1

kX

{i:sl−1(i)=k}

xi

where Nl−1

kis the number of observations in cluster k, i.e., Nl−1

k=|{i:sl−1(i) = k}|, and sl−1is

the partition function from the previous iteration. The initialization s0is typically random. The

following step is to determine the closest centroid for each observation, i.e., to update the partition

function:

(2) sl(i) = arg min

k

xi−cl

k

2

Here k·k denotes the Euclidean norm. In our setting X= (xij )has missing values. Therefore we

are estimating

(3) ˜

Xij

l=1{xij 6=NA}xij +1{xij =N A}Ul

ij w(l),

Figure 2. Clustering based on the simulated data with missings from Figure

1. On the left, k-means is performed to the data set where a random imputation

was applied. On the right, the proposed package ClustImpute was used without

any imputation or other pre-processing step. Note that the underlying data has

additional four ”‘noise”’ variables not shown in the ﬁgures, thus the clusters are

somewhat overlapping when plotted w.r.t xand yonly.

4 OLIVER PFAFFEL

where the weight function is given by w(l) = min l

nend ,1, and Ul

ij is uniformly distributed on all

non-missing values of the same column jthat lie in the same cluster sl−1(i), or all other variables

if this set is empty. Mathematically, Ul

ij is uniformly distributed on

(4) Sl

ij =({xrj 6=N A :sl−1(r) = sl−1(i)},if non-empty

{xrj 6=N A},otherwise.

Thus the calculation of the new centroids is not only conditional on sbut also on the realization

of the random variable Ul= (Ul

ij ):

(5) ˜ckl=1

Nl−1

kX

{i:sl−1(i)=k}

˜xil,

where Uij is simply set to zero if xij is not missing. In early iterations, the weight function w(l)is

near zero, thus, for each component j, this is basically the mean over-all non missing values xij.

Since the denominator Nkdoes not change with the share of missing values, there is some linear

regularization towards zero, the mean due to standardization, proportional to the share of missing

values. Finally, the update of the partition function,

(6) sl(i) = arg min

k

˜xil−˜ckl

2

triggers, by deﬁnition, an update of Sij ,Uand ˜

X.

3. Description of the implementation

ClustImpute is an Rimplementation of the algorithm stated in Chapter 2. The main script

is ClustImpute.R and explained here. Since random imputation is simply a conditional sam-

pling this is implemented directly in the algorithm. The clustering is performed by the function

ClusterR::KMeans_arma for csteps where the centroids are initialized randomly in the ﬁrst step

and with the current clusters centroids thereafter, i.e., in the 2nd step we use

ClusterR::KMeans_arma(...,CENTROIDS=cl_old,...) where clold denotes the current cluster

centroids. The ClusterR package was chosen for its fast implementation using RcppArmadillo

and the ability to start from customly chosen centroids. The missing values are sampled condi-

tionally on the current cluster assignment provided by ClusterR::predict_KMeans. Therefore,

only rather ‘standard’ packages like dplyr and rlang are imported by ClustImpute.

The package uses testthat for unit testing, has (as of version 0.1.3) full test coverage and uses

Travis CI for continuous integration. Further dependencies arise from helper functions for the

vignette and will be explained in the next chapter. The development page of ClustImpute can

be found at https://github.com/o1iv3r/ClustImpute

We are not aware of any implementations in Ror Python that implement the algorithm stated

in Chapter 2. However there are several packages for missing data imputation that can be used

before an application of any k-means implementation in R, for example:

•MICE provides multiple imputation using Fully Conditional Speciﬁcation implemented

by the MICE algorithm as described in [7]. Each variable has its own imputation model

and built-in imputation models are provided for continuous data, binary data, unordered

categorical data and ordered categorical data. Thus its applicability is much broader than

ClustImpute’s.

•Amelia implements Bootstrap multiple imputation using EM to estimate the parameters,

for quantitative data it imputes assuming a Multivariate Gaussian distribution, cf. [2].

•missRanger provides a fast implementation of the ’MissForest’ algorithm used to impute

mixed-type data sets by chaining random forests, introduced by [5]. Under the hood, it

uses the fast random jungle package ranger.

These packages will be used in combination with ClusterR for a fair comparison with ClustIm-

pute. Our goal is to compare with popular packages that represent a diverse set of methodologies.

However, we are aware that there exists a large number of further imputation packages. For a

CLUSTIMPUTE: CLUSTERING AND MISSING DATA IMPUTATION 5

good overview we refer to the CRAN task views: https://cran.r-project.org/web/views/

MissingData.html

4. Computational experiments

In this section we empirically test scalability and clustering performance of ClustImpute. To

this end we compare ClustImpute with alternative approaches performing a pre-imputation with

one of the packages stated above. Nevertheless, we do not aim at making general conclusions on

which approach should be taken in a speciﬁc setting. The main purpose of this section is to shed

some light on the performance and properties of the proposed package with regards to established

other approaches.

4.1. Scalability on simulated data. This experiment will evaluate the scalability of ClustIm-

pute for a data set with a growing number of rows. To reproduce this experiment, the supple-

mentary ﬁle Scalability.R can be used.

The data is generated by a simple function that produces an ntimes nrother + 2 matrix with

three clusters, where ndenotes the number of rows and nrother + 2 the number of columns. The

variables in the nrother columns are just variables to distract the algorithms from the true clusters

that lie in a 2-dimensional hyperplane. A visualization of a sample can be seen in Figure 1 and

2 with nrother = 4. Missings are created with the function ClustImpute::Missing_simulation.

In this function copula::normalCopula() is used to generate a Gaussian Copula based on a

random covariance matrix, which is diagonal in the ‘Missing completely at random (MCAR)’

setting, otherwise non-diagonal (‘Missing at random (MAR)’ case). A sample from this Gaussian

Copula via copula::rMvdc() provides the indicators used to replace existing values by NA’s. For

a mathematical deﬁnition of MCAR and MAR we refer to [3].

We compare ClustImpute with Rpackages for missing data imputation that are applied to

the data in a pre-processing step before the k-means clustering is performed. For the latter we

use the same clustering algorithm as in ClustImpute, namely the ClusterR package, to ensure

that only the computation time of the imputation is compared. The running time is measured as

an average of ﬁve clustering runs for each procedure and data set.

The parametrization is as follows:

nr_iter <- 14 # iterations of procedure

n_end <- 10 # step until convergence of weight function to 1

nr_cluster <- 3 # number of clusters

c_steps <- 50 # number of cluster steps per iteration

nr_iter_other <- (nr_iter-n_end) * c_steps

The parametrization nriterother of the clustering for external imputation strategies is set so that

we have a comparable number of steps after imputation (steps of ClustImpute with a weight <

1 are considered a ”‘burn-in”’ phase and only afterwards the distribution is considered credible),

otherwise approaches with a separate imputation beforehand would face a disadvantage. Due to a

very slow performance we decided to drop MissForrest and only consider missRanger, however,

the code for MissForrest is still part of the attached code ﬁle so that the reader could also add it

to this comparison. The results below are for a missing rate of 20%, though this could be changed

in a single line in the attached code ﬁle. For each data set, imputation and clustering are repeated

ﬁve times with a diﬀerent seed each time. Essentially, ClustImpute is used as follows:

res <- ClustImpute(dat_with_miss,nr_cluster=nr_cluster, nr_iter=nr_iter,

c_steps=c_steps, n_end=n_end, seed_nr = random_seed)

clusterR::external_validation(random_data$true_clusters, res$clusters)

In Figure 3 we observe that Amelia and ClustImpute scale much better than MICE and

missRanger.ClustImpute scales like a simple random imputation and similarly to Amelia.

We only show the result from a MAR simulation but the results for MCAR are comparable.

6 OLIVER PFAFFEL

Figure 3. This ﬁgure shows the median running time in seconds for an appli-

cation of ClustImpute vs. a comparable k-means clustering performed on a data

set imputed by Amelia,MICE,MissRanger or simple random imputation, on

a regular (left) and on a log-scale (right).

Nr. of obs. ClustImpute

RandomImp

+ClusterR

missRanger

+ClusterR

MICE

+ClusterR

Amelia

+ClusterR

400 0.6785 0.3163 0.3031 0.2920 0.3529

800 0.6519 0.3149 0.3108 0.3187 0.2840

1600 0.6896 0.4122 0.2648 0.2978 0.2051

3200 0.6732 0.3459 0.3365 0.4452 0.4263

6400 0.6673 0.2768 0.3134 0.2250 0.3040

Table 1. Rand index on simulated data

In a nutshell, we see that ClustImpute scales like a random imputation and hence is much

faster than a pre-processing with MICE or missRanger. This is not surprising since ClustIm-

pute basically runs a ﬁxed number of random imputations conditional on the current cluster

assignment.

In this experiment it was assumed that the number of clusters is known in advance. The vignette

of ClustImpute shows how to use the ”‘variance reduction”’ function

ClustImpute::var_reduction() to tune this hyper-parameter. Moreover, ClustImpute tracks

the mean and variance of imputed variables. These values are part of the output list and can

help to asses ”‘convergence”’ and to tune the nriter parameter appropriately, cf., the somewhat

comparable trace plots from the MICE package.

Of course, running time is only of relevance if the resulting clusters are adequate. In Table

1 we provide the rand indices, cf. [4] for a deﬁnition, comparing the resulting clusters with

the true clusters. ClustImpute provides the highest numbers, although the other imputation

methods (except random imputation) could provide better results if they were tuned. The stable

RandIndex independently of the number of observations justiﬁes to keep the hyper-parameters

of ClustImpute constant. Therefore the running time of ClustImpute is equivalent to a ﬁxed

number of random imputations plus a ﬁxed number of clustering steps, hence the algorithm scales

exactly like a random imputation in combination with a ﬁxed number of clustering steps afterwards.

In the next section we examine the clustering performance more adequately on the Iris data set.

4.2. Benchmarking Clustering Performance on Iris. For benchmarking ClustImpute and

clustering based on alternative imputation methods we use the popular IRIS data set. To reproduce

this experiment, the supplementary ﬁle Benchmarking_Iris.R can be used. In contrast to the

previous example, the size of the data set stays ﬁxed but the share of missing values for each

CLUSTIMPUTE: CLUSTERING AND MISSING DATA IMPUTATION 7

Figure 4. This ﬁgure shows the Rand Index on a censored IRIS data set for

an application of ClustImpute vs. a comparable k-means clustering performed

on a data set imputed by Amelia,MICE,MissRanger or simple random im-

putation. A diﬀerent share of missing values was simulated using MCAR.

of the four variables changes. Missings are generated similarly as in the previous sub-section on

scalability.

The parametrization is as follows:

nr_iter <- 15 # iterations of procedure

n_end <- 8 # step until convergence of weight function to 1

nr_cluster <- 3 # number of clusters

c_steps <- 1 # numer of cluster steps per iteration

nr_iter_other <- nr_iter * c_steps

In contrast to the previous section we use the same number of clustering steps for all approaches

(see last line of the code above) so that the benchmark approaches are not disadvantaged. Clus-

tering performance is measured as an average of the RandIndex resulting from 30 clustering runs

for each procedure and share of missings.

For a MCAR missing scheme, we observe in Figure 4 we that all approaches yield good clusters

when the share of missing values is low. ClustImpute is among the top for a share of missings

of 30% or less. For an even larger share of missings a pre-imputation with MICE provides better

results. Nevertheless, its performance is higher than for MissForrest or a simple random impu-

tation. Interestingly we see that ClustImpute without a weight function (meaning a weight of

1 already as of the ﬁrst iteration) performs worse than with a weight function. Note that Amelia

provides very good results for a low share of missings but fails to converge for a larger share.

Therefore we decided to drop points where the package did not run the full 30 iterations in order

to avoid biased results.

In the second experiment, everything was kept the same expect for the missing simulation that

now is MAR. To generate MAR, the missings will be correlated using the previously mentioned

function ClustImpute::Missing_simulation. Figure 5, made with the corrplot package, shows

that missings are indeed strongly correlated. The results in the MAR setting are somewhat

similar. However, ClustImpute is closer to a pre-imputation with MICE, and the two curves

for ClustImpute are further apart which indicates that the importance of the weighting function

increases. Amelia and, in this setting, also MICE fail to converge for a large share of missings

and as in the ﬁrst experiment we do not show results where the packages did not run the full 30

iterations in order to avoid stating biased results.

The benchmarking experiments on the IRIS data have been conducted for a ﬁxed parametriza-

tion of ClustImpute. The following experiment shows that, at least on this data set, there is a

rather low sensitivity with respect to changes in the parametrization: We considered all combina-

tions of the following parameters:

8 OLIVER PFAFFEL

Figure 5. This ﬁgure shows the correlations between missing values in the

censored IRIS data set with 40% missings at random (MAR).

Figure 6. This ﬁgure shows the Rand Index on a censored IRIS data set for an

application of ClustImpute vs. a comparable k-means clustering performed on a

data set imputed by Amelia,MICE,missRanger or simple random imputation.

A diﬀerent share of missing values was simulated using MAR.

nr_iter <- c(5,10,20,30,50) # iterations of procedure

c_steps <- c(1,10,25,50) # numer of cluster steps per iteration

# n_end is defined as a fraction of nr_iter

CLUSTIMPUTE: CLUSTERING AND MISSING DATA IMPUTATION 9

Figure 7. Points refer to the averaged Rand Index over 30 runs for each pa-

rameter combination. In the ﬁrst plot from the left, the end point of convergence

of the weight function was set either to 1, for the others to a fraction of 30%,

60% or 90% of the number of iterations. The size of the points corresponds to

the total number of clustering steps.

The parameter nend was set either to 1 (i.e., no weight function is used), or 30%,60% or 90%

of nriter, rounded to the closest integer. The missing simulation was performed as above using

MAR and a missing share of 30%. For each of the resulting 80 parameter combinations, 30

ClustImpute runs were performed and the resulting Rand Index averaged. Results be be seen

in Figure 7. One observes that the Rand Index is, on average, higher if a weight function is used.

Moreover, higher scores can be obtained if the convergence is well before the ﬁnal iteration by

deﬁning nend as a fraction of 30% or 60% of nriter. The size of the points corresponds to the total

number of clustering steps, i.e., the product of nriter and csteps. Best scores can be obtained by

a total number of iterations that is not too high. For example, a good result can be obtained by

setting nriter and csteps to 10 and nend to 6 (this is the 5th best result).

While the experiments in this section provide interesting insights into the clustering performance

of ClustImpute, results should be taken with care. First of all performance may vary considerably

with diﬀerent data sets and missing schemes. Secondly, other imputation methods exist and even

the considered ones have a large number of tuning options with a potentially high impact on

clustering performance. Moreover, in real-world applications partition labels do not exist so the

quality of resulting partitions has to be assesed with diﬀerent, often domain-speciﬁc, methods.

5. Summary and outlook

In this article we have described a novel implementation of a k-mean clustering algorithm with

built-in multivariate missing data imputation. Scalability and overall clustering performance of

the implementation in Rwere tested on simulated data and the IRIS data set with missings

simulated at random (i.e., considering a potential correlation of missings within the data set). A

potential issue of ClustImpute is the standardization being based on complete observations only.

An independent imputation, e.g. using MICE, can potentially detect a shift in expectation and

variances so that a more accurate standardization is possible for the clustering. This should be

seen as a down-side of any combined approach which yields a faster run-time. In the following we

want to highlight directions for further research.

Categorical variables can be considered in ClustImpute if a dummy coding is applied to the

underlying data. The imputation based on actual observations guarantees that no other values

other than 0 or 1 are created. Hence the structure of the data does not change and the resulting

clusters can be interpreted in terms of categories. However, non-binary categories might end up

with multiple assignments to some extent describing the associated uncertainty due to the missing

information. Alternatively one could replace the current clustering algorithm in step 3 by one that

10 OLIVER PFAFFEL

supports other metrics, e.g. Gower’s distance which deﬁnes a weighted average of the Euclidean

with the L0-norm. The imputation / sampling parts of the algorithm do not have to be changed.

A further improvement would be a parallelization of the imputation that is possible as columns

are sampled independently of each other, in contrast to an imputation using chained equations

where there is an update after each column so parallelization is not trivial. This should speed up

the computation for high-dimensional data sets.

Another methodological improvement would be adaptive weights. At the moment, the weights

from step 2 only depend on the iteration. However, adaptive weights could depend on other

features such as the relative distance to the closest centroid compared to the other centroids. This

way one could perform a sampling on the whole data set with cluster weights proportional on

the centroid distance. In comparison, the current implementation can be seen as a sampling on

the entire data set with a weight of 1 for the currently assigned cluster and 0 otherwise. On the

contrary, such an approach would increase computation time.

References

[1] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1.

Springer series in statistics New York, 2001.

[2] J. Honaker, G. King, and M. Blackwell. Amelia II: A program for missing data. Journal of

Statistical Software, 45(7):1–47, 2011.

[3] R. J. Little and D. B. Rubin. Statistical analysis with missing data, volume 793. John Wiley

& Sons, 2019.

[4] W. M. Rand. Ob jective criteria for the evaluation of clustering methods. Journal of the

American Statistical association, 66(336):846–850, 1971.

[5] D. J. Stekhoven and P. B¨

uhlmann. Missforest—non-parametric missing value imputation for

mixed-type data. Bioinformatics, 28(1):112–118, 2012.

[6] S. Van Buuren. Flexible imputation of missing data. Chapman and Hall/CRC, 2018.

[7] S. van Buuren and K. Groothuis-Oudshoorn. mice: Multivariate imputation by chained equa-

tions in r. Journal of Statistical Software, 45(3):1–67, 2011.