ArticlePDF Available

Understanding Collections of Related Datasets Using Dependent MMD Coresets


Abstract and Figures

Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepancy (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper, we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are useful for understanding multiple related datasets and understanding model generalization between such datasets.
Content may be subject to copyright.
Understanding Collections of Related Datasets Using
Dependent MMD Coresets
Sinead A. Williamson 1,* and Jette Henderson 2
Citation: Williamson, S.A.;
Henderson, J. Understanding
Collections of Related Datasets Using
Dependent MMD Coresets.
Information 2021,12, 392. https://
Academic Editors: Melanie F. Pradier
and Isabel Valera
Received: 2 August 2021
Accepted: 3 September 2021
Published: 23 September 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
1Department of Statistics and Data Science, University of Texas at Austin, Austin, TX 78712, USA
2CognitiveScale, Austin, TX 78759, USA;
Understanding how two datasets differ can help us determine whether one dataset under-
represents certain sub-populations, and provides insights into how well models will generalize
across datasets. Representative points selected by a maximum mean discrepancy (MMD) coreset can
provide interpretable summaries of a single dataset, but are not easily compared across datasets. In
this paper, we introduce dependent MMD coresets, a data summarization method for collections
of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are
useful for understanding multiple related datasets and understanding model generalization between
such datasets.
Keywords: coresets; data summarization; maximum mean discrepancy; interpretability
1. Introduction
When working with large datasets, it is important to understand your data. If a dataset
is not representative of your population of interest, and no appropriate correction is made,
then models trained on this data may perform poorly in the wild. Sub-populations that
are under-represented in the training data are likely to be poorly served by the resulting
algorithm, leading to unanticipated or unfair outcomes—something that has been observed
in numerous scenarios including medical diagnoses [1,2] and image classification [3,4].
In low-dimensional settings, it is common to summarize data using summary statistics
such as marginal moments or label frequencies, or to visualize univariate or bivariate
marginal distributions using histograms or scatter plots. As the dimensionality of our data
increases, such summaries and visualizations become unwieldy, and ignore higher-order
correlation structure. In structured data such as images, such summary statistics can be
hard to interpret, and can exclude important information about the distribution [
per-pixel mean and standard deviation of a collection of images tells us little about the
overall distribution. Further, if our data are not labeled, or are only partially labeled, we
cannot make use of label frequencies to assess class balance.
In such settings, we can instead choose to present a set of exemplars that capture the
diversity of the data. This is particularly helpful for structured, high-dimensional data
such as images or text, that can easily be qualitatively assessed by a person. A number
of algorithms have been proposed to find such a set of exemplars [
]. Many of these
algorithms can be seen as constructing a coreset for the dataset—a (potentially weighted)
set of exemplars that behave similarly to the full dataset under a certain class of functions.
In particular, coresets that minimize the maximum mean discrepancy [
] (MMD) between
coreset and data have recently been used for understanding data distributions [
]. Fur-
ther, evaluating models on such MMD-coresets have been shown to aid in understanding
model performance [11].
In addition to summarizing a single dataset, we may also wish to compare and contrast
multiple related datasets. For example, a company may be interested in characterizing
differences and similarities between different markets. A machine learning practitioner
Information 2021,12, 392.
Information 2021,12, 392 2 of 26
may wish to know whether their dataset is similar to that used to train a given model.
A researcher may be interested in understanding trends in posts or images on social
media. Here, summary statistics offer interpretable comparisons: we can plot the mean
and standard deviation of a given marginal quantity over time, and easily see how it
changes [
]. By contrast, coresets are harder to compare, since the exemplars selected
for two datasets X1and X2will not in general overlap.
In this paper, we introduce dependent MMD coresets, a new tool for characterizing
related datasets and understanding model behavior across such datasets. These dependent
MMD coresets provide a low-dimensional summary of a collection of datasets, that allows
easy comparison across datasets. A dependent MMD coreset for a collection of datasets
constructs a collection of exemplars, that is shared across all datasets. Each dataset assigns
a different weight vector to these exemplars, so that the weighted exemplars approximate
the dataset. These weights allow us to easily see which exemplars are relevant to which
datasets, and comparing two sets of weights provides a simple way of showing how the
corresponding datasets differ.
The use of shared exemplars makes it easy to compare two or more datasets, by provid-
ing a common language. Consider comparing two datasets of faces. If we independently
constructed representations of each dataset—for example, using two independent MMD
coresets—we would obtain two disjoint sets of weighted exemplars. Visually assessing
the similarity between two sets would involve considering both the similarities of the
images and the similarities in the weights. Conversely, with a dependent MMD coreset,
the exemplars would be shared between the two datasets. Similarity can be assessed by
considering the relative weights assigned in the two marginal coresets. This in turn leads to
easy summarization of the difference between the two datasets, by identifying exemplars
that are highly representative of one dataset, but less representative of the other.
In addition to understanding the difference between multiple datasets, dependent
MMD coresets allow us to qualitatively explore the behavior of algorithms on these datasets.
The shared set of exemplars provides representative points at which to evaluate the algo-
rithm. Looking at the relative weights of these exemplars in the different datasets paints
a picture of the relative performances we would expect between those datasets. This is
particularly useful when a model has been trained on one dataset, but we wish to apply
it to a second dataset: looking at exemplars that are highly representative of the second
dataset, but not the first, allows us to identify potential failure modes.
We begin by considering existing coreset methods for data and model understanding
in Section 2, before discussing their limitations and proposing our dependent MMD
coreset in Section 3. A greedy algorithm to select dependent MMD coresets is provided
Section 3.4
. In Section 4, we evaluate the efficacy of this algorithm, and show how the
resulting dependent coresets can be used for understanding collections of image datasets
and probing the generalization behavior of machine learning algorithms. We summarize
notation used in this paper in Table 1.
Table 1. Notation used in this paper.
Tset that indexes datasets and associated measures
Xt= (xt,1, . . . , xt,nt) X nta dataset indexed by t T
Pttrue distribution at t T ,XtPt
U= (u1, . . . , unu)set of candidate locations
δuDirac measure (i.e., point mass) at u.
a probability measure used to approximate
, that takes
the form iSwt,iδui, where S[nu]
Information 2021,12, 392 3 of 26
2. Background and Related Work
2.1. Coresets and Measure-Coresets
A coreset is a “small” summary of a dataset
, which can act as a proxy for the dataset
under a certain class of functions
. Concretely, a weighted set of points
an estrong coreset for a size-ndataset Xwith respect to Fif
for all f F [21].
A measure coreset [22] generalizes this idea to assume that Xare independently and
identically distributed samples from some distribution
. A measure
is an
coreset for Pwith respect to some class Fof functions if
|EXP[f(X)EYQf(Y)]|e. (1)
The left hand side of Equation
describes an integral probability metric [
], a class
of distances between probability measures parametrized by some class
of functions.
Different choices of Fyield different distributions (Table 2).
Table 2. Some examples of integral probability metrics.
Distance F
1-Wasserstein distance {f:||∇ f||11}
Maximum mean discrepancy {f:|| f||H1}for some RKHS H
Total variation {f:|| f||1}
2.2. MMD-Coresets
In this paper, we consider the case where
is the class of all functions that can
be represented in the unit ball of some reproducing kernel Hilbert space (RKHS)
very rich class of continuous functions on
. This corresponds to a metric known as the
maximum mean discrepancy [18] (MMD),
MMD(P,Q) = sup
|EXP[f(X)EYQf(Y)]|. (2)
An RKHS can be defined in terms of a mapping
, which in turn specifies a
kernel function
x0) = hΦ(x)
. A distribution
can be represented in this space
in terms of its mean embedding,
. The MMD between two distributions
equivalently can be expressed in terms of their mean embeddings,
-MMD coreset for a distribution
is a finite, atomic distribution
such that
. We will refer to the set
as the support of
, and refer
to individual locations in the support of Qas exemplars.
In practice, we are unlikely to have access to
directly, but instead have samples
X:= (x1, . . . , xn)P. If Q=iSwiδui, we can estimate MMD(P,Q)2as
k(xi,xj) +
We, therefore, define an
-MMD coreset for a dataset
as a finite, atomic distribution
such that
—or equivalently, whose mean embedding
is close
to the empirical mean embedding ˆ
µXso that ||µQˆ
Information 2021,12, 392 4 of 26
A number of algorithms have been proposed that correspond to finding
coresets, under certain restrictions on
(While most of these algorithms do not explicitly
use coreset terminology, the resulting set of samples, exemplars or prototypes meet the
definition of an MMD coreset for some value of
). Many of these algorithms greedily
construct an MMD coreset, adding exemplars one-by-one based on some criteria. For
example, kernel herding [14,24,25] can be seen as finding an MMD coreset Qfor a known
, with no restriction on the support of
. The greedy prototype-finding
algorithm used by [
] can be seen as a version of kernel herding, where
is only observed
via a set of samples
, and where the support of
is restricted to be some subset of a
collection of candidates
(often chosen to be the data set
). Versions of this algorithm
that assign weights to the atoms in Qare proposed in [13].
Other methods start from the full dataset, and repeatedly discard points to construct
a coreset [
]. Loosely, these methods repeatedly partition the dataset based on a
discrepancy criterion, and then discard one half of the partition. Compared with the greedy
methods, these approaches typically obtain smaller coresets for a given
]. As shown
by [27], random sampling also provides a way to construct an MMD-coreset.
As in [
], in this paper we require the support of our coreset to be a subset of some
finite set of candidates
, indexed by 1,
. . .
. In other words, our measure coresets will
take the form Q=iSwiδui, where S[nU].
2.3. Coresets for Understanding Datasets and Models
The primary application of coresets is to create a compact representation of a large
dataset, to allow for fast inference on downstream tasks (see [
] for a recent survey).
However, such compact representations have also proved beneficial in interpretation of
both models and datasets.
While humans are good at interpreting visual data [
], visualizing large quantities of
data can become overwhelming due to the sheer quantity of information. Coresets can be
used to filter such large datasets, while still retaining much of the relevant information.
The MMD-critic algorithm [
] uses a fixed-dimension, uniformly weighted MMD
coreset, which they refer to as “prototypes”, to summarize collections of images. Guru-
moorthy et al.
extends this to use a weighted MMD coreset, showing that weighted
prototypes allow us to better model the data distribution, leading to more interpretable
summaries. Zheng et al.
show how unweighted MMD coresets can be used to represent
spatial point processes such as spatial location of crimes.
Techniques such as coresets that produce representative points for a dataset can also
be used to provide interpretations and explanations of the behavior of models on that
dataset. Case-based reasoning approaches use representative points to describe “typical”
behavior of a model [
]. Considering the model’s output on such representative
points can allow us to understand the model’s behavior.
Viewing the model’s behavior on a collection of “typical” points in our dataset also
allows us to get an idea of the overall model performance on our data. Evaluating a model
on a coreset can give an idea of how we expect it to perform on the entire dataset, and can
help identify failure modes or subsets of the data where the model performs poorly.
2.4. Criticising MMD Coresets
While MMD coresets are good at summarizing a distribution, since the coreset is much
smaller than the original dataset, there are likely to be outliers in the data distribution
that are not well explained by the coreset. The MMD-critic algorithm supplements the
“prototypes” associated with the MMD coreset with a set of “criticisms”—points that are
poorly modeled by the coreset [11] .
Recall from Equation
that the MMD between two distributions
sponds to the maximum difference in the expected value on the two spaces, of a function
Information 2021,12, 392 5 of 26
that can be represented in the unit ball of a Hilbert space
. The function
that achieves
this maximum is known as the witness function, and is given by
f(x) = EXP[k(x,X)]EYQ[k(x,Y)].
When we only have access to
via a size-
, and where
, we
can approximate this as
f(x) = 1
Criticisms of an MMD coreset for a data set
are selected as the points in
the largest values of the witness function. Kim et al.
show that the combination of
prototypes and criticisms allow us to visually understand large collections of images: the
prototypes summarize the main structure of the dataset, while the criticisms allow us to
represent the extrema of a distribution. Criticisms can also augment an MMD coreset in a
case-based reasoning approach to model understanding, by allowing us to consider model
behavior on both “typical” and “atypical” exemplars.
2.5. Dependent and Correlated Random Measures
Dependent random measures [
] are distributions over collections of countable
, indexed by some set
, such that the marginal distribution at
t T
is described by a specific distribution. In most cases, this marginal distribution
is a Dirichlet process, meaning that the
are probability distributions. Most dependent
random measures either keep the weights
or the atom locations
constant accross
to assist identifiability and interpretability.
In a Bayesian framework, dependent Dirichlet processes are often used as a prior
for time-dependent mixture models. In settings where the atom locations (i.e., mixture
components) are fixed but the weights vary, the posterior mixture components can be
used to visualize and understand data drift [
]. The dependent coresets presented in
this paper can be seen as deterministic, finite-dimensional analogues of these posterior
dependent random measures.
3. Understanding Multiple Datasets Using Coresets
As we have seen, coresets provide a way of summarizing a single distribution. In this
section, we discuss interpretational limitations that arise when we attempt to use coresets
to summarize multiple related datasets (Section 3.1), before proposing dependent MMD
coresets in Section 3.2 and discussing their uses in Section 3.3.
3.1. Understanding Multiple Datasets Using MMD-Coresets
If we have a collection
of datasets, we might wish to find
-MMD coresets
for each of the
, in the hope of not just summarizing the individual datasets, but also
of easily comparing them. However, if we want to understand the relationships between
the datasets, in addition to their marginal distributions, comparing such coresets in an
interpretable manner is challenging.
An MMD coreset selects a set
of points from some set of candidates
. Even if
two datasets
are sampled from the same underlying distribution (i.e.,
and the set
of available candidates is shared, the optimal MMD-coreset for the two
datasets will differ in general. Sampling error between the two distributions means that
for any candidate coreset
, and so the optimal
coreset will typically differ between the two datasets.
Figure 1shows that, even if two distributions
are sampled from the same
underlying distribution, and their coreset locations are selected from the same collection
the two coresets will not be identical. Here, we see two datasets (Figure 1b,c) generated
from the same mixture of three equally weighted Gaussians (Figure 1a). Below (Figure 1d,e),
Information 2021,12, 392 6 of 26
we have selected a coreset for each dataset (using the algorithm that will be introduced in
Section 3.4), with locations selected from a shared set
. While the associated coresets are
visually similar, they are not the same.
) A mixture of three equally weighted
) First sample of 250 observations
from the distribution in Figure 1a.
) Second sample of 250 observations
from the distribution in Figure 1a.
) MMD coreset for the sample in
Figure 1b.
) MMD coreset for the sample in
Figure 1c.
Figure 1.
) Three equally weighted Gaussians (lines show 1, 2, 3 standard deviations of each
component). (
) Independently sampled datasets from the mixture of three Gaussians. (
coresets for the three-Gaussian datasets.
This is magnified if we look at a high-dimensional dataset. Here, the relative sparsity
of data points (and candidate points) in the space means that individual locations in
might not have close neighbors in
, even if
are sampled from the same
distribution. Further, in high dimensional spaces, it is harder to visually assess the distance
between two exemplars. These observations make it hard to compare two coresets, and
gain insights about similarities and differences between the associated datasets.
To demonstrate this, we constructed two datasets, each containing 250 randomly
selected, female-identified US highschool yearbook photos from the 1990s. Figure 2shows
MMD-coresets obtained for the two datasets (See Section 4.2.1 for full details of dataset and
coreset generation.) While both datasets were selected from the same distribution, there
is no overlap in the support of the two coresets. Visually, it is hard to tell that these two
coresets are representing samples from the same distribution.
) MMD coreset for a set of 250
randomly selected yearbook photos
from the 1990s.
) MMD coreset for a second set
of 250 randomly selected yearbook
photos from the 1990s.
Figure 2.
Independently learned MMD coresets for two randomly selected dataset of 250 female-
identified photographs from US yearbooks in the 1990s. Area of each bubble is proportional to weight
of the corresponding exemplar.
Information 2021,12, 392 7 of 26
3.2. Dependent MMD Coresets
The coresets in Figure 2are hard to compare due to their disjoint supports (i.e., the
fact that there are no shared exemplars). Comparing the two coresets involves comparing
multiple individual photos and assessing their similarities, in addition to incorporating
the information encoded in the associated weights. To avoid the lack of interpretability
resulting from dissimilar supports, we introduce the notion of a dependent MMD coreset.
Given a collection of datasets
, the collection of finite, atomic measures
{Qt}t∈T , is an e-dependent MMD coreset if
for all t T , and if the Qthave common support, i.e.,
wt,iδui, (4)
where {ui}iSis a subset of some candidate set U.
In Equation
, the exemplars
are shared between all
t T
, but the weights
associated with these exemplars can vary with
. Taking the view from Hilbert space, we
are restricting the mean embeddings
of the marginal coresets to all lie within a convex
hull defined by the exemplars {ui}iS.
By restricting the support of our coresets in this manner, we obtain data summaries
that are easily comparable. Within a single dataset, we can look at the weighted exemplars
that make up the coreset and use these to understand the spread of the data, as is the case
with an independent MMD coreset. Indeed, since
still meets the definition of an MMD
coreset for
(see Equation
), we can use it analogously to an independently generated
coreset. However, since the exemplars are shared across datasets, we can directly compare
the exemplars for two datasets. We no longer need to intuit similarities between disjoint
sets of exemplars and their corresponding weights; instead we can directly compare the
weights for each exemplar to determine their relative relevance to each dataset. We will
show in Section 4.2.1 that this facilitates qualitative comparison between the marginal
coresets, when compared to independently generated coresets.
We note that the dependent MMD coresets introduced in this paper are directly
extensible to other integral probability measures; we could, for example, construct a
dependent version of the Wasserstein coresets introduced by [22].
3.3. Model Understanding and Extrapolation
As we discussed in Section 2.3, MMD coresets can be used as tools to understand
the performance of an algorithm on “typical” data points. Considering how an algorithm
performs on such exemplars allows the practitioner to understand failure modes of the
algorithm, when applied to the data. In classification tasks where labeling is expensive,
or on qualitative tasks such as image modification, looking at an appropriate coreset can
provide an estimate of how the algorithm will perform across the dataset.
In a similar manner, dependent coresets can be used to understand generalization
behavior of an algorithm. Assume a machine learning algorithm has been trained on a
given dataset
, but we wish to apply it (without modification) to a dataset
. This is
frequently done in practice, since many machine learning algorithms require large training
sets and high computational cost; however if the training distribution differs from the
deployment distribution, the algorithm may not perform as intended. In general, we would
expect the algorithm to perform well on data points in
that have many close neighbors
in Xa, but perform poorly on data points in Xbthat are not well represented in Xa.
Creating a dependent MMD coreset
for the pair
allows us to identify exemplars that are highly representative of
have high weight in the corresponding weighted coreset). Further, by comparing the
weights in the two coreset measures—e.g., by calculating
we can identify
Information 2021,12, 392 8 of 26
exemplars that are much more representative of one dataset than another. Rather than look
at all points in the coreset, if we are satisfied with the performance of our model on
we can choose to only look at points with high values of
—points that are representative
of the new dataset
, but not the original dataset
. Further, if we wish to consider
generalization to multiple new datasets, a shared set of exemplars reduces the amount of
labeling or evaluation required.
An MMD coreset, dependent or otherwise, will only contain exemplars that are
representative of the dataset(s). There are likely to be outliers that are less well represented
by the coreset. Such outliers are likely to be underserved by a given algorithm—for
example, yielding low accuracy or poor reconstructions.
As we saw in Section 2.4, MMD coresets can be augmented by criticisms—points in
that are poorly approximated by
. We can equivalently construct criticisms for each
dataset represented by a dependent MMD coreset. In the example above, we would select
criticisms for the dataset Xbby selecting points in Xbthat maximize
ft(x) = 1
In addition to evaluating our algorithm on the marginal dependent coreset for dataset
, or the subset of the coreset with high values of
, we can evaluate on the criticisms
In conjunction, the dependent MMD coreset and its criticisms allow us to better understand
how the algorithm is likely to perform on both typical, and atypical, exemplars of Xb.
3.4. A Greedy Algorithm for Finding Dependent Coresets
Given a collection
of datasets, where we assume
. . .
xt,nt} Pt
and a set of
, our goal is to find a collection
with shared support
such that
for all
t T
. We begin by constructing
an algorithm for a related task: to minimize t∈T M MD(Qt,Pt), where
MMD2(Xt,Qt) = 1
k(xt,i,xt,j) +
. If we ignore terms in Equation
that do not depend on the
we obtain the following loss:
L({Qt}t∈T ) =
`t(Qt) = 1
We can use a greedy algorithm to minimize this loss. Let
indexes the first
exemplars to be added. We wish to select the exemplar
, and set of weights
for each dataset
, that minimize the
loss. However, searching over all possible combinations of exemplars and weights is
prohibitively expensive, as it involves a non-linear optimization to learn the weights
associated with each candidate. Instead, we assume that, for each
t T
, there is some
0 such that
for all
. In other words, we
assume that the relative weights in each
of the previously added exemplars do not
change as we add more exemplars.
Information 2021,12, 392 9 of 26
Fortunately, the value of
that minimizes
can be found
analytically for each candidate uby differentiating the loss in step 1, yielding
We can, therefore, set
t} arg min
and let
for all
t T
, and
for all
t T
and iS(m).
As written, the procedure will greedily minimize the sum of the per-dataset losses.
However, the definition of an MMD coreset involves satisfying, not minimizing: we want
for all
t T
. To achieve this, we modify the sum in Equation
that it only includes terms for which
. The resulting procedure is
summarized in Algorithm 1.
Algorithm 1 DM MD: Selecting dependent MMD coresets
Require: Datasets {Xt}t∈T ; candidate set U; kernel k(·,·); threshold e2>0
t[ ] for all t T ;D T ;m0
while D6=do
for all i[nU]\S(m)do
for all t T do
Calculate α
t,iusing Equation (7)
end for
for all tDdo
end for
end for
i=arg mini[nU]\S(m)Li
S(m+1)S(m) {i}
for all t T do
for all iS(m)do
end for
end for
end while
3.5. Limitations
As discussed in Section 2, if we can bound the MMD between two distributions by
then for any function
in the unit ball
of the Hilbert space associated with our kernel,
Information 2021,12, 392 10 of 26
the expectations of
with respect to the two distributions will differ by at most
. However,
we have no guarantee for functions that cannot be represented in that Hilbert space. If we
use an MMD coreset (dependent or otherwise) to understand the output of a model, and
that output cannot be well approximated by the expectation with respect to a function in
, we cannot use performance on the coreset to bound performance on the full dataset. For
this reason, we focus on the use of coresets as a qualitative, diagnostic tool for exploring
model performance.
Beyond the question of whether functions of interest lie in a Hilbert space, we must
also question which Hilbert space. Our choice of kernel will impact the nature of the
resulting coresets. If we assume the popular squared exponential kernel, then different
lengthscales will cause the algorithm to prioritize capturing variation at different scales. In
this work, we have used median heuristics to set the lengthscale [
]; however if we were
interested in capturing differences on a specific task, a better approach might be to learn
the kernel. An alternative approach would be to use a different integral probability metric
in place of the MMD, such as the Wasserstein distance, which has been used to construct
(non-dependent) measure coresets [22].
Conversely, a limitation of MMD is that calculating
scales cubically with
the size of the data. Similarly, calculating the Wasserstein distance is typically computation-
ally expensive, as in general it requires solving a linear programming problem. This limits
the scalability of our algorithm; however, since subsampling a dataset yields a valid MMD
coreset with high probability [
]), our algorithm could be used on samples from larger
datasets. Similarly, we could replace our initial datasets with (non-dependent) MMD core-
sets obtained using an existing algorithm [
], although this would be more expensive
than random sampling. In either setting, we would need to incorporate the approximation
error of the random sample or coreset, into our overall approximation error e.
When working with complex datasets such as images, we often work with lower-
dimensional representations or embeddings [
]—for example, in Section 4, we will
use ResNet [
] to generate embeddings for yearbook photos. However, this can make
notions of “similarity” opaque, since the representations can capture properties of the
image that are not immediately obvious to the viewer, or do not register as important [
Concerningly, recent research has suggested that image representations can encode harmful
human-like biases [44].
Our algorithm greedily constructs dependent coresets. Recent work on MMD coresets
has found that discrepancy-based algorithms, where the full dataset is successively divided
based on some discrepancy measure, can obtain smaller
-coresets than greedy methods or
random sampling [
]. Unfortunately, it is not clear how to extend such a partitioning
algorithm to the dependent setting; however, these results suggest that it is worth exploring
alternative constructions for dependent coresets.
4. Experimental Evaluation
In Section 3.2, we introduced dependent MMD coresets, a summarization technique
designed to allow easy comparison between related datasets, and proposed a greedy
algorithm to construct dependent MMD coresets in Section 3.4. We also described, in
Section 3.3, how dependent MMD coresets can be used to understand performance of
models and algorithms, particularly in the context of generalization to new datasets.
In this section, we will empirically evaluate the performance of our algorithm in
Section 4.1. Previous greedy algorithms for weighted MMD coresets (without dependence)
proceed by first selecting a new exemplar, and then updating weights once the exemplar
has been added to the coreset. While such an approach could be adapted to the dependent
setting, we show that our algorithm (Algorithm 1), which pre-selects weights based on a
single calculation, achieves comparable coresets with lower computational cost.
After evaluating the algorithm used to select the coresets, we will go on to explore
the coresets themselves, in Section 4.2. We begin by showing how, when comparing
two datasets, the shared support offered by dependent MMD coresets allows for easier
Information 2021,12, 392 11 of 26
comparison than two standard MMD coresets. We then go on to show, in an example
comparing 12 related datasets, that dependent MMD coresets can allow us to capture
trends and similarities in an interpretable manner.
In Section 4.3, we turn our attention to coresets for model understanding. Here, we
simulate a scenario where we wish to deploy algorithms trained on one dataset, to a slightly
different datasets. By looking at performance on exemplars that are highly weighted in the
second dataset, but not the first, we can obtain qualitative insights on the generalization
properties of the algorithms. Adding evaluation on criticisms of the dependent MMD
coreset leads to a deeper understanding of the model behavior.
4.1. Evaluation of Dependent MMD Coreset Algorithm
In Section 3.4, we proposed a greedy algorithm for selecting dependent MMD coresets
(Algorithm 1, which we will denote DMM D). This algorithm selected weights (one for
each dataset) for each candidate data point, and then greedily selected a data point and
its associated weights. Since dependent coresets are introduced in this work, there is no
direct comparison algorithm; however, a natural alternative would have been to adapt
PROTODASH, an existing greedy algorithm for weighted MMD coresets, to the dependent
setting. Such an approach differs from Algorithm 1in that weights are optimized after a
candidate has been selected.
Below, we review the PROTODASH algorithm, and introduce two alternative greedy
algorithms for dependent MMD coresets: a dependent version of PROTODASH, that selects
unweighted candidates then optimizes weights; and a hybrid algorithm that pre-selects
weights for candidate points, but further optimizes them after an exemplar has been added
to the coreset. We quantitatively compare these variants with Algorithm 1, showing that
pre-selecting weights provides comparable coresets to methods that optimize weights, at a
much lower computational cost.
The PROTODASH algorithm [
] for weighted MMD coresets greedily selects exemplars
that minimize the gradient of the loss in Equation
(for a single dataset). Having selected
an exemplar to add to the coreset, PROTODASH then uses an optimization procedure to find
the weights that minimize
. We modify this algorithm for the dependent
MMD setting by summing the gradients across all datasets for which the
threshold is
not yet satisficed, leading to the dependent PROTODASH algorithm shown in Algorithm 2.
Unlike the dependent version of
in Algorithm 2, our algorithm assigns
weights before selection, which should encourage adding points that would help some
of the marginal coresets, but not others. However, there is no post-exemplar-addition
optimization of the weights. Inspired by the post-addition optimization in PROTODASH,
we also compare our algorithm with a variant of Algorithm 1that optimizes the weights
after each step—allowing the relative weights of the exemplars to change between each
iteration. We will refer to this variant of DMMD with post-exemplar-addition optimization
as DMMD -OP T.
Information 2021,12, 392 12 of 26
Algorithm 2 A dependent protodash algorithm
Require: Datasets {Xt}t∈T , candidate set U, kernel k(·,·), threshold e2>0
t[] for all t T ,D T ,m0
for all i[nU]do
gi=t∈T 1
end for
while D6=do
i=arg mini[nU]\S(m)gi
S(m+1)S(m) {i}
for all t T do
t,i}iS(m+1)arg max{wt,i}iS(m+1)
end for
for all i[nU]\S(m+1)do
end for
end while
We evaluate all three methods using a dataset of photographs of 15,367 female-
identified students, taken from yearbooks between 1905 and 2013 [
]. We show a random
subset of these images in Figure 3. We generated 512-dimensional embeddings of the pho-
tos using the torchvision pre-trained implementation of ResNet [
]. We then partitioned
the collection into 12 datasets, each containing photos from a single decade.
Figure 3. A random subset of 100 images taken from the yearbook dataset.
In order to capture lengthscales appropriate for the variation in each decade, we use
an additive kernel, setting
K=Kall +tT Kt
|T | +1,
Information 2021,12, 392 13 of 26
is a squared exponential kernel with bandwidth given by the overall median
pairwise distances;
is the set of decades that index the datasets;
is a squared expo-
nential kernel with bandwidth given by the median pairwise distance between images in
dataset t.
We begin by considering how good a dependent MMD coreset each algorithm is able to
construct, for a given number of exemplars
. To do so, we ran all algorithms without
specifying a threshold
, recording
for each value of
. All algorithms
were run for one hour on a 2019 Macbook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB
2667 MHz DDR4), excluding time taken to generate and store the kernel entries, which
only occurs one time. As much code as possible was re-used between the three algorithms.
Where required, optimization of weights was carried out using a BFGS optimizer. Code is
available at (accesson 22 September 2021).
Figure 4a shows the per-dataset estimates
, and Figure 4b shows the
average performance across all 12 datasets. We see that the three algorithms perform
comparably in terms of coreset quality. D MM D-OPT seems to perform slightly better than
DMMD, as might be expected due to the additional optimization step. PROTODASH, by
comparison, seems to perform slightly worse, which we hypothesise is because it has no
mechanism by which weights can be incorporated at selection time. However, in both
cases, the difference is slight.
for increasing numbers of exemplars
. Each plot corresponds to a
dataset containing yearbook photos from a single decade.
(b) Mean ±one standard error of \
tfor increasing numbers of exemplars m.
Figure 4.
Evaluating how coreset quality varies with number of exemplars, for dependent MMD
coresets generated using three algorithms, on 12 yearbook datasets.
DMMD is however much faster at generating coresets, since it does not optimize the full
set of weights at each iteration. This can be seen in Figure 5, which shows the time taken
to generate coresets of a given size. The cost of the optimization-based algorithms grows
rapidly with coreset size (m); the rate of growth of the DMMD coresets is much smaller.
Information 2021,12, 392 14 of 26
Figure 5.
Time (in seconds) taken to construct MMD dependent coresets of a given size, for three
algorithms, on the 12 yearbook datasets. Algorithms ran for a maximum of one hour.
In practice, rather than endlessly minimizing
t∈T \
, we will aim to find
such that
for all
t T
. In Figure 6, we show the coreset sizes
required to obtain an
-MMD dependent coreset on the twelve decade-specific yearbook
datasets, for each algorithm. Again, a maximum runtime of one hour was specified. When
all three algorithms were able to finish, the coreset sizes are comparable (with DMM D-
OPT finding slightly smaller coresets than DM MD, and PROTODASH finding slightly larger
coresets). However the optimization-based methods are hampered by their slow runtime.
Figure 6.
Coreset size required to obtain an
-MMD dependent coreset on the 12 yearbook datasets,
for three algorithms. Algorithms ran for a maximum of one hour; PROTODASH failed to complete
coresets for e2=0.01 and e2=0.005. DM MD -OP T failed to complete a coreset for e2=0.005.
Based on these analyses, it appears there is some advantage to additional optimization
of the weights. However, in most cases, we do not feel the additional computational cost
merits the improved performance.
4.2. Interpretable Data Summarizations
Summarizations of datasets can allow us to quickly understand properties of their
distributions, and allow us to convey such properties to others, for example in a document
explaining the data and its providence [
]. In high-dimensional, highly structured
datasets such as collections of images, traditional summary statistics such as the mean
of a dataset are particularly uninterpretable, as they convey little of the shape of the
underlying distribution. A better approach is to show the viewer a collection of images that
are representative of the dataset. MMD coresets allow us to obtain such a representative
set, making them a better choice than displaying a random subset.
As we discussed in Section 3.1, if we wish to summarize a collection of related datasets,
independently generated MMD coresets can help us understand each dataset individually,
but it may prove challenging to compare datasets. This challenge becomes greater in
high dimensional settings such as image data, where we cannot easily intuit a distance
Information 2021,12, 392 15 of 26
between exemplars. To showcase this phenomenon, and demonstrate how dependent
MMD coresets can help, we return to the yearbook photos introduced in Section 4.1. For all
experiments in this section, we use the additive kernel described in Section 4.1.
4.2.1. A Shared Support Allows for Easier Comparison of Datasets
In Section 3.2, we argued that the shared support provided by dependent MMD
coresets facilitates comparison of datasets, since we only need to consider differences in
weights. To demonstrate this, we constructed four datasets, each a subset of the entire
yearbook dataset containing 250 photos. The first two datasets contained only faces from
the 1990s; the second two, only faces from the 2000s. The datasets were generated by
sampling without replacement from the associated decades, to ensure no photo appeared
more than once across the four datasets. Our goal is to provide a visual way to compare
these four datasets.
We begin by independently generating (non-dependent) MMD coresets for the four
datasets, using Algorithm 1independently on each dataset, with a threshold of
The set of candidate images,
, was the entire dataset of 15,367 images. The resulting
coresets are shown in Figure 7; the areas of the bubbles correspond to the weights associated
with each exemplar (The top row of Figure 2duplicates Figure 7).
) 0.1-MMD coreset for a set of 250 ran-
domly selected yearbook photos from
the 1990s.
) 0.1-MMD coreset for a second set of
250 randomly selected yearbook photos
from the 1990s.
) 0.1-MMD coreset for a set of 250 ran-
domly selected yearbook photos from
the 2000s.
) 0.1-MMD coreset for a second set of
250 randomly selected yearbook photos
from the 2000s.
Figure 7.
Independently learned, weighted 0.1-MMD coresets based on 250 random samples from a
given decade. Area of each bubble is proportional to the weight of the corresponding exemplar in
the coreset.
We can see that, considered individually, each coreset appears to be doing a good job
of capturing the variation in students for each dataset. However, if we compare the four
Information 2021,12, 392 16 of 26
coresets, it is not easy to tell that Figure 7a,b represent the same underlying distribution,
and Figure 7c,d represent a second underlying distribution—or to interpret the difference
between the two distributions. We see that the highest weighted exemplar for the two 2000s
datasets is the same(top left of Figure 7c,d), but only one other image is shared between the
two coresets. Meanwhile, the first coreset for the 1990s shares the same highest-weighted
image with the two 2000s datasets—but this coreset does not appear in the first 1990s
coreset, and the two 1990s coresets have no overlap. Overall, it is hard to compare between
the marginal coresets.
By contrast, the shared support offered by dependent coresets means we can directly
compare the distributions using their coresets. In Figure 8, we show a dependent MMD
coreset (
0.01) for the same collection of datasets. The shared support allows us to see
that, while the two decades are fairly similar, there is clearly a stronger similarity between
the pairs of datasets from the same year (i.e., similarly sized photos), than between pairs
from different years. We can also identify images that exemplify the difference between
the two decades, by looking at the difference in weights. We see that many of the faces
towards the top of the bubble plot have high weights in the 2000s, but low weights in the
1990s. Examining these exemplars suggests that straight hair became more prevalent in
the 2000s. Conversely, many of the faces towards the bottom of the bubble plot have high
weights in the 1990s, but low weights in the 2000s. These photos tend to have wavy/fluffy
hair and bangs. In conjunction, these plots suggest a tendency in the 2000s away from
bangs and towards straight hair, something the authors remember from their formative
years. However, there is still a significant overlap between the two decades: many of the
exemplars have similar weights in the 1990s and the 2000s.
) Marginal dependent 0.1-MMD core-
set for a set of 250 randomly selected
yearbook photos from the 1990s.
) Marginal dependent 0.1-MMD core-
set for a second set of 250 randomly se-
lected yearbook photos from the 1990s.
) Marginal dependent 0.1-MMD core-
set for a set of 250 randomly selected
yearbook photos from the 2000s.
) Marginal dependent 0.1-MMD core-
set for a second set of 250 randomly se-
lected yearbook photos from the 2000s.
Figure 8.
Dependent 0.1-MMD coreset for a collection of 4 datasets, each including 250 random
samples from a given decade. Area of each bubble is proportional to the weight of the corresponding
exemplar in the marginal coreset. Positioning is constant across all four examples.
Information 2021,12, 392 17 of 26
We can also see this in Figure 9, a bar chart shows the average weights associated
with each exemplar in each decades (i.e., the blue bar above a given image is the average
weight for that exemplar across the two 1990s datasets, and the red bar is the average
weight across the two 2000s datasets). We see that most of the exemplars have similar
weights in both scenarios, but that we have a number of straight-haired exemplars dispro-
portionally representing the 2000s, and a number of exemplars with bangs and/or wavy
hair disproportionately representing the 1990s. These insights would have been hard to
intuit from the standard MMD coresets, where it is hard to identify what variation is due
to true underlying differences in the dataset, and what is due to sampling error.
Figure 9.
Summary of a dependent 0.1-MMD coreset for four datasets of yearbook faces from the 1990s and 2000s. Exemplars
are shown along the
axis. The average weight for each exemplar in the coresets associated with the 1990s is shown in blue
; the average weight for the 2000s is shown in red .
4.2.2. Dependent Coresets Allow Us to Visualize Data Drift in Collections of Images
Next, we show how dependent MMD coresets can be used to understand and visualize
variation between collections of multiple datasets. As in Section 4.1, we partition the
15,367 yearbook images into twelve datasets based on their decade, with the goal of
understanding how the distribution over yearbook photos changes over time. Table 3
shows the number of photos in each resulting dataset.
Table 3. Number of yearbook photos for each decade.
1900s 1910s 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s
35 98 308 1682 2650 2093 2319 2806 2826 2621 2208 602
Figure 10 shows the exemplars in the resulting dependent MMD coreset, with a
threshold of
0.01. The corresponding plots show how the weights vary with time.
The exemplars are ordered based on their average weight across the 12 datasets. In each
case, a red, vertical line indicates the year of the yearbook from which the exemplar was
taken. We are able to see how styles change over time, moving away from the formal styles
of the early 20th century, through waved hairstyles popular in the midcentury, towards
longer, straighter hairstyles in later decades. In general, the relevance of an exemplar peaks
around the time it was taken (although, this information is not used to select exemplars).
However, some styles remain relevant over longer time periods (see many exemplars in
the first column). Most of the early exemplars are highly peaked on the 00s or 10s; this
is not surprising since these pre-WW1 photos tend to have very distinctive photography
characteristics and hair styles. Note that we do not include a comparison to standard,
independent MMD coresets, as it would not be possible to produce an analogous set of
plots—the exemplars in each decade’s coreset would, in general, not overlap.
Information 2021,12, 392 18 of 26
Figure 10.
Visualization of an 0.1-MMD dependent coreset for 12 datasets, each containing yearbook photos from a given
decades. Photos show the exemplars
, ordered by their average weight across the 12 marginal coresets. To the
left of each photo is a plot of the corresponding weight over time; a red vertical line marks the year the photo was taken.
Figure 10 appears to show that the marginal coresets have high weights on exemplars
from the corresponding decade. To look at this in more detail, we consider the distributions
over the dates of the exemplars associated with each decade. Figure 11 shows the weighted
mean and standard deviation of the years associated with the exemplars, with weights
given by the coreset weights. We see that the mean weighted year of the exemplars
increases with the decade. However, we notice that it is pulled towards the 1940s and 1950s
in each case: this is because we must represent all datasets using a weighted combination
of points taken from the convex hull of all datapoints.
Information 2021,12, 392 19 of 26
Figure 11.
Distribution over the year associated with the marginal coresets for each decade. Plot
shows weighted mean ±weighted standard deviation.
4.3. Dependent Coresets Allow Us to Understand Model Generalization
To see how dependent coresets can be used to understand how a model trained on
one dataset will generalize to others, we simulate a scenario where we wish to deploy a
machine learning model on a given dataset, but where the model was trained on a different
dataset. In this scenario, we are interested in learning whether the model generalizes well
to the new dataset.
We generate two datasets—one to represent the training data, and one to represent the
data used in deployment—by partitioning a collection of image digits. We started with the
USPS handwritten digits dataset [
], which comprises a train set of 2791 handwritten digits,
and a test set of 2001 handwritten digits. We split the train set into two datasets,
is skewed towards the earlier digits and
towards the later digits.
Figure 12
shows the resulting label counts for each dataset: we see there is a clear distributional
imbalance. Note that, in general, we will not have such a concise summary of the difference
between two datasets; however using image digits as our example allows us to get an idea
of the “ground truth” difference between the two datasets.
(a) Label frequencies for dataset Xa(b) Label frequencies for dataset Xb
Figure 12. Frequency with which each digit occurs in two datasets of handwritten digits.
We selected three classification algorithms to assess generalization performance. We
chose classification algorithms because it is easy for us to obtain “ground truth” generaliza-
tion performance by applying these algorithms to our second dataset
, allowing us to
compare our insights with the true generalization performance. In general, we may not be
able to easily estimate generalization performance in this manner: we may have unlabeled
data, or our task may not be easily qualitatively evaluated (e.g., evaluating quality of
auto-generated captions); we expect our approach to have greatest utility in such scenarios.
We trained three classifiers—a decision tree with maximum depth of 8, a random
forest with 100 trees, and a multilayer perceptron (MLP) with a single hidden layer with
Information 2021,12, 392 20 of 26
100 units—on
and the corresponding labels. In each case, we used the implementation
in scikit-learn [
], with parameters chosen to have similar train set accuracy on
. These
three models were chosen to have varying generalization accuracy. Table 4shows the
associated classification accuracies on the datasets
, which we will use as a
quantitative representation of the algorithms’ generalization performance on
. We also
show confusion matrices in Figures 13 and 14. We see that all three algorithms perform
comparably on
, the dataset on which they were trained. However, when applied to
we see in Figure 14 that the decision tree struggles in classifying 8s and 9s, and that the
random forest struggles with 9s.
Table 4.
Accuracies of three classification algorithms, on datasets
. All algorithms were
trained on Xa.
Model Accuracy on XaAccuracy on Xb
MLP 0.9998 0.8531
Random Forest 1.0 0.7129
Decision Tree 0.9585 0.5880
(a) Decision tree (b) Random forest (c) Multilayer perceptron
Figure 13. Confusion matrices on Xa, for three classification algorithms trained on Xa.
(a) Decision tree (b) Random forest (c) Multilayer perceptron
Figure 14. Confusion matrices on Xb, for three classification algorithms trained on Xa.
We begin our analysis by generating a dependent MMD coreset for the two datasets,
0.005. To ensure the exemplars in our coreset have not been seen in training,
we let our set of candidate points
be the union of
and the USPS test set. As with
the yearbook data, we use an additive squared exponential kernel, with bandwidths of
the composite kernels being the median within-class pairwise distances, and the overall
median pairwise distance. Distances were calculated using the raw pixel values. Figure 15
shows the resulting dependent MMD coreset, with the bars showing the weights
associated with the two datasets, and the images below the
axis showing the correspond-
ing images
. In Figure 16, the
have been grouped by number, so that if
the label of image u, the jth bar for Qahas weight iS:y(ui)=jwa,j.
Information 2021,12, 392 21 of 26
We can see that the coreset has selected points that cover the spread of the overall
dataset. However, looking at Figure 16, we see that the weights assigned to these exemplars
mirror the relative frequencies of each digit in the corresponding datasets
and Xb(Figure 12).
Figure 15.
Dependent MMD coreset for two datasets of handwritten digits. Bars show the weight in each marginal coreset,
with Xashown in blue and Xbshown in red ; images along axis show corresponding exemplars.
Figure 16.
Summary of a dependent MMD coreset for two datasets of handwritten digits. The
weights and exemplars from Figure 15 have been combined based on their label. Weights for
shown in blue and weights for Xbare shown in red .
We then considered all points
in our dependent coreset
2—i.e., points that are much more representative of
. We
then looked at the class probabilities of the three algorithms, on each of these points, as
shown in Figure 17. We see that the decision tree mis-classifies nine of the 21 exemplars,
and is frequently highly confident in its misclassification. The random forest misclassifies
three examples, and the MLP misclassifies two. We see this agrees with the ordering
Information 2021,12, 392 22 of 26
provided by empirically evaluating generalization in Table 4—the MLP generalizes best,
and the decision tree worst. As suggested by our confusion matrices in Figure 14, we see
that all algorithms generalize worst to the numbers 8 and 9—this is to be expected, since
these digits are most under-represented in
. The decision tree in particular appears to
fail on these digits, mirroring the quanitative results in the confusion matrix.
Figure 17.
Exemplars over-represented in
, with class probabilities under three algorithms trained
. The true class is shown in blue ; where the highest probability class differs from the true
class, the highest probability class is shown in red .
For comparison, in Figure 18 we show the points where
0.5—i.e., points that are
much more representative of
. Note that, since our candidate set did not include
any members of
, none of these points were in our training set. Despite this, the accuracy
is high, and fairly consistent between the three classifiers (the decision tree misclassifies
two exemplars; the other two algorithms make no errors).
The dependent coreset only provides information about performance on “represen-
tative” members of
. Since classifiers will tend to underperform on outliers, looking
only at the dependent MMD coreset does not give us a full picture of the expected per-
formance. We can augment our dependent MMD coreset with criticisms—points that are
poorly described by the dependent coreset. Figure 19 shows the performance of the three
algorithms on a size-20 set of criticisms for
. Note that, overall, accuracy is lower than
for the coreset—unsurprising, since these are outliers. However, as before, we see that the
decision tree performs worst on these criticisms (nine mis-classifications), with the other
two algorithms performing slightly better (six mis-classifications for the random forest,
and seven for the MLP).
Information 2021,12, 392 23 of 26
Figure 18.
Exemplars over-represented in
, with class probabilities under three algorithms trained
. The true class is shown in blue ; where the highest probability class differs from the true
class, the highest probability class is shown in red .
Figure 19.
Criticisms of
from the dataset
, with class probabilities under three algorithms
trained on
. The true class is shown in blue ; where the highest probability class differs from
the true class, the highest probability class is shown in red .
Note that, since accuracy does not correspond to a function in a RKHS, we cannot
expect to use the coreset to bound the expected accuracy of an algorithm on the full
Information 2021,12, 392 24 of 26
dataset. Indeed, while the coresets and critics correctly suggest that the decision tree
generalizes poorly, they do not give conclusive evidence on the relative generalization
abilities of the other two algorithms. However, they do highlight what sort of data points are
likely to be poorly modeled under each algorithm. By providing a qualitative assessment
of performance modalities and failure modes on either typical points for a dataset, or
points that are disproportionately representative of a dataset (vs the original training set)
dependent MMD coresets allows users to identify potential generalization concerns for
further exploration.
4.4. Discussion
MMD coresets have already proven to be a useful tool for summarizing datasets and
understanding models. However, as we have shown in Section 4.2, their interpretability
wanes when used to compare related datasets. Dependent MMD coresets provide a tool
to jointly model multiple datasets using a shared set of exemplars. This shared set of
exemplars makes it easy to compare two datasets, providing an interpretable summary not
just of each dataset in isolation, but also of the difference between datasets. As such, we
believe they will prove useful in understanding related datasets, and summarizing such
collections of datasets.
In addition to facilitating understanding of data, we have also shown that dependent
MMD coresets can be used to better understand model performance. By considering the
weights associated with two different datasets, we can identify areas of domain mis-match.
By exploring performance of algorithms on such points, we can glean insights about the
ability of a model to generalize to new datasets.
In principle, dependent MMD coresets can be applied to any number of datasets.
However, as we discuss in Section 3.5, the computational cost of our algorithm will scale
cubically in the number of datapoints in the union of the datasets. This cost can be reduced
by representing each dataset with an independent coreset, either obtained by subsampling
the data or by applying a coreset selection algorithm such as [
]; however, the approxi-
mation error of this coreset would need to be incorporated into the overall approximation
error e.
An alternative approach might be to develop streaming algorithms for constructing
dependent MMD coresets. In the non-dependent setting, streaming algorithms such as [
allow us to construct a coreset in an online manner, at a lower computational cost than
batch algorithms. Such an approach would be particularly appealing in the case of time-
stamped data, since it would allow us to update our dependent MMD coreset to include a
new dataset.
Dependent MMD coresets are just one example of a dependent coreset that could
be constructed using this framework. Future directions include exploring dependent
analogues of other measure coresets [22].
Author Contributions:
Conceptualization, S.A.W. and J.H.; methodology, software, and experiments,
S.A.W.; data curation: S.A.W. and J.H.; writing and visualization: S.A.W. and J.H. Both authors have
read and agreed to the published version of the manuscript.
This research received no external funding. Part of the work was completed while S.A.W.
was employed by CognitiveScale.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement:
Datasets and code available at
(accessed on 22 September 2021).
Conflicts of Interest: The authors declare no conflict of interest.
Information 2021,12, 392 25 of 26
Larrazabal, A.J.; Nieto, N.; Peterson, V.; Milone, D.H.; Ferrante, E. Gender imbalance in medical imaging datasets produces
biased classifiers for computer-aided diagnosis. Proc. Natl. Acad. Sci. USA 2020,117, 12592–12594. [CrossRef] [PubMed]
Chen, I.Y.; Johansson, F.D.; Sontag, D. Why is my classifier discriminatory? In Proceedings of the 32nd International Conference
on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 3543–3554.
Buolamwini, J.; Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings
of the 1st Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; pp. 77–91.
Shankar, S.; Halpern, Y.; Breck, E.; Atwood, J.; Wilson, J.; Sculley, D. No classification without representation: Assessing
geodiversity issues in open data sets for the developing world. arXiv 2017, arXiv:1711.08536.
Alexander, R.G.; Schmidt, J.; Zelinsky, G.J. Are summary statistics enough? Evidence for the importance of shape in guiding
visual search. Vis. Cogn. 2014,22, 595–609. [CrossRef]
Lauer, T.; Cornelissen, T.H.; Draschkow, D.; Willenbockel, V.; Võ, M.L.H. The role of scene summary statistics in object recognition.
Sci. Rep. 2018,8, 14666. [CrossRef]
Kaufmann, L.; Rousseeuw, P. Clustering by means of medoids. In Statistical Data Analysis Based on the L1-Norm and Related
Methods; Springer: Berlin/Heidelberg, Germany, 1987; pp. 405–416.
8. Bien, J.; Tibshirani, R. Prototype selection for interpretable classification. Ann. Appl. Stat. 2011,5, 2403–2424. [CrossRef]
Mak, S.; Joseph, V.R. Projected support points: A new method for high-dimensional data reduction. arXiv
, arXiv:1708.06897.
10. Mak, S.; Joseph, V.R. Support points. Ann. Stat. 2018,46, 2562–2592. [CrossRef]
Kim, B.; Khanna, R.; Koyejo, O.O. Examples are not enough, learn to criticize! Criticism for interpretability. In Proceedings of the
30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2280–2288.
Wilson, D.R.; Martinez, T.R. Reduction techniques for instance-based learning algorithms. Mach. Learn.
,38, 257–286.
Gurumoorthy, K.S.; Dhurandhar, A.; Cecchi, G.; Aggarwal, C. Efficient data representation by selecting prototypes with
importance weights. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11
November 2019; pp. 260–269.
Chen, Y.; Welling, M.; Smola, A. Super-samples from kernel herding. In Proceedings of the 26th Conference on Uncertainty in
Artificial Intelligence Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, 8–11 July 2010; pp. 109–116.
Phillips, J.M.; Tai, W.M. Near-optimal coresets of kernel density estimates. Discret. Comput. Geom.
,63, 867–887. [CrossRef]
Karnin, Z.; Liberty, E. Discrepancy, coresets, and sketches in machine learning. In Proceedings of the 32nd Conference on
Learning Theory Conference on Learning Theory, Phoenix, AZ, USA, 25–28 June 2019; pp. 1975–1993.
17. Tai, W.M. Optimal Coreset for Gaussian Kernel Density Estimation. arXiv 2021, arXiv:2007.08031.
Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res.
13, 723–773.
Pratt, K.B.; Tschapek, G. Visualizing concept drift. In Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Washington, DC, USA, 24–27 August 2003; pp. 735–740.
Hohman, F.; Wongsuphasawat, K.; Kery, M.B.; Patel, K. Understanding and visualizing data iteration in machine learning. In
Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–13.
Agarwal, P.K.; Har-Peled, S.; Varadarajan, K.R. Approximating extent measures of points. J. ACM
,51, 606–635. [CrossRef]
22. Claici, S.; Solomon, J. Wasserstein coresets for Lipschitz costs. Stat 2018,1050, 18.
Müller, A. Integral probability metrics and their generating classes of functions. Adv. Appl. Probab.
,29, 429–443. [CrossRef]
Bach, F.; Lacoste-Julien, S.; Obozinski, G. On the equivalence between herding and conditional gradient algorithms. In Proceedings
of the 29th International Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012.
Lacoste-Julien, S.; Lindsten, F.; Bach, F. Sequential kernel herding: Frank-Wolfe optimization for particle filtering. In Proceedings
of the 18th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; pp. 544–552.
Phillips, J.M.
-samples for kernels. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms,
New Orleans, LA, USA, 6–8 January 2013; pp. 1622–1632.
Lopez-Paz, D.; Muandet, K.; Schölkopf, B.; Tolstikhin, I. Towards a learning theory of cause-effect inference. In Proceedings of
the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1452–1461.
28. Feldman, D. Introduction to core-sets: An updated survey. arXiv 2020, arXiv:2011.09384.
Potter, M.C.; Wyble, B.; Hagmann, C.E.; McCourt, E.S. Detecting meaning in RSVP at 13 ms per picture. Atten. Percept. Psychophys.
2014,76, 270–279. [CrossRef] [PubMed]
Zheng, Y.; Ou, Y.; Lex, A.; Phillips, J.M. Visualization of big spatial data using coresets for kernel density estimates. In Proceedings
of the IEEE Visualization in Data Science (VDS), Phoenix, AZ, USA, 1 October 2017; pp. 23–30.
Kim, B.; Rudin, C.; Shah, J.A. The Bayesian case model: A generative approach for case-based reasoning and prototype
classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC,
Canada, 8–13 December 2014; pp. 1952–1960.
Aamodt, A.; Plaza, E. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Commun.
1994,7, 39–59. [CrossRef]
Information 2021,12, 392 26 of 26
Murdock, J.W.; Aha, D.W.; Breslow, L.A. Assessing elaborated hypotheses: An interpretive case-based reasoning approach. In
Case-Based Reasoning Research and Development, Proceedings of the 5th International Conference on Case-Based Reasoning, Trondheim,
Norway, 23–26 June 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 332–346.
MacEachern, S.N. Dependent nonparametric processes. In ASA Proceedings of the Section on Bayesian Statistical Science; American
Statistical Association: Alexandria, VA, USA, 1999; Volume 1, pp. 50–55.
Quintana, F.A.; Mueller, P.; Jara, A.; MacEachern, S.N. The dependent Dirichlet process and related models. arXiv
De Iorio, M.; Müller, P.; Rosner, G.L.; MacEachern, S.N. An ANOVA model for dependent random measures. J. Am. Stat. Assoc.
2004,99, 205–215. [CrossRef]
Dubey, A.; Hefny, A.; Williamson, S.; Xing, E.P. A nonparametric mixture model for topic modeling over time. In Proceedings of
the 13th SIAM International Conference on Data Mining, Austin, TX, USA, 2–4 May 2013; pp. 530–538.
38. Garreau, D.; Jitkrittum, W.; Kanagawa, M. Large sample analysis of the median heuristic. arXiv 2017, arXiv:1707.07269.
Kiela, D.; Bottou, L. Learning image embeddings using convolutional neural networks for improved multi-modal semantics.
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29
October 2014; pp. 36–45.
Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the
International Conference on Machine Learning, Online, 13–18 July 2020; pp. 1691–1703.
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In
Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020; pp. 1597–1607.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
Athalye, A.; Engstrom, L.; Ilyas, A.; Kwok, K. Synthesizing robust adversarial examples. In Proceedings of the 35th International
Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 284–293.
Steed, R.; Caliskan, A. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings
of the 4th Conference on Fairness, Accountability, and Transparency, Online, 3–10 March 2021; pp. 701–713.
Ginosar, S.; Rakelly, K.; Sachs, S.; Yin, B.; Efros, A.A. A century of portraits: A visual historical record of American high school
yearbooks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December
2015; pp. 1–7.
Marcel, S.; Rodriguez, Y. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM International
Conference on Multimedia, Firenze, Italy, 25–29 October 2010; pp. 1485–1488.
Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Daumé, H., III; Crawford, K. Datasheets for datasets. In
Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning, Stockholm, Sweden, 13–15
July 2018.
Chmielinski, K.S.; Newman, S.; Taylor, M.; Joseph, J.; Thomas, K.; Yurkofsky, J.; Qiu, Y.C. The Dataset Nutrition Label (2nd
Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence. In Proceedings of the NeurIPS 2020 Workshop on Dataset
Curation and Security, Online, 11 December 2020.
Hull, J.J. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell.
,16, 550–554. [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al.
Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011,12, 2825–2830.
... The relationship between data and phenomena is still assumed to be stable, and the technical work of data science projects is typically assumed to be the 'one-off application' of a statistical model to a given static dataset (Polyzotis et al. 2017). Changes to data (or its underlying distribution), called 'data drift,' are detrimental to model performance and to be controlled (Hohman et al. 2020;Hoens et al. 2012;Williamson and Henderson 2021). While several studies explore ways to set up a 'pipeline' for data science activities with continuously incoming, and possibly changing, datasets (e.g., Roh et al. 2019;Breck et al. 2019;Lourenço et al. 2019), these activities still mostly center on data scientists working on their computers. ...
... This focus on analytic models or algorithms is consistent with the priorities of technical data science and machine learning practitioners on developing generalizable machine learning models that learn to make accurate predictions "independent of which dataset is used" (Hohman et al. 2020, p.1;Sambasivan et al. 2021]. Changes in data are seen as detrimental to model performance and need to be tightly controlled (Gama et al. 2004;Sculley et al. 2015;Williamson and Henderson 2021). The data are rendered abstract and static in that a single data set can be statistically manipulated and modeled in many different ways. ...
... Second, the technical work of data science projects is typically approached as the 'one-off application' of a statistical model to a given dataset (Polyzotis et al. 2017). Built on an assumption of a 'largely stable world' (Marcus 2018), data science often views changes to data (or its underlying distribution), called 'data drift', as detrimental to model performance (Hohman et al. 2020;Hoens et al. 2012;Williamson and Henderson 2021). For data science activities to be sustainable, data need to be managed to account for changes to data in the dynamic world (e.g., Amershi et al. 2019b;Bopp et al. 2017 with their data and model settings (Seidelin et al. 2020;Ferreira and Monteiro 2020), and into how providing more interactive visualization techniques and interface design may allow domain experts to directly build and modify models relevant to the context of their data and problems (Amershi et al. 2014;Gil et al. 2019). ...
How can we make data science systems more actionable? This dissertation explores this question by placing end-users and their data practices, rather than data scientists and their technical work of building models and algorithms, at the center of data science systems. Inspired by phenomenological views of technical systems from CSCW, HCI, and STS, I use ethnographic and other qualitative methods to understand how participants from four studies worked with data across three settings: craft brewers producing beers, people with visual impairments engaging with image descriptions of their photos on their smartphones, and repair workers repairing broken artifacts. I analyze implications for making data science systems actionable by framing the participants as potential end-users of these systems. My findings emphasize that actionability in data science systems concerns not just predictions made on mostly given datasets. Actionability in my settings arose from the ongoing work of making data relevant to artifacts and phenomena that end-users engaged with in their practices and settings. I show how this ongoing work of making data relevant was challenging. The properties of artifacts and phenomena were inherently multiple and their relevance was contingent on end-users’ situations. I describe end-users’ data practices as processes of “registering” (making intelligible) a contingent yet coherent set of properties to turn multiple, uncertain artifacts and phenomena into actionable versions. My dissertation makes several contributions to emerging research on actionability and data science in CSCW, HCI, and STS literature. First, based on my findings, I theorize an approach to data science systems that imagines actionability as driven not so much by data scientists generating predictions, or even by putting humans in the loop, but by placing end-users at the center. Second, my end-user approach to data science systems informs the technical work of data science by proposing requirements for models and algorithms to be accountable not just in their predictions but to end-users’ practices and settings. Third, my dissertation integrates into data science research foundational phenomenological views from CSCW that focus on how technological systems can account for and support end-users in their domains of practice, rather than the other way around.
... al. proposed a hierarchical dataset summarization method [32] that organizes group entities into a hierarchy. In [33] MMD coresets is proposed, method for data summarization, useful for understanding multiple related datasets. ...
Full-text available
Artificial intelligence has become mainstream and its applications will only proliferate. Specific measures must be done to integrate such systems into society for the general benefit. One of the tools for improving that is explainability which boosts trust and understanding of decisions between humans and machines. This research offers an update on the current state of explainable AI (XAI). Recent XAI surveys in supervised learning show convergence of main conceptual ideas. We list the applications of XAI in the real world with concrete impact. The list is short and we call to action - to validate all the hard work done in the field with applications that go beyond experiments on datasets, but drive decisions and changes. We identify new frontiers of research, explainability of reinforcement learning and graph neural networks. For the latter, we give a detailed overview of the field.
Domain experts play an essential role in data science by helping data scientists situate their technical work beyond the statistical analysis of large datasets. How domain experts themselves may engage with data science tools as a type of end-user remains largely invisible. Understanding data science as domain expert-driven depends on understanding how domain experts use data. Drawing on an ethnographic study of a craft brewery in Korea, we show how craft brewers worked with data by situating otherwise abstract data within their brewing practices and settings. We contribute theoretical insight into how domain experts use data distinctly from technical data scientists in terms of their view of data (situated vs. abstract), purposes for engaging with data (guiding processes over predicting outcomes), and overall goals of using data (flexible control vs. precision). We propose four ways in which working with data can be supported through the design of data science tools, and discuss how craftwork can be a useful lens for integrating domain expert-driven understandings of data science into CSCW and HCI research.
Full-text available
Artificial intelligence (AI) systems for computer-aided diagnosis and image-based screening are being adopted worldwide by medical institutions. In such a context, generating fair and unbiased classifiers becomes of paramount importance. The research community of medical image computing is making great efforts in developing more accurate algorithms to assist medical doctors in the difficult task of disease diagnosis. However, little attention is paid to the way databases are collected and how this may influence the performance of AI systems. Our study sheds light on the importance of gender balance in medical imaging datasets used to train AI systems for computer-assisted diagnosis. We provide empirical evidence supported by a large-scale study, based on three deep neural network architectures and two well-known publicly available X-ray image datasets used to diagnose various thoracic diseases under different gender imbalance conditions. We found a consistent decrease in performance for underrepresented genders when a minimum balance is not fulfilled. This raises the alarm for national agencies in charge of regulating and approving computer-assisted diagnosis systems, which should include explicit gender balance and diversity recommendations. We also establish an open problem for the academic medical image computing community which needs to be addressed by novel algorithms endowed with robustness to gender imbalance.
Full-text available
Objects that are semantically related to the visual scene context are typically better recognized than unrelated objects. While context effects on object recognition are well studied, the question which particular visual information of an object's surroundings modulates its semantic processing is still unresolved. Typically, one would expect contextual influences to arise from high-level, semantic components of a scene but what if even low-level features could modulate object processing? Here, we generated seemingly meaningless textures of real-world scenes, which preserved similar summary statistics but discarded spatial layout information. In Experiment 1, participants categorized such textures better than colour controls that lacked higher-order scene statistics while original scenes resulted in the highest performance. In Experiment 2, participants recognized briefly presented consistent objects on scenes significantly better than inconsistent objects, whereas on textures, consistent objects were recognized only slightly more accurately. In Experiment 3, we recorded event-related potentials and observed a pronounced mid-central negativity in the N300/N400 time windows for inconsistent relative to consistent objects on scenes. Critically, inconsistent objects on textures also triggered N300/N400 effects with a comparable time course, though less pronounced. Our results suggest that a scene's low-level features contribute to the effective processing of objects in complex real-world environments.
Full-text available
We construct near-optimal coresets for kernel density estimate for points in $\mathbb{R^d}$ when the kernel is positive definite. Specifically we show a polynomial time construction for a coreset of size $O(\sqrt{d\log (1/\epsilon)}/\epsilon)$, and we show a near-matching lower bound of size $\Omega(\sqrt{d}/\epsilon)$. The upper bound is a polynomial in $1/\epsilon$ improvement when $d \in [3,1/\epsilon^2)$ (for all kernels except the Gaussian kernel which had a previous upper bound of $O((1/\epsilon) \log^d (1/\epsilon))$) and the lower bound is the first known lower bound to depend on $d$ for this problem. Moreover, the upper bound restriction that the kernel is positive definite is significant in that it applies to a wide-variety of kernels, specifically those most important for machine learning. This includes kernels for information distances and the sinc kernel which can be negative.
In optimization or machine learning problems we are given a set of items, usually points in some metric space, and the goal is to minimize or maximize an objective function over some space of candidate solutions. For example, in clustering problems, the input is a set of points in some metric space, and a common goal is to compute a set of centers in some other space (points, lines) that will minimize the sum of distances to these points. In database queries, we may need to compute such a sum for a specific query set of k centers.
Modern machine learning systems such as image classifiers rely heavily on large scale data sets for training. Such data sets are costly to create, thus in practice a small number of freely available, open source data sets are widely used. We suggest that examining the geo-diversity of open data sets is critical before adopting a data set for use cases in the developing world. We analyze two large, publicly available image data sets to assess geo-diversity and find that these data sets appear to exhibit an observable amerocentric and eurocentric representation bias. Further, we analyze classifiers trained on these data sets to assess the impact of these training distributions and find strong differences in the relative performance on images from different locales. These results emphasize the need to ensure geo-representation when constructing data sets for use in the developing world.