Conference PaperPDF Available

# Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

Authors:

## Abstract

Datasets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data. This paper presents a novel method for generating such datasets, along with several examples. Our technique varies from previous approaches in that new datasets are iteratively generated from a seed dataset through random perturbations of individual data points, and can be directed towards a desired outcome through a simulated annealing optimization strategy. Our method has the benefit of being agnostic to the particular statistical properties that are to remain constant between the datasets, and allows for control over the graphical appearance of resulting output.
Same Stats, Different Graphs:
Generating Datasets with Varied Appearance and
Identical Statistics through Simulated Annealing
Justin Matejka and George Fitzmaurice
{ﬁrst.last}@autodesk.com
Figure 1. A collection of data sets produced by our technique. While diﬀerent in appearance, each has the same summary statistics
(mean, std. deviation, and Pearson’s corr.) to 2 decimal places. (x
͞ =54.02, y
͞ = 48.09, sdx = 14.52, sdy = 24.79, Pearson’s r = +0.32)
ABSTRACT
Datasets which are identical over a number of statistical
properties, yet produce dissimilar graphs, are frequently used
to illustrate the importance of graphical representations when
exploring data. is paper presents a novel method for
generating such datasets, along with several examples. Our
technique varies from previous approaches in that new
datasets are iteratively generated from a seed dataset through
random perturbations of individual data points, and can be
directed towards a desired outcome through a simulated
annealing optimization strategy. Our method has the beneﬁt
of being agnostic to the particular statistical properties that
are to remain constant between the datasets, and allows for
control over the graphical appearance of resulting output.
INTRODUCTION
Anscome’s Quartet [1] is a set of four distinct datasets each
consisting of 11 (x,y) pairs where each dataset produces the
same summary statistics (mean, standard deviation, and
correlation) while producing vastly diﬀerent plots (Figure
2A). is dataset is frequently used to illustrate the
importance of graphical representations when exploring
data. e eﬀectiveness of Anscombe’s Quartet is not due to
simply having four diﬀerent data sets which generate the
same statistical properties, it is that four clearly dierent and
identiﬁably distinct datasets are producing the same
statistical properties. Dataset I appears to follow a somewhat
noisy linear model, while Dataset II is following a parabolic
distribution. Dataset III appears to be strongly linear, except
for a single outlier, while Dataset IV forms a vertical line
with the regression thrown oﬀ by a single outlier. In contrast,
Figure 2B shows a series of datasets also sharing the same
summary statistics as Anscombe’s Quartet, however without
any obvious underlying structure to the individual datasets,
this quartet is not nearly as eﬀective at demonstrating the
importance of graphical representations.
While very popular and eﬀective for illustrating the
importance of visualizations, it is not known how Anscombe
came up with his datasets [5]. Our work presents a novel
method for creating datasets which are identical over a range
of statistical properties, yet produce dissimilar graphics. Our
method diﬀers from previous by being agnostic to the
particular statistical properties that are to remain constant
between the datasets, while allowing for control over the
graphical appearance of resulting output.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proﬁt or commercial advantage and that copies bear this notice and the full citation
on the ﬁrst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission
and/or a fee. Request permissions from Permissions@acm.org.
CHI 2017, May 06 - 11, 2017, Denver, CO, USA
ACM 978-1-4503-4655-9/17/05…\$15.00
DOI: http://dx.doi.org/10.1145/3025453.3025912
Figure 2. (A) Anscombe’s Quartet, with each dataset having
the same mean, standard deviation, and correlation. (B)
Four unstructured datasets, each also having the same
statistical properties as those in Anscombe’s Quartet.
A
B
I II III IV
RELATED WORK
As alluded to above, producing multiple datasets with similar
statistics and dissimilar graphics was introduced by
Anscombe in 1973 [1]. “Graphs in Statistical Analysis” starts
by listing three notions prevalent about graphs at the time:
(1) Numerical calculations are exact, but graphs are
rough;
(2) For any particular kind of statistical data there
is just one set of calculations constituting a
correct statistical analysis;
(3) Performing intricate calculations is virtuous,
whereas actually looking at the data is cheating.
While one cannot argue that there is currently as much
resistance towards graphical methods as when Anscombe's
paper was originally published, the datasets described in the
work (Figure 1A) are still eﬀective and frequently used for
introducing or reinforcing the importance of visual methods.
Unfortunately, Anscombe does not report how the datasets
were created, nor suggest any method to create new ones.
e ﬁrst attempt at producing a generalized method for
creating such datasets was published in 2007 by Chatterjee
and Firat [5]. ey proposed a genetic algorithm based
approach where 1,000 random datasets were created with
identical summary statistics, then combined and mutated
with an objective function to maximize the “graphical
dissimilarity” between the initial and ﬁnal scatter plots.
While the datasets produced were graphically dissimilar to
the input datasets, they did not have any discernable structure
in their composition. Our technique diﬀers by providing a
mechanism to direct the solutions towards a speciﬁc shape,
as well as allowing for variety in the statistical measures
which are to remain constant between the solutions.
Govindaraju and Haslett developed a method for regressing
datasets towards their sample means while maintaining the
same linear regression formula [7]. In 2009, the same authors
extended their procedure to creating “cloned” datasets [8]. In
addition to maintaining the same linear regression as the seed
dataset, their cloned datasets also maintained the same means
(but not the same standard deviations). While Chatterjee and
Firat [5] wanted to create datasets as graphically dissimilar
as possible, Govindaraju and Haslett’s cloned datasets were
designed to be visually similar, with a proposed application
of conﬁdentializing sensitive data for publication purposes.
While our technique is primarily aimed at creating visually
distinct datasets, by choosing appropriate statistical tests to
remain constant through the iterations (such as a
Kolmogorov-Smirnov test) our technique can produce
datasets with similar graphical characteristics as well.
In the area of generating synthetic datasets, GraphCuisine [2]
allows users to direct an evolutionary algorithm to create
network graphs matching user-speciﬁed parameters. While
this work looks at a similar problem, it diﬀers in that it is
focused on network graphs, is an interactive system, and
allows for directly specifying characteristics of the output,
while our technique looks at 1D or 2D distributions of data,
is non-interactive, and perturbs the data such that the initial
statistical properties are maintained throughout the process.
Finally, on the topic of using scatter plots to encode graphics,
Residual Sur(Realism) [11] produces datasets with hidden
images which are only revealed when appropriate statistical
measures are performed. Conversely, our technique encodes
graphical appearance into the data directly.
METHOD
e key insight behind our approach is that while generating
a dataset from scratch to have particular statistical properties
is relatively dicult, it is relatively easy to take an existing
dataset, modify it slightly, and maintain (nearly) the same
statistical properties. With repetition, this process creates a
dataset with a diﬀerent visual appearance from the original,
while maintaining the same statistical properties. Further, if
the modiﬁcations to the dataset are biased to move the points
towards a particular goal, the resulting graph can be directed
towards a particular visual appearance.
e pseudocode for the high-level algorithm is listed below:
INITIAL_DS is the seed dataset from which the statistical
values we wish to maintain are calculated. e PERTURB
function is called at each iteration of the algorithm to modify
one or more points by a small amount, in a random direction.
e “small amount” is chosen from a normal distribution and
is calibrated such that >95% of movements result in the
statistical properties of the overall dataset remaining
unchanged (to two decimal places).
Once the individual points have been moved, the FIT
function is used to check if perturbing the points has
increased the overall ﬁtness of the dataset. e ﬁtness can be
calculated in a variety of ways, but for conditions where we
want to coerce the dataset to into a shape, ﬁtness is calculated
as the average distance of all points to the nearest point on
the target shape.
e naïve approach of accepting only datasets with an
improved ﬁtness value results in possibly getting stuck in
locally-optimal solutions where other, more globally-optimal
solutions are possible. To mitigate this possibility, we
employ a simulated annealing technique [9]. With the
possible solutions generated in each iteration, simulated
annealing works by always accepting solutions which
1: current_ds initial_ds
2: for x iterations, do:
3: test_ds PERTURB(current_ds, temp)
4: if ISERROROK(test_ds, initial_ds):
5: current_ds test_ds
6:
7: function PERTURB(ds, temp):
8: loop:
9: test MOVERANDOMPOINTS(ds)
10: if FIT(test) > FIT(ds) or temp > RANDOM():
11: return test
improve the ﬁtness, but also, if the ﬁtness is not improved,
the solution may be accepted based on the “temperature” of
the simulated annealing algorithm. If the current temperature
is less than a random number between 0 and 1, the solution
is accepted even if it the ﬁtness is worsened. We found that
using a quadratically-smoothed monotonic cooling schedule
starting with a temperature of 0.4 and ﬁnishing with a
temperature of 0.01 worked well for the sample datasets.
Once the perturbed dataset has been accepted, either through
an improved ﬁtness value or from the simulated annealing
process, the perturbed dataset is compared to the initial
dataset for statistical equivalence. For the examples in this
paper we consider properties to be “the same” if they are
equal to two decimal places. e ISERROROK function
compares the statistics between the datasets, and if they are
equal (to the speciﬁed number of decimal places), the result
from the current iteration becomes the new current state.
Example Generated Datasets
Example 1: Coercion Towards Target Shapes
In this ﬁrst example (Figure 1), each dataset contains 182
points and are equal (to two decimal places) for the
“standard” summary statistics (x/y mean, x/y standard
deviation, and Pearson’s correlation). Each dataset was
seeded with the plot in the top left.e target shapes are
speciﬁed as a series of line segments, and the shapes used in
this example are shown in Figure 3.
Figure 3. e initial data set (top-left), and line segment
collections used for directing the output towards speciﬁc
shapes. e results are seen in Figure 1.
With this example dataset, the algorithm ran for 200,000
iterations to achieve the ﬁnal results. On a laptop computer
this process took ~10 minutes. Figure 4 shows the
progression of one of the datasets towards the target shape.
Figure 4. Progression of the algorithm towards a target
shape over the course of the cooling schedule.
Example 2: Alternate Statistical Measures
One beneﬁt of our approach over previous methods is that
the iterative process is agnostic to the particular statistical
properties which remain constant between the datasets. In
this example (Figure 5) the datasets are derived from the
same initial dataset as in Example 1, but rather than being
equal on the parametric properties, the datasets are equal in
the non-parametric measures of x/y median, x/y interquartile
range (IQR), and Spearman’s rank correlation coecient.
Figure 5. Example datasets are equal in the non-parametric
statistics of x/y median (53.73, 46.21), x/y IQR (19.17, 37.92),
and Spearman’s rank correlation coecient (+0.31).
Example 3: Speciﬁc Initial Dataset
e previous two examples used a rather “generic” dataset of
a slightly positively correlated point cloud as the starting
point of the optimization. Alternately, it is possible to begin
with a very speciﬁc dataset to seed the optimization.
Figure 6. Creating a collection of datasets based on the
“dinosaurus” dataset. Each dataset has the same summary
statistics to two decimal places: (x
͞ =54.26, y
͞ = 47.83, sdx =
16.76, sdy = 26.93, Pearson’s r = -0.06).
Alberto Cairo produced a dataset called the “Datasaurus” [4].
Like Anscombe’s Quartet, this serves as a reminder to the
importance of visualizing your data, since, although the
dataset produces “normal” summary statistics, the resulting
plot is a picture of a dinosaur. In this example we use the
“datasaurus” as the initial dataset, and create other datasets
with the same summary statistics (Figure 6).
Another instrument for demonstrating the importance of
paradox occurs with data sets where a trend appears when
looking at individual groups in the data, but disappears or
reverses when the groups are combined.
To create a dataset exhibiting Simpson’s Paradox, we start
with a strongly positively correlated dataset (Figure 7A), and
then perturb and direct that dataset towards a series of
Iteration: 1
Temperature: 0.4
Iteration: 50,000
Temperature: 0.35
Iteration: 100,00
0
Temperature: 0.2
Iteration: 200,000
Temperature: 0.01
Iteration: 1 Iteration: 20,000 Iteration: 80,00
0
Iteration: 200,00
0
negatively sloping lines (Figure 7B). e resulting dataset
(Figure 7C) has the same positive correlation as the initial
dataset when looked at as a whole, while the individual
groups each have a strong negative correlation.
Figure 7. Demonstration of Simpson's Paradox. Both
datasets (A and C) have the same overall Pearson's
correlation of +0.81, however after coercing the data
towards the pattern of sloping lines (B), each subset of data
in (C) has an individually negative correlation.
Example 5: Cloned Dataset with Similar Appearance
As discussed by Govindaraju and Haslett [8] another use for
datasets with the same statistical properties is the creation of
“cloned” datasets to anonymize sensitive data [6]. In this
case, it is important that individual data points are changed
while the overall structure of the data remains similar. is
can be accomplished by performing a Kolmogorov-Smirnov
test within the ISERROROK function for both x and y. By only
accepting solutions where both the x and y K-S statistic is
<0.05 we ensure that the result will have a similar shape to
the original (Figure 8). is approach has the beneﬁt of
maintaining the x/y means and correlation as accomplished
in previous work [8], and additionally the x/y standard
deviations as well. is could also be useful for “graphical
inference” [12] to create a collection of variant plots
following the same null hypothesis.
Figure 8. Example of creating a “mirror” dataset as in [8].
Example 6: 1D Boxplots
To demonstrate the applicability of our approach to non 2D-
scatterplot data, this example uses a 1D distribution of data
as represented by a boxplot. e most common variety of
boxplot, the “Tukey Boxplot”, presents the 1st quartile,
median, and 3rd quartile values on the “box”, with the
“whiskers” showing the location of the furthest datapoints
within 1.5 interquartile ranges (IQR) from the 1st and 3
rd
quartiles. Starting with the data in a normal distribution
(Figure 9A) and perturbing the data to the left (B), right (C),
edges (D, E), and arbitrary points along the range (F) while
ensuring that the boxplot statistics remain constant produces
the results shown in Figure 9.
Figure 9. Six data distributions, each with the same 1st
quartile, median, and 3rd quartile values, as well as equal
locations for points 1.5 IQR from the 1st and 3rd quartiles.
Each dataset produces an identical boxplot.
LIMITATIONS AND FUTURE WORK
When the source dataset and the target shape are vastly
diﬀerent, the produced output might not be desirable. An
example is show Figure 10, where the data set from Figure
7A is coerced into a star (Figure 10). is problem can be
mitigated by coercing the data towards “simpler” patterns
with more coverage of the coordinate space such as lines
spanning the grid, or pre-scaling and positioning the target
shape to better align with the initial dataset.
Figure 10. Undesirable outcome (C) when coercing a
strongly positively correlated dataset (A) into a star (B).
e currently implemented ﬁtness function looks only at the
position of individual points in relation to the target shape,
which can result in “clumping” of data points and sparse
areas on the target shape. A future improvement could
consider an additional goal to “separate” the points to
encourage better coverage of the target shape in the output.
e parameters chosen for the algorithm (95% success rate,
quadratic cooling scheme, start/end temperatures, etc.) were
found to work well, but should not be considered “optimal”.
Such optimization is left as future work.
e code and datasets presented in this work are available at
www.autodeskresearch.com/publications/samestats.
CONCLUSION
We presented a technique for creating visually dissimilar
datasets which are equal over a range of statistical properties.
e outputs from our method can be used to demonstrate the
importance of visualizing your data, and may serve as a
starting point for new data anonymization techniques.
AC
B
“Cloned” DataOriginal Data Comparison
−10 −5 0 5 10
A
B
C
D
E
F
ABC
REFERENCES
1. Anscombe, F.J. (1973). Graphs in Statistical Analysis.
e American Statistician 27, 1, 17–21.
2. Bach, B., Spritzer, A., Lutton, E., and Fekete, J.-D.
(2012). Interactive Random Graph Generation with
3. Blyth, C.R. (1972). On Simpson’s Paradox and the
Sure-ing Principle. Journal of the American
Statistical Association 67, 338, 364–366.
summary statistics alone; always visualize your data.
datasaurus-never-trust-summary.html.
5. Chatterjee, S. and Firat, A. (2007). Generating Data
with Identical Statistics but Dissimilar Graphics. e
American Statistician 61, 3, 248–254.
6. Fung, B.C.M., Wang, K., Chen, R., and Yu, P.S.
(2010). Privacy-preserving Data Publishing: A Survey
of Recent Developments. ACM Comput. Surv. 42, 4,
14:1–14:53.
7. Govindaraju, K. and Haslett, S.J. (2008). Illustration of
regression towards the means. International Journal of
Mathematical Education in Science and Technology
39, 4, 544–550.
8. Haslett, S.J. and Govindaraju, K. (2009). Cloning
Data: Generating Datasets with Exactly the Same
Multiple Linear Regression Fit. Australian & New
Zealand Journal of Statistics 51, 4, 499–503.
9. Hwang, C.-R. Simulated annealing: eory and
applications. Acta Applicandae Mathematica 12, 1,
108–111.
10. Simpson, E.H. (1951). e Interpretation of Interaction
in Contingency Tables. Journal of the Royal Statistical
Society. Series B (Methodological) 13, 2, 238–241.
11. Stefanski, L.A. (2007). Residual (Sur)Realism. e
American Statistician, .
12. Wickham, H., Cook, D., Hofmann, H., and Buja, A.
(2010). Graphical inference for infovis. IEEE
Transactions on Visualization and Computer Graphics
16, 6, 973–979.
... When we see data and the model which has been fitted in the same image we appreciate far better the quality of the model fit. A second motivation for visualisation stems from the datasaurus example of Matejka and Fitzmaurice (2017), which proves that data sets with identical summary statistics can have very different shapes. Again, by creating a scatter plot those shape differences are shown clearly, allowing the analyst to derive far greater insight than is provided by the summary statistics alone. ...
... As we saw in the previous section correlation will also have an effect, but the process of constructing appropriate data sets means that the results are not directly comparable. As Matejka and Fitzmaurice (2017) show in two dimensions, a given average correlation can produce some very different point clouds. Here we impose the normal distribution for each axis to create a central mass of points on each dimension. ...
Preprint
Full-text available
Finance is heavily influenced by data-driven decision-making. Meanwhile, our ability to comprehend the full informational content of data sets remains impeded by the tools we apply in analysis, especially where the data is high-dimensional. Presenting the Topological Data Analysis Ball Mapper algorithm this paper illuminates a new means of seeing the detail in data from data shape. With comparisons to existing approaches and illustrative examples, the value of the new tool is shown. Directions for employing Ball Mapper in practice are given and the benefits are reviewed.
... xxx). We agree in the sense that this claim applies to all statistical modeling-badly misspecified models may lead to unjustified conclusions (Anscombe, 1973;Matejka & Fitzmaurice, 2017). What the authors mean, however, is that Bayes factors are appropriate in the M-closed setting (where one of the candidate models is true), and not in the M-open setting. ...
Preprint
Full-text available
In van Doorn et al. (2021) we outlined a series of open questions concerning Bayes factors for mixed effects model comparison, with an emphasis on the impact of aggregation, the effect of measurement error, the choice of prior distributions, and the detection of interactions. Seven expert commentaries (partially) addressed these initial questions. Surprisingly perhaps, the experts disagreed (often strongly) on what is best practice—a testament to the intricacy of conducting a mixed effect model comparison. Here we provide our perspective on these comments and highlight topics that warrant further discussion. In general, we agree with many of the commentaries that in order to take full advantage of Bayesian mixed model comparison, it is important to be aware of the specific assumptions that underlie the to-be-compared models.
... Furthermore, frequently used statistical coefficients may be ill-suited to describe sample-level trends in heterogeneous samples: Some statisticians suggest that in bimodal or other mixture distributions, or distributions with outliers, the average is no useful indicator of the central tendency (Derrible & Ahmad, 2015;Wirtz & Nachtigall, 1998). A multimodal distribution can imply that correlation coefficients do not represent the relationship between two variables in the way we think they do on any level of analysis (Matejka & Fitzmaurice, 2017;Moeller, 2021). The modality of uni-, bi-, and multivariate distributions must be checked before averages or other one-size-fits-all sample coefficients can be expected to represent overall group trends in these distributions. ...
Preprint
Full-text available
Intensive longitudinal studies typically examine phenomena that vary across time, individuals, contexts, and other boundary conditions. This poses challenges to the conceptualization and identification of replicability and generalizability, which refer to the invariance of research findings across samples and contexts as crucial criteria for trustworthiness. Some of these challenges are specific to intensive longitudinal studies, others are similarly relevant for the work with other complex datasets that contain multilayered sources of variation (individuals nested in different types of activities or organizations, regions, countries, etc.)This article opens with discussing the reasons why research findings may fail to replicate. We then analyze reasons why research findings may falsely appear to be non-replicable when in fact they were as such replicable, but lacked generalizability due to heterogeneity between samples, subgroups, individuals, time points, and contexts. Following that, we propose conceptual and methodological approaches to better disentangle non-replicability from non-generalizability and to better understand the exact causes of either problem. In particular, we apply Lakatos’s proposition to examine not only whether but under what boundary conditions a theory is a useful description of the world, to the question whether and under which conditions a research finding is replicable and generalizable. Not only will that contribute to a more systematic understanding of and research on replicability and generalizability in longitudinal studies and beyond, but it will also be a contribution to what has been called the heterogeneity revolution (Bryan et al., 2021; Moeller, 2021).
... The expected values of the random variables-usually estimated as an average of several repeated observations-can be used for this purpose. However, many statisticians have claimed that summarizing data with simple statistics such as the average or the standard deviation is misleading, as very different data can still have the same statistics Matejka and Fitzmaurice (2017); Chatterjee and Firat (2007). ...
Preprint
Full-text available
Non-deterministic measurements are common in real-world scenarios: the performance of a stochastic optimization algorithm or the total reward of a reinforcement learning agent in a chaotic environment are just two examples in which unpredictable outcomes are common. These measures can be modeled as random variables and compared among each other via their expected values or more sophisticated tools such as null hypothesis statistical tests. In this paper, we propose an alternative framework to visually compare two samples according to their estimated cumulative distribution functions. First, we introduce a dominance measure for two random variables that quantifies the proportion in which the cumulative distribution function of one of the random variables scholastically dominates the other one. Then, we present a graphical method that decomposes in quantiles i) the proposed dominance measure and ii) the probability that one of the random variables takes lower values than the other. With illustrative purposes, we re-evaluate the experimentation of an already published work with the proposed methodology and we show that additional conclusions (missed by the rest of the methods) can be inferred. Additionally, the software package RVCompare was created as a convenient way of applying and experimenting with the proposed framework.
Chapter
This chapter is mainly based on the paper: Exploiting Virtual Elasticity of Production Systems for Respecting OTD—Part 1: Post-Optimality Conditions for Ergodic Order Arrivals in Fixed Capacity Regimes, AJOR, 2020, 10, 321–342.
Book
Full-text available
O livro Análises Ecológicas no R é uma contribuição para o contínuo avanço do ensino de métodos computacionais, com um foco específico em análise de dados ecológicos através da linguagem R. O livro descreve como os códigos devem ser consequências das perguntas que a pesquisa pretende responder. Essa visão tem como consequência um livro que do começo ao fim conecta teoria ecológica, métodos científicos, análises quantitativas e programação. Isso é feito de modo explícito através de exemplos claros e didáticos que apresentam contexto e dados reais, um ou mais exemplos de perguntas que poderiam ser feitas, predições relacionadas às perguntas e a teoria em questão, além das variáveis que poderiam ser utilizadas nas análises. O texto que descreve essas partes é intercalado com pedaços organizados e claros de código e gráficos, o que torna a leitura dos capítulos bastante fluida e dinâmica, principalmente para quem gosta de executar os códigos no seu computador conforme lê os capítulos. É como uma aula prática guiada.
Article
A statistical graph can offer an alternative compelling approach to statistical thinking that focuses on important concepts rather than procedural formulas. Nowadays, visualizing multidimensional/multivariate data is essential but can also be challenging. In sport analytics, the exploration and descriptive analysis of data using visualization techniques has increased in recent years to, for example, describe possible patterns and uncertainty of player performance. These visualization techniques have been used so far with different purposes by various professionals in the sport industry, such as managers, coaches, scouters, technical staff, journalists, and researchers. The abuse of graphs, such as the radar plot, and their frequent misinterpretation in the world of sports and possible implications for coaching decisions has led us to create more informative and accurate visualizations. Here, we propose new, more educational visualizations we have termed violinboxplots and enhanced radar plot for their use in the sports analytics and other fields. These allow us to visualize, besides distribution and statistical summaries, the extreme data values that can be fundamental in performance studies and allow us to benchmark.
Chapter
Full-text available
Explainable machine learning and uncertainty quantification have emerged as promising approaches to check the suitability and understand the decision process of a data-driven model, to learn new insights from data, but also to get more information about the quality of a specific observation. In particular, heatmapping techniques that indicate the sensitivity of image regions are routinely used in image analysis and interpretation. In this paper, we consider a landmark-based approach to generate heatmaps that help derive sensitivity and uncertainty information for an application in marine science to support the monitoring of whales. Single whale identification is important to monitor the migration of whales, to avoid double counting of individuals and to reach more accurate population estimates. Here, we specifically explore the use of fluke landmarks learned as attention maps for local feature extraction and without other supervision than the whale IDs. These individual fluke landmarks are then used jointly to predict the whale ID. With this model, we use several techniques to estimate the sensitivity and uncertainty as a function of the consensus level and stability of localisation among the landmarks. For our experiments, we use images of humpback whale flukes provided by the Kaggle Challenge “Humpback Whale Identification” and compare our results to those of a whale expert.
Chapter
An increasing number of model-agnostic interpretation techniques for machine learning (ML) models such as partial dependence plots (PDP), permutation feature importance (PFI) and Shapley values provide insightful model interpretations, but can lead to wrong conclusions if applied incorrectly. We highlight many general pitfalls of ML model interpretation, such as using interpretation techniques in the wrong context, interpreting models that do not generalize well, ignoring feature dependencies, interactions, uncertainty estimates and issues in high-dimensional settings, or making unjustified causal interpretations, and illustrate them with examples. We focus on pitfalls for global methods that describe the average model behavior, but many pitfalls also apply to local methods that explain individual predictions. Our paper addresses ML practitioners by raising awareness of pitfalls and identifying solutions for correct model interpretation, but also addresses ML researchers by discussing open issues for further research.
Preprint
To interpret molecular dynamics simulations of biomolecular systems, systematic dimensionality reduction methods are commonly employed. Among others, this includes principal component analysis (PCA) and time-lagged independent component analysis (TICA), which aim to maximize the variance and the timescale of the first components, respectively. A crucial first step of such an analysis is the identification of suitable and relevant input coordinates (the so-called features), such as backbone dihedral angles and interresidue distances. As typically only a small subset of those coordinates is involved in a specific biomolecular process, it is important to discard the remaining uncorrelated motions or weakly correlated noise coordinates. This is because they may exhibit large amplitudes or long timescales and therefore will be erroneously be considered important by PCA and TICA, respectively. To discriminate collective motions underlying functional dynamics from uncorrelated motions, the correlation matrix of the input coordinates is block-diagonalized by a clustering method. This strategy avoids possible bias due to presumed functional observables and conformational states or variation principles that maximize variance or timescales. Considering several linear and nonlinear correlation measures and various clustering algorithms, it is shown that the combination of linear correlation and the Leiden community detection algorithm yields excellent results for all considered model systems. These include the functional motion of T4 lysozyme to demonstrate the successful identification of collective motion, as well as the folding of villin headpiece to highlight the physical interpretation of the correlated motions in terms of a functional mechanism.
Conference Paper
Full-text available
This paper introduces an interactive system called GraphCuisine that lets users steer an Evolutionary Algorithm (EA) to create random graphs that match user-specified measures. Generating random graphs with particular characteristics is crucial for evaluating graph algorithms, layouts and visualization techniques. Current random graph generators provide limited control of the final characteristics of the graphs they generate. The situation is even harder when one wants to generate random graphs similar to a given one, all-in-all leading to a long iterative process that involves several steps of random graph generation, parameter changes, and visual inspection. Our system follows an approach based on interactive evolutionary computation. Fitting generator parameters to create graphs with pre-defined measures is an optimization problem, while assessing the quality of the resulting graphs often involves human subjective judgment. In this paper we describe the graph generation process from a user’s perspective, provide details about our evolutionary algorithm, and demonstrate how GraphCuisine is employed to generate graphs that mimic a given real-world network. An interactive demo of GraphCuisine can be found on our website http://www.aviz.fr/Research/Graphcuisine .
Article
The definition of second order interaction in a (2 × 2 × 2) table given by Bartlett is accepted, but it is shown by an example that the vanishing of this second order interaction does not necessarily justify the mechanical procedure of forming the three component 2 × 2 tables and testing each of these for significance by standard methods.*
Book
Gaining access to high-quality data is a vital necessity in knowledge-based decision making. But data in its raw form often contains sensitive information about individuals. Providing solutions to this problem, the methods and tools of privacy-preserving data publishing enable the publication of useful information while protecting data privacy. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques presents state-of-the-art information sharing and data integration methods that take into account privacy and data mining requirements. The first part of the book discusses the fundamentals of the field. In the second part, the authors present anonymization methods for preserving information utility for specific data mining tasks. The third part examines the privacy issues, privacy models, and anonymization methods for realistic and challenging data publishing scenarios. While the first three parts focus on anonymizing relational data, the last part studies the privacy threats, privacy models, and anonymization methods for complex data, including transaction, trajectory, social network, and textual data. This book not only explores privacy and information utility issues but also efficiency and scalability challenges. In many chapters, the authors highlight efficient and scalable methods and provide an analytical discussion to compare the strengths and weaknesses of different solutions.
Article
This paradox is the possibility of \$P(A \mid B) even though P(A ∣ B) ≥ P(A ∣ B′) both under the additional condition C and under the complement C′ of that condition. Details are given on why this can happen and how extreme the inequalities can be. An example shows that Savage's sure-thing principle ("If you would definitely prefer g to f, either knowing that the event C obtained, or knowing that C did not obtain, then you definitely prefer g to f.") is not applicable to alternatives f and g that involve sequential operations.
This article, presents a procedure for generating a sequence of data sets which will yield exactly the same fitted simple linear regression equation y = a + bx. Unless rescaled, the generated data sets will have progressively smaller variability for the two variables, and the associated response and covariate will ‘regress’ towards their unconditional sample means.
Article
The Anscombe dataset is popular for teaching the importance of graphics in data analysis. It consists of four datasets that have identical summary statistics (e.g., mean, standard deviation, and correlation) but dissimilar data graphics (scatterplots). In this article, we provide a general procedure to generate datasets with identical summary statistics but dissimilar graphics by using a genetic algorithm based approach.