Access to this full-text is provided by De Gruyter.
Content available from Statistics, Politics, and Policy
This content is subject to copyright. Terms and conditions apply.
Article
Matthew J. Vowels*
Typical Yet Unlikely and Normally Abnormal:
The Intuition Behind High-Dimensional
Statistics
https://doi.org/10.1515/spp-2023-0028
Received August 10, 2023; accepted November 27, 2023; published online December 26, 2023
Abstract: Normality, in the colloquial sense, has historically been considered an
aspirational trait, synonymous with ideality. The arithmetic average and, by
extension, statistics including linear regression coefficients, have often been used to
characterize normality, and are often used as a way to summarize samples and
identify outliers. We provide intuition behind the behavior of such statistics in high
dimensions, and demonstrate that even for datasets with a relatively low number of
dimensions, data start to exhibit a number of peculiarities which become severe as
the number of dimensions increases. Whilst our main goal is to familiarize re-
searchers with these peculiarities, we also show that normality can be better char-
acterized with ‘typicality’, an information theoretic concept relating to entropy. An
application of typicality to both synthetic and real-world data concerning political
values reveals that in multi-dimensional space, to be ‘normal’is actually to be
atypical. We briefly explore the ramifications for outlier detection, demonstrating
how typicality, in contrast with the popular Mahalanobis distance, represents a
viable method for outlier detection.
Keywords: statistics; information theory; outlier; normality; typicality
In a well known United States Air Force (USAF) experiment seeking to identify ‘the
average man’, Gilbert Daniels found that out of 4063 men, not a single one fell within
30 % of the arithmetic sample averages for each of ten physical dimensions simul-
taneously (which included attributes such as stature, sleeve length, thigh circum-
ference, and so on) (Daniels 1952). Rather than this being an unlikely fluke of the
sample, averages, rather than representing the most ‘normal’attributes in a sample,
are actually highly abnormal in the context of multi-dimensional data. Indeed, even
*Corresponding author: Matthew J. Vowels, Institute of Psychology, University of Lausanne, Lausanne,
Switzerland, E-mail: matthew.vowels@unil.ch
Stat Polit Pol 2024; 15(1): 87–113
Open Access. © 2023 the author(s), published by De Gruyter. This work is licensed under the
Creative Commons Attribution 4.0 International License.
though averages may provide seemingly useful baselines for comparison, it is
important and perhaps surprising to note that the chance of finding an individual
with multiple traits falling close to the average is vanishingly small, particularly as
the number of traits increases.
The arithmetic average has been used to represent normality (vis-a-vis abnor-
mality, in the informal/colloquial/non-statistical sense), and is often used both pro-
ductively and unproductively as a blunt way to characterize samples and outliers.
Prior commentary has highlighted the pitfalls associated with the use of the mean as
a summary statistic (Speelman and McGann 2013); the limitations in relation to its
applicability and usefulness of parametric representations (such as the Gaussian)
when dealing with real-world phenomena (Micceri 1989; Modis 2007); and the soci-
etal context (Comte 1976; Misztal 2002), surrounding the potentially harmful
perception of normality as a “figure of perfection to which we may progress”
(Hacking 1990, 168). Whilst these commentaries are valuable and important in
developing an awareness for what it means to use averages to characterize hu-
mankind, they do not provide us with an alternative. They also do not discuss some of
the more technical aspects of normality in the context of multiple dimensions and
outlier detection, or explain why normality, when characterized by the arithmetic
average, is so difficult to attain in principle.
1
Furthermore, the problems associated
with the arithmetic average, which is an approximation of an expected value, extend
to other expected value based methods, such as regressions, which describe the
average value of an outcome for a particular subset of predictors. As such, it is
important that researchers familiarize themselves with the limitations of such
popular methodologies.
In this paper, our principal aim is to provide intuition for and to familiarize
researchers with the peculiar behavior of data in high dimensions. In particular, we
discuss averages in the context of multi-dimensional data, and explain why it is that
being normal is so abnormal. We touch on some of the peculiarities of multi-
dimensional spaces, such as how a high-dimensional sphere has close to zero volume,
and how high-dimensional random variables cluster in a thin annulus a finite dis-
tance away from the mean. Whilst not the primary focus of this work, we also
consider the relevance of these phenomena to outlier detection, and suggest that
outliers should not only be considered to include datapoints which lie far from the
mean, but also those points close to the mean. The intuition associated with this
phenomenon helps to explain why heterogeneity between individuals represents
such a great challenge for researchers in the social sciences –humans as high-
dimensional phenomena tend only to be normal insofar as they are all abnormal in
slightly different ways.
1One exception includes work by Kroc and Astivia (2021) for the determination of scale cutoffs.
88 M. J. Vowels
Using information theoretic concepts, we propose an alternative way of char-
acterizing normality and detecting outliers, namely through the concept of ‘typi-
cality’. We demonstrate the peculiarities as well as the proposed concepts both on
idealistic simulated data, as well as data from the ‘Politics and Views’LISS panel
survey (Scherpenzeel and Das 2010).
2
Finally, we compare the outlier detection
performance of typicality with the most common alternative (based on the Maha-
lanobis distance) and demonstrate it to be a viable alternative. More broadly, we
argue that if the average value in a multivariate setting is unlikely, then outlier
detection techniques should be able to identify it as such. This means updating our
working conceptualization of outliers to include not only points which lie far from
the mean (as most outlier detection methods do) but also those points which lie too
close to the mean, particularly as the dimensionality of the dataset increases.
1 Background
The notion of the mean of a Gaussian, or indeed its finite-sample estimate in the form
of the arithmetic average, as representing a ‘normal person’still holds strong rele-
vance in society and research today. Quetelet did much to popularise the idea
(Caponi 2013; Quetelet 1835), having devised the much used but also much criticized
Body Mass Index for characterizing a person’s weight in relation to their height. The
average is used as a way to parameterize, aggregate, and compare distributions, as
well as to establish bounds for purposes of defining outliers and pathology vis-à-vis
‘normality’in individuals. Quetelet’s perspective was also shared by Comte, who
considered normality to be synonymous with harmony and perfection (Misztal 2002).
Even though it is important to recognize the societal and ethical implications of such
views, this paper is concerned with the characteristics of multivariate distributions;
in particular, those characteristics which help us understand why averages might
provide a poor representation of ‘normality’, and what we might consider as an
alternative.
In the past, researchers and commentators (including well-known figures such
as Foucault) have levied a number of critiques at the use of averages in psychology
and social science (Foucault 1984; Myers 2013; Speelman and McGann 2013; Wetherall
1996). Part of the problem is the over-imposition of the Gaussian distribution on
empirical data. The Gaussian has only two parameters, and even if the full proba-
bility density function is given, only two pieces of information are required to specify
it –the mean (which we treat as equivalent to the arithmetic average) and the
2An anonmyized github repository with the Python code used for all analyses, simulations, and plots
can be found at https://anonymous.4open.science/r/Typicality-005B.
Typical Yet Unlikely and Normally Abnormal 89
variance. Even in univariate cases, the mean can be reductionist, draining the data of
nuance and complexity. Many of the developments in statistical methodology have
sought to increase the expressivity of statistical models and analyses in order to
account for the inherent complexity and heterogeneity associated with psychological
phenomena. For example, the family of longitudinal daily diary methods (Bolger and
Laurenceau 2013), as well as hierarchical models (Raudenbush and Bryk 2002) can be
used to capture different levels of variability associated with the data generating
process. Alternatively, other methods have sought to leverage techniques from the
engineering sciences, such as spectral analysis, in order to model dynamic fluctua-
tions and shared synchrony between partners over time (Gottman 1979; Vowels et al.
2018). Machine learning methods provide powerful, data-adaptive function
approximation methods for ‘letting the data speak’(van der Laan and Rose 2011) as
well as for testing the predictive validity of psychological theories (Vowels 2021;
Yarkoni and Westfall 2017), and in the world of big data, comprehensive meta-
analyses allow us to paint complete pictures of the gardens of forking paths (Gelman
and Loken 2013; Orben and Przybylski 2019).
Multi-dimensional data exhibit a number of peculiar attributes which concern
the use of averages. Assuming one conceives of a ‘normal person’as having qualities
similar to those of a ‘typical person’,wefind that the arithmetic average diverges
from this conception rather quickly, as the number of dimensions increases. The
peculiar attributes start to become apparent in surprisingly low-dimensional con-
texts (as few as four variables), and become increasingly extreme as dimensionality
increases. Understanding these attributes is particularly important because the
dimensionality of datasets and analyses is increasing along with the popularity of
machine learning. For instance, a machine learning approach identifying important
predictors of relationship satisfaction incorporated upwards of 189 variables (Joel
et al. 2020), and similar research looking at sexual desire used around 100 (Joel et al.
2017; Vowels, Vowels, and Mark 2021). Assuming that high-dimensional datasets will
continue to be of interest to psychologists, researchers ought to be aware of some of
the less intuitive but notable characteristics of such data.
As we will discuss, one domain for which the mean can be especially problematic
in multiple-dimensional datasets is outlier detection. In general, outlier detection
methods concern themselves with the distance that points lie from the mean. Even
methods designed to explore distances from the median are motivated by consid-
erations/difficulties with estimation, and are otherwise based on the assumption that
the expected value (or the estimate thereof) provides an object against which to
compare datapoints (Leys et al. 2019). Unfortunately, and as Daniels’discovered for
the USAF, values close to the mean become increasingly unlikely as the number of
dimensions increases, making the mean an inappropriate reference for classifying
outliers. Indeed, in Daniels’experiment it would have been a mistake to accept
90 M. J. Vowels
anyone close to the average as anything other than an outlier. As we describe later,
one can successfully summarise a set of datapoints in multiple dimensions in terms
of their typicality. We later evaluate the performance of a well-known multivariate
outlier method (based on the Mahalanobis distance) in terms of its capacity to
identify values far from the empirical average as outliers, and compare it against our
proposed measure of typicality.
2 Divergence from the Mean
This section is concerned with demonstrating some of the un-intuitive aspects of data
in higher dimensions. We begin by showing that, as dimensionality increases, the
‘distance’that a datapoint is from the mean/average increases at a rate of
D
√where
Dis the number of dimensions. We then provide a discussion of the ramifications.
Finally. we briefly present an alternative geometric view that leads us to the same
conclusions.
Notation: In terms of notation, we denote a datapoint for an individual ias x
i
where i= {1, 2, …,N}. The total number of individual datapoints is N, and the bold
font indicates that the datapoint is a vector (i.e. it is multivariate). A single dimension
dfrom individual i’s datapoint is given as x
i,d
, where d∈(Z+)D, where Dis the total
number of dimensions.
2.1 Gaussian Vectors in High Dimensions
Let us begin in familiar territory –for a multivariate distribution with independently
and identically distributed (i.i.d) Gaussian variables, the probability density function
for each dimension may be expressed as:
p(xi,d)= 1
2πσ2
d
e−(xi,d−μd)2
2σ2
d(1)
where μ
d
and σ
d
represent the mean and standard deviation of dimension d,
respectively. Each multivariate datapoint x
i
may be considered as a vector in this D-
dimensional space. An example of two datapoints drawn from a two-dimensional/
bivariate version of this distribution (i.e. D= 2), is shown in Figure 1. In this figure, the
values of these two random samples are x
1
= (0.4, 0.8) and x
2
= (0.55, 0.35). Assuming
that these datapoints are drawn from a distribution with a mean of 0 and a variance
of 1 for all dimensions (i.e. Nμd=0,σ2
d=1
∀d), then we can compute the
Typical Yet Unlikely and Normally Abnormal 91
distance these datapoints fall from the mean μ=0using the squared Euclidean
distance (see Eq. (2)),
‖xi‖2
2=∑
D
d=1
x2
i,d(2)
Here, we use the subscript dto index the dimension of the multidimensional data-
point x
i
. Importantly, note that the squared Euclidean norm closely resembles the
expression for sample variance for a certain dimension d(Eq. (3)):
Var
——
(xd)=1
N∑
N
i=1
x2
i,d(3)
For the two example vectors in Figure 1, taking the square root of the values derived
using Eq. (2), the distances from the mean are ‖x
1
‖
2
= 0.8 and ‖x
2
‖
2
= 0.3.
In other words, the variance of a sample is closely related to the distance that
each sample is expected to fall from the mean. Note that, when computing the
variance for a particular variable or dimension, we sum across datapoints
i, rather than dimensions d. Secondly, and more importantly, the variance con-
tains a normalization term 1/N, whereas the expression for the norm does not. By
consequence of this absent normalization term, the expected squared distance
of each datapoint from the mean will grow with increasing dimensionality. In
this example, we know that the variance of our distribution σ2
d=1forboth
dimensions, and as such, it is trivial to show that each individual dimension dwill
have an average/expected length equal to one. Without the normalization term
(i.e. 1/D), this means that the expected squared length of the vectors grows in
proportion to the number of dimensions.
3
Alternatively, taking a square root, we
Figure 1: Two samples in two-
dimensional space, with their
corresponding coordinates.
3This is simply the result of summing more values together, without accounting for the number of
dimensions. This is intentional –to compute the overall distance a sample lies from the mean, it is
92 M. J. Vowels
can say that the expected distance that samples fall from the mean increases
proportional to the square-root of the dimensionality of the distribution. More
concretely: E‖x‖2
[]
∝
D
√(Vershynin 2019).
This can of course also be verified in simulation, and Figure 2 shows both the
analytical as well as sample estimates for the average length of the vectors as the
number of dimensions increases. The intervals are defined by the 1st and 99th
percentiles. Each approximation to the expectation is taken over a sample size of 200
datapoints. The dashed red curve depicts the
D
√relationship, and the black simu-
lated curve is a direct (albeit noisier, owing to the fact that this curve is simulated)
overlay. This should start to remind us of Daniels’experience when working for the
USAF –he found that out of 4063 people, not a single one of them fell within 30 % of
the mean over ten variables. Indeed, if any had done, we should consider labelling
them as outliers, in spite of the fact that most existing outlier detection methods are
only sensitive to points which lie far from the mean.
The implications of this are important to understand. Whilst we know that each
variable x
d
has an expected value of zero and a variance of one, the expected length
of a datapoint across all Ddimensions/variables grows in proportion to the square
root of the number of variables. Dieleman (2020) summarised this informally when
they observed that “if we sample lots of vectors from a 100-dimensional standard
Gaussian, and measure their radii, we will find that just over 84 % of them are
Figure 2: The red dashed curve is simple
D
√, whilst the black curve is a simulated estimate of the
expected lengths, calculated over 200 datapoints, for increasing dimensionality D. The blue interval
represents the 1–99 % percentiles.
necessary to sum across all possible dimensions/directions without adjusting for the number of
dimensions.
Typical Yet Unlikely and Normally Abnormal 93
between 9 and 11, and more than 99 % are between 8 and 12. Only about 0.2 % have
a radius smaller than 8!”In other words, the expected location of a datapoint in
D-dimensional space moves further and further away from the mean as the
dimensionality increases.
It can also be shown that such high-dimensional Gaussian random variables are
distributed uniformly on the (high-dimensional) sphere with a radius of
D
√, and
grouped in a thin annulus (Stein 2020; Vershynin 2019).
4
The uniformity tells us the
direction of these vectors (i.e. their location on the surface of this high-dimensional
sphere) is arbitrary, and the squared distances, or radii, are Chi-squared distributed
(it is well known that the Chi-squared distribution is the distribution of the sum of
squares of Dindependent and identically distributed Gaussian variables). The dis-
tribution of distances (vis-à-vis the squared distances) is therefore Chi-distributed.
Figure 3 compares samples from a Chi-squared distribution against the distribution
of 10,000 squared vector lengths. Altogether, this means that in high-dimensions,
(a) it is unlikely to find datapoints anywhere close to the average (even though the
region close to the mean represents the one with the highest likelihood, the proba-
bility is nonetheless negligible), (b) randomly sampled vectors are unlikely to be
correlated (of course, in expectation the correlation will be zero because the di-
mensions of the Gaussian from which they were sampled are independent), and
(c) randomly sampled vectors have lengths that are close to the expected length
which increases at a rate
D
√. As such, the datapoints tend to cluster in a subspace
which lies at a fixed radius from the mean (we will later refer to this subspace as the
typical set). This is summarized graphically in Figure 4.
It is important that researchers understand that while the mean of such a high-
dimensional Gaussian represents the value which minimizes the sums of squared
Figure 3: For D= 40 these
histograms show the distributions
of 10,000 datapoints sampled from
aχ2distribution (red) and the sums
of squared distances ‖x‖2
2.
4See also Gaussian Annulus Theorem (Blum et al. 2020).
94 M. J. Vowels
distances (and is therefore the estimate which maximises the likelihood), the ma-
jority of the probability mass is actually not located around this point. As such, even
though a set of values close to the mean represents the most likely in terms of its
probability of occurrence, the magnitude of this probability is negligible, and most
points fall in a space around
D
√away from the mean. Figure 5 depicts the lengths of
2000 vectors sampled from a 40-dimensional Gaussian –they are nowhere close to
the origin. Another way to visualize this is to plot the locations of the expected lengths
for different dimensionalities on top of the curve for N(0,1), and this is shown in
Figure 6. In terms of the implications for psychological data –datasets which involve
high numbers of variables are likely to comprise individuals who are similar only
insofar as they appear to be equally ‘abnormal’, at least insofar as a univariate
characterization of normality (e.g. the mean across the dimensions) is a poor one
when used across multiple dimensions. Indeed, if an individual does possess char-
acteristics close to the mean or the mode across multiple dimensions, they could
reasonably be considered to be outliers. We will consider outlier detection more
closely in a later section.
2.2 An Alternative Perspective
Finally, the peculiarities of high-dimensional space are well visualized geometrically.
Figure 7 shows the generalization of a circle in 2-Dinscribed within a square, to a
sphere in 3-D inscribed within a cube. Taking this further still, the cube and the
sphere can be generalized to hyper-cubes and hyper-spheres, the volumes for which
can be calculated as C
D
=lDand SD=πD/2
ΓD
2+1
()
, respectively. The latter is a generalization
of the well-known expression for the volume of a sphere. The link with the previous
discussion lies in the fact that Gaussian data are spherically (or at least elliptically)
distributed. As such, an exploration of the characteristics of spheres and ellipses in
high-dimensions tells us something about high-dimensional data.
Figure 4: The plot illustrates how, in high-
dimensions, the probability mass is located in a
thin annulus at a distance σ
D
√from the average
(in the text, we assume σ= 1), despite the mean
representing the location which maximizes the
probability density. Adapted from (MacKay 1992).
Typical Yet Unlikely and Normally Abnormal 95
Figure 7 illustrates how the ratio between the volume of a sphere and the
volume of a cube changes dramatically even though the number of dimensions has
only increased by 1. In the first case, considering the square and the circle, the ratio
between the volumes is 4/π≈1.27. In the second case, considering now the cube and
the sphere, the ratio between the volumes is 8/4.19 = 1.90. In other words, the
volume of the sphere represents a significantly smaller fraction of the total volume
of the cube even though only one extra dimension has been added. Figure 8 il-
lustrates how this pattern continues exponentially, such that the ratio between a
cube and sphere for D= 20 is over 40 Million. In other words, a cube with sides of
length two has a volume which is 40 million times greater than the sphere inscribed
within it. This effect is at least partly explained by the extraordinary way the
Figure 6: This plot shows the location of the expected lengths of vectors of different dimensionality in
relation to the standard normal in one dimension. It can be seen that even at D= 5, the expected length
is over two standard deviations from the mean.
Figure 5: A scatter plot showing
the lengths of 2000 vectors sampled
from a 40-dimensional Gaussian.
Red line shows the average vector
length, and the green intervals
depict the size of the typical set for
different values of ϵ. Note that the
mean (0,0) is nowhere near the
distribution of norms or the typical
set.
96 M. J. Vowels
volume of a sphere changes as dimensionality increases. Figure 9 shows how the
volume of a sphere with a radius of one in D-dimensions quickly tends to zero after
passing a maximum at around D= 5. In other words, a high-dimensional sphere has
negligible volume.
Whilst the implications of this are the same as those in the previous section, the
demonstration hopefully gives some further intuition about just how quickly the
strange effects start to occur as Dis increased, at least in the case where our di-
mensions are independent. In order to gain an intuition for whether these effects
translate to more realistic data (including correlated dimensions/variables), see the
analysis below. Many problems in social science involve more than just a few di-
mensions, and problems which utilise ‘big data’are even more susceptible to issues
relating to what is known as the curse of dimensionality.
3 Typicality: An Information Theoretic Way to
Characterize ‘Normality’
In the previous section, we described how randomly sampled vectors in high-
dimensional space tend to be located at a radius of length
D
√away from the mean,
and tend to be uncorrelated. This makes points close to the mean across multiple
dimensions poor examples of ‘normality’. In this section we introduce the concept of
typicality from information theory, as a means to categorize whether a particular
sample or a particular set of samples is/are ‘normal’or ‘abnormal’(and therefore also
whether the points should be considered to be outliers).
Figure 7: Depicts a circle inscribed within a square (left) and a sphere inscribed within a cube (right).
Even though the diameters of the circle and the sphere are the same as the lengths of the sides of the
square and the cube, the ratio between the volume of the circle and the volume of the square greatly
decreases when moving from two to three dimensions.
Typical Yet Unlikely and Normally Abnormal 97
3.1 Entropy
Entropy describes the degree of surprise, uncertainty, or information associated
with a distribution and is computed as −∑N
i=1p(xi)log p(xi)where Nis the number of
datapoints in the sample distribution, x
i
is a single datapoint in this distribution, and
Figure 8: Depicts the ratio between the volume of a hupercube and the volume of a hypersphere as the
dimensionality Dincreases. Note the scale of the y-axis (×107).
Figure 9: Shows how the volume S
D
of a hypersphere changes with dimensionality D.
98 M. J. Vowels
p(x
i
) is that datapoint’s corresponding probability.
5
If the entropy is low, it means the
distribution is more certain and therefore also easier to predict.
Takingafaircoinasanexample,p(x=heads)=p(x= tails) = 0.5. The entropy of
this distribution is H=−(0.5 log
2
0.5 +0.5 log
2
0.5) = 1. Recall from above that entropy
describes the amount of information content –the units of entropy here are in bits.
The fact that our fair coin has 1 bit of information should therefore seem quite
reasonable –there are two equally possible outcomes and therefore one bit’sworth
of information. Furthermore, because the coin is unbiased, we are unable to pre-
dict the outcome any better than by randomly guessing. On the other hand, let’ssay
we have a highly biased coin whereby p(x=heads)=0.99andp(x=tails)=0.01.In
this case H=−(0.99 log
2
0.99 +0.01 log
2
0.01) = 0.08. The second example had a much
lower entropy because we are likely to observe heads, and this makes samples from
the distribution more predictable. As such, there is less new or surprising infor-
mation associated with samples from this distribution, than there was for the case
where there was an equal chance of a head or a tail.
According to the Asymptotic Equipartition Property, entropy can be approxi-
mated as a sum of log probabilities of a sequence of random samples, in a manner
equivalent to how an expected value can be estimated as a sum of random samples
according to the Law of Large Numbers (Cover and Thomas 2006):
−1
Nlog p(xi=1,x2,…,xN)≈−1
N∑
N
i=1
log p(xi)≈H(x)for sufficiently large N(4)
In words, the negative log of the joint probability tends towards the entropy of the
distribution. Entropy therefore gives us an alternative way to characterize
normality; but now instead of doing so using the arithmetic mean, we do so in terms
of entropy. Rather than comparing the value of a new sample against the mean or
expected value of a distribution, we can now consider the probability of observing
that sample and its relation to the entropy of the distribution.
3.2 Defining the Typical Set
We are now ready to define the typical set. Rather than comparing datapoints against
the mean, we can compare them against the entropy of the distribution H. For a
chosen threshold ϵ, datapoints may be considered typical according to (Cover and
Thomas 2006; Dieleman 2020; MacKay 2018):
5We temporarily consider the discrete random variable case for this example, but note that the
intuition holds for continuous distributions as well.
Typical Yet Unlikely and Normally Abnormal 99
T={x:2−(H+ϵ)≤p(x)≤2−(H−ϵ)}(5)
In words, the typical set Tcomprises datapoints xwhich fall within the bounds
defined on either side of the entropy of the distribution. Datapoints which have a
probability close (where close is defined according to the magnitude of ϵ) to the
entropy of the distribution are thereby defined as typical. Recall the thin annulus
containing most of the probability mass, illustrated in Figure 4; this annulus com-
prises the typical set. Note that, because this annulus contains most of our probability
mass, the set quickly incorporates all datapoints as ϵis increased (Cover and Thomas
2006). Note that this typical set (at least for ‘modest’values of ϵ) does not contain the
mean because, as an annulus, it cannot contain it by design (the mean falls at the
centre of a circle who’s radius defines the radius of the annulus). The quantity given
in Eq. (5) can be computed for continuous Gaussian (rather than discrete) data using
the analytical forms for entropy Hfor the univariate and multivariate Gaussian
provided as Supplementary, and the probability density function for a univariate or
multivariate Gaussian for p(x). This is undertaken for the outlier detection simula-
tion below.
3.3 Establishing Typicality in Practice
Even though it is arguable as to whether the Gaussian should be used less ubiqui-
tously for modeling data distributions than it currently is (Micceri 1989), one of the
strong advantages of the Gaussian is its mathematical tractability. This tractability
enables us to calculate (as opposed to estimate) quantities exactly, simply by
substituting parameter values into the equations (assuming these parameters have
themselves not been estimated). Thus, moving from a comparison of dataset values
against the average or expected value to a consideration for typicality does not
necessitate the abandonment of convenient analytic solutions. A derivation of the
(differential) entropy for a Gaussian distribution has been provided in Appendix A,
and is given in Eq. (6).
H(f)=1
2log2(2πeσ2)(6)
Note that the mean does not feature in Eq. (6) –this make it clear that the uncertainty
or information content of a distribution is independent of its location (i.e. the mean)
in vector space.
6
As well as being useful in categorising datapoints as typical or
atypical (or, alternatively, inliers and outliers) in practice, Eq. (6) can also be used to
6Note that entropy is closely related to the score function (the derivative of the log likelihood) as well
as Fisher information, which is the variance of the score.
100 M. J. Vowels
understand the relationship between ϵand the fraction of the total probability mass
that falls inside the typical set. Returning to Figure 5 which shows the lengths of 2000
vectors sampled from a 40-dimensional Gaussian, we can see that as ϵincreases, we
gradually expand the interval to cover a greater and greater proportion of the
empirical distribution. Note also that the mean, which in this plot has a location (0,0),
is a long way from any of the points and is not part of (and, by definition, cannot be
part of) the typical set.
4 An Example with Real-World Data
To demonstrate that these effects do not only apply to idealistic simulations, we use
the LISS longitudinal panel data, which is open access (Scherpenzeel and Das 2010).
Specifically, we use Likert-style response data from wave 1 of the Politics and Values
survey, collected between 2007 and 2008, which includes questions relating to levels
of satisfaction and confidence in science, healthcare, the economy, democracy, etc.
Given that no inference was required for these data, a simple approach was taken to
clean it: all non-Likert style data were removed, leaving 58 variables, and text based
responses which represented the extremes of the scale we replaced with integers
(e.g. ‘no confidence at all’is replaced with a 0). For the sake of demonstration, all
missing values were mean-imputed
7
(this may not be a wise choice in practice), and
the data were standardized so that all variables were mean zero with a standard
deviation of one. In total there were 6811 respondents.
Figure 10 depicts the bivariate correlations for each pair of variables in the data.
It can be seen that there exist many non-zero correlations, which makes these data
useful in understanding the generality of our expositions above (which were
undertaken with independent and therefore uncorrelated variables). Qualitatively,
some variables were also highly non-Gaussian, which again helps us understand the
generality of the effects in multi-dimensional data. Figure 11 shows how the expected
lengths of the vectors in the LISS panel data change as an increasing number of
dimensions are used. To generate this plot, we randomly selected Dvariables 1000
times, where Drange from three up to the total number of variables (58). For each of
the 1000 repetitions, we computed the Euclidean distances of each vector in the
dataset across these Dvariables, and then computed their average. Once the 1000
repetitions were complete, we compute the average across these repetitions to obtain
an approximation to the expectation of vector lengths in Ddimensions. Finally, we
7Across all included variables the amount of mean-imputation, on average, was 7.9 %. Note that
such imputation makes the demonstration more conservative, because it forces values to be equal to
the mean for the respective dimension.
Typical Yet Unlikely and Normally Abnormal 101
overlaid a plot of
D
√to ascertain how close the empirically estimated vector lengths
are, compared with the expected lengths for a multivariate Gaussian. We also plot
the 1–99 % intervals, which are found to be quite wide, owing to the mix of lowly and
highly correlated variables in conjunction with possibly non-Gaussianity.
These results demonstrate that even for correlated, potentially non-Gaussian,
real-world data, the peculiar behavior of multi-dimensional data discussed in this
paper nonetheless occur. For the LISS data, the expected lengths were slightly lower
than for samples from a ‘clean’multivariate Gaussian, and this is likely to be due to
Figure 10: Depicts the bivariate
correlations for the LISS panel data
(Scherpenzeel and Das 2010).
Figure 11: The lengths for vectors from the LISS panel data (red), for increasing D, as well as the
expected lengths for a multi-variate Gaussian (blue). The LISS panel data curve includes 1–99 %
percentile intervals (Scherpenzeel and Das 2010).
102 M. J. Vowels
the correlations present in the data.
8
Indeed, when estimating the entropy of these
data (and as can be seen in the supplementary code), a robust covariance matrix
estimation approach was used to account for the non-isotropic nature of the joint
distribution. More generally, non-parametric methods can be used for the estimation
of typicality, but such methods are likely to have lower sample efficiency (i.e. more
data are required for accurate estimation of entropy and typicality).
5 Moving Forward with Multivariate Outlier
Detection
Grubbs defined outliers as samples which “deviate markedly from other members of
the sample in which it occurs”(Grubbs 1969). This definition is useful to us here,
because it is not expressed in terms of distance from the mean, but in broad/general
terms. Indeed, as we have already discussed, in as few as four dimensions, points
near the mean become increasingly unlikely. This suggests that outlier methods
should not only identify points which are too far from the mean, but also those which
are too close.
Two related definitions of outliers which were noted by Leys et al. (2019) are:
“Data values that are unusually large or small compared to the other values of the
same construct”, and “Data points with large residual values.”The first is quite
similar to Grubbs’definition, identifying values as unusually large or small
(i.e. deviating markedly) with respect to other values of the same construct (i.e. with
respect to the other members of the sample in which they occur). The second defines
them with respect to the residuals of a statistical model. In other words, they are
values which lead to large discrepancies between true and predicted values. Note
that both of these definitions bear the consequences for our work –whether we are
comparing datapoints against the rest of the sample, or comparing them against the
predictions from a statistical model designed to estimate an expected value (which is
by far the most common case in psychology and social science), the relevance of these
definitions to our discussion remains the same.
It is also, perhaps, of interest to note that our definition of outliers makes no
value judgement about whether outliers are good or bad. Indeed, depending on the
application and our research questions, outliers may represent ‘golden’samples.
Consider a manufacturer interested in fabricating the perfect mechanical prototype.
Each sample may have its own unique blemishes, and our target may represent the
perfect average across all (high-dimensional) opportunities for such blemishes. In
8For further discussion relating to this point, see Kroc and Astivia (2021).
Typical Yet Unlikely and Normally Abnormal 103
such a case, the average represents the golden target for our manufacturer, and
identifying it necessitates outlier detection methods which understand that values
across high-dimensions close to the mean should be considered to be (in this case,
desirable) outliers, in much the same way as samples which deviate because they are
too far from the mean may also be outliers for opposite reasons.
Leys et al. (2019) provide a useful summary of options for both univariate and
multivariate outlier detection, as well as a discussion about the consequences of
outlier management decisions. Whilst their work provide an excellent introduction
to multivariate outlier detection and good practice, they do not discuss the strange
behavior of the mean in multiple dimensions, nor the impact of this behavior on
multivariate outlier detection methods which are unable to detect outliers which lie
close to the mean.
We note, as other researchers have (Leys et al. 2019), that the most common
method used for multidimensional/multivariate outlier detection in the domain of
social science is the Mahalanobis distance (Mahalanobis 1930). For a description of
the Mahalanobis distance and its application, readers are directed to work by Li et al.
(2019) and Leys et al. (2018). Briefly, the method assesses the distance of a point from
the centroid (i.e. the mean) of a cloud of points in (possibly correlated) multidi-
mensional space. The researchers note that in order to compute the distance from
putative outliers to the mean, it is first necessary to estimate the mean and covari-
ance whilst including those points in the estimation (Leys et al. 2013; Leys et al. 2018).
This process is somewhat problematic because if outliers are included in the
calculation being used to compute the mean and covariance, the estimation of these
quantities will themselves be biased towards these outliers, thereby reducing the
chances of correctly identifying the outliers. A solution is proposed which is called
the ‘robust’Mahalanobis distance (Leys et al. 2018; Li et al. 2019), which leverages
what is known as the Minimum Covariance Determinant (MCD), and estimates the
centroid/mean by selecting an estimate of the mean from a set of estimates derived
from different subsets of the dataset.
Unfortunately, despite the Mahalanobis distance and its robust variant being the
most commonly used multidimensional/multivariate outlier detection techniques in
social science, it suffers from the same problems as any multidimensional method
based on distances from the centroid/mean. By consequence it would certainly not
flag someone average in all dimensions as an outlier, even though statistically they
would represent an extremely unusual individual (it would not help Daniels with his
project, for example). It is therefore important that researchers qualify their defi-
nition of the outlying set to explicitly admit points which may fall too close to the
mean.
When using the Mahalanobis distance, one can make decisions about the set of
outliers Ousing the following expression:
104 M. J. Vowels
O={x:M(x)>c},(7)
where M(x)=
(x−μ
)TS−1(x−μ
)
and is the estimated Mahalanobis distance (in
units of standard deviation) for the multivariate datapoint under consideration x,
and cis the threshold for classifying a point as an outlier. In the expression for M,μ
is
the estimate of the mean of the distribution, Sis the estimated covariance matrix.
One of the benefits of the Mahalanobis based methods is that one can use them to
threshold the data based on units of standard deviations. Thinking in terms of
standard deviations is not unusual and therefore the process of selecting outliers in
these terms thus leads to intuitive selection thresholds. In contrast, we see in Eq. (5)
that the threshold for determining whether a datapoint falls within the typical set T
depends on ϵ, which is not related to the standard deviation, but rather to a distance
away from the entropy.
We have already seen how typicality has the added advantage of classifying
datapoints which lie too close to the mean. In Figure 12 we show that, in low-
dimensional settings, typicality can be used to make approximately the same clas-
sification of outliers as the Mahalanobis distance to the extent that some datapoints
which lie far from the mean should still be classified as outliers. Of course, in practice
a balance must be struck between the value of ϵin Eq. (5), in the same way that cin
Eq. (7) must be decided.
Specifically, for Figure 12, we generated 125 points from a bivariate Gaussian with
a covariance of 0.5, and thenadded a set of equally spaced outlier points ranging from
negative four to positive four on the y-axis (indicated with horizontal dashes). As such,
not all these points are expected to be identified as outliers, because some of their
values lie well within the tails of the distribution. They do, however, enable us to
compare at which point they are identified as outliers by the two detection methods
under comparison. Note that the subsequent estimation is done after the creation of
the complete dataset (including the outliers) using all the empirical values. Using the
robust MCD estimator mentioned above, we computed both the Mahalanobis distance
(in units of standard deviation), and colored each point according to this distance. For
typicality, we followed the estimation of entropy for the multivariate Gaussian which
also takes in an estimate for the covariance (see the Appendix A for the relationship
between the covariance matrix and the entropy of a Gaussian), for which we again
used the MCD method. The use of MCD for typicality arguably makes our typicality
estimator ‘robust’for the same reason that it is considered to make the Mahalanobis
distance estimation robust. The threshold for the Mahalanobis distance was set to
three standard deviations, whilst the value for the typicality threshold was set to five.
In practice, researchers may, of course, need to suitably selectand justify these values.
The scatter-plot marker shapes are set according to whether the outliers were
classified as such by both methods (circles), just the Mahalanobis method (squares),
Typical Yet Unlikely and Normally Abnormal 105
or just the typicality method (triangles). If neither method classifies a point as an
outlier, the points are set to vertical dashes (i.e. ‘inliers’). Note that there are no points
which are classified as outliers by the Mahalanobis method which are not also
classified as outliers by the typicality method. The inverse is not quite true, with one
additional point (indicated with the triangle marker) being classified as an outlier by
the typicality method.
9
Figure 12 therefore indicates that, in low-dimensions,
Mahalanobis distance performs similarly to typicality as an outlier detection method.
Figure 12: Comparison of Mahalanobis distance and typicality for outlier detection. The outliers are
generated as a vertical set of equally spaced points (indicated with horizontal dashes) ranging from
negative four to positive four on the y-axis, superimposed on a set of 125 points (indicated with vertical
dashes) drawn from a bivariate Gaussian with a covariance of 0.5. The points identified to be outliers by
both methods are indicated in circles, whilst those indicated to be outliers by the Mahalanobis or
typicality methods separately are indicated by squares or triangles, respectively. The color of the points
represents the Mahalanobis distance in units of standard deviation. The estimation of the covariance
matrices for both methods used the robust Minimum Covariance Determinant (MCD) method. Note that
there are no squares because no points were uniquely detected as outliers according to Mahalanobis
distance.
9Although this classification is correct, this point lies on the limit of the cloud of true inliers, and so
in practice it would not be clear whether this would represent a useful outlier classification or not.
106 M. J. Vowels
In Figure 13, we undertake the same task, but this time in a 20-dimensional space.
The figure shows the lengths of each of 1400 points (the lengths are used for visu-
alisation purposes) drawn from a 20-dimensional, isotropic Gaussian. Fifteen of
these points are manually set to fall very close to the mean/expected value of zero,
and these are the simulated outliers we wish to identify which fall towards the
bottom of the plot. Now, in contrast to the example above, we see a large difference
between the outliers identified using the two methods. Typicality successfully
identifies all 15 true outliers as outliers, whereas MCD fails to identify any of them.
Conversely, some points which lie far from the mean (but which have a low prob-
ability of occurrence relative to the entropy of the distribution) are identified by both
MCD and typicality, although it is possible that by tweaking the thresholds one could
achieve greater overlap between the classification of these points by the two
methods.
In summary, typicality does not only have a role in detecting outliers in high-
dimensional scenarios (where the outliers may include values close to the expected
value), but can perform similarly to how current approaches (such as MCD) do in
low-dimensional scenarios, which otherwise fail in high-dimensions. We thus
recommend practitioners consider typicality as a viable outlier detection approach
under both low- and high-dimensional conditions, and especially in high-
dimensions. To this extent, researchers are encouraged to consult various com-
mentaries on the usage of outlier detection methods, such as the one by Leys et al.
(2019) which provides general recommendations for practice (including pre-
registration). It is notable that prior commentary does not include a discussion
about the limitations of Mahalanobis based methods for outlier detection once the
number of dimensions increases, which serves as a reminder of how important it is
that researchers explore typicality. We recommend updating the working con-
ceptualisation of outliers to include those points which, in high-dimensions (but as
few as 4–10 dimensions) fall too close to the mean.
Finally, we note a quite different approach to identifying outliers known as
cellwise-outlier detection (Raymaekers and Rousseeuw 2021) which can be used to
identify outliers at a more granular level (identifying not only which cases are
outliers, but which variables are responsible for this classification). Note, however,
that this approach does not include a discussion about the additional complications
that arise as dimensionality increases (specifically, a discussion about how the
relevance of identifying unusually high or low values for certain variables might
change if one considers that values close to the average are atypical). Further work is
required to more broadly evaluate the implications of the atypicality of the mean
across other statistical approaches.
Typical Yet Unlikely and Normally Abnormal 107
6 Conclusions
The principal goal of this work was to provide researchers with an intuition about
the behavior of data as the number of dimensions increases. Through our explora-
tion of multi-dimensional space, we have shown that the mean, far from repre-
senting normality, actually represents abnormality, in so far as encountering a
datapoint close to the mean in datasets comprising more than a handful of di-
mensions becomes incredibly unlikely, even with a large number of datapoints. In
contrast with the arithmetic average, the information theoretic quantity known as
‘typicality’provides a way to establish normality (or rather, whether a datapoint is
typical or atypical), which is particularly useful in high-dimensional regimes. Given
that researchers in psychology and social science frequently deal with multivariate
Figure 13: Comparison of Mahalanobis distance and typicality for outlier detection in 20 dimensional
space. The outliers are generated as a set of 15 points close to the expected value of 0, superimposed on
a set of 1400 points drawn from a 20 dimensional isotropic Gaussian. For visualisation purposes, this plot
shows the lengths of each point (the x-axis is simply the index of the point in the dataset). The points
identified to be outliers by both methods are indicated in circles, whilst those indicated to be outliers by
the Mahalanobis or typicality methods separately are indicated by squares or triangles, respectively. The
color of the points represents the Mahalanobis distance in units of standard deviation. The estimation of
the covariance matrices for both methods used the robust Minimum Covariance Determinant (MCD)
method. Note that there are no squares because no points were uniquely detected as outliers according
to Mahalanobis distance.
108 M. J. Vowels
datasets, and that the peculiarities associated with multi-dimensional spaces start
occurring in relatively low dimensions (as few as four), it is important that re-
searchers have some awareness of the concepts presented in this paper. Indeed, the
implications of this work are particularly important to consider for researchers
concerned with policy. Policies, especially in areas like social welfare, health, and
education, are often built around statistics related to ‘average’individuals. If the
representational relevance of the mean is limited, particularly in multi-dimensional
contexts, then policy design, implementation, and evaluation decisions based on this
perspective may be misdirected.
Clearly, the motivations behind the characterizations of points as either normal
or abnormal overlap strongly with those behind outlier detection. The discussion
also provides us with a good justification for updating our working definition of
‘outlier’to include points which lie unusually close to the mean. Unlike popular
multivariate outlier detection techniques such as the Mahalanobis distance, which
characterize outliers as points which lie far from the expected value of the distri-
bution, typicality additionally offers a means to detect those which are close. Whilst
such additional benefits of typicality based methods become more evident as the
dimensionality of the dataset increases (where traditional methods like Mahalanobis
distance fail) we showed that typicality also performs as one would hope/expect in
low dimensions. To show this, we finished with an evaluation of typicality for
bivariate outlier detection using a ‘robust’version of entropy using the Minimum
Covariance Determinant estimation technique, and verified via simulation that in
low-dimensions it works well as an alternative to the popular Mahalanobis distance.
In addition, it is worth noting that we used a closed-form, parametric expression
for entropy and for the probability of individual datapoints p(x) (parametric in the
sense that we assumed an underlying multivariate Gaussian with a covariance and
mean). Such a parametric approach carries the advantage of high sample efficiency,
that is, relatively few datapoints are required to adequately estimate the relevant
quantities. However, in practice one may (a) have more dimensions than datapoints,
(b) have non-Gaussian data, or (c) both. When the number of datapoints is low
compared with the number of dimensions, parametric estimators may exhibit sub-
stantial instability (especially the estimator for the covariance). In the case where the
data cannot be justifiably parameterised by a Gaussian (or, indeed, any other
parametric distribution), semi- or non-parametric approaches can be used to esti-
mate the entropy and p(x). Unfortunately, there are concomitant disadvantages
regarding sample efficiency. In other words, the number of samples required for
reliable estimation goes up considerably, and this requirement is disproportionately
problematic for data with high-dimensionality. The extent to which this is a problem
depends on the combination of the specific choice of estimators, the sample size, and
the dimensionality of the data. A detailed exploration of the interplay between these
Typical Yet Unlikely and Normally Abnormal 109
factors is beyond the scope of this work, and it should be mentioned that the point is
important regardless of whether one is intending to detect outliers, or undertaking
statistical modeling of high-dimensional data in general.
Appendix A: Differential Entropy of a Gaussian
Following Cover and Thomas (2006), the differential entropy (in bits) is defined as:
H(f)=−E[log(f(x))] = − ∫
+∞
−∞
f(x)logef(x)dx(8)
The probability density function of the normal distribution is:
f(x)= 1
2πσ2
√e−(x−μ)2
2σ2(9)
Substituting the expression for f(x) into h(f):
H(f)=−∫
+∞
−∞
f(x)loge
1
2πσ2
√e−(x−μ)2
2σ2
(10)
H(f)=−∫
+∞
−∞
f(x)log2eloge
1
2πσ2
√
+logee−(x−μ)2
2σ2
(11)
H(f)=−∫
+∞
−∞
f(x)log2e−loge(
2πσ2)
−(x−μ)2
2σ2)
(12)
H(f)=log2eloge
2πσ2
√∫
+∞
−∞
f(x)dx+log2e∫
+∞
−∞
(x−μ)2
2σ2f(x)dx(13)
Note that:
∫
+∞
−∞
f(x)dx=1 (14)
and recall that:
∫
+∞
−∞(x−μ)2f(x)dx=E[(x−μ)2]=Var(x)=σ2(15)
Therefore:
H(f)=log2
2πσ2
√+log2e
2σ2σ2(16)
110 M. J. Vowels
And finally:
H(f)=1
2log2(2πeσ2)(17)
For a D-dimensional Gaussian, the derivation for the entropy is as follows:
H(f)=−E[log(f(x))] = − ∫
+∞
−∞
f(x)logef(x)dx,(18)
where the bold font indicates multidimensionality.
H(f)=−E[log[(2π)−D/2|S|−0.5exp(−0.5(x−μ)TS−1(x−μ))]],(19)
where Sis the covariance matrix, Tindicates the transpose, and |.|indicates the
determinant.
H(f)=0.5Dlog(2π)+0.5 log|S|+0.5E[(x−μ)TS−1(x−μ)] (20)
=0.5D(1+log(2π)) + 0.5 log(S)(21)
This last expression can then be expressed in bits by multiplying by log
2
e. Note
that this derivation follows the approach provided by Gundersen (2020) and uses a
number of ‘tricks’relating to the trace operator.
References
Blum, A., J. Hopcroft, and R. Kannan. 2020. Foundations of Data Science. Cambridge: Cambridge University
Press.
Bolger, N., and J. P. Laurenceau. 2013. Intensive Longitudinal Methods. New York: The Guilford Press.
Caponi, S. 2013. “Quetelet, the Average Man and Medical Knowledge.”Hist Cienc Saude Manguinhos Hist
Cienc Saude Manguinhos 20 (3): 830–47.
Comte, A. 1976. The Foundation of Sociology, edited by K. Thompson. London: Nelson.
Cover, T. M., and J. A. Thomas. 2006. Elements of Information Theory. New York: John Wiley and Sons Inc.
Daniels, G. 1952. “The “Average Man”?”In Technical Note 53-7 Wright Air Development Center. USAF.
Dieleman, S. 2020. Musings on Typicality. Also available at: https://benanne.github.io/2020/09/01/
typicality.html.
Foucault, M. 1984. Madness and Civilization, edited by P. Rabinow. London: Penguin Books.
Gelman, A., and E. Loken. 2013. The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem
Even when There Is No ‘fishing Expedition’or ‘p-Hacking’and the Research Hypothesis Was Posited
Ahead of Time. Also available at: http://www.stat.columbia.edu/gelman/research/unpublished/p\_
hacking.pdf.
Gottman, J. M. 1979. “Detecting Cyclicity in Social Interaction.”Psychological Bulletin 86 (2): 338–48.
Grubbs, F. 1969. “Procedures for Detecting Outlying Observations in Samples.”Technometrics 11 (1): 1–21.
Gundersen, G. 2020. Entropy of the Gaussian. https://gregorygundersen.com/blog/2020/09/01/gaussian-
entropy/ (accessed May 03, 2022).
Typical Yet Unlikely and Normally Abnormal 111
Hacking, I. 1990. The Taming of Chance. Cambridge: Cambridge University Press.
Joel, S., P. Eastwick, and E. Finkel. 2017. “Is Romantic Desire Predictable? Machine Learning Applied to
Initial Romantic Attraction.”Psychological Science 28 (10): 1478–89.
Joel, S., P. Eastwick, C. Allison, and X. E. A. Arriaga. 2020. “Machine Learning Uncovers the Most Robust
Self-Report Predictors of Relationships Quality across 43 Longitudinal Couples Studies.”PNAS 117
(32): 19061–71.
Kroc, E., and O. Astivia. 2021. “The Importance of Thinking Multivarietly when Selecting Subscale Cutoff
Scores.”Educational and Psychological Measurement 517–38.
Leys, C., C. Ley, O. Klein, P. Bernard, and L. Licata 2013. “Detecting Outliers: Do Not Use Standard Deviation
Around the Mean, Use Absolute Deviation Around the Mean.”Journal of Experimental Social
Psychology 49 (4): 764–6.
Leys, C., O. Klein, Y. Dominicy, and C. Ley. 2018. “Detecting Multivariate Outliers: Use a Robust Variant of
the Mahalanobis Distance.”Journal of Experimental Social Psychology 74: 150–6.
Leys, C., M. Delacre, D. Lakens, and C. Ley. 2019. “How to Classify, Detect, and Manage Univariate and
Multivariate Outliers, with Emphasis on Pre-registration.”International Review of Social Psychology 32
(1): 1–10.
Li, X., S. Deng, L. Lifang, and Y. Jiang. 2019. “Outlier Detection Based on Robust Mahalanobis Distance and
its Application.”Open Journal of Statistics 9 (1): 15–26.
MacKay, D. J. C. 1992. “A Practical Bayesian Framework for Backpropagation Networks.”Neural
Computation 4 (3): 448–72.
MacKay, D. J. C. 2018. Information Theory, Inference, and Learning Algorithms. Cambridge: Cambridge
University Press.
Mahalanobis, P. 1930. “On Tests and Measures of Group Divergence.”Journal and Proceedings of Asiatic
Society of Bengal 26: 541–88.
Micceri, T. 1989. “The Unicorn, the Normal Curve, and Other Improbable Creatures.”Psychological Bulletin
105: 156–66.
Misztal, B. 2002. “Rethinking the Concept of Normality: The Criticism of Comte’s Theory of Normal
Existence.”Polish Sociological Review 138: 189–202.
Modis, T. 2007. “The Normal, the Natural, and the Harmonic.”Technological Forecasting and Social Change
74: 391–494.
Myers, S. 2013. “Normality in Analytic Psychology.”Behavioural Sciences (Basel, Switzerland) 3 (4): 647–61.
Orben, A., and A. Przybylski. 2019. “The Association between Adolescent Well-Being and Digital
Technology Use.”Nature Human Behaviour 3: 173–82.
Quetelet, A. 1835. Sur l’homme et le developpement de ses facultes. Paris: Fayard.
Raudenbush, S., and A. Bryk. 2002. Hierarchical Linear Models: Applications and Data Analysis Methods.
Thousand Oaks, CA: SAGE Publications.
Raymaekers, J., and P. Rousseeuw. 2021. “Handling Cellwise Outliers by Sparse Regression and Robust
Covariance.”Journal of Data Science, Statistics, and Visualisation 1 (3): 1–30.
Scherpenzeel, A., and M. Das. 2010. Social and Behavioral Research and the Internet: Advances in Applied
Methods and Research Strategies. edited by P. E. M. Das, and L. Kaczmirek, 77–104. Boca Raton: Taylor
& Francis.
Speelman, C., and M. McGann. 2013. “How Mean Is the Mean?”Frontiers in Psychology 4 (451): 1–12.
Stein, S. 2020. Concentration properties of High-Dimensional Normal Distributions. https://stefan-stein.
github.io/posts/2020-03-07-concentration-properties-of-high-dimensional-normal-distributions/
(accessed May 03, 2022).
van der Laan, M. J., and S. Rose. 2011. Targeted Learning –Causal Inference for Observational and
Experimental Data. New York: Springer International.
112 M. J. Vowels
Vershynin, R. 2019. High-Dimensional Probability: An Introduction with Applications in Data Science.
Cambridge: Cambridge Series in Statistical and Probabilistics Mathematics.
Vowels, M. J. 2021. “Misspecification and Unreliable Interpretations in Psychology and Social Science.”
Psychological Methods 28 (3): 507–26.
Vowels, M. J., K. Mark, L. M. Vowels, and N. Wood. 2018. “Using Spectral and Cross-Spectral Analysis to
Identify Patterns and Synchrony in Couples’Sexual Desire.”PLoS One 13 (10): e0205330.
Vowels, L. M., M. J. Vowels, and K. Mark. 2021. “Uncovering the Most Important Factors for Predicting
Sexual Desire Using Interpretable Machine Learning.”The Journal of Sexual Medicine.
Wetherall, M. 1996. Identities, Groups and Social Issues. London: SAGE Publications.
Yarkoni, T., and J. Westfall. 2017. “Choosing Prediction over Explanation in Psychology: Lessons from
Machine Learning.”Perspectives on Psychological Science 1100–22, https://doi.org/10.1177/
1745691617693393.
Typical Yet Unlikely and Normally Abnormal 113
Content uploaded by Matthew Vowels
Author content
All content in this area was uploaded by Matthew Vowels on Jan 01, 2024
Content may be subject to copyright.