Content uploaded by Marco Del Giudice
Author content
All content in this area was uploaded by Marco Del Giudice on Oct 01, 2022
Content may be subject to copyright.
Chapter 1
Measuring Sex Differences and Similarities
Marco Del Giudice
Abstract This chapter offers a concise, systematic introduction to quantification in
sex differences research. The chapter reviews the main methods used to measure sex
differences and similarities, including standardized distances (Cohen’sdand
Mahalanobis’D), indices of overlap, variance ratios, and tail ratios. Some less
common approaches (e.g., relative distribution methods, taxometrics) are also
reviewed and discussed. The chapter examines the strengths and limitations of
each method, considers various statistical and methodological factors that may either
inflate or deflate the size of sex differences, and discusses the available options to
minimize their influence. Other topics addressed include the effective visualization
of sex differences/similarities, and the rationale for treating sex as a binary variable
despite the complexities of sex-related identity and behavior.
Keywords Cohen’sd· Gender differences · Gender similarities · Effect size ·
Mahalanobis’D· Measurement · Meta-analysis · Sex differences · Sex similarities
Few topics in psychology can rival sex differences in their power to stir controversy
and captivate both scientists and the public. Debates in this area revolve around two
types of questions: explanatory questions about the role of social learning and
biological factors in determining patterns of sex-related behavior, and descriptive
questions about the size and variability of those effects. These questions are logically
distinct and can be addressed independently; however, throughout the history of the
discipline the answers have tended to cluster together (see Eagly & Wood, 2013;
Lippa, 2005). More often than not, researchers who emphasize socio-cognitive
factors typically view sex differences as small, outweighed by similarities, and
highly context-dependent. They also tend to worry that exaggerated beliefs about
the extent of sex differences and their stability may have pernicious influences on
individuals and society (e.g., Hyde, 2005; Hyde et al., 2019; Rippon et al., 2014;
M. Del Giudice (*)
Department of Psychology, University of New Mexico, Albuquerque, NM, USA
e-mail: marcodg@unm.edu
©The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
D. P. VanderLaan, W. I. Wong (eds.), Gender and Sexuality Development, Focus on
Sexuality Research, https://doi.org/10.1007/978-3-030-84273-4_1
1
Unger, 1979). Conversely, most biologically oriented scholars argue that—at least in
regard to certain traits—differences between the sexes can be large, pervasive, and
potentially universal (e.g., Buss, 1995; Davies & Shackelford, 2008; Ellis, 2011;
Geary, 2010; Schmitt, 2015). While not all scholars can be neatly placed in one of
these two “camps,”the long-standing divide contributes to explain why measure-
ment and quantification are so often at the center of disputes in the field (Eagly &
Wood, 2013).
Regardless of one’s theoretical background, it is clear that future progress will
depend on our ability to quantify differences and similarities as accurately and
meaningfully as possible. Doing so requires not only the proper statistical tools,
but also awareness of the many factors that may distort empirical findings and make
them less interpretable, or even potentially misleading. Despite the importance of
these issues, the relevant literature is fragmented; as far as I know, there have been
no attempts to organize it in an accessible form. This chapter aims to fill this gap with
a concise but systematic introduction to quantification in sex differences research. I
begin with a meta-methodological note about the meaning of “sex”and “gender,”
and the rationale for treating sex as a binary variable despite the complexities of
sex-related identity and behavior (a point that necessitates a brief detour into
evolutionary biology). In the following section, I review the main approaches to
quantification, examine their strengths and limitations, and offer suggestions for
visualization. Finally, I discuss various statistical and methodological factors that
may inflate or deflate the apparent size of sex differences, and consider the available
options to minimize their influence.
1.1 Sex or Gender?
While many authors in psychology and other disciplines treat “sex”and “gender”as
synonyms (Haig, 2004), these terms have different histories and implications. The
contemporary usage of “gender”as the social and/or psychological counterpart of
biological sex was introduced in psychology by Money (1955), though Bentley
(1945) had drawn the same distinction 10 years before. Popularized by Stoller
(1968), the term was rapidly adopted by feminist scholars in the 1970s (Haig,
2004; Janssen, 2018). The motivation was to distinguish the biological characteris-
tics of males and females from the social roles, behaviors, and aspects of identity
associated with male/female labels; usually with the assumption that sociocultural
factors are more powerful and consequential than biological ones, and that psycho-
logical differences are largely or exclusively determined by socialization (e.g.,
Oakley, 1972; Unger, 1979). As many have noted over the years, the sex-gender
distinction is problematic and ultimately unworkable, which is probably why few
authors actually follow it in their writing. Not only does it suggest a clear-cut
separation between social and biological explanations; it also presupposes that one
already knows whether a certain aspect of behavior is biological or socially
2 M. Del Giudice
constructed in order to pick the appropriate term (Deaux, 1985; Ellis et al., 2008;
Haig, 2004).
Having grown uneasy with the sex-gender distinction, some feminist scholars
have started to promote the use of the hybrid term “sex/gender”(or “gender/sex”) as
a way to recognize that biological and social factors are inseparable, encourage
critical examination of the processes that lead to observable male-female differences,
and underscore the potential for plasticity (Fausto-Sterling, 2012; Hyde et al., 2019;
Jordan-Young & Rumiati, 2012; Rippon et al., 2014). Of course this is a legitimate
stance; but the new terminology has its own problems, and I suspect that the cure
would be worse than the disease. Sex/gender is often described by its proponents as a
continuum, or even a multidimensional collection of semi-independent features;
from this perspective, a person’s sex/gender may be regarded as hybrid, fluid, or
otherwise nonbinary (see, e.g., Hyde et al., 2019). Yet, the same term is also used in
the context of the distinction between males and females as groups (Jordan-Young &
Rumiati, 2012). Some authors have carried this tension to its logical conclusion and
suggested that researchers should stop using sex as a binary variable (Joel & Fausto-
Sterling, 2016). On this view, “male”and “female”should be replaced with multiple
overlapping categories, or even (multi)dimensional scores of gendered self-concepts
and attitudes (Hyde et al., 2019; Joel & Fausto-Sterling, 2016). This radical meth-
odological change is justified with the need to overcome the “gender binary.”
However, the binary nature of sex is not an illusion to dispel but a biological reality,
as I now briefly discuss.
1.1.1 The Sex Binary
In the social sciences, sex is usually defined as a collection of traits—X/Y chromo-
somes, gonads, hormones, and genitals—that cluster together in most people but
may also occur in atypical combinations (e.g., Blakemore et al., 2009; Fausto-
Sterling, 2012; Helgeson, 2016; Joel, 2012). This definition is the basis for the
widely repeated claim that up to 2% of live births are intersex (Blackless et al.,
2000). Few researchers and commenters seem aware that the 2% figure is a gross
overestimate. To begin, correcting for inaccuracies and counting errors in the
original report brings the total frequency down to less than 0.5% (Hull, 2003).
More importantly, Blackless et al. (2000) defined intersex very broadly as individ-
uals who deviate from the “Platonic ideal”of sex dimorphism; accordingly, they
included several conditions (e.g., Klinefelter syndrome, vaginal agenesis, congenital
adrenal hyperplasia) that affect the development of sexual characters but can be
classified as “intersex”only in a loose sense (Sax, 2002). If one restricts the term to
conditions that involve a discordance between chromosomal and phenotypic sex, or
a phenotype that cannot be classified unambiguously as either male or female, the
frequency of intersex is much lower—almost certainly less than 0.02% (Sax, 2002;
see also Hull, 2003).
1 Measuring Sex Differences and Similarities 3
A deeper issue with the “patchwork”definition of sex used in the social sciences
is the lack of a functional rationale, in stark contrast with how the sexes are defined in
biology. From a biological standpoint, what distinguishes the males and females of a
species is the size of their gametes: Males produce small gametes (e.g., sperm),
females produce large gametes (e.g., eggs; Kodric-Brown & Brown, 1987).
1
Dimor-
phism in gamete size or anisogamy is the dominant pattern in multicellular organ-
isms, including animals. The evolution of two gamete types with different sizes and
roles in fertilization can be predicted from first principles, as a result of selection to
maximize the efficiency of fertilization (Lehtonen & Kokko, 2011; Lehtonen &
Parker, 2014). In turn, anisogamy generates a cascade of selective pressures for
sexually differentiated traits in morphology, development, and behavior (see Janicke
et al., 2016; Lehtonen et al., 2016; Schärer et al., 2012). The biological definition of
sex is not just one option among many, or a matter of arbitrary preference: The very
existence of differentiated males and females in a species depends on the existence of
two gamete types. Chromosomes and hormones participate in the mechanics of sex
determination and sexual differentiation, but do not play the same foundational role.
Crucially, anisogamy gives rise to a true sex binary at the species level: Even if a
given individual may fail to produce viable gametes, there are only two gamete types
with no meaningful intermediate forms (Lehtonen & Parker, 2014). This dichotomy
is functional rather than statistical, and is not challenged by the existence of intersex
conditions (regardless of their frequency), nonbinary gender identities, and other
apparent exceptions. And yet, anisogamy is rarely discussed—or even mentioned—
in the social science literature on sex and gender, with the obvious exceptions of
evolutionary psychology and anthropology.
What are the implications for research? If the sex binary is a basic biological fact,
arguments that call for rejecting it on scientific grounds (e.g., Hyde et al., 2019) lose
much of their appeal. One can speak of sex differences in descriptive terms—as I do
in this chapter—without assuming that such differences are “hardwired”or immune
from social influences. From a practical standpoint, sex as a categorical variable is
also robust to the presence of a small proportion of individuals who, for various
reasons, are not easily classified or do not align with the biological definition. This
does not mean that exceptions are unimportant, or that sex should only be viewed
through a categorical lens. For example, there are methods for ranking individuals of
both sexes along a continuum of masculinity-femininity or male-female typicality
(e.g., Lippa, 2001,2010; Phillips et al., 2018; more on this in Sect. 1.2.1). Variations
in gender identity and sexual orientation can and should be studied in all their
complexity regardless of whether sex is a biological binary. More generally, the
existence of a well-defined sex binary is perfectly compatible with large amounts of
within-sex variation in anatomy, physiology, and behavior. Indeed, sexual selection
often amplifies individual variability in sex-related traits, and can favor the evolution
of multiple alternative phenotypes in males and females (Geary, 2010,2015;
1
Species with simultaneous hermaphroditism (mostly plants and invertebrates) do not have distinct
sexes, given that any individual can produce both types of gametes at the same time.
4 M. Del Giudice
Taborsky & Brockmann, 2010; see also Del Giudice et al., 2018). In the remainder
of the chapter I discuss how patterns of quantitative variation between the sexes can
be measured and analyzed in detail.
1.2 Quantification of Sex Differences/Similarities
There are many possible ways to quantify sex differences and similarities. In this
section I review the methods that are most often employed in the literature. I then
discuss some methods that are less common but warrant a closer look, either because
of their untapped potential or because of their peculiar limitations. I also address the
question of how to visualize quantitative findings effectively and intuitively. Note
that the various methods and indices discussed in this section are in no way
alternative to one another. Different indices can reveal different aspects of the
data, and may be used in combination to gain a broader perspective; other times,
one of the indices may be better suited to answer the particular question at hand. The
basic formulas are reported and explained in Table 1.1. Additional methods to deal
with more complex scenarios can be found in the cited references.
1.2.1 Common Indices of Difference/Similarity
1.2.1.1 Univariate Standardized Difference (Cohen’sd)
The standardized mean difference is by far the most common and versatile effect size
(ES) in sex differences research. Cohen’sdmeasures the distance between the male
and female means in standard deviation units (using the pooled standard deviation;
Table 1.1). Confidence intervals on dcan be calculated with exact formulas or
bootstrapped (Kelley, 2007; Kirby & Gerlanc, 2013). Here I follow the convention
of using positive dvalues to indicate higher scores in males. For example, d¼"0.50
indicates that the female mean is half a standard deviation higher than the male
mean. In two major syntheses of psychological sex differences, Hyde (2005) and
Zell et al. (2015) summarized hundreds of effect sizes from meta-analyses (see Sect.
1.3.4). They found that about 80% of the effects in the psychological literature are
smaller than d¼0.35; about 95% are smaller than d¼0.65; and only about 1–2%
are larger than d¼1.00 (absolute values, uncorrected for measurement error; the
average across domains was d¼0.21 in Zell et al., 2015). For comparison, the size
of sex differences in adult height is d¼1.63 (average across countries; Lippa, 2009).
The substantive interpretation of dvalues is a persistent source of confusion. The
problem can be traced to Cohen (1988), who in a popular book on power analysis
offered some conventional rules of thumb for d: 0.20 for “small”effects, 0.50 for
“medium”effects, and 0.80 for “large”effects. These guidelines have been used
countless times to interpret empirical findings and evaluate their importance;
1 Measuring Sex Differences and Similarities 5
Table 1.1 Common indices for the quantification of sex differences/similarities
Univariate Multivariate
d¼mM"mF
S¼mM"mF
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
NM"1
ðÞ
S2
MþNF"1
ðÞ
S2
F
NMþNF"2
qD¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
mM2mF
ðÞ
TS21mM2mF
ðÞ
q=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
dTR"1d
p
Cohen’sd. Standardized univariate differ-
ence (distance between the M and F means).
Convention: Positive values for m
M
>m
F
,
negative values for m
F
>m
M
a
m
M
,m
F
: Male/female means
S: Pooled standard deviation
S
M
,S
F
: Male/female standard deviations
N
M
,N
F
: Male/female sample sizes
Mahalanobis’D. Standardized multivariate differ-
ence (unsigned distance between the M and F
centroids along the M-F axis)
a
m
M
,F: Vectors of male/female means
d:Vector of dvalues
S:Pooled covariance matrix
R:Pooled correlation matrix
du¼g¼d1"3
4NMþNF"2ð Þ"1
hi
Du¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
max 0, NMþNF"k"3
NMþNF"2D2"kNMþNF
NMNF
"#hi
r
Small-sample variant of dcorrected for bias
(approximate formula); also known as
Hedges’g
Small-sample variant of Dcorrected for bias
k: Number of variables
OVL ¼2Φ("|d|/2) OVL ¼2Φ("D/2)
Overlapping coefficient. Proportion of
overlap relative to a single distribution
a,b
Φ(∙): Normal cumulative distribution func-
tion (CDF)
Overlapping coefficient. Proportion of overlap rel-
ative to a single distribution
a,b
Φ(∙): Normal cumulative distribution function
(CDF)
OVL2¼OVL
2"OVL ¼1"U1OVL2¼OVL
2"OVL ¼1"U1
Proportion of overlap relative to the joint
distribution
a,b
Proportion of overlap relative to the joint
distribution
a,b
U1¼1"OVL
2"OVL ¼1"OVL2U1¼1"OVL
2"OVL ¼1"OVL2
Proportion of nonoverlap relative to the
joint distribution
a,b
Proportion of nonoverlap relative to the joint
distribution
a,b
U
3
¼Φ(|d|) U
3
¼Φ(D)
Proportion of individuals in the group with
the higher mean who exceed the median
individual of the other group
a,b
Proportion of males who are more male-typical
than the median female (¼proportion of females
who are more female-typical than the median
male)
a,b
CL ¼Φdjj=ffiffiffi
2
p
$% CL ¼ΦD=ffiffiffi
2
p
$%
Common language effect size. Probability
that a randomly picked individual from the
group with the higher mean will exceed a
randomly picked individual from the other
group
a,b
Common language effect size. Probability that a
randomly picked male will be more male-typical
than a randomly picked female (¼probability that
a randomly picked female will be more female-
typical than a randomly picked male)
a,b
PCC ¼Φ(|d|/2) PCC ¼Φ(D/2)
Probability of correct classification (pre-
dictive accuracy). Probability of correctly
classifying a randomly picked individual as
male or female with d/2 as the decision
threshold
a,b,c
Probability of correct classification (predictive
accuracy). Probability of correctly classifying a
randomly picked individual as male or female with
linear discriminant analysis
a–c
η2¼d2
d2þ4η2¼D2
D2þ4
Eta squared. Proportion of variance
explained by sex
a–c
Eta squared. Proportion of generalized variance
explained by sex
a–c
(continued)
6 M. Del Giudice
unfortunately, this includes the influential papers by Hyde (2005,2014) and Zell
et al. (2015). The irony is that Cohen did not intend these numbers as benchmarks to
evaluate effect sizes in empirical data, but only as reasonable guesses to use when
behavioral scientists want to perform a priori power analysis but have no information
about the likely size of the effect.
2
In fact, what counts as “small”or “large”depends
entirely on the area of research, the variables under consideration, and the goals of a
particular study (Hill et al., 2008; Vacha-Haase & Thompson, 2004). To give just a
few examples: A “small”effect can be quite consequential if the phenomenon of
interest happens in the tails of the distribution, where average differences are
amplified (Sect. 1.2.1.8). Further, the apparent size of an effect can be diminished
by measurement error: when measures are contaminated by high levels of noise,
differences may appear much smaller than they actually are (Sect. 1.3.3). Even a
difference that is genuinely small from a practical standpoint can have significant
theoretical implications if rival hypotheses predict no difference at all. In this
context, the practice of labeling differences as trivial if they fall below an arbitrary
threshold such as d¼0.10 (Hyde, 2005,2014) is especially troubling.
3
Conversely,
Table 1.1 (continued)
Univariate Multivariate
VR ¼S2
M=S2
FVR ¼|S
M
|/|S
F
|
Male:Female variance ratio Male:female generalized variance ratio
S
M
,S
F
: Male/female covariance matrices
TRzSD ¼Φd"zðÞ
Φ"zðÞ TRzSD ¼ΦD"zðÞ
Φ"zðÞ
Tail ratio. Relative proportion of males:
Females in the region located zstandard
deviations above the female mean (use "d
for the relative proportion of females:Males
in the region located zstandard deviations
above the male mean)
a–c
Tail ratio. Relative proportion of males:females in
the region located zstandard deviations from the
female centroid in the male-typical direction (¼
relative proportion of females:males in the region
located zstandard deviations from the male cen-
troid in the female-typical direction)
a–c
a
The formula assumes equality of variances (univariate case) or covariance matrices (multivariate
case) in the population
b
The formula assumes (multivariate) normality in the population
c
The formula assumes equal group sizes (i.e., equal proportions of males and females)
2
In Cohen’s own words: “The terms “small,”“medium,”and “large”are relative, not only to each
other, but to the area of behavioral science or even more particularly to the specific content and
research method being employed in any given investigation [...] In the face of this relativity, there
is a certain risk inherent in offering conventional definitions for these terms for use in power
analysis in as diverse a field of inquiry as behavioral science. This risk is nevertheless accepted in
the belief that more is to be gained than lost by supplying a common conventional frame of
reference which is recommended only when no better basis for estimating the ES index is
available.”(Cohen, 1988, p. 25; emphasis added). This must have been one of the least successful
warnings in the history of statistics.
3
Of course, it is always possible to test the null hypothesis that a given difference is exactly zero, or
within a range that makes it practically equivalent to zero for the purpose of a particular study. In
contrast with standard significance testing, Bayesian methods can directly quantify the evidence in
1 Measuring Sex Differences and Similarities 7
effects that are “large”by Cohen’s standards can be nearly useless if one needs to
make highly accurate predictions or classifications; to illustrate, d¼0.80 implies a
predictive accuracy of about 66%, which is better than chance but may be too low in
some applied contexts (see Sect. 1.2.1.5). Also, a conventionally “large”effect may
be comparatively small if the other effects in the same domain are consistently
larger. This is not just the case for Cohen’sd: The same principle applies to all the
effect sizes discussed in this chapter. The idea that the practical importance of an
effect can be determined mechanically using fixed conventional guidelines is tempt-
ing, but deeply misguided.
1.2.1.2 Multivariate Standardized Difference (Mahalanobis’D)
Univariate differences are important, but there are situations in which they may
easily miss the forest for the trees. Many psychological constructs are intrinsically
multidimensional, from personality and cognitive ability to occupational prefer-
ences. When investigators are interested in global sex differences within a certain
domain, univariate differences calculated for individual variables can be relatively
uninformative (or even positively misleading if they are simply averaged together;
see Del Giudice, 2009). The reason is that relatively small differences across
multiple dimensions can add up to a substantial overall difference. Moreover, the
exact way in which multiple variables combine into a global effect size depends on
the sign and size of their mutual correlations, and cannot be judged by simply
looking at univariate effects. Sex differences in facial morphology nicely illustrate
this point (Fig. 1.1a). On average, men and women differ in individual anatomical
features such as mouth width, forehead height, and eye size; but univariate
Fig. 1.1 Sex differences in facial morphology. (a) Composite male and female faces (averages of
24 pictures each). (b) The continuum of male-female typicality in facial features. The figure shows a
sequence of morphed faces, from 100% female to 100% male. Adapted with permission from
Rhodes et al. (2004). Copyright 2004 by Elsevier Ltd.
support of the null hypothesis (see Dienes, 2016; Kruschke & Liddell, 2018; Wagenmakers et al.,
2018).
8 M. Del Giudice
differences in each of those features (mostly below d¼1.00) are too small to
accurately distinguish between the sexes. However, the combination of multiple
features yields two clearly distinct clusters of male vs. female faces, to the point
where observers can correctly determine sex from pictures with more than 95%
accuracy (Bruce et al., 1993; see Del Giudice, 2013).
The natural metric for measuring global sex differences across multiple variables
is Mahalanobis’D, the multivariate generalization of Cohen’sd(Huberty, 2005;
Olejnik & Algina, 2000; Table 1.1). The value of Dis the distance between the
centroids (multivariate means) of the male and female distributions, relative to the
standard deviation along the axis that connects the centroids. Figure 1.2 illustrates
the geometric meaning of Din the case of two variables (for more details see Del
Giudice, 2009). The interpretation of Dis essentially the same as that of d, with the
difference that Dis unsigned and cannot take negative values (reflecting the multi-
variate nature of the comparison). Confidence intervals for Dcan be obtained with
bootstrapping (Kelley, 2005; Hess et al., 2007) or with exact methods, which
unfortunately are not always applicable (see Reiser, 2001; Zou, 2007). Procedures
for obtaining a pooled correlation matrix are discussed in Furlow and Beretvas
(2005). Simple R functions to calculate Dwith confidence intervals, corrections
for bias and measurement error (Sect. 1.3), heterogeneity statistics (see below), and
other diagnostics and effect sizes are available at https://doi.org/10.6084/m9.
figshare.7934942.
Fig. 1.2 Illustration of Mahalanobis’distance (D) in the bivariate case. Dis the standardized
distance between the male and female centroids in the bivariate space, taking the correlation
between variables into account. (If the variables are uncorrelated, Dreduces to the Euclidean
distance.) Note that the distributions in the figure are bivariate normal with equal covariance
matrices. The axis that connects the male and female centroids can be interpreted as a dimension
of male-female typicality or “masculinity-femininity”(M-F) with respect to the relevant variables.
Univariate differences are represented as d
1
and d
2
1 Measuring Sex Differences and Similarities 9
M-F axis
_____________
The axis connecting the centroids summarizes the differences between males and
females across the entire set of variables, and can be conveniently interpreted as an
overall dimension of male-female typicality or masculinity-femininity (M-F) in the
domain described by those variables.
4
To illustrate: In the case of facial morphology,
the M-F axis would represent a continuum of male-female typicality like the one
shown in Fig. 1.1b.
5
This continuum summarizes the combination of anatomical
features that make a particular face male- or female-typical. Depending on the size of
D, the male and female distributions may overlap substantially along the continuum
or form largely separate clumps (as in Fig. 1.2). Individual scores on the M-F axis are
closely related to the gender diagnosticity index proposed by Lippa and Connelly
(1990). Gender diagnosticity is the probability that a given individual is male (or,
symmetrically, female), estimated with linear discriminant analysis from a set of
sexually differentiated variables (e.g., preferences for various occupations or activ-
ities). This probability can be used as an index of masculinity-femininity, and is a
function of an individual’s position along the M-F axis.
In sum, Dis a convenient index for multivariate differences that generalizes
Cohen’sdand has the same substantive interpretation. Oddly, Dhas been
overlooked for decades as a possible measure of group differences (e.g., Huberty,
2002; Vacha-Haase & Thompson, 2004). While Dhas been occasionally discussed
as an effect size (Hess et al., 2007; Olejnik & Algina, 2000; Sapp et al., 2007), it has
not been used in sex differences research until very recently. An instrumental role in
the “rediscovery”of Dwas played by a large-scale analysis of sex differences in
personality I performed with my colleagues (Del Giudice et al., 2012), as part of a
series of papers on multivariate effect sizes (Del Giudice, 2009,2013,2017,2018).
Initial applications of Dhave shown much larger sex differences than previously
expected, in domains ranging from personality (D¼2.71 in Del Giudice et al., 2012;
D¼2.10 in Kaiser et al., 2020; uncorrected average D¼1.12 in Mac Giolla &
Kajonius, 2019; uncorrected average D¼1.24 in Lee & Ashton, 2020) and
vocational interests (D¼1.61 in Morris, 2016) to mate preferences (average
D¼2.41 in Conroy-Beam et al., 2015). For comparison, the size of multivariate
sex differences in facial morphology is about D¼3.20 (Hennessy et al., 2005).
An alternative approach followed by some investigators is to combine multiple
sex-differentiated variables (e.g., personality items) into a summary score, usually
by adding or averaging them together. This method approximates the M-F dimen-
sion with a single composite variable; accordingly, effect sizes in these studies are
larger than typical univariate differences but smaller than the differences found with
4
Except in special cases, the M-F axis does not coincide with the discriminant axis. However, the
position of an individual point along the M-F axis (i.e., its projection onto the M-F axis in the
direction of the classification boundary) is equivalent to its position along the discriminant axis.
Thus, scores on the M-F axis provide the same information as discriminant scores.
5
In this case, “male-female typicality”is arguably preferable to “masculinity-femininity:”studies
have shown that when observers make judgements of facial masculinity, they rely on facial cues of
body size in addition to sexually dimorphic features (Holzleitner et al., 2014; Mitteroecker et al.,
2015).
10 M. Del Giudice
Din the same domains (e.g., d¼1.41 for vocational interests in Lippa, 2010;
d¼1.09 for personality in Verweij et al., 2016). In a recent paper, Phillips et al.
(2018) employed a hybrid method to obtain individual “sex differentiation”scores
from brain structure data.
6
First, they computed a differentiation index for each brain
feature, based on the ratio of the probability densities in males and females
(an approach that is conceptually similar to gender diagnosticity). They then selected
a subset of features showing sizable sex differences and averaged them into a
summary score. The effect size for this differentiation score was about d¼1.80.
7
Depending on how they are constructed, summary scores can be less prone to
overfitting the sample data than D(see Sect. 1.3.2); at the same time, they discard
information about the correlation structure of the variables and tend to underestimate
the overall effect. Note that systematic variation in effect sizes across studies may
depend on several factors, from differences in the reference populations (e.g., cross-
cultural or age-related effects) to the methods employed to correct for measurement
error and other artifacts (more on this in Sect. 1.3.3).
It is worth stressing that multivariate effect sizes like Dare not meant to replace
univariate indices like Cohen’sd. Univariate and multivariate approaches are com-
plementary, and whether one of them provides a more meaningful description of the
data is going to depend on the specific question being asked. Criticism of Das an
effect size has focused on the supposed lack of interpretability of the M-F axis, and
on the fact that Dcan be inflated by adding large numbers of irrelevant variables
(Hyde, 2014; Stewart-Williams & Thomas, 2013). While these points can be readily
addressed (see above and Sect. 1.3.2; for a lengthier discussion see Del Giudice,
2013), they do raise the crucial point that Dis only meaningful to the extent that it
summarizes a coherent, theoretically justified set of variables. A related issue is that
many multidimensional constructs in psychology are also hierarchical; for example,
the broad-band structure of personality can be usefully described with five broad
traits (the Big Five: extraversion, openness, agreeableness, conscientiousness, and
neuroticism/emotional instability), but each of those traits can be split into multiple
narrower traits or “facets”(e.g., the possible facets of extraversion include friendli-
ness, gregariousness, activity, assertiveness, excitement-seeking, and cheerfulness).
If sex differences in the lower-order facets of a trait run in opposite directions, they
may cancel out at the level of broad traits, leading to underestimates of the actual
effect size (see Del Giudice, 2015; Del Giudice et al., 2012). Thus, the choice of the
6
Of note, Phillips et al. (2018) framed their study as a demonstration that “the sex of the human
brain can be conceptualized along a continuum rather than as binary”(emphasis added). But this is
not what they did: the correlations between sex differentiation scores and other variables were
calculated within each sex, meaning that sex was treated as a binary variable and implicitly
“controlled for”by analyzing males and females separately.
7
The paper did not report descriptive statistics for the differentiation score; unfortunately, the raw
data were not available for reanalysis (Owen R. Phillips, personal communication, November
2, 2018). I extracted frequencies and central bin values from the histogram in Figure 1 of Phillips
et al. (2018) with ImageJ 1.50 (Schneider et al., 2012), and used them to recover approximate
sample statistics (females: M¼"0.25, SD ¼0.29; males: M¼0.26, SD ¼0.27).
1 Measuring Sex Differences and Similarities 11
appropriate level of analysis is an important consideration when applying multivar-
iate methods to hierarchical constructs.
Another complication in the interpretation of multivariate indices like Dconcerns
the relative contribution of individual variables to the overall effect. From Dvalues
alone, it is impossible to tell whether the multivariate effect reflects the joint
contribution of many variables, or the overwhelming contribution of one or a few
variables. I have proposed two indices that can be used to aid the interpretation of
D(Del Giudice, 2017,2018). The heterogeneity coefficient H
2
ranges from 0 (max-
imum homogeneity; all variables contribute equally) to 1 (maximum heterogeneity;
the totality of the effect is explained by just one variable). The “equivalent propor-
tion of variables”coefficient EPV
2
(also on a 0–1 scale) estimates the proportion of
equally contributing variables that would produce the same amount of heterogeneity,
if the other variables in the set made no contribution. Accordingly, smaller values of
EPV
2
indicate higher heterogeneity (e.g., EPV
2
¼0.30 means that the same amount
of heterogeneity would obtain if 30% of the variables contributed equally and the
remaining 70% made no contribution to the effect). For example, in the personality
dataset analyzed by Del Giudice et al. (2012) the heterogeneity coefficients are
H
2
¼0.90 and EPV
2
¼0.16, suggesting that the overall difference is largely driven
by a small subset of variables. Note that there are several possible ways to assign
credit to individual variables (e.g., Garthwaite & Koch, 2016); the method used to
calculate H
2
and EPV
2
is somewhat ad hoc and will likely be superseded by better
alternatives (see Del Giudice, 2018). Still, these indices can be used heuristically to
contextualize plain Dvalues and flag patterns that may warrant further attention.
1.2.1.3 Indices of Overlap (OVL,OVL
2
)
In contrast with difference metrics, indices of overlap focus on similarity, as they
quantify the proportion of the distribution area (or volume/hypervolume) that is
shared between males and females. When overlap is high, many males have female-
typical scores and many females have male-typical scores. The overlapping coeffi-
cient (OVL) is the proportion of each distribution that is shared with the other
(Bradley, 2006). This is a highly intuitive index of overlap; however, many
researchers use a somewhat different index (OVL
2
), in which overlap is calculated
as the shared area relative to the joint distribution.
8
The corresponding value can be
calculated as 1–U
1
, where U
1
is Cohen’s coefficient of nonoverlap (Cohen, 1988).
Typically, the quantity of interest is overlap rather than nonoverlap; for convenience
I use the label OVL
2
to indicate 1–U
1
, the proportion of overlap relative to the joint
distribution. While OVL
2
is a common index in psychology, its practical
8
The difference between OVL and OVL
2
can be visualized by looking at Figure 1.5.OVL ¼(purple
area)/(purple area + blue area) ¼(purple area)/(purple area + pink area). OVL
2
¼(purple area)/
(purple area + blue area + pink area).
12 M. Del Giudice
interpretation is somewhat obscure, and some authors have argued (quite convinc-
ingly) that OVL is preferable in most contexts (e.g., Grice & Barrett, 2014).
It is easy to convert dor Dvalues into indices of overlap under the assumption of
population normality and equality of variances (in the univariate case), or multivar-
iate normality and equality of covariance matrices (in the multivariate case;
Table 1.1). (For brevity, in the remainder of the chapter I will refer to these
assumptions as “normality”and “equality of variances/covariances.”) The conver-
sion is the same for univariate and multivariate indices, as shown in Fig. 1.3. For
example, both d¼0.50 and D¼0.50 correspond to OVL ¼0.80 and OVL
2
¼0.67,
indicating that 80% of each distribution and 67% of the joint distribution are shared
between the sexes. Overlap coefficients can also be estimated with nonparametric
methods (e.g., Anderson et al., 2012; Schmid & Schmidt, 2006), which may be
useful when the standard assumptions are severely violated (see Sect. 1.3.1).
1.2.1.4 Indices of Superiority (U
3
, CL)
Another way of looking at differences and similarities is to ask what proportion of
people in the group with the higher mean would score above the median member of
the other group. The answer is provided by Cohen’sU
3
coefficient, which can be
obtained from dor Dunder the same assumptions of overlap indices (Fig. 1.3;
Fig. 1.3 Relations between the standardized mean difference (Cohen’sdor Mahalanobis’D) and
various indices of difference/similarity. All conversion formulas assume (multivariate) normality
and equality of variances/covariance matrices. See Table 1.1 for details. OVL ¼proportion of
overlap on a single distribution. OVL
2
¼proportion of overlap on the joint distribution (equals 1–U
1
in Cohen’s terminology). U
3
¼proportion of a group above the median of the other group.
CL ¼common language effect size (“probability of superiority”). PCC ¼probability of correct
classification (assuming equal group sizes). η
2
¼proportion of variance explained (assuming equal
group sizes)
1 Measuring Sex Differences and Similarities 13
Table 1.1). For example, both d¼0.50 and D¼0.50 correspond to U
3
¼0.69.
Following the usual conventions, U
3
¼0.69 with a positive dmeans that 69% of
males score above the median female (or, equivalently, that 69% of females score
below the median male; Cohen, 1988). The interpretation of U
3
changes slightly
when one is dealing with a multivariate distribution. Specifically, U
3
becomes the
proportion of males that are more “masculine”or “male-typical”than the median
female—or, symmetrically, the proportion of females that are more “feminine”or
“female-typical”than the median male.
The common language effect size (CL; also known as “probability of superiority”)
is another popular index that translates group differences into probabilities. Specif-
ically, CL is the probability that a randomly picked individual from the group with
the higher mean will outscore a randomly picked individual from the other group
(McGraw & Wong, 1992). By assuming normality and equality of variances/covari-
ances, CL can be easily obtained from dor D(Fig. 1.3; Table 1.1). As with U
3
, the
interpretation of CL changes somewhat in a multivariate context, and becomes the
probability that a randomly picked male will be more “masculine”or “male-typical”
than a randomly picked female (or, symmetrically, the probability that a randomly
picked female will be more “feminine”or “female-typical”than a randomly picked
male). The original CL index can be generalized to discrete distributions (Vargha &
Delaney, 2000), and there are procedures to calculate confidence intervals when
standard assumptions do not apply (Vargha & Delaney, 2000; Zhou, 2008).
1.2.1.5 Probability of Correct Classification (PCC)
The probability of correct classification (hereafter PCC), predictive accuracy, or hit
rate is the probability that a randomly picked individual will be correctly classified
as male or female based on the variable(s) under consideration.
9
The ability to
reliably infer the sex of an individual can have considerable practical value and
offers an intuitive measure of the degree of statistical separation between two groups.
This approach to quantification differs from those reviewed until now in that the
probability of success depends (implicitly or explicitly) on the statistical model used
to perform the classification task. The problem is greatly simplified when the
assumptions of normality and equality of variances/covariances are satisfied. If
this is the case, linear discriminant analysis (LDA) approximates the optimal clas-
sifier (James et al., 2013), and the PCC can be estimated as a simple function of the
standardized difference dor D, assuming equal group sizes (Fig. 1.3; Table 1.1;
Dunn & Varady, 1966; Hess et al., 2007). For example, both d¼0.50 and D¼0.50
correspond to PCC ¼0.60, that is, a 60% probability of correctly classifying a
random individual as male or female. Returning to the example of male/female faces
discussed earlier, the predictive accuracy of human observers is 0.95 or more; under
9
This is different from gender diagnosticity (Sect. 1.2.1.2), which is the estimated probability that a
particular individual is male (or female), regardless of his/her actual sex.
14 M. Del Giudice
standard assumptions, this would imply a multivariate difference D&3.30, a figure
very close to the one estimated from face morphology data (about D¼3.20 in
Hennessy et al., 2005).
If variances/covariances differ between the sexes but normality still applies, the
approximately optimal classifier is not LDA but QDA (quadratic discriminant
analysis; see James et al., 2013). When distributions are strongly non-normal and
patterns of sex differences are characterized by nonlinearity and higher-order inter-
actions, the PCC is going to depend on the particular classification model chosen for
the analysis. The menu of available methods has been expanding rapidly thanks to
advances in machine learning; common options include logistic regression, classi-
fication trees, support vector machines (SVMs), and deep neural networks (see Berk,
2016; Efron & Hastie, 2016; James et al., 2013; Skiena, 2017). Sophisticated
classification methods can be especially effective in complex datasets with large
numbers of variables; it is not a coincidence that many recent applications to sex
differences come from neuroscience. To give just a few examples: van Putten et al.
(2018) trained a neural network on electroencephalogram signals (EEG) and were
able to identify the sex of participants more than 80% of the time. Using regularized
logistic regression, Chekroud et al. (2016) achieved 93% accuracy in identifying the
sex of adult participants from brain structure. The same accuracy (93%) was reported
by Anderson et al. (2018) with SVM and regularized logistic regression, and by Xin
et al. (2019) with a neural network. By applying SVM to brain scan data, Joel et al.
(2018) obtained 72–80% accuracy in adults, while Sepehrband et al. (2018) achieved
77–83% accuracy in children and adolescents. In all these studies, classification was
performed on multivariate data from the whole brain, not on individual brain
regions. Interestingly, the sex differentiation score computed by Phillips et al.
(2018) from brain structure data (see Sect. 1.2.1.2) yields an expected PCC ¼0.82
(estimated from d¼1.80), which is close to the performance of more complex
algorithms.
10
1.2.1.6 Variance Explained (η
2
)
The proportion of variance in the variable of interest that is explained by a categor-
ical predictor (e.g., sex) is usually labeled eta squared (η
2
; see Lakens, 2013; Olejnik
& Algina, 2000). This is a classic effect size but not a very intuitive one; for this
reason, it is seldom employed in sex differences research (but see Deaux, 1985). The
value of η
2
can be obtained from dor Dassuming normality and equality of
variances/covariances; for simplicity, the formulas presented in Table 1.1 also
assume equal group sizes. As can be seen in Fig. 1.3,d¼0.50 and D¼0.50
10
Note that multivariate patterns of sex differences in brain structure are strongly influenced by sex
differences in total brain volume. Because different regions show different scaling functions with
respect to overall volume, simple linear adjustments do not fully remove the effect of males having
larger brains on average. In a recent study that used more sophisticated correction methods,
classification accuracy dropped from more than 80% to about 60% (Sanchis-Segura et al., 2020).
1 Measuring Sex Differences and Similarities 15
correspond to η
2
¼0.06, or 6% of variance explained by sex. Explaining 50% of the
variance requires a male-female difference of two standard deviations. The main
problem with indices of variance explained is that values perceived as “small”are
easy to underestimate and dismiss as trivial, even when they reflect meaningful or
practically important effects (for extended discussion of this point see Abelson,
1985; Breaugh, 2003; Prentice & Miller, 1992; Rosenthal & Rubin, 1979).
1.2.1.7 Variance Ratio (VR)
Males and females may differ not only in their mean value on a trait, but also in their
variability around the mean. When computing most of the indices reviewed in this
chapter, unequal variances are treated as a deviation from standard assumptions
(Table 1.1); however, systematic differences in variability may be interesting in their
own respect, for example because they can have large effects on the relative pro-
portions of males and females at the distribution tails (Sect. 1.2.1.8).
Empirically, males have been found to show larger variance than females in a
majority of traits, including most dimensions of personality (except neuroticism; see
Del Giudice, 2015), general intelligence (e.g., Arden & Plomin, 2006; Dykiert et al.,
2009; Johnson et al., 2008), specific cognitive skills (e.g., Bessudnov & Makarov,
2015; Hyde et al., 2008; Lakin, 2013; Wai et al., 2018), brain size (e.g., Ritchie et al.,
2018; Wierenga et al., 2017), and many other bodily and physiological features (see
Del Giudice et al., 2018; Lehre et al., 2009). In the human literature, this is known as
the “greater male variability hypothesis”(for a historical perspective see Feingold,
1992), but the same general pattern is apparent in most sexually reproducing species
(Wyman & Rowe, 2014; Del Giudice et al., 2018). Some of these differences seem to
reflect scaling effects: If the variability of a trait increases with its mean level, the sex
with the higher mean will also show the larger variance. This is the case for physical
traits such as height, body mass, and brain volume. While the variance of these traits
is higher in males, the coefficient of variation (i.e., the standard deviation divided by
the mean) is very similar in men and women (Del Giudice et al., 2018). However,
greater male variance is also found in domains in which average differences are very
small or favor females (such as general intelligence and most personality traits).
The standard index for sex differences in variability is the variance ratio (VR),
which by convention is the ratio of the male variance to the female variance. In sex
differences research, variance ratios are usually calculated on univariate distributions
(confidence intervals on VR are discussed in Shaffer, 1992). However, the general-
ized variance of a multivariate distribution is the determinant of the covariance
matrix (Sen Gupta, 2004); a generalized variance ratio can be easily obtained as
the ratio of the male and female generalized variances (Table 1.1). Equality of
variances corresponds to VR ¼1.00. In the domains of personality and cognition,
values of VR estimated from large samples are often smaller than 1.20 and rarely
larger than 1.50. For neuroticism and related traits, which tend to be more variable in
females, VR usually ranges between 0.90 and 1.00 (Del Giudice, 2015; Hyde, 2014;
16 M. Del Giudice
Lakin, 2013; Lippa, 2009). For comparison, the variance ratio for height is estimated
at about VR ¼1.11 (average across countries; Lippa, 2009).
1.2.1.8 Tail Ratio (TR)
The relative proportions of males and females in the region around the mean are
often less interesting than their representation at the tails of the distribution. This is
typically the case when the outcome of interest depends on competition (e.g.,
selection of the top-ranking applicants for a job), the crossing of a threshold (e.g.,
selection requiring a minimum passing score), or other nonlinear effects (e.g., the
probability of committing violent crimes may increase more steeply at the upper end
of the distribution of aggression). Crucially, small differences between means can
have a substantial impact as one moves toward the tails of the distribution; and even
if males and females have exactly the same mean on a trait, sex differences in
variability can produce marked differences at the extremes (Halpern et al., 2007).
When the tails of the distribution are the focus of interest, summary indices such
as mean differences and overlap coefficients are uninformative; researchers may
wish to calculate a tail ratio (TR), that is, the relative proportion of the two sexes in
the region above (or below) a certain cutoff. Here I adopt a slight variation of the
reference group method proposed by Voracek et al. (2013); the alternative approach
by Hedges and Friedman (1993) uses the total distribution of the two groups
combined. In the standard version of Voracek et al.’s method, the group with the
lower mean serves as the reference group, and the cutoff to identify the tail is placed
at zstandard deviations from the lower mean (where zcan be any value). The choice
of cutoff is noted as TR
zSD
: for example, TR
2SD
is the tail ratio for a cutoff located
z¼2 standard deviations above the lower mean; TR
2.5SD
is the tail ratio for a cutoff
located z¼2.5 standard deviations above the lower mean; and so on. In the context
of sex differences, it is arguably more useful to pick one of the two sexes as the
reference group regardless of the ranking of means; in the following I use females as
the reference group, following the standard convention for variance ratios. While
Voracek et al. (2013) proposed benchmarks for the interpretation of TR modeled on
those for Cohen’sd,fixed conventions are even less meaningful in this context and
should probably be avoided.
Tail ratios can be estimated from means and variances assuming normality, or
from dand Dwith the additional assumption of equal variances/covariances
(Table 1.1). However, the resulting estimates can be very sensitive to violations of
these assumptions (see Sect. 1.3.1), and researchers working with large samples
often calculate tail ratios directly from frequency data rather than from summary
statistics (e.g., Lakin, 2013; Wai et al., 2018). Figure 1.4 shows how ddetermines
the tail ratios above three common cutoffs. With equal variances (VR ¼1), an effect
size d¼0.50 corresponds to TR
1SD
¼1.94, TR
2SD
¼2.94, and TR
3SD
¼4.60. In
other words, there are almost twice as many males as females in the region one
standard deviation above the female mean (TR
1SD
); almost three times as many in the
region two standard deviations above the female mean (TR
2SD
); and 4.6 times as
1 Measuring Sex Differences and Similarities 17
many in the region three standard deviations above the female mean (TR
3SD
). As the
standardized difference increases, TR becomes disproportionately larger (note that
the vertical axis of Fig. 1.4 is logarithmic). Figure 1.4 also illustrates the major
impact of unequal variances, which—depending on how they combine with distri-
bution means—can dramatically amplify sex imbalances in the tails, but also
attenuate or even reverse them. While standardized differences and overlap coeffi-
cients are robust to minor sex differences in variability, tail ratios can be remarkably
sensitive to unequal variances. Specifically, the impact of VR is maximized when
dor Dvalues are smaller and/or the chosen cutoff is more extreme (Fig. 1.4).
1.2.2 Other Methods
1.2.2.1 Relative Distribution Methods
A powerful but surprisingly underused approach to group differences employs the
statistical concept of a relative distribution to compare the distribution of a compar-
ison group to that of a reference group (Handcock & Janssen, 2002; Handcock &
Morris, 1998,1999). A key tool of relative distribution methods is the relative
density plot, which shows how the ratio of the comparison distribution (e.g.,
males) to the reference distribution (e.g., females) changes at different levels
(quantiles) of the reference distribution. An example of relative density plot is
Fig. 1.4 Tail ratios and the effect of unequal variances. The thick lines show the relative proportion
of males to females above the cutoffs located at one, two, and three standard deviations from the
female mean (TR
1SD
,TR
2SD
, and TR
3SD
) for positive values of d. Calculations assume normality,
equal group sizes, and equal variances in the two sexes (variance ratio VR ¼1.00). The shaded areas
represent changes in tail ratios when variances are unequal, ranging from VR ¼0.50 (twice as high
in females) to VR ¼2.00 (twice as high in males). Note that the impact of unequal variances on TR
is stronger when the difference between means is smaller and/or the cutoff is more extreme
18 M. Del Giudice
shown in Fig. 1.5a (see the figure legend for more details). The relative density
function contains all the information about the differences between the male and
female distributions, including differences in central tendency (mean or median),
differences in variability, and differences in the shape of the two distributions. For
this reason, the relative density plot is a remarkably informative display that can be
used for high-resolution exploration of the data (for a conceptually similar approach
that employs quantile differences instead of relative densities, see Rousselet et al.,
2017; Wilcox, 2006).
Fig. 1.5 Four visualizations of sex differences/similarities. All plots are based on the same dataset
with d¼1.0. (a) Relative density plot. This plot shows the relative male:female density at different
quantiles of the female distribution (bottom axis); the corresponding values of the variable (X) are
shown for references on the top axis. Dotted lines represent 95% pointwise confidence intervals.
Assuming equal group sizes, a relative density of 1.0 (horizontal dashed line) indicates equal
proportions of males and females. Under the same assumption, there are about five time as many
males as females with values at the lower extreme of the female distribution (0.0 on the bottom axis;
relative density '5.0). At the median of the female distribution (0.5 on the bottom axis) there are
about three times as many females as males (relative density '0.3), approximately the same
proportions found at the upper extreme (1.0 on the bottom axis). (b) Overlay density plot of the male
and female distributions. This plot shows the shape of the distributions, their overlap, and the
location of means (vertical dotted lines). (c) Normalized plot of the male and female distributions.
This plot shows the standardized mean difference and the corresponding overlap assuming nor-
mality and equality of variances (in this case, OVL ¼0.62 and OVL
2
¼0.45). Horizontal bars
represent 95% confidence intervals on d; the colors on the bottom bar can be reversed when the
interval includes opposite-sign values. (d) Venn diagram of the overlap between the male and
female distributions. This type of diagram can be used to intuitively communicate the overall size of
effects in complex multivariate contexts
1 Measuring Sex Differences and Similarities 19
Besides visual exploration, relative distribution methods also support various
types of quantitative inference. Most intriguingly, the relative distribution can be
easily decomposed into independent components that separate the effects of location
(i.e., differences in means or medians) from those of shape (including, but not
limited to, differences in variance). These components of the distribution can be
plotted separately to visually examine their characteristics, or quantified and com-
pared using information-theoretic measures (for details and examples see Handcock
& Morris, 1998). Despite their many attractive features, relative distribution methods
have been largely ignored in sex differences research; the few applications I am
aware of—limited to relative density plots—are in Bessudnov and Makarov (2015),
Del Giudice (2011), and Del Giudice et al. (2010,2014).
1.2.2.2 Taxometric Methods
The goal of taxometrics is to use observable indicators to infer the latent structure of
a given domain (Beauchaine, 2007; Meehl, 1995; Ruscio et al., 2011,2013).
Specifically, taxometric procedures examine patterns of variation and covariation
among indicators to distinguish cases in which differences between individuals are
purely dimensional (e.g., a continuum of increasingly severe antisocial behavior)
from those in which the data reflect the existence of categories with non-arbitrary
boundaries (e.g., psychopaths vs. non-psychopathic individuals)—or, stated other-
wise, categories that differ from one another in kind and not just degree (taxa).
Taxonic and dimensional variation are not mutually exclusive, and often coexist
within the same domain (e.g., psychopaths may vary in the severity of their antiso-
cial symptoms; see Ruscio et al., 2013).
Carothers and Reis (2013; Reis & Carothers, 2014) performed a taxometric
analysis on various putative indicators of gender, which they distinguished from
biological sex: measures of sexuality, mating preferences, empathy, intimacy, and
personality (including the Big Five). They found overwhelming support for a
dimensional model and concluded that the latent structure of gender—in contrast
with that of sex—is not a binary but a continuum. They also argued that average sex
differences are “not consistent or big enough to accurately diagnose group member-
ship”(p. 401). However, a simpler interpretation of these findings is that the
indicators used in the study were too weak to detect the underlying taxa. As also
noted by the authors, taxometric procedures quickly lose sensitivity as group
differences on the indicators become smaller than d¼1.20 (Beauchaine, 2007;
Ruscio et al., 2011); but almost all the effect sizes in the study were below this
threshold, and often substantially so. Because the indicators were inadequate to
detect taxonic differences, the analysis predictably indicated a dimensional structure.
The only set of psychological indicators with adequate effect sizes was a list of
preferences for sex-typed activities (e.g., boxing, hair styling, playing golf). Predict-
ably, sex-typed activities showed clear evidence of taxonicity, but this result was not
treated as part of the main analysis. Also, the authors’claim that sex differences are
too small and inconsistent to infer a person’s sex from psychological measures is
20 M. Del Giudice
unfounded: Personality traits alone can correctly classify males and females with
high probability, provided they are measured at the level of narrow traits and
aggregated with multivariate methods. For example, D¼2.71 (Del Giudice et al.,
2012) yields PCC ¼0.91 using the standard formula.
11
In contrast, the Big Five lack
the resolution to accurately differentiate the sexes, and the corresponding effect sizes
(d¼0.19–0.56 in the study) are too small to regard these traits as valid taxometric
indicators. In light of these limitations, the findings by Carothers and Reis (2013) are
hard to interpret with any confidence.
Beyond this particular study, it is unclear whether taxometric methods can make a
substantive contribution to sex differences research. The purpose of taxometrics is to
probe for the existence of taxa that cannot be directly observed, as is often the case
with mental disorders (Meehl, 1995). In meaningful applications, one does not know
a priori whether the hypothetical taxa exist or not, and there is a genuine possibility
that the underlying structure of the data is fully dimensional. But in the case of sex
differences, the taxa (males and females) are already known to the investigators, and
indicator variables are chosen precisely because they can distinguish between males
and females. Given these premises, studies that use sufficiently strong indicators
(e.g., sex-typed activities) can be expected to confirm the existence of two sexes,
whereas studies that use weak indicators will be uninterpretable because of their lack
of sensitivity, as in Carothers and Reis (2013). Either way, the results are going to be
uninformative, unless the goal is to look for additional taxonic distinctions within
each sex (e.g., discrete categories related to sexual orientation; Gangestad et al.,
2000; Norris et al., 2015).
1.2.2.3 Internal Consistency Analysis
Internal consistency analysis was introduced by Joel et al. (2015) in a famous study
of sex differences in brain structure. The first step of this procedure is to select a
subset of variables showing the largest sex differences (e.g., volumes of particular
brain regions) and split each of them in three equal-sized categories—the most male-
typical third, the most female-typical third, and the middle third. Each participant is
then classified based on his/her combination of variables: Participants who fall in the
male-typical, female-typical, or intermediate category on all the variables are
deemed “internally consistent”; those with at least one male-typical and one
female-typical variable are said to show “substantial variability”and are regarded
as “mosaics.”Joel et al. (2015) found very low proportions of internally consistent
individuals (ranging from 0.1% to 10.4%), not only in brain structure but also in
personality, attitudes, and preferences for sex-typed activities. Based on these
11
Of course, this effect size is based on latent variables, and the corresponding PCC assumes error-
free measurement (Sect. 1.3.3). The point remains valid: in principle, a combination of narrow
personality traits can accurately discriminate between males and females. Note that Carothers and
Reis’claim concerned the actual amount of overlap between the sexes, not the attenuating effects of
measurement error.
1 Measuring Sex Differences and Similarities 21
findings, they claimed that most people are characterized by a mosaic of male and
female brain features, a pattern that undermines any attempt to distinguish between
“male”and “female”brain types. In later work, Joel and others have argued that
extensive brain mosaicism calls into question the use of sex as an independent
variable in neuroscience (Joel & Fausto-Sterling, 2016; Hyde et al., 2019).
Unfortunately, the method devised by Joel et al. (2015) is seriously flawed. The
threshold for consistency is both arbitrary and exceedingly high: It is easy to show
that, in realistic conditions, the method always returns a small proportion of “inter-
nally consistent”individuals, regardless of the pattern of differences and correlations
among variables (Del Giudice et al., 2015,2016). This remains true even when the
variables show unrealistically high levels of consistency (i.e., all correlations among
variables equal to 0.90). In light of this, it is not surprising that Joel et al. (2015)
found only 1.2% of internally consistent individuals in the domain of sex-typed
activities, with the same data that showed clear evidence of taxonicity in Carothers
and Reis’(2013) analysis.
12
While “substantially variable”profiles are more sensi-
tive to variations in the data (Del Giudice et al., 2015; Joel et al., 2016), the
percentages returned by this method can be quite misleading if taken at face value.
The authors have continued to present their findings as evidence that most brains are
“gender/sex mosaics”(Joel & Fausto-Sterling, 2016; Hyde et al., 2019). The ques-
tion they address is without doubt an important one; patterns of consistency/incon-
sistency among sex-related traits can be both theoretically interesting and practically
important. However, their method is designed to show invariably low levels of
internal consistency, and I cannot recommend it as a useful analytic tool.
1.2.3 Visualization
There are many possible ways to visualize sex differences/similarities in plots and
diagrams; the most appropriate type of display is going to depend on the researchers’
aims and their intended audience. Figure 1.5a shows a relative density plot with
females as the reference group (Sect. 1.2.2.1). This plot does not depict the original
distributions but only their relative differences, and highlights the behavior of the
variable in the tail regions. While relative density plots can be very informative, they
are not immediately intuitive and require some technical background to interpret. A
similar type of plot based on quantile differences instead of relative densities is
discussed in Rousselet et al. (2017) and Wilcox (2006). In Fig. 1.5b, the male and
female probability densities are overlaid on the same plot (e.g., Ritchie et al., 2018).
This straightforward display conveys a lot of information, including the shape of the
12
To see why, consider a fictional man who hates talk shows and cosmetics and is passionate about
boxing and video games (male-typical values), but does not particularly like golf (intermediate). He
would be classified as showing an “intermediate”profile of gendered interests. If he happened to
dislike golf (female-typical value), he would be classified as a sex/gender mosaic with a “substan-
tially variable”interest profile (see Del Giudice et al., 2015).
22 M. Del Giudice
two distributions, the difference between means, and the amount of male-female
overlap—though it is less effective than the relative density plot in showing differ-
ences in the tail regions. Overlay density plots are similar to split violin plots, in
which densities are displayed side by side instead of overlaid (e.g., Wai et al., 2018);
however, split violin plots make it hard to visualize the overlap between distribu-
tions. Both density and relative density plots can be used to visually detect obvious
deviations from standard assumptions.
When effect sizes are mapped on normal distributions (with equal or unequal
variances), normalized density plots (Fig. 1.5c) offer an intuitive display of stan-
dardized differences and overlaps (e.g., Maney, 2016
13
). Plots of actual or normal-
ized distributions can be easily augmented with confidence intervals on d, as shown
in Fig. 1.5c. Still, this kind of plot is inherently univariate, and can be misleading
when one wants to present the results of multivariate analyses. In complex multi-
variate contexts, the overlap between distributions is usually the most intuitive
metric; overlap coefficients can be visualized with Venn diagrams (Fig. 1.5d) in
which areas represent proportions of overlap and nonoverlap (e.g., Del Giudice et al.,
2012).
1.3 Statistical and Methodological Issues
1.3.1 Assumption Violations
Many of the standard formulas presented in this chapter make the assumptions of
normality and equality of variances/covariances in the population. These formulas
are useful because they allow investigators to calculate a wide range of indices from
commonly reported statistics such as means, standard deviations, correlations, and
values of dor D. Moreover, some non-standard indices (e.g., multivariate overlap
between non-normal distributions) may be complicated to obtain even if raw data are
available. Still, deviations from normality are quite common: Empirical data are
frequently skewed, have heavier tails than expected under a normal distribution, and
so on (e.g., Limpert & Stahel, 2011). The size of indices like dand Dis sensitive to
both non-normality and the presence of outliers (Wilcox, 2006); moreover, exact
formulas for confidence intervals are only accurate when normality can be assumed.
Remedies to these distorting effects include bootstrap confidence intervals and
robust variants of Cohen’sdthat eliminate the influence of extreme values (e.g.,
Algina et al., 2005; see Kirby & Gerlanc, 2013). Deviations from normality may also
change the amount of overlap between distributions. When this is the case, robust
13
Note that some of the normalized plots in Maney (2016) show atypically large differences in
variance between males and females, up to about VR ¼23. However, those plots are based on very
small samples, and the extreme differences in variability they display are most likely due to
sampling error.
1 Measuring Sex Differences and Similarities 23
nonparametric methods can be used to estimate the OVL coefficient in place of the
usual formulas (Anderson et al., 2012; Schmid & Schmidt, 2006). As noted in Sect.
1.2.1, when variances/covariances are markedly unequal it is possible to use QDA
instead of LDA to estimate the PCC; however, both models are quite sensitive to
non-normality (Eisenbeis, 1977), which limits the utility of standard formulas when
normality assumptions are not met.
The most widely used test of univariate normality is the Shapiro-Wilk test
(Garson, 2012; Yap & Sim, 2011). Multivariate normality is harder to assess, and
no single method performs well in all conditions (Mecklin & Mundfrom, 2004,
2005). Thus, the recommended approach is to combine multiple tests (which do not
always agree with one another) and supplement them with graphical displays
(Holgersson, 2006; Korkmaz et al., 2014; see Mecklin & Mundfrom, 2004).
Levene’s test is the standard procedure for comparing variances, and there are robust
versions of the test that are less sensitive to non-normality (Gastwirth et al., 2009).
The equality of covariance matrices is usually evaluated with Box’s M test. Unfor-
tunately, the M test suffers from a high rate of false positives (i.e., it rejects
homogeneity too often) and is very sensitive to departures from multivariate nor-
mality; the latter problem can be lessened by using robust variants of the test
(Anderson, 2006; O'Brien, 1992). More generally, using significance tests to eval-
uate assumptions is not without problems. With small samples, many tests have low
power to detect violations; but when sample size is large, very small deviations from
perfect normality/homogeneity may cause a test to reject the assumption, even if the
practical consequences may be negligible.
In sex differences research, the phenomenon of greater male variability
(complemented by some instances of greater female variability) implies that the
assumption of equal variances is literally false in a majority of cases. If so, it makes
little sense to perform significance tests of strict equality: If equality is not expected,
a non-significant result may just mean that the test was underpowered. At the same
time, sex differences in variance are relatively mild—as noted in Sect. 1.2.1,
variance ratios are often lower than 1.20 and rarely higher than 1.50. Large discrep-
ancies between male and female variances typically occur as a consequence of
non-normality (e.g., skewed distributions with long tails), the presence of outliers,
ceiling/floor effects, and other artifacts. With variance ratios in the usual range and
approximately normal distributions, the results of the formulas in Table 1.1 are very
close to the actual values even when variances differ between the sexes (with the
exception of tail ratios; see below). Because equality of variances cannot be gener-
ally assumed, one can test the equality of correlation matrices (which are standard-
ized and do not contain information on variance) instead of that of covariance
matrices. This can be done with various significance tests (e.g., Jennrich, 1970;
Steiger, 1980; see Revelle, 2018). However, these tests suffer from the usual
problems of low sensitivity in small samples and excessive sensitivity in large
samples (see above). An alternative that does not rely on significance is to compare
sample correlation matrices with Tucker’scongruence coefficient (φor CC; Abdi,
2007). The CC coefficient in an index of matrix congruence that ranges from
"1.00–1.00. Lorenzo-Seva and ten Berge (2006) proposed benchmarks for CC
24 M. Del Giudice
based on expert judgments; following their recommendations, values of 0.85 or
more indicate fair similarity, while values above 0.95 indicate high similarity. A high
value of CC implies that there are no major discrepancies between the correlation
matrices of males and females. In many applications, this justifies the use of
multivariate indices, with the caveat that the resulting values are best regarded as
reasonable approximations. Inspection of the correlation matrices (and their differ-
ence) may point to specific variables that seem to behave differently in the two sexes.
Yet another strategy is to employ structural equation modeling (SEM) to fita
multigroup factor model of the variables (see below), and use model fit indices to
evaluate the equivalence of correlations in the two sexes (e.g., Del Giudice et al.,
2012).
While most of the standard formulas are robust to minor violations of their
assumptions, this is emphatically not the case of tail ratios. The formulas used to
estimate TR from effect sizes or summary statistics are very sensitive to small
deviations from the hypothesized distributions, particularly when differences
between groups are small and/or cutoffs are extreme (Fig. 1.4). Thus, estimates of
TR based on standard formulas should be treated with special caution unless the
underlying assumptions can be reasonably justified.
1.3.2 Biases in Effect Sizes
When they are calculated from sample data, dand Dare not unbiased estimators of
the corresponding population parameters but exhibit a certain amount of bias away
from zero (i.e., their expected value overestimates the absolute size of the effect).
Bias is typically negligible in large samples, but can be substantial in small studies; it
transmits to other indices when conversion formulas are used (Table 1.1), and may
lead investigators to overestimate the size of sex differences in their data. The bias in
darises from the fact that the pooled sample variance slightly underestimates the
population variance, and is only an issue when sample size is very small: It amounts
to less than 5% of the absolute value when the total Nis &18, and less than 1% when
N&78. The bias-corrected variant of Cohen’sdis known as d
u
or Hedges’g;a
simple correction formula is reported in Table 1.1 (see Hedges, 1981; Kelley, 2005).
The bias in Dis a bigger concern, because random deviations from zero in the
univariate effects (caused by sampling error) add up and collectively inflate the value
of D. In a previous paper (Del Giudice, 2013), I suggested a simple rule of thumb
based on simulations: The bias in Dcan be kept to acceptable levels (i.e., less than
0.05 in absolute value) by having at least 100 cases for each variable in the analysis
(e.g., N&500 when calculating Dfrom 5 variables). The rule works as advertised
when D&0.45, but bias can still be substantial for smaller values of D. A better
alternative when Nis small relative to the number of variables is to use the correction
formula reported in Table 1.1, which yields the small-sample variant D
u
(Lachenbruch & Mickey, 1968; Hess et al., 2007).
1 Measuring Sex Differences and Similarities 25
Capitalization on chance is also an issue with η
2
, which tends to systematically
overestimate the amount of variance explained. The index ω
2
(omega squared)
provides a less biased variant of η
2
that can be useful when working with small
samples (Lakens, 2013; Olejnik & Algina, 2000). More generally, multivariate
methods tend to overfit the sample data, leading to overestimate both the proportion
of variance they can explain and the accuracy of their predictions. This is obviously
the case when standard formulas are used to estimate PCC from inflated values of
D(see Glick, 1978). However, all kinds of predictive models—from logistic regres-
sion to classification trees and SVMs—tend to overfit the sample on which they are
trained; to the extent that they do, their performance can be expected to drop when
they are applied to a new, different set of data. Reducing overfit to improve out-of-
sample predictions and obtain correct estimates of a model’s performance is a major
concern in the field of machine learning. Common tools employed to this end
include cross-validation, regularization, and model selection based on information
criteria (see Berk, 2016; Efron & Hastie, 2016; Hooten & Hobbs, 2015; James et al.,
2013).
1.3.3 Measurement Error and Other Artifacts
While upward bias increases the apparent size of sex differences, measurement error
has the opposite effect. When variables are measured with error, the raw difference
between group means remains approximately the same but the standard deviation is
inflated by noise; as a consequence, standardized indices like dand Dbecome
proportionally smaller. When measurement is unreliable, this reduction (attenua-
tion) can be substantial. In classical test theory, the reliability of a measure is the
proportion of variance attributable to the construct being measured (“true score
variance,”as contrasted with “error variance”). Assuming that sex is measured
without error, the true value of dis attenuated by the square root of the reliability:
d¼1.00 becomes 0.95 if the measure has 90% reliability, 0.84 with 70% reliability,
and 0.71 with 50% reliability (Schmidt & Hunter, 2014; see also Schmidt & Hunter,
1996). In the case of D, measurement error reduces both the univariate differences
and the correlations among variables; these effects may either reinforce or oppose
one another depending on the correlation structure and the direction of the univariate
effects. In the field of sex differences, the large majority of individual studies and
meta-analyses fail to correct for attenuation due to measurement error, and as a result
yield downward biased estimates of effect sizes. This is also the case of the literature
syntheses compiled by Hyde (2005) and Zell et al. (2015).
There are two main approaches to correcting for measurement error. The first and
simpler method is to estimate the reliability of measures from sample data, then
disattenuate dby dividing it by the square root of the reliability coefficient. For
example, consider a standardized difference d¼0.50 on a variable with reliability
0.77. The square root of 0.77 is 0.88, and the disattenuated dis 0.50/0.88 ¼0.57. To
calculate D, both univariate effect sizes and correlations need to be disattenuated. To
26 M. Del Giudice
disattenuate a correlation, one divides it by square root of the product of the two
reliabilities. For example, consider a correlation r¼0.30 between two variables with
reliabilities 0.77 and 0.82. The product of these reliabilities is 0.63, its square root is
0.79, and the disattenuated ris 0.30/0.79 ¼0.38. While this method is an improve-
ment over no correction at all, reliability is typically estimated with Cronbach’s
alpha (α), an index with substantial methodological limitations. In realistic condi-
tions, αtends to yield deflated values when applied to unidimensional scales (Dunn
et al., 2014; McNeish, 2018; Revelle & Condon, 2018). More worryingly, values of
αdo not reflect the unidimensionality of a test: If the items measure more than one
construct, or tap additional specific factors on top of the general factor they are
supposed to measure, αcan be substantially inflated (Cortina, 1993; Crutzen &
Peters, 2017; Schmitt, 1996). For other ways to estimate reliability and a review of
alternative indices, see McNeish (2018), Revelle and Condon (2018), and Zinbarg
et al. (2005). Also note that disattenuated effect sizes have larger sampling errors
than their attenuated counterparts; this should be taken into account when calculating
confidence intervals (see Schmidt & Hunter, 2014).
The second and more sophisticated approach is to use latent variable methods
(most commonly SEM) to explicitly model the factor structure