Content uploaded by Wolfgang Lenhard
Author content
All content in this area was uploaded by Wolfgang Lenhard on Jan 10, 2023
Content may be subject to copyright.
1
1
Reducing the bias of norm scores in non-representative samples: Weighting as an adjunct 2
to continuous norming methods 3
4
Sebastian Gary1, Alexandra Lenhard1, Wolfgang Lenhard2*, David S. Herzberg3
5
6
1 Psychometrica, Dettelbach, Bavaria, Germany 7
2 Institute of Psychology, University of Würzburg, Bavaria, Germany 8
3 WPS, Torrance, CA, USA 9
10
* Corresponding author: 11
E-Mail: wolfgang.lenhard@uni-wuerzburg.de 12
13
Declaration of interest and acknowledgements 14
We do not have commercial interests influencing the research. We extend our gratitude to WPS 15
(Torrance, CA) for funding the simulation studies. We acknowledge that the fourth author (DSH) 16
is employed by WPS. The according R package cNORM was developed by the first three authors 17
and it is freely available under the AGPL2 license. 18
19
Preregistration and data availability 20
We report how we determined our sample size, all data exclusions, all manipulations, and all 21
measures in the study. The study’s design, hypotheses and analysis plan were preregistered on 22
18th of Feb. 2022 under 10.17605/OSF.IO/HNQDY. The code behind this simulation and analysis 23
has been made publicly available at the Open Science Foundation and can be accessed at 24
https://osf.io/bwcre/ . 25
Manuscript accepted for publication:
Gary, S., Lenhard, A., Lenhard, W. & Herzberg, D. (accepted). Reducing the bias of norm scores in
non-representative samples: Weighting as an adjunct to continuous norming methods. Assessment.
2
Abstract 26
We investigated whether the accuracy of normed test scores derived from non-27
demographically representative samples can be improved by combining continuous norming 28
methods with compensatory weighting at the raw score level. To this end, we introduce Raking, 29
a method from social sciences, to psychometrics. In a simulated reference population, we 30
modeled a latent cognitive ability with a typical developmental gradient, along with three 31
demographic variables that were correlated to varying degrees with the latent ability. We 32
simulated five additional populations representing patterns of non-representativeness that might 33
be encountered in real-world. We subsequently drew smaller normative samples from each 34
population and used an IRT model to generate simulated test results for each individual. Using 35
these simulated data, we applied norming techniques, both with and without compensatory 36
weighting. Weighting reduced the bias of the norm scores when the degree of non-37
representativeness was moderate, with only small risk of generating new biases. 38
39
Keywords: regression-based norming, continuous norming, weighted ranking, test 40
development, representativeness, raking 41
3
Reducing the bias of norm scores in non-representative samples: Weighting as an adjunct 42
to continuous norming methods 43
In the development of norm-referenced psychometric tests, demographically 44
representative samples provide the foundation for valid norm scores. An initial task for the test 45
developer is to identify those demographic variables that correlate most strongly with the 46
construct to be measured by the test. These variables typically include age, gender, race/ethnicity, 47
education level, and/or socioeconomic status. When measuring a developing cognitive ability, 48
age is the most important variable. Age has a stronger effect on test scores than the other 49
variables, especially when testing children and adolescents. Because of this, Wechsler (1939, 50
Chapter 3) recommended that same-age reference populations be used when norming tests of 51
intelligence and achievement. 52
Consequently, normative samples must be demographically representative not just over 53
the entire age range of the test, but also within smaller age groups containing individuals who are 54
at similar stages of development. Moreover, the normative samples for each individual age group 55
must be large enough to allow the computation of reliable norms. These requirements increase 56
the size of the entire normative sample (along with its cost and the time needed to collect it). To 57
mitigate the need for larger samples, advanced mathematical methods have been developed to 58
model the continuous relationship between raw and normed scores across age (e.g., Gorsuch, 59
1983, quoted from Zachary & Gorsuch, 1985; Cole, 1988; Cole & Green, 1992; Oosterhuis, van 60
der Ark & Sijtsma, 2016; Oosterhuis, 2017; A. Lenhard et al., 2018; W. Lenhard et al., 2018; 61
Stasinopoulos et al., 2018; A. Lenhard et al., 2019; Voncken et al., 2019; W. Lenhard & Lenhard, 62
2021). 63
Some norm-referenced measures additionally require consideration of variables other than 64
age. For example, measures of body mass index (BMI) need separate norms for males and 65
females, because optimal BMI for females is lower than that for males (Sang-Wook et al., 2015). 66
In other instances, it may be counter-productive to provide separate norms for the different levels 67
4
of a demographic variable. For example, some studies show that girls have higher reading skills 68
than boys (e.g., W. Lenhard et al., 2017; Price-Mohr & Price, 2017). However, if a reading test is 69
intended to identify those children who need additional support - for example, children at the 70
lowest decile of reading performance - then the use of gender-specific norms might result in 71
biased outcomes. With gender-specific norms, some girls might be identified as needing 72
additional support, even though they perform better on the test than boys who are not identified 73
as needing educational support. 74
Addressing demographic imbalances: Stratification and post-stratification 75
Besides creating separate norms for specific demographic subgroups, several options exist 76
for dealing with demographic variables that are correlated with the latent ability being measured. 77
An obvious course is to increase the size of the normative sample. As the size of a randomly 78
drawn sample increases, the distribution of the demographic variables in this sample increasingly 79
approximates the distribution in the reference population. However, cost and time constraints 80
usually limit the size of the sample available for norming. A second approach is stratification, in 81
which random sampling is conducted independently within homogenous categories, or strata, 82
defined by the demographic variables (e.g., males, females). The goal is to have the category 83
proportions in the normative sample match, as closely as possible, the proportions in the 84
reference population. For example, if census data indicates that the reference population is 85
composed of 50% males and 50% females1, then the researcher would sample males and females 86
independently to match those proportions in the normative sample. 87
However, it is not always possible to replicate population distributions through stratified 88
random sampling. One can randomly delete cases from over-represented strata, but researchers 89
are understandably reluctant to discard data. An alternative is to apply weighting multipliers, or 90
1 For simplicity’s sake, we assume at this point that the number of individuals who
identify as non-binary is negligible.
5
weights, to the data of individuals in the mis-represented strata. For example, if a sample consists 91
of 100 males and 50 females, a weighting multiplier of 2 could be applied to the data obtained 92
from females. Each test score from a female would then be treated as if two females had obtained 93
such a result. A weight wk, assigned to an observation xi in subsample k, thus indicates the 94
number of individuals that this single observation represents. The weights must therefore be 95
calculated so that the proportion 96 =
97
corresponds to the proportion of stratum k in the reference population (with nk = size of 98
subsample k in the normative sample). This weighting procedure is referred to as post-99
stratification (Little, 1993; Park et al., 2004; Lumley, 2011, chapter 7). 100
Recently, Kennedy and Gelman (2021) recommended the use of multilevel regression 101
combined with post-stratification to correct for non-representative samples in studies of 102
psychological intervention. The authors suggest that weights can be used to adjust the means of 103
non-representative samples, to facilitate statistical comparisons among samples. However, in 104
constructing test norms from non-representative samples, the application of weights is more 105
complicated, because the norming process involves modeling both population means and 106
percentile ranks. 107
It is straightforward to take weights into account when calculating percentiles. As 108
described above, each test result is treated as if obtained by wk individuals. But this simple 109
calculation runs the risk of introducing its own bias into the raw-to-norm-score relationship, 110
especially at the tails of the raw-score distributions. This risk occurs because the weights do not 111
change the variance of the distribution of raw scores within demographic subgroups, as would be 112
the case if more individuals were added to the subgroups. The potential distortion of the variance 113
of the raw score distributions increases with the magnitude of the weights themselves. The risk 114
for bias also increases as the number of individuals in a subgroup decreases (as is expected at the 115
6
tails of the raw-score distributions). Consequently, the usefulness of weighting as a corrective 116
procedure tends to diminish as the distributions of demographic variables in the normative 117
sample become increasingly divergent from those in the reference population. 118
Because the potential distortions associated with weighting are most prominent at the tails 119
of the raw-score distributions, they can disproportionately affect the raw-to-norm-score 120
relationships for individuals of very high and/or very low ability. Unfortunately, these extreme 121
ability ranges are the ones where precise norm scores are most needed, because the primary 122
clinical applications of psychometric tests are to help diagnose disabilities, or, alternatively, to 123
identify gifted individuals. 124
As noted above, post-stratification is a method for dealing with normative samples that 125
are not representative, in terms of the distributions of demographic variables, of the reference 126
populations from which they are drawn. An additional complicating factor is that the common 127
demographic variables of gender, socio-economic status (SES), race/ethnicity, and geographic 128
region are often inter-related, in terms of the effects they may have on test performance. For 129
example, areas with lower household income often have higher proportions of non-white 130
inhabitants. Because of such interactions, the most accurate approach to stratification is to 131
consider not only the marginal distributions of the demographic variables, but also their cross-132
classifications, or joint distributions. In a complete crossing of the four variables mentioned 133
above, for instance, an individual could be classified as “female, low SES, white, west region”. 134
There are several practical difficulties with stratification based on the joint distributions of 135
demographic variables. For one, census data are often available only for single demographic 136
variables considered independently from each other, not for the cross-classified categories of 137
multiple variables. In addition, in one possible cross-classification of gender, SES, race/ethnicity, 138
and region, 192 joint cells (2 x 4 x 6 x 4) are created, some of which require only a few 139
individuals to meet census proportions. Collecting a sample that meets these exacting 140
specifications becomes a costly, lengthy process. In fact, with typical sample sizes of 100 cases 141
7
per age year in tests of cognitive ability (e.g., Kaufman & Kaufman, 2004; Wechsler, 2008; 142
Wechsler, 2014), it is not possible to replicate the census proportions in every cross-classified 143
cell, because some of the joint percentages specify less than a single individual in a cell. 144
Raking 145
The raking procedure (Ireland & Kullback, 1968; Kalton & Flores-Cervantes, 2003) is an 146
approach to post-stratification that attempts to mitigate the practical challenges of sampling based 147
on a complete crossing of demographic variables. Raking does not draw on the explicit joint 148
distributions associated with all possible cross-classifications. Instead, the post-stratification 149
weights are determined in an iterative process based on the marginal distributions of each 150
demographic variable. That is, the weights assigned to the demographic categories are adjusted 151
successively and, if necessary, repeatedly, until they no longer change. The procedure is termed 152
“raking” because it is analogous to smoothing out the soil in a garden bed by repeatedly raking in 153
different directions. Studies have shown that the raking procedure is convergent and delivers 154
optimal asymptotically normal estimates for the joint probabilities associated with a complete 155
crossing of demographic variables (e.g., Ireland & Kullback, 1968). 156
Although widely employed to correct for lack of representativeness in political polls 157
(Kalton-Flores-Cervantes, 2003), raking apparently has not been used in the norming of 158
psychometric tests, perhaps because it could introduce error into the raw-to-norm-score 159
relationships. As discussed above, demographic variables may interact with one another in their 160
effects on test scores, creating the need to consider the joint distributions of such variables in 161
developing norms. Because raking operates only on the marginal distributions (i.e., it considers 162
only the “main effects” of demographic variables on test scores), it may magnify sources of error 163
that stem from the interactions of these variables. However, these potential risks remain at the 164
level of speculation, because, to our knowledge, the effect of raking on the accuracy of norm 165
scores has never been studied. 166
8
Effects of continuous norming on non-representativeness samples 167
Continuous norming methods offer the advantage of using the properties of the entire 168
normative sample to correct local sampling errors in smaller subsamples (e.g., age strata). 169
Consequently, continuous norming methods may offer at least a partial remedy to distortions 170
caused by lack of demographic representativeness in single age groups. 171
There have also been attempts to minimize systematic deviations of representativeness in 172
normative data by combining parametric continuous norming methods with other statistical 173
procedures. For example, Voncken et al. (2020) used Bayesian Gaussian distributional regression 174
to align the distributions of newly collected normative data with prior information, more 175
specifically, with previous normative data of the same test. In their simulation study, this method 176
proved to be successful if the prior information was not biased itself. This method of course 177
requires that a previous norm sample is available. And if it is available, it will be difficult to 178
determine whether it is (still) representative for the reference population. Moreover, new 179
normative data are usually collected precisely because the test has been revised or because one 180
suspects that the distribution of the measured variable may have changed in the population. In 181
practice, the use of prior information will therefore be restricted to very specific test development 182
scenarios. 183
Another continuous norming approach is the semi-parametric continuous norming 184
approach (SCN), first suggested by A. Lenhard and colleagues (A. Lenhard et al., 2018; A. 185
Lenhard et al., 2019; W. Lenhard & Lenhard, 2021). This method has been shown to yield 186
accurate norm scores with several non-optimal types of normative samples. One advantage of 187
SCN is that it does not make specific assumptions about distribution parameters, and therefore 188
can be applied to raw score distributions that are skewed, or that show floor and/or ceiling effects. 189
Unlike parametric continuous norming approaches (e.g., Stasinopoulos et al., 2019; 190
Voncken et al., 2019), SCN (as implemented in the cNORM package in R, A. Lenhard et al., 191
2018) does not rely on splines to model the trajectories of percentile ranks across age groups, but 192
9
on simple polynomial regression. Therefore, the SCN approach is quite stiff, i.e., the course of 193
the curve is less influenced by individual data points compared with spline-based methods. As a 194
result, these trajectories are relatively robust against single erroneous data points compared to 195
spline-based-regression. As noted above, this feature of SCN modeling tends to reduce the 196
influence of error variance in local age groups, including that caused by age-specific lack of 197
demographic representativeness. Therefore, SCN does not only produce less norm-score bias than 198
methods that determine raw-to-norm-score mapping separately for each age group (W. Lenhard 199
& Lenhard, 2021), but it also performs better than parametric continuous norming approaches 200
when applied to normative samples with the typical sample size of 100 per age cohort, 201
independent of the skewness of the raw score distributions (A. Lenhard et al., 2019). 202
Goals of the current simulation study 203
As described above, it has already been demonstrated that the SCN approach is very 204
successful in smoothing out sampling errors that occur only in specific age groups. But it 205
certainly cannot compensate for missing representativeness that systematically affects the entire 206
normative sample. Raking, in turn, is a weighting procedure that seems to be optimally suited to 207
compensate for systematic lack of representativeness affecting the entire sample. But it cannot 208
compensate for unsystematic error in single age groups caused by random sampling with limited 209
sample size. We therefore basically assumed that the combination of both methods would lead to 210
an even greater improvement in the quality of norm scores because both types of errors described 211
above are taken into account. But to date, no research has investigated whether the combination 212
of both methods (in the following referred to as weighted continuous norming, or WCN; 213
implemented in the R package cNORM v.3.0.2, A. Lenhard et al., 2018) does in fact improve the 214
accuracy of norm scores, compared to SCN alone. From a theoretical perspective, this should 215
clearly be the case. However, there is also a risk that the mathematical transformations wrought 216
by SCN and raking might interact in a way that increases the bias of norm scores, at least within 217
certain ability ranges. 218
10
The goal of the current study, therefore, was to evaluate the benefits and risks of applying 219
WCN to non-representative normative samples. To this end, we simulated normative data with 220
different types and degrees of deviations from representativeness, applied both SCN and WCN to 221
these data and subsequently compared the results. 222
For the current study, our hypotheses were as follows: 223
1. We expected a main effect of norming method, such that WCN would lead to less-biased 224
estimates of the norm scores than SCN, where “bias” is quantified in terms of root mean square 225
error (RMSE) and mean signed difference (MSD). 226
2. We expected an interaction between norming method and the degree of non-representativeness 227
of the input data. Specifically, we expected that as the non-representativeness of the normative 228
sample increased, norm-score bias would increase for both methods, but that the increase in bias 229
would be smaller for WCN than for SCN. 230
3. We expected that the simple effect of WCN in reducing norm-score bias would vary depending 231
on person location on the cognitive variable. Specifically, we expected that WCN would be less 232
effective at reducing bias at the tails of the cognitive ability distribution than in the central part of 233
that distribution. 234
Methods 235
Overview 236
In order to answer the research questions, we conducted a norming procedure on a 237
measure of a simulated cognitive ability that increases with age (cf. next section). Furthermore, 238
we modeled the effects of three simulated demographic variables on the cognitive measure. For 239
convenience, we labeled the simulated demographic variables as “education”, “ethnicity”, and 240
“geographic region”. We modeled education so that it would have a stronger effect on the 241
cognitive measure than ethnicity or region. 242
To provide input for the norming procedure, we generated six simulated population-level 243
data sets: a reference population that embodied the benchmark distributions of the three 244
11
demographic variables; and five non-representative populations, in which the distribution of these 245
demographic variables differed from the reference population. Each of these populations had six 246
equal-sized age cohorts. Table 1 summarizes the differences among the six simulated 247
populations. Because raking incorporates only marginal distributions, we expected it to have little 248
effect in populations 5 (biased joint probabilities) and 6 (clustered sampling). In these two 249
populations, non-representativeness occurs only at the level of joint distributions (cross-250
classifications), not at the level of marginal distributions. 251
In more detail, our simulation (data and R syntax available via 252
https://osf.io/bwcre/?view_only=5a3f4709ea9a4919935d3accc2ff3297) proceeded through the 253
following steps: 254
1. Modeling a latent cognitive ability with a typical age-related growth curve. 255
2. Generating data sets for the reference population and five additional simulated populations. 256
3. Drawing normative samples from each simulated population. 257
4. Generating simulated raw scores for a test of the cognitive ability. 258
5. Applying WCN and SCN to the raw scores from the normative samples. 259
6. Generating norm scores based on the reference population, as a standard of comparison. 260
7. Comparison of norm scores determined with biased population and applying SCN versus 261
WCN with the representative norms 262
Modeling cognitive ability 263
To provide a basis for a modeled cognitive ability that develops with age, we envisioned a 264
reference population divided into six age cohorts, spanning one year each. We conceptualized a 265
cognitive ability that increases in each successive age group, as is typical with cognitive 266
development during childhood. We further specified that this cognitive ability is influenced by 267
the three demographic variables, each of which has three categories, namely education (low, 268
medium, high), ethnicity (native, mixed, non-native) and region (south, east, northwest). In broad 269
12
terms, therefore, our model states that cognitive ability is a function of age and the three 270
demographic variables. 271
We operationalized the effect of the demographic variables on cognitive ability by 272
assigning three levels of mean cognitive ability (below average, average, above average) to the 273
three categories of each demographic variable, according to the matrix shown in Table 2. 274
Importantly, this mapping of ability level to demographic category remains constant in all study 275
conditions. Thus, by changing the distributions of demographic categories across simulated 276
populations, we simultaneously manipulate the distributions of cognitive ability. We then 277
specified benchmark distributions for the demographic variables, which would be enacted in the 278
simulated reference population data set. The benchmark demographic distributions must be 279
understood in terms of a complete cross-classification of the three demographic variables, which 280
yields a 27-cell matrix with a 3 (low, medium, high education) x 3 (native, mixed, non-native 281
ethnicity) x 3 (south, east, northwest region) structure. Table 3 follows this structure in specifying 282
the benchmark demographic distributions. 283
The structure of Table 3 provides a basis for understanding the demographic 284
manipulations that were applied to create five additional simulated populations. As described 285
previously, each cell of the table corresponds to a certain mean cognitive ability level, which is 286
determined by the demographic cross-classification of that cell. Thus, referring to Table 2, the 287
cell in Table 3 with the highest mean cognitive ability is high education, non-native ethnicity, 288
northwest region, which appears in the lower-right corner of the table, constituting 2.4% of the 289
reference population. Conversely, the cell with the lowest mean cognitive ability is low 290
education, native ethnicity, south region, which appears in the upper-left corner of Table 3, 291
constituting 7.2% of the reference population. 292
Within the reference population, each row of data includes age, level of cognitive ability, 293
and classifications on education, ethnicity, and region. The values for the demographic 294
classifications are assigned according to the percentages in Table 3. The row-wise values of 295
13
cognitive ability are based on a distinct mean value for each cell2 of Table 3. This cell-wise mean 296
is calculated by the following polynomial equation: 297 (,,,)=1.5 0.25 0.1 298 0.05 + 1.2 0.06 + 0.0001 299
(1) 300
As a result of this equation, each demographic variable exerts a different effect on 301
cognitive ability: education correlates at r = -.78 with cognitive ability (large effect), ethnicity 302
with r = -.54 (medium effect) and region with r = -.31 (small effect). 303
Figure 1 provides a graphic depiction of the modeled cognitive ability in the reference 304
population. The figure shows mean cognitive ability increasing across the six age cohorts. The 305
solid black line represents the reference population mean. The dashed black lines represent the 306
marginal mean cognitive abilities for the low, medium and high categories of education, the 307
demographic variable with the largest effect on cognitive ability. The grey lines represent the 308
mean cognitive abilities associated with the 27 demographic cross-classifications. The highest 309
grey line is high education, non-native ethnicity, northwest region, the cell of Table 3 with the 310
highest mean cognitive ability. The lowest grey line is low education, native ethnicity, south 311
region, the cell of Table 3 with the lowest mean cognitive ability. 312
Generation of simulated population data sets and normative samples 313
To generate the reference population data set, we drew 24 million pairs of random 314
numbers (4 million per age cohort), each pair representing one individual. The reference 315
population size is roughly based on the number of persons in the U.S.. The first random number 316
was uniformly distributed between 0 and 6 and represented age in years. The second number was 317
normally distributed with M = 0 and SD = 1 within each of the six age groups and represented the 318
cognitive ability of the individual with respect to other individuals of the same age. This random 319
2 SD is constrained to 1 across all cells of Table 2.
14
number was converted into the specific cognitive ability value for an individual by adding the 320
mean cognitive ability for that individual’s demographic cross-classification status (see Table 3 321
and Formula 1). Additionally, we z-standardized each cognitive ability value (), using the 322
reference population mean and standard deviation in formula 2, 323 =
(2) 324
where θpop represents an individual’s location on the cognitive ability variable with respect 325
to the entire reference population. 326
As noted previously, each individual in the reference population was assigned values on 327
the demographic variables, such that marginal and joint distributions of these variables would 328
match the distributions shown in Table 3. We then generated five additional simulated population 329
data sets, using the same method described at the outset of this section (Table 1). These additional 330
simulated populations represented various violations of demographic representativeness that 331
might be encountered in collecting normative data for the development of a psychometric test. 332
The distributions of the demographic variables in these five additional data sets differed from the 333
reference population as follows: 334
• Simulated population 2: Mild under-representation of high education. The high education 335
category was underrepresented (28% instead of 40%) and the low education category was 336
overrepresented (52% instead of 40%). This manipulation affected both the mean and the 337
variance of the cognitive ability variable. 338
• Simulated population 3: Moderate under-representation of high education. The pattern of 339
misrepresentation was the same as population 2, but the degree of misrepresentation was 340
greater (high education was 20% instead of 40%; low education was 60% instead of 40%). 341
This manipulation affected both the mean and the variance of the cognitive ability variable. 342
• Simulated population 4: Under-representation of both low and high education. Medium 343
education was overrepresented (40% instead of 20%), and high and low education were 344
15
underrepresented (30 % instead of 40 %). This manipulation attenuated the variance of the 345
cognitive ability variable, but its mean was not affected. 346
• Simulated population 5: Biased joint distributions. The joint distributions of the demographic 347
variables were varied from the reference percentages shown in Table 3, such that some of the 348
joint cells were overrepresented, while some were underrepresented, but the marginal 349
distributions were identical to those in the reference population. This manipulation increased 350
the overall variance of the cognitive ability variable, but its mean was only slightly affected. 351
• Simulated population 6: Clustered distributions. Within each age cohort, two-thirds of the 27 352
demographic cross-classification cells contained no data. In the remaining one-third of cells, 353
the number of individuals was tripled. This manipulation was applied to different subsets of 354
cells across age cohorts, such that when cell proportions were summed across all age cohorts, 355
the marginal and joint distributions of the demographic variables were identical to the 356
reference population. This distribution condition was added to investigate the influence of 357
clustered sampling as frequently applied in real test norming projects due to economic 358
constraints. This manipulation either increased or attenuated the variance of the cognitive 359
ability in each separate age group, but it did not affect the overall mean or variance. 360
In the five additional simulated population data sets, the cognitive variable was z-361
standardized using Formula 2. Importantly, the values of and were those from the reference 362
data set (population 1), not from the data set whose values were being standardized. By using 363
and of the reference population, the standardized variable reflects bias resulting from biased 364
norm sampling. For example, using formula 2 for simulated population 2 with and of the 365
reference population resulted in standardized values for the cognitive ability with mean value 366
higher than 0, since subjects with high cognitive abilities were overrepresented with respect to the 367
reference population. From each of the six simulated populations, we drew 100 random samples 368
of 600 individuals (100 cases per age cohort). These samples served as input to the norming 369
procedures. 370
16
Simulation of test results 371
Using the one-parameter logistic (1-PL) model, we simulated a 31-item test to generate 372
test results for each individual in the normative samples. The 31 item difficulties (δ) were drawn 373
randomly from a uniform distribution ranging from -3 and +3. The set of item difficulties covered 374
a range of about 3.7 standard deviations (M = -0.04, SD = 1.64), therefore spanning a wide range 375
of latent ability. The probability pk,i that an individual k with the z-standardized latent ability 376 succeeded on item i, with difficulty δi, was given by the following 1-PL equation: 377
,(= 1,) = exp
1+exp (3) 378
For every individual k and item i a uniformly distributed random number between 0 and 1 was 379
drawn and compared to pk,i. If pk,i exceeded the random number, the item was scored 1, otherwise 380
it was scored 0. Finally, each individual’s scores on all 31 items were summed to yield a raw total 381
score on the simulated test. 382
383
Application of weighted and unweighted norming procedures 384
For each raw score in the normative samples, we applied WCN and SCN to generate IQ-385
type standard scores (M = 100, SD = 15) for each norming method. These scores were labeled 386
IQWCN and IQSCN. For both WCN and SCN, these IQ scores were calculated with cNORM, an R 387
package that employs continuous norming (A. Lenhard et al., 2018). Weights were not used for 388
SCN. 389
To calculate the weights for WCN, we used the rake function from survey (Lumley, 390
2011), an R package that implements the raking procedure described earlier in this manuscript. 391
Additionally, we standardized the weights to make them easier to interpret. We divided each 392
weight by the smallest weight in the respective norm sample, thereby setting the weight of the 393
most overrepresented group in the sample to 1. 394
17
Using weights required modifications to the standard cNORM functions. In WCN, 395
weights are applied initially in the ranking procedure, where each raw score is assigned a 396
percentile rank. Because of the high number of ties, the average rank was used for further 397
processing, following the usual cNORM procedures (see A. Lenhard, Lenhard, & Gary, 2019). In 398
WCN, weights are also entered in cNORM’s regression-based modeling procedure. To perform 399
the regression, cNORM draws on the regsubsets function of leaps, an R package (Lumley, 2017). 400
regsubsets includes the capacity to process weights in the regression analysis. 401
Generating norm scores from the reference population 402
To test the study hypotheses, we created a measure (IQbest), in the same metric as IQWCN 403
and IQSCN, which represented the “actual” person location on the cognitive ability variable. IQbest 404
was derived from the distribution of raw scores in the entire reference population (in contrast to 405
IQWCN and IQSCN, which were derived from the smaller normative samples). To compute IQbest, 406
we generated raw scores on the 31-item simulated test for the 24 million individuals in the 407
reference population, using the previously described method. We then partitioned the reference 408
population by age, creating 365 equal-sized groups within each of the six age cohorts. Each of the 409
resulting 2190 age-groups consisted of about 11,000 individuals with the same “birthday”. The 410
raw scores were ranked and converted into IQ scores using rank-based inverse normal 411
transformation within each age group. As a result, each row in the reference population data set 412
included values for age, raw score and IQbest. 413
Hypothesis testing with RMSE and MSD 414
As noted above, we drew 100 normative samples from each of the six simulated 415
population data sets (N = 600) and subjected them to both SCN and WCN. For each of the 416
samples, we computed RMSE and MSD to compare IQbest to IQWCN and IQSCN, respectively. 417
RMSE is a summary measure of norming model error that includes both fixed and variable 418
error components (Lenhard & Lenhard, 2021). It was computed using the following formula: 419
18
= 1
(.)
, (4) 420
where is the number of cases and IQ. stands for either IQWCN or IQSCN. MSD is a measure of the 421
tendency for a norming model to overestimate (MSD > 0) or underestimate (MSD < 0) the actual 422
person location. The formula used to calculate the MSD was: 423
=1
(.)
. (5) 424
To be able to test Hypothesis 3, we divided the distributions of IQbest, IQWCN and IQSCN 425
into 11 intervals of 7.5 IQ points each. RMSE and MSD were calculated separately for each of 426
these intervals. Both RMSE and MSD are quantified in terms of IQ points. 427
In general, the analytic approach was to conduct 6 (simulated population) x 11 (IQ range) 428
x 2 (norming method) mixed ANOVAs on both RMSE and MSD. Population was a between-429
groups factor, and IQ range and norming method were within-groups factors. Because of the high 430
number of the simulation cycles (i.e., n = 100 within each cell of the ANOVAs), statistical power 431
was high, and therefore the level of significance was set to p = .01. The assumption of sphericity 432
was tested and, where indicated, degrees of freedom were corrected. Additionally, partial η2’s 433
were computed as measures of effect size. We further specified that, in the norm score 434
comparisons of interest, differences of less than 1 IQ point were too small to have any practical 435
relevance. We opted for this threshold, since IQ scores are usually rounded to integers in test 436
manuals, and, therefore, differences of 1 IQ point or more represent the smallest detectable 437
difference in the norm tables. 438
Results 439
As indicated by Mauchly’s test, sphericity assumptions were generally violated both for 440
RMSE and MSD. Therefore, degrees of freedom in all ANOVAs were corrected according to the 441
Greenhouse-Geisser method. The two separate 6 × 11 × 2 mixed ANOVAs on RMSE and MSD 442
19
yielded significant results for all main effects and interactions (p < .001). We focus here on the 443
effects that are most salient for testing our hypotheses. 444
Hypothesis 1: Main Effect of Norming Method 445
The first hypothesis proposed that WCN would yield lower levels of norm score bias than 446
SCN. This hypothesis was supported by tests of the main effects of norming method, RMSE: F(1, 447
594) = 94.93, p < .001, η2 = .14, MSD: F(1, 594) = 3397.28, p < .001, η2 = .85. RMSE was 448
smaller for WCN (M = 2.18, SE = .02) than for SCN (M = 2.36, SE = .02). The same was true for 449
MSD (SCN: M = 0.74, SE = .03; WCN: M = -0.24, SE = .03). 450
However, the analysis also detected significant interactions between norming method and 451
simulated population, indicating that the effects of weighting on norm-score bias varied among 452
the simulated populations, RMSE: F(5, 594) = 98.98, p < .001, η2 = .45, MSD: F(5, 594) = 453
764.77, p < .001, η2 = .87. As can be seen in Figure 2 (RMSE) and Figure 3 (MSD), in two of the 454
six simulated populations, weighting reduced bias in the normed scores. In populations 2 (mild 455
under-representation of high education) and 3 (moderate under-representation of high education), 456
RMSE was lower with WCN than with SCN, by an average of 0.48 IQ points and 0.97 IQ points, 457
respectively. In populations 1 (reference), 5 (biased joint probabilities) and 6 (clustered 458
sampling), the difference in RMSE between WCN and SCN approximated zero. In population 4 459
(under-representation of both low and high education), WCN returned higher average RMSE than 460
SCN, but the difference of 0.32 IQ points was below the threshold of practical relevance. 461
The analysis of the MSD yielded similar results. In populations 2 (mild under-462
representation of high education) and 3 (moderate under-representation of high education), MSD 463
was closer to the ideal value of zero for WCN (populations 2 and 3: -0.32 IQ points) than for 464
SCN (population 2: 1.51 IQ points; population 3: 2.59 IQ points). In populations 1 (reference), 5 465
(biased joint probabilities) and 6 (clustered sampling), MSD approximated zero, regardless of the 466
norming method. In population 4 (under-representation of both low and high education), MSD 467
20
deviated more from zero for WCN (-0.49 IQ points) than for SCN (0.13 IQ points). As with 468
RSME, these latter differences did not meet the criterion for practical significance. 469
Hypothesis 2: Interaction between Norming Method and Degree of Non-Representativeness 470
Hypothesis 2 specified that as the non-representativeness of the normative samples 471
increased, norm-score bias would increase for both methods, but that the increase in bias would 472
be smaller for WCN than for SCN. To address this hypothesis, we compared populations 2 and 3. 473
Both populations were characterized by under-representation of the high education group, but the 474
magnitude of under-representation was greater in population 3 than in population 2. Therefore, 475
we performed two additional ANOVAs that limited the levels of the between-groups factor to 476
populations 2 and 3. Both analyses yielded a significant interaction between norming method and 477
simulated population, RMSE: F(1, 198) = 26.01, p < .001, η2 = .12, MSD: F(1, 198) = 242.36, 478
p < .001, η2 = .55. 479
The results of these analyses are visualized in Figures 2 and 3. The plots show the 480
interaction: The RMSE is greater in magnitude in population 3 (moderate under-representation) 481
than in population 2 (mild under-representation), for both norming methods, but the magnitude of 482
increase is greater for SCN (.74 IQ points) than for WCN (.25 IQ points). With regard to MSD, 483
the benefits of WCN were even more pronounced. Considering WCN in isolation, average MSD 484
was approximately equal for both populations (-0.32 IQ points). By contrast, with SCN, MSD in 485
population 3 was 1.09 IQ points higher than in population 2. 486
Hypothesis 3: Effectiveness of WCN depends on person location 487
Hypothesis 3 proposed that WCN would be less effective at reducing bias at the tails of 488
the cognitive ability distribution than in the central part of that distribution. We tested this 489
hypothesis with two analytic approaches. First, we conducted 11 × 2 ANOVAs with person 490
location and norming method (WCN vs. SCN) as within factors, and RMSE and MSD as separate 491
dependent variables. Thus, a total of 12 different ANOVAs were calculated in this first analytic 492
approach (two for each of the six sampling conditions). 493
21
In populations 2, 3, 4, 5 and 6 (which yield demographically non-representative normative 494
samples, as described earlier), we additionally compared the performance of WCN to SCN in 495
population 1 (which yields demographically representative normative samples). SCN in 496
population 1 therefore represents a benchmark condition, against which the performance of WCN 497
in the other non-representative populations can be measured. For this second analytic approach, 498
we also used 11 × 2 ANOVAs, but this time with norming condition (WCN with biased 499
normative sample vs. SCN with unbiased normative sample) as a between-groups factor. The 500
results of these analyses are illustrated in Figures 4 (RMSE) and 5 (MSD). Because of the large 501
number of comparisons, we report effects only if at least one of the differences within an analysis 502
exceeded 0.5 IQ points. 503
Population 1: Reference 504
In normative samples drawn from the reference population, the ANOVAs for RMSE and 505
MSD both yielded a significant main effect of person location, RMSE: F(2.68, 264.87) = 70.68, 506
p < .001, η2 = .42, MSD: F(2.34, 231.65) = 35.54, p < .001, η2 = .26. In general, RMSE increased 507
as person location moved towards either tail of the distribution, away from the average IQ of 100. 508
This effect, also seen in the other simulated populations, is visualized as a parabolic shape in 509
Figure 4. By contrast, in the analysis with MSD, the main effect of person location is visualized 510
as a sinusoidal pattern (see Figure 5 and discussion section below). This effect of person location 511
on norming bias is a previously reported feature of continuous norming procedures (cf. A. 512
Lenhard et al., 2019). As such, this effect is not directly relevant to the question of whether 513
weighting, per se, reduces norm-score bias due to non-representative sampling. What is important 514
to note (and is readily seen in Figures 4 and 5) is that WCN and SCN perform equally well, in 515
terms of error measures, when processing normative samples drawn from a demographically 516
representative, reference population. This makes intuitive sense, because with representative 517
samples, there are no cell-wise departures from expected demographic proportions, to which 518
weights could be applied to correct for bias in the norming process. 519
22
Population 2: Mild under-representation of high education 520
In samples drawn from population 2, we found a main effect of norming method, RMSE: 521
F(1, 99) = 74.94, p < .001, η2 = .43, MSD: F(1, 99) = 1924.64, p < .001, η2 = .95. WCN was 522
superior to SCN in reducing norm-score bias resulting from non-representative samples. With 523
RMSE, we also observed an interaction between person location and norming method, F(2.72, 524
268.76) = 41.72, p < .001, η2 = .30. As shown in Figure 4, WCN reduced the error measure to a 525
greater degree in the upper range of person location than in the lower range. In population 2, 526
individuals of higher education (and consequently, higher cognitive ability) are under-527
represented. Thus, the interaction shows that WCN is correcting for norm-score bias at those 528
person locations that are under-represented in the normative samples. 529
In the comparison of WCN in population 2 to the benchmark of SCN in population 1, 530
there was no main effect of the norming condition on RMSE. That is, even under the conditions 531
of non-representativeness in population 2, WCN did not differ from the benchmark on the error 532
measure. This suggests that weighting successfully compensated for any norm-score bias due to 533
demographic non-representativeness in population 2, when that bias was measured by RMSE. The 534
results differed for MSD, where we observed a main effect of norming condition, F(1, 198) = 535
15,37, p < .001, η2 = .07, and an interaction between norming condition and person location, 536
F(2.31, 456.93) = 2.91, p = .048, η2 = .01. These findings indicated that, with respect to MSD, 537
WCN did not fully correct norm-score bias in samples from population 2. However, this 538
difference in MSD between WCN and the benchmark condition was generally rather small. Only 539
at a very low person location of IQ 62.5 the difference (0.99 IQ points) approached the practically 540
relevant threshold of 1 IQ point. 541
Population 3: Moderate under-representation of high education 542
For both error measures, the ANOVAs with normative samples drawn from population 3 543
yielded significant main effects of norming method, RMSE: F(1, 99) = 155.76, p < .001, η2 = .61, 544
MSD: F(1, 99) = 2707.76, p < .001, η2 = .97, and significant interactions between norming 545
23
method and person location, RMSE: F(2.79, 276.13) = 71.89, p < .001, η2 = .42, MSD: F(2.63, 546
260.05) = 6.58, p = .001, η2 = .06. The analyses for population 3 produced larger effect sizes than 547
those for population 2, mirroring the difference in representativeness between the two 548
populations. This suggests that WCN exerts a larger corrective effect on norm-score bias with 549
normative samples that display greater deviations from demographic representativeness. 550
In comparing WCN in population 3 to SCN in population 1, we observed significant main 551
effects of norming condition, RMSE: F(1, 198) = 9.52, p = .002, η2 = .05, MSD: F(1, 198) = 552
12.52, p = .001, η2 = .06, and significant interactions between norming condition and person 553
location, RMSE: F(2.81, 557.01) = 3.06, p = .031, η2 = .02, MSD: F(2.26, 447.76) = 3.28, 554
p = .033, η2 = .02. In the normative samples drawn from population 3, WCN yielded greater 555
norming error than the benchmark within the low and high levels of person location. However, 556
the differences were too small to be of practical relevance. 557
Population 4: Under-representation of both low and high education 558
As with populations 2 and 3, the analyses of normative samples drawn from population 4 559
produced significant main effects of norming method, RMSE: F(1, 99) = 39.02, p < .001, η2 = .28, 560
MSD: F(1, 99) = 164.63, p < .001, η2 = .62, and significant interactions between norming method 561
and person location, RMSE: F(3.02, 299.01) = 42.43, p < .001, η2 = .30, MSD: F(2.53, 250.47) = 562
42.44, p = .001, η2 = .30. However, with population 4, where both tails of the education 563
distribution were under-represented, WCN did not provide greater reduction of norm-score bias 564
than SCN. The interactions revealed that at low levels of person location RMSE was even greater 565
for WCN than for SCN (IQ 70.0 |∆RMSE| = 1.37 IQ points, IQ 62.5 (|∆RMSE| = 1.43 IQ points). 566
For MSD, the interaction between norming method and person location was more complex, with 567
SCN providing greater reduction of norm-score bias than WCN at low levels of person location 568
(IQ 70.0: |∆MSD| = 1.25 IQ points; IQ 62.5: |∆MSD| = 2.00 IQ points). In the upper levels of 569
person location, WCN narrowly outperformed SCN, but the differences were below the threshold 570
of practical relevance . 571
24
In comparing WCN in population 4 to SCN in population 1, we found significant main 572
effects of norming condition similar to those we just described when comparing WCN to SCN in 573
population 4., RMSE: F(1, 198) = 38.46, p < .001, η2 = .16, MSD: F(1, 198) = 31.10, p < .001, η2 574
= .14. We also found similar interactions between norming condition and person location, RMSE: 575
F(2.83, 560.75) = 22.19, p < .001, η2 = .10, MSD: F(2.55, 505.02) = 19.44, p < .001, η2 = .09. For 576
both RMSE and MSD, WCN delivered significantly worse results than the benchmark condition 577
at low person locations. For person locations of IQ 70 or lower, these differences exceeded the 578
practically relevant threshold of 1 IQ point (IQ 70.0: |∆ RMSE| = 1.47 IQ points; |∆ MSD| = 1.52 579
IQ points; IQ 62.5: |∆ RMSE| = 1.48 IQ points; |∆ MSD| = 2.21 IQ points). For high person 580
locations, the results were again more complex, but the differences were generally too small to be 581
of practical relevance in this ability range. 582
Population 5 (Biased joint distributions) 583
In samples drawn from population 5, we found no consistent effects demonstrating either 584
superiority or inferiority of WCN compared to SCN. For RMSE, we observed a significant 585
interaction between person location and norming method, F(3.11, 307.61) = 4.27, p = .005, η2 = 586
.04. For MSD, we found a significant main effect of norming method, F(1, 99) = 51.18, p < .001, 587
η2 = .34, and a significant interaction between norming method and person location, F(1.54, 588
152.20) = 11.21, p < .001, η2 = .10. But both interactions proved to be disordinal, that is, WCN 589
performed better than SCN at some person locations and worse at others. Furthermore, all 590
differences were much smaller than 1 IQ point and therefore of no practical relevance. 591
In the comparison of WCN in population 5 to SCN in population 1, we observed a 592
significant interaction between norming condition and person location for MSD only, F(2.29, 593
453.98) = 6.25, p = .001, η2 = .03. But again, the interaction was disordinal and the differences 594
were far too small to be of any practical relevance. Hence, the differences between WCN, SCN 595
and the benchmark (SCN in population 1) were generally very small in samples drawn from 596
population 5. We will further elaborate on this result in the discussion section. 597
25
Population 6 (Clustered distributions) 598
For population 6, the comparison between WCN and SCN yielded no significant effects at 599
all regarding RMSE. For MSD, the ANOVA returned a significant main effect of norming 600
method, F(1, 99) = 18.95, p < .001, η2 = .16, and a significant interaction between norming 601
method and person location, F(1.65, 163.79) = 10.04, p < .001, η2 = .09. But as in population 5, 602
the interaction was disordinal, that is, the effects of weighting were inconclusive with 603
amelioration of the norm scores at some person locations but deterioration at others. Furthermore, 604
the differences between WCN and SCN were even smaller than in population 5. 605
The comparison of WCN in population 6 to the benchmark of SCN in population 1 606
yielded no significant effects at all. 607
Discussion 608
The present study examined whether compensatory weighting at the raw score level, when 609
combined with semi-parametric continuous norming, would reduce bias in norm scores derived 610
from demographically non-representative norm samples. To pursue this aim, we simulated six 611
populations in which the distributions of demographic variables departed to various degrees from 612
expected proportions. We modeled a latent cognitive ability, which we used as the input for a 613
one-parameter logistic IRT model to create raw test scores. We drew normative samples from the 614
six populations, and generated IQ-type norm scores by applying weighted continuous norming 615
(WCN) and semi-parametric continuous norming without weighting (SCN). We used root mean 616
square error (RMSE) and mean signed difference (MSD) as measures of norm-score bias. 617
Our first hypothesis proposed that when processing non-representative normative 618
samples, WCN would produce less-biased norm scores than SCN. The predicted general 619
advantage of WCN was most apparent in samples drawn from populations 2 and 3, in which 620
individuals with high levels of education were under-represented. In samples drawn from 621
populations 5 (biased joint probabilities) and 6 (clustered sampling), WCN showed no benefit 622
over SCN, but neither did it degrade the quality of norm scores, relative to continuous norming 623
26
without compensatory weighting. In population 4 (under-representation of both low and high 624
education), we found that WCN even led to a small increase in norm-score bias, but only at 625
certain points in the range of cognitive ability. 626
In normative samples drawn from population 1, which served as the standard of 627
representativeness for the demographic variables, WCN demonstrated no advantage over SCN. 628
This result is not surprising: WCN creates weights to compensate for departures from 629
representativeness. Because population 1 was the benchmark, in terms of demographic 630
composition, random samples drawn from it were expected to be demographically representative. 631
Population 2 introduced deviations from the benchmark distribution of education, the 632
demographic variable with the strongest effect on cognitive ability. Specifically, level 1 (high 633
education/high ability) was mildly under-represented, and level 3 (low education/low ability) was 634
proportionately over-represented. In samples drawn from population 2, WCN yielded greater 635
reduction in norm-score bias than SCN, for both error measures, across the entire range of 636
cognitive ability. 637
Population 3 presented a pattern of non-representativeness on education that was similar 638
to that in population 2, but greater in magnitude. The comparison of normative samples drawn 639
from populations 2 and 3 was relevant to testing our second hypothesis, which specified that as 640
the non-representativeness of the normative sample increased, norm-score bias would increase for 641
both methods, but that the increase in bias would be smaller for WCN than for SCN. Our findings 642
provided support for this hypothesis: with samples drawn from population 3, the magnitude of 643
norm-score bias for WCN was larger than it was in the population 2 analyses, although WCN 644
retained its superiority to SCN in terms of reducing this bias. As compared to the benchmark 645
condition (i.e., SCN in the unbiased population 1), the increase in norming error associated with 646
WCN in population 3 depended on person location – it occurred at either extreme of the range of 647
the cognitive ability variable, but not at an average level. In no instances, however, did these 648
27
increases in the error measures exceed 1 IQ point. Hence, even with this strong deviation from 649
representativeness, WCN did a very good job at reducing the corresponding norm-score bias. 650
Population 4 embodied a further scenario of demographic non-representativeness, in 651
which both tails of the education distribution were under-represented, and the central part of the 652
distribution was proportionately over-represented. In terms of the average degree of 653
misrepresentation across the three levels of education, population 4 did not differ from population 654
3. Where the effect of the demographic manipulation differs is on the raw score distributions. In 655
population 4, the manipulation attenuates the variance of the raw score distributions, because 656
under-sampling both tails of the education distribution results in an under-sampling of the very 657
high and low raw scores that reside in those ability levels. In addition, whereas in population 3 658
the pattern of misrepresentation affects the mean of the raw score distribution, in population 4 the 659
mean is not affected, because there is equal under-representation of both tails of the raw score 660
distribution. 661
In normative samples drawn from population 4, we observed that at certain levels of 662
person location, WCN was less effective than SCN in reducing norm-score bias, which is 663
consistent with our third hypothesis. Specifically, we found that the disparity between WCN and 664
SCN increased at both tails of the cognitive ability distribution, with WCN showing the greatest 665
magnitude of norming error at the lowest levels of person location. 666
To put this finding into context, consider how the raw score distribution of a demographic 667
subgroup is affected differentially by adding additional individuals, as opposed to weighting the 668
existing raw scores without increasing sample size. Adding more individuals increases the 669
variance of the raw score distribution, whereas weighting existing raw scores does not affect the 670
variance. In population 4, furthermore, the variance of the low and high ability groups was 671
reduced by the pattern of under-representation, which results in fewer individuals in each of these 672
groups. Therefore, weighting the raw scores of the under-represented groups increases the 673
influence of any sampling error that exists in the raw score distributions. This phenomenon may 674
28
explain our finding that WCN resulted in greater norming error at the under-represented, low 675
levels of person location. By contrast, WCN did not yield increased norming error at the average 676
levels of person location, where there are more observations present and a consequent reduction 677
in sampling error. Our findings suggest that researchers should employ WCN with caution when 678
processing normative samples where both low-performing and high-performing subgroups are 679
substantially under-represented. 680
In population 5, the joint distributions of the demographic variables (resulting from a 681
complete cross-classification of the three variables) were manipulated in a pattern of alternating 682
over- and under-representation. This pattern was accomplished so that the marginal distributions 683
of the variables closely approximated those of the reference population. Thus, population 5 684
simulates a sampling scenario wherein demographic misrepresentation occurs at a level that is 685
“beyond the reach” of cNORM’s raking method, which operates only on marginal distributions. 686
Under these conditions, WCN did not provide any improvement in the reduction of norm-687
score bias over SCN. However, our manipulation of the joint probabilities, as it turned out, did 688
not strongly affect the means and variances of the raw score distributions. On the one hand, this 689
result leaves unanswered the question of how WCN might perform when misrepresentation at the 690
level of joint probabilities does bias the parameters of the raw score distributions more severely. 691
On the other hand, we would also like to emphasize here, that our aim wasn’t necessarily to 692
simulate deviations from representativeness as large as possible. Instead, we wanted to simulate 693
different realistic scenarios and demonstrate the size of the respective effects, no matter whether 694
these effects are large or small. With population 5, we wanted to simulate a condition where the 695
marginal distributions of the demographic variables satisfy representativeness, but the joint 696
distributions do not. We sought to manipulate the distributions of the demographic variables in 697
such a way that the resulting raw score distribution would deviate as much as possible from the 698
unbiased distribution in population 1. But with perfectly representative marginal probabilities, 699
this is in fact virtually impossible (at least we did not succeed). What can we possibly learn from 700
29
this specific condition? We can learn that one need not worry too much about the joint 701
distributions of the demographic variables as long as the marginal distributions are in line with 702
representativeness (which, by the way, is the principle of raking). 703
It is important to keep in mind, though, that in our simulation study, the three 704
demographic variables were modeled so that education had the strongest relationship with 705
cognitive ability, and thus had more impact on norm score accuracy than ethnicity or region. 706
Thus, our findings with population 5 do not reflect the range of possible relationships between 707
demographic factors and cognitive ability (e.g., other variables that are highly correlated with 708
ability, or variables that interact with each other). In these alternate scenarios, misrepresentation 709
in the joint distributions might possibly affect raw score means and variances more severely than 710
was the case in population 5 in this study. Later in this section, we provide guidance on how to 711
address these scenarios in practice. 712
In population 6 (clustered distributions), the distributions of the demographic variables 713
were manipulated within each of the six age cohorts. This manipulation is best understood in 714
comparison to population 1, in which the marginal and joint probabilities of the entire population 715
are replicated within each age cohort. In population 6, by contrast, two-thirds of the joint 716
distribution cells contained no data, meaning that the overall demographic distributions were not 717
replicated within the age cohorts. However, the pattern of data deletion was such that the 718
marginal and joint probabilities of the demographic variables, averaged over the entirety of the 719
population (across all age cohorts), matched those of population 1. 720
In normative samples drawn from population 6, the age-specific patterns of demographic 721
non-representativeness affected the parameters of the raw score distributions within each age 722
cohort. Raking per se cannot compensate for discrepancies of this nature, because raking operates 723
on marginal probabilities of the entire normative sample, not those within each age cohort. At 724
first glance, it may seem counter-intuitive to find, as we did, that neither WCN nor SCN yielded 725
increases in norm-score bias, when compared to the benchmark condition. We attribute this 726
30
finding to the influence of the semi-parametric continuous norming method that underlies both 727
WCN and SCN. As noted previously, this method models the raw-score-norm-score relationship 728
as a function of age and person location using polynomial regression over the complete age range 729
of the norm sample instead of computing the norm scores for each age cohort separately. 730
Therefore, the previously mentioned stiffness of the method seems capable to reduce the effects 731
of the varying age-specific violations of demographic non-representativeness. 732
Implications for the use of WCN in test norming 733
Our study showed that WCN reduces norm-score bias under certain patterns of non-734
representativeness of a demographic variable, where that variable is strongly correlated with the 735
test score being normed. The pattern of results across the six simulated populations, however, 736
suggested that even when a demographic variable has a strong effect on raw scores, it produces 737
relatively small distortions in resulting norm scores. Even under conditions representing large 738
departures from demographic representativeness, the differences in RMSE between norm scores 739
derived using WCN and those from representative samples did not exceed 2 IQ points. With 740
norming methods that generate norms independently for each age group, we would expect 741
departures from demographic representativeness to cause greater levels of norm-score bias (W. 742
Lenhard & Lenhard, 2021). These conventional norming methods lack the previously noted 743
advantage of continuous norming, which can smooth out local effects of non-representativeness. 744
Consistent with this view, we have demonstrated previously that with conventional norming per 745
age group, RMSE is more than three times as high, on average, as with semi-parametric 746
continuous norming, even with representative random samples (W. Lenhard & Lenhard, 2021). 747
The selection of an appropriate norming method is therefore a critical prerequisite for accurate 748
test norms, regardless of whether this procedure is used with or without weighting. By contrast, 749
the size of the normative sample is less critical, if continuous norming is used. For example, we 750
found that increasing sample size from 100 to 250 per age group did not yield significant 751
reduction in RMSE, when continuous norming methods were used (A. Lenhard et al., 2019). 752
31
Clearly, the best practice is to prevent problems associated with non-representativeness in 753
the first place, by collecting an adequately sized, demographically representative sample for 754
norming. Post-hoc weighting procedures are no substitute for a well-planned data collection 755
effort that draws randomly from the general population. Care must also be taken to avoid over-756
sampling from clinical settings, as this will bias the sample towards individuals of lower ability. 757
The current study demonstrates the utility of weighting procedures in reducing norm-score error 758
under conditions of mild-to-moderate non-representativeness of a demographic variable. 759
Nevertheless, we also found that the ability of WCN to reduce norm-score bias was degraded, 760
when we reduced the marginal probability of the high level of education to 20% from the 761
reference value of 40% (that is, when the size of that subgroup was half that needed for a 762
representative sample). Our work further shows that the effectiveness of weighting depends on 763
the location of under-represented demographic groups on the spectrum of person ability. With a 764
typical cognitive ability that is normally distributed in the general population, random sampling 765
will yield relatively small subgroups at either tail of the ability distribution. If these extreme 766
subgroups are under-sampled to begin with, any sampling error embodied in the raw score 767
distributions will only be multiplied by the application of compensatory weights. This can lead to 768
increased norm-score bias, as illustrated in our results. The remedy, of course, is to ensure that 769
these low- and high-ability groups are represented in adequate numbers. 770
As described previously, the raking procedure used in this study operates only on the 771
marginal distributions of the demographic variables. Census information on the joint distributions 772
of the demographic variables (e.g., the expected probability for the joint category of low 773
education/non-white ethnicity) is not always available. However, when that information is 774
available, it can be incorporated in the raking procedures through a recoding process. For 775
example, the crossing of two demographic variables, each of which has three categories, results 776
in nine cross-classification cells. These classifications can be recoded into nine levels of a single 777
dummy variable. The expected joint probabilities of the cross-classified cells thereby become the 778
32
expected marginal probabilities of the dummy variable. The risk in this approach comes from 779
increasing the number of categories, which also increases the likelihood that one or more 780
categories would have a very low expected probability. Under these circumstances, of course, 781
even adequately sampled categories may hold only a few individuals, thus increasing the 782
influence of sampling error when weights are applied. To counter this tendency, we often 783
recommend reducing the number of demographic categories by combining groups that are not 784
expected to differ significantly in mean location on the ability variable. This practice can be 785
applied to either marginal categories or joint cross-classifications, when the latter are subject to 786
the recoding procedure described in the previous paragraph. 787
Limitations of the study 788
This study evaluated only one method of post-stratification: raking with marginal 789
probabilities as the input. We did not examine fully cross-classified post-stratification (i.e., a 790
method that takes joint probabilities into account). Instead, we analyzed norm samples drawn 791
from population 5 (biased joint distributions), to determine the performance of raking under 792
conditions where the marginal probabilities are representative, but the joint probabilities are not. 793
In population 5, we did not find that WCN, which includes raking, yielded increased norm-score 794
bias compared to the benchmark condition. This result may have been due to the magnitude of 795
non-representativeness in the cross-classification cells. The demographic deficiencies in these 796
cells may not have been great enough to expose the inability of raking to compensate for such 797
deficiencies. 798
In our study, we simulated three demographic variables (education, ethnicity, region), 799
with varying levels of correlation with the latent cognitive ability (strong, moderate, weak, 800
respectively). We did not model any interactions among these three variables. Demographic 801
variables that interact in their effects on cognitive ability might yield larger disturbances in the 802
raw score distributions of the cross-classification cells. Under these conditions, as we have 803
demonstrated, weighting carries the risk of increasing norm-score bias. However, the main effect 804
33
of education on test scores in our study probably represents the upper limit of analogous effects 805
that could occur in real-world normative samples. With demographic variables that have smaller 806
effect sizes, of course, we can expect resulting norm-score biases to also diminish in magnitude. 807
A second limitation was that our study modeled only one latent psychological variable: a 808
cognitive ability that increases monotonically with increasing age. Other variables measured by 809
psychometric tests (e.g., the “big five” personality traits, Donnellan & Lucas, 2008) may not 810
manifest the same dependency on age, and they may be affected by demographic variables with 811
different characteristics than the ones simulated in our study. When norming tests of personality 812
traits, therefore, it may be appropriate to apply a weighting method that is not combined with 813
continuous norming procedures. 814
A third caution relates to the mathematical underpinnings of the cNORM norming 815
process. cNORM uses a semi-parametric continuous norming method that requires the expansion 816
of a Taylor polynomial (for more details, see A. Lenhard, Lenhard, Suggate & Segerer, 2018). 817
The modeling process calls for specification of a parameter (k) that sets an upper bound on the 818
exponents of person location and age. In the current study, we used a default value of k = 4 for 819
both location and age. It is possible that more precise models of the latent cognitive ability could 820
have been obtained with different values of k. Simulation studies published elsewhere (A. 821
Lenhard & Lenhard, 2021; A. Lenhard, Lenhard, & Gary, 2019; Gary & Lenhard, 2021) have 822
compared norm-score bias across a range of values of k. These findings suggest that k = 5 for 823
location and k = 3 for age provide an optimal balance between norm score accuracy and 824
processing load. As a result, we have selected these values as the defaults for the current version 825
of cNORM. 826
Finally, our study examined weighting only as applied to the semi-parametric continuous 827
norming method implemented in the cNORM package. We did not combine weighting with other 828
continuous norming approaches, such as parametric continuous norming (e.g., Stasinopoulos et 829
al., 2018) nor did we combine it with traditional raw-to-norm-score mapping performed 830
34
separately per age group. The latter does not seem very promising anyway, since traditional 831
approaches usually lead to norm-score bias that is more than three times as high on average as 832
compared to continuous norming with the same sample size, when applied on perfectly 833
representative samples (W. Lenhard & Lenhard, 2021). Therefore, traditional norming would 834
have to benefit much more from weighting than continuous norming to overcome this general 835
shortfall. There is simply no reason why such a disproportionate benefit should actually occur. 836
Raking per age group – which would be necessary when combined with traditional norming 837
approaches – might even entail its own pitfalls: The smaller the sample, the higher are the 838
expected deviations from representativeness. But as we have shown in this study, large deviations 839
from representativeness can in some cases lead to suboptimal results of weighting techniques. To 840
put it in a nutshell: Continuous norming is a major advance over traditional norming methods and 841
WCN is still one step further. 842
Concerning parametric continuous norming approaches, we earlier pointed out in this 843
paper that the semi-parametric continuous method, because it does not rely on splines to model 844
age-related changes in ability, may be better suited for certain conditions of non-845
representativeness in normative samples (e.g., the clustered distributions modeled in population 846
6). Moreover, we have demonstrated elsewhere (see A. Lenhard et al., 2019) that the cNORM 847
approach yields less norm-score bias than parametric continuous norming with skewed raw score 848
distributions, and with sample sizes of 150 or less per age group. Yet, the efficiency of post-849
stratification techniques combined with parametric continuous norming remains to be 850
investigated, since other regression-based norming methods (e.g. Oosterhuis, 2017) and 851
parametric continuous norming approaches (e.g. Voncken et al., 2019) might also benefit from 852
the use of weighing methods. Please note that the sample weighting itself starts before the actual 853
norming process is started. 854
The application of weighting techniques to the norming of psychometric tests is a 855
relatively new area of study. Unsurprisingly, therefore, several additional research questions 856
35
emerged from the current simulation protocol. For example, we implemented raking weights 857
twice within cNORM: Once during ranking of raw scores, and then again during regression 858
modeling. But we did not evaluate the relative value, in terms of reducing norm-score error, of 859
the second step. It is therefore possible that applying weights to the regression analysis was of 860
little benefit, or that it may have even increased norm-score bias. The latter might occur because, 861
as noted previously, weighting can multiply the effects of sampling error in under-represented 862
groups. 863
Concluding Remarks and Outlook 864
Weighting techniques are no substitute for the painstaking process of assembling a 865
demographically representative normative sample. Our study has shown, however, that if such 866
samples still exhibit reasonably small departures from representativeness, the weighting methods 867
implemented in cNORM offer a useful way of mitigating any resulting norm score bias. 868
869
36
References 870
Cole T. (1988). Fitting smoothed centile curves to Reference Data. Journal of the Royal 871
Statistical Society Series A (Statistics in Society), 151(3), 385. 872
Cole, T. J., & Green P. J. (1992). Smoothing reference centile curves: The lms method and 873
penalized likelihood. Statistics in Medicine, 11(10), 1305–19. 874
Donnellan, M. B., & Lucas, R. E. (2008). Age differences in the Big Five across the life span: 875
evidence from two national samples. Psychology and aging, 23(3), 558–566. 876
https://doi.org/10.1037/a0012897 877
Gary, S., & Lenhard, W. (2021). In norming we trust—Verfahren zur statistischen Modellierung 878
kontinuierlicher Testnormen auf dem Prüfstand. Diagnostica, 67(2), 75-86. 879
https://doi.org/10.1026/0012-1924/a000263 880
Ireland, C. T., & Kullback, S. (1968). Contingency tables with given marginals. Biometrika, 881
55(1), 179–188. https://doi.org/10.1093/biomet/55.1.179 882
Kalton, G., & Flores-Cervantes, I. (2003). Weighting Methods. Journal of Official Statistics, 883
19(2), 81-97. 884
Kaufman, A. S., & Kaufman, N. L. (2004). Kaufman Assessment Battery for Children Second 885
Edition. San Antonio: Pearson Clinical Assessment. 886
Kennedy, L., & Gelman, A. (2021). Know your population and know your model: Using model-887
based regression and poststratification to generalize findings beyond the observed sample. 888
Psychological Methods, 26(5), 547–558. https://doi.org/10.1037/met0000362 889
Lenhard, A. & Lenhard, W. & Gary, S. (2018). cNORM. (Continuous Norming). The 890
Comprehensive R Network. https://cran.r-project.org/web/packages/cNORM/index.html. 891
doi:10.1177/1073191116656437 892
Lenhard, A., Lenhard, W. & Gary, S. (2019). Continuous norming of psychometric tests: A 893
simulation study of parametric and semi-parametric approaches. PLOS ONE, 14(9), 894
e0222279. https://doi.org/10.1371/journal.pone.0222279 895
37
Lenhard, A., Lenhard, W., Suggate, S. & Segerer, R. (2018). A continuous solution to the 896
norming problem. Assessment, 25(1), 112-125. doi.org/10.1177/1073191116656437 897
Lenhard, A., Lenhard, W., Segerer, R. & Suggate, S. (2015). Peabody Picture Vocabulary Test - 898
Revision 4 (PPVT-4), German adaptation. Pearson Assessment. 899
Lenhard, W. & Lenhard, A. (2021). Improvement of Norm Score Quality via Regression-Based 900
Continuous Norming. Educational and psychological measurement, 81(2), 229-261. 901
https://doi.org/10.1177/0013164420928457 902
Lenhard, W., Lenhard, A. & Schneider, W. (2017). ELFE II - Ein Leseverständnistest für Erst- 903
bis Siebtklässler. Hogrefe. 904
Little, R. J. (1993). Post-stratification: A modeler’s perspective. Journal of the American 905
Statistical Association, 88(423), 1001-1012. https://doi.org/10.2307/2290792 906
Lumley, T. (2011). Complex Surveys – A Guide to Analyses Using R. Wiley. 907
Lumley, T. (2017). leaps: Regression Subset Selection. The Comprehensive R Network. 908
https://cran.r-project.org/web/packages/leaps/index.html 909
Oosterhuis, H. E., van der Ark, L. A., & Sijtsma, K. (2016). Sample size requirements for 910
traditional and regression-based norms. Assessment, 23(2), 191-202. 911
doi: 10.1177/1073191115580638 912
Oosterhuis, H. E. M. (2017). Regression-Based Norming for Psychological Tests and 913
Questionnaires. Unpublished PhD thesis, Tilburg University. 914
Park, D. K., Gelman, A., & Bafumi, J. (2004). Bayesian multilevel estimation with post-915
stratification: State-level estimates from national polls. Political Analysis, 12(4), 375-385. 916
doi: https://doi.org/10.1093/pan/mph024 917
Price-Mohr, R., Price, C. (2017). Gender Differences in Early Reading Strategies: A Comparison 918
of Synthetic Phonics Only with a Mixed Approach to Teaching Reading to 4–5 Year-Old 919
Children. Early Childhood Education Journal, 45, 613–620. 920
https://doi.org/10.1007/s10643-016-0813-y 921
38
Sang-Wook Y., Heechoul O., Soon-Ae S., & Jee-Jeon Y. (2015). Sex-age-specific association of 922
body mass index with all-cause mortality among 12.8 million Korean adults: a prospective 923
cohort study, International Journal of Epidemiology, 44(5), 1696–1705. 924
https://doi.org/10.1093/ije/dyv138 925
Stasinopoulos M. D., Rigby, R. A., Voudouris, V., Akantziliotou, C., Enea, M. & Kiose, D. 926
(2018). gamlss: Generalised Additive Models for Location Scale and Shape. The 927
Comprehensive R Network. https://cran.r-project.org/web/packages/gamlss/index.html 928
Voncken, L., Albers, C. J., & Timmerman, M. E. (2019). Model selection in continuous test 929
norming with GAMLSS. Assessment, 26(7), 1329-1346. 930
https://doi.org/10.1177/1073191117715113 931
Voncken, L., Kneib, T., Albers, C. J., Umlauf, N., & Timmerman, M. E. (2020). Bayesian 932
Gaussian distributional regressionmodels for more efficient norm estimation. British 933
Journal of Mathematical and Statistical Psychology, 74, 99–117. 934
https://doi.org/10.1111/bmsp.12206 935
Wechsler, D. (1939). The measurement of adult intelligence. Williams & Wilkins Co. 936
https://doi.org/10.1037/10020-000 937
Wechsler, D. (2008). WAIS-IV Technical and interpretive manual. Pearson. 938
Wechsler, D. (2014). WISC-V Technical and interpretive manual. Pearson. 939
Zachary, R. A., & Gorsuch, R. L. (1985). Continuous norming: Implications for the WAIS-R. 940
Journal of Clinical Psychology, 41(1), 86–94. 941
39
Figure Captions 942
943
Figure 1. Modeled cognitive ability in reference population. 944
945
946
Figure 2. RMSE across simulated populations, with (WCN) or without (SCN) weighting. 947
The grey rectangles represent 95% confidence intervals. 948
949
40
950
Figure 3. MSD across simulated populations with (WCN) or without (SCN) weighting. 951
The grey rectangles represent 95% confidence intervals. 952
953
41
954
Figure 4. RMSE across simulated populations, with (WCN) or without (SCN) weighting, 955
as a function of person location. The dotted grey line represents SCN with norm samples drawn 956
from population 1 (benchmark). 957
42
958
959
Figure 5. MSD across simulated populations, with (WCN) or without (SCN) weighting, 960
as a function of person location. The dotted grey line represents SCN with norm samples drawn 961
from population 1 (benchmark). 962
963
43
Table 1 964
Simulated populations for norming input 965
No.
Label
Description
Hypothesized effects on
distribution of cognitive ability
variable
1
Reference
Benchmark distributions of
demographic variables; the standard
of comparison for describing the
“representativeness” of the other
simulated populations.
Not applicable (benchmark
population).
2
Mild under-
representation
of high
education
Lower proportion of high-education
individuals, higher proportion of
low-education individuals, than
Population 1.
Both mean and variance
affected.
3
Moderate
under-
representation
of high
education
The pattern of divergence of
education proportions is similar to
population 2, but the degree of non-
representativeness is greater.
Both mean and variance
affected.
4
Under-
representation
of both low
and high
education
Both tails of the education
distribution have lower proportions
than Population 1.
Only variance affected.
44
5
Biased joint
distributions
Marginal distributions of
demographic variables match
population 1; joint distributions
(cross classifications) do not match
population 1. The pattern of non-
representation alternates from over-
to under-represented across the 27 (3
x 3 x 3) joint distributions.
Only variance affected.
6
Clustered
distributions
Marginal and joint distributions of
demographic variables match
population 1, but only when averaged
across all six age cohorts. Within
each age cohort, two-thirds of the
joint distribution cells contain no
data.
Only variance affected.
966
967
45
Table 2 968
Assignment of cognitive ability levels to demographic categories 969
Demographic variable
Below-average
ability
Average ability
Above-Average
ability
Education
low
medium
high
Ethnicity
native
mixed
non-native
Region
south
east
northwest
970
971
46
Table 3 972
Distributions of demographic variables, by category, in the reference population 973
low
education
40%
medium
education
20%
high
education
40%
ethnicity:
native
30%
region: south 60%
7.2%
3.6%
7.2%
region: east 20%
2.4%
1.2%
2.4%
region: north-west 20%
2.4%
1.2%
2.4%
ethnicity:
mixed
40%
region: south 60%
9.6%
4.8%
9.6%
region: east 20%
3.2%
1.6%
3.2%
region: north-west 20%
3.2%
1.6%
3.2%
ethnicity:
Non-native
30%
region: south 60%
7.2%
3.6%
7.2%
region: east 20%
2.4%
1.2%
2.4%
region: north-west 20%
2.4%
1.2%
2.4%
Notes. The benchmark marginal distributions of the demographic variables are shown in the table 974
margins (for education and ethnicity) and in the left-most column of the nested rows (for region). 975
The joint distributions for the complete cross-classification of the three variables are shown in the 976
table cells. 977
978