ArticlePDF Available

Abstract and Figures

The diversity of measures in clinical psychology hampers a straightforward interpretation of test results, complicates communication with the patient, and constitutes a challenge to the implementation of measurement-based care. In educational research and assessment, it is common practice to convert test scores to a common metric, such as T scores. We recommend applying this also in clinical psychology and propose and test a procedure to arrive at T scores approximating a normal distribution that can be applied to individual test scores. We established formulas to estimate normalized T scores from raw scale scores by regressing IRT-based θ scores on raw scores. With data from a large population and clinical samples, we established crosswalk formulas. Their validity was investigated by comparing calculated T scores with IRT-based T scores. IRT and formulas yielded very similar T scores, supporting the validity of the latter approach. Theoretical and practical advantages and disadvantages of both approaches to convert scores to common metrics and alternative approaches are discussed. Provided that scale characteristics allow for their computation, T scores will help to better understand measurement results, which makes it easier for patients and practitioners to use test results in joint decision-making about the course of treatment.
This content is subject to copyright. Terms and conditions apply.
European Journal of Psychological Assessment
A Common Measurement Scale for Self-Report Instruments in Mental
Health Care: T Scores With a Normal Distribution
Edwin de Beurs, Suzan Oudejans, and Berend Terluin
Online First Publication, December 16, 2022. https://dx.doi.org/10.1027/1015-5759/a000740
CITATION
de Beurs, E., Oudejans, S., & Terluin, B. (2022, December 16). A Common Measurement Scale for Self-Report Instruments
in Mental Health Care: T Scores With a Normal Distribution. European Journal of Psychological Assessment. Advance online
publication. https://dx.doi.org/10.1027/1015-5759/a000740
Original Article
A Common Measurement Scale
for Self-Report Instruments in
Mental Health Care
TScores With a Normal Distribution
Edwin de Beurs
1,2
, Suzan Oudejans
3
, and Berend Terluin
4
1
Department of Clinical Psychology, Faculty of Social Sciences, Leiden University, The Netherlands
2
Arkin Mental Health Institute, Amsterdam, The Netherlands
3
Mark Bench, Amsterdam, The Netherlands
4
EMGO Institute, VU Medical Center, Amsterdam, The Netherlands
Abstract: The diversity of measures in clinical psychology hampers a straightforward interpretation of test results, complicates
communication with the patient, and constitutes a challenge to the implementation of measurement-based care. In educational research
and assessment, it is common practice to convert test scores to a common metric, such as Tscores. We recommend applying this also in
clinical psychology and propose and test a procedure to arrive at Tscores approximating a normal distribution that can be applied to individual
test scores. We established formulas to estimate normalized Tscores from raw scale scores by regressing IRT-based θscores on raw scores.
With data from a large population and clinical samples, we established crosswalk formulas. Their validity was investigated by comparing
calculated Tscores with IRT-based Tscores. IRT and formulas yielded very similar Tscores, supporting the validity of the latter approach.
Theoretical and practical advantages and disadvantages of both approaches to convert scores to common metrics and alternative approaches
are discussed. Provided that scale characteristics allow for their computation, Tscores will help to better understand measurement results,
which makes it easier for patients and practitioners to use test results in joint decision-making about the course of treatment.
Keywords: IRT, Tscores, percentile ranks, common metric, norms, clinical
The importance of measurement for clinical management
and quality improvement of Mental Health Care (MHC)
is widely acknowledged (Kilbourne et al., 2018;Lambert
&Harmon,2018). Inspired by the recovery movement
(Slade et al., 2008) and developments such as shared deci-
sion-making (Patel et al., 2008), feedback informed treat-
ment (Miller et al., 2015), and measurement-based care
(Harding et al., 2011; Kilbourne et al., 2018), patientsneeds
and preferences are granted a more prominent role in
MHC. Increased patient involvement requires being well-
informed about the severity of ones condition at the onset
of treatment and about the progress made in the journey
toward recovery. Therefore, a good understanding of mea-
surement results is needed when findings of Routine Out-
come Monitoring (ROM; de Beurs et al., 2011)areshared.
However, the huge diversity of measurement instruments
in clinical practice, each with its measurement scale, com-
plicates straightforward communication among profession-
als and between professionals and patients about their
measurement results. This might hamper further imple-
mentation of measurement-based care. Using common,
measure-independent metrics for the results of clinical tests
may solve this problem (de Beurs, Fried, et al., 2022).
Two metrics have been proposed and researched in
recent years: Tscores and percentile rank. Tscores, first
proposed by McCall (1922), are standardized scores
(Z-scores) multiplied by 10 and 50 added, resulting in a
metric with M=50 (SD =10). The Tscore denotes the
commonness of a test result by its distance from the mean
of a reference group in standard units. Tscores require the
assumption that test results are measured on an interval
scale. Zscores (and Tscores) based on data from a refer-
ence population may have a highly skewed frequency distri-
bution. It is advisable first to transform the raw scores to
have a normal frequency distribution before converting
them to Tscores, which results in normalized T scores.
When the normalized Tscore metric is calibrated on the
Ó2022 Hogrefe Publishing European Journal of Psychological Assessment (2022)
https://doi.org/10.1027/1015-5759/a000740
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
general population, most psychiatric patients will score
before treatment around 6570 on measures of psy-
chopathology, and when treated successfully, their score
will decrease to 5560 over time. Percentile rank scores
have been proposed as an alternative way to express how
common or exceptional a test result is (Crawford & Garth-
waite, 2009;Ley,1972). They quantify the rarity of a tested
persons score in a percentage. However, percentile scores
are not on an interval scale but ordinal and are, especially at
the extremes, easily misunderstood (Bowman, 2002).
There is quite some literature on converting measure-
ment instruments to a common metric (Dorans et al.,
2007; Holland et al., 2006) and the subject is closely linked
to norming instruments (Mellenbergh, 2011). Most fre-
quently used are distribution-based approaches, such as
percentile conversion (Kolen & Brennan, 2014), and meth-
ods based on Item Response Theory (IRT; Embretson &
Reise, 2013). Following the latter approach, several
researchers have published reports on linking measures
for the same constructs and have proposed population-
based Tscores as a common metric for the severity of
depression (Choi et al., 2014; Fischer et al., 2011; Schalet
et al., 2015;Wahletal.,2014), anxiety (Schalet et al.,
2014), pain (Cook et al., 2015), physical functioning (Schalet
et al., 2015), fatigue (Friedrich et al., 2019), psychological
distress (Batterham et al., 2018), and quality of life mea-
sures (ten Klooster et al., 2013). Usually, general population
samples are used to easy interpretability of the resulting
metric (Wahl et al., 2014). If appropriately sampled, such
samples reflect the general population. In contrast, clinical
samples vary in composition and severity or complexity of
the disorder, and as such, they are less useful as a reference
group for general psychopathology measures. Various clin-
ical measurement instruments are administered to the
same group of respondents, and an IRT estimate, usually
the expected a posteriori (EAP) method (Lord & Wingersky,
1984) is used to estimate θin order to express scores in
the common metric. This endeavor is also known as the
PROsetta Stone project (Schalet et al., 2015).
A potential drawback of the IRT approach is that it
requires a comprehensive dataset and dedicated software
to calculate θ-scores. In clinical practice, this is not always
feasible: a clinician may only have a raw scale score from
an individual patient, and he/she does not have access to
the algorithm to obtain an IRT-based θ-score. To help
out, several authors provide crosswalk tables to translate
the raw test score into a Tscore (Batterham et al., 2018;
Choi et al., 2014; Schalet et al., 2014). However, reading
from crosswalk tables is cumbersome and prone to error.
The relation between raw scores and Tscores can also be
modeledandexpressedinafunction.Suchafunctionto
calculate Tscores can easily be used or implemented in
ROM software and provides an alternative method to
obtain scores on the common metric for individual patients.
In line with international developments, Tscores may be
basedonIRTmodelsandstemfromdatafromgeneral
population samples (de Beurs, Fried, et al., 2022). However,
for everyday clinical practice, we propose an approach in
which Tscores are calculated with a conversion function.
This will be feasible, even if only a test score from a single
individual is available.
Figure 1presents an overview of various approaches
obtaining Tscores. The first approach (on the left) entails
standardizing the sum score of items of a scale into Zscores
and converting these to Tscores with T=10 Z+50
(standard Tscores). The second approach is based on per-
centile rank scores, which are converted to normalized
Zscores and subsequently to normalized Tscores and take
the frequency distribution of scores into account (per-
centile-based T-scores). The third approach is advocated
in the present paper, with crosswalk formulas derived from
the regression of Tscores based on the factor scores result-
ing from an IRT model on sum scores (calculated Tscores).
The fourth approach is the IRT-based approach itself, prac-
ticed by among others the PROMIS group (θ-based
Tscores).
In this article we present crosswalk formulas for three
frequently used clinical measurement instruments: the
Brief Symptom Inventory (BSI; Derogatis, 1975), the Four-
Dimensional Symptom Questionnaire (4DSQ; Terluin
et al., 2006), and the Outcome Questionnaire (OQ-45;
Lambert et al., 2004). A necessary step is to investigate
the validity of calculated Tscores by comparing them with
θ-based Tscores. If there is a high agreement, the transfor-
mation of raw scale scores with the conversion function is a
valid approach for obtaining a proxy for θ-based Tscores.
For each measure, data were used from two samples: the
general population and patients. IRT-based Tscores were
founded on both samples, with the general population as
the reference population. The patient samples allowed us
to investigate whether the Tscores were normally dis-
tributed in clinical samples. Measures for ROM, where
patients are repeatedly assessed to determine change over
time, preferably have an interval scale of measurement
Figure 1. Overview of approaches to establish T-scores.
European Journal of Psychological Assessment (2022) Ó2022 Hogrefe Publishing
2 E. de Beurs et al., A Common Metric: Normalized TScores
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
with a normal distribution. Finally, percentile ranks are pro-
vided based on the population and clinical samples.
Methods
Datasets
BSI and 4DSQ data from the general population were col-
lected on separate occasions in a large sample of the Dutch
population, called the LISS panel (Long-term Internet Stud-
ies for the Social Sciences). The LISS panel is maintained by
CentERdata, a research institute located on the Tilburg
University campus, and includes about 5,300 household
respondents who were approached with the help of the
municipal population register (van der Laan, 2009). The
sample is representative of the Dutch population (Scherpen-
zeel, 2018; Scherpenzeel & Bethlehem, 2011). In 2007/
2008, one-third of all households were approached, of
which one person each completed the BSI (n=1,662). This
sample was stratified by age (four strata: 1829,3049,
5064,65+ years), gender, and ethnic origin. In 2013,
normativedataonthe4DSQ were collected from the entire
LISS panel (n=5,273). The sample and the procedure for
collecting 4DSQ data were described by Terluin and
colleagues (2016).
For the OQ-45,datawereusedfromn=1,810 respon-
dents, which have been described by Timman and col-
leagues (2017): 1,000 respondents came from a panel of
the TNS-NIPO research agency and were stratified by gen-
der, age, socioeconomic status, and education level; 810
came from an earlier validation study (de Jong et al.,
2007), 448 came from a sample drawn from the telephone
directory, and 362 were invited via internal mail from 14
companies or non-commercial organizations (Timman
et al., 2017).
Patient data for the BSI came from a dataset of patients
referred for treatment of depression, anxiety disorders, or
somatoform disorders at RijnVeste, the outpatient clinic
of GGZ Rivierduinen in the city of Leiden. Data from n=
4,853 patients collected between 2002 and 2013 were used.
The procedure of data collection has been described by de
Beurs and colleagues (2011). For the 4DSQ, patient data
were used from 199 patients from primary care physicians.
All patients were on sick leave, reported elevated levels of
stress, and participated in a trial evaluating an intervention
for stress-related mental disorders (Bakker et al., 2007).
The patient data for the OQ-45 came from 12,436 patients
describedbyTimmanandcolleagues(2017). This com-
prised a mixed sample: patients in daycare (n=481)or
inpatient care (n=484) from various MHC institutes;
patients in outpatient care (basic and specialized MHC,
n=1,581 and n=9,433 respectively), and patients treated
by private practitioners (n=457). According to Dutch law,
anonymized questionnaire data collected to support treat-
ment may be used for scientific research and such use is
exempt from an informed consent procedure.
Measures
The Brief Symptom Inventory (BSI; Derogatis, 1975;Dutch
version: de Beurs & Zitman, 2006) consists of 53 items
describing symptoms. Respondents can indicate to what
extent they were bothered by each symptom on a Likert-
type scale from 0(= notatall)to4(= very much)inthepast
week. In this study, we limited ourselves to investigating
the three most important scales of the BSI: depression
(BSI-DEP), anxiety (BSI-ANX), and somatic complaints
(BSI-SOM). Scale scores are calculated as the mean score
of the comprising items and range from 0to 4. In addition,
the global score for the severity of psychopathology was
analyzed: the mean score on all 53 items (Global Severity
Index or BSI-GSI, range: 04).
The Four-Dimensional Symptom Questionnaire (4DSQ;
Terluin et al., 2006) consists of 50 items, each describing
one symptom. Respondents can indicate how often they
experienced the symptom in the past week on a scale from
0(= no)to4(= very often or constantly). The 4DSQ com-
prises four scales: general distress (4DSQ-DIS, 16 items,
range: 032), depression (4DSQ-DEP, 6items, range:
012), anxiety (4DSQ-ANX, 12 items, range 024), and
somatic complaints (4DSQ-SOM, 16 items, range 032).
All scores are sum scores (after recoding the two response
options for high frequency: 4=2and 3=2), as recom-
mended in the scoring instruction of this instrument
(Terluin et al., 2006).
The Outcome Questionnaire (OQ-45; Lambert et al.,
2004; Dutch version: de Beurs et al., 2005;deJongetal.,
2007) comprises 45 items describing symptoms or prob-
lems. The respondent is asked to indicate how often these
emerged during the past week on a scale from 0(= never)
to 4(= almost always). A total score can be calculated
(OQ-TOTAL, 45 items, range: 0180)andfoursubscale
scores: Symptom Distress (OQ-SD, 25 items, range:
0100), Interpersonal Relations (OQ-IR, 11 items, range:
04), Social Role, (OQ-SR, 9items, range: 036), and Anxi-
ety and Social Distress (OQ-ASD, 13 items, range: 052). All
scores are sum scores.
Statistical Analysis
Calculations According to Item Response Theory
The Graded Response Model for polytomous itemsand
itsExpectedAPosteriori(EAP)scorewasusedastheesti-
mator for the θ-score with the multidimensional IRT (mirt)
package (Chalmers, 2012) version in R. There are other
Ó2022 Hogrefe Publishing European Journal of Psychological Assessment (2022)
E. de Beurs et al., A Common Metric: Normalized TScores 3
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
estimates available for the latent variable scores in IRT,
such as Maximum Likelihood (ML) and Weighted Likeli-
hood Estimates (WLE). We performed a sensitivity analysis,
comparing EAP with these alternatives, and found suffi-
ciently similar results regarding the mean θs, with their
95% confidence intervals (CI95) mostly overlapping and
highly correlated (results are provided in Table C in the
supplementary materials; de Beurs, Oudejans, et al.,
2022). Item parameters were established in the combined
general population and clinical samples. Thus, θscores
were estimated with the multigroup mirt option (Smits,
2016). We fixed the item parameters to be equal across
groups. The latent trait (θ) was standardized to a scale with
a mean of 0and a standard deviation of 1for the general
population. The unidimensionality of scales, a requirement
for IRT, was investigated with confirmatory factor analysis
using the R package lavaan (version 06.5; Rosseel, 2012).
We used the DWLS estimator based on the polychoric
correlation matrix for ordinal items and inspected (scaled)
fit statistics and set as requirements for unidimensionality:
Table 1.Raw scores, θ-based, and calculated T-scores of the population sample and patients on BSI, 4DSQ, and OQ-45
Raw scores θ-based Tscores Calculated Tscores
M Mdn SD Skew. Kurt. M Mdn SD Skew. Kurt. M Mdn SD Skew. Kurt.
BSI
Population
GSI 0.38 0.28 0.34 2.00 5.97 49.81 49.95 9.17 0.16 0.07 50.37 50.38 9.26 0.24 0.06
DEP 0.39 0.17 0.48 2.20 6.75 49.94 49.35 8.48 0.65 0.05 51.09 48.05 8.74 0.76 0.32
ANX 0.34 0.17 0.43 2.02 5.23 49.97 48.50 8.17 0.66 0.34 52.04 50.02 7.94 0.58 0.41
SOM 0.33 0.14 0.42 2.30 7.50 49.98 48.03 7.96 0.75 0.19 51.67 49.17 8.22 0.77 0.33
Patients
GSI 1.21 1.09 0.73 0.67 0.02 66.71 66.62 11.36 0.02 0.25 66.57 66.36 11.49 0.02 0.26
DEP 1.57 1.50 1.03 0.37 0.80 67.94 67.95 12.10 0.01 0.41 67.54 68.33 12.42 0.11 0.41
ANX 1.35 1.17 0.96 0.63 0.41 66.14 66.18 9.94 0.11 0.41 65.46 65.77 10.55 0.11 0.24
SOM 0.95 0.71 0.83 1.06 0.67 62.17 61.83 11.06 0.32 0.23 61.60 60.98 11.41 0.28 0.24
4DSQ
Population
DIST 5.52 3 6.40 1.72 2.87 50.00 49.63 9.20 0.50 0.11 50.03 49.07 9.20 0.53 0.01
DEP 0.70 0 1.92 3.69 14.71 50.00 46.57 7.05 1.88 2.44 50.17 46.77 6.97 1.87 2.38
ANX 1.05 0 2.53 4.14 21.64 50.00 45.48 7.46 1.56 1.77 50.13 45.58 7.42 1.52 1.69
SOM 4.92 4 4.91 1.58 3.07 50.00 49.68 8.87 0.42 0.10 50.11 50.79 8.87 0.42 0.05
Patients
DIST 19.02 20 8.32 0.24 0.93 66.90 66.77 8.32 0.12 0.38 66.19 66.43 8.55 0.16 0.32
DEP 3.45 2 3.66 1.06 0.05 62.77 61.97 6.45 0.65 0.24 60.39 61.66 9.52 0.16 0.92
ANX 5.21 4 5.14 1.06 0.43 63.38 63.16 7.28 0.37 0.58 61.28 63.12 9.59 0.16 0.67
SOM 12.77 11 7.00 0.46 0.51 63.82 62.90 8.20 0.35 0.04 62.36 60.96 8.95 0.14 0.12
OQ-45
Population
OQ-TOT 40.83 39 17.80 0.82 1.15 49.62 49.49 9.03 0.26 0.75 50.43 50.25 9.41 0.32 0.42
SDIS 23.76 22 11.43 0.83 1.09 49.76 49.58 8.98 0.30 0.78 50.84 50.13 9.50 0.39 0.37
IR 8.96 8 4.97 0.81 0.93 49.83 49.69 8.68 0.31 0.02 51.78 50.69 8.69 0.38 0.03
SR 8.12 8 3.88 0.64 0.71 49.81 50.44 7.94 0.10 0.22 50.46 50.39 8.39 0.49 0.42
ASD 14.37 14 6.82 0.59 0.45 49.84 49.73 8.60 0.11 0.38 51.67 51.81 9.52 0.26 0.21
Patients
OQ-TOT 78.23 79 24.00 0.00 0.24 68.52 68.89 10.96 0.10 0.17 68.34 68.95 10.85 0.17 0.07
SDIS 48.24 49 15.51 0.06 0.28 69.63 69.88 11.35 0.07 0.14 69.45 70.09 11.33 0.13 0.14
IR 16.54 16.5 6.79 0.11 0.35 64.31 64.93 10.46 0.12 0.15 63.95 64.62 10.35 0.22 0.19
SR 13.35 13 5.42 0.20 0.27 61.58 62.13 11.77 0.05 0.27 61.42 61.05 11.12 0.04 0.40
ASD 25.62 26 8.86 0.02 0.29 66.81 66.91 11.37 0.01 0.17 66.43 67.00 11.40 0.03 0.09
Note. Brief Symptom Inventory (BSI): GSI = Global Severity Index; DEP = Depression; ANX = Anxiety; SOM = Somatic complaints; Four-Dimensional Symptom
Questionnaire (4DSQ): DIST = Distress; DEP = Depression; ANX = Anxiety; OM = Somatization; Outcome Questionnaire (OQ): TOT = OQ45 Total score; SDIS =
Symptomatic Distress; IR = Interpersonal Relationships; SR = Social Role; ASD = Anxiety and Social Distress; M= Mean; Mdn = Median; SD = Standard
deviation; Skew. = Skewness; Kurt. = Kurtosis.
European Journal of Psychological Assessment (2022) Ó2022 Hogrefe Publishing
4 E. de Beurs et al., A Common Metric: Normalized TScores
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
CFI > .95,TLI0.95,RMSEA<.06,andSRMR<.08.If
insufficient fit of a unidimensional model is found, IRT
based scores are potentially flawed, and alternatives (e.g.,
percentile conversion or regression-based norming) can
be utilized.
We used non-linear least squares modeling of R (nls) to
establish the best fitting function (which had the lowest
AIC value and/or the most parsimonious number of coeffi-
cients) for the relation between raw scores and θ-based
T-scores. Linear, polynomial, exponential, logarithmic,
power, division, rational, sigmoid, and hyperbolic equations
were evaluated. We cross-validated each equation by
randomly splitting each dataset in two and using the first
dataset to establish the best-fitting function and using the
second dataset as a validation sample (Camstra &
Boomsma, 1992). Applying the conversion formula to the
raw scale score results in a calculated Tscore. The distribu-
tions of the resulting scores (mean, median, standard devi-
ation, skewness, and kurtosis) were investigated for the
population and the patient samples and visually inspected
on normality with histograms/density plots and QQ-plots.
ICC estimates for absolute agreement and consistency of
θ-based and calculated Tscores and their CI95 were deter-
mined using R and based on a two-way mixed-effects
model. We established bias(mean difference between
both Tscores (van Stralen et al., 2012), as well as the per-
centage error(the width of the CI95 interval proportional
to the population mean; Van Hoeck et al., 2000). If the
CI95 interval was within 5Tscore points, we would con-
clude that both approaches yielded sufficiently similar
results since 5Tscore points are the proposed limit for sta-
tistically reliable change in score over time (de Beurs et al.,
2019). We also inspected the agreement between θ-based
Tscores and calculated Tscores with Bland-Altman plots
for the full range of severity (Bland & Altman, 1986).
We did this for the entire sample and for the population
and clinical samples separately. Finally, to establish the
effect of normalization, we also compared standard
(A)
Figure 2. Frequency distribution (histograms with density and normal curves) and QQ-plots of scores for raw scores above, θ-based Tscores, and
calculated Tscores of the BSI General Severity Index (BSI-GSI; A), 4DSQ-Distress (4DSQ-DIS; B), and OQ-Symptomatic Distress (OQ-SD; C) of the
general population (above) and patient samples (below).
Ó2022 Hogrefe Publishing European Journal of Psychological Assessment (2022)
E. de Beurs et al., A Common Metric: Normalized TScores 5
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Tscores (based on the standard conversion formula stan-
dard T=10 Z+50)withθ-based Tscores and calcu-
lated Tscores.
Results
Unidimensionality of Scales
Most scales met criteria for unidimensionality, using CFI
0.95,TLI0.95,RMSEA<0.06 to 0.08;SRMR.08
(Schreiber et al., 2006). Table A in the supplementary
materials provides CFA results for all scales (de Beurs,
Oudejans, et al., 2022). The CFI, TLI, and SRMR require-
ments were met by all scales of the BSI and the 4DSQ,
except for the BSI-GSI and the DIS-SOM. The OQ scales
did not meet CFI and TLI requirements. RMSEA was larger
than 0.06 for most scales but not substantially larger,
except for the BSI-ANX and again the OQ-SR (RMSEA
>0.12). As a consequence, the precision of the θs for the
OQ scales, in particular, may be compromised, as these
scales appear to lack sufficient unidimensionality (Cris
ßan
et al., 2017).
Distribution of Raw Scores and TScores
Table 1presents an overview of raw scores, θ-based T
scores, and calculated Tscores and characteristics of the
frequency distribution for scales of the BSI, 4DSQ, and
OQ-45.Figure2shows the main score on each instrument,
the frequency distribution of the raw score, θ-based Tscore,
and calculated Tscore in the general population sample
(upper half) and the clinical sample (lower half), and
QQ-plots. The Similarity of the frequency distribution, the
density curve, and the normal curve and accordance with
the dots in the QQ-plot and the straight diagonal line, indi-
cate how close the distributions approximate normality.
(Figure A1A3in the supplementary materials presents plots
for all scales; de Beurs, Oudejans, et al., 2022). The average
BSI-GSI score in the population sample was M=0.38 (SD =
0.34). For the BSI-GSI, score skewness was 2.00; kurtosis
was 5.97, showing substantial deviation from the normality
(B)
Figure 2. Continued.
European Journal of Psychological Assessment (2022) Ó2022 Hogrefe Publishing
6 E. de Beurs et al., A Common Metric: Normalized TScores
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
of the scores with a tail to the right. Many respondents had
a low score, but this is to be expected when a symptom list
is completed by a population sample. The raw scores on
the 4DSQ scales in the population were also skewed. For
the depression and anxiety scale, values for skewness
and kurtosis were extreme, as 78.7%and67.6%ofthe
respondents had the lowest possible score. In contrast,
raw scores of the general population on the OQ-45 had
an almost normal distribution; only the total score and
the OQ-SD score showed marginal kurtosis in the popula-
tion sample (some surplus of scores below the average
score). Most Tscores based on θsandcalculatedTscores
had a normal distribution.
Conversion functions for the BSI, 4DSQ, and OQ-45
are shown in Table 2. For BSI scores rational functions
provided the best fit; indices of skewness and kurtosis
decreased considerably compared to the raw scores. Raw
scale scores in the patient sample had a normal distribution,
and this was preserved in the calculated Tscores.
For 4DSQ scales, also rational functions best fit the rela-
tion between raw scores and θ-based Tscores. Again, raw
scores were skewed and peaked, whereas θ-based and
calculated Tscores approximated the normal distribution
better, although the depression score and the anxiety score
of the population sample were still skewed and showed
kurtosis due to a large proportion of respondents with the
lowest possible score on these scales. Thus, transformation
to a normal distribution was successful for only two of the
four subscales of the 4DSQ. Patient scores had a normal
distribution. Finally, for the OQ-45, two rational functions,
a cubic, a quadratic, and a hyperbolic function, provided the
best fit.
In the supplementary materials, Figure A1A3(de Beurs,
Oudejans, et al., 2022) presents histograms with density lines
and QQ-plots for all scales. These graphs reveal a sufficiently
normal distribution for most scales, except for the 4DSQ
depression and anxiety scales, but also reveal some surplus
of extreme high and extreme low scores for all scales.
Comparison of θ-Based and Calculated
TScores
We cross-validated the equations shown in Table 2by
applying a random split of each dataset into a calibration
(C)
Figure 2. Continued.
Ó2022 Hogrefe Publishing European Journal of Psychological Assessment (2022)
E. de Beurs et al., A Common Metric: Normalized TScores 7
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
sample and a cross-validation sample. Table 2provides the
Root Mean Squared Error (RMSE), the coefficient of deter-
mination (R
2
), and the Mean Absolute Error (MAE) for the
correspondence of predicted scores (calculated Tscores
with formulas based on the calibration sample) with
observed Tscores (θ-based Tscores in the cross-validation
sample). Overall, correspondence was high; the lowest
R
2
=.869 for the OQ-SR scale.
Furthermore, we calculated ICCs for absolute agreement
between θ-based and calculated Tscores and consistency
(similar ranking of subjects according to both scores; van
Stralen et al., 2012). Table 3presents the ICCs and
Table 2.Formulas to calculate T-scores and indicators of correspondence from cross-validation
Scale Formula y=... RMSE R
2
MAE
BSI-GSI 31.1 + (138.641x+ 22.4779x
2
)/(1 + 4.089x0.3392x
2
) 1.456 0.988 1.116
BSI-DEP 41.6 + (47.593x+ 0.7650x
2
)/(1 + 1.444x0.1909x
2
) 1.627 0.986 1.304
BSI-ANX 43.4 + (50.294x1.9425x
2
)/(1 + 1.564x0.2368x
2
) 2.658 0.950 2.189
BSI-SOM 42.9 + (56.622x+ 9.3246x
2
)/(1 + 2.233x0.1947x
2
) 1.725 0.978 1.396
4DSQ-DIST 38.0 + (6.460x0.0284x
2
)/(1 + 0.258x0.0050x
2
) 1.087 0.988 0.788
4DSQ-DEP 46.8 + (27.876x0.8495x
2
)/(1 + 1.411x0.0766x
2
) 1.497 0.961 0.854
4DSQ-ANX 45.6 + (15.735x+ 0.0349x
2
)/(1 + 0.718x0.0155x
2
) 1.875 0.946 1.222
4DSQ-SOM 37.4 + (5.746x+ 0.0336x
2
)/(1 + 0.203x0.0031x
2
) 1.208 0.984 0.845
OQ-TOT 20.9 + (1.157x+ 0.0017x
2
)/(1 + 0.018x0.0001x
2
) 2.434 0.963 1.895
OQ-SD 25.8 + (1.487x0.0051x
2
)/(1 + 0.014x0.0001x
2
) 2.287 0.970 1.796
OQ-IR 33.5 + 2.46x0.0447x
2
+ 0.00058x
3
3.370 0.914 2.649
OQ-SR 32.1 + 2.39x0.0121x
2
3.879 0.896 3.016
OQ-ASD 68.2 + 1.094 (x27.0) + sinh((x27.0)/8.533) 2.350 0.965 1.840
Note. Brief Symptom Inventory (BSI): GSI = Global Severity Index; DEP = Depression; ANX = Anxiety; SOM = Somatic Complaints; Four-Dimensional Symptom
Questionnaire (4DSQ): DIST = Distress; DEP = Depression; ANX = Anxiety; SOM = Somatization; Outcome Questionnaire (OQ): TOT = Total score; SD =
Symptomatic Distress; IR = Interpersonal Relationships; SR = Social Role; ASD = Anxiety and Social Distress; y= calculated T-score; x= raw scale score;
RMSE = Root Mean Squared Error; R
2
= coefficient of determination; MAE = Mean Absolute Error.
Table 3.Indicators of correspondence between θ-based and calculated T-scores
> 5 Difference
5 and 5
<5
Scale ICC-A CI95 ICC-C CI95 Bias Perc. Err. CI95+ CI95%%%
BSI-GSI .99 .99.99 .99 .99.99 0.02 12.3 3.87 3.83 2.37 97.60 0.03
BSI-DEP .99 .99.99 .99 .99.99 0.01 11.4 3.61 3.62 0.82 99.04 0.14
BSI-ANX .97 .97.97 .97 .97.97 0.02 17.8 5.56 5.51 3.45 93.54 3.01
BSI-SOM .99 .98.99 .99 .98.99 0.01 13.1 3.88 3.86 1.36 98.63 0.02
4DSQ-DIST .99 .99.99 .99 .99.99 0.02 11.3 2.88 2.91 0.00 99.98 0.02
4DSQ-DEP .98 .98.98 .98 .98.98 0.03 11.8 3.03 2.98 0.00 98.36 1.64
4DSQ-ANX .99 .99.99 .99 .99.99 0.00 9.9 2.52 2.52 0.41 98.56 1.03
4DSQ-SOM .99 .99.99 .99 .99.99 0.02 9.9 2.54 2.50 0.00 99.82 0.18
OQ-TOT .98 .98.98 .98 .98.98 0.02 14.7 4.79 4.83 1.77 95.44 2.79
OQ-SD .98 .98.99 .98 .98.99 0.02 13.5 4.52 4.49 1.32 96.89 1.79
OQ-IR .96 .95.96 .96 .95.96 0.01 21.3 6.61 6.60 6.89 86.44 6.67
OQ-SR .95 .95.95 .95 .95.95 0.03 25.1 7.47 7.52 10.14 81.33 8.53
OQ-ASD .98 .98.98 .98 .98.98 0.02 14.7 4.68 4.72 1.92 96.03 2.05
Note. Brief Symptom Inventory (BSI): GSI = Global Severity Index; DEP = Depression; ANX = Anxiety; SOM = Somatic complaints; Four-Dimensional Symptom
Questionnaire (4DSQ): DIST = Distress; DEP = Depression; ANX = Anxiety; OM = Somatization; Outcome Questionnaire (OQ): TOT = OQ45 Total score; SD =
Symptomatic Distress; IR = Interpersonal Relationships; SR = Social Role; ASD = Anxiety and Social Distress; ICC-A = Absolute agreement; ICC-C =
consistency; Bias = difference between both T-scores (positive bias values indicates that θ-based T-scores are higher than calculated T-scores); Perc. Err. =
Percentage error, the width of the limits of agreement interval divided by the mean T-score of the population [(CI95+ CI95+)/M]; Difference 5 and 5 %=
The percentage of subjects for whom both T-scores differ less than 5 points; Difference < 5 and > 5 %= The percentage of subjects for whom the
difference between both methods is greater than 5 points.
European Journal of Psychological Assessment (2022) Ó2022 Hogrefe Publishing
8 E. de Beurs et al., A Common Metric: Normalized TScores
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
additional information on the association between both
scores. All ICCs were high, suggesting excellent agreement
and consistency. Bias and percentage error were also low,
with the exception of a higher percentage error for BSI-
ANX, OQ-IR, and OQ-SR, and for these scales, the CI95
extended beyond 5Tscore points. Figure 3presents
Bland-Altman plots for selected scales. Figures B1B3in
the supplementary materials present plots for all scales
(de Beurs, Oudejans, et al., 2022).
For the BSI, θ-based Tcorresponded well to calculated
Tscores (mean difference M=0.010.02,thesolidgray
line in the BA plot); only for the BSI-ANX, less than 95%
Figure 3. A Bland-Altman plot displays the difference between two methods for the full range of scores; the x-axis displays the range based on
the average of both methods; the y-axis displays for each subject the difference in score between both methods. BSI-GSI = Global Severity Index;
DSQ-DIST = 4DSQ Distress; OQ-SD = Symptomatic Distress.
Ó2022 Hogrefe Publishing European Journal of Psychological Assessment (2022)
E. de Beurs et al., A Common Metric: Normalized TScores 9
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
of the cases fell within the 5to 5interval for acceptable
difference. For the 4DSQ θ-based Tscores and calculated
Tscores corresponded well (mean difference M=0.02),
99% of the cases fell within the 5to 5interval. For the
OQ-SD also, excellent correspondence was found (mean
difference M=0.02); however, on average, 91.2%ofthe
Table 4.Indicators of correspondence of standardized Tscores with θ-based and calculated Tscores for the main scale of each self-report
measure
> 5 Difference
5 and 5
<5
Scale Pair ICC-A CI95 ICC-C CI95 Bias Perc. Err. CI95+ CI95%%%
BSI-GSI ST-TT .93 .93.93 .93 .93.93 0.29 35.6 11.42 10.83 18.89 67.92 13.19
ST-CT .93 .93.94 .93 .93.94 0.26 34.3 11.01 10.49 18.90 75.59 5.51
4DSQ-DIST ST-TT .93 .93.94 .94 .93.94 0.78 29.7 8.40 6.85 10.22 89.16 0.62
ST-CT .94 .93.94 .94 .94.94 0.79 28.5 8.10 6.52 9.72 90.28 0.00
OQ-SD ST-TT .97 .96.98 .98 .97.98 1.00 18.3 7.12 5.12 11.27 87.13 1.60
ST-CT .99 .98.99 .99 .99.99 0.99 12.5 5.16 3.19 1.00 99.00 0.00
Note. BSI-GSI = Brief Symptom Inventory Global Severity Index; 4DSQ-DIST = Four-Dimensional Symptom Questionnaire Distress; OQ-SD = Outcome
Questionnaire SD = Symptomatic Distress; ICC-A = Absolute agreement; ICC-C = Consistency; Bias = Difference between both T-scores (positive bias
values indicates that θ-based T-scores are higher than calculated T-scores); Perc. Err. = Percentage error, the width of the limits of agreement interval
divided by the mean T-score of the population [(CI95+ CI95+)/M]; Difference 5 and 5 %= The percentage of subjects for whom both T-scores differ less
than 5 points; Difference < 5% and > 5 %= The percentage of subjects for whom the difference between both methods is greater than 5 points; ST =
Standardized T-score; CT = Calculated T-score; TT = θ-based T-score.
Table 5.Crosswalk table for a selection of raw scores to calculated T-scores and percentile scores for the BSI global severity index, depression,
anxiety, and somatic complaints scale
Global severity index Depression Anxiety Somatic complaints
RS T-score PRn PRcl RS T-score PRn PRcl RS T-score PRn PRcl RS T-score PRn PRcl
0.00 31.1 13.2 2.1 0.00 41.6 16.1 2.6 0.00 43.4 17.1 2.9 0.00 42.9 45.9 27.4
0.17 45.5 52.9 8.3 0.17 48.2 41.5 7.5 0.17 50.1 44.0 8.9 0.17 50.1 91.8 54.8
0.33 52.0 79.3 12.5 0.33 52.5 58.4 12.2 0.33 54.4 62.0 15.3 0.33 54.4 91.8 54.8
0.50 56.4 79.3 12.5 0.50 55.9 71.5 17.7 0.50 57.7 75.5 22.0 0.50 57.7 91.8 54.8
0.67 59.8 79.3 12.5 0.67 58.7 80.6 23.7 0.67 60.3 83.4 28.8 0.67 60.4 91.8 54.8
0.83 62.5 83.3 21.7 0.83 61.0 86.7 29.4 0.83 62.3 88.2 36.1 0.83 62.5 91.8 54.9
1.00 65.0 90.2 40.2 1.00 63.1 90.6 35.1 1.00 64.2 91.7 43.4 1.00 64.6 95.6 69.3
1.17 67.4 96.0 59.8 1.17 65.0 92.9 40.9 1.17 65.8 94.1 50.1 1.17 66.5 99.3 83.6
1.33 69.5 98.9 70.2 1.33 66.6 94.5 46.4 1.33 67.2 95.8 56.2 1.33 68.2 99.3 83.7
1.50 71.7 98.9 70.2 1.50 68.3 96.1 51.6 1.50 68.6 97.2 61.6 1.50 69.9 99.3 83.9
1.67 73.9 98.9 70.2 1.67 70.0 97.2 56.6 1.67 70.0 98.1 66.6 1.67 71.7 99.3 84.0
1.83 75.9 99.4 75.5 1.83 71.5 97.9 61.7 1.83 71.2 98.6 71.2 1.83 73.3 99.3 84.0
2.00 78.1 100.0 84.9 2.00 73.1 98.5 66.6 2.00 72.6 99.1 75.2 2.00 75.0 99.7 89.7
2.17 80.3 100.0 92.1 2.17 74.7 98.9 71.0 2.17 73.9 99.4 78.9 2.17 76.7 100.0 95.4
2.33 82.4 100.0 95.1 2.33 76.2 99.1 75.1 2.33 75.1 99.5 82.4 2.33 78.3 100.0 95.4
2.50 84.6 100.0 95.1 2.50 77.8 99.4 78.9 2.50 76.5 99.7 85.5 2.50 80.1 100.0 95.4
2.67 86.9 100.0 95.1 2.67 79.5 99.5 82.5 2.67 77.9 99.8 88.3 2.67 81.9 100.0 95.4
2.83 89.2 100.0 95.5 2.83 81.2 99.6 86.0 2.83 79.3 99.9 90.7 2.83 83.6 100.0 95.5
3.00 91.6 100.0 97.0 3.00 83.0 99.8 89.1 3.00 80.8 100.0 92.9 3.00 85.5 100.0 97.4
3.17 94.2 100.0 98.9 3.17 84.9 99.9 91.6 3.17 82.5 100.0 94.7 3.17 87.5 100.0 99.3
3.33 96.6 100.0 99.6 3.33 86.8 99.9 94.1 3.33 84.1 100.0 96.2 3.33 89.4 100.0 99.3
3.50 99.3 100.0 99.6 3.50 89.0 99.9 96.1 3.50 86.0 100.0 97.6 3.50 91.4 100.0 99.3
3.67 102.1 100.0 99.6 3.67 91.2 100.0 97.6 3.67 88.0 100.0 98.6 3.67 93.6 100.0 99.3
3.83 104.8 100.0 99.6 3.83 93.5 100.0 98.6 3.83 90.1 100.0 99.3 3.83 95.7 100.0 99.3
4.00 107.8 100.0 99.8 4.00 96.1 100.0 99.5 4.00 92.4 100.0 99.8 4.00 98.0 100.0 99.6
Note. Brief Symptom Inventory (BSI): GSI = Global Severity Index; DEP = Depression; ANX = Anxiety; SOM = Somatic complaints; RS = Raw Score; T=
Calculated T-score; PRn = Percentile rank in the normal population; PRcl = Percentile rank in the clinical population.
European Journal of Psychological Assessment (2022) Ó2022 Hogrefe Publishing
10 E. de Beurs et al., A Common Metric: Normalized TScores
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
cases fell within the 5to 5interval with less correspon-
dence for the OQ-IR and OQ-SR scales. Generally, in
the low scoring range (< 40) and in the high score range
(> 70), calculated Tscores were somewhat higher; in the
mid-range θ-based, Tscores were higher. We also estab-
lished ICCs denoting correspondence between θ-based
and calculated Tscores for the population and the clinical
sample separately. Correspondence was somewhat lower
in the clinical sample, especially for the 4DSQ-DEP,
4DSQ-ANX, OQ-IR, and OQ-SR, but was still very high
(see Table B in the supplementary materials; de Beurs,
Oudejans, et al., 2022).
We also compared the θ-based and the calculated
Tscores for the BSI-GSI, 4DSQ-DIS, and OQ-SD with
standard Tscores (based on the simpler linear equation
T=10 Z+50). ICCs ranged from ICC = .81 for the
BSI-GSI to ICC = .99 for the OQ-SD (see Table 4). Corre-
spondence between standard Tscores and calculated
Tscores was still substantial, especially for the OQ-SD.
However, for the BSI-GSI, there was a large subgroup with
higher standard Tscores than θ-based or calculated
Tscores (for both comparisons Δ<5;18.9%), which is
understandable, as negative skewness (a low mean score
relative to the maximum scale score) results in more
respondents with extremely high standard Tscores. After
all, the non-linear conversion formula yielding calculated
Tscores corrects for precisely this undesirable effect.
Crosswalk From Raw Scores to TScores
and Percentiles
Table 5,6,and7present crosswalk tables for the conver-
sion of a selection of raw scores to calculated Tscores
and percentile ranks for the general population and the
clinical population. Per measurement instrument, the first
column gives raw scores (RS) and the second Tscores
are calculated according to the functions in Table 2for all
respondents. The conversion functions stretch the scales
at the extremes: a change in raw score of one scale point
in the low or high score area is larger than a change of
one scale point in the mid-range score area, and the nor-
malized Tscore reflects this appropriately. Figure 4depicts
the relationship between raw scores on the instruments and
Tscores.
Discussion
We analyzed community data of frequently used generic
outcome measures in the Netherlands to link raw scores
to two common metrics: Tscores and percentile ranks. In
line with previous research (Cook et al., 2015;Fischer&
Rose, 2016; Friedrich et al., 2019; Schalet et al., 2015;Wahl
et al., 2014),weappliedmethodsbasedonIRT,which
Table 6.Cross walk table for a selection of raw scores to calculated T-scores and percentile scores for the 4DSQ scales for distress, depression,
anxiety, and somatization
4DSQ-DIST 4DSQ-DEP 4DSQ-ANX 4DSQ-SOM
RS TPRn PRcl RS TPRn PRcl RS TPRn PRcl RS TPRn PRcl
0 38.0 10.4 0.2 0 46.8 39.5 13.3 0 45.6 42.6 16.2 0 37.4 14.6 0.3
2 46.6 37.7 1.1 1 58.3 82.8 33.8 2 58.9 89.4 41.6 2 45.8 39.5 2.5
4 51.0 54.9 4.0 2 61.7 88.7 46.7 4 63.1 94.9 55.6 4 50.8 57.8 9.2
6 54.0 67.2 8.1 3 63.5 91.9 57.5 6 65.7 96.9 66.2 6 54.4 71.5 18.7
8 56.2 75.5 11.8 4 64.8 93.9 66.5 8 67.9 98.0 75.4 8 57.3 81.2 29.9
10 58.1 81.7 17.7 5 66.0 95.4 72.5 10 69.9 98.7 82.1 10 59.8 87.9 42.7
12 59.8 86.0 22.8 6 67.1 96.6 77.8 12 71.9 99.1 88.0 12 62.1 91.9 51.7
14 61.4 89.4 29.7 7 68.3 97.4 83.1 14 74.0 99.4 92.2 14 64.3 94.5 58.9
16 63.0 92.0 36.5 8 69.6 97.8 86.1 16 76.2 99.5 94.7 16 66.5 96.6 68.2
18 64.7 93.8 42.4 9 71.1 98.3 87.7 18 78.7 99.7 96.6 18 68.7 97.8 76.5
20 66.4 95.2 50.0 10 72.8 98.7 89.8 20 81.6 99.8 98.3 20 70.9 98.6 83.0
22 68.3 96.3 59.6 11 74.9 99.2 92.4 22 84.8 99.9 99.4 22 73.3 99.2 88.5
24 70.3 97.3 68.3 12 77.5 99.7 96.7 24 75.8 99.5 93.6
26 72.6 98.2 76.3 26 78.4 99.7 96.1
28 75.1 98.8 83.6 28 81.2 99.8 97.8
30 78.0 99.5 89.0 30 84.3 99.9 99.2
32 81.3 99.9 96.8 32 87.6 100.0 99.7
Note. Four-Dimensional Symptom Questionnaire (4DSQ): DIST = Distress; DEP = Depression; ANX = Anxiety; OM = Somatization; RS = Raw Score; T=
Calculated T-score; PRn = Percentile rank in the normal population; PRcl = Percentile rank in the clinical population.
Ó2022 Hogrefe Publishing European Journal of Psychological Assessment (2022)
E. de Beurs et al., A Common Metric: Normalized TScores 11
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
resulted in θ-based Tscores with a normal frequency distri-
bution. We also determined functions to convert sum
scores into Tscores and showed that for most scales, calcu-
lated Tscores approximated θ-based Tscores very well, as
the scores were strongly related (all ICC > .95,seeTable3).
Scores were similar across the width of the entire scale and
yielded similar mean values for the groups. Correspon-
dence between calculated and θ-based Tscores supports
the validity of our approach to calculate Tscores with a
function based on curve fitting. However, at the extreme
end of the scales, the two approaches diverged somewhat.
Thus, caution with extreme scores is in order, especially
with T<40 and T>80.Furthermore,thefindingsof
two scales of the 4DSQ revealed that if raw scale scores
are too skewed or too leptokurtic, to begin with (due to
an excess of respondents with the lowest possible score),
conversion to θ-based or calculated Tscores will not yield
scores with a normal frequency distribution.
Compared to standard Tscores (T=10 Z+50), the
correspondence of the more complex conversion formulas
(correcting for a non-normal frequency distribution of raw
scores) with θ-based Tscores was better. For instance,
the ICC between standard Tscores and θ-based Tscores
for the BSI-GSI was ICC = .93 (see Table 4), whereas calcu-
lated Tscores showed almost perfect correspondence with
θ-based Tscores (ICC = .99; see Table 3). Better correspon-
dence was especially obtained in the higher score range.
For each scale, we cross-validated the formulas on subsam-
ples (splitting the samples randomly in half, and we also
compared the population and clinical samples). The results
revealed a similarity and high correlation between pre-
dicted scores from the learning samplewith obtained
scores in the test sample. However, thus a substantial
number of statistical tests were done per scale in each data-
set, and the applicability of these formulas still needs fur-
ther validation with data from other samples.
Table 7.Crosswalk table for a selection of raw scores to calculated T-scores and percentile scores for the OQ total score, symptomatic distress,
interpersonal relations, social role functioning, and anxiety and social distress
OQ-TOT OQ-SD OQ-IR OQ-SR OQ-ASD
RS TPRn PRcl RS TPRn PRcl RS TPRn PRcl RS TPRn PRcl RS TPRn PRcl
0 20.9 0.0 0.0 0 25.8 0.4 0.0 0 33.5 1.0 0.2 0 32.1 1.0 0.2 0 26.9 0.3 0.0
10 30.9 2.5 0.0 5 32.6 3.2 0.2 2 38.3 5.7 1.1 2 36.8 5.4 1.0 2 31.5 1.4 0.2
20 38.7 11.0 0.8 10 38.5 11.7 0.8 4 42.7 15.8 3.1 4 41.4 15.3 3.5 4 35.7 4.3 0.6
30 45.2 30.7 2.5 15 43.7 26.4 1.9 6 46.8 29.7 6.6 6 46.0 31.3 9.7 6 39.4 10.7 1.3
40 50.8 54.6 6.0 20 48.4 43.8 3.8 8 50.7 46.0 11.6 8 50.4 52.4 18.7 8 42.9 19.7 2.4
50 55.8 73.9 12.7 25 52.6 60.0 7.0 10 54.3 61.7 18.6 10 54.7 72.3 29.7 10 46.0 29.8 4.1
60 60.6 86.2 22.5 30 56.6 72.7 12.0 12 57.7 75.2 27.4 12 59.0 85.6 42.6 12 49.0 41.0 6.6
70 65.0 93.9 35.5 35 60.3 83.1 19.3 14 60.9 85.7 37.2 14 63.1 92.4 56.3 14 51.8 51.7 10.0
80 69.4 98.2 50.2 40 63.9 90.2 29.4 16 63.9 91.8 48.0 16 67.2 95.9 69.4 16 54.5 62.3 14.5
90 73.7 98.5 67.1 45 67.4 93.7 40.8 18 66.8 95.4 59.2 18 71.1 98.1 80.5 18 57.1 71.6 20.4
100 77.9 99.1 82.8 50 70.8 96.3 53.8 20 69.6 97.6 69.8 20 75.0 99.2 88.9 20 59.6 79.4 27.3
110 82.3 99.7 92.1 55 74.2 98.2 66.7 22 72.3 98.7 79.5 22 78.7 99.8 94.4 22 62.1 86.2 35.0
120 86.7 100.0 96.7 60 77.6 98.8 77.5 24 74.9 99.3 86.9 24 82.4 100.0 97.4 24 64.6 90.6 43.3
130 91.2 100.0 99.0 65 81.2 99.1 85.5 26 77.5 99.7 91.5 26 85.9 100.0 98.9 26 67.0 93.8 51.9
140 95.9 100.0 99.9 70 84.9 99.6 90.7 28 80.1 99.9 95.0 28 89.4 100.0 99.7 28 69.4 96.5 60.6
150 100.8 100.0 100.0 75 88.8 99.9 95.1 30 82.8 100.0 97.4 30 92.8 100.0 99.9 30 71.9 98.0 68.8
160 106.0 100.0 100.0 80 93.0 100.0 98.2 32 85.5 100.0 98.8 32 96.0 100.0 100.0 32 74.3 98.7 76.4
170 111.4 100.0 100.0 85 97.6 100.0 99.4 34 88.3 100.0 99.5 34 99.2 100.0 100.0 34 76.8 99.1 83.0
180 117.2 100.0 100.0 90 102.6 100.0 99.9 36 91.2 100.0 99.9 36 102.3 100.0 100.0 36 79.3 99.4 88.1
95 108.3 100.0 100.0 38 94.3 100.0 100.0 38 81.9 99.6 92.0
100 114.7 100.0 100.0 40 97.5 100.0 100.0 40 84.6 99.7 94.7
42 100.9 100.0 100.0 42 87.4 99.8 96.7
44 104.5 100.0 100.0 44 90.4 99.8 98.1
46 93.6 99.9 99.0
48 97.0 100.0 99.5
50 100.8 100.0 99.8
52 104.9 100.0 100.0
Note. Outcome Questionnaire (OQ): TOT = OQ45 Total Score; SD = Symptomatic Distress; IR = Interpersonal Relationships; SR = Social Role; ASD = Anxiety
and Social Distress; RS = Raw Score; T= Calculated T-score; PRn = Percentile rank in the normal population; PRcl = Percentile rank in the clinical
population.
European Journal of Psychological Assessment (2022) Ó2022 Hogrefe Publishing
12 E. de Beurs et al., A Common Metric: Normalized TScores
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
The results show that some scales evoked the lowest pos-
sible score from many respondents (especially the 4DSQ;
see Figure 2). This zero inflation in the data is quite com-
mon when measures of psychopathology are administered
in the general population. With zero-inflated data, the
Graded Response Model may yield biased results of IRT
analyses, and alternative models to deal with zero inflation
have been proposed (Wall et al., 2015), such as Zero
Inflated Mixture GRM (ZIM-GRM). In a simulation study
with zero-inflated scores, GRM showed substantial bias
(Smits et al., 2020). In future studies, the effect of zero
inflation on factor score estimates should be investigated
to ascertain the added value of these models.
It should be noted that there are viable alternatives to
arrive at normative values and subsequently calculate
Tscores and percentile rank scores instead of the IRT-
based methods or the approach advocated in the present
paper. When scales are not unidimensional and/or the
IRT model does not fit, alternatives can be used, for
example, frequency-based percentile rank scores. Espe-
cially, regression-based norming (Mellenbergh, 2011)isan
interesting option, for example, with the GAMLSS model
Figure 4. Linking raw scores to the Tscore metric for subscales of the BSI, 4DSQ, and OQ (GSI = Global Severity Index; DEP = Depression; ANX =
Anxiety; SOM = Somatic complaints; 4DIST = Distress; 4DEP = Depression; 4ANX = Anxiety; 4SOM = Somatization; TOT = OQ-45 Total score; SD =
Symptomatic Distress; IR = Interpersonal Relationships; SR = Social Role; ASD = Anxiety and Social Distress).
Ó2022 Hogrefe Publishing European Journal of Psychological Assessment (2022)
E. de Beurs et al., A Common Metric: Normalized TScores 13
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
(Stasinopoulos et al., 2018), in which the shape of the fre-
quency distribution of raw scores, as well as relevant
norm-predictors, such as gender, age, and educational
level can be taken into account. A useful introduction to
the continuous norming approach, which corrects for all
levels of these demographic variables, is offered by Tim-
merman and colleagues (2021). However, for the present
purposes, we build upon the previous work of the PROsetta
stone initiative and the PROMIS group.
The strengths of the study are the use of large datasets
with representative samples from the general population
and patients, warranting trust in the findings. The data from
the Dutch general population were collected by research
institutes with a good reputation, and the representative-
ness of these population samples has been documented
by Scherpenzeel (2018) and Timman and colleagues
(2017). A relatively simple and straightforward approach
is outlined, which leads to calculated Tscores that approx-
imate θ-based Tscores well. A potential limitation of the
proposed method is that it relies heavily on data from
the general population, as this is the basis of the item
parameters and θsusedtoobtainTscores. The clinical
measurement instruments were not developed for the
population at large but rather for patients suffering from psy-
chological complaints or symptoms of psychiatric disorders.
Cronbach (1984) noted that, ideally, instruments should be
validated with data from the population for which the instru-
ment was intended. Indeed, the frequency distribution of
responses of non-clinical respondents deviated much more
from normality than was the case with data obtained in
the clinical samples, as the present findings show.
The present results were based on normative data from
the Netherlands. Application of the conversion formulas
listed in Table 2or the crosswalk Tables 57is limited to
this context, as general population respondents from other
countries may score differently on these self-report mea-
sures. A further limitation may be the composition of the
clinical sample used to evaluate the frequency distribution
of the conversion formula in clinical data. All these patients
suffered from mild to moderate common mental disorders,
and patients with severe mental illness were not included.
Future research should investigate other clinical samples.
Conclusion
The high correlations found in this study between θ-based
and calculated Tscores for assessment with the 4DSQ, the
BSI, and the OQ-45, suggest that the proposed approach of
using conversion functions provides a good approximation
toward a common normalized metric for MHC. The use
of such a common metric will make the interpretation of
test results easier for therapists and patients (de Beurs,
Fried, et al., 2022) and will allow for better involvement
of the patient in shared decision-making regarding the
treatment (Patel et al., 2008), and will stimulate the future
uptake of measurement-based mental health care (Kil-
bourne et al., 2018).
References
Bakker, I. M., Terluin, B., Van Marwijk, H. W., van der Windt,
D. A. M., Rijmen, F., van Mechelen, W., & Stalman, W. A. (2007).
A cluster-randomised trial evaluating an intervention for
patients with stress-related mental disorders and sick leave
in primary care. PLoS Clinical Trials, 2(6), Article e26. https://
doi.org/10.1371/journal.pctr.0020026
Batterham, P. J., Sunderland, M., Slade, T., Calear, A. L., &
Carragher, N. (2018). Assessing distress in the community:
Psychometric properties and crosswalk comparison of eight
measures of psychological distress. Psychological Medicine,
48(8), 13161324. https://doi.org/10.1017/S0033291717002835
Bland, J. M., & Altman, D. G. (1986). Statistical methods for
assessing agreement between two methods of clinical
measurement. Lancet, 84767, 307310. https://doi.org/
10.1016/S140-6736(86)90837-8
Bowman, M. L. (2002). The perfidy of percentiles. Archives of
Clinical Neuropsychology, 17(3), 295303. https://doi.org/
10.1016/S0887-6177(01)00116-0
Camstra, A., & Boomsma, A. (1992). Cross-validation in regression
and covariance structure analysis: An overview. Sociological
Methods & Research, 21(1), 89115. https://doi.org/10.1177/
0049124192021001004
Chalmers, R. P. (2012). mirt: A multidimensional item response
theory package for the R environment. Journal of Statistical
Software, 48(6), 129. https://doi.org/10.18637/jss.v048.i06
Choi, S. W., Schalet, B., Cook, K. F., & Cella, D. (2014). Establishing
a common metric for depressive symptoms: Linking the BDI-II,
CES-D, and PHQ-9 to PROMIS Depression. Psychological Assess-
ment, 26(2), 513527. https://doi.org/10.1037/a0035768
Cook, K. F., Schalet, B. D., Kallen, M. A., Rutsohn, J. P., & Cella, D.
(2015). Establishing a common metric for self-reported pain:
Linking BPI Pain Interference and SF-36 Bodily Pain Subscale
scores to the PROMIS Pain Interference metric. Quality of Life
Research, 24(10), 23052318. https://doi.org/10.1007/s11136-
014-0790-9
Crawford, J. R., & Garthwaite, P. H. (2009). Percentiles please: The
case for expressing neuropsychological test scores and
accompanying confidence limits as percentile ranks. The
Clinical Neuropsychologist, 23(2), 193204. https://doi.org/
10.1080/13854040801968450
Cris
ßan, D. R., Tendeiro, J. N., & Meijer, R. R. (2017). Investigating
the practical consequences of model misfit in unidimensional
IRT models. Applied Psychological Measurement, 41(6), 439
455. https://doi.org/10.1177/0146621617695522
Cronbach, L. J. (1984). Essentials of psychological testing (4th ed.).
Harper & Row.
de Beurs, E., Carlier, I. V., & van Hemert, A. M. (2019). Approaches
to denote treatment outcome: Clinical significance and clinical
global impression compared. International Journal of Methods
in Psychiatric Research, 28(4), Article e1797. https://doi.org/
10.1002/mpr.1797
de Beurs, E., den Hollander-Gijsman, M., Buwalda, V., Trijsburg,
W., & Zitman, F. G. (2005). De Outcome Questionnaire (OQ-45):
Een meetinstrument voor meer dan alleen psychische klachten
[The Outcome Questionnaire (OQ-45): A measure for psychiatric
symptoms and more]. De Psycholoog, 40(1), 5363.
European Journal of Psychological Assessment (2022) Ó2022 Hogrefe Publishing
14 E. de Beurs et al., A Common Metric: Normalized TScores
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
de Beurs, E., den Hollander-Gijsman, M. E., van Rood, Y. R., van
der Wee, N. J., Giltay, E. J., van Noorden, M. S., van der Lem, R.,
van Fenema, E., & Zitman, F. G. (2011). Routine outcome
monitoring in the Netherlands: Practical experiences with a
web-based strategy for the assessment of treatment outcome
in clinical practice. Clinical Psychology & Psychotherapy, 18(1),
112. https://doi.org/10.1002/cpp.696
de Beurs, E., Fried, E. I., & Boehnke, J. (2022). Common measures
or common metrics? A plea to harmonize measurement results.
Clinical Psychology and Psychotherapy, 29(5), 17551767.
https://doi.org/10.1002/cpp.2742
de Beurs, E., Oudejans, S., & Terluin, B. (2022). Acommon
measurement scale for self-report instruments in mental health
care: T scores with a normal distribution (supplementary
materials). https://www.psycharchives.org/en/item/86e598e9
4828-412786ae-5f0d18e9586a
de Beurs, E., & Zitman, F. G. (2006). De Brief Symptom Inventory
(BSI): De betrouwbaarheid en validiteit van een handzaam
alternatief voor de SCL-90 [The Brief Symptom Inventory:
Reliability and validity of a handy alternative for the SCL-90].
Maandblad Geestelijke Volksgezondheid, 61, 120141.
de Jong, K., Nugter, M. A., Polak, M. G., Wagenborg, J. E. A.,
Spinhoven, P., & Heiser, W. J. (2007). The Outcome Question-
naire (OQ-45) in a Dutch population: A cross-cultural validation.
Clinical Psychology & Psychotherapy, 14(4), 288301. https://
doi.org/10.1002/cpp.529
Derogatis, L. R. (1975). The Brief Symptom Inventory. Clinical
Psychometric Research.
Dorans, N. J., Pommerich, M., & Holland, P. W. (Eds.). (2007).
Linking and aligning scores and scales. Springer.
Embretson, S. E., & Reise, S. P. (2013). Item response theory for
psychologists. Erlbaum.
Fischer, H. F., & Rose, M. (2016). www.common-metrics.org: A
web application to estimate scores from different patient-
reported outcome measures on a common scale. BMC Medical
Research Methodology, 16(1), Article 142. https://doi.org/
10.1186/s12874-016-0241-0
Fischer, H. F., Tritt, K., Klapp, B. F., & Fliege, H. (2011). How to
compare scores from different depression scales: Equating the
Patient Health Questionnaire (PHQ) and the ICD-10-Symptom
Rating (ISR) using item response theory. International Journal
of Methods in Psychiatric Research, 20(4), 203214. https://doi.
org/10.1002/mpr.350
Friedrich, M., Hinz, A., Kuhnt, S., Schulte, T., Rose, M., & Fischer,
F. (2019). Measuring fatigue in cancer patients: A common
metric for six fatigue instruments. Quality of Life Research,
28(6), 16151626. https://doi.org/10.1007/s11136-019-02147-3
Harding,K. J.,Rush,A.J.,Arbuckle,M.,Trivedi,M. H.,&Pincus,H. A.
(2011). Measurement-based care in psychiatric practice: A policy
framework for implementation. Journal of Clinical Psychiatry,
72(8), 11361143. https://doi.org/10.4088/JCP.10r06282whi
Holland, P. W., Dorans, N. J., & Petersen, N. S. (2006). Equating
test scores. In C. R. Rao & S. Sinharay (Eds.), Handbook of
statistics (Vol. 26, pp. 169203). https://doi.org/10.1016/
S0169-7161(06)26006-1
Kilbourne,A.M.,Beck,K.,Spaeth-Rublee,B.,Ramanuj,P.,OBrien,
R.W.,Tomoyasu,N.,&Pincus,H.A.(2018).Measuringand
improving the quality of mental health care: A global perspective.
World Psychiatry, 17(1), 3038. https://doi.org/10.1002/wps.20482
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and
linking: Methods and practices (3rd ed.). Springer Science &
Business Media.
Lambert, M. J., Gregersen, A. T., & Burlingame, G. M. (2004). The
Outcome Questionnaire 45. In M. E. Maruish (Ed.), The use of
psychological testing for treatment planning and outcomes
assessment: Volume 3: Instruments for adults (3rd ed., pp.
191234). Erlbaum. http://search.ebscohost.com/login.aspx?
direct=true&db=psyh&AN=2004-14941-006&site=ehost-live
Lambert, M. J., & Harmon, K. L. (2018). The merits of implement-
ing routine outcome monitoring in clinical practice. Clinical
Psychology: Science and Practice, 25(4), Article e12268. https://
doi.org/10.1111/cpsp.12268
Ley, P. (1972). Quantitative aspects of psychological assessment
(Vol. 1). Duckworth.
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-
score and equipercentile observed-score equatings.Applied
Psychological Measurement, 8(4), 453461. https://doi.org/
10.1177/014662168400800409
McCall, W. A. (1922). How to measure in education. MacMillan.
Mellenbergh, G. J. (2011). A conceptual introduction to psycho-
metrics: Development, analysis and application of psychological
and educational tests. Eleven International. https://books.
google.es/books?id=jRJJYAAACAAJ
Miller, S. D., Hubble, M. A., Chow, D., & Seidel, J. (2015). Beyond
measures and monitoring: Realizing the potential of feedback-
informed treatment. Psychotherapy, 52(4), 449457. https://
doi.org/10.1037/pst0000031
Patel, S. R., Bakken, S., & Ruland, C. (2008). Recent advances in
shared decision making for mental health. Current Opinion in
Psychiatry, 21(6), 6066012. https://doi.org/10.1097/
YCO.0b013e32830eb6b4
Rosseel, Y. (2012). Lavaan: An R package for structural equation
modeling and more. Version 0.512 (BETA). Journal of Statistical
Software, 48(2), 136. https://doi.org/10.18637/jss.v048.i02
Schalet, B. D., Cook, K. F., Choi, S. W., & Cella, D. (2014).
Establishing a common metric for self-reported anxiety: Link-
ing the MASQ, PANAS, and GAD-7 to PROMIS Anxiety. Journal
of Anxiety Disorders, 28(1), 8896. https://doi.org/10.1016/
j.janxdis.2013.11.006
Schalet, B. D., Revicki, D. A., Cook, K. F., Krishnan, E., Fries, J. F., &
Cella, D. (2015). Establishing a common metric for physical
function: Linking the HAQ-DI and SF-36 PF subscale to
PROMIS
Ò
Physical Function. Journal of General Internal Medi-
cine, 30(10), 15171523. https://doi.org/10.1007/s11606-015-
3360-0
Scherpenzeel, A. C. (2018). Truelongitudinal and probability-
based Internet panels: Evidence from the Netherlands. In M.
Das, P. Ester, & L. Kaszmirek (Eds.), Social and behavioral
research and the Internet (pp. 77104). Routledge. https://doi.
org/10.4324/9780203844922-4
Scherpenzeel, A. C., & Bethlehem, J. G. (2011). How representa-
tive are online panels? Problems of coverage and selection and
possible solutions. In M. Das, P. Ester, & L. Kaczmirek (Eds.),
Social and behavioral research and the Internet: Advances in
applied methods and research strategies (pp. 105132). Taylor
& Francis.
Schreiber, J. B., Nora, A., Stage, F. K., Barlow, E. A., & King, J.
(2006). Reporting structural equation modeling and confirma-
tory factor analysis results: A review. The Journal of Educa-
tional Research, 99(6), 323338. https://doi.org/10.3200/
JOER.99.6.323-338
Slade, M., Amering, M., & Oades, L. (2008). Recovery: An interna-
tional perspective. Epidemiology and Psychiatric Sciences,
17(2), 128137. https://doi.org/10.1017/S1121189X00002827
Smits, N. (2016). On the effect of adding clinical samples to
validation studies of patient-reported outcome item banks: A
simulation study. Quality of Life Research, 25(7), 16351644.
https://doi.org/10.1007/s11136-015-1199-9
Ó2022 Hogrefe Publishing European Journal of Psychological Assessment (2022)
E. de Beurs et al., A Common Metric: Normalized TScores 15
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Smits, N., Öğreden, O., Garnier-Villarreal, M., Terwee, C. B., &
Chalmers, R. P. (2020). A study of alternative approaches to
non-normal latent trait distributions in item response theory
models used for health outcome measurement. Statistical
Methods in Medical Research, 29(4), 10301048. https://doi.
org/10.1177/0962280220907625
Stasinopoulos, M. D., Rigby, R. A., & Bastiani, F. D. (2018). GAMLSS:
A distributional regression approach. Statistical Modelling,
18(34), 248273. https://doi.org/10.1177/1471082X18759144
ten Klooster, P. M., Oude Voshaar, M. A. H., Gandek, B., Rose, M.,
Bjorner, J. B., Taal, E., Glas, C. A. W., van Riel, P. L. C. M., & van
de Laar, M. A. F. J. (2013). Development and evaluation of a
crosswalk between the SF-36 Physical Functioning Scale and
Health Assessment Questionnaire Disability Index in rheuma-
toid arthritis. Health and Quality of Life Outcomes, 11, 199199.
https://doi.org/10.1186/1477-7525-11-199
Terluin, B., Smits, N., Brouwers, E. P. M., & de Vet, H. C. W. (2016).
The Four-Dimensional Symptom Questionnaire (4DSQ) in the
general population: Scale structure, reliability, measurement
invariance and normative data: A cross-sectional survey.
Health and Quality of Life Outcomes, 14(1), Article 130.
https://doi.org/10.1186/s12955-016-0533-4
Terluin, B., van Marwijk, H. W., Adèr, H. J., de Vet, H. C., Penninx,
B. W., Hermens, M. L., van Boeijen, C. A., van Balkom, A. J., van
der Klink, J. J., & Stalman, W. A. (2006). The Four-Dimensional
Symptom Questionnaire (4DSQ): A validation study of a multi-
dimensional self-report questionnaire to assess distress,
depression, anxiety and somatization. BMC Psychiatry, 6(1),
Article 1. https://doi.org/10.1186/1471-244X-6-34
Timman, R., de Jong, K., & de Neve-Enthoven, N. (2017). Cut-off
scores and clinical change indices for the Dutch Outcome
Questionnaire (OQ-45) in a large sample of normal and several
psychotherapeutic populations. Clinical Psychology & Psy-
chotherapy, 24(1), 7281. https://doi.org/10.1002/cpp.1979
Timmerman, M. E., Voncken, L., & Albers, C. J. (2021). A tutorial on
regression-based norming of psychological tests with GAMLSS.
Psychological Methods, 26(3), 357373. https://doi.org/
10.1037/met0000348
van der Laan, J. (2009). Representativity of the LISS panel.
Statistics Netherlands.
Van Hoeck, K. J. M., Lilien, M. R., Brinkman, D. C., & Schroeder,
C. H. (2000). Comparing a urea kinetic monitor with Daugirdas
formula and dietary records in children. Pediatric Nephrology,
14(4), 280283. https://doi.org/10.1007/s004670050759
van Stralen, K. J., Dekker, F. W., Zoccali, C., & Jager, K. J. (2012).
Measuring agreement, more complicated than it seems.
Nephron Clinical Practice, 120(3), c162c167. https://doi.org/
10.1159/000337798
Wahl, I., Löwe, B., Bjorner, J. B., Fischer, F., Langs, G., Voderholzer,
U., Aita, S. A., Bergemann, N., Brähler, E., & Rose, M. (2014).
Standardization of depression measurement: A common metric
was developed for 11 self-report depression measures. Journal
of Clinical Epidemiology, 67(1), 7386. https://doi.org/10.1016/
j.jclinepi.2013.04.019
Wall, M. M., Park, J. Y., & Moustaki, I. (2015). IRT Modeling in the
presence of zero-Inflation with application to psychiatric
disorder severity. Applied Psychological Measurement, 39(8),
583597. https://doi.org/10.1177/0146621615588184
History
Received April 19, 2021
Revision received July 6, 2022
Accepted August 3, 2022
Published online December 16, 2022
EJPA Section / Category Clinical Psychology
Acknowledgments
In this paper, we gratefully made use of BSI- and 4DSQ data of the
LISS (Longitudinal Internet Studies for the Social sciences) panel
administered by CentERdata (Tilburg University, The Netherlands)
and OQ-45 data from TNS-NIPO, The Netherlands. We would also
like to thank MHC providers in the Netherlands for providing
patient data on the BSI and the OQ-45 and the VU Medical Center
for providing patient data on the 4DSQ.
Conflict of Interest
The authors report no conflict of interest.
Publication Ethics
The data and narrative interpretations of the data/research
appearing in the manuscript have not been presented at a
conference or meeting, posted on a listserv, shared on a website,
including academic social networks like ResearchGate, and so
forth.
Open Science
We report on a reanalysis of data about which has been published
before. We report and refer to previous publications on how we
determined our sample size, all data exclusions, all data inclu-
sion/exclusion criteria, whether inclusion/exclusion criteria were
established prior to data analysis, all measures in the study, and
all analyses including all tested models. If we use inferential tests,
we report exact pvalues, effect sizes, and 95% confidence or
credible intervals.
Open Data: The information needed to reproduce all of the
reported results are not openly accessible, but can be requested
from the first author.
Open Materials: The information needed to reproduce all of the
reported methodology is made available. We provided suffi-
cient information for an independent researcher to reproduce
the reported results, including a codebook (https://www.
psycharchives.org/en/item/86e598e9-4828-4127-86ae-5f0d18e9
586a; de Beurs, Oudejans, et al., 2022). Moreover, we have
uploaded an annotated version of our R-code with two practice
datasets, which will allow other researchers to apply it to these
data (and their own data).
Preregistration of Studies and Analysis Plans: This study was not
preregistered.
Edwin de Beurs
Department of Clinical Psychology
Faculty of Social Sciences
Leiden University
Wassenaarseweg 52
2333 AK Leiden
The Netherlands
e.de.beurs@arkin.nl
European Journal of Psychological Assessment (2022) Ó2022 Hogrefe Publishing
16 E. de Beurs et al., A Common Metric: Normalized TScores
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
... Moreover, norms need to be established. Based on the Item Response Theory (IRT) analysis, factor scores could be used to obtain normalized standard scores (T-scores) [36], and percentile scores can be established. Both will offer a conversion of raw BSQ scores into common metrics, which will ease interpretation of scores on the BSQ, facilitate communication with patients, and increase the applicability of the BSQ [37]. ...
... Invariance of the unidimensional structure of the BSQ34 and the BSQ8C across sample 1 and sample 2 was investigated with a multi-group CFA measurement [53]. Finally, as described elsewhere in detail, an IRT-based transformation of scores was performed [36,37] on the data of the community sample to arrive at community-based normalized T-scores. First, an IRT model was fitted to the data of samples 1 and 2, and factor scores (thetas with M = 0, SD = 1 for sample 2) were calculated. ...
Article
Full-text available
This study examined the psychometric properties and provided normative data of the Dutch Body Shape Questionnaire (BSQ34) and its shortened BSQ8C among patients with binge-eating disorder. The two versions of the BSQ were administered to patients with binge-eating disorder (N = 155) enrolled for treatment, and to a community sample (N = 333). The translation and back-translation of the BSQ were performed by translators with and without eating-disorder expertise. Internal consistency, concurrent validity, test–retest reliability, incremental validity, and sensitivity to change were determined. A receiver-operating-characteristic curve-analysis was used to establish criterion-related validity, for which the Eating Disorder Examination—Shape concern subscale, was used. Uni-dimensionality of the instrument was investigated with confirmatory factor analysis. Norms (population-based T-scores and clinical percentile-scores) were determined. The psychometric properties of the BSQs were satisfactory. The BSQ34 discriminated well in body-shape dissatisfaction between patients with binge-eating disorder and the community sample (area-under-the-curve value = 0.91–0.98) and had a unidimensional factor structure. Comparing structural invariance between both samples revealed that scaler invariance was not supported, indicating that items may be interpreted differently by patients with binge-eating disorder and subjects from the community. Analyses were repeated for the BSQ8C, which yielded similar results. The results indicated that both versions of the BSQ appeared suitable to screen for body-shape dissatisfaction among patients with binge-eating disorder. The BSQ34 supplies valuable information on the various types of concerns respondents have, which are critical to consider in clinical settings; the BSQ8C is recommended as a short screening tool. Level of evidence: Level III: Evidence obtained from well-designed cohort or case–control analytic studies.
... Spearman correlation analyses were also performed to analyze MI scores with related variables, expecting positive associations between MI scores and ODSIS, OASIS and neuroticism scores, and negative associations with extraversion and quality of life scores. Finally, Percentiles and T-scores (M = 50; SD = 10) were also calculated to provide a clinically useful scale as recommended by the literature [28], based on the scores for both studies. ...
Article
Full-text available
Background The various systems of diagnosis and classification of mental disorders underline the need to evaluate the interference caused by the different disorders in a person’s daily life. The Maladjustment Inventory (MI) evaluates the impairment in the individual’s functioning in a brief and self-applied way, through six items. The objective of this research was to explore the psychometric properties of the MI scores through two studies, one with a Spanish clinical sample (Study 1) and another with a Spanish university students’ sample (Study 2). Methods The total sample was made up of 928 participants (81.1% women, n = 495 clinical sample). Descriptive analyses, exploration of internal structure and reliability, exploratory and confirmatory factor analyses, relationship with other variables (quality of life, anxiety, depression, neuroticism and extraversion), and percentiles and T-scores were performed. Results The results showed good psychometric properties of the MI, with a good fit model for one factor solution in both samples, Cronbach’s alpha coefficient of 0.84–88, and evidence of validity based on the relationship with other variables. Conclusion The good psychometric properties of the MI, together with its brevity, make it a recommended instrument for the evaluation of interference in both clinical and research contexts.
... One of the aims of the present research is to provide T-scores for the TMDP following a similar approach, but this time in the form of normalized T-scores (de Beurs, Oudejans, & Terluin, 2022). Previously, we have proposed an alternative method to establish T-scores, by first transforming raw scale scores to a normal distribution with either (1) an approach based on the Item Response Theory (Embretson & Reise, 2013) or with (2) an approach based on Rankit percentile-based normalization (Soloman & Sawilowsky, 2009). ...
Article
Converting raw scores to T-scores simplifies the interpretation of test results by providing a standardized metric. A simple linear conversion of raw scores to T-scores can produce misleading T-scores if the raw score distribution deviates significantly from normality. The aim of the present research was to provide normalized T-scores and percentile rank scores for the Test MultiDimensionnel de Personnalitˊe (TMDP). T-scores were established by first transforming raw scale scores to a normal distribution with either an approach based on the Item Response Theory or with an approach based on Rankit percentile-based normalization. The three resulting T-scores (Linear-, IRT-and Rankit-based T-scores) were compared. Results revealed that both normalization approaches yielded similar normalized T-scores, which deviated substantially from linear T-scores for most scales, especially in the higher range of the scores. Benefits of normalized T-scores and the need for gender-and age-based norms are discussed.
... Baseline averages for some BSI subscales are markedly higher compared to established psychiatric outpatient norm scores (i.e., somatization and phobic anxiety) and some average baseline BSI subscale scores are lower compared to psychiatric outpatients norm scores (i.e., interpersonal sensitivity, depression, anxiety and hostility) (76). GSI and BSI subscales baseline scores are also more severe compared to established patient norms as reported by De Beurs (101,102), except for the BSI Frontiers in Psychiatry 06 frontiersin.org subscales interpersonal sensitivity and psychoticism which are comparable to the patients norms as reported by De Beurs (101). ...
Article
Full-text available
Objective Although multimodal interventions are recommended in patients with severe depressive and/or anxiety disorders, available evidence is scarce. Therefore, the current study evaluates the effectiveness of an outpatient secondary care interdisciplinary multimodal integrative healthcare program, delivered within a transdiagnostic framework, for patients with (comorbid) depressive and/or anxiety disorders. Methods Participants were 3,900 patients diagnosed with a depressive and/or anxiety disorder. The primary outcome was Health-Related Quality of Life (HRQoL) measured with the Research and Development-36 (RAND-36). Secondary outcomes included: (1) current psychological and physical symptoms measured with the Brief Symptom Inventory (BSI) and (2) symptoms of depression, anxiety, and stress measured with the Depression Anxiety Stress Scale (DASS). The healthcare program consisted of two active treatment phases: main 20-week program and a subsequent continuation-phase intervention (i.e., 12-month relapse prevention program). Mixed linear models were used to examine the effects of the healthcare program on primary/secondary outcomes over four time points: before start 20-week program (T0), halfway 20-week program (T1), end of 20-week program (T2) and end of 12-month relapse prevention program (T3). Results Results showed significant improvements from T0 to T2 for the primary variable (i.e., RAND-36) and secondary variables (i.e., BSI/DASS). During the 12-month relapse prevention program, further significant improvements were mainly observed for secondary variables (i.e., BSI/DASS) and to a lesser extent for the primary variable (i.e., RAND-36). At the end of the relapse prevention program (i.e., T3), 63% of patients achieved remission of depressive symptoms (i.e., DASS depression score ≤ 9) and 67% of patients achieved remission of anxiety symptoms (i.e., DASS anxiety score ≤ 7). Conclusion An interdisciplinary multimodal integrative healthcare program, delivered within a transdiagnostic framework, seems effective for patients suffering from depressive and/or anxiety disorders with regard to HRQoL and symptoms of psychopathology. As reimbursement and funding for interdisciplinary multimodal interventions in this patient group has been under pressure in recent years, this study could add important evidence by reporting on routinely collected outcome data from a large patient group. Future studies should further investigate the long-term stability of treatment outcomes after interdisciplinary multimodal interventions for patients suffering from depressive and/or anxiety disorders.
... We compared the fit of these various equation by their Bayesian Information Criterion (BIC) value for each scale. The procedure is described in more detail and cross-validated by de Beurs et al. (36). ...
Article
Full-text available
Background There is a considerable gap between care provision and the demand for care for common mental disorders in low-and-middle-income countries. Screening for these disorders, e.g., in primary care, will help to close this gap. However, appropriate norms and threshold values for screeners of common mental disorders are lacking. Methods In a survey study, we gathered data on frequently used screeners for alcohol use disorders, (AUDIT), depression, (CES-D), and anxiety disorders (GAD-7, ACQ, and BSQ) in a representative sample from Suriname, a non-Latin American Caribbean country. A stratified sampling method was used by random selection of 2,863 respondents from 5 rural and 12 urban resorts. We established descriptive statistics of all scale scores and investigated unidimensionality. Furthermore, we compared scores by gender, age-group, and education level with t-test and Mann–Whitney U tests, using a significance level of p < 0.05. Results Norms and crosswalk tables were established for the conversion of raw scores into a common metric: T-scores. Furthermore, recommended cut-off values on the T-score metric for severity levels were compared with international cut-off values for raw scores on these screeners. Discussion The appropriateness of these cut-offs and the value of converting raw scores into T-scores are discussed. Cut-off values help with screening and early detection of those who are likely to have a common mental health disorder and may require treatment. Conversion of raw scores to a common metric in this study facilitates the interpretation of questionnaire results for clinicians and can improve health care provision through measurement-based care.
Article
Full-text available
Use of standardised scores, such as T ‐scores and percentile rank order scores, enhances measurement‐based care. They facilitate communication between therapists and clients about test results, particularly for multidimensional measures such as the Symptoms Questionnaire (SQ‐48). By transforming raw scores into a common metric, clinicians can more easily interpret and discuss patient profiles of scores on the various scales of the measure. This study explored the advantages and disadvantages of standardised scores and percentile ranks, with a specific focus on T ‐scores, utilising cross‐sectional data from a general population sample ( N = 516) and a clinical sample ( N = 242). We outline various approaches for establishing T ‐scores and provide illustrative examples. The analysis of the SQ‐48 revealed the necessity of first normalising raw scores to obtain accurate T ‐scores. Normalisation based on an IRT model is deemed superior, but formulas converting summed scale scores provide a good approximation. Regarding percentile rank order scores, we demonstrated that clinical percentiles offer more meaningful interpretations than population‐based percentiles, due to restriction of range for the latter among clinical subjects. Gender and age group differences were identified, with significantly higher scores for women and individuals aged 55 and older. Benefits of normalised T ‐scores and the need for gender‐ and age‐specific norms for the SQ‐48 are discussed.
Article
Perianal skin cancer, while rare, can have severe consequences on patient quality of life. This study sought to address this question by examining patient lived experience, social impact, psychological symptoms, and perceived informational needs following a diagnosis of perianal skin cancer. Mixed methods enabled contextualization, exploration, and illumination of pertinent patient issues. Results indicated that stigma, communication, psychological and physical effects, and concern for survival may impact patient care. Psychometric data support these findings. Patient informational needs suggest a reduction in the perceived frequency of all needs and a strong preference for digital over face-to-face information provision. As a truly patient-centered study, this paper provides an important starting point for the consideration of psychosocial support for perianal skin cancer patients. The outcomes are disseminated across clinical policy, offering highlighted points of consideration. Perianal skin cancers are rare. They are rare enough that, in general, the evidence base lacks the larger studies and data that could enable more consistent practice worldwide. This is the first patient-centered investigation of the impacts of perianal skin cancer on quality of life and patient well-being, and it is the first patient study to cascade into policy amendment within the United Kingdom, notably within the clinical unit. This study provides concrete, specific outcomes related to patient experience and clinical need, and suggests both the generic nature of the support available to assist these patients and the lack of productive practice that could be molded and strengthened from it.
Article
Objective: The Attention-Deficit/Hyperactivity Disorder Rating Scale-IV (ADHD-RS-IV) assesses ADHD symptoms in children and adolescents. The original United States norms comprise percentiles. Yet, no Nordic percentile norms exist, and only T-scores, which (often falsely) assume normally distributed data, are currently available. Here, we for the first time provide Danish percentile norms for children aged 6–9 based on parent/caregiver-reports, and illustrate the potential consequences of T-scores when derived based on the expected skewed distribution of an ADHD scale in the population. Materials and methods: The sample comprised 1895 Danish schoolchildren (879 girls and 1016 boys) in 1st, 2nd, or 3rd grade from the general population. Their parents/caregivers completed the ADHD-RS-IV. Sex and age differences were investigated, percentiles were derived based on the observed score distributions, and for comparison, T-scores > 70 were estimated, which are expected to identify the top 2.3% under the assumption of normality. Results: Boys were rated to have higher ADHD-RS-IV scores than girls except on the impulsivity score. No age effects were found on the majority of scores. Sex-stratified and unisex percentiles (80, 90, 93, 98) were reported. The distribution of ADHD-RS-IV scores were highly skewed. T-score cutoffs identified a significantly higher proportion of and about twice as many children as having elevated ADHD symptoms than expected (4.3–5.2% vs. 2.3%). Conclusions: ADHD-RS-IV (parent/caregiver-report) percentile norms for young Danish schoolchildren are now available for future reference. The use of percentiles is considered appropriate given the skewed score distribution and since T-scores appear to over-identify children as having clinically elevated ADHD symptoms.
Article
Full-text available
Introduction Saudi Arabia experiences elevated levels of body-shape dissatisfaction which might be related to the increased thin ideal. Studies on body-shape dissatisfaction are scarce, mainly because adapted assessment tools are unavailable. This study describes the Saudi-Arabic adaptation of the Body Shape Questionnaire (BSQ34), preliminary examines the psychometric properties and provides normative data. Methods The BSQ34 was administered in a convenience community sample (N = 867) between April 2017 and May 2018. Receiver-operating-characteristic curve analysis was used to establish discriminant validity, in a subsample (N = 602) in which the Eating Disorder Examination-Shape concern, was administered, the factor structure investigated with confirmatory-factor analyses and T-scores and percentile scores were determined. Results The BSQ34 discriminated well between low and high levels of body-shape dissatisfaction (area-under-the-curve value = 0.93), had high internal consistency and a unidimensional factor structure, and 23.9% appeared at risk for body-shape dissatisfaction. Analyses were repeated for the shortened BSQ8C, which yielded similar results. Discussion The results indicated that the BSQ34 and BSQ8C appeared suitable measurement tools to screen for body-shape dissatisfaction in a Saudi convenience community sample, mainly comprised young, unmarried, and highly educated women. The BSQ34 supplies more information on the type of concerns respondents have, which is worthwhile when the measure is used in a clinical setting; the BSQ8C is recommended as a short screener. As body-shape dissatisfaction is viewed as a risk factor for the development of eating disorder symptoms, screening for body-shape dissatisfaction with reliable tools is important to detect individuals at risk for eating disorder symptoms and may suggest subsequent preventive steps.
Article
Full-text available
Objective: There is a great variety of measurement instruments to assess similar constructs in clinical research and practice. This complicates the interpretation of test results and hampers the implementation of measurement-based care. Method: For reporting and discussing test results with patients, we suggest converting test results into universally applicable common metrics. Two well-established metrics are reviewed: T scores and percentile ranks. Their calculation is explained, their merits and drawbacks are discussed, and recommendations for the most convenient reference group are provided. Results: We propose to express test results as T scores with the general population as reference group. To elucidate test results to patients, T scores may be supplemented with percentile ranks, based on data from a clinical sample. The practical benefits are demonstrated using the published data of four frequently used instruments for measuring depression: the CES-D, PHQ-9, BDI-II, and the PROMIS depression measure. Discussion: Recent initiatives have proposed to mandate a limited set of outcome measures to harmonize clinical measurement. However, the selected instruments are not without flaws and, potentially, this directive may hamper future instrument development. We recommend using common metrics as an alternative approach to harmonize test results in clinical practice, as this will facilitate the integration of measures in day-to-day practice.
Article
Full-text available
A norm-referenced score expresses the position of an individual test taker in the reference population, thereby enabling a proper interpretation of the test score. Such normed scores are derived from test scores obtained from a sample of the reference population. Typically, multiple reference populations exist for a test, namely when the norm-referenced scores depend on individual characteristic(s), as age (and sex). To derive normed scores, regression-based norming has gained large popularity. The advantages of this method over traditional norming are its flexible nature, yielding potentially more realistic norms, and its efficiency, requiring potentially smaller sample sizes to achieve the same precision. In this tutorial, we introduce the reader to regression-based norming, using the generalized additive models for location, scale, and shape (GAMLSS). This approach has been useful in norm estimation of various psychological tests. We discuss the rationale of regression-based norming, theoretical properties of GAMLSS and their relationships to other regression-based norming models. Based on 6 steps, we describe how to: (a) design a normative study to gather proper normative sample data; (b) select a proper GAMLSS model for an empirical scale; (c) derive the desired normed scores for the scale from the fitted model, including those for a composite scale; and (d) visualize the results to achieve insight into the properties of the scale. Following these steps yields regression-based norms with GAMLSS for a psychological test, as we illustrate with normative data of the intelligence test IDS-2. The complete R code and data set is provided as online supplemental material. (PsycInfo Database Record (c) 2020 APA, all rights reserved).
Article
Full-text available
It is often unrealistic to assume normally distributed latent traits in the measurement of health outcomes. If normality is violated, the item response theory (IRT) models that are used to calibrate questionnaires may yield parameter estimates that are biased. Recently, IRT models were developed for dealing with specific deviations from normality, such as zero-inflation (“excess zeros”) and skewness. However, these models have not yet been evaluated under conditions representative of item bank development for health outcomes, characterized by a large number of polytomous items. A simulation study was performed to compare the bias in parameter estimates of the graded response model (GRM), polytomous extensions of the zero-inflated mixture IRT (ZIM-GRM), and Davidian Curve IRT (DC-GRM). In the case of zero-inflation, the GRM showed high bias overestimating discrimination parameters and yielding estimates of threshold parameters that were too high and too close to one another, while ZIM-GRM showed no bias. In the case of skewness, the GRM and DC-GRM showed little bias with the GRM showing slightly better results. Consequences for the development of health outcome measures are discussed.
Article
Full-text available
Objectives: The authors of a previous study proposed a statistically based approach to denote treatment outcome, translating pretest and posttest scores into clinically relevant categories, such as recovery and reliable improvement. We assessed the convergent validity of the Jacobson-Truax (JT) approach, using T-score based cutoff values, with ratings by an independent evaluator. Methods: Pretest and retest scores on the Brief Symptom Inventory (BSI) and clinical global impression improvement (CGI-I) ratings were collected repeatedly through routine outcome monitoring from 5,900 outpatients with common mental disorders. Data were collected in everyday practice in a large mental health care provider. Results: Continuous pretest-to-retest BSI change scores had a stronger association with CGI-I than the categorical variable based on JT. However, JT categorization and improvement according to CGI converged substantially with association indices (Somers' D) ranging from D = .50 to .56. Discordance was predominantly due to a more positive outcome according to JT than on CGI-I ratings. Conclusion: Converting continuous outcome variables into clinically meaningful categories comes at the price of somewhat diminished concurrent validity with CGI-I. Nevertheless, support was found for the proposed threshold values for reliable change and recovery, and the outcome denoted in these terms corresponded with CGI improvement for most patients.
Article
Full-text available
Purpose Fatigue is one of the most disabling symptoms in cancer patients. Many instruments exist to measure fatigue. This variety impedes the comparison of data across studies or to the general population. We aimed to estimate a common metric based on six different fatigue instruments (EORTC QLQ-C30 subscale fatigue, EORTC QLQ-FA12, MFI subscale General Fatigue, BFI, Fatigue Scale, and Fatigue Diagnostic Interview Guide) to convert the patients’ scores from one of the instruments to another. Additionally, we linked the common metric to the general population. Methods For n = 1225 cancer patients, the common metric was estimated using the Item Response Theory framework. The linking between the common metric of the patients and the general population was estimated using linear regression. Results The common metric was based on a model with acceptable fit (CFI = 0.94, SRMR = 0.06). Based on the standard error of measurement the reliability coefficients of the questionnaires ranged from 0.80 to 0.95. The common metric of the six questionnaires, also linked to the general population, is reported graphically and in supplementary crosswalk tables. Conclusions Our study enables researchers and clinicians to directly compare results across studies using different fatigue questionnaires and to assess the degree of fatigue with respect to the general population.
Article
Full-text available
The advantages of formally monitoring client treatment response are described in the context of implementation issues related to evidence-based practice. Despite clear evidence that routine outcome monitoring (ROM) can be effective at reducing treatment failure and enhancing the positive effects of psychotherapy, these practices are not attractive to clinicians and are not widely taught in graduate school or internship sites. Some reasons for this are suggested, and solutions are recommended. © 2018 American Psychological Association. Published by Wiley Periodicals, Inc., on behalf of the American Psychological Association.
Book
This book develops an intuitive understanding of IRT principles through the use of graphical displays and analogies to familiar psychological principles. It surveys contemporary IRT models, estimation methods, and computer programs. Polytomous IRT models are given central coverage since many psychological tests use rating scales. Ideal for clinical, industrial, counseling, educational, and behavioral medicine professionals and students familiar with classical testing principles, exposure to material covered in first-year graduate statistics courses is helpful. All symbols and equations are thoroughly explained verbally and graphically. © 2000 by Lawrence Erlbaum Associates, Inc. All rights reserved.
Article
A tutorial of the generalized additive models for location, scale and shape (GAMLSS) is given here using two examples. GAMLSS is a general framework for performing regression analysis where not only the location (e.g., the mean) of the distribution but also the scale and shape of the distribution can be modelled by explanatory variables.