ArticlePDF Available

Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness

Authors:

Abstract and Figures

The Alda score is commonly used to quantify lithium responsiveness in bipolar disorder. Most often, this score is dichotomized into “responder” and “non-responder” categories, respectively. This practice is often criticized as inappropriate, since continuous variables are thought to invariably be “more informative” than their dichotomizations. We therefore investigated the degree of informativeness across raw and dichotomized versions of the Alda score, using data from a published study of the scale’s inter-rater reliability (n = 59 raters of 12 standardized vignettes each). After learning a generative model for the relationship between observed and ground truth scores (the latter defined by a consensus rating of the 12 vignettes), we show that the dichotomized scale is more robust to inter-rater disagreement than the raw 0-10 scale. Further theoretical analysis shows that when a measure’s reliability is stronger at one extreme of the continuum—a scenario which has received little-to-no statistical attention, but which likely occurs for the Alda score ≥ 7—dichotomization of a continuous variable may be more informative concerning its ground truth value, particularly in the presence of noise. Our study suggests that research employing the Alda score of lithium responsiveness should continue using the dichotomous definition, particularly when data are sampled across multiple raters.
Content may be subject to copyright.
RESEARCH ARTICLE
Asymmetrical reliability of the Alda score
favours a dichotomous representation of
lithium responsiveness
Abraham NunesID
1,2
, Thomas TrappenbergID
2
, Martin Alda
1¤
*, The international
Consortium on Lithium Genetics (ConLiGen)
1Department of Psychiatry, Dalhousie University, Halifax, Nova Scotia, Canada, 2Faculty of Computer
Science, Dalhousie University, Halifax, Nova Scotia, Canada
¤Current address: QEII Health Sciences Centre, Halifax, Nova Scotia, Canada
¶ Membership list can be found in the Acknowledgments section.
*malda@dal.ca
Abstract
The Alda score is commonly used to quantify lithium responsiveness in bipolar disorder.
Most often, this score is dichotomized into “responder” and “non-responder” categories,
respectively. This practice is often criticized as inappropriate, since continuous variables
are thought to invariably be “more informative” than their dichotomizations. We therefore
investigated the degree of informativeness across raw and dichotomized versions of the
Alda score, using data from a published study of the scale’s inter-rater reliability (n = 59 rat-
ers of 12 standardized vignettes each). After learning a generative model for the relationship
between observed and ground truth scores (the latter defined by a consensus rating of the
12 vignettes), we show that the dichotomized scale is more robust to inter-rater disagree-
ment than the raw 0-10 scale. Further theoretical analysis shows that when a measure’s reli-
ability is stronger at one extreme of the continuum—a scenario which has received little-to-
no statistical attention, but which likely occurs for the Alda score 7—dichotomization of a
continuous variable may be more informative concerning its ground truth value, particularly
in the presence of noise. Our study suggests that research employing the Alda score of lith-
ium responsiveness should continue using the dichotomous definition, particularly when
data are sampled across multiple raters.
Introduction
The Alda score is a validated index of lithium responsiveness commonly used in bipolar disor-
der (BD) research [1]. This scale has two components. The first is the “A” subscale that pro-
vides an ordinal score (from 0 to 10, inclusive) of the overall “response” in a therapeutic trial
of lithium. The second component is the “B” subscale that attempts to qualify the degree to
which any improvement was causally related to lithium. The total Alda score is computed
based on these two subscale scores, and takes integer values between 0 and 10. Many studies
that employ the Alda score as a target variable dichotomize it, such that individuals with scores
7 are classified as “responders,” and those with scores <7 are “non-responders.”
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 1 / 15
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Nunes A, Trappenberg T, Alda M, The
international Consortium on Lithium Genetics
(ConLiGen) (2020) Asymmetrical reliability of the
Alda score favours a dichotomous representation
of lithium responsiveness. PLoS ONE 15(1):
e0225353. https://doi.org/10.1371/journal.
pone.0225353
Editor: Vincenzo De Luca, University of Toronto,
CANADA
Received: October 31, 2019
Accepted: December 27, 2019
Published: January 27, 2020
Peer Review History: PLOS recognizes the
benefits of transparency in the peer review
process; therefore, we enable the publication of
all of the content of peer review and author
responses alongside final, published articles. The
editorial history of this article is available here:
https://doi.org/10.1371/journal.pone.0225353
Copyright: ©2020 Nunes et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: All relevant data are
within the manuscript and its Supporting
Information files.
A common criticism that arises from this practice is that continuous variables should not
be discretized by virtue of “information loss.” Indeed, discretizing continuous variables is
widely viewed as an inappropriate practice [212]. However, the practice remains common
across many areas of research, including our group’s work on lithium responsiveness in BD
[13]. The primary justification for using the dichotomized Alda score as the lithium respon-
siveness definition has been based on the inter-rater reliability study by Manchia et al. [1], who
showed that a cut-off of 7 had strong inter-rater agreement (weighted kappa 0.66). Further-
more, using mixture modeling, they also found that the empirical distribution of Alda scores
supports the discretized definition. Therefore, there exist competing arguments regarding
the appropriateness of dichotomizing lithium response. Resolving this dispute is critical, since
the operational definition of lithium responsiveness is a concept upon which a large body of
research will depend.
Although the Manchia et al. [1] analysis provides some justification for using a dichoto-
mous lithium response definition, it does not dispel the argument of discretization-induced
information loss entirely. However, there is some intuitive reason to believe that discretization
is, at least pragmatically, the best approach to defining lithium response using the Alda score.
First, the Alda score remains inherently subjective to some degree and is not based on precise
biological measurements; an individual whose “true” Alda score is 6, for example, could have
observed scores that vary widely across raters. Second, it is possible that responders may be
more reliably identified than non-responders. For example, unambiguously “excellent” lithium
response is a phenomenon that undoubtedly exists in naturalistic settings [14,15], and for
which the space of possible Alda scores is substantially smaller than for non-responders; that
is, an Alda score of 8 can be obtained in far fewer ways than an Alda score of 5. As such, we
hypothesize that agreement on the Alda score is higher at the upper end of the score range,
and that this asymmetric agreement is a scenario in which dichotomization of the score is
more informative than the raw measure. To evaluate this, we present both empirical re-analy-
sis of the ConLiGen study by Manchia et al. [1] and analyses of simulated data with varying
levels of asymmetrical inter-rater reliability.
Materials and methods
Data
Detailed description of data and collection procedures is found in Manchia et al. [1]. Samples
included in our analysis are detailed in Table 1, including the number of raters included across
sites, and the average ratings obtained at each of those sites across the 12 assessment vignettes.
As a gold standard, we used ratings that were assigned to each case vignette using a consensus
process at the Halifax site (scores are noted in the first row of Table 1). The lithium responsive-
ness inter-rater reliability data are available in S1 File (total Alda score), and S2 File (Alda A-
score).
Empirical analysis of the Alda score
In this analysis, we seek to evaluate whether discretization of the Alda score under the existing
inter-rater reliability values preserves more mutual information (MI) between the observed
and ground truth labels than does the raw scale representation. To accomplish this, we first
develop a probabilistic formulation of raters’ score assignments based on a multinomial-
Dirichlet model, which we describe below. Since the Dirichlet distribution is the conjugate
prior for the multinomial distribution, the posterior distribution over ratings (and ultimately
the MI with respect to the “ground truth” Alda score) can be expressed as a closed-form func-
tion of the prior uncertainty, which increases the precision and efficiency of our experiments.
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 2 / 15
Funding: Genome Canada (MA, AN; https://www.
genomecanada.ca), Dalhousie Department of
Psychiatry Research Fund (MA, AN; https://
medicine.dal.ca/departments/department-sites/
psychiatry.html), Canadian Institutes of Health
Research #64410 (MA; http://www.cihr-irsc.gc.ca),
Natural Science and Engineering Research Council
of Canada (TT; https://www.nserc-crsng.gc.ca),
Nova Scotia Health Research Foundation Scotia
Scholars Graduate Scholarship (AN; https://nshrf.
ca), Killam Postgraduate Scholarship (AN; http://
www.killamlaureates.ca) and Dalhousie Medical
Research Foundation and the Lindsay family (AN
and MA). The funders had no role in study design,
data collection and analysis, decision to publish, or
preparation of the manuscript.
Competing interests: The authors have declared
that no competing interests exist.
Let nðkÞ
i2Nþdenote the number of raters who assigned an Alda score i2A, with
A¼ f0;1; :::; 10gto an individual whose gold standard Alda score is k2A. The vector of
rating counts for the gold standard score kis is nðkÞ¼ ðnðkÞ
iÞi2A. The probability of n
(k)
is multino-
mial with parameter vector θðkÞ¼ ðyðkÞ
iÞi2A, which is itself Dirichlet distributed θ
(k)
Dir(θ|α),
where αis a pseudocount denoting the prior expectation of the number of ratings received for
each score i2A. In the present analysis, we assume that αis equal across all scores in A, and
thus we denote it simply as a scalar α=α; this has the effect of increasing the uncertainty of θ
(k)
(i.e. the ratings become more “noisy”).
The posterior of θ
(k)
given n
(k)
and αis Dirichlet with parameters α0¼naþnðkÞ
i1o10
i¼0,
and its maximum a posteriori (MAP) estimate is
^
θanðkÞ
ð Þ ¼ aþnðkÞ
i1
P10
j¼0aþnðkÞ
j1
( )10
i¼0
;ð1Þ
which can be viewed as the conditional distribution over scores Afor any given rater when the
gold standard is k. In cases where no assessment vignette had a gold standard rating of k, we
assumed that
nðkÞ¼
1
2nðk1Þþnðkþ1Þ
ð Þ 0<k<10
nðkþ1Þk¼0
nðk1Þk¼10
8
>
>
>
<
>
>
>
:ð2Þ
Table 1. Number of raters and mean scores across sites. The total number of raters (n
r
) was 59.
Case Vignette
Site n
r
1 2 3 4 5 6 7 8 9 10 11 12
Gold standard 8 9 6 7 9 3 5 9 3 9 5 1
Halifax 9 8.4 8.6 6.6 6.9 9.2 3 3.9 8.8 3.1 9.1 4.7 1.2
NIMH 4 7.8 8.2 6.2 7 8.8 3.2 4 8.5 2.2 8.5 3.2 1.8
Poznan 2 9 8.5 6.5 5.5 9 4 7.5 9 5 8 4.5 4.5
Dresden 2 8.5 7.5 6 5 8.5 1.5 6 9 3.5 8.5 4 1.5
Japan 4 8 8.2 4.8 6.5 8.5 2 3 8.5 1 8.2 4.5 1.5
Wuerzburg 2 7.5 7.5 4 6.5 8 1.5 3 9 0 7 3 0.5
Cagliari 3 7.7 9 4.3 7 5.7 4 1.3 9 0.7 7.3 4 2
San Diego 2 7.5 8.5 7.5 7 9 5 7.5 8.5 3.5 8.5 6 3.5
Boston 2 8.5 8.5 6 7 9 3 3.5 8.5 1.5 9 4 1
Gottingen 2 9.5 9 4 6 9 1 1 9 1.5 9 4 3
Berlin 1 7 9 4 6 9 2 3 8 0 7 0 2
Taipeh 1 8 8 5 8 9 5 6 9 4 9 8 1
Prague 1 7 9 4 8 9 3 6 9 3 9 6 1
Johns Hopkins 7 8 8.7 5.3 5.9 8.3 2.7 2.4 9.1 2 8.3 4.4 1.1
Mayo 6 8 8.2 6 8 9 4.2 3 9 4.2 8.8 3.7 0.3
Brasil 3 8 8.3 5.3 6.3 8.7 2 4 9 4.3 8 4.7 0.7
Medellin 4 7.5 9 5.5 6.5 5 2.5 4 7.2 4.8 8.8 1.2 2
Geneve 3 7.7 8.7 6.7 5.3 9.7 5 6 8.7 1.3 9 3.7 0.3
https://doi.org/10.1371/journal.pone.0225353.t001
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 3 / 15
The dichotomized Alda scores are defined as T¼ fd½it:8i2Ag, where τis the
dichotomization threshold (set at τ= 7 for the Alda score), and where δ[] is an indicator func-
tion that evaluates to 1 if the argument is true, and 0 otherwise. Given threshold τ(Responders
τand Non-responders <τ), the dichotomous counts are represented as follows
cð0Þ
0¼Pt1
k¼0Pt1
i¼0nðkÞ
iObserved <t;Gold Standard <t
cð1Þ
0¼P10
k¼tPt1
i¼0nðkÞ
iObserved <t;Gold Standard t
cð0Þ
1¼Pt1
k¼0P10
i¼tnðkÞ
iObserved t;Gold Standard <t
cð1Þ
1¼P10
k¼tP10
i¼tnðkÞ
iObserved t;Gold Standard t
ð3Þ
with c
(k)
Multinomial(ϕ
k
), and ϕ
k
Dir(ϕ|ξ), where ξis a pseudocount for the number of
dichotomized ratings assigned to each of non-responders and responders. We can thus esti-
mate the conditional distribution over observed dichotomized response ratings as
^
xcðkÞ
ð Þ ¼ (xþcðkÞ
01
2x2þcðkÞ
0þcðkÞ
1
;xþcðkÞ
11
2x2þcðkÞ
0þcðkÞ
1g ð4Þ
Mutual information of raw and dichotomized Alda score representations. Mutual infor-
mation is a general measure of dependence that expresses the degree to which uncertainty
about one variable is reduced by observation of another. Whereas the correlation coefficient
depends on the existence of a linear association, MI can detect nonlinear relationships
between variables by comparing their joint probability against the product of their marginal
distributions.
Let
xopðxojxÞ ¼ Categorical^
θaðnðxÞÞð5Þ
denote a given observed raw Alda score assigned to a case with ground truth score of x2A.
Given uniform priors on the true classes, the joint distribution is
pðxo;xÞ ¼ pðxojxÞpðxÞ ¼ (1
11 ^
θanðx¼kÞ
 )k¼0;1;:::;10
:ð6Þ
For the binarized classes, we have a prior of pðy¼1Þ ¼ 4
11, and the joint distribution is
thus
pðyo;yÞ ¼ pðyojyÞpðyÞ ¼ npðy¼kÞ^
xðcðy¼kÞÞok2 f0;1g:ð7Þ
The MI for these distributions can be computed as functions of the prior pseudocounts α
and ξ:
Ia½xojjx ¼ X
xoX
x
pðxo;xÞlog pðxo;xÞ
pðxoÞpðxÞð8Þ
Ix½yojjy ¼ X
yoX
y
pðyo;yÞlog pðyo;yÞ
pðyoÞpðyÞð9Þ
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 4 / 15
for the raw and dichotomized Alda scores, respectively. We can express the MI of the raw and
dichotomized Alda score distributions both in terms of α, such that both distributions have
an equivalent total “concentration:” ξ=αwhen ξ= 11α/2. This is equivalent to saying that our
prior assumption about the uncertainty of the raw and dichotomized distributions assumes
the same number of a priori ratings.
Our primary hypothesis—that the dichotomized Alda score is more informative with greater
observation uncertainty—is evaluated by determining whether I
ξ
[y
o
||y
] exceeds I
α
[x
o
||x
] as we
increase the a priori observation noise (αand ξ).
Theoretical modeling of dichotomization under asymmetrical reliability
The previous experiment regarding dichotomization of the raw Alda score did not fully cap-
ture the effect of dichotomization of a continuous variable, since the raw Alda score is still
discrete (albeit with a larger domain of support). Thus, we sought to investigate whether
dichotomization of a truly continuous, though asymmetrically reliable, variable would show a
similar pattern of preserving MI and statistical power under higher levels of observation noise
and agreement asymmetry.
Synthetic datasets. The simplest synthetic dataset generated was merely a sample of regu-
larly spaced points across the [0, 10] interval in both the x and y directions. This dataset was
merely used to conduct a “sanity check” that our methods for computing MI correctly identi-
fied a value of 0. This was necessary since data with uniform random noise over the same inter-
val will only yield MI of 0 in the limit of large sample sizes.
The main synthetic dataset accepted “ground truth” values x2[0, 10] and yielded “observed”
values y2[0, 10] based on the following formula for the i
th
sample:
yi¼ofðxiÞ þ ð1oÞUniformð0;10Þ;ð10Þ
where 0 ω1 is a parameter governing the degree to which observed values are coupled
to the ground truth based on f(x
i
) (data are entirely uniform random noise when ω= 0, and
come entirely from f(x
i
) when ω= 1). The function f(x
i
) governing the agreement between
ground truth and observed is essentially a 1:1 correspondence between xand yto which we add
noise along the diagonal based on a uniform random variate ~
Uð s;sÞwith width σ.
We simulated two forms of diagonal spread. The first is constant across all values x2[0,
10], which we call the symmetrical case, and which is represented by a parameter β= 1. The
other is an asymmetrical case (represented as β= 0), in which the agreement between xand y
is not constant across the [0, 10] range. Overall, the function f(x
i
) is defined as
fðxiÞ ¼ bRð0;10Þxiþ~
Uð s;sÞ
1þe0:75 xiþ5
þ ð1bÞRð0;10Þxiþ~
Uð s;sÞ
 ;ð11Þ
where R
(l,u)
() is a function to ensure that all points remain within the [l,u] interval in both
axes. In the asymmetrical case, R
(l,u)
() reflects points at the [0, 10] bounds. In the symmetrical
case, the data are all simply rescaled to lie in the [0, 10] interval.
Demonstration of the simulated synthetic data are shown in Fig 1. Every synthetically gen-
erated dataset included 750 samples, and for notational simplicity, we denote the k
th
synthetic
dataset (given parameters β,ω,σ) as DðkÞ
b;o;s¼xðkÞ
j;yðkÞ
jj¼1;2;:::;750.
Computation of mutual information for continuous and discrete distributions.
Mutual information was computed for both continuous and dichotomized probability distri-
butions on the data. Mutual information for the continuous distribution was computed by
first performing Gaussian kernel density estimation (using Scott’s method for bandwidth
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 5 / 15
selection) on the simulated dataset, and then approximating the following integral using Mar-
kov chain Monte-Carlo sampling:
IKDE½yjjx ¼ ZZpðx;yÞlog pðx;yÞ
pðxÞpðyÞdxdyð12Þ
Conversely, discrete MI was computed by first creating a 2-dimensional histogram by bin-
ning data based on a dichotomization threshold τ. Data that lie below the dichotomization
threshold are denoted 1, and those that lie above the threshold are represented as 0. Based on
Fig 1. Demonstration of the synthetic agreement data across differences in the parameter ranges and presence of
asymmetry. The x-axes all represent the ground truth value of the variable, and the y-axes represent the “observed”
values. Data are depicted based on different values of a uniform noise parameter (0 ω1) that governs what
proportion of the data is merely uniform noise over the interval [0, 10], and a disagreement parameter (σ0), which
governs the variance around the diagonal line. Panel A (upper three rows, shown in blue) depicts the synthetic data in
which there was asymmetrical levels of agreement across the score domain. Panel B (lower three rows, shown in red)
depict synthetic data in which there was symmetrical agreement over the score domain.
https://doi.org/10.1371/journal.pone.0225353.g001
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 6 / 15
this joint distribution, the dichotomized MI is
It½yjjx ¼ X
xX
y
pðx;yÞlog pðx;yÞ
pðxÞpðyÞ:ð13Þ
Note that continuous MI will remain constant across τ.
Statistical power of classical tests of association. Association between the observed (y)
and ground truth (x) data can be measured using Pearson’s correlation coefficient (ρ) when
data are left as continuous, or using Fisher’s exact test when data are dichotomized. The statis-
tical power of the hypothesis that ρ0 given dataset DðkÞ
b;o;swith N
(k)
observations and two-
tailed statistical significance threshold α—which here is not the same αused as a Dirichlet con-
centration in Empirical Evaluation of the Alda Score of Lithium Response—can be easily shown
to equal
powerrðDðkÞ
b;o;s;a¼0:05Þ ¼ Fjz rð Þj  F11a
2
  ;ð14Þ
where F() and F
1
() are the cumulative distribution function and quantile functions for a
standard normal distribution, and z() is Fisher’s Z-transformation
zðrÞ ¼ 1
2log 1þr
1r:ð15Þ
Under a dichotomization of DðkÞ
b;o;swith threshold τassociation between the ground truth
and observed data can be evaluated using a (two-tailed) Fisher’s exact test, whose alternative
hypothesis is that the odds ratio (η) of the dichotomized data does not equal 1. The null-
hypothesis has a Fisher’s noncentral hypergeometric distribution,
Lo¼FisherHypergeometricDistributionNðkÞ
d½y<t;NðkÞ
d½x<t;NðkÞ;Z¼1ð16Þ
where N
(k)
is the total number of observations in sample k, and NðkÞ
d½x<tand NðkÞ
d½y<tare the num-
ber of ground truth and observed data, respectively, that fall below the dichotomization thresh-
old τ. Under the alternative hypothesis, this distribution has an odds ratio parameter estimated
from the data:
La¼FisherHypergeometricDistributionNðkÞ
d½y<t;NðkÞ
d½x<t;NðkÞ;^
Z:ð17Þ
The statistical power of Fisher’s exact test under this setup and a two-tailed significance
threshold of αis
fp DðkÞ
b;o;s;t;a
 ¼d½^
Z1SLaS1
Lo1a
2
  þd½^
Z<11SLaS1
Lo
a
2
  ;ð18Þ
where SLaðÞ and S1
LoðÞ are the survival functions of the alternative hypothesis and the inverse
survival function of the null hypothesis, respectively.
Evaluation of mutual information. The central aspect of this analysis is comparison
of the dichotomized and continuous MI across values of the dichotomization threshold τ,
global noise ω, asymmetry parameter β, and diagonal spread σ. Under all cases, we expect that
increases in the global noise parameter ωwill reduce the MI. We also expect that with symmet-
rical reliability (i.e. β= 0), the dichotomized MI will be lower than the continuous MI across
all thresholds. However, as the degree of asymmetry in the reliability increases, we expect the
dichotomized MI to exceed the continuous MI (i.e. as σincreases when β= 1). Finally, as a
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 7 / 15
sanity check, we expect that both continuous and dichotomized MI will be approximately 0
when applied to a grid of points regularly spaced over the [0, 10] interval.
Evaluation of effects on statistical power of classical tests of association. Statistical
power of the Pearson correlation coefficient and Fisher’s exact test were computed across sym-
metrical (β= 0) and asymmetrical (β= 1) conditions of the synthetic dataset described above.
Owing to the greater computational efficiency of these calculations (compared to the MI), the
diagonal spread parameter was varied more densely (σ2{1, 2, . . ., 20}). The power of Fisher’s
exact test was evaluated at two dichotomization thresholds: a median split at τ= 5 and a “tail
split” at τ= 3. We evaluated three global noise settings (ω2{0.3, 0.5, 0.7}). At each experimen-
tal setting, we computed the aforementioned power levels for 100 independent synthetic data-
sets. Results are presented using the mean and 95% confidence intervals of the power estimates
over the 100 runs under each condition. We expect that the Fisher’s exact test under a “tail
split” dichotomization (not a median split) will yield greater statistical power in the presence
of asymmetrical reliability, greater diagonal spread, and higher global noise. However, under
the symmetrically reliable condition, we expect that the statistical power will be greater for the
continuous test of association.
Materials
Mutual information experiments were conducted in Mathematica v. 12.0.0 (Wolfram
Research, Inc.; Champaign, IL). Experiments evaluating the statistical power under classical
tests of continuous and dichotomous association were conducted in the Python programming
language. Code for analyses are also provided in S3 File (Alda score analyses), S4 File (theoreti-
cal analysis of MI under asymmetrical reliability), S5 File (theoretical evaluation of classical
associative tests). The Mathematica notebooks are also available in PDF form in S6 and S7
Files.
Results
Empirical evaluation of the Alda score of lithium response
Histograms of the observed Alda scores for each of the gold standard vignette values are
depicted in Fig 2. Resulting joint distributions of the gold standard vs. observed Alda scores
(in both the raw or dichotomized representations) are shown in Fig 3 (Panels A-C) across
varying levels of observation noise. Fig 3D plots the MI for the raw and dichotomized Alda
scores across increasing levels of the observation noise parameter α(recalling that ξ= 11α/2).
Beyond an observation noise of approximately α>3.52, one can see that the dichotomized
lithium response definition retains greater MI between the true and observed labels, compared
to the raw representation.
Discrete vs. continuous mutual information in asymmetrically reliable data
Fig 4 shows the results of the experiment on synthetic data. Under agreement levels that are
constant across the (x,y) domains, one can observe that MI of dichotomized representations
of the variables are generally lower than their continuous counterparts. However, under asym-
metrical reliability (i.e. where agreement between xand ydecreases as xincreases), we see that
MI is higher for the dichotomized, rather than the continuous, representations. In particular,
as the level of agreement asymmetry increased (i.e. for higher values of σ), the best dichotomi-
zation thresholds decreased.
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 8 / 15
Fig 2. Histograms of ratings for each value of the ground truth Alda score available in the first wave dataset from Manchia
et al. [1]. Each histogram represents the distribution of ratings (n
r
= 59) for a single one of twelve assessment vignettes. The gold
standard (“ground truth”) Alda score, obtained by the Halifax consensus sample, is depicted as the title for each histogram. Plots in
blue are those for vignettes with gold standard Alda scores less than 7, which would be classified as “non-responders” under the
dichotomized setting. Vignettes with gold standard Alda scores 7 are shown in red, and represent the dichotomized group of
lithium responders.
https://doi.org/10.1371/journal.pone.0225353.g002
Fig 3. Mutual information between gold standard and observed Alda scores in relation to the observation noise (α) and
whether the scale is in its raw or dichotomized form (lithium responder [Li(+)] is Alda score 7; non-responder [Li(-)] is Alda
score <7). Panels A-C show the inferred joint distributions of the observed (x
o
for raw, y
o
for discrete) and gold standard (x
for
raw, y
for discrete) values at different levels of observation noise (α2{0, 10, 100}). Panel D plots the mutual information for the raw
(red) and discrete (blue) settings of the Alda score across increasing values of α. Recall that here we set ξ= 11/2.
https://doi.org/10.1371/journal.pone.0225353.g003
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 9 / 15
Statistical power of classical associative tests
Fig 5 plots the statistical power of null-hypothesis tests using continuous and dichotomized
representations of the synthetic dataset. As expected, under conditions of symmetrical reliabil-
ity, the continuous test of association (Pearson correlation) retains greater statistical power as
the degree of diagonal spread increases, although this difference lessens at very high levels of
diagonal spread or overall (uniform) noise. However, under conditions of asymmetrical reli-
ability, dichotomizing data according to a “tail split” (here a threshold of τ= 3) preserves
greater statistical power than either a median split (τ= 5) or continuous representation; this
relationship was present even at high levels of diagonal spread and overall uniform noise.
Discussion
The present study makes two important contributions. First, using a sample of 59 ratings
obtained using standardized vignettes compared to a consensus-defined gold standard [1],
we showed that the dichotomized Alda score has a higher MI between the observed and gold
standard ratings than does the raw scale (which ranges from 0-10). Those data suggested that
the Alda score’s reliability is asymmetrical, with greater inter-rater agreement at the upper
extreme. Secondly, therefore, using synthetic experiments we showed that asymmetrical inter-
rater reliability in a score’s range is the likely cause of this relationship. Our results do not
argue that lithium response is itself a categorical natural phenomenon. Rather, using the
dichotomous definition as a target variable in supervised learning problems likely confers
greater robustness to noise in the observed ratings.
Some have argued that the existence of categorical structure in one’s data [9], or evidence of
improved reliability under a dichotomized structure [16], are potentially justifiable rationales
Fig 4. Mutual information (MI) for dichotomized (solid lines) and continuous (dashed lines) distributions on synthetic data with
asymmetrical (upper row, Panel A) and symmetrical (lower row, Panel B) properties with respect to agreement. X-axes represent the
dichotomization thresholds at which we recalculate the dichotomized MI. Mutual information is depicted on the y-axes. Plot titles indicate the
different diagonal spread (σ) parameters used to synthesize the synthetic datasets. Solid lines (for dichotomized MI) are surrounded by ribbons
depicting the 95% confidence intervals over 10 runs at each combination of parameters (τ,ω,β,σ).
https://doi.org/10.1371/journal.pone.0225353.g004
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 10 / 15
for dichotomization of continuous variables. These claims are generally stated only briefly, and
with less quantitative support than the more numerous mathematical treatments of the prob-
lems with dichotomization [9,10,16,17]. However, these more rigorous quantitative analyses
typically involve assumptions of symmetrical or Gaussian distributions of the underlying vari-
ables in the context of generalized linear modeling (although Irwin & McClelland [10] demon-
strated that median splits of asymmetric and bimodal beta distributions is also deleterious).
These analyses have led to vigorous generalized denunciation of variable dichotomization
across several disciplines, but our current work offers important counterexamples to this nar-
rative [10,11].
The Alda score is more broadly used as a target variable in both predictive and associative
analyses, and not as a predictor variable, which is an important departure from most analyses
against dichotomization. Since there is no valid and reliable biomarker of lithium response,
these cases must rely on the Alda score-based definition of lithium response as a “ground
truth” target variable. In the case of predicting lithium response, where these ground truth
labels are collected from multiple raters across different international sites, variation in lithium
response scoring patterns across centres might further accentuate the extant between-site
heterogeneity.
To this end, inter-individual differences in subjective rating scales may be more informative
about the raters than the subjects, and one may wish to use dichotomization to discard this
nuisance variance [8,9,16]. Doing so means that one turns regression supervised by a dubious
target into classification with a more reliable (although coarser) target. Appropriately balanc-
ing these considerations may require more thought than adopting a blanket prohibition on
dichotomization or some other form of preprocessing.
An important criticism of continuous variable dichotomization is that it may impede compa-
rability of results across studies, both in terms of diminishing power and inflating heterogeneity
Fig 5. Statistical power achieved with the Pearson coefficient (a continuous measure of association; blue lines) and Fisher’s
exact test (a measure of association between dichotomized variables; red lines) for synthetic data with symmetrical (upper row)
and asymmetrical (lower row) properties with respect to agreement. Columns correspond to the level of uniform “overall” noise
(ω) added to the data, representing prior uncertainty. X-axes represent the diagonal spread (σ), and the y-axes represent the test’s
statistical power for the given sample size and estimated effect sizes. Data subjected to Fisher’s exact test were dichotomized at either
a threshold of 5 (the “Median Split,” denoted by ‘+’ markers in red) or 3 (the “Tail Split,” denoted by the dot markers in red). For all
series, dark lines denote means and the ribbons are 95% confidence intervals over 100 runs.
https://doi.org/10.1371/journal.pone.0225353.g005
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 11 / 15
[17]. However, this is more likely a problem when dichotomization thresholds are established
on a study-by-study basis, without considering generalizability from the outset. These argu-
ments do not necessarily apply to the Alda score, since the threshold of 7 has been established
across a large consortium with support from both reliability and discrete mixture analysis [1],
and is the effective standard split point for this scale [18].
Our study thus provides a unique point of support for the dichotomized Alda score insofar
as we show that the retention of MI and frequentist statistical power is likely due to asymmetri-
cal reliability across the range of scores. Our analyses show that there is a range of Alda scores
(those identifying good lithium responders; scores 7) for which scores correspond more
tightly to a consensus-defined gold standard in a large scale international consortium. Con-
versely, this asymmetry implies that Alda scores at the lower end of the range will carry greater
uncertainty (Fig 2). This may be due to the intrinsic structure of the Alda scale, wherein a
score of 3 may result from 2159 item combinations, while only 79 combinations can yield a
score of 7. In particular, we showed that tail split dichotomization of the Alda score will be
more robust to increases in the prior uncertainty (i.e. the overall level of background “noise”
in the relationship between true/observed scores). This feature is important since the sample
of raters included in the Alda score’s calibration study [1] was relatively small and consisted of
individuals involved in ConLiGen centres. It is reasonable to suspect that assessment of Alda
score reliability in broader research and clinical settings would add further disagreement-
based noise to the inter-rater reliability data. At present, use of the dichotomized scale could
confer some robustness to that uncertainty.
More generally our study showed that if reliability of a measure is particularly high at one
tail of its range, then a “tail split” dichotomization can outperform even the continuous repre-
sentation of the variable. This presents an important counterexample to previous authors,
such as Cohen [5], Irwin & McClelland [10], and MacCallum et al. [9] who argued that “tail
splits” are still worse than median splits. While our study reaffirms these claims in the case of
measures whose reliability is constant over the domain (see Fig 4B and the upper row of Fig 5),
our analysis of the asymmetrically reliable scenario yields opposite conclusions, favouring a
“tail split” dichotomization over both median splits and continuous representations. Tail split
dichotomization was particularly robust when data were affected by both asymmetrical reli-
ability and high degrees of uniform noise over the variable’s range. Together, these results
suggest that dichotomization/categorization of a continuous measurement may be justifiable
when its relationship to the underlying ground truth variable is noisy everywhere except at an
extreme.
Our study has several limitations. First, our sample size for the re-analysis of the Alda score
reliability was relatively small, and sourced from highly specialized raters involved in lithium-
specific research. However, one may consider this sample as representative of the “best case
scenario” for the Alda score’s reliability. It is likely that further expansion of the subject popu-
lation would introduce more noise into the relationship between ground truth and observed
Alda scores. It is likely that most of this additional disagreement would be observed for lower
Alda scores, since (A) there are simply more potential item combinations that can yield an
Alda score of 5 than an Alda score of 9, for example, and (B) unambiguously excellent lithium
response is a phenomenon so distinct that some question whether lithium responsive BD may
constitute a unique diagnostic entity [19,20]. Thus, we believe that our sample size for the reli-
ability analysis is likely sufficient to yield the present study’s conclusions.
Our study is also limited by the fact that theoretical analysis was largely simulation-based,
and thus cannot offer the degree of generalizability obtained through rigorous mathematical
proof. Nonetheless, our study offers sufficient evidence—in the form of a counterexample—to
show that there exist scenarios in which dichotomization is statistically superior to preserving
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 12 / 15
a variable’s continuous representation. Furthermore, we used well controlled experiments to
isolate asymmetrical reliability as the cause of dichotomization’s superiority across simulated
conditions.
Conclusion
In conclusion, we have shown that a dichotomous representation of the Alda score for lithium
responsiveness is more robust to noise arising from inter-rater disagreement. The dichoto-
mous Alda score is therefore likely a better representation of lithium responsiveness for multi-
site studies in which lithium response is a target or dependent variable. Through both re-analy-
sis of the Alda score’s real-world inter-rater reliability data and careful theoretical simulations,
we were able to show that asymmetrical reliability across the score’s domain was the likely
cause for superiority of the dichotomous definition. Our study is not only important for future
research on lithium response, but other studies using subjective and potentially unreliable
measures as dependent variables. Practically speaking, our results suggest that it might be bet-
ter to classify something we can all agree upon than to regress something upon which we can
not.
Supporting information
S1 Fig. A-score reliability histograms. Histograms of ratings for each value of the ground
truth Alda A-score. This figure was generated identically to Fig 2, but using the A-score data
only.
(PDF)
S2 Fig. A-score mutual information results. Mutual information between gold standard and
observed Alda A-scores in relation to observation noise and the scale’s “raw” or dichotomized
form. This figure was generated identically to Fig 3, but using the A-score data only.
(PDF)
S1 File. Total Alda score ratings. Inter-rater reliability data for the total Alda score.
(CSV)
S2 File. Alda A-score ratings. Inter-rater reliability data for the Alda A-score.
(CSV)
S3 File. Alda score analysis code. Mathematica notebook containing the empirical evaluation
of the Alda Score of Lithium response. This notebook also contains additional analysis of the
A-score alone.
(NB)
S4 File. Theoretical analysis code. Mathematica notebook containing the theoretical analyses
of discrete vs. continuous mutual information in asymmetrically reliable data.
(NB)
S5 File. Code for statistical power tests. Jupyter notebook containing the theoretical analyses
of the statistical power of classical associative tests under asymmetrically reliable data.
(IPYNB)
S6 File. Alda score analysis code (PDF version). PDF version of S3 File for those without
Mathematica license.
(PDF)
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 13 / 15
S7 File. Theoretical analysis code (PDF version). PDF version of S4 File for those without
Mathematica license.
(PDF)
Acknowledgments
The authors wish to acknowledge those members of the Consortium on Lithium Genetics
(ConLiGen) who contributed ratings for the vignettes herein: Mirko Manchia, Raffaella
Ardau, Jean-Michel Aubry, Lena Backlund, Claudio E.M. Banzato, Bernhard T. Baune, Frank
Bellivier, Susanne Bengesser, Clara Brichant-Petitjean, Elise Bui, Cynthia V. Calkin, Andrew
Tai Ann Cheng, Caterina Chillotti, Scott Clark, Piotr M. Czerski, Clarissa Dantas, Maria Del
Zompo, J. Raymond DePaulo, Bruno Etain, Peter Falkai, Louise Frise
´n, Mark A. Frye, Jan
Fullerton, Se
´bastien Gard, Julie Garnham, Fernando S. Goes, Paul Grof, Oliver Gruber, Ryota
Hashimoto, Joanna Hauser, Rebecca Hoban, Ste
´phane Jamain, Jean-Pierre Kahn, Layla Kas-
sem, Tadafumi Kato, John R. Kelsoe, Sarah Kittel-Schneider, Sebastian Kliwicki, Po-Hsiu Kuo,
Ichiro Kusumi, Gonzalo Laje, Catharina Lavebratt, Marion Leboyer, Susan G. Leckband, Car-
los A. Lo
´pez Jaramillo, Mario Maj, Alain Malafosse, Lina Martinsson, Takuya Masui, Philip B.
Mitchell, Frank Mondimore, Palmiero Monteleone, Audrey Nallet, Maria Neuner, Toma
´s
Nova
´k, Claire O’Donovan, Urban O
¨sby, Norio Ozaki, Roy H. Perlis, Andrea Pfennig, James B.
Potash, Daniela Reich-Erkelenz, Andreas Reif, Eva Reininghaus, Sara Richardson, Janusz K.
Rybakowski31, Martin Schalling, Peter R. Schofield, Oliver K. Schubert, Barbara Schweizer,
Florian Seemu¨ller, Maria Grigoroiu-Serbanescu, Giovanni Severino, Lisa R. Seymour, Claire
Slaney, Jordan W. Smoller, Alessio Squassina, Thomas Stamm, Pavla Stopkova, Sarah K.
Tighe, Alfonso Tortorella, Adam Wright, David Zilles, Michael Bauer, Marcella Rietschel, and
Thomas G. Schulze.
Author Contributions
Conceptualization: Abraham Nunes.
Data curation: Martin Alda.
Formal analysis: Abraham Nunes.
Investigation: Abraham Nunes.
Methodology: Abraham Nunes, Martin Alda.
Resources: Martin Alda.
Software: Abraham Nunes.
Supervision: Thomas Trappenberg, Martin Alda.
Validation: Abraham Nunes.
Visualization: Abraham Nunes.
Writing – original draft: Abraham Nunes.
Writing – review & editing: Abraham Nunes, Thomas Trappenberg, Martin Alda.
References
1. Manchia M, Adli M, Akula N, Ardau R, Aubry JM, Backlund L, et al. Assessment of Response to Lithium
Maintenance Treatment in Bipolar Disorder: A Consortium on Lithium Genetics (ConLiGen) Report.
PLoS ONE. 2013; 8. https://doi.org/10.1371/journal.pone.0065636
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 14 / 15
2. Humphreys LG, Fleishman A. Pseudo-orthogonal and other analysis of variance designs involving indi-
vidual-differences variables. Journal of Educational Psychology. 1974; 66: 464–472. https://doi.org/10.
1037/h0036539
3. Humphreys LG. Doing research the hard way: Substituting analysis of variance for a problem in correla-
tional analysis. Journal of Educational Psychology. 1978; 70: 873–876. https://doi.org/10.1037/0022-
0663.70.6.873
4. Humphreys LG. Research on individual differences requires correlational analysis, not ANOVA. Intelli-
gence. 1978; 2: 1–5. https://doi.org/10.1016/0160-2896(78)90010-7
5. Cohen J. The Cost of Dichotomization. Applied Psychological Measurement. 1983; 7: 249–253. https://
doi.org/10.1177/014662168300700301
6. Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad
idea. Statistics in Medicine. 2006; 25: 127–141. https://doi.org/10.1002/sim.2331 PMID: 16217841
7. Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ. 2006; 332: 1080. https://
doi.org/10.1136/bmj.332.7549.1080
8. Rucker DD, McShane BB, Preacher KJ. A researcher’s guide to regression, discretization, and median
splits of continuous variables. Journal of Consumer Psychology. 2015; 25: 666–678. https://doi.org/10.
1016/j.jcps.2015.04.004
9. MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of dichotomization of quantitative
variables. Psychological Methods. 2002; 7: 19–40. https://doi.org/10.1037/1082-989x.7.1.19 PMID:
11928888
10. Irwin JR, McClelland GH. Negative Consequences of Dichotomizing Continuous Predictor Variables.
Journal of Marketing Research. 2003; 40: 366–371. https://doi.org/10.1509/jmkr.40.3.366.19237
11. Fitzsimons GJ. Death to Dichotomizing. J Consum Res. 2008; 35: 5–8. https://doi.org/10.1086/589561
12. Streiner DL. Breaking up is Hard to Do: The Heartbreak of Dichotomizing Continuous Data. Can J Psy-
chiatry. 2002; 47: 262–266. https://doi.org/10.1177/070674370204700307 PMID: 11987478
13. Nunes A, Ardau R, Bergho
¨fer A, Bocchetta A, Chillotti, Deiana V, et al. Prediction of Lithium Response
using Clinical Data. Acta Psychiatr. Scandinav. in press.
14. Grof P. Responders to long-term lithium treatment. In: Bauer M, Grof P, Muller-Oerlinghausen B, edi-
tors. Lithium in Neuropsychiatry: The Comprehensive Guide. UK: Informa Healthcare; 2006. pp. 157–
178.
15. Gershon S, Chengappa KNR, Malhi GS. Lithium specificity in bipolar illness: A classic agent for the clas-
sic disorder. Bipolar Disorders. 2009; 11: 34–44. https://doi.org/10.1111/j.1399-5618.2009.00709.x
PMID: 19538684
16. DeCoster J, Iselin A-MR, Gallucci M. A conceptual and empirical examination of justifications for dichot-
omization. Psychological Methods. 2009; 14: 349–366. https://doi.org/10.1037/a0016956 PMID:
19968397
17. Hunter JE, Schmidt FL. Dichotomization of continuous variables: The implications for meta-analysis.
Journal of Applied Psychology. 1990; 75: 334–349. https://doi.org/10.1037/0021-9010.75.3.334
18. Hou L, Heilbronner U, Degenhardt F, Adli M, Akiyama K, Akula N, et al. Genetic variants associated
with response to lithium treatment in bipolar disorder: A genome-wide association study. The Lancet.
2016; 387: 1085–1093. https://doi.org/10.1016/S0140-6736(16)00143-4
19. Alda M. Lithium in the treatment of bipolar disorder: pharmacology and pharmacogenetics. Molecular
Psychiatry. 2015; 20: 661. https://doi.org/10.1038/mp.2015.4 PMID: 25687772
20. Alda M. Who are excellent lithium responders and why do they matter? World Psychiatry. 2017; 16:
319–320. https://doi.org/10.1002/wps.20462 PMID: 28941103
Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness
PLOS ONE | https://doi.org/10.1371/journal.pone.0225353 January 27, 2020 15 / 15
... This observation is interesting in light of recent work that quantified the asymmetrical reliability of the Alda scale, finding higher interrater reliability in the upper tail of the response distribution. 33 Therefore, a dichotomous representation of lithium response was generally argued for, even after considering the resultant loss in statistical power. However, rather than deciding a priori to discretise this distribution, an alternative approach would be to tune and select models in a leave-site-out cross-validation framework, as was done in the current work. ...
... However, rather than deciding a priori to discretise this distribution, an alternative approach would be to tune and select models in a leave-site-out cross-validation framework, as was done in the current work. This is because we would expect to see the highest amount of interrater disagreement between data-collection sites, as purported by Nunes et al. 33 If it was high enough to warrant a priori discretisation, these across site models would not generalise because of their disagreement in lithium response. However, in the current work nearly all models generalised across sites to the out-ofsample-test sets that were excluded from model construction, demonstrating that the use of leave-site-out cross-validation ensured that each model was tuned to learn parameters and relationships that generalised regardless of any disagreement between raters across sites. ...
... This observation is interesting in light of recent work that quantified the asymmetrical reliability of the Alda scale, finding higher interrater reliability in the upper tail of the response distribution. 33 Therefore, a dichotomous representation of lithium response was generally argued for, even after considering the resultant loss in statistical power. However, rather than deciding a priori to discretise this distribution, an alternative approach would be to tune and select models in a leave-site-out cross-validation framework, as was done in the current work. ...
... However, rather than deciding a priori to discretise this distribution, an alternative approach would be to tune and select models in a leave-site-out cross-validation framework, as was done in the current work. This is because we would expect to see the highest amount of interrater disagreement between data-collection sites, as purported by Nunes et al. 33 If it was high enough to warrant a priori discretisation, these across site models would not generalise because of their disagreement in lithium response. However, in the current work nearly all models generalised across sites to the out-of-sample-test sets that were excluded from model construction, demonstrating that the use of leave-site-out cross-validation ensured that each model was tuned to learn parameters and relationships that generalised regardless of any disagreement between raters across sites. ...
Article
Background Response to lithium in patients with bipolar disorder is associated with clinical and transdiagnostic genetic factors. The predictive combination of these variables might help clinicians better predict which patients will respond to lithium treatment. Aims To use a combination of transdiagnostic genetic and clinical factors to predict lithium response in patients with bipolar disorder. Method This study utilised genetic and clinical data ( n = 1034) collected as part of the International Consortium on Lithium Genetics (ConLi ⁺ Gen) project. Polygenic risk scores (PRS) were computed for schizophrenia and major depressive disorder, and then combined with clinical variables using a cross-validated machine-learning regression approach. Unimodal, multimodal and genetically stratified models were trained and validated using ridge, elastic net and random forest regression on 692 patients with bipolar disorder from ten study sites using leave-site-out cross-validation. All models were then tested on an independent test set of 342 patients. The best performing models were then tested in a classification framework. Results The best performing linear model explained 5.1% ( P = 0.0001) of variance in lithium response and was composed of clinical variables, PRS variables and interaction terms between them. The best performing non-linear model used only clinical variables and explained 8.1% ( P = 0.0001) of variance in lithium response. A priori genomic stratification improved non-linear model performance to 13.7% ( P = 0.0001) and improved the binary classification of lithium response. This model stratified patients based on their meta-polygenic loadings for major depressive disorder and schizophrenia and was then trained using clinical data. Conclusions Using PRS to first stratify patients genetically and then train machine-learning models with clinical predictors led to large improvements in lithium response prediction. When used with other PRS and biological markers in the future this approach may help inform which patients are most likely to respond to lithium treatment.
... The total score is obtained by subtracting the sum of the B criteria from that of Criterion A, and it ranks between 0 and 10 points. A total score ≥7 is classified as a good responder, while a total score <7 includes partial and nonresponders [39][40][41]. Importantly, the Alda scale was validated for the assessment of the response to other mood stabilizers, such as valproic acid, lamotrigine, carbamazepine, and atypical antipsychotics, as well as combination therapies [42,43]. The assessment of the clinical response to mood stabilizers was performed by trained psychiatrists, under the supervision of one senior rater (M.M.) who has worked in the validation procedure of the scale [40]. ...
Article
Full-text available
Bipolar disorder is associated with an inflammation-triggered elevated catabolism of tryptophan to the kynurenine pathway, which impacts psychiatric symptoms and outcomes. The data indicate that lithium exerts anti-inflammatory effects by inhibiting indoleamine-2,3-dioxygenase (IDO)-1 activity. This exploratory study aimed to investigate the tryptophan catabolism in individuals with bipolar disorder (n = 48) compared to healthy controls (n = 48), and the associations with the response to mood stabilizers such as lithium, valproate, or lamotrigine rated with the Retrospective Assessment of the Lithium Response Phenotype Scale (or the Alda scale). The results demonstrate an association of a poorer response to lithium with higher levels of kynurenine, kynurenine/tryptophan ratio as a proxy for IDO-1 activity, as well as quinolinic acid, which, overall, indicates a pro-inflammatory state with a higher degradation of tryptophan towards the neurotoxic branch. The treatment response to valproate and lamotrigine was not associated with the levels of the tryptophan metabolites. These findings support the anti-inflammatory properties of lithium. Furthermore, since quinolinic acid has neurotoxic features via the glutamatergic pathway, they also strengthen the assumption that the clinical drug response might be associated with biochemical processes. The relationship between the lithium response and the measurements of the tryptophan to the kynurenine pathway is of clinical relevance and may potentially bring advantages towards a personalized medicine approach to bipolar disorder that allows for the selection of the most effective mood-stabilizing drug.
... The Alda scale comprises two subscales: The A scale (which measures overall response) and the B scale (which assesses five potential confounders of response). In the original guidelines, Li response was reported either by the Total Score as a continuous measure (TS = A score minus B score) or, more often, as a categorical outcome (with cases classified as good or non-responders, i.e., GR or NR) [13,14]. However, when Manchia et al. (2013) undertook an inter-rater reliability study with researchers from the Consortium on Li Genetics (ConLiGen), reliability was low for Alda scale ratings of BD cases with high B scale scores (typically cases with complex clinical presentations). ...
Article
Full-text available
Optimal classification of the response to lithium (Li) is crucial in genetic and biomarker research. This proof of concept study aims at exploring whether different approaches to phenotyping the response to Li may influence the likelihood of detecting associations between the response and genetic markers. We operationalized Li response phenotypes using the Retrospective Assessment of Response to Lithium Scale (i.e., the Alda scale) in a sample of 164 cases with bipolar disorder (BD). Three phenotypes were defined using the established approaches, whilst two phenotypes were generated by machine learning algorithms. We examined whether these five different Li response phenotypes showed different levels of statistically significant associations with polymorphisms of three candidate circadian genes (RORA, TIMELESS and PPARGC1A), which were selected for this study because they were plausibly linked with the response to Li. The three original and two revised Alda ratings showed low levels of discordance (misclassification rates: 8–12%). However, the significance of associations with circadian genes differed when examining previously recommended categorical and continuous phenotypes versus machine-learning derived phenotypes. Findings using machine learning approaches identified more putative signals of the Li response. Established approaches to Li response phenotyping are easy to use but may lead to a significant loss of data (excluding partial responders) due to recent attempts to improve the reliability of the original rating system. While machine learning approaches require additional modeling to generate Li response phenotypes, they may offer a more nuanced approach, which, in turn, would enhance the probability of identifying significant signals in genetic studies.
... A particular challenge to pharmacogenomics in BD has been the measurement of treatment response which can be limited by the length of follow-up, adherence to medication, and confounding due to the multi-drug treatment strategy common to the illness. Consequently, a systematic rating system with a high inter-rater reliability, the Alda score, was developed to quantify the clinical improvement of BD during treatment while also accounting for potential confounders of treatment response (Nunes, Trappenberg, & Alda, 2020). However, obtaining large samples with reliable measures has limited the statistical power to discover clinically-informative genetic variants associated with treatment response. ...
Article
Full-text available
Bipolar disorder (BD) is a highly heritable mental disorder and is estimated to affect about 50 million people worldwide. Our understanding of the genetic etiology of BD has greatly increased in recent years with advances in technology and methodology as well as the adoption of international consortiums and large population-based biobanks. It is clear that BD is also highly heterogeneous and polygenic and shows substantial genetic overlap with other psychiatric disorders. Genetic studies of BD suggest that the number of associated loci is expected to substantially increase in larger future studies and with it, improved genetic prediction of the disorder. Still, a number of challenges remain to fully characterize the genetic architecture of BD. First among these is the need to incorporate ancestrally-diverse samples to move research away from a Eurocentric bias that has the potential to exacerbate health disparities already seen in BD. Furthermore, incorporation of population biobanks, registry data, and electronic health records will be required to increase the sample size necessary for continued genetic discovery, while increased deep phenotyping is necessary to elucidate subtypes within BD. Lastly, the role of rare variation in BD remains to be determined. Meeting these challenges will enable improved identification of causal variants for the disorder and also allow for equitable future clinical applications of both genetic risk prediction and therapeutic interventions.
Article
Objectives Early-onset Bipolar Disorder (EOBD), has a more malignant course with high recurrence risk and there is a need for population-specific pharmaco-genomic study. Methods This study is a prospective and retrospective observational study. Both newly diagnosed patients and those on follow-up with a diagnosis of bipolar I disorder with onset before 18 years of age and on lithium prophylaxis as part of treatment-as-usual were recruited for the study. Response to treatment was assessed at the end of two years follow up using ALDA scale. Ten single nucleotide polymorphisms associated with treatment response based on previous studies were chosen for analysis. Results Of 162 who had EOBD, sixty-four fulfilled inclusion criteria and fifty-seven completed the study. TT and TG genotypes of rs75222709 on AL157359.3 gene were found to be significantly different between non-responders and healthy controls (N=220). The frequency of the GA genotype of the single nucleotide polymorphism rs17204573 of the RORA (Retinoic Acid related orphan receptor alpha) gene was significantly lower among subjects (27.3%) as compared to controls (42.9%) (odds ratio:0.5, CI: 0.26-0.96, p value 0.035). However, the significance of both disappeared after Bonferroni correction. Among clinical factors female gender was significantly associated with lithium non-response. Conclusion Although conducting pharmaco-genomic studies with large sample size is a challenge for low and middle-income countries, future studies can help improve the long-term outcome of youth with EOBD.
Article
Full-text available
After decades of research, the mechanism of action of lithium in preventing recurrences of bipolar disorder remains only partially understood. Lithium research is complicated by the absence of suitable animal models of bipolar disorder and by having to rely on in vitro studies of peripheral tissues. A number of distinct hypotheses emerged over the years, but none has been conclusively supported or rejected. The common theme emerging from pharmacological and genetic studies is that lithium affects multiple steps in cellular signaling, usually enhancing basal and inhibiting stimulated activities. Some of the key nodes of these regulatory networks include GSK3 (glycogen synthase kinase 3), CREB (cAMP response element-binding protein) and Na(+)-K(+) ATPase. Genetic and pharmacogenetic studies are starting to generate promising findings, but remain limited by small sample sizes. As full responders to lithium seem to represent a unique clinical population, there is inherent value and need for studies of lithium responders. Such studies will be an opportunity to uncover specific effects of lithium in those individuals who clearly benefit from the treatment.Molecular Psychiatry advance online publication, 17 February 2015; doi:10.1038/mp.2015.4.
Article
Full-text available
The assessment of response to lithium maintenance treatment in bipolar disorder (BD) is complicated by variable length of treatment, unpredictable clinical course, and often inconsistent compliance. Prospective and retrospective methods of assessment of lithium response have been proposed in the literature. In this study we report the key phenotypic measures of the "Retrospective Criteria of Long-Term Treatment Response in Research Subjects with Bipolar Disorder" scale currently used in the Consortium on Lithium Genetics (ConLiGen) study. Twenty-nine ConLiGen sites took part in a two-stage case-vignette rating procedure to examine inter-rater agreement [Kappa (κ)] and reliability [intra-class correlation coefficient (ICC)] of lithium response. Annotated first-round vignettes and rating guidelines were circulated to expert research clinicians for training purposes between the two stages. Further, we analyzed the distributional properties of the treatment response scores available for 1,308 patients using mixture modeling. Substantial and moderate agreement was shown across sites in the first and second sets of vignettes (κ = 0.66 and κ = 0.54, respectively), without significant improvement from training. However, definition of response using the A score as a quantitative trait and selecting cases with B criteria of 4 or less showed an improvement between the two stages (ICC1 = 0.71 and ICC2 = 0.75, respectively). Mixture modeling of score distribution indicated three subpopulations (full responders, partial responders, non responders). We identified two definitions of lithium response, one dichotomous and the other continuous, with moderate to substantial inter-rater agreement and reliability. Accurate phenotypic measurement of lithium response is crucial for the ongoing ConLiGen pharmacogenomic study.
Article
Full-text available
In many studies included in meta-analyses, the independent variable measure, the dependent variable measure, or both, have been artificially dichotomized, attenuating the correlation from its true value and resulting in (a) a downward distortion in the mean correlation and (b) an upward distortion in the apparent real variation of correlations across studies. We present (a) exact corrections for this distortion for the case in which only one of the variables has been dichotomized and (b) methods for making approximate corrections when both variables have been artificially dichotomized. These approximate corrections are shown to be quite accurate for most research data. Methods for weighting the resulting corrected correlations in meta-analysis are presented. These corrections make it possible for meta-analysis to yield approximately unbiased estimates of mean population correlations and their standard deviations despite the initial distortion in the correlations from individual studies. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Objective: Promptly establishing maintenance therapy could reduce morbidity and mortality in patients with bipolar disorder. Using a machine learning approach, we sought to evaluate whether lithium responsiveness (LR) is predictable using clinical markers. Methods: Our data are the largest existing sample of direct interview-based clinical data from lithium treated patients (n=1266, 34.7% responders), collected across 7 sites, internationally. We trained a random forest model to classify LR-as defined by the previously validated Alda scale-against 180 clinical predictors. Results: Under appropriate cross-validation procedures, LR was predictable in the pooled sample with an area under the receiver operating characteristic curve of 0.80 (95% CI 0.78-0.82) and a Cohen's kappa of 0.46 (0.4-0.51). The model demonstrated a particularly low false positive rate (specificity 0.91 [0.88-0.92]). Features related to clinical course and the absence of rapid cycling appeared consistently informative. Conclusion: Clinical data can inform out-of-sample LR prediction to a potentially clinically relevant degree. Despite the relevance of clinical course and the absence of rapid cycling, there was substantial between-site heterogeneity with respect to feature importance. Future work must focus on improving classification of true positives, better characterizing between- and within-site heterogeneity, and further testing such models on new external datasets.
Article
Background Lithium remains a first-line treatment in bipolar disorder, but individual response is variable. Previous studies have suggested that lithium response is a heritable trait. However, no genetic markers have been reproducibly identified. Methods Here we report the results of a genome-wide association study of lithium response in 2,563 patients collected by 22 participating sites from the International Consortium on Lithium Genetics (ConLiGen); the largest attempted so far. Data from over 6 million common single nucleotide polymorphisms (SNPs) were tested for association with categorical and continuous ratings of lithium response of known reliability. Findings A single locus of four linked SNPs on chromosome 21 met genome-wide significance criteria for association with lithium response (rs79663003: p=1·37×10⁻⁸; rs78015114: p=1·31×10⁻⁸; rs74795342: p=3·31×10⁻⁹; rs75222709: p=3·50×10⁻⁹). In an independent, prospective study of 73 patients treated with lithium monotherapy for a period of up to two years, carriers of the response-associated alleles had a significantly lower rate of relapse than carriers of the alternate alleles (p=0·03, hazard ratio = 3·8). Interpretation The response-associated region contains two genes coding for long non-coding RNAs (lncRNAs), AL157359.3 and AL157359.4. LncRNAs are increasingly appreciated as important regulators of gene expression, particularly in the CNS. Further studies are needed to establish the biological context of these findings and their potential clinical utility. Confirmed biomarkers of lithium response would constitute an important step forward in the clinical management of bipolar disorder.
Article
We comment on [36] by evaluating the practice of discretizing continuous variables. We show that dichotomizing a continuous variable via the median split procedure or otherwise and analyzing the resulting data via ANOVA involves a large number of costs that can be avoided by preserving the continuous nature of the variable and analyzing the data via linear regression. As a consequence, we recommend that regression remain the normative procedure both when the statistical assumptions explored by Iacobucci et al. hold and more generally in research involving continuous variables. We also discuss the advantages of preserving the continuous nature of the variable for graphical presentation and provide practical suggestions forsuch presentations.
Article
In a recent publication, J. R. Kirby and J. P. Das (see PA, 62:5150) dichotomized 2 measures of individual differences at medians and thereafter treated these measures as if they were independent variables in an ANOVA. If instead they had analyzed their various measures by means of traditional correlational analysis, they would have had much more powerful tests of their hypotheses. They would also, in all probability, have been less inclined to interpret their results as if the dichotomized variables represented independent, causal antecedents of their various measures of intelligence. (4 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
In a critique of a 1969 paper by L. G. Humphreys and H. P. Dachler, T. J. Fischbach and H. J. Walberg (see record 1971-23066-001) advised research workers who use individual-differences variables in orthogonal analysis of variance designs to obtain equal Ns in each cell and not to worry about population Ns. This is thoroughly misleading advice and is based upon an inadequate model of components of variance in individual-differences measures. On the basis of present computer simulation analyses of the problem, it is concluded that the analysis of variance is an awkward, inefficient statistical model in these cases and that correlational analysis has many advantages for such problems. Some of the literature involving the pseudo-orthogonal design advocated by Fischbach and Walberg can be salvaged, when properly interpreted, but other research involving this method should be discarded and a fresh start should be made with adequate design and methods of analysis. (PsycINFO Database Record (c) 2012 APA, all rights reserved)