This article appeared in a journal published by Elsevier. The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elsevier’s archiving and manuscript policies are
encouraged to visit:
Author's personal copy
Web-based training and interrater reliability testing for scoring the
Hamilton Depression Rating Scale
Jules Rosena,d,⁎, Benoit H. Mulsanta,b, Patricia Marinoc, Christopher Groeningd,
Robert C. Youngc, Debra Foxe
aDepartment of Psychiatry, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States
bGeriatric Mental Health Program, Centre for Addiction and Mental Health, and University of Toronto, Toronto, Ontario, Canada
cDepartment of Psychiatry, Weill Medical College of Cornell University, White Plains, NY, United States
dKatz Graduate School of Business, University of Pittsburgh, Pittsburgh, PA, United States
eFox Learning Systems, Inc; Pittsburgh, PA, United States
Received 16 October 2006; received in revised form 17 October 2007; accepted 2 March 2008
Despite the importance of establishing shared scoring conventions and assessing interrater reliability in clinical trials in
psychiatry, these elements are often overlooked. Obstacles to rater training and reliability testing include logistic difficulties in
providing live training sessions, or mailing videotapes of patients to multiple sites and collecting the data for analysis. To address
some of these obstacles, a web-based interactive video system was developed. It uses actors of diverse ages, gender and race to
train raters how to score the Hamilton Depression Rating Scale and to assess interrater reliability. This system was tested with a
group of experienced and novice raters within a single site. It was subsequently used to train raters of a federally funded multi-
center clinical trial on scoring conventions and to test their interrater reliability. The advantages and limitations of using interactive
video technology to improve the quality of clinical trials are discussed.
© 2008 Elsevier Ireland Ltd. All rights reserved.
Keywords: Depression; Research methods; Rater training; Clinical trials; Hamilton Depression; Rating scales; Interrater reliability; Rater training
In clinical trials, the reliability of the data collected
ultimately determines the validity of the studies' conclu-
sions (Kobak et al., 1996). In psychiatry, the primary
outcome measures often depend on interviewers' skills
in eliciting information, as well as their interpretations
of the subjects' responses (Kobak et al., 2005a). When
multiple raters are used in a clinical trial, differences
between raters in terms of interviewing technique and
scoring criteria introduce variability that can distort the
outcome measures (Muller and Szegedi, 2002; Bourin
et al., 2004). Despite the importance of statistically
establishing raters' reliability, review of the literature
suggests that this issue is often ignored in clinical trials,
Available online at www.sciencedirect.com
Psychiatry Research 161 (2008) 126–130
⁎Corresponding author. Professor of Psychiatry, University of
Pittsburgh School of Medicine, 3811 O'Hara St., Pittsburgh, PA
15213, United States. Tel.: +1 412 246 5900; fax: +1 412 586 9300.
E-mail address: email@example.com (J. Rosen).
0165-1781/$ - see front matter © 2008 Elsevier Ireland Ltd. All rights reserved.
Author's personal copy
including those of depression treatment (Mulsant et al.,
2002). This is especially problematic in multi-center
trials that involve geographically dispersed groups of
raters, that may change over time, and that may recruit
patients over several years.
We have previously reported that videotapes of pro-
fessional actors using scripted interviews of the Ha-
milton Depression Rating Scale (HDRS) could not be
distinguished from the videotapes of the actual patients
when scored by experienced raters (Rosen et al., 2004).
Building on the findings of that study, we developed a
web-based system using professional actors both to train
raters on scoring the HDRS using shared scoring con-
ventions and to assess interrater reliability. This report
describes: 1) the development of the system, 2) a study
of HDRS scoring-tutorial and reliability testing with
both naive and experienced raters, and 3) results of a
field test of this system in a multi-site NIMH-funded
1.2. Development and description of web-based system
The web-based system consists of three components:
1) a scoring-tutorial program, 2) a reliability testing
program, and 3) an administrative program. In order to
use this system, users must have a high speed internet
connection with “Flash” plug-in for the internet browser.
To develop the scoring-tutorial and reliability testing
programs, informed consent was obtained to video-
record 21 HDRS interviews of seven patients participat-
ing in an NIMH-funded study of depression at initiation
of treatment, in mid-treatment, and in partial or full
project is based on the published interview by Williams
et al. (1988), and has been previously used in depression
trials in the U.S. (Mulsant et al., 1999; Tew et al., 1999;
Sackeim et al., 2000; Sackeim et al., 2001; Gildengers
et al., 2005; Feske et al., 2004; Reynolds, et al., 2006;
Dombrovski et al., 2006). The scores of these 21 inter-
views ranged from below 10 (absence of depression),
11–20 (mild to moderate depression), 21–29 (severe
depression), and greater than30 (very severe depression,
including psychosis) as each patient was followed
through the course of his or her treatment. The vi-
deotaped interviews of the patients were transcribed
yielding 21 scripts, which were modified to remove all
information that might identify the actual patients. In
order to create realistic portrayals of different stages of
depression in diverse populations, three male and three
female actors were recruited to portray young, mid-life,
and elderlyadults.One of themale and one ofthe female
actors were African-American. Each actor recorded 9 or
10 scripts that were slightly modified to be age- and
gender-appropriate for the actor (e.g., a reference to a
child may be changed to a reference to a grandchild).
Ten of the scripts were used to create the tutorial
program designed to train raters on scoring conventions.
The scoring-tutorial program provides video vignettes
for every possible score of 28 HDRS items. For item-
scores not represented by actual interviews, the scripts
were modified by changing either the intensity or fre-
quency to move the score into a more or into a less
severe rating. In the tutorial mode, trainees have the
option of watching every vignette for each question in
the order of increasing severity. Alternatively, they can
watch them in random order. While the rater is ob-
serving the interview in the tutorial mode, the scoring
guidelines are presented in text format in a box below
scores, and the system informs them when their scores
differ from the scores assigned by two expert psychia-
trist/raters. These raters (JR and BMH) have more than
20 years of cumulative experience administering and
scoring the HDRS.
Following completion of the tutorial, the raters are
directed by the system to the reliability testing program.
The testing program was created with the 11 scripts that
were not used in the tutorial program. To rest interrater
reliability, raters are presented with six of the HDRS
interviews representing a full range of severity of de-
pression. As in the tutorial mode, while raters watch the
interview, the scoring guideline corresponding to the
item that is being probed by the interviewer is presented
in text format below the video stream. After raters select
a score for a particular item, the system progresses to
the next questions. Raters have the opportunity to go
back and review any question and their score until they
have scored all the items and “lock in” their scores at
the end of the testing session. Once raters complete a
particular interview and lock in their scores, these
scores are stored in a database and are available to
calculate interrater reliability. All raters who are
associated with a given study complete the reliability
testing mode with the same six interviews. Repeat
testing to assess rater drift over time can be accom-
plished with an alternate set of interviews.
The system is designed to provide scoring-tutorials
and reliability testing using the 17-, 24- or 28-item
versions of the HDRS. The scoring conventions used for
the first 17 items are based on the published conventions
of the 17-item “Grid-Hamilton,” which provides a single
score for each item based on both the intensity and
frequency of depressive symptoms (Kalali et al., 2002).
The scoring conventions used for items 18–28 were
127J. Rosen et al. / Psychiatry Research 161 (2008) 126–130
Author's personal copy
adapted by two of the authors (JR and BHM) to be
congruent with the Grid-Hamilton scoring conventions.
The administrative program is designed to perform
several functions. The overall administrator of a clinical
trial can identify the sites participating in the study and
designate for each site a site coordinator. The overall
administrator also specifies the version of the HDRS to
be used for training and reliability testing (i.e., 17-, 24-,
or 28-item version). In multi-site studies, the site co-
ordinators enter the names and ID numbers of raters at
each research site. Sites and raters can be added or
removed during the course of a clinical trial. A database
stores the test scores of each rater. Intra-class correla-
tions (ICC) coefficients are calculated for raters parti-
cipating in a particular study or by site according to the
formula of Shrout and Fleiss (Shrout and Fleiss, 1979),
who described calculations based on one of three main
cases, depending on the assignment of judges. Our study
follows Case 3, where “each target is rated by each of
the same k judges, who are the only judges of interest”
2.1. Study 1: initial evaluation
Research raters were recruited from the research
programs of the Department of Psychiatry at the Uni-
versity of Pittsburgh School of Medicine to conduct an
initial evaluation of the web-based system prior to fi-
nalization of the system and actual field testing. All
participants were research raters in one of the psychiatry
research programs. All of them had received previous
training on at least one rating instrument using class-
room instructions and videotapes to establish reliability.
However, some had not been trained to administer and
score the HDRS. Regardless of their prior experience
with the HDRS, they were required to complete the
tutorial program prior to reliability testing. ICCs were
calculated for the entire group and for three subgroups
with no prior experience with the HDRS, 2) experienced
raters who had administered the HDRS fewer than 150
times, and 3) highly experienced raters who had admi-
nistered the HDRS 150 times or more.
Raters were also asked to keep track of and to report
the amount of time and the number of sessions needed to
complete the tutorial program and the reliability testing
2.2. Study 2: field trial
To further evaluate the system, a field trial was com-
multi-site study of late-life mood disorders used this
system to train raters on shared scoring conventions and
to assess interrater reliability.
ForbothStudy1andStudy2, thescoring-tutorial and
the reliability testing were to be completed within a 2-
week window. Within that time frame, raters were per-
mitted flexibility in terms of how much time they spent
with the system and the number of sessions needed to
complete thetutorial andthe testing. ForStudy 1, the28-
item version of the HDRS was used; for Study 2, the 17-
item version of the HDRS was used.
3.1. Study 1: single site study
Of the 17 raters who participated in this study, seven
were naive, three experienced, and seven experts. The
End of 4th quartile
aInterviews 1–6 were used in Study 1; Interviews 4 and 7–11 were used in Study 2.
128 J. Rosen et al. / Psychiatry Research 161 (2008) 126–130
Author's personal copy
mean age was 42.3 years (range: 22–60). One rater was
male; one rater was an African-American woman; the
remaining raters were Caucasian women.
Based on self-reports, the tutorial was completed in a
The mean number of hours to complete the reliability
testing was 3.3 (range 2.5–5) in 2.6 separate sessions
The ICCs for the naive, experienced and expert sub-
groups were 0.94, 0.93, and 0.96, respectively. The ICC
calculated for the entire group was 0.95.
3.2. Study 2: multi-site study
Of the 13 participating raters, 10 were female. One
woman was Asian and one Hispanic, and the remaining
raters were Caucasian. The mean age was 34.3 years
(range: 23–58). All participants completed the tutorial
before going on to the test mode. The ICC was 0.98 for
problems accessing the web-site, completing the inter-
active tutorial and testing, and recovery of ICC data.
3.3. Individual interviews/items
Each of the 11 testing interviews used in Study 1 and
Study 2 was individually assessed. Table 1 describes
characteristics of these interviews.
The interrater reliabilities were excellent for both
Study 1 and Study 2. Establishing rater reliability in
studies of depression treatment is critically important,
but most studies do not report on rater training or
reliability measures (Mulsant et al., 2002). In typical
industry-supported clinical trials, meetings of investi-
gators are convened to provide instruction to raters and
investigators on the proper use of the various instru-
ments. However, rigorous assessments of rater relia-
bility rarely occur at these meetings or at any following
time. The practical importance of interrater reliability
has been established as essential to reducing variability
in multi-site trials (Small et al., 1996). In that report,
inadequate rater training and the absence of a measure of
interrater reliability were shown to skew the results.
The relatively high ICCs for all groups of raters
participating in this study are consistent with interrater
reliability described in several studies with the HDRS
that used traditional videotapes. In a large multi-center
study, the ICC for conventionally trained raters on
scoring the HDRS was 0.97 (Sackeim et al., 2001), and
in a single center study with multiple raters, the ICC for
conventionally trained raters was 0.95 (Feske et al.,
Although not supported as a training tool in all set-
tings (Sanchez et al., 1995), videotapes of patients have
been used to train raters and establish reliability in some
Muller and Wetzel, 1998; Muller and Dragicevic, 2003).
Limitations to this technique include the logistical sup-
port needed to mail videotapes to all raters at multiple
sites, the inability to interact with video-based training,
to a data management center and entering the data.
The computer-based system provides interactive train-
ing that is continuously available through any computer
that has high speed connectivity to the Internet. The in-
tegrated database provides ICC calculations and report
generation without the additional work of mailing or fax-
sites can be added over time, and the ICCs can be cal-
culated with the group. Finally, rater drift can beassessed.
It is important to note that the web-based system des-
regard to scoring conventions. The equally important
component of training raters on clinical interview skills
required for the administration of the HDRS was not
addressed in this study. The importance and effectiveness
of providing rater interview training for the HDRS with a
web-based instrument has been previously demonstrated
Kobak et al., 2006; Jeglic et al., 2007). Additional limi-
tations to this study include the relatively small sample
size and the fact that all of the “naive” raters were ex-
perienced clinicians or raters that used different assess-
ment instruments. Finally, the use of videotapes may
artificially inflate estimates of reliability by reducing the
information variance that would result if each rater in-
terviewed the patient independently (Spitzer and Wil-
In conclusion, the current study evaluated a web-
based system of interactive scoring training and relia-
bility testing in a group of raters using the HDRS in a
single site and multiple site study. The ICCs calculated
support the effectiveness of this system without the
additional logistical burden involved with the use of
This work was sponsored in part by the National
MH067028, MH068847, HS011976, U01 MH074511).
129 J. Rosen et al. / Psychiatry Research 161 (2008) 126–130
Author's personal copy
Andreasen, N.C., McDonald-Scott, P., Grove, W.M., Keller, M.B.,
Shapiro, R.W., Hirschfeld, R.M., 1982. Assessment of reliability
in multicenter collaborative research with a videotape approach.
American Journal of Psychiatry 139 (7), 876–882.
Bourin, M., Deplanque, D., Zins-Ritter, M., 2004. Mean Deviation of
Inter-rater Scoring (MDIS): a simple tool for introducing conformity
into groups of clinical investigators. International Clinical Psycho-
pharmacology 19 (4), 209–213.
Dombrovski, A.Y., Blakesley-Ball, R.E., Mulsant, B.H., Mazumdar,
S., Houck, P.R., Szanto, K., Reynolds III, C.F., 2006. Speed of
improvement in sleep disturbance and anxiety compared with core
mood symptoms during acute treatment of depression in old age.
American Journal of Geriatric Psychiatry 14 (6), 550–554.
Feske, U., Mulsant, B.H., Pilkonis, P.A., Soloff, P., Dolata, D.,
Sackeim, H.A., Haskett, R.F., 2004. Clinical outcome of ECT in
disorder. American Journal of Psychiatry 161 (11), 2073–2080.
Gildengers, A.G., Houck, P.R., Mulsant, B.H., Dew, M.A., Aizenstein,
2005. Trajectories of treatment response in late-life depression:
psychosocial and clinical correlates. Journal of Clinical Psycho-
pharmacology 25 (4 Suppl 1), S8–S13.
Jeglic, E., Kobak, K.A., Engelhardt, N., Williams, J.B., Lipsitz, J.D.,
Salvucci, D., Bryson, H., Bellew, K., 2007. A novel approach to
rater training and certification in multinational trials. International
Clinical Psychopharmacology 22 (4), 187–191.
Kalali, A., Williams, J.B., Koback, K.A., Lipsitz, J., Engelhardt, N.,
GRID HAM-D: pilot testing and international field trials. Interna-
tional Clinical Psychopharmacology 5, S147–S148.
Kobak, K.A., Engelhardt, N., Lipsitz, J.D., 2006. Enriched rater training
using Internet based technologies: a comparison to traditional rater
40 (3), 192–199.
Kobak, K.A., Feiger, A.D., Lipsitz, J.D., 2005a. Interview quality and
signal detection in clinical trials. American Journal of Psychiatry
162 (3), 628.
2005b. A new approach to rater training and certification in a multi-
center clinical trial. Journal of Clinical Psychopharmacology 25 (5),
Kobak, K.A., Greist, J.J., Jefferson, J.W., Katzelnick, D.J., 1996.
Computer-administered clinical rating scales. A review. Psycho-
pharmacology 127 (4), 291–301.
Kobak, K.A., Lipsitz, J.D., Feiger, A., 2003. Development of a standar-
dized training program for the Hamilton Depression Scale using
internet-based technologies: results from a pilot study. Journal of
Psychiatric Research 37 (6), 509–515.
Muller, M.J., Dragicevic, A., 2003. Standardized rater training for the
Hamilton Depression Rating Scale (HAMD-17) in psychiatric
novices. Journal of Affective Disorders 77 (1), 65–69.
Muller, M.J., Szegedi, A., 2002. Effects of interrater reliability of psy-
chopathologic assessment on power and sample size calculations
in clinical trials. Journal of Clinical Psychopharmacology 22 (3),
Muller, M.J., Wetzel, H., 1998. Improvement of inter-rater reliability
of PANSS items and subscales by a standardized rater training.
Acta Psychiatrica Scandinavica 98 (2), 135–139.
Muller, M.J., Rossbach, W., Dannigkeit, P., Muller-Siecheneder, F.,
Szegedi, A., Wetzel, H., 1998. Evaluation of standardized rater
training for the Positive and Negative Syndrome Scale (PANSS).
Schizophrenia Research 32 (3), 151–160.
Mulsant, B.H., Kastango, K.B., Rosen, J., Stone, R.A., Mazumdar, S.,
disorders. American Journal of Psychiatry 159 (9), 1598–1600.
Mulsant, B.H., Pollock, B.G., Nebes, R.D., Miller, M.D., Little, J.T.,
1999. A double-blind randomized comparison of nortriptyline and
paroxetine in the treatment of late-life depression: 6-week outcome.
Journal of Clinical Psychiatry 60 (Suppl 20), 16–20.
Reynolds III, C.F., Dew, M.A., Pollock, B.G., Mulsant, B.H., Frank,
E., Miller, M.D., Houck, P.R., Mazumdar, S., Butters, M.A., Stack,
J.A., Schlernitzauer, M.A., Whyte, E.M., Gildengers, A., Karp, J.,
Lenze, E., Szanto, K., Bensasi, S., Kupfer, D.J., 2006. Main-
tenance treatment of major depression in old age. New England
Journal of Medicine 354 (11), 1130–1138.
Rosen,J., Mulsant, B.H., Bruce, M.L., Mittal, V.,Fox, D., 2004. Actors'
portrayals of depression to test interrater reliability in clinical trials.
American Journal of Psychiatry 161 (10), 1909–1911.
Sackeim, H.A., Haskett, R.F., Mulsant, B.H., Thase, M.E., Mann, J.J.,
J., 2001. Continuation pharmacotherapy in the prevention of relapse
following electroconvulsive therapy: a randomized controlled trial.
Journal of the American Medical Association 285 (10), 1299–1307.
Sackeim, H.A., Prudic, J., Devanand, D.P., Nobler, M.S., Lisanby, S.H.,
Peyser, S., Fitzsimons, L., Moody, B.J., Clark, J., 2000. A
prospective, randomized, double-blind comparison of bilateral and
right unilateral electroconvulsive therapy at different stimulus
intensities. Archives of General Psychiatry 57 (5), 425–434.
Sanchez, L.E., Adams, P.B., Uysal, S., Hallin, A., Campbell, M., Small,
A.M., 1995. A comparison of live and videotape ratings: clomipra-
Shrout, P.E., Fleiss, J.L., 1979. Intraclass correlations: uses in assessing
rater reliability. Psychological Bulletin 2, 420–428.
Small, G.W., Schneider, L.S., Hamilton, S.H., Bystritsky, A., Meyers,
B.S., Nemeroff, C.B., 1996. Site variability in a multisite geriatric
depression trial. International Journal of Geriatric Psychiatry 11,
Spitzer, R.L., Williams, J.B.W., 1980. Classification in Psychiatry.
Williams & Wilkins, Baltimore.
Targum, S.D., 2006. Evaluating rater competency for CNS clinical
trials. Journal of Clinical Psychopharmacology 26 (3), 308–310.
Tew Jr., J.D., Mulsant, B.H., Haskett, R.F., Prudic, J., Thase, M.E.,
Crowe, R.R., Dolata, D., Begley, A.E., Reynolds III, C.F., Sackeim,
H.A., 1999. Acute efficacy of ECT in the treatment of major de-
pression in the old–old. American Journal of Psychiatry 156 (12),
Williams, J.B.W., 1988. A structured interview guide for the Hamilton
130 J. Rosen et al. / Psychiatry Research 161 (2008) 126–130