Content uploaded by Ken Kobak
Author content
All content in this area was uploaded by Ken Kobak on Aug 22, 2018
Content may be subject to copyright.
120 Original article
The GRID-HAMD: standardization of the Hamilton
Depression Rating Scale
Janet B.W. Williams
a,c
, Kenneth A. Kobak
c
, Per Bech
g
, Nina Engelhardt
c
,
Ken Evans
f
, Joshua Lipsitz
a,c
, Jason Olin
b
, Jay Pearson
d
and Amir Kalali
e
This report describes the GRID-Hamilton Depression
Rating Scale (GRID-HAMD), an improved version of the
Hamilton Depression Rating Scale that was developed
through a broad-based international consensus process.
The GRID-HAMD separates the frequency of the symptom
from its intensity for most items, refines several
problematic anchors, and integrates both a structured
interview guide and consensus-derived conventions for all
items. Usability was established in a small three-site
sample of convenience, evaluating 29 outpatients, with
most evaluators finding the scale easy to use. Test–retest
(4-week) and interrater reliability were established in 34
adult outpatients with major depressive disorder, as part of
an ongoing clinical trial. In a separate study, interrater
reliability was found to be superior to the Guy version of
the HAMD, and as good as the Structured Interview Guide
for the Hamilton Depression Rating Scale (SIGH-D), across
30 interview pairs. Finally, using the SIGH-D as the criterion
standard, the GRID-HAMD demonstrated high concurrent
validity. Overall, these data suggest that the GRID-HAMD is
an improvement over the original Guy version as well as
the SIGH-D in its incorporation of innovative features
and preservation of high reliability and validity. Int Clin
Psychopharmacol 23:120–129 c2008 Wolters Kluwer
Health | Lippincott Williams & Wilkins.
International Clinical Psychopharmacology 2008, 23:120–129
Keywords: clinical trials, depressive disorder, Psychiatric Status Rating
Scales, treatment outcome
a
Columbia University, New York, New York,
b
Novartis Pharmaceuticals, East
Hanover, New Jersey, USA,
c
MedAvante Inc., Hamilton, Ontario, Canada,
d
Merck
Research Laboratories, Whitehouse Station, New Jersey,
e
Quintiles and the
University of California, San Diego, California, USA,
f
Ontario Cancer Biomarker
Network, Toronto, Ontario, Canada and
g
Psychiatric Research Institute and
Frederiksborg General Hospital, Copenhagen, Denmark
Correspondence to Janet B.W. Williams, DSW, MedAvante Inc., 100 American
Metro Blvd., Suite 106, Hamilton, NJ 08619, USA
Tel: + 609 528 9472; fax: + 609 528 9405;
e-mail: jbw5@columbia.edu or jwilliams@medavante.net
Received 29 April 2007 Accepted 21 January 2008
The Hamilton Depression Rating Scale (HAMD) (Ha-
milton, 1960) was introduced in 1960 to measure severity
of depression primarily in a hospitalized depressed
population, though it is now used most frequently in
outpatient settings. It is the most widely used clinician-
administered depression rating scale, and is used in most
clinical trials of antidepressant medications. However,
although the scale has been used successfully to assess
the effectiveness of antidepressants, it has been criticized
for poor item reliabilities, and, for some items, poor
discriminative properties across the range of depressive
severity (Faries et al., 2000; Santor and Coyne, 2001;
Evans et al., 2004).
The number of versions of the HAMD is legendary; in
fact, there are so many that researchers and clinicians
have lost track of the characteristics of each version.
Williams (2001) reviewed and summarized 11 published
versions, which differ widely in the number (ranging from
17 to 29), sequence, and wording of items. Few reports
provide a reference to the version of the HAMD used in a
trial (Zitman et al., 1990), and no single version of the
HAMD or single set of conventions has been universally
accepted. Over time, different aids for administering the
HAMD and modifications to the scale have been
proposed: self-report (Reynolds and Kobak, 1995) and
computerized versions (Kobak et al., 1990, 2000),
structured interview guides (Williams, 1988; Potts et al.,
1990; Whisman et al., 1999), and reduced (Maier and
Philipp, 1985; Bech et al., 1986; Gibbons et al., 1993) and
expanded (Thase, 1984; Gelenberg et al., 1990) item sets.
Although some of these modifications and aids claim to
have improved the reliability and sensitivity of the scale,
others have not, and interrater and intrarater item
reliability has remained problematic.
Despite its limitations, the scale remains a useful and
popular instrument, and there are good arguments for
continuing its use, at least until there is a superior
alternative with better psychometric characteristics.
Zimmerman has highlighted several advantages to the
HAMD, such as its provision of long-term continuity in
assessment methodology that allows comparison across
decades of studies, and its ability to discriminate drug
from placebo in clinical trials and to measure change
(Zimmerman et al., 2005). However, half a century after
0268-1315 c2008 Wolters Kluwer Health | Lippincott Williams & Wilkins
Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
its development, most would agree that a revision and
standardization of the scale is timely (Bagby et al., 2004).
The Depression Rating Scale Standardization
Team
A proposal was made at the 1999 meeting of the National
Institute of Mental Health sponsored New Clinical Drug
Evaluation Unit to establish a common set of standards
for scoring and administering the HAMD. This proposal
led to the formation of the Depression Rating Scale
Standardization Team (DRSST), a collaboration of
individuals from academia, clinical practice, the pharma-
ceutical industry, and government. A core working group
was selected (with representatives from all four groups).
The mission of the core group was to establish a process
whereby individuals from different and sometimes com-
peting disciplines could work cooperatively to develop a
standard approach to administering and scoring the
HAMD that would be used by academic and pharmaceu-
tical industry researchers in academic and clinical
settings.
A 3-day meeting was held in October 2000 to draft a
standardized version of the scale. At the outset, it was
agreed that the goal was to standardize the administration
and scoring of the HAMD without significantly altering
the original intent of Hamilton’s items or the scoring
profile. Problems with this approach were identified so
that the work of the group could focus on correcting these
deficiencies.
Several issues were addressed. First, most items of the
HAMD require an assessment of both the intensity and
frequency of a given symptom (or set of symptoms). This
creates a challenge for scoring because both frequency
and intensity must be taken into account before an
overall severity score can be assigned. No generally
accepted guidelines to help the rater determine the
contribution of both to symptom severity exist. For this
reason, the group developed a grid format, described
below.
Second, the possible responses for severity within many
of the items, that is, options 0–4, are ambiguously worded
and lack useful clinical examples. Work focused on
clarifying wording and adding examples. In the Depressed
Mood item of the GRID-HAMD, for example, a score of 1
(mild) was changed from ‘These feeling states indicated
only on questioning.’ to ‘Feelings of sadness, discourage-
ment, low self-esteem, pessimism.’
Third, a consistent observation across many items was
that the description for a score of 4 (‘very severe’)
pertained to the hospitalized patient and not to the
depressed outpatient with whom the scale is most
frequently used. In the GRID-HAMD, item descriptions
were modified to enhance the reliability of the scale and
its relevance to depressed outpatients. For example, a
score of 4 for Work and Activities was changed from
‘Stopped working because of present illness. In hospital,
rate 4 if patient engages in no activities except ward
chores or if patient fails to perform ward chores
unassisted.’ to, in the GRID-HAMD, ‘Unable to work;
needs help performing self-care activities; unable to
function without assistance.’
Fourth, the group focused on several HAMD items that
were regarded as especially problematic. For example,
insight is nearly always scored ‘0’ in outpatient studies
(Evans et al., 2004), and hypochondriasis has shown poor
item-total correlation and low interrater reliability. As a
result, these items add limited value to the assessment of
change in condition. These items were revised; not
discarded.
Fifth, for several severity levels, item descriptions have
been difficult for raters to interpret. For example, in the
Feelings of Guilt item, the third severity anchor states,
‘present illness is a punishment. Delusions of guilt.’ It is
not clear if delusional thinking is sufficient, or is also
necessary for this severity level.
An additional problem is that some commonly used item
descriptions have become outdated and confusing in light
of post-Diagnostic and Statistical Manual of Mental
Disorders-IV diagnostic criteria; for example, requiring
‘heaviness in limbs, back, or head’ for a positive rating of
Somatic Symptoms General. These confusing anchor and
item descriptions were clarified and updated.
The group concluded that many of these factors have
contributed to poor item reliabilities. Any revision to the
scale would need to: (1) be more straightforward to
use; (2) more clearly operationalize the anchor points;
(3) simplify scoring by allowing the rater to consider the
dimensions of intensity and frequency independently to
arrive at an overall severity rating; and (4) adopt a
standardized scale and conventions to obviate the need
for raters to learn different guidelines and conventions for
different simultaneous studies. Revision and standardiza-
tion are particularly needed when one considers the great
variety in the backgrounds and experience of raters
administering the HAMD in clinical trials today, ranging
from psychiatrists to study coordinators with little, if any,
clinical experience.
The group agreed that its goal was to develop a standard
method for administering and scoring the HAMD rather
than to develop a new instrument. Therefore, it was
crucial to improve the instrument while avoiding
significant changes to Hamilton’s original intent or
scoring profile. Thus, a patient scoring 18 on the original
GRID-HAMD Williams et al. 121
Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
HAMD-17 version would ideally score approximately 18
on the GRID-HAMD. Hamilton’s original guidelines for
administration and scoring were frequently consulted to
ensure fidelity to his original intent (Hamilton, 1960,
1967).
Follow-up conference calls and e-mail correspondence
resulted in a complete draft of the GRID-HAMD that
was distributed in April 2001 for feedback to approxi-
mately 200 depression researchers and clinicians world-
wide. The working group reviewed responses, blind to
authorship, and items were revised based on feedback
and then redistributed. Once consensus was achieved,
the probes and conventions were developed by the
working group and reviewed in a manner similar to the
process outlined above.
The GRID-HAMD
The complete GRID-HAMD has three components: the
GRID scoring system, the manual of scoring conventions,
and a semistructured interview guide. This paper
describes the final GRID system, and presents data from
two pilot studies as well as a study of the reliability and
validity of the system.
The GRID scoring system
The DRSST proposed a ‘grid’ scoring system in which
the dimensions of intensity and frequency of a symptom
are rated independently for each relevant item (Fig. 1),
with greater frequency generally resulting in a higher
score for a given intensity. Such scoring methods are used
in other fields (Chouinard and Miller, 1999) and can
simplify the assignment of a given rating. Symptom
intensity is considered on the vertical axis and symptom
frequency on the horizontal axis. The intersection of
these points reveals the patient’s score for the item.
Although the grid scoring system is different in appear-
ance from earlier HAMD scoring forms, the group’s
intention was to preserve the underlying scoring profile of
the original scale. It is hoped that the added clarity will
make each item easier to rate in a consistent manner.
To increase the reliability of individual items, ratings of
intensity and frequency were formalized using the grid
structure, item content was clarified, anchor descriptions
were enhanced with clinical examples at each severity
level, and the most severe response option within
5-option items was rephrased to increase the utility of
the HAMD in an outpatient population (e.g. eliminating
references to inpatient-specific functioning). Symptom
intensity, which generally includes degree of subjective
distress and functional impairment in the GRID-HAMD,
is rated as ‘absent, mild, moderate, severe, and very
severe’ or ‘absent, mild, and marked.’ Symptom fre-
quency is rated as ‘absent, occasional, much of the time,
and almost all of the time,’ with operational criteria
provided on each page of the GRID-HAMD. For
example, occasional frequency is defined as ‘Infrequent;
less than 3 days; and up to 30% of the week.’ Examples of
degrees of intensity and frequency are provided in the
GRID-HAMD itself.
Structured interview guide
It is widely held that the item reliability of the HAMD is
compromised by administering the instrument in an
unstructured manner (Moberg et al., 2001). To assess
‘Depressed Mood,’ for example, the rater must ask a
series of questions to determine the degree of depressed
mood. However, the specific questions that must be
asked, and the conventions used to interpret and score
responses, vary from study to study as well as from rater
to rater. In addition, there is marked intersite incon-
sistency with respect to rater training and clinical
experience.
A number of studies have found that standardized
instructions for completing and scoring the HAMD
improve interrater reliability (Williams, 1988; Moberg
et al., 2001). Semistructured interview methods such as
the Structured Interview Guide for the Hamilton
Depression Rating Scale (SIGH-D) have been developed
(Williams, 1988), though both structured and unstruc-
tured interview techniques are in common use. A
semistructured interview guide provides a series of basic
questions to the rater, who is directed to add his or her
own questions when necessary to obtain additional
information from the patient or to clarify an ambiguous
response. Moberg et al. (2001) compared the test–retest
reliability of the SIGH-D with the test–retest reliability
of an unstructured HAMD on the same series of patients.
He found that the SIGH-D produced uniformly higher
item and total-scale score reliabilities than the unstruc-
tured HAMD. Such semistructured interview guides have
been shown to facilitate training on the scale, and to
require no more interviewing time than unstructured
administration. No compelling reason to avoid the use of
a semistructured interview guide exists, although it is
unknown whether the use of such a guide translates into
better detection of treatment effects.
The GRID-HAMD includes a semistructured interview
guide, based on the SIGH-D (Williams, 1988). It begins
with a general contextual question, and then presents
questions for specific items, including items 18–21 for
optional use. The rater is instructed to ask the specified
questions exactly as they are written. Additional optional
questions are in parentheses and raters may add their own
questions as well to obtain more information. Questions
that assess frequency are standardized throughout the
interview guide. In addition, for those ratings in which
several symptoms are listed, the GRID-HAMD clarifies
whether one or all of the symptoms are required to merit
122 International Clinical Psychopharmacology 2008, Vol 23 No 3
Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
a particular score. Like many other psychiatric semi-
structured interview guides, the GRID-HAMD should be
used ideally by individuals who have received adequate
training in the assessment of mood in a depressed
population and are familiar with the use of the GRID-
HAMD specifically.
Fig. 1
1. Depressed mood
Absent
Not occurring
or clinically
insignificant
0
0
(1)
(2)
11
2
2
3
4
4
1
2
3
(3) (4) (5)
(8)
(11)
(14)
(7)(6)
(9)
4
3
(12) (13)
Conventions
This item should NOT be considered a global
measure of depressive severity. Item 1 assesses one
of several core symptoms of depression.
Normal mood fluctuations without clinical
significance should be rated "0."
Rate depressed mood even if patient attributes
mood to real life problems (e.g., depressed due to
bad job, marital conflict).
Some patients describe feelings of low
mood without acknowledging "sadness" or
"depression" (e.g., "down," "blah," "numb").
Rate as symptomatic.
Nonverbal signs (e.g., slumped posture,
infrequent eye contact, frowning, sad facial
expression) are also considered in assessing severity.
Do no rate angry, irritable, or anxious mood on
this item.
GRID-HAMD 1 ITEM SCORE:
(10)
Occasional
Infrequent;
less than 3 days;
up to 30% of the
week
Frequency
Much of the time
Often;
3–5 days;
31%–75% of the
week
Almost all of the time
Persistent;
6–7 days;
more than 75% of the
week
This item assesses feelings of sadness,
hopelessness, helplessness, and
worthlessness.
Note: This is not a global rating of depressive
illness.
Symptom intensity
Absent
Mild
Feelings of sadness, discouragement, low
self-esteem, pessimism
Moderate
Severe
Very severe
Extreme sadness, intractable hopelessness or
helplessness
What's your mood been like this past week (compared to when you feel OK)?
Have you been feeling down or depressed? Sad or hopeless? Helpless? Worthless?
(Can you describe what this feeling has been like for you? How bad is the feeling?)
Does the feeling lift at all if something good happens?
(Does it go away completely, or is it just less intense?)
How long have you been feeling this way?
How are you feeling about the future?
Have you been crying at all? IF YES: How often?
Frequency
During the past week, how often did you feel this way?
How much of the time did you feel this way?
How many days in the past week?
(Was it every day? How much of each day?)
Notes:
Intense sadness, weeping, hopelessness about
most aspects of life, feelings of complete
helplessness or worthlessness
Clear nonverbal signs of sadness (such as tearful-
ness), feelings of hopelessness, helplessness, or
worthlessness about some aspects of life
Example of GRID with structured interview guide and conventions for item 1: depressed mood.
GRID-HAMD Williams et al. 123
Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
Scoring conventions
Some investigators and pharmaceutical industry sponsors
have developed scoring conventions for the HAMD,
whereas others leave this to the rater’s clinical judgment.
As none of these approaches has been universally
accepted by researchers, investigational study staff are
routinely trained to different (and often contradictory)
sets of scoring conventions. This can lead to confusion
and disregard for a particular set of conventions or
methods. To ensure that users of the GRID-HAMD are
consistent, a brief set of rating conventions was devel-
oped. Specific conventions are listed in the instrument
alongside each item, making these easier to follow than if
they were presented in a separate document. General
guidelines for administering the instrument are pre-
sented in an introductory section.
Studies
Two pilot studies and a reliability/validity study have
been conducted using the GRID-HAMD. The first pilot
study focused on ease of use, and the second study
evaluated interrater and 4-week test–retest reliability.
Pilot study 1: usability
Raters at one Canadian and two US sites administered
the GRID-HAMD to a series of psychiatric outpatients in
their clinics. The intent was to assess the ease of use of
the new grid scoring system, the clarity of anchor
descriptions, and the accuracy of ratings compared with
a version of the HAMD that was customarily used at each
site, as well as the quality and ease of use of the new
structured interview guide and conventions. Data were
collected via a paper-and-pencil questionnaire.
Twelve raters were given a very brief introduction to the
GRID-HAMD in a telephone conference, and then
evaluated a total of 29 patients using the GRID-HAMD
with its structured interview guide and conventions.
Raters’ years of experience with the HAMD ranged from
1 to 26 years (mean = 7 years). Ten raters had previously
used the SIGH-D. None of the raters were involved in
the development of the GRID-HAMD.
Table 1 summarizes the site and rater characteristics.
Most (75%) of the raters found the GRID-HAMD ‘very
easy’ or ‘easy’ to use; no one rated it ‘very difficult.’ The
anchor descriptions were found to be ‘much more clear’
or ‘a little more clear’ than other versions of the HAMD
used at the sites (11/12 raters). The GRID-HAMD was
judged to result in ‘much more’ or ‘a little more’ accurate
ratings by 10 of the 12 raters; the other two thought there
was no difference from their usual method of using the
scale. No one rated the GRID-HAMD as less accurate
than their usual way of rating. All (12/12) raters indicated
that they thought the GRID conventions were ‘much
better’ or ‘a little better’ than the other sets of
conventions used at their site. Eight of the 12 raters
thought the structured interview guide was easy to use;
three were neutral on the question, and one rater
disagreed.
Pilot study 2: reliability
Five clinical investigative sites (Investigators participat-
ing in the GRID-HAMD reliability component of the
main study sponsored by Organon Inc.: Alan Feiger, MD,
Jon Heiser, MD, James Ferguson, MD, Charles Merideth,
MD, and John Carman, MD.) taking part in a large open-
label multicenter study of a novel antidepressant, agreed
to assess the reliability of the GRID-HAMD. Trial
participants were males and females between 18 and 70
years of age (N= 34) with moderate-to-severe major
depressive disorder, who were deemed appropriate for
long-term antidepressant therapy. Raters were briefly
oriented to the GRID-HAMD by teleconference (60 to
90 min) with members of the DRSST core team, as in the
usability study. Participants were administered the
GRID-HAMD twice at baseline by two different raters
(interrater reliability), and twice at week 4 (sensitivity to
change), by the same two raters whenever possible.
Data from participants assessed at baseline and week 4
are presented in Tables 2–5. Twenty raters across the
sites provided ratings; data were combined across sites
and rater pairs. The baseline HAMD-17 total scores had a
mean of 23.2 and a SD of 5.0. Table 3 presents the
intraclass correlation for the individual items and the
total score for baseline, week 4, and both visits combined.
A random effects intraclass correlation (ICC) model was
used. The ICCs were high for most items, and for the
total sample were improved from the original SIGH-D in
13 of 17 items, including item 1 (depressed mood) and
the total score.
Interrater reliability of the GRI D-HAMD
The structure of the GRID-HAMD, by providing
separate ratings of intensity and frequency for each item,
facilitates detailed analysis of agreement on these two
Table 1 Pilot study no. 1: rater and site characteristics
Site A Site B Site C
Number of raters 5 3 4
Number of patients 15 12 12
Site type Nonacademic
research
Academic Nonacademic
research
Experience with the HAMD (years) 2 years 1 year 3 years
3 years 2 years 4 years
4 years 3 years 4 years
13 years 20 years
26 years
Used SIGH-D previously? 5/5 1/3 4/4
Average administration time
(minutes)
14 22 21
HAMD, Hamilton Depression Rating Scale; SIG H-D, Structured Interview Guide
for the Hamilton Depression Rating Scale.
124 International Clinical Psychopharmacology 2008, Vol 23 No 3
Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
elements. Results are summarized here; additional details
are provided elsewhere (Engelhardt et al., 2003).
The percentage of ratings in which raters entered the
same item score was calculated for all pairs of ratings of
participants at baseline (Table 4, Item Score column). For
items with an exact match between raters on item
scores, the percentage of ratings for which the GRID
coordinates matched exactly was calculated (Table 4,
GRID Coordinate column). For example, if two raters
both scored a 3 on item 1, then they were counted as
using the GRID consistently if they had also recorded
exactly the same intensity and frequency designation.
Chance agreement on GRID coordinates varies from 25 to
50%, depending on the number of possible responses for
each item.
Sensitivity of the GRID to change
For participants who were rated twice at baseline and
twice at week 4 by the same rater pairs, interrater
reliability (ICC) for change scores was examined
(Table 5). Change scores were calculated as the baseline
score minus the week 4 score. Agreement on change
scores varied across items from 0.97 (sexual interest) to
– 0.03 (insight) and 0.18 (loss of weight). For all but
these last two items, the interrater correlation of change
scores was 0.56 or above. For the total score, the
correlation was 0.91.
Reliability and validity study
Twenty-nine raters from 10 US investigative sites agreed
to participate in a study of the validity of the GRID-
HAMD. A total of 150 patients (15 per site) with major
depressive disorder were administered a version of the
HAMD twice, by two independent interviewers blind to
each other’s scores. Scales were administered in counter-
balanced order on the same day. Four cells were present:
patients received either a GRID-HAMD and a SIGH-D
(n= 60), two GRID-HAMDs, (n= 30) two SIGH-Ds
(n= 30), or two HAMDs using the original (Guy) version
of the scale, without a structured interview guide
(n= 30). In this study, no training was provided for any
Table 3 Pilot study no. 2: mean score differences between raters
Baseline (N= 34) Week 4 (N= 31)
Rater 1 23.29 13.10
Rater 2 23.15 11.87
Difference 0.15 1.23
T 0.239 1.779
P 0.813 0.085
Table 4 Pilot study no. 2: percent agreement on individual item
scores and on GRID scoring coordinates (baseline and week 4
combined) N=65
Percent agreement
GRID-HAM D item Item score
a
GRID coordinate
b
Depressed mood 70.8 67.4
Guilt 67.7 61.4
Suicide 80.0 75.0
Insomnia early 87.7 70.2
Insomnia middle 73.8 72.9
Insomnia late 75.4 63.3
Work and activities 66.2 62.8
Retardation 63.1 —
c
Agitation 63.1 —
c
Anxiety psychic 61.5 75.0
Anxiety somatic 60.0 78.1
Loss of appetite 80.0 73.17
Somatic symptoms 64.6 66.6
Sexual interest 89.2 —
c
Hypochondriasis 69.2 66.4
Loss of weight 81.5 —
c
Insight 95.2 —
c
a
Percent agreement is the percent exact match on the item score.
b
Percent agreement is the percent exact match on the GRID coordinate for those
who had an exact match on the item score.
c
Item not presented as a GRID.
HAMD, Hamilton Depression Rating Scale.
Table 2 Pilot study no. 2: reliability of the GRID-HAMD items and total score
GRID-HAM D baseline intraclass
correlation, N=34
GRID-HAM D week 4 intraclass
correlation, N=31
GRID-HAM D baseline and
week 4, N=65
SIGH-D intraclass correlations
1988, N=23
Depressed mood 0.78 0.76 0.85 0.80
Guilt 0.55 0.66 0.72 0.63
Suicide 0.80 0.76 0.81 0.64
Insomnia early 0.83 0.79 0.82 0.80
Insomnia middle 0.71 0.68 0.71 0.62
Insomnia late 0.75 0.66 0.73 0.30
Work and activities 0.64 0.83 0.89 0.54
Retardation 0.28 0.22 0.33 0.32
Agitation 0.49 0.27 0.38 0.11
Anxiety psychic 0.52 0.61 0.65 0.78
Anxiety somatic 0.55 0.55 0.60 0.66
Loss of appetite 0.77 0.69 0.75 0.59
Somatic symptoms 0.37 0.59 0.60 0.61
Sexual interest 0.92 0.94 0.94 0.70
Hypochondriasis 0.68 0.70 0.71 0.55
Loss of weight 0.63 0.23 0.43 0.58
Insight – 0.03 — — —
Total HAMD-17 0.75 0.81 0.89 0.81
Highest value in each row is in bold font.
HAMD, Hamilton Depression Rating Scale.
GRID-HAMD Williams et al. 125
Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
of the scale versions and each rater was assigned to
administer only one of the three scales, although
undoubtedly, most of the raters had used the Guy version
and the SIGH-D in previous studies.
Interrater reliability of the GRID-HAMD was compared
with that of the SIGH-D and to the unstructured Guy
HAMD. In addition, concurrent validity of the GRID-
HAMD was estimated by comparing total and item scores
of the GRID-HAMD to the ‘gold standard’ Structured
Interview Guide for the HAMD (SIGH-D, Williams,
1988). The raters had an average of 20 years’ clinical
experience, and 13 years’ experience in administering a
version of the HAMD. Fifty-two percent were MDs, with
the remainder having BS/BA or RN degrees.
Results of reliability and validity study
The interrater reliability for both the GRID-HAMD and
the SIGH-D were high (ICC = 0.95 and 0.94, respec-
tively), and not significantly different from each other,
Z= 0.73, P= 0.47. The ICC for the Guy HAMD
(ICC = 0.78) was significantly lower than the ICC for
the GRID-HAMD, Z= – 3.4461, P= 0.001 and the
SIGH-D, Z= – 4.0889, P< 0.0001. Internal consistency
reliability (coefficient a) for the GRID-HAMD was 0.78;
for the SIGH-D was 0.71, and for the Guy HAMD
was 0.64.
The item correlations for the GRID, SIGH-D, and Guy
HAMD are presented in Table 6. Interrater reliability on
the item level is good for both the GRID and SIGH-D,
and greater than the Guy version on all items (except for
item 17, Insight, which was poor for all three versions).
Moderately high correlations between the GRID and
SIGH-D for all items were present (except insight).
GRID-HAMD versus SIGH-D
The mean score on the GRID-HAMD (22.23; SD = 6.82)
was not significantly different from the mean score
obtained on the same patients with the SIGH-D (22.03;
SD = 5.96), t(59) = 1.692, P= 0.693. The two scales
were highly correlated (ICC = 0.81, P< 0.001).
GRID frequency versus intensity dimensions
One of the advantages of rating frequency and intensity
separately on the GRID-HAMD is that it enables an
examination of the utility and incremental value of each
individual dimension. Comparisons of ICC by item and
total score using only the frequency dimension, only the
intensity dimension, and both dimensions are presented
in Table 7. The item ICCs were greatest on all but three
items when both frequency and intensity were used to
determine the score; however, the confidence intervals
for frequency and intensity overlapped (frequency
ICC = 0.913, CI: 0.829, 0.957; intensity ICC = 0.896;
ICC = 0.788, 0.951), suggesting that these may not be
statistically different. In addition, the mean score
differences between the GRID and SIGH-D were
smallest when the combination of frequency and
intensity was used (Table 8).
Discussion
The GRID-HAMD offers a number of potential advan-
tages over other versions of the HAMD. First, the scale
itself, the conventions, and the interview guide reflect a
broad consensus of researchers and clinicians in academia
and the pharmaceutical industry. Second, scoring was
facilitated to allow raters to make separate determina-
tions of intensity and frequency to arrive at an overall
Table 6 Pilot study no. 3: item correlations (ICC) for the GRID,
SIGH-D, and GUY versions of the HAMD and ICC between GRID
and SIGH-D
GRID vs.
GRID (n= 31)
SIGH-D vs.
SIGH-D
(n= 27)
Guy vs. Guy
(n= 27)
GRID vs.
SIGH-D
(n=60)
Depressed mood 0.92 0.87 0.43 0.69
Guilt 0.81 0.85 0.78 0.58
Suicide 0.77 0.90 0.56 0.77
Insomnia early 0.92 0.73 0.63 0.75
Insomnia middle 0.84 0.87 0.69 0.55
Insomnia late 0.79 0.82 0.67 0.67
Work and
activities
0.89 0.84 0.51 0.60
Retardation 0.77 0.69 0.21 0.53
Agitation 0.48 0.67 0.06 0.47
Anxiety psychic 0.73 0.79 0.13 0.51
Anxiety somatic 0.62 0.90 0.41 0.66
Loss of appetite 0.95 0.63 0.80 0.68
Somatic
symptoms
0.78 0.74 0.26 0.56
Sexual interest 0.78 0.74 0.58 0.79
Hypochondriasis 0.88 0.90 0.79 0.54
Loss of weight 0.77 0.89 0.79 0.61
Insight – 0.00 0.49 0.06 – 0.09
Total score 0.94 0.95 0.78 0.81
HAMD, Hamilton Depression Rating Scale; ICC, intraclass correlation; SIGH-D,
Structured Interview Guide for the Hamilton Depression Rating Scale.
Table 5 Pilot study no. 2: reliability of GRID-HAMD change scores
Reliability of GRID-HAMD change scores: interrater ICC
GRID-HAM D Change from baseline, N=20
Depressed mood 0.91
Guilt 0.59
Suicide 0.69
Insomnia early 0.77
Insomnia middle 0.70
Insomnia late 0.63
Work and activities 0.76
Retardation 0.76
Agitation 0.56
Anxiety psychic 0.62
Anxiety somatic 0.68
Loss of appetite 0.62
Somatic symptoms 0.82
Sexual interest 0.97
Hypochondriasis 0.80
Loss of weight 0.18
Insight – 0.03
Total score 0.91
HAMD, Hamilton Depression Rating Scale; ICC, intraclass correlation.
126 International Clinical Psychopharmacology 2008, Vol 23 No 3
Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
severity score. This scoring system should improve
reliability and allow for detailed analyses of specific areas
of disagreement between raters and provide useful
information regarding the validity of severity scores based
on a particular composite of frequency and intensity.
Third, the increased specificity of the instructions and
item definitions, and the integration of the conventions
into the instrument itself facilitate training on the
instrument.
The reliability study indicates that, even with a large
number of raters new to the instrument and with minimal
training, overall reliability is excellent, and agreement on
total HAMD score and most of the individual items is
improved. Reliability was established across different
indices for both baseline and assessments made after
4 weeks of treatment. Considering the number of raters
involved in the study, the level of reliability achieved is
impressive, given that larger numbers of raters make
reliability harder to obtain. In addition, no reliability
training was done between raters at the different sites.
Despite this, reliabilities were comparable with those
found by Moberg et al. (2001), whose raters underwent
extensive reliability training. Thus, the GRID-HAMD
should help improve reliability even in a cohort of raters
not calibrated to each other. Raters were using the GRID-
HAMD consistently, indicating its clinical utility and
value as an aid to reliability of ratings, which said, it
should be noted that the GRID-HAMD does not obviate
the need for reliability training, as the degree of interrater
reliability is directly related to the quantity and quality of
training provided. Raters on the whole found the GRID-
HAMD to be easy to use, and that the conventions were
improved over earlier versions.
The concurrent validity of the GRID was demonstrated
by high item and total score correlations with the SIGH-
D, and equivalent mean scores were obtained when both
scales were administered to the same patients. Both the
GRID-HAMD and the SIGH-D, which are semistruc-
tured interviews, demonstrated excellent interrater
reliability compared with the unstructured Guy version,
which had significantly lower interrater agreement than
either of the other two scales. The unique features of the
GRID-HAMD, that is, a standardized scoring system, and
conventions and an interview guide that are integrated
into the instrument, may provide specific benefits for
raters who have less clinical and assessment experience
than the highly experienced raters in this study.
Several studies have demonstrated a tendency (conscious
or unconscious) on the part of raters to assign a threshold
score to patients to justify their inclusion in a particular
study (DeBrota et al., 1999; Feltner et al., 2001; Kobak,
et al., 2005). For example, if a study requires a HAMD
of 18, raters may be more likely to give that score, even
if not quite justified. The increased specificity of
the GRID structure may force raters to be more accurate
in assigning item scores, and the GRID allows a
more specific ‘audit trial’ of why a particular score was
assigned. In addition, as investigational compounds
become more targeted to individual symptoms and
specific symptom complexes, item reliability increases
in importance.
Accumulating evidence exists that interrater reliability is
directly related to the amount and type of training
performed (Kobak et al., 2003), so that if one rater
receives 3 weeks of intense training, including observa-
tion of applied clinical skills, and uses the SIGH-D, and
another rater receives almost no training and uses the
GRID-HAMD, any resulting reliability superiority of the
SIGH-D may be because of training rather than the
instrument itself. Moberg et al. (2001) achieved very high
interrater agreement levels using the SIGH-D, but it
should be noted that his raters trained until they
achieved agreement ‘in excess of’ 0.91 before they began
the study. The raters in this study attained good
reliabilities with only minimal training. Presumably,
with better rating and therefore better reliability, there
would be better signal detection in a clinical trial. Muller
has modeled statistically the significant impact that
reliability has on signal detection (Muller and Szegedi,
2002).
Table 7 Item and total ICC by frequency and intensity dimensions
GRID frequency
and intensity
GRID frequency
only
GRID intensity
only
Depressed mood 0.92 0.94 0.77
Guilt 0.81 0.81 0.60
Suicide 0.77 0.62 0.73
Insomnia early 0.92 0.90 0.82
Insomnia middle 0.84 0.85 0.84
Insomnia late 0.79 0.76 0.79
Work and activities 0.89 0.75 0.76
Anxiety psychic 0.73 0.69 0.50
Anxiety somatic 0.62 0.56 0.42
Loss of appetite 0.95 0.88 0.94
Somatic symptoms 0.78 0.80 0.61
Hypochondriasis 0.88 0.87 0.77
Insight 0.00 0.00 0.00
Total score 0.93 0.91 0.89
GRID scoring not used for items 8, 9, 14, 16, and 17. Highest ICC bolded.
ICC, intraclass correlation.
Table 8 Mean score differences between the GRID and SIGH-D
using frequency alone, severity alone, and frequency and severity
combined
GRID frequency
only GRID severity only
GRID frequency
and severity
SIGH-D 21.94 21.95 21.95
GRID 24.55 20.74 22.03
Difference – 2.6 0 1.25 0.20
T – 4.60 2.24 0.40
P 0.00 0.029 0.693
SIGH-D, Structured Interview Guide for the Hamilton Depression Rating Scale.
GRID-HAMD Williams et al. 127
Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
Recently, interview quality, that is, the rater’s interview-
ing skills, has been shown to be related to signal
detection (Kobak et al., 2005). The GRID should help
improve interview quality by providing a structure for the
interview approach, and incorporating many of the
necessary follow-up probes associated with good clinical
technique.
Throughout the field, much energy is being directed
toward addressing these issues, the GRID-HAMD being
one approach. The main value of the new GRID-HAMD
is that it represents a consensus of the stakeholders.
Across a sample of academic experts and industry
representatives, all were able to agree on the structure
of the new instrument, its structured interview guide,
and its conventions. What remains is a test of whether use
of the GRID-HAMD increases the likelihood of finding
significant drug–placebo differences compared with use
of other versions of the HAMD.
Summary
The GRID-HAMD was designed to both clarify and
standardize administration and scoring of the HAMD in
clinical practice and research. Item descriptions have
been modified to enhance the reliability of the scale and
its relevance to depressed outpatients. In preliminary
testing, the GRID-HAMD showed good to very good
interrater item and overall score reliability. Validity of the
GRID-HAMD was also demonstrated by high correlations
and equivalent mean scores to the SIGH-D, the current
gold standard in the field.
Acknowledgements
The authors thank the contributions of Michael Giberti-
ni, PhD, who facilitated the incorporation of the GRID-
HAMD in a study being conducted by Organon Inc., and
Margaret Rothman, PhD, who participated in the initial
work of the DRSST. They also thank the hundreds of
clinicians who reviewed drafts of the GRID and sent
them helpful comments. A nonprofit organization, the
International Society of CNS Drug Development
(ISCDD), was formed to bring together representa-
tives of pharmaceutical companies, as well as academi-
cians and government scientists. This new group
absorbed the DRSST, providing support and funding
for its continuing work. In an effort to facilitate
widespread use of the instrument, a web site has
been developed (http://www.iscdd.org) to offer the
GRID-HAMD free of charge to be printed or stored
electronically.
The DRSST and the DID Project are funded by the
International Society for CNS Drug Development. The
GRID-HAMD may be downloaded, free of charge, at
www.iscdd.org.
References
Bagby RM, Ryder AG, Schuller DR, Marshall MB (2004). The Hamilton
Depression Rating Scale: has the gold standard become a lead weight?
Am J Psychiatry 161:2163–2177.
Bech P, Kastrup M, Rafaelsen OJ (1986). Mini-compendium of rating scales for
states of anxiety, depression, mania, schizophrenia with corresponding DSM-
II syndromes. Acta Psychiatr Scand 73:5–37.
Chouinard G, Miller R (1999). A rating scale for psychotic symptoms (RSPS) part
I: theoretical principles and subscale 1: perception symptoms (illusions and
hallucinations). Schizophr Res 38:101–122.
DeBrota D, Demitrack M, Landin R, Kobak KA, Greist J H, Potter W (1999). A
Comparison Between Interactive Voice Response System-Administered
HAM-D and Clinician-Administered HAM-D in Patients with Major
Depressive Episode. Paper presented at the National Institute of Mental
Health, New Clinical Drug Evaluation Unit, 39th Annual Meeting, Boca Raton,
Florida.
Engelhardt N, Kalali A, Gibertini M, Kobak K, Williams J, Evans K, et al. (2003).
The GRID-HAM D: a reliability study in patients with major depression. Paper
presented at the National Institute of Mental Health, New Clinical Drug
Evaluation Unit, 43rd Annual Meeting, Boca Raton, Florida.
Evans KR, Sills T, DeBrota DJ, Gelwicks S, Engelhardt N, Santor D (2004). An
Item Response analysis of the Hamilton Depression Rating Scale using
shared data from two pharmaceutical companies. J Psychiatr Res 38:
275–284.
Faries D, Herrera J, Rayamajhi J, DeBrota D, Demitrack M, Potter WZ (2000). The
responsiveness of the Hamilton Depression Rating Scale. J Psychiatr Res
34:3–10.
Feltner DE, Kobak KA, Crockatt J, Haber H, Kavoussi R, Pande A, et al. (2001).
Interactive Voice Response (IVR) for Patient Screening of Anxiety in a
Clinical Drug Trial. Paper presented at the National Institute of Mental Health,
New Clinical Drug Evaluation Unit, 41st Annual Meeting, Phoenix, Arizona.
Gelenberg AJ, Wojcik JD, Falk WE, Baldessarini RJ, Zeisel SH, Schoenfeld D,
et al. (1990). Tyrosine for depression: a double-blind trial. J Affect Disord
19:125–132.
Gibbons RD, Clark DC, Kupfer DJ (1993). Exactly what does the Hamilton
Depression Rating Scale measure. J Psychiatr Res 27:259–273.
Hamilton M (1960). A rating scale for depression. J Neurol, Neurosurg Psychiatry
23:56–62.
Hamilton M (1967). Development of a rating scale for primary depressive illness.
Br J Soc Clin Psychiatry 6:278–296.
Kobak KA, Reynolds WM, Rosenfeld R, Greist JH (1990). Development and
validation of a computer-administered version of the Hamilton Depression
Rating Scale. Psychol Assess 2:56–63.
Kobak KA, Mundt JC, Greist JH, Katzelnick DJ, Jefferson JW (2000). Computer
assessment of depression: automating the Hamilton Depression Rating
Scale. Drug Inf J 34:145–156.
Kobak KA, Lipsitz JD, Feiger A (2003). Development of a standardized training
program for the Hamilton Depression Scale using internet-based technolo-
gies: results from a pilot study. J Psychiatr Res 37:509–515.
Kobak KA, Feiger AD, Lipsitz JD (2005). Interview quality and signal detection in
clinical trials. Am J Psychiatry 162:628.
Kobak KA, Taylor LV, Warner G, Futterer R (2005). St. John’s wort vs. placebo
in social phobia: results from a placebo-controlled pilot study. J Clin
Psychopharmacol 25:51–58.
Maier W, Philipp M (1985). Improving the assessment of severity of depressive
states: a reduction of the Hamilton Depression Scale. Pharmacopsychiatry
18:114–115.
Moberg PJ, Lazarus LW, Mesholam RI, Bilker W, Chuy IL, Neyman I, et al. (2001).
Comparison of the standard and structured interview guide for the Hamilton
Depression Rating Scale in depressed geriatric inpatients. Am J Geriatr
Psychiatry 9:35–40.
Muller MJ, Szegedi A (2002). Effects of interrater reliability of psychopathologic
assessment on power and sample size calculations in clinical trials. J Clin
Psychopharmacol 22:318–325.
Potts MK, Daniels M, Burnam MA, Wells KB (1990). A structured interview
version of the Hamilton Depression Rating Scale: evidence of reliability and
versatility of administration. J Psychiatr Res 24:335–350.
Reynolds WM, Kobak KA (1995). Reliability and validity of the Hamilton
Depression Inventory: a paper-and-pencil version of the Hamilton Depression
Rating Scale clinical interview. Psychol Assess 7:472–483.
Santor DA, Coyne JC (2001). Examining symptom expression as a function of
symptom severity: item performance on the Hamilton Rating Scale for
Depression. Psychol Assess 13:127–139.
Thase ME (1984). A Hamilton subscale for endogenomorphic depression.
Hillside J Clin Psychiatry 6:57–68.
128 International Clinical Psychopharmacology 2008, Vol 23 No 3
Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
Whisman MA, Strosahl K, Fruzzetti AE, Schmaling KB, Jacobson NS, Miller DM
(1999). A structured interview version of the Hamilton Rating Scale for
Depression: reliability and validity. Psychol Assess 1:238–241.
Williams JBW (1988). A structured interview guide for the Hamilton Depression
Rating Scale. Arch Gen Psychiatry 45:742–747.
Williams JBW (2001). Standardizing the Hamilton Depression Rating Scale: past,
present, and future. Eur Arch Psychiatry Clin Neurosci 251:11/16–11/12.
Zimmerman M, Posternak MA, Chelminski I (2005). Is it time to replace
the Hamilton Depression Rating Scale as the primary outcome measure
in treatment studies of depression? J Clin Psychopharmacol 25:
105–110.
Zitman FG, Mennen MF, Griez E, Hooijer C (1990). The different versions
of the Hamilton Depression Rating Scale. Psychopharmacol Ser 9:
28–34.
GRID-HAMD Williams et al. 129
Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.