Content uploaded by Priya Kaur Jaggi
Author content
All content in this area was uploaded by Priya Kaur Jaggi on Jul 19, 2020
Content may be subject to copyright.
Journal of Nursing Measurement, Volume 28, Number 1, 2020
A
ID:ti0005
Practical Guide for Item Generation in
Measure Development: Insights From the
Development of a Patient-Reported
Experience Measure of Compassion
Shane
ID:p0050
Sinclair, PhD
Faculty
ID:p0055
of Nursing, University of Calgary, Calgary, AB, Canada
Compassion
ID:p0055-p445
Research Lab, Faculty of Nursing, University of Calgary,
Calgary, AB, Canada
Department
ID:p0060
of Oncology, Cumming School of Medicine, University of Calgary,
Calgary, AB, Canada
Priya
ID:p0065
Jaggi, MSc
Faculty
ID:p0070
of Nursing, University of Calgary, Calgary, AB, Canada
Compassion
ID:p0070-p450
Research Lab, Faculty of Nursing, University of Calgary,
Calgary, AB, Canada
Thomas
ID:p0075
F. Hack, PhD
College
ID:p0080
of Nursing, Rady Faculty of Health Sciences, University of Manitoba,
Winnipeg, MB, Canada
Research
ID:p0085
Institute in Oncology and Hematology, CancerCare Manitoba,
Winnipeg, MB, Canada
Psychosocial
ID:p0090
Oncology and Cancer Nursing Research, IH Asper Clinical
Research Institute, Winnipeg, MB, Canada
Susan
ID:p0095
E. McClement, PhD
College
ID:p0100
of Nursing, Rady Faculty of Health Sciences, University of Manitoba,
Winnipeg, MB, Canada
Research
ID:p0105
Institute in Oncology and Hematology, CancerCare Manitoba,
Winnipeg, MB, Canada
Lena
ID:p01
10
Cuthbertson, MEd
Office
ID:p01
15
of Patient-Centred Measurement, British Columbia Ministry of Health,
Vancouver, BC, Canada
Background
ID:p0120
and Purpose: Although various measure development guidelines exist, prac-
tical guidance on how to systematically generate items is nascent. This article provides
practical guidance on item generation in measure development and the use of a Table of
Specifications (TOS) in this process. Methods: In addition to a review of the literature, the
Pdf_Folio:138
138 © 2020 Springer Publishing Company
http://dx.doi.org/10.1891/JNM-D-19-00020
Item Generation in Measure Development: A Practical Guide 139
item generation process within an ongoing study to develop a valid and reliable
patient-reported measure of compassion is provided. Results: Consensus on an ini-
tial pool of 109 items and their response scale was achieved with the aid of a TOS.
Conclusions: Dynamic, experiential, and relational care constructs such as compas-
sion lie at the heart of nursing. Practical guidance on item generation is needed to
allow nurses to identify, measure, and improve compassion in research and practice.
Keywords: measure
ID:p0125
development; item generation; table of specifications;
psychometrics; instrument construction; content validity
Item
ID:p0130
generation is an imperative step in the developmental stage of instrument construc-
tion (DeVellis, 2003; Grant & Davis, 1997; Lynn, 1986; Morgado, Meireles, Neves,
Amaral, & Ferreira, 2017; Rattray & Jones, 2007). When performed well, item gener-
ation ensures that an instrument’s items (i.e., questions; Deshpande, Rajan, Sudeepthi, &
Nazir, 2011; US Food and Drug Administration, Center for Drug Evaluation and Research,
Center for Biologics Evaluation and Research, Center for Devices and Radiological Health,
2009) accurately and comprehensively cover the construct (i.e., topic of interest; Avila
et al., 2015; Deshpande et al., 2011; FDA Center for Drug Evaluation and Research
et al., 2009; Appendix A). Despite its reputed importance, there is little by way of practi-
cal guidance and recommended best practices to enhance rigor in this critical step (Nichols
& Sugrue, 1999) of developing patient-reported measures in healthcare. As a result, the
validity of patient-reported measures utilized by nurse researchers is compromised, as is
their clinical utility and reliability in nursing practice.
Developing
ID:p0135
a robust measure involves a combination of both inductive and deduc-
tive processes (Cheng & Clark, 2017; Morgado et al., 2017). The iterative nature of item
generation begins with domain identification, which is an inductive process that involves
establishing a theoretical foundation of the construct of interest, based on the literature,
subject matter expert (SME) opinion, or qualitative research with the population of inter-
est (Brod & Tesler, 2009; Cheng & Clark, 2017; DeVellis, 2003; Netemeyer, Bearden,
& Sharma, 2003). In patient-reported measure development, a domain is considered to be
a sub-concept within the larger, overarching construct that is being measured (Deshpande
et al. 2011; FDA Center for Drug Evaluation and Research et al., 2009; Appendix A).
Once a theoretical foundation has been established inductively, it becomes the basis for
item generation through a deductive process, which involves utilizing the derived the-
oretical foundation as a guiding source to generate an initial pool of items (Hinkin,
1998). Although it is recognized and suggested that a combination of inductive and
deductive processes is required for item generation (Brod & Tesler, 2009; Morgado
et al., 2017), studies under-report which processes were utilized and whether a theo-
retical foundation was established to inform the development of a measure (Hinkin,
1995). As a result of this methodological gap, the International Society for Qual-
ity of Life Research (ISOQOL) group proposed that the “documentation of sources
from which items were derived, modified, and prioritized during measure develop-
ment” be considered a best practice guideline (Reeve et al., 2013, p. 9). Thus, although
guidelines have been established on numerous accounts to facilitate measure devel-
opment in general (DeVellis, 2003; FDA Center for Drug Evaluation and Research
et al., 2009; Hinkin, 1998; Mokkink et al., 2010; Rattray & Jones, 2007; Reeve
et al., 2013; Saris, 2014; Terwee et al., 2007; Valderas et al., 2008) and while it has
been noted that item generation is a fundamental step of measure development (Hinkin,
Pdf_Folio:139
140 Sinclair et al.
1995; Morgado et al., 2017), there is limited practical guidance on how to generate
items. Without guidance to facilitate the generation and revision of the initial pool of items,
and the linkage of items to their defined construct(s) and associated domain(s), the robust-
ness of the final measure may be compromised or criticized (Bowling, 1997; Brod & Tesler,
2009; Buchbinder et al., 2011; Hinkin, 1995; Nichols & Sugrue, 1999; Rattray & Jones,
2007).
A
ID:p0140
Table of Specifications (TOS) is a tool long utilized within the field of nursing edu-
cation to ensure accurate and comprehensive item coverage in test construction (Billings
& Halstead, 2016, p. 425), and can provide a rigorous and pragmatic framework to aid
test/measure developers in the item generation process. A TOS is defined as a table that
“aligns objectives, instruction, and assessment” (Fives & DiDonato-Barnes, 2013, p. 1;
Appendix A) to facilitate the construction of tests through rigorous means. Since it is not
feasible for educators to measure every facet of a course or topic, or every aspect of the
subject of interest, a TOS allows educators to identify key content and item coverage, while
ensuring that the content is comprehensible to the target population—enhancing the valid-
ity of the test as a result (Fives & DiDonato-Barnes, 2013). Nursing educators rely heavily
on a TOS as their guiding framework to inform the essential content (i.e., categories or
domains of content to be measured) during test construction (Billings & Halstead, 2016,
p. 423). Adopting this framework to develop patient-reported measures of health experi-
ence or outcome may also provide researchers with a systematic means of ensuring accurate
coverage and rigor in linking the overarching construct of interest to its associated domains,
and domains to individual items, thereby ensuring the fidelity of the final measure.
The
ID:p0145
purpose of this article is to provide practical guidance to the process of item genera-
tion and the utilization of a TOS in the healthcare context, as illustrated in the development
of a patient-reported compassion measure for patients living with an incurable, life-limiting
illness.
METHODS
Measure
ID:ti0025
Development: Process Overview
Item
ID:p0150
generation is a highly iterative process of defining, re-defining, re-visiting, refining,
and modifying a measure throughout the course of its development. In the absence of a
guiding framework, there appears to exist a quantum leap from defining the construct of
interest to the completed pool of items, with little information describing the process of
generating items from the empirical literature or a theoretical model. To address this gap,
we provide an example of how we utilized a TOS not only to ensure adequate content
coverage, but also to ensure the fidelity of proposed items to the underlying construct of
interest—compassion within healthcare. Having defined the construct of compassion and
its associated theoretical domains within the Patient Compassion Model (PCM) derived
from qualitative interviews with patients (Sinclair et al., 2016), the goal herein is to describe
our experience of and insights of this process. While we highlight the individual phases
(Figure 1) we followed to achieve our pool of items, this process can be adapted to any
measure of interest. Figure 1 illustrates the process of item generation undertaken by our
team, spanning the initial process of domain identification, generation of initial items, and
refinement up until the construction of the draft measure and subsequent assessment by
SMEs and exploratory and confirmatory factor analyses.Pdf_Folio:140
Item Generation in Measure Development: A Practical Guide 141
Figure 1. Measure
ID:p0155
development overview: Five phases of item generation.
RESULTS
Phase
ID:ti0035
1. Establishing the Scope and Purpose of the Measure Using a
Conceptual Model: The “What” and “Why” of Measure Development
An
ID:p0160
important preliminary task in establishing and confirming the scope and purpose of a
measure is to determine precisely what (i.e., content) is being measured and why. The pur-
pose of the measure needs to be solidified at the outset, as this inevitably will affect the con-
tent and types of items that are developed, their respective response scales (e.g., satisfaction
or frequency scale), and the target population (Hinkin, 1998). To determine the scope and
purpose of our proposed compassion measure, we conducted a comprehensive and critical
review of the compassion literature in healthcare (Sinclair, Russell, Hack, Kondejewski,
& Sawatzki, 2016) and conducted a large qualitative study with patients with advanced
cancer (Sinclair et al., 2016). This qualitative study informed the development of a theo-
retical PCM delineating the construct of interest and its associated domains, and their rela-
tionship with one another. The transferability of the PCM was then further verified through
qualitative interviews with noncancer patients living with an incurable and life-limiting ill-
ness, to ensure that each facet of the model was adequately represented and generalizable toPdf_Folio:141
142 Sinclair et al.
patients with varying life-limiting illnesses. In developing the PCM and assessing its trans-
ferability in other patient populations (Sinclair et al., 2018), we recognized a gap between
what patients consider an integral component of quality care and healthcare providers’
(HCPs) ability to assess and deliver it (Sinclair et al., 2016), including the absence of
a valid and reliable measure of compassion to assist researchers in studying this topic
(Sinclair et al., 2016). Thus, our primary aim was to develop a valid and reliable measure to
aid researchers in studying compassion in healthcare, and secondly to develop a clinically
informed and relevant measure to assess patients’ experiences of compassion from their
HCPs. After a number of team meetings and lengthy discussions surrounding the domains
of the PCM and the associated coding schema (containing over 600 individual codes, rep-
resenting individualized patient views on compassion derived from the qualitative study
described earlier (Sinclair et al., 2016), a two-fold purpose for the measure was agreed
upon: (a) to measure the patient’s experience of compassion based on the emanated behav-
iors (what), skills (how), and qualities (who) of the HCPs that are exemplified in care; and
(b) to determine the extent to which patients feel that the care they received was compas-
sionate.
While
ID:p0165
determining the scope and purpose of the measure may seem like an intuitive or
obvious step, investing the time and effort in this phase should not be underestimated or
determined retrospectively. This critical “a priori” phase is also essential in the develop-
ment of a TOS, which can: (a) function as a framework for deductively generating items;
and (b) control against investigator biases, by providing an important reference point for
revisiting the content that is to be measured throughout the subsequent item evaluation
phases. Additional considerations that we found important and helpful to discuss at this
phase of developing a measure included: Item tense (past vs. present); recall period of inter-
est (i.e., present day, past week vs. month vs. year); and singular vs. plural (HCP(s) to be
evaluated).
Phase
ID:ti0040
2. Developing a TOS: “How” to Best Measure the Construct of Interest
After
ID:p0170
agreeing on the scope and purpose of the measure, we developed a TOS by identify-
ing the key measurement domains of the construct of compassion, to ensure that the items
within the measure adequately covered each domain, much like ensuring that exam ques-
tions adequately cover course learning objectives. Chase (1999) identifies three steps in the
creation of a TOS: (a) choosing the [measurement] domains to be covered, (b) distilling
the [selected] domains into key content or independent parts such as concepts, terms, pro-
cedures, and applications, and (c) constructing the table itself (Chase, 1999). Having iden-
tified the measurement domains in the PCM, the second step of distilling the domains into
key content or independent parts consisted of revisiting the identified domains of the PCM
(Figure 2; Sinclair et al., 2016) and confirming their alignment to the scope and purpose of
the measure. An important consideration at this phase is determining “how” to best mea-
sure the construct of interest by deciding on the granularity of measurement. In relation to
the PCM (Figure 2), the seven domains comprising the model are the highest level of mea-
surement, with the 27 associated themes being a secondary level (Figure 3; Appendix A),
and the individualized codes within these themes being the most granular (Sinclair
et al., 2016).
As
ID:p0185-p440
a team, we also needed to decide whether compassion should be measured at the
domain, thematic, or individual code level. Additional factors to consider in determining
Pdf_Folio:142
Item Generation in Measure Development: A Practical Guide 143
Figure 2. Patient
ID:p0175
compassion model.
Source. From Sinclair, S., McClement, S., Raffin-Bouchal, S., Hack, T., Hagen, A., McConnell, S., & Chochinov,
H. (2016). Compassion in health care: An empirical model. Journal of Pain and Symptom Management, 51,
193–203. Reprinted with permission.
the level of measurement included: (a) ensuring that each level of measurement has appro-
priate content coverage, as some domains or themes may have more or less codes within
them to inform the prospective items; (b) domain or theme overlap (i.e., are there similar
themes across domains or codes across the themes and/or domains); and (c) estimating the
number of items required for exploratory factor analysis. In revisiting and reviewing the
purpose(s) of the compassion measure and our data from previous studies that informed
this process, our research team decided that the level of measurement was thematic, as we
felt confident that the themes within the PCM (Figure 3; Sinclair et al., 2016) accurately
and comprehensively depicted the key components associated with patients’ experiences of
compassion. Individual codes within these respective themes however, served a secondary
purpose by providing the team with more granularity surrounding the content of the themes
and serving as a potential source for specific measurement indicators (Appendix A) of the
themes, and to inform item wording. As a result, we reanalyzed, clustered, and collapsed
each of the qualitative codes within their respective themes, paying particular attention to
areas of overlap and/or redundancy, thereby refining our understanding of the construct
and further delineating each of the domains and their relationship to one another (Table 1).
Having
ID:p0200
solidified the scope and purpose of the measure (Phase 1), we then consid-
ered additional aspects of how to best measure the construct of interest. This involved
revisiting the purpose(s) and determining whether the measure was intended to assess the
frequency of particular facets of compassion (e.g., How often did a patient experience aPdf_Folio:143
144 Sinclair et al.
TABLE 1. Table
ID:p0190
of Specifications: Determining the Core Content of a
Patient-Reported Compassion Measure
Item Content Specific
Examples From
Data Sources
Item Difficulty
(Low, Medium,
or High)
Type of Item
(Perception vs.
Behavior)
Sample Patient
Quotes
ESSENTIAL
ID:t0005
DOMAIN #1: VIRTUOUS RESPONSE
Theme
ID:t0010
#1: Knowing the Person
TBD
ID:t0015
in
subsequent
phases of TOS
development
•Understand
what patient is
going through
Sensitive to
patient’s
situation
Medium
ID:t0025
Perception •“To get to
know them
and
understand a
little bit about
what they’re
going
through”
TBD
ID:t0040
•Genuine
concern (1)
•Concerned
Low
ID:t0050
Perception •“Well, people
that are
concerned
about you, and
they look after
you, or help
you when they
can”
And
ID:t0065
so on… … … … …
ESSENTIAL
ID:t0090
DOMAIN #2: ATTENDING TO NEEDS
Theme
ID:t0095
#1: Action
TBD
ID:t0100
•Helping/doing
things for others
Low
ID:t01
10
Perception •“Goodwill,
willing to help
others and a
willingness to
complete the
needs of
others”
TBD
ID:t0125
•Little acts of
caring
Medium
ID:t0135
Behavior •“. . . it wasn’t
an easy thing
to do, but she
arranged it
and she knew
that that
would really
brighten my
day and it did”
And
ID:t0150
so on… … … … …
Pdf_Folio:144
Item Generation in Measure Development: A Practical Guide 145
Figure 3. Elements of the patient compassion model.
Permission to reproduce this figure was obtained from the Journal of Pain and Symptom Management,
Elsevier Inc.
compassionate behavior(s) from their HCPs?) and/or the adequacy of the HCPs behavior(s)
in keeping with the patient’s subjective experience of compassion (e.g., Were they satis-
fied with the compassionate care provided or did they agree that the care they received wasPdf_Folio:145
146 Sinclair et al.
compassionate?). Second, we considered the (a) type(s) of items (i.e., subjective percep-
tion or observed HCP behavior); (b) item difficulty (i.e., how easy/difficult it is for HCPs
to score well on the items); (c) the cognitive level of the items for the target population;
and (d) potential question types (e.g., open vs. closed ended; Dillman, Smyth, & Christian,
2014). For example, with respect to item difficulty, it is important to generate items with a
range of difficulty to ensure greater response variance, thereby mitigating the risk of floor
or ceiling effects. In the context of measuring a construct like compassion, an “easy” item is
one which the majority of HCPs are anticipated to receive high ratings, while only the most
compassionate HCPs would receive high ratings on a “difficult” item. While this phase of
creating a TOS is highly iterative and complex, it allows measure development teams to err
on the side of content comprehensiveness whil setting clear boundaries to delineate what
is considered to be “compassion” from the patient experience and what is not. Comprehen-
siveness is important, as construct underrepresentation is a challenge in measure develop-
ment (Buchbinder et al., 2011) that can be mitigated by using a TOS, thereby preventing
arbitrary or premature elimination of items.
At
ID:p0205
this phase, a subset of the measure development team developed the TOS (Table 1),
detailing the specific examples from data sources (i.e. codes), level of item difficulty, item
type, and quotes from our PCM study and examples from the literature. The TOS was then
circulated to the larger team in advance of an item generation meeting.
Phase
ID:ti0045
3. Developing Indicators for Each Measurement Domain
and Content Category
A
ID:p0210
3-day item generation meeting commenced with the team organizing specific examples
from data sources (Table 1) into indicators, paralleling the validity driven approach high-
lighted by Buchbinder et al. (2011) that involves “organizing ideas into groups that would
form the basis for the hypothesized scales to be included in the measurement tool” (Buch-
binder et al., p.2). While we recognize that different measures and measure development
teams may require different approaches, given the scope and complexity of our construct
of interest (i.e., compassion in healthcare), we found that meeting face-to-face, consecu-
tively over the course of a few days allowed us to us to generate our initial item pool in a
focussed manner, maximizing retention of information and decision recall in the process,
while also providing dedicated time for important discussions surrounding the preliminary
item generation process.
Once
ID:p0215
important decisions such as the scope and purpose of the measure, and intended
target population were agreed upon, the team developed measurement indicators, which are
observable elements used to assess the overarching construct of interest (Avila et al., 2015).
There are two types of indicators: (a) reflective or psychometric indicators, which “are a
manifestation of the same underlying construct” and are in that sense interchangeable and
(b) formative or clinimetric indicators, “which collectively form a construct,” but are there-
fore not interchangeable (Bagozzi, 2011; Fayers & Hand, 2002; Fayers & Machin, 2016;
Mokkink et al., 2010; Appendix A). A key characteristic of reflective indicators is that the
items co-vary in a consistent fashion in relation to a common overarching construct. In our
example of developing a measure of compassion, based on our theoretical PCM (Figure 2)
we determined that reflective indicators were most appropriate, as we view compassion as
a construct that is reflected in people’s experiences, but not necessarily defined or caused
by those experiences (i.e., there are other causal factor that are external to the measurement
of compassion). As a result, we generated indicators that reflected compassion, with the
Pdf_Folio:146
Item Generation in Measure Development: A Practical Guide 147
idea that each item would be indicative of the extent to which patients experience compas-
sion (Fayers & Machin, 2016). Thus, by way of a practical example, the more that patients
experience HCPs as “treating their patients as fellow human beings” the more this would
be reflective of higher compassion.
Since
ID:p0220
we aimed to measure compassion at the thematic level, we developed reflective
indicators for these themes of the PCM (Figure 3) by grouping the specific examples from
data sources (Table 1) together based on their shared attributes. While the goal of this pro-
cess was to condense the codes within the exhaustive TOS into reflective indicators, we felt
it was important to err on the side of inclusion, knowing that subsequent measure develop-
ment stages (judgment stage of content validity with SMEs and cognitive interviews with
patients, and further assessments of construct validity) will eliminate additional, poorly
performing items.
After
ID:p0225
achieving consensus on the indicators, the measure development team
consolidated them into content categories. Depending on the nature of the measure, these
content categories could represent either a set of behaviors, knowledge, skills, abilities,
attitudes, or other characteristics to be measured by a test (American Psychological Asso-
ciation [APA], 2014), which can then be utilized to generate an initial pool of items. As
an example, in developing a compassion measure, our previously identified measurement
indicators such as “helping/doing things for others; communicating information; helping
navigate the system; anticipating needs; and responding to needs” were consolidated into
a content category of “Help Me,” whereas the indicators of “providing comfort; pain and
symptom management; and gentle physical care” were consolidated into a content cate-
gory of “Relief/Comfort” (Figure 4). Further, this process of developing content categories
on the whiteboard led our team to collapse the domain of Virtues into the domain of Virtu-
ous Response (Figure 2), as we realized the high degree of overlap between the indicators
within each domain. This decision was further informed by reconfirming that the aim of
the measure was to focus on patients’ experiences of compassion, as opposed to the assess-
ment of the innate qualities of their HCPs. After determining the content categories, our
measure development team then verified that the categories were congruent with the TOS
and the preidentified measurement domains (Phase 2) to ensure adequate coverage. These
content categories became the definitive source of item generation and associated response
scaling in Phase 5.
Figure 4. Conceptual model of compassionate care: measurement domains and their content
categories.
*Level 1 = Measurement Domains (identified in Phase 2).
**Level 2 = Content Categories (derived from reflective indicators in Phase 3).
Pdf_Folio:147
148 Sinclair et al.
Phase
ID:ti0050
4. Instrument Construction: Item Generation and Response Scaling
Determination
The
ID:p0240
content categories derived in Phase 3 served as the template for the generation of a
comprehensive and detailed item pool in the form of statements or questions alongside
their variants (i.e., alternatively worded items), and their potential response scales. Prior
to generating the specific items, our team reviewed basic item development guidelines,
such as those provided by Bradburn, Wansink, and Sudman (2004); Dillman et al. (2014);
Hinkin (1998); and Streiner, Norman and Cairney (2015; Appendix B). Specifically, these
guidelines helped to ensure that the items were, concise and comprehensible to our target
population, and not double-barrelled (i.e., measuring two aspects of compassion in a single
item; Appendix B). An example of an item generated in the form of a statement from the
content category “Going the Extra Mile” (Phase 4, Figure 3) was “My Healthcare Providers
went the extra mile.” From this item we then identified the following variant items: “My
HCPs went above and beyond their job”; “My HCPs went above and beyond their call of
duty”; and “My HCPs did more than I expected.”
After
ID:p0245
carefully generating the items in the form of statements along with their possible
variants, the team turned their attention to determining response scale options. Different
scaling options can be utilized depending on the type of measure that is being constructed
(Dillman et al., 2014). Examples of types of scales include Agreement/Disagreement, Sat-
isfaction, and Unipolar or Bipolar scales, each with varying purposes. For the compassion
measure, we identified the following as possible response scales:
•Agreement
ID:p0250
/Disagreement Scale
•Satisfaction
ID:p0255
Scale
•Unipolar
ID:p0260
Scale (anchored at “0” with each point scoring higher than the previous)
•Unique
ID:p0265
Bipolar Scale (e.g., “When I needed help, I got it”: Very slowly—Slowly—Neutral—
Quickly—Very quickly or “I felt that my HCPs were”: Very cold—Cold—Neutral—Warm—
Very Warm)
Upon
ID:p0270
determining potential response scale options, we found that some items had more
than one scaling contender, such as “My HCPs were gentle with me” which could work
with an Agree/Disagree Scale or a Unique Bipolar Scale, (“My HCP(s) were . . . Very
rough with me—Rough with me—Neutral—Gentle with me—Very gentle with me). We
then proceeded to determine, item-by-item, potential scales that would be most appropriate
to the respective item. As a result of this process, we discovered that the Agree/Disagree
scale and frequency scale (i.e., Never—Sometimes—Usually—Always) was an optimal
option for every item in our pool. These generated items and response scale options that
formed the draft measure, were then reviewed by the team to finalize response scales for
each individual item.
Phase
ID:ti0055
5. Consensus and Modification of the Draft Instrument Among
the Measure Development Team
After
ID:p0275
each member of the team completed their independent review of the draft measure
and made their recommendation(s) regarding response scale options for each item, feed-
back was collated to identify and flag any items which appeared to be problematic (i.e.,
lack of clarity, suggested rewording, and points of disagreement). We decided that if any
Pdf_Folio:148
Item Generation in Measure Development: A Practical Guide 149
item was flagged, it would require further discussion, debate, and consensus among the full
team via face-to-face or a videoconference meeting before proceeding to content review by
SMEs. We felt that individual opinions from our diverse, interdisciplinary team were valu-
able and needed to be fully considered before consensus was reached, rather than adopting
a “majority rules” approach. In order to reach consensus in an efficient manner, we only
discussed items where differences of opinion existed (Table 2). After achieving consensus
on the item pool and scaling, the item pool was ready for a detailed review by SMEs in the
judgment stage of content validity along with cognitive interviews with patients in order
to further identify issues related to “clarity,” “assumptions,” and “response categories”
(Figure 1).
DISCUSSION
The
ID:p0285
complex process of item generation within measure development is nascent within
both the healthcare literature, leaving researchers with a lack of guidance on this important
and laborious process. While guidelines have been established to facilitate measure devel-
opment in general, there are no best practice guidelines on item generation specifically.
The purpose of this article was to begin to address this gap through a discussion of the use
of a TOS, a pragmatic framework to aid measure development teams in this process, and
detailing the item generation phases related to the development of a patient-reported com-
passion measure. While we caution against an overly prescriptive approach, we feel that
a guiding framework is imperative in addressing this methodological gap, enhancing the
rigor and robustness of measure development in the process.
We
ID:p0290
illustrated the process of item generation based on our ongoing study to develop
and validate a patient-reported compassion measure. In doing so, we provide readers with
a “real life snapshot” of the iterative nature of this process, which is particularly important
for experiential and dynamic constructs like compassion that often lie at the heart of both
the patient experience and high-quality nursing practice (Sinclair et al., 2016). While mea-
sure developers often indicate that items for their specific measures were generated from a
theoretical foundation, we were unable to locate any studies detailing this process within
the healthcare literature. We also found little detail and guidance related to the iterative
process of refining, redefining, and rerefining the construct of interest, domains, items, and
response scales. Further, although measure development experts argue for the necessity of
ensuring the fidelity between the content of the measure and its underlying construct (Buch-
binder et al., 2011; Morgado et al., 2017; Nichols & Sugrue, 1999), historically, measure
developers have been left to their best efforts in substantiating the link between the char-
acteristics of the items, the overarching construct and associated domains. Thus, while the
item generation process we describe herein is potentially simplistic and not overly novel, it
has evaded the purview of scholarship impeding measure development, particularly among
researchers and clinicians, who have limited practical experience in measure development.
In our experience in developing a patient-reported experience measure for compassion, we
found a TOS to be highly valuable as a systematic framework for identifying what is to be
measured and how it is to be measured, while accounting for the relationship between the
key measurement “domains” that compose the overarching construct of interest.
Based
ID:p0295
on our experience, creating a TOS prior to item generation facilitates imperative
discussions, debate, and decisions among key stakeholders, which when absent can result
in unnecessary disagreements on the purpose and scope of the measure, and ultimately anPdf_Folio:149
150 Sinclair et al.
TABLE 2. Sample
ID:p0280
Draft of Initial Measure and Instructions for Measure Development Team Review
Column
ID:t0175
A: Column B: Column C: Column D: Column E: Column F: Column G:
Items
ID:t0175
Default
Response
Scale
Other
Response Scale
Contenders
Indicate Your
Agreement
With Inclusion
of Item:
Indicate Your
Agreement With
Item Wording
Indicate
ID:t0200
Your
Choice of
Response Scale
Collated
ID:t0205
Feedback:
Inclusion
Consensus
as per
“Column D”
(Y= “Yes,”
N= “No”)
(If “agree,” leave
column empty. If
“disagree,”
suggest alternative
wording here)
(If your choice
of scale is not on
the list, please
suggest here)
(i.e., 5/6
reviewers said
“Yes” to item
inclusion)
Pdf_Folio:150
Item Generation in Measure Development: A Practical Guide 151
inaccurate, inadequate, and perhaps biased measure. We found that using a TOS forced our
measure development team to revisit the conceptualization of the construct on numerous
occasions, honing the purpose and scope as a result, while also eliciting additional perspec-
tives that would not otherwise be recognized. This also emphasizes the benefit of having a
heterogenous (substantive expertise, methodological approaches, disciplines, gender, eth-
nicity, etc.) group of experts within the measure development team to enhance the breadth
and depth of the construct of interest and the associated items, augmenting the perspectives
of patients from our initial qualitative research (Sinclair et al., 2016).
While
ID:p0300
we found that there were numerous benefits in creating and implementing a TOS
in the item generation stage of our compassion measure, we recognize that this is a single
case illustration focused on the development of a measure of a specific construct. Further
research is therefore required to determine the applicability of this item generation pro-
cess to other constructs, populations, and measure development teams. Further, while we
suggest that best practice guidelines need to be developed, it is our hope that the stepwise
process highlighted herein (Figure 1), including the use of a TOS to facilitate item gen-
eration, will at the very least provide a starting point for these important discussions and
possibly a potential best practice.
Establishing
ID:p0305
an a priori scope and purpose of measurement (Phase 1) as a necessary,
nonnegotiable preliminary phase to item generation helps measure development teams
remain grounded to the task at hand and can mitigate tangential discussions, which while
interesting, are unnecessary and circuitous to the focus of measure development. At the
same time, the ability to revisit the scope and purpose of the measure and previous decision
points is an ongoing and beneficial exercise throughout the subsequent phases of measure
development, which should not be left to the good intentions or residual memories of the
research team. In our process of developing a patient-reported compassion measure, other
important considerations included in the item generation phase included: Item tense, recall
period, evaluating singular vs. plural HCP(s), response scale, and mode of administra-
tion. While other research teams may find themselves periodically revisiting these factors,
having these discussions at the outset is valuable not only in enhancing the efficiency of
the team, but also in anticipating potential risks before they become unmodifiable issues
during the validation stages of measure development.
As
ID:p0310
a point of consideration, identifying these aforementioned issues in advance does not
guarantee or mitigate issues arising in subsequent stages. In fact, we discovered that many
of our identified, potential issues could not be addressed prospectively and instead had to
be “parked” and revisited as the measure development process unfolded. For example, as
determining the mode of administration was not a priority or necessity at the early stages
of measure development, discussion on this issue was deferred later in the item generation
process.
RELEVANCE TO NURSING PRACTICE, EDUCATION, OR RESEARCH
This
ID:p0315
article is intended to provide pragmatic guidance on the process and complexity of the
item generation step of measure development by utilizing a TOS, illustrated herein in the
development of a patient-reported compassion measure. Although compassion has been
identified as a pillar of quality nursing care, valid and reliable measures to identify, evalu-
ate, and improve compassion in both nursing practice and education are lacking. Likewise,
nursing researchers are currently impeded in conducting research in this area due to thePdf_Folio:151
152 Sinclair et al.
aforementioned shortcoming, leaving much of nursing scholarship to anecdotal and the-
oretical discourses on this vital topic. While we are not suggesting that the creation and
implementation of a TOS will rectify this matter or that it will become a standard of mea-
sure development, we feel that by drawing on the rich history of TOS usage in educational
testing, a TOS can be considered a valuable tool in the item generation process for patient-
reported measures, thereby providing an evidence-based approach to an otherwise neb-
ulous step of measure development—improving the quality, robustness, and rigor of the
final measure in the process.
REFERENCES
American Educational Research Association, American Psychological Association, National Council
on Measurement in Education, Joint Committee on Standards for Educational and Psycholog-
ical Testing (U.S.). (2014). Standards for educational and psychological testing. Washington,
DC: AERA.
Avila, M., Stinson, J., Kiss, A., Brandao, R., Uleryk, E., & Feldman, M. (2015). A critical
review of scoring options for clinical measurement tools. BMC Research Notes,28(8), 612.
https://doi.org/10.1186/s13104-015-1561-6
Bagozzi, R. (2011). Measurement and meaning in information systems and organizational
research: Methodological and philosophical foundations. MIS Quarterly,35(2), 261–292.
https://doi.org/10.2307/23044044
Billings, D., & Halstead, J. (2016). Teaching in nursing: A guide for faculty. Retrieved from
https://books.google.ca/books?isbn=032329054X
Bowling, A. (1997). Research methods in health. Buckingham, UK: Open University Press.
Bradburn, N., Wansink, B., & Sudman, S. (2004). Asking questions: The definitive guide to question-
naire design—For market research, political polls, and social and health questionnaires (Rev
ed.). San Francisco, CA: Jossey-Bass.
Brod, M., & Tesler, L. (2009). Qualitative research and content validity: Developing best
practices based on science and experience..Quality of Life Research,18, 1263–1278.
https://doi.org/10.1007/s11136-009-9540-9
Buchbinder, R., Batterham, R., Elsworth, G., Clermont, D., Irvin, E., & Osborne, R. (2011).
A validity-driven approach to the understanding of the personal and societal burden of low back
pain: Development of a conceptual and measurement model. Arthritis Research & Therapy,
13(5), R152. https://doi.org/R152-10.1186/ar3468
Chase, C. I. (1999). Contemporary assessment for educators. New York, NY: Longman.
Cheng, K., & Clark, A. (2017). Qualitative methods and patient-reported outcomes: Mea-
sures development and adaptation. International Journal of Qualitative Methods,16, 1–3.
https://doi.org/10.1177/1609406917702983
Deshpande, P., Rajan, S., Sudeepthi, L., & Nazir, A. (2011). Patient-reported outcomes:
A new era in clinical research. Perspectives in Clinical Research,2(4), 137–144. https://
doi.org/10.4103/2229-3485.86879
DeVellis, R. (2003). Scale development: Theory and applications (2nd ed.). Newbury Park, CA: Sage.
Dillman, D., Smyth, J., & Christian, L. (2014). Internet, phone, mail, and mixed-mode surveys: The
tailored design method. Somerset, UK: John Wiley & Sons.
Fayers, P., & Hand, D. J. (2002). Causal variables, indicator variables and measurement scales:
An example from quality of life. Journal of the Royal Statistical Society. Series A (Statistics in
Society),165(2), 233–253. doi:https://doi.org/10.1111/1467-985X.02020Pdf_Folio:152
Item Generation in Measure Development: A Practical Guide 153
Fayers, P., & Machin, D. (2016). Quality of life: The assessment, analysis, and reporting of
patient-reported outcomes (3rd ed.). Sussex, UK; Hoboken, NJ: Chichester, West.John Wiley &
Sons Inc.
Fives, H., & DiDonato-Barnes, N. (2013). Classroom test construction: The power of a table of
specifications. Practical Assessment, Research and Evaluation,18(4), 1-7.
Grant, J., & Davis, L. (1997). Selection and use of content experts for instrument development.
Research in Nursing & Health,20, 269–274.
Hinkin, T. (1995). A review of scale development practices in the study of organizations. Journal of
Management,21(5), 967–988. https://doi.org/10.1177/014920639502100509
Hinkin, T. (1998). A brief tutorial on the development of measures for use in survey
questionnaires. Organizational Research Methods,1(1), 104–121. doi:https://doi.org/
10.1177/109442819800100106
LaVela, S., & Gallan, A. (2014). Evaluation and measurement of patient experience. Patient Experi-
ence Journal,1(1). Article 5
Lynn, M. (1986). Determination and quantification of content validity. Nursing Research,35(6),
382–385. https://doi.org/10.1097/00006199-198611000-00017
Mokkink, L., Terwee, C., Patrick, D., Alonso, J., Stratford, P., Knol, D., ... de Vet, H. (2010). The
COSMIN checklist for assessing the methodological quality of studies on measurement prop-
erties of health status measurement instruments: An international delphi study. Quality of Life
Research: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabil-
itation,19(4), 539–549. https://doi.org/10.1007/s11136-010-9606-8
Morgado, F., Meireles, J., Neves, C., Amaral, A., & Ferreira, M. (2017). Scale development:
Ten main limitations and recommendations to improve future research practices. Psicologia:
Reflexão e Crítica,30(3). https://doi.org/10.1186/s41155-016-0057-1
Netemeyer, R., Bearden, W., & Sharma, S. (2003). Scaling procedures: Issues and applications.
London, UK: Sage.
Nichols, P., & Sugrue, B. (1999). The lack of fidelity between cognitively complex constructs and
conventional test development practice. Educational Measurement: Issues and Practice,18,
18–29. https://doi.org/10.1111/j.1745-3992.1999.tb00011.x
Rattray, J., & Jones, M. (2007). Essential elements of questionnaire design and development. Journal
of Clinical Nursing,16, 234–243. https://doi.org/10.1111/j.1365-2702.2006.01573.x
Reeve, B., Wyrwich, K., Wu, A., Velikova, G., Terwee, C., Snyder, C., ... Butt, Z. (2013). ISO-
QOL recommends minimum standards for patient-reported outcome measures used in patient-
centred outcomes and comparative effectiveness research. Quality of Life Research,22, 1889.
doi:https://doi.org/10.1007/s11136-012-0344-y
Saris, W. (2014). Design, evaluation, and analysis of questionnaires for survey research. Hoboken,
NJ: Wiley.
Sinclair, S., McClement, S., Raffin-Bouchal, S., Hack, T., Hagen, A., McConnell, S., & Chochinov,
H. (2016). Compassion in health care: An empirical model. Journal of Pain and Symptom Man-
agement,51, 193–203. https://doi.org/10.1016/j.jpainsymman.2015.10.009
Sinclair, S., Russell, L. B., Hack, T. F., Kondejewski, J., & Sawatzki, R. (2016). Measuring
compassion in healthcare: A comprehensive and critical review. Patient,10(4), 389–405.
https://doi.org/10.1007/s40271-016-0209-5
Sinclair, S., Jaggi, P., Hack, T. F., McClement, S., Raffin-Bouchal, S., & Singh, P. (2018). Assess-
ing the credibility and transferability of the patient compassion model in non-cancer palliative
populations. BMC Palliative Care,17, 108. https://doi.org/10.1186/s12904-018-0358-5
Streiner, D., Norman, G., & Cairney, J. (2015). Health measurement scales. A practical guide to their
development and use (5th ed.) Oxford, UK: Oxford University Press.Pdf_Folio:153
154 Sinclair et al.
Terwee, C., Bot, S., de Boer, M., van der Windt, D., Knol, D., Dekker, J. ... De de Vet, H. (2007).
Quality criteria were proposed for measurement properties of health status questionnaires. Jour-
nal of Clinical Epidemiology,60(1), 34–42. doi:https://doi.org/10.1016/j.jclinepi.2006.03.012
US Food and Drug Administration, Center for Drug Evaluation and Research, Center for Biologics
Evaluation and Research, Center for Devices and Radiological Health. (2009). Guidance for
industry patient-reported outcome measures: Use in medical product development to support
labeling claims. Retrieved from http://purl.access.gpo.gov/GPO/LPS113413.
Valderas, J., Ferrer, M., Mendivil, J., Garin, O., Rajmil, L., Herdman, M., & Alonso, J. (2008). Devel-
opment of EMPRO: A tool for the standardized assessment of patient-reported outcome mea-
sures. Value Health,11(4), 700–708. https://doi.org/10.1111/j.1524-4733.2007.00309.x
Disclosure.The
ID:p0320
authors have no relevant financial interest or affiliations with any commercial inter-
ests related to the subjects discussed within this article.
Acknowledgments
ID:p0345
.This study was approved by the University of Calgary Conjoint Health Research
Ethics Board (REB #16-1460). The datasets used and/or analyzed during the current study are avail-
able from the corresponding author on reasonable request.
Funding. This study was funded through a Canadian Institutes of Health Research Project Scheme
Grant (#364041). This research was undertaken in part, thanks to funding from the Canada Research
Chairs Program.
Correspondence regarding this article should be directed to Shane Sinclair, PhD, University of
Calgary, Calgary, Alberta, Canada. E-mail: sinclair@ucalgary.ca
Pdf_Folio:154
Item Generation in Measure Development: A Practical Guide 155
Appendix
ID:ti0080
A: Item Development Guidelines
1. Item
ID:p0360
statements should be simple and as short as possible (prevent boredom and fatigue from
the respondent perspective)
2. Language
ID:p0365
should be familiar to target respondents
3. Items
ID:p0370
should be written at a 12-year-old reading level
4. Avoid
ID:p0375
item ambiguity (e.g., words like “recently” can have different meanings for different
individuals, and hence should be described)
5. Avoid
ID:p0380
jargon (e.g., words that are not used on an every-day basis by patients, such as “benev-
olence” or “beneficence” or “proactive”)
6. Items
ID:p0385
should be measuring separate aspects of the topic or construct of interest (e.g., behavior
vs. affect) and should not be intermixed
7. Avoid
ID:p0390
double-barrelled items to avoid representing more than one construct
8. Avoid
ID:p0395
any leading questions to prevent response bias
9. Avoid
ID:p0400
items that would result in the same response from all individuals (i.e., floor and ceiling
effects and therefore no variance)
10. If
ID:p0405
using negatively worded items, ensure that they are very CAREFULLY worded for appro-
priate responses (can be helpful to have negatively worded questions to pick up on patients
who might answer questions a certain way due to fatigue or that they don’t understand the
question, resulting in response acquiescence).
a. Note
ID:p0410
: negatively worded items have lower validity coefficients than positively
worded ones
b. Scales
ID:p0415
with both positive and negative wording are less reliable than those with wording
in the same direction
c. To
ID:p0420
create a balanced scale, all of the items should be positively worded, but one
half should tap one direction of the trait and the other half should tap the opposite
direction of it
11. Avoid
ID:p0425
item redundancy
12. Include
ID:p0430
items with heavily endorsed codes (e.g., representing individualized patient views on
compassion derived from qualitative interviews)
Source
ID:p0435
:Bradburn, N., Wansink, B., & Sudman, S. (2004). Asking questions: The
definitive guide to questionnaire design—For market research, political polls, and
social and health questionnaires (Rev ed.). San Francisco, CA: Jossey-Bass.; Dillman,
D., Smyth, J., & Christian, L. (2014). Internet, phone, mail, and mixed-mode sur-
veys: The tailored design method (p. ). Somerset, UK: John Wiley & Sons. ISBN:
9781118921296; Hinkin, T. (1998). A brief tutorial on the development of measures
for use in survey questionnaires. Organizational Research Methods,1(1), 104–121.
doi:https://doi.org/10.1177/109442819800100106; Streiner, D., Norman, G., & Cairney, J.
(2015). Health measurement scales. A practical guide to their development and use (5th
ed..) Oxford, UK: Oxford University Press.
Pdf_Folio:155
156 Sinclair et al.
Appendix
ID:ti0085
B. Glossary of Terms
Term Definition
Item
ID:t0210
An individual question, statement, or task (and its standardized response
options) that is evaluated by the patient to address a particular concept
(Deshpande et al., 2011; FDA Center for Drug Evaluation and Research et al.,
2009).
Construct
ID:t0220
An abstract phenomenon of interest (Avila et al., 2015).
Domain
ID:t0230
A sub-concept that measures a larger concept comprised of multiple domains
(Deshpande et al., 2011; FDA Center for Drug Evaluation and Research et al.,
2009).
Theme
ID:t0240
A sub-concept that measures a larger domain (Sinclair et al., 2016).
Table
ID:t0250
of
specifications
A table that “aligns objectives, instruction, and assessment” to facilitate the
construction of tests through rigorous means (Fives & DiDonato-Barnes, 2013).
Measurement
ID:t0260
indicators
The observable elements (i.e., variables) used to assess the construct of interest
(Avila et al., 2015; Bagozzi, 2011).
Reflective
ID:t0270
indicators
Variables that measures a key domain of the construct that are a manifestation
of the construct and are interchangeable (Bagozzi, 2011; Fayers & Hand, 2002;
Fayers & Machin, 2016; Mokkink et al., 2010).
Formative
ID:t0280
indicators
Variables that collectively form a construct but are not interchangeable
(Bagozzi, 2011; Fayers & Hand, 2002; Fayers & Machin, 2016; Mokkink
et al., 2010).
Content
ID:t0290
category
A set of behaviors, knowledge, skills, abilities, attitudes, or other
characteristics to be measured by a test, which can be utilized to generate an
initial pool of items (APA, 2014).
Pdf_Folio:156