Content uploaded by Matthew J Koehler
Author content
All content in this area was uploaded by Matthew J Koehler on Feb 21, 2014
Content may be subject to copyright.
Volume 46 Number 2
|
Journal of Research on Technology in Education | 129
A Turn toward Specifying Validity Criteria in the Measurement of
Technological Pedagogical Content Knowledge (TPACK)
JRTE | Vol. 46, No. 2, pp. 129–148 | ©2013 ISTE | iste.org/jrte
A Turn toward Specifying Validity Criteria in the Measurement
of Technological Pedagogical Content Knowledge (TPACK)
Robert F. Cavanagh
Curtin University
Matthew J. Koehler
Michigan State University
Abstract
e impetus for this paper stems from a concern about directions and prog-
ress in the measurement of the Technological Pedagogical Content Knowledge
(TPACK) framework for eective technology integration. In this paper, we
develop the rationale for using a seven-criterion lens, based upon contem-
porary validity theory, for critiquing empirical investigations and measure-
ments using the TPACK framework. is proposed seven-criterion lens may
help researchers map out measurement principles and techniques that ensure
reliable and valid measurement in TPACK research. Our critique of existing
TPACK research using these criteria as a frame suggests several areas of theo-
rizing and practice that are likely impeding the press for measurement. First
are contradictions and confusion about the epistemology of TPACK. Second
is the lack of clarity about the purpose of TPACK measurement. ird is the
choice and use of measurement models and techniques. is article illustrates
these limitations with examples from current TPACK and measurement-
based research and discusses directions and guidelines for further research.
(Keywords: Technological Pedagogical Content Knowledge framework,
TPACK, reliability, validity, measurement, assessment)
Since initial publication in 2006 by Mishra and Koehler, the Tech-
nological Pedagogical Content Knowledge (TPACK) framework
for eective technology integration (see Figure 1, p. 130), has had a
signicant impact on research and practice around educational technology
(Koehler, Shin, & Mishra, 2011). Application of the framework by research-
ers and practitioners to inform design of interventions such as professional
development has led to the development of measures to quantify eects and
potential gains (Graham, Cox, & Velasquez, 2009; Guzey & Roehrig, 2009).
Although this empirical imperative is a powerful rationale for developing mea-
sures, measurement is also oen viewed as the optimal means of establishing the
validity of theoretical frameworks and models. e validation of the frame-
Cavanagh.indd 129 11/2/2013 5:02:17 PM
130 |
Journal of Research on Technology in Education
|
Volume 46 Number 2
Cavanagh & Koehler
work as a model of technology integration is a second driver of the prolifera-
tion of TPACK measures.
e growth in both the number and variety of the TPACK measures
being explored warrants a critical look at the quality and validity of the mea-
sures being used (Koehler, Shin, & Mishra, 2011). In the sections that follow,
we examine these issues through the lens of contemporary validity theory
and then propose a multistep approach for examining validity in empirical
investigations of TPACK.
is work is grounded in the construct of validity advanced by Messick
(1995). According to Messick (1995, p. 741), validity “is an overall judg-
ment of the degree to which evidence and theoretical rationales support the
adequacy and appropriateness of interpretations and actions on the basis of
test scores and other modes of assessment.” Messick (1998) was emphatic
about this approach being unied in contrast to the multiple-type concep-
tion that previously prevailed. He also reframed these types of validity as
forms of evidence and stated:
What is singular in the unied theory is the kind of validity: All validity
is of one kind, namely, construct validity. Other so-called separate types
of validity—whether labeled content validity, criterion-related validity,
consequential validity, or whatever—cannot stand alone in validity ar-
guments. Rather, these so-called validity types refer to complementary
forms of evidence to be integrated into an overall judgment of construct
validity. (p. 37)
Figure 1. The TPACK framework (reproduced with permission from http://tpack.org)
Cavanagh.indd 130 11/2/2013 5:02:17 PM
Volume 46 Number 2
|
Journal of Research on Technology in Education | 131
Specifying Validity Criteria in the Measurement of TPACK
e current version of the Standards for Educational and Psychologi-
cal Testing published by the American Educational Research Association
(AERA), the American Psychological Association (APA), and the National
Council on Measurement in Education (NCME) embody this unied
conception: “Validity is a unitary concept. It is the degree to which all the
accumulated evidence supports the intended interpretation of test scores for
the proposed purpose” (AERA, APA, & NCME, 1999, p. 11). e evidence
requires documentation of all aspects of instrument development and
administration, from initial theorizing to assessing the consequences of
interpreting the results.
Messick (1995) provided a six-criterion framework for the organization
of evidence. e criteria were the content, substantive, structural, generaliz-
ability, external, and consequential aspects. Wolfe and Smith (2007a) added
an additional aspect from the Medical Outcomes Trust Scientic Advisory
Committee (1995), evidence of the interpretability aspect. Application of the
seven-criterion framework has not been restricted to psychometric test de-
velopment. Signicantly, it has been used in the assaying of phenomenologi-
cal research that used rating scales, surveys, and observational instruments
(Cavanagh, 2011a; Young & Cavanagh, 2011).
The seven criteria are introduced in Table 1, along with examples of
how each can be applied. In this paper, we employ these criteria to audit
TPACK-based empirical research and the measures employed in this
research. The following sections explain the seven aspects of validity
Table 1. Validity Evidence Criteria
Types of Evidence Description Examples of Application
1. Content evidence The relationship between the
instrument’s content and what
the instrument seeks to measure
Specification of research ques-
tions, development of a construct
model, writing of items, selection
of a scaling model
2. Substantive evidence Explanation of observed consis-
tencies in the data by reference
to a priori theory or hypotheses
Comparing TPACK scores of
teachers who have completed
TPACK training with those who
have not
3. Structural evidence Confirmation of subconstructs
or components in the construct
model
Conducting Confirmatory Factor
Analysis
4. Generalizability evidence Individual items are not biased
toward particular groups or
situations
Testing that each item in a test of
TPACK elicits similar responses
from males and females with the
same overall TPACK level
5. External evidence Similar results are obtained when
different tests are applied to
measure the same construct
Comparing findings from
observational schedules and
document analysis
6. Consequential evidence Consideration of how results
could impact on persons and
organizations
Discussing findings with
stakeholders
7. Interpretability evidence Communication of the qualitative
meaning of scores
Providing a construct map that
explains key points on the scale
Cavanagh.indd 131 11/2/2013 5:02:17 PM
132 |
Journal of Research on Technology in Education
|
Volume 46 Number 2
Cavanagh & Koehler
evidence in more detail and how these could be manifest in TPACK
measurement.
e corpora of reports on TPACK measurement used in the study were
identied by a literature search in conjunction with the second author’s
extensive familiarity with TPACK literature. e theoretical model ap-
plied in the study was the Wolfe and Smith (2007a; 2007b) seven-criterion
framework. is framework was adopted a priori as a vehicle for identifying
validity evidence rather than simply classifying typical features of TPACK
research or counting the occurrence of these features in the TPACK litera-
ture. We selected studies because they exemplied one or more aspects of
validity evidence of relevance to TPACK measurement. However, locating
examples of all of the types of evidence was dicult and, for some types of
evidence, not successful. For example, specication of scaling models and
testing the technical quality of items are very rare and only found in one
study (i.e., Jamieson-Proctor, Finger, Albion, Cavanagh, Fitzgerald, Bond, &
Grimbeek, 2012), which applied many, but not all, of the AERA, APA and
NCME standards for instrument construction. is potential over-reliance
on one study is a limitation of this paper and will hopefully be overcome as
advances are made in attention to validity in future TPACK research.
We commence by examining the content aspect of validity that begins
with the reason for measurement. en we examine a sequence of activities
that lead from clarication of the construct of interest to the design of the
instrument.
Evidence of the Content Aspect
Purpose
e evidence of content aspect of validity includes clear statements of the
purpose of a study or instrument development process that are made before
other activities are attempted. Asking research questions is one widely used
method of expressing the intent of an investigation. For example, in a study
of TPACK measures, Koehler, Shin, and Mishra (2011, p. 18) made their
purpose clear by posing two research questions: “What kinds of measures
are used in the TPACK literature?” and “Are those measures reliable and
valid?”
Also related to articulating a clear purpose for a study or measure is
specifying the domain of inference, the types of inferences, and potential
constraints and limitations.
Domain of inference. Specifying the domain(s) of inference situates the an-
ticipated outcomes of an investigation within an established body of theory
or knowledge and provides additional evidence of the content. e domains
could be curricular (relating to instruction), cognitive (relating to cogni-
tive theory), or criterion-based (knowledge, skills, and behaviors required
for success in a particular setting). For example, the domain of inference of
Cavanagh.indd 132 11/2/2013 5:02:18 PM
Volume 46 Number 2
|
Journal of Research on Technology in Education | 133
Specifying Validity Criteria in the Measurement of TPACK
TPACK is curricular due to the pedagogical component (Mishra & Koehler,
2006; Koehler & Mishra 2008) and also criterion-based due to its contex-
tual specicity and situational dependence (Doering, Scharber, Miller, &
Veletsianos, 2009).
Type of Inferences. e types of inferences delimit the intended conclusions
or judgments to be made from a study or instrument. Presumably, TPACK
studies or measures could be designed to make inferences about mastery,
individual teachers, systems, or groups of teachers. To date, TPACK mea-
surements have primarily sought to measure individual teachers’ TPACK
(Roblyer & Doering, 2010; Schmidt, Baran, ompson, Mishra, Koehler, &
Shin, 2009), although there have been notable attempts to study groups of
teachers as well (e.g., Finger, et al., 2012).
ere is also an element of mastery underpinning TPACK through the
implication that high technology integration results from high levels of, and
interaction between, technological, pedagogical, and content knowledge.
Schmidt et al. (2009, p. 125) explained, “At the intersection of these three
knowledge types is an intuitive understanding of teaching content with ap-
propriate pedagogical methods and technologies.”
Potential constraints and limitations. Potential constraints and limita-
tions can also be identified that comment on the logistics, resource
issues, or methodological exigencies. For example, Harris, Grandgenett,
and Hofer (2010) identified a methodological limitation when they criti-
cized self-report methods in TPACK research. The authors explained
that “the challenges inherent in accurately estimating teachers’ knowl-
edge via self-reports—in particular, that of inexperienced teachers—
are well-documented” (Harris, et al., 2010, p. 1).
Instrument Specification
Following the denition of the purpose, a set of instrument specications are
developed. is task involves describing constructs, a construct model, and
then a construct map.
Constructs. Wilson (2010) described a construct as “the theoretical ob-
ject of our interest” (p. 6) and saw it resulting from knowledge about the
purpose of designing an instrument and the context in which it is to be
used. He also considered a construct to be part of a theoretical model that
explains phenomena. Importantly, the construct should sit within a well-
established body of knowledge, and one of the purposes of a study is to
contribute to extant theory in this domain of inference. e construct model
and this theory are a priori considerations that require specication prior to
other measure construction activities.
e TPACK framework could be viewed as a representation of one
construct, a trait or ability of teachers that is not directly observable but is
latent and indicated by observable behaviors. For example, Koehler et al.
(2011, p. 6) explained that the “TPACK framework connects technology to
Cavanagh.indd 133 11/2/2013 5:02:18 PM
134 |
Journal of Research on Technology in Education
|
Volume 46 Number 2
Cavanagh & Koehler
curriculum content and specic pedagogical approaches and describes how
teachers’ understandings of these three knowledge bases can interact with
one another to produce eective discipline-based teaching with educational
technologies” (p. 6).
Alternatively, TPACK could be viewed as a composite of the seven con-
structs of Figure 1 (p. 130), each of which is suciently dierent from the
others to warrant separate specication (Schmidt et al., 2009). e seven
constructs comprise three types of knowledge—technological knowledge
(TK), pedagogical knowledge (PK), and content knowledge (CK); and three
types of knowledge about the interactions between technology, pedagogy,
and content—pedagogical content knowledge (PCK), technological peda-
gogical knowledge (TPK), technological content knowledge (TCK); and
then the interaction between PCK, TPK, and TCK—technological pedagogi-
cal content knowledge (TPACK). Additional complexities are contextual
dependency on situational variables (e.g., subject discipline), which needs to
be accommodated in both the unied and the multi-component representa-
tions, and the possibility of perhaps as few as three components (Archam-
bult & Barnett, 2010) or more than seven components.
Empirical studies that use TPACK to guide research have tended to focus
on one specic aspect of TPACK. Angeli and Valanides (2009) researched
a strand within an alternative TPACK framework they termed ICT-TPCK;
Harris et al. (2010) studied the quality of technology integration; and
Jamieson-Proctor et al. (2012) evaluated TPACK condence and usefulness.
In these cases, models supplemented the more general TPACK model utiliz-
ing Venn diagrams that altered the focus on the phenomenon of interest.
Construct models. ere are many sources of information that can assist
in depicting a construct model. Wolfe and Smith (2007a) listed real-world
observations, literature reviews of theory, literature reviews of empirical
research, reviews of existing instruments, expert and lay viewpoints, and
content and task analyses. Constructs can have internal and external mod-
els. An internal model typically comprises components, facets, elements
or factors, and the hypothesized relations between these components. e
TPACK models above are examples of internal models. Another example of
an internal model is represented in Table 2 (Jamieson-Proctor et al., 2012,
p. 5). e construct model for the Teaching Teachers for the Future (TTFF)
TPACK Survey has seven components: TPACK, TPK, TCK, condence to
support student learning, condence to support teaching, usefulness to
support student learning, and usefulness to support teaching.
External models represent relations between the target construct and
other constructs. Constructs associated with context (e.g., racial identity,
learning environment, professional development) and how these relate
to TPACK could constitute external models. An early version of the TTF
instrument (Jamieson-Proctor, Finger, Albion, Cavanagh, Fitzgerald, Bond,
& Grimbeek, 2012) contained a set of items on teacher ecacy. ese items
Cavanagh.indd 134 11/2/2013 5:02:18 PM
Volume 46 Number 2
|
Journal of Research on Technology in Education | 135
Specifying Validity Criteria in the Measurement of TPACK
were intended to measure what was at the time considered a construct
related to TPACK.
Construct maps. e construct map requires qualication of the construct
model by providing a coherent and substantive denition of the content of
the construct and a proposal of some form of ordering of persons or of the
tasks administered to persons (Wilson, 2010). From a content perspective,
the extension of Shulman’s (1986; 1987) conception of pedagogical content
knowledge (PCK) by the addition of technological knowledge (TK) has
produced the integrative TPACK model (Graham, 2011). However, the PCK
model and associated denitions have been criticized for imprecision and
thus being “a barrier to the measurement of PCK” (Graham, 2011, p. 1955).
is in turn has led to problems when dening the TPACK construct and
the need for ongoing work in this area to resolve these issues (Koehler, Shin,
& Mishra, 2011).
e issue of denitional precision is not peculiar to TPACK measure-
ment. Wilson (2010, p. 28) referred to it as the “more complex reality of
usage” and suggested some constructs should be conceptualized as multi-
dimensional and represented by several discreet construct maps. He also
recommended initial focus on one dimension at a time and development
of a simple model on the assumption that complications can be dealt with
later. is approach is compatible with the transformative view of TPACK
that focuses on change and growth of teachers’ knowledge over time
rather than on discriminating between dierent types of TPACK knowl-
edge (Graham, 2011). It is also consistent with the general objectives of
measurement—interpersonal comparison of capabilities or dispositions,
comparison of an individual’s capabilities or dispositions at dierent times,
or comparison of the diculty the tasks comprising a measure present to
persons.
e notion of ordering of persons or of instrument tasks has been suc-
cessfully applied in construct mapping of TPACK. Harris, Grandgenett, and
Hofer (2012) developed a rubric to rate experienced teachers on four forms
of technology use when planning instruction. Twelve scorers assessed cur-
riculum goals and technologies, instructional strategies and technologies,
technology selection, and t using a scoring rubric that described four levels
of each form of technology use. ey rated curriculum goals and technolo-
gies “strongly aligned” (scored 4), “aligned” (scored 3), “partially aligned,”
(scored 2), and “not aligned” (scored 1). e goal of this exercise was evalu-
ating teachers’ TPACK by ordering of persons.
Table 2. The Conceptual Structure of the TTF TPACK Survey
TPACK Framework Dimension Scale: Confidence to Use ICT to: Scale: Usefulness of ICT to:
TPACK Support student learning Support student learning
TPK, TCK Support teaching Support teaching
Cavanagh.indd 135 11/2/2013 5:02:18 PM
136 |
Journal of Research on Technology in Education
|
Volume 46 Number 2
Cavanagh & Koehler
e ordering of tasks assumes that dierent tasks present varying
degrees of diculty to the persons attempting the tasks. An example of a
task-ordered rubric is the six facets of learning for understanding devel-
oped by Wiggins and McTighe (1998; 2005). e facet of explanation was
postulated to vary in degree from naïve to sophisticated. Five levels were
dened—naïve, intuitive, developed in-depth, and sophisticated. A naïve
understanding was described as “a supercial account; more descriptive
than analytic or creative; a fragmentary or sketchy account of facts/ideas or
glib generalizations” (Wiggins & McTighe, 1998, p. 76). In contrast, sophis-
ticated understanding could be demonstrated by “an unusually thorough,
elegant, and inventive account (model, theory, or explanation)” (Wiggins
& McTighe, 1998, p. 76). e facets of a learning rubric describe student
behaviors at each level to dierentiate between levels as well as to order the
levels. Such a system of ordering is important when the construct of interest
is hypothesized to be cognitively developmental with the attainment of low-
er-level tasks prerequisite to mastering those at higher levels. In the Wiggins
and McTighe (1998; 2005) construct map, naïve explanations are easier to
provide than intuitive explanations, which in turn are easier to provide than
developed explanations (Cavanagh, 2011). is ordering informs theorizing
about students learning for understanding. A developmental view of TPACK
learning in which teacher cognition progresses through developmental
stages would also require the identication of similar sequences of levels for
the construct map and then the development of instrument items.
Item development. Item development concerns making choices about
item formats such as multiple choice, rating scales, and performance as-
sessments. is can be informed by following the recommendations of
item writing guidelines about content/semantics, formatting, style, stem
statements, response scales, and response choices. Regular reviews such as
expert reviews, content reviews, and sensitivity (targeting) reviews can be
conducted throughout all stages of instrument development. For example,
seven TPACK experts reviewed the validity and face value of the rubric
developed by Harris et al. (2012) to assess observed evidence of TPACK
during classroom instruction.
Scoring model. A detailed construct map with an internal structure that
orders persons and tasks informs selection of a scoring model. Signi-
cantly, it is the ordering that provides a foundation for the instrument
being a measure. A scoring model shows how observations or responses
to items are numerically coded. Right or wrong answers provide dichoto-
mous data that could be scored 0, 1. Rating scales produce polytomous
data that can be scored using the successive integers 0, 1, 2, and 3. Rating
scales can show the degree of agreement of respondents to a stem state-
ment, and while this is related to the overall strength of the trait of inter-
est in persons, it is the ordering within the construct map that constitutes
the measure.
Cavanagh.indd 136 11/2/2013 5:02:18 PM
Volume 46 Number 2
|
Journal of Research on Technology in Education | 137
Specifying Validity Criteria in the Measurement of TPACK
e number and labeling of response categories is crucial to the per-
formance of a rating scale instrument (Hawthorne, Mouthaan, Forbes, &
Novaco, 2006; Preston & Colman, 2000). Another related issue is use of
a “neither disagree or agree” category and the reasons for the selection of
this category (Kulas & Stachowski, 2001). e scoring model for the TTF
TPACK Survey instrument (Jamieson-Proctor et al., 2012) comprised seven
response categories scored 0 (not condent/useful); 1, 2, 3 (moderately
condent/useful); 4, 5, 6 (extremely condent/useful); plus an additional
“unable to judge” category scored 8 and coded as missing data. We collected
data using Qualtrics online survey soware.
Scaling model. e data obtained directly from instrument administra-
tion are termed raw data because they require processing by scaling into a
meaningful form. Without scaling, the use of raw scores is limited to the
presentation of frequencies, and even mathematical operations as basic
as estimating a mean score should be undertaken with caution (Doig &
Groves, 2006). A scaling model such as the Rasch Model (Rasch 1980) can
be applied to raw scores to calibrate these on a linear scale. e intervals on
a linear scale are equal in the same way as the markings on a yardstick. is
enables comparison of person scores according to their magnitude and not
just their order.
We analyzed the TTF TPACK Survey student scores using the Rasch Rat-
ing Scale Model (Andrich, 1978a; Andrich, 1978b; Andrich, 1978c; Bond &
Fox, 2007; Jamieson-Proctor, Finger, Albion, Cavanagh, Fitzgerald, Bond,
& Grimbeek, 2012). Data from four groups of like-named items (i.e., TPK/
TCK Condence, TPK/TCK Usefulness, TPACK Condence, TPACK Use-
fulness) were subject to separate scaling, and then we equated scaled scores
on an interval scale (Jamieson-Proctor, Finger, Albion, Cavanagh, Fitzgerald,
Bond, & Grimbeek, 2012). e generation of interval data enabled accurate
comparison of student responses on four scales between the two occurrenc-
es of instrument administration at the national level and also within the 39
universities/higher education providers that participated in the project.
Item technical quality. Evidence of item technical quality can be garnered
by testing how well data from individual items meet the requirements of an
item-response measurement model. For example, in its simplest form, the
Rasch Model requires the probability of a person completing a task to be a
function of that person’s ability and the diculty of the task. Persons with
high ability are more likely to complete dicult tasks than those with lower
ability. Conjointly, easy tasks are likely to be completed by both low- and
high-ability persons. Rasch Model computer programs such as RUMM2030
(RUMMLab, 2007) or Winsteps (Linacre, 2009) test how well the responses
to an item display this property by estimating t statistics. Common reasons
for items having poor t to the model include the item not discriminating
between persons of dierent ability and the responses being confounded by
an attribute of the persons dierent to the trait being measured.
Cavanagh.indd 137 11/2/2013 5:02:18 PM
138 |
Journal of Research on Technology in Education
|
Volume 46 Number 2
Cavanagh & Koehler
Rasch Model analysis of the TTF TPACK Survey data using the
WINSTEPS computer program (Linacre, 2009) identified six items with
misfitting data. These were stepwise removed from subsequent analyses
until all the remaining items showed adequate fit to the model’s require-
ments for measurement. The items removed and their respective scales
were:
•Scale TPK/TCK Confidence Combined: Teach strategies to support students
from Aboriginal and Torres Strait Islander backgrounds; access, record,
manage, and analyze student assessment data
•Scale TPK/TCK Usefulness Combined: Teach strategies to support stu-
dents from Aboriginal and Torres Strait Islander backgrounds; man-
age challenging student behavior by encouraging the responsible use
of ICT
•Scale TPACK Confidence Combined: Communicate with others locally and
globally
•Scale TPACK Usefulness Combined: Communicate with others locally and
globally (Jamieson-Proctor, Finger, Albion, Cavanagh, Fitzgerald, Bond,
& Grimbeek, 2012. p. 8)
Another consideration in rating scale instruments is the functioning
of the rating scale categories. ere is a diversity of views on the optimum
number of response categories (Hawthorne, Mouthaan, Forbes, & Novaco,
2006; Preston & Colman, 2000). ere are also many reasons, which are
oen unclear, for selecting a “neither disagree or agree,” “undecided,” or “not
sure” category as a middle category (Kulas & Stachowski, 2001). Optimiz-
ing the response scale is possible by analysis of pilot and trial data using
the Rasch Rating Scale Model (Andrich, 1978a; Andrich, 1978b; Andrich,
1978c). For an item, a Category Probability Curve is produced from plotting
the responses to each category in the response scale against the ability of the
persons. An ideal pattern of responses would show the more capable respon-
dents choosing the most dicult to arm categories and the less capable
respondents choosing the easier to arm categories. For the seven-category
response scales used in the TTFF study, some of the provided response op-
tions were not used as intended. Consequently, “adjacent response categories
were combined as required to achieve satisfactory Category performance”
(Jamieson-Proctor, et al., 2012, p. 8).
e preceding section on the content aspect of validity described the key
activities in the construction of a measure, and methods for ensuring these
are implemented as intended. e content activities are sequential and itera-
tive but require implementation in conjunction with the other six aspects
of validity evidence. With this in mind, the following six sections examine
substantive, structural, generalizability, consequential, and interpretability
evidence of validity.
Cavanagh.indd 138 11/2/2013 5:02:18 PM
Volume 46 Number 2
|
Journal of Research on Technology in Education | 139
Specifying Validity Criteria in the Measurement of TPACK
Evidence of the Substantive Aspect
e substantive aspect of validity can be evidenced by the extent to which
the theoretical framework, an a priori theory, or the hypothesis inform-
ing an investigation can explain any observed consistencies among item
responses. is section examines each approach.
For example, the literature on student engagement suggests that it is
characterized by enjoyable experiences in the classroom and a favorable
disposition toward the material being learned and toward the classroom
environment (Sherno, 2010). Students describing their favorite class
would be expected to have higher engagement scores than those describing
a nonfavorite class. We used RUMM2030 to calculate engagement scores
for data from the Survey of Student Engagement in Classroom Learning
(Cavanagh, 2012). Figure 2 presents the frequency of scores (person loca-
tions measured in logits) for students reporting their favorite subjects and
those reporting a nonfavorite subject. e scores for favorite subject were
statistically signicantly higher than those for the nonfavorite subjects
(i.e., mean score favorite .93 logits and mean score nonfavorite .01 logits,
F=14 7.7, p< .000).
A similar approach for gathering substantive evidence could be used
with TPACK construct models and data. ere are likely particular groups
of teachers with attributes anticipated to be associated with high TPACK
scores. ese could be teachers who have completed postgraduate courses
in technology integration, teachers who have received substantial profes-
sional development in technology integration, teachers who have been
recognized for outstanding technology use in their classroom, teachers
who have received awards for innovative technology use in the classroom,
Figure 2. Frequency dis tributi ons of stud ent engagement scores f or favourite and nonfavorite subje cts (N=174 3 ).
Cavanagh.indd 139 11/2/2013 5:02:18 PM
140 |
Journal of Research on Technology in Education
|
Volume 46 Number 2
Cavanagh & Koehler
and/or teachers selected to mentor or train colleagues in technology
integration.
Evidence of the Structural Aspect
e structural aspect of validity concerns the construct model and map, for
example, by ascertaining if the requirements of a unidimensional measure-
ment model are met when a unidimensional trait is measured. ere are
both traditional and contemporary methods for collecting evidence about
construct structure. e traditional approach is to conduct a Principal
Components Factor Analysis of raw scores (dichotomous or polytomous)
to examine correlations and covariance between items by identifying
factorial structure in the data. Provided there is sucient data in relation
to the numbers of items in the scale under scrutiny, this method is well
accepted in TPACK research. Notwithstanding, smaller data sets and large
instruments (many items) have required a multiscale approach. Schmidt
et al. (2009) developed a 75-item instrument measuring preservice teach-
ers’ self-assessments of the seven TPACK dimensions: 8 TK items, 17
CK items, 10 PK items, 8 PCK items, 8 TCK items, 15 TPK items, and 9
TPACK items. However, the sample included only 124 preservice teachers,
which precluded a full exploratory factor analysis of data from all 75 items
but did allow separate analyses of the seven dimensions. In this study
(Schmidt et al. 2009), factor loadings were estimated, “problematic” items
were “eliminated,” and Cronbach’s alpha reliability coecient was com-
puted for data from the retained items in each scale. is process provided
evidence of the internal structure of the seven dimensions but did not
conrm a seven-dimension construct model of TPACK. Similarly, the TTF
TPACK Survey data were subject to two exploratory factor analyses: one for
the 24 TPK and TCK items and one for the 24 TPACK items. We found two-
factor solutions in both cases, with the condence data loaded on one factor
and the usefulness data loaded on the second factor (Jamieson-Proctor,
Finger, Albion, Cavanagh, Fitzgerald, Bond, & Grimbeek, 2012). e results
provide conrmatory evidence of the construct model in Table 2 (p. 135).
Another approach to garnering evidence of dimensionality uses the
Rasch Model. e linear Rasch measure is extracted from the data set aer
the initial Rasch scaling, and then a Principal Components Factor Analy-
sis of the residuals is conducted. e assumption underlying this process
is that variance within the data should be mainly attributable to the Rasch
measure and that there will be minimal structure and noise in the residual
data. Application of this approach to phenomena that are clearly multivari-
ate requires separate Rasch Model analyses for each variable. is was the
case with the TTF TPACK Survey data. We used four Rasch Model analyses
and took the sound data to model t in the four scales as evidence of the
structure within the four-component construct model presented in Table 2
(Jamieson-Proctor, et al., 2012).
Cavanagh.indd 140 11/2/2013 5:02:18 PM
Volume 46 Number 2
|
Journal of Research on Technology in Education | 141
Specifying Validity Criteria in the Measurement of TPACK
Evidence of the Generalizability Aspect
Wolfe and Smith (2007b) explained “the generalizability aspect of valid-
ity addresses the degree to which measures maintain their meaning across
measurement contexts” (p. 215). For example, consider an item for which
the success rate does not dier between males and females. A lack of this
property of an item is referred to as dierential item functioning (DIF). Test-
ing for DIF typically proceeds by generating an Item Characteristic Curve
and plotting observed scores for class intervals of groups of persons of inter-
est. Figure 3 displays this information for Item 35 (“My test scores are high”)
from the Survey of Student Engagement in Classroom Learning (Cavanagh,
2012). When the observed responses of boys and girls with the same engage-
ment level are compared, the more highly engaged boys responded more af-
rmatively than the more highly engaged girls (F=15.05, p< .001). e item
has functioned dierently for males and females.
A similar approach for gathering generalizability evidence could be used
with TPACK models and data. Ideally, there should be no dierence in
scores for a TPACK item between groups of teachers with the same overall
score, such as between groups of male and female teachers, city and rural
teachers, or experienced and inexperienced teachers. is does not negate
the overall instrument discriminating between dierent groups; it merely
avoids bias at the item level.
Evidence of the External Aspect
e relation between a measure and an external measure of a similar con-
struct can show the external aspect of validity. For example, the developers
of the TTF TPACK Survey acknowledged the importance of using exter-
nal measures: “As with all self-report instruments, data collected with this
instrument should be complemented with other data collection methodolo-
gies to overcome the limitations associated with self-report instruments”
Figure 3. Item charac teristi c curve for I tem 35 (N=174 5 ).
Cavanagh.indd 141 11/2/2013 5:02:18 PM
142 |
Journal of Research on Technology in Education
|
Volume 46 Number 2
Cavanagh & Koehler
(Jamieson-Proctor, Finger, Albion, Cavanagh, Fitzgerald, Bond, & Grim-
beek, 2012, p. 9). For similar reasons, Harris et al. (2010; 2012) assessed the
quality TPACK through examination of detailed written lesson plans and
also semi-structured interviews of teachers. However, the extent to which a
second measure can be independent of the rst is dicult to establish, par-
ticularly when both measures share a common construct model or measure
a similar construct.
Evidence of the Consequential Aspect
e consequential aspect of validity centers on judgments about how the
score interpretations might be of consequence. When measures are used
in high-stakes testing, the consequences for students, teachers, and schools
can be signicant and sometimes the source of serious concern. Measuring
TPACK is unlikely to have such consequences, but applications that compare
teachers against one another or against benchmarks for performance man-
agement purposes could be seen as less benign. TPACK researchers should
consider potential consequences, and such consideration is further evidence
for establishing consequential validity.
Evidence of the Interpretability Aspect
e interpretability aspect of validity concerns the qualitative interpretation
of a measure in terms of how well its meaning was communicated. Figures
and graphical displays can assist the reader in understanding the meaning
of an instrument and the properties of its data. e TTF TPACK Survey was
developed to test for change in TPACK in Australian preservice teachers
who were provided with six months of specialized instruction in technol-
ogy integration. e results of this testing were presented as graphics such as
Figure 4 (Finger et al. 2012, p. 12). is is an item-by-item display of scores
from the rst survey administration and of scores from the second survey
administration for the condence items. Rasch Model equating procedures
have enabled all the scores to be plotted on the same scale. e improvement
in scores for all the items is obvious.
Another useful display is an item map that plots the diculty of items
and the ability of persons on the same scale. Figure 5 is the item map for a
scale measuring student engagement and classroom learning environment
(Cavanagh, 2012, p. 9). e scale is marked in logits from 3.0 to -3.0. e
student scores are located on the scale, and × indicates 10 students. e stu-
dents with the most armative views are located toward the top of the dis-
tribution. e location of an item shows the diculty students experienced
in arming the item. e items located towards the top of the distribution
were more dicult to arm than those below. e items are numbered ac-
cording to their labeling in the instrument. Item 41 (“I start work as soon as
I enter the room”) and Item 48 (“Students do not stop others from work-
ing”) were the most dicult to arm, whereas Item 7 (“I make an eort”)
Cavanagh.indd 142 11/2/2013 5:02:18 PM
Volume 46 Number 2
|
Journal of Research on Technology in Education | 143
Specifying Validity Criteria in the Measurement of TPACK
was easy to arm. e relation between student scores and item diculty
enables predictions to be made about student responses. Students with
locations below 1.0 logits are unlikely to arm Items 41 and 48. Conversely,
those with locations above 1.0 logits are likely to arm these items.
For TPACK measurement, the calibration of items as illustrated in the
item map would enable proling of TPACK for many teachers at dierent
times and in dierent situations. It would also accurately show changes in
TPACK over time for individual teachers. e scaling of person scores and
item diculty scores is essential for constructing an item map; raw scores
are not suitable for this purpose.
Figure 5. Item map for engagement and le arning envir onment items.
Figure 4. Confidence to facilitate student use.
Cavanagh.indd 143 11/2/2013 5:02:19 PM
144 |
Journal of Research on Technology in Education
|
Volume 46 Number 2
Cavanagh & Koehler
A Checklist for Researchers
e preceding sections have examined seven aspects of validity evidence,
and, where possible, examples of TPACK measurement were used to illus-
trate these aspects and situate them within the epistemology and methodol-
ogy of TPACK research. Table 3 lists the denitions of the seven aspects to
provide a tabular recount of the key considerations in mounting an argu-
ment for validity. e table could be used as a checklist for TPACK research-
ers to assess the validity of their research, either a priori when designing
TPACK measures or post hoc to evaluate existing TPACK measures.
e use of the checklist requires comment. First, it is more than a simple
seven-item list; the content exemplies contemporary understandings of
validity and validity theory. e underlying approach and its major features
have been explained in this paper, but this explication has been limited.
Users of the table would likely benet by consulting some of the original
sources referenced in the text. Second, statistics such as correlation coef-
cients or the results of an exploratory factor analysis are oen put forward
as proof of validity. Statistical evidence is just one aspect of an argument
for validity, and an argument relying on only this form of evidence would
be weak. ird, the application of the checklist should be on the availability
of evidence rather than simply whether attention has been given to each
particular aspect, although this would be a useful starting point. e notion
of validity being an argument requires the provision of evidence to convince
others, and the checklist is simply a vehicle for stimulating and organiz-
ing evidence collection. Fourth, the availability of extensive evidence of all
seven aspects is an optimal situation and, in reality, not attainable in many
educational studies. is limitation is methodological and mainly centered
on the instrument specication process within the content aspect. e use
of measurement models that examine the properties of data at the level of
individual items and persons can ensure instrument specication complies
with the content evidence requirements. Detailed and persuasive evidence is
available when Item Response eory and Rasch Model methods are used.
While the iterative nature of instrument construction might suggest
that the sequencing of the seven aspects could be varied, there are some
strong reasons for commencing with the content aspect. e rationale for
this view derives from a scientic approach to educational research, includ-
ing TPACK research, that is very consistent with Messick’s (1995) view of
validity. In both, primacy is given to substantive theory informing decisions
about instrumentation. e research is driven by theory rather than theory
being generated from existing data; in terms of validity, specication of the
construct model, particularly the construct map, precedes selection of data
collection methods and analyses. When the checklist is used post hoc, this
matter is more important for principled rather than pragmatic reasons.
However, when using the checklist a priori at the commencement of a study,
substantive theory and the ndings of previous research require clarication
Cavanagh.indd 144 11/2/2013 5:02:19 PM
Volume 46 Number 2
|
Journal of Research on Technology in Education | 145
Specifying Validity Criteria in the Measurement of TPACK
before progressing to methodological decisions. In this situation, the order
of the seven aspects is important.
e nal consideration in the use of the checklist is that it is neither
exhaustive nor the only way to conceptualize an argument for validity. For
example, in the hard sciences, where causal relations exist between variables,
the dominant form of validity is predictive validity. Notwithstanding, we
believe that an argument is required, and this needs to reect all aspects of
an instrument development process or of an empirical investigation.
Conclusion
One purpose of this paper was to stimulate discussion about the validity of
TPACK measures and measurement. A second purpose was to use contem-
porary validity theory as a framework to examine the principles and prac-
tices applied when dealing with validity issues in TPACK measurement. e
analysis suggests several types of validity evidence that are not characteristic
of current TPACK measurement activities, and that identication of these
factors could provide the impetus for improvement of TPACK measure-
Table 3. A Checklist of Validity Evidence
Aspect of evidence Definition
1. Content The relevance and representativeness of the content upon which the
items are based and the technical quality of those items
Purpose Domain of inference
Types of inferences
Potential constraints and limitations
Instrument specification Construct selection
Construct model
Construct map
Item development
Scoring model
Scaling model
Item technical quality
2. Substantive The degree to which theoretical rationales relating to both item
content and processing models adequately explain the observed
consistencies among item responses
3. Structural The fidelity of the scoring structure to the structure of the construct
domain
4. Generalizability The degree to which score properties and interpretations generalize
to and across population groups, settings, and tasks, as well as the
generalization of criterion relationships
5. External What has traditionally been termed convergent and discriminant
validity and also concerns criterion relevance and the applied utility of
the measures
6. Consequential The value implications of score interpretation as a basis for action
7. Interpretability The degree to which qualitative meaning can be assigned to quanti-
tative measures
(Wolfe & Smith, 2007a, p. 99)
Cavanagh.indd 145 11/2/2013 5:02:19 PM
146 |
Journal of Research on Technology in Education
|
Volume 46 Number 2
Cavanagh & Koehler
ment. In particular, the content and substantive aspects of validity evidence
are especially challenging.
TPACK theory is still in its infancy, as is the measurement of TPACK. It is
timely to consider concerns such as validity from the perspective of main-
stream epistemologies and methodologies. Maturation of TPACK research
and measurement requires nurture and sustenance from well-established
elds of research and methodologies.
Acknowledgment
e authors gratefully acknowledge the assistance of Joshua Rosenberg with the preparation of
this manuscript.
Author Note
Robert F. Cavanagh is a professor in the School of Education at Curtin University, Perth, Austral-
ia. His research interests focus on the measurement of student, teacher, and classroom attributes
conducive to improved learning and instruction. Please address correspondence regarding this ar-
ticle to Rob Cavanagh, School of Education, Curtin University, Kent St., Bentley 6102, Australia.
Email: r.cavanagh@curtin.edu.au.
Matthew J. Koehler is a professor in the College of Education at Michigan State University, East
Lansing. His research interests focus on the design and assessment of innovative learning environ-
ments and the knowledge that teachers need to teach with technology.
References
American Educational Research Association, American Psychological Association, National
Council on Measurement in Education. (1999). Standards for educational and psychological
testing. Washington, DC: American Educational Research Association.
Andrich, D. (1978a). Application of a psychometric rating model to ordered categories which
are scored with successive integers. Applied Psychological Measurement, 2(4), 581–594.
doi:10.1177/014662167800200413
Andrich, D. (1978b). Rating formulation for ordered response categories. Psychometrika,
43(4), 561–573. doi:10.1007/BF02293814
Andrich, D. (1978c). Scaling attitude items constructed and scores in the Likert
tradition. Educational and Psychological Measurement, 38(3), 665–680.
doi:10.1177/001316447803800308
Angeli, C., & Valanides, N. (2009). Epistemological and methodological issues for the
conceptualization, development, and assessment of ICT-TPCK: Advances in technological
pedagogical content knowledge (TPCK). Computers and Education, 52(1), 154–168.
doi:10.1016/j.compedu.2008.07.006
Archambault, L. M., & Barnett, J. H. (2010). Revisiting technological pedagogical content
knowledge: Exploring the TPACK framework. Computers and Education, 55(4), 1656–1662.
doi:10.1016/j.compedu.2010.07.009
Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the
human sciences (2nd ed.). Mahwah, NJ: Erlbaum.
Cavanagh, R. F. (2011a). Establishing the validity of rating scale instrumentation in learning
environment investigations. In R. F. Cavanagh, & R. F. Waugh (Eds.), Applications of Rasch
measurement in learning environments research (pp. 77–100). Rotterdam: Sense Publishers.
Cavanagh, R. F. (2011b). Confirming the conceptual content and structure of a
curriculum framework: A Rasch Rating Scale Model approach. Curriculum Perspectives,
31(1), 42–51.
Cavanagh.indd 146 11/2/2013 5:02:19 PM
Volume 46 Number 2
|
Journal of Research on Technology in Education | 147
Specifying Validity Criteria in the Measurement of TPACK
Cavanagh, R. F. (2012). Associations between the classroom learning environment and student
engagement in learning: A Rasch model approach. Paper presented at the meeting of the
Australian Association for Research in Education: Sydney, Australia.
Doering, A., Scharber, C., Miller, C., & Veletsianos, G. (2009). Geoentic: Designing and
assessing with technology, pedagogy, and content knowledge. Contemporary Issues in
Technology and Teacher Education, 9(3), 316–336. Retrieved from http://www.citejournal.
org/vol9/iss3/socialstudies/article1.cfm
Doig, B., & Groves, S. (2006). Easier analysis and better reporting: Modeling ordinal data in
mathematics education research. Mathematics Education Review Journal, 18(2), 56–76.
doi:10.1007/BF03217436
Finger, G., Jamieson-Proctor, R., Cavanagh, R., Albion, P., Grimbeek, P., Bond, T., Fitzgerald,
R., Romeo, G., & Lloyd, M. (2012). Teaching teachers for the future (TTF) project TPACK
survey: Summary of the key ndings. Paper presented at ACEC2012: ITs Time Conference,
Perth, Australia. Available at: http://bit.ly/ACEC2012_Proceedings
Graham, C. R. (2011). eoretical considerations for understanding technological pedagogical
content knowledge (TPACK). Computers & Education, 57(3), 1953–1960. Retrieved from
http://www.sciencedirect.com/science/article/pii/S0360131511000911
Graham, C., Cox, S., & Velasquez, A. (2009). Teaching and measuring TPACK development
in two preservice teacher preparation programs. In I. Gibson et al. (Eds.), Proceedings of
Society for Information Technology & Teacher Education International Conference 2009 (pp.
4081–4086). Chesapeake, VA: AACE. Retrieved August 19, 2013, from http://www.editlib.
org/p/31297
Guzey, S. S., & Roehrig, G. H. (2009). Teaching science with technology: Case studies of
science teachers’ development of Technological Pedagogical Content Knowledge (TPCK).
Contemporary Issues in Technology and Teacher Education, 9(1), 25–45. AACE. Retrieved
August 18, 2013 from http://www.editlib.org/p/29293
Harris, J., Grandgenett, N., & Hofer, M. (2010). Testing a TPACK-Based Technology Integration
Assessment Rubric. In D. Gibson & B. Dodge (Eds.), Proceedings of Society for Information
Technology & Teacher Education International Conference 2010 (pp. 3833–3840). Chesapeake,
VA: AACE. Retrieved August 18, 2013, from http://www.editlib.org/p/33978
Harris, J., Grandgenett, N., & Hofer, M. (2012). Using structured interviews to assess
experienced teachers’ TPACK. In P. Resta (Ed.), Proceedings of Society for Information
Technology & Teacher Education International Conference 2012 (pp. 4696–4703).
Chesapeake, VA: AACE. Retrieved from http://www.editlib.org/p/40351
Hawthorne, G., Mouthaan, J., Forbes, D., & Novaco, R. W. (2006). Response categories and anger
measurement: Do fewer categories result in poorer measurement? Development of the DAR5.
Social Psychiatry Psychiatric Epidemiology, 41(2), 164–172. doi:10.1007/s00127-005-0986-y
Jamieson-Proctor, R., Finger, G., Albion, P., Cavanagh, R., Fitzgerald, R., Bond, T., &
Grimbeek, P. (2012). Teaching Teachers for the Future (TTF) project: Development of the TTF
TPACK survey instrument. Paper presented at ACEC2012: ITs Time Conference, Perth,
Australia. Available at: http://bit.ly/ACEC2012_Proceedings
Koehler, M. J., & Mishra, P. (2008). Introducing TPCK. In AACTE Committee on Technology
and Innovation (Ed.), Handbook of technological pedagogical content knowledge (TPCK) for
educators (pp. 3–29). London: Routledge.
Koehler, M. J., Shin, T. S., & Mishra, P. (2011). How do we measure TPACK? Let me count
the ways. In R. N. Ronau, C. R. Rakes, & M. L. Niess (Eds.), Educational technology, teacher
knowledge, and classroom impact: A research handbook on frameworks and approaches (pp.
16–31). Hershey, PA: Information Science Reference.
Kulas, J. T., & Stachowski, A. A. (2001). Respondent rationale for neither agreeing nor
disagreeing: Person and item contributors to middle category endorsement intent on
Likert personality indicators. Journal of Research in Personality, 47, 254–262. doi: 10.1016/j.
jrp.2013.01.014
Linacre, J. M. (2009). Winsteps (Version 3.68) [Computer Soware]. Beaverton, OR: Winsteps.
com.
Cavanagh.indd 147 11/2/2013 5:02:19 PM
148 |
Journal of Research on Technology in Education
|
Volume 46 Number 2
Cavanagh & Koehler
Medical Outcomes Trust Scientic Advisory Committee. (1995). Instrument review criteria.
Medical Outcomes Trust Bulletin, 3, 1–4.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’
responses and performances as scientic inquiry into score meaning. American Psychologist,
50(9), 741–749. doi:10.1037/0003-066X.50.9.741
Messick, S. (1998). Test validity: A matter of consequences. Social Indicators Research, 45(4),
35–44. doi:10.1023/A:1006964925094
Mishra, P., & Koehler, M. J. (2006). Technological pedagogical content knowledge: A
framework for teacher knowledge. Teachers College Record, 108(6), 1017–1054. doi:10.1111/
j.1467-9620.2006.00684.x
Preston, C. C., & Colman, A. M. (2000). Optimal number of response categories in rating
scales: Reliability, validity, discriminating power, and respondent preferences. Acta
Psychologica, 104(1), 1–15. doi:10.1016/S0001-6918(99)00050-5
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago:
MESA Press.
Roblyer, M. D., & Doering, A. H. (2010). Integrating educational technology into teaching (5th
ed.). Boston, MA: Allyn & Bacon.
RUMMLab. (2007). RUMM2020 Rasch Unidimensional Measurement Models. RUMM
Laboratory Pty Ltd.
Shulman, L. S. (1986). ose who understand: Knowledge growth in teaching. Educational
Researcher, 15(2), 4–14. doi:10.3102/0013189X015002004
Shulman, L. S. (1987). Knowledge and teaching: Foundations of the new reform. Harvard
Educational Review, 57(1), 1–22. Retrieved from http://hepg.org/her/abstract/461
Schmidt, D. A., Baran, E., ompson, A. D., Mishra, P., Koehler, M. J., & Shin, T. S. (2009).
Technological pedagogical content knowledge (TPACK): e development and validation
of an assessment instrument for preservice teachers. Journal of Research on Technology in
Education, 42(2), 123–149.
Sherno, D. J. (2010). e experience of student engagement in high school classrooms.
Saarbrucken, Germany: Lambert Academic Publishing.
Wiggins, G., & McTighe, J. (1998). Understanding by design. Alexandra, VA: Association for
Supervision and Curriculum Development.
Wiggins, G., & McTighe, J. (2005). Understanding by design (2nd ed.). Alexandra, Virginia:
Association for Supervision and Curriculum Development.
Wilson, M. (2010). Constructing measures: An item response approach. New York: Routledge.
Wolfe, E.W., & Smith, E.V. (2007a). Instrument development tools and activities for
measure validation using rasch models: Part I–instrument development tools. Journal
of Applied Measurement, 8(1), 97–123. Retrieved from http://www.ncbi.nlm.nih.gov/
pubmed/17215568
Wolfe, E. W., & Smith, E. V. (2007b). Instrument development tools and activities for
measure validation using Rasch models: Part II–validation activities. Journal of
Applied Measurement, 8(2), 294–234. Retrieved from http://www.ncbi.nlm.nih.gov/
pubmed/17440262
Young, A., & Cavanagh, R. F. (2011). An investigation of dierential need for psychological
services across learning environments. In R. F. Cavanagh & R. F. Waugh. (Eds.),
Applications of Rasch measurement in learning environments research (pp. 227–244).
Rotterdam: Sense Publishers. ISBN 978-94-6091-491-1.
Manuscript received July 12, 2013 | Initial decision July 30, 2013 | Revised manuscript accepted August 29, 2013
Cavanagh.indd 148 11/2/2013 5:02:19 PM