JOURNAL OF RESEARCH IN SCIENCE TEACHING VOL. 38, NO. 2, PP. 260 ± 278 (2001)
Comparison of the Reliability and Validity of Scores from
Two Concept-Mapping Techniques
Maria Araceli Ruiz-Primo,
Susan E. Schultz, Min Li, and Richard J. Shavelson
Stanford University/CRESST, School of Education, 485 Lasuen Mall, Stanford,
Received 1 September 1999; accepted 31 August 2000
Abstract: This paper reports the results of a study that compared two concept-mapping techniques,
one high-directed, ``®ll-in-the-map,'' and one low-directed, ``construct-a-map-from-scratch.'' We examined
whether: (1) skeleton map scores were sensitive to the sample of nodes or linking lines to be ®lled in; (2) the
two types of skeleton maps were equivalent; and (3) the two mapping techniques provided similar infor-
mation about students' connected understanding. Results indicated that ®ll-in-the-map scores were not
sensitive to the sample of concepts or linking lines to be ®lled in. Nevertheless, the ®ll-in-the-nodes and ®ll-
in-the-lines techniques were not equivalent forms of ®ll-in-the-map. Finally, high-directed and low-directed
maps led to different interpretations about students' knowledge structure. Whereas scores obtained under
the high-directed technique indicated that students' performance was close to the maximum possible, the
scores obtained with the low-directed technique revealed that students' knowledge was incomplete com-
pared to a criterion map. We concluded that the construct-a-map technique better re¯ected differences
among students' knowledge structure. ß2001 John Wiley & Sons, Inc. J Res Sci Teach 38: 260± 278, 2001
Concept maps have been used to assess students' knowledge structure, especially in science
education (Novak, 1990). The justi®cation for assessing students' knowledge structures is based
on the idea that relating concepts that belong to the same domain is an important characteristic
of scienti®c literacy (e.g., Bybee, 1996; Moore, 1995). Theory and research have shown that
understanding a subject domain such as science is associated with a rich set of relations among
important concepts in the domain (Novak, 1998; Novak & Gowin, 1984; Novak, Gowin, &
Johansen, 1983; Novak & Ridley, 1988). We know, for example, that successful learners develop
elaborate and highly integrated frameworks of related concepts (Mintzes, Wandersee, & Novak,
1997), just as experts do (Chi, Glaser, & Farr, 1988; Glaser, 1991). Research has shown that
highly organized structures facilitate problem solving and other cognitive activities (e.g.,
generating explanations or rapidly recognizing meaningful patterns; Baxter, Elder, & Glaser,
*Correspondence to: Maria Araceli Ruiz-Primo. E-mail: firstname.lastname@example.org
The original version of this paper was presented at the 1998 AERA Annual Meeting, San Diego, CA.
ß2001 John Wiley & Sons, Inc.
1996; Mintzes et al., 1997), and that differences in the performance of experts and novices are
due, largely, to how knowledge is structured in their memories (Chi et al., 1988; Glaser, 1991).
Concept maps provide a ``picture'' of how key concepts in a domain are mentally organized/
structured by students. With this assessment technique, students are asked to link pairs of
concepts in a science domain and label the links with a brief explanation of how the two concepts
go together. Although concept maps have been used in large-scale, as well as classroom
assessment, a wide variety of techniques are called concept maps and little is known about the
reliability and validity of scores produced by these different mapping techniques (e.g., Ruiz-
Primo & Shavelson, 1996). We suspect that the observed characteristics of the representation of
a student's knowledge structure depend to a large extent on how the representation is elicited.
Simply put, the method used to ask students to represent their knowledge can affect the
representation they provide as well as the scores they obtain (Ruiz-Primo, Schultz, & Shavelson,
1996; Ruiz-Primo & Shavelson, 1996; Ruiz-Primo, Shavelson, & Schultz, 1997).
Through a series of studies we sought to increase our understanding of how different
mapping techniques affect the representation and interpretation of a student's knowledge stru-
cture. In this paper, we provide reliability and validity evidence on the effects of two mapping
techniques, ``®ll-in-the-map'' and ``construct-a-map.''
Concept Map Assessment
We de®ne a concept map as a graph in which the nodes represent concepts, the lines between
nodes represent relations, and the labels on the lines represent the nature of the relations. The
combination of two nodes and a labeled line is called a propositionÐthe fundamental unit of the
map. Our characterization of a concept map assessment as based on its three componentsÐtask,
response format, and scoring systemÐhas revealed the enormity of variations in mapping
techniques used in research and practice (see Ruiz-Primo & Shavelson, 1996).
The characteristics of the task, the response format, and the scoring system hold the key for
tapping what concept-map-based assessments are intended to evaluate: knowledge structure (or
``connected understanding'' for some authors). The assessment task, for example, can vary in the
constraints (directedness) it imposes on a student in eliciting her representation of structural
knowledge. One dimension in which directedness varies lies in what is provided for use in the
concept map (Figure 1; see Surber, 1984).
If the characteristics of the assessment task fall on the left extreme the student's repre-
sentation is probably determined more by the mapping technique (or the assessor if you will),
than by the student's own knowledge or connected understanding.
If the assessment task falls on
the right extreme, the student is free to decide which and how many concepts to include in her
map, which concepts are related, and which words to use for explaining the relation. Asking the
student to generate the concepts to construct her map provides a good piece of information about
the student's knowledge in a particular domain (e.g., are the concepts selected by the student
relevant/essential to the topic?). However, this openness may also be undesirable in practice. In
one of our studies (Ruiz-Primo et al., 1996) we compared two mapping techniques that differed
on whether the concept sample was student-generated or assessor-generated. We found that
under the student-generated condition, some students provided related but not relevant/essential
The characteristics of the assessment task have an impact on the response format and the scoring system. For
example, a task that provides the structure of the map, will probably provide such a structure in the student's
response format. If the task provides the concepts to be used, the scoring system will not focus on the
``appropriateness of the concepts'' used in a map. The combination of the task, the response format, and the
scoring system is what determines a mapping technique.
COMPARING MAPPING TECHNIQUES 261
concepts to the topic. An irrelevant but related concept (e.g., ``chemistry'' within the topic of
ions, molecules, and compounds) led students to provide many accurate but irrelevant rela-
tionships between concepts within the topic in which students were assessed (e.g., compounds
``is a'' concept in chemistry). This situation led to arti®cially high scores. Furthermore, the
student-generated sample technique proved challenging when developing a scoring system since
each concept map might have a unique set of concepts and relations.
We suspect the cognitive demands imposed on students by high-directed techniques are
different from low-directed techniques. Furthermore, although high-directed techniques may be
responded to and scored quickly, they also are more likely to misrepresent a student's knowledge
structure by imposing a structure on their responses. In this study we examined the reliability and
validity of two mapping techniques, one high-directed and the other low-directed.
De®ning The Two Mapping Techniques
Some researchers (e.g., Schau & Mattern, 1997) have argued that asking students to draw a
map from scratch imposes too high a cognitive demand to produce a meaningful representation
of their knowledge. They proposed an alternative technique, ``®ll-in-the-map.'' In what follows
we describe both techniques, ®ll-in-the-map and construct-a-map.
Fill-in-the-Map. The ®ll-in-the-map technique provides students with a concept map where
some of the concepts and/or the linking words have been left out. Students ®ll in the blank nodes
or blank linking lines (e.g., Anderson & Huang, 1989; McClure & Bell, 1990; Schau, Mattern,
Weber, Minnick, & Witt, 1997; Surber, 1984). The response format is straightforward; students
®ll in the blanks and their responses are scored correct ± incorrect. Arguments can be made for
(e.g., ease of administration, scoring, and retrieval of propositions from long-term memory) and
against (e.g., imposes a structure on a student's knowledge) the technique. We posit that as
students' subject matter knowledge increases, the structure of their maps should increasingly
re¯ect the structure of the domain as held by experts (see Glaser, 1996; Shavelson, 1972, 1974).
By imposing a structure on the relations between concepts, it is dif®cult to know whether or not
students' knowledge structures are becoming increasingly similar to experts'. Structure of
representation, however, is not the only issue to consider. Usually, with ``®ll-in'' students are
provided with linking words in the skeleton map and they select the concepts from a list of
concepts. Therefore, there is less evidence gathered on students' connected understanding. In our
research using the construct-a-map technique we found that the linking words students used to
relate two concepts provided an insight into a students' understanding in a particular content
domain (e.g., Ruiz-Primo et al., 1996).
Figure 1. Degree of directedness in the concept map assessment task.
262 RUIZ-PRIMO ET AL.
Construct-A-Map From Scratch. The ``construct-a-map'' technique varies as to how much
information is provided by the assessor (Figure 1). The assessor may provide the concepts and/or
linking words or may ask students to construct a hierarchical or non-hierarchical map. The
response format is simply a piece of paper on which students construct a map. Scoring systems
vary from counting the number of nodes and linking lines (not recommended) to evaluating the
accuracy of propositions (see Ruiz-Primo & Shavelson, 1996).
This mapping technique, however, has been considered problematic for large-scale
assessment because students need to be trained to use maps and scoring is dif®cult and time-
consuming (e.g., Schau et al., 1997). Our research has tried to overcome these two problems (see
Ruiz-Primo et al., 1996, 1997). We designed a 50-minute program to teach students how to
construct concept maps. The program proved to be effective in achieving this goal with more
than 100 high school students. Moreover, to ®nd an ef®cient scoring system we have explored
different types of scores: some based only on the propositions, others using a criterion map. Map
propositions can be scored for accuracy and comprehensiveness or simply on whether the
propositions are correct or incorrect. Based on this differentiation we have studied three types of
scores: proposition accuracy scoreÐthe sum of individual proposition scores obtained on a
student's map; convergence scoreÐthe proportion of accurate propositions in the student's map
out of all possible propositions in the criterion map; salience scoreÐthe proportion of correct
propositions out of all propositions in the student's map. All three score types have yielded high
interrater reliability coef®cients, above .90, even when the quality and accuracy of the
propositions is judged.
This study explored the technical characteristics of the ``®ll-in-the-map'' and ``construct-a-
map'' techniques. More speci®cally, we examined whether the: (a) mapping techniques can be
considered equivalent; (b) ®ll-in-the-map scores are sensitive to the nodes (concepts) selected to
be ®lled in (construct-a-map scores have proven not to be sensitive to the sample of concepts
used; Ruiz-Primo et al., 1996); and (c) ®ll-in-the-map scores were sensitive to the linking lines
selected to be ®lled in (linking words).
One hundred and ®fty-two high school chemistry students and two chemistry teachers from
the Bay area participated in the study. Seventy-three were males and 79 were females. The
majority of the students were Caucasian (58.5%), followed by Asian (13.2%), Hispanic (4.6%),
African American (3.3%), and other ethnicities (1.5%; e.g., Indian). The rest of the students,
about 19%, did not provide information about their ethnicity. The proportion of ethnicity was
consistent across seven classes.
Standardized Testing and Reporting (STAR) test scores on science, language, and
expression were collected for about 51% of the students. The remaining students either did
not permit review of their academic ®les or did not take the STAR test. The test was taken the
same year in which the data were collected in this study.
Students were in one of seven chemistry classes. Classes 1, 2, 5, 6, and 7, were taught by
Teacher 1 (6 years of teaching experience), and classes 3 and 4 by Teacher 2 (1 year of teaching
experience). Classes 1, 2, 3, and 4 (96 students) were considered advanced, the remainder (56
students) were regular chemistry classes.
COMPARING MAPPING TECHNIQUES 263
Students and teachers were trained to construct concept maps, including the ®ll-in-the-map
technique, with the same 50-minute training program used in previous studies (Appendix A; see
Ruiz-Primo et al., 1996, 1997). To evaluate the training, 25% of the maps constructed by
students at the end of the training session were randomly sampled and analyzed. The analysis
focused on whether students used the concepts provided on the list, labeled the lines,
and provided accurate propositions. Results indicated that 92% of the students used all the
concepts provided in the list, all used labeled lines, and all provided four or more accurate
propositions. We concluded that the program succeeded in teaching students to construct
Selection of Concepts and Development of the Criterion/ Skeleton Maps
To identify the structure of the skeleton map for the ®ll-in mapping technique, we assumed
that: (1) there is some ``agreed-upon organization'' that adequately re¯ects the structure of a
content domain; (2) ``experts'' in that domain (in this context, the teachers who participated in
the study) can agree on the structure; and (3) experts' concept maps provide a reasonable
representation of the subject domain (e.g., Glaser, 1996). Therefore, the skeleton maps used were
based on the criterion map.
We chose the topic, ``Chemical Names and Formulas,'' as the domain for sampling the
concepts used in the study.
The two teachers of the seven classes and researchers (the second
author was a high school chemistry teacher for 10 years) were involved in the process of
selecting the concepts and creating the criterion map. Teachers were asked to identify the
concepts they considered to be the most important in the unit. Researchers also selected the most
important concepts by carefully reviewing the text used to teach the topic. Appendix B provides
a brief description of the procedure followed to select the key concepts and to de®ne the criterion
map (for details see Ruiz-Primo et al., 1996).
The ``agreed'' upon links across teachers' and researchers' maps were represented in the
criterion map and considered the ``substantial'' links that students were expected to know after
instruction on the topic (Figure 2). The criterion map was used as the master map for the purpose
of constructing the four skeleton maps. Concepts selected for the blank nodes on the skeleton
maps were randomly sampled from the key-concept list. Linking lines selected to be ®lled-in on
the skeleton maps were sampled from the linking lines on the criterion map. Propositions
provided in the skeleton maps were taken from the criterion map. The concepts for the construct-
a-map technique were all those on the key-concept list.
To evaluate whether the ®ll-in-the-map scores were sensitive to the sample of nodes or
linking lines to be ®lled in, we used a 2 2, node (concept) sample by linking-line sample
design. Four 20-node skeleton maps were constructed. In two of the maps' 12 nodes (60% of the
nodes) were left blank. In the other two skeleton maps, 12 linking lines (31.5% of the linking
lines in the criterion map) were left blank (i.e., no linking words). Concepts and linking lines to
be left blank were randomly selected from the list of key concepts and the list of propositions in a
criterion map. The four skeleton maps were as follows: AÐskeleton map with Sample 1 of
nodes left blank; BÐskeleton map with Sample 2 of nodes left blank; CÐskeleton map with
Although we used this topic in previous studies, the selection of concepts for mapping was carried out again
since different teachers participated on this occasion.
264 RUIZ-PRIMO ET AL.
Sample 1 of linking lines left blank; and DÐskeleton map with Sample 2 of linking lines left
blank (Figure 3).
Within each of the seven classes students were randomly assigned to one of four sequences
of skeleton maps: Sequence 1Ðskeleton map A followed by skeleton map C; Sequence 2Ð
skeleton map A followed by skeleton map D; Sequence 3Ðskeleton map B followed by skeleton
map C; and Sequence 4Ðskeleton map B followed by skeleton map D. Students were tested on
four occasions (Table 1): On Occasion 1, all students constructed a concept map from scratch
using the 20 concepts provided by the assessors. On Occasion 2, half the students ®lled in
skeleton map A and half ®lled in skeleton map B. On Occasion 3, half the students ®lled in
skeleton map C and half ®lled in skeleton map D. On Occasion 4, all students received a 30-item
multiple-choice test designed by the teachers and researchers.
The two mapping techniques varied in their task demands and constraints on students. Table
2 provides a pro®le of the directedness of the assessment tasks across techniques. The construct-
a-map technique asked students to construct a map using the 20 concepts provided by the
assessor. Students were encouraged to provide detailed propositions (linking words) to explain
the relationship between the two concepts they were linking. No restriction was imposed on the
type of structure students could use in the map (e.g., students were not instructed to create a
Figure 2. Criterion map.
COMPARING MAPPING TECHNIQUES 265
Figure 3. Fill-in-the-nodes, map A (top map) and ®ll-in-the-lines, map C (bottom map) skeleton maps.
266 RUIZ-PRIMO ET AL.
The ®ll-in-the-map technique asked students to ®ll in two skeleton maps, one with blank
nodes and the other with blank linking lines. After randomly selecting nodes, seven nodes were
the same on skeleton map A and skeleton map B. For the blank-linking line maps, only two
propositions were the same across skeleton map C and skeleton map D. Students' responses on
each skeleton map were scored as correct or incorrect. A maximum of 12 points could be
awarded to each student on each skeleton map.
As in previous studies, to score students' constructed maps we developed a proposition
inventory to account for variation in the quality of students' propositions. This inventory
contained the 190 possible relations between all pairs of concepts in the key-concept list. Based
on this inventory, each proposition was scored on a 5-point scale, from 0 for inaccurate/incorrect
to 4 for excellent/outstanding. Table 3 provides the de®nitions of the categories and one example
of the proposition inventory. For example, the accurate excellent proposition between acids and
Sequence Occasion 1 Occasion 2 Occasion 3 Occasion 4
Construct-a-map Fill-in-the-nodes: Fill-in-the-lines: Multiple-choice
Sample 1: Map A Sample 1: Map C test
Construct-a-map Fill-in-the-nodes: Fill-in-the-lines: Multiple-Choice
Sample 1: Map A Sample 2: Map D test
Construct-a-map Fill-in-the-nodes: Fill-in-the-lines: Multiple-choice
Sample 2: Map B Sample 1: Map C test
Construct-a-map Fill-in-the-nodes: Fill-in-the-lines: Multiple-choice
Sample 2: Map B Sample 2: Map D test
Directedness Pro®le of the Mapping Techniques
Technique Concepts Linking Lines Linking Words Structure of the Map
Construct-a-map Provided in a list: Not provided Not provided Not provided
Students use the
concepts in the list
for constructing the
Fill-in-the-nodes Provided in a list: Provided in the Provided in the Provided in the
Student selects, skeleton map skeleton map skeleton map
from the list, the
concept to ®ll-in a
Fill-in-the-lines Provided in the Provided in the Provided in a list: Provided in the
skeleton map skeleton map Students selects, skeleton map
from the list, the
linking words to
®ll-in a line
COMPARING MAPPING TECHNIQUES 267
compounds should be read, according to the direction of the arrow (<), as follows: compounds
that give off H
when dissolved in water are acids. The maximum score for a map constructed by
students was based on the criterion map: the number of links (38) in the criterion map was
multiplied by 4 (assuming all propositions were scored as excellent).
The multiple-choice test was designed by both teachers and researchers. The test items were
both conceptual and mechanical (see Appendix C for examples of the multiple-choice test). A
maximum of 30 points could be awarded to each student on this test. The internal consistency of
the multiple-choice test was .74.
We examined whether the: (1) skeleton map scores were sensitive to the sample of nodes or
linking lines left blank; (2) two forms of skeleton maps were equivalent; and (3) two mapping
techniques provided similar information about students' connected understanding. Results are
organized in three sections: ®ll-in-the-map technique; construct-a-map technique; and com-
parison across techniques.
Before focusing on a detailed discussion of the results, variation between classrooms needs
to be addressed. We compared the seven classes using two measures, the STAR-science and the
multiple-choice test scores. The one-way ANOVA results for both measures indicated a signi-
®cant difference between groups (F
3.98; p.002 and F
Tukey's HSD (p.05) indicated that differences in the STAR-science mean scores were due to
classes 5 and 6, which both differed signi®cantly only from Class 2. Differences in the multiple-
choice test were due only to the difference between Class 6 and Class 2. Moreover, a split-plot
ANOVA, type of skeleton map (T) by sequence (S) by class (C), indicated no signi®cant
interaction of class with any other within- or between-subjects factor (F
1.95; p.08; F
.90; p.58). Since no other differences were found across the
advanced and regular classes, for simplicity and brevity, we decided to collapse the seven classes
and present overall results. However, all statistical analyses were run by class and results are
available from the authors. These analyses do not change the overall analyses reported herein.
Finally, no signi®cant mean differences in gender or ethnicity were found across any of the
comparisons made. Therefore, analyses considering these two variables are not presented.
Quality of proposition categories
Quality of Proposition Descriptions and Examples
ExcellentÐ4 Outstanding proposition. Complete and correct. It shows a deep
understanding of the relation between the two concepts.
acids±compounds: <that gives off H
when dissolved in water are
GoodÐ3 Complete and correct proposition. It shows a good understanding of
the relation between the two concepts.
acids±compounds: >are examples of
PoorÐ2 Correct but incomplete proposition. It shows partial understanding
of the relation between the two concepts.
Don't CareÐ1 Although accurate, the proposition does not show understanding of
the relationship between the two concepts.
acids±compounds: >is a different concept
Inaccurate/invalidÐ0 Incorrect proposition.
acids±compound: >made of
268 RUIZ-PRIMO ET AL.
In this section we focus on the ®ll-in-the-map (skeleton map) technique by assessing ®rst,
whether the ®ll-in-the-map scores are sensitive to the sample of nodes or linking lines selected
for the skeleton map; and second, whether the two types of skeleton maps, ®ll-in-the-nodes and
®ll-in-the-lines, can be considered equivalent mapping techniques.
For each of the four skeleton maps developed, two ®ll-in-the-nodes and two ®ll-in-the-lines,
we calculated the internal consistency. On average, the alpha for the ®ll-in-the-nodes maps was
.71 and for ®ll-in-the-lines was .85.
Comparing Fill-In-The-Map Scores Across Samples of Concepts and Linking Lines. To
determine whether the ®ll-in-the-map scores were sensitive to the sample of nodes (concepts) or
linking lines (propositions) left blank, we compared the mean and variances of scores between
skeleton maps A and B (with blank nodes) and between skeleton maps C and D (with blank
The mean scores and standard deviations for the ®ll-in-the-nodes skeleton maps A and B and
®ll-in-the-lines skeleton maps C and D are presented in Table 4. Overall, students' performance
across the two types of skeleton maps and samples was high. However, it was higher for ®ll-in-
the-nodes maps than for ®ll-in-the-lines. An independent-samples t-test indicated no signi®cant
difference between the two samples of concept means (t1.57, p.12) or the two samples of
linking line means (t1.64, p.10). The Levene test indicated that variances were not
homogeneous across samples (F
6.77 and F
2.16; p<.20). However, since the
interquartile range across samples was the same or very similar (nodes: sample 1, IQR 2.00
and sample 2, IQR 2.00; linking lines: sample 1, IQR 4.00, and sample 2, IQR 6.00), we
concluded that both samples of nodes and linking lines were equivalent and that students' scores
were not affected by the particular sample used in the skeleton maps. Similar results using
different samples of concepts for constructing a map were found in one of our previous studies
(Ruiz-Primo et al., 1996).
Comparing Fill-In-The-Nodes and Fill-In-The-Lines Skeleton Maps. For the ®ll-in-the-
nodes and ®ll-in-the-lines techniques to be considered equivalent, they at least need to produce
similar means and variances. We carried out a 2 4 (skeleton map type by sequence) split-plot
ANOVA to evaluate whether the type of skeleton map (i.e., ®ll-in-the-nodes and ®ll-in-the-lines)
Means and standard deviations by type of skeleton map and sample
Type of Skeleton Map n(Max. 12) SD
Sample 1ÐMap A 80 11.21 1.43
Sample 2ÐMap B 72 10.81 1.74
Sample 1ÐMap C 78 9.77 2.74
Sample 2ÐMap D 73 8.99 3.09
COMPARING MAPPING TECHNIQUES 269
and the sequence in which students took the different forms of skeleton maps (e.g., skeleton map
A followed by skeleton map C or skeleton map A followed by skeleton map D) affected their
Table 5 provides the mean scores and standard deviations for each type of skeleton map and
sequence. As mentioned before, mean scores were higher for ®ll-in-the-nodes than ®ll-in-the-
lines, independent of the sequence in which students took the assessments.
ANOVA results indicated a signi®cant interaction between type of skeleton map (T) and
sequence (S) (F
2.73, p.046; Z
.05) and a signi®cant difference for type of map
65.95, p.000; Z
.31); but no signi®cant difference was observed for sequence
A closer examination of the interaction showed that it was ordinal. The mean difference in
scores between nodes and linking lines skeleton maps was not statistically signi®cant for those
students under Sequence 3 (F
4.06, p.052) whereas it was for those students under the
other three sequences (F
16.50, p.000; F
26.57, p.000; F
Filling-in-the-nodes using Map B somehow facilitated the ®ll-in-the-lines when Map C was
used. A closer look into the skeleton maps revealed that the number of propositions students
needed to read for ®lling-in-the-nodes (Map B) and ®lling-in-the-lines (Map C) overlapped more
in Sequence 3 than in any other sequence.
For the purposes of the study, however, a more important result was the one related to the
differences between the two types of skeleton maps, ®ll-in-the-nodes and ®ll-in-the-lines. The
split-plot ANOVA indicated that means differed signi®cantly. The magnitude of Z
large effect due to type of skeleton map (about 31% of the variance was accounted for by this
factor). Furthermore, an F
test indicated that variances of the two types of maps were
3.35, p.05). We concluded that ®ll-in-the-nodes and ®ll-in-the-lines were
not equivalent forms of skeleton maps. Fill-in-the-nodes maps were easier for students than ®ll-
Since the two samples of nodes and linking lines were considered equivalent (Table 4), we
ignored the sample of nodes or linking lines used in the skeleton maps and calculated a pooled-
within-sequence correlation between the ®ll-in-the-nodes and ®ll-in-the-lines maps. The
magnitude of the pooled correlation was .56, suggesting that students were ranked somewhat
differently across the two types of maps. However, the magnitude of the correlation may be
lowered due to the restriction of range observed in the ®ll-in-the-nodes maps. The correlation
corrected for attenuation was .72.
Means and standard deviations by type of skeleton map and sequence
Sequence nMean SD Mean SD
1 Nodes 1 ± Lines 1 43 11.09 1.52 9.72 2.84
2 Nodes 1 ± Lines 2 36 11.33 1.33 9.31 3.06
3 Nodes 2 ± Lines 1 35 10.63 1.82 9.83 2.65
4 Nodes 2 ± Lines 2 37 10.97 1.67 8.68 3.14
11.01 1.60 9.39 2.93
One student did not ®ll out the second skeleton map.
270 RUIZ-PRIMO ET AL.
In this section we examine the consistency of scores across raters for the construct-a-map
technique, characterize students' constructed maps, and compare types of scores.
Interrater Reliability. All construct-a-maps were scored for accuracy and compre-
hensiveness. For each student we calculated a proposition accuracy scoreÐthe sum of the
scores obtained on all propositions; convergence scoreÐthe proportion of accurate propositions
in a student's maps out of all possible propositions in the criterion map; and salience scoreÐthe
proportion of valid propositions out of all the propositions in the student's map.
A sample of 55 students' maps (more than a third of the sample) was scored by three raters.
To examine the generalizability of scores across raters, three person (p) by rater (r) G studies
were carried out, one for each type of score (Table 6).
Raters introduced negligible error. Both relative (^
2) and absolute (
coef®cients were very high across types of scores. Based on these results, the remaining 97
concept maps were randomly distributed among the three raters and only one rater scored each
map. The randomization was done within each of the seven classes. Thus, all three raters scored
a sample of students' maps across the seven classes.
Students' Maps. Table 7 provides information about the characteristics of students' con-
structed maps. Two-thirds of the maps used all 20 concepts provided in the list to construct their
maps. Another ®fth used 18 ± 19 concepts and only one student used just 14 concepts.
A surprising ®nding was that 6.6% of the students provided more than 38 links in their maps,
which is the number of links on the criterion map.
Furthermore, a few of the students provided
better propositions than those in the criterion map! This led us to re-score the criterion map using
the same criteria applied for students. Therefore, some propositions in the criterion map became
``Good,'' instead of ``Excellent,'' and one proposition became ``Poor.'' The original maximum
score of 152 was corrected to 135.
Estimated variance components and generalizability coef®cients for person by rater G study across types of
Proposition accuracy Convergence Salience
Source of Estimated Percent of Estimated Percent of Estimated Percent of
Variation Variance Total Variance Total Variance Total
Component Variability Component Variability Component Variability
Persons (p) 290.54 96.26 0.03114 97.65 0.02863 95.15
Raters (r) 0.36 0.12 0.00011 0.34 0.00020 0.66
pr,e 10.92 3.62 0.00064 2.00 0.00126 4.19
2(relative) .99 .99 .98
(absolute) .99 .99 .98
In fact, 18% of the students provided between 25 and 38 links.
COMPARING MAPPING TECHNIQUES 271
Types of Scores. Table 8 provides the means, standard deviations, and correlations across
the three types of scores used for the construct-a-map technique. Information provided about
students' connected understanding varies across the types of scores. Whereas the mean salience
score indicated that students' performance was close to the maximum, the proposition accuracy
and convergence scores indicated that students' knowledge was rather partial.
The high correlation between proposition accuracy and convergence scores (.95) was very
similar to correlations we have found in other studies (e.g., Ruiz-Primo et al., 1996, 1997).
However, correlations between proposition accuracy and convergence scores with salience
scores (.73 and .75, respectively), were lower than the ones we have observed before (.85).
When G theory has been used to evaluate the dependability of these measures (see Ruiz-
Primo et al., 1996, 1997), we found that the percent of variability among persons was higher for
proposition accuracy and convergence scores than for salience scores. This indicated that these
two measures better re¯ected the differences in students' knowledge structures than did salience
The general conclusion about construct-a-map scores is consistent with our previous
research. Proposition accuracy and convergence scores re¯ect the differences in students'
knowledge structure better than salience scores. Based on practical (e.g., scoring time) and
technical (e.g., stability of scores) arguments, we concluded that the convergence score was the
Comparing Students' Scores Across Assessment Techniques
In this section we focused ®rst on evaluating the extent to which the scores on the two
mapping techniques, ®ll-in-the-map and construct-a-map, converged. Then, we evaluated the
extent to which the two mapping technique scores converged with multiple-choice scores. A
Means and standard deviations of students' concept map components
Minimum Score Maximum Score
Map Components Mean SD Observed Observed
Nodes 19.34 1.23 14 20
Linking Lines 25.41 6.60 14 43
Accurate Propositions 18.88 7.44 0 42
Means, standard deviations, and correlations across the three types of construct-a-map scores
Descriptive Statistics Correlations
Type of Score nMaximum Mean SD PA CON SAL
Proposition Accuracy (PA) 152 135 53.91 22.17 Ð
Convergence (CON) 152 1 .50 .19 .95 Ð
Salience (SAL) 152 1 .73 .17 .73 .75 Ð
272 RUIZ-PRIMO ET AL.
correlational approach was used to compare techniques because of differences between score
Table 9 provides the descriptive statistics for the three types of assessments administered to
the students: construct-a-map, ®ll-in-the-map, and multiple-choice test.
Mean scores across the forms of assessments do not provide the same picture about
students' knowledge of the topic. Whereas ®ll-in-the-map and multiple-choice scores indicate
that the students' score was close to the maximum criterion, the convergence score indicated that
the students' knowledge was rather partial compared to the criterion map.
Table 10 provides a multiscore ± multitechnique matrix. In the matrix, rel iability coef®cients
are enclosed in parenthesis on the main diagonal. Along with the observed correlations, we
present correlations corrected for unreliability when appropriate. However, because different
Means and standard deviations across the three types of assessments
nMaximum Mean SD
Convergence 152 1 .50 .19
Fill-in-the-nodes 152 12 11.02 1.59
Fill-in-the-lines 151 12 9.39 2.93
Multiple-choice test 150 30 24.05 3.74
Correlations between mapping technique scores and types of assessments
Types of Assessment and
Techniques CON NOD LIN MC
Convergence Score (CON) (.99)
Observed .47 (.71)
Observed .44 .53 (.85)
Observed .44 .37 .65 (.74)
Corrected .51 .51 .82
Internal consistency averaged between the two-sample skeleton maps.
Both assessments are reliable, therefore correction was not calculated.
COMPARING MAPPING TECHNIQUES 273
reliability estimates were used in the matrix, and hence measurement error was de®ned
differently, some of these corrections may not be accurate and must be interpreted cautiously.
Therefore, we focus on the observed correlations.
Mapping Technique Scores. If the construct-a-map and ®ll-in techniques measure the same
construct, we should expect a high correlation between these scores. Yet, correlations were lower
than expected (r.48 averaged across types of scores), indicating that students were ranked
differently according to the technique used. It seems that different aspects of the students'
connected understanding were being tapped with the different techniques. Restriction of range
observed in both types of ®ll-in-the-map scores may have contributed to the magnitude of the
correlations; interpretation of the low coef®cients should be considered with caution.
Comparing Mapping and Multiple-Choice Scores. Magnitudes of the correlations between
construct-a-map scores and multiple-choice scores and between ®ll-in-the-map scores and
multiple-choice scores were very close to each other. The correlations between ®ll-in-the-map
scores with multiple-choice scores were quite surprising. The magnitudes of the correlations
between ®ll-in-the-nodes and multiple-choice test reported by Schau et al. (1997) were higher
(.75 on average) than the one we found in this study (.37).
Two issues may explain these differences: restriction of range observed in the ®ll-in-the-
nodes skeleton map scores (i.e., skeleton map was very easy for students in our study) and
differences between the characteristics of the ®ll-in-the-nodes maps used in the two studies. For
example, Schau et al. (1997) used 37 nodes and 50% were left blank; we used 20, and 60% were
left blank. Also, the propositions in the skeleton map used by Schau et al. were less complex than
the ones used in ours.
Whether the characteristics of the maps can affect students' scores deserves to be studied
more carefully. For example, how many nodes in a skeleton map is optimum? How many nodes
need to be left blank? What is the best way to select the nodes left blank?
Notice, however, that the correlation with ®ll-in-the-lines (.65) was the highest among all
the correlations between mapping scores and multiple-choice scores. Differences between these
two forms of ®ll-in maps deserve more attention.
An important ®nding for our purposes was that the pattern of correlations is not the same
across mapping techniques. Mapping techniques, then, did not provide similar information about
students' knowledge structure or connected understanding.
We think that the construct-a-map technique better re¯ects students' knowledge structures.
We based this conclusion on the fact that this technique is the only one that accurately re¯ected
the differences we saw among students' responses in their scores. The ®ll-in-the-map
score distributions were negatively skewed (skewness value ranged from ÿ.755 for ®ll-in-the-
lines to ÿ1.538 for ®ll-in-the-nodes) indicating that most students obtained high scores, whereas
the convergence scores were normally distributed (Kolmogorov± Smirnov normality test
con®rmed that only convergence scores were normally distributed; p.200). It seems then,
that convergence scores better re¯ect the differences in students' knowledge than the other
What, then, is the ®ll-in-the-map technique tapping? What aspect of students' knowledge is
being measured with this form of assessment? A closer look at the cognitive activities displayed
in this technique is needed. Talk aloud protocols may help to better de®ne the cognitive activities
re¯ected by both techniques.
274 RUIZ-PRIMO ET AL.
In this study we asked the following questions: Are ®ll-in-the-map (skeleton map) scores
sensitive to the nodes and linking lines selected to be ®lled-in? Are ®ll-in-the-nodes skeleton
maps equivalent to the ®ll-in-the-lines skeleton maps? Does the ®ll-in-the-map technique
provide the same picture of a student's connected understanding as the construct-a-map
Our results led to the following tentative conclusions. (1) Skeleton map scores were not
sensitive to the sample of concepts or linking lines to be ®lled-in. Probably the selection of
concepts and propositions re¯ected the key content of the unit and were cohesive enough so that
any combination of concepts or propositions could provide similar information about students'
knowledge. (2) Fill-in-the-nodes and ®ll-in-the-lines techniques are not equivalent forms of ®ll-
in-the-map. Further research is needed to de®ne which of these two forms provides the most
accurate information about students' knowledge or connected understanding. (3) The relation-
ship between the two mapping techniques suggests that both mapping techniques tap somewhat
similar but not identical aspects of students' connected understanding. Students' talk aloud
protocols may provide insight into the cognitive activities involved in constructing and ®lling-in
a map. (4) Construct-a-map scores most accurately re¯ected the differences across students'
knowledge structure. (5) The different pattern of correlations between scores from the multiple-
choice test and both mapping techniques con®rmed that the mapping techniques were not
equivalent. (6) Convergence scoresÐthe proportion of accurate propositions in the students'
maps to the number of all possible propositions in the criterion mapÐare the most ef®cient
indicator when scoring construct-a-map concept maps.
Our overall conclusion is that we need to invest time and resources in ®nding out more about
what aspects of students' knowledge are tapped by different forms of the concept map
assessment. Which technique should be considered the most appropriate for large-scale
assessment? Practical issues, though, cannot be the only criterion for selection. Constraints and
affordances imposed by different forms of assessments affect the way students perform. To
resolve the issue of what is being measured with these different techniques, we need information
about the cognitive activity displayed in each of them.
The work reported herein was supported, in part, by the Educational Research and Development
Centers Program (No. R305B60002), as administered by the Of®ce of Educational Research Improvement,
U.S. Department of Education. The ®nding and opinions expressed in this report do not re¯ect the positions
or policies of the National Institute on Student Achievement, Curriculum, and Assessment, the Of®ce of
Educational Research and Improvement, or the U.S. Department of Education.
Training For Constructing Concept Maps
The training lasts about 50 minutes and had four major parts. The ®rst part focuses on
introducing concept maps: what they are, what they are used for, what their components are (i.e.,
nodes, links, linking words, propositions), and examples (outside the domain to be mapped) of
hierarchical and non-hierarchical maps. The second part emphasizes the construction of concept
maps. Four aspects of mapping are highlighted: identifying a relationship between a pair of
concepts; creating a proposition; recognizing good maps; and redrawing a map. Students are
COMPARING MAPPING TECHNIQUES 275
then given two lists of common concepts to collectively construct a map. The ®rst list focuses on
the ``water cycle''Ða non-hierarchical map; the second list focuses on ``living things''Ða
hierarchical map. The third part of the program provides each individual with nine concepts on
the ``food web'' to construct a map individually. The fourth part of the program is a discussion of
students' questions after they had constructed their individual maps.
The program has proved to be effective in achieving this goal with more than 100 high
school students. To evaluate effectiveness of the training, we have randomly sampled
individually constructed maps at the end of the training within each group. These analyses
have focused on three aspects of the maps: use of the concepts provided on the list; use of labeled
links; and the accuracy of the propositions. Results across studies (see Ruiz-Primo et al., 1996,
1997) have indicated that: (a) more than 94% of the students used all the concepts provided on
the list; (b) 100% used labeled lines, and (c) more than 96% provided one or more valid
propositions. We have concluded that the training program has succeeded in training students to
construct concept maps.
Procedure Used To Construct A Criterion Map
1. Select a panel. Usually, it is composed of experts in the content domain to be tested,
teachers, and the researchers or assessors.
2. Ask each panel participant to provide a list of the ``X'' number of the most important
concepts in the subject domain.
3. Have panel participants compare and discuss their lists of selected concepts until a
consensus is reached about which are the most important concepts. This will be
considered the ``Key-Concept List.''
4. Ask each participant to construct a concept map with the key concepts.
5. Construct a concept map with relations that appear in at least 80% of the participants'
6. Discuss and modify the resulting concept map with participants until a consensus is
reached about which relations should be present in the map.
Examples Of the Multiple-Choice Test
Examples of the Conceptual Items:
1. Negative ions are formed from neutral atoms by:
(a) gaining in atomic number
(b) losing in atomic number
(c) losing of electrons
(d) gaining of electrons
2. Which one of the following pairs of elements would you expect to form molecular
(a) sodium and bromine
(b) calcium and chlorine
276 RUIZ-PRIMO ET AL.
(c) nitrogen and oxygen
(d) aluminum and sulfur
3. Binary ionic compounds are composed of:
(a) two monoatomic cations
(b) two monoatomic anions
(c) one or two polyatomic ions
(d) a monoatomic cation and a monoatomic anion
4. A cation is any atom or group of atoms that:
(a) gains electrons and has a positive charge
(b) gains electrons and has a negative charge
(c) loses electrons and has a positive charge
(d) loses electrons and has a negative charge
Examples of the Mechanical Items:
1. An ite or ate ending on the name of a compound indicates that the compound:
(a) is a binary ionic compound
(b) is a binary molecular compound
(c) contains a polyatomic anion
(d) contains a polyatomic cation
2. The compound formula formed when aluminum reacts with sulfur is:
3. The name of the chemical compound, Fe
(a) iron sulfate
(b) iron II sulfate
(c) iron III sulfate
(d) iron VI sulfate
4. Select the correct formula:
Anderson, T.H., & Huang, S-C.C. (1989). On using concept maps to assess the
comprehension effects of reading expository text. Technical Report No. 483. Urbana-
Champaign: Center for the Studying of Reading, University of Illinois at Urbana-Champaign
(ERIC Document Reproduction Service No ED 310 368).
Baxter, G.P., Elder, A.D., & Glaser, R. (1996). Knowledge-based cognition and per-
formance assessment in the science classroom. Educational Psychologist, 31, 133 ± 140.
Bybee, R.W. (1996). The contemporary reform of science education. In J. Rothon & P.
Bowers (Eds.), Issues in science education (pp. 1 ± 14). Arlington, VA: National Science
Teachers Association, National Science Education Leadership Association.
COMPARING MAPPING TECHNIQUES 277
Chi, M.T.H., Glaser, R., & Farr, M.J. (1988). The nature of expertise. Hillsdale, NJ:
Glaser, R. (1991). Expertise and assessment. In M.C. Wittrock & E.L. Baker (Eds.), Testing
and cognition (pp. 17 ± 39). Englewood Cliffs, NJ: Prentice Hall.
Glaser, R. (1996). Changing the agency for learning: Acquiring expert performance. In K.A.
Ericsson (Ed.), The road to excellence: The acquisition of expert performance in the art,
sciences, sports, and games (pp. 303 ± 311). Mahwah, NJ: Lawrence Erlbaum.
McClure, J.R., & Bell, P.E. (1990). Effects of an environmental education-related STS
approach instruction on cognitive structures of preservice teachers. University Park, PA:
Pennsylvania State University (ERIC Document Reproduction Service No. ED 341 582).
Mintzes, J.J., Wandersee, J.H., & Novak, J.D. (1997). Teaching science for understanding.
San Diego: Academic Press.
Moore, J.A. (1995). Cultural and scienti®c literacy. Molecular Biology of the Cell, 6, 1 ± 6.
Novak, J.D. (1990). Concept mapping: A useful tool for science education. Journal of
Research in Science Teaching, 27 (10), 937 ± 949.
Novak, J.D. (1998). Learning, creating, and using knowledge. Concept maps as facilitative
tools in school and corporations. Mahwah, NJ: Lawrence Erlbaum.
Novak, J.D., & Gowin, D.R. (1984). Learning how to learn. New York: Cambridge
Novak, J.D., Gowin, D.R., & Johansen, G.T. (1983). The use of concept mapping and
knowledge with junior high school science students. Science Education, 67(5), 625 ± 645.
Novak J.D., & Ridley, D.R. (1988). Assessing student learning in light of how students
learn. Paper prepared for The AAHE Assessment Forum. American Association for Higher
Education (ERIC Document Reproduction Service No. ED 299923).
Ruiz-Primo, M.A., Shavelson, R.J., & Schultz, S.E. (1997, March). On the validity of
concept map based assessment interpretations: An experiment testing the assumption of
hierarchical concept-maps in science. Paper presented at the AERA Annual Meeting, Chicago,
Ruiz-Primo, M.A., & Shavelson, R.J. (1996). Problems and issues in the use of concept
maps in science assessment. Journal of Research in Science Teaching, 33, 569 ± 600.
Ruiz-Primo, M.A., Schultz, S.E., & Shavelson, R.J. (1996, April). Concept-map based
assessment in science: An exploratory study. Paper presented at the AERA Annual Meeting,
New York, NY.
Schau, C., & Mattern, N. (1997). Use of map techniques in teaching applied statistics
courses. The American Statistician, 51, 171 ± 175.
Schau, C., Mattern, N., Weber, R., Minnick, K., & Witt, C. (1997, March). Use of ®ll-in
concept maps to assess middle school students' connected understanding of science. Paper
presented at the AERA Annual Meeting, Chicago, IL.
Shavelson, R.J. (1972). Some aspects of the correspondence between content structure and
cognitive structure in physics instruction. Journal of Educational Psychology, 63, 225±234.
Shavelson, R.J. (1974). Methods for examining representations of a subject-matter structure
in a student's memory. Journal of Research in Science Teaching, 11, 231 ± 249.
Surber, J.R. (1984). Mapping as a testing and diagnostic device. In C.D. Holley & D.F.
Dansereau (Eds.), Spatial learning strategies: Techniques, applications, and related issues (pp.
213 ± 233). Orlando: Academic Press.
278 RUIZ-PRIMO ET AL.