Content uploaded by Philipp Bitzenbauer
Author content
All content in this area was uploaded by Philipp Bitzenbauer on May 28, 2021
Content may be subject to copyright.
ISSN 2301-251X (Online)
European Journal of Science and Mathematics Education
https://www.scimath.net
Vol. 9, No. 3, 2021, 57-79
© 2021 by the authors; licensee EJSME. This article is an open access article distributed under the terms and conditions of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0/).
OPEN ACCESS
Development of a Test Instrument to Investigate Secondary School
Students’ Declarative Knowledge of Quantum Optics
Philipp Bitzenbauer 1*
1 Physics Education Research, Department of Physics, Friedrich-Alexander-Universität Erlangen-Nürnberg, GERMANY
* Corresponding author: philipp.bitzenbauer@fau.de
Received: 3 Mar. 2021 Accepted: 23 Apr. 2021
Citation: Bitzenbauer, P. (2021). Development of a Test Instrument to Investigate Secondary School Students’ Declarative
Knowledge of Quantum Optics. European Journal of Science and Mathematics Education, 9(3), 57-79.
https://doi.org/10.30935/scimath/10946
Abstract:
This article reports the development and validation of a test instrument to assess secondary school students’ declarative
quantum optics knowledge. With that, we respond to modern developments from physics education research: Numerous
researchers propose quantum optics-based introductory courses in quantum physics, focusing on experiments with
heralded photons. Our test instrument’s development is based on test development standards from the literature, and we
follow a contemporary conception of validity. We present results from three studies to test various assumptions that,
taken together, justify a valid test score interpretation, and we provide a psychometric characterization of the instrument.
The instrument is shown to enable a reliable (α = 0.78) and valid survey of declarative knowledge of quantum optics
focusing on experiments with heralded photons with three empirically separable subscales.
Keywords: quantum physics, test development, validation
INTRODUCTION
The improvement of physics teaching is a central goal of physics education research. One of its most
important areas is curriculum development research (Henderson, 2018). Today, there is an ongoing
tradition of curriculum research not only on topics of classical physics, such as mechanics (Spatz et al.,
2020) or electricity (Burde & Wilhelm, 2020), but also on advanced topics of modern physics, such as
quantum physics (Kohnle et al., 2014; Müller & Wiesner, 2002).
Meanwhile, teaching proposals to quantum physics have been developed for more than twenty years
to foster a detailed conceptual understanding of quantum physics among learners in schools and
universities. Given emerging technical advances in the preparation and detection of single-photon
states, diverse experiment-based approaches for teaching quantum physics have been developed
(Bronner et al., 2009; Galvez et al., 2005; Pearson & Jackson, 2010; Thorn et al., 2004). Most of these
experiment-based teaching sequences focus on quantum optics experiments with heralded photons:
The quantum behaviour of single photons at the beam splitter is demonstrated in such experiments,
making non-classical effects tangible (Bitzenbauer & Meyn, 2020; Holbrow et al., 2002; Pearson &
Jackson, 2010). Such non-classical effects, e.g., antibunching (Kimble et al., 1977), are revealed by
measuring intensity correlations of light at outputs of a beamsplitter (Grangier et al., 1986; Hanbury
Brown & Twiss, 1956) and allow for the demonstration of light’s quantum behaviour.
Quantum optics-based teaching approaches are promising in many ways: for example, Marshman and
Singh (2017) argue that quantum optical experiments can “elegantly illustrate the fundamental concepts
of quantum mechanics such as the wave-particle duality of a single photon, single-photon interference,
58
European Journal of Science and Mathematics Education Vol. 9, No. 3, 2021 Bitzenbauer
and the probabilistic nature of quantum measurement” (p. 1). Building the doctrine of quantum physics
on such single-photon experiments leads to a conception of photons consistent with quantum
electrodynamics, namely that of quantum objects, as “quanta of various continuous space-filling fields”
(Hobson, 2005, p. 61). According to Jones (1991), this “physical picture of the radiation field produced
by quantum electrodynamics (QED) is satisfactory” (p. 97). Thus, single-photon experiments “provide
the simplest method to date for demonstrating the essential mystery of quantum physics” (Pearson &
Jackson, 2011, p. 1) without using historical approaches or mechanistic analogies in quantum physics
lessons. These are known to lead to fundamental misunderstandings about quantum physics among
students (Henriksen et al., 2018; Olsen, 2002).
Development research requires empirical research: Many studies have investigated typical learners’
difficulties in quantum physics (Fischler & Lichtfeldt, 1992; Mashhadi & Woolnough, 1999; Sing &
Marshman, 2015; Styer, 1996). Most of these studies referred to university students, particularly
concerning quantum measurements and time evolution (Zhu & Singh, 2012a), the quantum mechanical
formalism in general (Singh, 2007), or wave functions in one spatial dimension (Zhu & Singh, 2012b).
However, there have been few studies with relatively small samples on learning and teaching quantum
mechanics in the context of experiments with single photons (Marshman & Sing, 2017), especially on
the secondary school level (Bitzenbauer, 2021; Bitzenbauer & Meyn, 2020). In particular, there is a lack
of psychometrically characterized test instruments to investigate students’ learning gains in quantum
optics-based teaching sequences on quantum physics using experiments with heralded photons.
In this article, we present the development of a test instrument that can be used to economically elicit
learners’ declarative knowledge in the context of quantum optical experiments with heralded photons
at the secondary school level. First, we give an overview of the test instruments that have emerged from
research in the field of quantum physics education. Then, we describe the development as well as the
psychometric characterization of the new test instrument. In future studies, this instrument can be used
to investigate the learning efficacy of quantum optics-based instructional approaches to quantum
physics in schools.
LITERATURE REVIEW
Test instruments on quantum physics. Research on students’ conceptions of quantum physics has a long
tradition: In their paper, Fischler and Lichtfeldt (1992) called for a departure from analogies to classical
physics in the teaching of quantum physics. They justified this by evaluating a teaching course on
quantum physics and the students’ conceptions found. Subsequent works (Ireson, 1999; Mannila et al.,
2002) took up some of Fischler’s and Lichtfeldt’s ideas, but no standard test instrument was developed.
Today, there are many test instruments for quantum physics of different format, with different thematic
foci and mainly with the students at universities as a target group. The overview in Table 1 shows a
clear need: The developed test instruments are predominantly unsuitable for evaluating quantum
physics teaching concepts at schools. In particular, no instrument exists with an explicit focus on
quantum optics and experiments with heralded photons.
Students’ conceptions of the nature of light. Studies on the nature of light generally refer to the dualism of
waves and particles. An interview study with N = 25 students (Ayene et al., 2011) led to three clusters
of conceptions, which the authors titled and described as follows:
- Classical description: objects are described either as waves or as particles in the classical sense.
Particles are described as localized, compact objects, visualized “as a billiard ball which carries
energy and momentum” (Ayene et al., 2011, p. 6).
- Mixed description: Photons are considered objects with the properties of classical particles and
waves.
- Quasiquantum description: Learners’ representations are predominantly dualistic but sometimes
resort to viewing quantum objects as either waves or particles.
Bitzenbauer EUROPEAN J SCI MATH ED Vol. 9, No. 3, 2021
59
Table 1. Overview of published test instruments for quantum physics
Test / author
Question
format
Exemplary Content
Target group
Students’ conceptions of quantum
phenomena (Ireson, 1999, 2000)
Items with
rating scale
Quantum phenomena and models
students (partly also suitable
for
secondary school students)
Quantum measurement test (Singh,
2001)
Open-ended
questions
Measurement process and time evolution
advanced undergraduate
students
QMVI (Cataloglu & Robinett, 2002)
Two-tier
multiple-choice
Quantum physical formalism
undergraduate students
Questionnaire on students’
conceptions (Müller & Wiesner, 2002)
Items with
rating scale
Models of the atomic shell, quantum
random,
quantum interference
secondary school students
QPCS (Wuttiprom et al., 2009)
Multiple-
Choice
Wave-particle duality, wave function
undergraduate students
QMAT (Goldhaber et al., 2009)
Open-ended
questions
Quantum physical formalism
students
QMCS (McKagan et al., 2010)
Multiple-
Choice
Wave-particle duality, atomic structure,
potential
wells, wave functions
undergraduate students
QMS (Zhu & Singh, 2012)
Multiple-
Choice
Nonrelativistic quantum mechanics of a
single
particle in one spatial dimension
undergraduate and graduate
students
QMCA (Sadaghiani & Pollock, 2015)
Multiple-
Choice
Measurement, Schrödinger Equation,
wave
function, time evolution,
probability
density
(advanced) undergraduate
students
QME (Uccio et al., 2019)
Two-tier
multiple-choice
Matter waves, measurement process,
atoms
and electrons
students (partly also suitable
for
secondary school students)
QMFPS (Marshman & Singh, 2019)
Multiple-
Choice
formalism and postulates of quantum
mechanics
(advanced) undergraduate and
graduate
students
This study’s results are consistent with previous findings (Ireson, 1999, 2000). Ireson (1999) conducted
a multivariate analysis on student understandings of quantum physics based on a survey of N = 225
learners and grouped subjects into three clusters according to their perceptions: mechanistic thinking,
intermediate thinking, and quantum thinking. These three levels in learners’ conceptions between
classical thinking and quantum thinking have also been reported throughout other studies (Ke et al.,
2005).
Development of the Test Instrument
To adapt instruction to learners’ needs, learners’ level of knowledge must be elicited (cf. Tyson et al.,
1997; Özdemir & Clark, 2007), for example, using tests. At the beginning of test development, very
central and pragmatic questions must be clarified, which influence the development process of the
instrument itself and the results of the survey later on (Mummendey & Grau, 2014). To develop our
instrument, we used standards from the literature as a guide (e.g., Adams & Wiemann, 2011; Haladyna
& Downing, 1989):
Determination of the target group. The primary target group are secondary school students.
Determination of the test objective. Typically, tests are differentiated with respect to their test objective, i.e.,
according to whether their objective is firstly to measure ability, secondly to classify individuals, or
thirdly to record knowledge. The test instrument presented in this article falls into this third category:
it is intended to survey declarative knowledge about quantum optics, focusing on experiments with
heralded photons. Here, we use the term declarative knowledge to describe knowledge about objects,
content, or facts (Anderson, 1996). Hence, most of the items primarily relate to concepts, some to facts.
60
European Journal of Science and Mathematics Education Vol. 9, No. 3, 2021 Bitzenbauer
Description of the knowledge domain. A specification of the knowledge domain quantum optics focusing on
experiments with heralded photons from literature is not possible because comparable test instruments
have not been published yet and quantum optics is not yet firmly anchored in international school
curricula either (Stadermann et al., 2019). Accordingly, the theoretically based operationalization of the
construct declarative knowledge on quantum optics is not possible. Therefore, in a field like quantum optics
- or other modern physics topics, which have not yet been empirically explored - it is even more
challenging to determine a substructure of the learners’ knowledge than it is in classical subject areas.
There is no standard procedure in physics education research for how to empirically approach such an
area. In developing the test instrument presented here, the following approach has proven fruitful: First,
the newly developed test instrument was based on a model containing the three evident sub-aspects
theoretical aspects, experimental aspects and photons. These three sub-aspects should be represented in the
test instrument. Thus, basic knowledge (sub-aspect theoretical aspects), general knowledge about
quantum objects using the example of the photon (sub-aspect photons) as well as technical-experimental
considerations (sub-aspect experimental aspects) are queried (cf. Table 2).
Table 2. Example items for each content area. The correct answer option is highlighted with a
checkmark.
For each item, students also indicated how confident they were on an additional 5-
point
rating
scale
Sub-aspect
Example item
Theoretical aspects
7. Interference is generally defined as...
a) ...the superposition of at least two electromagnetic waves.
b) ...the superposition of exactly two waves.
c)
...the superposition of at least two waves.
Experimental
aspects
6. In a single-photon detector...
a) ...the number of registered photons within some time interval is counted.
b) ...detected energy portions trigger electron avalanches.
c)
...charges are initially on top and are moved downwards through an incident energy quantity.
Photons
13. Photons are…
a) ...spherical particles, which sometimes show wave-like behaviour.
b) ...components of light surrounded by a wave, which is responsible.
d)
...energy portions.
One possibility of developing items with fit to this structural model of the knowledge domain is
preparing a blueprint. A blueprint is a matrix containing the content to be tested in the test on the one
hand and the students’ performance levels to be achieved in these individual content areas on the other
hand (Krebs, 2008). Such a blueprint was created at the initial stage of test development according to
the steps outlined by Flateby (2013).
Decision of task format. Usually, test instruments for surveying (declarative or conceptual) knowledge
consist of single- or multiple-choice items for economic reasons. In this test, we used two-tier single-
choice items. In the first tier, students choose exactly one out of three response options. In the second
tier, they are additionally asked to indicate how certain they were about the answer in tier one on a five-
point Likert scale (1 = I guessed, …, 4 = sure, 5 = very sure). This additional indication of response
confidence is primarily intended to minimize rate influence (Brell et al., 2005). For this purpose, a point
is assigned if the correct answer was chosen in tier one, and the respondent was at least sure (tier two).
This leads to underestimating the participants’ test scores but prevents an overestimation of the learning
gains when evaluating teaching concepts. Taken together, this format allows for objective data
evaluation but requires the development of attractive distractors (Theyßen, 2014).
Formulation of appropriate distractors. For the quality of items, the distractors’ quality, i.e., the wrong
answer options, is crucial. Moosbrugger and Kelava (2012) define distractors as “answer alternatives
that seem plausible but do not apply” (p. 418), expressing the difficulty in finding such distractors: they
have to be recognized as wrong by knowledgeable people and should seem right for others (Glug, 2009).
Therefore, distractors are usually based on widespread student conceptions. Because quantum optics-
Bitzenbauer EUROPEAN J SCI MATH ED Vol. 9, No. 3, 2021
61
based teaching approaches to quantum physics are primarily intended to promote elaborate
conceptions of the (quantum) nature of light among students, we started developing our new test
instrument with a literature review of student conceptions in this domain (cf. literature review).
Nevertheless, no broad studies on students’ conceptions of quantum physics exist that relate more
narrowly to the context of quantum optics and experiments with heralded photons. In such cases, a
standard method for obtaining appropriate distractors is the use of relevant questions in an open-ended
format first, for instance, in a preliminary study. Frequent errors or answers close to the correct solution
can then be used as distractors for the test instrument (Krebs, 2008). Therefore, in developing the test
instrument presented here, 21 items - distributed across the three sub-aspects theoretical aspects, photons
and experimental aspects according to the previously developed blueprint - were first formulated as open-
ended questions. These were given to N = 23 pre-service physics teachers. An initial set of test items was
obtained from the pre-service physics teachers’ answers because partially correct or conspicuously
frequent incorrect answers served as distractors for our instrument (cf. Table 3).
Table 3. Information on the test instrument items’ contents with the assignment to the three content
areas
theoretical aspects, experimental aspects, and photons. In the last column, exemplary
(partially)
incorrect
answers of the N = 23 surveyed students to the preliminary study’s open-ended questions
are
given,
which we used as a basis for developing appropriate distractors. We consider this a
fruitful
procedure
for developing plausible distractors for topics with little or no empirical coverage. All
items
of
the final test version can be found in Appendix A of this article.
Sub-aspect
Content
Item
No.
Exemplary (partly) incorrect student response that resulted in a distractor /
Reference
Theoretical
aspects
Optical beam splitter
1
“An optical beam splitter splits light beam.”
Atomic energy levels
4
adapted from (McKagan et al., 2010)
Conservation of energy
5
“Two photons are generated at half of the wavelength of the incident laser
beam.”
Interference I
7
“Interference is the superposition of exactly two waves.”
Interference II
8
“Conducting the double-slit experiment with single electrons leads to two
well-defined detection locations on a screen behind the double slit.”
Experimental
aspects
Non-linear crystal
2
“Laser light incident on a non-linear crystal is split into two partial beams.”
Single-photon detector
6
“A single-photon detector counts the number of registered photons within
some time interval.”
Interferometer in single-
photon experiments
9
“Interference of single photons actually shows that they only exist in some
experiments.”
Anticorrelation factor
11
-
Coincidence technique
12
“For experiments with heralded photons, one always needs exactly two
detectors.”
Photons
Eye as photon detector
3
“We cannot see photons because they are too small.”
Localizability of photons
10
“The photon splits at the beam splitter cube because classical light is also
reflected and transmitted at the beam splitter.”
Photons as energy quanta
13
“For me, photons are small spherical particles which sometimes show
wave-like behaviour.”
Test Score Interpretation
Validity. In empirical research, the validation of a test instrument is often referred to in the context of
test development. The debate about the test quality criterion validity led to a shift in the conception of
validity. Thus, today validity is not seen as a property of a test: “Validity is not a property of the test.
Rather, it is a property of the proposed interpretations and uses of the test scores. Interpretations and
uses that make sense and are supported by appropriate evidence are considered to have high validity
[...]” (Kane, 2013, p. 3). Valid test score interpretation is at the centre of assuring test quality: “Validity
refers to the degree to which evidence and theory support the interpretations of test scores for proposed
uses of the tests” (AERA, 2014, p. 11). The fact that a test procedure allows valid test score interpretation
must be derived by arguments: So, theorists of educational measurement more contemporary have
articulated validity as an evidence-based argument (Haertel, 2004; Kane, 2001, 2013). A prerequisite for
62
European Journal of Science and Mathematics Education Vol. 9, No. 3, 2021 Bitzenbauer
this process is that it is first determined which test score interpretation is intended. This is then referred
to as the intended test score interpretation. Besides, it must be determined on which assumptions this
intended test value interpretation is based. Methods and procedures must then be used to check the
validity of these assumptions (cf. Kane, 2001). The necessary strands of argumentation cannot be
standardized (Meinhardt et al., 2018).
In the development process of the test instrument presented in this article we used an iterative process
of development, pilot studies and refinement of the items. The extent to which a valid test score
interpretation is possible is derived argumentatively: Results from a Think-aloud study and an expert
survey are combined with results from a quantitative pilot study to yield a valid interpretation of the
test score.
Intended test score interpretation. The test score is meant to indicate the extent to which the concepts of
quantum optics, with a focus on experiments with heralded photons, are known by secondary school
students. It can, therefore, be taken as a measure of declarative knowledge in this area.
This intended test score interpretation is based on the following assumptions adapted from Meinhardt
(2018), which must be checked for plausibility as part of the validity argument:
1. The items adequately represent the construct according to the structural model (three empirically
separable subscales theoretical aspects, photons and experimental aspects).
2. The items evoke intended cognitive processes in the students. In particular, correct answers are not
(exclusively) given due to guessing.
3. The items are understood as intended by the respondents.
4. The items and distractors are authentic for the students.
5. The respective scales adequately represent the intended sub constructs (theoretical aspects,
experimental aspects and photons).
6. The construct declarative knowledge on quantum optics is distinguishable from different or similar
constructs.
As stated by Meinhardt (2018, p. 198), this list can, in principle, be continued arbitrarily. However, if the
assumptions 1. - 6. can be empirically substantiated, they serve as arguments for the justification of
“evaluative and generalizing conclusions” based on the test scores according to Meinhardt (2018, p.
198).
Methods and Samples
This study has two central goals: first, to empirically test the assumptions on which the intended test
score interpretation is based. This is to ensure a valid interpretation of the test scores. Second, the
psychometric characterization of the test instrument in the sense of classical test theory. For this
purpose, three studies were conducted (cf. Figure 1).
Bitzenbauer EUROPEAN J SCI MATH ED Vol. 9, No. 3, 2021
63
Study I. We interviewed N = 8 secondary school students (grade 12) in one-on-one settings using a
Think-aloud protocol to investigate students’ cognitive processes when answering the test items
(Ericsson & Simon, 1998). Before, the students had participated in an introductory quantum physics
course according to Bitzenbauer and Meyn (2020) in classroom. In order to ensure a standardised
implementation of the method, a guideline was developed. The conversations were recorded as audio
files and transcribed afterwards. The evaluation of the Think-aloud interviews was based on categories
from the literature (Meinhardt, 2018) related to the comprehensibility of the items, the cognitive
processes occurring during test processing, and the suitability of the response format using scaling
content analysis (Mayring, 2010, p. 102). The coding was carried out by a second independent coder for
12.5% of the data with a high level of agreement κ = 0.72 (95% CI [0.58; 0.85]). During the Think-aloud
interviews, participants are asked to verbalize their thoughts and reflections, and so the cognitive
processes of respondents in a test situation become accessible (van Someren et al., 1994). In order to
check whether the test persons correctly understand the items, the evaluation of a developed test
instrument with the help of the Think-aloud method makes sense: Particularly in the case of closed
questions (here single-choice), an examination of the test items appears to be useful, because the
distractors sometimes do not cover all conceivable participants’ responses, they influence each other or
do not correspond to the participant’s natural response (Schnell, 2016); this can have a negative
influence on the validity of the survey results (Rost, 2004). During the Think-aloud interviews, some
previously unnoticed difficulties and misconceptions were revealed.
Study II. The test items were revised based on study I. The Beta version of the test was then completed
by N = 86 undergraduate students of engineering. Before test processing, the students had been
introduced to quantum physics using the quantum optics-based introductory course suggested by
Bitzenbauer and Meyn (2020). Following classical test theory (Engelhardt, 2009), item difficulty and item
discriminatory indices were calculated for all items. Items that were not within empirically accepted
tolerance ranges concerning item statistics were excluded from the item set. We refer to the accepted
tolerance range of 0.2 to 0.8 for item difficulty (Kline, 2015). Concerning discriminatory indices, values
above 0.2 are considered good (Jorion et al., 2015), while others suggest a threshold of 0.3 (Fisseni, 1997).
In addition, the reliability of the test instrument was calculated using Cronbach’s alpha as an estimator
for internal consistency (Taber, 2018). To check criterion validity, the participants’ physics grades were
collected as an external criterion. Multidimensionality of the test is to be assumed because of the model
of the knowledge domain, which the test development is based on (cf. description of knowledge
domain). We, therefore, check the fit of the structural model to the data using confirmatory factor
analysis: “Confirmatory factor analysis is designed to assess how well a hypothesized factor structure
‘fits’ the observed data” (Russell, 2002, p. 1638). Confirmatory factor analyses require samples of at least
100, rather larger (cf. Kline, 2005; Loehlin, 2004; Schumacker & Lomax, 2004). Jackson (2003)
recommends a sample size to parameter ratio of at least 10:1, as “lower ratios are increasingly less
Figure
1. Overview of the development of our test instrument and pilot studies. The grey
arrows
indicate that such a process is never complete but should be understood as iterative
64
European Journal of Science and Mathematics Education Vol. 9, No. 3, 2021 Bitzenbauer
trustworthy” (Kyriazos, 2018, p. 2217). However, MacCallum and Widaman (1999) suggest 5:1 as a cut-
off ratio. In our study we have 6.6:1 and thus our sample can be considered sufficiently large for a
confirmatory factor analysis in our pilot study. We refrain from Rasch scaling of the test because
different test items refer to the same experimental setups, so the requirement of local stochastic
independence is violated (Debelak & Koller, 2020).
Study III. Finally, the test content’s relevance and representativeness were verified with an expert survey
(N = 8 scientists from physics education research). For each of the 13 items in the test instrument, experts
were asked to rate the following three items on a 5-point Likert scale (1 = strongly disagree, 2 = disagree,
3 = undecided, 4 = agree, 5 = agree completely) using a pen-and-paper format:
a) The item’s distractors are authentic; thus, they could be assumed to be true by someone who is not
certain of the answer.
b) This item assesses a crucial aspect of the knowledge domain.
c) This item is of good quality.
In addition to an evaluation of the individual items, the test as a whole was also evaluated in order to
check for content validity. For this purpose, a scale consisting of four items - again with a 5-point Likert
scale - was developed:
1. The items represent relevant contents of the knowledge domain.
2. The contents are in an appropriate relation to each other, i.e. the weighting of the content areas is
reasonable.
3. The test instrument has a high fit to the knowledge domain.
4. The test instrument covers important content aspects of the single photon experiments.
The internal consistency of the scale was found to be = 0.86.
To evaluate the expert survey, Diverging Stacked Bar Charts (Robbins & Heiberger, 2011) were created
using the software Tableau, version 2019.3. These charts align a bar corresponding to 100% of the expert
ratings relative to the scale’s centre (0%). Experts’ agreement corresponds to a swing of the bar to the
right, and disagreement corresponds to the bar’s swing to the left. The mean value m and standard
deviation SD of the expert ratings are also given to quantify the expert ratings.
Table 4. Overview of the samples of the three studies piloting the test instrument
Study
Sample
Think-aloud study (study I)
N = 8 secondary school students
Quantitative study (study II)
N = 86 first-semester undergraduate engineering students
Expert survey (study III)
N = 8 scientists from physics education research
RESULTS
The three sub-studies I-III conducted as part of the new test instrument’s piloting help to verify the
assumptions that the intended test score interpretation is based on. The summary of all these individual
procedures enables a detailed validity argument. This supports valid test score interpretation.
According to this test score interpretation, the participants’ test scores represent a measure of the
declarative knowledge on quantum optics, focusing on experiments with heralded photons. Besides,
study II provides a psychometric characterization of the instrument.
In summary, Table 5 shows which assumption regarding the intended test score interpretation is
addressed in which sub-study and with which measure.
Bitzenbauer EUROPEAN J SCI MATH ED Vol. 9, No. 3, 2021
65
Table 5. The assumptions which the intended test score interpretation is based on are checked and
argumentatively
supported by different measures. Only the sum of the arguments finally speaks for
a
valid
test score interpretation.
Assumption
Study
1. The items adequately represent the construct according to the structural model
(three empirically separable subscales theoretical aspects, photons and experimental
aspects).
Quantitative study (study II): confirmatory
factor analysis
Expert survey (study III)
2. The items evoke intended cognitive processes in the students. In particular,
correct answers are not (exclusively) given due to guessing.
Think-aloud study (study I)
3. The items are understood as intended by the respondents.
Think-aloud study (study I)
4. The items and distractors are authentic for the students.
Think-aloud study (study I)
5. The respective scales adequately represent the intended sub constructs
(theoretical aspects, experimental aspects and photons).
Expert survey (study III)
6. The construct declarative knowledge on quantum optics is distinguishable from
different or similar constructs.
Quantitative study (study II):
correlation analysis with external criterion
Results of study I. The participants formulated their thoughts aloud while processing the test and were
observed continuously. The results of the Think-aloud study led to a revision of the test items. Here,
two items will be used as examples to show the extent to which the Think-aloud study contributed to
an optimization of the test items at an early stage of the test’s development.
A particular focus of the Think-aloud study was whether the participants perceived the developed
distractors as authentic. In this regard, the original item 1 proved to be problematic. This was: “A beam
splitter...
a) ...is employed in the Michelson interferometer, because it can be used to split an incident ray of light into two
partial beams.
b) …is a prism.
c) ...separates incident rays of light or superimposes two rays of light.”
The test participants often were critical of distractor b) because a logical contradiction arose by equating
beam splitter and prism. Therefore, this distractor was not authentic. In contrast, distractor a) is highly
authentic because, according to some respondents, the term Michelson interferometer is very attractive.
From these observations, the item was revised to: “A beam splitter...
a) ...is employed in the Michelson interferometer, because it can be used to split an incident ray of light into two
partial beams.
b) …...is made out of two merged prisms, where one of them is responsible for the transmitted beam and one for
the reflected beam.
c) ...separates incident rays of light or superimposes two rays of light.”
Through the utterances in the Think-aloud study, indications of the cognitive processes occurring while
answering the items could be recorded. It is noticeable that the decision-making process was perceived
as complex for the majority of the items. One item (item 13 in the final test version) that proved
problematic in this regard was the following: “Photons are…
a) ...spherical particles, which sometimes show wave-like behaviour.
b) ...spherical particles surrounded by a wave responsible for interference.
c) ...the elementary energy portions of light.”
66
European Journal of Science and Mathematics Education Vol. 9, No. 3, 2021 Bitzenbauer
Some of the test persons believed that the doubling of the expression “spherical particles” in the
distractors a) and b) causes that the decision process is not very complex and leads to answer option c).
Because of this observation, this item was modified to:” Photons are...
a) ...spherical particles, which sometimes show wave-like behaviour.
b) ...components of light surrounded by a wave, which is responsible for interference.
c) ...energy portions.”
In summary, the Think-aloud study led to the confirmation of assumptions 2-4 (cf. Table 5), which the
intended test score interpretation is based on, but all of the items were revised based on the Think-aloud
protocols. Exemplary, we discussed two item revisions in this chapter.
Results of study II – Descriptive analysis. The psychometric parameters reported below refer to the 13 items
found in the final version of the test instrument (cf. Appendix A). Items that were removed from the
item set due to the item analysis are not presented and discussed here. Thus, in this chapter, we provide
a psychometric characterization of our instrument based on 86 undergraduate engineering students’
data.
In the 13 items, students could score a maximum of 13 points (1 point each). The students reached a
mean score of m = 5.94, SD = 3.15, ranging from 0 points (four students) to 12 points (three students), cf.
Figure 2. To check criterion validity, the test scores’ correlation with the subjects’ physics scores was
determined, which was found to be r = 0.44 (p < 0.01).
Figure 2. Histogram of the students’ test scores
For most of the items, each distractor is selected by at least 5% of the participants so that none of the
items has to be excluded due to a too rarely selected distractor (cf. Table 6).
Table 6. Distribution of students’ responses to all items. The correct responses are in boldface.
Item No. (Content)
a)
b)
c)
1 (Optical beam splitter)
31%
46%
23%
2 (Non-linear crystal)
27%
28%
44%
3 (Eye as photon detector)
62%
22%
16%
4 (Atomic energy levels)
9%
19%
72%
5 (Conservation of energy)
16%
9%
75%
6 (Single-photon detector)
41%
53%
6%
7 (Interference I)
14%
3%
82%
8 (Interference II)
75%
15%
9%
9 (Interferometer in single-photon experiments)
3%
25%
71%
10 (Localizability of photons)
9%
22%
67%
11 (Anticorrelation factor)
16%
29%
54%
12 (Coincidence technique)
5%
65%
30%
13 (Photons as energy quanta)
11%
17%
72%
Bitzenbauer EUROPEAN J SCI MATH ED Vol. 9, No. 3, 2021
67
The analysis of the item difficulties (cf. Table 7) showed that almost all items lie within the tolerance
range of 0.20 to 0.80, ranging from 0.15 (Item 1) to 0.74 (Item 7). Only for items 1 and 9 there are small
deviations into the problematic range below 0.2. Nevertheless, items 1 and 9 were retained because they
inquire essential aspects of the experiments with heralded photons. For the discriminatory indices of
the test items, values in a range between 0.31 (Items 9, 12) and 0.54 (Item 3) are obtained (cf. Table 7).
These values are within the tolerance range above 0.3. The value of Cronbach’s alpha as an estimator
for the internal consistency of the test instrument is found to be 0.78 for the final test version with 13
items. The values in Table 7 also prove that the internal consistency could not be raised by excluding
additional items. We furthermore calculated split-half reliability: Therefore, we decided to use the
Guttman formula as it does not assume homogeneity of both test halves and does not lead to an
overestimation of reliability (Kerlinger & Lee, 2000). For our test instrument, we obtain a value of 0.75.
Table 7. Item characteristics in the overview.
Item No. (Content) Item difficulty
Discriminatory
power
Cronbach’s alpha excluding
the respective item
1 (Optical beam splitter)
0.15
0.40
0.76
2 (Non-linear crystal)
0.23
0.39
0.76
3 (Eye as photon detector)
0.51
0.54
0.75
4 (Atomic energy levels)
0.59
0.45
0.76
5 (Conservation of energy)
0.73
0.40
0.76
6 (Single-photon detector)
0.37
0.32
0.77
7 (Interference I)
0.74
0.37
0.77
8 (Interference II)
0.55
0.40
0.76
9 (Interferometer in single-photon experiments)
0.17
0.31
0.77
10 (Localizability of photons)
0.47
0.50
0.75
11 (Anticorrelation factor)
0.35
0.36
0.77
12 (Coincidence technique)
0.44
0.31
0.77
13 (Photons as energy quanta)
0.63
0.48
0.76
Results of study II – Confirmatory factor analysis. A confirmatory factor analysis was used to check whether
there is sufficient agreement between the empirical data and the theoretical model of our knowledge
domain (Moosbrugger & Kelava, 2012, p. 334), thus, to generate evidence for construct validity. The
empirical data are the test scores collected in study II. The theoretical model is the structural model for
the knowledge domain, which we have outlined in Table 2, consisting of the three factors theoretical
aspects, experimental aspects and photons. The model parameters were estimated using the maximum
likelihood method, and the model fit was checked using different goodness of fit measures on model
level. We refer to the fit parameters χ²/df, RMSEA, the root-mean-square error of approximation (Steiger
& Lind, 1980), CFI, the comparative fit index (Bentler, 1990) and SRMR, the standardized root mean
square residual (Hu & Bentler, 1999) and their cut-off values for good and acceptable model fits
according to Schermelleh-Engel et al. (2003). In order to apply maximum likelihood estimation,
(approximately) multivariate normally distributed data are necessary. To test the data for normal
distribution, we used the skewness and kurtosis of the indicator variables here. These values are all
smaller than |2| for the reported sample, so that normal distribution can be assumed (Hammer &
Landau, 1981, p. 578). For the model used here, the following model fits were found, which indicate an
acceptable to good fit to the empirical data.
Table 8. Fit measures of the confirmatory factor analysis based on the empirical data. Furthermore, we
obtain
AIC = 107.62 for the single factorial model and for the three factorial AIC = 108.79. For reasons
of
content,
we continue with the three-factorial model.
2
�
One-factorial model
0.86
0.00
1.00
0.06
Three-factorial model (cf. table 2)
0.82
0.00
1.00
0.06
68
European Journal of Science and Mathematics Education Vol. 9, No. 3, 2021 Bitzenbauer
The confirmatory factor analysis’ results on indicator and construct level are summarized in Table 9
and Figure 3. The correlations among the factors are statistically significant at the 1% level and range
from 0.39 and 0.46. This indicates that all three subscales contribute to the students’ declarative
knowledge on quantum optics, focusing on heralded photons’ experiments.
Table 9. Confirmatory factor analysis results on indicator and construct level. The indicator reliabilities
mostly
lay above the threshold for good reliability of 0.40 (Bagozzi & Baumgartner, 1994, p. 402).
The
Fornell
-Larcker criterion is also met because the squared correlation of two latent variables is
smaller
than
the mean extracted variance AEV per factor in each case (Fornell & Larcker, 1981, p. 46).
Factor
Results of CFA
Reliability calculations
Indicator Factor loading
Error variance
Indicator
reliability
Factor reliability
(Cronbach’s ) AEV
Theoretical aspects
Item 4
0.65
0.58
0.42
0.68 0.36
Item 5
0.70
0.51
0.49
Item 8
0.72
0.48
0.52
Item 1
0.46
0.79
0.22
Item 7
0.37
0.86
0.14
Experimental aspects
Item 6
0.45
0.80
0.20
0.52 0.37
Item 2
0.55
0.70
0.30
Item 9
0.67
0.55
0.45
Item 11
0.76
0.42
0.58
Item 12
0.55
0.70
0.30
Photons
Item 10
0.45
0.80
0.20
0.55 0.31
Item 3
0.68
0.54
0.46
Item 13
0.52
0.73
0.27
Figure
3. Overview of the factor loadings of the individual items and correlations between the
factors.
Only
factor loadings above 0.3 are displayed. Factor loadings below 0.3 are suppressed. This shows
that
only four items have secondary loadings above 0.3.
For the internal consistencies of the three subscales theoretical aspects (5 items, = 0.68), experimental
aspects (5 items, = 0.52) and photons (3 items, = 0.55) values are obtained that are well below the
overall reliability of the test instrument of = 0.78. This aspect will be addressed in the discussion in
more detail. Thus, in summary, study II contributes to the confirmation of assumptions 1 and 6 (cf.
Table 5), which the intended test score interpretation is based on.
Results of study III. In study III, an expert survey was conducted to – among others - ensure that the
developed items’ distractors represent meaningful response options in terms of content. According to
Landis and Koch (1977), the experts show substantial agreement on this (Fleiss κ = 0.62), as can be seen
from the descriptive statistics on the experts’ ratings (cf. Table 10) and the corresponding Diverging
Stacked Bar Chart (cf. Figure 4).
Bitzenbauer EUROPEAN J SCI MATH ED Vol. 9, No. 3, 2021
69
Furthermore, the experts - according to Landis and Koch (1977) - show moderate consensus (Fleiss κ =
0.59) that all the test instruments’ items ask relevant content about quantum optics with a focus on
experiments with heralded photons. This finding is of particular importance for judging the content
validity of the instrument (cf. Table 11, Figure 5).
Table 10. Expert ratings’ descriptive statistics on the quality of all 13 items’ distractors (1 = strongly
disagree,
…, 5 = agree completely).
Item No. (Content)
“The item’s distractors are authentic; thus, they could be assumed to be
true by someone who is not certain of the answer.”
m
SD
1 (Optical beam splitter)
3.50
3.88
3.50
4.00
3.75
3.63
4.25
4.25
4.00
3.88
3.63
3.63
4.00
1.31
2 (Non-linear crystal)
0.99
3 (Eye as photon detector)
1.07
4 (Atomic energy levels)
1.31
5 (Conservation of energy)
1.28
6 (Single-photon detector)
0.92
7 (Interference I)
0.71
8 (Interference II)
0.89
9 (Interferometer in single-photon experiments)
0.93
10 (Localizability of photons)
1.24
11 (Anticorrelation factor)
1.06
12 (Coincidence technique)
1.06
13 (Photons as energy quanta)
1.20
Figure
4. Diverging Stacked Bar Chart for the experts’ ratings on the statement “The item’s distractors
are
authentic; thus, they could be assumed to be true by someone who is not certain of the answer” (Fleiss κ = 0.62).
70
European Journal of Science and Mathematics Education Vol. 9, No. 3, 2021 Bitzenbauer
Moreover, the experts were asked to rate the quality of each item. The results (cf. Table 12, Figure 6)
show that two of the items are viewed critically by the experts (items 2 and 3). However, these two items
were retained for didactic reasons in order to maintain the breadth of content.
Table 11. Expert ratings’ descriptive statistics on the importance of all 13 items’ contents (1 = strongly
disagree,
…, 5 = agree completely).
Item No. (Content)
“This item assesses a crucial aspect of the knowledge domain.”
m
SD
1 (Optical beam splitter)
3.50
3.75
3.50
4.00
3.75
3.63
4.25
4.25
4.13
3.88
3.63
3.63
4.00
1.31
2 (Non-linear crystal)
1.04
3 (Eye as photon detector)
1.07
4 (Atomic energy levels)
1.31
5 (Conservation of energy)
1.28
6 (Single-photon detector)
0.92
7 (Interference I)
0.71
8 (Interference II)
1.04
9 (Interferometer in single-photon experiments)
0.84
10 (Localizability of photons)
1.25
11 (Anticorrelation factor)
1.06
12 (Coincidence technique)
1.06
13 (Photons as energy quanta)
1.20
Figure
5. Diverging Stacked Bar Chart for the experts’ ratings on the statement “This item assesses a
crucial
aspect of the knowledge domain” (Fleiss κ = 0.59).
Bitzenbauer EUROPEAN J SCI MATH ED Vol. 9, No. 3, 2021
71
Finally, looking at the scale for the content validity of the test instrument as a whole (4 items, = 0.86),
a consistently positive picture emerges (cf. Table 13). Thus, in summary, study III contributes to the
confirmation of assumptions 1 and 5 (cf. Table 5), on which the intended test score interpretation is
based.
Table 13. Expert ratings’ descriptive statistics on the scale on the test instrument’s content validity (1 =
strongly
disagree, …, 5 = agree completely).
Item
m
SD
1. The items represent relevant contents of the knowledge domain.
2. The contents are in an appropriate relation to each other, i.e., the weighting of the content areas is reasonable.
3. The test instrument has a high fit to the knowledge domain.
4. The test instrument covers important content aspects of the single photon experiments.
3.80
0.63
0.67
0.82
0.82
4.30
4.00
4.00
Table 12. Expert ratings’ descriptive statistics on the overall quality of all 13 items of the test instrument
(1
= strongly disagree, …, 5 = agree completely).
Item No. (Content)
“This item assesses a crucial aspect of the knowledge domain.”
m
SD
1 (Optical beam splitter)
3.38
2.75
2.63
4.50
4.00
3.75
4.00
4.25
3.38
3.88
3.50
3.25
3.88
0.92
2 (Non-linear crystal)
0.89
3 (Eye as photon detector)
0.52
4 (Atomic energy levels)
0.54
5 (Conservation of energy)
0.76
6 (Single-photon detector)
0.71
7 (Interference I)
0.76
8 (Interference II)
1.17
9 (Interferometer in single-photon experiments)
1.06
10 (Localizability of photons)
1.36
11 (Anticorrelation factor)
0.76
12 (Coincidence technique)
1.17
13 (Photons as energy quanta)
1.13
Figure
6. Diverging Stacked Bar Chart for the experts’ ratings on the statement “This item is of
good
quality” (Fleiss κ = 0.61).
72
European Journal of Science and Mathematics Education Vol. 9, No. 3, 2021 Bitzenbauer
DISCUSSION
The development of tools to assess a given construct validly and reliably has a long tradition in science
education research (Britton & Schneider, 2007; Doran et al., 1994; Tamir, 1998). Our newly developed
instrument aims at assessing students’ declarative knowledge of quantum optics in the context of
experiments with heralded photons. Because this field has only little been explored empirically, test
development cannot draw on a body of published students’ conceptions as is the case in other fields,
e.g., for the development of concept inventories such as the FCI in mechanics (Hestenes et al., 1992;
Hestenes & Halloun, 1995), the DIRECT in electricity (Engelhardt & Beichner, 2004) or the KTSO-A in
optics (Hettmannsperger et al., 2021). Therefore, in this article, we presented the development process
in detail to show a possibility to open up a field that has only little been researched empirically so far.
It is a consensus across disciplines in empirical research that study results must meet the quality criteria
of objectivity, reliability, and validity. The content validity is typically ensured using expert surveys,
and the construct validity is usually based on factor analysis. Mostly, reliability is approached using
KR-20 or Cronbach’s Alpha, as summarised by Liu (2012). In this article, we have presented three
studies that, taken together, were intended to assure the quality of our newly developed test instrument.
A Think-aloud study was conducted at an early stage of development, leading to item revision.
Furthermore, the newly developed instrument was piloted with a sample of N = 86 undergraduate
engineering students. The results show that a reliable survey of declarative knowledge on quantum
optics is possible with this test instrument. Besides, correlation analysis, a confirmatory factor analysis,
and an expert survey support the assumptions on which an intended test score interpretation was based.
While validity cannot be assessed conclusively, the results presented provide solid arguments that the
developed test on quantum optics allows for a valid test score interpretation. This means that the test
scores can be interpreted as a measure of declarative knowledge in quantum optics.
Jorion et al. (2015) provide a categorical judgement scheme and assignment rules to evaluate concept
inventories. The authors use their framework to analyze three different concept tests, namely the
Concept Assessment Tool for Statics CATS (Steif & Dantzler, 2005), the Statistics Concept Inventory SCI
(Stone et al., 2003), and the Dynamics Concept Inventory DCI (Gray et al., 2005). We refer to this scheme
to judge the psychometric characteristics of our new test instrument (cf. Table 14).
While the psychometric parameters of our test instrument correspond to medium to excellent values
(cf. Table 14), they lie outside the recommended ranges for the empirically separable subscales
theoretical aspects (5 items, = 0.68), experimental aspects (5 items, = 0.52) and photons (3 items, = 0.55).
These have been confirmed using confirmatory factor analysis (cf. Table 9). Given the background of
published test instruments on other domains, however, this is not a surprise because the investigation
of factor structure has often led to difficulties - not least for the Force Concept Inventory (Huffman &
Heller, 1995; Scott et al., 2012). For many instruments, no factor structure could be extracted at all: one
example is the FCI, another one is the CSEM (Maloney et al., 2001), the Conceptual Survey of Electricity
and Magnetism. For many of the instruments for which the extraction of a factor structure was
successful, no reliabilities of the subscales have been reported yet (cf. Engelhardt & Beichner, 2004;
Ramlo, 2008; Urban-Woldron & Hopf, 2012). In the article by Jorion et al. (2015), the authors examined
the subscale reliabilities of CATS (0.33 ≤ ≤ 0.72), SCI (0.27 ≤ ≤ 0.47) and DCI (0.06 ≤ ≤ 0.62) and
our quantum optics test instruments’ subscale reliabilities lay in a similar range. In the paper of
Hettmannsperger et al. (2021), the development and validation of the concept test KTSO-A for ray optics
are presented. For the KTSO-A, subscale reliabilities in a similar or slightly higher range are reported.
Bitzenbauer EUROPEAN J SCI MATH ED Vol. 9, No. 3, 2021
73
LIMITATIONS AND CONCLUSION
This article reports the development and validation of a test instrument to assess secondary school
students’ declarative quantum optics knowledge. With that, we respond to modern developments on
learning quantum physics from physics education research: Numerous researchers propose quantum
optics-based introductory courses in quantum physics, focusing on experiments with heralded photons
(cf. Introduction).
Our test instrument’s development is based on test development standards from the literature (cf.
chapter Development of the test instrument), and we follow a contemporary conception of validity (cf.
chapter Test score interpretation). Therefore, we present an evidence-based argument for our instrument’s
validation: We report the results from three studies to test various assumptions that justify a valid test
score interpretation. Future pilot studies are necessary to refine the test instrument and to tackle
limitations: Although the test instrument was piloted in a Think-aloud study with secondary school
students, the quantitative study II was conducted with undergraduate engineering students. While we
argue that these do not differ substantially from our primary target group of secondary school students in
11th/12th grade concerning their prior knowledge in quantum physics, the use of the instrument in larger
samples with secondary school students is necessary. In this way, the psychometric characteristics
reported in this article, all of which are in the acceptable to the excellent range (cf. chapter Discussion),
can be verified.
With the test instrument presented in this article, we want to provide the possibility to economically
assess students’ declarative knowledge of quantum optics focusing on experiments with heralded
photons. We consider developing and validating such a test instrument as the first step in empirical
research of secondary school students’ learning processes on quantum physics in experiment-based
settings. With the help of the presented test instrument, it becomes be possible to evaluate the learning
effectiveness of developed teaching concepts on modern quantum physics focussing on experiments
with heralded photons. In our next studies, we will use the test instrument presented here as part of an
Table 14. Categorical Judgment Scheme and Assignment Rules for Evaluating a Concept Inventory adopted
from
(Jorion et al., 2015, p. 482). Values in parentheses specify the number of items that can lie
outside
this
suggestion (Jorion et al., 2015, p. 482)
Analysis
Excellent
Good
Average
Poor
Our test on quantum optics
Classical test theory
Item statistics
Difficulty
Discrimination
Total score reliability
of total score
-with-item-deleted
0.2-0.8
> 0.2
> 0.9
All items less than
overall
0.2-0.8 (3)
> 0.1
> 0.8
(3)
0.1-0.9
> 0.0
> 0.65
(6)
0.1-0.9 (3)
> -0.2
> 0.5
(9)
Good (cf. table 7)
Excellent (cf. table 7)
Average
Excellent
Item response theory
Individual item measures
All items fit the model
(2)
(4)
(6)
(8)
was not used here, justification
cf. description of study II
Structural analyses
Exploratory factor analysis
Confirmatory factor
analysis
Item loading
CFI
RMSEA
Conforms to
predicted
constructs
> 0.3
> 0.9
< 0.03
(5)
> 0.3 (3)
> 0.8
< 0.05
(10)
> 0.1
> 0.7
< 0.1
(15)
> 0.1 (3)
> 0.6
< 0.2
was only used in a preliminary
analysis
to
preclude uni-dimensionality of
the instrument
Excellent (cf. figure 3)
Excellent (cf. table 8)
Excellent (cf. table 8)
74
European Journal of Science and Mathematics Education Vol. 9, No. 3, 2021 Bitzenbauer
evaluation study to investigate the effectiveness of our teaching concept on quantum optics
(Bitzenbauer & Meyn, 2020) in 11th and 12th grades at secondary schools. Based on the results, both the
teaching concept itself and the test instrument presented in this article will be refined in the sense of an
iterative process (cf. Figure 1). In this context, the structure of the test instrument will have to be
reviewed with a larger sample of the primary target group of secondary school students, as possible
deviations from the confirmatory factor analysis results reported here due to the different group of
people (school students vs. engineering students in this study) cannot be excluded with certainty.
Based on such evaluation studies on a large scale, it can empirically be investigated which conceptions
of quantum physics learners develop in such settings. Insights of this kind are necessary to uncover
typical learning difficulties in quantum optics-based introductory courses in the future. To this end, we
believe qualitative research methods, such as interview studies, are necessary. In the long run, this may
lead to a concept inventory that makes the different teaching concepts on modern quantum physics
based on experiments with heralded photons comparable - not only concerning a mere learning gain
but especially with respect to the question to what extent learners acquire a conceptual understanding
of quantum physics.
Funding: This study was funded by the Emerging Talents Initiative (University of Erlangen, Germany).
Declaration of interest: Author declares no competing interest.
Data availability: Data generated or analysed during this study are available from the author on request.
REFERENCES
Adams, W. K., & Wieman, C. E. (2011). Development and validation of instruments to measure learning of expert-like thinking.
International Journal of Science Education, 33, 1289-1312. https://doi.org/10.1080/09500693.2010.512369
AERA (2014). Standards for educational and psychological testing. American Educational Research Association.
Anderson, J. R. (1996). ACT, a simple theory of complex cognition. American Psychologist, 51(4), 355-365.
https://doi.org/10.1037/0003-066X.51.4.355
Ayene, M., Kriek, J., & Damtie, B. (2011). Wave-particle duality and uncertainty principle: Phenomenographic categories of
description of tertiary physics students’ depictions. Physical Review Special Topics - Physics Education Research, 7, 020113.
https://doi.org/10.1103/PhysRevSTPER.7.020113
Bagozzi, R. P., & Baumgartner, H. (1994). The evaluation of structural equation models and hypotheses testing. In R. P. Bagozzi
(Hrsg.), Principles of marketing research (p. 386-422). Blackwell.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological bulletin, 107(2), 238-246.
https://doi.org/10.1037/0033-2909.107.2.238
Bitzenbauer, P., & Meyn, J.-P. (2021). Fostering students' conceptions about the quantum world - results of an interview study.
Progress in Science Education, 4(2), 40-51. https://doi.org/10.25321/prise.2021.1079
Bitzenbauer, P., & Meyn, J.-P. (2020). A new teaching concept on quantum physics in secondary schools. Physics Education, 55(5),
055031. https://doi.org/10.1088/1361-6552/aba208
Brell, C., Schecker, H., Theyßen, H., & Schumacher, D. (2005). Computer trifft Realexperiment - besser lernen mit Neuen Medien?
[Computer meets real experiment - learn better with new media?]. PhyDid B - Didaktik der Physik - Beiträge zur DPG-
Frühjahrstagung.
Britton, E. D., & Schneider, S. A. (2007). Large-scale assessments in science education. In S. K. Abell & N. G. Lederman (Eds.),
Handbook of research on science education (pp. 1007-1040). Lawrence Erlbaum.
Bronner, P., Strunz, A., Silberhorn, C., & Meyn, J.-P. Demonstrating quantum random with single photons. European Journal of
Physics, 30, 1189. https://doi.org/10.1088/0143-0807/30/5/026
Burde, J.-P., & Wilhelm, T. (2021). Teaching electric circuits with a focus on potential differences. Physical Review Physics Education
Research, 16, 020153. https://doi.org/10.1103/PhysRevPhysEducRes.16.020153
Cataloglu, E., & Robinett, R. W. (2002). Testing the development of student conceptual and visualization understanding in
quantum mechanics through the undergraduate career. American Journal of Physics, 70, 238-251.
https://doi.org/10.1119/1.1405509
Debelak, R., & Koller, I. (2020). Testing the Local Independence Assumption of the Rasch Model With Q3-Based Nonparametric
Model Tests. Applied Psychological Measurement, 44(2), 103-117. https://doi.org/10.1177/0146621619835501
di Uccio, S., Colantonio, A., Galano, S., Marzoli, I., Trani, F., & Testa, I. (2019). Design and validation of a two-tier questionnaire
on basic aspects in quantum mechanics. Physical Review Physics Education Research, 15, 010137.
https://doi.org/10.1103/PhysRevPhysEducRes.15.010137
Bitzenbauer EUROPEAN J SCI MATH ED Vol. 9, No. 3, 2021
75
Doran, R. L, Lawrenz, F. and Helgeson, S. (1994). Research on assessment in science. In D. L. Gabel (Ed.), Handbook of Research on
science teaching and learning (pp. 388-442). Macmillan Publishing Company.
Engelhardt, P. (2009). An Introduction to Classical Test Theory as Applied to Conceptual Multiple-choice Tests. Getting Started
in PER. https://www.compadre.org/Repository/document/ServeFile.cfm?ID=8807&DocID=1148
Engelhardt, P. V., & Beichner, R. J. (2004). Students’ understanding of direct current resistive electrical circuits. American Journal
of Physics, 72(1), 98-115. https://doi.org/10.1119/1.1614813
Ericsson, K., & Simon, H. (1998). How to study thinking in everyday life: Contrasting think-aloud protocols with descriptions and
explanations of thinking. Mind, Culture, and Activity, 5, 178-186. https://doi.org/10.1207/s15327884mca0503_3
Fischler, H., & Lichtfeldt, M. (1992). Modern physics and students’ conceptions. International Journal of Science Education, 14(2),
181-190. https://doi.org/10.1080/0950069920140206
Fisseni, H. (1997). Lehrbuch der psychologischen Diagnostik [Textbook of psychological diagnostics]. Hogrefe.
Flateby, T. L. (2013). A Guide for Writing and Improving Achievement Tests.
https://evaeducation.weebly.com/uploads/1/9/6/9/19692577/guide.pdf
Fornell, C., & Larcker, D. F. (1981). Evaluation structural equation models with unobservable variables and measurement error.
Journal of Marketing Research, 18, 39-50. https://doi.org/10.1177/002224378101800104
Galvez, E. J., Holbrow, C. H., Pysher, M. J., Martin, J. W., Courtemanche, N., Heilig, L., & Spencer, J. (2005). Interference with
correlated photons: Five quantum mechanics experiments for undergraduates. American Journal of Physics, 73, 127.
https://doi.org/10.1119/1.1796811
Glug, I. (2009). Entwicklung und Validierung eines Multiple-Choice-Tests zur Erfassung prozessbezogener naturwissenschaftlicher
Grundbildung [Development and validation of a multiple choice test to record process-related basic scientific education].
IPN.
Goldhaber, S., Pollock, S. J., Dubson, M., Beale, P., & Perkins, K. K. (2009). Transforming Upper-Division Quantum Mechanics:
Learning Goals and Assessment. Physics Education Research Conference 2009, 145-148. https://doi.org/10.1063/1.3266699
Grangier, P., Roger, G., & Aspect, A. (1986). Experimental Evidence for a Photon Anticorrelation Effect on a Beam Splitter: A New
Light on Single-Photon Interferences. Europhysics Letters, 1, 173-179. https://doi.org/10.1209/0295-5075/1/4/004
Gray, G.L., Costanzo, F., Evans, D., Cornwell, P., Self, B., & Lane, J. L. (2005). The Dynamics Concept Inventory Assessment Test:
A progress report and some results. In Proceedings of the 2005 ASEE Annual Conference and Exposition.
Haertel, E. (2004). Interpretive Argument and Validity Argument for Certification Testing: Can We Escape the Need for
Psychological Theory? Measurement: Interdisciplinary Research and Perspectives, 2(3), 175-178.
Haladyna, T. M., & Downing, S. M. (1989). The validity of a taxonomy of multiple-choice item-writing rules. Applied Measurement
in Education, 1, 51-78. https://doi.org/10.1207/s15324818ame0201_4
Hammer, T. H., & Landau, J. (1981). Methodological issues in the use of absence data. Journal of Applied Psychology, 66, 574-581.
https://doi.org/10.1037/0021-9010.66.5.574
Hanbury Brown, R., & Twiss, R. Q. (1956). Correlation between Photons in two Coherent Beams of Light. Nature, 177, 27-29.
https://doi.org/10.1038/177027a0
Henderson, C. (2018). Editorial: Call for Papers Focused Collection of Physical Review Physics Education Research Curriculum
Development: Theory into Design. Physical Review Physics Education Research, 14, 010003.
https://doi.org/10.1103/PhysRevPhysEducRes.14.010003
Henriksen, E. K., Angell C., Vistnes, A. I., & Bungum, B. (2018). What Is Light? Science & Education, 27, 81-111.
https://doi.org/10.1007/s11191-018-9963-1
Hestenes, D., & Halloun, I. (1995). Interpreting the Force Concept Inventory. A response to Huffman and Heller. The Physics
Teacher, 33, 502-506. https://doi.org/10.1119/1.2344278
Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force Concept Inventory. The Physics Teacher, 30, 141-158.
https://doi.org/10.1119/1.2343497
Hettmannsperger, R., Müller, A., Scheid, J., Kuhn, J., & Vogt, P. (2021). KTSO-A: KONZEPTTEST-STRAHLENOPTIK –
ABBILDUNGEN. Entwicklung eines Konzepttests zur Erfassung von Konzepten der Lichtausbreitung, Streuung und der
Entstehung reeller Bilder im Bereich der Strahlenoptik [KTSO-A: CONCEPT TEST RAY OPTICS - ILLUSTRATIONS.
Development of a concept test to capture concepts of light propagation, scattering and the creation of real images in the
field of ray optics]. Progress in Science Education, 4(1), 11-35.
Hobson, A. (2005). Electrons as field quanta: A better way to teach quantum physics in introductory general physics courses.
American Journal of Physics, 73, 630. https://doi.org/10.1119/1.1900097
Holbrow, C. H., Galvez, E. J., & Parks, M. (2002). Photon quantum mechanics and beam splitters. American Journal of Physics, 70,
260. https://doi.org/10.1119/1.1432972
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new
alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1-55. https://doi.org/10.1080/10705519909540118
Huffman, D., & Heller, P. (1995). What Does the Force Concept Inventory Actually Measure? The Physics Teacher, 33, 138-143.
https://doi.org/10.1119/1.2344171
76
European Journal of Science and Mathematics Education Vol. 9, No. 3, 2021 Bitzenbauer
Ireson, G. (1999). A multivariate analysis of undergraduate physics students’ conceptions of quantum phenomena. European
Journal of Physics, 20(3), 193. https://doi.org/10.1088/0143-0807/20/3/309
Ireson, G. (2000). The quantum understanding of pre-university physics students. Physics Education, 35, 15.
https://doi.org/10.1088/0031-9120/35/1/302
Jackson, D. L. (2003). Revisiting Sample Size and Number of Parameter Estimates: Some Support for the N:q Hypothesis. Structural
Equation Modeling, 10, 128-141. https://doi.org/10.1207/S15328007SEM1001_6
Jones, D. G. C. (1991). Teaching modern physics-misconceptions of the photon that can damage understanding. Physics Education,
26, 93. https://doi.org/10.1088/0031-9120/26/2/002
Jorion, N., Gane, B. D., James, K., Schroeder, L., DiBello, L. V., & Pellegrino, J. W. (2015). An analytic framework for evaluating
the validity of concept inventory claims. Journal of Engineering Education, 104(4), 454-496. https://doi.org/10.1002/jee.20104
Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319-342.
https://doi.org/10.1111/j.1745-3984.2001.tb01130.x
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73.
https://doi.org/10.1111/jedm.12000
Ke, J. L., Monk, M., & Duschl, R. (2005). Learning introductory quantum physics: sensori-motor experiences and mental models.
International Journal of Science Education, 27(13), 1571-1594. https://doi.org/10.1080/09500690500186485
Kerlinger, F. N., & Lee, H. B. (2000). Foundations of behavioral research (4th ed.). Wadsworth.
Kimble, H. J., Dagenais, M., & Mandel, L. (1977). Photon Antibunching in Resonance Fluorescence. Physical Review Letters, 39, 691-
695. https://doi.org/10.1103/PhysRevLett.39.691
Kline, R. B. (2005). Principles and Praxis of Structural Equation Modeling. Guilford Press.
Kline, T. J. B. (2005). Psychological Testing. A Practical Approach to Design and Evaluation. Sage. https://doi.org/10.4135/9781483385693
Kohnle, A., Bozhinova, I., Browne, D., Everitt, M., Fomins, A., Kok, P., Kulaitis, G. Prokopas, M., Raine, D., & Swinbank, E. (2014).
A new introductory quantum mechanics curriculum. European Journal of Physics, 35, 015001. https://doi.org/10.1088/0143-
0807/35/1/015001
Krebs, R. (2008). Multiple Choice Fragen? - Ja, aber richtig. Medizinische Fakultät; Institut für Medizinische Lehre IML; Abteilung für
Assessment- und Evaluation AAE.
Kyriazos, T. A. (2018). Applied Psychometrics: Sample Size and Sample Power Considerations in Factor Analysis (EFA, CFA) and
SEM in General. Psychology, 9, 2207-2230. https://doi.org/10.4236/psych.2018.98126
Landis, J., & Koch, G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174.
https://doi.org/10.2307/2529310
Liu, X. (2012). Developing Measurement Instruments for Science Education Research. In B. J. Fraser, K. Tobin, & C. J. McRobbie
(Eds.), Second International Handbook of Science Education. (Springer International Handbooks of Education) (pp. 651-665).
Springer. https://doi.org/10.1007/978-1-4020-9041-7_43
Loehlin, J. C. (2004). Latent variable models (4th ed.). Lawrence Erlbaum. https://doi.org/10.4324/9781410609823
MacCallum, R. C., & Widaman, K. F. (1999). Sample Size in Factor Analysis. Psychological Methods, 4(1), 84-99.
https://doi.org/10.1037/1082-989X.4.1.84
Maloney, D. P., O’Kuma, T. L., Hieggelke, C. J., & Heuvelen, A. v. (2001). Surveying students’ conceptual knowledge of electricity
and magnetism. American Journal of Physics, 69(7), 12-23. https://doi.org/10.1119/1.1371296
Mannila, K., Koponen, I. T., & Niskanen, J. A. (2002). Building a picture of students’ conceptions of wave-particle-like properties
of quantum entities. European Journal of Physics, 23, 45-54. https://doi.org/10.1088/0143-0807/23/1/307
Marshman, E., & Singh, C. (2017). Investigating and improving student understanding of quantum mechanics in the context of
single photon interference. Physical Review Physics Education Research, 13, 010117.
https://doi.org/10.1103/PhysRevPhysEducRes.13.010117
Marshman, E., & Singh, C. (2019). Validation and administration of a conceptual survey on the formalism and postulates of
quantum mechanics. Physical Review Physics Education Research, 15, 020128.
https://doi.org/10.1103/PhysRevPhysEducRes.15.020128
Mashhadi, A., & Woolnough, B. (1999). Insights into students’ understanding of quantum physics: visualizing quantum entities.
European Journal of Physics, 20(6), 511-516. https://doi.org/10.1088/0143-0807/20/6/317
Mayring, P. (2010). Qualitative Inhaltsanalyse: Grundlage und Techniken. Beltz Verlagsgruppe. https://doi.org/10.1007/978-3-531-
92052-8_42
McKagan, S. B., Perkins, K. K., & Wieman, C. E. (2010). Design and validation of the Quantum Mechanics Conceptual Survey.
Physical Review Physics Education Research, 6(2), 020121. https://doi.org/10.1103/PhysRevSTPER.6.020121
Meinhardt, C. (2018). Entwicklung und Validierung eines Testinstruments zu Selbstwirksamkeitserwartungen von (angehenden)
Physiklehrkräften in physikdidaktischen Handlungsfeldern [Development and validation of a test instrument for self-efficacy
expectations of (prospective) physics teachers in physics-didactic fields of activity]. Logos. https://doi.org/10.30819/4712
Meinhardt, C., Rabe, T. and Krey, O. (2018). Formulierung eines evidenzbasierten Validitätsarguments am Beispiel der Erfassung
physikdidaktischer Selbstwirksamkeitserwartungen mit einem neu entwickelten Instrument [Formulation of an
Bitzenbauer EUROPEAN J SCI MATH ED Vol. 9, No. 3, 2021
77
evidence-based validity argument using the example of recording physical-didactic self-efficacy expectations with a
newly developed instrument]. Zeitschrift für Didaktik der Naturwissenschaften, 24, 131-150. https://doi.org/10.1007/s40573-
018-0079-6
Moosbrugger, H., & Kelava, A. (2012). Testtheorie und Fragebogenkonstruktion [Test theory and questionnaire construction].
Springer Verlag. https://doi.org/10.1007/978-3-642-20072-4
Müller, R., & Wiesner, H. (2002). Teaching quantum mechanics on an introductory level. American Journal of Physics, 70, 200.
https://doi.org/10.1119/1.1435346
Mummendey, H. D., & Grau, I. (2014). Die Fragebogen-Methode: Grundlagen und Anwendungen in Persönlichkeits-, Einstellungs- und
Selbstkonzeptforschung [The questionnaire method: Basics and applications in personality, attitude and self-concept
research]. Hogrefe.
Olsen, R. V. (2002). Introducing quantum mechanics in the upper secondary school: A study in Norway. International Journal of
Science Education, 24(6), 565-574. https://doi.org/10.1080/09500690110073982
Özdemir, G., & Clark, D. B. (2007). An Overview of Conceptual Change Theories. Eurasia Journal of Mathematics, Science and
Technology Education, 3(4), 351-361. https://doi.org/10.12973/ejmste/75414
Pearson, B. J., & Jackson, D. P. (2010). A hands-on introduction to single photons and quantum mechanics for undergraduates.
American Journal of Physics, 78, 471-484. https://doi.org/10.1119/1.3494251
Ramlo, S. (2008). Validity and reliability of the force and motion conceptual evaluation. American Journal of Physics, 76(9), 882-886.
https://doi.org/10.1119/1.2952440
Robbins, N., & Heiberger, R. (2011). Plotting Likert and other rating scales. Proceedings of the 2011 Joint Statistical Meeting, 1058-
1066.
Rost, J. (2004). Lehrbuch Testtheorie – Testkonstruktion [Textbook test theory - test construction]. Verlag Hans Huber.
Russell, D. W. (2002). In Search of Underlying Dimensions: The Use (and Abuse) of Factor Analysis in Personality and Social
Psychology Bulletin. Personality and Social Psychology Bulletin, 28, 1629-1646. https://doi.org/10.1177/014616702237645
Sadaghiani. H., & Pollock, S. J. (2015). Quantum mechanics concept assessment: Development and validation study. Physical
Review Special Topics - Physics Education Research, 11, 010110. https://doi.org/10.1103/PhysRevSTPER.11.010110
Schermelleh-Engel, K., Moosbrugger, H. and Müller, H. (2003). Evaluating the fit of structural equation models: tests of
significance and descriptive Goodness-of-Fit measures. Methods of Psychological Research Online, 8(2), 23-74.
Schnell, C. (2016). Lautes Denken als qualitative Methode zur Untersuchung der Validität von Testitems. Zeitschrift für ökonomische
Bildung, 5, 26-49.
Schumacker, R. E., & Lomax, R. G. (2004). A Beginner’s Guide to Structural Equation Modeling (2nd ed.). Lawrence Erlbaum.
https://doi.org/10.4324/9781410610904
Scott, T. F., Schumayer, D., & Gray, A. R. (2012). Exploratory factor analysis of a Force Concept Inventory data set. Physical Review
Special Topics - Physics Education Research, 8(2), 020105. https://doi.org/10.1103/PhysRevSTPER.8.020105
Singh, C. (2001). Student understanding of quantum mechanics. American Journal of Physics, 69, 885-895.
https://doi.org/10.1119/1.1365404
Singh, C. (2007). Student Difficulties with Quantum Mechanics Formalism. AIP Conference Proceedings, 883, 185-188.
https://doi.org/10.1063/1.2508723
Singh, C., & Marshman, E. (2015). Review of student difficulties in upper-level quantum mechanics. Physical Review Special Topics
- Physics Education Research, 11, 020117. https://doi.org/10.1103/PhysRevSTPER.11.020117
Spatz, V., Hopf, M., Wilhelm, T., Waltner, C., & Wiesner, H. (2020). Introduction to Newtonian mechanics via two-dimensional
dynamics - The effects of a newly developed content structure on German middle school students. European Journal of
Science and Mathematics Education, 8(2), 76-91. https://doi.org/10.30935/scimath/9548
Stadermann, H. K. E., van den Berg, E., & Goedhart, M. J. (2019). Analysis of secondary school quantum physics curricula of 15
different countries: Different perspectives on a challenging topic. Physical Review Physics Education Research, 15, 010130.
https://doi.org/10.1103/PhysRevPhysEducRes.15.010130
Steif, P. S., & Dantzler, J. A. (2005). A statics concept inventory: Development and psychometric analysis. Journal of Engineering
Education, 94, 363-371. https://doi.org/10.1002/j.2168-9830.2005.tb00864.x
Steiger, J. H., & Lind, J. C. (1980). Statistically based tests for the number of common factors [Paper presentation]. Annual Spring
Meeting of the Psychometric Society, Iowa City, IA.
Stone, A., Allen, K., Rhoads, T. R., Murphy, T. J., Shehab, R. L., & Saha, C. (2003). The Statistics Concept Inventory: A pilot study.
In Proceedings of the 33rd ASEE/IEEE Frontiers in Education Conference (Vol. 1, pp. T3D-1–T3D-6).
https://doi.org/10.1109/FIE.2003.1263336
Styer, D. F. (1996). Common misconceptions regarding quantum mechanics. American Journal of Physics, 64, 31-34.
https://doi.org/10.1119/1.18288
Taber, K. S. (2018). The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education.
Research in Science Education, 48, 1273-1296. https://doi.org/10.1007/s11165-016-9602-2