ArticlePDF Available

Students’ misconceptions of statistical inference: A review of the empirical evidence from research on statistics education

  • KU Leuven & imec

Abstract and Figures

A solid understanding of inferential statistics is of major importance for designing and interpreting empirical results in any scientific discipline. However, students are prone to many misconceptions regarding this topic. This article structurally summarizes and describes these misconceptions by presenting a systematic review of publications that provide empirical evidence of them. This group of publications was found to be dispersed over a wide range of specialized journals and proceedings, and the methodology used in the empirical studies was very diverse. Three research needs rise from this review: (1) further empirical studies that identify the sources and possible solutions for misconceptions in order to complement the abundant theoretical and statistical discussion about them; (2) new insights into effective research designs and methodologies to perform this type of research; and (3) structured and systematic summaries of findings like the one presented here, concerning misconceptions in other areas of statistics, that might be of interest both for educational researchers and teachers of statistics.
Content may be subject to copyright.
vailable online at
Educational Research Review 2 (2007) 98–113
Students’ misconceptions of statistical inference: A review of the
empirical evidence from research on statistics education
Ana Elisa Castro Sotos , Stijn Vanhoof, Wim Van den Noortgate, Patrick Onghena
Centre for Methodology of Educational Research, Katholieke Universiteit Leuven, Vesaliusstraat 2, 3000 Leuven, Belgium
Received 19 December 2006; received in revised form 13 April 2007; accepted 27 April 2007
A solid understanding of inferential statistics is of major importance for designing and interpreting empirical results in any
scientific discipline. However, students are prone to many misconceptions regarding this topic. This article structurally summarizes
and describes these misconceptions by presenting a systematic review of publications that provide empirical evidence of them. This
group of publications was found to be dispersed over a wide range of specialized journals and proceedings, and the methodology
used in the empirical studies was very diverse. Three research needs rise from this review: (1) further empirical studies that identify
the sources and possible solutions for misconceptions in order to complement the abundant theoretical and statistical discussion
about them; (2) new insights into effective research designs and methodologies to perform this type of research; and (3) structured
and systematic summaries of findings like the one presented here, concerning misconceptions in other areas of statistics, that might
be of interest both for educational researchers and teachers of statistics.
© 2007 Elsevier Ltd. All rights reserved.
Keywords: Statistical inference; Misconceptions; Students; Research review
1. Introduction
Since the early 1970s, there has been an increasing interest in research about people’s understanding and performance
in probability and statistics. Researchers especially focused on misconceptions and heuristics regarding probability,
chance, and randomness (e.g., Kahneman, Slovic, & Tversky, 1982;Konold, 1989, 1991;Nisbett & Ross, 1980;
Shaughnessy, 2003). This article reports on a systematic review of the available empirical evidence of students’
misconceptions in statistical inference. It constitutes a starting point for educational researchers interested in the
relation between misconceptions and the conceptual change theory (e.g., Finch & Cumming, 1998;Guzzetti, Snyder,
Glass, & Gamas, 1993;Smith, diSessa, & Roschelle, 1993), or a helpful tool for teachers of statistical inference to
become aware of the most common misconceptions that their students may hold (e.g., Batanero, Godino, Vallecillos,
& Holmes, 1994;Brewer, 1985;Haller & Krauss, 2002).
In educational research, the term misconception is used to refer to several concepts. On the one hand, authors often
consider a broad definition of the word, using it to label different concepts such as preconception,misunderstanding,
misuse,ormisinterpretation interchangeably (Smith et al., 1993). On the other hand, misconceptions are sometimes
defined in a more restrictive way, as misunderstandings generated during instruction, emphasizing a distinction with
Corresponding author. Tel.: +32 16 32 62 65; fax: +32 16 32 59 34.
E-mail address: (A.E. Castro Sotos).
1747-938X/$ – see front matter © 2007 Elsevier Ltd. All rights reserved.
A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113 99
alternative conceptions resulting from ordinary life and experience (Guzzetti et al., 1993). In this manuscript, a refine-
ment of the first definition is applied, and the term refers to any sort of fallacies, misunderstandings, misuses, or
misinterpretations of concepts, provided that they result in a documented systematic pattern of error (Cohen, Smith,
Chechile, Burns, & Tsai, 1996).
The interest on statistical inference arises from three realities. First, this is a topic of main relevance for the
development of research in all empirical sciences in general and psychology and education in particular (Belia, Fidler,
Williams, & Cumming, 2005;Krauss & Wassner, 2002). Second, inference receives special attention in statistical
courses from almost all scientific areas, where hypotheses tests and confidence intervals are taught to students as the
methods for evaluating scientific hypotheses (Aberson, Berger, Healy, & Romero, 2003;APA, 2001). Finally, inferential
ideas seem to be especially sensitive to be misunderstood and students are often prone to fall into deep misconceptions
(Daniel, 1998; Kirk, 2001) because they require students to understand and connect many abstract concepts such as
sampling distribution and significance level.
After presenting our methodology of search for this review (Section 2), we provide an overview of the misconceptions
mentioned and exemplified in the literature, and describe to what extent and under which conditions they occur,
discussing the methodology of the presented group of studies (Section 3). Finally, we conclude with some suggestions
for further research (Section 4).
2. Method
We performed a thorough literature exploration in order to bring together publications that report on studies providing
empirical evidence of university students’ misconceptions that have been published during the last 15 years (from 1990
to the beginning of 2006). Therefore, studies based on personal experience and anecdotes only or publications oriented
to other groups (such as professionals or younger students) were excluded.
Other publications that did not match our inclusion criteria, for instance purely theoretical discussions of miscon-
ceptions, will be used here to illustrate original ideas or describe these misconceptions.
We followed four different lines of search: first, as our main source, we surveyed the Web of Science (ISI, 2004),
PsycINFO (APA, 2006), and ERIC (IES, 2006) electronic databases. We based our selection of key words1on the
criteria described above and the main concepts of statistical inference. These concepts were chosen according to the
logical structure of statistical inference, which is based on the foundational sampling theory and consists of two main
inferential techniques: hypotheses tests and confidence intervals. This structure is as well reflected in the content
and order of topics in most handbooks of introductory courses on statistical inference (e.g., Healey, 2005;Moore &
McCabe, 2006). These handbooks approach statistical inference by firstly introducing sampling processes and related
concepts and properties, focusing on the concepts of population, sample, point estimation by means of sample statistics
(e.g., the sample mean), sample and sampling distributions, and practical approximations (e.g., central limit theorem).
Next, students learn how to construct and understand confidence intervals for the estimation of different parameters as
well as to perform hypotheses tests. We will use this structure to present the results of our review in this manuscript
(see Section 3).
Second, after searching in electronic databases, we complemented our list of references by in depth scrutinizing the
main forums for research on statistics education:
Journal of Statistics Education, all available articles: from Volume 1 (July 1993) to Volume 13 (November 2005).
Statistics Education Research Journal, all available articles: from Volume 1 (May 2002) to Volume 4 (November
Proceedings from the 5th (1998) and 6th (2002) International Conferences on Teaching Statistics (ICOTS).
Next, we started a third line of search, tracking down all references cited in the results of the previous two searches.
1The following combination of keywords was used: (Misconception OR error OR misuse OR misinterpretation OR misunderstanding OR fallacy)
AND (inference OR sampling OR normal OR confidence OR interval OR test OR level OR p-value) AND student AND (course OR college OR
100 A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113
Table 1
Summary of publications presented in this review
Topic concerning the misconception Publications
Sampling distributions
The law of small numbers and sampling variability Chance et al. (2004),delMas and Liu (2005),Finch (1998),Sedlmeier (1998) and Well et
al. (1990)
The different distributions Chance et al. (2004) and Lipson (2002)
The central limit theorem Batanero et al. (2001, 2004)
Hypotheses tests
Approaches to hypotheses testing Vallecillos and Batanero (1997)
Definition of hypotheses Vallecillos and Batanero (1997)
The conditional nature of significance levels Haller and Krauss (2002),Vallecillos (2002),Vallecillos and Batanero (1997) and
Williams (1998)
Interpretation of the numerical value of the p-value –
Nature of hypotheses tests Falk and Greenbaum (1995),Haller and Krauss (2002) and Vallecillos (1995, 1996, 2000)
Evaluation of statistical significance
Confidence intervals
See Table 3 Fidler (2006)
Finally, as the fourth search, we looked for those journal articles, book chapters, and conference proceedings that
cite classical literature about stochastic reasoning (e.g., Kahneman et al., 1982; Shaughnessy, 1992) or the results from
the previous three searches, making use of two electronic search engines (Web of Science and Google Scholar). This
last search complemented the previous three in finding manuscripts that could have been missed or that could have
escaped from those searches.
We classified the resulting group of references following the structure of inferential statistics courses mentioned
above, depending on the topic of each documented misconception (see Section 3). Besides, we recorded specific
information for each of the studies: characteristics of participants (number, gender, statistical background, and studies),
country where the study took place, and method of data collection (interview, multiple-choice test, open-answer
questionnaire, etc.). Most of this information, when not available in the text, can be found in Appendix A.
3. Results
As our findings show, the literature on statistics education, and particularly publications providing empirical evidence
of misconceptions in statistics, is sparse (see also Batanero, 2005;Ware & Chastain, 1991). The four searches defined
above yielded more than 500 references that contained only 21 publications (cf. Table 1) reporting on 17 different
studies (cf. Table A1) that provide evidence of misconceptions about topics related to statistical inference and that
satisfied our selection criteria. Whereas some of the misconceptions were described several times, for others we found
none or only one or two empirical studies.
Although most of the studies present the same type of study design (a one-group posttest only evaluation), the
main characteristic of the research presented by this set of publications is the variation in the methodology they used.
First, regarding the setting, some publications were carried out in the classroom environment (e.g., Batanero, Tauber, &
anchez, 2001;Well, Pollatsek, & Boyce, 1990, Study 1), and a few under very controlled conditions (e.g., Vallecillos &
Batanero, 1997;Well et al., 1990, Study 3) whereas other publications do not clarify in which conditions the study took
place (e.g., Fidler, 2006). Moreover, a course is sometimes (e.g., Lipson, 2002), but not always (e.g., Haller & Krauss,
2002) included as a treatment stage of the study. Third, with regard to the data gathering methodology, interviews (e.g.,
delMas & Liu, 2005), and mixture of multiple-choice and open-answer items can be found (e.g., Vallecillos, 1995,
1996, 2000, 2002). The items are mostly presented in paper-and-pencil format (e.g., Vallecillos, 1996) and seldom by
means of a computer device (e.g., Sedlmeier, 1998). Finally, there are differences in the existence of a pre-test (e.g.,
the SRA2in Vallecillos & Batanero, 1997 versus none in Falk & Greenbaum, 1995).
2Statistical reasoning assessment, see Garfield (2003).
A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113 101
Another factor making it difficult to compare results of these studies is the differences in sample size, which range
from very small (e.g., delMas & Liu, 2005;Vallecillos & Batanero, 1997) to large numbers of participants (e.g.,
Vallecillos, 1995, 1996, 2000).
In the following, we describe the misconceptions found in this literature research. Although a strict classification
would be somewhat artificial because many misconceptions overlap with each other, for the sake of clarity, they are
expounded one by one, according to the logical structure for inferential statistics courses mentioned above. First, a
detailed description of each misconception is illustrated with the help of all references regarding that misconception
that were found, not only those satisfying the criteria in Section 2. Next, the selected publications detected by our
searches are enumerated and commented more in detail, highlighting the most striking empirical results that they
3.1. Sampling distributions
Sampling distributions are central to statistical inference. They are used to test hypotheses and construct confidence
intervals. The key idea in inferential reasoning is that a sample provides some, but not complete, information about
the population from which it is drawn. Understanding this fundamental property of sampling processes implies under-
standing a balance between two concepts. First, sample representativeness, meaning that when the process of selecting
the sample has been performed properly, the sample will often have characteristics similar to those of the population;
and, second, variability, implying that not all the samples can resemble the population in the same way and to the same
extent every time (Batanero et al., 1994).
Many statistics educators have stressed that the sampling distribution is a core idea in the understanding of statistical
inference. Yet, despite its critical role, experience and research have shown that this concept is generally poorly
understood (Chance, delMas, & Garfield, 2004;Lipson, 2002; Shaughnessy, 1992;Tversky & Kahneman, 1971).
Although many students are able to apply the necessary notions to deal with sampling processes in isolation, they often
lack the ability to integrate the different ideas and properly use concepts in inferential reasoning (Batanero, 2005).
Many authors claim that the explanation for this lack is the presence of misconceptions pertaining to such processes,
which are not sufficiently known by teachers (Batanero et al., 1994). The following compilation of empirical evidence
might be, therefore, of much use for teachers of statistical inference.
3.1.1. Misconceptions concerning the law of small numbers and sampling variability
The sample mean is a key descriptive statistic for inferential analyses. However, many misconceptions regarding
sampling processes concern the sample mean, more specifically, the properties of its sampling distribution. The most
important of such properties is the so-called law of large numbers. This law states that, for any population with finite
mean, the population mean is the limit of the sample mean as the sample size increases with new observations drawn
at random from the population, therefore implying that the variability of the sample mean tends to zero as the sample
size increases. This final implication for the variability of the sample mean has been proven to be misunderstood by
many students, who neglect the effect of sample size on the variance of the sample mean (Chance et al., 2004).
Our search criteria identified five publications documenting empirical evidence about misconceptions regarding the
idea behind the law of large numbers, more specifically: Well et al. (1990),Finch (1998),Sedlmeier (1998),Chance
et al. (2004; see next section), and delMas and Liu (2005).
Concerning the variability of the sample mean, the university psychology students surveyed by Well et al. (1990)
seemed to understand that the means of larger samples are more likely to resemble the population mean. Moreover, some
of the participants (also students of psychology) in the study by Finch (1998) considered that variance can decrease in
some sense or that reliability gains from large samples. However, they did not seem to understand its implication for
the variability of the sample mean, even when they seemed to have a clear understanding of the distinction between
sample and sampling distributions and had observed evidence of the phenomenon (Well et al., 1990).
These difficulties might have their origin in a misunderstanding of the law of large numbers. According to the
representativeness heuristic (Tversky & Kahneman, 1971), people confuse the sample and the population distributions,
believing that any sample must be very similar to the population, regardless of its size, and therefore extrapolating
the law of large numbers to small samples. This misconception is known as the belief in the law of small numbers,a
term coined by Tversky and Kahneman (1971). The believer in the law of small numbers underestimates the size of
confidence intervals, overestimates the significance in tests of hypothesis, and is over-confident in obtaining the same
102 A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113
results in future replications of the experiment. A student holding such belief will not see the sample size or the sample
bias as factors that affect the validity of inferential conclusions or generalizations to the population (Innabi, 1999).
Gigerenzer and Sedlmeier (Gigerenzer, 1996; Sedlmeier, 1998;Sedlmeier & Gigerenzer, 2000) argue that the
occurrence of this misconception depends on the type of task. These authors make a distinction between frequency
distribution tasks (e.g., problems in Bar-Hillel, 1979), which deal with the frequency of an event, and sampling
distribution tasks (e.g., in Kahneman & Tversky, 1972), which concern the frequency with which particular events’
sample means or proportions fall into specified categories. They suggest that the intuition that larger samples usually
lead to more exact estimates of the population mean or proportion (Finch, 1998; Well et al., 1990) helps in the solution of
the frequency rather than the sampling distribution tasks because the law of large numbers can be applied to frequency
distribution tasks directly, whereas sampling distribution tasks require repeated application of it, as well as a judgment
about the variances of the distributions. In addition, they provide empirical evidence of this phenomenon in the three
studies described in Sedlmeier (1998).
The misconceptions described above are therefore related to, and might arise from, a profound lack of insight in
the idea of variability in random events that has a direct impact on the understanding of sampling processes and hence
on the relevant properties of the sample mean distribution such as the law of large numbers. Before being able to
understand the concept and features of sampling distributions, students should be able to develop a good understanding
of the idea of variability (delMas & Liu, 2005). For that reason, they are expected to arrive at university introductory
courses with a comfortable domain of this concept. However, several studies like Innabi (1999) or Watson and Moritz
(2000) demonstrate that this is not yet the case and that there is still a lot of work to do in making students understand
variability in pre-university education. It seems that even though natural cognitive development might be favorable to
improving the understanding of sampling variability (delMas & Liu, 2005;Watson & Moritz, 2000), statistics courses
still fail in helping and supporting students to construct understanding of the basic and core concepts of probability
and statistics such as the idea of variability.
3.1.2. Misconceptions concerning the different distributions
As a consequence of the representativeness misconception described above, which provokes the confusion of a
specific sample distribution and the corresponding population distribution, two severe misconceptions can arise. First,
if students believe that the sampling distribution of a statistic should have the same shape and properties as the population
distribution, they will confuse the population and the sampling distributions (Chance et al., 2004). Second, as a result of
the combination of the representativeness misconception and the confusion of population and sampling distributions,
a student might not be able to detect the difference between the distribution of a sample and the sampling distribution
of a statistic. We found two studies related to these misconceptions: Lipson (2002) and Chance et al. (2004).
Lipson (2002) attempted to improve the development of the concept of sampling distribution and facilitate the for-
mation of links between the sampling distribution and statistical inference. She exposed a group of weak-mathematical
students to an instructional treatment with two computer sampling packages and found that, although 10 out of the
23 participants correctly linked the sampling distribution as the distribution of a sample statistic, 5 showed evidence
of one of the other two misconceptions described above, incorrectly designating the distribution of the sample as the
sampling distribution. Moreover, in the same study, only 7 and 3 out of the 23 students explicitly linked the sampling
distribution to the determination of the concepts of p-value and confidence interval, respectively. Lipson concluded that
it seems possible that the extensive use of the sampling software was helpful in elucidating some important concepts
of sampling distribution in each of the specific contexts in which it was applied for some students. However, these
software packages have no specific role in illustrating the concepts and links, which together form a schema for the
generalized sampling distribution.
In turn, Chance et al. (2004) performed a series of five studies to document students’ learning of sampling dis-
tributions. More specifically, their third study was centered on a conceptual analysis of students’ prior knowledge
and misconceptions. The results of their post-treatment assessment tasks showed that many students still exhibited
misconceptions concerning sampling distributions. In particular, they believed that a sampling distribution should look
like the population (and even more as the sample size increases) showing that they did not understand that a sampling
distribution is a distribution of sample statistics and confusing it with the population distribution. One example of
a statement wrongly selected by the students that showed this misconception is “As the sample size increases, the
sampling distribution of means looks more like the population, has the same mean as the population, and has a standard
deviation that is similar to the population” (Chance et al., 2004).
A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113 103
3.1.3. Misconceptions concerning the central limit theorem
Not only does the law of large numbers explain the sampling behavior of the sample mean and its variability, but also
the central limit theorem provides a useful and often used approximation for the sampling distribution of the sample
mean. This theorem states that, for sufficiently large sample sizes, the sampling distribution of the sample mean can
be approximated by a Normal distribution, which facilitates inferential calculations. Studies satisfying our searching
criteria that provide empirical evidence of misconceptions with regard to the Normal distribution, and therefore in
relation to this theorem, are Batanero et al. (2001) and Batanero, Tauber, and S´
anchez (2004).
One of the misconceptions related to the understanding of the central limit theorem is that students wrongly
extrapolate it and believe that, the larger the sample size, the closer the distribution of any statistic will approximate
a Normal distribution (Bower, 2003). A related misconception is that, because they believe it always can be applied,
students seem to be confused when they try to find the reason to use such an approximation and give a justification for
the use of the Normal distribution, although they might be comfortable doing the formal manipulations needed to use
this theorem (Wilensky, 1997).
Besides, students have shown the misconception of not properly distinguishing between the real sampling dis-
tribution, and the theoretical model of a Normally distributed population that is only used as an approximation of
this sampling distribution, based on the central limit theorem that is used to test the null hypothesis in a significance
test. This misconception might be related to a more general misconception about the differences between theoretical
(mathematical) Normal distributions and empirical (for the actual data) almost-Normal distributions as documented
in Batanero et al. (2001, 2004). They found that, for example, their first year university students did not identify the
Normal model with an equation (the analytical expression of the Normal density function) and approximately 81%
of them (n= 55) selected the statement “The Normal curve is a model that is defined by empirical data” and were not
aware that the Normal curve is a theoretical distribution.
In summary, although students might be able to perform all necessary manipulations and formal calculations for
testing a hypothesis, it has been shown that many of them hold deep misconceptions related to sampling distributions.
The main ones being those concerning the law of large numbers, the confusion of these distributions with the sample
or the population distributions, and the central limit theorem. These misconceptions have a direct impact on learning
inferential statistics because of the interconnection of the concepts and methods and the relevance of their understanding
for an appropriate interpretation of inferential results and conclusions. For example, a student that confounds the
sampling distribution of a statistic and the population distribution of the variable under study will believe that “something
is wrong in the process” when, for instance, a graphical representation of these distributions shows a great difference
between them. Another example would be a student who uses the central limit theorem under the wrong conditions
(because of the belief in its universal validity) and reaches wrong conclusions for the population of study. In a similar
way, a student confounding the sample and population distributions is prone to overlook the sample bias as a factor
that affect the validity of inferential conclusions.
3.2. Hypotheses tests
The main tool in inferential statistics is the hypotheses test (also called significance test). This technique aims to state
the evidence in a sample against a previously defined (null) hypothesis, minimizing certain risks. Getting students to
make sense of hypotheses tests is a very difficult goal for statistics instructors because of the persistency and deepness
of the misconceptions hold by learners (Brewer, 1985; Daniel, 1998; Kirk, 2001), even after years of training (Falk,
1986; Vallecillos, 2002). The main reason for this phenomenon is that performing these tests requires students to
understand and be able to relate many abstract concepts such as the concept of a sampling distribution, the significance
level, null and alternative hypotheses, the p-value, and so on.
For more than 20 years, misconceptions and misuses regarding hypotheses tests have been discussed. An historical
summary of bad practices can be found in Daniel (1998) and an overview of the controversy about the use of these tests
in research in Falk (1986),Harlow, Mulaik, and Steiger (1997),Kirk (2001),McLean and Ernest (1998), and Reichardt
and Gollob (1997). According to Krantz (1999), misconceptions concerning hypotheses tests are not the fault of the
method, but the responsibility of those who use it. Several authors (e.g., Brewer, 1985;Gliner, Leech, & Morgan, 2002)
have pointed at textbooks as main culprits of misconceptions, while others (e.g., Haller & Krauss, 2002) claimed that
there is also a large number of statistics instructors that share the misconceptions of their students, and have an even
larger influence on fostering the misconceptions than the textbooks have. In fact, Gordon (2001),Lecoutre, Poitevineau,
104 A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113
and Lecoutre (2003), and Mittag and Thompson (2000) convincingly showed that even statisticians are not immune to
misconceptions of hypotheses tests.
The following compilation of publications provides an overview of the main misconceptions held by university
students (and apparently also by professional researchers) with regard to hypotheses tests. In the next sections it will
be highlighted when, for a misconception, no empirically based publications were found by our searches and its
description is based on purely theoretical studies or on results from other type of participants different from university
students (e.g., Misconception 5 from exploration of textbooks in Gliner et al., 2002).
3.2.1. Misconception concerning the different approaches to hypotheses testing
As shown in Cohen (1990) and Chow (1996), students’ misconceptions about hypotheses tests can be classified
and analyzed from several points of view, but, in general, most misconceptions about learning and using these tests
have been attributed in the literature to two aspects: the philosophy behind the test, and the interpretation of concepts
and results. Regarding the philosophy of the test, hybridism between Fisher’s and Neyman–Pearson’s approaches is
mostly present in statistical practice (Batanero, 2000; Chow, 1996;Falk & Greenbaum, 1995;Vallecillos, 2000). The
systematic Fisher’s comparison of the p-value with the level of significance that has become a routine behavior in
interpreting the results of hypothesis tests is applied together with the Neyman–Pearson’s focus on decision. Also
Neyman–Pearson’s a priori choice of the significance level is widely used; as well as their type I and type II error
terminology. As Borovcnik and Peard (1996) indicated, the different axiomatic theories do not cover the ideas of the
other positions very well and the mixture is the culprit of much confusion around the use of hypotheses tests (see also
Gigerenzer, 1993).
Vallecillos and Batanero (1997) is the only publication found by our searching criteria aiming to empirically address
this (and the following) misconception. They found students’ conception far from the logic normally used in teaching
that considers hypotheses testing as a decision process in order to accept or reject a hypothesis, according to the
Neyman–Pearson’s theory. Some of their students considered the hypotheses as alternatives in a decision problem
under uncertainty, whereas the majority of them did not acknowledge the parallelism between the hypotheses testing
and the decision process.
3.2.2. Misconceptions concerning the definition of the hypotheses
With regard to misconceptions about concepts and results, there are elements in every stage of hypothesis testing
that students misunderstand. First, in Vallecillos and Batanero (1996) it is remarked that the confusion between the
null and the alternative hypotheses turns out to be a serious misconception that obstructs the understanding of the
testing process and specially the correct interpretation of its results. In fact, the election of inadequate hypotheses
determines the results of the complete process and in Vallecillos (1999) it is assured that the first step of defining the
null and alternative hypotheses presents great comprehension problems for students, who are unable to identify the
most appropriate statement for each case.
In Vallecillos and Batanero (1997) evidence is provided of students’ incorrect choice of hypothesis. Although
participants in their study agree on the theoretical idea of stating a null hypothesis with the intention of finding
evidence against it, they are not consistent when they are asked to define the null and alternative hypothesis for a
specific contextualized problem, exchanging those two concepts. Moreover, another two specific misconceptions were
detected in Vallecillos and Batanero (1997):
Believing that the null hypothesis and the acceptance region “are the same thing”.
Believing that a hypothesis can refer both to a population and a sample.
3.2.3. Misconceptions concerning the conditional nature of significance levels
Once the hypotheses have been specified and the analytical calculations for the test (including the estimation of
the sampling distribution of the test statistic) have been performed, it is time for the student to interpret the obtained
results. The most complicated concepts associated to hypotheses testing results are the significance level (α) and the
p-value (p)(Haller & Krauss, 2002).
The most common misconception of the significance level and the p-value is that of switching the two terms in the
conditional probabilities (Falk, 1986). As a consequence, αis considered as the probability that the null hypothesis is
true once the decision to reject has been taken, and pas the probability that the null hypothesis is true given the observed
A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113 105
data. Our searches have detected empirical evidence of this switch in Vallecillos and Batanero (1997),Williams (1998),
Vallecillos (2002), and Haller and Krauss (2002).
The occurrence of this misconception is impressive. For example, 30 out of the 44 students surveyed by Haller and
Krauss agreed with a statement saying “You know, if you decide to reject the null hypothesis, the probability that you
are making the wrong decision” referring to the meaning of the p-value of the test (Haller & Krauss, 2002). In the
study by Vallecillos 53% (n= 436) of the students selected a similar item stating “A level of significance of 5% means
that, on average, 5 out of every 100 times that we reject the null hypothesis, we shall be wrong” (Vallecillos, 2002).
In these studies, university students from different disciplines following statistics courses were selected (only from
psychology in the case of Haller and Krauss).
Using the same items later mentioned in Vallecillos (2002) and Vallecillos and Batanero (1997) carried out a series
of interviews with seven excellent students, most of which interpreted the significance level as the probability of the null
hypothesis being true given that it has been rejected. In other words, falling into the misconception and switching the
terms in the conditional probability defined by α. Finally, Williams (1998) also interviewed university students about
the concept of significance level and also found several quotes reflecting the switch of the conditional probabilities
The misconception of confusing a conditional and its inverse might reveal a deeper misconception inherited from
probability, documented in Shaughnessy and Dick (1991), that was recently ascribed to the language most textbooks
and trainers use to describe conditional probabilities, in Ancker (2006; see also Falk, 1986;Falk & Greenbaum, 1995).
This author identifies the switch of the conditional (and the ignoring, see next section) as a main misconception when,
for example, students think the conditional probability of having a disease given a positive diagnostic test to be the
same as the conditional probability of a positive diagnostic test given the disease (or just the probability of having the
disease, see next section). In Ancker’s opinion, this misconception might be triggered by the textbook and teacher’s use
of the expression “A given B” for the probability P(A|B) which may not be sufficient to draw attention to the existence
of a new category of events given by the change in the sample space introduced by the conditional (Ancker, 2006).
This author’s advice is to describe conditional events as “A within B” or “A in B” which can encourage students to
visualize one event as a subset of another, thus emphasizing the existence of a new category defined by the conditioning
Ignoring the conditional and considering the significance level and the p-value as single event probabilities generates
another important group of misconceptions concerning the conditional nature of significance levels:
αis regarded as the probability that the one of the hypotheses is true.
αis regarded as the probability of making a mistake.
pis regarded as the probability that the observed event happened by chance (an interpretation that is not necessarily
incorrect but is vague since it does not specify the nature of the conditional).
In explaining why these misconceptions prevail among students, statistical terminology was already blamed in
Falk (1986), where it was suggested that they might be well explained by the Neyman–Pearson’s way to denote α
as the probability of a type I error. This expression is not conditionally phrased and could reasonably be interpreted
as the conjunction of the two uncertain events (i.e., that the null hypothesis is true and that it is rejected). In Falk
and Greenbaum (1995) it is discussed more in depth the role of verbal ambiguity in statistical misconceptions and
introduced the illusion of probabilistic proof by contradiction (see Misconception 6) as a possible serious consequence
of it.
Our searches found only one publication that empirically investigates this type of misconceptions. This is the study
(Haller and Krauss, 2002) mentioned above, where 14 out of the 44 psychology students believed that a hypotheses
test provides the single probability of the alternative hypothesis being true and 26 agreed that αis the single probability
of the null hypothesis.
3.2.4. Misconception concerning the interpretation of the numerical value of the p-value
The last misconception of the p-value is that of students considering its numeric value as an indicator of the strength
of the treatment effect under test, or the distance between the groups that are being compared. This misconception is
addressed by Gliner and colleagues in their study of statistical textbooks (Gliner et al., 2002). They state that outcomes
with lower p-values are sometimes interpreted by students as having stronger treatment effects than those with higher
106 A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113
Table 2
Modus tollens and the illusion of probabilistic proof by contradiction
Statements Modus tollens Example Illusion
Premise 1 pqIf it is raining there are clouds If H0is true there is a high probability that the p-value is large
Premise 2 qcThere are no clouds The p-value is small
Conclusion pcIt is not raining H0is improbable
p-values. On the other hand, the strong disagreement of AERA (American Educational Research Association) members
with the statement that p-values directly measure study effect size is reported in Mittag and Thompson (2000).
Despite of the description of the misconception provided in Gliner et al. (2002) based on their findings regarding
textbooks, no empirical studies trying to demonstrate that university students fall, in fact, into this misconception, have
been found by our searches.
3.2.5. Misconceptions concerning the nature of hypotheses tests
The misconceptions around the key concepts of significance level and p-value might have a direct consequence
on the conclusions that students extract from hypotheses tests. Mainly, there are two misconceptions pertaining
to the information that a statistical test provide, related to the interpretation of the p-value. First, considering the
test as a mathematical (logical) proof (Vallecillos, 1995), and second, as a probabilistic proof (Falk & Greenbaum,
1995) of one of the hypotheses. When students consider the test as a mathematical proof, they assume that, just
as any mathematical procedure, the test results (p-value) are deterministic. Therefore, they believe that the null
hypothesis has been proved to be true or false. On the other hand, if students consider a statistical test as a prob-
abilistic proof it means that they are falling into the so-called illusion of probabilistic proof by contradiction (or
the illusion of attaining improbability). This illusion is a consequence of the similarity in the formal structures of
hypotheses tests’ reasoning and the mathematical proof by contradiction, which is based on the logical modus tollens
method. As can be seen in Table 2, the misconception arises when an analogous method is applied to hypothesis test
The analogy to the mathematical proof by contradiction does not work for hypotheses tests because a contradiction
disproves the premise from which it is drawn, but a low probability event does not make the premise from which
it is drawn improbable. This property shows that there is an important difference between statistical inference and
propositional logic, since they are equivalent only when probabilities are 0 or 1 (Nilsson, 1986).
Our searches detected five publications addressing these two misconceptions (mathematical and probabilistic proof):
Vallecillos (1995, 1996, 2000),Haller and Krauss (2002), and Falk and Greenbaum (1995). First, Vallecillos provided
evidence of the mathematical conception of hypotheses tests finding that more than 42% (n= 436) of the students
participating in the survey selected the statement “A hypotheses test, correctly performed, establishes the truth of
one of the two hypotheses, null or alternative”. This percentage was specially striking for the pedagogy (55.8%) and
psychology (74.3%) students (Vallecillos, 1995, 1996, 2000). In turn, in the study by Haller and Krauss (2002) a large
number of students agreed with the statements that the test absolutely disproves the null hypothesis (15 out of 44) or
absolutely proves the alternative (9 out of 44).
On the other hand, most (26 out of 53) of the psychology students responding to the survey by Falk and Greenbaum
(1995) believed that when a test of significance is conducted and the result turns out significant for a predetermined
level, the null hypothesis has been proved to be improbable, showing evidence of the probabilistic proof by contradiction
3.2.6. Misconception concerning the evaluation of statistical significance
One of the major misconceptions mentioned in the literature, that students might encounter when evaluating hypothe-
ses tests, is that of understanding the difference between statistical and practical significance. The misconceptions that
the calibration and evaluation of the difference between statistical significance and practical significance entail are
stressed in Gliner et al. (2002). The first one does not imply the second one, neither the other way around. Profound
knowledge about the contextualization of the test and the design of the experiment (sample sizes, etc.) as well as the
encountered effect sizes, are needed in order to know when a statistically significant result is also practically signifi-
A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113 107
cant. In turn, a practically significant result might turn out not to be statistically significant. This is the reason why it
is proposed in Batanero et al. (2004) to drop the word significant from data analysis vocabulary and use it only in its
everyday sense to describe something actually noteworthy or important.
Students might be not aware of this important reality (Kirk, 2001) and even statisticians seem to have very little
idea of how the interpretation of p-values should depend on sample size and how important effect sizes are, extracting
wrong conclusions from their analyses (Lecoutre et al., 2003). Unfortunately, despite the stress put on the relevance of
this misconception by some authors like Gliner et al. (2002) or Lecoutre et al. (2003), there is no empirical evidence
of it regarding university students in any of the publications detected by our searches.
In summary, although students might be able to perform all necessary manipulations and formal statistical calcu-
lations concerning hypotheses tests, they have been proved to hold deep misconceptions about the meaning, use, and
main concepts of this technique.
3.3. Confidence intervals
Due to the damaging over-reliance on the extensively misunderstood hypotheses tests, a wider use of confi-
dence intervals has been recommended by many authors in order to improve research communication (Cumming,
Williams, & Fidler, 2004;Harlow et al., 1997). Confidence intervals are strongly advocated as the best reporting strat-
egy (APA, 2001), as an alternative, or complement to hypotheses tests in statistical inference (Reichardt & Gollob,
However, also confidence intervals are not always properly interpreted (Fidler, 2006;Fidler, Thomason, Cumming,
Finch, & Leeman, 2004) and are prone to misconceptions. According to Fidler (2006), statistical reform has been
advocated to a large degree based on the compelling argument of the tendency of hypotheses tests to be misinterpreted,
but without providing an alternative that was shown to be less misunderstood, or to be easier to explain and teach to
Few researchers have studied misconceptions concerning confidence intervals, and the ones who have done so,
such as Cumming and colleagues (Belia et al., 2005; Cumming, 2006;Cumming & Maillardet, 2006;Cumming et
al., 2004; Fidler et al., 2004), have mostly focused on researchers’ understanding instead of university students. For
example, in Cumming et al. (2004) a deep misconception was found concerning the question “What is the probability
that the next replication mean will fall within the original 95% confidence interval?” An internet-based investigation
of researcher’ answers to this question (given a graphical bar representation of the original 95% confidence interval)
showed a large proportion of participants believing that about 95% of replications means will fall within the original
confidence interval. This misconception is consistent with the law of small numbers intuition of underestimating
sampling variability, and therefore believing that replications will be unjustifiably close to a first result (Cumming et
al., 2004). However, the probability is close to .95 only when the original mean happens to fall very close to μ, and
the capture rate drops as it falls further from μ. In fact, the probability is only .834 (Cumming & Maillardet, 2006).
In Cumming et al. (2004) they conclude that graphical representation of confidence intervals in figure bars possibly
prompts the misconception in a large majority of researchers. In the similar study (internet research) by Belia et al.
(2005), researchers’ understanding of the graphical comparison of confidence intervals for two separated means was
studied. Participants widely believed that no overlapping of two 95% confidence intervals on independent means implies
a significant difference at the .05 level between the means and that overlapping of the two intervals implies that there is
no significant difference. However, this rule is incorrect. Although non-overlapping of the two confidence intervals does
imply a significant difference, an overlap does not necessarily imply that there is no statistically significant difference
at the .05 level.
The only study that our searches have found concerning students’ understanding of confidence intervals, Fidler
(2006), suggests that confidence intervals help decreasing the misconception that a statistically not significant result is
equivalent to not practically significant (or ‘no effect’). When asked to interpret results in confidence interval format,
participants in this research misinterpreted statistically non-significant results (from a low powered study with a non-
trivial effect size) much less than participants who had to interpret the results in hypotheses test format (10 versus 24,
out of 55).
Nevertheless, a second series of experiments with psychology and ecology students revealed to Fidler that confidence
intervals themselves are prone to a new set of misconceptions. Table 3 summarizes the percentage of students (n= 180)
choosing each description of a confidence interval from a prepared list that included several misconceptions.
108 A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113
Table 3
Misconceptions of confidence intervals detected by Fidler (2006)
Description of a confidence interval Percentage of students (n= 180)
Plausible values for the sample mean 38
Range of individual scores 8
Range of individual scores within one standard deviation 11
The width of a confidence interval increases with sample size 20
The width of a confidence interval is not affected by sample size 29
A 90% confidence interval is wider than a 95% confidence interval (for the same data) 73
Table 4
Overview of misconceptions suggested by the reviewed studies
Sampling distributions
The law of small numbers and sampling variability Neglect the effect of sample size on the variance of the sample mean
Belief in the law of small numbers
The different distributions Confuse the population and the sampling distributions
Confuse the sample and the sampling distributions
The central limit theorem
Belief that the larger the sample size, the closer any distribution is to the Normal
Inability to justify the use of the theorem and the Normal
Confusion between the theoretical and the approximated Normal
Hypotheses tests
Approaches to hypotheses testing Neglect the parallelism between hypotheses test and decision process
Definition of hypotheses
Confusion in the definition of null and alternative hypotheses
Confusion of the null hypothesis and the acceptance region
Believing that a hypothesis can refer both to a population and a sample
The conditional nature of significance levels
Inverse the conditional of the p-value
Inverse the conditional of the significance level
Interpreting the significance level as the probability of one hypothesis
Interpreting the significance level as the probability of making a mistake
Interpreting the p-value as the probability that the event happened by chance
Interpretation of the numerical value of the p-value Interpreting the numeric value of the p-value as strength of treatment
Nature of hypotheses tests Considering the test as a mathematical proof
Illusion of probabilistic proof by contradiction
Evaluation of statistical significance Confuse practical and statistical significance
Confidence intervals
See Table 3
4. Conclusion
The group of publications analyzed in this review provides empirical evidence of deep and spread students’ miscon-
ceptions in statistical inference. This evidence demonstrates that, although they may be able to manipulate and carry out
calculations with statistical data, students have severemisconceptions concerning the interpretation of results from infer-
ential techniques. According to this literature, the following summary and classification of students’ misconceptions
in statistical inference can be suggested (Table 4).
As the results of our searches showed, the available number of publications providing empirical data is still small;
therefore, research should still fill some gaps and present empirical evidence about misconceptions that have not been
much documented so far. These are the following: mixing approaches to hypotheses tests, confusing the hypotheses,
ignoring the conditional event in the p-value and significance level descriptions, interpreting the p-value as the strength
of the effect, evaluating the meaning of statistical significance versus practical significance, and misconceptions con-
cerning confidence intervals. Besides, this review made clear the need for more empirically based studies that shed
A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113 109
light on the sources of misconceptions. Students’ misconceptions might have their origin in textbooks, as suggested by
Brewer (1985) and Gliner et al. (2002); in teachers bringing the misconceptions into the classroom (Haller & Krauss,
2002); or in inherited intuitions as documented in many publications; some of which are cited in this review to exemplify
and clarify the evolution of certain misconceptions of statistical inference (e.g., Innabi, 1999;Watson & Moritz, 2000,
in the first group of misconceptions of Section 3.1,orShaughnessy & Dick, 1991, in the third group of misconceptions
of Section 3.2). In addition, empirical studies should be set up to find possible means to specifically address and help
students overcoming the misconceptions within the statistics course, via observational and intervention studies with
large groups of participants from different backgrounds.
A second conclusion concerns the methodological aspect of this type of research and it is that more evidence
should be sought in order to find confirmation of good methodologies to perform such type of studies that detect,
describe, and try to find helpful activities to defeat, students’ misconceptions. Apparently, the classical means to do so
do not provide enough information about the misconceptions or are not enlightening in the search for solutions, since
misconceptions keep on being found. As mentioned above, in addition to continuing the existing type of research (with
questionnaires, interviews, etc.), special emphasis should be put on carrying out new kinds of studies, such as design
experiments, to analyze possible means to tackle the detected misconceptions. One of the approaches to explore is
that of confronting students with their misconceptions, considered by some authors as the most effective strategy to
achieve the accurate conceptions (Finch & Cumming, 1998;Guzzetti et al., 1993). Our recommendation for statistics
educators is to include both computational and non-computational items in their in-classroom assessments, so that their
students’ misconceptions can be identified. They could make use, for example, of the tools created by the ARTIST3
team, who made available on-line topic scales and the Comprehensive Assessment of Outcomes in a First Statistics
course (CAOS), designed for exactly that purpose.
Third, the spread nature of these kind of publications makes it necessary to promote and consider structured and
systematic summaries of evidence like the one presented here, concerning students’ misconceptions in the different
areas of statistics (correlation analysis, measures of variability, etc.), as the best starting point for further research about
misconceptions and their relation with the conceptual change theory, and for instructional design. This type of reviews
shows the remaining gaps where research can be done and helps teachers of statistics in providing practical summaries
of the available findings about students’ misconceptions within the specific areas of statistics.
Appendix A
The following outline summarizes the main characteristics of participants and methodology for the 20 selected
studies (Table A1).
3Assessment resource tools for improving statistical thinking (available on line:
110 A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113
Table A1
Main characteristics of the selected studies
Studies Participants Students’ area of
Statistical background Design Instruments
Batanero et al. (2001) 55 Variety Pre-university 1 group: pre-test + course + evaluation Pre-test: SRA; Course: 10 sessions of 2 h (half with
Statgraphics and half in traditional classroom);
Evaluation: open answer 20-item questionnaire + report
of three tasks solved with Statgraphics
Batanero et al. (2004) 117 Variety Pre-university 1 group: pre-test + course + evaluation Pre-test: SRA; Course: 10 sessions (6 of 1.5 h and 4 of
1 h) half in PC lab and half in traditional classroom;
Evaluation: 3 open-ended tasks to solve with
Statgraphics + 21-item questionnaire
Chance et al. (2004) More than 100 Variety None assumed 3 similar settings: activity with
PC + assessment
Software: simulation assessment: posttest and final exam
delMas and Liu
12 Pre-university 1 group: interview Interview while interaction with computer program
Falk and Greenbaum
53 Psychology Three courses 1 group: questionnaire Questionnaire: one multiple-choice item about the
result of a test
Fidler (2006) 55 (1st study);
180 (2nd)
Ecology (1st);
ecology and
psychology (2nd)
At least introductory
university statistics
1st: 1 group: comparison of two versions of
one question; 2nd: 1 group: questionnaire
Question: interpret a test or confidence interval result
(comparison); Questionnaire: selection of statements
Finch (1998) 20 3rd year psychology 1 group posttest only:1
question + interview
Question: rate statements about description of research
study on a four-point scale (true to false); Interview:
explain the rating
Haller and Krauss
44 Psychology – 3 groups (instructors, scientists not
teaching, students) comparison:
Questionnaire: six statements to rate as true or false
about a hypotheses test’s results
Lipson (2002) 23 (Part-time) None or little 1 group: course + exercises Exercises: concept mapping exercises solved in-class
Sedlmeier (1998) 46 (1st);
22 + 40 (2nd);
31 (3rd)
Variety – 1st and 2nd: 2 groups comparison: 2 (1st)
or 3 (2nd) tasks; 3rd: 2 groups
comparison: interview
1st: two version of the tasks in a PC (click the right of
three possible answers); 2nd: idem with three tasks; 3rd:
interview about real experimentation with two versions
of two tasks (the control group did not have to do one of
the questions)
Vallecillos (1995,
1996, 2000, 2002)
436 Variety One introductory
university course
1 group: questionnaire + interview Questionnaire: 20-item (true/false, multiple-choice, and
open answer)
Vallecillos and
Batanero (1997)
7 Medicine Very good 1 group: questionnaire + interview Questionnaire: three true/false items and two problems
Well et al. (1990) 114 (1st); 151
(2nd a); 138
(2nd b); 120
Psychology No previous university
1st: 1 group: questionnaires in groups of
5–15 (half the groups received the items in
inverse order); 2nd: 2 groups comparison:
questionnaire (two versions); 3rd: 2 groups
comparison: controlled conditions: four
problems in groups 5–15 (two versions)
1st and 2nd: two open-answer problems; 3rd: four
open-answer problems
Williams (1998) 18 – 1 group: interviews Interviews: clinical with one concept mapping task and
two textbook hypothesis test’s task
A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113 111
Aberson, C. L., Berger, D. E., Healy, M. R., & Romero, V. L. (2003). Evaluation of an interactive tutorial for teaching hypothesis testing concepts.
Teaching of Psychology,30(1), 75–78.
Ancker, J. S. (2006). The language of conditional probability. Journal of Statistics Education,14, retrieved July 28, 2006, from
APA. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: American Psychological Association.
APA. (2006). PsycINFO. APA., retrieved November 27, 2006 from [on-line]
Bar-Hillel, M. (1979). The role of sample size in sample evaluation. Organizational Behavior and Human Performance,24, 245–257.
Batanero, C. (2000). Controversies around the role of statistical tests in experimental research. Mathematical Thinking and Learning (An International
Journal),2(1/2), 75–97.
Batanero, C. (2005). Statistics education as a field for research and practice. In Proceedings of the 10th international commission for mathematical
instruction. Copenhagen, Denmark: International Commission for Mathematical Instruction.
Batanero, C., Godino, J. D., Vallecillos, A., Green, D. R., & Holmes, P. (1994). Errors and difficulties in understanding elementary statistical
concepts. International Journal of Mathematics Education in Science and Technology,25(4), 527–547.
Batanero, C., Tauber, L. M., & S ´
anchez, V. (2001). Significado y comprensi ´
on de la distribuci´
on normal en un curso introductorio de an´
alisis de
datos [Meaning and understanding of the normal distribution in an introductory data analysis course]. Quadrante,10(1), 59–92.
Batanero, C., Tauber, L. M., & S ´
anchez, V. (2004). Students’ reasoning about the normal distribution. In D. Ben-Zvi & J. Garfield (Eds.), The
challenge of developing statistical literacy, reasoning and thinking (pp. 257–276). Dordrecht, The Netherlands: Kluwer Academic Publishers.
Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological
Methods,10, 389–396.
Borovcnik, M., & Peard, R. (1996). Probability. In A. J. Bishop (Ed.), International handbook of mathematics education (pp. 239–287). Dordrecht,
The Netherlands: Kluwer Academic Publishers.
Bower, K. M. (2003). Some misconceptions about the Normal distribution. Paper presented at the Six Sigma Forum. Milwaukee, WI: American
Society for Quality.
Brewer, J. K. (1985). Behavioral statistics textbooks: Source of myths and misconceptions? Journal of Educational Statistics,10(3), 252–268.
Chance, B., delMas, R. C., & Garfield, J. (2004). Reasoning about sampling distributions. In D. Ben-Zvi & J. Garfield (Eds.), The challenge of
developing statistical literacy, reasoning and thinking (pp. 295–323). Dordrecht, The Netherlands: Kluwer Academic Publishers.
Chow, S. L. (1996). Statistical significance. London: SAGE Publications Ltd..
Cohen, J. (1990). Things I have learned (so far). American Psychologist,45(12), 1304–1312.
Cohen, S., Smith, G., Chechile, R. A., Burns, G., & Tsai, F. (1996). Identifying impediments to learning probability and statistics from an assessment
of instructional software. Journal of Educational and Behavioral Statistics,21(1), 35–54.
Cumming, G. (2006). Understanding replication: Confidence intervals, p-values, and what’s likely to happen next time. In Proceedings of the seventh
international conference on teaching statistics. International Association for Statistical Education.
Cumming, G., & Maillardet, R. (2006). Confidence intervals and replication: Where will the next mean fall? Psychological Methods,11, 217–
Cumming, G., Williams, J., & Fidler, F. (2004). Replication, and researchers’ understanding of confidence intervals and standard error bars.
Understanding Statistics,3, 199–311.
Daniel, L. G. (1998). Statistical significance testing: A historical overview of misuse and misinterpretation with implications for the editorial policies
of educational journals. Research in the Schools,5(2), 23–32.
delMas, R. C., & Liu, Y. (2005). Exploring students’ conceptions of the standard deviation. Statistics Education Research Journal,4(1), 55–82.
Falk, R. (1986). Misconceptions of statistical significance. Journal of Structural Learning,9, 83–96.
Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard. Theory and Psychology,5(1), 75–98.
Fidler, F. (2006). Should psychology abandon p-values and teach CIs instead? Evidence-based reforms in statistics education. In Proceedings of the
seventh international conference on teaching statistics. International Association for Statistical Education.
Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them
think: Statistical reform lessons from medicine. Psychological Science,15, 119–123.
Finch, S. (1998). Explaining the law of large numbers. In L. Pereira-Mendoza, L. Seu Kea, T. Wee Kee, & W. K. Wong (Eds.), Proceedings of the
fifth international conference on teaching statistics (pp. 731–736). Voorburg, The Netherlands: International Statistical Institute.
Finch, S., & Cumming, G. (1998). Assessing conceptual change in learning statistics. In L. Pereira-Mendoza, L. Seu Kea, T. Wee Kee, & W. K.
Wong (Eds.), Proceedings of the fifth international conference on teaching statistics (pp. 897–904). Voorburg, The Netherlands: International
Statistical Institute.
Garfield. (2003). Assessing statistical reasoning. Statistics Education Research Journal,2, 22–38.
Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in
the behavioral sciences. Methodological issues (pp. 311–339). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Gigerenzer, G. (1996). On narrow norms and vague heuristics: A reply to Kahneman and Tversky. Psychological Review,103, 592–596.
Gliner, J. A., Leech, N. L., & Morgan, G. A. (2002). Problems with null hypothesis significance testing (NHST): What do the textbooks say? The
Journal of Experimental Education,71(1), 83–92.
Gordon, H. R. D. (2001). AVERA members’ perceptions of statistical significance tests and other statistical controversies. Journal of Vocational
Education Research,26(2.), retrieved January 12, 2006, from
Guzzetti, B. J., Snyder, T. E., Glass, G. V., & Gamas, W. S. (1993). Promoting conceptual change in science: A comparative meta-analysis of
instructional interventions from reading education and science education. Reading Research Quarterly,28(2), 116–159.
112 A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113
Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers? Methods of Psychological Research,
7(1), 1–20.
Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (Eds.). (1997). What if there were no significance tests? London: Lawrence Erlbaum Associates,
Healey, J. F. (2005). Statistics. A tool for social research (7th ed.). Belmont, CA: Thomson Wadsworth.
IES. (2006). ERIC. Retrieved November 28, 2006, from nfpb=true& pageLabel=Home page
Innabi, H. (1999). Students’ judgment of the validity of societal statistical generalization. In A. Rogerson (Ed.), Proceedings of the international
conference on mathematics education into the 21st Century: Societal challenges, issues and approaches
ISI. (2004). Web of science. Retrieved November 29, 2006, from [on-line].
Kahneman, D., Slovic, P., & Tversky, A. (1982). Judgment under uncertainty: Heuristics and biases. Cambridge University Press.
Kahneman, D., & Tversky, A. (1972). Subjective probability: A judgment of representativeness. Cognitive Psychology,3, 430–454.
Kirk, R. E. (2001). Promoting good statistical practices: Some suggestions. Educational and Psychological Measurement,61(2), 213–
Konold, C. (1989). Informal conceptions of probability. Cognition and Instruction,6(1), 59–98.
Konold, C. (1991). Understanding students’ beliefs about probability. In E. Von Glaserfeld (Ed.), Radical constructivism in mathematics education
(pp. 139–156). Dordrecht, The Netherlands: Kluwer Academic Publishers.
Krantz, D. H. (1999). The null hypothesis testing controversy in psychology. Journal of the American Statistical Association,44, 1372–1381.
Krauss, S., & Wassner, C. (2002). How significance tests should be presented to avoid the typical misinterpretations. In Proceedings of the sixth
international conference on teaching statistics. Voorburg, The Netherlands: International Statistical Institute.
Lecoutre, M.-P., Poitevineau, J., & Lecoutre, B. (2003). Even statisticians are not immune to misinterpretations of null hypothesis significance tests.
International Journal of Psychology,38(1), 37–45.
Lipson, K. (2002). The role of computer based technology in developing understanding of the concept of sampling distribution. In Proceedings of
the sixth international conference on teaching statistics. Voorburg, The Netherlands: International Statistical Institute.
McLean, A., & Ernest, J. M. (1998). The role of statistical significance testing in educational research. Research in the Schools,5(2), 15–
Mittag, K. C., & Thompson, B. (2000). A national survey of AERA members’ perceptions of statistical significance tests and other statistical issues.
Educational Researcher,29(4), 1420.
Moore, D. S., & McCabe, G. P. (2006). Introduction to the practice of statistics (5th ed.). New York: W.H. Freeman and Company.
Nilsson, N. J. (1986). Probabilistic logic. Artificial Intelligence,28(1), 71–87.
Nisbett, R. E., & Ross, L. (1980). Human inference: Strategies and shortcomings of social judgment. Englewood Cliffs, NJ: Prentice Hall.
Reichardt, C. S., & Gollob, H. F. (1997). When confidence intervals should be used instead of statistical tests, and vice versa. In L. L. Harlow, S. A.
Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 259–284). London: Lawrence Erlbaum Associates, Publishers.
Sedlmeier, P. (1998). The distribution matters: Two types of sample-size tasks. Journal of Behavioral Decision Making,11, 281–301.
Sedlmeier, P., & Gigerenzer, G. (2000). Was Bernoulli wrong? On intuitions about sample size. Journal of Behavioral Decision Making,13, 133–
Shaughnessy, J. M. (1992). Research in probability and statistics: Reflections and directions. In D. A. Grouws (Ed.), Handbook of research on
mathematics teaching and learning (pp. 465–494). New York: Macmillan.
Shaughnessy, J. M. (2003). Research on students’ understandings of probability. In J. Kilpatric, W. G. Martin, & D. Schifter (Eds.), A research
companion to principles and standards for school mathematics (pp. 216–226). Reston, VA: National Council of Teachers of Mathematics.
Shaughnessy, J. M., & Dick, T. (1991). Monty’s dilemma: Should you stick or switch? Mathematics Teacher,84, 252–256.
Smith, J. P., III, diSessa, A. A., & Roschelle, J. (1993). Misconceptions reconceived: A constructivist analysis of knowledge in transition. The
Journal of the Learning Sciences,3(2), 115–163.
Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin,76, 105–110.
Vallecillos, A. (1995). Comprensi´
on de la l´
ogica del contraste de hip´
otesis en estudiantes universitarios [Understanding of the logic of hypothesis
testing amongst university students]. Recherches en Didactique des Mathematiques,15, 53–81.
Vallecillos, A. (1996). Students’ conceptions of the logic of hypothesis testing. Hiroshima Journal of Mathematics Education,4, 43–61.
Vallecillos, A. (1999). Some empirical evidence on learning difficulties about testing hypotheses. In Proceedings of the 52 session of the International
Statistical Institute (pp. 201–204). Helsinki: International Statistical Institute. Tome 58, Book 2
Vallecillos, A. (2000). Understanding of the logic of hypothesis testing amongst university students. Journal for Didactics of Mathematics,21,
Vallecillos, A. (2002). Empirical evidence about understanding of the level of significance concept in hypotheses testing by university students.
Themes in Education,3(2), 183–198.
Vallecillos, A., & Batanero, C. (1996). Conditional probability and the level of significance in tests of hypotheses. In L. Puig & A. Guti´
errez (Eds.),
Proceedings of the 20th conference of the International Group for the Psychology of mathematics education (pp. 271–378). Valencia, Spain:
University of Valenciam.
Vallecillos, A., & Batanero, C. (1997). Conceptos activados en el contraste de hip´
otesis estad´
ısticas y su comprensi´
on por estudiantes universi-
tarios [Activated concepts in the statistical hypotheses contrast and their understanding by university students]. Recherches en Didactique des
Mathematiques,17, 29–48.
Ware, M. E., & Chastain, J. D. (1991). Developing selection skills in introductory statistics. Teaching of Psychology,18(4), 219–222.
Watson, J. M., & Moritz, J. B. (2000). Development of understanding of sampling for statistical literacy. Journal of Mathematical Behavior,19,
A.E. Castro Sotos et al. / Educational Research Review 2 (2007) 98–113 113
Well, A. D., Pollatsek, A., & Boyce, S. J. (1990). Understanding the effects of sample size on the variability of the mean. Organizational Behavior
and Human Decision Processes,47, 289–312.
Wilensky, U. (1997). What is normal anyway? Therapy for epistemological anxiety. Educational Studies in Mathematics,33(2), 171–202.
Williams, A. M. (1998). Students’ understanding of the significance level concept. In L. Pereira-Mendoza, L. S. Kea, T. W. Kee, & W. Wong (Eds.),
Proceedings of the fifth international conference on teaching statistics (pp. 743–749). Voorburg, The Netherlands: International Statistical
... Именно здесь мы видим путь к улучшению ситуации, поскольку, по нашему мнению: 1) все недостатки господствующей практики вытекают из одногонеправильной интерпретации NHST подхода, которая захватывает не только слой массовой науки, но проникает в учебники, написанные специалистами самого высокого уровня в нашем сообществе, см. обсуждения [Gliner et al., 2002;Sotos et al., 2007]; ...
... Неверно ожидать, и что следующие выборочные средние попадут в этот интервал в 95% случаев. Другие распространенные ошибки (не закрепленные в учебных текстах) таковы: уверенность, что размер доверительного интервала не зависит от размера выборки, уверенность, что перекрывающиеся интервалы для средних двух независимых выборок означают отсутствие значимых различий между ними [Sotos et al., 2007]. ...
... Обоснование преимущества доверительных интервалов и других альтернативных форм представлений результатов может, следовательно, проводиться в эмпирических исследованиях. Примером такового может служить работа [Sotos et al., 2007], где показано, что студенты, получающие материал по экологической проблематике с иллюстрациями в виде error bars, лучше осознают, что не-отвержение нулевой гипотезы (которая говорит о безопасности некоторого подхода к решению проблем) не означает ее истинности [7]. При этом даже ошибочная интерпретация, сопровождающаяся описанием доверительных интервалов, может быть полезнее ошибки, обычно сопровождающей NHST, так как интервал, показывающий вероятное положение оцениваемого параметра, демонстрирует широкий диапазон согласующихся с выборкой возможностей. ...
Статья посвящена критическому обсуждению проблемы статистического вывода и методологии проверки нулевой гипотезы. В ней рассматриваются основные недостатки этого подхода к статистическому оцениванию данных. Выделяются несколько уровней критики проверки нулевой гипотезы: собственно статистический, связанный с процедурами и допущениями, стоящими за этой методологией, уровень социальных последствий, связанных с доминированием данного подхода в статистике, приводящий к ошибкам в интерпретации получаемых результатов, и, наконец, уровень соотнесения статистического и содержательного (психологического) анализа. Далее рассматриваются основные альтернативы, предлагающиеся в настоящее время для преодоления проблем, вызванных использованием методологии проверки нулевой гипотезы, дается их критическая оценка. Формулируется предварительный вывод о недостаточности изменения способов оценивания изолированного исследования.
... Misconceptions in statistical inference and hypothesis testing have also been reported in the literature through multiple studies. Their main subject concerned the investigation of the most common difficulties among university students and concentrated on the misconceptions that have not yet received much attention (Sotos et al., 2007(Sotos et al., , 2009Krishnan, Idris, 2014;Reeves, Brewer, 1980). Studies conducted by Motulsky (2015) and Akobeng (2016) investigated the types of errors as well as the common misconceptions about data analysis and statistical interpretations. ...
... Multicolinearity is a problem that happens when there is a dependence between two or more regressor variables. 34 18.89 ...
... The misconceptions recorded when answering questions related to model assumptions and hypothesis testing strongly stem from a consistent misunderstanding of statistical inference. A solid understanding of this latter is of crucial importance for fitting an econometric model and interpreting its results (Sotos et al., 2007). Also, when compared with earlier studies (Vallecillos, 2000;Rossman et al., 2004), our results confirm that there is still a considerable number of economics students holding misconceptions about the concepts of bias, error, estimation and multicolinearity. ...
Full-text available
Solving problems related to econometrics requires a good knowledge of regression analysis concepts. The objective of this study is to evaluate students' difficulties resulting from the lack of knowledge of regression analysis concepts among economics students enrolled in the Master's cycle at the institute of economics at the university center of Tipaza (Algeria). In order to analyze students' answers, a typical correction was prepared based on professors' answers to this questionnaire. The procedure consists of comparing students' key answers with their corresponding typical answers to see how near or far it is from the right answer. In order to see whether the difficulties are originated from the same students, we analyzed the association between answers based on Multiple Correspondence Analysis (MCA) method. The principal results showed that difficulties resulting from the lack of knowledge of regression analysis concepts were prevalent among students. Their main causes were strongly related to misunderstanding, misconceptions and confusions. MCA analysis indicated that students can be categorized according to their answers into four groups: a very weak group, a weak group, an average group and a good group. We concluded that the difficulty of solving problems in the context of linear regression among students is the result of a lack of knowledge of regression concepts coupled with the inability to explain them.
... De hecho, a escala mundial, el currículum escolar ha dado importancia a esta área y en diversos lugares se está tratando de introducir la inferencia estadística de manera temprana, incluso en la educación primaria (Harradine et al., 2011;Manor y Ben-Zvi, 2017;Watson, 2008). Diversos autores reportan que la enseñanza y aprendizaje de la inferencia estadística es compleja y aparecen múltiples preconcepciones erradas y dificultades en la interpretación de conceptos centrales, como por ejemplo distribuciones muestrales y tipos de hipótesis (Castro-Sotos et al., 2007;Harradine et al., 2011). ...
... Por otra parte, la enseñanza y el aprendizaje de la inferencia estadística generan múltiples desafíos. Por ejemplo, muchos estudiantes logran desarrollar los cálculos asociados a los procesos de inferencia, pero no razonar e interpretar adecuadamente los resultados de estos procesos (Castro-Sotos et al., 2007;Harradine et al., 2011). Otras dificultades en torno a la comprensión de la inferencia estadística identificadas son concepciones erradas y falta de comprensión del teorema del límite central; errores de interpretación de intervalos de confianza; errores de interpretación del p-valor y la significancia; confundir significancia práctica y estadística, entre otros (Castro-Sotos et al., 2007;Harradine et al., 2011). ...
... Por ejemplo, muchos estudiantes logran desarrollar los cálculos asociados a los procesos de inferencia, pero no razonar e interpretar adecuadamente los resultados de estos procesos (Castro-Sotos et al., 2007;Harradine et al., 2011). Otras dificultades en torno a la comprensión de la inferencia estadística identificadas son concepciones erradas y falta de comprensión del teorema del límite central; errores de interpretación de intervalos de confianza; errores de interpretación del p-valor y la significancia; confundir significancia práctica y estadística, entre otros (Castro-Sotos et al., 2007;Harradine et al., 2011). ...
Full-text available
Este estudio examina las tensiones que vivencia una formadora de profesores al enseñar inferencia estadística, tópico en el que se han reconocido diversas dificultades de enseñanza y aprendizaje en la formación inicial del profesorado de Matemática de secundaria. Para ello, se utilizó un enfoque cualitativo de naturaleza interpretativa, donde se analizaron las reflexiones de la formadora sobre la enseñanza de una unidad de inferencia estadística registradas por medio de bitácoras y reuniones de amistad crítica. El análisis permitió identificar tres grandes tensiones: una de ellas relacionada con la estadística como disciplina; la segunda, vinculada a su enseñanza; y la tercera, asociada con el aprendizaje de los estudiantes. Finalmente, se discuten posibles mejoras de los procesos formativos en la formación inicial del profesorado.
... The current study is focused on the domain statistics education to show how valid and reliable measurements of students' skills can be obtained with cognitive diagnostic assessment. Introductory statistics courses are part of many educational programs at universities, yet students often struggle to understand the abstract concepts in this domain [22]. Diagnostic assessment can support students in developing such conceptual understanding [23]. ...
... Learning objectives were defined based on literature and course materials. The literature study was based on four sources: a review of students' misconceptions in statistical inference [22]; a book about the development of students' statistical reasoning [27]; a paper about the development of instruments measuring conceptual understanding of statistics [28]; and the Guidelines for Assessment and Instruction in Statistics Education College Report [29], which lists goals that summarize what students should know and understand after a first course in statistics based on collective beliefs reflected in statistics education literature. After the literature review, we inspected the course materials of (nonmathematical) introductory statistics courses to complement the specified learning objectives. ...
Full-text available
E-learning is increasingly used to support student learning in higher education, facilitating administration of online formative assessments. Although providing diagnostic, actionable feedback is generally more effective, in current practice, feedback is often given in the form of a simple proportion of correctly solved items. This study shows the validation process of constructing detailed diagnostic information on a set of skills, abilities, and cognitive processes (so-called attributes) from students’ item response data with diagnostic classification models. Attribute measurement in the domain of statistics education is validated based on both expert judgment and empirical student data from a think-aloud study and large-scale assessment administration. The constructed assessments provide a valid and reliable measurement of the attributes. Inferences that can be drawn from the results of these formative assessments are discussed and it is demonstrated how this information can be communicated to students via learning dashboards to allow them to make more effective learning choices.
... Confidence intervals (CI) are recommended as an alternative that avoids, for example, the complex logic of HT (Castro Sotos et al., 2007;Makar & Rubin, 2018). Interpreting the results of CI calculations is a common problem among undergraduate students (Reaburn, 2014), who may believe that 95% CI for the mean indicates "the percentage of all sample means that fall between the confidence limits" (delMas et al., 2007, p. 49). ...
... However, their interpretation of the result was deterministic: "The alternative hypothesis is true" (D'' 1 ). Castro Sotos et al. (2007) state that this interpretation results from considering HT a "mathematical, logical proof" (p. 106), leading students to "believe that the null hypothesis has been proven to be true or false" (p. ...
Full-text available
Since statistical inference is a probabilistic generalization about a population analyzed on the basis of a sample, inferential reasoning demands producing reasons (statistical and contextual) to substantiate and validate generalizations. To convey an understanding of students’ inferential reasoning, we present a proposal—based on Toulmin’s argumentation model—in which the production of statistical and contextual reasons serve as fundamental components of students’ inferential reasoning by providing supporting arguments that can be expressed as a sequence of statements. We analyze the inferential reasoning of university students asked to work in teams on an inferential activity in the context of environmental pollution. Results show that they integrated informal and formal methods to produce statistical reasons, complemented by contextual reasons to support their conclusions. Their reasoning model allowed us to identify (a) a potential transition from informal to formal inferential reasoning; and (b) ambiguity, or an absence of expressions of uncertainty, about generalizations regarding the population, possibly related to their confidence in using formal methods (e.g., hypothesis testing). We conclude that our proposal helps encourage and analyze students’ inferential reasoning. Future research will require clearer definitions of the characteristics of the argumentation model in the field of statistical inferential reasoning because arguments depend on their disciplinary context.
... Una razón de la atención que actualmente recibe este procedimiento es que las recomendaciones metodológicas de asociaciones profesionales indican que en las publicaciones de estas áreas los contrastes de hipótesis han con los intervalos de confianza (Wilkinson y TFSI, 1999;Yaremko, Harari, Harrison y Lynn, 2013). Sin embargo, este tema no es sencillo, como muestra la investigación previa (Castro Sotos, Vanhoof, Van den Nororgate y Onghena, 2007;Harradine, Batanero y Rossman, 2011). ...
ResumenEl objetivo de este trabajo fue evaluar el conocimiento común y especializado del contenido de los futuros profesores españoles de bachillerato sobre el intervalo de confianza. Para lograr tal fin se propuso a un grupo de 73 futuros profesores un problema abierto sobre intervalo de confianza, similar a los propuestos en cursos anteriores a los estudiantes de Bachillerato en las pruebas de acceso a la universidad. Además, se les preguntó cómo explicarían a un estudiante el significado del intervalo de confianza y sus propiedades. Utilizando una metodología cualitativa de análisis de contenido, se estudia el planteamiento, cálculo, interpretación y definición del intervalo y las propiedades que recuerdan. Aunque la mayor parte de los participantes en el estudio plantea y calcula correctamente el intervalo, sólo el 28,8% de los futuros profesores lo interpreta correctamente. Se encuentran interpretaciones deterministas o propias a la metodología bayesiana, similares a las descritas en investigaciones con estudiantes. Otros o no proporcionan una interpretación. Pocos participantes fueron capaces de definir correctamente el intervalo de confianza cuando se les pide explicar a un estudiante su significado y recuerdan pocas propiedades del mismo. Se concluye la necesidad de reforzar el conocimiento del contenido sobre intervalo de confianza en estos futuros profesores. Palabras-clave: Intervalo de Confianza. Conocimiento del Contenido. Futuros Profesores. Bachillerato. Evaluación. AbstractThe aim of this research was to assess prospective high school teachers’ common and specialized content knowledge of confidence intervals. To achieve this aim, we proposed to 73 prospective teachers an open problem similar to those proposed to high school students in the previous years at the entrance to university tests. Besides, participants were asked to describe how they would explain the concept and its properties to a student. Using the qualitative methodology of content analysis we study the setting, computation, interpretation and definition of confidence interval and its properties. Although most participants correctly set and computed the confidence interval, only 28.8% of them made a correct interpretation of the same. Deterministic and Bayesian interpretations, similar to those described in previous research with students arose or there was no interpretation. Few participants defined correctly the interval when asked how they would explain the topic to a student; additionally, only a few properties of the interval were remembered. We conclude the need to reinforce the content knowledge of confidence intervals in prospective teachers. Keywords: Confidence Interval. Content Knowledge. Prospective Teachers. High School. Assessment.
... En las prácticas de los profesores en ejercicio encontramos que no se plantearon las hipótesis nula y alternativa, aspecto que es fundamental en las pruebas de hipótesis, pero como han evidenciado diversos estudios (e.g., Batanero, 2013;López-Martín, Batanero y Gea, 2019;Sotos et al., 2007;Vallecillos, 1997) las hipótesis nula y alternativa son de las principales dificultades que presentan tanto profesores como estudiantes en inferencia. Por ello, en los niveles de razonamiento inferencial sobre el estadístico t-Student se plantea trabajar en tres momentos con las hipótesis: en un primer momento, se identifica la hipótesis que se encuentra implícita en el problema y se hace explicita por medio de un lenguaje natural; en un segundo momento, se plantean las hipótesis nula y alternativa en lenguaje natural; y en un tercer momento, se trabaja con las hipótesis nula y alternativa en lenguaje simbólico, que es como solemos encontrarlas en los estudios estadísticos. ...
Full-text available
One of the topics that have been deeply studied in Statistical Education is that of how to promote formal inferential reasoning (FIR) based on the results of informal inferential reasoning (IIR). However, it is still necessary to have proposals to explore and progressively develop the inferential reasoning of students and teachers from IIR to FIR. In this context, this article seeks to characterize the inferential reasoning displayed by high school mathematics teachers in the practices they developed for solving problems about Student’s t statistic.To do this, we use theoretical and methodological notions introduced by the Onto-Semiotic Approach (OSA) to mathematical knowledge and instruction, among which are the notion of mathematical practice, mathematical object and the theoretical proposal of progressive levels of inferential reasoning about the Student’s t statistic. The subjects who participated in this qualitative study were 59 teachers in training from Costa Rica and 22 practicing teachers from Chile. The practices developed by the pre-service teachers and the practicing teachers were found to have similar elements (representations, concepts/definitions, properties, procedures, arguments). The principal conclusion of this investigation was that the proposal of inferential reasoning levels about Student’s t statistic proved to be a useful predictor of the practices developed by the teachers, making it possible to distinguish characteristic elements of each level of inferential reasoning.
Full-text available
This study contributes to the call for influencing practice by increasing attention to how learning environments can be designed to support learning in statistical inference. We report on a design experiment in secondary school (students 14–16 years old), that resulted in a set of lessons with the learning goal of teaching students how to apply concepts and principles of hypothesis testing for making an inference as to whether or not students in secondary school can taste the difference between two brands of cola soda. The design experiment resulted in four design principles for a simulation-based approach for learning hypothesis testing in secondary school. The design principles highlight the combination of practical and digital simulations of samplings. They stress the need for using random generators that allow for high reliability in collecting sample data and introduce a simulation-based method for determining p-values, i.e. to quantify how likely or surprising a sample result, or a result more extreme, is under a null hypothesis.
Constructing interactive web apps has become more accessible for instructors, for example, by using the R package Shiny. Here we explored learners' preferences and the efficiency of interactive simulations versus static pictures in acquiring statistics knowledge of Cohen's d and standard normal distribution. Results revealed that students' spontaneous interaction with pictures was infrequent (pilot study, N = 26). While prompts (Exp. 1, N = 152) effectively ensured the manipulation of simulations, student exposure to interactive simulations led to longer learning times though similar test performance compared with student exposure to static pictures. Multiple interactive representations led to lower test performance than single interactive and static representations (Exp. 2, N = 117). Though no advantage was gained regarding learning outcomes, participants preferred the interactive variant (Exp. 3, N = 119). Taken together, this study demonstrates that the superiority of interactive pictures cannot be assumed to hold in general. Further work should evaluate how mental model construction can be effectively scaffolded by interactive simulations.
Conference Paper
Although inferential concepts are typically introduced in courses at high school, the approaches taught are usually the methodologies in introductory classes at university level. There is much research to support that learners have difficulty with classical frequentist inference and that a better understanding of inferential concepts can be obtained via an introduction using simulation-based methods. A new course available to high schools in British Columbia, Canada, incorporates several novel aspects, a key feature being the reliance on “intuitive,” simulation-based inference. We describe the pedagogical approaches adopted in this course and how students appeared to have learned from their experiences.
Full-text available
This study investigated introductory statistics students' conceptual understanding of the standard deviation. A computer environment was designed to promote students' ability to coordinate characteristics of variation of values about the mean with the size of the standard deviation as a measure of that variation. Twelve students participated in an interview divided into two primary phases, an exploration phase where students rearranged histogram bars to produce the largest and smallest standard deviation, and a testing phase where students compared the sizes of the standard deviation of two distributions. Analysis of data revealed conceptions and strategies that students used to construct their arrangements and make comparisons. In general, students moved from simple, one-dimensional understandings of the standard deviation that did not consider variation about the mean to more mean-centered conceptualizations that coordinated the effects of frequency (density) and deviation from the mean. Discussions of the results and implications for instruction and further research are presented.
Full-text available
Statistical terms are accurate and powerful but can sometimes lead to misleading impressions among beginning students. Discrepancies between the popular and statistical meanings of "conditional" are discussed, and suggestions are made for the use of different vocabulary when teaching beginners in applied introductory courses.
A controversy concerning the usefulness of "null" hypothesis tests in scientific inference has continued in articles within psychology since 1960 and has recently come to a head, with serious proposals offered for a test ban or something close to it. This article sketches some of the views of statistical theory and practice among different groups of psychologists, reviews a recent book offering multiple perspectives on null hypothesis tests, and argues that the debate within psychology is a symptom of serious incompleteness in the foundations of statistics.
Where do new ideas come from? What is social intelligence? Why do social scientists perform mindless statistical rituals? This vital book is about rethinking rationality as adaptive thinking: to understand how minds cope with their environments, both ecological and social. The author proposes and illustrates a bold new research program that investigates the psychology of rationality, introducing the concepts of ecological, bounded, and social rationality. His path-breaking collection takes research on thinking, social intelligence, creativity, and decision-making out of an ethereal world where the laws of logic and probability reign, and places it into our real world of human behavior and interaction. This book is accessibly written for general readers with an interest in psychology, cognitive science, economics, sociology, philosophy, artificial intelligence, and animal behavior. It also teaches a practical audience, such as physicians, AIDS counselors, and experts in criminal law, how to understand and communicate uncertainties and risks.
In the first three experiments, we attempted to learn more about subjects' understanding of the importance of sample size by systematically changing aspects of the problems we gave to subjects. In a fourth study, understanding of the effects of sample size was tested as subjects went through a computerassisted training procedure that dealt with random sampling and the sampling distribution of the mean. Subjects used sample size information more appropriately for problems that were stated in terms of the accuracy of the sample average or the center of the sampling distribution than for problems stated in terms of the tails of the sampling distribution. Apparently, people understand that the means of larger samples are more likely to resemble the population mean but not the implications of this fact for the variability of the mean. The fourth experiment showed that although instruction about the sampling distribution of the mean led to better understanding of the effects of sample size, subjects were still unable to make correct inferences about the variability of the mean. The appreciation that people have for some aspects of the law of large numbers does not seem to result from an in-depth understanding of the relation between sample size and variability.
Many decisions are based on beliefs concerning the likelihood of uncertain events such as the outcome of an election, the guilt of a defendant, or the future value of the dollar. Occasionally, beliefs concerning uncertain events are expressed in numerical form as odds or subjective probabilities. In general, the heuristics are quite useful, but sometimes they lead to severe and systematic errors. The subjective assessment of probability resembles the subjective assessment of physical quantities such as distance or size. These judgments are all based on data of limited validity, which are processed according to heuristic rules. However, the reliance on this rule leads to systematic errors in the estimation of distance. This chapter describes three heuristics that are employed in making judgments under uncertainty. The first is representativeness, which is usually employed when people are asked to judge the probability that an object or event belongs to a class or event. The second is the availability of instances or scenarios, which is often employed when people are asked to assess the frequency of a class or the plausibility of a particular development, and the third is adjustment from an anchor, which is usually employed in numerical prediction when a relevant value is available.