Article

Theory-Testing in Psychology and Physics: A Methodological Paradox

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Because physical theories typically predict numerical values, an improvement in experimental precision reduces the tolerance range and hence increases corroborability. In most psychological research, improved power of a statistical design leads to a prior probability approaching ½ of finding a significant difference in the theoretically predicted direction. Hence the corroboration yielded by “success” is very weak, and becomes weaker with increased precision. “Statistical significance” plays a logical role in psychology precisely the reverse of its role in physics. This problem is worsened by certain unhealthy tendencies prevalent among psychologists, such as a premium placed on experimental “cuteness” and a free reliance upon ad hoc explanations to avoid refutation.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... A second, and even more fundamental, problem with NHST is that it is generally unable to say anything useful about hypotheses of substantive interest. This is true even when the P-values derived from the test are, technically speaking, correctly interpreted: the nil hypothesis is set up as a straw man (i.e., the claim that not the tiniest effect or relation exists between the variables of interest), and its rejection (or lack thereof) tells us little about the types of substantive hypotheses researchers are typically interested in (e.g., De Groot, 1956Meehl, 1967Meehl, , 1990. Indeed, when pressed, communication researchers would probably not claim to be interested in infinitesimally small effects. ...
... Rather, we will try to familiarize communication scientists with one line of thought within the reform movement that remains close to the familiar P-value concept but aligns it with the falsificationist ideal of stringent hypothesis testing. After Meehl (1967), Meehl (1990), we label this approach 'strong-form'frequentist testing'although its core principles reside under different names as well (e.g., 'severe testing': Mayo, 2018;Mayo & Spanos, 2006; 'minimal effect testing': Murphy & Myors, 1999; '(non-)equivalency testing': Weber & Popova, 2012;Wellek, 2010). The basic idea of strong-form testing is straightforward: instead of testing a nil-null-hypothesis we simply set up a test for a statistical hypothesis of (minimal) theoretical interest. ...
... Notice how this description of P-values alludes to the intimate relationship between frequentist hypothesis testing and a (neo-)Popperian, falsificationist epistemology (Mayo, 2018;Meehl, 1967;Meehl, 1990;Popper, 1963). The key point of falsificationism is that science logically progresses through critical tests of theories; that is, what sets science apart from pseudo-science in the falsificationist sense is that the latter typically tries to gather observations that are able to confirm theories, whereas the former actively tries to challenge theories and find counterevidence for them. ...
Full-text available
Article
This paper discusses ‘strong-form’ frequentist testing as a useful complement to null hypothesis testing in communication science. In a ‘strong-form’ set-up a researcher defines a hypothetical effect size of (minimal) theoretical interest and assesses to what extent her findings falsify or corroborate that particular hypothesis. We argue that the idea of ‘strong-form’ testing aligns closely with the ideals of the movements for scientific reform, discuss its technical application within the context of the General Linear Model, and show how the relevant P-value-like quantities can be calculated and interpreted. We also provide examples and a simulation to illustrate how a strong-form set-up requires more nuanced reflections about research findings. In addition, we discuss some pitfalls that might still hold back strong-form tests from widespread adoption.
... With the shift in focus vis-à-vis the replication crisis, we hope there is a corresponding shift toward theory development. Good development relies on the accumulation of evidence, as well as precise predictions (Meehl, 1967). Unfortunately, many theories lack both (Edwards and Berry, 2010). ...
... For one, Null Hypothesis Significance Testing (NHST) cannot incorporate estimates obtained from the literature (Velicer, 2008). Further, NHST becomes more challenging to falsify hypotheses as one's methodology becomes more rigorous (Meehl, 1967). To compound the issue, common alternatives (e.g., Bayesian and meta-analysis) have limitations of their own. ...
... Mayo's account deals primarily with controlling (Type I and II) error rates in inference and is therefore frequentist in nature (cf. Neyman & Pearson, 1933;1967;Neyman, 1977;Mayo, 1996). One of the key objectives of this paper is to provide a critical analysis of Mayo's account and to assess the prospects of error statistics for application in social science. ...
... However, as argued by Vanpaemel (2010), this property of Bayes factors is a virtue rather than a vice: the prior distribution expresses, after all, our theoretical expectations and predictions. A scientist using Bayesian inference needs to think about the prior distribution in advance, strengthening the link between scientific theorizing and statistical analysis whose absence has often been named as a cause of the lack of reliability and replicability of psychological research (compare Meehl, 1967;Ioannidis, 2005;Dienes, 2021). ...
Full-text available
Article
A tradition that goes back to Sir Karl R. Popper assesses the value of a statistical test primarily by its severity: was there an honest and stringent attempt to prove the tested hypothesis wrong? For "error statisticians" such as Mayo (1996, 2018), and frequentists more generally, severity is a key virtue in hypothesis tests. Conversely, failure to incorporate severity into statistical inference, as allegedly happens in Bayesian inference, counts as a major methodological shortcoming. Our paper pursues a double goal: First, we argue that the error-statistical explication of severity has substantive drawbacks; specifically, the neglect of research context and the specificity of the predictions of the hypothesis. Second, we argue that severity matters for Bayesian inference via the value of specific, risky predictions: severity boosts the expected evidential value of a Bayesian hypothesis test. We illustrate severity-based reasoning in Bayesian statistics by means of a practical example and discuss its advantages and potential drawbacks.
... The larger modern cultural context accepts the objectivity of data as a record of events assumed to suffice as criterion for reproducibility. It seems that neither repeated conceptual critiques of these assumptions [59,152,153,186,228,233,302,303,349], nor readily available operational alternatives [7,12,95,207,246,288,347,352] have had much impact on most researchers' methodological choices. Is it possible to imagine any kind of a compelling motivation that would inspire wide adoption of more meaningful, rigorous, and useful measurement methods? ...
Full-text available
Chapter
An historic shift in focus on the quality and person-centeredness of health care has occurred in the last two decades. Accounts of results produced from reinvigorated attention to the measurement, management, and improvement of the outcomes of health care show that much has been learned, and much remains to be done. This article proposes that causes of the failure to replicate in health care the benefits of “lean” methods lie in persistent inattention to measurement fundamentals. These fundamentals must extend beyond mathematical and technical issues to the social, economic, and political processes involved in constituting trustworthy performance measurement systems. Successful “lean” implementations will follow only when duly diligent investments in these fundamentals are undertaken. Absent those investments, average people will not be able to leverage brilliant processes to produce exceptional outcomes, and we will remain stuck with broken processes in which even brilliant people can produce only flawed results. The methodological shift in policy and practice prescribed by the authors of the chapters in this book moves away from prioritizing the objectivity of data in centrally planned and executed statistical modeling, and toward scientific models that prioritize the objectivity of substantive and invariant unit quantities. The chapters in this book describe scientific modeling’s bottom-up, emergent and evolving standards for mass customized comparability. Though the technical aspects of the scientific modeling perspective are well established in health care outcomes measurement, operationalization of the social, economic, and political aspects required for creating new degrees of trust in health care institutions remains at a nascent stage of development. Potentials for extending everyday thinking in new directions offer hope for achieving previously unattained levels of efficacy in health care improvement efforts.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
The process of ascribing meaning to scores produced by a measurement procedure is generally recognized as the most important task in developing an educational or psychological measure, be it an achievement test, interest inventory, or personality scale. This process, which is commonly referred to as construct validation (Cronbach, 1971; Cronbach & Meehl, 1955; ETS, 1979; Messick, 1975, 1980), involves a family of methods and procedures for assessing the degree to which a test measures a trait or theoretical construct.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
Several concepts from Georg Rasch's last papers are discussed. The key one is comparison because Rasch considered the method of comparison fundamental to science. From the role of comparison stems scientific inference made operational by a properly developed frame of reference producing specific objectivity. The exact specifications Rasch outlined for making comparisons are explicated from quotes, and the role of causality derived from making comparisons is also examined. Understanding causality has implications for what can and cannot be produced via Rasch measurement. His simple examples were instructive, but the implications are far reaching upon first establishing the key role of comparison.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
The purpose of this paper is to review some assumptions underlying the use of norm-referenced tests in educational evaluations and to provide a prospectus for research on these assumptions as well as other questions related to norm-referenced tests. Specifically, the assumptions which will be examined are (1) expressing treatment effects in a standard score metric permits aggregation of effects across grades, (2) commonly used standardized tests are sufficiently comparable to permit aggregation of results across tests, and (3) the summer loss observed in Title I projects is due to an actual loss in achievement skills and knowledge. We wish to emphasize at the outset that our intent in this paper is to raise questions and not to present a coherent set of answers.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
Rasch’s unidimensional models for measurement show how to connect object measures (e.g., reader abilities), measurement mechanisms (e.g., machine-generated cloze reading items), and observational outcomes (e.g., counts correct on reading instruments). Substantive theory shows what interventions or manipulations to the measurement mechanism can be traded off against a change to the object measure to hold the observed outcome constant. A Rasch model integrated with a substantive theory dictates the form and substance of permissible interventions. Rasch analysis, absent construct theory and an associated specification equation, is a black box in which understanding may be more illusory than not. Finally, the quantitative hypothesis can be tested by comparing theory-based trade-off relations with observed trade-off relations. Only quantitative variables (as measured) support such trade-offs. Note that to test the quantitative hypothesis requires more than manipulation of the algebraic equivalencies in the Rasch model or descriptively fitting data to the model. A causal Rasch model involves experimental intervention/manipulation on either reader ability or text complexity or a conjoint intervention on both simultaneously to yield a successful prediction of the resultant observed outcome (count correct). We conjecture that when this type of manipulation is introduced for individual reader text encounters and model predictions are consistent with observations, the quantitative hypothesis is sustained.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
Does the reader comprehend the text because the reader is able or because the text is easy? Localizing the cause of comprehension in either the reader or the text is fraught with contradictions. A proposed solution uses a Rasch equation to models comprehension as the difference between a reader measure and text measure. Computing such a difference requires that reader and text are measured on a common scale. Thus, the puzzle is solved by positing a single continuum along which texts and readers can be conjointly ordered. A reader’s comprehension of a text is a function of the difference between reader ability and text readability. This solution forces recognition that generalizations about reader performance can be text independent (reader ability) or text dependent (comprehension). The article explores how reader ability and text readability can be measured on a single continuum, and the implications that this formulation holds for reading theory, the teaching of reading, and the testing of reading.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
This paper describes Mapping Variables, the principal technique for planning and constructing a test or rating instrument. A variable map is also useful for interpreting results. Modest reference is made to the history of mapping leading to its importance in psychometrics. Several maps are given to show the importance and value of mapping a variable by person and item data. The need for critical appraisal of maps is also stressed.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
A construct theory is the story we tell about what it means to move up and down the scale for a variable of interest (e.g., temperature, reading ability, short term memory). Why is it, for example, that items are ordered as they are on the item map? The story evolves as knowledge regarding the construct increases. We call both the process and the product of this evolutionary unfolding "construct definition" (Stenner et al., Journal of Educational Measurement 20:305–316, 1983). Advanced stages of construct definition are characterized by calibration equations (or specification equations) that operationalize and formalize a construct theory. These equations, make point predictions about item behavior or item ensemble distributions. The more closely theoretical calibrations coincide with empirical item difficulties, the more useful the construct theory and the more interesting the story. Twenty-five years of experience in developing the Lexile Framework for Reading enable us to distinguish five stages of thinking. Each subsequent stage can be characterized by an increasingly sophisticated use of substantive theory. Evidence that a construct theory and its associated technologies have reached a given stage or level can be found in the artifacts, instruments, and social networks that are realized at each level.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
Psychometric models typically represent encounters between persons and dichotomous items as a random variable with two possible outcomes, one of which can be labeled success. For a given item, the stipulation that each person has a probability of success defines a construct on persons. This model specification defines the construct, but measurement is not yet achieved. The path to measurement must involve replication; unlike coin-tossing, this cannot be attained by repeating the encounter between the same person and the same item. Such replication can only be achieved with more items whose features are included in the model specifications. That is, the model must incorporate multiple items. This chapter examines multi-item model specifications that support the goal of measurement. The objective is to select the model that best facilitates the development of reliable measuring instruments. From this perspective, the Rasch model has important features compared to other models.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
Teachers make use of these two premises to match readers to text. Knowing a lot about text is helpful because “text matters” (Hiebert, 1998). But ordering or leveling text is only half the equation. We must also assess the level of the readers. These two activities are necessary so that the right books can be matched to the right reader at the right time. When teachers achieve this match intuitively, they are rewarded with students choosing to read more.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
The International Vocabulary of Measurement (VIM) and the Guide to Uncertainty in Measurement (GUM) shift the terms and concepts of measurement information quality away from an Error Approach toward a model-based Uncertainty Approach. An analogous shift has taken place in psychometrics with the decreasing use of True Score Theory and increasing attention to probabilistic models for unidimensional measurement. These corresponding shifts emerge from shared roots in cognitive processes common across the sciences and they point toward new opportunities for an art and science of living complex adaptive systems. The psychology of model-based reasoning sets the stage for not just a new consensus on measurement and uncertainty, and not just for a new valuation of the scientific status of psychology and the social sciences, but for an appreciation of how to harness the energy of self-organizing processes in ways that harmonize human relationships.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
The last 50 years of human and social science measurement theory and practice have witnessed a steady retreat from physical science as the canonical model. Humphry (2011) unapologetically draws on metrology and physical science analogies to reformulate the relationship between discrimination and the unit. This brief note focuses on why this reformulation is important and on how these ideas can improve measurement theory and practice.
... The first challenge researchers face when linking performance across tasks, is measuring key hypothetical constructs. Although this is a well-known challenge for social science researchers in general (e.g., Borsboom et al., 2004;Brady et al., 2021;Kellen et al., 2020;Meehl, 1967;Regenwetter & Robinson, 2017;Rotello et al., 2015), it is also a challenge for researchers who study cognitive control in particular, because the dominant measurement approach called the subtraction method is known to suffer from major limitations. These limitations have been discussed in detail by other researchers and will be reviewed later in this article. ...
Article
Cognitive control refers to the ability to maintain goal-relevant information in the face of distraction, making it a core construct for understanding human thought and behavior. There is great theoretical and practical value in building theories that can be used to explain or to predict variations in cognitive control as a function of experimental manipulations or individual differences. A critical step toward building such theories is determining which latent constructs are shared between laboratory tasks that are designed to measure cognitive control. In the current work, we examine this question in a novel way by formally linking computational models of two canonical cognitive control tasks, the Eriksen flanker and task-switching task. Specifically, we examine whether model parameters that capture cognitive control processes in one task can be swapped across models to make predictions about individual differences in performance on another task. We apply our modeling and analysis to a large scale data set from an online cognitive training platform, which optimizes our ability to detect individual differences in the data. Our results suggest that the flanker and task-switching tasks probe common control processes. This finding supports the view that higher level cognitive control processes as opposed to solely strategies in speed and accuracy tradeoffs, or perceptual processing and motor response speed are shared across the two tasks. We discuss how our computational modeling substitution approach addresses limitations of prior efforts to relate performance across different cognitive control tasks, and how our findings inform current theories of cognitive control. (PsycInfo Database Record (c) 2022 APA, all rights reserved).
... «Диалектика исследовательских программ… совсем не сводится к чередованию умозрительных догадок и эмпирических опровержений. Типы отношений между процессом развития программы и процессами эмпирических проверок могут быть самыми разнообразными; какой из них осуществляется -вопрос конкретно-исторический» [13], считая, что развитие статистической техники и улучшение экспериментальных условий дает аппарат лишь для фальшивых подкреплений, снижения барьера валидности и создания видимости научного прогресса, за которым (прогрессом) стоит в таком случае «лишь псевдоинтеллектуальный мусор». Однако З. Фрейд изначально пытался выразить свои идеи на языке строгой науки, ориентируясь, к примеру, на основателя психофизики Г. Фехнера, который, в свою очередь, рассматривал психику как гомеостатическую систему с возможным ее описанием в духе физических моделей. ...
Full-text available
Article
Aim to consider some basic aspects of the development of psychoanalysis in the methodology proposed by Imre Lakatos. We outlined the constituent elements of the "solid core" that make up a psychoanalytic metapsychology; the work of the "protective layer" is shown by examples. The development of psychoanalysis and the preservation of its heuristic power have been facilitated by the transformations and discoveries that have led to the formation of the modern form of this scientific research program. Since its emergence, psychoanalysis has become a way of studying the human being outside of medicine as well. The dissemination of the ideas of psychoanalysis in philosophy, sociology, politics, anthropology, etc., involved more and more researchers in the development of theoretical constructions, critically revising the fundamental positions and giving the program a contemporary resonance. Throughout the development of psychoanalysis as a scientific research program, the reliance on therapeutic practice has been critical.
... The more scrutiny a reported effect survives-in-cluding replication and reanalysis attempts-the better. Then, and only then, can evidence for such an effect can be temporarily considered credible/trustworthy, proportional to the amount and nature of the scrutiny that a reported effect has survived (LeBel et al., 2018;Meehl, 1967Meehl, , 1978. ...
... The more scrutiny a reported effect survives-including replication and reanalysis attempts-the better. Then, and only then, can evidence for such an effect can be temporarily considered credible/trustworthy, proportional to the amount and nature of the scrutiny that a reported effect has survived (LeBel et al., 2018;Meehl, 1967Meehl, , 1978. ...
Full-text available
Article
Although the preceding exchange in this special subsection of the Journal (Augustine, 2022a, 2022b; Braude et al., 2022) has highlighted the differences between skeptics and proponents of discarnate personal survival, there is much more in common between us that often goes unsaid, such as a common respect for sound reasoning and for investigating matters empirically whenever possible. We also agree that this topic warrants further empirical investigation, and of a quality superior to that found in the extant survival literature. While we could further delineate our similarities and differences, a much more fruitful avenue for research is to collaborate on a design for an ‘ideal’ prospective test of potential survival that, if successful and replicable, would complement and corroborate previous attempts at rigorous experimental survival research. Working with Braude et al.’s (2022) team of survival proponents would have been optimal, but given time and logistical constraints, we have alternatively joined forces with the last author who has published several methodological papers in this domain from an agnostic perspective (e.g., Jamieson & Rock, 2014; Rock & Storm, 2015). By developing some of the proponents’ own published proposals, we have agreed on an experimental design that would provide substantiating evidence consistent with an anomalous effect by shielding any attainable replicable positive results, as much as feasible, from normal or conventional explanations. Such explanations run the gamut from simple cueing to researcher degrees of freedom or p-hacking, i.e., researchers inadvertently or deliberately collecting or selecting data or analyses until nonsignificant results are rendered statistically significant (Head et al., 2015).
... Additionally, proponents may also invoke elaborate post hoc explanations to explain away negative research results. While these explanations may be plausible, the fact that they were not disclosed as potential limitations a priori raises questions about their veracity (Meehl, 1967). ...
Article
This article addresses the use of hype in the promotion of clinical assessment practices and instrumentation. Particular focus is given to the role of school psychologists in evaluating the evidence associated with clinical assessment claims, the types of evidence necessary to support such claims, and the need to maintain a degree of “healthy self-doubt” about one’s own beliefs and preferred practices. Included is a discussion of topics that may facilitate developing and refining scientific thinking skills related to clinical assessment across common coursework, and how this framework fits with both the scientist-practitioner and clinical science perspectives for training.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
Huge resources are invested in metrology and standards in the natural sciences, engineering, and across a wide range of commercial technologies. Significant positive returns of human, social, environmental, and economic value on these investments have been sustained for decades. Proven methods for calibrating test and survey instruments in linear units are readily available, as are data- and theory-based methods for equating those instruments to a shared unit. Using these methods, metrological traceability is obtained in a variety of commercially available elementary and secondary English and Spanish language reading education programs in the U.S., Canada, Mexico, and Australia. Given established historical patterns, widespread routine reproduction of predicted text-based and instructional effects expressed in a common language and shared frame of reference may lead to significant developments in theory and practice. Opportunities for systematic implementations of teacher-driven lean thinking and continuous quality improvement methods may be of particular interest and value.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
In his classic paper entitled “The Unreasonable Effectiveness of Mathematics in the Natural Sciences,” Eugene Wigner addresses the question of why the language of Mathematics should prove so remarkably effective in the physical [natural] sciences. He marvels that “the enormous usefulness of mathematics in the natural sciences is something bordering on the mysterious and that there is no rational explanation for it.” We have been similarly struck by the outsized benefits that theory based instrument calibrations convey on the natural sciences, in contrast with the almost universal practice in the social sciences of using data to calibrate instrumentation.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
This paper presents and illustrates a novel methodology, construct-specification equations, for examining the construct validity of a psychological instrument. Whereas traditional approaches have focused on the study of between-person variation on the construct, the suggested methodology emphasizes study of the relationships between item characteristics and item scores. The major thesis of the construct-specification-equation approach is that until developers of a psychological instrument understand what item characteristics are determining the item difficulties, the understanding of what is being measured is unsatisfyingly primitive. This method is illustrated with data from the Knox Cube Test which purports to be a measure of visual attention and short-term memory.
... Pointing out the poor use of NHST has long been a cottage industry for curmudgeons, both in neuroscience and in other fields where NHST is ubiquitously but poorly used (e.g. Meehl, 1967;Cohen, 1994;Gigerenzer, 2004). This criticism has sometimes provoked rousing defenses of all that NHST could be if only it were used properly (e.g. ...
Full-text available
Article
Null-hypothesis significance testing (NHST) has become the main tool of inference in neuroscience, and yet evidence suggests we do not use this tool well: tests are often planned poorly, conducted unfairly, and interpreted invalidly. This editorial makes the case that in addition to reforms to increase rigor we should test less, reserving NHST for clearly confirmatory contexts in which the researcher has derived a quantitative prediction, can provide the inputs needed to plan a quality test, and can specify the criteria not only for confirming their hypothesis but also for rejecting it. A reduction in testing would be accompanied by an expansion of the use of estimation [effect sizes and confidence intervals (CIs)]. Estimation is more suitable for exploratory research, provides the inputs needed to plan strong tests, and provides important contexts for properly interpreting tests.
... Although considered to be the strongest and most consistent predictor of adult offending, in terms of prognostication, it is weak and wrought with error. This dovetails with much medical decision-making and prognostication that highlights the limits of using odds ratios or effect sizes as a basis for prediction (Meehl 1967;Ware 2006). For instance, Poldrak et al. (2020) discussed an example of genome-wide associations between schizophrenia and certain genetic variants in the population, which they note while statistically significant are in no way "clinically actionable" to diagnose schizophrenia patients. ...
Full-text available
Article
Intoduction/Aim Extant tests of developmental theories have largely refrained from moving past testing models of association to building models of prediction, as have other fields with an intervention focus. With this in mind, we test the prognostic capacity to predict offending outcomes in early adulthood derived from various developmental theories. Methods Using 734 subjects from the Rochester Youth Development Study (RYDS), we use out-of-sample predictions based on 5-fold cross-validation and compare the sensitivity, specificity and positive predictive value of three different prognostic models to predict arrest and serious, persistent offending in early adulthood. The first uses only predictors measures in early adolescence, the second uses dynamic trajectories of delinquency from ages 14–22, and the third uses a combination of the two. We further consider how early in adolescence the trajectory models calibrate prediction. Results Both the early adolescent risk factor only model and the dynamic trajectory model were poor at prognosticating both arrest and persistent offending in early adulthood, which is manifest in the large rate of false positive cases. Conculsion Furthermore, existing developmental theories would be well served to move beyond cataloging risk factors and draw more heavily on refinements, including a greater focus on human agency in life course patterns of offending.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
Implicit in the idea of measurement is the concept of objectivity. When we measure the temperatureTemperature using a thermometer, we assume that the measurement we obtain is not dependent on the conditions of measurement, such as which thermometer we use. Any functioning thermometer should give us the same reading of, for example, 75 °F. If one thermometer measured 40 °, another 250 and a third 150, then the lack of objectivity would invalidate the very idea of accurately measuring temperatureTemperature.
... Consequently, the crucial theoretical notion prescribes that individuals' general tendency to engage in utility maximization at others' expense is essentially determined by D and not by whatever aspects a trait may encompass beyond D. Stated simply, to the extent that any aversive trait accounts for how much individuals weigh their own utility over others', it does so because of D. Arguably, this is a bold theoretical stance and thereby a clear advantage, as it sets a high empirical hurdle for the theory to overcome (Meehl, 1967;Platt, 1964). Moreover, this claim is unique to the theory of D in that it is not shared by any other constructs previously suggested as representations of the commonalties of aversive traits-for example, low levels of Big Five Agreeableness (e.g., Vize et al., 2021) or HEXACO Honesty-Humility (Hodson et al., 2018). ...
Full-text available
Article
Individuals differ in how they weigh their own utility versus others'. This tendency codefines the dark factor of personality (D), which is conceptualized as the underlying disposition from which all socially and ethically aversive (dark) traits arise as specific, flavored manifestations. We scrutinize this unique theoretical notion by testing, for a broad set of 58 different traits and related constructs, whether any predict how individuals weigh their own versus others' utility in proactive allocation decisions (i.e., social value orientations) beyond D. These traits and constructs range from broad dimensions (e.g., agreeableness), to aversive traits (e.g., sadism) and beliefs (e.g., normlessness), to prosocial tendencies (e.g., compassion). In a large-scale longitudinal study involving the assessment of consequential choices (median N = 2,270; a heterogeneous adult community sample from Germany), results from several hundred latent model comparisons revealed that no meaningful incremental variance was explained beyond D. Thus, D alone is sufficient to represent the social preferences inherent in socially and ethically aversive personality traits.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
We argue that a goal of measurement is general objectivity: point estimates of a person’s measure (height, temperature, and reader ability) should be independent of the instrument and independent of the sample in which the person happens to find herself. In contrast, Rasch’s concept of specific objectivity requires only differences (i.e., comparisons) between person measures to be independent of the instrument. We present a canonical case in which there is no overlap between instruments and persons: each person is measured by a unique instrument. We then show what is required to estimate measures in this degenerate case. The canonical case encourages a simplification and reconceptualization of validity and reliability. Not surprisingly, this reconceptualization looks a lot like the way physicists and chemometricians think about validity and measurement error. We animate this presentation with a technology that blurs the distinction between instruction, assessment, and generally objective measurement of reader ability. We encourage adaptation of this model to health outcomes measurement.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
In an argument whereby, “… individual-centered statistical techniques require models in which each individual is characterized separately and from which, given adequate data, the individual parameters can be estimated”.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
The field of career education measurement is in disarray. Evidence mounts that today’s career education instruments are verbal ability measures in disguise. A plethora of trait names such as career maturity, career development, career planning, career awareness, and career decision making have, in the last decade, appeared as labels to scales comprised of multiple choice items. Many of these scales appear to be measuring similar underlying traits and certainly the labels have a similar sound or “jingle” to them. Other scale names are attached to clusters of items that appear to measure different traits and at first glance appear deserving of their unique trait names, e.g., occupational information, resources for exploration, work conditions, personal economics. The items of these scales look different and the labels correspondingly are dissimilar or have a different “jangle” to them.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
Growth in reading ability varies across individuals in terms of starting points, velocities, and decelerations. Reading assessments vary in the texts they include, the questions asked about those texts, and in the way responses are scored. Complex conceptual and operational challenges must be addressed if we are to coherently assess reading ability, so that learning outcomes are comparable within students over time, across classrooms, and across formative, interim, and accountability assessments. A philosophical and historical context in which to situate the problems emerges via analogies from scientific, aesthetic, and democratic values. In a work now over 100 years old, Cook's study of the geometry of proportions in art, architecture, and nature focuses more on individual variation than on average general patterns. Cook anticipates the point made by Kuhn and Rasch that the goal of research is the discovery of anomalies—not the discovery of scientific laws. Bluecher extends Cook’s points by drawing an analogy between the beauty of individual variations in the Parthenon’s pillars and the democratic resilience of unique citizen soldiers in Pericles’ Athenian army. Lessons for how to approach reading measurement follow from the beauty and strength of stochastically integrated variations and uniformities in architectural, natural, and democratic principles.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
One must provide information about the conditions under which [the measurement outcome] would change or be different. It follows that the generalizations that figure in explanations [of measurement outcomes] must be change-relating… Both explainers [e.g., person parameters and item parameters] and what is explained [measurement outcomes] must be capable of change, and such changes must be connected in the right way (Woodward, 2003). Rasch’s unidimensional models for measurement tell us how to connect object measures, instrument calibrations, and measurement outcomes. Substantive theory tells us what interventions or changes to the instrument must offset a change to the measure for an object of measurement to hold the measurement outcome constant. Integrating a Rasch model with a substantive theory dictates the form and substance of permissible conjoint interventions. Rasch analysis absent construct theory and an associated specification equation is a black box in which understanding may be more illusory than not. The mere availability of numbers to analyze and statistics to report is often accepted as methodologically satisfactory in the social sciences, but falls far short of what is needed for a science.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
A metrological infrastructure for the social, behavioral, and economic sciences has foundational and transformative potentials relating to education, health care, human and natural resource management, organizational performance assessment, and the economy at large. The traceability of universally uniform metrics to reference standard metrics is a taken-for-granted essential component of the infrastructure of the natural sciences and engineering. Advanced measurement methods and models capable of supporting similar metrics, standards, and traceability for intangible forms of capital have been available for decades but have yet to be implemented in ways that take full advantage of their capacities. The economy, education, health care reform, and the environment are all now top national priorities. There is nothing more essential to succeeding in these efforts than the quality of the measures we develop and deploy. Even so, few, if any, of these efforts are taking systematic advantage of longstanding, proven measurement technologies that may be crucial to the scientific and economic successes we seek. Bringing these technologies to the attention of the academic and business communities for use, further testing, and development in new directions is an area of critical national need.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
There is nothing wrong with the NAEP reading exercises, the sampling design, or the NAEP Reading Proficiency Scale, these authors maintain. But adding a rich criterion-based frame of reference to the scale should yield an even more useful tool for shaping U.S. educational policy.
... The magnitude of the opportunity Jack perceived is inversely related to the embedded depth of the data-centric focus on statistical analysis as necessary to measurement, which dominates the mainstream research culture in many fields. In a bizarre kind of cultural schizophrenia (Bateson, 1978;Fisher, 2021a;Wright, 1988), statistical methods are universally referred to as quantitative, even though those methods are hardly ever used to formulate and test the hypothesis that a unit quantity exists (Meehl, 1967;Michell, 1986). The deep entrenchment of the automatic association of numbers with quantities is counter-intuitive in that everyone knows counts of rocks are in no way indicative of the quantitative amount of rock possessed; your two rocks may have far more mass than my ten rocks. ...
Full-text available
Chapter
Measurement plays a vital role in the creation of markets, one that hinges on efficiencies gained via universal availability of precise and accurate information on product quantity and quality. Fulfilling the potential of these ideals requires close attention to measurement and the role of technology in science and the economy. The practical value of a strong theory of instrument calibration and metrological traceability stems from the capacity to mediate relationships in ways that align, coordinate, and integrate different firms’ expectations, investments, and capital budgeting decisions over the long term. Improvements in the measurement of reading ability exhibit patterns analogous to Moore’s Law, which has guided expectations in the micro-processor industry for almost 50 years. The state of the art in reading measurement serves as a model for generalizing the mediating role of instruments in making markets for other forms of intangible assets. These remarks provide only a preliminary sketch of the kinds of information that are both available and needed for making more efficient markets for human, social, and natural capital. Nevertheless, these initial steps project new horizons in the arts and sciences of measuring and managing intangible assets.
Article
The Visual World Paradigm (VWP) is a powerful experimental paradigm for language research. Listeners respond to speech in a "visual world" containing potential referents of the speech. Fixations to these referents provides insight into the preliminary states of language processing as decisions unfold. The VWP has become the dominant paradigm in psycholinguistics and extended to every level of language, development, and disorders. Part of its impact is the impressive data visualizations which reveal the millisecond-by-millisecond time course of processing, and advances have been made in developing new analyses that precisely characterize this time course. All theoretical and statistical approaches make the tacit assumption that the time course of fixations is closely related to the underlying activation in the system. However, given the serial nature of fixations and their long refractory period, it is unclear how closely the observed dynamics of the fixation curves are actually coupled to the underlying dynamics of activation. I investigated this assumption with a series of simulations. Each simulation starts with a set of true underlying activation functions and generates simulated fixations using a simple stochastic sampling procedure that respects the sequential nature of fixations. I then analyzed the results to determine the conditions under which the observed fixations curves match the underlying functions, the reliability of the observed data, and the implications for Type I error and power. These simulations demonstrate that even under the simplest fixation-based models, observed fixation curves are systematically biased relative to the underlying activation functions, and they are substantially noisier, with important implications for reliability and power. I then present a potential generative model that may ultimately overcome many of these issues.
Chapter
A complex phenomenon can be explained only by a model of higher complexity than itself and only by specifying abstract patterns of behavior rather than point predictions. Increased precision does not increase point prediction, it only limits generalizability. Negative rules of order—prohibitions of specified classes of action—constrain complex phenomena. This is the only way finite and limited control structures can deal with indefinitely extended domains of behavior. Science and all such phenomena are never algorithmic or determinately computable—they operate by following standing obligation rules (of what not to do) which one must follow indefinitely.
Chapter
Measurement theory (scaling theory) severely constrains what statistical procedures can meaningfully tell us. Nonparametric statistical procedures employ nominal, ordinal, and (some) interval scaling procedures to provide actual meaningful information in the human sciences, whereas using parametric procedures based upon ratio scaling provides misleading results because such procedures add in much information that cannot actually be supported by the data. Better to use a less powerful procedure that does work than to use a powerful one commonly used on physical phenomena that cannot work with the available functional realm data.
Article
We argue that critical areas of memory research rely on problematic measurement practices and provide concrete suggestions to improve the situation. In particular, we highlight the prevalence of memory studies that use tasks (like the "old/new" task: "have you seen this item before? yes/no") where quantifying performance is deeply dependent on counterfactual reasoning that depends on the (unknowable) distribution of underlying memory signals. As a result of this difficulty, different literatures in memory research (e.g., visual working memory, eyewitness identification, picture memory, etc.) have settled on a variety of fundamentally different metrics to get performance measures from such tasks (e.g., A', corrected hit rate, percent correct, d', diagnosticity ratios, K values, etc.), even though these metrics make different, contradictory assumptions about the distribution of latent memory signals, and even though all of their assumptions are frequently incorrect. We suggest that in order for the psychology and neuroscience of memory to become a more cumulative, theory-driven science, more attention must be given to measurement issues. We make a concrete suggestion: The default memory task for those simply interested in performance should change from old/new ("did you see this item'?") to two-alternative forced-choice ("which of these two items did you see?"). In situations where old/new variants are preferred (e.g., eyewitness identification; theoretical investigations of the nature of memory signals), receiver operating characteristic (ROC) analysis should be performed rather than a binary old/new task.
Article
The doctoral dissertation often shapes the career that follows it, influencing both opportunities encountered and research conducted. This article describes the ways this has been true for me and then argues that, given the dissertation's importance, graduate programs do not focus sufficiently on strategies for conceiving research. As a result, many students flounder at the dissertation proposal stage. Drawing on the role of doubt in my career and in science more generally, I propose changes in doctoral programs to reduce the problem.
Article
Fitts' law and throughput based on effective measures are two mathematical models frequently used to analyze human motor performance in a standardized pointing task, e.g., to compare the performance of input and output devices. Even though pointing has been deeply studied in 2D, it is not well understood how different task execution strategies affect throughput in pointing in 3D virtual environments. In this work, we examine the effective throughput measure, claimed to be invariant to task execution strategies, in Virtual Reality (VR) systems with three such strategies, “as fast, as precise, and as fast and as precise as possible” for ray casting and virtual hand interaction, by re-analyzing data from a 3D pointing IS0 9241-411 study. Results show that effective throughput is not invariant for different task execution strategies in VR, which also matches a more recent 2D result. Normalized speed vs. accuracy curves also did not fit the data. We thus suggest that practitioners, developers, and researchers who use MacKenzie's effective throughput formulation should consider our findings when analyzing 3D user pointing performance in VR systems.
Chapter
Sampling error ensures that a sample is never a perfect representation of a population, so generalizing from a sample to a population is risky. Null hypothesis tests help us decide when to generalize. A null hypothesis is infinitely numerically precise, which allows us to formulate an equally precise prediction should the null hypothesis be correct. That prediction is then compared mathematically to the outcome of an investigation. The product of that comparison indicates how much the results of the investigation support the null hypothesis. Traditionally, if there is little support, we reject the null. But the null hypothesis’s precision means that many null hypotheses cannot be correct, and yet those null hypotheses should be tested anyway: to see if we can trust our data to tell us the direction of a difference we know must be there. This is how Ronald Fisher used null hypothesis tests, so-called one-tailed tests are an unrelated topic, and testing for direction renders many of the criticisms of null hypothesis testing irrelevant.
Chapter
Though they sound daunting, probability density distributions are easy to understand. Distributions help us see why setting α to 0.05 means we will commit type I errors 5% of the time we test correct null hypotheses. More importantly, distributions help us see that if we set α to 0.05, we will get the direction of a difference wrong, in other words, make a type III (or type S) error, less than 2.5% of the time, and only when statistical power is very low. We might see that as good news but maybe we are drawing no conclusion too often. John Tukey recommended setting α to 0.10 because doing so ensures we will make type III errors less than 5% of the time, 5% being a conventional value of α. Finally, distributions remind us that we conduct one-tailed tests to increase power, not to determine direction.
Chapter
The following misconceptions and others are explained and disposed of: statistical “significance” means a difference is large enough to be important; if we reject the null hypothesis, we should accept the alternative hypothesis that something other than the null hypothesis is correct; p is the probability of a type I error; p is the probability that the null hypothesis is correct; confidence intervals should be used instead of null hypothesis tests; power can be used to justify accepting the null hypothesis; if p is greater than α, we should fail to reject the null hypothesis. That last statement is correct in some cases, but in others it is absurd.
Chapter
Null hypotheses can be placed into one of four categories depending on their plausibility and whether our interest is in refuting them or in determining direction. Though it is generally inappropriate to accept a null hypothesis, there are reasonable criteria for doing so under certain circumstances. Inferential statistics often involves estimating effect size but doing so has nothing to do with testing null hypotheses. It is important that we not test our hypotheses with the results that inspired those hypotheses. We should provide all of our results, not just those deemed statistically “significant,” to avoid contributing to publication bias. Although it is helpful to see null hypothesis testing in the context of the hypothetico-deductive method, if we test for direction, we are not using that method but rather some other underappreciated and unnamed method of inference.
Chapter
The significance test controversy is revisited in the light of Jeffreys’ views on the role of statistical inference in experimental investigations. These views were clearly expressed in the third edition of his Theory of Probability. We will quote and comment the relevant passage. We will consider only the elementary inference about the difference between two means, but our conclusions will be applicable to most of the usual situations encountered in experimental data analysis.KeywordsBayesian interpretation of p-valuesExperimental investigationsJeffreys’s views of statistical inferenceKilleen’s prepPure estimationRole of significance tests
Article
Many domains of inquiry in psychology are concerned with rich and complex phenomena. At the same time, the field of psychology is grappling with how to improve research practices to address concerns with the scientific enterprise. In this Perspective, we argue that both of these challenges can be addressed by adopting a principle of methodological variety. According to this principle, developing a variety of methodological tools should be regarded as a scientific goal in itself, one that is critical for advancing scientific theory. To illustrate, we show how the study of language and communication requires varied methodologies, and that theory development proceeds, in part, by integrating disparate tools and designs. We argue that the importance of methodological variation and innovation runs deep, travelling alongside theory development to the core of the scientific enterprise. Finally, we highlight ongoing research agendas that might help to specify, quantify and model methodological variety and its implications. Philosophers of science have identified epistemological criteria for evaluating the promise of a scientific theory. In this Perspective, Dale et al. propose that a principle of methodological variety should be one of these criteria, and argue that psychologists should actively cultivate methodological variety to advance theory.
Book
Modern Applied Regressions creates an intricate and colorful mural with mosaics of categorical and limited response variable (CLRV) models using both Bayesian and Frequentist approaches. Written for graduate students, junior researchers, and quantitative analysts in behavioral, health, and social sciences, this text provides details for doing Bayesian and Frequentist data analysis of CLRV models. Each chapter can be read and studied separately with R coding snippets and template interpretation for easy replication. Along with the doing part, the text provides basic and accessible statistical theories behind these models and uses a narrative style to recount their origins and evolution. This book first scaffolds both Bayesian and frequentist paradigms for regression analysis, and then moves onto different types of categorical and limited response variable models, including binary, ordered, multinomial, count, and survival regression. Each of the middle four chapters discusses a major type of CLRV regression that subsumes an array of important variants and extensions. The discussion of all major types usually begins with the history and evolution of the prototypical model, followed by the formulation of basic statistical properties and an elaboration on the doing part of the model and its extension. The doing part typically includes R codes, results, and their interpretation. The last chapter discusses advanced modeling and predictive techniques―multilevel modeling, causal inference and propensity score analysis, and machine learning―that are largely built with the toolkits designed for the CLRV models previously covered.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
The use of hypothesis-testing models is criticized and particularly the null-hypothesis model. Criticism is also directed at "fixed-increment" hypothesis testing, the small N fallacy, the sampling fallacy, and the crucial experiment. The author recommends that hypotheses should be tested by a process of estimation illustrating with the use of analysis of variance. Confidence intervals are used to provide an indication of the level of confidence to be placed in an estimate. For comparison of mean differences Epsilon is recommended for providing an unbiased estimate of the correlation ratio. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
The test of significance does not provide the information concerning psychological phenomena characteristically attributed to it; and a great deal of mischief has been associated with its use. The basic logic associated with the test of significance is reviewed. The null hypothesis is characteristically false under any circumstances. Publication practices foster the reporting of small effects in populations. Psychologists have "adjusted" by misinterpretation, taking the p value as a "measure," assuming that the test of significance provides automaticity of inference, and confusing the aggregate with the general. The difficulties are illuminated by bringing to bear the contributions from the decision-theory school on the Fisher approach. The Bayesian approach is suggested.
Article
Abstract 1. Though several serious objections to the null - hypothesis significance test method are raised," its most basic error lies in mistaking the aim of a scientific investigation to be a decision, rather than a cognitive evaluation…. It is further argued that the proper ...
Article
Concerning the traditional nondirectional 2-sided test of significance, the author argues that "we cannot logically make a directional statistical decision or statement when the null hypothesis is rejected on the basis of the direction of the difference in the observed means." Thus, this test "should almost never be used." He proposes that "almost without exception the directional two-sided test should replace" it (18 ref.)
Statistical significance in psychiatric research Reports from the Research Laboratories of the Department of Psychiatry
  • David T Lykken
Lykken, David T., "Statistical significance in psychiatric research," Reports from the Research Laboratories of the Department of Psychiatry, University of Minnesota. Report No. PR-66-9, Minneapolis: December 30, 1966.
Statistics for psythologists
  • William L Hays
Hays, William L., Statistics for psythologists, New York: Holt, Rinehart, and Winston, 1963.