Article

THE ONCE AND FUTURE ISSUES OF VALIDITY: ASSESSING THE MEANING AND CONSEQUENCES OF MEASUREMENT1

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

A unified view of test validity is propounded that stresses both the existing evidence for and the potential consequences of test interpretation and use. The thrust of this unified view is that appropriateness, meaningfulness, and usefulness of score-based inferences are inseparable and that the unifying force is empirically-grounded construct interpretation. Evidence and theoretical rationales are examined indicating that construct interpretation undergirds all score-based inferences – not just those related to interpretive meaningfulness but also the content- and criterion-related inferences specific to applied decisions and actions based on test scores.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Gonen and Goldberg 2019): how bias is measured can change, and in many cases invert, judgments about the efficacy of bias mitigation methods. Consequential validity emphasizes the eventual usage and impact of the measure (Messick 1988). While most of these consequences will be determined in the future, our bias measures have already been adopted as the default bias metrics in the HELM benchmark by Liang et al. (2022) to evaluate 30+ prominent language models. ...
... Together, this makes the case for our social bias measures being trustworthy. However, as Messick (1987Messick ( , 1988) reminds us, the task of validating a measure is an ongoing process. ...
Article
How do we design measures of social bias that we trust? While prior work has introduced several measures, no measure has gained widespread trust: instead, mounting evidence argues we should distrust these measures. In this work, we design bias measures that warrant trust based on the cross-disciplinary theory of measurement modeling. To combat the frequently fuzzy treatment of social bias in natural language processing, we explicitly define social bias, grounded in principles drawn from social science research. We operationalize our definition by proposing a general bias measurement framework DivDist, which we use to instantiate 5 concrete bias measures. To validate our measures, we propose a rigorous testing protocol with 8 testing criteria (e.g. predictive validity: do measures predict biases in US employment?). Through our testing, we demonstrate considerable evidence to trust our measures, showing they overcome conceptual, technical, and empirical deficiencies present in prior measures.
... We envisaged obtaining a relatively short, unidimensional, reliable, and valid measure of erroneous assumptions of American physicians about adults with ID. Concerning reliability and validity, it needs to be noted that they are context-specific (Messick 1988); thus the conclusions presented in this study are with respect to this specific population of participants and their characteristics. ...
... Future steps in this research include translating and adapting the BAID to other languages and cultures to be used in different countries (International Test Commission 2017) and, thus, improving the diffusion and use of this instrument as well as making comparisons internationally. Reliability and validity are context-specific, dependent on the sample's characteristics and the study's conditions (Messick 1988). Thus, if the instrument is used in a context that is different from that in which it was developed, the psychometric properties need to be investigated. ...
Article
Background Physicians' erroneous assumptions about individuals with intellectual disability (ID) negatively impact the quality of care provided to this population. This study aimed to investigate the psychometric properties of the Beliefs About Adults with ID (BAID), an instrument we developed for measuring physicians' erroneous assumptions about adults with ID. Methods Two hundred ninety‐two American physicians participated. Classical test theory and Rasch measurement theory were used to refine the scale (through item analysis, exploratory factor analysis, infit and outfit mean‐squares statistics, and differential item functioning) and investigate its psychometric properties (functioning of the response scale, reliability, and validity). Results The BAID provided a unidimensional, reliable, valid, and precise measure in assessing high levels of erroneous assumptions. It showed convergent and divergent validity with the different factors of a scale measuring attitudes towards ID. The BAID items were discriminant, non‐redundant, unambiguous, and invariant across gender and previous ID training. The BAID response scale was found to be appropriate for measuring physicians' erroneous assumptions about adults with ID. Conclusions BAID is a brief instrument with good psychometric properties to assess erroneous assumptions about adults with ID in physicians of different genders and who have/have not previously received ID training. Therefore, it might be helpful for research and medical education purposes.
... There is an oft-cited gap between validity theory and the practice of validation, which many trace to the theory of construct validity and the difficulty of implementing such a theory (Messick, 1988(Messick, , 1989Shepard, 1993Shepard, , 1997Kane, 1992Kane, , 2001Kane, , 2004. In a good part, argument-based approaches emerge from the notion that we validate inferences and uses rather than tests. ...
... Some contemporary researchers in language assessment and general education focused on the third. Messick (1980Messick ( , 1988Messick ( , 1989Messick ( , 1995Messick ( , 1996Messick ( , 1998) articulated a unified view of validity in a series of publications. He was clear that validity is about the inferences, interpretations, actions, or decisions based on a test score, not the test itself. ...
Book
Full-text available
ABSTRACT: This monograph describes a framework for test validation that synthesizes construct theories and argument-based approaches bringing washback (also described as consequences or impact) into the foreground of the validation practices. This framework is well suited for tests immersed mainly in a construct theory and whose validation practices have allied with test validity theory as codified in the Standards for Educational and Psychological Testing, where washback has been in the background. A Bayesian method is described and used in a washback study of test preparation. The monograph closes with a critical review of washback studies for nationally mandated and regulated (language) tests for immigration purposes.
... Loevinger made the conception explicit by stating that "since predictive, concurrent, and content validities are all essentially ad hoc, construct validity is the whole of validity from a scientific point of view" [12]. Later, Messick proposed a unified framework of validity which has been widely endorsed [1,13,14]. This seminal framework is a four-fold classification subsumed by two aspects, i.e., the source of justification of the testing and the function of outcome of the testing. ...
... The framework is considered as a progressive matrix, with construct validity an indispensable element in each cell to highlight its pervasive and overarching nature. The unified approach advances the concept of validity by precluding excessive reliance on selected forms of evidence, highlighting the crucial though auxiliary role of evidence from content and criterion-related validity and considering value implication and social consequences formally [14]. Although Messick's unified view of validity has been of far-reaching influence, it is found to be opaque and vague and provides little feasible guidance for test validation [15]. ...
Conference Paper
Validity and validation have been key issues in language testing. During the past seven decades, researchers have aired different views on validity whose development, can be divided into four stages, namely, the stage of criterion-based approach, the stage of tripartite approach, the stage of unified approach and the stage of argument-based approach. In order to have an informed knowledge of validity theory, this article briefly traces the history of research on validity concepts and the corresponding frameworks of validation. By examining the development of validity and validation, some possible topics for future research are uncovered.
... Construct validity evidence is focused on the extent to which a tool, such as an interview, measures the 'right' psychological construct(s) (Messick 1989). As argued by Messick (1989), construct validity pertains to the extent to which a test is measuring the 'right' psychological construct (conscientiousness, for example). ...
... Construct validity evidence is focused on the extent to which a tool, such as an interview, measures the 'right' psychological construct(s) (Messick 1989). As argued by Messick (1989), construct validity pertains to the extent to which a test is measuring the 'right' psychological construct (conscientiousness, for example). The case that a test, tool or measure has high construct validity is often made using evidence that these instruments use questions that relate to and represent some domain and that outcomes from them are systematically related to another measure (i.e. ...
Article
Full-text available
Across the world, teacher quality has come to be recognised as one of the most important variables affecting student outcomes; consequently, the regulation of entry into the profession is the subject of iterative review. The traditional ‘one-off’ interview, involving an interviewee and two or more interviewers, is a common, but not unproblematic, selection mechanism in the field. In particular, the modest positive correlation between performance at interviews and in clinical settings raises questions about using interviews as a selection mechanism for Initial Teacher Education (ITE) programmes. In this paper, we draw on validity theory and some key commentaries and studies in the research literature to offer a perspective on the extent to which the traditional interview provides data that can be used to make good decisions about applicants for ITE. The paper proposes a validity-based framework for use by practitioners to enhance the conceptualisation, design and evaluation of interviews in the process of teacher selection.
... Another critical point to note regarding the resilience scales is that their development involved a definition of validity that relies on the ability of a measure to capture its target construct as established through statistical analyses of item content and score performance (Pangallo et al., 2015;Ungar & Liebenberg, 2011). Echoing measurement scholars such as Bowen (2008), Cronbach (1988), and Messick (1988), we contend this view of validation is inadequate, particularly for a polysemous construct or concept like resilience that researchers are still struggling to define or operationalize. A more adequate approach would include respondent-related validation: assessing whether respondents interpreted items and response options as intended and whether response options fit with respondents' perceptions and experiences (Bowen, 2008). ...
... A more adequate approach would include respondent-related validation: assessing whether respondents interpreted items and response options as intended and whether response options fit with respondents' perceptions and experiences (Bowen, 2008). Practice-related validation (Cronbach, 1988;Messick, 1988) is also important. Practicerelated validation refers to whether scores from a measure are used appropriately for the setting or context in which it is used. ...
Article
Resilience is critical among survivors of trafficking as they are mostly vulnerable populations who face multiple adversities before, during, and after trafficking. However, resilience in survivors of trafficking is understudied. This scoping review aims to clarify the current state of knowledge, focusing on definitions of resilience, how resilience has been studied, and factors associated with resilience among survivors. Five databases were searched using key words related to trafficking and resilience. Studies were included if they were published in English between 2000 and 2019 and focused on resilience with the study design including at least one of these four features: (a) use of standardized measures of resilience, (b) qualitative descriptions of resilience, (c) participants were survivors or professionals serving survivors, and (d) data sources such as case files or program manuals directly pertained to survivors. Eighteen studies were identified. Findings indicated that resilience was primarily described as emergent from interactions between the survivor and the environment. Resilience in trafficking appeared largely similar to resilience in other kinds of victimization. Nonetheless, trafficking survivors also may display resilience in alternative ways such as refusing treatment. Positive interpersonal relationships were the most commonly mentioned resilience factor. In addition, current research lacks studies featuring longitudinal designs, interventions, participatory methods, types of trafficking other than sexual trafficking, and demographic characteristics such as age, gender, and national origin. Future research needs to establish definitions and measures of resilience that are culturally and contextually relevant to survivors and build knowledge necessary for designing and evaluating resilience-enhancing interventions.
... Many studies cite Irvin et al. 's (2004) systematic review of using ODRs within a school-wide system as a key study of related validity evidence. Irvin et al. wrote: In this review, we used the unified approach to construct validity template developed by Messick (1988) to document exemplars of the empirical and ethical foundations for the validity of interpretations and uses of school-wide ODR measures. We focused on ODR validity for assessing (a) school-wide behavioral climate, (b) the effectiveness of schoolwide behavioral intervention programs, and (c) differing needs across schools in developing positive behavioral environments. ...
Article
Full-text available
There is a large body of research literature using the number of office discipline referrals (ODRs) as a measure of students' problem behavior (See Study Selection Flowchart). Many studies cite Irvin et al.'s (2004) systematic review of using ODRs within a school-wide system as a key study of related validity evidence. Irvin et al. wrote: In this review, we used the unified approach to construct validity template developed by Messick (1988) to document exemplars of the empirical and ethical foundations for the validity of interpretations and uses of school-wide ODR measures. We focused on ODR validity for assessing (a) school-wide behavioral climate, (b) the effectiveness of school-wide behavioral intervention programs, and (c) differing needs across schools in developing positive behavioral environments. We found a substantial basis for interpreting and using ODR measures in these ways. Several important issues require ongoing attention, however, if school-wide interpretation and use of ODR measures is to improve. (p. 143) They go on to describe three important issues: (a) the validity of using ODRs for individual students versus school-wide interpretation, (b) understanding ODRs as interactions (i.e., a student's response to a given situation , a teacher/staff member's response to the student's behavior, and an administrator's response to the student teacher interaction), and (c) ongoing concerns regarding the cultural sensitivity of the school community's value system (Irvin et al., 2004). To investigate ODR use further, this current study conducted a mixed methods research synthesis of ODR studies published after the Irvin et al. (2004) review. We used the Standards of Educational and Psychological Testing ABSTRACT This mixed methods research synthesis study reexamined the evidence of validity for identifying individual students using their office discipline referrals (ODRs). ODRs range from severe (e.g., weapons) to subjectively determined (e.g., disruption) problem behavior suggesting two content domains. There are no studies showing teachers can reliably identify student behavior by ODR content. Proposed ODR cut score intervals (i.e., 0-1, 2-5, and > 6) were not linked to teacher behavior rating scale scores across the proposed percentile ranges (i.e., 0-80%, 81-94%, and > 95%). Convergent and discriminant correlations between ODRs and behavioral observations or rating scales showed small effect sizes (on average 7% and 4% explained variance, respectively). ODR data tend to require specialized statistical analyses because of their distributional properties; yet, these analyses were not used in the studies reviewed. Odds-and risk-ratios show that students of color 2 receive relatively more ODRs than do students who are White, and for different ODR content and contexts, and for relatively more out-of-school suspensions and expulsions. Because there is limited evidence of validity and unfair practices for students of color, we question identifying individual students using subjectively determined ODR categories. In conclusion, the American Psychological Association Council of Representative's apology to persons of color and call for the dismantling of racist practices suggest we take immediate action to end unfair assessment practices.
... J. Cronbach 和 Paul Meehl 随后发表了一篇论文,更透 彻地解释了"构念效度" (Cronbach & Meehl, 1955 "传统"效度术语简介 在重点讨论构念效度的历史之前,我们需要承认 一个不幸的事实:有关不同效度种类的提法非常广泛, 似乎渗透到了众多心理学和教育领域的入门教科书 中。我在表 1 中列出了在 AERA、APA 和 NCME 的七 版《标准》中使用的不同效度术语。如表 1 所示,效 度术语从效度的"分类"演变为效度的"类型",再到效 度的"方面",之后又回到"分类",最后演变为"效度证 据的来源" --最近 20 多年都保持了该说法。有趣 的是,"内容"是整个七十年的历史中唯一持续存在的 术语。但许多人认同的效度三种"传统形式"是:构念 效度、内容效度和效标关联效度。我将在下一节通过 追溯历史来描述构念效度。在这里,我只简单地介绍 了内容和效标关联效度的传统提法 (更全面的描述见 Kane, 2006Kane, , 2013。 内容效度指测试内容在多大程度上代表了测验所 针对的能力,以及该内容与测试目的的一致程度。显 然,内容的有效性是教育和认证考试的一个先决条件, 因为这类考试需要显示出与目标课程或工作领域的"校 准" (Crocker, 2003;Martone & Sireci, 2009;Sireci, 1998;Sireci & Faulkner-Bond, 2014 et al., 1999, 2014 Guion, 1977) 观点,提出这个一元概念本质上就是构 念效度 (Messick, 1975(Messick, , 1980(Messick, , 1988(Messick, , 1989)。 在他收录 于第三版《教育测量》的里程碑式的章节中 (Messick, 1989) (Messick, 1989, p. 64 (Ebel, 1977;Sireci, 1998;Yalow & Popham, 1983),或过于晦涩,无 法有效地促进对测验效度检验的实践 (Shepard, 1993) Kane (1992Kane ( , 2006Kane ( , 2013 (Embretson, 1983;Messick, 1989;Mislevy, 2009) (Zenisky et al., 2018, p. 10 ...
Article
Full-text available
"构念效度理论 (construct validity theory) 为教育和心理测验的“效度”提供了最全面的描述。“构念效度”一词最初由1954年的《心理测试与诊断技术的技术建议》引入 (American Psychological Association [APA], 1954),随后由两名1954年委员会成员 Cronbach 和 Meehl (1955) 进行了阐释。构念效度理论对效度的理论描述产生了巨大影响,但没有在最近两版的《教育与心理测验标准》 (American Educational Research Association [AERA] et al., 1999, 2014) 中得到明确支持。在本文中,我将回溯有关构念效度理论对测验效度检验的重要性的讨论历史,并且识别构念效度理论的基本要素——这些要素对于有着特定用途的测验的效度验证至关重要。同时,我将提出一个侧重于测验使用,而非测验构念的效度验证框架。这个解构(译:“去构念”)的方法包含四步:(1)明确测验目的,(2)确认测验使用可能造成的消极后果,(3)将测验目的、可能的误用与 AERA et al. (2014) 的《教育与心理测验标准》中的五个效度证 据来源进行交叉比对,(4)优先考虑一些效度证据来源,以建立一个充足的、注重测验使用以及后果的效度论证。该去构念效度验证的目标是接受构念效度理论涉及的主要原则,利用这些原则发展一个心理测量学家、法院法官、决策者和一般公众都能理解的、连贯的、全面的效度论证,且与 AERA et al.(2014) 的《标准》保持一致。"
... Thus, by the mid-1980s, the consensus was set that validity was a unitary concept. Around this same time, a particularly compelling and influential validity theorist, Samuel Messick, drew from Cronbach and Meehl (1955), Loevinger (1957), and others (e.g., Guion, 1977) to argue that, in essence, this unitary conceptualization was construct validity (Messick, 1975(Messick, , 1980(Messick, , 1988(Messick, , 1989. In his landmark chapter in the third edition of Educational Measurement (Messick, 1989), he used philosophy, logical argument, and a comprehensive review of the literature and practice in educational and psychological testing, to claim that all interpretations of test scores, and the evaluation of the use of a test, must be viewed in relation to the construct the test intends to measure. ...
Article
Full-text available
Construct validity theory presents the most comprehensive description of “validity” as it pertains to educational and psychological testing. The term “construct validity” was introduced in 1954 in the Technical Recommendations for Psychological Tests and Diagnostic Techniques (American Psychological Association [APA], 1954), and subsequently elucidated by two members of the 1954 committee — Cronbach and Meehl (1955). Construct validity theory has had enormous impact on the theoretical descriptions of validity, but it was not explicitly supported by the last two versions of the Standards for Educational and Psychological Testing (American Educational Research Association [AERA] et al., 1999, 2014). In this article I trace some of the history of the debate regarding the importance of construct validity theory for test val- idation, identify the essential elements of construct validity theory that are critical for validating the use of a test for a particular purpose, and propose a framework for test validation that focuses on test use, rather than test construct. This “de-constructed” approach involves four steps: (a) clearly articulating testing purposes, (b) identifying potential negative consequences of test use, (c) crossing test purposes and potential misuses with the five sources of validity evidence listed in the AERA et al. (2014) Standards for Educational and Psychological Testing, and (d) prioritizing the sources of validity evidence needed to build a sound validity argument that focuses on test use and consequences. The goals of deconstructed validation are to embrace the major tenets involved in construct validity theory by using them to develop a coherent and comprehensive validity argument that is comprehensible to psychometricians, court justices, policy makers, and the general public; and is consistent with the AERA et al. (2014) Standards.
... In view of the fact that the practical implementation of DA calls for making better informed decisions at each and every single stage of various meditational moves as well as the case of placement decision making (see Anton, 2003), the consequential validity (Messick, 1988) of DA is not overlooked. Laing and Kamhi (2003) rightly point out that learners' actual performance is a more realistic portrayal of their language learning difficulty than psychometric criteria given by norm-referenced tests. ...
... Since the beginning of modern psychological assessment, various authors have emphasized the need to clearly distinguish psychometric concepts, for the purpose of a sharp, structured, and methodical assessment science (Cronbach & Meehl, 1955;Loevinger, 1957;Messick, 1988;Slaney & Maraun, 2008). We focus in this paper on one element of this larger need: the use of the word dimension to describe psychopathology constructs in the context of transitions away from diagnosis based on polythetic disorder categories. ...
... From our perspective, it is similarly undesirable that fairness should be viewed as resting on different "views" of fairness. To mirror Messick (1998), the multiple fairness views discussed by the Standards are better viewed as complementary forms of evidence to be integrated into an overall judgment of assessment fairness relative to a specific purpose or use. Yet, models that provide guidance for integrating fairness evidence, like the one proposed by Messick (1995) for construct validity, have been lacking until now. ...
... Die Frage nach der Validität eines Messinstrumentes, also die Frage, ob ein Instrument auch die Fähigkeit oder Eigenschaft misst, die es vorgibt zu messen, ist die eigentlich wichtigste und kritischste Frage in Bezug auf die psychometrische Qualität eines Testinstrumentes (Messick, 1988 . Dabei gibt es unterschiedliche Möglichkeiten, Hinweise auf die Validität eines Instrumentes zu erlangen. ...
... Modern validity theory is considered unitary and evidence based and can be traced back to Cronbach and Meehl (1955) and Messick (1995). Our findings in respect of SWM and the PS include the classic concept of validity and are set out below. ...
Article
Full-text available
Oversubscribed social work (SW) courses and a workforce review in Northern Ireland prompted a review of admissions, to ensure recruitment of applicants with strong core values. Concerns regarding authorship, plagiarism and reliability of personal statements, and calls for values-based recruitment underpinned this research. This study evaluates psychometric properties of an SW specific personal statement (PS) and a values-based psychological screening tool, Social Work Match (SWM). Social Work students (n = 112), who commenced the 3-year undergraduate route (UGR) or the 2-year relevant graduate route (RGR) were invited to participate. Their PS scores and SWM scores permitted investigation of scoring outcomes and psychometric properties. Statistical analysis was conducted using Minitab 17. Forty-nine participants (5 male, 44 female) completed SWM on two occasions (October 2020 and January 2021). Findings provide practical, theoretical, statistical, and qualitative reasons for concluding that the PS has substantial limitations as a measure of suitability. It does not compare well with international test standards for psychometric tests. In contrast, SWM is a valid and reliable measure with good discriminatory power, standardized administration and consistent marking. SWM is a viable alternative to the PS for assessing suitability/shortlisting applicants for social work interviews.
... However, a look at press headlines, political statements, or the statements of many others (Murphy 1995(Murphy , 2001; Pinto and Portelli, 1st ed. of this volume) indicates that assessment results can be used well beyond the purposes for which they were intended (see Ennis, this volume, for a discussion of purposes of assessment). The consequential validity (Messick 1988) of these assessments is thus put in question. Unfortunately, however, some users treat assessment results as definitive rather than as the tentative and fragile (Murphy et al. 1998) pieces of documentation that they are. ...
Book
This second edition of CRITICAL THINKING EDUCATION AND ASSESSMENT: Can Higher Order Thinking be Tested? contains a series of important papers from the first edition and a new Introduction by Jan Sobocan. The essays are an important read for anyone interested in the issues raised by the teaching of critical thinking and consequent attempts to test its success. They discuss attempts to use testing to ensure educational accountability, the politics of testing regimes, and the shortcomings and the strengths of standard tests used to teach and assess students, courses, programs, and the tests themselves. The ebook can serve as a useful introduction to the questions that this raises, at the same time that it provides answers to these questions from the perspective of many different trends within contemporary argumentation theory.
... Construct-related evidence of validity is defined herein as the extent to which an instrument measures the theoretical construct or trait that it is intended to measure (Allen & Yen, 2001;Cambell & Fiske, 1959;Crocker & Algina, 1986;Cronbach & Meehl, 1955;Kimberlin & Winterstein, 2008;Messick, 1995). In the validation of survey scores, construct-related evidence of validity is highly important in that construct validity involves integration of evidence entailed in the interpretation or meaning of instrument scores (AERA et al., 2014;Anastasi, 1986;Embretson, 1983;Guion, 1977;Kane, 2013;Messick, 1975;Messick, 1980;Messick, 1988, Messick, 1989Messick, 1995). ...
... The Biographical (2005) Kaufman Domains of Creativity Scale (K-DOCS) (Kaufman, 2012;McKay, Karwowski & Kaufman, 2017) Creative Achievement Questionnaire (CAQ) Big-C Pro-C (Silvia, Wigert, Reiter-Palmon, &Kaufman, 2012) Task-based measures (Dietrich & Kanso, 2010) Divergent thinking (DT) tasks (Silvia et al., 2008;Benedek, Mühlmann, Jauk & Neubauer, 2013) Artistic and real-life creativity tasks (Kaufman & Baer, 2012;Kaufman, Gentile & Baer,2005) Insight tasks (Sternberg & Davidson, 1995) (Beaty, Nusbaum & Silvia, 2014) Fürst and Grin (2018). (Cronbach, 1988;Messick, 1986;Sireci, 2009;Zumbo, 1999) (Chen, & Revicki , 2014 ) (Zumbo, 1999) (Zumbo, 1999) (Teresi, et al, 2009) (Zumbo, 1999) (Teresi, et al, 2009) (Zumbo, 1999) (Zumbo, 1999 (Wiberg, 2007) (Purification) (Wiberg, 2007) ORL/IRT (Swaminathan & Rogers 1990) (Zumbo, 1999 ) (Karami, 2012) (Swaminathan & Rogers, 1990 ) (Zumbo, 1999) Hocevar (1976Hocevar ( , 1979Hocevar ( , 1981 Creative Behavior Assessment of divergent thinking by means of the subjective topscoring method: Effects of the number of top-ideas and time-on-task on reliability and validity. Psychology of aesthetics, creativity, and the arts, 7(4), 341-349. ...
... Performance represents an aggregation of behaviours or outcomes over time, task, individuals or groups (Messick, 1988). Corvellec (1994Corvellec ( , 1995 opined that performance refers simultaneously to the action, the result of the action, and to the success of the result compared to some benchmark. ...
Conference Paper
Full-text available
Projecting and monitoring NO2 pollutants' concentration is perhaps an efficient and effective technique to lower people's exposure, reducing the negative impact caused by this harmful atmospheric substance. However, many studies have been proposed to predict NO2 Machine learning (ML) algorithm using a diverse set of data, making the efficiency of such a model dependent on the data/feature used. This research installed and used data from 14 Internet of Things (IoT) emission sensors, combined with weather data from the UK meteorology department and traffic data from the department for transport for the corresponding time and location where the pollution sensors exist. This paper selected relevant features from the united data/feature set using Boruta Algorithm. Six out of the many features were identified as valuable features in the NO2 ML model development. The identified features are Ambient humidity, Ambient pressure, Ambient temperature, Days of the week, two-wheeled vehicles (counts), cars/taxis (counts). These six features were used to develop different ML models compared with the same ML model developed using all united data/features. For most ML models implemented, a performance improvement was developed using the features selected with Boruta Algorithm.
... Performance represents an aggregation of behaviours or outcomes over time, task, individuals or groups (Messick, 1988). Corvellec (1994Corvellec ( , 1995 opined that performance refers simultaneously to the action, the result of the action, and to the success of the result compared to some benchmark. ...
Conference Paper
Full-text available
There are already countless articles on strategies to limit human exposure to particulate matter10 (PM10) pollution because of their disastrous impact on the environment and people's well-being in the United Kingdom (UK) and around the globe. Strategies such as imposing sanctions on places with higher levels of exposure, dissuading non-environmentally friendly vehicles, motivating bicycles for transportation, and encouraging the use of eco-friendly fuels in industries. All these methods are viable options but will take longer to implement. For this, efficient PM10 predictive machine learning is needed with the most impactful features/data identified. The predictive model will offer more strategic avoidance techniques to this lethal air pollutant, in addition to all other current efforts. However, the diversity of the existing data is a challenge. This paper solves this by (1) Bringing together numerous data sources into an Amazon web service big data platform and (2) Investigating which exact feature contributes best to building a high-performance PM10 machine learning predictive model. Examples of such data sources in this research include traffic information, pollution concentration information, geographical/built environment information, and meteorological information. Furthermore, this paper applied random forest in selecting the most impactful features due to its better performance over the decision tree Feature selection and XGBoost feature selection method. As part of the discovery from this research work, it is now clearly discovered that the height of buildings in a geographical area has a role in the dispersion of PM10.
... Performance represents an aggregation of behaviours or outcomes over time, task, individuals or groups (Messick, 1988). Corvellec (1994Corvellec ( , 1995 opined that performance refers simultaneously to the action, the result of the action, and to the success of the result compared to some benchmark. ...
Conference Paper
Full-text available
Time is a critical factor or primary success metric in measuring the progress of construction projects since they are normally time-bound. The construction industry, on the other hand, seldomly completes projects on time due to its varied architecture - varying project styles, scopes, places, and sizes, as well as the participation of several stakeholders from different disciplines. Building Information Modeling (BIM) is expected to be a valuable tool in the construction industry, as it has the ability to mitigate construction project risks and complete projects successfully. As such, a systematic review on the effects of BIM on construction project delays become vital. Admittedly, systematic reviews provide a valuable opportunity for academics and practitioners to apply established expertise to further action, policy or study. Using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline, this study thus aims to conduct a systematic review on the effects of BIM on construction projects delay. This research approach yielded a positive effect of BIM on delay across multiple regions of the world with different construction project types. This systemic review, as an evidence-based methodology, will be crucial for the Architecture Engineering Construction (AEC) industry in enforcing the adoption of BIM for current and future projects in the sector globally. It is recommended that a comprehensive systematic review be conducted on other pertinent issues common to the construction industry.
... Performance represents an aggregation of behaviours or outcomes over time, task, individuals or groups (Messick, 1988). Corvellec (1994Corvellec ( , 1995 opined that performance refers simultaneously to the action, the result of the action, and to the success of the result compared to some benchmark. ...
Conference Paper
Full-text available
The perpetual occurrence of a global phenomenon – delay in construction industry despite considerable mitigation efforts remains a huge concern to its policy makers. Interestingly, this industry which produces massive amount of data from IoT sensors, building information modelling etc., on most of its projects daily is slow in taking the advantage of contemporary analysis method like machine learning (ML) which best explains factors that can affect a phenomenon like delay based on its predictive capabilities haven been widely adopted across other sectors. In this study therefore, a premise to compare the performance of machine learning algorithms for predicting delay of construction projects was proposed. To begin, a study of the existing body of knowledge on the factors that influence construction project delays was utilised to survey experts in order to obtain quantitative data. The generated dataset was used to train twenty-seven machine learning algorithms in order to develop predictive models. Results from the algorithm evaluation metrics: accuracy, balanced accuracy, Receiver Operating Characteristic Curve (ROC AUC), and f1-score indeed proved Perceptron model as the top performant model having achieved an accuracy, balanced accuracy, ROC AUC, and f1-score of 85%, 85%, 0.85 and 085 respectively higher than the rest of the models and unachieved in any previous study in predicting construction projects delay. Ultimately, this model can subsequently be integrated into construction information system to promote evidencebased decision-making, thereby enabling constructive project risk management initiatives in the industry.
... Se os diferentes avaliadores interpretam um desempenho docente particular de maneira diversa, e sugerem distintas atividades de desenvolvimento profissional que poderiam ser igualmente bem sucedidas quando se trata de fazer progressos na aprendizagem docente, então, de acordo com a definição da avaliação formativa adotada na seção 8, seriam da mesma maneira válidas em termos de impacto na aprendizagem do professor. Ou seja, adotando a distinção utilizada por Samuel Messick (1988), se as funções cumulativas da avaliação são validadas por seus significados, então as funções formativas da avaliação são validadas por suas consequências (Wiliam & Black, 1996). ...
Article
Full-text available
A argumentação deste artigo se baseia em quatro propostas. Em primeiro lugar, fazem-se necessários maiores realizações educacionais tanto para os indivíduos quanto para a sociedade. Em segundo lugar, para conseguir maiores realizações educacionais se requer maior qualidade nos professores. Em terceiro lugar, uma maior qualidade nos professores demanda investir nos professores que já se encontram trabalhando em nossas escolas. Em quarto lugar, esse investimento precisa tomar um formato radicalmente distinto do desenvolvimento profissional que comumente os professores receberam.
... itself and about the relationship between testing results and relevant outcomes (Messick, 1986;Reeves and Marbach-Ad, 2016). The following section describes ways in which the exam development process aligned with standards, ways in which it differed, and plans to collect a wider range of validity evidence in the future. ...
Article
Full-text available
With support from the American Society for Biochemistry and Molecular Biology (ASBMB), a community of biochemistry and molecular biology (BMB) scientist-educators has developed and administered an assessment instrument designed to evaluate student competence across four core concept and skill areas fundamental to BMB. The four areas encompass energy and metabolism; information storage and transfer; macromolecular structure, function, and assembly; and skills including analytical and quantitative reasoning. First offered in 2014, the exam has now been administered to nearly 4000 students in ASBMB-accredited programs at more than 70 colleges and universities. Here, we describe the development and continued maturation of the exam program, including the organic role of faculty volunteers as drivers and stewards of all facets: content and format selection, question development, and scoring.
... Cattell's (1966) scree test suggested a onecomponent solution, following which a forced onecomponent solution was computed and used to select 18 items for the final tool. Confirming the one-component solution in a second sample of humanitarian aid workers is vital to ensuring that test score interpretation of the PostAID/Q does not lead to unintended consequences from incorrectly applied treatments (Messick 1988). A further validation study conducted by McCormack et al. (2016) determined that the PostAID/Q added an additional 10% of the variation over the General Health Questionnaire-12 (GHQ-12: Goldberg and Williams 1988) in predicting the subjective response to traumatic events as measured by the Impact of Events Scale -Revised (IESR: Weiss and Marmar 1997). ...
Article
Full-text available
Objective Humanitarian-specific psychological distress following deployment can elude detection using contemporary measures of trauma-related stress. This study assesses the unidimensional structure and convergent validity of the Post-deployment Altruistic Identity Disruption Questionnaire (PostAID/Q), an 18-item questionnaire underpinned by the construct Altruistic Identity/Disruption ( AI/AID ) . Method Humanitarian aid personnel ( N =108) completed an online web survey, inclusive of the Moral Injury Questionnaire (MIQ), Posttraumatic Distress Disorder Checklist (PCL-5), Psychological Well-Being Posttraumatic Changes Questionnaire (PWB-PTCQ) and Social Provisions Scale (SPS). Results A confirmatory factor analysis suggested a single factor structure providing further support for the conception of AI/AID as a unidimensional construct. Convergent validity was demonstrated through (1) utility for predicting a posttraumatic stress disorder (PTSD) diagnosis assessed by the PCL-5, and (2) moral injury assessed by the MIQ. The PostAID/Q was further moderately and negatively associated with the availability of social support (assessed by the SPS) and lower self-reports of psychological well-being post trauma (assessed by the PWB-PCTQ). Finally, the PostAID/Q demonstrated evidence of incremental validity in predicting humanitarian specific psychological distress over and above the PCL-5. Specifically, the PostAID/Q predicted increased moral injury on the MIQ, and decreased psychological well-being post trauma. Conclusions The PostAID/Q can assist in identifying humanitarian specific psychological responses post deployment guiding support for personnel, over and above more traditional measures of posttraumatic stress.
... the whole story, for many this approach was initiated by the work of one of the founders of the orthodoxy, Messick (1980Messick ( , 1981Messick ( , 1988Messick ( , 1989. ...
... The importance of a proficiency test or translation of texts can be other indictors of the importance of translation competence over just holding a degree in translation from a university, which in practicality seems justifiable. The content of the texts to be translated indicates that the publishers have a fair, although informal, idea about the content, construct and consequential validity of the test (Messick, 1987a). The great majority of the publishers (84%) agreed with the idea that translators with work experience are eligible to be recruited even if they do not have an employment history. ...
Article
Full-text available
This study aimed to investigate the criteria considered by Iranian publishers for admitting translators. In order to achieve this purpose, the qualifications developed by Samuelson-Brown were employed to design a 19-item Likert scale questionnaire on a continuum from ‘strongly agree’ to ‘strongly disagree’. Some of the qualifications mentioned in the questionnaire were academic degree, translation competence, work experience, proficiency test, translators’ sense of responsibility for their task and the direct relationship between the publishers and the translators. Next, it was handed out to 140 Iranian publishers who were randomly selected as the participants from different parts of Iran for data collection. The collected data were analysed both qualitatively and quantitatively using the chi-squared test to see whether there were any significant relationships between each item and the participants’ attitudes. The qualitative results showed that almost all participants agreed with considering most qualifications as their own criteria to recruit translators, of course with some variations in their opinions on the significance of the variables of interest. However, the result of the chi-squared test showed a significant relationship only between four variables of inclusion of proficiency test, computer literacy, the translators’ relationship with the publisher and gender. The results can have some practical implications on translation course developers, students and teachers.  Â
... Given the students' responses to the relevant questions we then attempt to validate these aforementioned constructs: in other words to check whether they exist as 'measures' (or scales), and if not if there are other dimensions relevant and useful. Our validation process, thus, refers to the accumulation of evidence to support validity arguments regarding the students' reported measures (Messick 1988). We employed a psychometric analysis for this purpose, conducted within the Rasch measurement framework, following relevant guidelines Smith Jr. 2007a, 2007b) and with the Rasch rating scale model, in particular, which is the most appropriate model for Likert type items, as in this project. ...
Article
Full-text available
In this paper, we approach modelling mathematics dispositions from a different methodological perspective in order to shed more light into the complex interplay between teaching practices and students’ learning outcomes. We draw on survey data from around 5000 students from Year 7–11 (age 11–16) from 40 Secondary schools in England. Our methodological approach includes Rasch modelling to produce measures of attitudinal outcomes as well as students’ perceptions of pedagogic practices. We then employ fuzzy-set Qualitative Comparative Analysis (fs/QCA) to explore the relationships between students’ characteristics and the perceived type of teaching they receive in mathematics. We use two measures of ‘transmissionist teaching’ which aim to quantify the degree to which teaching practices are perceived as ‘teacher-centred’. One measure gives the students’ perceptions and the other gives the teacher’s perspective. We find that different configurations of student and teacher perceptions of transmissionist teaching are associated with high and low mathematical dispositions for different year groups and for boys and girls. We discuss the methodological merits of this approach along with the substantive educational implications of these findings.
... 18-19). This also implicates the major importance of predictive validity in designing tests for selection purposes (Messick, 1986). In particular, as stated by Brown and Abeywickrama (2010), predictive validity plays a central role in the use of large-scale standardized tests, such as IELTS, as gatekeepers. ...
Article
Full-text available
The International English Language Testing System (IELTS) has become one of the most widely used measurements of English proficiency in the world for academic, professional and migration purposes. For universities in particular, it is expected that applicants’ IELTS scores closely reflect their actual ability in communicating and doing their assignments in English. This study examines the authenticity and predictive validity of the writing section in the IELTS Academic Module by reviewing relevant research on IELTS within the last two decades. In general, those studies have provided evidence that the IELTS writing test suffers from low authenticity and predictive validity, and is thus an inaccurate predictor of a candidate’s performance in writing real-life academic tasks.
Chapter
Applied linguistic designs are too seldom acknowledged for being inspired by care and concern for the language needs of the vulnerable. Yet in them love and compassion, rather than self-interest and malice, are easily identifiable as motivations. Normatively, applied linguistic interventions aiming to alleviate pernicious language difficulties are carried out with diligence and care, with integrity and a genuine concern for the language improvement of others: as a professional duty to be undertaken with respect for humanity, and vigilance when the language rights of especially the voiceless are being threatened. Factual designs should embody the ethical convictions of applied linguists.
Chapter
The notion of validity encapsulates the echo of the physical within the technical. The technical force of a language course, test or plan needs to be evaluated for its effects. On the norm side, this yields a design principle that asks whether the design is adequate, and can be validated. That kind of technical validation is perhaps most prominent in language assessment. This quality enhancing process can productively be extended to the other interventions. On the factual side, the operation of subjective technical processes which produce technical objects come into focus, as well as factual technical changes caused by interventions.
Chapter
The interpretability of technical processes depends conceptually on lingual analogies. In analyzing the meaningfulness of applied linguistic designs, this chapter examines how technical design anticipates the expressive dimension of experience. Designs are articulated in the form of a blueprint for each language solution. The specifications in the blueprint are lingual anticipations functioning on the norm side of the technical aspect. On the factual side, we encounter the technical recording of the design processes of applied linguistic artefacts. Technical subjects interpret designs; the objective designs are meaningless without being interpretable. The lingual anticipations yield a first set of regulative technical ideas.
Article
In the mid-1990s nations in the Organization for Economic Cooperation and Development (OECD) conducted the first International Adult Literacy Survey (IALS). The IALS used two different methods for assessing adult literacy. One method used performance scales to measure prose, document, and quantitative literacy. The second method measured perceived abilities by having adults rate the extent to which their literacy and numeracy skills met their work and daily life requirements for these skills. This paper reviews evidence that challenges the validity of the IALS standardized performance scales, including the construct validity of the measurement scales (the question of just what it is that the IALS scales measure), the standards validity (the question of how good is good enough to be considered competent at whatever the scales measure), and the use validity (the extent to which the findings are useful for various purposes and do not produce social harm). The author concludes that in future assessments more attention should be given to the use of self-perceptions of skills so those who believe they are in need of additional literacy development can be identified and provided with information about educational opportunities. Résumé Au milieu des années '90, les pays membres de l'Organisation de coopération et de développement économique ont mené, selon deux méthodes distinctes, une première enquête internationale sur le niveau d'alphabétisation des adultes. L'une des méthodes choisies a utilisé des échelles de performance pour évaluer de manière quantitative la prose, l'argumentation et l'alphabétisation. Le seconde méthode a mesuré les habiletés perçues en demandant aux adultes eux-mêmes d'évaluer à quel point, chaque jour, ils étaient confrontés à leur analphabétisme ainsi qu'à leurs difficultés avec les nombres. Cet article remet en question la validité de ces échelles de performance standardisées pour les adultes, de même que la façon dont ces échelles ont été bâties, c'est-à- dire; leur objet de mesure et la justesse des critères utilisés. En d'autres mots, à quelle compétence correspond un niveau donné sur cette échelle et quelle utilisation sera faite de ces mesures ultérieurement et enfin, si ces découvertes seront-elles utiles à d'autres usages et n'entraîneront-elles pas de conséquences négatives ? On y conclut que lors de prochaines évaluations, il faudra faire plus attention à l'utilisation des habiletés perçues par les adultes eux-mêmes de sorte que ceux qui pensent avoir besoin de plus de formation puissent être identifiés et qu'on puisse leur procurer de l'information sur les possibilités de s'instruire qui leur sont offertes.
Article
In the United States, many actors are pushing for the use of grade point average (GPA) as the main placement tool for gatekeeper math and English courses for community college students (Quarles, 2022; Scott‐Clayton, 2018; Turk, 2017). One community college system (pseudonymously, SXCC) in a New England state has begun placing students in initial math and English classes based on self‐reported GPA. There have been studies on the effects of placement changes of this type (Belfield & Crosta, 2012; Hodara & Cox, 2016; Ngo & Kwon, 2014; Scott‐Clayton, 2012). However, studies have not included the effects of these changes on multilingual learners (MLLs).Using a census of every MLL placed in SXCC in the summer and fall of 2020 and the spring of 2021 ( N = 12,603), a MANOVA found that MLL students in the SXCC system who were placed using previous placement methods had a higher overall GPA than students placed using self‐reported GPA ( M = 3.32, SD = 0.740; M = 2.01, SD = 1.27, respectively) and had higher satisfactory academic progress (SAP) ( M = 102.98, SD = 51.52; M = 57.66, SD = 55.53, respectively), and took longer to enroll in English 101 ( M = 5.11, SD = 3.55; M = 2.36, SD = 1.76, respectively).
Chapter
Health Measurement Scales is the ultimate online guide to developing and validating measurement scales that are to be used in the health sciences. It covers how the individual items are developed; various biases that can affect responses (e.g. social desirability, yea-saying, framing); various response options; how to select the best items in the set; how to combine them into a scale; and finally how to determine the reliability and validity of the scale. It concludes with a discussion of ethical issues that may be encountered, and guidelines for reporting the results of the scale development process. Appendices include a comprehensive guide to finding existing scales, and a brief introduction to exploratory and confirmatory factor analysis.
Chapter
Full-text available
[English: Applied linguistics is a discipline of design. As applied linguists, we first conceive, design, and then implement applied linguistic interventions that can provide solutions to specific language problems. In this chapter, the focus is on one such artifact designed to intervene in a language problem, namely a language test, and how it can be used productively in assessing language proficiency so that other or further interventions (such as a language course for academic purposes or a language plan for a specific environment) can then be conceived, designed, and implemented. Robert Lado, a pioneer in language testing, contributed to the idea that language testing is no longer considered as just part of the normal work of any language teacher, but may today be regarded as a research area in its own right. Overall, it is not separate from matters such as a language course or language policy. The interventions discussed here – language courses, language plans, and language tests – are interventions in the subfields of language development, language policy, and language proficiency measurement, respectively, all sub-disciplines in applied linguistics. Language testing provides information about language proficiency, as well as diagnostic information that potentially facilitates language development. Language tests can also be used to determine the effectiveness of a language policy or development program. South Africa has much to learn in this regard, especially at the school level.] Afrikaans: Die toegepaste linguistiek is ’n dissipline van ontwerp. As toegepaste taalkundiges bedink ons eers, en ontwerp en implementeer dan toegepaste linguistiese ingrypings wat oplossings vir bepaalde talige probleme kan bied. In hierdie hoofstuk is die fokus op een so ’n artefak wat ontwerp is om in te gryp op ʼn taalprobleem, naamlik ’n taaltoets, en hoe dit produktief aangewend kan word in die assessering van taalvermoë sodat daar weer ander, of verdere, ingrepe (soos byvoorbeeld ’n taalkursus vir akademiese doeleindes of ’n taalplan vir ’n spesifieke omgewing) bedink, ontwerp en implementeer kan word. Robert Lado, ʼn pionier van taaltoetsing, het daartoe bygedra dat taaltoetsing nie meer net as deel van die normale werk van enige taalonderwyser geag is nie, maar voldoende fokus gekry het om vandag geag te word as ’n ondersoeksgebied in eie reg. Oorhoofs gesien, staan dit egter nie los van sake soos ’n taalkursus of ’n taalbeleid nie. Die intervensies wat hier ter sprake is – taalkursusse, taalplanne en taaltoetse – is ingrypings op, onderskeidelik, die subgebiede van taalontwikkeling, taalbeleid en die meting van taalvermoë – alles subdissiplines in die toegepaste linguistiek. Taaltoetsing verskaf eerstens inligting oor taalvermoë, maar ook diagnostiese informasie wat taalontwikkeling potensieel vergemaklik. Taaltoetse kan verder gebruik word om die effektiwiteit van ’n taalbeleid of ontwikkelingsprogram te bepaal. Suid-Afrika het op alle vlakke, maar veral op skoolvlak, veel te leer in hierdie verband.
Article
Full-text available
ABSTRACT: In line with the journal volume’s theme, this essay considers lessons from the past and visions for the future of test validity. In the first part of the essay, a description of historical trends in test validity since the early 1900s leads to the natural question of whether the discipline has progressed in its definition and description of test validity. There is no single agreed-upon definition of test validity; however, there is a marked coalescing of explanation-centered views at the meta-level. The second part of the essay focuses on the author's development of an explanation-focused view of validity theory with aligned validation methods. The confluence of ideas that motivated and influenced the development of a coherent view of test validity as the explanation for the test score variation and validation is the process of developing and testing the explanation guided by abductive methods and inference to the best explanation. This description also includes a new re-interpretation of true scores in classical test theory afforded by the author’s measure-theoretic mental test theory development—for a particular test-taker, the variation in observed test-taker scores includes measurement error and variation attributable to the different ecological testing settings, which aligns with the explanation-focused view wherein item and test performance are the object of explanatory analyses. The final main section of the essay describes several methodological innovations in explanation-focused validity that are in response to the tensions and changes in assessment in the last 25 years.
Book
Full-text available
This innovative, timely text introduces the theory and research of critical approaches to language assessment, foregrounding ethical and socially contextualized concerns in language testing and language test validation in today’s globalized world. The editors bring together diverse perspectives, qualitative and quantitative methodologies, and empirical work on this subject that speak to concerns about social justice and equity in language education, from languages and contexts around the world – offering an overview of key concepts and theoretical issues and field-advancing suggestions for research projects. This book offers a fresh perspective on language testing that will be an invaluable resource for advanced students and researchers of applied linguistics, sociolinguistics, language policy, education, and related fields – as well as language program administrators. DOI: https://doi.org/10.4324/9781003384922 Available: https://www.taylorfrancis.com/books/edit/10.4324/9781003384922/ethics-context-second-language-testing-rafael-salaberry-wei-li-hsu-albert-weideman PREVIEW PDF available on this site.
Article
Background: General emergency physicians provide most pediatric emergency care in the United States yet report more challenges managing emergencies in children than adults. Recommendations for standardized pediatric emergency medicine (PEM) curricula to address educational gaps due to variations in pediatric exposure during emergency medicine (EM) training lack learner input. This study surveyed senior EM residents and recent graduates about their perceived preparedness to manage pediatric emergencies to better inform PEM curricula design. Methods: In 2021, senior EM residents and graduates from the classes of 2020 and 2019 across eight EM programs with PEM rotations at the same children's hospital were recruited and surveyed electronically to assess perceived preparedness for 42 pediatric emergencies and procedures by age: infants under 1 year, toddlers, and children over 4 years. Preparedness was reported on a 5-point Likert scale with 1 or 2 defined as "unprepared." A chi-square test of independence compared the proportion of respondents unprepared to manage each condition across age groups, and a p-value < 0.05 demonstrated significance. Results: The response rate was 53% (129/242), with a higher response rate from senior residents (65%). Respondents reported feeling unprepared to manage more emergency conditions in infants compared to other age groups. Respondents felt least prepared to manage inborn errors of metabolism and congenital heart disease, with 45%-68% unprepared for these conditions across ages. A heat map compared senior residents to recent graduates. More graduates reported feeling unprepared for major trauma, impending respiratory failure, and pediatric advanced life support algorithms. Conclusions: This study, describing the perspective of EM senior residents and recent graduates, offers unique insights into PEM curricular needs during EM training. Future PEM curricula should target infant complaints and conditions with lower preparedness scores across ages. Other centers training EM residents could use our findings and methods to bolster PEM curricula.
Chapter
Full-text available
Dieses Kapitel vermittelt folgende Lernziele: Wissen, was man unter wissenschaftlicher Datenerhebung versteht und wie sie sich von nicht-wissenschaftlicher Datensammlung unterscheidet. Qualitative und quantitative Beobachtungsmethoden charakterisieren und anwenden können. Qualitative und quantitative Interviewtechniken erläutern und einsetzen können. Qualitative und quantitative selbstadministrierte Fragebogenmethoden differenzieren und entsprechende Fragebögen entwickeln können. Unterschiedliche Arten von projektiven und psychometrischen psychologischen Tests voneinander abgrenzen können und wissen, was bei Testanwendung und Testentwicklung zu beachten ist. Wichtige physiologische Messverfahren für unterschiedliche Organsysteme (z. B. Hirnaktivität, Herz-Kreislauf-Aktivität, elektrodermale Aktivität) in ihren Grundlagen beschreiben und ihrer Aussagekraft einschätzen können. Verschiedene Formen der qualitativen und quantitativen Dokumentenanalyse unterscheiden und dabei insbesondere das Vorgehen bei einer qualitativen und einer quantitativen Inhaltsanalyse vorgefundener Dokumente schildern können. Die Besonderheiten und Vor- und Nachteile der verschiedenen Datenerhebungsmethoden abwägen und die für ein konkretes Forschungsproblem passende(n) Datenerhebungsmethode(n) auswählen können.
Article
Full-text available
Questionnaires have been widely used in second language (L2) research. To examine the accuracy and trustworthiness of research that uses questionnaires, it is necessary to examine the validity of questionnaires before drawing conclusions or conducting further analysis based on the data collected. To determine the validity of questionnaires that have been investigated in previous L2 research, we adopted the argument-based validation framework to conduct a systematic review. Due to the extensive nature of the extant questionnaire-based research, only the most recent literature, that is, research in 2020, was included in this review. A total of 118 questionnaire-based L2 studies published in 2020 were identified, coded, and analyzed. The findings showed that the validity of the questionnaires in the studies was not satisfactory. In terms of the validity inferences for the questionnaires, we found that (1) the evaluation inference was not supported by psychometric evidence in 41.52% of the studies; (2) the generalization inference was not supported by statistical evidence in 44.07% of the studies; and (3) the explanation inference was not supported by any evidence in 65.25% of the studies, indicating the need for more rigorous validation procedures for questionnaire development and use in future research. We provide suggestions for the validation of questionnaires.
Article
In the past decades, implicature has been recognized as an indispensable topic in L2 pragmatics research. While various instruments have been used to test implicature knowledge, scant research has verified their construct validity of implicature items by examining test takers’ cognitive processes while responding to them. In response to this research gap, Roever, Carsten. 2005. Testing ESL pragmatics: Development and validation of a web-based assessment battery. Frankfurt am Main, Germany: Peter Lang implicature test was scrutinized. Twelve L2 test takers with high and low proficiency levels were asked to verbalize their thoughts retrospectively. Content analyses were employed to examine whether their cognitive processes were (ir)relevant to what the test intended to elicit and how well their processes aligned with the test results. The findings revealed that relevant processes were associated with idiosyncratic implicature questions, while irrelevant processes were related to formulaic implicature ones. The high-proficiency test takers generally reported engaging in relevant processes, whereas the low-proficiency test takers reported using for the most part irrelevant processes. Moreover, test takers’ cognitive processes aligned with the test results on the whole. In other words, when the respondents’ cognitive processes corresponded to those the test designed to elicit, the test results were normally correct. One the other hand, when the cognitive processes deviated from the expected ones, the test results were generally incorrect. However, fourteen misalignments were identified and classified as irrelevant conceptual blending, contextual judgment, insufficient effort responding, intuition, and wild guessing. These misalignments were seen as construct irrelevant variance and pose potential threats to the construct validity.
Article
“Test of Chinese as a Heritage Language” is developed to test the Chinese proficiency of overseas Chinese adolescents. Overseas Chinese adolescent is a special group which is different from ordinary foreign Chinese learners. There is no a standardized Chinese proficiency test for this group until now. Guided by argument-based approach to validation, the overall design of the test and the test data are analyzed. The validity of the test is systematically demonstrated by qualitative and quantitative methods from many aspects, and the results of validity demonstration further feeds back the improvement of the test.
Chapter
This chapter is concerned with the importance and meaning of data validity, and the implications for educational practice. Validity requires attention to both interpretation (meaning) and use (relevance) of data as well as their consequences (including social effects). Starting with an exploration of general theories of validity, this chapter considers what is involved in establishing the validity of different forms of data—assessment data, observation data, survey data, and file data. Validity issues in relation to assessment data are given primary attention. The emergence of a unified theory of validity is discussed, followed by analysis of threats to validity for all four forms of data. Implications for assessment design are elaborated in terms of user responsibilities in data use, dealing with negative consequences of testing, determining what is to be assessed, and matching assessments to desired learning outcomes. This leads to discussion of the importance, from a validity perspective, of taking an holistic approach to assessment design.
Article
Full-text available
Questions of the adequacy of a test as a measure of the characteristic it is interpreted to assess are answerable on scientific grounds by appraising psychometric evidence, especially construct validity. Questions of the appropriateness of test use in proposed applications are answerable on ethical grounds by appraising potential social consequences of the testing. The 1st set of answers provides an evidential basis for test interpretation, and the 2nd set provides a consequential basis for test use. The present article stresses (a) the importance of construct validity for test use because it provides a rational foundation for predictiveness and relevance and (b) the importance of taking into account the value implications of test interpretations per se. By thus considering both the evidential and consequential bases of both test interpretation and test use, the roles of evidence and social values in the overall validation process are illuminated, and test validity comes to be based on ethical and evidential grounds. (2½ p ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Notes that there is evidence that tests influence teacher and student performance and that multiple-choice tests tend not to measure the more complex cognitive abilities. The more economical multiple-choice tests have nearly driven out other testing procedures that might be used in school evaluation. It is suggested that the greater cost of tests in other formats might be justified by their value for instruction (i.e., to encourage the teaching of higher level cognitive skills and to provide practice with feedback). (56 ref)
Article
Full-text available
Expected utility losses in moving from T. A. Cleary's (1968), R. L. Thorndike's (1971), and R. B. Darlington's (1971) Definition 3 selection fairness models to the quota model were assessed on an interval scale for various combinations of validity, minority base rate, and selection ratio. Expected changes in minority selection ratios across conditions were also determined. Utility losses were shown to be large enough to be of practical significance in many commonly occurring selection situations, although when considered as a percentage of maximum value, utilities remained quite high for the Thorndike and Darlington models. Increases in minority selection ratios across models were more striking than utility losses. Because no accepted method for converting minority selection ratios to utility units exists, and in light of the fact that the legal status of all 4 models is as yet unclear, it is concluded that each personnel researcher or organization must consider the trade-off between utility and the minority selection ratio subjectively and choose the model of selection fairness most consistent with his or its values. (18 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
The acceptance of validity coefficients at face value as measures of practical significance is evaluated by examining each functional relationship between 3 indexes of predictive efficiency r, r2, and E and 3 measures of practical significance the increase of the criterion mean, the expected proportion "satisfactory," and the expected proportion in 10 criterion categories. The validity coefficient, r is a linear function of the increase of the criterion mean and very nearly a linear function of the other 2 measures of practical significance; r2 and E are related to these 3 measures in a more curvilinear manner. A table is presented that gives the proportion expected in each of 18 criterion categories as a function of r and the selection ratio. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Examines implications for educational and psychological measurement of 3 ontological perspectives on the nature of mediating variables underlying consistencies in test and nontest behaviors: (a) intervening variables operationally tied to real causal entities, such as personality traits or environmental contingencies; (b) hypothetical constructs that organize and summarize behavioral consistencies but have no reality outside the theoretical system; and (c) manifestations of real entities that are understood only in terms of constructs that summarize their empirical properties in relation to a theoretical network. All 3 apply to personality traits, situational forces, and their interactions; the summary of power of constructs that led to the predominance of construct validity principles in trait measurement implies that these principles should hold with equal cogency for situational and interactive measurement. (48 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
The 3 conventionally listed aspects of validity—criterion-related, content, and construct—are examined from a dual perspective: aiding in the understanding of a construct and establishing a basis for comparison between evaluations of the validity of measurement and evaluations of the validity of a hypothesis. The unifying nature of the validity of measurement is found in the degree to which results of measurement (the numbers or scores) represent magnitudes of the intended attribute. Validity is an evaluative judgment based on a variety of considerations, including the structure of the measurement operations, the pattern of correlations with other variables, and the results of confirmatory and disconfirmatory investigations. Validity in this sense is close to the concept of construct validity but without the theoretical implications of that term; like construct validity, the evaluation cannot be expressed with a single research result. It is suggested that evaluations of the validity of hypotheses should be based on multiple considerations rather than on single coefficients. (23 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Discusses the issue of validity in testing, with reference to new Federal equal employment regulations. Definitions of testing terms are presented and illustrated with examples from a study of packers. Definitions include criterion-related validity, construct validity, and job-relatedness. Validity is not equated with job-relatedness, and criterion-related validity is only conditional evidence of job-relatedness. The importance of considering values in hypothesis determination is stressed. Organizational climate and information content are also considered. (19 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Discusses the concepts of construct and content validity, the requirements of convergent and discriminant evidence, norm and criterion-referenced interpretations, values in measurement and the uses of counterhypotheses, and the identification of bias. The importance of construct-referencing all measurement is noted. The need for a dialectical evaluation where a particular thesis is confronted with its antithetical elements is stressed. This approach should help uncover assumptions and ideologies implicit in many measurement and evaluation activities. (61 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Establishing the validity of educational and psychological tests entails judgments of both evidence and consequence for each of the two distinct but interrelated functions of tests – interpretation and use. The evidential basis of test interpretation is construct validity. The evidential basis of test use is also construct validity, but elaborated to determine the relevance of the construct to the applied purpose and the utility of the measure in the applied setting. The consequential basis of test interpretation entails an appraisal of the value implications of alternative interpretations, while the consequential basis of test use requires appraisal of the potential social consequences of the proposed use and of the actual consequences when used. The delineation of a consequential basis for test validity as well as an evidential basis means that standards for evaluating tests must take explicit account of social context and social purpose, and test validity comes to be based on ethical as well as evidential grounds. Implications of this unified view of test validity as an overall judgment of the adequacy and appropriateness of inferences and actions based on test scores are examined for the measurement of four constructs prevalent in current educational practice – namely, intelligence, competence, mastery, and scholastic ability. Implications are also adumbrated for the responsibilities of test users and for the ethics of test use. (41 pp.)
Article
The concept of content validity takes on special importance where invoked to justify use of a test. The term 1) refers to psychological measurement, 2) using samples of behavior, sampling both stimu lus and response components, and 3) implies repre sentativeness in sampling. Examples are given to show that content sampling may be considered a form of operationalism in defining constructs. Five conditions are proposed as necessary if one is to ac cept the use of a measuring instrument as a valid operational definition on the basis of content samp ling alone.
Article
Argues for the importance of construct validity in test use by stressing its role in providing a rational foundation for predictive validity. Questions of the adequacy of a test as a measure of the characteristics it is interpreted to assess are answerable on scientific grounds by appraising psychometric evidence, especially construct validity. Questions of the appropriateness of test use in proposed applications are answerable on ethical grounds by appraising potential social consequences of the testing. The 1st set of answers provides an evidential basis for test interpretation, and the 2nd set provides a consequential basis for test use. The following are stressed: (a) the importance of construct validity for test use because it provides a rational foundation for predictiveness and relevance and (b) the importance of taking into account the value implications of test interpretations per se. By thus considering both evidential and consequential bases of test interpretation and test use, the roles of evidence and social values in the overall validation process are illuminated, and test validity comes to be based on ethical as well as evidential grounds. (5 p ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Reviews the revised second edition of A Handbook of Child Psychology, edited by Carl Murchison (1933). Revision is a modest term--the book is really a new work. While there are only 24 papers as compared with 22 in the first edition, there is about 45 per cent more type-page area. The present volume is arranged in a much more satisfactory manner than the earlier one with five logically succeeding parts: methods, pre-natal development, post-natal development, factors modifying behavior, and studies of special groups. Eleven of the papers are on topics not considered in the first edition: pre-natal behavior in the vertebrate phyla; neonates; the maturation and patterning of behavior; locomotor and visual-manual functions; emotion; mental growth; sex differences; speech pathology; development and training of the "physiological appetites"; children with difficulties of adjustment; and adolescents. Papers on methodology, language, social behavior, environmental forces, learning, morals, philosophies, order of birth, eidetic imagery, and children with special gifts and deficiencies are by the same authors as in the first edition. The papers in this edition are of more even quality, and they are better quantitatively and qualitatively than those in the first edition. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Chapter
A goal in the study of spatial development is to answer, with regard to this domain, some of the classic questions in the history of psychology, biology, and philosophy: the extent to which adaptation rests on innately available representations or represents emergent knowledge based on interaction with the world, the uniqueness of our species and the course of comparative evolution, the interplay of symbolic and more basic spatial processes, and the nature of individual differences. In addition, spatial development is an important research area because understanding how it occurs is likely to have considerable practical relevance for devizing educational curricula that support optimal acquisition of skills that underlie many essential real-world activities. This chapter covers current views of the nature of mature spatial competence, what we know about the spatial capabilities that exist in infancy, whether early spatial skills are modular or not, development of mental transformation of spatial information and the capacity for symbolic representation of spatial information, development and causes of individual differences, including gender differences, and cases in which children have abnormal bases for learning based on genetic differences (Williams Syndrome) or abnormal environmental contexts based on sensory limitations (impaired vision). Keywords: map use; mental rotation; navigation; spatial cognition; spatial development; spatial language
Article
Revision of the 1966 Standards for educational and psychological tests and manuals.
Article
The psychologist should lead the way in finding good criterion measures rather than construct imperfect tests which measure those attributes that can be appraised more accurately and validly by judgment of the non-scientific expert. Emphasis should be placed on intrinsic content validity which can be measured by (1) use of factor analysis variables, (2) use of validity coefficients, (3) comprehensive factors study of criterion and predictor variables, (4) use of pretraining and post-training tests. 31 references.
The social consequences of educational testing
  • R L Ebel
Ebel, R. L. (1964). The social consequences of educational testing. Proceedings of the 1963 Invitational Conference on Testing Problems. Princeton, NJ: Educational Testing Service.
Toward reform of program evaluation
  • L J Cronbach
  • Associates
Cronbach, L. J., & Associates. (1980). Toward reform of program evaluation. San Francisco: JOSsey-Bass.
Handbook of methods for detecting test bias
  • L A Shepard
Shepard, L. A. (1982). Definitions of bias. In R. A. Berk (Ed.), Handbook of methods for detecting test bias. Baltimore, MD: Johns Hopkins University Press.
Recruiting, selection, and job placement
  • R M Guion
Guion, R. M. (1976). Recruiting, selection, and job placement. In M. D.
Validation of educational measures. Prodeedings of the 1969 Invitational Conference on Testing Problems: Toward a theory of achievement measurement
  • L J Cronbach
Cronbach, L. J. (1969) • validation of educational measures. prodeedings of the 1969 Invitational Conference on Testing Problems: Toward a theory of achievement measurement. Princeton, NJ: Educational Testing Service.
validity on parole: How can we go straight? New directions for testing and measurement -Measuring achievement over a decade -Proceedings of the
  • L J Cronbach
Cronbach, L. J. (1980). validity on parole: How can we go straight? New directions for testing and measurement -Measuring achievement over a decade -Proceedings of the 1979 ETS Invitational Conference. San Francisco: Jossey-Bass.
Educational measurement
  • E. E. Cureton
validity on parole: How can we go straight? New directions for testing and measurement -Measuring achievement over a decade -Proceedings of the 1979 ETS Invitational Conference
  • L J Cronbach
Cronbach, L. J. (1980). validity on parole: How can we go straight? New directions for testing and measurement -Measuring achievement over a decade -Proceedings of the 1979 ETS Invitational Conference. San Francisco: Jossey-Bass.
Educational measurement 3rd ed
  • S Messick
  • R L Validity
  • Linn
New directions for testing and measurement - Measuring achievement over a decade- Proceedings of the 1979 ETS Invitational Conference
  • L J Cronbach