Article

A Framework for Conceptualizing and Evaluating the Validity of Instructionally Relevant Assessments

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Assessments that function close to classroom teaching and learning can play a powerful role in fostering academic achievement. Unfortunately, however, relatively little attention has been given to discussion of the design and validation of such assessments. The present article presents a framework for conceptualizing and organizing the multiple components of validity applicable to assessments intended for use in the classroom to support ongoing processes of teaching and learning. The conceptual framework builds on existing validity concepts and focuses attention on three components: cognitive validity, instructional validity, and inferential validity. The goal in presenting the framework is to clarify the concept of validity, including key components of the interpretive argument, while considering the types and forms of evidence needed to construct a validity argument for classroom assessments. The framework's utility is illustrated by presenting an application to the analysis of the validity of assessments embedded within an elementary mathematics curriculum.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For this purpose, we implemented two transformer-based and five feature-based machine learning models, as well as several baselines. The responses used to train and evaluate these were taken from formative assessments within synchronous online units designed under the paradigms of project-based pedagogy (Krajcik & Shin, 2014) and evidence-centered design (Mislevy et al., 2003;Mislevy & Haertel, 2007;Pellegrino et al., 2016). The dataset we collected includes the responses of 305 German secondary school students from Schleswig-Holstein (school forms Gemeinschaftsschule and Gymnasium) to a set of 38 different constructed response items. ...
... We used a procedure grounded in evidence-centered design (ECD) (Mislevy et al., 2003;Mislevy & Haertel, 2007;Pellegrino et al., 2016) to develop rubrics for coding the core ideas within responses. It is based on existing research on students' learning about energy to formulate a competency model (e.g., Herrmann-Abell & DeBoer, 2018; Neumann et al., 2013). ...
Article
Full-text available
Lay Description What is already known about this topic? Formative assessments are needed to test and monitor the development of learners’ knowledge throughout a unit to provide them with appropriate automated feedback. Constructed response items which require learners to formulate their own free‐text responses are well suited for testing their active knowledge. Assessing constructed responses in an automated fashion is a widely researched topic, but the problem is far from solved and most of the work focuses on predicting holistic scores or grades. To allow for a more fine‐grained and analytic assessment of learners’ knowledge, systems which go beyond predicting simple grades are required. To guarantee that models are stable and make their predictions for the correct reasons, methods for explaining the models are required. What this papers adds? A core topic in physics education is the concept of energy. We implement and evaluate multiple systems based on natural language processing technology for assessing learners’ conceptual knowledge about energy physics using transformer language models as well as feature‐based approaches. The systems assess students’ knowledge about various forms of energy, indicators for the same and the transformation of energy from form into another. As our systems are based on machine learning methodology, we introduce a novel German short answer dataset for training them to detect the respective knowledge elements within students’ free‐text responses. We evaluate the performance of these systems using this dataset as well as the well‐established SciEntsBand‐3‐Way dataset and manage to achieve, to our best knowledge, new state‐of‐the‐art results for the latter. Moreover, we apply methodology for explaining model predictions to assess whether predictions are carried out for the correct reasons. Implications for practice and/or policy It is indeed possible to assess constructed responses for the demonstrated knowledge about energy physics in an analytic fashion using natural language processing. Transformer language models can outperform more specialized feature‐based approaches for this task in terms of predictive and descriptive accuracy. Co‐occurrences of different concepts within the same responses can lead models to learn undesired shortcuts which make them unstable.
... Construct validity is related to how what is actually assessed corresponds with what is intended to be assessed. Consequential validity refers to the use of assessment and its consequences for teaching and learning (Pellegrino et al., 2016). In literature, there are also other aspects (types) of validity, such as triangulation, face validity, catalytic validity and rhizomatic validity (Le Grange & Beets, 2005). ...
... It is of utmost importance that prioritization is based on relevant criteria (Divjak et al., 2021). This is related to several aspects of validity (Pellegrino et al., 2016). First, it supports content validity, as linking assessment with LOs is crucial in making sure that it "represents critical aspects of the domain being assessed". ...
Article
To ensure the validity of an assessment programme, it is essential to align it with the intended learning outcomes (LO). We present a model for ensuring assessment validity which supports this constructive alignment and uses learning analytics (LA). The model is based on LA that include a comparison between ideal LO weights (expressing the prioritization of LOs), actual assessment weights (maximum assessment points per LO), and student assessment results (actually obtained assessment points per LO), as well as clustering and trace data analysis. These analytics are part of a continuous improvement cycle, including strategic planning and learning design (LD) supported by LO prioritization, and monitoring and evaluation supported by LA. To illustrate and test the model, we conducted a study on the example of a graduate‐level higher education course in applied mathematics, by analysing student assessment results and activity in a learning management system. The study showed that the analyses provided valuable insights with practical implications for the development of sound LD, tailored educational interventions, databases of assessment tasks, recommendation systems, and self‐regulated learning. Future research should investigate the possibilities for automation of such LA, to enable full exploitation of their potential and use in everyday teaching and learning. Practitioner notesWhat is already known about this topicTo develop sound, student‐centred learning design (LD), it is essential to ensure that assessment is constructively aligned with the intended learning outcomes (LO).This constructive alignment is crucial for ensuring the validity of an assessment program.Learning analytics (LA) can provide insights that help develop valid assessment programs.What this paper addsAs not all LOs are equally important, assessment programs should reflect the prioritization of LOs, which can be determined by using various multi‐criteria decision‐making (MCDM) methods.This article presents and illustrates, based on an empirical case, a model of continuous improvement of LD, which uses LA to compare how LOs are reflected in (actual) students' results, in an (actual) assessment program, and in the (ideal) prioritization of LOs based on MCDM.The study presents how clustering of students based on their assessment results can be used in LA to provide insights for educational interventions better targeted to students' needs.Implications for practice and/or policyThe proposed LA can provide important insights for the development (or improvement) of LD in line with the intended course LOs, but also study program LOs (if course and study program LOs are properly aligned).The LA can also contribute to the development of databases of assessment tasks aligned with course LOs, with ensured validity, supporting sharing and reusing, as well as to the development of tailored educational interventions (eg, based on clustering).The proposed LA can also contribute to the development of recommendation systems, with recommendations for the improvement of LD for teachers or learning suggestions for students, as well as students' meta‐cognition and self‐regulated learning. What is already known about this topicTo develop sound, student‐centred learning design (LD), it is essential to ensure that assessment is constructively aligned with the intended learning outcomes (LO).This constructive alignment is crucial for ensuring the validity of an assessment program.Learning analytics (LA) can provide insights that help develop valid assessment programs. What this paper addsAs not all LOs are equally important, assessment programs should reflect the prioritization of LOs, which can be determined by using various multi‐criteria decision‐making (MCDM) methods.This article presents and illustrates, based on an empirical case, a model of continuous improvement of LD, which uses LA to compare how LOs are reflected in (actual) students' results, in an (actual) assessment program, and in the (ideal) prioritization of LOs based on MCDM.The study presents how clustering of students based on their assessment results can be used in LA to provide insights for educational interventions better targeted to students' needs. Implications for practice and/or policyThe proposed LA can provide important insights for the development (or improvement) of LD in line with the intended course LOs, but also study program LOs (if course and study program LOs are properly aligned).The LA can also contribute to the development of databases of assessment tasks aligned with course LOs, with ensured validity, supporting sharing and reusing, as well as to the development of tailored educational interventions (eg, based on clustering).The proposed LA can also contribute to the development of recommendation systems, with recommendations for the improvement of LD for teachers or learning suggestions for students, as well as students' meta‐cognition and self‐regulated learning.
... Mulder, 2017;Winterton, 2009) and the assumption that competencies are domain-specific (Blömeke, Gustafsson, & Shavelson, 2015;Hartig, 2008), it is important to identify and analyze situations (tasks and challenges) sustainable entrepreneurs are confronted with in order to relate them to necessary individual dispositions and skills for mastering such situations (Mulder, 2017). By this criterion-based formulation of curricular goals, promising authentic performance-based instructional learning environments and corresponding assessments can be designed for successful workplace learning (Achtenhagen, 2012;Pellegrino, DiBello & Goldman, 2016). ...
... These more detailed insights provide fruitful hints for designing appropriate curricula prompting performance-based workplace learning environments and deeper learning processes in order to raise awareness, knowledge, and skills for a competent sustainable entrepreneurial behavior, but also for creating corresponding authentic performanceassessments by which learners can show their acquired sustainable behavior. In our presentation, we will give an example according to Shavelson & Kurpius (2012), Davey et al. (2015), Pellegrino et al. (2016), ...
... Our approach to assessment validity is aligned with the approach for evaluating 'instructionally relevant assessments' described by Pellegrino et al. (2016). They emphasise collecting multiple forms of evidence to evaluate how well the assessment is measuring its desired goal. ...
... In our assessments, the skills are defined in terms of the expert decisions and evidence of whether students have mastered making the decision is gathered by comparing their performance to the expert consensus. We use the information collected during the pilot testing with experts and students described above to evaluate the cognitive and inferential validity of the assessment (Pellegrino et al., 2016). We observe how well the assessment captures the expert problem-solving process and how well the students' responses reflect their process. ...
Article
Full-text available
The ability to solve authentic real-world problems in science, engineering, and medicine is an important goal in post-secondary education. Despite extensive research on problem solving and expertise, the teaching and assessing of advanced problem-solving skills in post-secondary students remains a challenge. We present a template for creating assessments of advanced problem-solving skills that is applicable across science, engineering, and medical disciplines. It is based on a cognitive model of the problem-solving process that is empirically grounded in the study of skilled practitioners (‘experts’) solving authentic problems in their disciplines. These assessments have three key features that overcome shortcomings of current assessment approaches: 1) a more authentic amount and sequence of information provided, 2) opportunities for students to make decisions, and 3) scoring based on comparison with skilled practitioners. This provides a more complete and accurate assessment of authentic problem-solving skills than currently exists for science, engineering, and medicine. Such assessments will be valuable to instructors and curriculum designers for evaluating and improving the teaching of problem solving in these disciplines. We provide examples of problem-solving assessments that illustrate the use of this template in several disciplines.
... The literature also suggests that contextual factors influence how students respond to feedback, particularly the culture of assessment and learning (e.g., Shepard, 2000;Hattie and Timperley, 2007;Andrade, 2010;Havnes et al., 2012;Robinson et al., 2013;Lipnevich et al., 2016;Tulis et al., 2016;Winstone et al., 2017), the task (Butler and Winne, 1995;Hattie and Timperley, 2007;Andrade, 2013;Lipnevich et al., 2016;Pellegrino et al., 2016;Leighton, 2019), and the purpose and source of feedback (Panadero and Lipnevich, 2021). Lui and Andrade (2022) offer an in-depth review of the literature that informed the model's design. ...
... As shown in Figure 1, the external learning and assessment context, which includes the assessment and feedback culture, the task and the expectations regarding its quality, and feedback source and purpose, influences how assessment information and feedback are processed and used (e.g., Butler and Winne, 1995;Kluger and DeNisi, 1996;Shepard, 2000;Hattie and Timperley, 2007;Havnes et al., 2012;Andrade, 2013;Robinson et al., 2013;Harris et al., 2014;Lipnevich et al., 2016;Pellegrino et al., 2016;Tulis et al., 2016;Winstone et al., 2017). The internal mechanisms of feedback processing in Figure 1 include inputs [external feedback (A)] and outputs [behavioral response (F) and academic achievement (G)]. ...
Article
Full-text available
Attempts to explain inconsistencies in findings on the effects of formative assessment and feedback have led us to study the next black box: how students interpret and subsequently use formative feedback from an external source. In this empirical study, we explore how students understand and process formative feedback and how they plan to use this information to inform next steps. We present findings from a study that examined students’ affective and cognitive responses to feedback, operationalized as emotions, interpretations (i.e., judgments, meaning making, attributions), and decision-making. Relationships among these processes and students’ initial motivational states were also explored. Survey data were collected from 93 students of a 7th grade English/Language Arts teacher who employed formative assessment practices. The results indicate that students tended to have positive emotions and judgments in response to their teacher’s feedback and make controllable attributions. They generally made informative meaning of the feedback and constructive decisions about next steps. Correlational findings showed that (1) emotions, judgments, meaning making, and attributions are related; (2) judgments of and the meaning that students made about the feedback were most strongly related to decision-making about next steps; and (3) task value was the only motivation variable related to responses to feedback. We conclude with implications for research and practice based on the expected and unexpected findings from this study.
... When students are in a classroom that encourages a culture of critique, constructive feedback is welcomed and valued, and mistakes are treated as opportunities to learn (Hattie and Timperley, 2007;Andrade 2013). Tasks assigned to students are also influential (Butler and Winne, 1995;Hattie and Timperley, 2007;Andrade, 2013;Lipnevich et al., 2016;Pellegrino et al., 2016;Leighton, 2019). The Feedback Intervention Theory (FIT) highlights the importance of the task: what it is, how one completes it, and the processes that students engage in during the task (Kluger and DeNisi, 1996). ...
... The Feedback Intervention Theory (FIT) highlights the importance of the task: what it is, how one completes it, and the processes that students engage in during the task (Kluger and DeNisi, 1996). The clarity of the task and the expectations are vital in activating the knowledge and skills needed for students to complete the task and understand the feedback received about performance (Leighton, 2019;Pellegrino et al., 2016;Wang et al., 2019). ...
Article
Full-text available
In this theoretical paper, we shift the attention from feedback as something given to feedback as something received . After Black and Wiliam shined a light into the black box of the classroom and identified formative assessment as a way to raise standards of achievement, a large body of research revealed the influence of feedback on learning. Not all such influences were positive, however, which created a need for closer examinations of the nature of feedback. In addition, recent scholarship on assessment as the co-regulation of learning reveals the importance of understanding how students process and use feedback. We present a model of the internal mechanisms of feedback processing that represents hypothesized ways in which initial motivational states drive how students respond to feedback, as well as the cognitive and affective mechanisms of assessment information processing. We first synthesize a review of existing models and then describe our model in detail, emphasizing the internal mechanisms of feedback processing: initial motivational states, emotions elicited by and interpretations of feedback, and decision-making. The paper concludes with implications for the model’s use as a framework for empirical studies that could contribute to the nascent field of research on classroom assessment as the co-regulation of learning.
... Researchers and practitioners have acknowledged the need for assessments that go beyond traditional, large-scale summative assessments and inform instruction (e.g., Council of Chief State School Officers, 2008; Pellegrino et al., 2016;Wilson, 2018). Comprehensive and balanced assessment systems-in which a variety of assessments administered throughout the year provide stakeholders with multiple sources of evidence for decisionmaking-emerged at the state and local levels to provide teachers with data throughout the year. ...
... Without a strong connection between cognition and assessment contents, the validity of inferences made from results can be compromised (National Research Council, 2001). Although large-scale accountability measures have not historically been based on strong models of cognition (Pellegrino et al., 2016), the use of learning progressions to inform formative assessment practices is widely acknowledged (e.g., Wilson, 2009;Alonzo, 2018). A strong research-based model for how skills develop supports instructional practice (Shepard, 2018), which is critical for subsequent claims in the theory of action. ...
Article
Full-text available
Policy shifts in the United States are beginning to reduce the emphasis on using statewide assessment results primarily for accountability and teacher evaluation. Increasingly, there are calls for and interest in innovative and flexible assessments that shift the purposes of assessment and use of results toward instructional planning and student learning. Under the Innovative Assessment Demonstration Authority, some states are exploring options for replacing traditional large-scale summative assessments with innovative measures. However, many of these programs are still in early phases of planning and research and have not yet fully articulated how the innovative system achieves desired outcomes. This conceptual paper presents an argument in the form of a theory of action for a flexible and innovative assessment system already in operational use. The system replaces traditional summative assessments with large-scale through-year Instructionally Embedded assessments. We describe the components of the theory of action, detailing the theoretical model and supporting literature that illustrate how system design, delivery, and scoring contribute to the intended outcomes of teachers using assessment results to inform instruction and having higher expectations for student achievement, in addition to accountability uses. We share considerations for others developing innovative assessment systems to meet stakeholders’ needs.
... Through the implementation of the CEFR-aligned CBA, it will require teachers to assess the students' language development in four main language skills (listening, speaking, reading, and writing) (Hopfenbeck, 2018). Pellegrino, DiBello and Goldman (2016) suggested that teachers need to be able to design the assessment instrument that is suitable to engage the students exuding their language knowledge by aligning it with CEFR principles. To note, the classroom-based assessment which was recently introduced into the Malaysian education requires teachers to be literate in its assessment system, because they were given a full autonomy to assess the students' language progression. ...
Article
Full-text available
This past five years has seen a drastic reform in the Malaysian education system. The reform prompts significant changes in teaching and learning and the assessment system. The emphasis on high-stakes examinations were revised and a new assessment system is introduced. In 2021, many high-stake examinations like Uijan Pencapaian Sekolah Rendah (UPSR) and Pentaksiran Tingkatan 3 (PT3) were abolished due to the reason of being ineffective tools to measure students’ learning capability. Since the abolishment, the government paved the way by introducing a more progressive and continuous assessment system known as Classroom-Based Assessment (CBA) which gives full autonomy to teachers to assess students by introducing formative assessments. In the context of English Language Education, the CBA is aligned with CEFR. CEFR breathes new ways for teachers to assess students’ language progression across a standard international descriptor. This change has resorted in a huge transformation in the teachers’ role as CEFR-aligned CBA demands teachers to plan, design, implement, and report the new assessment system in their teaching practices. Hence, a mixed-method study was conducted to explore the teacher’s assessment literacy in implementing CEFR-aligned CBA at the micro level. The findings of this study reported that teachers posit a low assessment literacy level which has influenced their practices to enact the new assessment system. Several challenges like the time constraint, lack of training, teachers’ unfamiliarity, and tedious process of CBA raised the concern for the government to reduce these deficiencies in enacting the change in education.
... In allen Studien zur Digitalisierung werden Kompetenzen im Umgang mit digitalen Arbeitsmitteln und Technologien hervorgehoben (Seibold & Stieler, 2016;Bach et al., 2020 Inhaltlich und didaktisch wurde der Workshop kompetenzorientiert in Anlehnung an den Evidence-Centered Design-Ansatz (ECD) (Pellegrino, DiBello & Goldman, 2016) unter expliziter Berücksichtigung der Curriculum-Instruktion-Assessment-Triade (Weber & Starke, 2010;Achtenhagen, 2012) konzipiert (Weber, Off, Hackenberg, Schumann & Achtenhagen, 2021). Hiernach sind ausgehend von einer intensiven Domänenanalyse authentische fachdidaktische Problemsituationen zu extrahieren und Kompetenzen zu formulieren, die vor diesem Hintergrund dem avisierten Adressatenkreis vermittelt werden sollen (Curriculum). ...
Chapter
Full-text available
Zusammenfassung: Der Beitrag widmet sich dem Einsatz von Augmented Reality (AR) in betrieblichen Aus- und Weiterbildungskontexten. Hierbei stehen insbesondere Fragen zu Berufsfeldern und Ausbildungsinhalten, zu verwendeten AR-Spezifikationen, zu Zusammenhängen zwischen AR-Anwendung und leistungs-, wahrnehmungs- und motivationsbezogenen Variablen sowie zur didaktischen Einbettung entsprechender Technologien im Fokus. Auf Basis einer integrierenden Sichtung bestehender Literatur konnten insgesamt 16 Studien sowie zwei bereits vorliegende Literaturreviews in die Analyse eingeschlossen werden. Die Ergebnisse verweisen u. a. auf eine heterogene Lage hinsichtlich forschungsmethodischer Zugänge und didaktischer Referenzkonzepte. Weitgehend förderliche Effekte auf leistungs-, wahrnehmungs- und motivationsbezogene Variablen stimmen insgesamt bezüglich des AR-Einsatzes positiv, machen aber auch die Notwendigkeit einer integrierten Betrachtung technologischer und pädagogisch-instruktionaler Gestaltungsparameter deutlich. // Abstract: This contribution focuses on the use of augmented reality (AR) in vocational education and training as well as further education contexts. In particular, the focus is on questions regarding what occupational fields and training contents use AR, what kind of AR are used, whether correlations between AR applications and performance-, perception- and motivation-related variables exist, as well as how AR is embedded in didactic and pedagogic contexts. Based on an integrative review of existing literature, a total of 16 studies and two existing literature reviews were included in the analysis. The results indicate, among other things, that the studies use heterogeneous research methodologies as well as didactic and pedagogic reference theories. A few studies point to positive effects of AR use on performance-, perception- and motivation-related variables. At the same time, the review emphasises a need to further integrate technological and pedagogical-instruction design parameters.
... Applying Kane's approach to the NTE, the interpretive argument for the validity of the NTE would focus largely on the uses and interpretations of the results, with the idea being that using the results as intended helps align the test with the curriculum and thus aid pupils' learning in a formative way (Pellegrino et al., 2016). In the case of the NTE, there is a clearly stated objective for the tests, as well as guidelines for schools, parents, and (especially) teachers as to how the results should be used and interpreted. ...
Article
Full-text available
This article reports on an investigation into the role the National Tests of English (NTE) and their results play in Norwegian eighth-grade classrooms. Previous and current opposition to the tests from some teachers and pupils gave rise to the question of whether the tests are being used as recommended, based on documents produced by the Directorate of Education and Training. The study proceeded from the premise that consequential validity (Messick, 1994) could be under threat in cases of clear discrepancies between intended and actual consequences and uses of the tests and their results. A mixed methods study was conducted among eighth-grade teachers of English, consisting of a quantitative digital survey and qualitative, semi-structured interviews. In total, 43 English teachers participated in the study. Results indicated a lack of uniformity in the uses of the NTE, with both the nature and levels of engagement in individual schools being determined by factors such as principals’ concerns, time constraints, parents’ interest levels and teachers’ own views on the usefulness of the tests and the results. Validity is threatened when unintended consequences take the place of intended consequences (Chalhoub-Deville, 2015). The study reveals that, in around half of the schools involved, this seems to be happening, partly ascribable to a lack of time and/or interest in the tests. As respondents reported that schools’ allocations of time and resources to the tests are largely determined by school principals, a follow-up study with principals is recommended.
... For instance, the question arises as to whether different response processes are to be expected among students with a certain academic subject, certain educational indicators, or in different familial and social contexts. In particular, the possible effects of different domains and possible curricular and/or instructional specifics in the study programs that may impact students' COR should also be considered in follow-up research to test for instructional validity (Pellegrino et al., 2016). ...
... Especially in higher education, where, due to digitalization, studying has become more location-and time-independent, individually tailored, multimodal, and self-directed, digital learning opportunities play a central role in teaching and learning (Minea-Pic, 2020). In the context of the current pandemic, an immense increase in online learning can be seen worldwide (Mulenga & Marbán, 2020;UNESCO, 2020) which leads to a growing demand for effective, high-quality, digital learning opportunities (Aljawarneh, 2020;Biggs & Tang, 2011;Manca, 2020;Pellegrino, 2012;Pellegrino et al., 2016;Walstad & Wagner, 2016). The increasing heterogeneity of students and their study preconditions (Bröckling et al., 2021) as a further challenge of higher education can also be met with digital learning tools as they enable an individually tailored teaching offer, thus contributing to more equity in education (Ribeiro et al., 2011). ...
Conference Paper
Digital learning tools have become essential elements of teaching, especially during the COVID-19 pandemic. Innovative teaching-learning tools are particularly needed in higher education to support the teaching and learning of highly heterogeneous student groups. Thereby, the promotion of action-oriented competencies is a central goal, especially in teacher education, where classroom practice is not included to a satisfactory extent in higher education curricula. A current research project develops and evaluates digital multimedia learning packages for teacher education to effectively foster adaptive teaching competencies, a central component of action-oriented competencies. Using a pre-post design, we examined use and effectiveness of the newly developed and implemented multimedia learning packages (focusing on self-regulated learning) in teacher education. Based on our findings, implications for further research and teacher training are drawn.
... High levels of expertise in students' mathematical cognition and/or measurement can lead to relatively detailed or precise models that may be ideal for research purposes but could go beyond what is desired by classroom teachers. Instructional usefulness should be the priority of this work (Pellegrino et al., 2016). For example, while our proportional reasoning domain articulation is relatively simple compared to the extensive research that has been conducted in this area, its focus is on generating information and materials useful to classroom teachers. ...
Article
Full-text available
Construct maps and related item type frameworks, which respectively describe and identify common patterns in student reasoning, provide teachers with tools to support instruction built upon students’ intuitive thinking. We present a methodological approach–embedded within common assessment development processes–for developing such construct maps and item type frameworks which entails (1) articulating a construct map, (2) developing diagnostic item types, (3) comparing item type difficulties, and (4) modifying or providing evidence in support of the construct map. We present an example focused on students’ proportional reasoning based on its centrality to mathematical success. The results further inform our construct map for students’ proportional reasoning and provide a diagnostic item-type framework.
... The team has developed a validation argument framework (Confrey, Toutkoushian, & Shah, 2019) derived from Pellegrino, DiBello, & Goldman (2016) and a method to conduct ongoing validation studies of specific RLCs (Confrey & Toutkoushian, in press). A key characteristic of this work is the collaboration between team experts in LS and psychometrics (PM) working in a continuous cycle and at scale with partner districts. ...
... Simplified representation of evidence-centered design, adopted from Pellegrino et al. (2015). current extensions of evidence-centered design remain relatively assessment-centered and provide little guidance in how to negotiate between the needs of assessment and crafting engaging science and mathematics instruction. ...
Article
Full-text available
National educational standards stress the importance of science and mathematics learning for today’s students. However, across disciplines, students frequently struggle to meet learning goals about core concepts like energy. Digital learning environments enhanced with artificial intelligence hold the promise to address this issue by providing individualized instruction and support for students at scale. Scaffolding and feedback, for example, are both most effective when tailored to students’ needs. Providing individualized instruction requires continuous assessment of students’ individual knowledge, abilities, and skills in a way that is meaningful for providing tailored support and planning further instruction. While continuously assessing individual students’ science and mathematics learning is challenging, intelligent tutoring systems show that it is feasible in principle. However, the learning environments in intelligent tutoring systems are typically not compatible with the vision of how effective K-12 science and mathematics learning looks like. This leads to the challenge of designing digital learning environments that allow for both – meaningful science and mathematics learning and the reliable and valid assessment of individual students’ learning. Today, digital devices such as tablets, laptops, or digital measurement systems increasingly enter science and mathematics classrooms. In consequence, students’ learning increasingly produces rich product and process data. Learning Analytics techniques can help to automatically analyze this data in order to obtain insights about individual students’ learning, drawing on general theories of learning and relative to established domain specific models of learning, i.e., learning progressions. We call this approach Learning Progression Analytics (LPA). In this manuscript, building of evidence-centered design (ECD), we develop a framework to guide the development of learning environments that provide meaningful learning activities and data for the automated analysis of individual students’ learning – the basis for LPA and scaling individualized instruction with artificial intelligence.
... Despite promising results, it becomes complex when referring to interpretation and use of scores. This is true because assessment practice is an evidentiary reasoning process and unknown factors may cause validity issues (Pellegrino et al., 2016). This study thus utilized a validity inferential network to analyze the validity of science teachers' interpretation and uses of automatically generated scores of explanations. ...
... Especially in higher education, where, due to digitalization, studying has become more location-and time-independent, individually tailored, multimodal, and self-directed, digital learning opportunities play a central role in teaching and learning (Minea-Pic, 2020). In the context of the current pandemic, an immense increase in online learning can be seen worldwide (Mulenga & Marbán, 2020;UNESCO, 2020) which leads to a growing demand for effective, high-quality, digital learning opportunities (Aljawarneh, 2020;Biggs & Tang, 2011;Manca, 2020;Pellegrino, 2012;Pellegrino et al., 2016;Walstad & Wagner, 2016). The increasing heterogeneity of students and their study preconditions (Bröckling et al., 2021) as a further challenge of higher education can also be met with digital learning tools as they enable an individually tailored teaching offer, thus contributing to more equity in education (Ribeiro et al., 2011). ...
Presentation
Digital learning tools have become essential elements of teaching, especially during the COVID-19 pandemic. Innovative teaching-learning tools are particularly needed in higher education to support the teaching and learning of highly heterogeneous student groups. Thereby, the promotion of action-oriented competencies is a central goal, especially in teacher education, where classroom practice is not included to a satisfactory extent in higher education curricula. A current research project develops and evaluates digital multimedia learning packages for teacher education to effectively foster adaptive teaching competencies, a central component of action-oriented competencies. Using a pre-post design, we examined use and effectiveness of the newly developed and implemented multimedia learning packages (focusing on self-regulated learning) in teacher education. Based on our findings, implications for further research and teacher training are drawn.
... 214) Originally written for a science context, this definition focused on how student thinking develops over time. However, the definition is not universally accepted; Pellegrino et al. (2016) noted that existing LP definitions have "substantial differences in focus and intent" (p. 64). ...
Article
This systematic review examined evidence of the utility of learning progression (LP)–based assessments to inform teaching and student learning in classroom contexts. Fifty-nine studies met inclusion criteria and were analyzed against four research questions. Evidence highlighted their potential for supporting judgments about learning, informing instructional and learning decisions, and improving teacher learning and development. Although 23 studies measured student achievement, reporting positive overall effects, only 6 adopted the experimental designs necessary for causal claims. Using LP-based assessment for formative purposes was well supported. Limited evidence was found regarding summative and accountability uses. Findings show that LP-based assessment design and use requires trade-offs relating to standardization and scale. Teachers need opportunities for negotiation when making judgments and integrating LP-based assessments into existing curriculum and policy contexts. Future research should examine student use of LP assessments and find a balance between standardization and customization to meet the needs of diverse learners and local contexts.
... The importance of CT assessment is regularly stressed (Grover & Pea, 2013;Ilic et al., 2018;Shute et al., 2017;Tang et al., 2020;Weintrop et al., 2021). However, it has to be highlighted that assessment is not an end in itself, but it should contribute to promoting student learning (Pellegrino et al., 2016). When assessing complex skills such as CT, the structure as well as the levels of the construct have to be considered (Seufert et al., 2021). ...
Article
Full-text available
Computational thinking (CT) is an important 21st-century skill. This paper aims at more useful CT assessment. Available evaluation instruments are reviewed; two generally accepted CT evaluation tools are selected for a comprehensive CT assessment: the CTt, a performance test, and the CTS, a self-assessment instrument. The sample comprises 202 high school students from German-speaking Switzerland. Concerning the CTt, Rasch-scalability is demonstrated. Utilizing the approach of the PISA studies, proficiency levels are formed that comprise tasks with specific characteristics that students are systematically able to master. This could help teachers to offer individual support to their students. In terms of the CTS, the original version is refined using confirmatory factor and measurement-invariance analysis. A latent profile analysis yielded four profiles, two of which are of particular interest. One profile comprises students with, on the one hand, moderate to high creative thinking ability, cooperativity, and critical thinking skills and, on the other hand, low algorithmic thinking ability. The second remarkable profile consists of students with particularly low cooperativity. Based on these strength and weakness profiles, teachers could offer support tailored to student needs.
... The extent to which teachers are prepared to face these challenges has been treated with skepticism (Pellegrino et al, 2016). Even though theoretical arguments have repeatedly accentuated the need to prepare teachers and stakeholders for their assessment responsibilities, only a minority of teachers seem to be prepared 'to face the challenges of classroom assessment because they have not been ...
Experiment Findings
Full-text available
The purpose of the present study was to identify currently used assessment practices in English as a Foreign Language (EFL) classrooms across Europe from the perspective of two important groups of stakeholders in the foreign language assessment process, namely EFL teachers and learners. Secondary school learners in different European educational contexts were for the first time targeted with regard to assessment practices in order to establish what assessment practices help them develop their language proficiency in the foreign language. Moreover, EFL teachers’ confidence levels with a range of assessment formats were identified as well as their perceived training needs in this field
... The processes described in this study support the design and validation of a large-scale, standardized educational assessment intended for commingled informative and summative purposes. Under such a purpose, careful, iterative development processes are necessary to ensure interpretations regarding how student learning grows more sophisticated are accurately depicted (Pellegrino et al., 2016). ...
Article
This paper presents results of a score interpretation study for a computer adaptive mathematics assessment. The study purpose was to test the efficacy of item developers’ alignment of items to Range Achievement‐Level Descriptors (RALDs; Egan et al.) against the empirical achievement‐level alignment of items to investigate the use of RALDs as the epicenter of the test score interpretation validity argument. Item developers aligned 82%–87% of items in a computer adaptive item bank for Grades 3–8 mathematics to the assessment's RALDs that had been reconciled to the test scale after standard setting. The degree of consistency between the hypothesized alignment and actual alignment was examined using agreement statistics. Item developers correctly identified the empirical achievement level of 56%–60% of the items, which was above the chance level of agreement. An emerging technique known as embedded standard setting (ESS; Lewis and Cook) was then used to evaluate whether score interpretations based on item developer classifications of items to RALDs were comparable to the score interpretations derived from the cut scores set in 2018. Score interpretations for the first two achievement levels were consistent across administrations; however, for the most advanced students score interpretations were not maintained.
... However, one might need to consider the prompt that launches the learner and how they build. This means viewing assessment as part of the activity design and not something that happens solely at the end of the project (Pellegrino et al., 2016;Sanders et al., 2019). ...
Article
During the pandemic, teachers whose practice depends on maker-based learning have had the added challenge of translating their hands-on lessons for remote teaching. Yet with students making remotely, how can a teacher monitor the students’ progress, offer timely feedback, or infer what the students understood? In short, how are teachers assessing this work? Working with a learning community of teachers who center hands-on making in their instruction regardless of academic discipline, this study was conducted to examine how teachers are supporting and assessing maker-based learning. Our study draws on observational field notes taken during the community’s meetings, interviews with four focal teachers, and artifacts from the teachers’ maker projects. Taking a values-based assessment approach, our findings reveal interesting shifts in teaching practice. Specifically, teachers incorporated social-emotional goals into the activities they design and monitor, students documented their artifacts and process, and teachers adapted to using low-tech materials to ensure accessibility while engaging remote students in their learning goals. These findings imply that not only can remote maker-based experiences can influence the role of students as assessors and the tools and materials they use for making but also how these practices revealed in remote settings could inform in-person settings.
... It has further evolved to require an interpretive argument that provides the necessary warrants for the propositions and claims of those arguments (Mislevy et al., 2003;Haertel and Lorie, 2004;Kane, 2006). Pellegrino et al. (2016) identified three components for constructing a validation argument for classroom assessments: cognitive, instructional, and inferential. We adapted and elaborated on their framework to create a validation framework tailored to MM6-8. ...
Article
Full-text available
This study reports how a validation argument for a learning trajectory (LT) is constituted from test design, empirical recovery, and data use through a collaborative process, described as a “trading zone” among learning scientists, psychometricians, and practitioners. The validation argument is tied to a learning theory about learning trajectories and a framework (LT-based data-driven decision-making, or LT-DDDM) to guide instructional modifications. A validation study was conducted on a middle school LT on “Relations and Functions” using a Rasch model and stepwise regression. Of five potentially non-conforming items, three were adjusted, one retained to collect more data, and one was flagged as a discussion item. One LT level description was revised. A linear logistic test model (LLTM) revealed that LT level and item type explained substantial variance in item difficulty. Using the LT-DDDM framework, a hypothesized teacher analysis of a class report led to three conjectures for interventions, demonstrating the LT assessment’s potential to inform instructional decision-making.
... Its utilisation for diagnostic assessment purposes is a yet unexplored area, and perhaps other validation approaches might be more suited for this end. To gain a broader perspective on the matter, the reader might, for example, also be interested in utilising design and validation approaches that have a stronger emphasis on formative educational practices (e.g., Black & William, 2018;Pellegrino et al., 2016). Second, as indicated by Study 1, it remains to be seen whether the current study was able to gain insight into the full range of work products pupils might generate. ...
Article
Full-text available
This study aimed to develop and validate, based on the Evidence Centered Design approach, a generic tool to diagnose primary education pupils’ prior knowledge of technological systems in primary school classrooms. Two technological devices, namely the Buzz Wire device and the Stairs Marble Track, were selected to investigate whether theoretical underpinnings could be backed by empirical evidence. Study 1 indicated that the tool enabled pupils to demonstrate different aspects of their prior knowledge about a technological system by a wide variety of work products. Study 2 indicated that these work products could be reliably ranked from low to high functionality by technology education experts. Their rank order matched the Fischer-scale-based scoring rules, designed in cooperation with experts in skill development. The solution patterns fit the extended non-parametric Rasch model, confirming that the task can reveal differences in pupils’ prior knowledge on a one-dimensional scale. Test–retest reliability was satisfactory. Study 3 indicated that the diagnostic tool was able to capture the range of prior knowledge levels that could be expected of 10 to 12 years old pupils. It also indicated that pupils’ scores on standardised reading comprehension and mathematics test had a low predictive value for the outcomes of the diagnostic tool. Overall, the findings substantiate the claim that pupils’ prior knowledge of technological systems can be diagnosed properly with the developed tool, which may support teachers in decisions for their technology lessons about content, instruction and support.
...  These log file data have the potential of providing further insights into: (a) the processes and strategies of APS; and (b) the functioning of the specific APS tasks. Both aspects contribute to establishing a validity argument for the proposed assessment (Pellegrino, DiBello and Goldman, 2016). In particular, with respect to: (a) log file data comprise information on the sequence and duration of actions, and the actions themselves. ...
... Then specific measurable tasks were developed to measure learning success (Rostami and Khadjooi, 2010). This orientation made it difficult to study such phenomena as understanding, reasoning, and thinking (Pellegrino et al., 2016)-phenomena that are of paramount importance for education. Nonetheless, this orientation is the basis of evidence-based instructional practices ranging from approaches to prompting (Institute of Education Sciences 2018) to procedures like task analysis and constant time delay (e.g., Browder et al., 2014) to teach clearly articulated academic skills to students with significant cognitive disabilities. ...
Article
Full-text available
In education, taxonomies that define cognitive processes describe what a learner does with the content. Cognitive process dimensions (CPDs) are used for a number of purposes, such as in the development of standards, assessments, and subsequent alignment studies. Educators consider CPDs when developing instructional activities and materials. CPDs may provide one way to track students’ progress toward acquiring increasingly complex knowledge. There are a number of terms used to characterize CPDs, such as depth-of-knowledge, cognitive demand, cognitive complexity, complexity framework, and cognitive taxonomy or hierarchy. The Dynamic Learning Maps (DLM™) Alternate Assessment System is built on a map-based model, grounded in the literature, where academic domains are organized by cognitive complexity as appropriate for the diversity of students with significant cognitive disabilities (SCD). Of these students, approximately 9% either demonstrate no intentional communication system or have not yet attained symbolic communication abilities. This group of students without symbolic communication engages with and responds to stimuli in diverse ways based on context and familiarity. Most commonly used cognitive taxonomies begin with initial levels, such as recall , that assume students are using symbolic communication when they process academic content. Taxonomies that have tried to extend downward to address the abilities of students without symbolic communication often include only a single dimension (i.e., attend ). The DLM alternate assessments are based on learning map models that depict cognitive processes exhibited at the foundational levels of pre-academic learning, non-symbolic communication, and growth toward higher levels of complexity. DLM examined existing cognitive taxonomies and expanded the range to include additional cognitive processes that demonstrate changes from the least complex cognitive processes through early symbolic processes. This paper describes the theoretical foundations and processes used to develop the DLM Cognitive Processing Dimension (CPD) Taxonomy to characterize cognitive processes appropriate for map-based alternate assessments. We further explain how the expanded DLM CPD Taxonomy is used in the development of the maps, extended standards (i.e., Essential Elements), alternate assessments, alignment studies, and professional development materials. Opportunities and challenges associated with the use of the DLM CPD Taxonomy in these applications are highlighted.
... The implications of autism and SLD and the obstacles those may pose in relation to assessment, also form part of the theoretical discussion that follows. Pellegrino et al. (2016) remarked that formative assessment is on-going and summative assessment is periodic and gives the teacher information about the grade-related progress of their students. Taras (2005) argues that 'The process of assessment leads to summative assessment, that is, a judgement which encapsulates all the evidence up to a given point' (p. ...
Article
Full-text available
The Engagement Model was launched in January 2020, endeavouring to address the weaknesses of the P‐scales assessment for students not yet involved in a subject‐specific curriculum. This paper will discuss how and if the tensions between previously adopted assessment systems as discussed in teacher interviews can be reconciled through the Engagement Model in relation to students with autism and severe learning difficulties. The interview findings suggested that some of the problems with assessment, when applied in this context, are related to consistency and transferability, lack of formal recognition of non‐academic progress, familiarity with the students, observation skills and training, workload and time, and subjectivity of judgement amongst professionals. When compared with the aims of the Engagement Model, the findings of the research suggest that even though it addresses some of the issues raised, it cannot act as a substitute to the P‐scale system as it serves a different purpose.
...  These log file data have the potential of providing further insights into: (a) the processes and strategies of APS; and (b) the functioning of the specific APS tasks. Both aspects contribute to establishing a validity argument for the proposed assessment (Pellegrino, DiBello and Goldman, 2016). In particular, with respect to: (a) log file data comprise information on the sequence and duration of actions, and the actions themselves. ...
Chapter
Please download here: https://www.oecd.org/skills/piaac/publications/
... The design and validation of systems intended for both formative and summative purposes both for classroom and high-stakes uses require careful development processes based in evidence when systems are intended to support interpretations regarding how student learning grows more sophisticated over time [34]. This is an element of learning science, and because of the kinship of LPs and RALDs, we followed the approach and process recommended by Jin et al. [25] to create a validity framework for the RALDs as shown in Table 4. ...
Preprint
Abstract. Coherence between adaptive instructional and summative assessment systems should provide teachers stronger support in challenging each student. Coherent systems should lead to accelerated student learning, and students being ready for the next grade. Range achievement level descriptors (RALDs) describe a state’s theory of what increasing knowledge, skills, and abilities look like in their standards as students become more sophisticated thinkers on their journey to proficiency and beyond. Systems can be con-nected to better align interpretations of student performance using task fea-tures that align to evidence statements in RALDs. Our proposition is that through coding tasks in both systems using common schema, including RALD-to-task match, assessment and instructional system inferences about student progress can be bridged. We combined two approaches for linking inferences across systems measuring mathematics that do not rely on common students or common tasks as a proof of concept. Using RALDs and other task features, we predicted 45% of the variance in task difficulties for a secondary data source. Holding all else constant, RALDs were the strongest feature for modeling increases to task difficulty. This suggests that RALDs could be leveraged in instructional systems to support interpretations of student growth, increasing their value for teachers.
... In the instantiation of the assessment triangle discussed in Lai et al., the LP was mapped to the cognition vertex, the data collection (think-aloud protocols and online assessment activities, in this case) was mapped to the observation vertex, and the analyses (qualitative analysis and latent class analysis, in this case) were mapped to the inference vertex. Confrey, Toutkoushian, and Shah (2019, Fig. 2, p. 27) developed a two-dimensional validation framework, in which the vertical dimension is assessment purpose and the horizontal dimension consists of the components of classroom assessment as defined by Pellegrino, DiBello, and Goldman (2016). The assessment purpose dimension includes "Support teachers to interpret data to target instruction," "Elicit diverse levels of student thinking," "Improve student learning," "Increase students' self-awareness," "Strengthen teachers' knowledge," and "Connect to larger assessment system(s)." ...
Article
A learning progression, or learning trajectory, describes the evolution of student thinking from early conceptions to the target understanding within a particular domain. As a complex theory of development, it requires conceptual and empirical support. In earlier work, we proposed a cycle for the validation of a learning progression with four steps: 1) Theory Development, 2) Examination of Empirical Recovery, 3) Comparison to Competing Models, and 4) Evaluation of Instructional Efficacy. A group of experts met to discuss the application of learning sciences to the design, use, and validation of classroom assessment. Learning progressions, learning trajectories, and how they can support classroom assessment were the main focuses. Revisions to the cycle were suggested. We describe the adapted cycle and illustrate how the first third of it has been applied towards the validation of a learning progression for the concept of function.
... Validations of the ILP and assessments. We used multiple components of validity analysis (cognitive, instructional, and inferential validity) for validating the instructionally relevant LP and assessment tasks, including expert review and student cognitive protocol studies, and small-scale samples of student performance (Pellegrino et al., 2016). The current proposal presented the cognitive validations of the ILP and assessments by reporting the findings of using the feedback data from experts' (three internal experts, three external experts and one senior expert) and four teachers' reviews in the development, revision, and validation process. ...
Conference Paper
Full-text available
This study aims to develop a three-dimensional integrated learning progression (3D ILP) and assessments for delineating the progress of middle school students’ proficiency in applying the elements of energy, modeling and cause and effect to make sense of real-world phenomena. We employed construct-centered design for guiding the development of 3D ILP and assessments and applied multiple components of validity analysis for them. In the proposal, we presented internal and external experts’ as well as teachers’ feedbacks. The key takeaways from experts and the adjustments of from teachers were used to validate and further improve the current 3D ILP and assessments. Our development approach would be interested by NARST members, and the final 3D ILP would be useful to guide teachers’ instruction for their students’ science proficiency.
Article
This is a research report of teaching patterns of critical thinking using the competency-based 3CA (an acronym for the educational practices of Concept maps, Critical thinking, Collaboration, and Assessment) model of classroom instruction to change the grammar of schooling. Critical thinking is defined as the “WH questions”: “what, when, where, how, who, and why” taken from Aristotle’s Nicomachean Ethics. These questions are threaded through the practices of concept maps, collaboration, and assessment. This conceptualization of patterns of thinking is influenced by Ludwig Wittgenstein’s conceptualization of the relations between the language games of practice and language games in the mind. This study compares individual and collaborative approaches to teaching the critical thinking “WH questions” in a child development class. Students in the individual groups used more “what questions,” whereas students in the collaborative group used more “why and how questions.”
Article
National calls to improve science education include focusing on scientific practices coupled with learning disciplinary core ideas. Among the practices is constructing explanations. In the field of cellular and molecular biology, explanations typically include a mechanism and can be used to make predictions about phenomena. In this work, we developed an assessment item about transcription, a key process in the biology core concept of genetic information flow. We used a mechanistic framework to develop a rubric that identifies undergraduate explanations that leverage molecular or sub-molecular mechanisms, descriptions, or use unlinked ideas. We applied this rubric to categorize 346 undergraduate written explanations and compare five-item versions. We found that one version elicited sub-molecular mechanistic explanations from 20% of students, compared to between 2% and 13% from other versions. This version included the element of time, by indicating that a new RNA was formed as part of transcription. We also developed and applied a conceptual rubric to capture the context students used in their explanations and found a median of two context ideas in student explanations of transcription. Our work demonstrates that with careful item wording, undergraduates can explain molecular processes like transcription by leveraging sub-molecular mechanisms.
Article
Computational thinking (CT) is a set of cognitive skills that every child should acquire. K–12 classrooms are expected to provide students opportunities (tasks) to think computationally. We introduce a CT competency assessment for middle school students. The assessment design process started by establishing a cognitive model of CT domain mastery, in which three broad skill types were identified to represent CT competency. After multiple-choice item prototypes were written, pilot tested, and revised, 15 of them were finally selected to be administered to 564 students in two middle schools in the Midwestern United States. Using a cognitive diagnostic scoring model, mastery classifications for each student were determined that can be used diagnostically by teachers as a pretest and, perhaps in the future, to compare the outcomes of CT instructional programs. The results inform an initial understanding of typical learning progressions in CT at the middle school level.
Article
Full-text available
Background Previous research indicates that students lack sufficient online credibility evaluation skills. However, the results are fragmented and difficult to compare as they are based on different types of measures and indicators. Consequently, there is no clear understanding of the structure of credibility evaluation. Objectives The present study sought to establish the structure of credibility evaluation of online texts among 265 sixth graders. Methods Students' credibility evaluation skills were measured with a task in which they read four online texts, two more credible (a popular science text and a newspaper article) and two less credible (a layperson's blog text and a commercial text). Students read one text at a time and evaluated the author's expertise, the author's benevolence and the quality of the evidence before ranking the texts according to credibility. Four competing measurement models of students' credibility evaluations were assessed. Results The model termed the Genre‐based Confirming‐Questioning Model reflected the structure of credibility evaluation best. The results suggest that credibility evaluation reflects the source texts and requires two latent skills: confirming the more credible texts and questioning the less credible texts. These latent skills of credibility evaluation were positively associated with students' abilities to rank the texts according to credibility. Implications The study revealed that the structure of credibility evaluation might be more complex than previously conceptualized. Consequently, students would benefit from activities that ask them to carefully analyse different credibility aspects of more and less credible texts, as well as the connections between these aspects.
Article
The interdisciplinary field of the learning sciences encompasses educational psychology, cognitive science, computer science, and anthropology, among other disciplines. The Cambridge Handbook of the Learning Sciences, first published in 2006, is the definitive introduction to this innovative approach to teaching, learning, and educational technology. In this significantly revised third edition, leading scholars incorporate the latest research to provide seminal overviews of the field. This research is essential in developing effective innovations that enhance student learning - including how to write textbooks, design educational software, prepare effective teachers, and organize classrooms. The chapters illustrate the importance of creating productive learning environments both inside and outside school, including after school clubs, libraries, and museums. The Handbook has proven to be an essential resource for graduate students, researchers, consultants, software designers, and policy makers on a global scale.
Article
This chapter reviews assessment research with the goal of helping all readers understand how to design and use effective assessments. The chapter begins by introducing the purposes and contexts of educational assessment. It then presents four related frameworks to guide work on assessment: (1) assessment as a process of reasoning from evidence, (2) assessment driven by models of learning expressed as learning progressions, (3) the use of an evidence-centered design process to develop and interpret assessments, and (4) the centrality of the concept of validity in the design, use, and interpretation of any assessment. The chapter then explores the implications of these frameworks for real-world assessments and for learning sciences research. Most learning sciences research studies deeper learning that goes beyond traditional student assessment, and the field can contribute its insights to help shape the future of educational assessment.
Book
Das Lernen mit frei zugänglichen Online-Ressourcen als wichtige Informationsquelle zum Erwerb neuen Wissens birgt einige Herausforderungen, da Inhalte im Internet frei verbreitet werden können und unüberschaubare Mengen an unstrukturierten, unzuverlässigen, oder voreingenommenen Informationen leicht zugänglich sind. Studierende müssen in der Lage sein, Online-Informationen und -Quellen zu suchen, zu selektieren, auszuwählen, zu überprüfen und anhand relevanter Kriterien kritisch zu bewerten. In Deutschland wurde dahingehend ein neuer theoretisch-konzeptueller Ansatz für die Operationalisierung dieser Fähigkeiten durch das Konstrukt Critical Online Reasoning (COR) sowie Messung dieser Fähigkeiten mittels eines neuen Assessments (CORA) entwickelt. Auf Grundlage theoretischer Überlegungen zum COR-Konstrukt sowie zur argumentationsbasierten Validierung nach den Standards for Educational and Psychological Testing (AERA et al.) ist das Ziel dieser Arbeit eine Analyse von zentralen Validierungsaspekten.
Article
This study explores the role of unconventional forms of classroom assessments in expanding minoritized students' opportunities to learn (OTL) in high school physics classrooms. In this research + practice partnership project, high school physics teachers and researchers co‐designed a unit about momentum to expand minoritized students' meaningful OTL. Specifically, the unit was designed to (a) expand what it means to learn and be good at science using unconventional forms of assessment, (b) facilitate students to leverage everyday experiences, concerns, and home languages to do science, and (c) support teachers to facilitate meaningful dialogical interactions. The analysis focused on examining minoritized students' OTLs mediated by intentionally designed, curriculum‐embedded, unconventional forms of assessments. The participants were a total of 76 students in 11th or 12th grade. Data were gathered in the form of student assessment tasks, a science identity survey, and interviews. Data analysis entailed: (a) statistical analysis of student performance measured by conventional and unconventional assessments and (b) qualitative analysis of two Latinx students' experiences with the co‐designed curriculum and assessments. The findings suggest that the use of unconventional forms of curriculum‐embedded assessment can increase minoritized students' OTL if the assessment facilitates minoritized students to personally and deeply relate themselves to academic tasks.
Article
Fixed and growth mindsets represent implicit theories about the nature of one's abilities or traits. The existing body of research on academic achievement and the effectiveness of mindset interventions for student learning largely relies on the premise that fixed and growth mindsets are mutually exclusive. This premise has led to the common practice in which measures of one mindset are reversed and then assumed to represent the other mindset. Focusing on K-12 and university students (N = 27328), we tested the validity of this practice via a comprehensive item-level meta-analysis of the Implicit Theories of Intelligence Scale (ITIS). By means of meta-analytic structural equation modeling and network analysis, we examined (a) the ITIS item-item correlations and their heterogeneity across 32 primary studies; (b) the factor structure of the ITIS, including the distinction between fixed and growth mindset; and (c) moderator effects of sample, study, and measurement characteristics. We found positive item-item correlations within the sets of fixed and growth mindset items, with substantial between-study heterogeneity. The ITIS factor structure comprised two moderately correlated mindset factors (ρ = 0.63–0.65), even after reversing one mindset scale. This structure was moderated by the educational level and origin of the student sample, the assessment mode, and scale modifications. Overall, we argue that fixed and growth mindsets are not mutually exclusive but correlated constructs. We discuss the implications for the assessment of implicit theories of intelligence in education.
Article
In this article, we argue that automated scoring engines should be transparent and construct relevant—that is, as much as is currently feasible. Many current automated scoring engines cannot achieve high degrees of scoring accuracy without allowing in some features that may not be easily explained and understood and may not be obviously and directly relevant to the target assessment construct. We address the current limitations on evidence and validity arguments for scores from automated scoring engines from the points of view of the Standards for Educational and Psychological Testing (i.e., construct relevance, construct representation, and fairness) and emerging principles in Artificial Intelligence (e.g., explainable AI, an examinee's right to explanations, and principled AI). We illustrate these concepts and arguments for automated essay scores.
Article
Full-text available
Computational thinking (CT) skills are critical for the science, technology, engineering, and mathematics (STEM) fields, thus drawing increasing attention in STEM education. More curricula and assessments, however, are needed to cultivate and measure CT for different learning goals. Maker activities have the potential to improve student CT, but more validated assessments are needed for maker activities. We developed a set of activities for students to improve and assess essential CT skills by creating real-life applications using Arduino, a microcontroller often used in maker activities. We examined the psychometric features of CT performance assessments with rubrics and the effectiveness of the maker activities on improving CT. Two high school physics teachers implemented these Arduino activities and assessments with fifteen high school students during three days in a summer program. The participating students took an internal content-involved and an external CT tests before and after participating in the program. The students also took the performance based CT assessment at the end of the program. The data provide reliability and validity evidence of the Arduino assessment as a tool to measure CT. The pre-and post-test comparison indicates that students significantly improved their scores on the content-involved assessment aligned with the Arduino activities, but not on the content-free CT assessment. It shows that Arduino, or some equipment similar, can be used to improve students' CT skills and the Arduino maker activities can be used as performance assessments to measure students' engineering involving CT skills.
Article
Full-text available
Objective: To evaluate the preferred method of assessment from medical student's perspective. Study Design: Cross-sectional study Place and Duration of Study: This study was conducted at the Al-Tibri Medical College and Hospital, Isra University-Karachi Campus from March-2020 to October-2020. Materials and Methods: A valid questionnaire was adopted for the evaluation of the preferred method of assessment. Data was collected after the institutional ethical approval, and verbal consent was taken from the respondent. A total of 150 undergraduate students were included through a snow ball sampling method from the 2nd, 3rd, and 4th year of MBBS. Data were analyzed through SPSS version 21.0 and showed in the form of frequency and percentage. The chi-square test was applied to assess the qualitative data and level of significance was taken P =<0.05. Results: Out of 150 medical students, the mean age of 2nd-year students was 20.36± 4.00, 3rd-year students with 21.36 ± 0.95, and the mean age of 4th year was 22.87± 2.97. Male students' frequency was 75(50%), and female participants were 75(50%) of the total numbers of respondents. Numbers of students preferred MCQs a quality choice for assessment in comparison to essay questions. Maximum respondents preferred that MCQs give brief understanding and a wide range of course coverage, while essays enhance the written expression. Conclusion: Following the present study results concluded that MCQs are a more preferred assessment tool among the students of medical sciences compared to the essay question. MCQs cover more content areas in a limited period and assess the higher level of cognition; the level of competency depends on MCQ's quality structure.
Article
The growing interest of educational researchers in computational thinking (CT) has led to an expanding literature on assessments of CT skills and attitudes. However, few studies have examined whether CT attitudes influence CT skills. The present study examines the relationship between CT attitudes and CT skills for preservice teachers (PSTs). The Callysto CT test (CCTt) for Teachers was administered to $n\,\,=$ 105 PSTs to measure their CT attitudes and skills. Structural equation modeling was used to examine the relationship of participants’ CT and problem-solving skills with their attitudes toward CT, technology, coding, and data. Findings revealed that CT attitudes predicted CT skills and provided the first step in exploring the validity and reliability of the CCTt instrument.
Chapter
Serious games are garnering popularity in learning environments and as assessment tools. We propose a summative assessment of a serious game as an assessment tool by merging assessment standards with serious game mechanics. To this end, we apply instructional design components with a focus on the evidence-centered game design approach (ECgD). Simultaneously, we introduce a different approach to game design and the traditional chain of effects toward competence assessment. Our leading questions are: how can the competences be operationalized and translated into game mechanics? Through which serious game mechanics can we prompt players to act in typical domain-specific situations and show their sustainable creative competence? How and through which statistical models can we match the observed competence of players with the intended competence model formulated a priori? To answer these questions, we developed the domain-specific serious game MyBUY to assess the sustainable creative competence (SC competence) of young adults in Vocational Education and Training in the field of retail and sales. By matching the intended competence (theoretical model) while playing the serious game with the SC competence (empirical model), we found that the models were highly compatible. Further confirmation is given by the results of questionnaires on usability, cognitive load, and motivation. Our results affirm the need for future studies to apply our algorithm to design domain-specific serious games as competence assessment tools and extend data collection and data analytics procedures in longitudinal studies.
Chapter
Coherence between adaptive instructional and summative assessment systems should provide teachers stronger support in challenging each student. Coherent systems should lead to accelerated student learning, and students being ready for the next grade. Range achievement level descriptors (RALDs) describe a state’s theory of what increasing knowledge, skills, and abilities look like in their standards as students become more sophisticated thinkers on their journey to proficiency and beyond. Systems can be connected to better align interpretations of student performance using task features that align to evidence statements in RALDs. Our proposition is that through coding tasks in both systems using common schema, including RALD-to-task match, assessment and instructional system inferences about student progress can be bridged. We combined two approaches for linking inferences across systems measuring mathematics that do not rely on common students or common tasks as a proof of concept. Using RALDs and other task features, we predicted 45% of the variance in task difficulties for a secondary data source. Holding all else constant, RALDs were the strongest feature for modeling increases to task difficulty. This suggests that RALDs could be leveraged in instructional systems to support interpretations of student growth, increasing their value for teachers.
Article
The importance of engaging students in disciplinary practices of science is widely acknowledged and well researched. What is less understood is how to assess students’ development of these practices. In particular, there is a need for understanding how formative assessment and feedback practices can be integrated into classroom instruction in ways that are linked to science practices and aligned with theories of learning. This paper examines the question with regards to scientific theory-building practices. It presents an approach to integrating formative assessment and feedback into science instruction and illustrates it with a narrative account of its implementation in one classroom. It describes changes in assessment results over the course of one unit and relates those changes to the feedback activities.
Book
Full-text available
Learning Trajectories in Mathematics: A Foundation for Standards, Curriculum, Assessment, and Instruction aims to provide: A useful introduction to current work and thinking about learning trajectories for mathematics education An explanation for why we should care about these questions A strategy for how to think about what is being attempted in the field, casting some light on the varying, and perhaps confusing, ways in which the terms trajectory, progression, learning, teaching, and so on, are being used by the education community. Specifically, the report builds on arguments published elsewhere to offer a working definition of the concept of learning trajectories in mathematics and to reflect on the intellectual status of the concept and its usefulness for policy and practice. It considers the potential of trajectories and progressions for informing the development of more useful assessments and supporting more effective formative assessment practices, for informing the on-going redesign of mathematics content and performance standards, and for supporting teachers’ understanding of students’ learning in ways that can strengthen their capability for providing adaptive instruction. The authors conclude with a set of recommended next steps for research and development, and for policy.
Article
Full-text available
On September 2, 1957, Lee Cronbach delivered his visionary presidential address to the American Psychological Association (APA), calling for the unification of differential and experimental psychology, the two disciplines of scientific psychol-ogy. He described the essential features of each approach to asking questions about human nature, and he strongly hinted at the benefits to be gained by unification. Cronbach was calling for linking theories and research on learning and instruction, especially the instructional treatments that logically and psychologically followed from such research, with the tradition of assessing individual differences in cognitive abilities. In his opinion, such work would probably yield information of profound educational relevance. In describing some illustrative examples, he stated, quite boldly, "Such findings ... when replicated and explained, will carry us into an educational psychology which measures readiness for different types of teaching and which invents teaching methods to fit different types of readiness" (p. 681). He subsequently went even further and argued that this work had broader theoretical impact and meaning. "Constructs originating in differential psychology are now being tied to experimental variables. As a result, the whole theoretical picture in such an area as human abilities is changing" (p. 682)., and Student Testing (CRESST); and the Educational Testing Service. The views expressed are solely those of the authors. We would like to express our appreciation to Dave Lohman, Lorrie Shepard, and Ali Iran-Nejad for their constructive comments on drafts of this chapter. We would also like to thank Dave Law for his assistance in assembling and reviewing some of the material discussed in this chapter, as well as his reactions to earlier versions. We dedicate this chapter to our departed colleague and friend Richard Snow, whose career was dedicated to seeking out answers to difficult questions about the nature of human abilities, constantly probing how such knowledge could bc profitably used to enhance education. Many of our thoughts and insights draw heavily from his seminal works addressing the two disciplines issues highlighted in this chaptcr.
Article
Full-text available
In educational assessment, we observe what students say, do, or make in a few particular circumstances and attempt to infer what they know, can do, or have accomplished more generally. A web of inference connects the two. Some connections depend on theories and experience concerning the targeted knowledge in the domain, how it is acquired, and the circumstances under which people bring their knowledge to bear. Other connections may depend on statistical models and probability-based reasoning. Still others concern the elements and processes involved in test construction, administration, scoring, and reporting. This article describes a framework for assessment that makes explicit the interrelations among substantive arguments, assessment designs, and operational processes. The work was motivated by the need to develop assessments that incorporate purposes, technologies, and psychological perspectives that are not well served by familiar forms of assessments. However, the framework is equally applicable to analyzing existing assessments or designing new assessments within familiar forms.
Article
Full-text available
This monograph discusses criteria for judging the alignment between expectations of student achievement and assessment. Alignment is central to current efforts of systemic and standards-based education reforms in mathematics and science. More than four-fifths of the states have content frameworks in place in mathematics and science, and a large number of these have some form of statewide assessment to measure student attainment of expectations given in the frameworks. Various approaches to alignment have been attempted, but they have generally lacked specific criteria for judging the alignment. Twelve criteria for judging alignment grouped into five categories are described, along with examples and levels of agreement. The five general categories are: (1) content focus; (2) articulation across grades and ages; (3) equity and fairness; (4) pedagogical implications; and (5) system applicability. These criteria were developed by an expert panel formed as a cooperative effort of the Council of Chief State School Officers and the National Institute for Science Education provide guidance to educators trying to develop a coherent system of expectations and assessments. An appendix lists the task force participants. (Contains 1 figure, 12 charts, and 54 references.) (SLD)
Article
Full-text available
In this article, we describe the principles that guided the creation and implementation of a system of embedded assessments—the so-called BEAR (Berkeley Evaluation and Assessment Research) Assessment System. The assessment system was devel-oped in the context of a specific curriculum in issues-oriented science for the middle grades but is designed generically to address the implementation of those principles. The assessment system builds on methodological advances in alternative assessment techniques and attempts to address salient issues in the integration of alternative as-sessment into the classroom teaching and learning context. The 4 principles are de-scribed, and we discuss how the application of these principles generates the compo-nent parts of the system and determines how the component parts work together. The use of teacher moderation to integrate the parts of the system in school and classroom is also discussed. In recent years, "alternative assessment" has been a major topic of interest, de-bate, and experimentation in the nationwide efforts at educational reform. Initial hopes that alternative, authentic, or performance assessments of student achieve-ment would drive (or at least facilitate) changes in what and how students are taught have been tempered by the realities of implementation. Efforts to intro-duce alternative assessments into large-scale, high-stakes state and district test-ing programs have met with mixed results due to high costs, logistical barriers, and political ramifications (e.g., Gipps, 1995; Rothman, 1995). For example, the demise of the California Learning Assessment System was due principally to the complications, technical, political, and financial, of using performance assess-ments for large-scale assessment. Efforts to introduce alternative assessments APPLIED MEASUREMENT IN EDUCATION, 13(2), 181–208 Copyright © 2000 Lawrence Erlbaum Associates, Inc.
Article
Full-text available
We propose a multilevel-multifaceted approach to evaluating the impact of education reform on student achievement that would be sensitive to context and small treatment effects. The approach uses different assessments based on their proximity to the enacted curriculum. Immediate assessments are artifacts (students' products) from the enactment of the curriculum; close assessments parallel the content and activities of the unit/curriculum; proximal assessments tap knowledge and skills relevant to the curriculum, but topics can be different; and distal assessments reflect state/national standards in a particular knowledge domain. To provide evidence about the sensitivity of the multilevel approach in ascertaining outcomes of hands-on science programs we administered close, proximal, and distal performance assessments to evaluate the impact of instruction based on two Full Option Science System units—Variables, and Mixtures and Solutions—in a Bay Area school district. Results indicated that close assessments were more sensitive to the changes in students' pre- to post-test performance than proximal assessments. © 2002 Wiley Periodicals, Inc. J Res Sci Teach 39: 369–393, 2002
Book
In response to the No Child Left Behind Act of 2001 (NCLB), Systems for State Science Assessment explores the ideas and tools that are needed to assess science learning at the state level. This book provides a detailed examination of K-12 science assessment: looking specifically at what should be measured and how to measure it. Along with reading and mathematics, the testing of science is a key component of NCLB-it is part of the national effort to establish challenging academic content standards and develop the tools to measure student progress toward higher achievement. The book will be a critical resource for states that are designing and implementing science assessments to meet the 2007-2008 requirements of NCLB. In addition to offering important information for states, Systems for State Science Assessment provides policy makers, local schools, teachers, scientists, and parents with a broad view of the role of testing and assessment in science education. © 2006 by the National Academy of Sciences. All rights reserved.
Article
This article exemplifies how assessment design might be grounded in theory, thereby helping to strengthen validity claims. Spanning work across multiple related projects, the article first briefly summarizes an assessment system model for the elementary and secondary levels. Next the article describes how cognitive-domain theory and principles are used in the design of a scenario-based summative assessment for argumentation in the English language arts. Finally, results from several psychometric approaches are used to evaluate propositions suggested by the domain theory, including ones related to the use of topical scenarios and learning progressions in assessment design. Although results generally supported these propositions, the work described represents only a small step in a long-term, iterative process of theory development, assessment design, and empirical tryout, which should, in principle, lead to more valid assessments that better inform teaching and learning.
Book
Packed with examples from various subjects and grades, this guide walks readers through every step of the formative assessment process, from articulating learning goals to providing quality feedback.
Chapter
Article
Background Concept inventories (CIs) are commonly used in engineering disciplines to assess students' conceptual understanding and to evaluate instruction, but educators often use CIs without sufficient evidence that a structured approach has been applied to validate inferences about student thinking. PurposeWe propose an analytic framework for evaluating the validity arguments of CIs. We focus on three types of claims: that CI scores enable one to infer (1) students' overall understanding of all concepts identified in the CI, (2) students' understanding of specific concepts, and (3) students' propensity for misconceptions or common errors. Method We applied our analytic framework to three CIs: the Concept Assessment Tool for Statics (CATS), the Statistics Concept Inventory (SCI), and the Dynamics Concept Inventory (DCI). ResultsUsing our analytic framework, we found varying degrees of support for each type of claim. CATS and DCI analyses indicated that the CIs could reliably measure students' overall understanding of all concepts identified in the CI, whereas SCI analyses provided limited evidence for this claim. Analyses revealed that the CATS could accurately measure students' understanding of specific concepts; analyses for the other two CIs did not support this claim. None of the CI analyses provided evidence that the instruments could reliably measure students' misconceptions and common errors. Conclusions Our analytic framework provides a structure for evaluating CI validity. Engineering educators can apply this framework to evaluate aspects of CI validity and make more warranted uses and interpretations of CI outcome scores.
Article
Under the argument-based approach to validity, test-score interpretations and uses that are clearly stated and are supported by appropriate evidence are considered to be valid. Conversely, interpretations or uses that are not well defined or that involve doubtful inferences or assumptions are not considered valid. The proposed interpretation and use of test scores are specified in terms of a network of inferences and assumptions (or an argument) leading from a test taker's observed performances to score-based conclusions and decisions for the test taker. The validity of the proposed interpretation and use can then be evaluated in terms of the completeness and coherence of the network and the plausibility of its inferences and assumptions.
Book
Learning progressions – descriptions of increasingly sophisticated ways of thinking about or understanding a topic (National Research Council, 2007) – represent a promising framework for developing organized curricula and meaningful assessments in science. In addition, well-grounded learning progressions may allow for coherence between cognitive models of how understanding develops in a given domain, classroom instruction, professional development, and classroom and large-scale assessments. Because of the promise that learning progressions hold for bringing organization and structure to often disconnected views of how to teach and assess science, they are rapidly gaining popularity in the science education community. However, there are signi?cant challenges faced by all engaged in this work. In June 2009, science education researchers and practitioners, as well as scientists, psychometricians, and assessment specialists convened to discuss these challenges as part of the Learning Progressions in Science (LeaPS) conference. The LeaPS conference provided a structured forum for considering design decisions entailed in four aspects of work on learning progressions: de?ning learning progressions; developing assessments to elicit student responses relative to learning progressions; modeling and interpreting student performance with respect to a learning progressions; and using learning progressions to in?uence standards, curricula, and teacher education. This book presents speci?c examples of learning progression work and syntheses of ideas from these examples and discussions at the LeaPS conference.
Book
Constructing Measures introduces a way to understand the advantages and disadvantages of measurement instruments, how to use such instruments, and how to apply these methods to develop new instruments or adapt old ones. The book is organized around the steps taken while constructing an instrument. It opens with a summary of the constructive steps involved. Each step is then expanded on in the next four chapters. These chapters develop the "building blocks" that make up an instrument--the construct map, the design plan for the items, the outcome space, and the statistical measurement model. The next three chapters focus on quality control. They rely heavily on the calibrated construct map and review how to check if scores are operating consistently and how to evaluate the reliability and validity evidence. The book introduces a variety of item formats, including multiple-choice, open-ended, and performance items; projects; portfolios; Likert and Guttman items; behavioral observations; and interview protocols. Each chapter includes an overview of the key concepts, related resources for further investigation and exercises and activities. Some chapters feature appendices that describe parts of the instrument development process in more detail, numerical manipulations used in the text, and/or data results. A variety of examples from the behavioral and social sciences and education including achievement and performance testing; attitude measures; health measures, and general sociological scales, demonstrate the application of the material. An accompanying CD features control files, output, and a data set to allow readers to compute the text's exercises and create new analyses and case archives based on the book's examples so the reader can work through the entire development of an instrument. Constructing Measures is an ideal text or supplement in courses on item, test, or instrument development, measurement, item response theory, or rasch analysis taught in a variety of departments including education and psychology. The book also appeals to those who develop instruments, including industrial/organizational, educational, and school psychologists, health outcomes researchers, program evaluators, and sociological measurers. Knowledge of basic descriptive statistics and elementary regression is recommended. © 2005 by Lawrence Erlbaum Associates, Inc. All rights reserved.
Article
To validate an interpretation or use of test scores is to evaluate the plausibility of the claims based on the scores. An argument-based approach to validation suggests that the claims based on the test scores be outlined as an argument that specifies the inferences and supporting assumptions needed to get from test responses to score-based interpretations and uses. Validation then can be thought of as an evaluation of the coherence and completeness of this interpretation/use argument and of the plausibility of its inferences and assumptions. In outlining the argument-based approach to validation, this paper makes eight general points. First, it is the proposed score interpretations and uses that are validated and not the test or the test scores. Second, the validity of a proposed interpretation or use depends on how well the evidence supports the claims being made. Third, more-ambitious claims require more support than less-ambitious claims. Fourth, more-ambitious claims (e.g., construct interpretations) tend to be more useful than less-ambitious claims, but they are also harder to validate. Fifth, interpretations and uses can change over time in response to new needs and new understandings leading to changes in the evidence needed for validation. Sixth, the evaluation of score uses requires an evaluation of the consequences of the proposed uses; negative consequences can render a score use unacceptable. Seventh, the rejection of a score use does not necessarily invalidate a prior, underlying score interpretation. Eighth, the validation of the score interpretation on which a score use is based does not validate the score use.
Article
Standards-based score reports interpret test performance with reference to cut scores defining categories like "below basic" or "proficient" or "master." This paper first develops a conceptual framework for validity arguments supporting such interpretations, then presents three applications. Two of these are serve to introduce new standard-setting methods. The conceptual framework lays out the logic of validity arguments in support of standards-based score interpretations, focusing on requirements that the performance standard (i.e., the characterization of examinees who surpass the cut score) be defensible both as a description and as a normative judgment, and that the cut score accurately operationalize that performance standard. The three applications illustrate performance standards that differ in the breadth of the claims they set forth. The first, a "criterion-referenced testing" application, features a narrow performance standard that corresponds closely to performance on the test itself. The second, "minimum competency testing," introduces a new standard-setting method that might be used when there is a weaker linkage between the test and the performance standard. The third, a contemporary standards-based testing application, proposes a new procedure whereby the performance standard would be derived directly from the specification for the test itself.
Article
This paper is divided into two main sections. The first half of the paper focuses on the intent and practice of diagnostic assessment, providing a general organiz-ing scheme for a diagnostic assessment implementation process, from design to scoring. The discussion includes specific concrete examples throughout, as well as summaries of data studies as appropriate. The second half of the paper focuses on one critical component of the implemen-tation process – the specification of an appropriate psychometric model. It includes the presentation of a general form for the models as an interaction of knowledge structure with item structure, a review of each of a variety of selected models, separate detailed summaries of knowledge structure modeling and item structure modeling, and lastly some summarizing and concluding remarks. To make the scope manageable, the part is restricted to models for dichotomously scored items. Throughout the paper, practical advice is given about how to apply and implement the ideas and principles discussed.
Article
A central theme throughout the impressive series of philosophical books and articles Stephen Toulmin has published since 1948 is the way in which assertions and opinions concerning all sorts of topics, brought up in everyday life or in academic research, can be rationally justified. Is there one universal system of norms, by which all sorts of arguments in all sorts of fields must be judged, or must each sort of argument be judged according to its own norms? In The Uses of Argument (1958) Toulmin sets out his views on these questions for the first time. In spite of initial criticisms from logicians and fellow philosophers, The Uses of Argument has been an enduring source of inspiration and discussion to students of argumentation from all kinds of disciplinary background for more than forty years.
Article
Authentic and direct assessments of performances and products are examined in the light of contrasting functions and purposes having implications for validation, especially with respect to the need for specialized validity criteria tailored for performance assessment. These include contrasts between performances and products, between assessment of performance per se and performance assessment of competence or other constructs, between structured and unstructured problems and response modes, and between breadth and depth of domain coverage. These distinctions are elaborated in the context of an overarching contrast between task-driven and construct-driven performance assessment. Rhetoric touting performance assessments because they eschew decomposed skills and decontextualized tasks is viewed as misguided, in that component skills and abstract problems have a legitimate place in pedagogy. Hence, the essence of authentic assessment must be sought elsewhere, that is, in the quest for complete construct representation. With this background, the concepts of “authenticity” and “directness” of performance assessment are treated as tantamount to promissory validity claims that they offset, respectively, the two major threats to construct validity, namely, construct underrepresentation and construct-irrelevant variance. With respect to validation, the salient role of both positive and negative consequences is underscored as well as the need, as in all assessment, for evidence of construct validity.
Article
This article is a review of the literature on classroom formative assessment. Several studies show firm evidence that innovations designed to strengthen the frequent feedback that students receive about their learning yield substantial learning gains. The perceptions of students and their role in self‐assessment are considered alongside analysis of the strategies used by teachers and the formative strategies incorporated in such systemic approaches as mastery learning. There follows a more detailed and theoretical analysis of the nature of feedback, which provides a basis for a discussion of the development of theoretical models for formative assessment and of the prospects for the improvement of practice.
Article
Outlines a general argument-based approach to validation, develops an interpretive argument for a placement test as an example, and examines some key properties in interpretive arguments. Validity is associated with the interpretation assigned to test scores rather than with the test scores or the test. The interpretation involves an argument leading from the scores to score-based statements or decisions, and the validity of the interpretation depends on the plausibility of this interpretive argument. The interpretive arguments associated with most test-score interpretations involve multiple inferences and assumptions. An explicit recognition of the inferences and assumptions in the interpretive argument makes it possible to identify the kinds of evidence needed to evaluate the argument. Evidence for the inferences and assumptions in the argument supports the interpretation, and evidence against any part of the argument casts doubt on the interpretation. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
This article presents findings from two projects designed to improve evaluations of technical quality of alternate assessments for students with the most significant cognitive disabilities. We argue that assessment technical documents should allow for the evaluation of the construct validity of the alternate assessments following the traditions ofCronbach (1971),Messick (1989, 1995), Linn, Baker, and Dunbar (1991), andShepard (1993). The projects used the work of Knowing What Students Know (Pellegrino, Chudowsky, & Glaser, 2001) to structure and focus the collection and evaluation of assessment information. The heuristic of the assessment triangle (Pellegrino et al., 2001) was particularly useful in emphasizing that the validity evaluation needs to consider the logical connections among the characteristics of the students tested and how they develop domain proficiency (the cognition vertex), the nature of the assessment (the observation vertex), and the ways in which the assessment results are interpreted (the interpretation vertex). This project has shown that in addition to designing more valid assessments, the growing body of knowledge about the psychology of achievement testing can be useful for structuring evaluations of technical quality.
Article
Cronbach made the point that for validity arguments to be convincing to diverse stakeholders, they need to be based on assumptions that are credible to these stakeholders. The interpretations and uses of high-stakes test scores rely on a number of policy assumptions about what should be taught in schools, and more specifically, about the content standards and performance standards that should be applied to students and schools. For example, a high-school graduation test can be developed as a measure of readiness for the world of work, for college, or for citizenship and the activities of daily life. The assumptions built into the assessment need to be subjected to scrutiny and criticism if a strong case is to be made for the validity of the proposed interpretation and use.
Article
Scholarship on learning progressions (LPs) in science has emerged over the past 5 years, with the first comprehensive descriptions of LPs, on the nature of matter and evolution, published as commissioned reports (Catley, Lehrer, & Reiser, 2005; Smith, Wiser, Anderson, & Krajcik, 2006). Several recent policy reports have advocated for the use of LPs as a means of aligning standards, curriculum, and assessment (National Research Council [NRC], 2005, 2007). In some ways, LPs are not a new idea; developmental psychologists have long been examining the development of childrens' ideas over time in several scientific domains. However, the emerging research offers renewed interest, a new perspective, and potentially new applications for this construct. For these reasons, this special issue of the Journal for Research in Science Teaching is timely. Our goal in this introduction is to explain the motivation for developing LPs, propose a consensual definition of LPs, describe the ways in which these constructs are being developed and validated, and finally, discuss some of the unresolved questions regarding this emerging scholarship. © 2009 Wiley Periodicals, Inc. J Res Sci Teach 46: 606–609, 2009
Article
Educational test theory consists of statistical and methodological tools to support inference about examinees’ knowledge, skills, and accomplishments. Its evolution has been shaped by the nature of users’ inferences, which have been framed almost exclusively in terms of trait and behavioral psychology, and focused on students’ tendency to act in prespecified ways in prespecified domains of tasks. Progress in the methodology of test theory enabled users to extend the range of inference and ground interpretations more solidly within these psychological paradigms. Developments in cognitive and developmental psychology have broadened the range of inferences we wish to make about students’ learning to encompass conjectures about the nature and acquisition of their knowledge. The same underlying principles of inference that led to standard test theory can support inference in this broader universe of discourse. Familiar models and methods-sometimes extended, sometimes reinterpreted, sometimes applied to problems wholly different from those for which they were first devised-can play a useful role to this end.
Article
Evidence-centered assessment design (ECD) provides language, concepts, and knowledge representations for designing and delivering educational assessments, all organized around the evidentiary argument an assessment is meant to embody. This article describes ECD in terms of layers for analyzing domains, laying out arguments, creating schemas for operational elements such as tasks and measurement models, implementing the assessment, and carrying out the operational processes. We argue that this framework helps designers take advantage of developments from measurement, technology, cognitive psychology, and learning in the domains. Examples of ECD tools and applications are drawn from the Principled Assessment Design for Inquiry (PADI) project. Attention is given to implications for large-scale tests such as state accountability measures, with a special eye for computer-based simulation tasks.
Article
We are at the end of the first century of work on models of educational and psychological measurement and into a new millennium. This certainly seems like an appropriate time for looking backward and looking forward in assessment. Furthermore, a new edition of the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) has been published, and the previous editions of the Standards have served as benchmarks in the development of measurement theory. This backward glance will be just that, a glance. After a brief historical review focusing mainly on construct validity, the current state of validity theory will be summarized, with an emphasis on the role of arguments in validation. Then how an argument-based approach might be applied will be examined in regards to two issues in validity theory: the distinction between performance-based and theory-based interpretations, and the role of consequences in validation.
Next generation science standards
  • Achieve
Achieve. (2013). Next generation science standards. Retrieved from http:// www.nextgenscience.org/
Atlas of science literacy
American Association for the Advancement of Science. (2001). Atlas of science literacy. Washington, DC: Author.
Everyday mathematics: The University of Chicago school mathematics project
  • M Bell
  • J Bell
Bell, M., & Bell, J. (2007). Everyday mathematics: The University of Chicago school mathematics project (3rd ed.). Chicago, IL: McGraw-Hill Wright Group.
Everyday mathematics teacher's reference manual grades 1–3
  • M Bell
  • J Bell
  • J Bretzlauf
  • A Dillard
  • J Flanders
  • R Hartfield
  • P Saecker
Bell, M., Bell, J., Bretzlauf, J., Dillard, A., Flanders, J., Hartfield, R.,... Saecker, P. (2007). Everyday mathematics teacher's reference manual grades 1–3. Chicago, IL: McGraw-Hill Wright Group.
A validity analysis framework for instructionally supportive assessments
  • L V Dibello
  • J W Pellegrino
  • B D Gane
  • S R Goldman
DiBello, L. V., Pellegrino, J. W., Gane, B. D., & Goldman, S. R. (in press). A validity analysis framework for instructionally supportive assessments. In K. Ercikan & J. W. Pellegrino (Eds.), Validation of score meaning in the next generation of assessments. New York, NY: Routledge.
Taking science to school: Learning and teaching science in grade K-8
  • R A Duschl
  • H A Schweingruber
  • Shouse
Duschl, R. A., Schweingruber, H. A., & Shouse, A. W. (Eds.). (2007). Taking science to school: Learning and teaching science in grade K-8. Washington, DC: The National Academies Press.
Implications for NAEP of research on learning and cognition
  • J G Greeno
  • P D Pearson
  • A H Schoenfeld
Greeno, J. G., Pearson, P. D., & Schoenfeld, A. H. (1996, August). Implications for NAEP of research on learning and cognition. Stanford, CA: National Academy of Education.
Applying Webb's depth of knowledge levels in science. Dover, NH: National Center for the Improvement of Educational Assessment. Retrieved from http
  • K K Hess
Hess, K. K. (2010). Applying Webb's depth of knowledge levels in science. Dover, NH: National Center for the Improvement of Educational Assessment. Retrieved from http://www.nciea.org/beta-site/publica tion_PDFs/DOKscience_KH11.pdf