Article

The Relationship between Teacher Assessments and Pupil Attainments in Standard Test Tasks at Key Stage 2, 1996–98

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This article explores relationships between pupil attainments on standard National Curriculum tests at the end of Key Stage 2, teacher assessments, and pupil characteristics of gender, age, English as an additional language (EAI), and special educational needs (SEN) using representative samples drawn from schools in England in 1996, 1997 and 1998. Levels of agreement between teacher assessments and test results were remarkably consistent across all years and all subject areas. In all subject areas, teacher assessments were more likely to be lower than corresponding test results for pupils with SEN compared to their peers. Other pupil characteristics demonstrated only weak associations with extent of agreement. There was evidence that schools have become more similar over time with regard to patterns of differences between teacher assessments and test results. This is suggestive of increased consistency amongst teachers in the way that they interpret and apply the Key Stage assessment levels.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... shows that results of TA and standard tasks agree, and are to an extent consistent with the recognition that they assess similar but not identical achievements (Reeves et al., 2001). • The clearer teachers are about the goals of students' work, the more consistently they apply assessment criteria (Hargreaves et al., 1996). ...
... The evidence in Koretz et al. (1994), Shapley and Bush (1999), and Rowe and Hill (1996) is derived from rescoring the work assessed by teachers. Reeves et al. (2001), Thomas et al. (1998), and Rowe and Hill provide evidence relating to the consistency with which assessment criteria are applied by teachers, while Abbott et al. (1994) report on the consistency of administration of performance assessment. Two high-weight studies concerned assessment systems which involve teachers assessing portfolios of students' work. ...
... For each attainment target, teachers were to judge achievement against level descriptions, using a 'best fit' approach.] Reeves et al. (2001) reported data collected by the School Sampling Project, a longitudinal project started in 1995 which 'tracks pupil performance and schools' implementation of the curriculum over a period of time based on a sample framework of 1000 schools' (Reeves et al., 2001, p 143). In the NCA, students are assigned levels of 1 to 8, 4 being the target level for 11 year-olds. ...
Technical Report
The ALRSG was created as one of the first wave of the Evidence for Policy and Practice Information and Co-ordinating Centre (EPPI-Centre) Review Groups in 2000 and undertook its first review from February 2001 to January 2002. This was entitled 'A systematic review of the impact of summative assessment and testing on students' motivation for learning' and was published in the Research Evidence in Education Library (REEL) in 2002 (Harlen and Deakin Crick, 2002). The second review, conducted from February 2002 to January 2003, was concerned with the impact on students and teachers of the use of Information and Communication Technologies (ICT) for assessment of creative and critical thinking skills, and was published on REEL in 2003 (Harlen and Deakin Crick, 2003a)
... Teachers, working as they do within the human dynamics of the classroom, are potentially influenced both in their teaching interaction and in their assessment activity by student characteristics other than those that are in principle being assessed (Morgan & Watson 2002;Harlen 2004Harlen , 2005Martinez, Stecher & Borko 2009). These 'construct irrelevant' characteristics include the student's gender (Lafontaine & Monseur 2009), ethnicity (Burgess & Greaves 2009), socioeconomic status (Hauser-Cram, Sirin & Stipek 2003;Wyatt-Smith & Castleton 2005), EAL and SEN status (Thomas, Madaus, Raczek & Smees 1998;Reeves, Boyle & Christie 2001), and personal qualities, such as behaviour and effort (Bennett, Gottesman, Rock & Cerullo 1993;Morgan & Watson 2002;Wyatt-Smith & Castleton 2005). The same phenomenon has been noted with respect to workplace assessors in vocational education and training (Wolf 1995). ...
... The reliability of national curriculum tests has been questioned (Hutchison & Schagen 1994;Wiliam 2001Wiliam , 2003 and explored (see, for example, Newton 2009). And attempts have been made to evaluate the reliability of teachers' judgements, which have never been subject to moderation at key stages 2 and 3, by reviewing rates of agreement with test results (see, for example, Reeves, Boyle & Christie 2001). ...
... But unfortunately no record is kept of the date on which individual schools submit their judgements, so that an analysis based on independent assessments is not possible to undertake. Reeves et al. (2001) noted this, as did Rose (1999), in the report of the independent scrutiny panel on the 1999 key stage 2 tests. ...
... A large number of studies report variations in assessment and grading between schools. The school is mentioned as an important factor accounting for differences between teacher assessments and external test scores in several countries including the United States, the UK, Germany, Sweden, and The Netherlands (De Lange & Dronkers, 2007;Harlen, 2004Harlen, , 2005Himmler & Schwager, 2007;Reeves, Boyle, & Christie, 2001;Thomas, Madaus, Raczek, & Smees, 1998;Wikstro¨m & Wikstro¨m, 2005;Willingham et al., 2002). The findings relate to both primary and secondary education. ...
... This implies that schools where students on average score high marks in the BCSE exam tend to score relatively low CA averages and vice versa. Similar arguments have been expressed in several countries on the basis of empirical research (De Lange & Dronkers, 2007;Himmler & Schwager, 2007;Reeves et al., 2001;Thomas et al., 1998;Wikstro¨m & Wikstro¨m, 2005;Willingham et al., 2002). The results of the present study may be summarized as indicating a moderate degree of conformity between the CA marks and the BCSE marks within schools and nonconformity between schools. ...
... The inter-rater differences may further be inferred to imply differences between schools in either the interpretation or the application of the standards for assessing listening and speaking skills. Researchers like Himmler and Schwager (2007), Reeves et al. (2001), and Wikstro¨m and Wikstro¨m (2005) have reported differences between schools in teacher assessments and scores on standardized tests as well. ...
Article
This study explores the validity of school-based assessments when they serve to supplement scores on central tests in high-stakes examinations. The school-based continuous assessment (CA) marks are compared to the marks scored on the central written Bhutan Certificate of Secondary Education (BCSE) examination, to detailed teacher ratings of student competencies, and to student self-ratings. A survey was undertaken in 10 higher secondary schools, involving 26 English teachers and 365 graduates. Though results indicate moderate conformity among measurements within schools, results between schools indicate that schools with high average scores on the BCSE exam tend to score relatively low CA averages and vice versa. Compared to the CA marks for student performance in English listening and speaking skills, the detailed teacher ratings of students on the same skills correlate more strongly with the BCSE exam marks and the student self-ratings.
... The National Curriculum Standard Assessment Tasks (SATs) employed to assess children's academic achievement in the U.K., combine tests with teacher assessments. Despite scepticism, reliability on the SATs tests have revealed high values (Hurry, 1999;Reeves, Boyle, & Christie, 2001). One strength of SATs assessments is that they sample a broad range of skills compared to other standardised tests. ...
... and .87 have been reported (Cronbach, 1960). Reeves et al. (2001) argue that the teacher assessment (SATs TA) is an essential part of the National Curriculum assessment arrangements. ...
... An important question concerns the relationship between TR and TA. Teacher expectations and test results are not completely independent of one another and teachers are able to consult TR to assist them with TA. Reeves et al. (2001) explored the relationship between SATs TR and SATs TA for Key Stage 2 results for over 6000 children. Comparison of SATs TR and TA for individual pupils revealed a remarkably high level of consistency across years in all three subjects (Math, English, and Science) ranging from 73% to 77%. ...
Article
Full-text available
The association between bullying behaviour and academic achievement was investigated in 1016 children from primary schools (6–7-year-olds/year 2: 480; 8–9-year-olds/year 4: 536). Children were individually interviewed about their bullying experiences using a standard interview. Key Stage I National Curriculum results (assessed at the end of year 2) were collected from class teachers, and parents completed a behaviour and health questionnaire. Results revealed no relationship between direct bullying behaviour and decrements in academic achievement. Conversely, higher academic achievement at year 2 predicted bullying others relationally (e.g. social exclusion at year 4). Relational victimisation, Special Educational Needs (SEN), being a pupil from a rural school or small classes and low socioeconomic status (SES) predicted low academic achievement for year 2 children. Findings discount the theory that underachievement and frustration at school leads to direct, physical bullying behaviour.
... For example, crosstabulation analyses on a nationwide sample of 600,000 children have shown that agreement be-tween NC teacher assessments of reading and scores on group-administered NC reading tests at Key Stage 1 is good (Cohen's k 5 .80;. Similarly high levels of agreement have been reported for Key Stage 2 teacher and test assessments (Reeves, Boyle, & Christie, 2001). ...
... A second difficulty concerns test validity. Although levels of agreement between NC teacher and test assessments of achievement are high Reeves et al., 2001), there is evidence that teacher assessments may be influenced by nonreading characteristics of the children. For example, an analysis of Key Stage 2 NC scores showed that teacher assessments were more likely to be lower than test results when pupils had special educational needs (Reeves et al., 2001). ...
... Although levels of agreement between NC teacher and test assessments of achievement are high Reeves et al., 2001), there is evidence that teacher assessments may be influenced by nonreading characteristics of the children. For example, an analysis of Key Stage 2 NC scores showed that teacher assessments were more likely to be lower than test results when pupils had special educational needs (Reeves et al., 2001). These limitations do not apply to the PIAT rc , but psychometric measures, especially when administered in nonstandard ways such as the Internet, may have their own problems. ...
Article
Full-text available
Little is known about the underlying causes and developmental patterns of stability and change in early reading abilities. In a longitudinal study of twins (n=4,291 pairs), individual differences in reading achievement assessed by teachers using U.K. National Curriculum (NC) criteria showed substantial heritabilities at ages 7, 9, and 10 years (.57-.67) and modest shared environmental influences (.10-.17). Stability in NC scores was primarily mediated genetically. There was also evidence for age-specific genetic influences at each age. Genetic influences on reading are substantial and stable during the elementary school years despite the shift from "learning to read" to "reading to learn."
... Although statistical strategies are developed to moderate SBA scores to enhance comparability at the school level, it is still rather difficult to ensure a very high level of comparability when the design and the context of assessment tasks and individual teachers' interpretations of the assess ment criteria may vary. Therefore, SBA is usually considered as less objective and trustworthy than standardised tests (Reeves, Boyle, & Christie, 2001) and so can be a challenge to public assessment, in particular to highly competitive systems like the NCEE in Mainland China. This may explain the reasons for not finding SBA in the SPCS. ...
... Some challenges for implementing SBA have been discussed in the literature, including (i) the weakness in the objectivity of its result, which is caused by the variations among individual teachers' interpretations of the assessment criteria and their judgments of students' performance (Hill et al., 1997;Reeves et al., 2001); (ii) the high requirements that SBA places on teachers' expertise in assess ment and teaching (Yip & Cheung, 2005); and (iii) the increased workload for teachers and students (Board of Studies, 1998;Cheung, 2001). These challenges are common at the practical or technical level. ...
... Many empirical studies in this field have focused on comparing teacher assessments of pupil achievement and test scores, an approach we follow in this paper (Murphy 1981;Delap 1995;Plewis 1997;Thomas et al. 1998;Reeves et al. 2001;Dhillon 2005;Gibbons and Chevalier 2008;Martínez et al. 2009). Some of this literature has suggested that indeed teachers do over-and under-state pupil ability relative to test scores in particular subjects and that such differences are systematically linked to pupils' personal characteristics, including their socio-economic background and ethnicity (Plewis 1997, for example). ...
... There is not universal agreement in the literature, however. Reeves et al. (2001) found that teacher assessment and test assessment were very consistent, using English data on 11-year-olds (key stage 2 tests). Gibbons and Chevalier (2008) found, again for England, that teachers' assessments of pupils at age 15 did vary systematically from test-based measures of pupils' achievement at that age (key stage 3 tests). ...
Article
Background: Education systems rely on both teacher and test-based assessments. Where these assessments are used for summative purposes particularly, it is important to understand why, and for which groups of students, teachers’ assessments may produce different results from test-based assessments. Purpose: This paper assesses whether the difference between teacher and test-based summative assessment at age 11 and 15 – as a measure of bias or uncertainty in assessment – is linked systematically to observable pupil characteristics. We also consider the important question of whether these measures of uncertainty and bias in assessment have any relationship with pupils’ subsequent educational choices at age 16. Sample: To the best of our knowledge, this is the first paper focused on the Spanish educational system that has compared teacher and test-based assessments for school-age children. The sample consists of 3000 primary and secondary students. Design and methods: This research is conducted using regression analysis on quantitative data, comparing teacher and test-based assessments conducted during the academic year 2009/10. Results: The gap between test scores and teacher assessments is particularly large for female students in mathematics, who obtain significantly higher teacher assessments compared with their test scores. Students from semi-private schools (concertadas), who are more likely to be socially advantaged, gain higher test scores in relation to their teachers’ assessments. This is particularly noticeable in reading. Conclusions: This finding is consistent with the notion that teachers overestimate the achievement of girls in mathematics, relative to their test scores. Teachers tend to assess pupils in a relative way and hence teachers in semi-private schools (with higher achieving pupils) underestimate the achievement of their pupils.
... A raft of research studies in assessment practices indicate a teacher bias in favour of girls. Bennett, Gottesman, Rock, and Cerullo (1993) found that teachers adversely judged boys' achievement because of their poor classroom behaviour; Shorrocks, Daniels, Staintone, and Ring (1993) found an assessment bias towards girls in English; while Reeves, Boyle, and Christie (2001) highlighted a tendency for teachers to under-rate boys' performance and potential. Looking specifically at writing, Peterson and Kennedy (2006) explored the impact of the teachers' knowledge of the gender of the writer on their marking of the text. ...
... If the differences suggest that grade for grade, boys' writing is more like better than weak writing, then this raises the possibility that their writing may be being under-graded. This suggestion would confirm research suggesting a bias in favour of girls in teacher assessment practices (Bennett et al, 1993;Reeves et al, 2001;Shorrocks et al, 1993), and has particular resonance with Petersen and Kennedy's study (2006) that found teachers judged writing they believed to be written by a boy more harshly, and that "teachers' assessments of the quality of the writing were often influenced by their perceptions of the writer's gender" (Petersen & Kennedy, 2006, p. 42). A substantial body of research explores the biases that lead to boys' domination of oral activities (Davies, 1998;Younger, Warrington & Williams, 1999) and how teachers position girls less powerfully in oral interactions (Cheshire & Jenkins, 1990), but relatively little research investigates whether teachers' view of girls as better writers positions boys unfavourably. ...
Article
Set in the context of international concerns about boys' achievements in writing, this article presents research that explores gender differences or similarities in linguistic competence in writing. Drawing on the results of a large-scale analysis of the linguistic characteristics of secondary-aged writers, we outline gender difference in the sample. The article explains the limited differences revealed through this analysis but highlights the repeated pattern of differences in boys' writing, mirroring parallel patterns in able writers. The findings are discussed light of the prevalent discourse of difference that permeates academic, professional, and political consideration of gender and writing. /// Dans le droit fil des inquiétudes que suscite le rendement scolaire des garçons en écriture, cet article présente une recherche sur les différences et les similitudes selon le sexe quant à l'aptitude à écrire. Analysant les résultats d'une vaste étude portant sur les caractéristiques linguistiques d'élèves du secondaire, les auteurs tiennent compte des différences selon le sexe dans l'échantillon. Ils expliquent les différences limitées qu'a révélées cette analyse tout en soulignant le profil répétitif des différences dans les écrits des garçons, faisant en cela écho à des caractéristiques parallèles chez les élèves ayant une aptitude à écrire. Les auteurs discutent des conclusions à la lumière du discours sur la différence partout présent dans les considérations pédagogiques, professionnelles et politiques sur le genre et l'écriture.
... They find evidence of significant discrimination with exams assigned to lower caste children being given grades between 0.03 and 0.09 standard deviations below those assigned to higher caste children. All these results are consistent with earlier research with smaller sample sizes (Reeves, Boyle, & Christie, 2001;Thomas, Madaus, Raczek, & Smees, 1998). Qualitative work adds to this picture: in his study of a UK multi-ethnic school (Gillborn, 1990) argues that "teacher-student interaction was fraught with conflict and suspicion" for Black Caribbean pupils. ...
... The lower part of Table 1 confirms that most of the distribution of TA-KS covers the values (-1, 0, +1). About three quarters of pupils have TA-KS equal to zero, consistent with previous research (Reeves et al, 2001;Thomas et al, 1998;Gibbons et al. 2008), and less than 5% of pupils have an absolute difference greater than one. There are differences between subjects: TA-KS<0 or "under-assessment" is much more common than "over-assessment" in English and science, but in maths "overassessment" is slightly more common. ...
Article
We assess whether ethnic minority pupils are subject to low teacher expectations. We exploit the English testing system of “quasi-blind” externally marked tests and “non-blind” internal assessment to compare differences in these assessment methods between White and ethnic minority pupils. We find evidence that some ethnic groups are systematically “under-assessed” relative to their White peers, while some are “over-assessed”. We propose a stereotype model in which a teacher’s local experience of an ethnic group affects assessment of current pupils; this is supported by the data.
... Although the validity of teacher assessments has been questioned (for example, Davies & Brember, 1994;Demaray & Elliot, 1998;Glascoe, 2001;Reeves, Boyle, & Christie, 2001), a review of the literature has concluded that on the whole they are valid (Hoge & Coladarci, 1989). Furthermore, in the TEDS sample, Key Stage 1 teacher-assessed reading correlates .68 with a brief test of early word recognition (Test of Early Word Reading Efficiency; Torgesen, Wagner, & Rashotte, 1999) that we administered via telephone to 5,808 seven-year-olds, thus providing additional support for the validity of teacher assessments (Dale, Harlaar, & Plomin, submitted). ...
... As discussed in the Methods section, the validity of teacher assessments has been questioned. Evidence does exist that teacher assessments contain some level of bias, if discrepancies with test scores are counted as a sign of bias (Davies & Brember, 1994;Reeves et al., 2001). However, a meta-analysis of the literature suggests that teacher assessments are largely valid (Hoge & Coladarci, 1989). ...
Article
Although it is well established that school characteristics (SCH) and socio‐economic status (SES) are associated with academic achievement (ACH), these correlations are not necessarily causal. Because academic achievement shows substantial genetic influence, it is useful to embed such investigations in genetically sensitive designs in order to examine environmental influences more precisely by controlling genetic influence on ACH. In the first study of this kind for academic achievement, data were collected for 1,063 same‐sex pairs of seven‐year‐old MZ and DZ twins for teacher‐assessed ACH, UK statistics on SCH, and parent‐reported SES. Exclusive of genetic influence on school achievement, shared environment (environmental influences that make siblings similar) accounts for 12% of the variance in academic achievement. SCH accounts for 17% and SES accounts for 83% of this shared environmental variance. Exclusive of genetic and shared environmental influence including SCH and SES, nonshared environment (environmental influences that do not make siblings similar) accounts for 19% of the variance in academic achievement. The importance of nonshared environmental influences on academic achievement leads to the question of what these child‐specific experiences might be that are not shared by children in the same family, school, and classroom.
... Although the validity of teacher assessments has been questioned (for example, Davies & Brember, 1994;Demaray & Elliot, 1998;Glascoe, 2001;Reeves, Boyle, & Christie, 2001), a review of the literature has concluded that on the whole they are valid (Hoge & Coladarci, 1989). Furthermore, in the TEDS sample, Key Stage 1 teacher-assessed reading correlates .68 with a brief test of early word recognition (Test of Early Word Reading Efficiency; Torgesen, Wagner, & Rashotte, 1999) that we administered via telephone to 5,808 seven-year-olds, thus providing additional support for the validity of teacher assessments (Dale, Harlaar, & Plomin, submitted). ...
... As discussed in the Methods section, the validity of teacher assessments has been questioned. Evidence does exist that teacher assessments contain some level of bias, if discrepancies with test scores are counted as a sign of bias (Davies & Brember, 1994;Reeves et al., 2001). However, a meta-analysis of the literature suggests that teacher assessments are largely valid (Hoge & Coladarci, 1989). ...
Article
Although prior research has examined children’s perceptions of the classroom environment as related to academic achievement, genetically sensitive designs have not been employed. In the first study of its kind for the primary school classroom environment, data were collected for 3,020 pairs of nine‐year‐old identical and fraternal twin pairs in same and different classrooms on their perceptions in six domains: social integration, opportunity, adventure, general satisfaction, negative affect, and teachers. Data were also collected for teacher‐assessed academic achievement (ACH). Modest genetic influence was found for children’s perceptions of the classroom environment: an average of .33, .06, .25, .27, .19, and .20 of the variance, respectively. Non‐shared environment played a more influential role, accounting for an average of .58, .78, .64, .60, .69, and .65 of the variance, respectively. Negative affect, adventure, social integration, and opportunity were significantly, albeit modestly, associated with ACH. Results suggest that perceptions of the classroom environment are driven primarily by child‐specific experiences, and that such perceptions, although experientially important, are less important for ACH.
... Recent discussions on assessment have been mostly driven by a focus on determining how well students are doing. It has been shown by a number of studies that evaluating the capabilities and performance of a teacher is as necessary and valuable as teaching itself (See for example: Rimfeld et al., 2019;Chung, 2008;Reeves et al., 2001). To further this goal of improving teacher performance, assessments and evaluations of teacher performance skills are also carried out. ...
Article
Full-text available
The purpose of this research is to analyze and evaluate the Teacher Competency Test (TCT) implementation in Makassar, South Sulawesi, Indonesia. In addition, this research analyzed the perspectives of instructors, officer representatives, and academics on the TCT Objective and Implementation, as well as the Program's Effectiveness and Efficiency. This research used a qualitative approach and in-depth interviews. The results imply that the current TCT does not accurately reflect teachers' real-world proficiency owing to teachers' lack of computer literacy, since they were evaluated using a computer-based testing. However, the outcomes of this evaluation are also valuable for government decision-making on the mapping of teachers' requirements and future development. Second, the development of TCT program management and the quantity of financing for teacher training must be enhanced to assure the program's future viability. This research concludes by recommending a strategic shift to enhance student learning: mapping teachers' competency and education and the Sustainable Teachers Training and Development Program. Proper and ongoing professional development would not only assist instructors in supporting successful learning, but it would also enable experienced teachers to work with novice teachers and act as mentors for them, therefore enhancing classroom learning circumstances.
... The harsh judgement of teachers for boys made them perform badly. (Reeves, Boyle & Christie, 2001). It was found that teachers underrated boys' performances. ...
Article
Full-text available
The present study examines the difference in ESL academic writing of boys' and girls' in their written assignments. It aims at exploring differences in ESL writing based on the variable of gender. The data site for this study was a Diploma class at the Department of English FC, NUML Islamabad, where it was collected from 24 participants, i.e., 12 boys and 12 girls, who were asked to write an essay. The conceptual framework of Swan (1992) underpins the present study. The data were analyzed through a qualitative and quantitative method. The study found that the subtopics highlighted in their writings were different and approached variedly. The study also showed that the girls' writings are more reflective and subjective, and they made use of personal pronouns more often, whereas boys prefer being objective and used a third-person pronoun. Also, their writings were more fact and figure based, which was absent in the essays written by girls.
... However, during the past decade, the trustworthiness of teachers' intuitive judgement has been questioned. An array of studies showed a lack of validity and reliability when the accuracy of teachers' intuitive judgement was compared with objective measures such as standardized tests (Brookhart, 1994(Brookhart, , 2003(Brookhart, , 2013Feinberg & Shapiro, 2009;Harlen & Deakin, 2002;Hoge & Coladarci, 1989;Reeves, Boyle, & Christie, 2001). Mostly, these studies showed that intuitive teacher judgement disadvantaged specific groups such as low achievers, pupils with special educational needs or pupils from lower social classes (Briscoe, 1991;Brookhart, 2013;Kelly, 1914;Rugg, 1918;Starch & Elliott, 1912;Stiggins, 2005). ...
Book
Alhtough many scholars agree that both data and intuition influence teachers' decisions, there is little insight in how these rational and intuitive processes mutually influence teacher judgement. We developed and tested a theoretical framework that proved to be a valuable lens to understand the interplay of both processes in the different steps of decision making. This dissertation provides insight in the conceptualisation of both data-based and recognition-primed teacher judgement and describes conditions that can help prevent decision bias. In this manner, we offer a valuable starting point for theory and practice to understand how teachers make decisions and we provide recommendations to enhance both rational and intuitive processes of professional teacher judgement.
... There was evidence of bias and error in teachers' summative assessment in the findings of some studies. This was generally due to teachers taking into account information about non-relevant aspects of students' behaviour (2), or being apparently influenced by gender, special educational needs, or the general or verbal ability of a student in judging performance in a particular task (5,6,13,14,31,39,44,46,49). Several researchers claim that bias in teachers' assessment is susceptible to correction through focused workshop training (19,31) although this review did not specifically include studies of the impact of such training. ...
Technical Report
The Assessment Systems for the Future (ASF) project is a project of the Assessment Reform Group ARG, led by Wynne Harlen. Its focus is summative assessment and the role that assessment by teachers can take in it. This paper summarises the research evidence revealed by the latest in a series of reviews of research that have explored what can be learned from research studies of the uses of assessment in education. We begin by setting summative assessment by teachers in the context of the purposes and uses of assessment. We then summarise the research evidence revealed by two reviews of research – one focusing on the reliability and validity of summative assessment by teachers and the other on the impact on students and teachers of teachers’ summative assessment. The final section proposes implications of the findings for educational practice, teacher professional development, assessment policy, and research.
... Third, evidence indicates that stream placement might influence the perceptions and expectations that class teachers hold of their pupils. The wider literature demonstrates that teachers can (consciously or unconsciously) label and stereotype children based on a variety of characteristics (Burgess and Greaves, 2009;Campbell, 2015;Hansen and Jones, 2011;Reeves et al., 2001;Thomas et al., 1998). In particular, there is evidence that teachers formulate and act upon expectations of pupils according to the level of their academic group placement (Ansalone, 2003;Boaler, 1997;Boaler et al., 2000;Ireson and Hallam, 1999;Rubie-Davies, 2010). ...
Article
Full-text available
This paper tests the hypothesis that stream placement influences teacher judgements of pupils, thus investigating a route through which streaming by 'ability' may contribute to inequalities. Regression modelling of data for 800+ 7-year-olds taking part in the Millennium Cohort Study examines whether teachers' reported perceptions of 'ability and attainment' correspond to the stream in which a pupil is situated. Children with similar characteristics, who perform equivalently on recent, independent, salient cognitive tests, and who have equal prior attainment, are compared. As predicted, stream level is associated with teachers' perceptions. The hypothesis that there is a relationship from stream placement to teacher judgement is supported.
... Most studies that have examined validity issues of teacher judgements have used an approach that has focused either on the extent to which judgements correlate with standardized test measures (Beswick et al., 2005;Brookhart, 2012;Coladarci, 1986;Hoge & Coladarci, 1989;Meisels et al., 2001;Taylor, Anselmo, Foreman, Schatschneider, & Angelopoulos, 2000) and/or the extent to which judgements accurately predict future performance (Gijsel, Bosman, & Verhoeven, 2006;Hecht & Greeneld, 2002;Taylor et al., 2000). The principal focus of these studies has been general teacher judgements of pupil achievement (Hoge & Coladarci, 1989;Perry & Meisels, 1996), emerging reading and literacy skills (Bates & Nettleback, 2001;Beswick et al., 2005;Meisels et al., 2001), and reading and learning disabilities (Reeves, Boyle & Christie, 2001;Taylor et al., 2000). ...
Thesis
Full-text available
The purpose of this thesis is to examine validity issues in different forms of assessments; teacher judgements, external tests, and pupil self-assessment in Swedish primary schools. The data used were selected from a large-scale study––PIRLS 2001––in which more than 11000 pupils and some 700 teachers from grades 3 and 4 participated. The primary method used in the secondary analyses to investigate validity issues of the assessment forms is multilevel Structural Equation Modeling (SEM) with latent variables. An argument-based approach to validity was adopted, where possible weaknesses in assessment forms were addressed. A fairly high degree of correspondence between teacher judgements and test results was found within classrooms with a correlation of .65 being obtained for 3rd graders, a finding well in line with documented results in previous research. Grade 3 teachers’ judgements correlated higher than those of grade 4 teachers. The longer period of time spent with the pupils, as well as their different education, were suggested as plausible explanations. Gender and socioeconomic status (SES) of the pupils showed a significant effect on the teacher judgements, in that girls and pupils with higher SES received higher judgements from teachers than test results accounted for. Teachers with higher levels of formal competence were shown to have pupils with higher achievement levels. Pupil achievement was measured with both teacher judgements and PIRLS test results. Furthermore, higher correspondence between judgements and test-results was demonstrated for teachers with higher levels of competence. Comparisons of classroom achievement were shown to be problematic with the use of teachers’ judgements. The judgements reflected different achievement levels, despite the fact that test-results indicated similar performance levels across classrooms. Pupil self-assessments correlated slightly lower to both teacher judgement and to test results, than did teacher judgements and test results. However, in spite of their young age, pupils assessed their knowledge and skills in the reading domain relatively well. No differences in self-assessments were found for pupils of different gender or SES. In summary, a conclusion of the studies on the three forms of assessment was that all have certain limitations. Strengths and weaknesses of the different assessment forms were discussed.
... Second, our conclusions are limited to teacher assessments of academic achievement. Whereas we have no doubt that teacher assessments are highly valid measures of school achievement (for a meta-analysis see Hoge & Coladarci, 1989) and that teacher assessments and achievement test scores based on the National Curriculum are in close correspondence (Reeves, Boyle, & Christie, 2001), the relevance of motivation for school achievement might depend to some extent on the definition of school achievement itself. It has been argued that motivational measures should contribute to the prediction of achievement above intelligence especially when school achievement is assessed by teachers compared to more objective measures (Hansford & Hattie, 1982;Helmke, 1992). ...
... Second, our conclusions are limited to teacher assessments of academic achievement. Whereas we have no doubt that teacher assessments are highly valid measures of school achievement (for a meta-analysis see Hoge & Coladarci, 1989) and that teacher assessments and achievement test scores based on the National Curriculum are in close correspondence (Reeves, Boyle, & Christie, 2001), the relevance of motivation for school achievement might depend to some extent on the definition of school achievement itself. It has been argued that motivational measures should contribute to the prediction of achievement above intelligence especially when school achievement is assessed by teachers compared to more objective measures (Hansford & Hattie, 1982;Helmke, 1992). ...
Article
The present study examined the extent to which motivation contributes to the prediction of school achievement among elementary school children beyond general mental ability (g). The sample consisted of N = 1678 nine-year-old UK elementary school children who took part in the Twins Early Development Study (TEDS). Teachers provided achievement assessments according to the UK National Curriculum criteria for Mathematics, English, and Science, and pupils reported their ability self-perceptions and intrinsic values for these subjects. For all three domains, g proved to be the strongest, and, in the case of Science, the only predictor of school achievement. However, in Mathematics and English, children's ability self-perceptions as well as intrinsic values each contributed incrementally to the prediction of achievement beyond g, with ability self-perceptions being a better predictor than intrinsic values. Finally, commonality analyses revealed a substantial portion of common variance in school achievement explained both by g and motivation. In the light of these results it is argued that the study of motivation offers valuable clues for the understanding and improvement of school achievement.
... Research has shown that judgments made by teachers about the children's reading levels are generally confirmed by the subsequent reading scores of these children (see e.g., Fox & Routh, 1984). For this reason, probably, teachers' assessment is seen, in many cases, as alternative assessment to formal testing in current educational and psychological research (Reeves, Boyle, & Thomas, 2001;Teasdale & Leung, 2000). Children were excluded from the sample if they were (a) students whose problems were primarily emotional in nature; (b) students with sensory handicaps (impaired vision or hearing); (c) students with developmental disabilities (i.e., mental retardation); or (d) students who spoke Greek as a second language. ...
... That attainment indicators depend so heavily on teacher assessment invites the question of whether these apparent achievement gaps may to some extent be an artefact of the measurement method used. There is an enduring body of evidence which indicates that teacher assessments are subject consistently to a large and significant level of error (Brookhart, 2013;Eckert et al., 2006;Harlen, 2005), and, more importantly, research also indicates that some of this error may be systematic (Harlen, 2005;Robinson and Lubienski, 2011), and that there may be regular patterns of inequality in teacher judgements of English primary school pupils (Burgess and Greaves, 2009;Reeves et al., 2001;Thomas et al., 1998). ...
Article
Full-text available
There is evidence that teacher judgements and assessments of primary school pupils can be systematically biased. This paper tests the proposal that stereotyping plays a part in creating these judgement inequalities and is instrumental in achievement variation according to income-level, gender, special educational needs status, ethnicity and spoken language. Using 2008 data for almost 5,000 pupils from the Millennium Cohort Study, it demonstrates biases in teachers’ average ratings of sample pupils’ reading and maths ‘ability and attainment’ which correspond to every one of these key characteristics. Findings go on to suggest that stereotypes according to each of income-level, gender, special educational needs status and ethnicity all play some part in forming these biases. The paper strengthens the evidence that stereotyping of pupils may contribute to assessment and thereby attainment inequalities, and concludes that an increased focus on tackling this process may lead to greater parity and a narrowing of gaps.
... 84) She surmised that teachers have difficulty separating achievement from personal characteristics and this appears to be more complicated in the assessment of children with disabilities. Teacher bias often more negatively skews assessments of children with disabilities (Reeves, Boyle, & Christie, 2001) and teacher assessments of children with disabilities has been criticized for adhering to ideas of normalization (Loyd, 2008). This is not to say that Tribunals should rely on standardized assessments in their decisions. ...
Article
Full-text available
This paper examines Special Education Tribunals, in Ontario, Canada through a Luhmannian theoretical lens. At total of 58 Special Education Tribunal summary hearings were analyzed using the constant comparative method through NVivo software. The results revealed that these Tribunals appear to favour the assessment testimony of teachers and other school personnel over that of other professionals such as educational psychologists, medical doctors, and university professors. This finding is discussed in relation to the available interpretations of Luhmann’s social systems theory along with the limitations of using educational tribunals to remedy social justice issues. © 2015, International Journal of Special Education. All rights reserved.
... They suggest that teachers may have noticed the errors in writing attributed to boys more often because they were expecting to see them, and made more comments because they believed boys were more likely to need the additional support for their writing. Similarly, both Sharrocks et al (1993) and Reeves (2001) found a tendency to under-rate boys' potential as writers. Classroom practice, therefore, may be reinforcing gender norms whereby students experience gender as a range of constraints about what they can legitimately say, do, write and behave as a boy or as a girl, as they attempt to realise the writing skills, linguistic know-how and compositional practices that make up a writer's subject knowledge. ...
Article
Full-text available
This article maps the diverse theoretical disciplines that inform writing research and in particular, how these disciplines have researched the relationship between writing and gender. This is presented against the background of a changing theoretical landscape in research in gender. In particular, it will consider the paradigm shift from discourses of difference and disadvantage to discourses of diversity. Research on writing has not always acknowledged this changing lens, and gender research rarely focuses on writing. The aim therefore is to map out these different approaches, explore how they have impacted writing classrooms and to add to the call for a reconfiguring of gender in writing research as a complex and diverse category rather than as a fixed and essential characteristic we each possess.
... 57 Teachers demonstrate deep bias. An often cited example of this is gender, where, as Reeves et al. (2001) found, ...
Technical Report
Full-text available
This report presents the results of a detailed investigation of poor state performance indicators (PSPIs) which was developed within the second phase of the Crisis States Programme. The report analyses current practices and proposes some potential solutions as well as an agenda of research for the future.
... The CE tests were in line with the tests that teachers routinely used in class, whereas the UEE tests were more demanding and were used for selection purposes. When the public is faced with a discrepancy between test results and teacher assessment, teacher assessment may be considered to be less reliable and more subjective (Reeves et al. 2001). Teachers sometimes feel that they should adjust their teacher assessment for fear that they may appear too lenient or too strict. ...
Article
Full-text available
This article describes briefly the education system of Cyprus and elaborates on the recent changes in its large-scale assessment (LSA) programme. Until 2005, two independent LSA programmes existed: one for school-graduation purposes and one for gaining entrance to higher education. The introduction of a dual-purpose LSA program in 2006, due to unexpected external political and legal events, had unintended consequences, thus making Cyprus an instructive modern case-study. The article follows three threads based on the same story. The first thread stresses the problems that emerged because the same LSA programme is used for graduation (mostly criterion-referenced) and selection (purely norm-referenced) purposes. The second thread discusses the problems arising from the lack of comparability between examination subjects. The third thread stresses the vulnerability of the system, because Cyprus partly depends on other countries to offer access to higher education to its citizens.
... Second, our conclusions are limited to teacher assessments of academic achievement. Whereas we have no doubt that teacher assessments are highly valid measures of school achievement (for a meta-analysis see Hoge & Coladarci, 1989) and that teacher assessments and achievement test scores based on the National Curriculum are in close correspondence (Reeves, Boyle, & Christie, 2001), the relevance of motivation for school achievement might depend to some extent on the definition of school achievement itself. It has been argued that motivational measures should contribute to the prediction of achievement above intelligence especially when school achievement is assessed by teachers compared to more objective measures (Hansford & Hattie, 1982;Helmke, 1992). ...
Article
Examined the respective contribution of motivation and general mental ability (g) to the prediction of school achievement in elementary-school students. Cognitive abilities were assessed in a sample of 1,678 nine-year-old elementary students from the British Twins Early Development Study (TEDS) using items derived from the Wechsler Intelligence Scale for Children (WISC-III-PI) and the Cognitive Abilities Test 3 (CAT3). Children also indicated their self-perceived ability and intrinsic values for three school subjects - mathematics, English, and science. Academic achievement in the three subjects was was assessed by the children's teachers using national curricular criteria. Regression analyses revealed g to be the strongest predictor of academic achievement. Children's ability self-perceptions and intrinsic values additionally contributed to the prediction of achievement in mathematics and English, self-perceptions more so than values. Moreover, g and motivation explained a substantial portion of common variance in school achievement. It is concluded that motivation plays an important role in school achievement, and may offer valuable clues for the improvement of academic performance.
... This paper reports only the children's test scores. This decision was based on studies such as Reeves, Boyle and Christie (2001) where high levels of statistical agreement between teacher assessment and test scores in Key Stage 2 assessments were found (although the results of assessments of pupils with registered special educational needs are a notable exception). In addition, the fact that test scores are reported with higher frequency and are accredited more significance in national evaluations of school performance, through, for example, the government league tables, renders them most relevant to research examining outcomes in the context of the current education system. ...
Article
As part of their longitudinal investigations of a large cohort with a history of specific language impairment (SLI), Professor Gina Conti-Ramsden and her research team based at the School of Education, University of Manchester, Dr Emma Knox, Dr Nicola Botting and Dr Zoë Simkin report on the changing educational placements and National Curriculum assessment outcomes of 200 children at 11 years. Teacher questionnaires reporting on the Year 6 primary education placements of the sample reveal details about the long-term educational needs of children with SLI. Furthermore, in exploring the experiences of the sample in the National Curriculum Key Stage 2 tests the present study found that children with SLI perform poorly relative to national expectations of levels of achievement across all tests. At present there are no guidelines for supporting children with SLI in relation to National Curriculum tests and the present data suggests there is an obvious need for these to be developed.
... Research has shown that judgments made by teachers about the children's reading levels are generally confirmed by the subsequent reading scores of these children (see e.g., Fox & Routh, 1984). For this reason, probably, teachers' assessment is seen, in many cases, as alternative assessment to formal testing in current educational and psychological research (Reeves, Boyle, & Thomas, 2001;Teasdale & Leung, 2000). Children were excluded from the sample if they were (a) students whose problems were primarily emotional in nature; (b) students with sensory handicaps (impaired vision or hearing); (c) students with developmental disabilities (i.e., mental retardation); or (d) students who spoke Greek as a second language. ...
Article
This study reports two different experiments, as a part of a longitudinal study, that evaluated a cognitive intervention (PREP: PASS Reading Enhancement Program) to enhance early phonological processing skills, such as odd-word-out, segmenting, and blending, to kindergarten children at-rish for reading difficulties, in order to support the development of subsequent word reading skills. As part of the first experiment, thirty children aged 5.1, matched on the basis of age, gender, parental education levels, Non-verbal and Verbal IQ, were assigned to an experimental and a control group (15 in each group) and compared before and after the four-week intervention on a set of phonological and cognitive (successive and simultaneous processing) measures. The two groups of participants were screened to be significantly different at pre-test on the outcome measures. The results of the first experiment indicated that the experimental group performed equally well with the control group on all the measures of phonological and cognitive processing skills. Subsequent analysis focusing on aptitude-treatment interaction indicated that the PREP program appeared to be optimally successful in improving phonological skills in cases where the cognitive profile of the 5-year-olds matched the emphasis on successive information integration. The follow-up experiment examined the long-term effects of PREP remediation. Results showed that both the experimental and control groups performed equally well on word reading tasks and, more importantly, on the bridging PREP tasks, requiring knowledge of the alphabet and of letter-sound correspondences, despite that neither of the groups had been previously trained on the latter. Discussion concludes that intervention including inductive training on the distal cognitive processes, namely successive and simultaneous processing, appears to be effective for enhancing early word-reading skills to kindergarten children at-risk for reading difficulties, even in the absence of direct training of these skills in kindergarten. Cette étude presente deux différentes expérimentations comme résultat d’une longue étude qui a évalué une intervation cognitive (PREP: PASS Reading Enhancement Program) pour augmenter la capacité développe telle que «odd-word-out», «segmenting» et «blending», comme objectif la surveillance d’enfants en danger de difficultés de lecture, pour supporter le développement de capacités de lecture. Comme résultat de la première expérimentation trente élèves ages 5.1, combines en ce qui concerne leur age, leur genre, l’éducation de leurs parents, non-verbal et verbal-IQ, etainent assignés à un groupe expérimental et un groupe de contrôle (15 à chaque groupe) et comparés avant et après l’intervention de quatre semaines, a un «set» de mesures phonologiques et cognitives (d’un processus successif et simultané). Les deux groupes étaient delibérement assez différents en ce qui concerne les mesures avant et après les tests. Les résultats de la première expérimentation ont indiqué que le groupe expérimental avait perfomé aussi bien que le groupe de contrôle à tous les mesures phonologiques et cognitives du processus de capacités. Les analyses suivantes centrées sur l’aptitude du traitement d’interaction ont indiqué que le programme PREP apparaît favorablement réussi a l’amélioration de capacités phonologiques quand le profil cognitif de 5 ans ages, s’accordait avec l’emphase d’integration sur l’information successive. L’expérimentation suivante examinait les «long-term» effets de la remediation PREP. Les résultats ont montré que tous les deux groupes — expérimental et contrôle — ont perfomé aussi bien en ce qui concerne la lecture, et encore plus important en ce qui concerne les PREP travaux de combination qui demandent la connaissance de l’alphabet et la correspondance du «ton» de la lettre, bien que tous les deux groupes soient entraînes en ce qui concerne le dernier. La conclusion de la discussion, qui comprend l’intervention inductive sur le traitement du processus distal, nomme processus successif et simultané, parait d’être effective pour le développement, très tôt, de la capacité de lecture, en ce qui concerne les enfants à l’école maternelle, risquant des difficultés en ce qui concerne la lecture, même si il y a l’absence d’enraciner, en ce qui concerne ces capacités à l’école maternelle.
... 80; Dale, Harlaar, & Plomin, 2005). Similarly high levels of agreement have been reported for Key Stage 2 teacher and test assessments (Reeves, Boyle, & Christie, 2001). SPA-Children's SPA were assessed at ages 9 and 12 using the Perceived Ability in School Scale (Spinath, Spinath, Harlaar & Plomin, 2006). ...
Article
This paper examines the longitudinal causal relationship between self-perceived abilities (SPA) and academic achievement (Ach) while controlling for cognitive ability (CA). In all, 5957 UK school children were assessed on SPA, Ach and CA at ages 9 and 12. Results indicated that SPA and Ach at age 9 independently affected both SPA and Ach at age 12, even when CA was considered. Moreover the effects of previous Ach on subsequent SPA were of similar magnitude to the effects of prior SPA on subsequent Ach, suggesting that the link between SPA and Ach independent of CA is reflective of both “insight” (children's accounts of their previous performance) and self-efficacy (the self-fulfilling or motivational effects of self-beliefs). Practical and theoretical implications for the study of SPA are discussed.
... Another potential limitation is the use of teacher reports of scientific performance. Although the validity of teacher assessments has been questioned (e.g., Davies & Brember, 1994; Demaray & Elliot, 1998; Glascoe, 2001; Reeves, Boyle, & Christie, 2001), a review of the literature has concluded that on the whole they are valid (Hoge & Coladarci, 1989). Using TEDS data we have shown that teacher assessments of reading performance correlate highly (.68) with a telephone-administered test of word and nonword reading administered at the same age (Dale, Harlaar, & Plomin, 2005). ...
Article
We investigated whether the sexes differ in science performance before they make important course and career selections. We collected teacher-report data from a sample of children from the Twins Early Development Study (TEDS) assessed at ages 9, 10 and 12 years (N>2500 pairs). In addition we developed a test of scientific enquiry and administered it to a sub-sample of TEDS (n=1135; age=14 years). We found no evidence for mean sex differences in science performance assessed by teachers, or by a test of scientific enquiry, although boys were somewhat more variable. At a time when adolescents are making important course choices, girls are performing just as well as boys.
... One notable feature is that boys, compared to girls, do relatively well on teacher assessments in English, but relatively poorly in mathematics and science and these gender differences are always significant at the 1% level. The last two findings echo those in Reeves et al (2001) for age-11 assessments in 1998 8 . Lastly, older children seem to be rated relatively well in teacher assessments than tests, particularly in science. ...
Article
This paper summarises our research into the relationship between pupil assessment at age 14 (Key Stage 3) and participation in age 16+ education. We question whether a systematic gap between teacher-based assessment and externally marked tests indicates assessment bias or uncertainty, either in testing procedures or through teachers' perceptions of pupils' skills. We explore whether these errors have consequences for pupils' subsequent educational attainment and participation. We find that teacher and test assessments diverge slightly along lines of pupil characteristics, especially prior achievement, clearly observable to the teacher but less so to external assessors, but this does not conform to notions of teacher stereotyping. Moreover, the divergence between the assessments at age 14 has almost no bearing on pupil qualifications or participation in education after age 16, and is unlikely to influence participation rates in higher education (HE).
... However, a concern that boys are less successful than girls, particularly in writing, is widespread although the extent to which this is actually a new problem is questioned as this gap has been stable for 40 years (Smith, 2003). The significance may lie in the perception of teachers and their assessment bias towards girls and their under rating of boys' performance (Reeves, Boyle & Christie, 2001). The analysis of this corpus of data showed that girls performed significantly better across all year groups (average effect size .43, ...
Article
Full-text available
The data base of writing examined serves a dual purpose. Here it is used as a research tool and the writing performance from the large, nationally representative sample (N = 20,947) of students (years 4 to 12) interrogated to examine patterns of performance in writing. However, the data base was designed to underpin a software tool for diagnostic assessment of writing. Viewing writing as accomplishing social communicative goals, performance was considered in terms of seven main purposes the writer may seek to achieve. Tasks related to each purpose were encapsulated in 60 writing prompts that included stimulus material. Participants produced one writing sample; the design ensured appropriate representation across writing purposes. Samples were scored using criteria differentiated according to purpose and curriculum level of schooling and acceptable reliability obtained. Analyses indicate that growth was most marked between years 8 and 10, arguably, as opportunity to write increases and writing is linked to learning in content areas. Variability in performance is relatively low at primary school and high at secondary school. Students at any level did not write equally well for different purposes. Mean scores across purposes at primary school were relatively similar with to instruct and to explain highest. By years 11-12 there is a considerable gap between the highest scores (for narrate and report) and the lowest, recount, reflecting likely opportunities to practice writing for different purposes. Although girls performed better than boys, the difference in mean scores narrows by years 11-12.
... Although we did not obtain data about the reliability or validity of teacher assessments, the high correlations for the twins -even when rated by different teachers -provides strong evidence for both reliability and validity. Evidence exists that teacher assessments reveal some bias, if one accepts discrepancies with test scores as a sign of bias (Davies & Brember, 1994;Reeves, Boyle, & Christie, 2001). It would be interesting to compare teacher assessments and test scores, and this is what we are doing in the ongoing 9-year phase of TEDS. ...
Article
Twin research has consistently shown substantial genetic influence on individual differences in cognitive ability; however, much less is known about the genetic and environmental aetiologies of school achievement. Our goal is to test the hypotheses that teacher-assessed achievement in the early school years shows substantial genetic influence but only modest shared environmental influence when children are assessed by the same teachers and by different teachers. 1,189 monozygotic (MZ) and dizygotic (DZ) twin pairs born in 1994 in England and Wales. Teachers evaluated academic achievement for 7-year-olds in Mathematics and English. Results were based on the twin method, which compares the similarity between identical and fraternal twins. Suggested substantial genetic influence in that identical twins were almost twice as similar as fraternal twins when compared on teacher assessments for Mathematics, English and a total score. The results confirm prior research suggesting that teacher assessments of academic achievement are substantially influenced by genetics. This finding holds even when twins are assessed independently by different teachers.
... Dale, Harlaar, & Plomin, 2005). High levels of agreement have also been reported for teacher and test assessments at Key Stage 2 (Reeves, Boyle, & Christie, 2001). Second, there is evidence for the concurrent validity of the NC assessments relative to individual direct testing. ...
Article
Full-text available
Language acquisition is predictive of successful reading development, but the nature of this link is poorly understood. A sample of 7,179 twin pairs was assessed on parent-report measures of syntax and vocabulary at ages 2, 3, and 4 years and on teacher assessments of reading achievement (RA) at ages 7, 9, and 10 years. These measures were used to construct latent factors of early language ability (LA) and RA in structural equation model-fitting analyses. The phenotypic correlation between LA and RA (r = .40) was primarily due to shared environmental influences that contribute to familial resemblance. These environmental influences on LA and RA overlapped substantially (rC = .62). Genetic influences made a significant but smaller contribution to the phenotypic correlation between LA and RA, and showed moderate overlap (rA = .36). There was also evidence for a direct causal influence of LA on RA. The association between early language and later reading is underpinned by common environmental and genetic influences. The effects of some risk factors on RA may be mediated by language. The results provide a foundation for more fine-grained studies that examine links between specific measures of language, reading, genes, and environments.
Technical Report
Full-text available
138 pages. This review of teacher assessment has looked at teacher assessment in practice in a number of countries to see what works best and to consider the implications for Assessing Pupils’ Progress (APP). APP is an innovative approach to integrate teaching and assessment to improve and keep track of student learning. It involves professional capacity building to make teachers sensitive to the developmental progression of their students. In addition to published research evidence from other countries the review had access to evaluation reports carried out during the piloting of APP. The emphasis of the review was to capture research evidence of the conditions under which teacher assessment works effectively and reliably. The review has shown that in assessment systems similar to the APP it is possible to gain high levels of reliability. However high levels of reliability cannot be taken for granted. Some systems have disappointingly low levels of reliability despite the implementation of training schemes for assessors. The APP uses a well-structured system with assessment focuses clearly described. The evaluation reports indicated that for most teachers the reliability of judgments based on the APP system are satisfactory for purpose. An examination of the overall distribution of levels awarded under APP compared with those resulting from external moderation and from optional tests showed a reassuring similarity. This indicates the likelihood of acceptable validity when fully implemented. The review looks at issues that may be worth considering as the system is implemented and makes suggestions for a future evaluation strategy.
Article
For a number of reasons, increasing reliance is being placed on teacher assessment in high-stakes contexts in many countries around the world. Simultaneously, countries that have for some time relied to greater or lesser degrees on teacher assessment for high-stakes purposes are in the process of questioning the validity of that reliance. In principle, teacher assessment has an important role to play in increasing assessment validity by complementing testing to cover subject domains more comprehensively than otherwise would be possible. But what is the evidence regarding the reliability of teacher assessment in high-stakes contexts? The answer is that the evidence is limited and often ambiguous. Research has revealed that teachers can be influenced by a number of construct-irrelevant factors as they work towards their judgements, factors such as gender, socio-economic background, effort and behaviour, that risk biasing their assessments. And when considering construct-relevant achievement evidence teachers are often expected to use verbal or semi-verbal sets of criteria, such as level descriptions, which typically require a degree of subjective interpretation in application and so are themselves a source of unwanted variation in judging standards. Arguably the most effective strategy for addressing these issues is participation in consensus moderation. Yet there have been few attempts to provide evidence of the effectiveness of moderation in practice. The potential value of, and the growing reliance upon, teacher assessment in high-stakes applications demand that evaluation of consensus moderation become a built-in part of the process.
Article
Full-text available
There is an established body of evidence indicating that a pupil's relative age within their school year cohort is associated with academic attainment throughout compulsory education. In England, autumn-born pupils consistently attain at higher levels than summer-born pupils. Analysis here investigates a possible channel of this relative age effect: ability grouping in early primary school. Relatively younger children tend more often to be placed in the lowest in-class ability groups, and relatively older children in the highest group. In addition, teacher perceptions of pupils' ability and attainment are associated with the child's birth month: older children are more likely to be judged above average by their teachers. Using 2008 data for 5481 English seven-year-old pupils and their teachers from the Millennium Cohort Study, this research uses linear regression modelling to explore whether birth month gradation in teacher perceptions of pupils is more pronounced when pupils are in-class ability grouped than when they are not. It finds an amplification of the already disproportionate tendency of teachers to judge autumn-born children as more able when grouping takes place. The autumn–summer difference in teacher judgements is significantly more pronounced among in-class ability grouped pupils than among non-grouped pupils. Given evidence that teacher perceptions and expectations can influence children's trajectories, this supports the hypothesis that in-class ability grouping in early primary school may be instrumental in creating the relative age effect.
Article
This article reports on the outcomes from the e-scape Primary Scientific and Technological Understanding Assessment Project (2009–2010), which aimed to support primary teachers in developing valid portfolio-based tasks to assess pupils’ scientific and technological enquiry skills at age 11. This was part of the wider ‘e-scape’ project (2003-present), which has developed an innovative controlled alternative to design & technology and science public assessment at age 16. Teachers from eight primary schools were trained in the use of an online task-authoring tool to develop and trial assessment activities based on current classroom work. To compile their e-portfolios of assessment evidence, pupils used netbook devices, which afford multi-modal responses (text, drawing, photo, audio, video, spreadsheet) whilst leaving space on pupils’ tables for practical investigations. Once the pupil e-portfolios had been uploaded to the secure e-scape website, teachers assessed them using a ‘comparative judgement’ approach to produce a rank order with a high reliability coefficient. Participant teachers recognised the strength of the e-scape approach in terms of facilitating and managing pupils’ responses to assessment tasks in the classroom, which they successfully adapted to suit primary pedagogy. In particular, the benefits of scaffolding complex assessment tasks through the step-wise e-scape process in the authoring tool represented for some of the teachers a pedagogically significant development in terms of their planning.
Article
This paper summarizes the findings of a systematic review of research on the reliability and validity of teachers’ assessment used for summative purposes. In addition to the main question, the review also addressed the question ‘What conditions affect the reliability and validity of teachers’ summative assessment?’ The initial search for studies meeting the explicit inclusion criteria of relevance found 431potentially relevant studies. This number was gradually reduced, through the systematic review procedures, to 30 studies, which specifically addressed the review questions. These studies were subject to in‐depth data extraction conducted independently by two researchers, followed by reconciliation of any differences of interpretation. This procedure was also used to judge the weight of evidence provided for the review by each study so that greater emphasis could be given to findings from the most relevant and methodologically sound research. The findings of the review by no means constitute a ringing endorsement of teachers’ assessment; there was evidence of low reliability and bias in teachers’ judgements made in certain circumstances. However, this has to be considered against the low validity and lower than generally assumed reliability of external tests. The findings also point to ways of overcoming the deficiencies of teachers’ assessment and lead to implications for assessment policy, practice and research, which are proposed in the final section of the paper.
Article
The article examines the origins and purposes of assessment for learning (AfL) within the National Curriculum Assessment context in England. As a part of the Primary Strategy, AfL became part of the government's drive to improve standards through measuring school outcomes. The authors describe their investigation into teachers' understandings of AfL, how AfL has influenced teaching, learning and assessment in the intervening six years and whether it has established a presence as part of teaching pedagogy.
Article
Full-text available
This paper discusses fairness and equity in assessment of mathematics. The increased importance of teachers' interpretative judgments of students' performance in highstakes assessments and in the classroom has prompted this exploration. Following a substantial theoretical overview of the field, the issues are illustrated by two studies that took place in the context of a reformed mathematics curriculum in England. One study is of teachers' informal classroom assessment practices; the other is of their interpretation and evaluation of students' formal written mathematical texts (i.e., responses to mathematics problems). Results from both studies found that broadly similar work could be interpreted differently by different teachers. The formation of teachers' views of students and evaluation of their mathematical attainments appeared to be influenced by surface features of students' work and behavior and by individual teachers' prior expectations. We discuss and critique some approaches to improving the quality and equity of teachers' assessment.
Article
The study investigated how well report card grades communicate to students and parents that state educational standards are being met, standards that are objectively measured by infrequently administered mandated assessments. Data sources were report card grades and external assessment scores for 2006–09 for Ontario Canada. The information that parents and students received about student performance from report cards and external assessments were similar (r s = .47) to the r = .40–.60 range previously reported. Teachers assigned higher grades than external assessments warranted, even after a major source of construct irrelevant variance in report card grades (teacher ratings on multiple scales measuring student effort and school commitment) was controlled. The relationship of grades to assessment scores was robust across genders, school district types (Public versus Catholic) and language (English and French). Agreement of assessments was higher for grade 6 than for grade 3 and for Writing than for Reading or Mathematics. Report cards provided information about students’ future achievement that was accurate and delivered up to 2 years prior to the administration of external assessments. Seventy to 80% of students who reached the provincial achievement standard on one or both prior report cards were successful on the subsequent external assessment, compared to 30–50% of students who failed to meet the report card standard at least once.
Article
Full-text available
Accepting that school based assessment may have the potential to bring additional reliability to the assessment outcomes of an educational system, this research uses Generalizability Theory to address the question “why school based assessment is not a universal feature of high stakes assessment systems”? Three major issues are identified: (a) there is a conflict between the psychometric model and classroom assessment practice; (b) different schools are not equally effective; and, (c) teachers’ judgments are frequently accused of being biased. The role of public examination boards is discussed in this context.
Chapter
Full-text available
This chapter discusses problem solving and information processing by people with mental retardation. It discusses the sources of intelligence-related variation in the problem solving of persons with mental retardation. The chapter also explores the evidence about the strategy production and transfer of persons with mental retardation and illustrates that the problem solving of persons with mental retardation is greatly influenced by constraints on functional working memory capacity. People with mental retardation can be productive when the interaction of the task environment and their processing capacities are taken into account, when the task is not too difficult, when they understand the task requirements, and when they have the content knowledge required for successful task performance. Each of these factors is affected by constraints on working memory capacity, and in turn, affects the functional capacity available to these persons. The slowness of information processing in persons with mental retardation limits the amount of information that can be kept active.
Article
When researchers investigate how school policies, practices, or climates affect student outcomes, they use multilevel, hierarchical data. Though methodologists have consistently warned of the formidable inferential problems such data pose for traditional statistical methods, no comprehensive alternative analytic strategy has been available. This paper presents a general statistical methodology for such hierarchically structured data and illustrates its use by reexamining the High School and Beyond data and the controversy over the effectiveness of public and Catholic schools. The model enables the researcher to utilize mean achievement and certain structural parameters that characterize the equity in the social distribution of achievement as multivariate outcomes for each school. Variation in these school-level outcomes is then explained as a function of school characteristics.
Article
An introductory account is given of developments in multilevel modelling of educational and other social data. The technique is introduced with some simple examples and its importance is explained. Examples of applications in a number of areas are given, including repeated measures designs, school effectiveness studies, area‐based studies and political opinion sample surveys. Almost all data collected in the social sciences have some form of inherent hierarchical structure, and this structure should be reflected in the statistical models that are used to analyse them. It is suggested that multilevel techniques and associated software packages have reached the stage when they can and should be applied routinely in the analysis of social data, and that failure to do so can result in potentially serious misinterpretations.
Article
This April 2011 article is a reprint of the original May 1989 (V70N9) article and includes a new one-page introduction (on page 63 of this issue) by the author. The problem of assessment in education persists, the author maintains, because we have not yet properly framed the problem. We need to determine what are the actual performances we want students to be good at, he urges, define authentic standards and tasks to judge intellectual ability, and then design a test that measures the performance. The article focuses on the authentic test, which is a contextualized, complex intellectual challenge, rather than a collection of fragmented and static bits or tasks.
Article
This study examines the 1992 National Curriculum assessment data from one large LEA in England in order to address the issue of equity. For comparison purposes we also present additional data obtained front the same sample of pupils on an NFER standardised word recognition test. The report focuses on the relative performance of gender, low income, linguistic, and special needs groups on a standardised reading test and the teacher (TA) and standard task (ST) performance assessments administered in 1992 to 7‐year‐olds as part of the national curriculum (NC) in England and Wales. The impact of schools and teacher effectiveness on student attainments scores is also examined and discussed. Briefly, the findings show that irrespective of the method of assessment, differences in attainment were found between most pupil groups investigated. However, importantly, only very modest evidence was found that particular methods of assessment appeared either to reduce or increase the differences in attainment and overall no clear patterns emerged. The findings are discussed in the context of various factors that may have an impact on the assessment of student attainment.
Article
The size and stability of gender, ethnic and socio‐economic differences in students’ educational achievement are examined over a 9 year period. Both absolute differences in cognitive attainment and relative differences in progress are considered. The study, which is part of a follow up of an age cohort originally included in the ‘School Matters’ research, utilises multilevel modelling techniques. Attainment in reading and mathematics is reported at primary school (Year 3 and 5), secondary transfer (Year 6) and in the General Certificate of Secondary Education (GCSE) (Year 11). Whilst differences in achievement related to gender and socio‐economic factors remained consistent and generally increased over time, greater change was found in patterns of ethnic differences. Possible explanations for the findings are discussed, particularly in relation to the debate concerning performance assessment and equity. The importance of adequate control for socio‐economic background in the analysis of ethnic differences is noted.
Article
How far is assessment fair? In this evaluation of research from a wide range of countries the authors examine the evidence for differences in performance among gender and ethnic groups on various forms of assessment. They explore the reasons put forward for these observed differences and clarify the issues involved. The authors' concern is that assessment practice and interpretation of results are just for all groups. This is a complex field in which access to schooling, the curriculum offered, pupil motivation and esteem, teacher stereotype and expectation all interact with the mode of assessment. This analytical and comprehensive overview is essential reading in a field crucial to educators. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Value Added in Education (London, Department for Education)
  • Department For
  • Education
DEPARTMENT FOR EDUCATION (1995) Value Added in Education (London, Department for Education).
Linking restructurin g to authentic student achievement
  • F M Newmann
NEWMANN, F. M. (1991) Linking restructurin g to authentic student achievement, Phi Delta Kappan, 72, pp. 458-463.
Determining what 'adds value' to student achievement
  • S Thomas
  • P Sammons
  • P Mortimore
THOMAS, S., SAMMONS, P. & MORTIMORE, P. (1995) Determining what 'adds value' to student achievement, Educational Leadership Internationa l, 51(6), pp. 19-22.
The Value Added National Project: Second Primary Technical Report: an analysis of the 1991 Key Stage 1 assessment data linked to the 1995 KS2 data provided by
  • P B Tymms
TYMMS, P. B. (1996) The Value Added National Project: Second Primary Technical Report: an analysis of the 1991 Key Stage 1 assessment data linked to the 1995 KS2 data provided by Avon LEA (London, School Curriculum and Assessment Authority).
The National Curriculum and Its Assessment (London, School Curriculum and Assessment Authority)
  • R Dearing
DEARING, R. (1993) The National Curriculum and Its Assessment (London, School Curriculum and Assessment Authority).
Standard s at Key Stage 2 English
QUALIFICATIONS AND CURRICULUM AUTHORITY (1998a) Standard s at Key Stage 2 English, Mathematics and Science (London, QCA).
Weighing the Baby. The Report of the Independen t Scrutiny Panel on the 1999 Key Stage 2
ROSE, J. (1999) Weighing the Baby. The Report of the Independen t Scrutiny Panel on the 1999 Key Stage 2 National Curriculum Tests in English and Mathematics (London, DfEE).
Key Stage 2 Assessment and Reporting Arrangements
QUALIFICATIONS AND CURRICULUM AUTHORITY (1999) Key Stage 2 Assessment and Reporting Arrangements (London, QCA).
External Marking of the 1998 Key Stage 2 Tests
QUALIFICATIONS AND CURRICULUM AUTHORITY (1998b) External Marking of the 1998 Key Stage 2 Tests. Test Results Tables (London, QCA).
The Value Added National Project Final Report-Feasibility studies for a national system of value-adde d indicators
  • C T Fitz-Gibbon
FITZ-GIBBON, C. T. (1997) The Value Added National Project Final Report-Feasibility studies for a national system of value-adde d indicators (London, School Curriculum and Assessment Authority).