ArticlePDF Available

Investigating the Effect of Different Selected-response Item Formats for Reading Comprehension

Authors:

Abstract and Figures

While previous research has identified numerous factors that contribute to item difficulty, studies involving large-scale reading tests have provided mixed results. This study examined five selected-response item types used to measure reading comprehension in the Pearson Test of English Academic: a) multiple-choice (choose one answer), b) multiple-choice (choose multiple answers), c) reorder paragraphs, d) reading (fill-in-the-blanks), and e) reading and writing (fill-in-the-blanks). Utilizing a multiple regression approach, the criterion measure consisted of item difficulty scores for 172 items. Eighteen passage, passage-question, and response-format variables served as predictors. Overall, four significant predictors were identified for the entire group (i.e., sentence length, falsifiable distractors, number of correct options, and abstractness of information requested) and five variables were found to be significant for high-performing readers (including the four listed above and passage coherence); only the number of falsifiable distractors was a significant predictor for low-performing readers. Implications for assessing reading comprehension are discussed.
Content may be subject to copyright.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 1
This is an Accepted Manuscript of an article published by Taylor & Francis Group in
Educational Assessment on 9/20/2018, available online at:
https://www.tandfonline.com/doi/full/10.1080/10627197.2018.1517023
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 2
ABSTRACT
While previous research has identified numerous factors that contribute to item difficulty, studies
involving large-scale reading tests have provided mixed results. This study examined five
selected-response item types used to measure reading comprehension in the Pearson Test of
English Academic: a) multiple-choice (choose one answer), b) multiple-choice (choose multiple
answers), c) re-order paragraphs, d) reading (fill-in-the-blanks), and e) reading and writing (fill-
in-the-blanks). Utilizing a multiple regression approach, the criterion measure consisted of item
difficulty scores for 172 items. Eighteen passage, passage-question, and response-format
variables served as predictors. Overall, four significant predictors were identified for the entire
group (i.e., sentence length, falsifiable distractors, number of correct options, and abstractness of
information requested) and five variables were found to be significant for high-performing
readers (including the four listed above and passage coherence); only the number of falsifiable
distractors was a significant predictor for low-performing readers. Implications for assessing
reading comprehension are discussed.
Keywords: format, item difficulty, L2 reading, multiple-choice, regression, standardized test
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 3
Introduction
The complex nature of second-language (L2) reading has been acknowledged in both
theory- and pedagogy-oriented research (Alderson, 2000; Grabe, 2009). Grabe (2009) described
two types of cognitive processes that affect L2 readers ability to comprehend texts: lower-level
processes (including word recognition, syntactic parsing, and propositional encoding), and
higher-level processes (including understanding main ideas, interpreting a text, activating
background knowledge, and making inferences). Because these processes operate simultaneously
at various phases during reading, operationalizing the construct of reading ability is challenging
for assessment purposes, as one has to not only confirm that test items measure the same latent
trait, but also consider a number of factors that impact the difficulty level of test items.
A clear definition of reading ability is especially important in large standardized
proficiency tests, such as Test of English as a Foreign Language (TOEFL), International English
Language Test System (IELTS), and Pearson Test of English (PTE) Academic, all of which
include reading ability as a sub-construct within a more global communicative ability. Because
the test scores are used for admission and placement purposes at English-medium institutions,
high-stakes decisions are made based on test-takers performance. To date, a large body of
empirical research has been carried out to establish the construct of reading ability and to
examine several factors affecting the difficulty of test items in TOEFL and IELTS (e.g., Freedle
& Kostin, 1993; Carr, 2006; Weir, Hawkey, Green, Unaldi, & Devi, 2009).
PTE Academic, a relatively new test with a rapidly developing trajectory of use
internationally, has been drawing more attention from researchers and assessment specialists.
Specifically, recent studies have focused on establishing the construct validity of PTE Academic,
defining sub-skills targeted in the test, examining response formats and their effect on test-takers,
investigating differential item functioning, and examining rater behavior (e.g., Pae, 2011; Zheng
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 4
& De Jong, 2011). Yet, to our knowledge, no study to date has empirically investigated the
features that could potentially affect reading item difficulty on PTE Academic. The present
study, therefore, attempted to fill this void by determining the extent to which various features of
passages, question types, and response formats accounted for the difficulty of PTE Academic
reading test items, all of which are presented using different selected-response item formats.
It is worth noting here that a variety of terms are used throughout this paper to describe
aspects of test items, including those items used in the PTE Academic reading test. For the
present study, we took the perspective that a (reading) test item typically includes three
distinctive elements: (1) input, in the form of a written text or passage; (2) a stimulus, in the form
of a question or a stem that requires completion; and (3) a response, in the form of a constructed-
or selected-response format. Specifically, for the purposes of our study, we define a test item as
any part of the PTE Academic reading test in which an examinee must read a passage and choose
between response options which are provided to the examinee in order to respond to a question
or to complete a stem. Furthermore, when discussing item formats or types throughout this
paper, we are generally referring to the multiple response formats that are possible when
designing selected-response test items. For example, in the case of the PTE Academic reading
test, the selected-response item formats or types that are utilized include multiple choice (MC),
re-ordering paragraphs, and filling-in-the-blanks. These specific item types are discussed in more
detail in the methods section of this paper.
In order to contextualize this study within existing research on L2 reading and to justify
the specific questions explored in the study, the following sections include a discussion of a
theoretical perspective on L2 reading as well as an overview of the main findings obtained from
empirical research that motivated the focus of the present study.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 5
Reading Comprehension: Theoretical Underpinnings
In this study we adopted the current perspective for understanding reading as a dynamic
multi-dimensional construct that can be operationalized through a range of component processes
occurring at both lower and higher levels and, at certain points, engaging in interactions with
each other (Grabe, 2009; Khalifa & Weir, 2009). This approach to reading allows one to
conceptualize and examine the multiple cognitive processes that occur during reading and to
explore the complex nature of each of these component processes. While individual reading
comprehension frameworks and models organized under this larger multi-dimensional
componential perspective may have unique elements, they all view reading as a purposeful and
interactive activity during which a reader extracts and co-constructs the meaning encoded in a
text by employing a number of skills, processes, and strategies to achieve comprehension within
a specific context.
Current understanding of the construct of reading comprehension in both L1 and L2 has
been greatly influenced by schema theory, which emphasizes the importance of readers
knowledge, experiences, values, and strategies that they bring to the reading process to achieve
comprehension (Rumelhart, 1980). Schemas, or schemata, are the cognitive constructs that
represent complex networks of information which can be utilized by people to organize new
experiences, stimuli, events, and situations. Researchers who have explored the role of schema in
the reading process seem to agree that readers who activate appropriate schemata, such as
possessing sufficient background knowledge about a topic, being aware of the discourse
structure of a text, and being able to decode input in a language, are in a better position to
achieve comprehension. In contrast, readers (both L1 and L2 learners) who either lack sufficient
schemata or are unable to activate it appropriately might experience more difficulties with
understanding a text (Aebersold & Field, 1997). Building on the schema theory, Kintsch (1998)
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 6
proposed an alternative account of comprehension, which he called the construction integration
(CI) model. This model has been influential not only in theoretical discussions of what
constitutes comprehension during reading activities (e.g., Grabe, 2009; Weir, Hawkey, Green,
Unaldi, & Devi, 2009), but also in empirical research examining a number of trait factors that
can potentially affect participants reading performance (e.g., Pae, 2012a).
Taking into account the very fluid and context-sensitive nature of human comprehension,
Kintschs CI model presents it as a loosely organized bottom-up process which may initially
generate a somewhat chaotic and incoherent textbase that is then integrated into a well-ordered
coherent structure (i.e., the situation model). The steps involved in the construction of the
textbase include: (a) forming propositions that correspond to the linguistic input and represent
the meaning of a text, (b) elaborating each of the constructed elements, (c) inferring additional
propositions that might be required to complete the construction process (i.e., bridging inferences
that are employed whenever the constructed textbase is still largely incoherent), and (d)
establishing the interconnections between all elements. The outcome of the construction process
is a rich textbase structure that represents a cluster of generated propositions, with some of them
being derived directly from the text and others being closely connected to them. During the
integration step, when the situation model is developed, the propositional representation of the
text is then integrated with other knowledge and the material that is irrelevant and/ or
inconsistent with the context becomes deactivated (Kintsch, 1988).
For the present study, the CI model provides insight into specific identifiable components
that might potentially account for the difficulty in text comprehension and, therefore, can be
important for assessment purposes. At the surface level of text comprehension, readers ability to
decode visual input can be potentially affected by the textual (i.e., passage) features, including
the complexity of the reading material, both lexical and syntactic, as well as the extent to which
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 7
the macrostructure of the passage is clearly organized and signaled. Likewise, the construction of
the situation model of the text depends on additional contextual features the extent to which
readers are prompted and able to integrate information from the text with their background
knowledge as well as readers overall familiarity with the topic/ content discussed in the text.
L2 Reading Comprehension: Additional Considerations
Expanding on Kintschs CI model of reading comprehension for native speakers, Grabe
(2009) offers further insights about how the comprehension process unfolds for L2 readers.
During the construction of the textbase, Grabe (2009) highlights additional complexities of the
decoding process related to L1-L2 differences, such as learning new letter-to-sound patterns or
acceptable syllable structures, that are likely to have a significant impact on the speed and
accuracy of word-recognition processes in L2 reading development, particularly at lower
proficiency levels (p. 121). In terms of the language skills, L2 readers, especially beginning
learners, have a much more limited vocabulary when they start learning to read in their L2, in
addition to an impoverished knowledge of L2 morphology and syntax. Furthermore, the nature
of L1-L2 differences will dictate the extent to which L1 transfer influences will affect reading
comprehension and which linguistic resources will be affected the most (e.g., morpho-syntactic
patterns, vocabulary use, discourse practices).
In the past, some researchers have proposed that L2 reading ability is likely to include
general language resources that are universal and, therefore, transferable (e.g., Koda, 2007).
However, Grabe (2009) argues that, depending on the structural and semantic differences
between the languages, these universal aspects of reading ability might not always transfer well.
In fact, even when the two language systems share similarities, there are still limitations in how
(and to what extent) the universal aspects of reading ability can be transferred to L2 reading
skills. Therefore, a more productive approach to capturing the complexities of L2 reading ability
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 8
seems to be viewing it as a dual-language processing system that evolves over time and presents
a case of learning to read with languages rather than learning to read in another language (p.
129). According to Grabe, the emergent system is dynamic in nature and is sensitive to the task
conditions, the reader, and the context. Based on a readers specific background knowledge, their
level of language proficiency, the specific task requirements, and the reading goals, L2 readers
might be engaging a range of skills at different levels of processing to achieve comprehension.
Assessment of Reading Comprehension
To this point, our discussion of reading comprehension has focused primarily on
understanding how readers process textual information and integrate it with their background
knowledge in non-testing conditions (i.e., in situations that do not purposefully assess readers
ability to achieve comprehension). In testing conditions, however, a multifaceted construct of
reading ability takes on an additional layer of complexity associated with the methods used to
assess reading comprehension.
Reflecting on the specific challenges related to the validity of reading comprehension
tests, Ozuru, Rowe, OReilly, and McNamara (2008) noted that the complexities of the many
underlying sub-processes that comprise reading comprehension as well as considerable
variability in reading situations present major challenges to adequately measuring reading ability.
This, in turn, often results in the artificial nature of reading comprehension tests and, therefore,
justifies current research inquiries into the generalizability of results from large standardized
reading tests that employ different selected-response item formats, such as gap-fill and MC
(McCray & Brunfaut, 2016). The main focus of such inquiries is to examine if (and to what
extent) tests using selected-response items elicit performances that accurately reflect readers
abilities to comprehend a text, especially when compared to real-life comprehension (i.e.,
when reading outside the testing context).
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 9
One of the ways to investigate this issue is to focus on the cognitive processes and the
extent to which the processes engaged during a reading test correspond to the cognitive processes
that readers typically perform during reading activities outside the testing situation (e.g., see
McCray & Brunfaut, 2016; Yamashita, 2003). However, in situations when a researcher only has
access to participants test scores (i.e., no observation data is available), a product- (rather than
process-) oriented approach to examining the validity of reading test items might be a more
appropriate alternative. In this approach, which has been employed extensively in the field, test-
takers performance is typically examined in light of other relevant factors. The majority of the
investigations that have employed this approach have focused on selected-response items, with
the goal of investigating the extent to which they were able to measure different aspects of
reading ability and, therefore, to serve as an efficient means to assess L2 learners reading skills
(see Currie & Chiramanee, 2010; Kobayashi, 2004; Rupp, Ferne, & Choi, 2006; Shohamy,
1984). For example, Currie & Chiramanee (2010) argued that MC items had a distorting effect
on the measurement of language ability, as these items seemed to have introduced a greater
proportion of format-related noise than the language performance actually sought (p. 487).
Yet, due to their ease of administration and scoring, as well as their usefulness for
targeting a range of cognitive skills, selected-response items are often preferred for large-scale
standardized tests, including PTE Academic. While considerable efforts have been made to
develop tests that are inexpensive, efficient to administer, reliable, and valid, some concerns
about construct-irrelevant variance that might be attributed to item formats used in PTE
Academic still remain. Specifically, in a study examining the construct validity and the effect of
the testing method utilized in PTE Academic, Pae (2012a) reported that while, in general, the
assessment methods are adequate, mainly for measuring the given traits, […] some portion of the
variance was related to […] question formats, suggesting that the question type might have
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 10
assessed different constructs, especially for the prescribed-response format (pp. 6-7). Thus,
according to Pae (2012a), further attention should be devoted to investigating the (potential)
effect of the format features of the reading items on participants test performance. Following
this call for more research, the present study includes a group of item format features as
additional factors that might affect L2 participants reading performance, in addition to the
textual (passage) and contextual (question) variables that affect participants reading
comprehension in non-testing situations. The following section includes an overview of
empirical studies that examined the different categories of features (i.e., passage, question, and
response-format) affecting the difficulty of selected-response test items.
Effect of Item Features on the Difficulty of Selected-Response Items
Empirical research conducted on the effect of item format features on test-takers
performance has had a very practical rationale. As Bachman, Davidson, and Milanovic (1996)
argue, test developers must determine what characteristics are relevant to test performance, how
best to obtain information about these characteristics, and how to use information about test and
item characteristics to describe tests better and to predict and interpret test performance (p.
129). The influence of test characteristics on L2 reading performance is particularly relevant in
cases where selected-response items are used, as some have argued that these items alone are not
sufficient to measure reading ability (see Haladyna, 2012).
Quantitative investigations into the effect of item format features on test-takers L2
performance have typically examined how various coded features of passages and questions
accounted for item difficulty and/or item discrimination (see Bachman et al., 1996; Carr, 2006;
Freedle & Kostin, 1993; Garrison, Dowaliby, & Long, 1992; Ozuru, Rowe, OReilly, &
McNamara, 2008). For example, Garrison, Dowaliby, and Long (1992) provided evidence for
the importance of the passage-question relationship (or what we refer to as question-related
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 11
variables in the present study), as they found that the amount of transformation of text
information needed to be able to answer a question, ranging from verbatim to inference
questions, as well as the background knowledge required to answer a question, proved to be key
predictors of item difficulty. In addition to these two features, the number of words in the
question combined with the number of words in a passage contributed to increased difficulty of
test items. Davey (1988) also reported that question length positively correlated with passage
sentence length, passage coherence, and number of content words included in a question; that is,
passages with longer sentences, which were found to be more coherent yet more difficult, tended
to accompany longer questions.
In a study of TOEFL MC reading items, Freedle and Kostin (1993) examined a set of 75
features that had been identified in previous research as important predictors of reading
comprehension. The results indicated that seven predictor categories jointly accounted for
reading item difficulty: (1) lexical overlap between the correct answer and the text, (2) key
sentence length (for supporting idea items), (3) concreteness of the text, (4) rhetorical
organization of a passage, (5) number of negations in the correct answer, (6) number of referents
across clauses in a passage, and (7) passage length. Since six out of the seven predictors
belonged to the passage and passage-to-item (i.e., contextual/ question) categories, and only one
of the format features contributed to item difficulty, the authors concluded that the results of the
study provided sufficient evidence for the construct validity of the TOEFL reading items.
While the findings from this study, and others like it (e.g., Davey, 1988), highlight the
importance of passage and question features in determining item difficulty, they are not
altogether conclusive, as more recent studies have had contradictory results. For instance, in a
more recent study of TOEFL MC reading items, Carr (2006) found that many of the passage-
related features were not significant predictors of item difficulty. The features included in the
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 12
analysis were both passage (e.g., subject matter, propositional content, cohesion, and focus
constructions) and key sentence features (e.g., length and location, lexical variety, key sentences
per sentence, and proportion of clauses). Carr (2006) noted that key sentence features emerged
as constituting a key dimension along which reading items may vary (p. 285), and called into
question the applicability of solely using task characteristics to predict examinee performance,
particularly as passage-related features are concerned.
One of the more recent studies which adopted a more holistic approach to exploring
sources of difficulty within an item examined the extent to which three sets of item features
predicted the difficulty of MC items on the Gates-MacGinitie Reading Tests (Ozuru, Rowe,
OReilly, & McNamara, 2008). The authors carried out an item-based analysis of 192
comprehension questions to investigate if, and to what extent, various item features could predict
the difficulty of test items targeting two different levels (i.e., lower vs. higher) of reading ability.
The results showed that, whereas the difficulty of reading test items for the lower level could
mostly be predicted by passage features, none of the item features could account for the
difficulty of items included in the higher-level tests. Ozuru, Rowe, OReilly, and McNamaras
(2008) study motivated the current investigation, as it considered (and illustrated) the multi-
dimensional nature of reading test performance which incorporated a variety of features,
including text (passage), context (question type), item format, and learner characteristics (e.g.,
level of reading ability) that could potentially impact item difficulty.
Present Study
The purpose of the present study was to investigate the effect of a combination of factors
on the difficulty of selected-response reading items in the PTE Academic. Two of the three item
facets examined in the study were identified based on Kintschs (1998) CI model of reading
comprehension which we discussed in greater detail in the previous sections. Specifically, we
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 13
believe that both groups of features (i.e., textual/ passage and contextual/ question) can
potentially present sources of difficulty during text comprehension. Depending on participants
level of L2 ability and previous linguistic experiences in general, their construction of the
textbase might be affected by the linguistic complexity of the reading material (both syntactic
and lexical) as well as by how the information is organized in a text. Thus, L2 readers ability to
parse a text and establish propositional meaning at the sentence and discourse level is likely to be
related to the number of words readers can recognize in a text whose lexical meaning they know
and, therefore, can figure out the idea units. The longer the text, the more likely it is to include
words that are less familiar to the reader and/or present more complex grammatical relationships,
which, in turn, increases the difficulty level of the text. Likewise, the fewer propositional gaps a
text contains and the more explicit the structure of the text is, the easier it is to construct the
appropriate textbase. Furthermore, readers who are able to identify the most essential
information that supports the overall organizing idea conveyed in the text, as well as to integrate
new information from the text with existing background knowledge, are likely to be more
successful during the integration phase of the comprehension process.
The third group of features that is related to testing situations includes format
characteristics of test items. Haladyna (2012) argues that item format in tests of reading
comprehension is an important consideration, as different item formats may present different
cognitive demands and thought processes. Because this study adopts a more holistic approach to
examining groups of factors that might impact item difficulty, including a group of item format
features appears to be worthwhile. Our understanding of reading comprehension as a complex set
of cognitive processes affected by passage, question, and (potentially) item format
characteristics, leads to the first research question we asked in the present study:
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 14
(1) To what extent do various passage, question, and response-format features predict the
difficulty of PTE reading items?
Based on the test specifications of the PTE Academic (Zheng & De Jong, 2011), the
reading items appear to target a wide range of reading abilities included in the reading construct
that align well with Kintschs CI model. Specifically, the tasks are thought to elicit abilities
ranging from understanding the meaning of lexical items, grammatical constructions, and the
structural organization of a text to identify the main ideas and details to reading critically to
understand the authors purpose and rhetorical intent as well as to be able to draw logical
inferences, evaluate a hypothesis, and to integrate information from multiple sources to generate
an organizing frame that is not explicitly stated (p. 10). It is possible, however, that depending
on the overall proficiency level of the test-takers, not all of these abilities might be tapped during
the test. In fact, in a study examining the factor structure of L2 learners academic English skills
assessed by PTE Academic, Pae (2012b) noted that the factor structure for the high and low
ability groups was different, suggesting that there might be a different latent construct for
different ability groups and, therefore, caution needs to be taken in the interpretation of test
scores according to the skill level (p. 6). Therefore, in the present study we also considered a
learner-related component by comparing the test performance of more and less successful
readers. This is reflected in the second research question we asked in the present study:
(2) Do the specific passage, question, and response-format features affecting the
difficulty of PTE reading items differ for low- and high-performing readers?
Method
Reading test items for PTE Academic were evaluated according to three different facets
(i.e., passage, passage-question relationship, and response-format); a description of these facets
and how they were coded are provided below. The data for the first analysis in the present study
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 15
comprised a subset of reading test items (k = 172) taken from 10 separate administrations of PTE
Academic exams administered in four different countries between 2011 and 2014. The reading
test items targeted different levels of reading ability, ranging from B1 to C2 on the Common
European Framework (CEF) scale (Zheng & De Jong, 2011).
In addition, the study also examined the differences between high- and low-performing
readers. From an initial sample of 836 examinees, 434 of those examinees were divided into two
groups based on their reading test results for PTE Academic. Individuals belonging to the high-
performing group (n = 192) represented those examinees who scored at least one standard
deviation (SD) above the mean score for the reading test, while individuals belonging to the low-
performing group (n = 242) included examinees who scored at least one SD unit below the mean
score. Those individuals who fell between +/- one SD of the mean score were excluded from the
analysis (see Table 1). The two groups included both males (n = 251) and females (n = 183)
between the ages of 18 and 38 years old (M = 24.81 years; SD = 4.33).
[INSERT TABLE 1 ABOUT HERE.]
Reading test for PTE Academic. The reading test consists of five selected-response
item formats: a) MC (choose one answer), b) MC (choose multiple answers), c) re-ordering
paragraphs, d) reading (fill-in-the-blanks), and e) reading and writing (fill-in-the-blanks). For
MC items, examinees are presented with a reading passage and asked to select the correct
response(s), choosing either a single answer (from four options) or multiple answers (from five
options), depending on the question-type. For paragraph re-ordering items, examinees are
presented with a short reading passage and several randomly ordered (but related) sentences
below the passage. Based on the information in the passage, examinees are asked to restore the
original order of the sentences. Finally, for the fill-in-the-blanks items, a short reading passage is
presented with some words missing. To complete the passage, for each blank, examinees are
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 16
asked to either drag-and-drop the most appropriate word from a general list of words or they are
asked to select the most appropriate word from a list of four options included in a drop-down
menu. Each of the item formats targeted specific reading subskills (see PTE Academic, 2012).
As illustrated in Table 2, some of the subskills overlapped between the different item types,
while some were unique to a given item type.
For this study, Pearson provided the researchers with test results from 10 different forms
(i.e., administrations) of PTE Academic. It was assumed that the forms were largely equivalent, in
terms of the test content and the difficulty of test items across the test forms. Although the
specific outcomes of the procedures undertaken to ensure the equivalence of the PTE Academic
test forms were not disclosed to the researchers, Pearson provides ample information to suggest
that great lengths are taken to ensure equivalence across test administrations. For example, De
Jong and Zheng (2011) note that, draws [from the test item bank] are stratified to ensure
comparable composition with respect to the item types included and thereby the representation of
skills in the test (p. 8). Furthermore, as De Jong, Li, and Duvin (2010) explain, test items drawn
from the PTE Academic item bank are calibrated on a single IRT scale, making them equally
sufficient to predict scores on any [PTE Academic] test form. Based on this information, we
concluded that stable estimates of item difficulty were established across the PTE Academic test
forms. Therefore, the decision was made to collapse the results for the 10 forms into one overall
dataset1. This led to a total of 172 items for the dataset used in the present study.
As shown in Table 2, each test form included five reading passages, ranging from 30 to
300 words, as well as 2 single-answer MC items, 2-3 multiple-answer MC items, 2-3 re-order
paragraph items, 4-5 drag-and-drop fill-in-the-blank items, and 5-6 drop-down fill-in-the-
1 The results of a one-way ANOVA test further supported this decision, as no significant differences were found
between mean test scores for any of the forms, F(9, 826) = 1.01, p > .05.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 17
blank items. While single-answer MC items were dichotomously scored (0 or 1), all other item
types were scored using a partial-credit scoring system (Pearson Test of English Academic, 2012).
[INSERT TABLE 2 ABOUT HERE.]
Procedures for Data Analysis
Description of Reading Test Item Features
The 172 reading test items were evaluated according to three different groups of facets
(i.e., passage, question, and response-format) that have been identified as potential sources of
item difficulty based on prior research (e.g., Alderson, 2000; Currie & Chiramanee, 2010;
Davey, 1988; Ozuru, Rowe, OReilly, & McNamara, 2008). Within these facets, 10 features of
the reading test items were determined, totaling 12 unique measures (see Table 3).
Passage-related features. These features concerned the characteristics of the reading
passages that served as the input for examinees. Four features of reading passages that have been
shown to influence reading comprehension difficulty (and have been reviewed above) were
examined: a) general complexity, b) syntactic complexity, c) lexical complexity, and d) passage
structure. The first two features, general and syntactic complexity, were measured using an
online L2 syntactic complexity analyzer (see Lu, 2010). General complexity was measured as the
total number of words in a reading passage, while syntactic complexity was measured as the
mean length of sentence and clauses per sentence. The third feature, lexical complexity, was
measured using type-token ratio and words off-list. Type-token ratio represented the number of
unique function and content words with respect to the number of repeated words, while the
words off-list represented the proportion of words that fell outside of the General Service List2
(West, 1953) and Academic Word List (Coxhead, 2000) for each reading passage.
2 The General Service List, originally published by Michael West in 1953, is a list of roughly 2,000 words selected
to be of the greatest “general service” to learners of English. (Source: http://jbauman.com/aboutgsl.html)
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 18
[INSERT TABLE 3 ABOUT HERE.]
For the fourth feature in this category, passage structure, a framework for measuring
coherence was adopted from the content structure analysis used in Kobayashi (2002). This
framework categorizes the structure of content in a reading text into four different types: 1)
association, 2) causation, 3) response, and 4) description. Association refers to ideas that are only
loosely associated with each other around a common topic, as well as events related to a time
sequence (i.e., recounting events in chronological order). Causation refers to ideas that are
related both in terms of time (e.g., one event happens before another) and causality (i.e., the
earlier event causes the latter). Response refers to the inter-relationship of ideas such that a
solution is suggested in response to an existing causality. Finally, description refers to at least
two subordinate arguments that are linked by an element of comparison, as well as ideas
arranged in a hierarchical manner (i.e., one argument is superordinate and the other modifies this
superordinate argument).
According to Kintschs model, the language complexity of the passage, including its
length as well as the number of syntactic constructions and the lexical density of the text, might
affect the ability of some readers to construct the textbase. Furthermore, depending on the extent
to which the passage structure is clear to a reader and the ideas are presented in a logical and
coherent manner, the construction of the textbase can either be facilitated or inhibited.
Question-related features. These features concerned the cognitive operation involved in
answering a question that could potentially affect test-takers ability to integrate textual
information with other knowledge and develop the situation model of text comprehension. Three
question features were included: 1) number of falsifiable distractor options, 2) comprehension-
type, and 3) abstractness of the targeted answer. Following Ozuru, Rowe, OReilly, and
McNamara (2008), the first feature, number of falsifiable distractor options, was determined by
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 19
tallying the number of distractors that were inconsistent with the theme of the passage and/or
contained erroneous or unrelated information for each item. Based on Kintschs (1998) CI
model, test-takers ability to achieve higher-level comprehension assumes their ability to identify
essential information necessary for comprehension and to deactivate all unnecessary material
that is either inaccurate or redundant. Therefore, it was believed that the occurrence of falsifiable
distractors (i.e., information that is inconsistent with the main theme) might potentially make it
more challenging for a test-taker to comprehend a text.
The second feature, comprehension-type, refers to the level of understanding that is
required to answer a question. Previous research has investigated the level of understanding
needed to comprehend a written text (e.g., Davey, 1988; Kobayashi, 2004; McKenna & Stahl,
2009; Ozuru, Rowe, OReilly, & McNamara, 2008), identifying four types of text
comprehension questions: 1) text-based questions, 2) restructuring / rewording questions, 3)
integration questions, and 4) knowledge-based questions. Text-based questions represent those
question-types for which the answer is stated nearly verbatim in the passage, thus making it
easily identifiable. Restructuring questions, while not stated verbatim, simply require the
examinee to re-order the organization of words. Integration questions require the examinee to
synthesize information, whereas knowledge-based questions require the examinee to draw
inferences and/or integrate their world knowledge. While the first two question-types are
considered to be less cognitively demanding, the latter two types are thought to be more
cognitively demanding, as they require higher-order reading skills (Kobayashi, 2004). Similarly,
following Kintschs (1998) argument about the two different levels of comprehension, it is the
latter two types of questions (i.e., integration and knowledge-based questions) that promote
deep-level understanding of a text which supports learning (as opposed to surface level
comprehension that is memory-based).
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 20
The third feature, abstractness of information, addresses the level of concreteness or
abstractness of the information requested by the question. It is our belief that the more abstract
the information requested by the question, the more challenging it is for a reader to rely on their
background knowledge and incorporate it during the integration phase of the comprehension
process. Therefore, questions that ask about abstract concepts are likely to be more challenging
for a reader, especially if they have limited knowledge about a topic and, therefore, require
additional effort to achieve comprehension. Following Ozuru, Rowe, OReilly, and McNamara
(2008), five levels of concreteness or abstractness were identified in the present study: 1) most
concrete identification of persons, animals, or things, 2) highly concrete information about
time, attributes, or amounts, 3) intermediate identification of manner, goal, purpose, or
condition, 4) highly abstract identification of cause, effect, reason, or result, and 5) most
abstract information about equivalence, difference, or theme.
Response-format features. These features concerned the specific response formats of
test items. Previous research has identified a variety of response-format features that influence
the difficulty of reading test items (e.g., Currie & Chiramanee, 2010; Garrison, Dowaliby, &
Long, 1992; Ozuru, Rowe, OReilly, & McNamara, 2008). The present study only focused on
three features, as they have commonly been found to influence item difficulty across studies: 1)
number of response alternatives possible for an item, 2) number of correct options required, and
3) average length of options.
Coding of Individual Items
Ten features of three different facets which could potentially represent sources of
difficulty for reading test items were identified and coded by both researchers. The coding
process included three stages. First, a trial coding of the passage-related features was conducted.
Both individuals coded the reading passages from two of the same test forms and verified the
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 21
results of the L2 syntactic complexity analyzer for calculating number of words, mean sentence
length and clauses per sentence, and also discussed the implementation of the scheme for
analyzing the content structure of reading passages (see Kobayashi, 2002). Following this, both
coders then focused on these same passage-related features for the eight remaining reading test
forms and compared their results. Any discrepancies between the coders were discussed and
reconciled to ensure 100% agreement on all features. Finally, for the remaining passage-related
features (i.e., type-token ratio and words off-list), one of the coders utilized the Vocabprofile
function in Compleat Lexical Tutor (Cobb, n.d.) to determine counts for these measures for each
reading test item. The first two steps of the coding procedures described above were likewise
followed for question features (stage two) and response-format features (stage three).
Statistical Analysis
Item analysis. Item difficulty scores were provided by Pearson for each of the 172
reading test items included in the present study. As mentioned earlier, the forms of the reading
tests included both dichotomous and polytomous items. Therefore, a Partial Credit/Rasch model
analysis was implemented by the Pearson test development team for item inspection. As De Jong
and Zheng (2011) explain, the complete item response dataset [taken from a pilot study spanning
2007-2009] was equally split into odd- and even-numbered items, whereupon Partial
Credit/Rasch model analysis was applied separately to both sets of items. Item fit statistics were
then evaluated for both odd- and even-numbered items, with misfitting items subsequently
deleted. Following this, even-item calibration was linked to odd-item calibration, with odd-item
calibration serving as the base metric. For a more detailed explanation of this analysis, see pages
6-7 in De Jong and Zheng (2011).
As a result of the initial item analysis (described above), it was possible to directly
compare the performance of all items across the 10 different test forms. For the present study,
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 22
item parameter estimates for the difficulty (threshold, b parameter) of all 172 reading test items
ranged from -1.43 (least difficult) to +2.85 (most difficult). INFIT and OUTFIT statistics (t <
2.0) confirmed the fit of test items, with a mean of 1.69 and standard deviation of .36.
In addition, using classical test theory (CTT), separate item difficulty estimates for all
172 reading test items were also determined for the high- and low-performing groups. CTT was
deemed suitable to use with the relatively small sample sizes of the two groups, as there was
stability across test items and the examinee sample was representativeness of the intended test-
taker population. For dichotomous items, item difficulty represented the proportion of examinees
that got the item correct. For polytomous items, item difficulty represented the mean score of
each item. Furthermore, because the unit of measurement for item difficulty values was not
constant across all test items, an order-preserving arcsin transformation was performed (see
Garrison, Dowaliby, & Long, 1992). Item difficulty estimates ranged from 0.92 to 0.19 for the
high-performing examinees, while estimates ranged from 0.74 to 0.08 for the low-performing
group. Overall, there appeared to be a strong relationship between the item difficulty estimates
determined by the IRT-based model and the CTT-based model (r = .956).
Regression analysis. To address the research questions, correlations and standard
multiple regression analyses were employed. Since regression analysis treats all independent
variables as numerical (i.e., interval or ratio scale variables), dummy variables had to be created
in order to analyze the nominal scale variables included in the present study (i.e., abstractness of
information, coherence of passages, and comprehension-type). A dummy variable (scored
dichotomously, 0 or 1) was created for each level of the three predictor variables listed above,
with one dummy variable then being dropped from each set for the regression analysis. Each of
the remaining dummy variables was then treated as a separate predictor variable throughout the
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 23
regression analysis. In total, 10 dummy variables were created for the three aforementioned
predictor variables (see Table 4).
[INSERT TABLE 4 ABOUT HERE.]
Results
Descriptive statistics for the reading test item features were first calculated (see Table 5).
Then, Pearson product-moment correlation coefficients were determined to assess the
relationship between reading item difficulty scores and each of the item features. Next, two
separate regression models were considered (i.e., Regression I and Regression II). Prior to
conducting the regression analyses, assumptions for using this statistical test were first evaluated.
In what follows, the results of the correlation analyses are discussed, followed by the outcomes
of the assumption tests and the results for the regression models.
[INSERT TABLE 5 ABOUT HERE.]
Relationship between Reading Item Variables and Item Difficulty Scores
Results showed that correlation coefficients among the reading test item variables ranged
from +0.57 to -0.82, while correlations between the features and the item difficulty scores ranged
from +0.65 to -0.48 (see Table A-1 in Appendix). There were significant correlations between
item difficulty scores and number of correct options, as well as between the item difficulty
scores and two of the dummy variables (integrative comprehension-type and most abstract
information). A relatively strong negative correlation between type/token ratio and number of
words was found (-0.82); meanwhile, none of the other correlations were larger than +/-0.57,
suggesting that multicollinearity was moderate-to-low among the majority of the test item
features (Meyers, Gamst, & Guarino, 2006).
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 24
Effects of Reading Item Variables on Item Difficulty
As a result of the strong negative correlation between type/token ratio and number of
words, two standard regression analyses were conducted, as the inclusion of both variables
within the same model might create some concern with respect to multicollinearity (Meyers et
al., 2006). Overall, item difficulty scores for 172 reading test items were regressed against eight
continuous predictor variables, incorporating either type/token ratio (the Regression I model) or
number of words (the Regression II model), and 10 dummy variables.
Evaluation of assumptions. The assumptions of the regression analysis were evaluated
from three different perspectives. First, regarding the assumption of normality, the skewness and
kurtosis values ranged from +1.94 to -1.28; as the values were between +/-2, they were
considered acceptable (Field, 2009). Furthermore, the histograms of the residuals for all
regression models were normally distributed (see Figure A-1 in Appendix), satisfying the
assumption of normality of residuals. Second, the scatterplots of standardized residuals against
standardized predicted values indicated that the data were randomly and evenly scattered for the
regression analyses (refer again to Figure A-1 in Appendix). This implies that the assumptions of
linearity and homoscedasticity were met (Field, 2009). Finally, all of the values of tolerance or
VIF (Variance Inflation Factor) were between 0.10 and 10 for Regression I, suggesting that there
were no problems with multicollinearity among the predictor variables in this particular model
(Meyers et al., 2006). However, some of the values were not within this range for Regression II,
suggesting possible problems with multicollinearity.
Regression models. Adjusted R2 is reported for both models (as opposed to reporting total
R2), as it is considered more appropriate when a large number of predictor variables are included,
which can result in overfitting the model. For Regression I, adjusted R2 = 0.81, while for Regression
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 25
II, adjusted R2 = 0.75, indicating that 75-81%, respectively, of the variance in the item difficulty
scores could be explained by the test item features incorporating either type/token ratio or number
of words
[F(16,155) = 36.99, p = 0.001; F(16,155) = 36.71, p = 0.001]. As for the individual predictor
variables,
mean length of sentence, number of falsifiable distractors, and number of correct
options were
significant in both regressions, while questions that requested highly abstract
information was
only significant in Regression I. However, due to problems with multi-
collinearity associated with
Regression II (discussed above), the model for Regression I was seen
as being more favorable. Therefore, only the results for Regression I are presented in detail (see
Table 6). Following this
initial analysis, the four significant predictor variables from the
Regression I model (mean length
of sentence, number of falsifiable distractors, number of
correct options, and highly abstract
information) were subjected to another (standard) regression
analysis. Overall, adjusted R2 = 0.59 for the four variables.
[INSERT TABLE 6 ABOUT HERE.]
Effects of Reading Item Features for Low- and High-performing Readers
Two additional (standard) regression analyses were conducted to investigate the extent to
which the reading test features predicted item difficulty scores for low- and high-performing
readers. The same set of predictor and dummy variables from Regression I were regressed against
the item difficulty scores for both groups. The predictor variable, number of words, was excluded
from the analysis, as collinearity statistics for the initial analysis indicated that there were
problems when it was used in place of type/token ratio.
Low-performing readers. Correlation coefficients among the reading test item features
ranged from +0.56 to -0.48, while correlations between the features and the item difficulty scores
ranged from +0.58 to -0.44 (see Table 7). Adjusted R2 = 0.73 [F(16,155) = 27.49, p = 0.010]. As
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 26
shown in Table 8, only one variable, number of falsifiable distractors, appeared to significantly
predict item difficulty scores for the low-performing readers. This one variable was then subjected
to another (standard) regression analysis, with adjusted R2 = 0.32.
High-performing readers. Correlation coefficients among the reading test item features
ranged from +0.57 to -0.54, while correlations between the features and the item difficulty scores
ranged from +0.88 to -0.49 (again, see Table 7). For the high-performing group, adjusted R2 =
0.78 [F(16,155) = 32.97, p < 0.000]. A number of predictor variables, including mean length of
sentence, causation (coherence structure), number of falsifiable distractors, number of correct
options, and questions requesting highly abstract information, appeared to significantly predict
item difficulty scores for the high-performing readers; information for the significant predictors in
the regression analysis for the high-performing readers are also presented in Table 8. These five
variables were then subjected to another (standard) regression analysis, with adjusted R2 = 0.56.
[INSERT TABLE 7 ABOUT HERE.]
[INSERT TABLE 8 ABOUT HERE.]
Discussion
For the entire group of test-takers, variables from three facets (i.e., passage, question, and
response-format) accounted for 81% of systematic variance in item difficulty. Comparing low-
performing and high-performing readers, 73% and 78% of the variance in item difficulty scores
could be attributed to the combined contribution of features included in the regression analyses,
respectively. Four significant predictorsmean length of sentence, number of false distractors,
number of correct options, and questions requesting highly abstract information, were identified as
making a unique contribution to item difficulty for the overall group, with one additional
predictorcausation (coherence structure)identified for the high-performing readers. Only one
variablenumber of falsifiable distractorswas found to be a significant predictor for the low-
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 27
performing readers. As indicated by the squared partial correlations, which can be interpreted as
effect size estimates, each of the five predictors helped to individually explain a sizable amount of
the variability associated with item difficulty scores, ranging from effect sizes of 36% to 68% for
the entire group and from 52% to 77% for the high-performing readers, with an effect size of 59%
for the number of falsifiable distractors for the low-ability readers.
Two predictorsmean length of sentence and coherence (causation) were the only passage-
related features found to be significant. As for sentence length, items were more difficult as the
mean length of sentence increased. This outcome is supported by prior empirical research, which
has found that longer sentences require the reader to retain more information in the short-term
memory, since they are likely to contain more clauses and propositions (e.g., see White, 2010).
From a theoretical standpoint, the increase in sentence length makes it more challenging to
process the text in order to complete word recognition, syntactic parsing, and to establish the
relationships among various elements of a proposition, all of which are the processes that L2
readers employ in order to build a text model of comprehension (Kintsch, 1998). On the other
hand, coherent passage organization, in which the propositions are presented following a well-
established structure, seems to facilitate readers construction of the textbase (Grabe 2009;
Kintsch, 1998), albeit only for the high-performing readers. It is likely the case that the tightly-
organized structure of causation texts, which requires a reader to relate ideas both in terms of
time sequence and the connection between antecedent and consequent (Kobayashi, 2002, 2004),
may be more comprehensible for better readers, as they are more aware of overall text
organization and can utilize that awareness to enhance understanding. This supports the claim
that passage cohesion, particularly as it relates to structured and unstructured texts, can be
influential for discriminating different levels of L2 reading comprehension (Kobayashi, 2002).
Here, it is also worth noting that none of the passage features were significant predictors
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 28
of item difficulty for low-performing readers. While we can only speculate about potential
reasons to explain this finding, it seems plausible to suggest that, overall, the reading passages
were quite challenging for many of the low-performing readers who, therefore, showed a much
more random response pattern in their completion of reading test items.
In terms of question (i.e., contextual) features, there were two significant predictors of
item difficulty (i.e., number of false distractors and abstractness of the targeted answer). As the
number of falsifiable distractors increased, so did item difficulty. While some studies found the
effect of falsifiable distractors to be minimal (e.g., Ozuru, Rowe, OReilly, & McNamara, 2008),
the squared-partial correlations (0.68 for the entire group, 0.59 for low-performing readers and
0.55 for high-performing readers) indicate large effect sizes for the present study. This might be
explained by the rather unique format of selected-response items used in PTE Academic. That is,
for most test items that required more than one correct answer, there were typically only 1-2
falsifiable distractors present, sometimes leaving 2-3 plausible options. Therefore, the examinees
likely had to employ different strategies to answer the varying item formats in PTE Academic,
which might have forced them to more closely attend to information in reading passages and
how it related to the question, thereby increasing cognitive demands and leading to greater item
difficulty (Garrison, Dowaliby, & Long, 1992). While the falsifiability of distractors was found
to be a significant predictor of item difficulty for both groups of examinees, we have to wonder
about whether both groups used the same set of strategies and resources to complete the task. For
high-performing readers, who might have been able to build a text model of the passage, the
information in the answer options might have required additional confirmation of the initially
established textbase in order to distinguish between false and plausible distractors to deactivate
unnecessary (or inaccurate) information that was inconsistent with the main theme. If, however,
the participants were already struggling with constructing the textbase (as was likely the case
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 29
with low-performing readers in the present study), any additional inconsistencies between
information in the text and information in the response options could have increased the
difficulty of test items.
Furthermore, for the entire group and high-performing readers, as the information
requested in a question became highly abstract, so did the difficulty of the item. Research largely
supports the proposition that more abstract content in reading passages increases comprehension
difficulty (e.g., Freedle & Kostin, 1993; Lumley, Routitsky, Mendelovits, & Ramalingam, 2012;
Ozuru et al., 2008). As Freedle and Kostin (1993) argue, information is more difficult to identify
and scan for in an abstract passage because details are obscured and not explicitly stated.
Abstractness of information requested by a question also increased the item difficulty.
Interestingly, all questions targeting highly abstract information (i.e., cause, effect, reason, and
result) accompanied the passages in which this information did not present the main content of a
passage. Since the information structure of these passages did not involve a causal (or time
sequence) relationship as means to organize ideas, it is likely that the test-takers had to
additionally activate and integrate this information when constructing the situation model of
passage interpretation. However, it was surprising that abstractness of information was not also a
significant predictor for low-performing readers in the present study. While these findings are
difficult to account for, they could be attributed to the effect of other features of the source
passage. For example, Ozuru, Rowe, OReilly, and McNamara (2008) found that some features
(e.g., propositions, sentence length, and word frequency) influenced the abstractness of
questions, albeit not systematically. Therefore, based on their findings, as well as the findings of
our own study, it appears that there is not a systematic influence across all proficiency levels.
Finally, the number of correct options the only response-format feature was found to
be a significant predictor for the overall group as well as high-performing readers. This is not an
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 30
altogether surprising finding as it appears logical that the increase in task difficulty (i.e.,
requiring readers to identify more pieces of information and relate it back to the passage) would
lead to an overall increase in the difficulty of a reading item.
Overall, test-takers seem to have identified various sources of difficulty, with features
that were clearly associated with passage, question, and response format. In the present study, it
appears that for low-performing readers there was a single contextual feature that predicted the
difficulty of reading test items. The two-process model of comprehension predicts that their
lower level of reading ability was a likely reason to prevent them from building a well-developed
text model and causing them to rely more on contextual features to complete the task, possibly
without a thorough comprehension of the text. As Grabe (2009) argues, low-performing readers
often give up trying to build an appropriate text model when working through a difficult text
(p. 49). For high-performing readers, however, there was a more even split between the textual
and contextual features, suggesting that participants at higher levels of reading ability were more
consistent in their sensitivity to features of text organization and discourse-signaling mechanisms
during passage comprehension.
Limitations and Implications
There are several limitations of the present study that should be acknowledged. First, not
all features which had been found to significantly predict item difficulty in previous research
were investigated in the present study (e.g., see Carr, 2006; Freedle & Kostin, 1993). While a
combination of factors was found to affect the difficulty of selected-response items for the
present study, additional research might investigate the effect of different selected-response item
formats on examinees performance with L2 reading comprehension tests (see Haladyna, 2012).
Also, the quantitative analysis of the features carried out in the present study did not
explicitly address the processes that test-takers employed during the test. Ideally, we would have
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 31
carried out a follow-up qualitative study to employ verbal reports in order to investigate the
effect of text and item factors on test takers responses to the test questions, using Rupp et al.s
(2006) study as a model. Rupp and colleagues employed semi-structured interviews and think-
aloud protocols to investigate the effect of text and item characteristics on test-takers responses
to reading MC questions from the CanTEST, a large-scale paper-and-pencil proficiency test. The
results of their analyses indicated that participants were selecting different strategies to answer
MC questions based on a number of factors, including their previous experiences with MC tests,
their experiences with instructors, and the perceived characteristics of the text and the questions.
However, as we did not have access to examinees, this type of analysis could not be carried out.
In addition, the results of the present study have several important implications for L2
reading assessment. First, the results indicated that the majority of significant predictors
identified in the study were textual/ passage-related features (e.g., sentence length, coherence
structure) and contextual/ question-related features (e.g., number of falsifiable distractions,
questions requesting highly abstract information) suggesting that both facets of features were
largely responsible for the difficulty exhibited by participants during text comprehension.
Therefore, we believe that the two-model account of reading comprehension (Grabe, 2009;
Kitsch, 1998) is informative for understanding the nature of L2 reading comprehension and
should serve as an important theoretical model for designing and constructing tests of L2
reading, as well as interpreting the results of such tests. This account is particularly useful in that
it depicts reading as a combination of top-down and bottom-up processes, whereby individuals
(at different levels of reading ability) engage in a range of strategic processing skills to
comprehend a written text.
Second, with multiple passage, question, and response-format features found to be
important predictors of difficulty for selected-response items used in the PTE Academic reading
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 32
test, the results of the study highlight the need to carefully consider the test item characteristics
that contribute to the difficulty of selected-response items used in similar high-stakes language
tests; it is crucial that we understand how these features, particularly those associated with
response-format, impact the interpretation of test results. Issues related to test item format (i.e.,
response-format features) have great potential to introduce construct-irrelevant variance, a major
threat to the validity of high-stakes tests. Therefore, a clearer understanding of the types of
features associated with easier or harder reading items is an important piece of knowledge that
test developers and language instructors need to have when developing reading assessments.
Finally, the last implication concerns the differences in test performance found between
low- and high-performing readers. The grouping technique in the present study enabled us to
sample the readers at the lowest and highest levels of reading ability, illustrating that the
performance of the overall group did not necessarily reflect the more-nuanced differences found
between the low- and high-performing readers. As the low- and high-performing readers varied
significantly across a number of predictors used in the present study, it is important for language
test developers to also consider how various test item characteristics might interact with learner-
related variables, such as language proficiency. This is particularly important when designing
norm-referenced language tests, in which the goal is to differentiate test-takers.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 33
References
Alderson, J.C. (2000). Assessing reading. Cambridge, UK: Cambridge University Press.
Aebersold, J.A., & Field, M.L. (1997). From reader to reading teacher. Cambridge: Cambridge
University Press.
Bachman, L.F., Davidson, F., & Milanovic, M. (1996). The use of test method characteristics in
the content analysis and design of EFL proficiency tests. Language Testing, 13, 125-150.
Carr, N.T. (2006). The factor structure of test task characteristics and examinee performance.
Language Testing, 23, 269-289.
Cobb, T. Web Vocabprofile [accessed January 2013 from http://wwwlextutor.ca/vp/], an
adaptation of Heatley, Nation, & Coxheads (2002) RANGE program.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213-238.
Currie, M., & Chiramanee, T. (2010). The effect of the multiple-choice item format on the
measurement of knowledge of language structure. Language Testing, 27, 471-491.
Davey, B. (1988). Affecting the difficulty of reading comprehension items for successful and
unsuccessful readers. The Journal of Experimental Education, 56, 67-76.
De Jong, J.H.A.L., Li, J., & Duvin, J. (2010). Setting requirements for entry-level nurses on PTE
Academic. Internal Report, Pearson Test of English Academic.
De Jong, J.H.A.L., & Zheng, Y. (2011). Applying EALTA Guidelines A practical case study
on Pearson Test of English Academic. Research Notes, Pearson Test of English
Academic. Retrieved from http://www.ealta.eu.org/documents/archive/
EALTA_GGP_PTE_Academic.pdf
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage Publications.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 34
Freedle, R., & Kostin,I. (1993). The prediction of TOEFL reading comprehension item difficulty
for expository prose passages for three item types: main idea, inference, and supporting
idea items (ETS Research Report). Princeton, NJ: ETS.
Garrison, W., Dowaliby, F., & Long, G. (1992). Reading comprehension test item difficulty as a
function of cognitive processing variables. American Annals of the Deaf, 137, 22-30.
Grabe, W. (2009). Reading in a second language. Cambridge University Press.
Haladyna, T.M. (2012). Developing and validating multiple-choice test items (3rd ed.). New
York, NY: Routledge.
Khalifa, N., & Weir, C. J. (2009). Examining reading: Research and practice in assessing
second language reading. New York: Cambridge University Press.
Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-
integration model. Psychological Review, 95, 163-182.
Kintsch, W. (1998). Comprehension: A paradigm for cognition. New York, NY: Cambridge
University Press.
Kobayashi, M. (2002). Method effects on reading comprehension test performance: Text
organization and response format. Language Testing, 19, 193-220.
Kobayashi, M. (2004). Reading comprehension assessment: From text perspectives. Scientific
Approaches to Language (Center for Language Sciences, Kanda University of
International Studies), 3, 129-157.
Koda, K. (2007). Reading and language learning: Crosslinguistic constraints on second language
reading development. In K. Koda (Ed.), Reading and language learning (pp. 1-44).
Special issue of Language Learning Supplement, 57, 1-44.
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing.
International Journal of Corpus Linguistics, 15, 474-496.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 35
Lumley, T., Routitsky, A., Mendelovits, J., & Ramalingam, D. (2012). A framework for
predicting item difficulty in reading tests. Retrieved from
http://research.acer.edu/au/pisa/5.
McCray, G., & Brunfaut, T. (2016). Investigating the construct measured by banked gap-fill
items: Evidence from eye-tracking. Language Testing, 35, 51-73.
McKenna, M.C., & Stahl, K.A.D. (2009). Assessment for reading instruction (2nd edition). New
York, NY: The Guilford Press.
Meyers, L.S., Gamst, G., & Guarino, A.J. (2006). Applied multivariate research: Design and
interpretation. London: Sage Publications.
Ozuru, Y., Rowe, M., OReilly, T., & McNamara, D.S. (2008). Wheres the difficulty in
standardized tests: The passage or the question? Behavior Research Methods, 40, 1001-
1015.
Pae, H. (2011). Differential item functioning and unidimensionality in the Pearson Test of
English Academic. Pearson Research Notes. Retrieved from
http://pearsonpte.com/research/research-summaries-notes/.
Pae, H. (2012a). Construct validity of the Pearson Test of English Academic: A multitrait-
multimethod approach. Pearson Research Notes. Retrieved from:
http://pearsonpte.com/research/research-summaries-notes/.
Pae, H. (2012b). A model for receptive and expressive modalities in adult English learners
academic L2 skills. Pearson Research Notes. Retrieved from
http://pearsonpte.com/research/research-summaries-notes/.
Pearson Test of English Academic (2012). Scoring Guide. Retrieved from
http://pearsonpte.com/wp-content/uploads/2014/07/PTEA_Score_Guide.pdf
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 36
Rumelhart, D. (1980). Schemata: The building blocks of cognition. In: R. J. Spiro, B. C. Bruce &
W. F. Brewer. (Eds.), Theoretical issues in reading comprehension (pp. 33-58). Hillsdale,
NJ: Erlbaum.
Rupp, A., Ferne, T., & Choi, H. (2006). How assessing reading comprehension with multiple-
choice questions shapes the construct: a cognitive processing perspective. Language
Testing, 23(4), 441-474.
Shohamy, E. (1984). Does the testing method make a difference? The case of reading
comprehension. Language Testing, 1(2), 147-170.
Weir, C., Hawkey, R., Green, A., Unaldi, A., & Devi, S. (2009). The relationship between the
academic reading construct as measured by IELTS and the reading experiences of
students in their first year of study at a British university. IELTS Research Reports,
Volume 9. Retrieved from http://www.ielts.org/researchers/research/volume_9.aspx.
West, M. (l953). A General Service List of English Words. London: Longman, Green & Co.
White, S. (2010). Understanding adult functional literacy: Connecting text features, task
demands, and respondent skills. New York, NY: Routledge.
Yamashita, J. (2003). Processes of taking a gap-filling test: Comparison of skilled and less
skilled EFL readers. Language Testing, 20(3), 267293.
Zheng, Y., & De Jong, J. (2011). Establishing construct and concurrent validity of Pearson Test
of English Academic. Pearson Research Note. Retrieved from
http://pearsonpte.com/research/research-summaries-notes/.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 37
Appendices
Table A-1: Intercorrelations among predictor variables and item difficulty scores
[INSERT TABLE A-1 ABOUT HERE]
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 38
Figure A-1: Histograms and scatterplots for the residuals
[INSERT FIGURE A-1 ABOUT HERE]
Table 1. Distribution of Test Takers across PTE Academic Test Forms
Form
No. of test takersa
No. of L1s representedb
A
30 (H = 13; L = 17)
3
B
25 (H = 11; L = 14)
3
C
23 (H = 16; L = 7)
4
D
48 (H = 20; L = 28)
4
E
54 (H = 26; L = 28)
6
F
56 (H = 22; L = 34)
8
G
50 (H = 18; L = 32)
7
H
48 (H = 23; L = 25)
8
I
54 (H = 24; L = 30)
6
J
46 (H = 19; L = 27)
5
Notes: a H = high performers; L = low performers; b 16 distinct L1s were represented among the 434 test takers.
Table 3. Features examined in PTE Academic reading comprehension test
Facet
Feature
Measure for each feature
Passage-related
General
complexity
Total number of words in a reading passage
Syntactic
complexity
Average number of words per sentence for a
reading passage
Average number of independent and dependent
clauses per sentence for a reading passage
Lexical
complexity
Number of unique function and content words
divided by the number of repeated words
Number of off-list words divided by the total
number of words in a reading passage
Passage structure
Content structure of a reading passage
Question-related
No. of falsifiable
distractors
Comprehension-
type
Abstractness of
targeted answer
Number of falsifiable distractor options
for an item
Type of question used to demonstrate
comprehension of a passage
Level of concreteness or abstractness of the
information requested by a question
Response-format
No. of response
alternatives
Number of alternatives provided for an item
No. of correct options
required
Number of correct answers needed to obtain
full credit for an item
Average length of
options
Average number of words per item option
Table 4. List of dummy variables included in regression analysis
Feature
No. of original levels
Final dummy variables
Abstractness
5
1. Highly concrete
2. Intermediate
3. Highly abstract
4. Most abstract
Coherence
4
1. Causation
2. Description
3. Response
Comprehension-
type
4
1. Integration
2. Knowledge-based
3. Restructure
Table 5. Descriptive statistics for reading test item features
Facet
Feature
Mean
SD
Skewness
Kurtosis
Passage-related
Words
107.20
64.24
-0.74
-0.42
Length of sentence
20.67
5.14
0.10
1.25
Clauses per sentence
1.74
0.55
0.84
-0.03
Coherencea
---
---
---
---
Association
0.32
0.23
-0.97
-1.00
Causation
0.12
0.28
1.22
0.97
Response
0.08
0.20
1.04
0.71
Description
0.48
0.51
-0.87
-1.11
Type-token ratio
0.72
0.11
0.71
-0.85
Words off-list
0.10
0.06
-1.21
-0.67
Question-
related
No. of falsifiable distractors
0.72
0.74
1.94
0.93
Comprehension-typea
---
---
---
---
Text-based
0.42
0.69
0.43
-0.52
Restructure/reword
0.16
0.33
-0.33
0.70
Integration
0.32
0.46
-0.89
-1.22
Knowledge-based
0.10
0.28
1.55
1.18
Abstractnessa
---
---
---
---
Most concrete
0.26
0.35
-0.68
0.54
Highly concrete
0.16
0.41
-0.21
-0.46
Intermediate
0.30
0.48
0.47
0.39
Highly abstract
0.20
0.44
0.29
1.24
Most abstract
0.08
0.28
-0.37
-0.46
Response-format
No. of response alternatives
5.16
1.52
-1.04
-1.28
No. of correct options
3.24
1.40
0.99
1.45
Length of options
6.59
7.63
-0.80
0.32
Note: aValues for this variable represent proportion scores, ranging from 0 to 1.
Table 6. Regression I (incorporating type/token ratio)
Total R2 = 0.81
Standardized
Coeff. Beta
t
Sig.
Squared partial
correlation
Collinearity
statistics
Tolerance
VIF
1. Mean length of
sentence
0.43
2.55
0.044
0.52
0.89
1.12
2. Clauses per
sentence
-0.09
-0.54
0.609
0.05
0.94
1.06
3. Coherence
(causation)
-0.25
-1.86
0.112
0.37
0.95
1.05
4. Coherence
(response)
-0.07
-0.57
0.593
0.05
0.24
4.12
5. Coherence
(description)
-0.03
-0.28
0.790
0.01
0.24
4.11
6. Type-token ratio
-0.07
-0.07
0.662
0.04
0.44
2.25
7. Words off-list
-0.08
-0.63
0.551
0.06
0.84
1.19
8. No. of falsifiable
distractors
0.52
3.65
0.011
0.68
0.33
2.99
9. No. of response
alternatives
-0.14
-0.70
0.512
0.07
0.17
5.98
10. No. of correct
options
0.82
3.01
0.024
0.60
0.98
1.02
11. Length of
options
0.18
1.07
0.324
0.20
0.31
3.22
12. Comp-type
(restructure)
-0.38
-1.77
0.127
0.35
0.20
4.94
13. Comp-type
(integration)
0.09
0.30
0.777
0.01
0.31
3.20
14. Comp-type
(knowledge-based)
-0.01
-0.05
0.960
0.00
0.32
3.12
15. Abstractness
(highly concrete)
0.29
1.53
0.748
0.28
0.33
3.03
16. Abstractness
(intermediate)
0.13
0.93
0.390
0.12
0.20
4.94
17. Abstractness
(highly abstract)
0.27
1.87
0.048
0.36
0.20
4.89
18. Abstractness
(most abstract)
0.34
1.23
0.266
0.20
0.33
2.99
Table 7. Correlations among features for low- and high-performing readers
Feature
Reading Test Performance Group
Low
High
1. Mean length of sentence
-0.03
0.09
2. Clauses per sentence
0.01
0.06
3. Coherence (causation)
-0.19
-0.20
4. Coherence (response)
-0.27
-0.24
5. Coherence (description)
0.20
0.32
6. Type-token ratio
0.36*
0.33
7. Words off-list
-0.34
-0.16
8. No. of false distractors
0.57**
0.40*
9. No. of response alternatives
0.05
0.35*
10. No. of correct options
0.58**
0.88**
11. Length of options
0.08
-0.38*
12. Comp-type (restructure)
-0.38*
-0.31
13. Comp-type (integration)
0.34
0.51**
14. Comp-type (knowledge-based)
-0.27
-0.40*
15. Abstractness (highly concrete)
-0.28
-0.12
16. Abstractness (intermediate)
0.02
0.25
17. Abstractness (highly abstract)
-0.44*
-0.49*
18. Abstractness (most abstract)
-0.41*
-0.18
** Correlation is significant at the 0.01 level (2-tailed); * Correlation is significant at the 0.05 level (2-tailed).
Table 8. Regressions for low- and high-performing readers
Significant
predictorsa
Standardized
Coeff. Beta
t
Sig.
Squared partial
correlation
Collinearity
statistics
Tolerance
VIF
Low-performers
No. of false
distractors
0.62
2.95
0.026
0.59
0.25
4.01
High-performers
Mean length of
sentence
0.38
2.65
0.038
0.55
0.55
1.82
Coherence
(causation)
-0.29
-2.54
0.044
0.52
0.51
1.96
No. of false
distractors
0.33
2.67
0.037
0.55
0.28
3.62
No. of correct
options
1.08
4.62
0.004
0.77
0.44
2.25
Abstractness
(highly abstract)
0.30
2.86
0.029
0.58
0.36
2.77
Note: a Only significant predictors from the regression models are reported here (p < .05).
... This research aims to investigate the many potential effects of environmental elements on fourth-grade readers' reading abilities. There has been a lot of research on the individual achievements of students, but the literature on contextual elements that can separate highperforming readers from those who perform poorly (Becker and Nekrasova-Beker 2018;Begeny et al. 2011;Chen et al. 2022;Fang et al. 2022;Hall 2012;Hu et al. 2022, Hu et al. 2023Rakhlin et al. 2023;Stole et al. 2020) is in development. The various circumstances surrounding a child's learning environment are referred to as contextual factors, including things like socioeconomic background, factors that limit instruction and school resources (Dong and Hu 2019). ...
Article
Full-text available
This study aimed to employ machine learning techniques to uncover the pivotal determinants influencing the reading proficiency of fourth-grade students across 65 regions, as participants in the PIRLS 2021 assessment. The primary objective was to discern and assess key factors at the student, family and school levels that predict high and low reading performance among these students. Utilising a machine learning approach, this research analysed data from 204,176 fourth-grade students encompassing 122 independent variables. The Support Vector Machine (SVM) algorithm was employed to effectively differentiate between students with high and low reading performance based on the identification of 16 crucial contextual factors. The results revealed that the most influential factors predominantly resided at the family level, encompassing socioeconomic variables. These factors pertained to the provision of personalised study environments, facilitated through access to an internet connection, individual study desks, dedicated study rooms, an assortment of books and personal smartphones. At the student level, significant factors included reading motivation, gender and age. Meanwhile, school-level determinants encompassed aspects such as ineffective classroom regulations, absenteeism rates, the presence of libraries and the availability of digital learning resources.
... The constructed response format requires that readers produce (construct) responses (typically written or oral) based on prompts, for example, in the form of open-ended questions, essay tasks, or instructions for free recall. The selected response format includes items presenting readers with a set of possible responses and requires that they select the best ones, with multiplechoice, re-ordering of paragraphs, and different types of fill-in-the-blanks (i.e., cloze tests) being the most common implementations of this format (Becker & Nekrasova-Beker, 2018). Great benefits of the selected response format are efficient administration, easy dichotomous scoring procedures, and few to no scoring errors (Haladyna & Rodriguez, 2013). ...
Article
Full-text available
The deep cloze test was developed by Jensen and Elbro (Read Writ Interdiscip J 35(5):1221–1237, 2022. https://doi.org/10.1007/s11145-021-10230-w) to assess reading comprehension at the level of global situational understanding. In two independent studies, we examined potential contributors to students’ scores on the deep cloze reading comprehension test, as well as the predictability of students’ scores on this measure for their course achievement and integrated text understanding measured with an open-ended written comprehension assessment. Results showed that students’ language background, word recognition skills, and working memory resources explained unique portions of the variance in students’ scores on the deep cloze reading comprehension test. Further, scores on this test were positively correlated with students’ course achievement and uniquely predicted their integrated text understanding when language background, working memory, and prior topic knowledge were controlled for. Taken together, our findings support an interpretation of the deep cloze reading comprehension test as an effective and efficient measure of situation level understanding that draws on language skills, word level processes, and working memory resources and also can be used to predict students’ performance on important criterial tasks requiring deeper level understanding.
... Informed by the interview data, the formative assessment is conducted in online mode. The researchers found that there were some methods of the formative assessment were used by the participants in assessing senior secondary school students' proficiency in English language subjects (Becker & Nekrasova-Beker, 2018) during distance learning. One of them is practices and quizzes with Google Form. ...
Article
Full-text available
Assessment is a pivotal endeavor where the teacher can provide feedback, evaluation and solutions on students’ learning so that the teacher can find out the level of students’ understanding of the lesson that has been taught and see the abilities and difficulties faced by the students to determine what students need. Based on the current situation, the COVID-19 pandemic has led to changes in the learning system from offline learning to distance learning with online mode. In response to this, teachers and students need to adapt to the online mode of teaching and learning process. To fill this gap, the current study aims to explore the way of English teachers in conducting formative assessments related to students’ proficiency during distance learning. Designed in narrative research, data are garnered through semi-structured interviews with three Indonesian English teachers of senior secondary school. The interview seeks to explore the methods or strategies of English teachers towards online formative assessment during distance learning. The results of this study indicate that formative assessment during distance learning is carried out online with one or more combinations of language skills.
Article
The current study investigates the question of test difficulty decomposition depending on the characteristics of items (such as: format, belonging to the type of text to which the item belongs) and the reader's actions required to answer it (search for information in the text, simple conclusions, complex conclusions, critical interpretation of the text). The sample of the study consisted of fourth grade elementary school students in Krasnoyarsk, who completed the computerized test of reading literacy "Progress" in the spring of 2022. Research method: psychometric modeling using the LLTM+e model. Research hypothesis: the decomposition of item difficulties will help to prove that the reading actions required to complete the tasks will form a hierarchy of difficulties similar to traditional taxonomies (B. Bloom), that is, reading skills aimed at analyzing, synthesizing, interpreting information will give tasks greater difficulty than simple conclusions, and those, in turn, will make tasks more difficult than the reader's actions to find information in the text. The results show that the assignment of items to the group of reader's actions is a significant factor. The size of the effects does not allow us to speak of a strict hierarchy, but when other attributes are controlled, the tasks for information retrieval in an explicit form are easier for students than the tasks for complex conclusions and for critical understanding of the text.
Article
Full-text available
Pearson Test of English Academic (PTE Academic) is a computer-based international English language test launched globally in 2009. The purpose of the test isto assess English language competence in the context of academic programs of study where English is the language of instruction. This paper reports the processes involved in collecting evidence to support the validity claims of the test for that purpose. The paper begins with a review ofthe definitions of and the threats to construct validity;defines the construct validity of PTE Academic;and then reports both qualitative and quantitative evidence (including research methods, results, and revisions) whichPearson have gatheredto support the construct validity of PTE Academic. In addition, the concurrent validity of PTE Academic is presented by comparing PTE Academic with other externally established criteria or tests. A summary of the initial investigationsand the plansto collect further evidence to establish construct validity and other aspects of validity is outlined in the conclusion.
Article
Full-text available
This study investigates test-takers’ processing while completing banked gap-fill tasks, designed to test reading proficiency, in order to test theoretically based expectations about the variation in cognitive processes of test-takers across levels of performance. Twenty-eight test-takers’ eye traces on 24 banked gap-fill items (on six tasks) were analysed according to seven online eyetracking measures representing overall, text and task processing. Variation in processing was related to test-takers’ level of performance on the tasks overall. In particular, as hypothesized, lower-scoring students exerted more cognitive effort on local reading and lower-level cognitive processing in contrast to test-takers who attained higher scores. The findings of different cognitive processes associated with variation in scores illuminate the construct measured by banked gap-fill items, and therefore have implications for test design and the validity of score interpretations.
Article
Full-text available
Noting the widespread use of multiple-choice items in tests in English language education in Thailand, this study compared their effect against that of constructed-response items. One hundred and fifty-two university undergraduates took a test of English structure first in constructed-response format, and later in three, stem-equivalent multiple-choice formats, with the distractors based on incorrect answers from the constructed-response test. A significant and substantial increase in mean, and generally in individual scores between the two tests was found although the scores in the tests were quite closely correlated, often taken to indicate that a similar construct was measured by the two test formats. However, direct comparison of the responses to the items in the two tests showed that only 26% of the responses were the same, suggesting that most of what the multiple-choice items measured was directly dependent on the item format. The study found remarkable consistency in the response patterns between the tests among three experimental groups of participants, who sat different option number formats of the multiple choice test, pointing to the possibility of a general effect of multiple-choice items in testing the learning of structure in second and foreign languages.
Book
This book is the most comprehensive and up-to-date treatment of the assessment of reading in a foreign or second language.
Chapter
Reading in a second language is the essence of this article. The ability to read in a second language is one of the most important skills required of people in multilingual and international settings. It is also a skill that is one of the most difficult to develop to a high level of proficiency. This article highlights the different purposes of reading. People read for a variety of purposes, and many of these purposes require distinct combinations of skills in order to achieve the reader's purpose. Because of this variation, it is not easy to define second language reading as a single notion or a unitary ability. It is true that differing purposes draw on many of the same cognitive processes, but they do so to differing extents and sometimes in different ways. Reading can easily be defined simply as the ability to derive understanding from written text. This article also focuses on how reading as an individual process works.
Article
“This is a genuinely scholarly work . . . It is based on [analysis of] the most up-to-date quantitative surveys that we have on adult literacy. These surveys are the gold standard in terms of documenting adult literacy in the United States . . . The author analyzes these extensive surveys and puts them into a theoretical context in a way that has not been done before.” - Rosemary J. Park, University of Minnesota
Article
This study assessed the contributions of various test features (passage variables, question types, and format variables) to reading comprehension performance for successful and unsuccessful readers. Items from a typical standardized reading comprehension test were analyzed according to 20 predictor test features. A three-stage conditional regression approach assessed the predictability of these features on item-difficulty scores for the two reader groups. Two features, location of response information and stem length, accounted for a significant amount of explained variance for both groups. Possible explanatory hypotheses are considered and implications are drawn for improved test design as well as for further research concerning interactions between assessment task features and reader performance.
Article
This article describes the development and evaluation of a new academic word list (Coxhead, 1998), which was compiled from a corpus of 3.5 million running words of written academic text by examining the range and frequency of words outside the first 2,000 most frequently occurring words of English, as described by West (1953). The AWL contains 570 word families that account for approximately 10.0% of the total words (tokens) in academic texts but only 1.4% of the total words in a fiction collection of the same size. This difference in coverage provides evidence that the list contains predominantly academic words. By highlighting the words that university students meet in a wide range of academic texts, the AWL shows learners with academic goals which words are most worth studying. The list also provides a useful basis for further research into the nature of academic vocabulary.
Article
Results on reading tests are typically reported on scales composed of levels, each giving a statement of student achievement or proficiency. The PISA reading scales provide broad descriptions of skill levels associated with reading items, intended to communicate to policy makers and teachers about the reading proficiency of students at different levels. However, the described scales are not explicitly tied to features that predict difficulty. Difficulty is thus treated as an empirical issue, using a post hoc solution, while a priori estimates of item difficulty have tended to be unreliable. Understanding features influencing the difficulty of reading tasks has the potential to help test developers, teachers and researchers interested in understanding the construct of reading. This paper presents work, conducted over a period of more than a decade, intended to provide a scheme for describing the difficulty of reading items used in PISA. Whereas the mathematics research in earlier papers in this symposium focused on mathematical competencies, the reading research concentrates on describing the reading tasks and the parts of texts that students are required to engage with.