Content uploaded by Anthony Becker
Author content
All content in this area was uploaded by Anthony Becker on Sep 21, 2018
Content may be subject to copyright.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 1
This is an Accepted Manuscript of an article published by Taylor & Francis Group in
Educational Assessment on 9/20/2018, available online at:
https://www.tandfonline.com/doi/full/10.1080/10627197.2018.1517023
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 2
ABSTRACT
While previous research has identified numerous factors that contribute to item difficulty, studies
involving large-scale reading tests have provided mixed results. This study examined five
selected-response item types used to measure reading comprehension in the Pearson Test of
English Academic: a) multiple-choice (choose one answer), b) multiple-choice (choose multiple
answers), c) re-order paragraphs, d) reading (fill-in-the-blanks), and e) reading and writing (fill-
in-the-blanks). Utilizing a multiple regression approach, the criterion measure consisted of item
difficulty scores for 172 items. Eighteen passage, passage-question, and response-format
variables served as predictors. Overall, four significant predictors were identified for the entire
group (i.e., sentence length, falsifiable distractors, number of correct options, and abstractness of
information requested) and five variables were found to be significant for high-performing
readers (including the four listed above and passage coherence); only the number of falsifiable
distractors was a significant predictor for low-performing readers. Implications for assessing
reading comprehension are discussed.
Keywords: format, item difficulty, L2 reading, multiple-choice, regression, standardized test
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 3
Introduction
The complex nature of second-language (L2) reading has been acknowledged in both
theory- and pedagogy-oriented research (Alderson, 2000; Grabe, 2009). Grabe (2009) described
two types of cognitive processes that affect L2 readers’ ability to comprehend texts: lower-level
processes (including word recognition, syntactic parsing, and propositional encoding), and
higher-level processes (including understanding main ideas, interpreting a text, activating
background knowledge, and making inferences). Because these processes operate simultaneously
at various phases during reading, operationalizing the construct of reading ability is challenging
for assessment purposes, as one has to not only confirm that test items measure the same latent
trait, but also consider a number of factors that impact the difficulty level of test items.
A clear definition of reading ability is especially important in large standardized
proficiency tests, such as Test of English as a Foreign Language (TOEFL), International English
Language Test System (IELTS), and Pearson Test of English (PTE) Academic, all of which
include reading ability as a sub-construct within a more global communicative ability. Because
the test scores are used for admission and placement purposes at English-medium institutions,
high-stakes decisions are made based on test-takers’ performance. To date, a large body of
empirical research has been carried out to establish the construct of reading ability and to
examine several factors affecting the difficulty of test items in TOEFL and IELTS (e.g., Freedle
& Kostin, 1993; Carr, 2006; Weir, Hawkey, Green, Unaldi, & Devi, 2009).
PTE Academic, a relatively new test with a rapidly developing trajectory of use
internationally, has been drawing more attention from researchers and assessment specialists.
Specifically, recent studies have focused on establishing the construct validity of PTE Academic,
defining sub-skills targeted in the test, examining response formats and their effect on test-takers,
investigating differential item functioning, and examining rater behavior (e.g., Pae, 2011; Zheng
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 4
& De Jong, 2011). Yet, to our knowledge, no study to date has empirically investigated the
features that could potentially affect reading item difficulty on PTE Academic. The present
study, therefore, attempted to fill this void by determining the extent to which various features of
passages, question types, and response formats accounted for the difficulty of PTE Academic
reading test items, all of which are presented using different selected-response item formats.
It is worth noting here that a variety of terms are used throughout this paper to describe
aspects of test items, including those items used in the PTE Academic reading test. For the
present study, we took the perspective that a (reading) test item typically includes three
distinctive elements: (1) input, in the form of a written text or passage; (2) a stimulus, in the form
of a question or a stem that requires completion; and (3) a response, in the form of a constructed-
or selected-response format. Specifically, for the purposes of our study, we define a test item as
any part of the PTE Academic reading test in which an examinee must read a passage and choose
between response options which are provided to the examinee in order to respond to a question
or to complete a stem. Furthermore, when discussing item formats or types throughout this
paper, we are generally referring to the multiple response formats that are possible when
designing selected-response test items. For example, in the case of the PTE Academic reading
test, the selected-response item formats or types that are utilized include multiple choice (MC),
re-ordering paragraphs, and filling-in-the-blanks. These specific item types are discussed in more
detail in the methods section of this paper.
In order to contextualize this study within existing research on L2 reading and to justify
the specific questions explored in the study, the following sections include a discussion of a
theoretical perspective on L2 reading as well as an overview of the main findings obtained from
empirical research that motivated the focus of the present study.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 5
Reading Comprehension: Theoretical Underpinnings
In this study we adopted the current perspective for understanding reading as a dynamic
multi-dimensional construct that can be operationalized through a range of component processes
occurring at both lower and higher levels and, at certain points, engaging in interactions with
each other (Grabe, 2009; Khalifa & Weir, 2009). This approach to reading allows one to
conceptualize and examine the multiple cognitive processes that occur during reading and to
explore the complex nature of each of these component processes. While individual reading
comprehension frameworks and models organized under this larger multi-dimensional
componential perspective may have unique elements, they all view reading as a purposeful and
interactive activity during which a reader extracts and co-constructs the meaning encoded in a
text by employing a number of skills, processes, and strategies to achieve comprehension within
a specific context.
Current understanding of the construct of reading comprehension in both L1 and L2 has
been greatly influenced by schema theory, which emphasizes the importance of readers’
knowledge, experiences, values, and strategies that they bring to the reading process to achieve
comprehension (Rumelhart, 1980). Schemas, or schemata, are the cognitive constructs that
represent complex networks of information which can be utilized by people to organize new
experiences, stimuli, events, and situations. Researchers who have explored the role of schema in
the reading process seem to agree that readers who activate appropriate schemata, such as
possessing sufficient background knowledge about a topic, being aware of the discourse
structure of a text, and being able to decode input in a language, are in a better position to
achieve comprehension. In contrast, readers (both L1 and L2 learners) who either lack sufficient
schemata or are unable to activate it appropriately might experience more difficulties with
understanding a text (Aebersold & Field, 1997). Building on the schema theory, Kintsch (1998)
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 6
proposed an alternative account of comprehension, which he called the construction – integration
(CI) model. This model has been influential not only in theoretical discussions of what
constitutes comprehension during reading activities (e.g., Grabe, 2009; Weir, Hawkey, Green,
Unaldi, & Devi, 2009), but also in empirical research examining a number of trait factors that
can potentially affect participants’ reading performance (e.g., Pae, 2012a).
Taking into account the very fluid and context-sensitive nature of human comprehension,
Kintsch’s CI model presents it as a loosely organized bottom-up process which may initially
generate a somewhat chaotic and incoherent textbase that is then integrated into a well-ordered
coherent structure (i.e., the situation model). The steps involved in the construction of the
textbase include: (a) forming propositions that correspond to the linguistic input and represent
the meaning of a text, (b) elaborating each of the constructed elements, (c) inferring additional
propositions that might be required to complete the construction process (i.e., bridging inferences
that are employed whenever the constructed textbase is still largely incoherent), and (d)
establishing the interconnections between all elements. The outcome of the construction process
is a rich textbase structure that represents a cluster of generated propositions, with some of them
being derived directly from the text and others being closely connected to them. During the
integration step, when the situation model is developed, the propositional representation of the
text is then integrated with other knowledge and the material that is irrelevant and/ or
inconsistent with the context becomes deactivated (Kintsch, 1988).
For the present study, the CI model provides insight into specific identifiable components
that might potentially account for the difficulty in text comprehension and, therefore, can be
important for assessment purposes. At the surface level of text comprehension, readers’ ability to
decode visual input can be potentially affected by the textual (i.e., passage) features, including
the complexity of the reading material, both lexical and syntactic, as well as the extent to which
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 7
the macrostructure of the passage is clearly organized and signaled. Likewise, the construction of
the situation model of the text depends on additional contextual features – the extent to which
readers are prompted and able to integrate information from the text with their background
knowledge as well as readers’ overall familiarity with the topic/ content discussed in the text.
L2 Reading Comprehension: Additional Considerations
Expanding on Kintsch’s CI model of reading comprehension for native speakers, Grabe
(2009) offers further insights about how the comprehension process unfolds for L2 readers.
During the construction of the textbase, Grabe (2009) highlights additional complexities of the
decoding process related to L1-L2 differences, such as learning new letter-to-sound patterns or
acceptable syllable structures, that “are likely to have a significant impact on the speed and
accuracy of word-recognition processes in L2 reading development, particularly at lower
proficiency levels” (p. 121). In terms of the language skills, L2 readers, especially beginning
learners, have a much more limited vocabulary when they start learning to read in their L2, in
addition to an impoverished knowledge of L2 morphology and syntax. Furthermore, the nature
of L1-L2 differences will dictate the extent to which L1 transfer influences will affect reading
comprehension and which linguistic resources will be affected the most (e.g., morpho-syntactic
patterns, vocabulary use, discourse practices).
In the past, some researchers have proposed that L2 reading ability is likely to include
general language resources that are universal and, therefore, transferable (e.g., Koda, 2007).
However, Grabe (2009) argues that, depending on the structural and semantic differences
between the languages, these universal aspects of reading ability might not always transfer well.
In fact, even when the two language systems share similarities, there are still limitations in how
(and to what extent) the universal aspects of reading ability can be transferred to L2 reading
skills. Therefore, a more productive approach to capturing the complexities of L2 reading ability
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 8
seems to be viewing it as a dual-language processing system that evolves over time and presents
“a case of learning to read with languages” rather than “learning to read in another language” (p.
129). According to Grabe, the emergent system is dynamic in nature and is sensitive to the task
conditions, the reader, and the context. Based on a reader’s specific background knowledge, their
level of language proficiency, the specific task requirements, and the reading goals, L2 readers
might be engaging a range of skills at different levels of processing to achieve comprehension.
Assessment of Reading Comprehension
To this point, our discussion of reading comprehension has focused primarily on
understanding how readers process textual information and integrate it with their background
knowledge in non-testing conditions (i.e., in situations that do not purposefully assess readers’
ability to achieve comprehension). In testing conditions, however, a multifaceted construct of
reading ability takes on an additional layer of complexity associated with the methods used to
assess reading comprehension.
Reflecting on the specific challenges related to the validity of reading comprehension
tests, Ozuru, Rowe, O’Reilly, and McNamara (2008) noted that the complexities of the many
underlying sub-processes that comprise reading comprehension as well as considerable
variability in reading situations present major challenges to adequately measuring reading ability.
This, in turn, often results in “the artificial nature” of reading comprehension tests and, therefore,
justifies current research inquiries into the generalizability of results from large standardized
reading tests that employ different selected-response item formats, such as gap-fill and MC
(McCray & Brunfaut, 2016). The main focus of such inquiries is to examine if (and to what
extent) tests using selected-response items elicit performances that accurately reflect readers’
abilities to comprehend a text, especially when compared to “real-life” comprehension (i.e.,
when reading outside the testing context).
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 9
One of the ways to investigate this issue is to focus on the cognitive processes and the
extent to which the processes engaged during a reading test correspond to the cognitive processes
that readers typically perform during reading activities outside the testing situation (e.g., see
McCray & Brunfaut, 2016; Yamashita, 2003). However, in situations when a researcher only has
access to participants’ test scores (i.e., no observation data is available), a product- (rather than
process-) oriented approach to examining the validity of reading test items might be a more
appropriate alternative. In this approach, which has been employed extensively in the field, test-
takers’ performance is typically examined in light of other relevant factors. The majority of the
investigations that have employed this approach have focused on selected-response items, with
the goal of investigating the extent to which they were able to measure different aspects of
reading ability and, therefore, to serve as an efficient means to assess L2 learners’ reading skills
(see Currie & Chiramanee, 2010; Kobayashi, 2004; Rupp, Ferne, & Choi, 2006; Shohamy,
1984). For example, Currie & Chiramanee (2010) argued that MC items had a distorting effect
on the measurement of language ability, as these items seemed to have introduced a “greater
proportion of format-related ‘noise’ than the language performance actually sought” (p. 487).
Yet, due to their ease of administration and scoring, as well as their usefulness for
targeting a range of cognitive skills, selected-response items are often preferred for large-scale
standardized tests, including PTE Academic. While considerable efforts have been made to
develop tests that are inexpensive, efficient to administer, reliable, and valid, some concerns
about construct-irrelevant variance that might be attributed to item formats used in PTE
Academic still remain. Specifically, in a study examining the construct validity and the effect of
the testing method utilized in PTE Academic, Pae (2012a) reported that while, in general, “the
assessment methods are adequate, mainly for measuring the given traits, […] some portion of the
variance was related to […] question formats, suggesting that the question type might have
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 10
assessed different constructs, especially for the prescribed-response format” (pp. 6-7). Thus,
according to Pae (2012a), further attention should be devoted to investigating the (potential)
effect of the format features of the reading items on participants’ test performance. Following
this call for more research, the present study includes a group of item format features as
additional factors that might affect L2 participants’ reading performance, in addition to the
textual (passage) and contextual (question) variables that affect participants’ reading
comprehension in non-testing situations. The following section includes an overview of
empirical studies that examined the different categories of features (i.e., passage, question, and
response-format) affecting the difficulty of selected-response test items.
Effect of Item Features on the Difficulty of Selected-Response Items
Empirical research conducted on the effect of item format features on test-takers’
performance has had a very practical rationale. As Bachman, Davidson, and Milanovic (1996)
argue, test developers must determine “what characteristics are relevant to test performance, how
best to obtain information about these characteristics, and how to use information about test and
item characteristics to describe tests better and to predict and interpret test performance” (p.
129). The influence of test characteristics on L2 reading performance is particularly relevant in
cases where selected-response items are used, as some have argued that these items alone are not
sufficient to measure reading ability (see Haladyna, 2012).
Quantitative investigations into the effect of item format features on test-takers’ L2
performance have typically examined how various coded features of passages and questions
accounted for item difficulty and/or item discrimination (see Bachman et al., 1996; Carr, 2006;
Freedle & Kostin, 1993; Garrison, Dowaliby, & Long, 1992; Ozuru, Rowe, O’Reilly, &
McNamara, 2008). For example, Garrison, Dowaliby, and Long (1992) provided evidence for
the importance of the passage-question relationship (or what we refer to as question-related
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 11
variables in the present study), as they found that the amount of transformation of text
information needed to be able to answer a question, ranging from verbatim to inference
questions, as well as the background knowledge required to answer a question, proved to be key
predictors of item difficulty. In addition to these two features, the number of words in the
question combined with the number of words in a passage contributed to increased difficulty of
test items. Davey (1988) also reported that question length positively correlated with passage
sentence length, passage coherence, and number of content words included in a question; that is,
passages with longer sentences, which were found to be more coherent yet more difficult, tended
to accompany longer questions.
In a study of TOEFL MC reading items, Freedle and Kostin (1993) examined a set of 75
features that had been identified in previous research as important predictors of reading
comprehension. The results indicated that seven predictor categories jointly accounted for
reading item difficulty: (1) lexical overlap between the correct answer and the text, (2) key
sentence length (for supporting idea items), (3) concreteness of the text, (4) rhetorical
organization of a passage, (5) number of negations in the correct answer, (6) number of referents
across clauses in a passage, and (7) passage length. Since six out of the seven predictors
belonged to the passage and passage-to-item (i.e., contextual/ question) categories, and only one
of the format features contributed to item difficulty, the authors concluded that the results of the
study provided sufficient evidence for the construct validity of the TOEFL reading items.
While the findings from this study, and others like it (e.g., Davey, 1988), highlight the
importance of passage and question features in determining item difficulty, they are not
altogether conclusive, as more recent studies have had contradictory results. For instance, in a
more recent study of TOEFL MC reading items, Carr (2006) found that many of the passage-
related features were not significant predictors of item difficulty. The features included in the
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 12
analysis were both passage (e.g., subject matter, propositional content, cohesion, and focus
constructions) and key sentence features (e.g., length and location, lexical variety, key sentences
per sentence, and proportion of clauses). Carr (2006) noted that key sentence features “emerged
as constituting a key dimension along which reading items may vary” (p. 285), and called into
question the applicability of solely using task characteristics to predict examinee performance,
particularly as passage-related features are concerned.
One of the more recent studies which adopted a more holistic approach to exploring
sources of difficulty within an item examined the extent to which three sets of item features
predicted the difficulty of MC items on the Gates-MacGinitie Reading Tests (Ozuru, Rowe,
O’Reilly, & McNamara, 2008). The authors carried out an item-based analysis of 192
comprehension questions to investigate if, and to what extent, various item features could predict
the difficulty of test items targeting two different levels (i.e., lower vs. higher) of reading ability.
The results showed that, whereas the difficulty of reading test items for the lower level could
mostly be predicted by passage features, none of the item features could account for the
difficulty of items included in the higher-level tests. Ozuru, Rowe, O’Reilly, and McNamara’s
(2008) study motivated the current investigation, as it considered (and illustrated) the multi-
dimensional nature of reading test performance which incorporated a variety of features,
including text (passage), context (question type), item format, and learner characteristics (e.g.,
level of reading ability) that could potentially impact item difficulty.
Present Study
The purpose of the present study was to investigate the effect of a combination of factors
on the difficulty of selected-response reading items in the PTE Academic. Two of the three item
facets examined in the study were identified based on Kintsch’s (1998) CI model of reading
comprehension which we discussed in greater detail in the previous sections. Specifically, we
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 13
believe that both groups of features (i.e., textual/ passage and contextual/ question) can
potentially present sources of difficulty during text comprehension. Depending on participants’
level of L2 ability and previous linguistic experiences in general, their construction of the
textbase might be affected by the linguistic complexity of the reading material (both syntactic
and lexical) as well as by how the information is organized in a text. Thus, L2 readers’ ability to
parse a text and establish propositional meaning at the sentence and discourse level is likely to be
related to the number of words readers can recognize in a text whose lexical meaning they know
and, therefore, can figure out the idea units. The longer the text, the more likely it is to include
words that are less familiar to the reader and/or present more complex grammatical relationships,
which, in turn, increases the difficulty level of the text. Likewise, the fewer propositional gaps a
text contains and the more explicit the structure of the text is, the easier it is to construct the
appropriate textbase. Furthermore, readers who are able to identify the most essential
information that supports the overall organizing idea conveyed in the text, as well as to integrate
new information from the text with existing background knowledge, are likely to be more
successful during the integration phase of the comprehension process.
The third group of features that is related to testing situations includes format
characteristics of test items. Haladyna (2012) argues that item format in tests of reading
comprehension is an important consideration, as different item formats may present different
cognitive demands and thought processes. Because this study adopts a more holistic approach to
examining groups of factors that might impact item difficulty, including a group of item format
features appears to be worthwhile. Our understanding of reading comprehension as a complex set
of cognitive processes affected by passage, question, and (potentially) item format
characteristics, leads to the first research question we asked in the present study:
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 14
(1) To what extent do various passage, question, and response-format features predict the
difficulty of PTE reading items?
Based on the test specifications of the PTE Academic (Zheng & De Jong, 2011), the
reading items appear to target a wide range of reading abilities included in the reading construct
that align well with Kintsch’s CI model. Specifically, the tasks are thought to elicit abilities
ranging from understanding the meaning of lexical items, grammatical constructions, and the
structural organization of a text to identify the main ideas and details to reading critically to
understand the author’s purpose and rhetorical intent as well as to be able to draw logical
inferences, evaluate a hypothesis, and to integrate information from multiple sources “to generate
an organizing frame that is not explicitly stated” (p. 10). It is possible, however, that depending
on the overall proficiency level of the test-takers, not all of these abilities might be tapped during
the test. In fact, in a study examining the factor structure of L2 learners’ academic English skills
assessed by PTE Academic, Pae (2012b) noted that the factor structure for the high and low
ability groups was different, suggesting that “there might be a different latent construct for
different ability groups” and, therefore, “caution needs to be taken in the interpretation of test
scores according to the skill level” (p. 6). Therefore, in the present study we also considered a
learner-related component by comparing the test performance of more and less successful
readers. This is reflected in the second research question we asked in the present study:
(2) Do the specific passage, question, and response-format features affecting the
difficulty of PTE reading items differ for low- and high-performing readers?
Method
Reading test items for PTE Academic were evaluated according to three different facets
(i.e., passage, passage-question relationship, and response-format); a description of these facets
and how they were coded are provided below. The data for the first analysis in the present study
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 15
comprised a subset of reading test items (k = 172) taken from 10 separate administrations of PTE
Academic exams administered in four different countries between 2011 and 2014. The reading
test items targeted different levels of reading ability, ranging from B1 to C2 on the Common
European Framework (CEF) scale (Zheng & De Jong, 2011).
In addition, the study also examined the differences between high- and low-performing
readers. From an initial sample of 836 examinees, 434 of those examinees were divided into two
groups based on their reading test results for PTE Academic. Individuals belonging to the high-
performing group (n = 192) represented those examinees who scored at least one standard
deviation (SD) above the mean score for the reading test, while individuals belonging to the low-
performing group (n = 242) included examinees who scored at least one SD unit below the mean
score. Those individuals who fell between +/- one SD of the mean score were excluded from the
analysis (see Table 1). The two groups included both males (n = 251) and females (n = 183)
between the ages of 18 and 38 years old (M = 24.81 years; SD = 4.33).
[INSERT TABLE 1 ABOUT HERE.]
Reading test for PTE Academic. The reading test consists of five selected-response
item formats: a) MC (choose one answer), b) MC (choose multiple answers), c) re-ordering
paragraphs, d) reading (fill-in-the-blanks), and e) reading and writing (fill-in-the-blanks). For
MC items, examinees are presented with a reading passage and asked to select the correct
response(s), choosing either a single answer (from four options) or multiple answers (from five
options), depending on the question-type. For paragraph re-ordering items, examinees are
presented with a short reading passage and several randomly ordered (but related) sentences
below the passage. Based on the information in the passage, examinees are asked to restore the
original order of the sentences. Finally, for the fill-in-the-blanks items, a short reading passage is
presented with some words missing. To complete the passage, for each blank, examinees are
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 16
asked to either drag-and-drop the most appropriate word from a general list of words or they are
asked to select the most appropriate word from a list of four options included in a drop-down
menu. Each of the item formats targeted specific reading subskills (see PTE Academic, 2012).
As illustrated in Table 2, some of the subskills overlapped between the different item types,
while some were unique to a given item type.
For this study, Pearson provided the researchers with test results from 10 different forms
(i.e., administrations) of PTE Academic. It was assumed that the forms were largely equivalent, in
terms of the test content and the difficulty of test items across the test forms. Although the
specific outcomes of the procedures undertaken to ensure the equivalence of the PTE Academic
test forms were not disclosed to the researchers, Pearson provides ample information to suggest
that great lengths are taken to ensure equivalence across test administrations. For example, De
Jong and Zheng (2011) note that, “draws [from the test item bank] are stratified to ensure
comparable composition with respect to the item types included and thereby the representation of
skills in the test” (p. 8). Furthermore, as De Jong, Li, and Duvin (2010) explain, test items drawn
from the PTE Academic item bank are calibrated on a single IRT scale, making them equally
sufficient to predict scores on any [PTE Academic] test form. Based on this information, we
concluded that stable estimates of item difficulty were established across the PTE Academic test
forms. Therefore, the decision was made to collapse the results for the 10 forms into one overall
dataset1. This led to a total of 172 items for the dataset used in the present study.
As shown in Table 2, each test form included five reading passages, ranging from 30 to
300 words, as well as 2 single-answer MC items, 2-3 multiple-answer MC items, 2-3 re-order
paragraph items, 4-5 “drag-and-drop” fill-in-the-blank items, and 5-6 “drop-down” fill-in-the-
1 The results of a one-way ANOVA test further supported this decision, as no significant differences were found
between mean test scores for any of the forms, F(9, 826) = 1.01, p > .05.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 17
blank items. While single-answer MC items were dichotomously scored (0 or 1), all other item
types were scored using a partial-credit scoring system (Pearson Test of English Academic, 2012).
[INSERT TABLE 2 ABOUT HERE.]
Procedures for Data Analysis
Description of Reading Test Item Features
The 172 reading test items were evaluated according to three different groups of facets
(i.e., passage, question, and response-format) that have been identified as potential sources of
item difficulty based on prior research (e.g., Alderson, 2000; Currie & Chiramanee, 2010;
Davey, 1988; Ozuru, Rowe, O’Reilly, & McNamara, 2008). Within these facets, 10 features of
the reading test items were determined, totaling 12 unique measures (see Table 3).
Passage-related features. These features concerned the characteristics of the reading
passages that served as the input for examinees. Four features of reading passages that have been
shown to influence reading comprehension difficulty (and have been reviewed above) were
examined: a) general complexity, b) syntactic complexity, c) lexical complexity, and d) passage
structure. The first two features, general and syntactic complexity, were measured using an
online L2 syntactic complexity analyzer (see Lu, 2010). General complexity was measured as the
total number of words in a reading passage, while syntactic complexity was measured as the
mean length of sentence and clauses per sentence. The third feature, lexical complexity, was
measured using type-token ratio and words off-list. Type-token ratio represented the number of
unique function and content words with respect to the number of repeated words, while the
words off-list represented the proportion of words that fell outside of the General Service List2
(West, 1953) and Academic Word List (Coxhead, 2000) for each reading passage.
2 The General Service List, originally published by Michael West in 1953, is a list of roughly 2,000 words selected
to be of the greatest “general service” to learners of English. (Source: http://jbauman.com/aboutgsl.html)
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 18
[INSERT TABLE 3 ABOUT HERE.]
For the fourth feature in this category, passage structure, a framework for measuring
coherence was adopted from the content structure analysis used in Kobayashi (2002). This
framework categorizes the structure of content in a reading text into four different types: 1)
association, 2) causation, 3) response, and 4) description. Association refers to ideas that are only
loosely associated with each other around a common topic, as well as events related to a time
sequence (i.e., recounting events in chronological order). Causation refers to ideas that are
related both in terms of time (e.g., one event happens before another) and causality (i.e., the
earlier event causes the latter). Response refers to the inter-relationship of ideas such that a
solution is suggested in response to an existing causality. Finally, description refers to at least
two subordinate arguments that are linked by an element of comparison, as well as ideas
arranged in a hierarchical manner (i.e., one argument is superordinate and the other modifies this
superordinate argument).
According to Kintsch’s model, the language complexity of the passage, including its
length as well as the number of syntactic constructions and the lexical density of the text, might
affect the ability of some readers to construct the textbase. Furthermore, depending on the extent
to which the passage structure is clear to a reader and the ideas are presented in a logical and
coherent manner, the construction of the textbase can either be facilitated or inhibited.
Question-related features. These features concerned the cognitive operation involved in
answering a question that could potentially affect test-takers’ ability to integrate textual
information with other knowledge and develop the situation model of text comprehension. Three
question features were included: 1) number of falsifiable distractor options, 2) comprehension-
type, and 3) abstractness of the targeted answer. Following Ozuru, Rowe, O’Reilly, and
McNamara (2008), the first feature, number of falsifiable distractor options, was determined by
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 19
tallying the number of distractors that were inconsistent with the theme of the passage and/or
contained erroneous or unrelated information for each item. Based on Kintsch’s (1998) CI
model, test-takers’ ability to achieve higher-level comprehension assumes their ability to identify
essential information necessary for comprehension and to deactivate all unnecessary material
that is either inaccurate or redundant. Therefore, it was believed that the occurrence of falsifiable
distractors (i.e., information that is inconsistent with the main theme) might potentially make it
more challenging for a test-taker to comprehend a text.
The second feature, comprehension-type, refers to the level of understanding that is
required to answer a question. Previous research has investigated the level of understanding
needed to comprehend a written text (e.g., Davey, 1988; Kobayashi, 2004; McKenna & Stahl,
2009; Ozuru, Rowe, O’Reilly, & McNamara, 2008), identifying four types of text
comprehension questions: 1) text-based questions, 2) restructuring / rewording questions, 3)
integration questions, and 4) knowledge-based questions. Text-based questions represent those
question-types for which the answer is stated nearly verbatim in the passage, thus making it
easily identifiable. Restructuring questions, while not stated verbatim, simply require the
examinee to re-order the organization of words. Integration questions require the examinee to
synthesize information, whereas knowledge-based questions require the examinee to draw
inferences and/or integrate their world knowledge. While the first two question-types are
considered to be less cognitively demanding, the latter two types are thought to be more
cognitively demanding, as they require higher-order reading skills (Kobayashi, 2004). Similarly,
following Kintsch’s (1998) argument about the two different levels of comprehension, it is the
latter two types of questions (i.e., integration and knowledge-based questions) that promote
deep-level understanding of a text which supports learning (as opposed to surface level
comprehension that is memory-based).
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 20
The third feature, abstractness of information, addresses the level of concreteness or
abstractness of the information requested by the question. It is our belief that the more abstract
the information requested by the question, the more challenging it is for a reader to rely on their
background knowledge and incorporate it during the integration phase of the comprehension
process. Therefore, questions that ask about abstract concepts are likely to be more challenging
for a reader, especially if they have limited knowledge about a topic and, therefore, require
additional effort to achieve comprehension. Following Ozuru, Rowe, O’Reilly, and McNamara
(2008), five levels of concreteness or abstractness were identified in the present study: 1) most
concrete – identification of persons, animals, or things, 2) highly concrete – information about
time, attributes, or amounts, 3) intermediate – identification of manner, goal, purpose, or
condition, 4) highly abstract – identification of cause, effect, reason, or result, and 5) most
abstract – information about equivalence, difference, or theme.
Response-format features. These features concerned the specific response formats of
test items. Previous research has identified a variety of response-format features that influence
the difficulty of reading test items (e.g., Currie & Chiramanee, 2010; Garrison, Dowaliby, &
Long, 1992; Ozuru, Rowe, O’Reilly, & McNamara, 2008). The present study only focused on
three features, as they have commonly been found to influence item difficulty across studies: 1)
number of response alternatives possible for an item, 2) number of correct options required, and
3) average length of options.
Coding of Individual Items
Ten features of three different facets which could potentially represent sources of
difficulty for reading test items were identified and coded by both researchers. The coding
process included three stages. First, a trial coding of the passage-related features was conducted.
Both individuals coded the reading passages from two of the same test forms and verified the
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 21
results of the L2 syntactic complexity analyzer for calculating number of words, mean sentence
length and clauses per sentence, and also discussed the implementation of the scheme for
analyzing the content structure of reading passages (see Kobayashi, 2002). Following this, both
coders then focused on these same passage-related features for the eight remaining reading test
forms and compared their results. Any discrepancies between the coders were discussed and
reconciled to ensure 100% agreement on all features. Finally, for the remaining passage-related
features (i.e., type-token ratio and words off-list), one of the coders utilized the Vocabprofile
function in Compleat Lexical Tutor (Cobb, n.d.) to determine counts for these measures for each
reading test item. The first two steps of the coding procedures described above were likewise
followed for question features (stage two) and response-format features (stage three).
Statistical Analysis
Item analysis. Item difficulty scores were provided by Pearson for each of the 172
reading test items included in the present study. As mentioned earlier, the forms of the reading
tests included both dichotomous and polytomous items. Therefore, a Partial Credit/Rasch model
analysis was implemented by the Pearson test development team for item inspection. As De Jong
and Zheng (2011) explain, the complete item response dataset [taken from a pilot study spanning
2007-2009] was equally split into odd- and even-numbered items, whereupon Partial
Credit/Rasch model analysis was applied separately to both sets of items. Item fit statistics were
then evaluated for both odd- and even-numbered items, with misfitting items subsequently
deleted. Following this, even-item calibration was linked to odd-item calibration, with odd-item
calibration serving as the base metric. For a more detailed explanation of this analysis, see pages
6-7 in De Jong and Zheng (2011).
As a result of the initial item analysis (described above), it was possible to directly
compare the performance of all items across the 10 different test forms. For the present study,
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 22
item parameter estimates for the difficulty (threshold, b parameter) of all 172 reading test items
ranged from -1.43 (least difficult) to +2.85 (most difficult). INFIT and OUTFIT statistics (t <
2.0) confirmed the fit of test items, with a mean of 1.69 and standard deviation of .36.
In addition, using classical test theory (CTT), separate item difficulty estimates for all
172 reading test items were also determined for the high- and low-performing groups. CTT was
deemed suitable to use with the relatively small sample sizes of the two groups, as there was
stability across test items and the examinee sample was representativeness of the intended test-
taker population. For dichotomous items, item difficulty represented the proportion of examinees
that got the item correct. For polytomous items, item difficulty represented the mean score of
each item. Furthermore, because the unit of measurement for item difficulty values was not
constant across all test items, an order-preserving arcsin transformation was performed (see
Garrison, Dowaliby, & Long, 1992). Item difficulty estimates ranged from 0.92 to 0.19 for the
high-performing examinees, while estimates ranged from 0.74 to 0.08 for the low-performing
group. Overall, there appeared to be a strong relationship between the item difficulty estimates
determined by the IRT-based model and the CTT-based model (r = .956).
Regression analysis. To address the research questions, correlations and standard
multiple regression analyses were employed. Since regression analysis treats all independent
variables as numerical (i.e., interval or ratio scale variables), dummy variables had to be created
in order to analyze the nominal scale variables included in the present study (i.e., abstractness of
information, coherence of passages, and comprehension-type). A dummy variable (scored
dichotomously, 0 or 1) was created for each level of the three predictor variables listed above,
with one dummy variable then being dropped from each set for the regression analysis. Each of
the remaining dummy variables was then treated as a separate predictor variable throughout the
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 23
regression analysis. In total, 10 dummy variables were created for the three aforementioned
predictor variables (see Table 4).
[INSERT TABLE 4 ABOUT HERE.]
Results
Descriptive statistics for the reading test item features were first calculated (see Table 5).
Then, Pearson product-moment correlation coefficients were determined to assess the
relationship between reading item difficulty scores and each of the item features. Next, two
separate regression models were considered (i.e., Regression I and Regression II). Prior to
conducting the regression analyses, assumptions for using this statistical test were first evaluated.
In what follows, the results of the correlation analyses are discussed, followed by the outcomes
of the assumption tests and the results for the regression models.
[INSERT TABLE 5 ABOUT HERE.]
Relationship between Reading Item Variables and Item Difficulty Scores
Results showed that correlation coefficients among the reading test item variables ranged
from +0.57 to -0.82, while correlations between the features and the item difficulty scores ranged
from +0.65 to -0.48 (see Table A-1 in Appendix). There were significant correlations between
item difficulty scores and number of correct options, as well as between the item difficulty
scores and two of the dummy variables (integrative comprehension-type and most abstract
information). A relatively strong negative correlation between type/token ratio and number of
words was found (-0.82); meanwhile, none of the other correlations were larger than +/-0.57,
suggesting that multicollinearity was moderate-to-low among the majority of the test item
features (Meyers, Gamst, & Guarino, 2006).
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 24
Effects of Reading Item Variables on Item Difficulty
As a result of the strong negative correlation between type/token ratio and number of
words, two standard regression analyses were conducted, as the inclusion of both variables
within the same model might create some concern with respect to multicollinearity (Meyers et
al., 2006). Overall, item difficulty scores for 172 reading test items were regressed against eight
continuous predictor variables, incorporating either type/token ratio (the Regression I model) or
number of words (the Regression II model), and 10 dummy variables.
Evaluation of assumptions. The assumptions of the regression analysis were evaluated
from three different perspectives. First, regarding the assumption of normality, the skewness and
kurtosis values ranged from +1.94 to -1.28; as the values were between +/-2, they were
considered acceptable (Field, 2009). Furthermore, the histograms of the residuals for all
regression models were normally distributed (see Figure A-1 in Appendix), satisfying the
assumption of normality of residuals. Second, the scatterplots of standardized residuals against
standardized predicted values indicated that the data were randomly and evenly scattered for the
regression analyses (refer again to Figure A-1 in Appendix). This implies that the assumptions of
linearity and homoscedasticity were met (Field, 2009). Finally, all of the values of tolerance or
VIF (Variance Inflation Factor) were between 0.10 and 10 for Regression I, suggesting that there
were no problems with multicollinearity among the predictor variables in this particular model
(Meyers et al., 2006). However, some of the values were not within this range for Regression II,
suggesting possible problems with multicollinearity.
Regression models. Adjusted R2 is reported for both models (as opposed to reporting total
R2), as it is considered more appropriate when a large number of predictor variables are included,
which can result in overfitting the model. For Regression I, adjusted R2 = 0.81, while for Regression
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 25
II, adjusted R2 = 0.75, indicating that 75-81%, respectively, of the variance in the item difficulty
scores could be explained by the test item features incorporating either type/token ratio or number
of words
[F(16,155) = 36.99, p = 0.001; F(16,155) = 36.71, p = 0.001]. As for the individual predictor
variables,
mean length of sentence, number of falsifiable distractors, and number of correct
options were
significant in both regressions, while questions that requested highly abstract
information was
only significant in Regression I. However, due to problems with multi-
collinearity associated with
Regression II (discussed above), the model for Regression I was seen
as being more favorable. Therefore, only the results for Regression I are presented in detail (see
Table 6). Following this
initial analysis, the four significant predictor variables from the
Regression I model (mean length
of sentence, number of falsifiable distractors, number of
correct options, and highly abstract
information) were subjected to another (standard) regression
analysis. Overall, adjusted R2 = 0.59 for the four variables.
[INSERT TABLE 6 ABOUT HERE.]
Effects of Reading Item Features for Low- and High-performing Readers
Two additional (standard) regression analyses were conducted to investigate the extent to
which the reading test features predicted item difficulty scores for low- and high-performing
readers. The same set of predictor and dummy variables from Regression I were regressed against
the item difficulty scores for both groups. The predictor variable, number of words, was excluded
from the analysis, as collinearity statistics for the initial analysis indicated that there were
problems when it was used in place of type/token ratio.
Low-performing readers. Correlation coefficients among the reading test item features
ranged from +0.56 to -0.48, while correlations between the features and the item difficulty scores
ranged from +0.58 to -0.44 (see Table 7). Adjusted R2 = 0.73 [F(16,155) = 27.49, p = 0.010]. As
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 26
shown in Table 8, only one variable, number of falsifiable distractors, appeared to significantly
predict item difficulty scores for the low-performing readers. This one variable was then subjected
to another (standard) regression analysis, with adjusted R2 = 0.32.
High-performing readers. Correlation coefficients among the reading test item features
ranged from +0.57 to -0.54, while correlations between the features and the item difficulty scores
ranged from +0.88 to -0.49 (again, see Table 7). For the high-performing group, adjusted R2 =
0.78 [F(16,155) = 32.97, p < 0.000]. A number of predictor variables, including mean length of
sentence, causation (coherence structure), number of falsifiable distractors, number of correct
options, and questions requesting highly abstract information, appeared to significantly predict
item difficulty scores for the high-performing readers; information for the significant predictors in
the regression analysis for the high-performing readers are also presented in Table 8. These five
variables were then subjected to another (standard) regression analysis, with adjusted R2 = 0.56.
[INSERT TABLE 7 ABOUT HERE.]
[INSERT TABLE 8 ABOUT HERE.]
Discussion
For the entire group of test-takers, variables from three facets (i.e., passage, question, and
response-format) accounted for 81% of systematic variance in item difficulty. Comparing low-
performing and high-performing readers, 73% and 78% of the variance in item difficulty scores
could be attributed to the combined contribution of features included in the regression analyses,
respectively. Four significant predictors–mean length of sentence, number of false distractors,
number of correct options, and questions requesting highly abstract information, were identified as
making a unique contribution to item difficulty for the overall group, with one additional
predictor–causation (coherence structure)–identified for the high-performing readers. Only one
variable–number of falsifiable distractors–was found to be a significant predictor for the low-
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 27
performing readers. As indicated by the squared partial correlations, which can be interpreted as
effect size estimates, each of the five predictors helped to individually explain a sizable amount of
the variability associated with item difficulty scores, ranging from effect sizes of 36% to 68% for
the entire group and from 52% to 77% for the high-performing readers, with an effect size of 59%
for the number of falsifiable distractors for the low-ability readers.
Two predictors–mean length of sentence and coherence (causation) were the only passage-
related features found to be significant. As for sentence length, items were more difficult as the
mean length of sentence increased. This outcome is supported by prior empirical research, which
has found that longer sentences require the reader to retain more information in the short-term
memory, since they are likely to contain more clauses and propositions (e.g., see White, 2010).
From a theoretical standpoint, the increase in sentence length makes it more challenging to
process the text in order to complete word recognition, syntactic parsing, and to establish the
relationships among various elements of a proposition, all of which are the processes that L2
readers employ in order to build a text model of comprehension (Kintsch, 1998). On the other
hand, coherent passage organization, in which the propositions are presented following a well-
established structure, seems to facilitate readers’ construction of the textbase (Grabe 2009;
Kintsch, 1998), albeit only for the high-performing readers. It is likely the case that the tightly-
organized structure of causation texts, which requires a reader to relate ideas both in terms of
time sequence and the connection between antecedent and consequent (Kobayashi, 2002, 2004),
may be more comprehensible for better readers, as they are more aware of overall text
organization and can utilize that awareness to enhance understanding. This supports the claim
that passage cohesion, particularly as it relates to structured and unstructured texts, can be
influential for discriminating different levels of L2 reading comprehension (Kobayashi, 2002).
Here, it is also worth noting that none of the passage features were significant predictors
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 28
of item difficulty for low-performing readers. While we can only speculate about potential
reasons to explain this finding, it seems plausible to suggest that, overall, the reading passages
were quite challenging for many of the low-performing readers who, therefore, showed a much
more random response pattern in their completion of reading test items.
In terms of question (i.e., contextual) features, there were two significant predictors of
item difficulty (i.e., number of false distractors and abstractness of the targeted answer). As the
number of falsifiable distractors increased, so did item difficulty. While some studies found the
effect of falsifiable distractors to be minimal (e.g., Ozuru, Rowe, O’Reilly, & McNamara, 2008),
the squared-partial correlations (0.68 for the entire group, 0.59 for low-performing readers and
0.55 for high-performing readers) indicate large effect sizes for the present study. This might be
explained by the rather unique format of selected-response items used in PTE Academic. That is,
for most test items that required more than one correct answer, there were typically only 1-2
falsifiable distractors present, sometimes leaving 2-3 plausible options. Therefore, the examinees
likely had to employ different strategies to answer the varying item formats in PTE Academic,
which might have forced them to more closely attend to information in reading passages and
how it related to the question, thereby increasing cognitive demands and leading to greater item
difficulty (Garrison, Dowaliby, & Long, 1992). While the falsifiability of distractors was found
to be a significant predictor of item difficulty for both groups of examinees, we have to wonder
about whether both groups used the same set of strategies and resources to complete the task. For
high-performing readers, who might have been able to build a text model of the passage, the
information in the answer options might have required additional confirmation of the initially
established textbase in order to distinguish between false and plausible distractors to deactivate
unnecessary (or inaccurate) information that was inconsistent with the main theme. If, however,
the participants were already struggling with constructing the textbase (as was likely the case
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 29
with low-performing readers in the present study), any additional inconsistencies between
information in the text and information in the response options could have increased the
difficulty of test items.
Furthermore, for the entire group and high-performing readers, as the information
requested in a question became highly abstract, so did the difficulty of the item. Research largely
supports the proposition that more abstract content in reading passages increases comprehension
difficulty (e.g., Freedle & Kostin, 1993; Lumley, Routitsky, Mendelovits, & Ramalingam, 2012;
Ozuru et al., 2008). As Freedle and Kostin (1993) argue, information is more difficult to identify
and scan for in an abstract passage because details are obscured and not explicitly stated.
Abstractness of information requested by a question also increased the item difficulty.
Interestingly, all questions targeting highly abstract information (i.e., cause, effect, reason, and
result) accompanied the passages in which this information did not present the main content of a
passage. Since the information structure of these passages did not involve a causal (or time
sequence) relationship as means to organize ideas, it is likely that the test-takers had to
additionally activate and integrate this information when constructing the situation model of
passage interpretation. However, it was surprising that abstractness of information was not also a
significant predictor for low-performing readers in the present study. While these findings are
difficult to account for, they could be attributed to the effect of other features of the source
passage. For example, Ozuru, Rowe, O’Reilly, and McNamara (2008) found that some features
(e.g., propositions, sentence length, and word frequency) influenced the abstractness of
questions, albeit not systematically. Therefore, based on their findings, as well as the findings of
our own study, it appears that there is not a systematic influence across all proficiency levels.
Finally, the number of correct options – the only response-format feature – was found to
be a significant predictor for the overall group as well as high-performing readers. This is not an
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 30
altogether surprising finding as it appears logical that the increase in task difficulty (i.e.,
requiring readers to identify more pieces of information and relate it back to the passage) would
lead to an overall increase in the difficulty of a reading item.
Overall, test-takers seem to have identified various sources of difficulty, with features
that were clearly associated with passage, question, and response format. In the present study, it
appears that for low-performing readers there was a single contextual feature that predicted the
difficulty of reading test items. The two-process model of comprehension predicts that their
lower level of reading ability was a likely reason to prevent them from building a well-developed
text model and causing them to rely more on contextual features to complete the task, possibly
without a thorough comprehension of the text. As Grabe (2009) argues, low-performing readers
“often give up trying to build an appropriate text model when working through a difficult text”
(p. 49). For high-performing readers, however, there was a more even split between the textual
and contextual features, suggesting that participants at higher levels of reading ability were more
consistent in their sensitivity to features of text organization and discourse-signaling mechanisms
during passage comprehension.
Limitations and Implications
There are several limitations of the present study that should be acknowledged. First, not
all features which had been found to significantly predict item difficulty in previous research
were investigated in the present study (e.g., see Carr, 2006; Freedle & Kostin, 1993). While a
combination of factors was found to affect the difficulty of selected-response items for the
present study, additional research might investigate the effect of different selected-response item
formats on examinees’ performance with L2 reading comprehension tests (see Haladyna, 2012).
Also, the quantitative analysis of the features carried out in the present study did not
explicitly address the processes that test-takers employed during the test. Ideally, we would have
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 31
carried out a follow-up qualitative study to employ verbal reports in order to investigate the
effect of text and item factors on test takers’ responses to the test questions, using Rupp et al.’s
(2006) study as a model. Rupp and colleagues employed semi-structured interviews and think-
aloud protocols to investigate the effect of text and item characteristics on test-takers’ responses
to reading MC questions from the CanTEST, a large-scale paper-and-pencil proficiency test. The
results of their analyses indicated that participants were selecting different strategies to answer
MC questions based on a number of factors, including their previous experiences with MC tests,
their experiences with instructors, and the perceived characteristics of the text and the questions.
However, as we did not have access to examinees, this type of analysis could not be carried out.
In addition, the results of the present study have several important implications for L2
reading assessment. First, the results indicated that the majority of significant predictors
identified in the study were textual/ passage-related features (e.g., sentence length, coherence
structure) and contextual/ question-related features (e.g., number of falsifiable distractions,
questions requesting highly abstract information) suggesting that both facets of features were
largely responsible for the difficulty exhibited by participants during text comprehension.
Therefore, we believe that the two-model account of reading comprehension (Grabe, 2009;
Kitsch, 1998) is informative for understanding the nature of L2 reading comprehension and
should serve as an important theoretical model for designing and constructing tests of L2
reading, as well as interpreting the results of such tests. This account is particularly useful in that
it depicts reading as a combination of top-down and bottom-up processes, whereby individuals
(at different levels of reading ability) engage in a range of strategic processing skills to
comprehend a written text.
Second, with multiple passage, question, and response-format features found to be
important predictors of difficulty for selected-response items used in the PTE Academic reading
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 32
test, the results of the study highlight the need to carefully consider the test item characteristics
that contribute to the difficulty of selected-response items used in similar high-stakes language
tests; it is crucial that we understand how these features, particularly those associated with
response-format, impact the interpretation of test results. Issues related to test item format (i.e.,
response-format features) have great potential to introduce construct-irrelevant variance, a major
threat to the validity of high-stakes tests. Therefore, a clearer understanding of the types of
features associated with easier or harder reading items is an important piece of knowledge that
test developers and language instructors need to have when developing reading assessments.
Finally, the last implication concerns the differences in test performance found between
low- and high-performing readers. The grouping technique in the present study enabled us to
sample the readers at the lowest and highest levels of reading ability, illustrating that the
performance of the overall group did not necessarily reflect the more-nuanced differences found
between the low- and high-performing readers. As the low- and high-performing readers varied
significantly across a number of predictors used in the present study, it is important for language
test developers to also consider how various test item characteristics might interact with learner-
related variables, such as language proficiency. This is particularly important when designing
norm-referenced language tests, in which the goal is to differentiate test-takers.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 33
References
Alderson, J.C. (2000). Assessing reading. Cambridge, UK: Cambridge University Press.
Aebersold, J.A., & Field, M.L. (1997). From reader to reading teacher. Cambridge: Cambridge
University Press.
Bachman, L.F., Davidson, F., & Milanovic, M. (1996). The use of test method characteristics in
the content analysis and design of EFL proficiency tests. Language Testing, 13, 125-150.
Carr, N.T. (2006). The factor structure of test task characteristics and examinee performance.
Language Testing, 23, 269-289.
Cobb, T. Web Vocabprofile [accessed January 2013 from http://wwwlextutor.ca/vp/], an
adaptation of Heatley, Nation, & Coxhead’s (2002) RANGE program.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213-238.
Currie, M., & Chiramanee, T. (2010). The effect of the multiple-choice item format on the
measurement of knowledge of language structure. Language Testing, 27, 471-491.
Davey, B. (1988). Affecting the difficulty of reading comprehension items for successful and
unsuccessful readers. The Journal of Experimental Education, 56, 67-76.
De Jong, J.H.A.L., Li, J., & Duvin, J. (2010). Setting requirements for entry-level nurses on PTE
Academic. Internal Report, Pearson Test of English Academic.
De Jong, J.H.A.L., & Zheng, Y. (2011). Applying EALTA Guidelines – A practical case study
on Pearson Test of English Academic. Research Notes, Pearson Test of English
Academic. Retrieved from http://www.ealta.eu.org/documents/archive/
EALTA_GGP_PTE_Academic.pdf
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage Publications.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 34
Freedle, R., & Kostin,I. (1993). The prediction of TOEFL reading comprehension item difficulty
for expository prose passages for three item types: main idea, inference, and supporting
idea items (ETS Research Report). Princeton, NJ: ETS.
Garrison, W., Dowaliby, F., & Long, G. (1992). Reading comprehension test item difficulty as a
function of cognitive processing variables. American Annals of the Deaf, 137, 22-30.
Grabe, W. (2009). Reading in a second language. Cambridge University Press.
Haladyna, T.M. (2012). Developing and validating multiple-choice test items (3rd ed.). New
York, NY: Routledge.
Khalifa, N., & Weir, C. J. (2009). Examining reading: Research and practice in assessing
second language reading. New York: Cambridge University Press.
Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-
integration model. Psychological Review, 95, 163-182.
Kintsch, W. (1998). Comprehension: A paradigm for cognition. New York, NY: Cambridge
University Press.
Kobayashi, M. (2002). Method effects on reading comprehension test performance: Text
organization and response format. Language Testing, 19, 193-220.
Kobayashi, M. (2004). Reading comprehension assessment: From text perspectives. Scientific
Approaches to Language (Center for Language Sciences, Kanda University of
International Studies), 3, 129-157.
Koda, K. (2007). Reading and language learning: Crosslinguistic constraints on second language
reading development. In K. Koda (Ed.), Reading and language learning (pp. 1-44).
Special issue of Language Learning Supplement, 57, 1-44.
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing.
International Journal of Corpus Linguistics, 15, 474-496.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 35
Lumley, T., Routitsky, A., Mendelovits, J., & Ramalingam, D. (2012). A framework for
predicting item difficulty in reading tests. Retrieved from
http://research.acer.edu/au/pisa/5.
McCray, G., & Brunfaut, T. (2016). Investigating the construct measured by banked gap-fill
items: Evidence from eye-tracking. Language Testing, 35, 51-73.
McKenna, M.C., & Stahl, K.A.D. (2009). Assessment for reading instruction (2nd edition). New
York, NY: The Guilford Press.
Meyers, L.S., Gamst, G., & Guarino, A.J. (2006). Applied multivariate research: Design and
interpretation. London: Sage Publications.
Ozuru, Y., Rowe, M., O’Reilly, T., & McNamara, D.S. (2008). Where’s the difficulty in
standardized tests: The passage or the question? Behavior Research Methods, 40, 1001-
1015.
Pae, H. (2011). Differential item functioning and unidimensionality in the Pearson Test of
English Academic. Pearson Research Notes. Retrieved from
http://pearsonpte.com/research/research-summaries-notes/.
Pae, H. (2012a). Construct validity of the Pearson Test of English Academic: A multitrait-
multimethod approach. Pearson Research Notes. Retrieved from:
http://pearsonpte.com/research/research-summaries-notes/.
Pae, H. (2012b). A model for receptive and expressive modalities in adult English learners’
academic L2 skills. Pearson Research Notes. Retrieved from
http://pearsonpte.com/research/research-summaries-notes/.
Pearson Test of English Academic (2012). Scoring Guide. Retrieved from
http://pearsonpte.com/wp-content/uploads/2014/07/PTEA_Score_Guide.pdf
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 36
Rumelhart, D. (1980). Schemata: The building blocks of cognition. In: R. J. Spiro, B. C. Bruce &
W. F. Brewer. (Eds.), Theoretical issues in reading comprehension (pp. 33-58). Hillsdale,
NJ: Erlbaum.
Rupp, A., Ferne, T., & Choi, H. (2006). How assessing reading comprehension with multiple-
choice questions shapes the construct: a cognitive processing perspective. Language
Testing, 23(4), 441-474.
Shohamy, E. (1984). Does the testing method make a difference? The case of reading
comprehension. Language Testing, 1(2), 147-170.
Weir, C., Hawkey, R., Green, A., Unaldi, A., & Devi, S. (2009). The relationship between the
academic reading construct as measured by IELTS and the reading experiences of
students in their first year of study at a British university. IELTS Research Reports,
Volume 9. Retrieved from http://www.ielts.org/researchers/research/volume_9.aspx.
West, M. (l953). A General Service List of English Words. London: Longman, Green & Co.
White, S. (2010). Understanding adult functional literacy: Connecting text features, task
demands, and respondent skills. New York, NY: Routledge.
Yamashita, J. (2003). Processes of taking a gap-filling test: Comparison of skilled and less
skilled EFL readers. Language Testing, 20(3), 267–293.
Zheng, Y., & De Jong, J. (2011). Establishing construct and concurrent validity of Pearson Test
of English Academic. Pearson Research Note. Retrieved from
http://pearsonpte.com/research/research-summaries-notes/.
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 37
Appendices
Table A-1: Intercorrelations among predictor variables and item difficulty scores
[INSERT TABLE A-1 ABOUT HERE]
INVESTIGATING THE EFFECT OF DIFFERENT SELECTED 38
Figure A-1: Histograms and scatterplots for the residuals
[INSERT FIGURE A-1 ABOUT HERE]
Table 1. Distribution of Test Takers across PTE Academic Test Forms
Form
No. of test takersa
No. of L1s representedb
A
30 (H = 13; L = 17)
3
B
25 (H = 11; L = 14)
3
C
23 (H = 16; L = 7)
4
D
48 (H = 20; L = 28)
4
E
54 (H = 26; L = 28)
6
F
56 (H = 22; L = 34)
8
G
50 (H = 18; L = 32)
7
H
48 (H = 23; L = 25)
8
I
54 (H = 24; L = 30)
6
J
46 (H = 19; L = 27)
5
Notes: a H = high performers; L = low performers; b 16 distinct L1s were represented among the 434 test takers.
Table 2. Overview of reading test forms for PTE Academic
Item types
(Sample of) Subskills being tested
Passages
(per form)
No. of items
(per form)
Total no.
of itemsa
Explanation
of scoring
MC (single
answer)
Identifying the topic, theme or main ideas; evaluating
the quality and usefulness of texts; reading for
information to infer meanings or find relationships
1
2
20
Correct/incorrect
MC
(multiple
answers)
Identifying the topic, theme or main ideas; evaluating
the quality and usefulness of texts; identifying specific
details, facts, opinions, definitions or sequences of
events
1
2-3
24
Partial credit (for
each correct
response)
Re-order
paragraphs
Identifying the topic, theme or main ideas; identifying
supporting points or examples; identifying the
relationships between sentences and paragraphs;
following a logical or chronological sequence of events
1
2-3
24
Partial credit (for
each correctly
ordered, adjacent
pair)
Fill-in-the-
blank (drag-
and-drop)
Identifying the topic, theme or main ideas;
understanding the difference between connotation and
denotation; inferring the meaning of unfamiliar words;
comprehending concrete and abstract information
1
4-5
48
Partial credit (for
each correctly
completed blank)
Fill-in-the-
blank (drop-
down)
Identifying the topic, theme or main ideas; identifying
words and phrases appropriate to the context;
understanding the difference between connotation and
connotation; comprehending concrete and abstract
information
1
5-6
56
Partial credit (for
each correctly
completed blank)
TOTAL
5
15-20
172
---
Note: aTotal number of items for all 10 forms.
Table 3. Features examined in PTE Academic reading comprehension test
Facet
Feature
Measure for each feature
Passage-related
General
complexity
Total number of words in a reading passage
Syntactic
complexity
Average number of words per sentence for a
reading passage
Average number of independent and dependent
clauses per sentence for a reading passage
Lexical
complexity
Number of unique function and content words
divided by the number of repeated words
Number of off-list words divided by the total
number of words in a reading passage
Passage structure
Content structure of a reading passage
Question-related
No. of falsifiable
distractors
Comprehension-
type
Abstractness of
targeted answer
Number of falsifiable distractor options
for an item
Type of question used to demonstrate
comprehension of a passage
Level of concreteness or abstractness of the
information requested by a question
Response-format
No. of response
alternatives
Number of alternatives provided for an item
No. of correct options
required
Number of correct answers needed to obtain
full credit for an item
Average length of
options
Average number of words per item option
Table 4. List of dummy variables included in regression analysis
Feature
No. of original levels
Final dummy variables
Abstractness
5
1. Highly concrete
2. Intermediate
3. Highly abstract
4. Most abstract
Coherence
4
1. Causation
2. Description
3. Response
Comprehension-
type
4
1. Integration
2. Knowledge-based
3. Restructure
Table 5. Descriptive statistics for reading test item features
Facet
Feature
Mean
SD
Skewness
Kurtosis
Passage-related
Words
107.20
64.24
-0.74
-0.42
Length of sentence
20.67
5.14
0.10
1.25
Clauses per sentence
1.74
0.55
0.84
-0.03
Coherencea
---
---
---
---
Association
0.32
0.23
-0.97
-1.00
Causation
0.12
0.28
1.22
0.97
Response
0.08
0.20
1.04
0.71
Description
0.48
0.51
-0.87
-1.11
Type-token ratio
0.72
0.11
0.71
-0.85
Words off-list
0.10
0.06
-1.21
-0.67
Question-
related
No. of falsifiable distractors
0.72
0.74
1.94
0.93
Comprehension-typea
---
---
---
---
Text-based
0.42
0.69
0.43
-0.52
Restructure/reword
0.16
0.33
-0.33
0.70
Integration
0.32
0.46
-0.89
-1.22
Knowledge-based
0.10
0.28
1.55
1.18
Abstractnessa
---
---
---
---
Most concrete
0.26
0.35
-0.68
0.54
Highly concrete
0.16
0.41
-0.21
-0.46
Intermediate
0.30
0.48
0.47
0.39
Highly abstract
0.20
0.44
0.29
1.24
Most abstract
0.08
0.28
-0.37
-0.46
Response-format
No. of response alternatives
5.16
1.52
-1.04
-1.28
No. of correct options
3.24
1.40
0.99
1.45
Length of options
6.59
7.63
-0.80
0.32
Note: aValues for this variable represent proportion scores, ranging from 0 to 1.
Table 6. Regression I (incorporating type/token ratio)
Total R2 = 0.81
Standardized
Coeff. Beta
t
Sig.
Squared partial
correlation
Collinearity
statistics
Tolerance
VIF
1. Mean length of
sentence
0.43
2.55
0.044
0.52
0.89
1.12
2. Clauses per
sentence
-0.09
-0.54
0.609
0.05
0.94
1.06
3. Coherence
(causation)
-0.25
-1.86
0.112
0.37
0.95
1.05
4. Coherence
(response)
-0.07
-0.57
0.593
0.05
0.24
4.12
5. Coherence
(description)
-0.03
-0.28
0.790
0.01
0.24
4.11
6. Type-token ratio
-0.07
-0.07
0.662
0.04
0.44
2.25
7. Words off-list
-0.08
-0.63
0.551
0.06
0.84
1.19
8. No. of falsifiable
distractors
0.52
3.65
0.011
0.68
0.33
2.99
9. No. of response
alternatives
-0.14
-0.70
0.512
0.07
0.17
5.98
10. No. of correct
options
0.82
3.01
0.024
0.60
0.98
1.02
11. Length of
options
0.18
1.07
0.324
0.20
0.31
3.22
12. Comp-type
(restructure)
-0.38
-1.77
0.127
0.35
0.20
4.94
13. Comp-type
(integration)
0.09
0.30
0.777
0.01
0.31
3.20
14. Comp-type
(knowledge-based)
-0.01
-0.05
0.960
0.00
0.32
3.12
15. Abstractness
(highly concrete)
0.29
1.53
0.748
0.28
0.33
3.03
16. Abstractness
(intermediate)
0.13
0.93
0.390
0.12
0.20
4.94
17. Abstractness
(highly abstract)
0.27
1.87
0.048
0.36
0.20
4.89
18. Abstractness
(most abstract)
0.34
1.23
0.266
0.20
0.33
2.99
Table 7. Correlations among features for low- and high-performing readers
Feature
Reading Test Performance Group
Low
High
1. Mean length of sentence
-0.03
0.09
2. Clauses per sentence
0.01
0.06
3. Coherence (causation)
-0.19
-0.20
4. Coherence (response)
-0.27
-0.24
5. Coherence (description)
0.20
0.32
6. Type-token ratio
0.36*
0.33
7. Words off-list
-0.34
-0.16
8. No. of false distractors
0.57**
0.40*
9. No. of response alternatives
0.05
0.35*
10. No. of correct options
0.58**
0.88**
11. Length of options
0.08
-0.38*
12. Comp-type (restructure)
-0.38*
-0.31
13. Comp-type (integration)
0.34
0.51**
14. Comp-type (knowledge-based)
-0.27
-0.40*
15. Abstractness (highly concrete)
-0.28
-0.12
16. Abstractness (intermediate)
0.02
0.25
17. Abstractness (highly abstract)
-0.44*
-0.49*
18. Abstractness (most abstract)
-0.41*
-0.18
** Correlation is significant at the 0.01 level (2-tailed); * Correlation is significant at the 0.05 level (2-tailed).
Table 8. Regressions for low- and high-performing readers
Significant
predictorsa
Standardized
Coeff. Beta
t
Sig.
Squared partial
correlation
Collinearity
statistics
Tolerance
VIF
Low-performers
No. of false
distractors
0.62
2.95
0.026
0.59
0.25
4.01
High-performers
Mean length of
sentence
0.38
2.65
0.038
0.55
0.55
1.82
Coherence
(causation)
-0.29
-2.54
0.044
0.52
0.51
1.96
No. of false
distractors
0.33
2.67
0.037
0.55
0.28
3.62
No. of correct
options
1.08
4.62
0.004
0.77
0.44
2.25
Abstractness
(highly abstract)
0.30
2.86
0.029
0.58
0.36
2.77
Note: a Only significant predictors from the regression models are reported here (p < .05).
** Correlation is significant at the 0.01 level (2-tailed); * Correlation is significant at the 0.05 level (2-tailed).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1. Scores
1.00
2. MLS
.10
1.00
3. Clauses
.10
.38
1.00
4. Coherence
(causation)
-.22
.16
-.03
1.00
5. Coherence
(response)
-.28
-.02
-.17
-.06
1.00
6. Coherence
(description)
.31
.12
.31
-.28
-.20
1.00
7. Type-token
ratio
.35
-.32
-.16
-.12
-.20
.01
1.00
8. Words off-list
-.33
.10
-.35
.38
.22
-.32
.03
1.00
9. False
distractors
.32
.05
.36
-.33
.18
.05
.42*
-.30
1.00
10. Response alt.
.25
.26
.14
-.23
.12
.26
.38
.01
.26
1.00
11. Correct
options
.65**
-.08
-.08
-.05
-.19
.18
.18
.02
.10
.26
1.00
12. Option length
-.21
-.29
.01
.03
.13
-.29
.08
-.02
.30
-.54*
-.46
1.00
13. Comp-type
(restructure)
-.38
.20
.43*
-.11
.55*
-.11
-.25
.13
.33
.04
-.34
.16
1.00
14. Comp-type
(integration)
.44*
-.28
-.25
.18
-.33
.06
.34
.01
-.19
.37
.57*
-.46*
-.59*
1.00
15. Comp-type
(knowledge)
-.30
.45*
.08
-.09
-.06
.01
-.44*
-.22
-.33
-.23
-.48*
-.06
-.11
-.47*
1.00
16. Abstractness
(HC)
-.21
.06
.29
-.15
.41
.12
-.08
-.03
.24
.22
-.16
-.02
.56*
-.36
-.15
1.00
17. Abstractness
(INT)
.16
.29
-.17
.11
-.14
.03
.48
.21
-.43*
-.13
.38
-.33
-.25
.24
.11
-.34
1.00
18. Abstractness
(HA)
.33
-.05
.09
-.17
-.12
.02
.27
-.36
.31
.13
-.03
.04
-.21
.14
.18
-.28
-.39
1.00
19. Abstractness
(MA)
-.48*
-.29
-.12
.46*
-.06
.01
.15
.17
-.33
-.23
-.48*
.13
-.11
.18
-.09
-.15
-.20
-.17
1.00
20. No. of words
-.38
.30
.34
.05
.18
.11
-.82**
.09
-.37
-.34
-.23
-.13
.32
-.35
.39
.06
.29
-.09
-.05
1.00
Model without number of words
Model without type/token ratio