Article

The Three‐Option Format for Knowledge and Ability Multiple‐Choice Tests: A Case for Why it Should Be More Commonly Used in Personnel Testing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Multiple‐choice (MC) tests are arguably the most widely used testing format in applied settings. In the psychometric and education literatures, research on the optimal number of options for knowledge and ability MC tests has revealed that three‐option tests are psychometrically equivalent and, in some cases, superior to five‐option tests. In addition, there are a number of practical, economic, and administrative advantages associated with the use of three‐option MC tests. Yet, despite its advantages, the three‐option format is underutilized in personnel selection. Across two studies, we compared test‐taker perceptions, criterion‐related validity, and sex‐based subgroup differences, and in Study 1, we compared race‐based subgroup differences on three‐ and five‐option tests. Participants in the two studies completed a three‐ or five‐option version of ACT. Test perceptions, criterion‐related validity, and race‐ and sex‐based subgroup differences were similar across test formats. The implications for the expanded use of three‐option tests in applied settings and future directions for research are discussed.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Multiple-choice knowledge and ability tests are frequently administered in educational and personnel selection contexts (Edwards, Arthur, & Bruce, 2012). From a diagnostic perspective, a test should collect as much information as possible using only as many items as necessary. ...
... Multiple-choice tests are in widespread use in educational and personnel selection OPTION WEIGHTING 4 contexts. They are frequently employed, for example, to measure cognitive ability (e.g., Carretta & Ree, 2018;Edwards et al., 2012), financial literacy (e.g., Calcagno & Moticone, 2015), and job knowledge (e.g., Ones & Viswesvaran, 2007). Multiple-choice tests are also often used in situational judgement tests (St-Sauveur, Girouard, & Goyette, 2014) and in online assessments for personnel selection (Scott & Lezotte, 2012). ...
Article
Full-text available
Multiple‐choice tests are frequently used in personnel selection contexts to measure knowledge and abilities. Option weighting is an alternative multiple‐choice scoring procedure that awards partial credit for incomplete knowledge reflected in applicants’ distractor choices. We investigated whether option weights should be based on expert judgment or on empirical data when trying to outperform conventional number‐right scoring in terms of reliability and validity. To obtain generalizable results, we used repeated random sub‐sampling validation and found that empirical option weighting, but not expert option weighting, increased the reliability of a knowledge test. Neither option weighting procedure improved test validity. We recommend to improve the reliability of existing ability and knowledge tests used for personnel selection by computing and publishing empirical option weights.
... Six previous studies investigated whether the number of answer options affected criterion-related evidence of the validity of MC test scores. To this end, validity coefficients were obtained by correlating test scores with some external criteria (Edwards et al., 2012; meaningful or systematic relation between option number and validity coefficients. However, to detect differences in criterion-related evidence of validity, it is necessary to compare correlation coefficients, and this requires very large sample sizes. ...
... The mean number of participants per condition was 144. Edwards et al. (2012) conducted two experiments to investigate criterion-related evidence of the validity of 3-and 5-option MC items. The mean sample sizes per condition were 107 and 205, respectively. ...
Article
In multiple-choice tests, the quality of distractors may be more important than their number. We therefore examined the joint influence of distractor quality and quantity on test functioning by providing a sample of 5,793 participants with five parallel test sets consisting of items that differed in the number and quality of distractors. Surprisingly, we found that items in which only the one best distractor was presented together with the solution provided the strongest criterion-related evidence of the validity of test scores and thus allowed for the most valid conclusions on the general knowledge level of test takers. Items that included the best distractor produced more reliable test scores irrespective of option number. Increasing the number of options increased item difficulty, but did not increase internal consistency when testing time was controlled for.
... Later, this theory was supported by many other studies [10,31]. The principle of random guessing is applied if all the options have an equal probability of being chosen; however, tests are usually designed to meet the learning objectives of a taught curriculum, and test takers are expected to have full or partial knowledge of the test items, so they approach each option with some degree of knowledge, which is certainly not random guessing [46]. Examinees are familiar with the exam subject. ...
Article
Full-text available
Background Studies that have investigated the effect options’ number in MCQ tests used in the assessments of senior medical students are scarce. This study aims to compare exam psychometrics between three- and five-option MCQ tests in final-year assessments. Methods A cluster randomized study was applied. Participants were classified into three groups, according to their academic levels. Students in each of those levels were randomized into either the three- or five-option test groups. Results Mean time to finish the five-option test was 45 min, versus 32 min for the three-option group. Cronbach’s alpha was 0.89 for the three-option group, versus 0.81 for the five-options, p-value = 0.19. The mean difficulty index for the three-option group was 0.75, compared to 0.73 for the five-option group, p-value = 0.57. The mean discriminating index was 0.53 for the three-option group, and 0.45 for the five-options, p-value = 0.07. The frequency of non-functioning distractors was higher in the five-option test, 111 (56%), versus 39 (39%) in the three-options, with p-value < 0.01. Conclusions This study has shown that three-option MCQs are comparable to five-option MCQs, in terms of exam psychometrics. Three-option MCQs are superior to five-option tests regarding distractors’ effectiveness and saving administrative time.
... The first guideline in creating answer choice sets is that all distractors should be plausible (Weitzman, 1970;Haladyna and Steven, 1989;Haladyna et al., 2010). This is because non-plausible distractors can be eliminated easily by students, decrease item difficulty, increase reading time, and lessen the number of items that can be included in an exam (Ascalon et al., 2007;Tarrant and Ware, 2010;Edwards et al., 2012;Schneid et al., 2014;Papenberg and Musch, 2017). One way to attempt to create distractors are plausible, is to create them by using student errors or misconceptions (Case and Swanson, 2002;Moreno and Martı, 2006;Tarrant et al., 2009;Gierl et al., 2017). ...
Thesis
Assessment of student learning is ubiquitous in higher education chemistry courses because it is the mechanism by which instructors can assign grades, alter teaching practice, and help their students to succeed. One type of assessment that is popular in general chemistry courses, yet difficult to create effectively, is the multiple-choice assessment. Despite its popularity, little is known about the extent that multiple-choice general chemistry exams adhere to accepted design practices or the processes that general chemistry instructors engage in while creating these assessments. Further understanding of multiple-choice assessment quality and the design practices of general chemistry instructors could inform efforts to improve the quality of multiple-choice assessment practice in the future. This work attempted to characterize multiple-choice assessment practices in undergraduate general chemistry classrooms by, 1) conducting a phenomenographic study of general chemistry instructor’s assessment practices and 2) designing an instrument that can detect violations of item writing guidelines in multiple-choice chemistry exams. The phenomenographic study of general chemistry instructors’ assessment practices included 13 instructors from the United States who participated in a three-phase interview. They were asked to describe how they create multiple-choice assessments, to evaluate six multiple-choice exam items, and to create two multiple-choice exam items using a think-aloud protocol. It was found that the participating instructors considered many appropriate assessment design practices yet did not utilize, or were not familiar with, all the appropriate assessment design practices available to them. Additionally, an instrument was developed that can be used to detect violations of item writing guidelines in multiple-choice exams. The instrument, known as the Item Writing Flaws Evaluation Instrument (IWFEI) was shown to be reliable between users of the instrument. Once developed, the IWFEI was used to analyze 1,019 general chemistry exam items. This instrument provides a tool for researchers to use to study item writing guideline adherence, as well as, a tool for instructors to use to evaluate their own multiple-choice exams. The use of the IWFEI is hoped to improve multiple-choice item writing practice and quality. The results of this work provide insight into the multiple-choice assessment design practices of general chemistry instructors and an instrument that can be used to evaluate multiple-choice exams for item writing guideline adherence. Conclusions, recommendations for professional development, and recommendations for future research are discussed.
... The first guideline in creating answer choice sets is that all distractors should be plausible (Weitzman, 1970;Haladyna and Steven, 1989;Haladyna et al., 2010). This is because non-plausible distractors can be eliminated easily by students, decrease item difficulty, increase reading time, and lessen the number of items that can be included in an exam (Ascalon et al., 2007;Tarrant and Ware, 2010;Edwards et al., 2012;Schneid et al., 2014;Papenberg and Musch, 2017). One way to attempt to create distractors are plausible, is to create them by using student errors or misconceptions (Case and Swanson, 2002;Moreno and Martı, 2006;Tarrant et al., 2009;Gierl et al., 2017). ...
Article
Multiple-choice (MC) exams are common in undergraduate general chemistry courses in the United States and are known for being difficult to construct. With their extensive use in the general chemistry classroom, it is important to ensure that these exams are valid measures of what chemistry students know and can do. One threat to MC exam validity is the presence of flaws, known as item writing flaws, that can falsely inflate or deflate a student’s performance on an exam, independent of their chemistry knowledge. Such flaws can disadvantage (or falsely advantage) students in their exam performance. Additionally, these flaws can introduce unwanted noise into exam data. With the numerous possible flaws that can be made during MC exam creation, it can be difficult to recognize (and avoid) these flaws when creating MC general chemistry exams. In this study a rubric, known as the Item Writing Flaws Evaluation Instrument (IWFEI), has been created that can be used to identify item writing flaws in MC exams. The instrument was developed based on a review of the item writing literature and was tested for inter-rater reliability using general chemistry exam items. The instrument was found to have a high degree of inter-rater reliability with an overall percent agreement of 91.8 % and a Krippendorff Alpha of 0.836. Using the IWFEI in an analysis of 1,019 general chemistry MC exam items, it was found that 83% of items contained at least one item writing flaw with the most common flaw being the inclusion of implausible distractors. From the results of this study, an instrument has been developed that can be used in both research and teaching settings. As the IWFEI is used in these settings we envision an improvement in MC exam development practice and quality.
... Empirical support for this recommendation 1 3 is provided by a thorough meta-analysis completed by Rodriguez (2005) who found that most distractors are selected by so few examinees that they are essentially nonfunctional. In addition, several studies have shown that eliminating one option from a four-option MCQ, or two options from a five-option MCQ, has a negligible impact on measurement precision (Delgado and Prieto 1998;Edwards et al. 2012;Rodriguez 2005;Tarrant et al. 2009;Kilgour and Tayyaba 2016). Test items with fewer options should, in theory, reduce the amount of reading time per item, thereby decreasing exam speededness or allowing more test items to be covered on an exam (Tversky 1964). ...
Article
Full-text available
Research suggests that the three-option format is optimal for multiple choice questions (MCQs). This conclusion is supported by numerous studies showing that most distractors (i.e., incorrect answers) are selected by so few examinees that they are essentially nonfunctional. However, nearly all studies have defined a distractor as nonfunctional if it is selected by fewer than 5% of examinees. A limitation of this definition is that the proportion of examinees available to choose a distractor depends on overall item difficulty. This is especially problematic for mastery tests, which consist of items that most examinees are expected to answer correctly. Based on the traditional definition of nonfunctional, a five-option MCQ answered correctly by greater than 90% of examinees will be constrained to have only one functional distractor. The primary purpose of the present study was to evaluate an index of nonfunctional that is sensitive to item difficulty. A secondary purpose was to extend previous research by studying distractor functionality within the context of professionally-developed credentialing tests. Data were analyzed for 840 MCQs consisting of five options per item. Results based on the traditional definition of nonfunctional were consistent with previous research indicating that most MCQs had one or two functional distractors. In contrast, the newly proposed index indicated that nearly half (47.3%) of all items had three or four functional distractors. Implications for item and test development are discussed.
... Arthur, & Bruce, 2012;Motowidlo et al., 1990), the format does not correspond with real-life. ...
... The results reported here for the rank-SJT must also be interpreted in the context of the fact that we used a five-option SJT. Hence, one can envisage a situation where four-or maybe even a three-option SJT (e.g., see Edwards, Arthur, & Bruce, 2012) might be considered to be less difficult and engender less information processing and cognitive demands than a five-response format. In addition, the rank-SJT may also be influenced by the structural dispersion or dissimilarity of the response options (i.e., the extent to which the relative effectiveness of the options are similar or dissimilar). ...
Article
Full-text available
As a testing method, the efficacy of situational judgment tests (SJTs) is a function of a number of design features. One such design feature is the response format. However, despite the considerable interest in SJT design features, there is little guidance in the extant literature as to which response format is superior or the conditions under which one might be preferable to others. Using an integrity-based SJT measure administered to 31,194 job applicants, we present a comparative evaluation of 3 response formats (rate, rank, and most/least) in terms of construct-related validity, subgroup differences, and score reliability. The results indicate that the rate-SJT displayed stronger correlations with the hypothesized personality traits; weaker correlations with general mental ability and, consequently, lower levels of subgroup differences; and higher levels of internal consistency reliability. A follow-up study with 492 college students (Study 2; details of which are presented in the online supplemental materials) also indicates that the rate response format displayed higher levels of internal consistency and retest reliability as well as favorable reactions from test takers. However, it displayed the strongest relationships with a measure of response distortion, suggesting that it is more susceptible to this threat. Although there were a few exceptions, the rank and most/least response formats were generally quite similar in terms of several of the study outcomes. The results suggest that in the context of SJTs designed to measure noncognitive constructs, the rate response format appears to be the superior, preferred response format, with its main drawback being that it is susceptible to response distortion, although not any more so than the rank response format. (PsycINFO Database Record (c) 2014 APA, all rights reserved).
... That is, researchers should check for unintended side-effects on validity of strategies designed to be fairer or more efficient. For example, Edwards et al. (2012) demonstrated that three-option multiple-choice items, which have efficiency advantages, were similar to fiveoption tests in terms of candidate reactions, criterionrelated validity, and sex-based group differences. Thus, Edwards and colleagues recommended increased use of the more efficient alternative. ...
Article
Candidate perceptions of test fairness have significant consequences for organizations. However, little attention has been directed toward interventions that may promote favorable reactions, despite recognition that the identification of such management practices is important. The current study investigated a strategy designed to improve candidate reactions—field vetting of items. This strategy involves comprehensive test item review sessions with a variety of applied subject matter experts. The extent to which the field vetting process resulted in more favorable reactions among police officers vying for promotion was investigated using a naturally-occurring quasi-experimental design. Results showed support for the field vetting process to improve reactions to high-stakes tests.
Article
There is nearly a century of educational research that has demonstrated that three option multiple-choice questions (MCQs) are as valid and reliable as four or five option, yet this format continues to be underutilized in educational institutions. This replication study was a quasi-experimental between groups research design conducted at three Canadian schools of nursing to examine the psychometric properties of three option MCQs when compared to the more traditional four option questions. Data analysis revealed that there were no statistically significant differences in the item discrimination, difficulty or mean examination scores when MCQs were administered with three versus four option answer choices.
Article
The evidence is mounting regarding the guidance to employ more three-option multiple-choice items. From theoretical analyses, empirical results, and practical considerations, such items are of equal or higher quality than four- or five-option items, and more items can be administered to improve content coverage. This study looks at 58 tests, including state achievement, college readiness, and credentialing tests. The evidence here supports previous assertions. The article also clarifies distractor functioning criteria and offers a typology of items via distractor functioning.
Article
Few articles contemplate the need for good guidance in question item-writing in the continuing education (CE) space. Although many of the core principles of sound item design translate to the CE health education team, the need exists for specific examples for nurse educators that clearly describe how to measure changes in competence and knowledge using multiple choice items. In this article, some keys points and specific examples for nursing CE providers are shared. J Contin Educ Nurs. 2015;46(11):481-483.
Article
Nursing Professional Development (NPD) specialists frequently design test items to assess competence, to measure learning outcomes, and to create active learning experiences. This article presents six valuable tips for improving test items and using test results to strengthen validity of measurement. NPD specialists can readily apply these tips and examples to measure knowledge with greater accuracy.
Article
Full-text available
On the basis of a distinction between test content and method of testing, the present study examined several conceptually and practically important effects relating race, reading comprehension, method of assessment, face validity perceptions, and performance on a situational judgment test using a sample of 241 psychology undergraduates (113 Blacks and 128 Whites). Results showed that the Black–White differences in situational judgment test performance and face validity reactions to the test were substantially smaller in the video-based method of testing than in the paper-and-pencil method. The Race × Method interaction effect on test performance was attributable to differences in reading comprehension and face validity reactions associated with race and method of testing. Implications of the findings were discussed in the context of research on adverse impact and examinee test reactions.
Article
Full-text available
A method for investigating measurement equivalence across subpopulations is developed and applied to an instrument frequently used to assess job satisfaction (the Job Descriptive Index; JDI). The method is based on Jöreskog's simultaneous factor analysis in several populations. Several adaptations are necessary to overcome problems with violations of assumptions that occur with rating scale data. Two studies were conducted to evaluate the measurement equivalence of the JDI across different subpopulations. Investigation of five relatively homogeneous subpopulations within one industry revealed invariant measurement properties for the JDI. In the second study, measurement equivalence of the JDI was examined across health care, retailing, and military samples. Generally small violations of measurement equivalence were found. The results in both studies indicate that mean differences in JDI scores (i.e., differences in job satisfaction across groups) are due to group differences rather than lack of measurement equivalence.
Article
Full-text available
Previous studies have indicated that as many as 25% to 50% of applicants in organizational and educational settings are retested with measures of cognitive ability. Researchers have shown that practice effects are found across measurement occasions such that scores improve when these applicants retest. In this study, the authors used meta-analysis to summarize the results of 50 studies of practice effects for tests of cognitive ability. Results from 107 samples and 134,436 participants revealed an adjusted overall effect size of .26. Moderator analyses indicated that effects were larger when practice was accompanied by test coaching and when identical forms were used. Additional research is needed to understand the impact of retesting on the validity inferences drawn from test scores.
Article
Full-text available
This article summarizes the practical and theoretical implications of 85 years of research in personnel selection. On the basis of meta-analytic findings, this article presents the validity of 19 selection procedures for predicting job performance and training performance and the validity of paired combinations of general mental ability (GMA) and the 18 other selection procedures. Overall, the 3 combinations with the highest multivariate validity and utility for job performance were GMA plus a work sample test (mean validity of .63), GMA plus an integrity test (mean validity of .65), and GMA plus a structured interview (mean validity of .63). A further advantage of the latter 2 combinations is that they can be used for both entry level selection and selection of experienced employees. The practical utility implications of these summary findings are substantial. The implications of these research findings for the development of theories of job performance are discussed.
Article
Full-text available
The rigorous construction of items constitutes a field of great current interest for psychometric researchers and practitioners. In previous studies we have reviewed and analyzed the existing guidelines for the construction of multiple-choice items. From this review emerged a new proposal for guidelines that is now, in the present work, subjected to empirical assessment. This assessment was carried out by users of the guidelines and by experts in item construction. The results endorse the proposal for the new guidelines presented, confirming the advantages in relation to their simplicity and efficiency, as well as permitting identification of the difficulties involved in drawing up and organizing some of the guidelines. Taking into account these results, we propose a new, refined set of guidelines that constitutes a useful, simple, and structured instrument for the construction of multiple-choice items.
Article
Full-text available
Examiners seeking guidance on multiple‐choice and true/false tests are likely to encounter various faulty or questionable ideas. Twelve of these are discussed in detail, having to do mainly with the effects on test reliability of test length, guessing and scoring method (i.e. number‐right scoring or negative marking). Some misunderstandings could be based on evidence from tests that were badly written or administered, while others may have arisen through the misinterpretation of reliability coefficients. The usefulness of item response theory in the analysis of academic test items is briefly dismissed.
Article
Full-text available
Self-reported grades are heavily used in research and applied settings because of the importance of grades and the convenience of obtaining self-reports. This study reviews and meta-analytically summarizes the literature on the accuracy of self-reported grades, class ranks, and test scores. Results based on a pairwise sample of 60,926 subjects indicate that self-reported grades are less construct valid than many scholars believe. Furthermore, self-reported grade validity was strongly moderated by actual levels of school performance and cognitive ability. These findings suggest that self-reported grades should be used with caution. Situations in which self-reported grades can be employed more safely are identified, and suggestions for their use in research are discussed.
Article
Full-text available
This study examined the validity of an item-writing rule concerning the optimal number of options in the design of multiple-choice test items. Although measurement textbooks typically recommend the use of four or five options - and most ability and achievement tests still follow this rule - theoretical papers as well as empirical research over a period of more than half a century reveal that three options may be more suitable for most ability and achievement test items. Previous results show that three-option items, compared with their four-option versions, tend to be slightly easier (i. e., with higher traditional difficulty indexes) without showing any decrease in discrimination. In this study, two versions (with four and three options) of 90 items comprising three computerized examinations were applied in successive years, showing the expected trend. In addition, there were no systematic changes in reliability for the tests, which adds to the evidence favoring the use of the three-option test item.
Article
Full-text available
If a test taker possesses test-wiseness and relevant partial knowledge, and if a test contains susceptible items, then the combination of these factors can result in improved or higher scores. In this article, test-wiseness is defined and described in terms of the elements that comprise test-wiseness. A model of test-wise test taking behavior is presented that shows the need for relevant partial knowledge in the application of test-wiseness. A review of the correlates, including race or ethnicity, of test-wiseness is then provided, followed by a review of the effects of training programs designed to minimize the differences in test-wiseness among examinees. The paper concludes with the recommendation, made in the interest of fairness to all examinees, that multifaceted, multimedia training programs directed toward the acquisition of test-wiseness be included as a regular part of the school program at the junior high school level.
Article
Full-text available
This article summarizes the practical and theoretical implications of 85 years of research in personnel selection. On the basis of meta-analytic findings, this article presents the validity of 19 selection procedures for predicting job performance and training performance and the validity of paired combinations of general mental ability (GMA) and the 18 other selection procedures. Overall, the 3 combinations with the highest multivariate validity and utility for job performance were GMA plus a work sample test (mean validity of .63), GMA plus an integrity test (mean validity of .65), and GMA plus a structured interview (mean validity of .63). A further advantage of the latter 2 combinations is that they can be used for both entry level selection and selection of experienced employees. The practical utility implications of these summary findings are substantial. The implications of these research findings for the development of theories of job performance are discussed. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Studied the optimum number of alternatives in multiple choice items using a procedure that is based on item response theory and that requires only a single sample. A 5-alternative, 221-item English vocabulary test (J. Olea et al, 1996) was administered to 452 secondary school students, undergraduate students, and university teachers in Spain. Ss' responses to the 3 worst alternatives were reassigned randomly to generate hypothetical answers to the 2nd, 3rd, and 4th alternatives, respectively. Changes in item parameters, test information function, and ability estimation were analyzed. The data show that the 2-option condition provided the worst results and that the 3-option condition produced the best results. The methodological limitations of this research are discussed. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
This study provides a global perspective on gender differences in performance of 9- and 13-year-olds on mathematics and science exams by reanalyzing and interpreting results on the 1991 International Assessment of Educational Progress. The analyses were performed across 20 countries that tested 13-year-olds and 14 countries that tested 9-year-olds. A random sample of 3,300 students was selected from each population at each age level; half were assessed in mathematics and half in science. The gender effect sizes on the mathematics assessment at both the subdomains level and the total scores were found to be small, especially among 9-year-olds. In general, the gender effects for science were substantially larger than those for mathematics (SD = 0.16 and 0.26 SDs on the total score, in favor of boys, for 9- and 13-year-olds, respectively). Analyses were carried out in seven selected countries—Hungary, Ireland, Israel, Korea, Scotland, Spain, and the United States. Gender differences in variability, reliability, and the structure of the intercorrelations among the subdomains were discussed as well. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
A common practice in applications of structural equation modeling techniques is to create composite measures from individual items. The purpose of this article was to provide an empirical comparison of several composite formation methods on model fit. Data from 1, 177 public school teachers were used to test a model of union commitment in which alternative composite formation methods were used to specify the measurement components of the model. Bootstrapping procedures were used to generate data for two additional sample sizes. Results indicated that the use of composites, in general, resulted in improved overall model fit as compared to treating all items as individual indicators. Lambda values and explained criterion variance indicated that this improved model fit was due to the creation of strong measurement models. Implications of these results for researchers using composites are discussed.
Article
Full-text available
Procedural and distributive justice were examined in an employee selection situation. Along procedural justice dimensions, job relatedness of and explanation offered for the selection procedures were manipulated. Distributive justice was examined through manipulation of a selection decision and collection of a priori hiring expectations. Dependent measures included fairness reactions, recommendation intentions, self-efficacy, and actual work performance. Undergraduates (n = 260) were selected/rejected for paid employment. Job relatedness influenced performance and interacted with selection decision on perceptions of distributive fairness and self-efficacy. Explanations influenced recommendations of rejected applicants. Interactions between hiring expectations and selection decision were observed on perceived fairness and recommendation intentions. Discussion focuses on theoretical and practical implications of the observed interactions.
Article
Full-text available
Mean subgroup (gender, ethnic/cultural, and age) differences are summarized across studies for several predictor domains – cognitive ability, personality and physical ability – at both broadly and more narrowly defined construct levels, with some surprising results. Research clearly indicates that the setting, the sample, the construct and the level of construct specificity can all, either individually or in combination, moderate the magnitude of differences between groups. Employers using tests in employment settings need to assess accurately the requirements of work. When the exact nature of the work is specified, the appropriate predictors may or may not have adverse impact against some groups. The possible causes and remedies for adverse impact (measurement method, culture, test coaching, test-taker perceptions, stereotype threat and criterion conceptualization) are also summarized. Each of these factors can contribute to subgroup differences, and some appear to contribute significantly to subgroup differences on cognitive ability tests, where Black–White mean differences are most pronounced. Statistical methods for detecting differential prediction, test fairness and construct equivalence are described and evaluated, as are statistical/mathematical strategies for reducing adverse impact (test-score banding and predictor/criterion weighting strategies).
Article
The current study reported the results of a meta-analytic investigation of the effects on test scores and test completion times of three aspects of writing test items: The number of answers in multiple-choice exams, the order of item difficulty, and the organization of items by content. The results of meta-analysis indicated that three-choice questions are slightly easier than four-choice questions (d = .90) and take significantly less time to complete (d = −.61). Exams beginning with easier items and then moving to more difficult items are slightly easier than exams with randomly ordered items (d = .11) or exams beginning with difficult items (d — .22). Exams in which the items are organized by content are slightly easier than exams containing randomly ordered items (d = .04). All of the above effect sizes are small.
Article
Although limited proficiency in the language of a test is known to depress aptitude test scores, the changes that occur as proficiency rises over time have been less well studied. The objective here is to contrast longitudinal changes in test performance for persons who indicated that English was (EBL) or was not their best language (ENBL). Analyses were based on a sample of U.S. citizens and permanent residents (N=65,987 EBL and N=1,592 ENBL); each individual had taken both the Scholastic Assessment Test (SAT)1 and the Graduate Record Examinations (GRE) General Test some years later. The distance in verbal mean scores between the EBL non-Hispanic White group and ENBL groups (such as, non-Hispanic White, Asian-American, Black, Puerto Rican, other Hispanic) grew closer by 0.21 to 0.48 standard deviation units from the taking of the SAT to the taking of the GRE. These findings have implications for the interpretation of test scores in longitudinal studies of linguistic minorities and for selective admissions of students with limited proficiency in English.
Article
A taxonomy of 43 multiple-choice item-writing rules was developed on the basis of an analysis of 46 references in the educational measurement literature. In this comparative review, the results of 96 theoretical and empirical studies were analyzed to determine if support existed for each rule. In some instances, rules were revised, and for nearly one half the rules, no research was found.
Article
A taxonomy of 31 multiple-choice item-writing guidelines was validated through a logical process that included two sources of evidence: the consensus achieved from reviewing what was found in 27 textbooks on educational testing and the results of 27 research studies and reviews published since 1990. This taxonomy is mainly intended for classroom assessment. Because textbooks have potential to educate teachers and future teachers, textbook writers are encouraged to consider these findings in future editions of their textbooks. This taxonomy may also have usefulness for developing test items for large-scale assessments. Finally, research on multiple-choice item writing is discussed both from substantive and methodological viewpoints.
Article
Theoretical and test simulation work reveals that under the knowledge-or-randomguessing assumption, three-option item tests are at least as good as four-option item tests in terms of item discrimination and internal consistency. Of concern, however, is the finding that multiple-choice items may be susceptible to testwiseness, thereby contradicting the random-guessing assumption. Both item-level and test-level characteristics were examined for items included in a high stakes school-leaving mathematics examination. As expected, the influence of testwiseness is lessened when three-option items are used instead of four-option items. Differences and nondifferences between the psychometric characteristics of the three-option and four-option test forms tend to agree with the findings of earlier studies: Tests consisting of three-option items are at least equivalent to tests composed of four options in terms of internal consistency score reliability, difficulty is inversely related to the number of options, and the findings for item discrimination are not conclusive.
Article
This study examines the question of the optimal number of choices on a multiple-choice test from an information theory perspective. Results are compared to addressing this question using more traditional statistical approaches. Based on information theory, the study reveals that, in general, three choices to a multiple-choice test item seem optimal. This finding verifies what other researchers have found from statistical and observational (item analysis) approaches.
Article
This study examined the validity of two item-writing rules in the design of test items: (a) the desirable number of options for a multiple-choice test item and (b) use of the inclusive none of these option. An experimental repeated measures design found that items with three options were more difficult than those with four options and items employing the none of these option were more difficult than those not using this inclusive option format. Neither format manipulation affected item discrimination. Therefore, evidence allows no recommendation for the none of these option but suggests an advantage for multiple-choice items with fewer than the traditional four or five options.
Article
Textbook writers often recommend four or five options per multiple-choice item, and most, if not all, testing programs in the United States also employ four or five options. Recent reviews of research on the desirable number of options for a multiple-choice test item reveal that three options may be suitable for most ability and achievement tests. A study of the frequency of acceptably performing distractors is reported. Results from three different testing programs support the conclusion that test items seldom contain more than three useful options. Consequently, testing program personnel and classroom teachers may be better served by using 2-or 3-option items instead of the typically recommended 4- or 5-option items.
Article
Students from two consecutive semesters were given multiple-choice tests over five units of an undergraduate course in psychology. During the first semester, students were given five 50-question 4-option multiple-choice tests, and during the second semester students were given five 50-question 3-option multiple-choice tests. One-hundred and forty-four (57.6%) of the questions were identical between semesters except for second semester test items having only 3 options. Results indicate that students performed significantly better on 3-option items than on 4-option items (corrected for chance guessing), and that this improvement may be due to improved validity of the test items.
Article
This study addressed the hypothesis that, after the systematic elimination of nonfunctioning options, four-option test items would perform as well as five-option test items having one or more dysfunctional distracters. The study consisted of two investigations involving an examination administered to 700 candidates for certification in a medical specialty. In the first investigation, it was found that content experts exhibited a high degree of accuracy in identifying nonfunctioning options where the criterion was empirical item analysis data. The second phase of the study compared five-option versions of multiple-choice items with four-option versions in which a nonfunctioning option had been removed. Results indicated that (a) removal of a nonfunctioning option resulted in a slight, non significant overall increase in item difficulty and no significant differences in item discrimination, (b) a test consisting of items with a nonfunctioning option removed was nearly equally reliable compared with a set of the same items in a five-option format, and (c) the use of empirical or judgmental methods of identifying nonfunctioning options was not related to changes in item performance. Implications, cautions, and suggestions for future research are provided.
Article
Reliability and validity of multiple-choice examinations were computed as a function of the number of options per item and student ability for junior class parochial high school students administered the verbal section of the Washington Pre-College Test Battery. The least discriminating options were deleted to create 3- and 4-option test formats from the original 5-option item test. Students were placed into ability groups by using noncontiguous grade point average (GPA) cutoffs. The GPAs were the criteria for the validity coefficients. Significant differences (p c 0.05) were found between reliability coefficients for low ability students. The optimum number of options was three when the ability groups were combined. None of the validity coefficients followed the hypothesized trend. These results are part of the mounting evidence that suggests the efficacy of the 3-option item. An explanation is provided.
Article
Despite evidence supporting 3-option items, text authors and practitioners continue to advocate the use of four or five options. We designed an experiment to test further the efficacy of 3-option achievement items. Parallel tests of 3- and 5-option items were built and distributed randomly to college students. Results showed no differences in mean item difficulty, mean discrimination, or total test score, but a substantial reduction in time spent on 3-option items. The straightforward implication is that content validity may be boosted by writing additional 3-option items to tap more content.
Article
Achievement test reliability and validity as a function of ability were determined for multiple sections of a large University of Washington French class. Previous empirical and theoretical papers suggested that reliabilities of tests with 3-option items were as high as or higher than those of tests with 2-, 4-,or 5-options. Lord (1977) and Weber (1978), however, argued that decreasing the number of options resulted in a more efficient test for high-level examinees but a less efficient test for low level examinees. Results of this study did not support this argument for test reliability in a classroom situation. An explanation for the discrepancy is presented.
Article
Previous research has suggested that there exists a bias in the social sciences against no-effect hypotheses. This is regrettable given the importance of establishing not only when an effect does occur but also the boundary conditions of that effect. The purposes of this article are two-fold The first purpose is to review relevant portions of the history of hypothesis testing in an attempt to identify the sources of bias against hypotheses of no effect. The second purpose is to develop and describe rigorous methods for providing evidence in support of no-effect hypotheses-methods that avoid some of the problems traditionally associated with no-effect conclusions.
Article
This review critically examines the literature from 1985 to 1999 on applicant perceptions of selection procedures. We organize our review around several key questions: What perceptions have been studied? What are determinants of perceptions? What are the consequences or outcomes associated with perceptions applicants hold? What theoretical frameworks are most useful in examining these perceptions? For each of these questions, we provide suggestions for key research directions. We conclude with a discussion of the practical implications of this line of research for those who design and administer selection processes.
Article
Selection system designers seek to create processes that do not negatively impact the attraction of applicants. While there is considerable research on applicant reactions to selection procedures, it is not at the level of specificity to be of most practical use for designers. This paper reviews this body of research with a focus on how to move toward the more practical advice that will result in system changes.
Article
* more information about writing items that match content standards;
Article
Social Forces 84.4 (2006) 2367-2368 In one of the most comprehensive collections on workplace discrimination, the contributing authors make a strong case that workplace discrimination is alive and well, but workplaces are not beyond repair. As the title suggests, this collection of 18 chapters draws mainly from psychology to frame a discussion of workplace discrimination although some authors incorporate perspectives from management, economics and sociology into their discussions. In addition to synthesizing research on workplace discrimination, this volume also establishes a new framework for studying discrimination and new directions for future research. The chapters are grouped into three sections with overlapping themes. The first section provides a comprehensive explanation of the individual, group, organization and extra-organizational causes of discrimination at work. While each chapter in this section covers one source of discrimination, the authors recognize that all levels contribute to and shape how people exhibit, experience and respond to discrimination. One chapter in this section, Chapter 3, critiques relational demography's ability to explain discrimination. The authors of this chapter suggest four ways relational demography can improve our understanding of discrimination. Their suggestions bridge the gap between psychology and demography in ways that easily translate into practice. Consistent with the section's focus, Chapter 4 discusses the role of group membership and demographic composition on workplace discrimination. This chapter offers tangible ways for organizations to avoid group-based discrimination: construct new identities for members or allow members' multiple identities to coexist. I anticipated greater specificity in the authors' solutions, but understand that space limitations and complexity of their solutions made that difficult. In this section's last chapter, the authors chart the environmental inputs, organizational throughputs and behaviors/processes, and multi-level outputs of discrimination. This framework will be extremely useful for scholars studying the organizational and environmental causes and consequences of discrimination. We learn from the chapters in this section that discrimination stems from a set of complex, inter-related processes and systems and that the elimination of discrimination requires equally complex solutions. The second section considers discrimination on the basis of group membership. Chapter 6, which addresses organizational race composition, clearly demonstrates the need for a multi-disciplinary approach to the study of workplace discrimination as well as a major limitation of all disciplines – the failure to study race beyond a black-white dichotomy. Less a criticism than a caveat, Chapter 7, on gender discrimination, concisely summarizes a substantial body of work, but does so at the expense of identifying the specific ways employers discriminate on the basis of sex. This section's remaining chapters discuss research on salient yet understudied dimensions of discrimination: sexual orientation, age, disability, personality and physical appearance. Together, these chapters demonstrate a crucial point about workplace discrimination – the underlying mechanisms linking different group memberships to disparate outcomes are varied. As a result, the theories that apply to one type of discrimination (e.g., race discrimination) do not necessarily apply to other forms of discrimination (e.g., age discrimination). For example, paternalism and pity on the part of a discriminator may set in motion disability discrimination. An entirely different mechanism, homophobia, may play a role in the discrimination that lesbians and gays face at work. That said, the chapters in this section offer researchers a theoretical framework for investigating multiple types of discrimination and point out just how little we know about discrimination outside of race and sex discrimination. The volume's final section consists of six chapters that focus on the practice, policy and legal implications for the research summarized here. With the exception of Chapter 16 (a summary of the causes and consequences of employment discrimination outside of the United States) and Chapter 17 (a description of the process of studying Wyoming's sex wage gap), the chapters in this section offer researchers and practitioners advice on what to do about workplace discrimination. Readers should read Chapters 13 and 15 together. The former identifies human resource practices that can diversify organizations; the latter highlights the unintended (negative) consequences of some human resource...
Article
The establishment of measurement invariance across groups is a logical prerequisite to conducting substantive cross-group comparisons (e.g., tests of group mean differences, invariance of structural parameter estimates), but measurement invariance is rarely tested in organizational research. In this article, the authors (a) elaborate the importance of conducting tests of measurement invariance across groups, (b) review recommended practices for conducting tests of measurement invariance, (c) review applications of measurement invariance tests in substantive applications, (d) discuss issues involved in tests of various aspects of measurement invariance, (e) present an empirical example of the analysis of longitudinal measurement invariance, and (f) propose an integrative paradigm for conducting sequences of measurement invariance tests.
Article
Traditionally, multiple choice tests have included four or five alternatives. Data from public sector employment tests are presented that indicate that tests composed of multiple choice items containing three alternatives have psychometric properties similar to those offered by tests composed of items containing five alternatives. Given the similarity of the psychometric properties and the likely reductions in cost of development and administration time, three-alternative multiple choice items may be preferable to five-alternative multiple choice items for some testing purposes.
Article
We note that applicant reactions to selection procedures may be of practical importance to employers because of influences on organizations’attractiveness to candidates, ethical and legal issues, and possible effects on selection procedure validity and utility. In Study 1, after reviewing sample items or brief descriptions of 14 selection tools, newly hired entry-level managers (n= 110) and recruiting/employment managers (n= 44) judged simulations, interviews, and cognitive tests with relatively concrete item-types (e.g., vocabulary, standard written English, mathematical word problems) to be significantly more job related than personality, biodata, and cognitive tests with relatively abstract item-types (e.g., quantitative comparisons, letter sets). A measure of new managers’cognitive abilities was positively correlated with their perceptions of the job relatedness of selection procedures. In Study 2, applicant reactions to a range of entry-level to professional civil service examinations (assessed immediately after tasting the exam) were positively related to (procedural and distributive) justice perceptions and willingness to recommend the employer to others (assessed one month after the exam, n= 460).
Article
The cognitive ability levels of different ethnic groups have interested psychologists for over a century. Many narrative reviews of the empirical literature in the area focus on the Black-White differences, and the reviews conclude that the mean difference in cognitive ability (g) is approximately 1 standard deviation; that is, the generally accepted effect size is about 1.0. We conduct a meta-analytic review that suggests that the one standard deviation effect size accurately summarizes Black-White differences for college application tests (e.g., SAT) and overall analyses of tests of g for job applicants in corporate settings. However, the 1 standard deviation summary of group differences fails to capture many of the complexities in estimating ethnic group differences in employment settings. For example, our results indicate that job complexity, the use of within job versus across job study design, focus on applicant versus incumbent samples, and the exact construct of interest are important moderators of standardized group differences. In many instances, standardized group differences are less than 1 standard deviation. We conduct similar analyses for Hispanics, when possible, and note that Hispanic-White differences are somewhat less than Black-White differences.
Article
Pyburn, Ployhart, and Kravitz (this issue, 2008) introduced the diversity–validity dilemma: that some of the most valid predictors of job performance are also associated with large racioethnic and sex subgroup predictor score differences. This article examines 16 selection strategies hypothesized to minimize racioethnic and sex subgroup differences and adverse impact and, hence, balance diversity and validity. Rather than presenting a highly technical review, our purpose is to provide practitioners with a concise summary, paying particular attention to comparing and contrasting the effectiveness of the strategies and reporting new developments. The paper is organized around 4 key questions: (a) Which strategies are most effective for reducing subgroup differences? (b) Which strategies do not involve a validity tradeoff? (c) What are the major new developments in strategies for reducing adverse impact? (d) What are the major new developments in alternative predictor measurement methods (e.g., interviews, situational judgment tests, assessment centers) for reducing adverse impact? We then conclude with recommendations and caveats for how to best balance diversity and validity. These ideas are developed further in Kravitz (this issue, 2008), who considers even broader approaches for solving the diversity–validity dilemma.
Article
We present an example of an innovative constructed response test format–a write-in/mark-in paper-and-pencil test–as an alternative to the traditional multiple-choice paper-and-pencil test, with the potential for reducing subgroup differences. We present subgroup differences data on these 2 paper-and-pencil test formats on an operational promotional exam in a sample of African American and White firefighters. The tests were designed to measure the same content domain. Using within-subjects data that compared the performance of 13 African American and 14 White fire captains, and between-subjects data that compared the performance of 21 African American and 49 White fire captains, several results were in the predicted direction such that subgroup differences were reduced on the constructed response test. However, these results did not reach statistical significance. Therefore, the study points to the need for additional research to further evaluate the promise of the constructed response test format.
Article
This study applied the attribution framework described by Weiner (1986) to understand the psychological reasons applicants withdrew from a police officer selection process, as well as the consequences of their attributions for withdrawal. Individuals (n= 196) who withdrew from the selection process were contacted and were asked to indicate their primary reason for withdrawal; and then rated this reason on locus, stability, and controllability dimensions. They also reported future application expectancies. Results indicate minority and female applicants appeared to indicate different reasons for withdrawing than did White and male applicants. Finally, race and controllability interacted in the prediction of reapplication expectancies, such that the relationship between expectancies and controllability was negative for White applicants and positive for minority applicants.
Article
Two approaches in the literature for determining the optimal number of choices for a test item are compared with two new approaches.
Article
The first phase of this research effort describes an effort to directly measure the attitudes and opinions of employment test takers toward the tests they just took; the instrument is called the Test Attitude Survey (TAS). Nine factors were developed which reflect test takers' expressed effort and motivation on the test, the degree of concentration, perceived test ease, and the like. Several studies were conducted showing that TAS factors were significantly sensitive to differences in test types and administration permitting the inference that the TAS possessed construct validity. The second phase of this study tested several propositions and hypotheses. In one study, it is shown that the applicants report significantly higher effort and motivation on the employment tests compared to incumbents, even when ability is held constant. A second study showed that a small but significant relationship exists between TAS factor scores, test performances, and the person factors. Moreover, some of the racial differences on test performances can be accounted for via the TAS factor scores; it is observed that after holding these TAS factors constant, racial differences on the employment tests scores diminished. In a third study, very limited evidence was found for the incremental and moderating effects of these attitudes, but there were several limitations to the study associated with small sample sizes, unknown reliabilities in the criterion scales, and so forth. Discussion focussed on the potential practical applications of the TAS instrument and factor scores. It is suggested that further research could have some utility in this domain.
Article
The role of test-taking attitudes in the decisions of applicants to withdraw from a selection process was examined. Measures of test-taking attitudes were administered to 3,290 police officer applicants. Interviews were conducted with 618 applicants who withdrew from the selection process. Comparative anxiety, motivation, and literacy scales were found to predict withdrawal, but the effects were quite small. African-Americans were more likely to withdraw. Small race differences were found on test attitude scales. The requirement of taking a test was not a major factor in applicant withdrawal; procedural fairness and several other factors appeared to play a greater role. A model of applicant withdrawal is proposed based on the qualitative data from applicants who withdrew.
Article
Multiple-choice items are a mainstay of achievement testing. The need to adequately cover the content domain to certify achievement proficiency by producing meaningful precise scores requires many high-quality items. More 3-option items can be administered than 4- or 5-option items per testing time while improving content coverage, without detrimental effects on psychometric quality of test scores. Researchers have endorsed 3-option items for over 80 years with empirical evidence—the results of which have been synthesized in an effort to unify this endorsement and encourage its adoption.