Article

A Differential Performance in the Ability Difference to Employ Test Wiseness Strategies According to Contemporary Measurement Theoryالأداء التفاضلي لاختلاف القدرة على توظيف استراتيجيات الحكمة الاختبارية وفق نظرية القياس المعاصرة

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

يهدف البحث الحالي إلى الاستفادة من نموذج أندريش، بوصفه أحد نماذج الاستجابة المتعددة التدريج لنظرية القياس المعاصرة. حيث ركز الباحث على قياس الأداء التفاضلي استنادا للقدرة على استخدام استراتيجيات الحكمة الاختبارية. ولتحقيق هذا الهدف اعتمدت الباحث على مقياس الحكمة الاختبارية والمعد من قبل (حمد، 2010). لذلك تم اختيار عينة عشوائية طبقية قوامها (447) طالباً وطالبة من الصفوف العاشر والحادي عشر والثاني عشر الإعدادي، وتم التحقق من فرضيات نظرية الاستجابة للفقرة (IRT)، بما في ذلك فرضية البعد الواحد، وذلك من خلال التحليل العاملي. لفقرات الاختبار باستخدام طريقة تحليل المكونات الرئيسية (PCA) لاستجابات الأفراد لفقرات الاختبار، وذلك بحساب قيمة الجذر الكامن ونسبة التباين المفسر،وكـذلك التبـاين المفسـر التراكمي لكل عامل من العوامل، ومن خلال هذا الافتراض تم تأكيد فرضية الاستقلال المحلي أيضاً. ولتحليل بيانات فقرات المقياس استخدم الباحث نموذج أندريش، وباستخدام برنامج الحاسوب (ConstructMap-4.6) حيث أشارت قيم موقع الفقرة المقدرة على السمة الكامنة إلى أنها تراوحت من (2.55) إلى (2.71) لوغاريتم، بمتوسط (0.027) لوغاريتم. وهذا يشير إلى أن المقياس يغطي نطاقًا واسعًا من السمة المقاسة، من الأقل إلى الأعلى كما بلغ الخطأ المعياري لمتوسط تقديرات صعوبة الفقرة (0.032)، وهي قيمة منخفضة قريبة من الصفر، مما يدل على دقة تقديرات موقع الفقرة على سمة الحكمة الكامنة خلف الاستجابة للاختبار

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The current study examined the effect of differing sample size to detect the Item Differential Functioning (DIF) and the study has been used three different sizes of the samples (300,500,1000) as well as to test a component of twenty polytomous items, where each item has five categories, and were used Graded Response Model as a single of polytomous item response theory models to estimating items and individuals parameters. And the study has been used Mantel-Haenszel (MH) way to detecting (DIF) and that through each of case for the different samples, and the results of the study showed the inverse relationship between the sample size and the number of items which showed a differential performer.
Article
Full-text available
The aim of the study was to provide empirical verifications to stocking’s equations for estimating item parameters in IRT. For the purpose of this study, a mental ability test, which was developed in a previous study, was used. Software’s Bilog- Mg and Excel were used to analyze the data. The findings were as follows: 1. Stocking’s equations were satisfied in specifying ability levels matching maximum information for estimating item parameters for the three logistic models. 2. Stocking’s equations were a suitable approach in specifying ability levels for estimating item parameters. However, the optimal ability levels specified by Stocking’s equations were mentioned in the text. 3. The accuracy of estimating the discrimination parameter and guessing parameter were inconsistent with the accuracy of estimating ability parameter, whereas, estimating difficulty parameter was consistent with the accuracy of estimating ability parameter. 4. Ability levels specified by Stocking’s equations contribute more than ability levels specified by equations introduced by Hambleton and Swaminthan in term of increasing accuracy of estimating guessing parameter for 3PL, and discrimination parameter for 2PL ,whereas , accuracy in estimating difficulty parameter wasn't changed. (Key Words: IRT, Item Parameters Estimation, Item Information Function Information Function).
Article
Full-text available
In this study the Mantel-Haenszel procedure, widely used in studies for identifying differential item functioning, is proposed as an alternative to the delta-plot method and applied in a test-equating context for flagging common items that behave differentially across cohorts of examinees. The Mantel-Haenszel procedure has the advantage of conditioning on ability when making comparisons of performance of two examinee groups on an item. There are schemes for interpreting the effect size of differential performance, which can inform the decision as to whether to retain those items in the common-item pool, or to discard them. Data from a statewide assessment are analyzed to illustrate the use of this procedure. Advantages of this methodology are discussed and limitations regarding test design that may make its application difficult are described.
Book
Full-text available
That a test not be biased is an important consideration in the selection and use of any psychological test. That is, it is essential that a test is fair to all applicants, and is not biased against a segment of the applicant population. Bias can result in systematic errors that distort the inferences made in selection and classification. In many cases, test items are biased due to the fact that they contain sources of difficulty that are irrelevant or extraneous to the construct being measured, and these extraneous or irrelevant factors affect performance. Perhaps the item is tapping a secondary factor or factors over-and above the one of interest. This issue, known as test bias, has been the subject of a great deal of recent research, and a technique called Differential Item Functioning (DIF) analysis has become the new standard in psychometric bias analysis. The purpose of this handbook is to provide a perspective and foundation for DIF, a review and recommendation of a family of statistical techniques (i.e., logistic regression) to conduct DIF analyses, and a series of examples to motivate its use in practice. DIF statistical techniques are based on the principle that if different groups of test-takers (e.g., males and females) have roughly the same level of something (e.g., knowledge), then they should perform similarly on individual test items regardless of group membership. In their essence, all DIF techniques match test takers from different groups according to their total test scores and then investigate how the different groups performed on individual test items to determine whether the test items are creating problems for a particular group.
Article
Full-text available
In this study we investigated the effects of the average signed area (ASA) between the item characteristic curves of the reference and focal groups and three test purification procedures on the uniform differential item functioning (DIF) detection via the Mantel-Haenszel (M-H) method through Monte Carlo simulations. The results showed that ASA, rather than the percentage of DIF items in the test, determines the performances of the conventional one-stage M-H method. For example, the M-H method performed appropriately even when there were 50% DIF items in the test, as long as ASA approaches 0. Under most of the simulated conditions, the two-stage and iterative M-H methods performed much superior to the one-stage M-H method. When tests were short and the mean abilities of the reference and focal groups were far apart, all the three M-H methods yielded inflated Type I error under the two-parameter or three-parameter logistic model. For most of the other situations, the two-stage and iterative M-H methods can be used safely.
Article
Differential item functioning (DIF) is a key topic as regards tests applied in educational, psychological, social, and health settings. Measurement by means of tests may be invalidated by the presence of items that show different psychometric properties with respect to groups of people from different demographic, social, cultural, and linguistic backgrounds. This article introduces the concept of DIF and distinguishes it from bias and impact. The main statistical techniques applied in the detection of DIF are presented and a guide to the advantages and shortcomings associated with each one is included. Finally, some practical recommendations are offered for professionals who need to ensure the correct detection of DIF.
Book
Constructing Measures introduces a way to understand the advantages and disadvantages of measurement instruments, how to use such instruments, and how to apply these methods to develop new instruments or adapt old ones. The book is organized around the steps taken while constructing an instrument. It opens with a summary of the constructive steps involved. Each step is then expanded on in the next four chapters. These chapters develop the "building blocks" that make up an instrument--the construct map, the design plan for the items, the outcome space, and the statistical measurement model. The next three chapters focus on quality control. They rely heavily on the calibrated construct map and review how to check if scores are operating consistently and how to evaluate the reliability and validity evidence. The book introduces a variety of item formats, including multiple-choice, open-ended, and performance items; projects; portfolios; Likert and Guttman items; behavioral observations; and interview protocols. Each chapter includes an overview of the key concepts, related resources for further investigation and exercises and activities. Some chapters feature appendices that describe parts of the instrument development process in more detail, numerical manipulations used in the text, and/or data results. A variety of examples from the behavioral and social sciences and education including achievement and performance testing; attitude measures; health measures, and general sociological scales, demonstrate the application of the material. An accompanying CD features control files, output, and a data set to allow readers to compute the text's exercises and create new analyses and case archives based on the book's examples so the reader can work through the entire development of an instrument. Constructing Measures is an ideal text or supplement in courses on item, test, or instrument development, measurement, item response theory, or rasch analysis taught in a variety of departments including education and psychology. The book also appeals to those who develop instruments, including industrial/organizational, educational, and school psychologists, health outcomes researchers, program evaluators, and sociological measurers. Knowledge of basic descriptive statistics and elementary regression is recommended. © 2005 by Lawrence Erlbaum Associates, Inc. All rights reserved.
Book
A revision will be coming out in the next few months.
Article
This article reports the first results of a long-term research project focusing on the detection and possible linguistic causes of differential item functioning (DIF) for second generation immigrant students in the Final Test of Primary Education in the Netherlands. The main aim of the project is to provide test constructors with information which is as detailed as possible about sources of DIF in order to help them avoid item bias in future forms of the test. The project was carried out in three steps. First, DIF items were identified using two statistical procedures: IRT and Mantel-Haenszel. The second stage was an investigation of sources of DIF, beginning with an inventory of subject matter elements which can cause difficulty for immigrant students, according to previous educational research. This was followed by a careful evaluation of DIF items by the researchers and by external experts, and a think-aloud experiment with students. At the third stage, judges decided whether each DIF item was biased and thus reduced the construct validity of the test. The article concentrates on the classification of linguistic sources of DIF and a consideration of whether they work to the advantage or disadvantage of immigrant students.
Article
Differential item functioning (DIF) assessment procedures for items with more than 2 ordered score categories were evaluated. Three descriptive statistics—the standardized mean difference (SMD; Dorans & Schmitt, 1991) and 2 procedures based on SIBTEST (Shealy & Stout, 1993)—were considered, along with 5 inferential procedures: 2 based on SMD, 2 based on SIBTEST, and the Mantel (1963) method. A simulation showed that, when the 2 examinee groups had the same distribution, the descriptive index that performed best was the SMD. When the group means differed by 1 SD, a modified form of the SIBTEST DIF effect size measure tended to perform best. The 5 inferential procedures performed almost indistinguishably when the 2 groups had identical distributions. When the groups had different distributions and the studied item was highly discriminating, the SIBTEST procedures showed much better Type I error control than did the SMD and Mantel methods, particularly in short tests. The power ranking of the 5 procedures was inconsistent; it depended on the direction of DIF and other factors. Routine application of these polytomous DIF methods seems feasible when a reliable test is available for matching examinees. The Type I error rates of the Mantel and SMD methods may be a concern under certain conditions. The current version of SIBTEST cannot easily accommodate matching tests that do not use number-right scoring. Additional research in these areas would be useful.
Article
Differential item functioning (DIF) occurs when items on a test or questionnaire have different measurement properties for one group of people versus another, irrespective of group-mean differences on the construct. Methods for testing DIF require matching members of different groups on an estimate of the construct. Preferably, the estimate is based on a subset of group-invariant items called designated anchors. In this research, a quick and easy strategy for empirically selecting designated anchors is proposed and evaluated in simulations. Although the proposed rank-based approach is applicable to any method for DIF testing, this article focuses on likelihood-ratio (LR) comparisons between nested two-group item response models. The rank-based strategy frequently identified a group-invariant designated anchor set that produced more accurate LR test results than those using all other items as anchors. Group-invariant anchors were more difficult to identify as the percentage of differentially functioning items increased. Advice for practitioners is offered.
Article
Type I error rates of Lord's χ 2 test for differential item functioning were investigated using monte carlo simulations. Two- and three-parameter item response theory (IRT) models were used to generate 50-item tests for samples of 250 and 1,000 simulated examin ees. Item parameters were estimated using two algo rithms (marginal maximum likelihood estimation and marginal Bayesian estimation) for three IRT models (the three-parameter model, the three-parameter model with a fixed guessing parameter, and the two-param eter model). Proportions of significant χ 2s at selected nominal α levels were compared to those from joint maximum likelihood estimation as reported by McLaughlin & Drasgow (1987). Type I error rates for the three-parameter model consistently exceeded theo retically expected values. Results for the three-param eter model with a fixed guessing parameter and for the two-parameter model were consistently lower than ex pected values at the a levels in this study. Index terms: differential item functioning, item response theory, Lord's χ2.
Article
Various methods for determining unidimensionality are reviewed and the rationale of these methods is as sessed. Indices based on answer patterns, reliability, components and factor analysis, and latent traits are reviewed. It is shown that many of the indices lack a rationale, and that many are adjustments of a previous index to take into account some criticisms of it. After reviewing many indices, it is suggested that those based on the size of residuals after fitting a two- or three-parameter latent trait model may be the most useful to detect unidimensionality. An attempt is made to clarify the term unidimensional, and it is shown how it differs from other terms often used inter changeably such as reliability, internal consistency, and homogeneity. Reliability is defined as the ratio of true score variance to observed score variance. Inter nal consistency denotes a group of methods that are intended to estimate reliability, are based on the vari ances and the covariances of test items, and depend on only one administration of a test. Homogeneity seems to refer more specifically to the similarity of the item correlations, but the term is often used as a synonym for unidimensionality. The usefulness of the terms in ternal consistency and homogeneity is questioned. Uni dimensionality is defined as the existence of one latent trait underlying the data.
Article
This book provides information on educational and psychological testing and the use and interpretation of test scores. This 6th edition contains updated content and references, as well as an expanded glossary of testing terms. Additional attention has been paid to ethical issues in test usage. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
A logistic regression model for characterizing differential item functioning (DIF) between two groups is presented. A distinction is drawn between uniform and nonuniform DIF in terms of the parameters of the model. A statistic for testing the hypothesis of no DIF is developed. Through simulation studies, it is shown that the logistic regression procedure is more powerful than the Mantel-Haenszel procedure for detecting nonuniform DIF and as powerful in detecting uniform DIF.
Article
1965 ed. published under title: "Measuring educational achievement". Incl. bibl., index.
Article
We empirically test existing theories on the provision of public goods, in particular air quality, using data on sulfur dioxide (SO2) concentrations from the Global Environment Monitoring Projects for 107 cities in 42 countries from 1971 to 1996. The results are as follows: First, we provide additional support for the claim that the degree of democracy has an independent positive effect on air quality. Second, we find that among democracies, presidential systems are more conducive to air quality than parliamentary ones. Third, in testing competing claims about the effect of interest groups on public goods provision in democracies we establish that labor union strength contributes to lower environmental quality, whereas the strength of green parties has the opposite effect.
Psychological measurement: Theory and application
  • S Abdul-Rahman
Abdul-Rahman, S. (1998). Psychological measurement: Theory and application. Kuwait: Alfalah Library.
Contemporary developments in psychological measurement
  • S M Alam
Alam, S. M. (1986). Contemporary developments in psychological measurement. Kuwait University.
Educational and psychological measurement and evaluation: Fundamentals, applications, and contemporary approaches
  • S M Alam
Alam, S. M. (2000). Educational and psychological measurement and evaluation: Fundamentals, applications, and contemporary approaches (1 st ed.). Cairo: Dar Alfikr Alarabi.
Models of unidimensional and multidimensional item response and their applications in educational and psychological measurement
  • S M Alam
Alam, S. M. (2005). Models of unidimensional and multidimensional item response and their applications in educational and psychological measurement. Cairo: Dar Alfikr Alarabi.
Comparison of four methods for detecting differential item functioning
  • N Allabadi
Allabadi, N. (2008). Comparison of four methods for detecting differential item functioning. (PhD dissertation), Yarmouk University, Jordan.
The relationship between test anxiety and test wiseness among a sample of secondary school students in Allaith educational province
  • Dh B A Almaliki
Almaliki, Dh. B. A. (2010). The relationship between test anxiety and test wiseness among a sample of secondary school students in Allaith educational province. [Unpublished master's thesis], College of Education, Umm Alqura University.
Principles of measurement and assessment in education
  • Z M Alzaher
  • T Jacqueline
  • A A Judat
Alzaher, Z. M., Jacqueline, T., & Judat, A. A. (1999). Principles of measurement and assessment in education. Amman: Dar Althaqafa for Publishing and Distribution.
Psychometric characteristics of the test wiseness scale among university students in the Saudi environment
  • M R Alzahrani
Alzahrani, M. R. (2015). Psychometric characteristics of the test wiseness scale among university students in the Saudi environment. Scientific Journal of the College of Education,3(4), 217-266.
Psychological testing (4 th ed)
  • A Anastassi
Anastassi, A. (1976). Psychological testing (4 th ed). New York, Macmillan, 206.
Measurement and evaluation in the teaching process (2 nd ed)
  • A Awda
Awda, A. (1998). Measurement and evaluation in the teaching process (2 nd ed). Irbid: Dar Alamal for Publishing and Distribution.
Methods for identifying bias test item. USA: Stage publication
  • G Camilli
  • L Shepared
Camilli, G. & Shepared, L. (1994). Methods for identifying bias test item. USA: Stage publication.
Teaching test presentation strategies. Education Magazine, Qatar National Committee for Education
  • H Dawood
Dawood, H. (2005). Teaching test presentation strategies. Education Magazine, Qatar National Committee for Education, Culture and Science, 34 (125), 102.
Statistical test theory for education and psychology
  • D N M Degruijter
  • Van Der
  • L J Kamp
  • Th
DeGruijter, D. N. M., & Van Der Kamp, L. J. Th. (2005). Statistical test theory for education and psychology. © D. N. M. de Gruijter & L. J. Th. Van der Kamp.
Two approaches to psychometric process: Classical test theory and item response theory
  • M Erguven
Erguven, M. (2014). Two approaches to psychometric process: Classical test theory and item response theory. Journal of Education; 26.
Applied psychometrics using SPSS and AMOS
  • H W Finch
  • C J Immerkus
  • F B Frensh
Finch, H. W., Immerkus, C.J., & Frensh, F.B. (2016). Applied psychometrics using SPSS and AMOS. Information Age publishing, INC.
The relationship of test wiseness to achievement test performance with a multiple-choice test constructed according to the Rasch model among female college of education students in the literary departments at Umm Alqura university
  • D Hamad
Hamad, D, F, A. (2010). The relationship of test wiseness to achievement test performance with a multiple-choice test constructed according to the Rasch model among female college of education students in the literary departments at Umm Alqura university. Journal of Arab Studies in Education and Psychology,4(4) Umm Alqura University, 297-338.
Constructing a scale of attitudes towards cyberbullying among a sample of social media users at Al albayt university
  • I M Hamadneh
  • M S Bani Kh
Hamadneh, I. M., & Bani Kh. M. S. (2013). Constructing a scale of attitudes towards cyberbullying among a sample of social media users at Al albayt university. Almanarah Journal, 19(3).
Measurement and evaluation in physical education
  • M S Hassanein
Hassanein, M. S. (2001). Measurement and evaluation in physical education. (Part 1), (4 th ed.). Cairo: Arab Thought House.
Dress norms in nadhriyyah discourse: A sociolinguistic analysis of the modesty of Saudi women's clothing. (Master's thesis)
  • A Kaafarani
Kaafarani, A. (1988). Dress norms in nadhriyyah discourse: A sociolinguistic analysis of the modesty of Saudi women's clothing. (Master's thesis), Sultan Qaboos University, Oman.
Test item bias. Beverly Hills
  • S Osterlind
Osterlind, S. (1983). Test item bias. Beverly Hills; Sage publications.
Test wiseness and its relationship with attentional control among graduate students
  • A A Saleh
  • M A Obaid
Saleh, A. A., & Obaid, M. A. (2020). Test wiseness and its relationship with attentional control among graduate students. Alqadisiyah Journal for Humanities, 23(4), 121-148.
Differential item functioning detection in reading comprehension test using mantel-haenszel, item response theory, and logical data analysis
  • T M Salubayba
Salubayba, T. M. (2013). Differential item functioning detection in reading comprehension test using mantel-haenszel, item response theory, and logical data analysis.The International Journal of Social Sciences, 14(1), 76-82.
Exploratory and confirmatory factor analysis: Concepts and methodology using SPSS and LISREL. Amman: Dar Almaseera for Publishing, Distribution, and Printing
  • A B Tigza
Tigza, A. B. (2012). Exploratory and confirmatory factor analysis: Concepts and methodology using SPSS and LISREL. Amman: Dar Almaseera for Publishing, Distribution, and Printing, p. 90.
Gender-related differential item functioning in mathematics assessment on the third international mathematics and science study-repeat (TIMSS-R). The University of Toledo
  • S Yan
YAN, S. (2005). Gender-related differential item functioning in mathematics assessment on the third international mathematics and science study-repeat (TIMSS-R). The University of Toledo, ProQuest Dissertations Publishing, 2005. 3177610.
Identifying differential item functioning of the "EMBU" test of parental rearing styles among a sample of secondary school students
  • A M Zakri
Zakri, A. M. (2020). Identifying differential item functioning of the "EMBU" test of parental rearing styles among a sample of secondary school students. Journal of the Faculty of Education, Al-Azhar University, 39(3), 677-720.