Article

Introduction to Educational and Psychological Measurement Using R

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

This book provides an introduction to the theory and application of measurement in education and psychology. Topics include test development, item writing, item analysis, reliability, dimensionality, and item response theory. These topics come together in overviews of validity and, finally, test evaluation. Validity and test evaluation are based on both qualitative and quantitative analysis of the properties of a measure. This book addresses the qualitative side using a simple argument-based approach. The quantitative side is addressed using descriptive and inferential statistical analyses, all of which are presented and visualized within the statistical environment R (R Core Team 2017). The intended audience for this book includes advanced undergraduate and graduate students, practitioners, researchers, and educators. Knowledge of R is not a prerequisite to using this book. However, familiarity with data analysis and introductory statistics concepts, especially ones used in the social sciences, is recommended.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... According to Ferketich as cited by Albano (2018), the best-case scenario when administering an initial pool of candidate test items must at least be twice as large as the final number of items needed. With the 45 items generated in the Kinematics test, the 110 respondents were satisfied, and therefore, can be considered suitable for item analysis. ...
Article
Full-text available
Kinematics, a fundamental structure in Mechanics is a critical concept that needs to be realized by students for a more complex analysis of subsequent topics in Physics. One way to determine the effectiveness of Physics teachers in teaching at these trying times is to measure the conceptual understanding of Grade 12-Senior High School (SHS) students in Science, Technology, Engineering, and Mathematics (STEM) track. With the goal of establishing a valid and reliable test questionnaire in Kinematics that can be administered either in a paper-and-pencil approach (asynchronous learning) or online approach (synchronous learning); this study focused on the development and validation process of a 45-item conceptual test in Kinematics. Adhering to the Most Essential Learning Competencies (MELC) set by the Department of Education (DEPED), the initial pool of items was pilot tested using a Google form to 110 SHS students after the items had undergone face and content validation by a panel of experts. Furthermore, Classical Item Analysis by calculating the difficulty and discrimination indices was examined to establish test validity. Reliability analysis was also conducted using Cronbach’s Alpha ( =0.758) and the Kuder-Richardson formula, (KR-20 = 0.761) which resulted in a deletion of 15 items. In general, this Physics concept test in Kinematics showed an acceptable standard of measurement for classroom use which can be utilized by teachers as a form of diagnostic, formative, and summative tests.
... The comparison between eigenvalue to the number of items, and can identify the factors that found the most variability in scores. We used EFA as the first step of validating the structure to examine the relevance factor of subjective well-being and to get the initial number of factors according to the analysis of the current data (Albano, 2018). There is no index in EFA to reveal which factor is better, which is the main restriction that would be the part of the next step of the analysis. ...
Article
Full-text available
Bringing up a child with disabilities has particular challenges and demands as these children's disabilities may cause certain impacts on their caregivers' well-being. In most of the studies, caregivers exhibited high scores of negative emotions that led to low subjective well-being. The effort to improve caregivers' well-being has been being carried out and one of the ways through subjective well-being research. Diener et al. (2009) define Subjective Well-Being (SWB) as the person’s evaluation of their life events in terms of cognitive and affective aspects. The higher rating score of these aspects, the higher level of SWB of the person. The aspects of SWB could be well measured if the instrument has good psychometric properties. The validity of the instruments is crucial to produce good quality research. In the present study, we examined the construct validity of the SWB using Satisfaction with Life Scale (SWLS) and Scale of Positive and Negative Experience (SPANE) scales. The data was collected from 209 parents who had children with intellectual disability in Tangerang and Jakarta. The construct was validated by exploratory factor analysis (EFA) and Confirmatory Factor Analysis (CFA) in the software R version 3.6.2. The EFA results showed that the construct consisted of four factors: one for the cognitive aspect, one for positive affect, and two for negative affects. The CFA results further demonstrated that this model fitted the empirical data.
... This theory also estimates the individual's marks/grades in these traits. A group of models has developed from this theory, known as the latent traits model (Albano, 2017;Steel & Klingsieck, 2016;Zanon et al., 2016). ...
Article
Full-text available
This study aims to compare the effect of test length on the degree of ability parameter estimation in the two-parameter and three-parameter logistic models, using the Bayesian method of expected prior mode and maximum likelihood. The experimental approach is followed, using the Monte Carlo method of simulation. The study population consists of all subjects with the specified ability level. The study includes random samples of subjects and of items. Results reveal that estimation accuracy of the ability parameter in the two-parameter logistic model according to the maximum likelihood method and the Bayesian method increases with the increase in the number of test items. Results also show that with long and average length tests, the effectiveness is related to the maximum likelihood method and to all conditions of the sample size, whereas in short tests, the Bayesian method of prior mode outperformed in all conditions. Results indicate that the increase of the ability parameter in the three-parameter logistic model increases with the increase of test items number. The Bayesian method outperforms with respect to the accuracy of estimation at all conditions of the sample size, whereas in long tests the maximum likelihood method outperforms at all different conditions. Received: 17 September 2021 / Accepted: 24 November 2021 / Published: 3 January 2022
... The course duration in this study is short, with only 16 weeks; a more prolonged period can help reduce random errors. Another limitation is the sample size, with 83 students; we believe that a larger sample size could increase the generalizability of the result (nevertheless, our sample size meets the minimum requirement suggested by [2]). We will expand the study in the following semesters for future work, and we expect that researchers can conduct related studies to confirm our findings. ...
Chapter
Full-text available
Having a good Human-Computer Interaction (HCI) design is challenging. Previous works have contributed significantly to fostering HCI, including design principle with report study from the instructor view. The questions of how and to what extent students perceive the design principles are still left open. To answer this question, this paper conducts a study of HCI adoption in the classroom. The studio-based learning method is adapted to teach 83 graduate and undergraduate students in 16 weeks long with four activities. A standalone presentation tool for instant online peer feedback during the presentation session is developed to help students justify and critique other’s work. Our tool provides a sandbox, which supports multiple application types, including Web-applications, Object Detection, Web-based Virtual Reality (VR), and Augmented Reality (AR). After presenting one assignment and two projects, our results shows that students acquired a better understanding of the Golden Rules principle over time, which is demonstrated by the development of visual interface design. The Wordcloud reveals the primary focus was on the user interface and sheds light on students’ interest in user experience. The inter-rater score indicates the agreement among students that they have the same level of understanding of the principles. The results show a high level of guideline compliance with HCI principles, in which we witness variations in visual cognitive styles. Regardless of diversity in visual preference, the students present high consistency and a similar perspective on adopting HCI design principles. The results also elicit suggestions into the development of the HCI curriculum in the future.
... The course duration in this study is short, with only 16 weeks; a more prolonged period can help reduce random errors. Another limitation is the sample size, with 83 students; we believe that a larger sample size could increase the generalizability of the result (nevertheless, our sample size meets the minimum requirement suggested by [2]). We will expand the study in the following semesters for future work, and we expect that researchers can conduct related studies to confirm our findings. ...
Preprint
Full-text available
Having a good Human-Computer Interaction (HCI) design is challenging. Previous works have contributed significantly to fostering HCI, including design principle with report study from the instructor view. The questions of how and to what extent students perceive the design principles are still left open. To answer this question, this paper conducts a study of HCI adoption in the classroom. The studio-based learning method was adapted to teach 83 graduate and undergraduate students in 16 weeks long with four activities. A standalone presentation tool for instant online peer feedback during the presentation session was developed to help students justify and critique other's work. Our tool provides a sandbox, which supports multiple application types, including Web-applications, Object Detection, Web-based Virtual Reality (VR), and Augmented Reality (AR). After presenting one assignment and two projects, our results showed that students acquired a better understanding of the Golden Rules principle over time, which was demonstrated by the development of visual interface design. The Wordcloud reveals the primary focus was on the user interface and shed some light on students' interest in user experience. The inter-rater score indicates the agreement among students that they have the same level of understanding of the principles. The results show a high level of guideline compliance with HCI principles, in which we witnessed variations in visual cognitive styles. Regardless of diversity in visual preference, the students presented high consistency and a similar perspective on adopting HCI design principles. The results also elicited suggestions into the development of the HCI curriculum in the future.
... They are critical components of any assessment framework because they reflect the selected SC concept and definition. As indicated by Albano (2017), scale systems are categorized in four classes that range, in terms of the conveyed information, from very general to specific scale values. These classes are: (1) nominal (descriptive names), (2) ordinal (ranking without meaningful intervals), (3) interval (meaningful intervals with relative benchmark), and (4) ratio (meaningful intervals with absolute benchmark) scaling systems. ...
Conference Paper
Full-text available
As a response to the challenges of population and urban growth, the concept of smart city/community (SC) promises more intelligent, sustainable, and resilient communities that provide better services and quality of life. However, the SC as an ecosystem is an evolving concept; hence, there is no universally-shared definition or assessment tool. Additionally, each municipality worldwide has its own unique characteristics, challenges, and opportunities. Therefore, any SC definition and assessment method should be adopted or developed specifically for each city and agreed participatively by the SC initiative leaders. In terms of an SC assessment, most of the available tools are based on evaluating the performance of urban systems. Hence, the developed indicators are mainly used for ranking or comparison purposes. However, these performance and ranking indicators face many challenges due to the broad, multidisciplinary, and rapidly evolving and changing nature of SCs. For instance, due to the rapid technological evolution of SCs, some of the currently-accepted performance indicators will be obsolete in just a few years. Therefore, our research attempts to adapt a generic SC definition with three dimensions. These dimensions include the "connectivity" that can be achieved through intelligent technologies, "sustainability" in terms of long-term viable performance, and "resiliency" in terms of preventive and proactive considerations. Based on these dimensions, a maturity-based scale that is compatible with the evolving nature of SC is proposed for SC maturity assessment. The significance of the research outcome is that it will help the public and managers of the municipalities focus on advancing city maturity which is essential for continuously improving citizens' well-being.
... Thus, we estimate two temporal phase annotations (upper and lower face) per video. We evaluate accuracy with a standard annotation evaluation metric [1], namely Pearson's correlation. Specifically, we correlate the estimated upper (lower) face annotation with the temporal phase annotations of all the upper (lower) facial action units (AUs) [13,42], and consider an estimation successful if the correlation is at least 0.85. ...
Finding the largest subset of sequences (i.e., time series) that are correlated above a certain threshold, within large datasets, is of significant interest for computer vision and pattern recognition problems across domains, including behavior analysis, computational biology, neuroscience, and finance. Maximal clique algorithms can be used to solve this problem, but they are not scalable. We present an approximate, but highly efficient and scalable, method that represents the search space as a union of sets called ϵ-expanded clusters, one of which is theoretically guaranteed to contain the largest subset of synchronized sequences. The method finds synchronized sets by fitting a Euclidean ball on ϵ-expanded clusters, using Jung's theorem. We validate the method on data from the three distinct domains of facial behavior analysis, finance, and neuroscience, where we respectively discover the synchrony among pixels of face videos, stock market item prices, and dynamic brain connectivity data. Experiments show that our method produces results comparable to, but up to 300 times faster than, maximal clique algorithms, with speed gains increasing exponentially with the number of input sequences.
... IRT models provide information about item parameters and latent traits of test respondents, helping gain insights and assessments about their performance as well as the items. It is also useful for test development, item analysis, equating, item banking, and computer aided test (CAT) [5]. As a group of statistical models with probabilistic and stochastic procedures, IRT connects the pattern of responses to a group of items to predict a latent trait/ability, and then, converts discrete item responses into the levels or locations of probability estimates which respondents possess underlying the latent trait [6,7]. ...
Article
Full-text available
This paper explores a way to apply Item Response Theory (IRT), one of the popular statistical methodologies in measurement and psychometrics, to evaluate Financial Transmission Rights (FTR) paths in the U.S. electricity market. FTR is an energy derivative product to hedge congestion cost risks inherent in constrained transmission lines. In New England, with about 1200 pricing locations, the theoretical combinations of FTR paths amount to 1.4 million in prevailing flows alone. With capital constraints, it is imperative that FTR market participants build the capability to evaluate FTR paths to bid on. IRT provides a framework of how well tests work, and how individual items work on tests, estimating respondents' latent abilities, and individual item parameters. IRT is utilized to analyze historical electricity data of 2019 for a daily congestion cost of eight customer load zones and one hub in the U.S., New England, for the evaluation of FTR paths. In the analysis, an item represents an FTR path, while item difficulty, item discrimination, and a latent trait variable for the path correspond to the path profitability, risk level, and daily congestion ability, respectively. This paper explores the experimental procedures by which IRT, a psychometric tool, may also be applicable in complex energy markets, providing a consistent and standardized analytical framework to address the issues of selection and prioritization among multiple opportunities. FTR path evaluation is conducted in three steps to determine bid priority paths in FTR auctions: parameter significance tests, ranking on path profitability and risk level, and weighting scores of individual rankings on the two criteria.
... When the two remaining cases were coded by the first and third authors, and combined with the first three, IRR remained strongly positive. The final number of agreements (n = 38) divided by the total possible agreements (n = 45) and multiplied by 100 yielded a final IRR of 84% across all cases (Albano, 2017). Scorers then met to assess IRR and discuss disagreements until discrepancies were resolved. ...
Article
With transition litigation on the rise in recent years, educators need access to current legal trends in special education. Traditionally, educators have been dependent on researchers and attorneys to report on the implications of legal cases to guide the education and services for students with disabilities. In response to this, the Three Dimensions of FAPE Rubric (FAPE3DR) was created to help educators analyze legal cases in a timely manner. Specifically, the authors applied this rubric to five recent legal cases that were decided in favor of the family or transition-age youth. Findings are reported within the scope of broader transition issues.
Article
CoSAGE Community Advisory and Ethics Committee; Age-related hearing impairment yields many negative outcomes, including alterations in mental health, functional impairments, and decreased social engagement. The purpose of the current study was to examine perceived hearing impairment and its relationship with person-centered outcomes among adults in a rural community setting. A cross-sectional, descriptive correlational design was used. Survey packets of validated instruments were distributed following all weekend services at a rural community church; 72 completed surveys were returned (26% response rate). Descriptive and inferential statistics, including Spearman's rank correlations (rs), were used to address the study aims. Mean age of participants was 54 years (SD = 17 years), 58% were female, and 97% attended church regularly. Thirty-one percent of respondents reported moderate to severe hearing impairment. Perceived hearing impairment was associated with more depressive symptoms (rs = 0.24, p = 0.052), poorer attentional function (rs = -0.29, p = 0.016), and decreased quality of life in the mental health domain (rs = -0.21, p = 0.081). Findings expand evidence supporting the relationship between hearing and person-centered outcomes, including a functional measure of cognition. These results serve as a foundation for the design of a community-driven, church-based hearing health intervention. [Research in Gerontological Nursing, 16(1), 21-32.].
Article
Early identification of language delay is important as it has a serious impact on a child’s life in terms of educational, social, and emotional development. Among the early language screening tools, there are some parent-administered tools; however, they are not culturally appropriate or freely available. This article documents the development and preliminary validation of a quick and easy-to-administer language screening tool for babies from 6 to 18 months of age. Parents of 100 babies ranging in age from 6 to 21 months were included in the study. The babies were classified into five screening levels according to their age. The items of a Screening Test of Early Language Development-Test version (STELD-T) were created and validated through expert opinion. The STELD-T was administered along with the Receptive Expressive Emergent Language Scale (REELS). Internal consistency using the Kuder-Richardson Formula-20 ranged from 0.457 to 0.853 across the five levels, acceptable owing to short tool length and item heterogeneity. Kappa coefficients indicated 0.459 to 0.875 agreement between the STELD-T and the REELS indicated satisfactory criterion validity. After calculating the percentage of babies with a “refer” result as well as Kappa statistics with three different pass-refer criteria, a pass-refer criterion of 75% seemed to be appropriate for screening. The STELD seems to be a reliable and valid tool to screen language development in babies from 6 months to 18 months of age in urban areas of Maharashtra. Items representing a range of language skills including pragmatics make it a unique tool.
Article
Full-text available
The present paper aims to analyze the psychological properties of the mathematics fourth-grade items of Omani and Iranian students by IRT and CDM item analysis. The statistical samples were selected from all Omani and Iranian fourth-grade students who took TIMSS 2015 mathematics test. The research methodology was a secondary analysis method. The results of the IRT showed, there are no same difficult items for both countries. However, 22 items were recognized as very easy items for Oman and Iran. Furthermore, IRT showed, there were just 5 of the same appropriate items for Omani and Iranian students. Besides, the CDM approach found the 9 most difficult items for both Omani and Iranian students. Consequently, CDM analysis analyzed better the psychological properties of the items.
Conference Paper
Virtual 3D conferences are emerging communication channels as a substitution for face-to-face fashion due to the advancement of technologies and the covid-19 pandemic. Current efforts focus on bringing contents into 3D virtual space while delivering them to the color vision deficiency have not been taken into account. To alleviate the stated issue, this paper presents a prototype for color-blind people to simulate the same experience as normal ones. Our method helps users: 1) understand the presented content through adjusted color filtering in such a way that similar colors can be differentiated by the brightness, 2) apparently-identical colors can be varied by the color transformation. Our proposed prototype is demonstrated through three use cases setup in three conditions such as traffic lights, fruit color differentiation, and graph reading in a virtual meeting room. A pilot study conduct with 29 participants shows that our proposed method can improve color differentiation and accuracy for color-blind
Book
Full-text available
The basic issues in psychometrics for MA students in educational and behavioral sciences are explained. Classical Test Theory and Item Response Theory are the major focus.
Article
Objectives: Emergency department thoracotomy (EDT) is a rare and challenging procedure. Emergency medicine (EM) residents have limited opportunities to perform the procedure in clinical or educational settings. Standardized, reliable, validated checklists do not exist to evaluate procedural competency. The objectives of this project were twofold: 1) to develop a checklist containing the critical actions for performing an EDT that can be used for future procedural skills training and 2) to evaluate the reliability and validity of the checklist for performing EDT. Methods: After a literature review, a preliminary 22-item checklist was developed and disseminated to experts in EM and trauma surgery. A modified Delphi method was used to revise the checklist. To assess usability of the checklist, EM and trauma surgery faculty and residents were evaluated performing an EDT while inter-rater reliability was calculated with Cohen's kappa. A Student's t-test was used to compare the performance of participants who had or had not performed a thoracotomy in clinical practice. Item-total correlation was calculated for each checklist item to determine discriminatory ability. Results: A final 22-item checklist was developed for EDT. The overall inter-rater reliability was strong (κ = 0.84) with individual item agreement ranging from moderate to strong (κ = 0.61 to 1.00). Experts (attending physicians and senior residents) performed well on the checklist, achieving an average score of 80% on the checklist. Participants who had performed EDT in clinical practice performed significantly better than those that had not, achieving an average of 80.7% items completed versus 52.3% (p < 0.05). Seventeen of 22 items had an item-total correlation greater than 0.2. Conclusions: A final 22-item consensus-based checklist was developed for the EDT. Overall inter-rater reliability was strong. This checklist can be used in future studies to serve as a foundation for curriculum development around this important procedure.
Thesis
Full-text available
O ensino do pensamento computacional, já na Educação Básica, é de suma importância para preparar os alunos para os desafios do século XXI. Desta forma, surge a necessidade de avaliação das competências adquiridas em relação ao pensamento computacional. A avaliação pela análise do código criado pelo aluno como resultado de atividades abertas é uma forma que permite verificar quais conceitos foram efetivamente aplicados no processo de ensino-aprendizagem. Mesmo já existindo algumas abordagens de forma pontual, principalmente para a linguagem de programação Scratch, ainda não existe um modelo de avaliação mais abrangente e sistematicamente validado. Desta forma, o objetivo do presente trabalho é desenvolver sistematicamente um modelo de avaliação genérico independente de uma linguagem de programação visual (VPL) com base na literatura e no estado da arte. O modelo é instanciado por uma rubrica voltada à avaliação de programas criados com a VPL App Inventor e implementado evoluindo a ferramenta web CodeMaster. A avaliação da confiabilidade e validade do modelo é realizada por uma avaliação em larga escala com mais de 88 mil aplicativos desenvolvidos com App Inventor. Os resultados da avaliação indicam que o modelo é válido e confiável. Por meio da disponibilidade do modelo, espera-se facilitar e reduzir o esforço necessário para avaliação de atividades de programação no contexto de ensino de computação na Educação Básica, suportando assim a sua ampla aplicação em escolas brasileiras.
ResearchGate has not been able to resolve any references for this publication.