Science topic

# Item Response Theory - Science topic

Explore the latest questions and answers in Item Response Theory, and find Item Response Theory experts.

Questions related to Item Response Theory

Query1)

Can mirt exploratory factor analysis method be used for factor structure for marketing/management research studies because most of the studies that I have gone through are related to education test studies?
My objective is to extract factors to be used in subsequent analysis (Regression/SEM)
My data is comprised of questions like:
Data sample for Rasch Factors
Thinking about your general shopping habits, do you ever:
a. Buy something online
b. Use your cell phone to buy something online
c. Watch product review videos online
RESPONSE CATEGORIES:
Yes = 1
No = 0
Data sample for graded Factors
Thinking about ride-hailing services such as Uber or Lyft, do you think the following statements describe them well?
a. Are less expensive than taking a taxi
c. Use drivers who you would feel safe riding with
d. Save their users time and stress
e. Are more reliable than taking a taxi or public transportation
RESPONSE CATEGORIES:
Yes =3
Not sure = 2

No = 1
Query2) If we use mirt exploratory factor analysis using rasch model for dichotomous and graded for polytomous, do these models by default contain tetrachoric correlation for rash model and polychoric correlation for graded models?
My objective is to extract factors to be used in subsequent analysis (Regression/SEM)
Note: I am using R for data analysis

Can mirt exploratory factor analysis method be used for factor structure for marketing/management research studies because most of the studies that I have gone through are related to education test studies?
My objective is to extract factors to be used in subsequent analysis (Regression/SEM)
My data is comprised of questions like:
Data sample for Rasch Factors
Thinking about your general shopping habits, do you ever:
a. Buy something online
b. Watch product review videos online
RESPONSE CATEGORIES:
1 Yes
2 No
Data sample for graded Factors
Thinking about ride-hailing services such as Uber or Lyft, do you think the following statements describe them well?
a. Are less expensive than taking a taxi
b. Save their users time and stress
RESPONSE CATEGORIES:
1 Yes
2 No
3 Not sure

Dear all,

I have an English Listening Comprehension test consisting of 50 items taken by about 400 students. I would like to score the test using the TOEFL scale (max 58 I think), and it is claimed that TOEFL is scored using IRT (3PL model). I am using MIRT package in R to obtain three parameters, as I used 3PL model.

library(readxl)

TOEFL006 <- read_excel("TOEFL Prediction Mei 2021 Form 006.xlsx")

TOEFL006_LIST <- TOEFL006[, 9:58]

library(mirt)

sv <- mirt(TOEFL006_LIST, # data frame (ordinal only)

1, # 1 for unidimentional, 2 for exploratory

itemtype = '3PL') # models, i.e. Rasch, 1PL, 2PL, and 3PL

sv_coeffs <- coef(sv,

simplify = T,

IRTpars = T)

sv_coeffs

The result is shown below :

| | a | b | g | u |

|L1 |2.198 | 0.165 |0.198 | 1 |

|L2 |2.254 | 0.117 |0.248 | 1 |

|L3 |2.103 |-0.049 |0.232 | 1 |

|L4 |4.663 | 0.293 |0.248 | 1 |

|L5 |1.612 |-0.374 |0.001 | 1 |

|... |... |... | ... |... |

The problem is that I do not know how to use the parameters above to give weight for each item. The formula should be like this right?

Would anyone help me show how I can insert the parameters into the formula in R? Or maybe there are other ways of obtaining students' scores without having to manually give weight for each item. Your help is much appreciated.

The data can be downloaded here: https://drive.google.com/file/d/1WwwjzgxJRBByCXAjdlNkGNRCtXjMlddW/view?usp=sharing

Thank you very much for your help everyone.

Among all the other models of Item Response Theory, is there anyone better than PCM and accessible to analyze psychometric properties of tests. i want to be sure if PCM is the best option.

I’m preparing a study whose main goal it is to

**explore whether a set of related behaviours are best conceptualized as one-dimensional or two-dimensional.**Traditionally, such questions have been answered using methods such as**exploratory and confirmatory factor analysis**(as well as variations from item response theory).More recently, some researchers have argued that these approaches are somewhat problematic, because they assume that there is a latent factor that “causes” the phenomena of interest. This assumption doesn’t always make sense theoretically. A possible alternative is to use

**network models**(e.g., https://doi.org/10.1080/21642850.2018.1521283, https://doi.org/10.1177/1948550617709827, or https://doi.org/10.3758/s13428-017-0862-1).Ideally, the choice for either of these methods is based on theoretical considerations: If I expect that there is a latent trait that “causes” the behaviours of interest, I choose EFA/CFA (or a similar IRT variation of it). If I assume a strong interaction between these behaviours and that there isn’t any latent cause, I choose a network approach.

**My question:**

What if I start from a relatively uninformed position and do not have such a priori theoretical assumptions?

**Which approach (EFA/CFA vs. network analysis) is better suited when the goal is to explore the dimensionality of the behaviours of interest and to develop theory?**Any arguments and pointers to previous work that provides guidance on this would be highly appreciated.

Hello everyone!

I am currently analysing a questionnaire from a Rasch-perspective. Results of the Andersen Likelihood Ratio (random split) and the Martin-Loef Test (median split) turned out to be significant. I know what significant results mean and which assumptions are violated. However, I am not sure about possible reasons for subgroup invariance and item heterogenity. What are some of the possible causes for significant results?

I hope that someone of you can help me answer this question. Thank you very much already in advance :)

Best regards,

Lili

Based on published item response data LSAT and pcmdat2, the General Total Score and Total Score are applied for numerical (high-stake) scoring. Item Response Theory (IRT) 2PL model is also applied for numerical scoring as compared to those from General Total Score and Total Score. Pieces of R codes for computing General Total Score are also offered.

______________________________________________________

Please click following link for full essay:

The cross-cultural adaptation of a health status questionnaire or tool for use in a new country, culture, and/or language requires a unique methodology in order to reach equivalence between the original source and target languages. It is now recognized that if measures are to be used across cultures, the items must not only be translated well linguistically, but also be adapted culturally to maintain the content validity of the instrument across different cultures. In this way, we can be more confident that we are describing the impact of a disease or its treatment in a similar manner in multi-national trials or outcome evaluations. The term "cross-cultural adaptation” is used to encompass a process which looks at both language (translation) and cultural adaptation issues in the process of preparing a questionnaire for use in another setting (Hill et al. & Kirwan et al.cited Beaton’s et al. 2000). The process of cross-cultural adaptation strives to produce equivalency based on content. This suggests the other statistical properties such as internal consistency, validity and reliability might be retained. However, this is not necessarily the case. For example, if the new culture has a different way of doing a task included within a disability scale that makes it inherently more or less difficult to do relative to other items in the scale, the validity would likely change particularly in terms of item-level analyses (such as item response theory, Rasch). Further testing should be done on an adapted questionnaire to verify the psychometric properties. What's your opinion?

I am developing a tool and am trying to decide which software package to use. Unfortunately packs such as RUMM and BILOG are not free under UCL, however we do have R. Can I use this? Will it impact my analysis?

I am working on a research paper and I want to use PLS-SEM. How can I find out sample size for PLS-SEM with item response theory?

I have a measurement where the indicators influence/give rise to the construct. This makes it formative, however, I don't think I can apply IRT on formative measurement models. Can anyone confirm whether this is the case with an explanation on why?

Most software packages for analyzing polytomous models under the Item Response Theory approach show option characteristic curves as an output. Given that I have the data on those option characteristic curves, how would I calculate the item characteristic curve?

- Total Score is defined with the number of those correctly-responded (dichotomous) items;
- Sub-score is defined as the total score associated with the sub-scale;
- Overall Score is defined as the total score associated with all testing items.
- For Total Score, the Overall Score is the summation of its Sub-scores which is called Additivity.
- For Item Response Theory (IRT)-ability (theta parameter), the relationship between Overall Score and Sub-scores is unavailable.
- Comment: (5) implies IRT has no Additivity. Therefore, with IRT-ability, the sub-scores and Overall Score can not be available simultaneously. This fact strongly indicates that IRT is not a correct theory for high-stake scoring while Total Score in (4) is (although only is as a special case).

Dear colleagues,

I have obtained thetas from the 2PL IRT model based on the (12 item) test of the financial literacy. I am curious can I further use those theta scores (original or modified form) in the logit regression as a one of the independent variables? Cannot seem to find any literature to support or dissuade from doing so.

Hi all,

I'm currently conducting a research, for which I have the data for every submission within a e-learning platform. The design is as such that every exercise can be handed in infinitely times, usually within a deadline of 1 or 2 weeks. Because of this, I've created my own variables such as number of attempts it took a student to solve the exercise, the time spent on an exercise from the moment of the first submission, the times a student switches from exercises without having solved the previous exercise (indicating it was likely a very hard question), and whether or not the exercise was eventually solved.

I started from the idea of implementing some sort of item response theory model, but because of the nature of the data (users do not all get the same exercises, users are in different courses, etc.) this causes too many (imo) impossible problems since for example I have lots of missing data by design (since there are a total of 1000+ exercises, but users only make between 60 & 100 exercises in total, and different users make different exercises etc.). This causes maximum likelihood estimations to be either very imprecise or impossible. Also I don't want to measure difficulty as whether or not the exercise was solved (dichotomous response) or in how many attemps (polytomous responses) but rather as a combination of all those factors.

This is why I ended up with both factor analysis and principal component analysis approaches. Since the three variables I want to combine, are in my opinion all reasonable proxies of difficulty, this should work and fitting those models appears to be a reasonable to actually a good fit, purely based on intuition so far (I have no test data to verify so far). However, my concern is that within the literature I don't find many examples of a single factor analysis, and especially not of principal component analyses with just one principal component. This is why I was wondering if anyone could confirm this is actually a viable approach for the problem I'm trying to solve, as well as point me to some literature which did similar things in the past.

So to summarize, I would like to estimate a latent variable 'difficulty' (because I have no test data) which should be a reasonable approximation of all three variables (attempts, time it took to solve, and switches) and as far as I'm aware, both factor analysis and principal components analysis should be able to do that. However, within the literature, I fail to find many if any examples of previously similar researches (because they both only require 1 factor and one principal component) so I would like to know if it's actually a viable approach and if possible point me towards some literature who did similar things.

Thanks in advance!

I took opinion from public health specialists about different indicators to create a city health profile. I tried to draw item characteristic curves for each indicator with R software. Please help me in interpretation of this chart.

Data is the score for each indicator i.e a continuous variable ranged between 1.00 to 5.00 (with decimals)

I'm using a test in which each item could vary from 0 to 6 points. It is a short-term memory test, and each point refers to how many features the subject could remember. Can I use Item Response Theory in this case?

I conducted an opinion survey to select feasible indicators to assess health profile of a city. I asked the participants to give score from 1-5, (1 for low 5 for high) for each indicator in six aspects viz importance, specificity, measurability , attainablity and time bound characters. That means each respondent will give score from 1-5 for characters of each indicator. The total score of each indicator is 30. I collected opinion about 60 different indicators.

If i treat feasibility as the latent trait of every indicator, how can i select highly feasible indicators wit the help of Item response theory analysis. How to draw item characteristic curve for each indicator and how to select indicators.

Any one please help me to overcome this hurdle.

Dear all, I am searching for a comparison of CTT with IRT. Unfortunately, I mostly get an outline of both theories, but they are not compared. Furthermore, generally the "old" interpretation of CTT using axioms is used and not the correct interpretation provided by Zimmerman (1975) and Steyer (1989). The only comparison of both theories that I found was in Tenko Raykov's book "Introduction to Psychometric Theory". Does anybody know of any other sources?

Kind regards, Karin

"When in 1940, a committee established by the British Association for the Advancement of Science to consider and report upon the possibility of quantitative estimates of sensory events published its final report (Ferguson eta/., 1940) in which its non-psychologist members agreed that psychophysical methods did not constitute scientific measurement, many quantitative psychologists realized that the problem could not be ignored any longer. Once again, the fundamental criticism was that the additivity of psychological attributes had not been displayed and, so, there was no evidence to support the hypothesis that psychophysical methods measured anything. While the argument sustaining this critique was largely framed within N. R. Campbell's (1920, 1928) theory of measurement, it stemmed from essentially the samesource as the quantity objection." by Joel Michell

(1) Why "there was no evidence to support the hypothesis that psychophysical methods measured anything" because "the additivity of psychological attributes had not been displayed"?

(2) Item Response Theory (IRT) has no Additivity, Can IRT correctly measure educational testing performance?

If the height of a test information curve (not item information curve) indicates the discriminability of a test at a given level of a trait, then doesn't it follow that the range of test scores in the tails of this distribution are uninformative (unreliable)? It seems to me that an inevitable conclusion from IRT is that, for most publish scales, extreme test scores will be inherently noisy and therefore should not be given prominence in data analysis (e.g., treating test scores as a continuous variable and using all of the data including extreme scores) because of the high leverage these data points will have in determining the solutions. At the very least, it seems IRT would compel researchers to either trim their data (e.g., omit the top and bottom 5 or 10% of scores) or in some cases treat the data discretely and perform ANOVA's instead. How does one reconcile the Test Information Curve and prescription to analyze data as a continuous variable without trimming extreme scores?

Hi everyone,

I am looking for the suitable method to perform a factor analysis of binary data. When going through the literature on factor analysis on binary response data, it appeared to me that both

*hierarchical item response theory (HIRT)*and*structural equation modelling (SEM)*could be suitable methods. I was wondering therefore what the main differences are between both? Can HIRT be considered as a special case of SEM?Any literature on that topic is welcome as well as advises on performing an HIRT (mainly, which software is recommanded?).

Thank you!

I have been looking at Chen and Thissen LD chi square calculations done in IRTPRO and I compared it to MIRT package. The values are different as in IRTPRO they are subtracting the degree of freedom from chi square value and then dividing it by 2* degree of freedom square root.

What I am confused is when I want to assess whether my questions have local dependency should I look at the chi square critical table or z-score table ??

Moreover, in IRTPRO guide they mentioned that they don't consider values like 2 and 3 to be high and values above 10 are local dependent.

"Because the standardized LD X2 statistic is only approximately standardized, and is known to be based on a statistics with a longtailed (X2) distribution, we do not consider values larger than 2 or 3 to be large. Rather, we consider values larger than 10 large,indicating likely LD; values in the range 5-10 lie in a gray area, and may either indicate LD or they may be a result of sparseness in the underlying table of frequencies."

Note: I have 14 questions with 126 responses and my data is dichotomous

and using LD chi square test my degree of freedom will be 1

- Application of
**Rasch analysis**and**IRT**models are becoming increasingly popular for developing and validating a patient reported outcome measure. Rasch analysis is a confirmatory model where the data has to meet the Rasch model requirement to form a valid measurement scale. Whereas, IRT models are exploratory models aiming to describe the variance in the data. Researchers seem to be divided on the preference of one over another.*What is your opinion about this dilemma*

Hello,

I am currently trying to work out how to conduct item analysis on my likert scale questionnaire.

The questionnaire consists of 34 questions, which are split between 13 subdomains. I want to determine how the scoring of these subdomains varies between the quartiles of the overall questionnaire score.

I was looking at item response theory but I am given to understand that this is not appropriate as Likert scales do not assume that item difficulty varies.

Any guidance is most appreciated!

Is there any possible way?

I understand that if the options point to the same trait, it can be done. for example a question of the type:

I work better:

(a) individually

(b) with other persons

either of the two options is valid for the person (helping avoid bias) and for example if I'm measuring the trait of teamwork I may think that a person who selects option b will have a higher degree in the trait of teamwork. Am I making a mistake in assuming this?

now, is there any way to do this when they point to different traits in response options? I want to be able, based on the data of forced response items, to carry out normative analysis (to be able to compare with other subjects).

PS: I'm clear that with ipsatives items you can't make comparisons between people, however, if you manage the punctuation in a different way could you do it somehow?

I am currently working on the theoretical framework of my thesis (a new scale for meassuring scientific creative trhinking), and I am a bit stuck in the theoretical method to create a new scale: design stage, Classic theory vs Item Response Theory...

So any recommendation on the theroretical aspect of these issues would be very wellcomed.

Thanks in advance

Good evening for everyone,

I would like to conduct an

*Item Response Theory's*Differential Item Functioning (DIF) analysis.I have two groups that answered a test. This test have a bifactor structure with dichotomous responses.

What is, in your opinion, the most appropiate technique (and software) to conduct this kind of DIF analysis?

Thank you,

José Ángel

I'm exploring the concept that scales can be ordered and that certain items should carry more weight in a scale. I came across guttman scalograms and Mokken scaling techniques. Creating the initial Mokken scale makes sense. What I don’t get is that after I get my H coefficients and AISP to run with Mokken, how do I evaluate the data in a meaningful way?

If I use a mokken analysis on a 10 item likert survey, it wouldn't make sense to get an overall composite mean score to represent the latent ability since I established that items are ordered on difficulty. Do the H coefficients determine item weights? How can I sum a participants score on my newly created Mokken scale?

I am interested in test theory.

Now I am studying Classical Test Theory and Item Response Theory.

I understood the fundamental part of the item response theory (eg how to estimate using R, how to think about equalization, etc.)

So, what is the current hot topic in the field of test theory?

In my opinion, I think that there are many researches of Cognitive Diagnosis Models recently.

Please tell us your opinion.

(I used a google translation so I may have made mistakes.)

We have data for a dichotomous numerical ability test that we would like to refine.

We are looking for someone with experience in Rasch modeling who can collaborate on this small project. We are thinking that this would be a learning process, and so it would not simply be a matter of doing the analyses.

Suppose that the phenomena we study comprise about 3 % of the population and we made a two item screening instrument (with dichotomous items). If we think in terms of Rasch /IRT, what is the optimal item difficulty for these two items? and if it's contingent of their covariance - how to test it ?

To be more specific, this question does not concern central tendency bias/error -- where respondents are inclined towards centralized responses -- but the concept that mean values expressed on Likert-type response data tend to be centralized due to the issue of using truncated variables (e.g. 5 or 7 points on a finite scale with no continuum).

For example, if you administer a 5-point scale to two respondents, the possible number of combinations for them arriving at a combined mean score of 1 or 5 is

**one**. However, the possible number of combinations for them arriving at a combined mean score of 3 is**five**.Obviously, if you increase the respondents to three, four, five, etc. the possible number of combinations to reach a combined mean score of 3

**grows exponentially;**a plethora of combinations is possible with even a few dozen respondents. Yet, the possible number of combinations for arriving at a mean score of 1 or 5**remains stagnant at one**.How do you approach this dilemma when analyzing data? How can you associate a degree of 2 or 4 with more "oneness" and "fiveness" respectively to account for the central tendency of respondents?

**Forced distribution**seems feasible, but the practice of imposing a hypothetical normal distribution curve on data seems to me a sub-optimal and outdated practice.

Keyword searches into this problem have brought up concepts like

**entropy,**or**ordinal regression**, but I am not sure if they address the issue (or perhaps they do, but their application simply goes over my head).Many thanks for reading. This question is attempting to 'fix' the dilemma of differentiating centralized mean values (e.g. 2.3/5 and 2.8/5) to account for the aforementioned issue of centralization when assessing their differences (e.g. 2.8 - 2.3 = 0.5) so that "lower" or "higher" values (e.g. 2.3) can be interpreted as "closer" to the end of the scale (e.g. 1) than towards the middle of the scale (e.g. 2 or 3).

Is calculating item discriminating value through t-test the only method? Are there other methods? Most importantly, in personality scales, while eliminating items, should one consider the only item discriminating value?

Title: From Total Score to General Total Score

Key Words: Total Score, General Total Score, Item difficulty level, Partial credits, Items interaction or duplication, General Total Score of Shared ability, General Total Score of Unique Ability, General Total Score of total ability/all items, General Total Score of subscale, Relation of Total Score and Sub-scores, General Total Score Decomposition.

Introduction:

Total Score, which is the number of those correctly-responded items (testing questions), is a major tool for individual examinee's scoring (high-stake scoring). In practice, although Total Score plays an important role in many businesses, Total Score is still not fully qualified for high-stake scoring because Total Score is only true under the following three assumptions:

(1) All items (testing questions) are dichotomous items which have two possible values "Right" or "Wrong" (in practice, "1" for "right"; "0" for "wrong"). That is to say that any item must not have partial credit(s), say, an item having possible values: 0,1,2,3,... is not allowed in Total Score.

(2) All items are of the same difficulty levels. Therefore, the examinee who correctly respond 10 difficulty items receives the same Total Score as the student who correctly respond 10 easy items.

(3) All items are jointly independent. Therefore, Total Score of two (correctly-responded) highly-duplicated items or even identical items is the same as Total Score of two (correctly-responded) less-duplicated or even independent items (A fair rule is that the Score for two correctly-responded identical items should be the same as the score for correctly-responding one of these two identical items). Also, "All items are jointly independent" implies that the scoring for those shared ability across different students is always 0, but, in real world as we know, this is not true (this issue is related to the ability's decomposition which will be discussed in more detail).

In this talk, we will introduce a Total Score without above three assumptions. Total Score without above three assumptions is called General Total Score. That is to say that, General Total Score must be true no matter above three assumptions hold or not. Therefore, in its scoring, General Total Score must be able to correctly handle

(a) The partial credits associated with each item;

(b) The different difficulty level associated with each item;

(c) Those interactions among any combinations of the items.

The theory for General Total Score has been published in “N. Kong. A Mathematical Theory of Ability Measure. Journal of Applied Measurement (2015, Vol(16), P1-12) ". In this paper, we present the simplified version of General Total Score theory with artificial numerical examples. Readers are welcome to make their comments or discussions during presentation. Also, readers are strongly encouraged to carry out independent numerical analysis to verify General Total Score and make comparison with those you did before based on other approaches such as Total Score or Item Response Theory etc.

The full paper "From Total Score to General Total Score" is available at

During my research, I asked several companies to complete a questionnaire to measure they risk maturity. At the end of the questionnaire, each respondent receives a score from 1 to 5 (1=low maturity, 5=high maturity)

Now I want to analyze the results of the questionnaire. The data are then in form of a single vector, where each row is the "score" received by a single participant. On this vector, I want to perform a PCA analysis to find the first 3 PCs.

The idea I want to follow comes from the Arbitrage Pricing Theory (Ross 1976), where the author performs the PCA analysis on the returns of several stocks to understand how many factors influence these returns.

I know that PCA is usually used when I have several dimensions and I want to reduce them finding "factors" as combination of those dimensions. However, Ross in his paper (you can find it attached to this post) uses just the returns to investigate the factors, and I would like to perform a similar analysis on my values to find how many factors can explain the variability of my sample. Even though I now what my objective is, I am not able to get there. Any suggestion?

We have a test with 16 items to measure student achievement. Items are comprised of of multiple-choice, close-ended and open ended questions.

We have three different scores on student achievement (0:wrong answer, 1:partial credit, 2:full credit).

Which (preferably free) software would you suggest for the partial credit IRT model analysis to calculate student total scores?

I am looking to convert raw FIM scores from a sample of patients from their ordinal scale to an interval scale so that I can use summary indices that are currently used in the rehab literature.

A recent paper (http://www.ncbi.nlm.nih.gov/pubmed/24068767) suggests that when using these indices, the raw data should be first transformed to an equal interval scale via Rasch analysis before applying any of the measures designed to assess rehab efficacy using the FIM, due to the fact that their is no way to quantify that one patient's score (e.g. a 5 on UE dressing) is the same as another patient's score of 5 on the same measure, due to inherent patient differences.

I was hoping that there was a method someone may know of that could be used to easily accomplish this score transformation for someone with limited knowledge of Rasch analysis and Item Response theory as it pertains to the FIM.

Which transformation should I apply to an angular variable (i.e. slope of a landslide) in order to normalize it (like arcsin to percentages)?

Even though ANOVA are robust to not normal distributions, I'd like to know if I could adapt my cases without altering them.

what are the implications behind choosing Extraction and Rotation methods while doing EFA?

I have doubt about PCA(Principal components analysis), i have data with 200 cases about sellers, and everyone has 500 variables. In one book (Editorial ANUIES, 2007, ISBN:970704103X, 9789707041035) the autor recommend have 5x cases for 1x variables, so i don't know if i can to apply PCA or maybe someone can tell me about other method to reduce variables

My CFA model is eight Items scale for assessing one latent variable (Schizophrenia)

Hi everyone,

I have a question regarding the negative values I am obtaining when calculating first-order sensitivity indices.

For a given number of samples (N), the results show the expected values of the first-order indices (all positive and relative importance according to the function defined) when the parameters are normalized between [0,1]. However, using the same number of samples (N), when I leave the parameters in their original range [-0.5,0.5], I obtain negative values for some of the first-order indices. In addition to this, when the parameters are left in the original range [-0.5,0.5], the values of the first-order indices reflect a completely different order of importance of the parameters.

I understand that increasing the number of samples is important, but for a fixed number of samples (N), could you help me to understand why normalizing between [0,1] is so important?

Thanks in advance,

David

I think I can't do it with Excel, and I haven't got SPSS, so suggestions are welcome!

I have an IRT question regarding calibration of the items. I am also aware of the item parameters are invariant to the sample using which they are estimated.

Suppose there are three schools, namely:A,B,and C. Students in school A are having below the average intelligence (IQ) and the school doesn't have exposure to computerized adaptive test (CAT) while students in school B have moderate IQ and have partial information about technology of the CAT, whereas students from School C are the brightest ones and also they are experienced with CAT.

Suppose, We have estimated the item parameters of an item bank (of say Mathematics subject) using response data from the school C (students with not so good intelligence and not aware of CAT)

Now suppose if I assume the item invariant assumption of IRT to be true and ask school C(one with bright students) students to take the test on same item bank, and turns out they all perform well.

So adjustments should I do with item bank so that I should be able to compare the results students from School A , B, and C with the numerical score I have got from the test.

Is this test a good way to compare these students?

Would the results be different had I estimated the item parameters using the response data from school C? (First thought comes to my mind is yes).

Am I missing something here?

I am open to the discussion.

Thank you in advance.

I have a dataset with a lot of missing values due to the fact that this data was generated by a computer adaptive test.

I want to show unidimensionality of the test, but factor analysis is a problem due to the missings. The paper below seems to suggest that we can use ones theta estimate (I'm using IRTPRO to estimate it based on the EAP method) to calculate expected scores and impute them - which seems intuitive. Is this a correct way?

Can I simply proceed with factor analysis after this, or would I need to round the probabilities to 0 and 1's? Are there other methods?

Regards, Dirk

Context : Performance test, dichotomous, one-dimensional (at least in theory), at least 3PL (variable pseudo-guessing). Theta is assumed normal. But I'm also interested in answers in general in IRT.

Problem : It seems to me that EFA factor loadings provide clear guidelines to rank/select a subset of items from a pool (with referenced rules of thumb, etc.) when one does not have any prior info/assumption of theta (aka for "all" test-takers).

On the other hand, IRT is, in my opinion, a much more accurate representation of the psychological situation that underlies test situations, but it seems to me that there are a variety (especially in IRT-3PL / 4PL) of parameters to take into account all at once to select items, at least without any prior estimation of theta.

So I'm wondering if you knew of any guidelines/packages that can be referenced as a clear basis (meaning, not eye-balling item response functions) for item selection there. At this stage I'm thinking of a very non-parsimonious solution, like generating all item subsets possible (I'd get a LOT of models, but why not) -> fit IRT model -> compute marginal reliability (and/or...Information, of why not CFI, RMSEA, etc.) for thetas ranging between -3SD and +3SD -> Rank the subsets by descending marginal reliability (but I'm afraid it would bias towards more items, so I'd have to weight per item count maybe).

Anyway, you get the idea. Any known referenced procedures/packages?

What is the best book explaining these advantages and disadvantages ?

So, I've been stuck trying to get item locations from mirt while using GPCM. I know that eRm usually gives you both item location and thresholds, but somehow I haven't been able to find out where is it on mirt. By using:

$ coefG <- coef(modG, IRTpars = TRUE, simplify = TRUE)

I just've been able to fetch discrimination and thresholds, but apparently location is hidden somewhere. I need mirt because of the method MHRM due to some sample issues... Also it seems that mirt works with a different type of object otherwise I'd be able to list its elements and find things out at once. So anyone has a clue on how can I find the item locations among my results or even to calculate and extract it out of the thresholds?

Best!

I was wondering whether someone has a suggestion on how to create a unidimensional IRT model where item pairs (on symptoms) are conditional so that the one item asks about the presence of a symptom (Yes/No) and the other about the severity of the endorsed item (on a 5-step scale).

Conceivably reporting a symptom but rating the lowest severity could be indistinguishable on the latent trait from the "No" response, but how can I test this?

If I combine the item pairs into items with 6 response options, will a nominal model do the trick? Or -- assuming that symptom and severity items measure the same thing -- should I just use all variables in a Graded Response Model, accepting the systematic missingness, and compare a parameters for the item pairs? In the latter case the dependence isn't modelled in any way.

Am trying to develop a composite indicator for socio-economic classes using data on ownership of various assets (such as a car, fridge, etc). I have total household expenditures which can be used to estimate the strength of the model.

Am confused as to whether to use factor analysis, regression, or Rasch models (Item Response Theory), or data envelopment techniques.

Hello,

If a knowledge structure satisfies the condition smoothness and consistency then it is called as learning space. I was referring to the KST (R-package ) function "

*kstructure_is_wellgraded*", as we know that if knowledge structure is well graded then it is also a learning space (according to Theorem 2.2.4 in Learning Spaces.)It turns out that when the knowledge structure is big above method is taking forever to confirm if the given knowledge structure is a learning space. That is why I wanted to know if there is any other way , a fast one preferred to verify the same.

We are planning to apply a specific test to measure problem solving skills of students. Students will be given full or partial credit for the 16 question items they will answer. Are there any practical guidelines on how to score and measure student success with Item Response Theory?

Actually, although the results demonstrate certain limitations of the classical test theory and advantages of using the IRT, have we surpassed the classical model?

Ideally such a tool would use a computerized adaptive testing approach?

I am looking for a method which can be used to more precisely screen for a spectrum of behavioral health issues without using an extensive battery of questions?

Such a tool would ideally be used to anonymously screen large numbers of individuals. The validity and reliability would need to have been established so that the tool could stand up to some scrutiny.

The cross-sectional IRT analysis in Stata 14 is quite simple. But what about longitudinal data / repeated measures? I just don't know where to start from. I would really appreciate if somebody can push me into the right direction. If it is not possible with the Stata built-in IRT module then any suggestions what software (except 'R') would be relatively simple to use for the task?

Hello,

I want to experiment on one of the functions from the "DAKS" package from R to create a knowledge map/graph. They already have a data with 5 questions in an item-response matrix format ( item across the columns and responses along the rows) but I need a bigger data with at least 50 questions. Does anyone have such kind of data or any algorithm to simulate such kind of matrix, where there are items ; such as response to one is dependent on each other. Need not the the case that all the questions should be dependent on each other.

Thank you for any help.

Hello everyone,

I need to know what all properties should an item bank have to carry out a precise CAT. I am particularly interested in the 1PL model.

Properties I am looking for

1. What should be the optimum size of the item bank?

2. What should be the distribution of item difficulty?

Any references are also welcome.

Thank you in advance.

There is software available for item response theory, but it is very hard for me to understand how they work. Can anyone provide information on this.

Thank you

If a, b, and c parameters have been obtained and published for a normative sample, and I calculate them for a new data set, would it be possible to statistically examine DIF on the basis of these parameters? Or would I need the full raw data for the normative sample?

Is there any classification for the item difficulty and the item description ranges of values in item response theory (IRT) and multidimensional item response theory (MIRT).

According to Baker in "The basics of item response theory"

item discrimination in IRT is classified into the following :

none 0

very low 0.01 - 0.34

Low 0.35 - 0.64

moderate 0.65 - 1.34

High 1.35 - 1.69

Very high > 1.70

Perfect + infinity

According to Baker,Hambleton (Fundamentals of Item Response Theory ), and Hasmy (Compare Unidimensional and Multidimensional Rasch Model for Test with Multidimensional Construct and Items Local Dependence) item difficulty is classified into the following :

very easy above [-2,...]

easy (-0.5,-2)

medium [-0.5,0.5]

hard (0.5,2)

very hard [2,..]

**Could the item discrimination and item difficulty classification be also used in MIRT**

Can anybody share about item information in IRT?

Do we have any interpretation ranges value for item information (IIF and TIF)?

I'm looking into the development of an on-line IRT adaptive test for a specific test instrument. Any pointers to help me start out such has published research or case studies would be grateful. I've come across the Concerto platform but would be interested to know what else is out there.

I want to calculate the AIC of GPCM. From the program of Parscale, I had know the value of -2loglikelihood (namely G

^{2}or -2ln(L)). AIC=2k-2ln(L). k is the number of estimated parameters of GPCM. My question is that if the scale have 8 items and each item have 5 categories, then what the k is.There are many different programs for conducting IRT, some are stand-alone products and some are components within larger statistical programs. What are the

*most popular*programs in common use?Additionally, does anyone know of any independent comparisons of the various programs to ensure that they produce identical results?

I would ideally like to use a single threshold - eg category 0/1 vs 2/3.

Does anyone know publications, that compare IRT based item information curves or item information functions of questions/testitems with different response format (but equal content)?

Response formats may differ in number of response options, item wording, etc.

I am running an IRT analysis on an instrument in XCalibre, and the analysis reports substantially different means for the items than those calculated in Excel? Is there some weighting happening of which I am unaware?

Point biserial correlation is used to to determine the discrimination index of items in a test. It correlates the dichotomous response on a specific item with the total score in a test. According to the literature items with Point biserial correlation above 0.2 are accepted. According to Crocker (Introduction to classical and modern test theory, p.234) the threshold for point biserial correlation is 2 standard errors above 0.00 , and the standard error could be determined by (1/sqrt(N)) where N is the sample size. What is not clear to me is that in tests we need items that have high discrimination (Correlation) and if Point biserial correlation is a special case of pearson correlation that means that by accepting 0.2 as a threshold we are accepting the fact that the coefficient of determination is 0.04 and total score is only capturing 4% of item variance.

I'm specifically looking for papers where the expectations are manipulated (e.g. "this test is going to be very difficult" vs "very easy") and ultimately influence results. I'm already aware of all the literature on stereotype-based performance change (e.g. Levy, 1996; Aronson,1999), but I am looking for other kinds of manipulations. Thanks to anybody who'll help me!

Hello, I want to compute the so-called IRT Item Information Function for individual items as well as the latent variable using Stata. Several methods are available for IRT analysis, like clogit, gllamm or raschtest, but so far I could not find any syntax to draw the Information Function using anyone of these methods. So, any help or example is much appreciated.

I am using

**mirt**(an R-package) for the IRT( Item Response Theory) analysis. Current data, I am dealing with is sparse (contains missing value). Responses are missing because the test is adaptive. ( In the adaptive test not all the questions in the item bank are presented to the test taker hence the responses to the questions he was not presented and the ones he could not solve are missing ). Now the "mirt " function in**mirt**package takes care that you can calibrate the data with missing values (i.e. fitting the IRT models (Rasch, 2PL, 3PL). However when it comes to the item fit analysis (using "itemfit" function to carry out the item fit analysis) you can not put the sparse data. In this package for the sparse data if you need to go for item fit analysis, you must use imputation. I have two questions here:1. Are there any more methods available besides imputation for item fit analysis when you have the sparse data?

2. What is the maximum percentage of sparseness in the response data matrix, where you can use imputation method to get reliable results?

Can anybody help to interpret the ICC curve of IRT for polytomous response category?

Does anybody know why partial credit model is better as compared to graded response model?

My experiment consists of three tests; test one consists of 14 items (126 responses), test two of 16 items (89 responses), and test three of 14 questions (responses). I started applying several uni-dimensional and multidimensional item response theory models to find the best fit. I ended up with MIRT(2PL) model for test one and UIRT(2PL) for tests two and three. Now I need to prove that my models are reliable. The items' difficulty and discrimination will remain the same regardless of the sample of students I have. I tried to split each test into two sub-samples (50 % of sample) and fit my model to each part and then correlate sub-samples with each other and with the whole sample. The results were not that good and I expect this is because the sample size is small and the error becomes higher. Is there any other way to assess the reliability of item parameters?

I need to estimate hierarchical IRT polytomous models, with one general ability and v specific sub-abilities.

Specifically, I am interested in the following two models:

1) Each specific ability theta_j is a linear function of the general ability theta_0, that is:

theta_j = Beta_j * theta_0 + error, with j=1,...,v

2) A model in which each specific ability is a linear function of the general ability, that is:

theta_0 = sum(Beta_j * theta_j) + error, with j=1,...,v

Can you suggest any software (preferably an R package) that may allow this?