Science topic

# IRT - Science topic

Explore the latest questions and answers in IRT, and find IRT experts.

Questions related to IRT

Hi,

I'm trying to list different CAT models and I'm used to work with IRT-based and Rasch-based CATs but I'm never wandered outside IRT/Rasch. Has anyone heard about such CATs developped for real-life use or developped only theoretically?

Thanks a lot!

Christian Bourassa

Dear all,

I have an English Listening Comprehension test consisting of 50 items taken by about 400 students. I would like to score the test using the TOEFL scale (max 58 I think), and it is claimed that TOEFL is scored using IRT (3PL model). I am using MIRT package in R to obtain three parameters, as I used 3PL model.

library(readxl)

TOEFL006 <- read_excel("TOEFL Prediction Mei 2021 Form 006.xlsx")

TOEFL006_LIST <- TOEFL006[, 9:58]

library(mirt)

sv <- mirt(TOEFL006_LIST, # data frame (ordinal only)

1, # 1 for unidimentional, 2 for exploratory

itemtype = '3PL') # models, i.e. Rasch, 1PL, 2PL, and 3PL

sv_coeffs <- coef(sv,

simplify = T,

IRTpars = T)

sv_coeffs

The result is shown below :

| | a | b | g | u |

|L1 |2.198 | 0.165 |0.198 | 1 |

|L2 |2.254 | 0.117 |0.248 | 1 |

|L3 |2.103 |-0.049 |0.232 | 1 |

|L4 |4.663 | 0.293 |0.248 | 1 |

|L5 |1.612 |-0.374 |0.001 | 1 |

|... |... |... | ... |... |

The problem is that I do not know how to use the parameters above to give weight for each item. The formula should be like this right?

Would anyone help me show how I can insert the parameters into the formula in R? Or maybe there are other ways of obtaining students' scores without having to manually give weight for each item. Your help is much appreciated.

The data can be downloaded here: https://drive.google.com/file/d/1WwwjzgxJRBByCXAjdlNkGNRCtXjMlddW/view?usp=sharing

Thank you very much for your help everyone.

Looking to measure stress in cats using IRT. One study used a FLIR, but it is quite pricey. Any other good options? I think FLIR is a good brand, but I don't know enough about the specs to determine what I need.

My goal is to test the equivalence of the model between countries. The model is relatively complex. It consists of 35 items, some of which are continuous and some ordinal, divided into 7 dimensions.

I would like to ask, what method would you recommend (e.g. MGCFA, IRT, ESEM, Alignment, LCFA, BSEM)? I would like to test more methods and compare the results. However, each method is suitable for either continuous (e.g. MGCFA) or ordinal (e.g. IRT), and I do not know if it is possible to apply them when I have both types of variables.

Alternatively, whether the solution would be to transform the response scales to be uniform?

Thank you very much for your answers.

Hello

I am using six 500W-Halogen lamps to heat up the surface of an HDPE plate in order to get a thermal gradient over its thickness required for IR Thermography (IRT). I am using IRT to detect subsurface defects. For this purpose, I heated up the HDPE plate with the six halogen lamps for 180 s and after removing the heat source I started monitoring the target. I wanted to use the Furriers equation to calculate the temperature at the location of each subsurface defect. My problem is how to calculate (Q) in Furrier's equation? (I think I am wrong if I calculate the "Q" like this: 6*500*180 because it gave me unrealistic values for the temperature at the defect location).

I am examining results from an exploratory factor analysis (using Mplus) and it seems like the two-factor solution fits the data better than the one factor solution (per the RMSEA, chi-square LRT, CFI, TLI, and WRMR). Model fit for the one factor model was, in fact, poor (e.g., RMSEA = .10, CFI = .90). In the two factor model, the two latent factors were strongly correlated (.75) and model fit was satisfactory (e.g., RMSEA = .07, CFI = .94). The scree plot, a parallel analysis, and eigenvalue > 1, however, all seem to point to the one-factor model.

I am not sure whether I should retain the one or two factor model. I'm also not sure whether I should look at other parameters/model estimates to make determine how many factors I retain. Theoretically, both models make sense. I intend to use these models to conduct an IRT (uni- or multidimensional graded response model - depending on the # of factors I retain).

Thank you in advance!

I am comparing IRT results for 2 samples who took the same exam. I am aware of DIF detection methods (MH, LR, Lord-Wald, ext) but for my specific purpose I only want to compare if questions discriminate equally (ignore differences in the difficulty parameter). I have the models standard errors for each discrimination parameter so I can make confidence intervals but need to know what distribution these parameter estimates follow. I believe they follow the Z distribution but was just looking for some other input/confirmation since I can't find conclusive evidence in the literature. Thank you in advance!

Dear all,

I am trying to understand the similarities between conducting CFA on categorical data using the WLSMV in Mplus and IRT. I would appreciate any input. Moreover, is there a way to convert Mplus parameters to difficulty and discrimination parameters?

Thanks in advance,

Nikos

I am analysing a 7 dichotomous scored items instrument using IRT. I calibrated the data using 2PL, the item fit statistics are good, but the p value of the M2 statistic is far less than 0.05. In this case, how should I interpret the model data fit? Is it an appropriate model for my data or not? Thanks

There is an interesting

**Law of the Hammer**(see*link**), which states the following truism: "*If the only tool you have is a hammer, it is tempting to treat everything as if it were a nail.*" This Law may be working in some areas of Psychometrics, too.If the only cognitive tools you have, are

**ordinary arithmetic and statistics on the real line**(i.e., all negative and positive numbers), then you may be tempted to treat ALL measures as REAL numbers and apply the usual arithmetic and statistical operations (plus, times, arithmetic mean, etc.) on REAL numbers, to get whatever you want, e.g. estimates and predictions.**The Hammer of IRT/Rasch**

This seems to be the case for

**IRT based on a naive concept of scores**. Scores, in ITS's view, are just like real numbers, X for short, which can be neatly associated with (cumulative) probabilities via an elegant transformation (conversion) rule:- P(X) = exp(X) / (1 + exp(X) )

Once you have those probabilities, the whole machinery of probability theory and statistics can be invoked for descriptive or testing purposes. Adding one, two or three parameters (which have to be estimated) increases the explanatory power of the approach even if your responses are still binary 0 and 1's. Generalizing to polytomous and multidimensional IRT etc. is possible within the same adopted framework.

Still, all these developments rest on a highly

**questionable, implausible interpretation of X as the genuine measurement**of an underlying (psychological) feature and P(X) as its associated probability. However, there is no empirical evidence or formal argument, that such an interpretation is possible, necessary, meaningful or even correct.On the contrary: it is

**highly counter-intuitive**. There are no (latent) psychological features or variables like X which are strictly unbounded to either side, i.e. could become - infinity or + infinity. If you happen to know one, please let me know.**An alternative view avoiding the concept of probability**

The alternative is, to view the real number X as a

**transformed score**f(S) from the double-bounded interval [A,B] between given A and B to the real numbers.- For instance, A could be 1 and B could be n, with equal-spaced anchor points 1, 2, 3, ..., n, so that [A,B] is the underlying continuous scale of a discrete n-point ruler or scale for the user/respondent.

Now let S be such a score on the n-point scale from 1 to n, and define X such that :

- exp(X) := (S - A) / (B - S)

Then X is well-defined: it is the (natural) logarithm of (S - A) / (B - S). Now, after a few simple manipulations we get:

- S = (1- P(X)) * A + P(X) * B so that P(X) / (1 - P(X)) = (S - A) / (B - S).

In other words:

**P(X) is just a weighting factor**required to get the position (location) of the score S on the chosen n-point scale from A to B. Indeed, we could replace P(X) just be W(S), i.e. the weight of S on [A,B]) to get:- S = (1- W(S)) * A + W(S) * B so that W(S) / (1 - W(S)) = (S - A) / (B - S).

In other words, W(S) = P(X) is just a normalized version of S given that S is constrained to [A,B].

**No probability interpretation is required!**What's more: we can define the

**quasi-addition (+) of scores****S and T on any scale [A,B]**. It is, in fact, relatively simple:- S (+) T := (S - A) * (T - A) divided by (S - A)*(T - A) + (B - S)*(B - T)

This

**quasi-sum**(+) of S and T has (almost) ALL the properties of the usual sum of two real numbers, however, it is closed on the interval between A and B. Defining quasi-multiplication (*) of a score S with a real number is also quite simple. And once we have quasi-multiplication, the quasi-additive inverse is correctly defined as (-1) (*) S. Together,**Module Theory**provides the right framework for working out the algebra of bounded scales and scores.**Introducing the interval scale on [A,B]**

The biggest surprise of all: it turns out that

**the n-point scale endowed with this quasi-arithmetic addition operation is indeed an interval scale**. The same holds for the standard percentage scale [0,1], so often used in educational contexts, but also in Fuzzy Set Theory and Fuzzy Logic.**CONCLUSION**:

*Rasch*, one of the founders of IRT, was an original mathematician and statistician, but he was rather focussed on classical statistics and measurement theory from the physical sciences, and probably he didn't know about the then-recent developments from what is known now as

**quasi-arithmetic calculus**and

**module theory**.

Also, in his time,

*Stevens*was one of the hero's of a rather naive (=operationalist) scale theory for psychology and other social sciences (see e.g. the critiques of J. Michell, Luce, and other measurement theorists).I am sure, that

**Rasch would be delighted**to see that it is indeed possible to do serious measurement (theory) in the social sciences without the counter-intuitive assumptions that he had to make.________________________________________________________________________

Based on published item response data LSAT and pcmdat2, the General Total Score and Total Score are applied for numerical (high-stake) scoring. Item Response Theory (IRT) 2PL model is also applied for numerical scoring as compared to those from General Total Score and Total Score. Pieces of R codes for computing General Total Score are also offered.

______________________________________________________

Please click following link for full essay:

Hi! I would like to know if there is any convention or standard to evaluate for how long a national, large-scale educational evaluation should be implemented? I've heard that even with an extensive item bank and doing equating a test has kind of a "lifespan" (for example, a if a given test should be used only for 10 years before it loses its psychometric properties susbtantially). But how does this gets evaluated or measured, or is there any implicit rule? And is there a keyword for this?

Could you please provide me with

**reference?***published*Thanks in advance!

I am writing a paper assessing unidimensionality of multiple-choice mathematics test items. The scoring of the test was based on right or wrong answers which imply that the set of data are in nominal scale. Some earlier research studies that have consulted used exploratory factor analysis, but with the little experience in data management, I think factor analysis may not work. This unidimensionality is one of the assumptions of dichotomously scored items in IRT. Please sirs/mas, I need professional guidance, if possible the software and the manual.

I intend to investigate the psychometric properties of a SAT with a 3PLM of the IRT. The total population of my study is 16,328 and, I was told that Krayjcie and Morgan cannot be used as a yardstick for determining my sample size since I'm running my analysis with an IRT software. Thus, will require much larger sample.

I used an R package plink to link two scales(polytomous item).

I was able to get a transformed item parameter through linking.

The plink does not seem to have the function to obtain the Test information function for the transformed item parameters through linking.

1) If I have a item parameter, can I get the GPCM's Test information function?

2) If you have implemented this, please let me know how to do

Thanks!

I am do research in educational assessment and focus on developing assessment instruments (test). Thanks.

I am using Item Response theory (IRT) using 3 Parameter Logistic Model(3PL) for Logic test. After training the model, I use the posterior means of the item parameters 𝛼, β and γ to estimate person trait 𝜃 during the adaptive test. I want to introduce co-variates i.e. age, gender etc in the model for estimating ability of the person using Latent Regression. But I am not able to find any research for introducing co-variates in the IRT model. Any guidance would be much appreciated.

When I showed my refined scale after running a CFA and EFA (11 items, 3 factors, 4, 4, 3) a statistician at UCL told me that I had too few items per factor to run IRT.

Does anyone have any references to back this up?

Many thanks,

Leng

I would be thankful if someone could give me a short breakdown on how to test unidimensionality of polytomous data (likert-scale) using IRT-models in R.

As it is my understanding that the ltm-package is obsolete, I'm looking for an equivalent to ltm::unidimTest() (perhaps within the mirt-package?).

Thank you

Hello RG community,

I have a questionnaire to validate. Most of the questions are so-called gated: ex. Do you have symptom XX? a. Yes, b.No. If yes, proceed to the next question: How much does symptom XX bothers you? 1. not that much, 2. to some extent, 3. very much.

My solution is to make two questions out of one and re-code the derived question: How much does symptom XX bother you? 0. nothing because I don't have it , 1. not that much, 2. to some extent, 3. very much.

The proposition comes with at least two risks: a. I am not sure if the distance between zero and one is the same as that of other categories. and b. I am not sure what I am validating !! The new question is a re-framed version of the two questions and I don't know the extent to which the response patterns would be different if we originally gave patients to fill out the made-up (derived) questionnaire.

The same problem with IRT.

Any comments?

I have a measurement where the indicators influence/give rise to the construct. This makes it formative, however, I don't think I can apply IRT on formative measurement models. Can anyone confirm whether this is the case with an explanation on why?

In literature when calculating IRT SE I found multiple times Fisher Information being mentioned.

Being curious I started to try to play around with the fisher information in order to obtain the typical Information reported as P(theta)Q(theta)a^2.

My understanding of the process failed me when I started to check why the variance of the score is defined as follows

score = s = d/dtheta ln( f(x,theta) )

Var(s) = E[s^2]

Given that the variance is

Var(s) = E[s^2] - E[s]^2

I started looking in why E[s]^2 is zero. As long as f(x,theta) is a density function I can write

E[s]^2 = [\integral{ d/dtheta ln( f(x,theta) ) * f(x,theta) dx }]^2

= [\integral{ d/dtheta( f(x,theta) ) * f(x,theta)/f(x,theta) dx }]^2

= [\integral{ d/dtheta( f(x,theta) )dx }]^2

= [d/dtheta( \integral{ f(x,theta)dx } ) ]^2

= [d/dtheta( 1 ) ]^2

= 0

But as soon as we use the IRF (Item Response Function), that gives us the probability of getting score x given theta, all the computations done above are not working anymore. The reason being that the integral of the IRF is not finite, hence

[d/dtheta( 1 ) ]^2

not valid.

I have demonstrated that

E[d/dtheta ln( f(x,theta) ) ^ 2] = -1 *E[d/dtheta(d/dtheta ln( f(x,theta) ))]

but that holds when integrals are one for f(x, theta) and simplifications can be done.

Any input on my approach and (not) understanding of the problem?

Most software packages for analyzing polytomous models under the Item Response Theory approach show option characteristic curves as an output. Given that I have the data on those option characteristic curves, how would I calculate the item characteristic curve?

Researchers do shorten questionnaires but very little is known (?) about good practice in this field. There are probably some methods, both rooted in IRT and CTT, but do someone empirically tested what metod is more useful in what circumstances?

- Total Score is defined with the number of those correctly-responded (dichotomous) items;
- Sub-score is defined as the total score associated with the sub-scale;
- Overall Score is defined as the total score associated with all testing items.
- For Total Score, the Overall Score is the summation of its Sub-scores which is called Additivity.
- For Item Response Theory (IRT)-ability (theta parameter), the relationship between Overall Score and Sub-scores is unavailable.
- Comment: (5) implies IRT has no Additivity. Therefore, with IRT-ability, the sub-scores and Overall Score can not be available simultaneously. This fact strongly indicates that IRT is not a correct theory for high-stake scoring while Total Score in (4) is (although only is as a special case).

**1pl IRT model**

Mathematical formula to find the item (question) difficulty is:

- p(0/1) = e
^{(theta-b)}/1+e^{(theta-b)} - ability (theta) = ln(p/(1-p))
- p = count of correct answer ..that count of 1 of each student
- difficulty(b) = ln(p/(1-p))
- where p = count of 1 in each question.
- done with 1pl model.
- Now formula for 2pl model is
- p(0/1) = e
^{a(theta-b)}/1+e^{a(theta-b)} - a= discrimination parameter
- what is formula for find discrimination parameter a.

Dear colleagues,

I have obtained thetas from the 2PL IRT model based on the (12 item) test of the financial literacy. I am curious can I further use those theta scores (original or modified form) in the logit regression as a one of the independent variables? Cannot seem to find any literature to support or dissuade from doing so.

Dear all

I am working in a research which need me to simulate data with LD item pairs (surface), I am not so familiar with r package, so I faced some problems when I used some codes by other researchers like Houts (2011). I make the holders of the LD pairs and complete the code, but I have all elements of matrices equal to zero, also no raw data files. It will be appreciated if you could help me or if you have a detailed explanation for the code you used to simulate raw data with LD.

Dear all, I am searching for a comparison of CTT with IRT. Unfortunately, I mostly get an outline of both theories, but they are not compared. Furthermore, generally the "old" interpretation of CTT using axioms is used and not the correct interpretation provided by Zimmerman (1975) and Steyer (1989). The only comparison of both theories that I found was in Tenko Raykov's book "Introduction to Psychometric Theory". Does anybody know of any other sources?

Kind regards, Karin

"When in 1940, a committee established by the British Association for the Advancement of Science to consider and report upon the possibility of quantitative estimates of sensory events published its final report (Ferguson eta/., 1940) in which its non-psychologist members agreed that psychophysical methods did not constitute scientific measurement, many quantitative psychologists realized that the problem could not be ignored any longer. Once again, the fundamental criticism was that the additivity of psychological attributes had not been displayed and, so, there was no evidence to support the hypothesis that psychophysical methods measured anything. While the argument sustaining this critique was largely framed within N. R. Campbell's (1920, 1928) theory of measurement, it stemmed from essentially the samesource as the quantity objection." by Joel Michell

(1) Why "there was no evidence to support the hypothesis that psychophysical methods measured anything" because "the additivity of psychological attributes had not been displayed"?

(2) Item Response Theory (IRT) has no Additivity, Can IRT correctly measure educational testing performance?

My supervisor has suggested to use IRT after a long discussion around using CTT. I am confused as to ask whether to use both or one or the other.

If I were to choose IRT only, which model should I use? Rasch,or 2- and 3PLMs ? And if I use any of these models what would be the ideal sample size.

Any advise would be very helpful.

If the height of a test information curve (not item information curve) indicates the discriminability of a test at a given level of a trait, then doesn't it follow that the range of test scores in the tails of this distribution are uninformative (unreliable)? It seems to me that an inevitable conclusion from IRT is that, for most publish scales, extreme test scores will be inherently noisy and therefore should not be given prominence in data analysis (e.g., treating test scores as a continuous variable and using all of the data including extreme scores) because of the high leverage these data points will have in determining the solutions. At the very least, it seems IRT would compel researchers to either trim their data (e.g., omit the top and bottom 5 or 10% of scores) or in some cases treat the data discretely and perform ANOVA's instead. How does one reconcile the Test Information Curve and prescription to analyze data as a continuous variable without trimming extreme scores?

Hi everyone,

I need to find an IRT course upon my supervisors advice. Does anyone have any recommendations? My google search has been limited and UCL only offer generic statistic courses.

I am UK based however, but am open to online courses/tutorials as well.

Thanks,

Leng

I'm using IRT for a DIF analysis by gender. However it occurs to me that it might be possible to do DIF in a Mokken scale analysis framework. Is there a standard protocol for doing this?

- Application of
**Rasch analysis**and**IRT**models are becoming increasingly popular for developing and validating a patient reported outcome measure. Rasch analysis is a confirmatory model where the data has to meet the Rasch model requirement to form a valid measurement scale. Whereas, IRT models are exploratory models aiming to describe the variance in the data. Researchers seem to be divided on the preference of one over another.*What is your opinion about this dilemma*

(1) Item Response Theory (IRT) is incorrect theory for high-stake scoring;
(2) IRT has no additive structure; IRT has no decomposition theory;
(3) IRT is not consistent with Total Score and General Total Score;
(4) IRT is not self-consistent; IRT is not complete;

For full article, please click
https://www.researchgate.net/publication/322076733_Item_Response_Theory

Good evening for everyone,

I would like to conduct an

*Item Response Theory's*Differential Item Functioning (DIF) analysis.I have two groups that answered a test. This test have a bifactor structure with dichotomous responses.

What is, in your opinion, the most appropiate technique (and software) to conduct this kind of DIF analysis?

Thank you,

José Ángel

Immunoreactive trypsinogen (IRT) levels can be obtained from the blood as well as serum. I wonder how stable is IRT in -20 and -80 storage, for how long serum can be stored at this temperatures, how freezing-thawing cycles can influence IRT levels.

I'm exploring the concept that scales can be ordered and that certain items should carry more weight in a scale. I came across guttman scalograms and Mokken scaling techniques. Creating the initial Mokken scale makes sense. What I don’t get is that after I get my H coefficients and AISP to run with Mokken, how do I evaluate the data in a meaningful way?

If I use a mokken analysis on a 10 item likert survey, it wouldn't make sense to get an overall composite mean score to represent the latent ability since I established that items are ordered on difficulty. Do the H coefficients determine item weights? How can I sum a participants score on my newly created Mokken scale?

Hi,

Can anyone suggest me if it is possible to predict factor scores from IRT MODEL for each factor(latent trait) like in usual PCA

Best,

Davit

I'm looking for a way to compute fit measures for polytomous Item Response Theroy models, more specifically, for the Graded Response Model, Generalized Partial Credit Model and Generalized Graded Unfolding Model. ltm package for R provides some fit measures, but, for example, in the analyses I've ran on a variety of scales, p-value from the goodness of fit test was always exactly .01, which I find weird. GGUM2004 software only provides item-fit measures, with no global measure of model fit. I was wondering if there are ways to compute additional fit indicators, such as RMSEA and other statistics commonly used in SEM and if yes, how could I calculate them?

In IRT models the difficulty of items has the same metric and equivalence to the ability of the people (both are in logits) I want to do the same with a CFA model. I want to have items in the same metric and equivalence as the factor scores of my only latent variable. How can I do this? I am using lavaan package in R and I have all the matrices (mean of items, variance covariance matrix and residual matrix of items) but I am stuck in that step. I have used a WLSMV algorithm because my data does not show a multivariate normal distribution.

Can somebody help me please? Thank you!

I want to examine the factor structure of two scales respectively (scale A: 14 items; scale B: 26 items). We are trying several methods: CFA followed by EFA; Rasch analysis and IRT respectively. IRT seems to be a more and more popular methodology for the purpose. But I have also come across some studies using meta-SEM to examine the factor structure. It basically refers to: getting the pooled inter-item correlation matrix from the respective inter-item correlation matrix from a batch of related study, then running SEM on the pooled matrix for testing the factor structure.

For meta-SEM, is item-level intercorrelation is must? Can it be based on intercorrelations among subscales instead of all individual items?

Which method would be relatively more rigorous? Any recommended examples or resources? Thanks a lot!

I have a dataset with a lot of missing values due to the fact that this data was generated by a computer adaptive test.

I want to show unidimensionality of the test, but factor analysis is a problem due to the missings. The paper below seems to suggest that we can use ones theta estimate (I'm using IRTPRO to estimate it based on the EAP method) to calculate expected scores and impute them - which seems intuitive. Is this a correct way?

Can I simply proceed with factor analysis after this, or would I need to round the probabilities to 0 and 1's? Are there other methods?

Regards, Dirk

Context : Performance test, dichotomous, one-dimensional (at least in theory), at least 3PL (variable pseudo-guessing). Theta is assumed normal. But I'm also interested in answers in general in IRT.

Problem : It seems to me that EFA factor loadings provide clear guidelines to rank/select a subset of items from a pool (with referenced rules of thumb, etc.) when one does not have any prior info/assumption of theta (aka for "all" test-takers).

On the other hand, IRT is, in my opinion, a much more accurate representation of the psychological situation that underlies test situations, but it seems to me that there are a variety (especially in IRT-3PL / 4PL) of parameters to take into account all at once to select items, at least without any prior estimation of theta.

So I'm wondering if you knew of any guidelines/packages that can be referenced as a clear basis (meaning, not eye-balling item response functions) for item selection there. At this stage I'm thinking of a very non-parsimonious solution, like generating all item subsets possible (I'd get a LOT of models, but why not) -> fit IRT model -> compute marginal reliability (and/or...Information, of why not CFI, RMSEA, etc.) for thetas ranging between -3SD and +3SD -> Rank the subsets by descending marginal reliability (but I'm afraid it would bias towards more items, so I'd have to weight per item count maybe).

Anyway, you get the idea. Any known referenced procedures/packages?

So, I've been stuck trying to get item locations from mirt while using GPCM. I know that eRm usually gives you both item location and thresholds, but somehow I haven't been able to find out where is it on mirt. By using:

$ coefG <- coef(modG, IRTpars = TRUE, simplify = TRUE)

I just've been able to fetch discrimination and thresholds, but apparently location is hidden somewhere. I need mirt because of the method MHRM due to some sample issues... Also it seems that mirt works with a different type of object otherwise I'd be able to list its elements and find things out at once. So anyone has a clue on how can I find the item locations among my results or even to calculate and extract it out of the thresholds?

Best!

I was wondering whether someone has a suggestion on how to create a unidimensional IRT model where item pairs (on symptoms) are conditional so that the one item asks about the presence of a symptom (Yes/No) and the other about the severity of the endorsed item (on a 5-step scale).

Conceivably reporting a symptom but rating the lowest severity could be indistinguishable on the latent trait from the "No" response, but how can I test this?

If I combine the item pairs into items with 6 response options, will a nominal model do the trick? Or -- assuming that symptom and severity items measure the same thing -- should I just use all variables in a Graded Response Model, accepting the systematic missingness, and compare a parameters for the item pairs? In the latter case the dependence isn't modelled in any way.

We are working on the development of diagnostic systems for cystic fibrosis. We are having problems in the storage of our IRT proteins. The protein is degraded upon freeze thawing. Storage in PBS seems to disrupt the structure. We tried storage at pH:4.8. It works better, but we are still not sure what is the best method for storage. Any suggestion will help while we are pursuing our optimization work.

Hi all!

I am running some IRT models using the mirt package for R (Polytomous 2PL GRM). The scales I am analysing consist of 6-items that are comprised of 3 contrait / 3 protrait items measured on 9-point Likert scales, and a sample size is around 3200.

When I run the item-fit (S-X2) the output shows that the protrait items fit the model (items 3-5) and the contrait ones (items 6-8) do not (output below). Several of the other scales I am testing have similar issues but not always with the protrait/contrait divide. Assumption testing showed a very strong one factor solution and acceptable Local Dependence (LD) (Yen's Q pairs were around .02 - .25). Although many texts recommend the S-X2 test for polytomous item-level fit, none that I have found suggest what to do when the model does not fit the data (although several just say "close enough" which seems like troublesome logic).

item Zh S_X2 df p

1 ethnic_3 7.12 210.88 199 0.27

2 ethnic_4 8.33 199.63 192 0.34

3 ethnic_5 6.43 219.67 196 0.12

4 ethnic_6 5.31 286.77 211 0.00

5 ethnic_7 4.48 254.58 213 0.03

6 ethnic_8 1.87 308.25 234 0.00

I was wondering what levels of violation are acceptable and what we can do to increase the item-level fit? Is the only solution to change the model (1PL, 2PL, Partial Credit etc., ) or estimator used?

Thanks in advance,

Conal Monaghan

Dear all

I tried to get freeware to check local item dependence such as IRTNEW and LDIP but I can not find any link for them in the internet, would you help me if you have links for them or any other freeware (GUI) to chek LID.

thank you for help

I want to use IRTPRO for item analysis of a public examination. The subject i intend to analyse has 40 multiple choice items, can all items fit into the model?

If a, b, and c parameters have been obtained and published for a normative sample, and I calculate them for a new data set, would it be possible to statistically examine DIF on the basis of these parameters? Or would I need the full raw data for the normative sample?

Is there any classification for the item difficulty and the item description ranges of values in item response theory (IRT) and multidimensional item response theory (MIRT).

According to Baker in "The basics of item response theory"

item discrimination in IRT is classified into the following :

none 0

very low 0.01 - 0.34

Low 0.35 - 0.64

moderate 0.65 - 1.34

High 1.35 - 1.69

Very high > 1.70

Perfect + infinity

According to Baker,Hambleton (Fundamentals of Item Response Theory ), and Hasmy (Compare Unidimensional and Multidimensional Rasch Model for Test with Multidimensional Construct and Items Local Dependence) item difficulty is classified into the following :

very easy above [-2,...]

easy (-0.5,-2)

medium [-0.5,0.5]

hard (0.5,2)

very hard [2,..]

**Could the item discrimination and item difficulty classification be also used in MIRT**

I'm looking into the development of an on-line IRT adaptive test for a specific test instrument. Any pointers to help me start out such has published research or case studies would be grateful. I've come across the Concerto platform but would be interested to know what else is out there.

There are many different programs for conducting IRT, some are stand-alone products and some are components within larger statistical programs. What are the

*most popular*programs in common use?Additionally, does anyone know of any independent comparisons of the various programs to ensure that they produce identical results?

I am interested in comparing test equating procedures

I am running an IRT analysis on an instrument in XCalibre, and the analysis reports substantially different means for the items than those calculated in Excel? Is there some weighting happening of which I am unaware?

Hello everyone,

Are there any commercial software available for multidimensional adaptive testing (which include calibration of items, ability estimation and next item selection procedures) ? If yes which are they?

Thanking you,

Irshad

For her dissertation research relating item content to slopes/loadings, my student is seeking papers that fit simple (i.e., no cross-loadings) structural models using IRT or CFA for Likert-style responses to measures of personality, attitude, etc. We will also fit such models ourselves if the data are made available to us.

In order to analyze the item content and relate it to the structural parameter, the items have to be provided or publicly available and the paper must provide the slopes/loadings.

We are particularly interested in measures of content that has shown wording effects like Rosenberg's self-esteem scale, big-five personality, positive and negative effect, etc. We cannot consider measures whose sole use is clinical (e.g., MMPI).

Hello, I want to compute the so-called IRT Item Information Function for individual items as well as the latent variable using Stata. Several methods are available for IRT analysis, like clogit, gllamm or raschtest, but so far I could not find any syntax to draw the Information Function using anyone of these methods. So, any help or example is much appreciated.

I am using

**mirt**(an R-package) for the IRT( Item Response Theory) analysis. Current data, I am dealing with is sparse (contains missing value). Responses are missing because the test is adaptive. ( In the adaptive test not all the questions in the item bank are presented to the test taker hence the responses to the questions he was not presented and the ones he could not solve are missing ). Now the "mirt " function in**mirt**package takes care that you can calibrate the data with missing values (i.e. fitting the IRT models (Rasch, 2PL, 3PL). However when it comes to the item fit analysis (using "itemfit" function to carry out the item fit analysis) you can not put the sparse data. In this package for the sparse data if you need to go for item fit analysis, you must use imputation. I have two questions here:1. Are there any more methods available besides imputation for item fit analysis when you have the sparse data?

2. What is the maximum percentage of sparseness in the response data matrix, where you can use imputation method to get reliable results?

I would like to conduct Rasch Model Analysis with Olweus Bullying Questionnaire for a Pakistani sample. I haven't done it before. Any suggestions and help especially from some expert in Lahore would be appreciated? ConQuest is the available software.

I am using IRTPRO software to find the item difficultly and discrimination for my dichotomous tests. I realized that the program implements different algorithms for IRT parameter estimation such as bock aitkin EM algorithm and Metropolis-hastings-Robbins-Monro (MH-RM) algorithm and I want to understand how each one of them work. Could anyone help me find some good material which explains the algorithms in a simple way. I also came across "invariant item parameters in IRT" and how they don't change regardless the sample of students we have. How is this proved? And how confident could we be about such an assumption?

I am using CTT and IRT to analyse three different tests I gave to students. Applying CTT was straight forward, however using IRT was a bit confusing to me. I started by assessing the tests unidimensionality using the Prinicple component analysis in SPSS. The results obtained showed that each test is assessing more than one latent variable (I have been able to identify 4 or 5 components in each test). Then I decided to follow a different approach by examining a model-data fit for each test. Starting by fitting each test to different unidimensional and multidimensional IRT models to chose the best model and then obtain the difficulty index and discrimination index. In this case even If my test is not unidimensional I still could assess that a unidimensional model could fit my data better than a multidimensional model. I also could determine the number of dimensions that represent my data better.

My question is how reliable is my approach, and if the principle component analysis results in more than one dimension, and the model data fit shows that the unidimensional item response theory is the best fit, should we just assume that my test is unidimensional?

I have the following situation: A number of texts have been rated for readability. The research question is: Is the used readability score valid? We did a study where participants first read the text and then answer a multple choice question. Because the choices were exclusive (exactly one correct choice), there is a 25% chance of guessing.

We ran a logistic regression with correct/incorrect as response and readability score as predictor. We find the desired relationship, but we also want to predict performance at texts of lower readability than in our sample. How can we formulate the model or adjust the estimates such that predicted probability of correct answer would never fall below 25% ?

Note that the data analysis is done by a bachelor student. Pragmatic solutions, using standard software, are preferred over profound. Also note that we have to use either a mixed-effects model or GEE to account for repeated measures.