Science topic

IRT - Science topic

Explore the latest questions and answers in IRT, and find IRT experts.
Questions related to IRT
  • asked a question related to IRT
Question
3 answers
Hi,
I'm trying to list different CAT models and I'm used to work with IRT-based and Rasch-based CATs but I'm never wandered outside IRT/Rasch. Has anyone heard about such CATs developped for real-life use or developped only theoretically?
Thanks a lot!
Christian Bourassa
Relevant answer
Answer
This essay is about a new theory called General Total Score:
  • asked a question related to IRT
Question
4 answers
Dear all,
I have an English Listening Comprehension test consisting of 50 items taken by about 400 students. I would like to score the test using the TOEFL scale (max 58 I think), and it is claimed that TOEFL is scored using IRT (3PL model). I am using MIRT package in R to obtain three parameters, as I used 3PL model.
library(readxl)
TOEFL006 <- read_excel("TOEFL Prediction Mei 2021 Form 006.xlsx")
TOEFL006_LIST <- TOEFL006[, 9:58]
library(mirt)
sv <- mirt(TOEFL006_LIST, # data frame (ordinal only)
1, # 1 for unidimentional, 2 for exploratory
itemtype = '3PL') # models, i.e. Rasch, 1PL, 2PL, and 3PL
sv_coeffs <- coef(sv,
simplify = T,
IRTpars = T)
sv_coeffs
The result is shown below :
| | a | b | g | u |
|L1 |2.198 | 0.165 |0.198 | 1 |
|L2 |2.254 | 0.117 |0.248 | 1 |
|L3 |2.103 |-0.049 |0.232 | 1 |
|L4 |4.663 | 0.293 |0.248 | 1 |
|L5 |1.612 |-0.374 |0.001 | 1 |
|... |... |... | ... |... |
The problem is that I do not know how to use the parameters above to give weight for each item. The formula should be like this right?
Would anyone help me show how I can insert the parameters into the formula in R? Or maybe there are other ways of obtaining students' scores without having to manually give weight for each item. Your help is much appreciated.
Thank you very much for your help everyone.
Relevant answer
Answer
This essay may be of interest for your research showing that IRT theta-parameter is not ability measure although they are positively correlated to each other:
  • asked a question related to IRT
Question
1 answer
Looking to measure stress in cats using IRT. One study used a FLIR, but it is quite pricey. Any other good options? I think FLIR is a good brand, but I don't know enough about the specs to determine what I need.
Relevant answer
Answer
Hello Kathryn, Flir are very high quality thermal cameras, this company specializes in this area and they are very high quality, but also expensive. We use a testo camera to measure stress in animals, specifically the testo 890-2 model. It is a camera for scientific purposes, has suitable accuracy and many useful functions, they are cheaper than Flir. And the price corresponds to the quality.
  • asked a question related to IRT
Question
4 answers
My goal is to test the equivalence of the model between countries. The model is relatively complex. It consists of 35 items, some of which are continuous and some ordinal, divided into 7 dimensions.
I would like to ask, what method would you recommend (e.g. MGCFA, IRT, ESEM, Alignment, LCFA, BSEM)? I would like to test more methods and compare the results. However, each method is suitable for either continuous (e.g. MGCFA) or ordinal (e.g. IRT), and I do not know if it is possible to apply them when I have both types of variables.
Alternatively, whether the solution would be to transform the response scales to be uniform?
Thank you very much for your answers.
Relevant answer
Answer
The two one-sided tests (TOST) procedure. See Equivalence Testing for Psychological Research: A Tutorial Lakens et al 2018:
  • asked a question related to IRT
Question
11 answers
Hello
I am using six 500W-Halogen lamps to heat up the surface of an HDPE plate in order to get a thermal gradient over its thickness required for IR Thermography (IRT). I am using IRT to detect subsurface defects. For this purpose, I heated up the HDPE plate with the six halogen lamps for 180 s and after removing the heat source I started monitoring the target. I wanted to use the Furriers equation to calculate the temperature at the location of each subsurface defect. My problem is how to calculate (Q) in Furrier's equation? (I think I am wrong if I calculate the "Q" like this: 6*500*180 because it gave me unrealistic values for the temperature at the defect location).
Relevant answer
Answer
500W is energy consumption of Halogen lamp. To know actual heat transferred to HDPE plate, you need heat flux sensor. With that you can measure heat flux incidents on plate, this is Q/A (W/m2). Then you need to know the absorptivity (or from emissivity) of plate surface. Product of these two will give you actual heat (ignoring other heat falls) transferred to plate.
  • asked a question related to IRT
Question
8 answers
I am examining results from an exploratory factor analysis (using Mplus) and it seems like the two-factor solution fits the data better than the one factor solution (per the RMSEA, chi-square LRT, CFI, TLI, and WRMR). Model fit for the one factor model was, in fact, poor (e.g., RMSEA = .10, CFI = .90). In the two factor model, the two latent factors were strongly correlated (.75) and model fit was satisfactory (e.g., RMSEA = .07, CFI = .94). The scree plot, a parallel analysis, and eigenvalue > 1, however, all seem to point to the one-factor model.
I am not sure whether I should retain the one or two factor model. I'm also not sure whether I should look at other parameters/model estimates to make determine how many factors I retain. Theoretically, both models make sense. I intend to use these models to conduct an IRT (uni- or multidimensional graded response model - depending on the # of factors I retain).
Thank you in advance!
Relevant answer
Answer
0.70 is acceptable
  • asked a question related to IRT
Question
4 answers
I am comparing IRT results for 2 samples who took the same exam. I am aware of DIF detection methods (MH, LR, Lord-Wald, ext) but for my specific purpose I only want to compare if questions discriminate equally (ignore differences in the difficulty parameter). I have the models standard errors for each discrimination parameter so I can make confidence intervals but need to know what distribution these parameter estimates follow. I believe they follow the Z distribution but was just looking for some other input/confirmation since I can't find conclusive evidence in the literature. Thank you in advance!
Relevant answer
Answer
  • asked a question related to IRT
Question
2 answers
Dear all,
I am trying to understand the similarities between conducting CFA on categorical data using the WLSMV in Mplus and IRT. I would appreciate any input. Moreover, is there a way to convert Mplus parameters to difficulty and discrimination parameters?
Thanks in advance,
Nikos
Relevant answer
Answer
Dear Vignesh,
Thank you for the input.
Nikos
  • asked a question related to IRT
Question
3 answers
I am analysing a 7 dichotomous scored items instrument using IRT. I calibrated the data using 2PL, the item fit statistics are good, but the p value of the M2 statistic is far less than 0.05. In this case, how should I interpret the model data fit? Is it an appropriate model for my data or not? Thanks
Relevant answer
Answer
The statistic M2 tells you that the model does not fit exactly. If you cannot find a better model, the question is whether it is at least actionable. I would compute an rmsea to determine that and use a cutoff of .05. The reference is in 'Asessing approximate fit in categorical data analysis' where we explain how to compute rmsea in irt, and what cutoffs to use, and why
  • asked a question related to IRT
Question
18 answers
There is an interesting Law of the Hammer (see link *), which states the following truism: "If the only tool you have is a hammer, it is tempting to treat everything as if it were a nail." This Law may be working in some areas of Psychometrics, too.
If the only cognitive tools you have, are ordinary arithmetic and statistics on the real line (i.e., all negative and positive numbers), then you may be tempted to treat ALL measures as REAL numbers and apply the usual arithmetic and statistical operations (plus, times, arithmetic mean, etc.) on REAL numbers, to get whatever you want, e.g. estimates and predictions.
The Hammer of IRT/Rasch
This seems to be the case for IRT based on a naive concept of scores. Scores, in ITS's view, are just like real numbers, X for short, which can be neatly associated with (cumulative) probabilities via an elegant transformation (conversion) rule:
  • P(X) = exp(X) / (1 + exp(X) )
Once you have those probabilities, the whole machinery of probability theory and statistics can be invoked for descriptive or testing purposes. Adding one, two or three parameters (which have to be estimated) increases the explanatory power of the approach even if your responses are still binary 0 and 1's. Generalizing to polytomous and multidimensional IRT etc. is possible within the same adopted framework.
Still, all these developments rest on a highly questionable, implausible interpretation of X as the genuine measurement of an underlying (psychological) feature and P(X) as its associated probability. However, there is no empirical evidence or formal argument, that such an interpretation is possible, necessary, meaningful or even correct.
On the contrary: it is highly counter-intuitive. There are no (latent) psychological features or variables like X which are strictly unbounded to either side, i.e. could become - infinity or + infinity. If you happen to know one, please let me know.
An alternative view avoiding the concept of probability
The alternative is, to view the real number X as a transformed score f(S) from the double-bounded interval [A,B] between given A and B to the real numbers.
  • For instance, A could be 1 and B could be n, with equal-spaced anchor points 1, 2, 3, ..., n, so that [A,B] is the underlying continuous scale of a discrete n-point ruler or scale for the user/respondent.
Now let S be such a score on the n-point scale from 1 to n, and define X such that :
  • exp(X) := (S - A) / (B - S)
Then X is well-defined: it is the (natural) logarithm of (S - A) / (B - S). Now, after a few simple manipulations we get:
  • S = (1- P(X)) * A + P(X) * B so that P(X) / (1 - P(X)) = (S - A) / (B - S).
In other words: P(X) is just a weighting factor required to get the position (location) of the score S on the chosen n-point scale from A to B. Indeed, we could replace P(X) just be W(S), i.e. the weight of S on [A,B]) to get:
  • S = (1- W(S)) * A + W(S) * B so that W(S) / (1 - W(S)) = (S - A) / (B - S).
In other words, W(S) = P(X) is just a normalized version of S given that S is constrained to [A,B]. No probability interpretation is required!
What's more: we can define the quasi-addition (+) of scores S and T on any scale [A,B]. It is, in fact, relatively simple:
  • S (+) T := (S - A) * (T - A) divided by (S - A)*(T - A) + (B - S)*(B - T)
This quasi-sum (+) of S and T has (almost) ALL the properties of the usual sum of two real numbers, however, it is closed on the interval between A and B. Defining quasi-multiplication (*) of a score S with a real number is also quite simple. And once we have quasi-multiplication, the quasi-additive inverse is correctly defined as (-1) (*) S. Together, Module Theory provides the right framework for working out the algebra of bounded scales and scores.
Introducing the interval scale on [A,B]
The biggest surprise of all: it turns out that the n-point scale endowed with this quasi-arithmetic addition operation is indeed an interval scale. The same holds for the standard percentage scale [0,1], so often used in educational contexts, but also in Fuzzy Set Theory and Fuzzy Logic.
CONCLUSION:
Rasch, one of the founders of IRT, was an original mathematician and statistician, but he was rather focussed on classical statistics and measurement theory from the physical sciences, and probably he didn't know about the then-recent developments from what is known now as quasi-arithmetic calculus and module theory.
Also, in his time, Stevens was one of the hero's of a rather naive (=operationalist) scale theory for psychology and other social sciences (see e.g. the critiques of J. Michell, Luce, and other measurement theorists).
I am sure, that Rasch would be delighted to see that it is indeed possible to do serious measurement (theory) in the social sciences without the counter-intuitive assumptions that he had to make.
________________________________________________________________________
  • asked a question related to IRT
Question
1 answer
Based on published item response data LSAT and pcmdat2, the General Total Score and Total Score are applied for numerical (high-stake) scoring. Item Response Theory (IRT) 2PL model is also applied for numerical scoring as compared to those from General Total Score and Total Score. Pieces of R codes for computing General Total Score are also offered.
______________________________________________________
Please click following link for full essay:
  • asked a question related to IRT
Question
6 answers
Hi! I would like to know if there is any convention or standard to evaluate for how long a national, large-scale educational evaluation should be implemented? I've heard that even with an extensive item bank and doing equating a test has kind of a "lifespan" (for example, a if a given test should be used only for 10 years before it loses its psychometric properties susbtantially). But how does this gets evaluated or measured, or is there any implicit rule? And is there a keyword for this?
Could you please provide me with published reference?
Thanks in advance!
Relevant answer
The American Psychological Association (APA) has issued clear standards for psychological and educational tests. So, I agree with Dr. Marius Babici .
and this link maybe assist you too:
  • asked a question related to IRT
Question
9 answers
I am writing a paper assessing unidimensionality of multiple-choice mathematics test items. The scoring of the test was based on right or wrong answers which imply that the set of data are in nominal scale. Some earlier research studies that have consulted used exploratory factor analysis, but with the little experience in data management, I think factor analysis may not work. This unidimensionality is one of the assumptions of dichotomously scored items in IRT. Please sirs/mas, I need professional guidance, if possible the software and the manual.
  • asked a question related to IRT
Question
6 answers
I intend to investigate the psychometric properties of a SAT with a 3PLM of the IRT. The total population of my study is 16,328 and, I was told that Krayjcie and Morgan cannot be used as a yardstick for determining my sample size since I'm running my analysis with an IRT software. Thus, will require much larger sample.
Relevant answer
Answer
I agree number of items, scales, and types of model will play a big part in determing minimum calculations! However, saying that have a look at general rule of thumb for these type of analysis. Sorry cant be much help! Good luck.
  • asked a question related to IRT
Question
3 answers
I used an R package plink to link two scales(polytomous item).
I was able to get a transformed item parameter through linking.
The plink does not seem to have the function to obtain the Test information function for the transformed item parameters through linking.
1) If I have a item parameter, can I get the GPCM's Test information function?
2) If you have implemented this, please let me know how to do
Thanks!
Relevant answer
Answer
Hello Myeonggi,
Yes, with item parameter estimates, you may estimate test information, by iteratively incrementing theta over the desired logit range (e.g., -3, +3). See the classic reference by Frank Baker: http://echo.edres.org:8080/irt/baker/chapter6.pdf
Good luck with your work.
  • asked a question related to IRT
Question
3 answers
I am do research in educational assessment and focus on developing assessment instruments (test). Thanks.
Relevant answer
Answer
Rasch model encompasses dichotomous, partial credit and rating scale data. Any or all can be used in one analysis.
  • asked a question related to IRT
Question
4 answers
I am using Item Response theory (IRT) using 3 Parameter Logistic Model(3PL) for Logic test. After training the model, I use the posterior means of the item parameters 𝛼, β and γ to estimate person trait 𝜃 during the adaptive test. I want to introduce co-variates i.e. age, gender etc in the model for estimating ability of the person using Latent Regression. But I am not able to find any research for introducing co-variates in the IRT model. Any guidance would be much appreciated.
Relevant answer
Answer
Hello Fahad,
Though you may have already considered this, it might be useful first to determine whether your proposed control variables make any difference. Tests of DIF (differential item functioning) would be one way to determine whether estimated item parameters are influenced by gender (and any other variables under consideration). If there is no evidence of DIF, then ability estimates should be consistent across subgroups; no control variables would be needed.
If you're using ability estimates for some other purpose (e.g., as a dependent variable in some subsequent analysis), and you want to add covariates, then there is nothing special about IRT ability estimates to prevent you from doing that.
Sorry if I've misunderstood your intent.
Good luck with your work.
  • asked a question related to IRT
Question
3 answers
When I showed my refined scale after running a CFA and EFA (11 items, 3 factors, 4, 4, 3) a statistician at UCL told me that I had too few items per factor to run IRT.
Does anyone have any references to back this up?
Many thanks,
Leng
Relevant answer
Answer
Hello Leng,
Minimum items per scale isn't really an assumption associated with IRT. It's more a matter of how much information there needs to be in order to construct a stable scale/measure.
While you can conduct Rasch analysis (an IRT variant that posits just one item parameter, difficulty) with just two items (see Ben Wright's note on this: https://www.rasch.org/rmt/rmt122b.htm), in general, more items per scale is better in that you'll tend to get: (a) lower uncertainty associated with score/trait estimates; (b) better score reliability; and (c) more stable estimates of item parameters. However, there is also an inevitable point beyond which the additional precision does not outweigh the time, effort, and energy needed for testing with more items (e.g., diminishing returns).
Studies that look at impact of test length in IRT research tend to use values such as 10=, 20-, or 30-item measures (here's an example: https://files.eric.ed.gov/fulltext/EJ1130806.pdf)
Can you conduct IRT with 3 or 4 items per scale? Yes.
WIll the resultant scores be the best possible estimates of persons' location on your scale or yield the best estimates of item characteristics? No.
Good luck with your work.
  • asked a question related to IRT
Question
7 answers
I would be thankful if someone could give me a short breakdown on how to test unidimensionality of polytomous data (likert-scale) using IRT-models in R.
As it is my understanding that the ltm-package is obsolete, I'm looking for an equivalent to ltm::unidimTest() (perhaps within the mirt-package?).
Thank you
Relevant answer
Answer
you can use sirt package for this purpose. DETECT and Noharm tests can be conducted with sirt package. You can use expl.detect() function for DETECT method and noharm.sirt() function for NOHARM method.
Hope helps.
AFK
  • asked a question related to IRT
Question
2 answers
IRT- Item response Theory
Relevant answer
Answer
If you plot the test information, this will show how much information there is at any given point of your latent trait scale, and the target value, wher the test information if maximized. TIF is also an inverse function of standard error of measurement, so where you have most information, you also have the least error. If you also plot the persons' abilities, you can see whether most of your persons are in a range on the latent trait with a lot or a little information.
Was this the type of interpretaion information, you where looking for?
best
  • asked a question related to IRT
Question
4 answers
Hello RG community,
I have a questionnaire to validate. Most of the questions are so-called gated: ex. Do you have symptom XX? a. Yes, b.No. If yes, proceed to the next question: How much does symptom XX bothers you? 1. not that much, 2. to some extent, 3. very much.
My solution is to make two questions out of one and re-code the derived question: How much does symptom XX bother you? 0. nothing because I don't have it , 1. not that much, 2. to some extent, 3. very much.
The proposition comes with at least two risks: a. I am not sure if the distance between zero and one is the same as that of other categories. and b. I am not sure what I am validating !! The new question is a re-framed version of the two questions and I don't know the extent to which the response patterns would be different if we originally gave patients to fill out the made-up (derived) questionnaire.
The same problem with IRT.
Any comments?
Relevant answer
Answer
Hello Saeid,
It's not difficult to recode the answers from the two questions into a single variable (as your query suggests); I see no risk to response validity as a result. I agree that the scale is not interval (from 0 on); polytomous IRT would be a sensible option, presuming that you have a set of comparable symptoms to consider.
But, you can still compare cases with, without a symptom/diagnosis fruitfully either way.
Good luck with your work.
  • asked a question related to IRT
Question
3 answers
I have a measurement where the indicators influence/give rise to the construct. This makes it formative, however, I don't think I can apply IRT on formative measurement models. Can anyone confirm whether this is the case with an explanation on why?
Relevant answer
Answer
I think no, responses on items are not causes of the construct in IRT , instead they are indicators of it. So using IRT for formative model would be conceptuallly wrong.
  • asked a question related to IRT
Question
1 answer
In literature when calculating IRT SE I found multiple times Fisher Information being mentioned.
Being curious I started to try to play around with the fisher information in order to obtain the typical Information reported as P(theta)Q(theta)a^2.
My understanding of the process failed me when I started to check why the variance of the score is defined as follows
score = s = d/dtheta ln( f(x,theta) )
Var(s) = E[s^2]
Given that the variance is
Var(s) = E[s^2] - E[s]^2
I started looking in why E[s]^2 is zero. As long as f(x,theta) is a density function I can write
E[s]^2 = [\integral{ d/dtheta ln( f(x,theta) ) * f(x,theta) dx }]^2
= [\integral{ d/dtheta( f(x,theta) ) * f(x,theta)/f(x,theta) dx }]^2
= [\integral{ d/dtheta( f(x,theta) )dx }]^2
= [d/dtheta( \integral{ f(x,theta)dx } ) ]^2
= [d/dtheta( 1 ) ]^2
= 0
But as soon as we use the IRF (Item Response Function), that gives us the probability of getting score x given theta, all the computations done above are not working anymore. The reason being that the integral of the IRF is not finite, hence
[d/dtheta( 1 ) ]^2
not valid.
I have demonstrated that
E[d/dtheta ln( f(x,theta) ) ^ 2] = -1 *E[d/dtheta(d/dtheta ln( f(x,theta) ))]
but that holds when integrals are one for f(x, theta) and simplifications can be done.
Any input on my approach and (not) understanding of the problem?
Relevant answer
Answer
I'm not sure it needs to be, but I'd recommend Rembretsin and Reise (2000) IRT text.
Matt
  • asked a question related to IRT
Question
11 answers
Most software packages for analyzing polytomous models under the Item Response Theory approach show option characteristic curves as an output. Given that I have the data on those option characteristic curves, how would I calculate the item characteristic curve?
Relevant answer
Answer
how can i draw the Item category response functions for a five‐category item in spss. please guide me
  • asked a question related to IRT
Question
4 answers
Researchers do shorten questionnaires but very little is known (?) about good practice in this field. There are probably some methods, both rooted in IRT and CTT, but do someone empirically tested what metod is more useful in what circumstances?
Relevant answer
Answer
Hi Emilia,
probably a bit late for you, but maybe interesting for other researchers struggling with the issue of shortening scales:
Scott Tonidandel and colleagues presented a pareto optimization procedure to shorten scales according to a set of criteria (e.g. predictive validity of a chosen set of items) at SIOP2019. There is a beta version of the tool to determine the best items for the purposes of a specific study:
However, pilot data are required.
Best,
Oliver
  • asked a question related to IRT
Question
2 answers
  1. Total Score is defined with the number of those correctly-responded (dichotomous) items;
  2. Sub-score is defined as the total score associated with the sub-scale;
  3. Overall Score is defined as the total score associated with all testing items.
  4. For Total Score, the Overall Score is the summation of its Sub-scores which is called Additivity.
  5. For Item Response Theory (IRT)-ability (theta parameter), the relationship between Overall Score and Sub-scores is unavailable.
  6. Comment: (5) implies IRT has no Additivity. Therefore, with IRT-ability, the sub-scores and Overall Score can not be available simultaneously. This fact strongly indicates that IRT is not a correct theory for high-stake scoring while Total Score in (4) is (although only is as a special case).
Relevant answer
Answer
Hi, Matthew,
Thank you for your interest in this topic.
(1) In MIRT, all the latent variables represent the scores associated with sub-scales, but no latent variable is for Overall scale. Further, the covariance between sub-scales is the linear part of the mutual relations between subscales, the mutual infomation beyond the linear part and those interactions associated with more than two sub-scales are totally missing in MIRT. Again, key argument is that, in MIRT, there is no latent variable representing the Overall Score.
(2) In multivariate statistics, the ONLY reason to put all the variables into a single system is that they are interactive, otherwise, those (jointly) independent variables should be studied individually (and therefore, easily). In IRT, the assumption of conditional independence shouldn't be there because, in real world, it is rare to be true. Now, issue is that, without any unrealistic precondition, why IRT can not express its OVERALL score in terms of its sub-scores, i.e. Does IRT have its OVERALL score and what latent variable (theta-parameter) in IRT stands for?
  • asked a question related to IRT
Question
3 answers
1pl IRT model
Mathematical formula to find the item (question) difficulty is:
  1. p(0/1) = e(theta-b)/1+e(theta-b)
  2. ability (theta) = ln(p/(1-p))
  3. p = count of correct answer ..that count of 1 of each student
  4. difficulty(b) = ln(p/(1-p))
  5. where p = count of 1 in each question.
  6. done with 1pl model.
  7. Now formula for 2pl model is
  8. p(0/1) = ea(theta-b)/1+ea(theta-b)
  9. a= discrimination parameter
  10. what is formula for find discrimination parameter a.
Relevant answer
Answer
Hello Snehali,
The discrimination parameter, a, is proportional to the slope of the fitted item response curve. Obviously, in the 2PL model, there is the potential for each item/variable to have a unique value.
Good luck with your work.
  • asked a question related to IRT
Question
4 answers
Dear colleagues,
I have obtained thetas from the 2PL IRT model based on the (12 item) test of the financial literacy. I am curious can I further use those theta scores (original or modified form) in the logit regression as a one of the independent variables? Cannot seem to find any literature to support or dissuade from doing so.
Relevant answer
Answer
hello Dmitrij
I think you can, given that the 2PL model appropriately used with sufficient sample size.
Good luck
  • asked a question related to IRT
Question
1 answer
Dear all
I am working in a research which need me to simulate data with LD item pairs (surface), I am not so familiar with r package, so I faced some problems when I used some codes by other researchers like Houts (2011). I make the holders of the LD pairs and complete the code, but I have all elements of matrices equal to zero, also no raw data files. It will be appreciated if you could help me or if you have a detailed explanation for the code you used to simulate raw data with LD.
Relevant answer
Answer
  • asked a question related to IRT
Question
27 answers
Dear all, I am searching for a comparison of CTT with IRT. Unfortunately, I mostly get an outline of both theories, but they are not compared. Furthermore, generally the "old" interpretation of CTT using axioms is used and not the correct interpretation provided by Zimmerman (1975) and Steyer (1989). The only comparison of both theories that I found was in Tenko Raykov's book "Introduction to Psychometric Theory". Does anybody know of any other sources?
Kind regards, Karin
Relevant answer
Answer
Greg, thanks for your detailed reply. I know that it is always described in text books, but the equation X = T + E is not a model. Therefore, I do not agree with you concerning this point. Raykov (2011, p. 121) explains this quite clearly:
“In a considerable part of the literature dealing with test theory (in particular, on other approaches to test theory), the version of Equation (5.1) [X = T + E] for a given test is occasionally incorrectly referred to as the ‘CTT model’. There is, however, no CTT model (cf. Steyer & Eid, 2001; Zimmerman, 1975). In fact, for a given test, the CTT decomposition (5.1) of observed score as the sum of true score and error score can always be made (of course as long as the underlying mathematical expectation of observed score—to yield the true score—exists, as mentioned above). Hence, Equation (5.1) is always true. Logically and scientifically, any model is a set of assumptions that is made about certain objects (scores here). These assumptions must, however, be falsifiable in order to speak of a model. In circumstances where no falsifiable assumptions are made, there is also no model present. Therefore, one can speak of a model only when a set of assumptions is made that can in principle be wrong (but need not be so in an empirical setting). Because Equation (5.1) is always true, however, it cannot be disconfirmed or falsified. For this reason, Equation (5.1) is not an assumption but rather a tautology. Therefore, Equation (5.1)—which is frequently incorrectly referred to in the literature as ‘CTT model’—cannot in fact represent a model. Hence, contrary to statements made in many other sources, CTT is not based on a model, and in actual fact, as mentioned earlier, there is no CTT model.”
But there are models developed within the framework of CTT. If one posits assumptions about true scores and errors for a given set of observed measures (items), which assumptions can be falsified, then one obtains models. This is closely related to confirmatory factor analysis, because the CTT-based models can be tested using CFA. If one assumes unidimensionality and uncorrelated errors, this would be a model of tau-congeneric variables, because these assumptions can be tested.
Raykov, T. & Marcoulides, G.A. (2011). Introduction to Psychometric Theory. New York, NY: Routledge.
Steyer, R. & Eid, M. (2001). Messen und Testen (Measurement and Testing). Heidelberg: Springer.
Zimmerman, D. W. (1975). Probability spaces, Hilbert spaces, and the axioms of test theory. Psychometrika, 40, 395-412.
Kind regards,
Karin
  • asked a question related to IRT
Question
4 answers
"When in 1940, a committee established by the British Association for the Advancement of Science to consider and report upon the possibility of quantitative estimates of sensory events published its final report (Ferguson eta/., 1940) in which its non-psychologist members agreed that psychophysical methods did not constitute scientific measurement, many quantitative psychologists realized that the problem could not be ignored any longer. Once again, the fundamental criticism was that the additivity of psychological attributes had not been displayed and, so, there was no evidence to support the hypothesis that psychophysical methods measured anything. While the argument sustaining this critique was largely framed within N. R. Campbell's (1920, 1928) theory of measurement, it stemmed from essentially the samesource as the quantity objection." by Joel Michell
(1) Why "there was no evidence to support the hypothesis that psychophysical methods measured anything" because "the additivity of psychological attributes had not been displayed"?
(2) Item Response Theory (IRT) has no Additivity, Can IRT correctly measure educational testing performance?
Relevant answer
Answer
Dear Nan Kong,
The statement of true or false measurement is very rigid and is not even valid for the physical sciences. But it can measure with a certain degree of accuracy. The differentiation of the measurement theories is how to increase the degree of accuracy. It is not right to ignore the evolution of psychological measurement, all existing theories. Additivity is an important concept in all theories of psychological measurement, and each theory has its own way of collecting specific evidence for additivity.
Best
  • asked a question related to IRT
Question
11 answers
My supervisor has suggested to use IRT after a long discussion around using CTT. I am confused as to ask whether to use both or one or the other.
If I were to choose IRT only, which model should I use? Rasch,or 2- and 3PLMs ? And if I use any of these models what would be the ideal sample size.
Any advise would be very helpful.
Relevant answer
Answer
Hi Leng
Although IRT is generally put forward, there are advantages and disadvantages in two theories. For example, if you have performed in a small sample, it would not be appropriate to use IRT models ( except for non-parametric IRT models). At least 500 sample sizes are suggested to use IRT in many studies, while an ideal sample size of 1000 or more is recommended. In fact, the ideal sample size is assessed by data structure, number of items and item types. More information on the ideal sample size can be found in sources related to IRT (1- DeMars, C. (2010). Item response theory. New York: Oxford University Press; 2-Hambleton, R.K, Swaminathan, H. (1985). Item response theory: Principles and applications. Springer). However, many studies show that both theories give similar results when the test length and sample size are sufficient. It can be said that the differences that occur at least are practically insignificant.
On the other hand, in almost all scale development studies, statistics based on classical test theory are given. These two theories are now treated as complementary rather than alternatives. The main decision-making process is seen in the choice of IRT models. Decision which model can be used depends on the item type, data structure, distribution pattern, model-data fit statistics . Finally, it should not be forgotten that there are alternatives such as Rash and Mokken scaling. Nevertheless, I would like to repeat that it is a widely accepted approach to give results based on classical test theory, whichever alternative theory is used.
good luck
  • asked a question related to IRT
Question
4 answers
If the height of a test information curve (not item information curve) indicates the discriminability of a test at a given level of a trait, then doesn't it follow that the range of test scores in the tails of this distribution are uninformative (unreliable)? It seems to me that an inevitable conclusion from IRT is that, for most publish scales, extreme test scores will be inherently noisy and therefore should not be given prominence in data analysis (e.g., treating test scores as a continuous variable and using all of the data including extreme scores) because of the high leverage these data points will have in determining the solutions. At the very least, it seems IRT would compel researchers to either trim their data (e.g., omit the top and bottom 5 or 10% of scores) or in some cases treat the data discretely and perform ANOVA's instead. How does one reconcile the Test Information Curve and prescription to analyze data as a continuous variable without trimming extreme scores?
Relevant answer
Answer
My general feeling is that IRT and CTT aren't *that* different from each other. Take a look at R. P. McDonald's Test Theory: A Unified Treatment, 1999, LEA for an articulation of this view; Rod essentially considered CTT to be a highly restricted version of the more general factor model. (McDonald was one of my professors but he should speak for himself.) The underlying error model is quite similar and if you want a good link between the two, consider the congeneric test model (aka Spearman factor model). Charlie Lewis wrote a nifty little article on this that was published as a chapter in the edited volume by C. R. Rao and Sandhip Sinharay (2007, Handbook of Statistics Vol 26: Psychometrics.) See also the excellent book by Anders Skrondal & Sophia Rabe-Hesketh (2004, Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models). Warning, this book is highly technical.
IRT quantities are conditional, whereas CTT marginalizes over the sample. This is a substantial point and allows for the power of IRT, such as the ability to generate vertical equated scales, deal with missing-at-random incomplete observations, and to tailor tests for specific positions. IRT parameters are much less sample-dependent than CTT parameters. (The logic is exactly the same for why mixed logit and marginal logit differ.) In my experience, though, a poor IRT analysis is typically paralleled by poor CTT statistics, which is one reason applied testing programs almost always report CTT statistics as well as IRT analyses.
IRT's benefits come along with much stronger assumptions, which may, of course, be wrong and are much more demanding of the analyst. In addition, its benefits really depend on the use of pattern scoring. If you just revert to sum scoring, there is little point to IRT.
What quantities like the information curve CAN tell you about whether you should switch from IRT to LCA. My general feeling is if the information curve is "weird" looking, it's possibly a sign that the model is wrong. For instance, I fit an analysis where LCA was more appropriate in the spring. What I saw was an information curve that was highly unstable to model assumptions and had a rather unbelievable peak. This suggested that some of the items were very near to being Guttman items (though not quite). More examination revealed three not strongly distinguished latent classes and some unusual interactions, where examinees that had a higher state of knowledge were falling for attractive distractors on some items, so that lower knowledge state examinees got more credit.
If you want a book (kind of so-so, but with many interesting ideas) that discusses why you may want to use an IRT perspective with a discrete latent variable, take a look at Bartolucci, Bacci, & Gnaldi's Statistical Analysis of Questionnaires (2015, CRC). They advocate for the use of a discrete latent variable which represents a restricted LCA model. There are other examples.
  • asked a question related to IRT
Question
3 answers
Hi everyone,
I need to find an IRT course upon my supervisors advice. Does anyone have any recommendations? My google search has been limited and UCL only offer generic statistic courses.
I am UK based however, but am open to online courses/tutorials as well.
Thanks,
Leng
Relevant answer
Answer
The Psychometrics Centre at Cambridge has strong work in IRT and CAT, with people like Profs. John Rust, David Stillwell, and Michal Kosinki (now at Stanford). The most/best IRT/CAT work in Europe is generally in the Netherlands though, so I recommend you check over there (Profs. Theo Eggen, Cees Glas, Bernard Veldkamp, and more).
If you are willing to take some American courses, there are a few online. The University of Illinois at Chicago has an entire online Masters degree in educational assessment and psychometrics, however they are most definitely in the Rasch sect and not IRT in general. The best person to contact there is Prof Everett Smith.
I have plans to make a free course myself someday, but currently only have posted some videos (youtube.com/ascpsychometrics) and lecture notes from a grad course I used to teach (http://www.assess.com/lecture-notes-graduate-course-assessment/). These are probably not as deep as you need.
Also, my friend Fernando is right about Templin having a lot of good resources, especially about CDMs, but that is another topic.
  • asked a question related to IRT
Question
7 answers
I'm using IRT for a DIF analysis by gender. However it occurs to me that it might be possible to do DIF in a Mokken scale analysis framework. Is there a standard protocol for doing this?
Relevant answer
Answer
There is a way to investigate DIF in Mokken, but it is not as simple as in IRT models. In Mokken you can fix the order of the item steps to see if they are in the same order in all groups. In MSP5Win, you can get a group difference test. In R, you can calculate the same results when you follow the instruction in the latest manual. I have used the procdeure in 'Apotheken door Clienten Bekeken: schaalanalyses en profielscores 2009' (tables page 11) to test for six variables on DIF. The article is in Dutch and unpublished as the method was replaced with the CQI. The itemstep order are calculated for the total group, but if you take the item step order for a specific subgroup no DIF in all variables can be show (See Grafiek 3 and 4). The very risky variable in the tables is 'ontevreden' (dissatisfied with pharmacist service and having a complaint). I expected this variable could potentially distroy the scale all together. The group having a complaint, could have a very different use of the categories and items all together. The variable did not. I had some DIF, but the critical value kept below 80 (see MSP manual). Hope this helps. Hope you have MSP5Win, in R it is harder to do.
  • asked a question related to IRT
Question
27 answers
  • Application of Rasch analysis and IRT models are becoming increasingly popular for developing and validating a patient reported outcome measure. Rasch analysis is a confirmatory model where the data has to meet the Rasch model requirement to form a valid measurement scale. Whereas, IRT models are exploratory models aiming to describe the variance in the data. Researchers seem to be divided on the preference of one over another. What is your opinion about this dilemma, in development of patient reported outcome measures?
Relevant answer
Answer
Rasch requires the data to fit the model in order to generate invariant, interval-level measures (sic.) of items and persons. It is prescriptive. IRT models attempt to great a model that will fit the data. They are descriptive. While IRT users, see Rasch as a particular IRT model, most Rasch proponents see it as distinctly different from other IRT models. The key differences are philosophical. Wiki provides a suitable introduction:
You might recall Fan's infamous comparison paper. I can add a critique of that if you wish.
  • asked a question related to IRT
Question
12 answers
(1) Item Response Theory (IRT) is incorrect theory for high-stake scoring; (2) IRT has no additive structure; IRT has no decomposition theory; (3) IRT is not consistent with Total Score and General Total Score; (4) IRT is not self-consistent; IRT is not complete;
Relevant answer
Answer
Measurement theory, not Item Response Theory, is what is needed for high stakes assessments, as well as for low stakes classroom assessments. Recent dialogues between metrologists (metric system engineers) and psychometricians elaborate the value of developing scientific models of psychological and social constructs, instead of focusing on talking at cross purposes with advocates of statistical modeling (i.e., IRT). For more information, see Mari and Wilson (2014, 2015), Pendrill (2014), Pendrill and Fisher (2013, 2015), Wilson (2013a, b), Wilson, et al. (2015), Wilson and Fisher (2016).
Results of IRT analyses typically vary depending on what starting values are used for the discrimination and guessing parameters; it is this condition makes the model unidentifiable (San Martin, Gonzalez, & Tuerlinckx, 2015; Verhelst & Glas, 1995; also see Bamber & van Santen, 1985). Identifiability is a condition in which the model parameter estimates must be associated with a reproducible probability distribution. Different estimates should result in different probability distributions, and the same ones should not. Wright (1984, 1997) provides further background.
1. Verhelst and Glas (1995, pp. 235-236) point out that: "Although use of a model like the OPLM [One Parameter Logistic (Rasch) Model] may seem old-fashioned to some psychometricians in view of the availability of the 2PLM, which seems to be much more flexible, one should be careful…. In Section 12.1 it was pointed out that joint ML-estimation of person and item parameters in the 2PLM is equivalent to assuming an over-parameterized multinomial model that is not identified. The consequence of this is that either unique estimates do not exist or, if they exist, that some restrictions hold between the estimates, entailing inconsistency of the estimators, because no restriction is imposed on the parameter space. ... Notwithstanding this identification problem, the computer program LOGIST (Wingersky, Barton, & Lord, 1982), where this estimation procedure is implemented, is still used.”
2. Stocking (1989, p. 7) concurs, saying that:
a. "...both LOGIST and BILOG [popular IRT software programs] results depend not only on the information contained in the response data, but also on information supplied by the researcher as may be required to produce reasonable and efficient solutions (e.g., starting values, boundaries, prior distributions, etc.). Thus the empirical properties of estimates produced by either program can be expected to differ across tests and across samples of examinees."
b. Because IRT estimates tend to diverge toward infinity in the course of analysis, LOGIST implements a measurement model on alternative iterations, with the effect that some degree of convergence (though not complete convergence) can be obtained (Stocking, 1989, p. 20).
3. IRT advocates themselves admit that “The [IRT 2-PL, 3-PL] theta-scale, or any linear transformation of it, however, does not possess the properties of a ratio or interval scale, although it is popular and reasonable to assume that the theta-scale has equal-interval properties" (Hambleton, Swaminathan, & Rogers 1991, p. 87). Is "popular and reasonable" enough of a basis for high stakes accountability, especially given the ready availability of models based in mathematical proofs of a separability theorem, where the observed score is shown to be necessary and sufficient to the estimation of the parameters? See Andersen (1977, 1999), Fischer (1981), Andrich (2010).
4. Andrich (1988, p. 67) agrees, saying, “The [IRT 2-parameter] model destroys the possibility of explicit invariance of the estimates of the person and item parameters.”
5. Lumsden (1978, p. 22), writing in the British Journal of Mathematical and Statistical Psychology, said "The two- and three-parameter logistic and normal ogive scaling models should be abandoned,” since, as Wood (1978, p. 31) explains, “test scaling models are self-contradictory if they assert both unidimensionality and different slopes for the item characteristic curves," as IRT does.
6. Finally, Embretson (1996, p. 211) also agrees, saying “It is sometimes maintained that the Rasch model is too restrictive and does not fit real test data sufficiently well. However, even if a more complex IRT model is required to fit the data, the total score scale would not provide a relatively better metric. In fact, if item discrimination parameters are required to obtain fit, total score is not even monotonically related to the IRT theta parameters. The IRT trait score, even for equal total scores, would depend on which items were answered correctly."
References
Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.
Andersen, E. B. (1999). Sufficient statistics in educational measurement. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 122-125). New York: Pergamon.
Andrich, D. (1988). Sage University Paper Series on Quantitative Applications in the Social Sciences. Vol. series no. 07-068: Rasch models for measurement. Beverly Hills, California: Sage Publications.
Andrich, D. (2010). Sufficiency and conditional estimation of person parameters in the polytomous Rasch model. Psychometrika, 75(2), 292-308.
Bamber, D., & van Santen, J. P. H. (1985). How many parameters can a model have and still be testable? Journal of Mathematical Psychology, 29, 443-73.
Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201-212.
Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.
Hambleton, R. K., Swaminathan, H., & Rogers, L. (1991). Fundamentals of item response theory. Newbury Park, California: Sage Publications.
Lumsden, J. (1978). Tests are perfectly reliable. British Journal of Mathematical and Statistical Psychology, 31, 19-26.
Mari, L., & Wilson, M. (2014, May). An introduction to the Rasch measurement approach for metrologists. Measurement, 51, 315-327. Retrieved from http://www.sciencedirect.com/science/article/pii/S0263224114000645
Mari, L., & Wilson, M. (2015, 11-14 May). A structural framework across strongly and weakly defined measurements. Instrumentation and Measurement Technology Conference (I2MTC), 2015 IEEE International, pp. 1522-1526.
Pendrill, L. (2014, December). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.
Pendrill, L., & Fisher, W. P., Jr. (2013). Quantifying human response: Linking metrological and psychometric characterisations of man as a measurement instrument. Journal of Physics: Conference Series, 459, http://iopscience.iop.org/1742-6596/459/1/012057.
Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55. doi: http://dx.doi.org/10.1016/j.measurement.2015.04.010
San Martin, E., Gonzalez, J., & Tuerlinckx, F. (2015). On the unidentifiability of the fixed-effects 3 PL model. Psychometrika, 80(2), 450-467.
Stocking, M. L. (1989). Empirical estimation errors in item response theory as a function of test properties (Educational Testing Service Research Report 89-05 No. ERIC Document ED395027). Princeton, New Jersey:. Educational Testing Service. (ETS Research Reports).
Verhelst, N. D., & Glas, C. A. W. (1995). The one parameter logistic model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations recent developments, and applications (pp. 215-237). New York: Springer.
Wilson, M. R. (2013, April). Seeking a balance between the statistical and scientific elements in psychometrics. Psychometrika, 78(2), 211-236.
Wilson, M. R. (2013). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774. Retrieved from http://www.sciencedirect.com/science/article/pii/S0263224113001061
Wilson, M., & Fisher, W. (2016). Preface: 2016 IMEKO TC1-TC7-TC13 Joint Symposium: Metrology Across the Sciences: Wishful Thinking? Journal of Physics Conference Series, 772(1), 011001. Retrieved from http://iopscience.iop.org/article/10.1088/1742-6596/772/1/011001/pdf
Wilson, M., Mari, L., Maul, A., & Torres Irribara, D. (2015). A comparison of measurement concepts across physical science and social science domains: Instrument design, calibration, and measurement. Journal of Physics: Conference Series, 588(012034), http://iopscience.iop.org/1742-6596/588/1/012034.
Wingersky, M. S., Barton, M. A., & Lord, F. M. (1982). LOGIST [Computer program]. Princeton, NJ: Educational Testing Service.
Wood, R. (1978). Fitting the Rasch model: A heady tale. British Journal of Mathematical and Statistical Psychology, 31, 27-32.
Wright, B. D. (1984). Despair and hope for educational measurement. Contemporary Education Review, 3(1), 281-288 [http://www.rasch.org/memo41.htm].
Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].
  • asked a question related to IRT
Question
4 answers
Good evening for everyone,
I would like to conduct an Item Response Theory's Differential Item Functioning (DIF) analysis.
I have two groups that answered a test. This test have a bifactor structure with dichotomous responses.
What is, in your opinion, the most appropiate technique (and software) to conduct this kind of DIF analysis?
Thank you,
José Ángel
Relevant answer
Answer
There's only a handful of software that perform bifactor-based IRT estimation, so you're limited in that respect. The ones that come to mind are flexMIRT, IRTPRO, and the mirt package in R (the last of which is free, of course).
Perming likelihood ratio tests on nested models, or Wald-based tests are perfect valid for these models, though due to the dimension-reduction estimation structure what you can end up testing may be a little limited from a full-blown DIF perspective.
  • asked a question related to IRT
Question
3 answers
Immunoreactive trypsinogen (IRT) levels can be obtained from the blood as well as serum. I wonder how stable is IRT in -20 and -80 storage, for how long serum can be stored at this temperatures, how freezing-thawing cycles can influence IRT levels.
Relevant answer
Answer
Hi, Lilianna Bakinowska
How long serum can be stored in -20 and -80
at -20 -3 month
a -40 -1 years ( Frozen Temp.)
at -80 =3 years ( Cryo Temp,)
at -173 = 9 years ( Liquid Nitrogen)
at -273 - Unlimited ( Absolute Zero)
regards
  • asked a question related to IRT
Question
6 answers
I'm exploring the concept that scales can be ordered and that certain items should carry more weight in a scale. I came across guttman scalograms and Mokken scaling techniques. Creating the initial Mokken scale makes sense. What I don’t get is that after I get my H coefficients and AISP to run with Mokken, how do I evaluate the data in a meaningful way?
If I use a mokken analysis on a 10 item likert survey, it wouldn't make sense to get an overall composite mean score to represent the latent ability since I established that items are ordered on difficulty. Do the H coefficients determine item weights? How can I sum a participants score on my newly created Mokken scale?
Relevant answer
Answer
In regards to IRT, I would suggest you to start non-parametric IRT (KernsmoothIRT) in R to evaluate/describe the use of response options for all items. May want to drop some if they do not work well with overall scores. If response options look OK, move to a parametric graded response model (ltm package in R) for item parameter estimation, identification of patterns of responding, and constructing of a standard error of measurement that accounts for where on the continuum of scale performs best/worst. It might be meaningful to start with the theory behind the scale development to help.
  • asked a question related to IRT
Question
4 answers
Hi,
Can anyone suggest me if it is possible to predict factor scores from IRT MODEL for each factor(latent trait) like in usual PCA
Best,
Davit
Relevant answer
Answer
The Rasch model , instantiated in Winsteps, and PCA have completely different purposes. Factors scores presume more than one factor. Rasch requires just one “factor” just just one person measure per test.
  • asked a question related to IRT
Question
2 answers
I'm looking for a way to compute fit measures for polytomous Item Response Theroy models, more specifically, for the Graded Response Model, Generalized Partial Credit Model and Generalized Graded Unfolding Model. ltm package for R provides some fit measures, but, for example, in the analyses I've ran on a variety of scales, p-value from the goodness of fit test was always exactly .01, which I find weird. GGUM2004 software only provides item-fit measures, with no global measure of model fit. I was wondering if there are ways to compute additional fit indicators, such as RMSEA and other statistics commonly used in SEM and if yes, how could I calculate them?
Relevant answer
Answer
Dear Rana,
Thank you for your kind suggestion. I will surely check out the IRTPRO software!
  • asked a question related to IRT
Question
8 answers
In IRT models the difficulty of items has the same metric and equivalence to the ability of the people (both are in logits) I want to do the same with a CFA model. I want to have items in the same metric and equivalence as the factor scores of my only latent variable. How can I do this? I am using lavaan package in R and I have all the matrices (mean of items, variance covariance matrix and residual matrix of items) but I am stuck in that step. I have used a WLSMV algorithm because my data does not show a multivariate normal distribution. 
Can somebody help me please? Thank you!
Relevant answer
Answer
I just tried, and
(1) the Factor scores are perfectly correlated whatever anchoring is used (var(FS)=1 ot loading(item1)=1 or loading(item2)=1 etc), but
(2) the metrics are different. Both means and variances are different.
  • asked a question related to IRT
Question
9 answers
I want to examine the factor structure of two scales respectively (scale A: 14 items; scale B: 26 items). We are trying several methods: CFA followed by EFA; Rasch analysis and IRT respectively. IRT seems to be a more and more popular methodology for the purpose. But I have also come across some studies using meta-SEM to examine the factor structure. It basically refers to: getting the pooled inter-item correlation matrix from the respective inter-item correlation matrix from a batch of related study, then running SEM on the pooled matrix for testing the factor structure.
For meta-SEM, is item-level intercorrelation is must? Can it be based on intercorrelations among subscales instead of all individual items?
Which method would be relatively more rigorous? Any recommended examples or resources? Thanks a lot!
Relevant answer
Answer
IRT has benefits but generally assumes unidimensionality of the scale.  If you think that your scale will contain more than 1 factor then you may consider some other means to establish the structure of your scale.  IRT, however, will permit fewer items and greater precision of those items.
More traditional methods of scale validation include the reflective models of EFA and CFA.  Your choice of EFA depends upon your a priori conceptualization of the theoretical orientation of your scale.  But, principal axis factoring is often a good choice.  Once you conduct your EFA then you should refine your model (i.e., by removing items that have poor loadings (<.30 or .40) or cross-loadings).  You should then use CFA (ML assuming multivariate normality) to further refine and establish the viability of your model.  
As for ESEM (and Bayesian SEM), it is indeed a newer technique that holds promise.  However, the more traditional methods of EFA/CFA have a long standing research tradition and you will face less criticism from reviewers should you go with those techniques. 
My $.02 cents.
  • asked a question related to IRT
Question
9 answers
I have a dataset with a lot of missing values due to the fact that this data was generated by a computer adaptive test. 
I want to show unidimensionality of the test, but factor analysis is a problem due to the missings. The paper below seems to suggest that we can use ones theta estimate (I'm using IRTPRO to estimate it based on the EAP method) to calculate expected scores and impute them - which seems intuitive. Is this a correct way?
Can I simply proceed with factor analysis after this, or would I need to round the probabilities to 0 and 1's? Are there other methods?
Regards, Dirk
Relevant answer
Answer
You are taking on something that is very challenging because of the amount of missing data and the pattern of missingness.  Also, the number of observations per test item will differ.  I imagine that some will have very low frequency.  I would recommend first creating a large person by item matrix and then sort the item dimension by the frequency of observations.  Then, plan on doing your analysis on the items that have a frequency greater then some number like 100.  I would prefer 200, but I don't know how large your data-set is.  Then, check the dimensionality on the smaller set of items.  Using a program that will handle the missing data is a necessity, but it is also possible to impute the missing values, but I would not impute the missing item responses with a unidimensional IRT model.  That will cause underestimation of the dimensionality.  When I have done this, I have computed the correlation matrix for the items and then imputed the correlations for missing item pairs.  The resulting correlation matrix may need to be smoothed to avoid negative eigenvalues.  Bock has a procedure for doing this.  Then, I prefer doing parallel analysis on the correlation matrix.  This all needs to be done cautiously because of all of the issues that are involved.
  • asked a question related to IRT
Question
3 answers
Context : Performance test, dichotomous, one-dimensional (at least in theory), at least 3PL (variable pseudo-guessing). Theta is assumed normal. But I'm also interested in answers in general in IRT.
Problem : It seems to me that EFA factor loadings provide clear guidelines to rank/select a subset of items from a pool (with referenced rules of thumb, etc.) when one does not have any prior info/assumption of theta (aka for "all" test-takers).
On the other hand, IRT is, in my opinion, a much more accurate representation of the psychological situation that underlies test situations, but it seems to me that there are a variety (especially in IRT-3PL / 4PL) of parameters to take into account all at once to select items, at least without any prior estimation of theta.
So I'm wondering if you knew of any guidelines/packages that can be referenced as a clear basis (meaning, not eye-balling item response functions) for item selection there. At this stage I'm thinking of a very non-parsimonious solution, like generating all item subsets possible (I'd get a LOT of models, but why not) -> fit IRT model -> compute marginal reliability (and/or...Information, of why not CFI, RMSEA, etc.) for thetas ranging between -3SD and +3SD -> Rank the subsets by descending marginal reliability (but I'm afraid it would bias towards more items, so I'd have to weight per item count maybe).
Anyway, you get the idea. Any known referenced procedures/packages?
Relevant answer
Answer
What often happens is that the IRT parameters, the proportion correct, and DIF statistics are used on field items and all taken into account when constructing the final test taking into account what the items are one (since often field tests will have several items for one content area but only a few for others). So, these are "eye-balling" them, but they start with criterion from psychometricians but the decisions are often by the test developers. This is for large scale tests.
As far as packages, I'm not sure exactly what you want (if it is just choose your thresholds for various statistics, that could be coded fairly easily), but a big list of IRT packages is on:
  • asked a question related to IRT
Question
6 answers
So, I've been stuck trying to get item locations from mirt while using GPCM. I know that eRm usually gives you both item location and thresholds, but somehow I haven't been able to find out where is it on mirt. By using:
$ coefG <- coef(modG, IRTpars = TRUE, simplify = TRUE)
I just've been able to fetch discrimination and thresholds, but apparently location is hidden somewhere. I need mirt because of the method MHRM due to some sample issues... Also it seems that mirt works with a different type of object otherwise I'd be able to list its elements and find things out at once. So anyone has a clue on how can I find the item locations among my results or even to calculate and extract it out of the thresholds?
Best!
Relevant answer
Answer
Assuming you define 'location' and 'thresholds' in a way analogous to how it is defined in rating-scale models (Muraki, 1992), what 'coef(modG, IRTpars = TRUE, simplify = TRUE) ' returns you as 'bi' parameters is:
(item_location - item_category_threshold)
Because item_category_thresholds for a given item sums to 0, you can compute item location as a mean of 'bi' parameters and then compute item_category_thresholds by subtracting 'bi' parameters from item location.
  • asked a question related to IRT
Question
4 answers
I was wondering whether someone has a suggestion on how to create a unidimensional IRT model where item pairs (on symptoms) are conditional so that the one item asks about the presence of a symptom (Yes/No) and the other about the severity of the endorsed item (on a 5-step scale).
Conceivably reporting a symptom but rating the lowest severity could be indistinguishable on the latent trait from the "No" response, but how can I test this?
If I combine the item pairs into items with 6 response options, will a nominal model do the trick? Or -- assuming that symptom and severity items measure the same thing -- should I just use all variables in a Graded Response Model, accepting the systematic missingness, and compare a parameters for the item pairs? In the latter case the dependence isn't modelled in any way.
Relevant answer
Answer
Thank you all for your thoughtful responses. I'm summarizing what I learned in case someone later stumbles upon this discussion.
I believe latent response modelling is a way of making explicit our assumptions about how the world and our "instruments" interact, and in addition to make them plausible. Latent models can also be argued to simplify our data, in finding the common factor driving responses. Furthermore, for science to be replicable and cumulative, we have to operationalize and quantify, even though it may currently be inferior to intuitive understanding.
Most of the literature suggested was unfortunately not applicable, as it pertained to, for instance, (1) local dependence between items, but not missingness dependent on the other item, (2) "hierarchical" as in second-order latent factor models, or (3) multidimensional IRT models (MIRT).
Phil Chalmers, the author of the mirt package for R, responded on the forum for that tool:
"I think you are correct that merging these two-stimulus questions into one is the best approach to avoid the use of NA's. Otherwise, the MAR assumption would be violated if the dataset contained NA's where participants reported no symptoms present (i.e., it is more likely to see an NA for low participants than high participants, which violates independence). I think an ordinal/graded model should be just fine; no need for a nominal model unless you are really worried about category ordering and wanted to verify."
I tried the nominal model, and the order between strongly disagree vs. disagree re distress was indeed indistinguishable for many items, suggesting that they could be collapsed, and a graded model applied.
  • asked a question related to IRT
Question
1 answer
We are working on the development of diagnostic systems for cystic fibrosis. We are having problems in the storage of our IRT proteins. The protein is degraded upon freeze thawing. Storage in PBS seems to disrupt the structure. We tried storage at pH:4.8. It works better, but we are still not sure what is the best method for storage. Any suggestion will help while we are pursuing our optimization work.
  • asked a question related to IRT
Question
7 answers
Hi all!
       I am running some IRT models using the mirt package for R (Polytomous 2PL GRM). The scales I am analysing consist of 6-items that are comprised of 3 contrait / 3 protrait items measured on 9-point Likert scales, and a sample size is around 3200.
When I run the item-fit (S-X2) the output shows that the protrait items fit the model (items 3-5) and the contrait ones (items 6-8) do not (output below). Several of the other scales I am testing have similar issues but not always with the protrait/contrait divide. Assumption testing showed a very strong one factor solution and acceptable Local Dependence (LD) (Yen's Q pairs were around .02 - .25). Although many texts recommend the S-X2 test for polytomous item-level fit, none that I have found suggest what to do when the model does not fit the data (although several just say "close enough" which seems like troublesome logic).
     item          Zh      S_X2          df           p
1 ethnic_3    7.12    210.88       199         0.27
2 ethnic_4    8.33   199.63        192         0.34
3 ethnic_5    6.43    219.67       196         0.12
4 ethnic_6    5.31    286.77       211         0.00
5 ethnic_7    4.48    254.58       213         0.03
6 ethnic_8    1.87    308.25       234         0.00
I was wondering what levels of violation are acceptable and what we can do to increase the item-level fit? Is the only solution to change the model (1PL, 2PL, Partial Credit etc., ) or estimator used? 
Thanks in advance,
      Conal Monaghan
Relevant answer
Answer
Hi, Conal:
I think you are unlikely to find blanket recommendations for responding to significant S-X2 tests because while S-X2 can tell you that there is a problem, I don't believe that it can tell you why. For that, you have to interpret your other sources of information in the context of your theory. 
Given that the approach to each significant S-X2 test is unique, I can't offer any guesses about the item sets you do not describe here. In the case of the items you describe in your question, my guess is that you may have a multi-dimensionality issue. This guess is based on two pieces of evidence. First, the Q3 cutoff of .2 is a little arbitrary, so sometimes S-X2 picks up problems that Q3 misses. Second, you mention that the first three items are "pro" and the second three are "con." There's a fair amount of literature suggesting that negatively-worded or negatively-valenced items can form separate factors from their positive counterparts, even though they theoretically shouldn't. This "method effect" can often be controlled for by using a bifactor model. 
If what I just said sounds correct, you might want to consider running a bi-factor model to account for the possibility of the method effect. I've seen this done in a couple of different ways, most typically with all the items loading onto one factor (i.e., items 3-8 would load onto one factor, and that would be your content factor), and the negatively-worded items loading onto a second factor (i.e., items 6-8 would also load onto a second factor, and that would be your negative-item-wording method-effect factor). An example: doi.org/10.1037/a0036472
The second--and less common--way I've seen this done is to also model a third factor to account for the positively-worded items (i.e., items 3-5 would also load onto a third factor, and that would be your positive-item-wording method-effect factor). An example: doi:10.1016/j.paid.2014.03.034 
Unfortunately, I am not familiar with mirt(), so can't tell you how it would be programmed.
Best.
  • asked a question related to IRT
Question
4 answers
Dear all
I tried to get freeware to check local item dependence such as IRTNEW and LDIP but I can not find any link for them in the internet, would you help me if you have links for them or any other freeware (GUI) to chek  LID.
thank you for help
Relevant answer
Answer
Hello again! I have attached a different compilation of "uncommercial" software for IRT. I hope you find something that is helpful for your study. I particularly like Dr. Hanson's web page. Best, Patricia.
  • asked a question related to IRT
Question
6 answers
I want to use IRTPRO for item analysis of a public examination. The subject i intend to analyse has 40 multiple choice items, can all items fit into the model?
Relevant answer
Answer
Dear Ibrahim,
The most important thing is the sample size, the number of examinees. Parameters stability depends on this point. The larger the sample the better quantity of information to test the 3p IRT model.
Good luck with your study.
Best.
Dr. Albert Sesé
  • asked a question related to IRT
Question
6 answers
If a, b, and c parameters have been obtained and published for a normative sample, and I calculate them for a new data set, would it be possible to statistically examine DIF on the basis of these parameters?  Or would I need the full raw data for the normative sample?
Relevant answer
Answer
There is an excellent text on this in: Embretson & Reise (2000) Item Response Theory for Psychologists
  • asked a question related to IRT
Question
7 answers
Is there any classification for the item difficulty and the item description ranges of values in item response theory (IRT) and multidimensional item response theory (MIRT).
According to Baker in "The basics of item response theory"
item discrimination in IRT is classified into the following :
none 0
very low 0.01 - 0.34
Low 0.35 - 0.64
moderate 0.65 - 1.34
High 1.35 - 1.69
Very high > 1.70
Perfect + infinity
According to Baker,Hambleton (Fundamentals of Item Response Theory ), and Hasmy (Compare Unidimensional and Multidimensional Rasch Model for Test with Multidimensional Construct and Items Local Dependence) item difficulty is classified into the following :
very easy above [-2,...]
easy (-0.5,-2)
medium [-0.5,0.5]
hard (0.5,2)
very hard [2,..]
Could the item discrimination and item difficulty classification be also used in MIRT
Relevant answer
Answer
  • asked a question related to IRT
Question
11 answers
I'm looking into the development of an on-line IRT adaptive test for a specific test instrument. Any pointers to help me start out such has published research or case studies would be grateful. I've come across the Concerto platform but would be interested to know what else is out there.
Relevant answer
Answer
Hello Edmund,
In case you are looking for IRT based adaptive testing R packages , please go through catR , mirt and mirtCAT packages. Especially mirtCAT package , it provides tools to generate an HTML interface for creating adaptive testing.
Hope it helps you.
  • asked a question related to IRT
Question
20 answers
There are many different programs for conducting IRT, some are stand-alone products and some are components within larger statistical programs. What are the most popular programs in common use?
Additionally, does anyone know of any independent comparisons of the various programs to ensure that they produce identical results?
Relevant answer
Answer
I prefer open source R with package mirt, but there are a number of different IRT packages in R. Popular closed source programs are parscale and the more recent irtpro. i think mplus might have irt capabilities, but i am not sure.
dont know about comparisons between programs. best, felix
  • asked a question related to IRT
Question
10 answers
I am interested in comparing test equating procedures
Relevant answer
Answer
I wrote a review a few years ago on IRTEQ, which is a free program to help your research.  My review is pretty sparse (intentionally) but I highly recommend the software. 
  • asked a question related to IRT
Question
4 answers
I am running an IRT analysis on an instrument in XCalibre, and the analysis reports substantially different means for the items than those calculated in Excel? Is there some weighting happening of which I am unaware?
Relevant answer
Answer
As Alan suggested, there is indeed likely an algorithmic reason that Xcalibre might be calculating differently.  I know of one situation for sure.  A few weeks ago I received a support email with the same question, and the issue was that they were running a 5-point polytomous calibration on a sample of only 36. (!!!!)  Xcalibre automatically combines response levels with N=0.  In that case, there was, for example, an item where no one responded as 1 or 2, only 3-4-5.  The 1 and 2 levels were dropped and it was treated as a 3-option item, and since numbering starts at 0 or 1 (depending on if doing PCM or RSM approach), it gets renumbered.  So the researcher anticipated an item mean of 4 or so and it was reported as 2 or so.
I'd encourage you to contact the support team about the issue.
(Disclosure: I am the author of Xcalibre.)
  • asked a question related to IRT
Question
6 answers
Hello everyone,
Are there any commercial software available for multidimensional adaptive testing (which include calibration of items, ability estimation and next item selection procedures) ? If yes which are they?
Thanking you,
Irshad
Relevant answer
Answer
Currently, I am not aware of a single commercial software for both, MIRT scaling and operational multidimensional adaptive testing.
Some commercial products for MIRT scaling have been mentioned above (e.g., ConQust). 
To my knowledge, no commercial software exists for multidimensional adaptive testing. The Multidimensional Adaptive Testing Environment (MATE) can be used for teaching and research purposes: 
However, no support is provided and you are not allowed to use it for commercial purposes.
Andreas 
  • asked a question related to IRT
Question
4 answers
For her dissertation research relating item content to slopes/loadings, my student is seeking papers that fit simple (i.e., no cross-loadings) structural models using IRT or CFA for Likert-style responses to measures of personality, attitude, etc. We will also fit such models ourselves if the data are made available to us. 
In order to analyze the item content and relate it to the structural parameter, the items have to be provided or publicly available and the paper must provide the slopes/loadings.
We are particularly interested in measures of content that has shown wording effects like Rosenberg's self-esteem scale, big-five personality, positive and negative effect, etc.  We cannot consider measures whose sole use is clinical (e.g., MMPI).
Relevant answer
Answer
I had a student from Omn some years ago studying  big five on Omanis. You may google her name as Muna Alkalbani to get her email or to access to some of the papers published. 
  • asked a question related to IRT
Question
5 answers
Hello, I want to compute the so-called IRT Item Information Function for individual items as well as the latent variable using Stata. Several methods are available for IRT analysis, like clogit, gllamm or raschtest, but so far I could not find any syntax to draw the Information Function using anyone of these methods. So, any help or example is much appreciated.
Relevant answer
Answer
Dear Richard,
This morning I did not see the options test info in the help file of icc_2pl.
Now I did, and the great news is that thanks to icc_2pl, I am also able to create the item information plots and the test information plot.
So, I really have to thank you again for your ado file, and I think you should share it with the Stata community too. It is a great time saver to have this.
Best regards,
Eric
  • asked a question related to IRT
Question
3 answers
I am using mirt (an R-package) for the IRT( Item Response Theory) analysis. Current data, I am dealing with is sparse (contains missing value). Responses are missing because the test is adaptive. ( In the adaptive test not all the questions in the item bank are  presented to the test taker hence the responses to the questions he was not presented and the ones he could not solve are missing ). Now the "mirt " function in mirt package takes care that you can calibrate the data with missing values (i.e. fitting the IRT models (Rasch, 2PL, 3PL). However when it comes to the item fit analysis (using "itemfit" function to carry out the item fit analysis) you can not put the sparse data. In this package for the sparse data if you need to go for item fit analysis, you  must use imputation. I have two questions here:
1. Are there any more methods available besides imputation for item fit analysis  when you have the sparse data?
2. What is the maximum percentage of sparseness in the response data matrix, where you can use imputation method to get reliable results?
Relevant answer
Answer
Irshad, have you considered using different software? Missing data is no problem for Rasch methodology. Imputation is not necessary. In fact, counter-productive. 99%+ missing data is acceptable. See, for instance, "Rasch Lessons from the Netflix® Prize Challenge Competition" at the link.
  • asked a question related to IRT
Question
4 answers
I would like to conduct Rasch Model Analysis with Olweus Bullying Questionnaire for a Pakistani sample. I haven't done it before. Any suggestions and help especially from some expert in Lahore would be appreciated? ConQuest is the available software.
Relevant answer
Answer
Thanks dear
  • asked a question related to IRT
Question
3 answers
I am using IRTPRO software to find the item difficultly and discrimination for my dichotomous tests. I realized that the program implements different algorithms for IRT parameter estimation such as bock aitkin EM algorithm and Metropolis-hastings-Robbins-Monro (MH-RM) algorithm and I want to understand how each one of them work. Could anyone help me find some good material which explains the algorithms in a simple way. I also came across "invariant item parameters in IRT" and how they don't change regardless the sample of students we have. How is this proved? And how confident could we be about such an assumption?
Relevant answer
Answer
As regards the first question:
The short answer is the the MH-RM is faster in many cases.
At the risk of being misleading, I will try to give a simple explanation:
When we start out with the data in the scale, we don't know any of the parameters, so we have to estimate them somehow. Unfortunately our IRT models express item parameters in terms of person parameters and probabilities, and we don't know the person parameters either. Typically this is done with some kind of a maximum likelihood approach. The likelihood is the probability of observing the parameters that are observed (which we don't initially know) given some distribution underlying those parameters, which is why we want to solve for the parameters which maximize the likelihood.
The Bock Aitkin approach, also called marginal maximum likelihood estimation, assumes that while we might not know each person's ability, we know the distribution (typically the normal distribution) of abilities from which the sample is drawn. This allows for a numerical solution to the likelihood maximization problem by integrating out the person parameter but requires some computationally intensive methods since the integral does not have a closed form solution. As the number of latent trait dimensions increases, there is a rapid increase in the computational demands for this solution.
The MH-RM takes a different approach with no integration involved. Basically, the computer simply estimates model parameters from an "observed" distribution. However, that "observed" distribution is not truly observed real data; it comes from observations of simulated data. (That simulated data is arrived at via a Markov Chain - wikipedia is a good place to start reading about this)
The reason to use the MH-RM method is because it is computationally faster for high dimensional latent traits.
To go to the literature:
I give a mix of papers here based on readability, convenience, and historical importance. However, I think that these papers are not that clear by themselves. Some reading on Markov Chains and Metropolis Hastings will probably be helpful.
There are two articles that describe the methods in which you are speaking, both of which are published in Pyschometrika:
The first is the article by Bock and Aitkin, in which they introduce marginal maximum likelihood estimation:
Bock, R. Darrell, and Murray Aitkin. "Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm." Psychometrika 46.4 (1981): 443-459.
They discuss the generalization of the marginal maximum likelihood estimation towards the end of the article to two dimensions. You get a hint of the computational difficulties that can be associated withe high dimensional data towards the end when they discuss this two dimensional generalization, but in my opinion it is not quite so obvious.
As such, I think that perhaps the single best resource is an original article by Li Cai in Psychometrika whom I think is actually one of the developers of IRTPro available here (notably it is published as an open source article): http://link.springer.com/article/10.1007/s11336-009-9136-x
This paper is about using the MH-RM algorithm, and its introduction does a pretty good job laying out in plain English the role of the EM algorithm and the reason that an MH-RM algorithm has been developed. However, this paper sort of starts out by assuming that the reader understands bayesian methods.
I hope that these are at least helpful to get you started.
As for the second question:
I am not sure what you mean by "invariant item parameters," but I assume that you are referring to mean insensitivity of item parameter estimates to the sample in which they are estimated.
Let me point out that there is a difference between the parameters and the estimation of the parameters.
Now we don't have to prove that our model parameters are invariant. We don't have to prove anything about our model - it is our model, so we can make whatever assumptions we would like in it. Whether or not our model has anything to do with reality is another matter. The question is whether or not the model fits the data and whether or not we can estimate the model's parameters.
The way that we estimate parameters is by obtaining some data from a sample. The question then is whether or not the sample that we choose will affect the estimates of the parameters? Let me stress that we are referring to sample selection within a single population. Two very different samples taken from the same population will ideally give the same estimates, but two samples from two different populations might not.
Now, to answer the question:
If we use a marginal maximum likelihood estimation procedure, for example, we might find that changes in the sample do change the estimated parameters, because this estimation methods relies on assumptions about distribution of the observed participants. However, conditional maximum likelihood estimation in a sense allows for sample independence. The most clearly laid out explanation of this that I am aware of was published in a somewhat obscure journal (at least to the English speaking world):
It is about a software package implementing the CML estimation procedure and therefore explains why this estimation method is the closest thing to "sample independent" that we have.
However, I don't think that multidimensional conditional maximum likelihood has been developed (unless one considers the linear logistic models multidimensional models). Only IRT models in which a location parameter has been estimated are suitable for CML estimation.
(There are other IRT estimation methods as well.)
Let me also add that IRT estimation methods are a contentious issue with both money and academic reputation are at stake. Each IRT estimation program has an estimation method, and the developers of each program have generally published many articles explaining why the estimation procedure that they use is best. You will notice that the article I give you for CML estimation is by a group that has developed a software program using CML to estimate the Rasch model. That being said, CML is probably the gold standard if there is such a thing, and other estimation methods are sometimes compared against it to demonstrate that they yield good estimates.
So how confident can we be? Well, we can be absolutely confident that our model meets our assumptions, but our model does whatever we want it to do. As for our confidence about parameter estimates..... This is debatable, especially depending on what estimation method you use.
Hope that this helps.
  • asked a question related to IRT
Question
1 answer
I am using CTT and IRT to analyse three different tests I gave to students. Applying CTT was straight forward, however using IRT was a bit confusing to me. I started by assessing the tests unidimensionality using the Prinicple component analysis in SPSS. The results obtained showed that each test is assessing more than one latent variable (I have been able to identify 4 or 5 components in each test). Then I decided to follow a different approach by examining a model-data fit for each test. Starting by fitting each test to different unidimensional and multidimensional IRT models to chose the best model and then obtain the difficulty index and discrimination index. In this case even If my test is not unidimensional I still could assess that a unidimensional model could fit my data better than a multidimensional model. I also could determine the number of dimensions that represent my data better.
My question is how reliable is my approach, and if the principle component analysis results in more than one dimension, and the model data fit shows that the unidimensional item response theory is the best fit, should we just assume that my test is unidimensional?
Relevant answer
Answer
First, dimensionality is not a single concept that has a well defined meaning in psychometrics. Dimensionality as assessed by principle component analysis (PCA) is different than dimensionality assessed by item response theory (IRT). Below is a citation an old review highlighting the ambiguity of the term "unidimensional." Though it is old, I still see papers where the authors seems unaware of some of the issues raised in it.
Reference:
Hattie, John. "Methodology review: assessing unidimensionality of tests and ltenls." Applied Psychological Measurement 9.2 (1985): 139-164.
To move more specifically to your question:
PCA is a variance-based method that seeks to explain the total observed variance in a set of responses. A similar technique is factor analysis, which seeks to explain the common variance. IRT is not concerned with explaining variance but with ordering items along a latent trait, something that variance-based methods have no regard for.
That being said, variance-based approaches and IRT-based approaches often (though far from always) suggest the same number of dimensions.
Another point worth mentioning here is that IRT is a method that is designed for ordinal and categorical data, while PCA is designed for continuous variables. Frequently, ordinal data is treated as continuous and used in PCA, which is not necessarily a problem but can be responsible for quirks in the data. One special way to use variance-based methods with ordinal data is to use polychoric or tetrachoric correlations, which explicitly assume that the data is ordinal but represents thresholds along a continuous response. Polychoric correlations are sometimes different than pearson correlations, and as such statistical techniques looking at explaining variance (on which correlations are based) will give different answers if the polychoric correlations are different. Notably, the technique used to find polychoric correlations might make some different assumptions that your data violates.
Though you did not ask, I will point out that you can use polychoric correlations to do factor analysis and from the results of the factor analysis and the polychoric correlation thresholds, you can recover the IRT parameters - whether unidimensional or multidimensional. The mathematical relationships between IRT parameters and factor analysis has been explored in the reference below:
Kamata, Akihito, and Daniel J. Bauer. "A note on the relation between factor analytic and item response theory models." Structural Equation Modeling 15.1 (2008): 136-153.
This can be done even in multiple dimensions. Notably, if you have binary items, the factor-analytic estimates of IRT parameters will essentially be Birnbaum's 2 parameter model. If you have ordinal / polytomous items, then the factor analytic IRT estimates will be estimates of a graded response model (as opposed to a generalized partial credit model).
In summary:
Is your approach of using PCA followed by IRT reasonable? Sure, but if two different solutions for the number of dimensions are found then you have to ask the question what is that you mean by the term "dimension?" Do you mean explained variance or the ability locate items in terms of difficulty along a latent trait? If what you are interested in is the latter, then it is very reasonable to take the results of the IRT models over the PCA-based models.
  • asked a question related to IRT
Question
1 answer
I have the following situation: A number of texts have been rated for readability. The research question is: Is the used readability score valid? We did a study where participants first read the text and then answer a multple choice question. Because the choices were exclusive (exactly one correct choice), there is a 25% chance of guessing.
We ran a logistic regression with correct/incorrect as response and readability score as predictor. We find the desired relationship, but we also want to predict performance at texts of lower readability than in our sample. How can we formulate the model or adjust the estimates such that predicted probability of correct answer would never fall below 25% ?
Note that the data analysis is done by a bachelor student. Pragmatic solutions, using standard software, are preferred over profound. Also note that we have to use either a mixed-effects model or GEE to account for repeated measures.
Relevant answer
Answer
Issue #1: "we also want to predict performance at texts of lower readability than in our sample." A simple reminder that extrapolation is always dangerous.
Issue #2: You have to allow the model to produce the full range out possibilities from predicting all 0s to all 1s. It's possible that under some conditions people would actually produce performance below 25% (e.g., if the subject was wired backward such that wrong answers looked more correct than right answers).
I'm guessing that the question arose because in trying to extrapolate (Issue #1), you generated predicted values below chance which isn't what you expected. Well, that's the trouble with extrapolation!
So, the real answer is not statistical - be sure that your data set includes texts of lower readability to ensure that the model fits these conditions and thus your prediction doesn't require extrapolation.
A completely different answer is to aggregate data to create a percentage score and then convert the score to a d' using signal detection theory where 25% = d' of zero. See Hacker & Ratcliff (1979) in Perception and Psychophysics. The drawback is that aggregation loses information, but I think this matches your conceptualization of the problem. It may not help the extrapolation issue, however....