Science topic

Item Response Theory - Science topic

Explore the latest questions and answers in Item Response Theory, and find Item Response Theory experts.
Questions related to Item Response Theory
  • asked a question related to Item Response Theory
Question
4 answers
Query1)
Can mirt exploratory factor analysis method be used for factor structure for marketing/management research studies because most of the studies that I have gone through are related to education test studies? My objective is to extract factors to be used in subsequent analysis (Regression/SEM) My data is comprised of questions like: Data sample for Rasch Factors Thinking about your general shopping habits, do you ever: a. Buy something online b. Use your cell phone to buy something online c. Watch product review videos online RESPONSE CATEGORIES: Yes = 1 No = 0 Data sample for graded Factors Thinking about ride-hailing services such as Uber or Lyft, do you think the following statements describe them well? a. Are less expensive than taking a taxi c. Use drivers who you would feel safe riding with d. Save their users time and stress e. Are more reliable than taking a taxi or public transportation RESPONSE CATEGORIES: Yes =3 Not sure = 2
No = 1 Query2) If we use mirt exploratory factor analysis using rasch model for dichotomous and graded for polytomous, do these models by default contain tetrachoric correlation for rash model and polychoric correlation for graded models? My objective is to extract factors to be used in subsequent analysis (Regression/SEM) Note: I am using R for data analysis
Relevant answer
Answer
I would really appreciate that you spared some time to answer my question. May be I am unable to ask question properly but here my objective is to create factors from underlying battery of items with different scales. So my question is simple that can I use mirt to perform EFA to create factors to be used in subsequent analysis (Regression/SEM).
One more things I would like to know that what exactly "easy" and "difficult" items you mean in your given answer?
  • asked a question related to Item Response Theory
Question
1 answer
Can mirt exploratory factor analysis method be used for factor structure for marketing/management research studies because most of the studies that I have gone through are related to education test studies? My objective is to extract factors to be used in subsequent analysis (Regression/SEM) My data is comprised of questions like: Data sample for Rasch Factors Thinking about your general shopping habits, do you ever: a. Buy something online b. Watch product review videos online RESPONSE CATEGORIES: 1 Yes 2 No Data sample for graded Factors Thinking about ride-hailing services such as Uber or Lyft, do you think the following statements describe them well? a. Are less expensive than taking a taxi b. Save their users time and stress RESPONSE CATEGORIES: 1 Yes 2 No 3 Not sure
Relevant answer
Answer
The Mirt package generally concentrates on Item Response Theory. You can perform the Exploratory Factor Analysis you want more easily with the "fa" function in the "psych" package in R. You will also find many different and up-to-date estimation methods of factor analysis in this package.
  • asked a question related to Item Response Theory
Question
4 answers
Dear all,
I have an English Listening Comprehension test consisting of 50 items taken by about 400 students. I would like to score the test using the TOEFL scale (max 58 I think), and it is claimed that TOEFL is scored using IRT (3PL model). I am using MIRT package in R to obtain three parameters, as I used 3PL model.
library(readxl)
TOEFL006 <- read_excel("TOEFL Prediction Mei 2021 Form 006.xlsx")
TOEFL006_LIST <- TOEFL006[, 9:58]
library(mirt)
sv <- mirt(TOEFL006_LIST, # data frame (ordinal only)
1, # 1 for unidimentional, 2 for exploratory
itemtype = '3PL') # models, i.e. Rasch, 1PL, 2PL, and 3PL
sv_coeffs <- coef(sv,
simplify = T,
IRTpars = T)
sv_coeffs
The result is shown below :
| | a | b | g | u |
|L1 |2.198 | 0.165 |0.198 | 1 |
|L2 |2.254 | 0.117 |0.248 | 1 |
|L3 |2.103 |-0.049 |0.232 | 1 |
|L4 |4.663 | 0.293 |0.248 | 1 |
|L5 |1.612 |-0.374 |0.001 | 1 |
|... |... |... | ... |... |
The problem is that I do not know how to use the parameters above to give weight for each item. The formula should be like this right?
Would anyone help me show how I can insert the parameters into the formula in R? Or maybe there are other ways of obtaining students' scores without having to manually give weight for each item. Your help is much appreciated.
Thank you very much for your help everyone.
Relevant answer
Answer
This essay may be of interest for your research showing that IRT theta-parameter is not ability measure although they are positively correlated to each other:
  • asked a question related to Item Response Theory
Question
4 answers
Among all the other models of Item Response Theory, is there anyone better than PCM and accessible to analyze psychometric properties of tests. i want to be sure if PCM is the best option.
Relevant answer
Answer
I believe your choice of model should depend on which psychometric properties you would like to investigate, and which properties you would like your scale to have (provided that the model fits of course). Then there is the issue of the nature of your items - dichotomous or polytomous - which also weighs in. Lastly, having chosen the appropriate model for your study, there is the choice of software, as the estimation methods differ!
  • asked a question related to Item Response Theory
Question
7 answers
I’m preparing a study whose main goal it is to explore whether a set of related behaviours are best conceptualized as one-dimensional or two-dimensional. Traditionally, such questions have been answered using methods such as exploratory and confirmatory factor analysis (as well as variations from item response theory).
More recently, some researchers have argued that these approaches are somewhat problematic, because they assume that there is a latent factor that “causes” the phenomena of interest. This assumption doesn’t always make sense theoretically. A possible alternative is to use network models (e.g., https://doi.org/10.1080/21642850.2018.1521283, https://doi.org/10.1177/1948550617709827, or https://doi.org/10.3758/s13428-017-0862-1).
Ideally, the choice for either of these methods is based on theoretical considerations: If I expect that there is a latent trait that “causes” the behaviours of interest, I choose EFA/CFA (or a similar IRT variation of it). If I assume a strong interaction between these behaviours and that there isn’t any latent cause, I choose a network approach.
My question:
What if I start from a relatively uninformed position and do not have such a priori theoretical assumptions? Which approach (EFA/CFA vs. network analysis) is better suited when the goal is to explore the dimensionality of the behaviours of interest and to develop theory?
Any arguments and pointers to previous work that provides guidance on this would be highly appreciated.
Relevant answer
Answer
There is no magic bullet (IMHO). Methodologies all rest on presumptions (some explicit but many implicit). Thus, for example, in studies of how people feel about things a common methodology is using feeling thermometers, but the validity of that approach avoidance presumption is not tested by FT rather that presumption dictates the method; similarly, affect check lists (such as the PANAS or PANAS X) are not interpretable w/o presumptions (should latent factors be orthogonal or should correlated factors, e.g., using SEM procedures rather than EFA with orthogonality selected) be allowed or precluded? Data can help, but absent one or more theoretical formulations one can rarely (if ever?) decide on which data collection method is best, or which statistical approach to data analysis is suitable. At least some data collections enable core presumptions to be tested (affect word checklists enable on to test the valence model of affect - generally finding that model fails - but cannot absent other theoretical claims determine the character or number of latent factors). So, not that theory has to come first, but it has to be part of the dynamic interplay between scholar - data - method - theory (and between literatures and other scholars).
  • asked a question related to Item Response Theory
Question
4 answers
Hello everyone!
I am currently analysing a questionnaire from a Rasch-perspective. Results of the Andersen Likelihood Ratio (random split) and the Martin-Loef Test (median split) turned out to be significant. I know what significant results mean and which assumptions are violated. However, I am not sure about possible reasons for subgroup invariance and item heterogenity. What are some of the possible causes for significant results?
I hope that someone of you can help me answer this question. Thank you very much already in advance :)
Best regards,
Lili
Relevant answer
Answer
Dear Lili,
I can only second what David and Georg already said. Other than that I would not recommend using a median split for the Andersen LRT. Simulation studies show that the Andersen LRT performs rather poorly with a random split (https://dx.doi.org/10.22237/jmasm/1555594442).
All the best for your research,
Georg
  • asked a question related to Item Response Theory
Question
1 answer
Based on published item response data LSAT and pcmdat2, the General Total Score and Total Score are applied for numerical (high-stake) scoring. Item Response Theory (IRT) 2PL model is also applied for numerical scoring as compared to those from General Total Score and Total Score. Pieces of R codes for computing General Total Score are also offered.
______________________________________________________
Please click following link for full essay:
  • asked a question related to Item Response Theory
Question
9 answers
The cross-cultural adaptation of a health status questionnaire or tool for use in a new country, culture, and/or language requires a unique methodology in order to reach equivalence between the original source and target languages. It is now recognized that if measures are to be used across cultures, the items must not only be translated well linguistically, but also be adapted culturally to maintain the content validity of the instrument across different cultures. In this way, we can be more confident that we are describing the impact of a disease or its treatment in a similar manner in multi-national trials or outcome evaluations. The term "cross-cultural adaptation” is used to encompass a process which looks at both language (translation) and cultural adaptation issues in the process of preparing a questionnaire for use in another setting (Hill et al. & Kirwan et al.cited Beaton’s et al. 2000). The process of cross-cultural adaptation strives to produce equivalency based on content. This suggests the other statistical properties such as internal consistency, validity and reliability might be retained. However, this is not necessarily the case. For example, if the new culture has a different way of doing a task included within a disability scale that makes it inherently more or less difficult to do relative to other items in the scale, the validity would likely change particularly in terms of item-level analyses (such as item response theory, Rasch). Further testing should be done on an adapted questionnaire to verify the psychometric properties. What's your opinion?
Relevant answer
Answer
Dear Khasanov
I think you are right.its very important to fiil lexical gaps in language. I'm not an expert on language but still I feel like it's important for endangered language.
Regards
Md. Israt Hasan
  • asked a question related to Item Response Theory
Question
11 answers
I am developing a tool and am trying to decide which software package to use. Unfortunately packs such as RUMM and BILOG are not free under UCL, however we do have R. Can I use this? Will it impact my analysis?
Relevant answer
Answer
I use the mirt package for most of my IRT analyses developed by R. Philip Chalmers. It has the advantage that multidimensional models are implemented as well, but you can also use it for a simple unidimensional Rasch analysis. If you want to run a nonparametric analysis, I recommend the mokken package.
  • asked a question related to Item Response Theory
Question
4 answers
I am working on a research paper and I want to use PLS-SEM. How can I find out sample size for PLS-SEM with item response theory?
Relevant answer
Answer
I agree with Asim Aziz , 10 or 15 respondent per item is usually used.
  • asked a question related to Item Response Theory
Question
2 answers
IRT- Item response Theory
Relevant answer
Answer
If you plot the test information, this will show how much information there is at any given point of your latent trait scale, and the target value, wher the test information if maximized. TIF is also an inverse function of standard error of measurement, so where you have most information, you also have the least error. If you also plot the persons' abilities, you can see whether most of your persons are in a range on the latent trait with a lot or a little information.
Was this the type of interpretaion information, you where looking for?
best
  • asked a question related to Item Response Theory
Question
3 answers
I have a measurement where the indicators influence/give rise to the construct. This makes it formative, however, I don't think I can apply IRT on formative measurement models. Can anyone confirm whether this is the case with an explanation on why?
Relevant answer
Answer
I think no, responses on items are not causes of the construct in IRT , instead they are indicators of it. So using IRT for formative model would be conceptuallly wrong.
  • asked a question related to Item Response Theory
Question
11 answers
Most software packages for analyzing polytomous models under the Item Response Theory approach show option characteristic curves as an output. Given that I have the data on those option characteristic curves, how would I calculate the item characteristic curve?
Relevant answer
Answer
how can i draw the Item category response functions for a five‐category item in spss. please guide me
  • asked a question related to Item Response Theory
Question
2 answers
  1. Total Score is defined with the number of those correctly-responded (dichotomous) items;
  2. Sub-score is defined as the total score associated with the sub-scale;
  3. Overall Score is defined as the total score associated with all testing items.
  4. For Total Score, the Overall Score is the summation of its Sub-scores which is called Additivity.
  5. For Item Response Theory (IRT)-ability (theta parameter), the relationship between Overall Score and Sub-scores is unavailable.
  6. Comment: (5) implies IRT has no Additivity. Therefore, with IRT-ability, the sub-scores and Overall Score can not be available simultaneously. This fact strongly indicates that IRT is not a correct theory for high-stake scoring while Total Score in (4) is (although only is as a special case).
Relevant answer
Answer
Hi, Matthew,
Thank you for your interest in this topic.
(1) In MIRT, all the latent variables represent the scores associated with sub-scales, but no latent variable is for Overall scale. Further, the covariance between sub-scales is the linear part of the mutual relations between subscales, the mutual infomation beyond the linear part and those interactions associated with more than two sub-scales are totally missing in MIRT. Again, key argument is that, in MIRT, there is no latent variable representing the Overall Score.
(2) In multivariate statistics, the ONLY reason to put all the variables into a single system is that they are interactive, otherwise, those (jointly) independent variables should be studied individually (and therefore, easily). In IRT, the assumption of conditional independence shouldn't be there because, in real world, it is rare to be true. Now, issue is that, without any unrealistic precondition, why IRT can not express its OVERALL score in terms of its sub-scores, i.e. Does IRT have its OVERALL score and what latent variable (theta-parameter) in IRT stands for?
  • asked a question related to Item Response Theory
Question
4 answers
Dear colleagues,
I have obtained thetas from the 2PL IRT model based on the (12 item) test of the financial literacy. I am curious can I further use those theta scores (original or modified form) in the logit regression as a one of the independent variables? Cannot seem to find any literature to support or dissuade from doing so.
Relevant answer
Answer
hello Dmitrij
I think you can, given that the 2PL model appropriately used with sufficient sample size.
Good luck
  • asked a question related to Item Response Theory
Question
6 answers
Hi all,
I'm currently conducting a research, for which I have the data for every submission within a e-learning platform. The design is as such that every exercise can be handed in infinitely times, usually within a deadline of 1 or 2 weeks. Because of this, I've created my own variables such as number of attempts it took a student to solve the exercise, the time spent on an exercise from the moment of the first submission, the times a student switches from exercises without having solved the previous exercise (indicating it was likely a very hard question), and whether or not the exercise was eventually solved.
I started from the idea of implementing some sort of item response theory model, but because of the nature of the data (users do not all get the same exercises, users are in different courses, etc.) this causes too many (imo) impossible problems since for example I have lots of missing data by design (since there are a total of 1000+ exercises, but users only make between 60 & 100 exercises in total, and different users make different exercises etc.). This causes maximum likelihood estimations to be either very imprecise or impossible. Also I don't want to measure difficulty as whether or not the exercise was solved (dichotomous response) or in how many attemps (polytomous responses) but rather as a combination of all those factors.
This is why I ended up with both factor analysis and principal component analysis approaches. Since the three variables I want to combine, are in my opinion all reasonable proxies of difficulty, this should work and fitting those models appears to be a reasonable to actually a good fit, purely based on intuition so far (I have no test data to verify so far). However, my concern is that within the literature I don't find many examples of a single factor analysis, and especially not of principal component analyses with just one principal component. This is why I was wondering if anyone could confirm this is actually a viable approach for the problem I'm trying to solve, as well as point me to some literature which did similar things in the past.
So to summarize, I would like to estimate a latent variable 'difficulty' (because I have no test data) which should be a reasonable approximation of all three variables (attempts, time it took to solve, and switches) and as far as I'm aware, both factor analysis and principal components analysis should be able to do that. However, within the literature, I fail to find many if any examples of previously similar researches (because they both only require 1 factor and one principal component) so I would like to know if it's actually a viable approach and if possible point me towards some literature who did similar things.
Thanks in advance!
Relevant answer
Answer
You're welcome, Olivier Lattrez .
I'd review relevant literature before conceptualizing a variable or set of variables. That would give relevant answers to your questions. My answers above are related to measuring subjective evaluations, i.e., attitudinal variables. It'd be different story if you were to collect objective data (e.g., gender). See our brand new article. It's about constructing and validating a perceptual/subjective measure.
Thanks,
A
  • asked a question related to Item Response Theory
Question
4 answers
I took opinion from public health specialists about different indicators to create a city health profile. I tried to draw item characteristic curves for each indicator with R software. Please help me in interpretation of this chart.
Data is the score for each indicator i.e a continuous variable ranged between 1.00 to 5.00 (with decimals)
Relevant answer
Answer
Hi...I think you used wrong IRT model....your data contiuous....
  • asked a question related to Item Response Theory
Question
8 answers
I'm using a test in which each item could vary from 0 to 6 points. It is a short-term memory test, and each point refers to how many features the subject could remember. Can I use Item Response Theory in this case?
Relevant answer
Answer
You can see mirt package....
  • asked a question related to Item Response Theory
Question
2 answers
I conducted an opinion survey to select feasible indicators to assess health profile of a city. I asked the participants to give score from 1-5, (1 for low 5 for high) for each indicator in six aspects viz importance, specificity, measurability , attainablity and time bound characters. That means each respondent will give score from 1-5 for characters of each indicator. The total score of each indicator is 30. I collected opinion about 60 different indicators.
If i treat feasibility as the latent trait of every indicator, how can i select highly feasible indicators wit the help of Item response theory analysis. How to draw item characteristic curve for each indicator and how to select indicators.
Any one please help me to overcome this hurdle.
Relevant answer
Answer
Develop feasible indicators using IRT is subject to a series of steps that need to be performed. The IRT assumptions are essential in this regard, as the uni- dimensionality, local independence as well as speediness should be examined. The issue of fitting of your data to Rasch Rating Scale Model (may be it is the appropriate model in the Likert Scale) need to be investigated in a series of steps to identify the outfit persons as well as the outfit items. The criteria used is that : the outfit and In-fit index(MNSQ) should be between (0.7- 1.3) whereas, (ZSTD) should be between(-2, 2). The process should be replicated until you reach a set of items that are persons-free. In addition to that , as Rasch Scale model with one parameter may be used; so as, in such situations you may need to examine that the discriminations index is equal for all items, and that the guessing index is equal to zero for all items. Furthermore, person separation index and item separation index should be used to estimate scale reliability.
  • asked a question related to Item Response Theory
Question
27 answers
Dear all, I am searching for a comparison of CTT with IRT. Unfortunately, I mostly get an outline of both theories, but they are not compared. Furthermore, generally the "old" interpretation of CTT using axioms is used and not the correct interpretation provided by Zimmerman (1975) and Steyer (1989). The only comparison of both theories that I found was in Tenko Raykov's book "Introduction to Psychometric Theory". Does anybody know of any other sources?
Kind regards, Karin
Relevant answer
Answer
Greg, thanks for your detailed reply. I know that it is always described in text books, but the equation X = T + E is not a model. Therefore, I do not agree with you concerning this point. Raykov (2011, p. 121) explains this quite clearly:
“In a considerable part of the literature dealing with test theory (in particular, on other approaches to test theory), the version of Equation (5.1) [X = T + E] for a given test is occasionally incorrectly referred to as the ‘CTT model’. There is, however, no CTT model (cf. Steyer & Eid, 2001; Zimmerman, 1975). In fact, for a given test, the CTT decomposition (5.1) of observed score as the sum of true score and error score can always be made (of course as long as the underlying mathematical expectation of observed score—to yield the true score—exists, as mentioned above). Hence, Equation (5.1) is always true. Logically and scientifically, any model is a set of assumptions that is made about certain objects (scores here). These assumptions must, however, be falsifiable in order to speak of a model. In circumstances where no falsifiable assumptions are made, there is also no model present. Therefore, one can speak of a model only when a set of assumptions is made that can in principle be wrong (but need not be so in an empirical setting). Because Equation (5.1) is always true, however, it cannot be disconfirmed or falsified. For this reason, Equation (5.1) is not an assumption but rather a tautology. Therefore, Equation (5.1)—which is frequently incorrectly referred to in the literature as ‘CTT model’—cannot in fact represent a model. Hence, contrary to statements made in many other sources, CTT is not based on a model, and in actual fact, as mentioned earlier, there is no CTT model.”
But there are models developed within the framework of CTT. If one posits assumptions about true scores and errors for a given set of observed measures (items), which assumptions can be falsified, then one obtains models. This is closely related to confirmatory factor analysis, because the CTT-based models can be tested using CFA. If one assumes unidimensionality and uncorrelated errors, this would be a model of tau-congeneric variables, because these assumptions can be tested.
Raykov, T. & Marcoulides, G.A. (2011). Introduction to Psychometric Theory. New York, NY: Routledge.
Steyer, R. & Eid, M. (2001). Messen und Testen (Measurement and Testing). Heidelberg: Springer.
Zimmerman, D. W. (1975). Probability spaces, Hilbert spaces, and the axioms of test theory. Psychometrika, 40, 395-412.
Kind regards,
Karin
  • asked a question related to Item Response Theory
Question
4 answers
"When in 1940, a committee established by the British Association for the Advancement of Science to consider and report upon the possibility of quantitative estimates of sensory events published its final report (Ferguson eta/., 1940) in which its non-psychologist members agreed that psychophysical methods did not constitute scientific measurement, many quantitative psychologists realized that the problem could not be ignored any longer. Once again, the fundamental criticism was that the additivity of psychological attributes had not been displayed and, so, there was no evidence to support the hypothesis that psychophysical methods measured anything. While the argument sustaining this critique was largely framed within N. R. Campbell's (1920, 1928) theory of measurement, it stemmed from essentially the samesource as the quantity objection." by Joel Michell
(1) Why "there was no evidence to support the hypothesis that psychophysical methods measured anything" because "the additivity of psychological attributes had not been displayed"?
(2) Item Response Theory (IRT) has no Additivity, Can IRT correctly measure educational testing performance?
Relevant answer
Answer
Dear Nan Kong,
The statement of true or false measurement is very rigid and is not even valid for the physical sciences. But it can measure with a certain degree of accuracy. The differentiation of the measurement theories is how to increase the degree of accuracy. It is not right to ignore the evolution of psychological measurement, all existing theories. Additivity is an important concept in all theories of psychological measurement, and each theory has its own way of collecting specific evidence for additivity.
Best
  • asked a question related to Item Response Theory
Question
4 answers
If the height of a test information curve (not item information curve) indicates the discriminability of a test at a given level of a trait, then doesn't it follow that the range of test scores in the tails of this distribution are uninformative (unreliable)? It seems to me that an inevitable conclusion from IRT is that, for most publish scales, extreme test scores will be inherently noisy and therefore should not be given prominence in data analysis (e.g., treating test scores as a continuous variable and using all of the data including extreme scores) because of the high leverage these data points will have in determining the solutions. At the very least, it seems IRT would compel researchers to either trim their data (e.g., omit the top and bottom 5 or 10% of scores) or in some cases treat the data discretely and perform ANOVA's instead. How does one reconcile the Test Information Curve and prescription to analyze data as a continuous variable without trimming extreme scores?
Relevant answer
Answer
My general feeling is that IRT and CTT aren't *that* different from each other. Take a look at R. P. McDonald's Test Theory: A Unified Treatment, 1999, LEA for an articulation of this view; Rod essentially considered CTT to be a highly restricted version of the more general factor model. (McDonald was one of my professors but he should speak for himself.) The underlying error model is quite similar and if you want a good link between the two, consider the congeneric test model (aka Spearman factor model). Charlie Lewis wrote a nifty little article on this that was published as a chapter in the edited volume by C. R. Rao and Sandhip Sinharay (2007, Handbook of Statistics Vol 26: Psychometrics.) See also the excellent book by Anders Skrondal & Sophia Rabe-Hesketh (2004, Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models). Warning, this book is highly technical.
IRT quantities are conditional, whereas CTT marginalizes over the sample. This is a substantial point and allows for the power of IRT, such as the ability to generate vertical equated scales, deal with missing-at-random incomplete observations, and to tailor tests for specific positions. IRT parameters are much less sample-dependent than CTT parameters. (The logic is exactly the same for why mixed logit and marginal logit differ.) In my experience, though, a poor IRT analysis is typically paralleled by poor CTT statistics, which is one reason applied testing programs almost always report CTT statistics as well as IRT analyses.
IRT's benefits come along with much stronger assumptions, which may, of course, be wrong and are much more demanding of the analyst. In addition, its benefits really depend on the use of pattern scoring. If you just revert to sum scoring, there is little point to IRT.
What quantities like the information curve CAN tell you about whether you should switch from IRT to LCA. My general feeling is if the information curve is "weird" looking, it's possibly a sign that the model is wrong. For instance, I fit an analysis where LCA was more appropriate in the spring. What I saw was an information curve that was highly unstable to model assumptions and had a rather unbelievable peak. This suggested that some of the items were very near to being Guttman items (though not quite). More examination revealed three not strongly distinguished latent classes and some unusual interactions, where examinees that had a higher state of knowledge were falling for attractive distractors on some items, so that lower knowledge state examinees got more credit.
If you want a book (kind of so-so, but with many interesting ideas) that discusses why you may want to use an IRT perspective with a discrete latent variable, take a look at Bartolucci, Bacci, & Gnaldi's Statistical Analysis of Questionnaires (2015, CRC). They advocate for the use of a discrete latent variable which represents a restricted LCA model. There are other examples.
  • asked a question related to Item Response Theory
Question
6 answers
Hi everyone,
I am looking for the suitable method to perform a factor analysis of binary data. When going through the literature on factor analysis on binary response data, it appeared to me that both hierarchical item response theory (HIRT) and structural equation modelling (SEM) could be suitable methods. I was wondering therefore what the main differences are between both? Can HIRT be considered as a special case of SEM?
Any literature on that topic is welcome as well as advises on performing an HIRT (mainly, which software is recommanded?).
Thank you!
Relevant answer
Answer
Bartholomew et al. ( ) show the relationship between latent variable models for metric and binary manifest variables. As noted above, the ones with binary variables are often called IRT or latent trait models. IRT is common in education and there are lots of commercial and free packages (e.g., mirt in R). The hierarchical in HIRT can mean a few things so I won't comment on that. SEM is also a latent variable approach, and different people include lots of different things within its boundaries. A really broad definition would be anything that includes direct relations between any latent variables. Some IRT models would include this, some wouldn't. So I'd say both are part of the latent variable family of models.
  • asked a question related to Item Response Theory
Question
2 answers
I have been looking at Chen and Thissen LD chi square calculations done in IRTPRO and I compared it to MIRT package. The values are different as in IRTPRO they are subtracting the degree of freedom from chi square value and then dividing it by 2* degree of freedom square root.
What I am confused is when I want to assess whether my questions have local dependency should I look at the chi square critical table or z-score table ??
Moreover, in IRTPRO guide they mentioned that they don't consider values like 2 and 3 to be high and  values above 10 are local dependent.
"Because the standardized LD X2 statistic is only approximately standardized, and is known to be based on a statistics with a longtailed (X2) distribution, we do not consider values larger than 2 or 3 to be large. Rather, we consider values larger than 10 large,indicating likely LD; values in the range 5-10 lie in a gray area, and may either indicate LD or they may be a result of sparseness in the underlying table of frequencies."
Note: I have 14 questions with 126 responses and my data is dichotomous
and using LD chi square test my degree of freedom will be 1 
Relevant answer
Answer
See this thread on the mirt-package forum: https://groups.google.com/forum/#!topic/mirt-package/v1Tolmzi9To
  • asked a question related to Item Response Theory
Question
27 answers
  • Application of Rasch analysis and IRT models are becoming increasingly popular for developing and validating a patient reported outcome measure. Rasch analysis is a confirmatory model where the data has to meet the Rasch model requirement to form a valid measurement scale. Whereas, IRT models are exploratory models aiming to describe the variance in the data. Researchers seem to be divided on the preference of one over another. What is your opinion about this dilemma, in development of patient reported outcome measures?
Relevant answer
Answer
Rasch requires the data to fit the model in order to generate invariant, interval-level measures (sic.) of items and persons. It is prescriptive. IRT models attempt to great a model that will fit the data. They are descriptive. While IRT users, see Rasch as a particular IRT model, most Rasch proponents see it as distinctly different from other IRT models. The key differences are philosophical. Wiki provides a suitable introduction:
You might recall Fan's infamous comparison paper. I can add a critique of that if you wish.
  • asked a question related to Item Response Theory
Question
4 answers
Hello,
I am currently trying to work out how to conduct item analysis on my likert scale questionnaire.
The questionnaire consists of 34 questions, which are split between 13 subdomains. I want to determine how the scoring of these subdomains varies between the quartiles of the overall questionnaire score.
I was looking at item response theory but I am given to understand that this is not appropriate as Likert scales do not assume that item difficulty varies.
Any guidance is most appreciated!
Relevant answer
Answer
Classical test theory can indeed answer some of your research questions, and there are a number of software programs that can calculate what you need since CTT is so incredibly simple. It wouldn't even take long to analyze it in Excel. However, you might want software like Iteman that specializes in it. CFA on the classical results might be useful.
Nevertheless, as Roberto recommended, I also highly recommend IRT. It is not unreasonably complex, and there are a number of models that are specifically designed for Likert type data. The two important ones you should examine are the Rasch Rating Scale model and the Graded Response model. You can download a free version of Xcalibre to try out on a small sample (http://www.assess.com/xcalibre/). Winsteps is also worth looking at, as it is the gold standard for Rasch analysis.
  • asked a question related to Item Response Theory
Question
9 answers
Is there any possible way?
I understand that if the options point to the same trait, it can be done. for example a question of the type:
I work better:
(a) individually
(b) with other persons
either of the two options is valid for the person (helping avoid bias) and for example if I'm measuring the trait of teamwork I may think that a person who selects option b will have a higher degree in the trait of teamwork. Am I making a mistake in assuming this?
now, is there any way to do this when they point to different traits in response options? I want to be able, based on the data of forced response items, to carry out normative analysis (to be able to compare with other subjects).
PS: I'm clear that with ipsatives items you can't make comparisons between people, however, if you manage the punctuation in a different way could you do it somehow?
Relevant answer
Answer
Hi Ale,
there are recent developments in IRT that allows extracting normative scores from forced-choice questionnaires. The Thurstonian IRT (TIRT) model by Brown and Maydeu-Olivares and the MUPP model by Stark and colleagues are good examples.
From my own experience, the TIRT model works best in practice (i.e., in terms of reliability and validity).
  • asked a question related to Item Response Theory
Question
4 answers
I am currently working on the theoretical framework of my thesis (a new scale for meassuring scientific creative trhinking), and I am a bit stuck in the theoretical method to create a new scale: design stage, Classic theory vs Item Response Theory...
So any recommendation on the theroretical aspect of these issues would be very wellcomed.
Thanks in advance
Relevant answer
Answer
I recommand APA Handbook of Testing and Assessment in Psychology (3 Volume Set) by Kurt F. Geisinger and Bruce A. Bracken.
Have a nice day,
  • asked a question related to Item Response Theory
Question
4 answers
Good evening for everyone,
I would like to conduct an Item Response Theory's Differential Item Functioning (DIF) analysis.
I have two groups that answered a test. This test have a bifactor structure with dichotomous responses.
What is, in your opinion, the most appropiate technique (and software) to conduct this kind of DIF analysis?
Thank you,
José Ángel
Relevant answer
Answer
There's only a handful of software that perform bifactor-based IRT estimation, so you're limited in that respect. The ones that come to mind are flexMIRT, IRTPRO, and the mirt package in R (the last of which is free, of course).
Perming likelihood ratio tests on nested models, or Wald-based tests are perfect valid for these models, though due to the dimension-reduction estimation structure what you can end up testing may be a little limited from a full-blown DIF perspective.
  • asked a question related to Item Response Theory
Question
6 answers
I'm exploring the concept that scales can be ordered and that certain items should carry more weight in a scale. I came across guttman scalograms and Mokken scaling techniques. Creating the initial Mokken scale makes sense. What I don’t get is that after I get my H coefficients and AISP to run with Mokken, how do I evaluate the data in a meaningful way?
If I use a mokken analysis on a 10 item likert survey, it wouldn't make sense to get an overall composite mean score to represent the latent ability since I established that items are ordered on difficulty. Do the H coefficients determine item weights? How can I sum a participants score on my newly created Mokken scale?
Relevant answer
Answer
In regards to IRT, I would suggest you to start non-parametric IRT (KernsmoothIRT) in R to evaluate/describe the use of response options for all items. May want to drop some if they do not work well with overall scores. If response options look OK, move to a parametric graded response model (ltm package in R) for item parameter estimation, identification of patterns of responding, and constructing of a standard error of measurement that accounts for where on the continuum of scale performs best/worst. It might be meaningful to start with the theory behind the scale development to help.
  • asked a question related to Item Response Theory
Question
3 answers
I am interested in test theory.
Now I am studying Classical Test Theory and Item Response Theory.
I understood the fundamental part of the item response theory (eg how to estimate using R, how to think about equalization, etc.)
So, what is the current hot topic in the field of test theory?
In my opinion, I think that there are many researches of Cognitive Diagnosis Models recently.
Please tell us your opinion.
(I used a google translation so I may have made mistakes.)
Relevant answer
Answer
I agree, I think DMMs is one of the hottest new topics. As the previous answers noted, CTT and IRT/Rasch are still sufficient in most cases as the psychometric paradigm for a testing program. The hot topics then come in how to use those most efficiently, such as adaptive testing algorithms, automated test assembly, AI & machine learning, etc. A great way to see what is hot is to look at the agendas of recent or upcoming conferences, like https://marces.org/workshop.html on AI in psychometrics or http://iacat.org/2017-iacat-conference-niigata-japan on adaptive testing.
  • asked a question related to Item Response Theory
Question
5 answers
We have data for a dichotomous numerical ability test that we would like to refine.
We are looking for someone with experience in Rasch modeling who can collaborate on this small project. We are thinking that this would be a learning process, and so it would not simply be a matter of doing the analyses.
Relevant answer
Answer
Thanks for the suggestions everyone! We found someone in the Chicago area to work with on this project.
  • asked a question related to Item Response Theory
Question
12 answers
Suppose that the phenomena we study comprise about 3 % of the population and we made a two item screening instrument (with dichotomous items). If we think in terms of Rasch /IRT, what is the optimal item difficulty for these two items? and if it's contingent of their covariance - how to test it ? 
Relevant answer
Answer
Excellent, thanks! 
  • asked a question related to Item Response Theory
Question
7 answers
To be more specific, this question does not concern central tendency bias/error -- where respondents are inclined towards centralized responses -- but the concept that mean values expressed on Likert-type response data tend to be centralized due to the issue of using truncated variables (e.g. 5 or 7 points on a finite scale with no continuum).
For example, if you administer a 5-point scale to two respondents, the possible number of combinations for them arriving at a combined mean score of 1 or 5 is one. However, the possible number of combinations for them arriving at a combined mean score of 3 is five.
Obviously, if you increase the respondents to three, four, five, etc. the possible number of combinations to reach a combined mean score of 3 grows exponentially; a plethora of combinations is possible with even a few dozen respondents. Yet, the possible number of combinations for arriving at a mean score of 1 or 5 remains stagnant at one.
How do you approach this dilemma when analyzing data? How can you associate a degree of 2 or 4 with more "oneness" and "fiveness" respectively to account for the central tendency of respondents?
Forced distribution seems feasible, but the practice of imposing a hypothetical normal distribution curve on data seems to me a sub-optimal and outdated practice.
Keyword searches into this problem have brought up concepts like entropy, or ordinal regression, but I am not sure if they address the issue (or perhaps they do, but their application simply goes over my head).
Many thanks for reading. This question is attempting to 'fix' the dilemma of differentiating centralized mean values (e.g. 2.3/5 and 2.8/5) to account for the aforementioned issue of centralization when assessing their differences (e.g. 2.8 - 2.3 = 0.5) so that "lower" or "higher" values (e.g. 2.3) can be interpreted as "closer" to the end of the scale (e.g. 1) than towards the middle of the scale (e.g. 2 or 3).
Relevant answer
Answer
You could run your data through a paired comparisons model with the public vs. private as a predictor variable (possibly even an interaction variable, though I have never really tried that). There are also Rasch rating scale models that have similar functionality.
  • asked a question related to Item Response Theory
Question
3 answers
Is calculating item discriminating value through t-test the only method? Are there other methods? Most importantly, in personality scales, while eliminating items, should one consider the only item discriminating value?
Relevant answer
Answer
The world of scale construction has more to offer then just EFA or CFA, You might want to learn about IRT models (Rasch, Graded Response, Mokken etc.). EFA and CFA are not really appropriate when you want to analyze categorical Likert Ordered Items. In IRT models one also uses item-rest total correlations, but one needs to know what one wants and does.
  • asked a question related to Item Response Theory
Question
2 answers
Title: From Total Score to General Total Score
Key Words: Total Score, General Total Score, Item difficulty level, Partial credits, Items interaction or duplication, General Total Score of Shared ability, General Total Score of Unique Ability, General Total Score of total ability/all items, General Total Score of subscale, Relation of Total Score and Sub-scores, General Total Score Decomposition.
Introduction:
Total Score, which is the number of those correctly-responded items (testing questions), is a major tool for individual examinee's scoring (high-stake scoring). In practice, although Total Score plays an important role in many businesses, Total Score is still not fully qualified for high-stake scoring because Total Score is only true under the following three assumptions:
(1) All items (testing questions) are dichotomous items which have two possible values "Right" or "Wrong" (in practice, "1" for "right"; "0" for "wrong"). That is to say that any item must not have partial credit(s), say, an item having possible values: 0,1,2,3,... is not allowed in Total Score.
(2) All items are of the same difficulty levels. Therefore, the examinee who correctly respond 10 difficulty items receives the same Total Score as the student who correctly respond 10 easy items.
(3) All items are jointly independent. Therefore, Total Score of two (correctly-responded) highly-duplicated items or even identical items is the same as Total Score of two (correctly-responded) less-duplicated or even independent items (A fair rule is that the Score for two correctly-responded identical items should be the same as the score for correctly-responding one of these two identical items). Also, "All items are jointly independent" implies that the scoring for those shared ability across different students is always 0, but, in real world as we know, this is not true (this issue is related to the ability's decomposition which will be discussed in more detail).
In this talk, we will introduce a Total Score without above three assumptions. Total Score without above three assumptions is called General Total Score. That is to say that, General Total Score must be true no matter above three assumptions hold or not. Therefore, in its scoring, General Total Score must be able to correctly handle
(a) The partial credits associated with each item;
(b) The different difficulty level associated with each item;
(c) Those interactions among any combinations of the items.
The theory for General Total Score has been published in “N. Kong. A Mathematical Theory of Ability Measure.  Journal of Applied Measurement (2015, Vol(16), P1-12) ". In this paper, we present the simplified version of General Total Score theory with artificial numerical examples. Readers are welcome to make their comments or discussions during presentation. Also, readers are strongly encouraged to carry out independent numerical analysis to verify General Total Score and make comparison with those you did before based on other approaches such as Total Score or Item Response Theory etc. 
The full paper "From Total Score to General Total Score" is available at 
Relevant answer
Answer
 Hi, Ken,
Thank you for your interesting in this topic.
You are right that General Total Score is "... a new way to measure student or individual performance in tests ...".
The goal of presenting this topic is to look for more applications for General Total Score by those users from independent  third party to verify General Total Score because General Total Score is not only a mathematical theory, but also a practical business tool. 
In terms of information, General Total Score fully utilizes the information among those student responded testing questions because General Total Score is directly defined from the joint probability of those testing questions (the information along with random variables is fully described by their joint distribution -- nothing more). 
Entropy mentioned in this topic is the information entropy (also called Shannon entropy) which is the measure of measuring uncertainty. The reason to mention Entropy in this topic is that General Total Score is ability measure which shares the same fundamental structure with entropy, which is called additivity, according to measure theory. 
I am a mathematician and currently working with psychometric data analysis.
Thank you again for this.
Nan Kong
  • asked a question related to Item Response Theory
Question
4 answers
During my research, I asked several companies to complete a questionnaire to measure they risk maturity. At the end of the questionnaire, each respondent receives a score from 1 to 5 (1=low maturity, 5=high maturity)
Now I want to analyze the results of the questionnaire. The data are then in form of a single vector, where each row is the "score" received by a single participant. On this vector, I want to perform a PCA analysis to find the first 3 PCs.
The idea I want to follow comes from the Arbitrage Pricing Theory (Ross 1976), where the author performs the PCA analysis on the returns of several stocks to understand how many factors influence these returns. 
I know that PCA is usually used when I have several dimensions and I want to reduce them finding "factors" as combination of those dimensions. However, Ross in his paper (you can find it attached to this post) uses just the returns to investigate the factors, and I would like to perform a similar analysis on my values to find how many factors can explain the variability of my sample. Even though I now what my objective is, I am not able to get there. Any suggestion?
Relevant answer
How many variables (questions) you have? PCA is easy to be done. 
  • asked a question related to Item Response Theory
Question
6 answers
We have a test with 16 items to measure student achievement. Items are comprised of of multiple-choice, close-ended and open ended questions.
We have three different scores on student achievement (0:wrong answer, 1:partial credit, 2:full credit). 
Which (preferably free) software would you suggest for the partial credit IRT model analysis  to calculate student total scores?
Relevant answer
Answer
Thanks, David!
  • asked a question related to Item Response Theory
Question
6 answers
 I am looking to convert raw FIM scores from a sample of patients from their ordinal scale to an interval scale so that I can use summary indices that are currently used in the rehab literature.
A recent paper (http://www.ncbi.nlm.nih.gov/pubmed/24068767) suggests that when using these indices, the raw data should be first transformed to an equal interval scale via Rasch analysis before applying any of the measures designed to assess rehab efficacy using the FIM, due to the fact that their is no way to quantify that one patient's score (e.g. a 5 on UE dressing) is the same as another patient's score of 5 on the same measure, due to inherent patient differences.
I was hoping that there was a method someone may know of that could be used to easily accomplish this score transformation for someone with limited knowledge of Rasch analysis and Item Response theory as it pertains to the FIM.
Relevant answer
Answer
This is accomplished is through the use of specialized software. There really is no avoiding the use of such software either through learning how to use it or having the local neighborhood psychometrician help out.
  • asked a question related to Item Response Theory
Question
2 answers
Which transformation should I apply to an angular variable (i.e. slope of a landslide) in order to normalize it (like arcsin to percentages)?
Even though ANOVA are robust to not normal distributions, I'd like to know if I could adapt my cases without altering them.
Relevant answer
Answer
1) If it is cyclic, as in 365 day in a year, then a nonlinear regression using a cosine function should work.
2) If it is dispersal of an insect placed at the center of an arena, then I would look in the literature to find relevant citations. I know that this sort of thing has been published, though the analysis may be too simple to be useful for you.
3) If the problem is slope of a landslide. Are you sure it is not normally distributed? I haven't seen too many slopes at 284 degrees. I would bet that the angles are all between zero and 90 degrees. A slope of zero will not generate much of a landslide, and the only stable "slope at 90 degrees is solid rock. The rock could collapse, but that is not exactly a landslide. So maybe 30 to 70 degrees? Within that range, do you have a normal distribution? If so, there is no problem here, use ANOVA.
It depends a bit on how you are using the data. So I suspect that as slope increases the probability of a landslide increases. The cumulative probability follows an S shaped curve that reaches 1 at 80 degrees. You could consider logistic regression or a logit transformation for the probability of a landslide. However, if I look at the distribution of slopes at which landslides occur I should see an increase up to some value, and then a decrease. The decrease occurs because there will be very few cases where I observe a landslide at 78 degrees because it is unlikely that the soil will hold sufficiently to achieve this slope. The distribution of slopes might be normally distributed.
However, I am not a geologist. There may be a feature of the data that I have not considered. There may also be a standard method in the geology literature that I am entirely unaware of.
  • asked a question related to Item Response Theory
Question
3 answers
what are the implications behind choosing Extraction and Rotation methods while doing EFA?
Relevant answer
Answer
Dear Shinaj,
of course I would say that reading the recommended literature above is inevitable, however I think that it is possible to break down the answer on few core considerations after which you may find through the specific case you have.
1) Principal component  (PCA) vs. factor (axis) analysis (FA). When you want to reduce the data-set (lots of items) to few variables (e.g. summed factor scores for participants that cover the data best), then you choose principal component analysis. In other words, the goal is describing the participants data in the least possible numbers of variables. These then would be the sum scores of the items which belong to the components (at least if it makes apparent sense putting the found items together, and be carefull, explaining all variance also means that this includes binning 'item specific' variance and systematic (factor explained) variance together, which inflates the overall amount of explained variance. Example: You measure response time for each item in a questionnaire and PCA will find a component, which explains the 'longer' reading time, which is due to the length of the given questions the participants read. This will not happen that easily in FA which seeks to separate the item specific variance and the 'meaningful' covariance between items). Hence, when you believe that there is a fine-grained factor structure behind the data, and this is the only thing you want to know, then choose an axis analysis, which is more sensitive to this (usually simply termed factor analysis).
2) Rotation.First, rotation means that you optimize the solution by rotating the (hypothetical) factors in the space of observations such that as much as variance can be explained by the factors, and to further differentiate the factor loadings on the items. When you perform PCA, rotation is futile, because the analysis by definition searches for othogonal components that explain most of the variance (highest factor loadings). Hence, rotation will only give you (nearly) the same results. When using FA rotation makes sense, because FA searches for the structure in the first place, but not maximum variance explanation. After conducting a rotation, which again optimizes factor loadings, you can more easily see which item belongs to each factor. The basic choice further is between orthogonal and oblique rotation, e.g. varimax rotation, which assumes that factors are independent (orthogonal), and promax rotation, which assumes that factors are dependent (oblique/correlated). Last one is to be preferred in EFA, because it is exploratory..., and when the factors are in fact indepented, the promax will give the same solution as the varimax.
There are more points, but I hope this helps priming the indepth reading into the complex topic :)
Best, René
  • asked a question related to Item Response Theory
Question
7 answers
I have doubt about PCA(Principal components analysis), i have data with 200 cases about sellers, and everyone has 500 variables. In one book (Editorial ANUIES, 2007, ISBN:970704103X, 9789707041035) the autor recommend have 5x cases for 1x variables, so i don't know if i can to apply PCA or maybe someone can tell me about other method to reduce variables
Relevant answer
Answer
You can also use CPT Climate Predictability Tool developed by Columbia University, the input does not have to be climatic data.
  • asked a question related to Item Response Theory
Question
5 answers
My CFA model is eight Items scale for assessing one latent variable (Schizophrenia)
Relevant answer
Answer
Hi Mohammad.
even if a factor model does fit, it may nonetheless be misspecified. Low loadings indicate generally low correlations among the indicators. Hence, there is a low chance for fit measures to detect a misfit because the main implications of such a model (i.e., local independence) are not violated to an extent that it becomes notable.
That being said, although I am not an expert in clinical psychology, you may be inspired by some recent development of alternative models for psychopathological constructs: 
Borsboom, D., & Cramer, A. O. J. (2013). Network analysis: An integrative approach to the structure of psychopathology. Annual Review of Clinical Psychology, 9(1), 91-121. doi:10.1146/annurev-clinpsy-050212-185608
HTH
Holger
  • asked a question related to Item Response Theory
Question
3 answers
Hi everyone, 
I have a question regarding the negative values I am obtaining when calculating first-order sensitivity indices.
For a given number of samples (N), the results show the expected values of the first-order indices (all positive and relative importance according to the function defined) when the parameters are normalized between [0,1]. However, using the same number of samples (N), when I leave the parameters in their original range [-0.5,0.5], I obtain negative values for some of the first-order indices. In addition to this, when the parameters are left in the original range [-0.5,0.5], the values of the first-order indices reflect a completely different order of importance of the parameters.
I understand that increasing the number of samples is important, but for a fixed number of samples (N), could you help me to understand why normalizing between [0,1] is so important? 
Thanks in advance,
David
Relevant answer
Answer
Dear Alexandre Allard and Andrew Paul McKenzie Pegman,
Thank you very much for your interesting comments. In particular, thank you for the reference. I saw in the paper that they rescale variables multiplying by a factor, but it should be equivalent to move the range from [-0.5,0.5] to [0,1].
I actually found a possible reason for the negative values of my first-order indices. Initially, I was using the parameter space defined by the simulation/experiments I received data from and this was giving me the negative values for Si. However, when I define each variable in the parameter space using a uniform distribution, everything works smoothly (Si > 0 and STi are greater or equal than Si). I guess this is because my original data (from simulations/experiments) was not uniformly distributed and this a requirement from reading Sobol's and Saltelli's papers. What do you think?
Now I am trying to define variables for sensitivity analysis using different types of distributions (bernoulli, normal,...).
Thanks again,
David
  • asked a question related to Item Response Theory
Question
8 answers
I think I can't do it with Excel, and I haven't got SPSS, so suggestions are welcome!
  • asked a question related to Item Response Theory
Question
4 answers
I have an IRT question regarding calibration of the items. I am also aware of the item parameters are invariant to the sample using which they are estimated.
Suppose there are three schools, namely:A,B,and C. Students in school A are having below the average intelligence (IQ) and the school doesn't have exposure to computerized adaptive test (CAT) while students in school B have moderate IQ and have partial information about technology of the CAT, whereas students from School C are the brightest ones and also they are experienced with CAT.
Suppose, We have estimated the item parameters of an item bank (of say Mathematics subject) using response data from the school C (students with not so good intelligence and not aware of CAT)
Now suppose if I assume the item invariant assumption of IRT to be true and ask school C(one with bright students) students to take the test on same item bank, and turns out they all perform well.
So adjustments should I do with item bank so that I should be able to compare the results students from School A , B, and C with the numerical score I have got from the test.
Is this test a good way to compare these students?
Would the results be different had I estimated the item parameters using the response data from school C? (First thought comes to my mind is yes).
Am I missing something here?
I am open to the discussion.
Thank you in advance.
Relevant answer
Answer
Yes, the Rasch model is sample-free and test-free provided that (a) the data fit the Rasch model (as indicated by item and model fit) and (b) there is no DIF present in the items. So assessing DIF is important to determining if the measure has the property of being sample free.
  • asked a question related to Item Response Theory
Question
9 answers
I have a dataset with a lot of missing values due to the fact that this data was generated by a computer adaptive test. 
I want to show unidimensionality of the test, but factor analysis is a problem due to the missings. The paper below seems to suggest that we can use ones theta estimate (I'm using IRTPRO to estimate it based on the EAP method) to calculate expected scores and impute them - which seems intuitive. Is this a correct way?
Can I simply proceed with factor analysis after this, or would I need to round the probabilities to 0 and 1's? Are there other methods?
Regards, Dirk
Relevant answer
Answer
You are taking on something that is very challenging because of the amount of missing data and the pattern of missingness.  Also, the number of observations per test item will differ.  I imagine that some will have very low frequency.  I would recommend first creating a large person by item matrix and then sort the item dimension by the frequency of observations.  Then, plan on doing your analysis on the items that have a frequency greater then some number like 100.  I would prefer 200, but I don't know how large your data-set is.  Then, check the dimensionality on the smaller set of items.  Using a program that will handle the missing data is a necessity, but it is also possible to impute the missing values, but I would not impute the missing item responses with a unidimensional IRT model.  That will cause underestimation of the dimensionality.  When I have done this, I have computed the correlation matrix for the items and then imputed the correlations for missing item pairs.  The resulting correlation matrix may need to be smoothed to avoid negative eigenvalues.  Bock has a procedure for doing this.  Then, I prefer doing parallel analysis on the correlation matrix.  This all needs to be done cautiously because of all of the issues that are involved.
  • asked a question related to Item Response Theory
Question
3 answers
Context : Performance test, dichotomous, one-dimensional (at least in theory), at least 3PL (variable pseudo-guessing). Theta is assumed normal. But I'm also interested in answers in general in IRT.
Problem : It seems to me that EFA factor loadings provide clear guidelines to rank/select a subset of items from a pool (with referenced rules of thumb, etc.) when one does not have any prior info/assumption of theta (aka for "all" test-takers).
On the other hand, IRT is, in my opinion, a much more accurate representation of the psychological situation that underlies test situations, but it seems to me that there are a variety (especially in IRT-3PL / 4PL) of parameters to take into account all at once to select items, at least without any prior estimation of theta.
So I'm wondering if you knew of any guidelines/packages that can be referenced as a clear basis (meaning, not eye-balling item response functions) for item selection there. At this stage I'm thinking of a very non-parsimonious solution, like generating all item subsets possible (I'd get a LOT of models, but why not) -> fit IRT model -> compute marginal reliability (and/or...Information, of why not CFI, RMSEA, etc.) for thetas ranging between -3SD and +3SD -> Rank the subsets by descending marginal reliability (but I'm afraid it would bias towards more items, so I'd have to weight per item count maybe).
Anyway, you get the idea. Any known referenced procedures/packages?
Relevant answer
Answer
What often happens is that the IRT parameters, the proportion correct, and DIF statistics are used on field items and all taken into account when constructing the final test taking into account what the items are one (since often field tests will have several items for one content area but only a few for others). So, these are "eye-balling" them, but they start with criterion from psychometricians but the decisions are often by the test developers. This is for large scale tests.
As far as packages, I'm not sure exactly what you want (if it is just choose your thresholds for various statistics, that could be coded fairly easily), but a big list of IRT packages is on:
  • asked a question related to Item Response Theory
Question
9 answers
What is the best book explaining these advantages and disadvantages ?
Relevant answer
Answer
Check out the book; Psychological Testing by Anastasi & Urbina. You should find some useful information. I have also attached two pdf documents. You might find them useful too, if you haven't accessed them before. Thanks and goodluck with your research.
Steven.
  • asked a question related to Item Response Theory
Question
6 answers
So, I've been stuck trying to get item locations from mirt while using GPCM. I know that eRm usually gives you both item location and thresholds, but somehow I haven't been able to find out where is it on mirt. By using:
$ coefG <- coef(modG, IRTpars = TRUE, simplify = TRUE)
I just've been able to fetch discrimination and thresholds, but apparently location is hidden somewhere. I need mirt because of the method MHRM due to some sample issues... Also it seems that mirt works with a different type of object otherwise I'd be able to list its elements and find things out at once. So anyone has a clue on how can I find the item locations among my results or even to calculate and extract it out of the thresholds?
Best!
Relevant answer
Answer
Assuming you define 'location' and 'thresholds' in a way analogous to how it is defined in rating-scale models (Muraki, 1992), what 'coef(modG, IRTpars = TRUE, simplify = TRUE) ' returns you as 'bi' parameters is:
(item_location - item_category_threshold)
Because item_category_thresholds for a given item sums to 0, you can compute item location as a mean of 'bi' parameters and then compute item_category_thresholds by subtracting 'bi' parameters from item location.
  • asked a question related to Item Response Theory
Question
4 answers
I was wondering whether someone has a suggestion on how to create a unidimensional IRT model where item pairs (on symptoms) are conditional so that the one item asks about the presence of a symptom (Yes/No) and the other about the severity of the endorsed item (on a 5-step scale).
Conceivably reporting a symptom but rating the lowest severity could be indistinguishable on the latent trait from the "No" response, but how can I test this?
If I combine the item pairs into items with 6 response options, will a nominal model do the trick? Or -- assuming that symptom and severity items measure the same thing -- should I just use all variables in a Graded Response Model, accepting the systematic missingness, and compare a parameters for the item pairs? In the latter case the dependence isn't modelled in any way.
Relevant answer
Answer
Thank you all for your thoughtful responses. I'm summarizing what I learned in case someone later stumbles upon this discussion.
I believe latent response modelling is a way of making explicit our assumptions about how the world and our "instruments" interact, and in addition to make them plausible. Latent models can also be argued to simplify our data, in finding the common factor driving responses. Furthermore, for science to be replicable and cumulative, we have to operationalize and quantify, even though it may currently be inferior to intuitive understanding.
Most of the literature suggested was unfortunately not applicable, as it pertained to, for instance, (1) local dependence between items, but not missingness dependent on the other item, (2) "hierarchical" as in second-order latent factor models, or (3) multidimensional IRT models (MIRT).
Phil Chalmers, the author of the mirt package for R, responded on the forum for that tool:
"I think you are correct that merging these two-stimulus questions into one is the best approach to avoid the use of NA's. Otherwise, the MAR assumption would be violated if the dataset contained NA's where participants reported no symptoms present (i.e., it is more likely to see an NA for low participants than high participants, which violates independence). I think an ordinal/graded model should be just fine; no need for a nominal model unless you are really worried about category ordering and wanted to verify."
I tried the nominal model, and the order between strongly disagree vs. disagree re distress was indeed indistinguishable for many items, suggesting that they could be collapsed, and a graded model applied.
  • asked a question related to Item Response Theory
Question
4 answers
Am trying to develop a composite indicator for socio-economic classes using data on ownership of various assets (such as a car, fridge, etc). I have total household expenditures which can be used to estimate the strength of the model. 
Am confused as to whether to use factor analysis, regression, or Rasch models (Item Response Theory), or data envelopment techniques. 
Relevant answer
Answer
Dear Jawaid,
The technique depends on your variable types and data analysis objectives. Some of your variables seem to be categorical, so latent-class-analysis could be an option. But more info is needed in order to give a suggestion.
Best,
Ruben.
  • asked a question related to Item Response Theory
Question
2 answers
Hello,
If a knowledge structure satisfies the condition smoothness and consistency then it is called as learning space. I was referring to the KST (R-package ) function "kstructure_is_wellgraded", as we know that if knowledge structure is well graded then it is also a learning space (according to Theorem 2.2.4 in Learning Spaces.)
It turns out that when the knowledge structure is big above method is taking forever to confirm if the given knowledge structure is a learning space. That is why I wanted to know if there is any other way , a fast one preferred to verify the same.
Relevant answer
Answer
Thank you for your reply Rusiadi. I only need a method which can verify if a given knowledge structure is learning space or not. Now you may call it whatever you would like: Descriptive/quantitative.
  • asked a question related to Item Response Theory
Question
6 answers
We are planning to apply a specific test to measure problem solving skills of students. Students will be given full or partial credit for the 16 question items they will answer. Are there any practical guidelines on how to score and measure student success with Item Response Theory? 
Relevant answer
Answer
Most IRT packages will make available estimates of the student's latent variable score for "ability", often labelled with the Greek letter theta. Whether you want to use this or some measure that is more easily understood by students is something to consider. The attached addresses some of the questions Elin Skagerberg and I got when we started using multiple choice questions at the university we were at.
  • asked a question related to Item Response Theory
Question
3 answers
Actually, although the results demonstrate certain limitations of the classical test theory and advantages of using the IRT, have we surpassed the classical model? 
Relevant answer
Answer
Classical test theory (CTT and item response theory (IRT) are widely perceived as representing two very different measurement frameworks. However, few studies have empirically examined the similarities and differences in the
parameters estimated using the two frameworks.
Although CTT has served the measurement community for most of this century, IRT has witnessed an exponential growth in recent decades. The major advantage of CTT are its relatively weak theoretical assumptions, which make CTT easy to apply in many testing situations. Relatively weak theoretical assumptions not only characterize CTT but also its extensions (e.g., generalizability theory). Although CTT’s major focus is on test-level information, item statistics (i.e., item difficulty and item discrimination) are also an important part of the CTT model.
for more details please login to:
  • asked a question related to Item Response Theory
Question
4 answers
Ideally such a tool would use a computerized adaptive testing approach?
I am looking for a method which can be used to more precisely screen for a spectrum of behavioral health issues without using an extensive battery of questions?
Such a tool would ideally be used to anonymously screen large numbers of individuals.  The validity and reliability would need to have been established so that the tool could stand up to some scrutiny.
Relevant answer
Answer
Thank you all for your conscientious answers.  Do you know of similar IRT based tools for adults?
  • asked a question related to Item Response Theory
Question
3 answers
The cross-sectional IRT analysis in Stata 14 is quite simple. But what about longitudinal data / repeated measures? I just don't know where to start from. I would really appreciate if somebody can push me into the right direction. If it is not possible with the Stata built-in IRT module then any suggestions what software (except 'R') would be relatively simple to use for the task?
Relevant answer
Answer
I think that your design is three level:  3 - individuals - level 2: items - level 1 the occasions when the tests were taken  . Multilevel models can fits such models where the response is a discrete outcome
Here is a reference to
A REPEATED MEASURES, MULTILEVEL RASCH MODEL WITH APPLICATION TO SELF-REPORTED CRIMINAL BEHAVIOR
Stata through GLLAMM  has capabilities for Multilevel IRT models- see
How do I fit multilevel IRT models?
Mplus also has such facilities
You said you were not interested in R, but this paper gives good insights as to what you can do with lme4
Estimating the Multilevel Rasch Model: With the lme4 Package
  • asked a question related to Item Response Theory
Question
2 answers
Hello,
I want to experiment  on one of the functions from the "DAKS" package from R to create a knowledge map/graph. They already have a data with 5 questions in an item-response matrix format ( item across the columns and responses along the rows) but I need a bigger data with at least 50 questions. Does anyone have such kind of data or any algorithm to simulate such kind of matrix, where there are items ; such as response to one is dependent on each other. Need not the the case that all the questions should be dependent on each other.
Thank you for any help.
Relevant answer
Answer
Hello Irshad.
I suggest you to consult Prof. Cord Hockemeyer, here in ResGate.
  • asked a question related to Item Response Theory
Question
3 answers
Hello everyone,
I need to know what all properties should an item bank have to carry out a precise CAT. I am particularly interested in the 1PL model.
Properties I am looking for
1. What should be the optimum size of the item bank?
2. What should be the distribution of item difficulty?
Any references are also welcome.
Thank you in advance.
Relevant answer
Answer
Hallo Irshad,
In this paper, we describe setting up the foundations for an item bank that could be used as a basis for CAT. It focuses on the 2PL model, but the approach could be used for the 1PL model too.
Yours,
  • asked a question related to Item Response Theory
Question
23 answers
There is software available for item response theory, but it is very hard for me to understand how they work. Can anyone provide information on this.
Thank you
Relevant answer
Answer
Hey Marijana, I wrote an R package for doing IRT with Shiny. It might be helpful for you, if you have any feedback please feel free to let me know!
  • asked a question related to Item Response Theory
Question
6 answers
If a, b, and c parameters have been obtained and published for a normative sample, and I calculate them for a new data set, would it be possible to statistically examine DIF on the basis of these parameters?  Or would I need the full raw data for the normative sample?
Relevant answer
Answer
There is an excellent text on this in: Embretson & Reise (2000) Item Response Theory for Psychologists
  • asked a question related to Item Response Theory
Question
7 answers
Is there any classification for the item difficulty and the item description ranges of values in item response theory (IRT) and multidimensional item response theory (MIRT).
According to Baker in "The basics of item response theory"
item discrimination in IRT is classified into the following :
none 0
very low 0.01 - 0.34
Low 0.35 - 0.64
moderate 0.65 - 1.34
High 1.35 - 1.69
Very high > 1.70
Perfect + infinity
According to Baker,Hambleton (Fundamentals of Item Response Theory ), and Hasmy (Compare Unidimensional and Multidimensional Rasch Model for Test with Multidimensional Construct and Items Local Dependence) item difficulty is classified into the following :
very easy above [-2,...]
easy (-0.5,-2)
medium [-0.5,0.5]
hard (0.5,2)
very hard [2,..]
Could the item discrimination and item difficulty classification be also used in MIRT
Relevant answer
Answer
  • asked a question related to Item Response Theory
Question
12 answers
Can anybody share about item information in IRT?
Do we have any interpretation ranges value for item information (IIF and TIF)?
Relevant answer
Answer
Ahmad
there is no fixed rules for the amount of the information that item or test provide, but it is useful for select items so the item with higher information has lower amount of error which will be helpful to have a test with high information and minimum error. As Boris mentioned items varied in providing information on the ability range so you select items with max information at the range you interested with. You can see Baker, 1990 and Hambleton & Swaminathan, 1987.
good luck
  • asked a question related to Item Response Theory
Question
11 answers
I'm looking into the development of an on-line IRT adaptive test for a specific test instrument. Any pointers to help me start out such has published research or case studies would be grateful. I've come across the Concerto platform but would be interested to know what else is out there.
Relevant answer
Answer
Hello Edmund,
In case you are looking for IRT based adaptive testing R packages , please go through catR , mirt and mirtCAT packages. Especially mirtCAT package , it provides tools to generate an HTML interface for creating adaptive testing.
Hope it helps you.
  • asked a question related to Item Response Theory
Question
5 answers
I want to calculate the AIC of GPCM. From the program of Parscale, I had know the value of -2loglikelihood (namely G2 or -2ln(L)).  AIC=2k-2ln(L). k is  the number of estimated parameters of GPCM. My question is that if the scale have 8 items and each item have 5 categories, then what the k is. 
Relevant answer
Answer
Ying Ling,
as far as I can see, k should be the number of step parameters plus the number of item discriminations. With 8 items and 5 categories, you should have 8*4 = 32 step parameters plus 8 item discrimination parameters.
Here is a nice reference in which this is explained (just google for it):
Anne Corinne Huggins-Manley & James Algina (2015): The Partial Credit Model and Generalized Partial Credit Model as Constrained Nominal Response Models, With Applications in Mplus, Structural Equation Modeling: A
Multidisciplinary Journal, DOI: 10.1080/10705511.2014.937374
Regards, Karin
  • asked a question related to Item Response Theory
Question
20 answers
There are many different programs for conducting IRT, some are stand-alone products and some are components within larger statistical programs. What are the most popular programs in common use?
Additionally, does anyone know of any independent comparisons of the various programs to ensure that they produce identical results?
Relevant answer
Answer
I prefer open source R with package mirt, but there are a number of different IRT packages in R. Popular closed source programs are parscale and the more recent irtpro. i think mplus might have irt capabilities, but i am not sure.
dont know about comparisons between programs. best, felix
  • asked a question related to Item Response Theory
Question
3 answers
I would ideally like to use a single threshold - eg category 0/1 vs 2/3.
Relevant answer
Answer
Thanks, Pat!
Once a mentor, always a mentor.
  • asked a question related to Item Response Theory
Question
3 answers
Does anyone know publications, that compare IRT based item information curves or item information functions of questions/testitems with different response format (but equal content)?
Response formats may differ in number of response options, item wording, etc.
Relevant answer
Answer
Some years ago, we did some analyses on the Beck Depression Inventory looking at the ordinality of the original wording of the response sets. We ended up with "corrected" (collapsed) categories for several items, meaning that a partial credit model was adequate to  the new (and varying) response set. You may look up the article here:
Frick, Ulrich, Jürgen Rehm, and Uta Thien. "On the Latent Structure of the Beck Depression Inventory (BDI): Using the „Somatic “Subscale to Evaluate a Clinical Trial." Applications of Latent Trait and Latent Class Models in the Social Sciences (1997).
  • asked a question related to Item Response Theory
Question
4 answers
I am running an IRT analysis on an instrument in XCalibre, and the analysis reports substantially different means for the items than those calculated in Excel? Is there some weighting happening of which I am unaware?
Relevant answer
Answer
As Alan suggested, there is indeed likely an algorithmic reason that Xcalibre might be calculating differently.  I know of one situation for sure.  A few weeks ago I received a support email with the same question, and the issue was that they were running a 5-point polytomous calibration on a sample of only 36. (!!!!)  Xcalibre automatically combines response levels with N=0.  In that case, there was, for example, an item where no one responded as 1 or 2, only 3-4-5.  The 1 and 2 levels were dropped and it was treated as a 3-option item, and since numbering starts at 0 or 1 (depending on if doing PCM or RSM approach), it gets renumbered.  So the researcher anticipated an item mean of 4 or so and it was reported as 2 or so.
I'd encourage you to contact the support team about the issue.
(Disclosure: I am the author of Xcalibre.)
  • asked a question related to Item Response Theory
Question
8 answers
Point biserial correlation is used to to determine the discrimination index of items in a test. It  correlates the dichotomous response on a specific item with the total score in a test. According to the literature items with Point biserial correlation above 0.2 are accepted. According to Crocker (Introduction to classical and modern test theory, p.234) the threshold for point biserial correlation is 2 standard errors above 0.00 , and the standard error could be determined by (1/sqrt(N)) where N is the sample size. What is not clear to me is that in tests we need items that have high discrimination (Correlation) and if Point biserial correlation is a special case of pearson correlation that means that by accepting 0.2 as a threshold we are accepting the fact that the coefficient of determination is 0.04 and total score is only capturing 4% of item variance.  
Relevant answer
Answer
I guess you have to remember back in the old days before desk top computer processing became available and calculations were done by hand or by punch-cards, you would want a rule of thumb that was easy to derive. So rpb=.20 is just a convention for ensuring that the item has a non-chance positive association with the total. Of course arbitrary cut-offs are just that--arbitrary. If you rounded the example of .199 and .201 to .20 you would make the same rule of thumb decision. However, with modern computing you can determine rpb very quickly and accurately. Indeed, if the correlation is >0.00 you could probably keep the item even though it might not help discriminate candidates and it might never get used in any kind of selection or adaptive testing. Rules of thumb are how our industry started over 100 years ago when statistical and computing technology was primitive....keep it in mind when approaching conventions.
  • asked a question related to Item Response Theory
Question
2 answers
I'm specifically looking for papers where the expectations are manipulated (e.g. "this test is going to be very difficult" vs "very easy") and ultimately influence results. I'm already aware of all the literature on stereotype-based performance change (e.g. Levy, 1996; Aronson,1999), but I am looking for other kinds of manipulations. Thanks to anybody who'll help me!
Relevant answer
Answer
There is an old popular study concerned with the effect of self-efficacy on performance:
Expectations and performance: An empirical test of Bandura's self-efficacy theory.
Weinberg, Robert S.; Gould, Daniel; Jackson, Allen
Journal of Sport Psychology, Vol 1(4), 1979, 320-331.
Hope, this helps.
  • asked a question related to Item Response Theory
Question
5 answers
Hello, I want to compute the so-called IRT Item Information Function for individual items as well as the latent variable using Stata. Several methods are available for IRT analysis, like clogit, gllamm or raschtest, but so far I could not find any syntax to draw the Information Function using anyone of these methods. So, any help or example is much appreciated.
Relevant answer
Answer
Dear Richard,
This morning I did not see the options test info in the help file of icc_2pl.
Now I did, and the great news is that thanks to icc_2pl, I am also able to create the item information plots and the test information plot.
So, I really have to thank you again for your ado file, and I think you should share it with the Stata community too. It is a great time saver to have this.
Best regards,
Eric
  • asked a question related to Item Response Theory
Question
3 answers
I am using mirt (an R-package) for the IRT( Item Response Theory) analysis. Current data, I am dealing with is sparse (contains missing value). Responses are missing because the test is adaptive. ( In the adaptive test not all the questions in the item bank are  presented to the test taker hence the responses to the questions he was not presented and the ones he could not solve are missing ). Now the "mirt " function in mirt package takes care that you can calibrate the data with missing values (i.e. fitting the IRT models (Rasch, 2PL, 3PL). However when it comes to the item fit analysis (using "itemfit" function to carry out the item fit analysis) you can not put the sparse data. In this package for the sparse data if you need to go for item fit analysis, you  must use imputation. I have two questions here:
1. Are there any more methods available besides imputation for item fit analysis  when you have the sparse data?
2. What is the maximum percentage of sparseness in the response data matrix, where you can use imputation method to get reliable results?
Relevant answer
Answer
Irshad, have you considered using different software? Missing data is no problem for Rasch methodology. Imputation is not necessary. In fact, counter-productive. 99%+ missing data is acceptable. See, for instance, "Rasch Lessons from the Netflix® Prize Challenge Competition" at the link.
  • asked a question related to Item Response Theory
Question
4 answers
Can anybody help to interpret the ICC curve of IRT for polytomous response category?
Does anybody know why partial credit model is better as compared to graded response model?
Relevant answer
Answer
Interpretation depends of what kind of model you are using: Rasch or non-Rasch models... http://www.rasch.org/rmt/rmt191a.htm
  • asked a question related to Item Response Theory
Question
4 answers
My experiment consists of three tests; test one consists of 14 items (126 responses), test two of 16 items (89 responses), and test three of 14 questions (responses). I started applying several uni-dimensional and multidimensional item response theory models to find the best fit. I ended up with MIRT(2PL) model for test one and UIRT(2PL) for tests two and three. Now I need to prove that my models are reliable. The items' difficulty and discrimination will remain the same regardless of the sample of students I have. I tried to split each test into two sub-samples (50 % of sample) and fit my model to each part and then correlate sub-samples with each other and with the whole sample. The results were not that good and I expect this is because the sample size is small and the error becomes higher. Is there any other way to assess the reliability of item parameters?
Relevant answer
Answer
Because of your small sample, you should try Bayesian IRT. However, some models can be fitted well in exploratory and confirmatory multidimensional IRT. IRTPro 2.1 is a good software, but "mirt" R package does the same algorithms and it is free. http://cran.r-project.org/web/packages/mirt/mirt.pdf
  • asked a question related to Item Response Theory
Question
3 answers
I need to estimate hierarchical IRT polytomous models, with one general ability and v specific sub-abilities.
Specifically, I am interested in the following two models:
1) Each specific ability theta_j is a linear function of the general ability theta_0, that is:
theta_j = Beta_j * theta_0 + error, with j=1,...,v
2) A model in which each specific ability is a linear function of the general ability, that is:
theta_0 = sum(Beta_j * theta_j) + error, with j=1,...,v
Can you suggest any software (preferably an R package) that may allow this?
Relevant answer