Science topic

Item Analysis - Science topic

Explore the latest questions and answers in Item Analysis, and find Item Analysis experts.
Questions related to Item Analysis
  • asked a question related to Item Analysis
Question
12 answers
Hi all,
I've been having some issues coming across general guidelines for what an appropriate response rate would be during the item analysis phase of a low-stakes questionnaire.
I'm currently working with a dataset of just under 600 cases that have submitted responses for a 5 option Likert scale-based questionnaire, the responses of which will be used to evaluate the effectiveness of a program. There are a total of 30 questions that measure certain personality traits.
The problem I'm running into is, there are some cases who only answered 28 out of 30 questions, or 15 out of 30, or 18 out of 30. My question is, can I include missing responses in my analysis? If so, what would the cut-off be? Could someone who completed 70% of the questionnaire be included? Just having trouble tracking down some empirical evidence for this situation.
Thank you in advance!
Relevant answer
Answer
Missingness is always interesting, as well as being bothersome. When there are large numbers of missing responses, my immediate response is to look for missingness patterns. These can range from simple (stopped answering part-way through) to revealing and important.
As an example of the latter, we used a stigma questionnaire that had a whole clump of missing data. The missing data were concentrated in a small number of items which all related to work-related stigma. It was easy to cross-check : they were coming from people who were not in work. We omitted those items as not relevant to the whole study population.
Other missing items may just be vague or badly-worded so that even the data you have may be suspect – the people who did answer them were not really sure what they were doing. I recommend an item analysis using Mokken scaling to ensure that the items you are using form a definable scale, but even looking at item-rest correlations can help to spot these items. Actually – now I think of it – critically reading the items is the first thing you should do!
  • asked a question related to Item Analysis
Question
5 answers
Hello,
I have a set of items that would need to be slightly adapted to fit my research.
1) Let's assume I have the following item: "Introduce a new generation of products/services."
Is it possible to change the tense to: "introducED a new generation of products/services"?
2) Let's assume I have the following item: "We introduce a new generation of products/services."
Is it possible to change the personal pronoun from we to I: "I introduce a new generation of products/services."?
Are these two changes possible without any further testing?
David
Relevant answer
Answer
Questionnaire adaption to suit research context is common.
  • asked a question related to Item Analysis
Question
5 answers
For example, I prefer to study in (A) Morning (B) Evening (C) Late Night
Relevant answer
Answer
Hello Partha,
No matter what kind of measure you are building, you no doubt have in mind some intended use(s) of the scores from that measure.
Gathering data and demonstrating that the scores have technical adequacy for your intended uses (both score reliability and score validity) is always a good idea. This may well involve at least some aspects of item analysis.
If your items/questions/stimuli are nominal, as in the example you give, you likely won't be computing things such as item-total correlations in the traditional sense. But, for determining temporal stability, you would like to see that the answers given by respondents on one occasion agree to a very high extent with those they might give, say, three weeks later. That could be quantified as a correlation (e.g., Cramer's V) or as a percent agreement, or via Cohen's kappa, computed for each item.
Good luck with your work.
  • asked a question related to Item Analysis
Question
6 answers
Sir/ Madam,
I am doing a research on study habits of university students. To measure the study habits I have developed a tool. In this tool each item has multiple categorical option. So, how can I do analysis each item?
Relevant answer
Answer
By categorical, do you mean the responses are not a scale?
  • asked a question related to Item Analysis
Question
3 answers
Dear RG Community, Can you please share how to do Item Analysis in scale development? I have developed a scale, now need to do the item analysis before try out.
Relevant answer
Answer
Maria Dolores Tolsá Thank you very much for your kind answer. I will love to collaborate for my ongoing project in the research publication stage.
  • asked a question related to Item Analysis
Question
5 answers
I am working on a scale where I have a dichotomous response pattern. I would like to know how to item analysis and item reliability.
Relevant answer
Answer
If the "ipsative" scales are all dichotomous (such as introverted-extraverted) you should not have any major problems. The issue arises when all the scales are dependent on one another. My advice is to treat each dichotomy as a single scale rather than as two. Then follow the usual procedures: nothing extra-fancy. For each item you might calculate:
(1) The percentage of respondents answering in each direction
(2) the correlation of the score on the item (say, +1 for the extraverted option) with total score on the scale.
Ideally, (2) would be done after removing the item in question. Doing this one by one is very tedious, though, so some folks don't bother. If you are working in SPSS or a similar program, however, these statistics are readily available when running a reliability analysis (such as Cronbach's alpha).
Again: the main point is that you don't really have two separate scales for extraversion and introversion, even if you are going to use the results to classify people into one crude category or another. You have one overall measure of how extraverted (vs. introverted) they are.
  • asked a question related to Item Analysis
Question
4 answers
There are many programs for conducting item analysis. However, the only free program I could find was the Lertap Software, which is Excel-based. The issue with Lertap is how its light version is limited to processing scores of less than 100 participants.
Lertap Sofware can be found at this link:
Relevant answer
Answer
Lertap5 is an item analysis program packaged as an Excel "app". The free version allows for an unlimited number of items, and up to 250 respondents. The "pro" version is perpetual and costs USD78. For both Windows and Mac. See:
  • asked a question related to Item Analysis
Question
8 answers
For my research I have to have a pretest and post-test between which there is an instructional treatment. I need to pilot test the items of (pre- post-tests). Which statistical procedure of calculating reliability and item analysis on SPSS is appropriate?
Relevant answer
Answer
  1. Cronbach's Alpha is a very good option for calculating reliability of a questionnaire. It is also very easy to calculate this through SPSS. You can get much detail on this technique by following this article: Hasan, S.S., M.Z. Turin, M.K. Ghosh and M.I. Khalil. 2017. Assessing Agricultural Extension Professionals Opinion towards Sustainable Agriculture in Bangladesh. Asian Journal of Agricultural Extension, Economics and Sociology. 17(1): 1-13. DOI: 10.9734/AJAEES/2017/33338.
  • asked a question related to Item Analysis
Question
2 answers
**Deleted**
Relevant answer
You may want to check the following paper out:
Using XBRL to conduct a large-scale study of discrepancies between the accounting numbers in Compustat and SEC 10-K filings‏
R Chychyla, A Kogan - Journal of Information Systems, 2015‏ - meridian.allenpress.com‏
  • asked a question related to Item Analysis
Question
10 answers
I was studying the following paper:  Park, D.H., Lee, J. and Han, I., 2007. The effect of on-line consumer reviews on consumer purchasing intention: The moderating role of involvement. International journal of electronic commerce11(4), pp.125-148.
My query is that first 3 items in the scale used to measure attitude in this paper are definitely about positive attitude while next 3 items seem to be about negative attitude. Is my understanding correct? If yes, are these items (from #4 to #6) reverse coded?  
Relevant answer
Answer
I agree that items 5 and 6 are reverse coded, but #4 essentially refers to "not worrying," so I would expect it to correlate positively with the first 3 items.
  • asked a question related to Item Analysis
Question
4 answers
Hello everyone!
I am currently analysing a questionnaire from a Rasch-perspective. Results of the Andersen Likelihood Ratio (random split) and the Martin-Loef Test (median split) turned out to be significant. I know what significant results mean and which assumptions are violated. However, I am not sure about possible reasons for subgroup invariance and item heterogenity. What are some of the possible causes for significant results?
I hope that someone of you can help me answer this question. Thank you very much already in advance :)
Best regards,
Lili
Relevant answer
Answer
Dear Lili,
I can only second what David and Georg already said. Other than that I would not recommend using a median split for the Andersen LRT. Simulation studies show that the Andersen LRT performs rather poorly with a random split (https://dx.doi.org/10.22237/jmasm/1555594442).
All the best for your research,
Georg
  • asked a question related to Item Analysis
Question
5 answers
Hi everyone! I am trying to find the "how to" on item difficulty and discrimination on SPSS and its interpretation. I have read about and performed the command analyze> scale> reliability analysis, to get the corrected item total correlations (which I believe can be interpreted as a dicrimination analysis...?). I also watched videos that teach that you could calculate dificulty index by calculating means and sums in the analyze> descriptive statistics> frecuencies and interpreting the means. But this is a different method to the one I read about in books and articles (rbis).
I am using this article as reference:
It describes item correlation, discrimination index and difficulty index as different methods for item reduction. Is it adecuate to use just one or two of these analysis? Must we use all at the same scale construction proyect? According to the item correlation method, I only keep 4 of 30 initial items from the scale so far.
Proyect Description: Scale construction with 30 initial dichotomous items (True/False)
Relevant answer
Answer
Maria Acevedo Velazquez, just quickly in case it helps, I think that obtaining item-total correlations (through the reliability section of SPSS) is not going to give you the kind of answers that you want.
I think you also need to be clear about what is mean by discrimination. In some contexts, it means that scores on one construct are different from scores on another construct so that researchers are assured they're not just measuring the same construct in two different ways. In your context (from what I can pick up from your question), you want to be able to discriminate participants one from the other. That's a different kettle of fish.
I hope my pointing out those different things (both of which are dealt with in the Boateng et al. article that you mention) helps you set your sights a bit more clearly.
  • asked a question related to Item Analysis
Question
7 answers
I want to gather data to validate new psychological measures (e.g., personality, attitudes, abilities) I've created (using EFA, CFA, correlations, item analysis, etc)
However, the survey will be very long if I include all items for everyone (~60 minutes).
I've read about split questionnaire designs where you give a different subset of items to random groups of respondents so that they could have, let's say, half the questions but where across the respondents you get data for all items. Then you impute the 'planned missing data'; e.g. Rhemtulla, M., & Little, T. D. (2012). Planned missing data designs for research in cognitive development. Journal of Cognition and Development, 13(4), 425-438.
However, I can't find information about whether this design is appropriate when the purpose of the data collection is to establish the validity of a new measure. As far as I can tell, it *should* be appropriate but it could be problematic because of the way i'd have to impute large parts of the data.
Has anyone here looked into this or used split questionnaires for this purpose? Has anyone come across articles that have done this?
Relevant answer
Answer
Hi Daniel, oh my....alone hearing terms such as "test banks" gives me goosebumps. In my very personal view, the whole personality psych suffers til today from its historical approach to "measurement"--that is creating masses of items and then aggregating these items with the help of principal component analysis. Still, 10 years ago I had weekly battles with colleagues to convince them about the fundamental differences between PCA and the factor model. While this may be understood by most of the folks, most "item banks" are still a result --either--of the procedure (PCA) or even a more fundamental positivist/operationalist perspective of constructs being constituted by measures.
This reminds of a sentence, Denny Borsboom wrote in this "attacks of the psychometricians" paper (2006) , about the failure of CFA to support the model structure proposed for the big five:
"Now it turns out that, with respect to the Big Five, CFA gives Big Problems. For instance, McCrae, Zonderman, Costa, Bond, & Paunonen (1996) found that a five factor model is not supported by the data, even though the tests involved in the analysis were specifically designed on the basis of the PCA solution. What does one conclude from this? Well, obviously, because the Big Five exist, but CFA cannot find them, CFA is wrong. “In actual analyses of personality data [. . .] structures that are known to be reliable [from principal components analyses] showed poor fits when evaluated by CFA techniques. We believe this points to serious problems with CFA itself when used to examine personality structure” (McCrae et al., 1996, p. 563)."
You will find more deeper ontological problems in related personality constructs, for instance this nonsensical core self-evaluations construct, which proponents sometimes describe as an aggregate of primary traits (self-esteem, self-efficacy, neuroticism, locus of control) and at other times as second-order factor. You can not have it both ways, but the ontological flexibility is breath taking--in the same way as the non-falsifiability of their models (by completely ignoring model test results). As aforementioned this goes along with very crude perceptions what a factor means, for instance, this undying convictions of a factor having "facets", "parts", or "breadth" which is utter nonsense (as a factor is ONE dimension). Look alone that the labels of factors you will see sometimes! How can ONE dimensions (=one varying attribute) be labeled in a pluralistic way (e.g., core self evaluationS)?
Okay, I stop my rant here. Here's more ;))
Best
--Holger
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71(3), 425-440. doi:10.1007/s11336-006-1447-6
  • asked a question related to Item Analysis
Question
3 answers
Hello, as part of my Dissertation Project I am conducting an item analysis. I used a 2x2 mixed design ANOVA (recommended by my supervisor) to conduct this. I unfortunately got no significant interactions between any of my items...I am not too sure what I should say for this or if I should dwell on it for too long in the discussion? At the moment I think it means that my items do not act in the way they were meant to...so does this undermine my whole experiment? Or would I just advise that future research uses different items to measure what I am intending to measure? Thank you in advance.
Relevant answer
Answer
No problem at all. In fact the more creative you get with explanations the better (for your advisors and you- dont despair)
  • asked a question related to Item Analysis
Question
6 answers
I am currently working on my master's thesis which is focused on validating a skills test. The panel members in our university suggested that in conducting item analysis on the test i prepared, i should divide it into two forms (with equal number of items) and administer it to two different groups of students. I have followed their recommendation but i still looked for literature or studies to support it but found none. I also figured out that this can't fall under parallel-form reliability test because it requires that the two tests be administered to same group of students separated by time. I hope you can help me with this.
Thank you very much and i hope you'd all be safe from the covid 19 virus.
Relevant answer
Answer
Thank you very much Yusuf F Zakariya !
  • asked a question related to Item Analysis
Question
3 answers
My tool has 7 different dimensions how can i run item analysis. t, or product moment correlations or else.
Relevant answer
Answer
Thanks, Cristian Ramos-Vera, and David Morse. I appreciate your suggestions and guidance. Is 't' useful for any item inclusion or exclusion making decision?
  • asked a question related to Item Analysis
Question
5 answers
I am writing a paper and a dataset that I am using (secondary data) contains, amongst other, seven questions related to a corporate website. Every question is of the form: "Does your website have an attribute _" and then followed by seven characteristics. In all seven instances, interviewees were asked to give a yes/no answer, where yes=1 and no=0.
My question is is it possible to add together these seven answers for each interviewee in order to derive one 0-7 scale item per interviewee, for the purpose of adding the newly derived item into my SEM analysis.
I would be very grateful if you could provide your thoughts on this matter. If you could advise me on relevant references in which this problem was handled in the same manner, it would be much obliged.
Relevant answer
Answer
Hello Aleksa,
Yes, it's possible; yes, it's done quite often. Since you're building a scale, you might wish to consider evaluating the technical properties of the items and scale. IRT modeling is a useful approach here.
Good luck with your work.
  • asked a question related to Item Analysis
Question
1 answer
In literature when calculating IRT SE I found multiple times Fisher Information being mentioned.
Being curious I started to try to play around with the fisher information in order to obtain the typical Information reported as P(theta)Q(theta)a^2.
My understanding of the process failed me when I started to check why the variance of the score is defined as follows
score = s = d/dtheta ln( f(x,theta) )
Var(s) = E[s^2]
Given that the variance is
Var(s) = E[s^2] - E[s]^2
I started looking in why E[s]^2 is zero. As long as f(x,theta) is a density function I can write
E[s]^2 = [\integral{ d/dtheta ln( f(x,theta) ) * f(x,theta) dx }]^2
= [\integral{ d/dtheta( f(x,theta) ) * f(x,theta)/f(x,theta) dx }]^2
= [\integral{ d/dtheta( f(x,theta) )dx }]^2
= [d/dtheta( \integral{ f(x,theta)dx } ) ]^2
= [d/dtheta( 1 ) ]^2
= 0
But as soon as we use the IRF (Item Response Function), that gives us the probability of getting score x given theta, all the computations done above are not working anymore. The reason being that the integral of the IRF is not finite, hence
[d/dtheta( 1 ) ]^2
not valid.
I have demonstrated that
E[d/dtheta ln( f(x,theta) ) ^ 2] = -1 *E[d/dtheta(d/dtheta ln( f(x,theta) ))]
but that holds when integrals are one for f(x, theta) and simplifications can be done.
Any input on my approach and (not) understanding of the problem?
Relevant answer
Answer
I'm not sure it needs to be, but I'd recommend Rembretsin and Reise (2000) IRT text.
Matt
  • asked a question related to Item Analysis
Question
2 answers
In this essay, General Total Score (GTS)-based (testing) item analysis is discussed in (1) item difficulty analysis; (2) item decomposition; (3) K-dependence coefficient (uncertainty coefficient) and item dependence analysis; (4) Items structure analysis across different populations.
Relevant answer
Answer
Good discussion and analysis.
  • asked a question related to Item Analysis
Question
16 answers
I am in the middle of questionnaire development and validation processes. I would like to get expert opinion on these processes whether the steps are adequately and correctly done.
1. Items generation
Items were generated through literature review, expert opinion and target population input. The items were listed exhaustively till saturation.
2. Contents validation
Initial items pool were then pre-tested with 10-20 target population to ensure comprehensibility. The items were then reworded based on feedback.
3. Construct validity
a) Bivariate correlation matrix to ensure no items correlation >0.8
b) Principal Axis Loading with Varimax Rotation. KMO statistic >0.5. Barttlets Test of Sphericity significant. Communalities less than 0.2 were then removed one-by-one in turn. Items with factor loading high cross-loading were removed one-by-one in turn. Then, item with factor loading <0.5 were removed one-by-one in turn. This eventually yielded 17 variables with 6 factors, but 4 factors have only 2 items. So I try to run 1, 2, 3, 4, 5 and 6 factor models, and found that 4-factor model is the most stable (each factor had at least 3 items with factor loadings >0.4). Next analysis is only on 4-factor model.
c) Next, i run Principal Component Analysis without rotation on each factor (there are 4 factors altogether), and each resulted in correlation matrix determinant >0.01, KMO >0.5, Bartlett significant, total variance >50%, and no factor loading <0.5.
d) I run reliability analysis on each factor (there are 4 factors altogether) and found cronbach alpha >0.7, while overall realibility is 0.8.
e) I run bivariate correlation matrix and found no pairs correlation >0.5.
f) Finally, i am satisfied and decided to choose four-factor model with 17 variables and 4 factors (each factor has 5,4,4,4 items), and each factor had at least 3 items with loadings >0.5. Realibilility for each factor >0.7 while overall is 0.8.
.
My question is, am i doing it right and adequate?
Your response is highly appreciated.
Thanks.
Regards,
Fadhli
Relevant answer
Answer
attached file may help....
  • asked a question related to Item Analysis
Question
5 answers
While developing a questionnaire to measure several personality traits in a somewhat unconventional way, I now seem to be facing a dilemma due to the size of my item pool. The questionnaire contains 240 items, theoretically deduced from 24 scales. Although 240 items isn't a "large item pool" per se, the processing time for each item is averages on ~25 seconds. This yields an overall processing time of over >1.5 hours - way to much, even for the bravest participants!
In short, this results in a presumably common dilemma: What aspects of the data from my item analysis sample to I have to jeopardize?
  • Splitting the questionnaire into parallel tests will reduce processing time, but hinder factor analyses.
  • Splitting the questionnaire into within-subject parallel tests over time will require unfeasible sample sizes due to a) drop-out rates and b) eventual noise generated by possibly low stability over time.
  • An average processing time over 30 minutes will tire participants, jeopardize data quality in general.
  • Randomizing the item order and tolerating the >1.5 hours of processing time will again require an unfeasible sample size, due to lower item-intercorrelations.
I'm aware that this probably has to be tackled by conducting multiple studies, but that doesn't solve most of the described problems.
This must be a very common practical obstacle and I am curious to know how other social scientists tackle it. Maybe there even is some best practice advise?
Many thanks!
Relevant answer
Answer
Sounds like you've created an instrument which is sound as it is based on theory - but this version should only be the pilot version, rather than the final instrument. As you see - there are too many items, which will affect the quality of the data collected. Have you conducted any sort of pilot study - factor analysis - and tested to see how those items relate? You'll probably find some items are redundant, and possibly even some scales ... use EFA to explore how the items load - then as you delete items - the number of items in each scale will decrease - you can then delete scales on the basis of insufficient items.
I find David De Vaus' work on survey design and validation very useful:
  • asked a question related to Item Analysis
Question
1 answer
In this essay, General Total Score (GTS)-based (testing) item analysis is discussed in (1) item difficulty analysis; (2) items decomposition; (3) K-dependence coefficient (uncertainty coefficient) and item dependence analysis.
For full essay, please click:
Relevant answer
Answer
Please also read following essay:
  • asked a question related to Item Analysis
Question
4 answers
I took opinion from public health specialists about different indicators to create a city health profile. I tried to draw item characteristic curves for each indicator with R software. Please help me in interpretation of this chart.
Data is the score for each indicator i.e a continuous variable ranged between 1.00 to 5.00 (with decimals)
Relevant answer
Answer
Hi...I think you used wrong IRT model....your data contiuous....
  • asked a question related to Item Analysis
Question
3 answers
I used mixed design ANOVA when analysing my accuracy data and also my RT, some of the results were significant in the subject analysis but not in the item analysis. The question is how can I explain it? should I say there is no relation between factor A and factor B since it is not significant in the analysis by item. I am a little bit confused. I would appreciate it if someone can help.
Many thanks
Relevant answer
Answer
I would say to make sure there is a definite explination with a varied analysis and just offer it both ways and see which one is most accepted in yor region or in that research. As long as it can be duplicated again and the hypoyhesis is shown to be supported or refuted by your researh then you should be fine.
  • asked a question related to Item Analysis
Question
2 answers
I conducted an opinion survey to select feasible indicators to assess health profile of a city. I asked the participants to give score from 1-5, (1 for low 5 for high) for each indicator in six aspects viz importance, specificity, measurability , attainablity and time bound characters. That means each respondent will give score from 1-5 for characters of each indicator. The total score of each indicator is 30. I collected opinion about 60 different indicators.
If i treat feasibility as the latent trait of every indicator, how can i select highly feasible indicators wit the help of Item response theory analysis. How to draw item characteristic curve for each indicator and how to select indicators.
Any one please help me to overcome this hurdle.
Relevant answer
Answer
Develop feasible indicators using IRT is subject to a series of steps that need to be performed. The IRT assumptions are essential in this regard, as the uni- dimensionality, local independence as well as speediness should be examined. The issue of fitting of your data to Rasch Rating Scale Model (may be it is the appropriate model in the Likert Scale) need to be investigated in a series of steps to identify the outfit persons as well as the outfit items. The criteria used is that : the outfit and In-fit index(MNSQ) should be between (0.7- 1.3) whereas, (ZSTD) should be between(-2, 2). The process should be replicated until you reach a set of items that are persons-free. In addition to that , as Rasch Scale model with one parameter may be used; so as, in such situations you may need to examine that the discriminations index is equal for all items, and that the guessing index is equal to zero for all items. Furthermore, person separation index and item separation index should be used to estimate scale reliability.
  • asked a question related to Item Analysis
Question
3 answers
I want to run correlations between EFA factors testing orthogonality assumption (uncorrelated). I have a six factor final structure. Prior to running Varimax rotation, I ran oblimin and item correlations were less than .25. With the item analysis suggesting the items were uncorrelated, I am now planning to run a simple correlation among factors to verify the assumption of orthogonality.
Relevant answer
Answer
Hi Salim
Yes, you can get SPSS to add up the scores from the items that constitute a subscale. Alternatively you could ask SPSS to produce factor scores for each factor, these are probably better than summed scores as the factor scores will be weighted whereby the 'best' items contribute more to the factor scores.
However it may be that your research question could be better approached through confirmatory factor analysis. It appears that you know which items are the best indicators for each factor, so why not set up a CFA model and this will allow you to estimate the correlations among the factors/latent variables. You can then see which are correlated or uncorrelated. EFA rotation is a blunt tool, where the factors have to be all correlated or all uncorrelated. It may be that some of your factors are correlated and some are uncorrelated - CFA will tell you this.
Mark
  • asked a question related to Item Analysis
Question
5 answers
I am building an test item analysis portfolio and am reaching out to inquire about resources you may recommend... God's blessings for the day..... debe
Relevant answer
Answer
You can also use Psychometrıc Tool Box which is very similar to the Excel.http://psico.fcep.urv.cat/utilitats/PsychometricTools/index.html
You also check free version of the CITAS at http://www.assess.com/citas/
  • asked a question related to Item Analysis
Question
9 answers
Recently I've been reviewing how to handle social desirability in testing.
After much theoretical review I have come to the conclusion that the best way to do this is to neutralize the items from social desirability. The format I will use is likert scale.
For example, an item that says "I fight with my coworkers" would be transformed into " sometimes I react strongly to my coworkers" (the second is somewhat more neutral).
The idea comes from the work done by Professor Martin Bäckström.
Now the question I have is: is there any methodology that can help make this neutralization?
If not, what would be good ideas to realize it? What elements should I consider?
I think a good idea might be to "depersonalize" the item. For example, instead of "I fight with my bosses," I would become "I think an employee has the right to fight with his or her boss".
Another option I've thought of is to modify the frequency. For example, instead of "I get angry easily," I'd use "Sometimes I get angry easily."
However, I do not know if these options would affect the validity of the item to measure the construct.
Thank you so much for the help.
Relevant answer
Answer
This problem is something i try to solve in my study about developing spiritual serenity scale... In first study, i found the desirability of the items.. Rasch model analyze found that the items below of respondents ability...
In the next study, i will try change the descriptors of items to be more difficult... For example, never, seldom, sometimes, often, very often, and always..
Options "often", "very often" , and always will be chosed by persons only with high ability...
  • asked a question related to Item Analysis
Question
2 answers
I am preparing a value scale to be used for rural population. The six value areas are Theoretical,Economic, Aesthetic,social ,political and Religious. How to go about item analysis for the scale.
Whether it will be considered as one whole scale or item analysis of each subset ( religious, economic etc. )be done separately.
What are the methods which can be used for item analysis and item improvement.
I have a doubt in my mind - all subsets depend upon each other. Higher score in one subset will decrees score on other subset. How this can be handled.
Relevant answer
Answer
It sounds like you are using Allport's scale to measure Spranger's types of men. While they are somewhat useful, researchers usually rely more often on Schwartz's value model. The measures developed by Schwartz (1992 and Schwartz et al., 2001) are more reliable and better validated than Allport's scale. The "Schwartz Value Survey" is a relable measure of Schwartz's 10 value types, as is the Portrait Value Questionnaire (PVQ) or the PVQ-RR, which measures 19 value types. Gouveia's functional theory would also be an option.
So the best way to improve your survey is to use another value model. If you want to use Spranger's types of men, however, then you would keep each of the six types of men separately. Combining them would mean that you would loose a lot of information and the interpretation of the data would be difficult.
  • asked a question related to Item Analysis
Question
2 answers
Hi all,
I'm having some theroretical ponderings with polytomous items and item-total correlation. In the binary case, we have a modification of Pearson correlation of the item X and scale Y
(M1-M0)*var(X)^(1/2)/var(Y)^(1/2)
Does anyone know a polytomous modification/generalization of this form of coefficient? There could be something like [(M1-M0) + (M2-M1)...] x var(X)^(1/2) in the denominator.
See better the formulae in the appendix.
Relevant answer
Answer
Hello Jari,
The point-biserial formula was derived as a short cut to the usual definitional or computational version of the Pearson product-moment correlation formula in the days before statistical computing software or cheap/free calculators. Both the short-cut and PPMC formulae yield the same numeric result. In the case of polytomous items, so long as one is willing to assume interval scale strength to the polytomous scores, the usual PPMC formula will also apply. If interval scale isn't a realistic claim for the data set, then a rank-order correlation would need to be used.
That doesn't address your question directly, but I strongly suspect that the corresponding "short-cut" polytomous generalization version would be more trouble to code than the basic PPMC formula. However, as an intellectual exercise, it may be of interest to some folks.
Good luck with your work!
  • asked a question related to Item Analysis
Question
4 answers
Hello,
I am currently trying to work out how to conduct item analysis on my likert scale questionnaire.
The questionnaire consists of 34 questions, which are split between 13 subdomains. I want to determine how the scoring of these subdomains varies between the quartiles of the overall questionnaire score.
I was looking at item response theory but I am given to understand that this is not appropriate as Likert scales do not assume that item difficulty varies.
Any guidance is most appreciated!
Relevant answer
Answer
Classical test theory can indeed answer some of your research questions, and there are a number of software programs that can calculate what you need since CTT is so incredibly simple. It wouldn't even take long to analyze it in Excel. However, you might want software like Iteman that specializes in it. CFA on the classical results might be useful.
Nevertheless, as Roberto recommended, I also highly recommend IRT. It is not unreasonably complex, and there are a number of models that are specifically designed for Likert type data. The two important ones you should examine are the Rasch Rating Scale model and the Graded Response model. You can download a free version of Xcalibre to try out on a small sample (http://www.assess.com/xcalibre/). Winsteps is also worth looking at, as it is the gold standard for Rasch analysis.
  • asked a question related to Item Analysis
Question
4 answers
I want to understand what the reliability index of an item in a questionnaire with liker type scale is. How is it calculated. Is there any other term used for it and what type of item analysis is this?
Relevant answer
Answer
Hi
Item reliability index is equal to the product of the discrimination and standard deviation of the item. So item realibility index=rjx.Sx
This index can be calculated for dichotomous or polytomous item type.
Allen & Yen (1979) explained this index in their book. Take a look for more information.
Allen, M. J. ve Yen, W. M. (1979). Introduction to measurement theory. Monterey: Brooks/Cole.
Good luck.
  • asked a question related to Item Analysis
Question
9 answers
Is there any possible way?
I understand that if the options point to the same trait, it can be done. for example a question of the type:
I work better:
(a) individually
(b) with other persons
either of the two options is valid for the person (helping avoid bias) and for example if I'm measuring the trait of teamwork I may think that a person who selects option b will have a higher degree in the trait of teamwork. Am I making a mistake in assuming this?
now, is there any way to do this when they point to different traits in response options? I want to be able, based on the data of forced response items, to carry out normative analysis (to be able to compare with other subjects).
PS: I'm clear that with ipsatives items you can't make comparisons between people, however, if you manage the punctuation in a different way could you do it somehow?
Relevant answer
Answer
Hi Ale,
there are recent developments in IRT that allows extracting normative scores from forced-choice questionnaires. The Thurstonian IRT (TIRT) model by Brown and Maydeu-Olivares and the MUPP model by Stark and colleagues are good examples.
From my own experience, the TIRT model works best in practice (i.e., in terms of reliability and validity).
  • asked a question related to Item Analysis
Question
3 answers
Hello,
I am evaluating pearson correlation between items and score on the totality of the test. I have been reading about corrected item-test correlation and end up with this paper
but it looks like it is only for dichotomous items.
Can you suggest any articles regarding corrected pearson correlation?
Since the correlation is calculated for the total score with the exclusion of the score from the current question; would be enough to remove the question's score from the averageTestScore across students and the test score for a specific student?
Thanks
Relevant answer
Answer
The main condition for calculating the Pearson linear correlation coefficient is that the variables involved are measured on the interval / ratio scale, along with the existence a form of distribution that does not deviate severely from the normal curve.
If the conditions for using the Pearson test are not met, the following non-parametric tests may be used: Chi-square test (for nominal data) or Spearman or Kendall (for ordinal data) correlation coefficients.
The SPSS procedure for calculating Pearson (r) linear correlation coefficient is: Statistics-Corelate-Bivariate.
Have a nice day,
  • asked a question related to Item Analysis
Question
13 answers
I'm planning to use KET (Key English Test is prepared by Cambridge University and it tests the skills of Reading, Writing, Listening and Speaking – with each skill equally weighted at 25%.) in my research. Although it's a well-known international test, I need to examine the validity and reliability of the test before using it in my research. However, I couldn't find any research about the reliability and validity of KET in the literature.
Actually, I think I need to do a pilot test and calculate KR-20 or Spearman-Brown or Pearson r for RELIABILITY and do item analysis (item discrimination and item difficulty) for VALIDITY. On the other hand, what should I do if I need to discard some items according to the item analysis result?
Do you have any suggestions for it?
Relevant answer
Answer
It is perhaps go to bear in indm that the psychometric criteria (i) agreement, (ii) reliability and (iii) are hierarchically ordered. If there is no agreement among judges, a test cannot be reliable(judgment) , if a test is not reliable (i.e. has reproducinle results)), it cannot be valid. So, validity is the (does the test truly measure what it is supposed to easure) mtest in the final analysis, ?only criterion that matters. mvalidity
  • asked a question related to Item Analysis
Question
6 answers
In testing whether questionnaire items have been correctly assigned to a cluster (scale), test developers often look at the correlations between each item and each cluster. To arrive at these, unweighted sum scores of the items per cluster per individual are often used (corrected for self-correlation if the item is part of the cluster). However, a sum score of 36 for a cluster of, for example, 12 items could consist of, for example, (12 x 3), (6 x 1 + 6 x 5), or (2 x 1 + 2 x 2 + 4 x 3 + 2 x 4 + 2 x 5). These cluster compositions are not qualitatively equivalent.
Therefore, I experimented with an alternative: for each item, I had the program average the correlations of each item with each other item of each cluster. This results in values, similar to the item-sum correlations but about 2/3 smaller. (If these, in turn, are averaged per cluster, it results in then mean inter-item correlation per cluster, on which Cronbach’s alpha is based.)
The distribution of both types of item-cluster correlations is highly similar but certainly not 100%. I experimented with both in my cluster optimization program (see reference), and found it ended in somewhat divergent results.
My question is: has this approach to item-cluster correlations tried out before, meaning I have reinvented the wheel? If so, why has it not become a widespread practice; is, perhaps, this wheel not quite circular?
P.S. For a good understanding: the clusters I investigate are symptom clusters of mental disorders. These do not obey the common factor model (see Borsboom et al.). That makes them unsuited for confirmatory factor analysis, unless all residual correlations have been specified. For that, it is still too early.
Relevant answer
Answer
Thanks again. One remark: I do not use Cronbach's alpha for estimating the cluster's reliability but for increasing its homogeneity within proportions. If I would merely use the mean inter-item correlation (cohesion) of the cluster for that purpose, then the cluster would become very small and would only consist of very similar items. Alpha is a function of both cohesion and the size of the cluster. A larger size compensates for a smaller cohesion. So using alpha guards against reducing the cluster too much.
  • asked a question related to Item Analysis
Question
6 answers
The purpose is only to store items and item properties and construct tests. We are using IRTPRO for item analysis, but right now we have all the item information in excel and we need better organization.
Relevant answer
Answer
Why don't you try my software, X-Pro Milestone.
  • asked a question related to Item Analysis
Question
4 answers
I am developing and pilot testing a screening questionnaire. Two groups filled out the questionnaires, the experimental group (individuals who were already diagnosed with the disorder) and a control group (individuals not diagnosed with the disorder). As part of my analysis I have to do an item analysis, and I was wondering how to go about it. Should I use my entire dataset (for control and experimental group) and run an item analysis on it, or should I run a seperate item analysis for each group and then compare findings? Or perhaps there is another possibility that I did not consider?
Would appreciate the help!
Relevant answer
Answer
Ertuğrul, thank you!
  • asked a question related to Item Analysis
Question
7 answers
I've been working with IRT and usually I interpret the item information function and test information function based on the purpose of the measure I'm developing. A screening measure will most likely have more information about the low levels of ability. A performance based test might have more of a balanced TIF or might need a curve located to the high levels of trait. 
One thing that picked my attention is that we find people saying on papers that item "y" has little information at any given location or that the test provides little information about the trait continuum, etc. So... my question is are there guidelines for the amount of information that an item or a test should provide? Are there any rule of thumb to interpret how much information should the TIF peak at?
I understand that the peak of the TIF or IIF isn't the most important information as we need to pay attention to the area to understand the distribution of the information, still I had that question on my mind....
Relevant answer
Answer
Elaborating on Ivanouw's answer, for dichotomous models like the 3PL, the standard error of measurement (SEM) is equal to 1/sqrt(TIF(theta)). So, if your TIF has a height of 10, then SEM = 1/sqrt(10) = 0.316. So if that were at the point theta=1, that means that (assuming normally distributed errors) the 95% confidence interval for the true theta with an observed theta = 1 is [0.38,1.62].
So, that's your rule-of-thumb... You pick how narrow you want the 95% CI and work back to the height of the TIF required. I think SEM << 0.5  would be ideal.
However, that said, for most tests most of the time, we don't have the luxury of a rule of thumb, because the best items have some peak IIF and we're limited in how many items we can put on a form.  In other words, there are usually strong limits on how good we can make an exam. The huge advantage of using IRT and examining the TIF is that we avoid making really bad errors, like having holes in our scales where SEM is practically infinite.
In other words, the reason you haven't come across a rule of thumb is because that's not important. But spreading information across the range of scores that we care about is extremely important.
This is also the point where people start wishing for adaptive tests... If I can make the best 15-item test out of a pool of 100 good items, I'll be giving each person a much better exam than if I have to pick 15 items that everyone sees.
  • asked a question related to Item Analysis
Question
7 answers
I'm looking for a spss macro to do item analysis (dichotomic). IRT or CTT it doesn't matter. Thanks
Relevant answer
Answer
I would like to know, please, the SPSS commands  to make Item Response Theory combined with multilevel models. Can someone help me? 
  • asked a question related to Item Analysis
Question
2 answers
I'm studying the use of CHIC and want to know how I can use the A.S.I. (CHIC) in a research about test validation? Is applied to colaborate in item analysis?
Relevant answer
Answer
Hello Rodrigo,It seems to be used with some frequency:example:
Validation of the French version of the alcohol, smoking and ...
University of Geneva
by RA Khan - ‎2011 - ‎Cited by 18 - ‎Related articles
Home Titles list Validation of the French version of the alcohol, smoking and substance ... Screening Test (ASSIST) was developed to detect substance use disorders. ... significant correlations between ASSIST scores and scores from ASI, AUDIT and RTQ, ... Substance-Related Disorders/diagnosis/epidemiology/psychology.
Also see:
PDF]Psychological Testing on the Internet - American Psychological ...
American Psychological Association
by JA Naglieri - ‎Cited by 209 - ‎Related articles
there is a corresponding need for the ethical and professional use of test results. We ... changed, or translated without appropriate permission or validation. ...... prior computer use, who has reported difficulty in using the ASI-MV, including some.
The Use of Psychological Testing for Treatment Planning and Outcomes ...
Mark E. Maruish - 2004 - ‎Psychology
... had little prior experience (e.g., the costs and worth of psychological testing). ... validation of a computer-administered addiction severity index: The ASIBMV ...
ASI-6 - SciELO
by F Kessler - ‎2012 - ‎Cited by 31 - ‎Related articles
OBJECTIVE: To test the psychometric properties of ASI in its sixth version (ASI-6). ... Transcultural adaptation and validation of an instrument demands careful methodological .... They are psychometrically derived using nonparametric item response theory ... The 25 interviewers were either psychologists or psychiatrists.
Regards
  • asked a question related to Item Analysis
Question
8 answers
 In order to score students' responses to 6open-ended mathematics questions, I am going to use a 4 scale rubric. The total score for each student per each question would be between 4 to 16. please advise me to find the formula for calculating the difficulty and discrimination indexes for non-multiple choice questions. thank you.  
Relevant answer
Answer
Mehraneh,
Define difficulty as the mean observed score for an item (in your example, between 1-4).  Higher mean = easier task.
Define discrimination as Pearson correlation between item score and total score.  If you want to be a bit more precise, subtract item score from the total score first (so-called corrected item-total correlation).  Classical discrimination indices are all some variation on how well item scores relate to some larger criterion (e.g., total score).
You can, of course, use Qasim's approach, but that requires an extra step of first computing total scores for all respondents, then disaggregating cases into 'high' and 'low' scoring groups.
  • asked a question related to Item Analysis
Question
3 answers
Will this be called an adapted questionnaire or it will be considered as a a new questionnaire which will have to go through the scale construction procedure of item analysis etc.?
Relevant answer
Answer
The questionnaire will have been "standardized" (this word can mean different things in different fields) for a particular culture or cultures. Thus, if you are working in a different culture than it was "standardized" it may not have the same psychometric properties. In short, often you will want to adapt it. Sometimes wording changes are necessary to have it make sense (e.g., from American English to British English) and for some populations some items do not make sense. This is one of the issues that makes cross-cultural research difficult.
There are copyright issues with some questionnaires. Also, if you make changes, stress this so that people do not falsely assume that you have not. This includes how you code responses.
  • asked a question related to Item Analysis
Question
8 answers
Shabir Ahmad
Relevant answer
Answer
I think you should formulate two groups one with high scores on the scale (top 25% of the group) and other with low scores(i.e. bottom 25% of the group) then apply t-test between these two groups  for each item . The items with significant t values will be accepted while those with insignificant t-values will be rejected
  • asked a question related to Item Analysis
Question
14 answers
Assuming the scale is only measuring a single phenomenon.
Would it be wise to remove one of the two items to get the questionarie more slim?
Relevant answer
Answer
In addition, such high correlation (>.85) is classified as multicolinearity in Regression analysis and the researcher should drop out one of those highly correlated variables.
  • asked a question related to Item Analysis
Question
3 answers
Hello everyone,
I need to know what all properties should an item bank have to carry out a precise CAT. I am particularly interested in the 1PL model.
Properties I am looking for
1. What should be the optimum size of the item bank?
2. What should be the distribution of item difficulty?
Any references are also welcome.
Thank you in advance.
Relevant answer
Answer
Hallo Irshad,
In this paper, we describe setting up the foundations for an item bank that could be used as a basis for CAT. It focuses on the 2PL model, but the approach could be used for the 1PL model too.
Yours,
  • asked a question related to Item Analysis
Question
3 answers
I have problem while analyzing items using Lisrel program. I'm trying to validate four items to measure extrinsic dimension in Learning Motivation. But the factor loading score of all items have negative score. then I drop one item, then analyze three items to  measure it. The result is The model is over fit, with p-value 1.000 and RMSEA 0.000. So what should I do?
Relevant answer
Answer
If you have one fewer explanatory variables than data points, the model fit will be perfect by definition. Imagine you're trying to model Y as a linear function of X, but have only two points.  Any line you draw will fit perfectly.
More generally, model overfitting is when your function fits the training data a lot better than it fits new data.  It tends to happen when you have too many explanatory variables and/or too small a sample.
  • asked a question related to Item Analysis
Question
3 answers
Many researchers and many handbooks referred that cuttoff point of total-item corelations are above .3. But Fiefld (2005) specified that with bigger samples smaller corellation coefficients are acceptable. As Kline (1998) reported that with bigger samples .2 cutoff point may be accepted. 
My main question; is there any reference that define the exact sample size of these bigger samples?
Kline, P. (1988) The New Psychometrics: Science, Psychology, and Measurement. London: Routledge.
Field, A. (2005). Discovering statistics using SPSS (2nd ed.). London: Sage Publication.
Relevant answer
Answer
Hi
I agree with Joan, and the following reference is very helpfull
Nunnally J.C. & Bernstein I.H. (1994) Psychometric Theory.
McGraw-Hill, New York.
Good luck
  • asked a question related to Item Analysis
Question
3 answers
Does anyone know publications, that compare IRT based item information curves or item information functions of questions/testitems with different response format (but equal content)?
Response formats may differ in number of response options, item wording, etc.
Relevant answer
Answer
Hi Nathan, thank you for your answer. My aim is to evaluate which item format is the best/most informative as this is a crucial decision in the process of psychometric measurement development. In detail, we want to prove evidence, that so-called ability-items ("Are you able to wash your car?" - 5 ordinal response options) are more informative than performance-items ("Did you wash your car last week"  - yes/No) for measuring physical impaiment. I`ve already analyzed the data by grafical/descriptive illustrations. What I am interested in is, how other authors use to present comparable findings in their publications.
  • asked a question related to Item Analysis
Question
8 answers
Point biserial correlation is used to to determine the discrimination index of items in a test. It  correlates the dichotomous response on a specific item with the total score in a test. According to the literature items with Point biserial correlation above 0.2 are accepted. According to Crocker (Introduction to classical and modern test theory, p.234) the threshold for point biserial correlation is 2 standard errors above 0.00 , and the standard error could be determined by (1/sqrt(N)) where N is the sample size. What is not clear to me is that in tests we need items that have high discrimination (Correlation) and if Point biserial correlation is a special case of pearson correlation that means that by accepting 0.2 as a threshold we are accepting the fact that the coefficient of determination is 0.04 and total score is only capturing 4% of item variance.  
Relevant answer
Answer
I guess you have to remember back in the old days before desk top computer processing became available and calculations were done by hand or by punch-cards, you would want a rule of thumb that was easy to derive. So rpb=.20 is just a convention for ensuring that the item has a non-chance positive association with the total. Of course arbitrary cut-offs are just that--arbitrary. If you rounded the example of .199 and .201 to .20 you would make the same rule of thumb decision. However, with modern computing you can determine rpb very quickly and accurately. Indeed, if the correlation is >0.00 you could probably keep the item even though it might not help discriminate candidates and it might never get used in any kind of selection or adaptive testing. Rules of thumb are how our industry started over 100 years ago when statistical and computing technology was primitive....keep it in mind when approaching conventions.
  • asked a question related to Item Analysis
Question
3 answers
I have 30 items scale, I want to calculate item bias using SPSS
Relevant answer
Answer
Dear Mahmoud, there are several available options for estimating item bias (or more adequately termed, Differential Item Functioning) using SPSS if the items are dichotomous. Two of the most well known non IRT methods are available using the Mantel Haenszel Statistic in crosstabs (you'll need to cross item response by group by total test score, but it will yield only uniform DIF results). Alternatively you may use Logistic regression methods using item response as dependent variable, and group, total score and Total score by group interacion for detecting non-uniform DIF. You'll need to interpret significance and pseudo-R2 values.
If items are ordinal, you cannot use the Mantel Haenszel test "as is", but you may still use Ordinal Logistic regression for detecting DIF in this case.
  • asked a question related to Item Analysis
Question
4 answers
Can anybody help to interpret the ICC curve of IRT for polytomous response category?
Does anybody know why partial credit model is better as compared to graded response model?
Relevant answer
Answer
Interpretation depends of what kind of model you are using: Rasch or non-Rasch models... http://www.rasch.org/rmt/rmt191a.htm