Science topic

# Item Analysis - Science topic

Explore the latest questions and answers in Item Analysis, and find Item Analysis experts.

Questions related to Item Analysis

Hi all,

I've been having some issues coming across general guidelines for what an appropriate response rate would be during the item analysis phase of a low-stakes questionnaire.

I'm currently working with a dataset of just under 600 cases that have submitted responses for a 5 option Likert scale-based questionnaire, the responses of which will be used to evaluate the effectiveness of a program. There are a total of 30 questions that measure certain personality traits.

The problem I'm running into is, there are some cases who only answered 28 out of 30 questions, or 15 out of 30, or 18 out of 30. My question is, can I include missing responses in my analysis? If so, what would the cut-off be? Could someone who completed 70% of the questionnaire be included? Just having trouble tracking down some empirical evidence for this situation.

Thank you in advance!

Hello,

I have a set of items that would need to be slightly adapted to fit my research.

1) Let's assume I have the following item: "Introduce a new generation of products/services."

Is it possible to change the tense to: "introducED a new generation of products/services"?

2) Let's assume I have the following item: "We introduce a new generation of products/services."

Is it possible to change the personal pronoun from we to I: "I introduce a new generation of products/services."?

Are these two changes possible without any further testing?

David

For example, I prefer to study in (A) Morning (B) Evening (C) Late Night

Sir/ Madam,

I am doing a research on study habits of university students. To measure the study habits I have developed a tool. In this tool each item has multiple categorical option. So, how can I do analysis each item?

Dear RG Community, Can you please share how to do Item Analysis in scale development? I have developed a scale, now need to do the item analysis before try out.

I am working on a scale where I have a dichotomous response pattern. I would like to know how to item analysis and item reliability.

There are many programs for conducting item analysis. However, the only free program I could find was the

*Lertap Software*, which is Excel-based. The issue with Lertap is how its light version is limited to processing scores of less than 100 participants.Lertap Sofware can be found at this link:

For my research I have to have a pretest and post-test between which there is an instructional treatment. I need to pilot test the items of (pre- post-tests). Which statistical procedure of calculating reliability and item analysis on SPSS is appropriate?

I was studying the following paper:
Park, D.H., Lee, J. and Han, I., 2007. The effect of on-line consumer reviews on consumer purchasing intention: The moderating role of involvement.

*International journal of electronic commerce*,*11*(4), pp.125-148.My query is that first 3 items in the scale used to measure attitude in this paper are definitely about positive attitude while next 3 items seem to be about negative attitude. Is my understanding correct? If yes, are these items (from #4 to #6) reverse coded?

Hello everyone!

I am currently analysing a questionnaire from a Rasch-perspective. Results of the Andersen Likelihood Ratio (random split) and the Martin-Loef Test (median split) turned out to be significant. I know what significant results mean and which assumptions are violated. However, I am not sure about possible reasons for subgroup invariance and item heterogenity. What are some of the possible causes for significant results?

I hope that someone of you can help me answer this question. Thank you very much already in advance :)

Best regards,

Lili

Hi everyone! I am trying to find the "how to" on item difficulty and discrimination on SPSS and its interpretation. I have read about and performed the command analyze> scale> reliability analysis, to get the corrected item total correlations (which I believe can be interpreted as a dicrimination analysis...?). I also watched videos that teach that you could calculate dificulty index by calculating means and sums in the analyze> descriptive statistics> frecuencies and interpreting the means. But this is a different method to the one I read about in books and articles (rbis).

I am using this article as reference:

It describes item correlation, discrimination index and difficulty index as different methods for item reduction. Is it adecuate to use just one or two of these analysis? Must we use all at the same scale construction proyect? According to the item correlation method, I only keep 4 of 30 initial items from the scale so far.

Proyect Description: Scale construction with 30 initial dichotomous items (True/False)

I want to gather data to validate new psychological measures (e.g., personality, attitudes, abilities) I've created (using EFA, CFA, correlations, item analysis, etc)

However, the survey will be very long if I include all items for everyone (~60 minutes).

I've read about split questionnaire designs where you give a different subset of items to random groups of respondents so that they could have, let's say, half the questions but where across the respondents you get data for all items. Then you impute the 'planned missing data'; e.g. Rhemtulla, M., & Little, T. D. (2012). Planned missing data designs for research in cognitive development.

*Journal of Cognition and Development*,*13*(4), 425-438.However, I can't find information about whether this design is appropriate when the purpose of the data collection is to establish the validity of a new measure. As far as I can tell, it *should* be appropriate but it could be problematic because of the way i'd have to impute large parts of the data.

Has anyone here looked into this or used split questionnaires for this purpose? Has anyone come across articles that have done this?

Hello, as part of my Dissertation Project I am conducting an item analysis. I used a 2x2 mixed design ANOVA (recommended by my supervisor) to conduct this. I unfortunately got no significant interactions between any of my items...I am not too sure what I should say for this or if I should dwell on it for too long in the discussion? At the moment I think it means that my items do not act in the way they were meant to...so does this undermine my whole experiment? Or would I just advise that future research uses different items to measure what I am intending to measure? Thank you in advance.

I am currently working on my master's thesis which is focused on validating a skills test. The panel members in our university suggested that in conducting item analysis on the test i prepared, i should divide it into two forms (with equal number of items) and administer it to two different groups of students. I have followed their recommendation but i still looked for literature or studies to support it but found none. I also figured out that this can't fall under parallel-form reliability test because it requires that the two tests be administered to same group of students separated by time. I hope you can help me with this.

Thank you very much and i hope you'd all be safe from the covid 19 virus.

My tool has 7 different dimensions how can i run item analysis. t, or product moment correlations or else.

I am writing a paper and a dataset that I am using (secondary data) contains, amongst other, seven questions related to a corporate website. Every question is of the form: "Does your website have an attribute _" and then followed by seven characteristics. In all seven instances, interviewees were asked to give a yes/no answer, where

**yes=1 and no=0.**My question is is it possible to

**add together these seven answers**for each interviewee**in order to derive one 0-7 scale item**per interviewee, for the purpose of adding the newly derived item into my SEM analysis.I would be very grateful if you could provide your thoughts on this matter. If you could advise me on relevant

**references**in which this problem was handled in the same manner, it would be much obliged.In literature when calculating IRT SE I found multiple times Fisher Information being mentioned.

Being curious I started to try to play around with the fisher information in order to obtain the typical Information reported as P(theta)Q(theta)a^2.

My understanding of the process failed me when I started to check why the variance of the score is defined as follows

score = s = d/dtheta ln( f(x,theta) )

Var(s) = E[s^2]

Given that the variance is

Var(s) = E[s^2] - E[s]^2

I started looking in why E[s]^2 is zero. As long as f(x,theta) is a density function I can write

E[s]^2 = [\integral{ d/dtheta ln( f(x,theta) ) * f(x,theta) dx }]^2

= [\integral{ d/dtheta( f(x,theta) ) * f(x,theta)/f(x,theta) dx }]^2

= [\integral{ d/dtheta( f(x,theta) )dx }]^2

= [d/dtheta( \integral{ f(x,theta)dx } ) ]^2

= [d/dtheta( 1 ) ]^2

= 0

But as soon as we use the IRF (Item Response Function), that gives us the probability of getting score x given theta, all the computations done above are not working anymore. The reason being that the integral of the IRF is not finite, hence

[d/dtheta( 1 ) ]^2

not valid.

I have demonstrated that

E[d/dtheta ln( f(x,theta) ) ^ 2] = -1 *E[d/dtheta(d/dtheta ln( f(x,theta) ))]

but that holds when integrals are one for f(x, theta) and simplifications can be done.

Any input on my approach and (not) understanding of the problem?

In this essay, General Total Score (GTS)-based (testing) item analysis is discussed in (1) item difficulty analysis; (2) item decomposition; (3) K-dependence coefficient (uncertainty coefficient) and item dependence analysis; (4) Items structure analysis across different populations.

I am in the middle of questionnaire development and validation processes. I would like to get expert opinion on these processes whether the steps are adequately and correctly done.

1. Items generation

Items were generated through literature review, expert opinion and target population input. The items were listed exhaustively till saturation.

2. Contents validation

Initial items pool were then pre-tested with 10-20 target population to ensure comprehensibility. The items were then reworded based on feedback.

3. Construct validity

a) Bivariate correlation matrix to ensure no items correlation >0.8

b) Principal Axis Loading with Varimax Rotation. KMO statistic >0.5. Barttlets Test of Sphericity significant. Communalities less than 0.2 were then removed one-by-one in turn. Items with factor loading high cross-loading were removed one-by-one in turn. Then, item with factor loading <0.5 were removed one-by-one in turn. This eventually yielded 17 variables with 6 factors, but 4 factors have only 2 items. So I try to run 1, 2, 3, 4, 5 and 6 factor models, and found that 4-factor model is the most stable (each factor had at least 3 items with factor loadings >0.4). Next analysis is only on 4-factor model.

c) Next, i run Principal Component Analysis without rotation on each factor (there are 4 factors altogether), and each resulted in correlation matrix determinant >0.01, KMO >0.5, Bartlett significant, total variance >50%, and no factor loading <0.5.

d) I run reliability analysis on each factor (there are 4 factors altogether) and found cronbach alpha >0.7, while overall realibility is 0.8.

e) I run bivariate correlation matrix and found no pairs correlation >0.5.

f) Finally, i am satisfied and decided to choose four-factor model with 17 variables and 4 factors (each factor has 5,4,4,4 items), and each factor had at least 3 items with loadings >0.5. Realibilility for each factor >0.7 while overall is 0.8.

.

My question is, am i doing it right and adequate?

Your response is highly appreciated.

Thanks.

Regards,

Fadhli

While developing a questionnaire to measure several personality traits in a somewhat unconventional way, I now seem to be facing a dilemma due to the size of my item pool. The questionnaire contains 240 items, theoretically deduced from 24 scales. Although 240 items isn't a

*"large item pool*" per se, the processing time for each item is averages on ~25 seconds. This yields an overall processing time of over >1.5 hours - way to much, even for the bravest participants!In short, this results in a presumably common dilemma:

**What aspects of the data from my item analysis sample to I have to jeopardize?**- Splitting the questionnaire into parallel tests will reduce processing time, but hinder factor analyses.
- Splitting the questionnaire into within-subject parallel tests over time will require unfeasible sample sizes due to a) drop-out rates and b) eventual noise generated by possibly low stability over time.
- An average processing time over 30 minutes will tire participants, jeopardize data quality in general.
- Randomizing the item order and tolerating the >1.5 hours of processing time will again require an unfeasible sample size, due to lower item-intercorrelations.

I'm aware that this probably has to be tackled by conducting multiple studies, but that doesn't solve most of the described problems.

This must be a very common practical obstacle and I am curious to know how other social scientists tackle it. Maybe there even is some best practice advise?

Many thanks!

In this essay, General Total Score (GTS)-based (testing) item analysis is discussed in (1) item difficulty analysis; (2) items decomposition; (3) K-dependence coefficient (uncertainty coefficient) and item dependence analysis.

For full essay, please click:

I took opinion from public health specialists about different indicators to create a city health profile. I tried to draw item characteristic curves for each indicator with R software. Please help me in interpretation of this chart.

Data is the score for each indicator i.e a continuous variable ranged between 1.00 to 5.00 (with decimals)

I used mixed design ANOVA when analysing my accuracy data and also my RT, some of the results were significant in the subject analysis but not in the item analysis. The question is how can I explain it? should I say there is no relation between factor A and factor B since it is not significant in the analysis by item. I am a little bit confused. I would appreciate it if someone can help.

Many thanks

I conducted an opinion survey to select feasible indicators to assess health profile of a city. I asked the participants to give score from 1-5, (1 for low 5 for high) for each indicator in six aspects viz importance, specificity, measurability , attainablity and time bound characters. That means each respondent will give score from 1-5 for characters of each indicator. The total score of each indicator is 30. I collected opinion about 60 different indicators.

If i treat feasibility as the latent trait of every indicator, how can i select highly feasible indicators wit the help of Item response theory analysis. How to draw item characteristic curve for each indicator and how to select indicators.

Any one please help me to overcome this hurdle.

I want to run correlations between EFA factors testing orthogonality assumption (uncorrelated). I have a six factor final structure. Prior to running Varimax rotation, I ran oblimin and item correlations were less than .25. With the item analysis suggesting the items were uncorrelated, I am now planning to run a simple correlation among factors to verify the assumption of orthogonality.

I am building an test item analysis portfolio and am reaching out to inquire about resources you may recommend... God's blessings for the day..... debe

Recently I've been reviewing how to handle social desirability in testing.

After much theoretical review I have come to the conclusion that the best way to do this is to neutralize the items from social desirability. The format I will use is likert scale.

For example, an item that says "I fight with my coworkers" would be transformed into " sometimes I react strongly to my coworkers" (the second is somewhat more neutral).

The idea comes from the work done by Professor Martin Bäckström.

Now the question I have is:

**is there any methodology that can help make this neutralization?**If not,

**what would be good ideas to realize it? What elements should I consider?**I think a good idea might be to "depersonalize" the item. For example, instead of "I fight with my bosses," I would become "I think an employee has the right to fight with his or her boss".

Another option I've thought of is to modify the frequency. For example, instead of "I get angry easily," I'd use "Sometimes I get angry easily."

However, I do not know if these options would affect the validity of the item to measure the construct.

Thank you so much for the help.

I am preparing a value scale to be used for rural population. The six value areas are Theoretical,Economic, Aesthetic,social ,political and Religious. How to go about item analysis for the scale.

Whether it will be considered as one whole scale or item analysis of each subset ( religious, economic etc. )be done separately.

What are the methods which can be used for item analysis and item improvement.

I have a doubt in my mind - all subsets depend upon each other. Higher score in one subset will decrees score on other subset. How this can be handled.

Hi all,

I'm having some theroretical ponderings with polytomous items and item-total correlation. In the binary case, we have a modification of Pearson correlation of the item X and scale Y

(M1-M0)*var(X)^(1/2)/var(Y)^(1/2)

Does anyone know a polytomous modification/generalization of this form of coefficient? There could be something like [(M1-M0) + (M2-M1)...] x var(X)^(1/2) in the denominator.

See better the formulae in the appendix.

Hello,

I am currently trying to work out how to conduct item analysis on my likert scale questionnaire.

The questionnaire consists of 34 questions, which are split between 13 subdomains. I want to determine how the scoring of these subdomains varies between the quartiles of the overall questionnaire score.

I was looking at item response theory but I am given to understand that this is not appropriate as Likert scales do not assume that item difficulty varies.

Any guidance is most appreciated!

I want to understand what the reliability index of an item in a questionnaire with liker type scale is. How is it calculated. Is there any other term used for it and what type of item analysis is this?

Is there any possible way?

I understand that if the options point to the same trait, it can be done. for example a question of the type:

I work better:

(a) individually

(b) with other persons

either of the two options is valid for the person (helping avoid bias) and for example if I'm measuring the trait of teamwork I may think that a person who selects option b will have a higher degree in the trait of teamwork. Am I making a mistake in assuming this?

now, is there any way to do this when they point to different traits in response options? I want to be able, based on the data of forced response items, to carry out normative analysis (to be able to compare with other subjects).

PS: I'm clear that with ipsatives items you can't make comparisons between people, however, if you manage the punctuation in a different way could you do it somehow?

Hello,

I am evaluating pearson correlation between items and score on the totality of the test. I have been reading about corrected item-test correlation and end up with this paper

but it looks like it is only for dichotomous items.

Can you suggest any articles regarding corrected pearson correlation?

Since the correlation is calculated for the total score with the exclusion of the score from the current question; would be enough to remove the question's score from the averageTestScore across students and the test score for a specific student?

Thanks

I'm planning to use KET (Key English Test is prepared by Cambridge University and it tests the skills of Reading, Writing, Listening and Speaking – with each skill equally weighted at 25%.) in my research. Although it's a well-known international test, I need to examine the validity and reliability of the test before using it in my research. However, I couldn't find any research about the reliability and validity of KET in the literature.

Actually, I think I need to do a pilot test and calculate KR-20 or Spearman-Brown or Pearson r for

**RELIABILITY**and do item analysis (item discrimination and item difficulty) for**VALIDITY**. On the other hand, what should I do if I need to discard some items according to the item analysis result?Do you have any suggestions for it?

In testing whether questionnaire items have been correctly assigned to a cluster (scale), test developers often look at the correlations between each item and each cluster. To arrive at these, unweighted sum scores of the items per cluster per individual are often used (corrected for self-correlation if the item is part of the cluster). However, a sum score of 36 for a cluster of, for example, 12 items could consist of, for example, (12 x 3), (6 x 1 + 6 x 5), or (2 x 1 + 2 x 2 + 4 x 3 + 2 x 4 + 2 x 5). These cluster compositions are not qualitatively equivalent.

Therefore, I experimented with an alternative: for each item, I had the program average the correlations of each item with each other item of each cluster. This results in values, similar to the item-sum correlations but about 2/3 smaller. (If these, in turn, are averaged per cluster, it results in then mean inter-item correlation per cluster, on which Cronbach’s alpha is based.)

The distribution of both types of item-cluster correlations is highly similar but certainly not 100%. I experimented with both in my cluster optimization program (see reference), and found it ended in somewhat divergent results.

My question is: has this approach to item-cluster correlations tried out before, meaning I have reinvented the wheel? If so, why has it not become a widespread practice; is, perhaps, this wheel not quite circular?

P.S. For a good understanding: the clusters I investigate are symptom clusters of mental disorders. These do not obey the common factor model (see Borsboom et al.). That makes them unsuited for confirmatory factor analysis, unless all residual correlations have been specified. For that, it is still too early.

The purpose is only to store items and item properties and construct tests. We are using IRTPRO for item analysis, but right now we have all the item information in excel and we need better organization.

I am developing and pilot testing a screening questionnaire. Two groups filled out the questionnaires, the experimental group (individuals who were already diagnosed with the disorder) and a control group (individuals not diagnosed with the disorder). As part of my analysis I have to do an item analysis, and I was wondering how to go about it. Should I use my entire dataset (for control and experimental group) and run an item analysis on it, or should I run a seperate item analysis for each group and then compare findings? Or perhaps there is another possibility that I did not consider?

Would appreciate the help!

I've been working with IRT and usually I interpret the item information function and test information function based on the purpose of the measure I'm developing. A screening measure will most likely have more information about the low levels of ability. A performance based test might have more of a balanced TIF or might need a curve located to the high levels of trait.

One thing that picked my attention is that we find people saying on papers that item "y" has little information at any given location or that the test provides little information about the trait continuum, etc. So... my question is are there guidelines for the amount of information that an item or a test should provide? Are there any rule of thumb to interpret how much information should the TIF peak at?

I understand that the peak of the TIF or IIF isn't the most important information as we need to pay attention to the area to understand the distribution of the information, still I had that question on my mind....

I'm looking for a spss macro to do item analysis (dichotomic). IRT or CTT it doesn't matter. Thanks

I'm studying the use of CHIC and want to know how I can use the A.S.I. (CHIC) in a research about test validation? Is applied to colaborate in item analysis?

In order to score students' responses to 6open-ended mathematics questions, I am going to use a 4 scale rubric. The total score for each student per each question would be between 4 to 16. please advise me to find the formula for calculating the difficulty and discrimination indexes for non-multiple choice questions. thank you.

Will this be called an adapted questionnaire or it will be considered as a a new questionnaire which will have to go through the scale construction procedure of item analysis etc.?

Assuming the scale is only measuring a single phenomenon.

Would it be wise to remove one of the two items to get the questionarie more slim?

Hello everyone,

I need to know what all properties should an item bank have to carry out a precise CAT. I am particularly interested in the 1PL model.

Properties I am looking for

1. What should be the optimum size of the item bank?

2. What should be the distribution of item difficulty?

Any references are also welcome.

Thank you in advance.

I have problem while analyzing items using Lisrel program. I'm trying to validate four items to measure extrinsic dimension in Learning Motivation. But the factor loading score of all items have negative score. then I drop one item, then analyze three items to measure it. The result is The model is over fit, with p-value 1.000 and RMSEA 0.000. So what should I do?

Many researchers and many handbooks referred that cuttoff point of total-item corelations are above .3. But Fiefld (2005) specified that with bigger samples smaller corellation coefficients are acceptable. As Kline (1998) reported that with bigger samples .2 cutoff point may be accepted.

My main question; is there any reference that define the exact sample size of these bigger samples?

Kline, P. (1988) The New Psychometrics: Science, Psychology, and Measurement. London: Routledge.

Field, A. (2005). Discovering statistics using SPSS (2nd ed.). London: Sage Publication.

Does anyone know publications, that compare IRT based item information curves or item information functions of questions/testitems with different response format (but equal content)?

Response formats may differ in number of response options, item wording, etc.

Point biserial correlation is used to to determine the discrimination index of items in a test. It correlates the dichotomous response on a specific item with the total score in a test. According to the literature items with Point biserial correlation above 0.2 are accepted. According to Crocker (Introduction to classical and modern test theory, p.234) the threshold for point biserial correlation is 2 standard errors above 0.00 , and the standard error could be determined by (1/sqrt(N)) where N is the sample size. What is not clear to me is that in tests we need items that have high discrimination (Correlation) and if Point biserial correlation is a special case of pearson correlation that means that by accepting 0.2 as a threshold we are accepting the fact that the coefficient of determination is 0.04 and total score is only capturing 4% of item variance.

I have 30 items scale, I want to calculate item bias using SPSS

Can anybody help to interpret the ICC curve of IRT for polytomous response category?

Does anybody know why partial credit model is better as compared to graded response model?