8th Nov, 2021

Portland State University

Q&A

Find answers to technical questions and follow scientific discussions

Question

Asked 2nd Nov, 2021

Hello,

in my dataset some of the categorical data have 18 or more levels? i want to ask can i merge them based on their frequency? or should i keep them in the model as they are? and at what basis should i combine them?

thank you

Muhammad Zia Aslam the idea of "collapsing" less frequent categories into an "Other" category is basically common sense, so you do not need a reference for it.

3 Recommendations

If you have a range of different jobs grouped together, then check to see if they equal the mean on your dependent variable. If so, then that would make this Other candidate a good candidate for the "omitted" category in a dummy variable analysis.

1 Recommendation

**Get help with your research**

Join ResearchGate to ask questions, get input, and advance your work.

What ever logic you use for combining categories will have to be easily understood by your reviewers/readers. The most frequently used strategy is to combine the less common categories into a single "other category."

1 Recommendation

Collapsing the levels/categories of a categorical variable could be useful when catering for a theoretical reason (e.g., reducing respondents’ education level into just two categories, signifying *university graduate* or *non-university graduate*). It could also be driven by a decision after conducting data evaluation (e.g., having few observations in some categories). Either way, you could do the merge on SPSS following the procedure for recording a categorical variable as illustrated by van den Berg (2021) and KSU libraries (2021). You might refer to Rutkowski et al. (2019) and DiStefano et al. (2021) for inputs on rationalizing the collapse of the categories. Here are the full citations.

DiStefano, C., Shi, D., & Morgan, G. B. (2021). Collapsing categories is often more advantageous than modeling sparse data: Investigations in the CFA framework. *Structural Equation Modeling: A Multidisciplinary Journal*, *28*(2), 237–249. https://doi.org/10.1080/10705511.2020.1803073

KSU Libraries. (2021, October 4). *LibGuides: SPSS tutorials: Recoding variables*. LibGuides at Kent State University. https://libguides.library.kent.edu/spss/recodevariables

Rutkowski, L., Svetina, D., & Liaw, Y.-L. (2019). Collapsing categorical variables and measurement invariance. *Structural Equation Modeling: A Multidisciplinary Journal*, *26*(5), 790–802. https://doi.org/10.1080/10705511.2018.1547640

van den Berg, R. (2021, August). *SPSS - Merge categories of categorical variable*. SPSS tutorials | The Ultimate Guide to SPSS. https://www.spss-tutorials.com/spss-merge-categories-of-categorical-variable/

Good luck,

2 Recommendations

You can also change the modalities of the categorical variable using the frequency in order to do some learning processing and apply machine learning model. You also need to be aware since when you will combine them it may lead to loss of information

Mohialdeen Alotumi thank you for the useful explanation. can i do that in R as well?

Yes, you can. The R package, *Collapse*, by Krantz et al. (2016) could be helpful. Here is the full citation.

Krantz, S., Dowle, M., Srinivasan, A., Berge, L., Eddelbuettel, D., Pasek, J., & Tappe, K. (2016). *Collapse 1.6.5*. Advanced and fast data transformation in R. https://sebkrantz.github.io/collapse/

Good luck,

1 Recommendation

Sam H.Bahreini though it is permissible to combine categorical data, it very much depends on how you want to utilize that data i.e., it will be used only for descriptive statistics or you plan to apply some inferential statistics as well.

Muhammad Zia Aslam Dear Muhammad, thank you for the answer;

in my dataset for example, income has 20 levels of education has 9 levels and i have 14 independent variables, after descriptive analysis, i should run ordered and unordered logistic model and count model and compare them to find the best fit for my data, s

some of the income level for example has very low frequency, that is why i want to combine them toghether;

Mohialdeen Alotumi thanks a lot for your help,

as i wrote to Muhammad Zia Aslam after preparing data,i should run ordered and unordered logistic regression on my data, i read some papers and chapters and found that i also need to do cross-validation, should i first do the cross validation before run the model and after modelling, should i again use other methods to fid goodness of fit?

thank you

Dear Sam H.Bahreini it is very obvious actually to have some groups with very low respondents when you will have too many categories. Anyways, as I could understand that you have measured these variables, such as income and education, as groups or categories with options 1,2,3,4,5......... for each group, you can easily expand your group participation by simply recoding to combine. According to my understanding it is neither manipulation nor an issue to worry about. From expand I mean raise the bar for example for the first income group from 1-1000 to 1-50,000 or whatever your specification of the higher income groups. **For Logistic Regression you can take guidance from Prof. Mike Crowson's YouTube tutorials.** Tq.

1 Recommendation

Muhammad Zia Aslam thank you very much, yes, all of my datasets are categorial data as you said1,2,3,...18. I just wanted to ask is there specific rules to combine for example 3 levels? i did not find anything online, i suppose i should combine the less frequent groups together.

If you have any continuous variables such age, years of education, or income then you should recode each of these to the midpoint of the category.

@sorry David, could you please explain what do you mean? my income and age are categorial too, for example i have 18 levels of income or 5 level of ages,

An example of recoding to the midpoint of an income category would be to convert $20,000 to $40,000 into a value of $30,000.

1 Recommendation

If you have 18 ordinal categories ( that is each category is higher or lower than another) as an exposure variable, I would treat them as quasi continuous.

If the categories of the exposure variable are nominal (that is different and not ordered) I would group if needed on theory (but also being aware of the frequencies) but theory trumps.

1 Recommendation

@Kelvyn Jones I agree with you about the ordinal DV. Here I would use truncated regression and not OLS. Full details can be found in the attached screenshot. I certainly agree with all of your other suggestions. Best wishes to all, David Booth

- 306.82 KBScreenshot_20211105-214219.png

1 Recommendation

Sam H.Bahreini yes, there applies only one rule in this case and that is "meaningfulness" according to my understanding. Your issue basically is related to pre data collection stage and can be resolved easily by making groups representation meaningful. In doing so you will not disturb any individual responses but will just make the group representation meaningful. So, I think you should move forward to your analysis by re-grouping the categories. Good luck. Tq

1 Recommendation

Muhammad Zia Aslam thank you for your consideration. I want to ask, is it possible to combine less frequent job levels together? i can not find any reference on it, i have more than 20 levels of jobs, can i combine less frequent levels together, or should they enter to the model as they are? i can not say something meaningful to combine for example housekeepers and jobless or students together just because of the frequency of the data. do you have any idea?

Muhammad Zia Aslam the idea of "collapsing" less frequent categories into an "Other" category is basically common sense, so you do not need a reference for it.

3 Recommendations

I totally agree with you Prof. David L Morgan . I think Sam H.Bahreini wanted to be secure by asking for a reference on the "common sense" of "collapsing" categories :). It is sure not needed. Regards.

1 Recommendation

Muhammad Zia Aslam thank you again for your answer, I really appreciate your help; my concern was mainly for job categories, I mean for interpretation of I combine less frequents together, then I have housekeepers, students, jobless, managers, so on in one level.

Data

- Jan 2010

List of clusters with significant correlation between the categorized GO-terms.

Article

Full-text available

- Oct 2005

Clustering categorical data is an integral part of data mining and has attracted much attention recently. In this paper, we present k-histogram, a new efficient algorithm for clustering categorical data. The k-histogram algorithm extends the k-means algorithm to categorical domain by replacing the means of clusters with histograms, and dynamically...

Get high-quality answers from experts.