Q&A
Find answers to technical questions and follow scientific discussions
Question
Asked 2nd Nov, 2021

Should i merge different levels of categorial data?

Hello,
in my dataset some of the categorical data have 18 or more levels? i want to ask can i merge them based on their frequency? or should i keep them in the model as they are? and at what basis should i combine them?
thank you

Popular answers (1)

8th Nov, 2021
David L Morgan
Portland State University
Muhammad Zia Aslam the idea of "collapsing" less frequent categories into an "Other" category is basically common sense, so you do not need a reference for it.
3 Recommendations

Most recent answer

10th Nov, 2021
David L Morgan
Portland State University
If you have a range of different jobs grouped together, then check to see if they equal the mean on your dependent variable. If so, then that would make this Other candidate a good candidate for the "omitted" category in a dummy variable analysis.
1 Recommendation

All Answers (21)

2nd Nov, 2021
David L Morgan
Portland State University
What ever logic you use for combining categories will have to be easily understood by your reviewers/readers. The most frequently used strategy is to combine the less common categories into a single "other category."
1 Recommendation
2nd Nov, 2021
Mohialdeen Alotumi
Sana'a University
Collapsing the levels/categories of a categorical variable could be useful when catering for a theoretical reason (e.g., reducing respondents’ education level into just two categories, signifying university graduate or non-university graduate). It could also be driven by a decision after conducting data evaluation (e.g., having few observations in some categories). Either way, you could do the merge on SPSS following the procedure for recording a categorical variable as illustrated by van den Berg (2021) and KSU libraries (2021). You might refer to Rutkowski et al. (2019) and DiStefano et al. (2021) for inputs on rationalizing the collapse of the categories. Here are the full citations.
DiStefano, C., Shi, D., & Morgan, G. B. (2021). Collapsing categories is often more advantageous than modeling sparse data: Investigations in the CFA framework. Structural Equation Modeling: A Multidisciplinary Journal, 28(2), 237–249. https://doi.org/10.1080/10705511.2020.1803073
KSU Libraries. (2021, October 4). LibGuides: SPSS tutorials: Recoding variables. LibGuides at Kent State University. https://libguides.library.kent.edu/spss/recodevariables
Rutkowski, L., Svetina, D., & Liaw, Y.-L. (2019). Collapsing categorical variables and measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 26(5), 790–802. https://doi.org/10.1080/10705511.2018.1547640
van den Berg, R. (2021, August). SPSS - Merge categories of categorical variable. SPSS tutorials | The Ultimate Guide to SPSS. https://www.spss-tutorials.com/spss-merge-categories-of-categorical-variable/
Good luck,
2 Recommendations
3rd Nov, 2021
Semeh Ben Salem
Ecole Polytechnique de Tunisie
You can also change the modalities of the categorical variable using the frequency in order to do some learning processing and apply machine learning model. You also need to be aware since when you will combine them it may lead to loss of information
3rd Nov, 2021
Sam H.Bahreini
University of Liège
Mohialdeen Alotumi thank you for the useful explanation. can i do that in R as well?
3rd Nov, 2021
Mohialdeen Alotumi
Sana'a University
Yes, you can. The R package, Collapse, by Krantz et al. (2016) could be helpful. Here is the full citation.
Krantz, S., Dowle, M., Srinivasan, A., Berge, L., Eddelbuettel, D., Pasek, J., & Tappe, K. (2016). Collapse 1.6.5. Advanced and fast data transformation in R. https://sebkrantz.github.io/collapse/
Good luck,
1 Recommendation
4th Nov, 2021
Muhammad Zia Aslam
Superior University
Sam H.Bahreini though it is permissible to combine categorical data, it very much depends on how you want to utilize that data i.e., it will be used only for descriptive statistics or you plan to apply some inferential statistics as well.
By the way what do you mean by combining categories based on their frequencies? Furthermore, it would be possible only when you have meaningful categories like education level, for example, in respected Mohialdeen Alotumi example. Tq.
4th Nov, 2021
Sam H.Bahreini
University of Liège
Muhammad Zia Aslam Dear Muhammad, thank you for the answer;
in my dataset for example, income has 20 levels of education has 9 levels and i have 14 independent variables, after descriptive analysis, i should run ordered and unordered logistic model and count model and compare them to find the best fit for my data, s
some of the income level for example has very low frequency, that is why i want to combine them toghether;
4th Nov, 2021
Sam H.Bahreini
University of Liège
Mohialdeen Alotumi thanks a lot for your help,
as i wrote to Muhammad Zia Aslam after preparing data,i should run ordered and unordered logistic regression on my data, i read some papers and chapters and found that i also need to do cross-validation, should i first do the cross validation before run the model and after modelling, should i again use other methods to fid goodness of fit?
thank you
5th Nov, 2021
Muhammad Zia Aslam
Superior University
Dear Sam H.Bahreini it is very obvious actually to have some groups with very low respondents when you will have too many categories. Anyways, as I could understand that you have measured these variables, such as income and education, as groups or categories with options 1,2,3,4,5......... for each group, you can easily expand your group participation by simply recoding to combine. According to my understanding it is neither manipulation nor an issue to worry about. From expand I mean raise the bar for example for the first income group from 1-1000 to 1-50,000 or whatever your specification of the higher income groups. For Logistic Regression you can take guidance from Prof. Mike Crowson's YouTube tutorials. Tq.
1 Recommendation
5th Nov, 2021
Sam H.Bahreini
University of Liège
Muhammad Zia Aslam thank you very much, yes, all of my datasets are categorial data as you said1,2,3,...18. I just wanted to ask is there specific rules to combine for example 3 levels? i did not find anything online, i suppose i should combine the less frequent groups together.
5th Nov, 2021
David L Morgan
Portland State University
If you have any continuous variables such age, years of education, or income then you should recode each of these to the midpoint of the category.
5th Nov, 2021
Sam H.Bahreini
University of Liège
@sorry David, could you please explain what do you mean? my income and age are categorial too, for example i have 18 levels of income or 5 level of ages,
5th Nov, 2021
David L Morgan
Portland State University
An example of recoding to the midpoint of an income category would be to convert $20,000 to $40,000 into a value of $30,000.
1 Recommendation
5th Nov, 2021
Kelvyn Jones
University of Bristol
If you have 18 ordinal categories ( that is each category is higher or lower than another) as an exposure variable, I would treat them as quasi continuous.
If the categories of the exposure variable are nominal (that is different and not ordered) I would group if needed on theory (but also being aware of the frequencies) but theory trumps.
1 Recommendation
6th Nov, 2021
David Eugene Booth
Kent State University
@Kelvyn Jones I agree with you about the ordinal DV. Here I would use truncated regression and not OLS. Full details can be found in the attached screenshot. I certainly agree with all of your other suggestions. Best wishes to all, David Booth
1 Recommendation
6th Nov, 2021
Muhammad Zia Aslam
Superior University
Sam H.Bahreini yes, there applies only one rule in this case and that is "meaningfulness" according to my understanding. Your issue basically is related to pre data collection stage and can be resolved easily by making groups representation meaningful. In doing so you will not disturb any individual responses but will just make the group representation meaningful. So, I think you should move forward to your analysis by re-grouping the categories. Good luck. Tq
1 Recommendation
8th Nov, 2021
Sam H.Bahreini
University of Liège
Muhammad Zia Aslam thank you for your consideration. I want to ask, is it possible to combine less frequent job levels together? i can not find any reference on it, i have more than 20 levels of jobs, can i combine less frequent levels together, or should they enter to the model as they are? i can not say something meaningful to combine for example housekeepers and jobless or students together just because of the frequency of the data. do you have any idea?
8th Nov, 2021
David L Morgan
Portland State University
Muhammad Zia Aslam the idea of "collapsing" less frequent categories into an "Other" category is basically common sense, so you do not need a reference for it.
3 Recommendations
10th Nov, 2021
Muhammad Zia Aslam
Superior University
I totally agree with you Prof. David L Morgan . I think Sam H.Bahreini wanted to be secure by asking for a reference on the "common sense" of "collapsing" categories :). It is sure not needed. Regards.
1 Recommendation
10th Nov, 2021
Sam H.Bahreini
University of Liège
Muhammad Zia Aslam thank you again for your answer, I really appreciate your help; my concern was mainly for job categories, I mean for interpretation of I combine less frequents together, then I have housekeepers, students, jobless, managers, so on in one level.
10th Nov, 2021
David L Morgan
Portland State University
If you have a range of different jobs grouped together, then check to see if they equal the mean on your dependent variable. If so, then that would make this Other candidate a good candidate for the "omitted" category in a dummy variable analysis.
1 Recommendation

Similar questions and discussions

Related Publications

Data
List of clusters with significant correlation between the categorized GO-terms.
Article
Full-text available
Clustering categorical data is an integral part of data mining and has attracted much attention recently. In this paper, we present k-histogram, a new efficient algorithm for clustering categorical data. The k-histogram algorithm extends the k-means algorithm to categorical domain by replacing the means of clusters with histograms, and dynamically...
Got a technical question?
Get high-quality answers from experts.