Science topics: Data MiningData Cleaning
Science topic

Data Cleaning - Science topic

Explore the latest questions and answers in Data Cleaning, and find Data Cleaning experts.
Questions related to Data Cleaning
  • asked a question related to Data Cleaning
Question
5 answers
Dear scholars,
I have a database with 100 geolocated samples in a given area, each sample contains 38 chemical elements that were quantified.
Some of these samples contain values Below the Detection Level of the instrument (BDL), clearly, when we have 100% of the samples with BDL values there is not much to do, but what can be done when, for example, when there is only 20% BDL, what do we do with them, with what value do we replace a BDL sample?
Some papers show that a BDL sample can be replaced by the detection level (for the instrument's minimum detection level for that chemical element) divided by 0.25, others show that you have to divide it by 0.5... What would you do in each case, and is there any literature you would recommend? If it matters, I am mostly interested in Copper and Arsenic.
Regards
Relevant answer
Answer
What fraction of values below BDL is acceptable.
Why are you making the measurements? The why determines what is acceptable.
If you are concerned about an upper limit, then BDLs are of no concern.
If you are concerned about a lower limit, it will depend upon the nature of your concern.
There is no recommendation. There is no rule of thumb.
You decide from the criteria associated with WHY if you have enough information.
Too many BDLs might mean you need a different technique, but it always returns to WHY.
If , say, a customer wants an answer at each location, use the actual result and note the uncertainty. The result is usually meaningless, because of the high uncertainty.
  • asked a question related to Data Cleaning
Question
3 answers
Dear scholars,
I have a database with 100 geolocated samples in a given area, each sample contains 38 chemical elements that were quantified.
Some of these samples contain values Below the Detection Level of the instrument (BDL), clearly, when we have 100% of the samples with BDL values there is not much to do, but what can be done when, for example, when there is only 20% BDL, what do we do with them, with what value do we replace a BDL sample?
Some papers show that a BDL sample can be replaced by the detection level (for the instrument's minimum detection level for that chemical element) divided by 0.25, others show that you have to divide it by 0.5... What would you do in each case, and is there any literature you would recommend? If it matters, I am mostly interested in Copper and Arsenic.
Regards
Relevant answer
Answer
You don't give a research question so I would simply report exactly what you said in full detail. You may want also to give information on the detection limits of the methods you used. Just let Prof Kan know detection limits are determined by Nature's own chemistry and not by personal choice. Best wishes David Booth
  • asked a question related to Data Cleaning
Question
5 answers
Hi all,
I'm having trouble converting one particular variable in my dataset from string to numeric. I've tried manually transforming/recoding into a different variable and automatic recoding. I've also tried writing syntax (see below). The same syntax has worked for every other variable I needed to convert but this one. For all methods (manual recode, automatic recode, and writing a syntax), I end up with missing data.
recode variablename ('Occurred 0 times' = 0) ('Occurred 1 time' = 1) ('Occurred 2 times' = 2) ('Occurred 3+ times' = 3) into Nvariablename.
execute.
VALUE LABELS
Nvariablename
0 'Occurred 0 times'
1 'Occurred 1 time'
2 'Occurred 2 times'
3 'Occurred 3+ times'.
EXECUTE.
Thank you in advance for your help!
Relevant answer
Answer
Konstantinos Mastrothanasis, by introducing manual copying & pasting etc., you make reproducibility much more difficult. IMO, anything that can be done via command syntax ought to be done via command syntax. The basic code Angela H. posted will work for the particular values she showed in her post--see the example below. If it is not working, that suggests there are other values present in the dataset other than the ones she has shown us. But we are still waiting for her to upload a small file including the problematic cases.
Meanwhile, here is the aforementioned example that works.
* Read in the values Angela showed in her post.
NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / svar(A20).
BEGIN DATA
'Occurred 0 times'
'Occurred 1 time'
'Occurred 2 times'
'Occurred 3+ times'
END DATA.
LIST.
* Recode svar to nvar.
RECODE svar
('Occurred 0 times' = 0)
('Occurred 1 time' = 1)
('Occurred 2 times' = 2)
('Occurred 3+ times' = 3) into nvar.
FORMATS nvar (F1).
VALUE LABELS nvar
0 'Occurred 0 times'
1 'Occurred 1 time'
2 'Occurred 2 times'
3 'Occurred 3+ times'
.
CROSSTABS svar BY nvar.
  • asked a question related to Data Cleaning
Question
4 answers
Hello,
I have recently created a multiple response set from a check all that apply question. I see that I can compute frequencies for that response set, but is there a way to use the new response set as a variable for other inferential analyses? For example, a linear regression? If not, is there a better way to handle my check all that apply question that would allow for future analyses like regressions?
Thank you!
Relevant answer
You can compute new variables in SPSS for linear regression analyses. . MANOVA is also a possibility
  • asked a question related to Data Cleaning
Question
3 answers
Dear experts,
I am struggling with my masters dissertation and SPSS at the moment. Participants had to fill in a Qualtrics survey in which they were directed to one of two questionnaires (PRFQ-C and PRFQ-A) depending on whether they had a child over 18 or under 18. Now I'm struggling to clean the data in SPSS, how do I remove all the missing data for participants who didn't need to fill in one of those two questionnaires?
I hope that makes sense.
Relevant answer
Answer
Hello Daria,
Your skip logic reflects an instance wherein missing data (e.g., the under 18 survey wasn't completed because the respondent was over 18) are generally ignorable (but see paragraph #3, below). Ignorable missing data are not a cause for concern; it just means that you are reporting on a subset of the entire sample for one or more variables (here, potentially an entire survey form).
If the two surveys are completely different, then it might make more sense to have two independent data files, and report the results for each survey separately.
If the two surveys have at least some identical questions, and you wish to compare the results for the under 18 respondents with those of the over 18 respondents to those questions, then you'll have to merge the two files such that the matched questions are coded the same way for both groups, and represent the same variable/column in your spss data file.
Good luck with your work.
  • asked a question related to Data Cleaning
Question
7 answers
Hello all,
I have a dataset which contains some couples and some single people. I want to keep all the singles and keep one person from each couple. Specifically, from the couples, I want to keep the person who has the highest score between the two partners on a specific variable.
I cannot figure out how to do this with syntax. Any advice would be greatly appreciated. Thank you!
Relevant answer
Answer
Hello Laurien Meijer. I think it is easier to tackle this with the data in the LONG format. E.g., suppose you have a variable called PairID that shows which pair each row belongs to. And suppose you want to keep for each pair the row with the higher age. This would do the trick:
* Save the max age for each pair as new variable age_max.
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
/BREAK=PairID
/age_max=MAX(age).
* For each pair, keep the record with the higher age.
SELECT IF Age EQ age_max.
DESCRIPTIVES Age PairID.
Note that if both members of a pair have exactly the same age, both would be kept in the dataset. I don't know if that is possible for your data.
  • asked a question related to Data Cleaning
Question
10 answers
Hi, I have been working on some Natural Language Processing research, and my dataset has several duplicate records. I wonder should I delete those duplicate records to increase the performance of the algorithms on test data?
I'm not sure whether duplication has a positive or negative impact on test or train data. I found some controversial answers online regarding this, which make me confused!
For reference, I'm using ML algorithms such as Decision Tree, KNN, Random Forest, Logistic Regression, MNB etc. On the other hand, DL algorithms such as CNN and RNN.
Relevant answer
Answer
Hi Abdus,
I would suggest you should check the performance of your model with and without duplication of records. Generally, the duplication may increase the biasedness of the data, which may lead to a biased model. To solve this you can use the data augmentation approach.
  • asked a question related to Data Cleaning
Question
2 answers
Hi all! I know this is a very basic question, but since I've never had to do this myself, I was curious how to clean data for a check all that apply question. I've uploaded my Excel file from Qualtrics into SPSS. The way it currently reads in data view, every choice is listed as a single string variable with commas in between. In all of the videos I have watched, each answer choice has its own column in SPSS already, and then multiple response sets are generated. But that is a step further than where I am currently at. Do I have to go back in Excel and create separate columns, or is there an easier way to do this in SPSS? Thank you in advance!!
Relevant answer
Answer
Hello Angela,
Well, at least next time you use qualtrics, you'll know to set up "choose any/all" options as separate variables! (There probably is at least one other option; it wouldn't hurt to touch base with tech support at Qualtrics, they usually are quite helpful.)
Assume your option list has 5 possible values: "A", "B", "C", "D", and "E" and the choices are currently saved as a text variable called 'text'.
In spss, open up a syntax box (file/new/syntax) and do the following:
Create and run a set of statements of this form:
compute option1 = (index(upcase(text), 'A')>0) .
compute option2 = (index(upcase(text), 'B')>0) .
....
execute .
The newly created variables (option1, etc.) will have a value of 1 if the designated choice was included, and a value of 0 otherwise.
Good luck with your work
  • asked a question related to Data Cleaning
Question
4 answers
Hello, I am stuck as the question suggests I am trying to combine two different datasets that have the same column however neither merge() nor Join() functions and their variants didn't work. What can I do? I heard that can be made in Python but I am not very good at Python so I prefer to use R or SPSS or Excel.
Relevant answer
Answer
Use the rbind function to combine two data frames (datasets) vertically. The variables in the two data frames must be the same, but they do not have to be in the same order. If data frameA contains variables that data frameB does not, then one of the following is true: Remove the unneeded variables from data frameA.
rbind() is used to join two R data frames.
This method stacks the two data frames on top of one another, adding the second to the first. Both data frames must have the same amount of columns and column names for this function to work.
This is how you generally connect different data sets in R. If you like, you may use left join instead of merge. Reduce(function(dtf1,dtf2) left join(dtf1,dtf2,by="index"), list(x,y,z)) is an example.
To join data frames with various columns, use the bind rows() function. The column names and numbers in the input data frames may differ. NA is used to fill in missing columns in the related data frames. A column appears in the output data frame only if it appears in any of the data frames.
  • asked a question related to Data Cleaning
Question
4 answers
My Doctoral research is based on LBSN data and the data is basically dirty so needs cleaning. Would you share the script of Auto Data Cleaning using ML as described in your article? I prefer script using Py language. Thank you very much. I attach a csv file as a sample of data.
Relevant answer
Answer
Dear Dr Rayat,
The below reference can be useful in order to find a tool to clean data automatically using the Python.
  • asked a question related to Data Cleaning
Question
9 answers
I have some raw text data, which is pretty "dirty". My goal is to evaluate whether proper data cleaning can improve the accuracy of an NLP classification model. The data is firstly splitted to training and testing sets, it is natural to train two models based on cleaned training set and raw training set, respectively. The problem is whether the models should both be tested on raw test set, or on a testset which is cleaned by the same method as the training set?
Relevant answer
Hello!
Your goal is: "to evaluate whether proper data cleaning can improve the accuracy of an NLP classification model"
To my understanding you want to evaluate the classification. For me it means that you need two models, one trained on raw data and one trained on cleaned data. But to compare the two models you need to use the very same raw test data. It will show how your model will classify if it faces dirty data.
If you want to evaluate the cleaning process itself, than you need to test on a cleaned test data. It will show how necessary to clean your data before doing the classification.
  • asked a question related to Data Cleaning
Question
4 answers
I am collecting questionnaire data in an epidemiological study from large numbers of participants (>1000). What is the best data management system/software for: data entry, data validation and data checking?
Relevant answer
Answer
epi info is free and has backup help
  • asked a question related to Data Cleaning
Question
4 answers
We were doing a project to integrate two different technologies, Civil Engineering and Machine learning in my project. We have completed till the dataset preparation and data cleaning steps.
Now, I have a challenging task to build a model for this data. But our data has 9 independent variables and 3 target attributes. Each target attribute correlates with all the 9 inputs. And also 3 target attributes correlate each other.
I am not getting which algorithm/model I should apply for this data. It will be so helpful if anybody suggest me or give me an idea. Here I am attaching a small sample of our data .
Relevant answer
Answer
Thank you sir.
After taking suggestions from Kiran sir, I understood the behaviour of inputs and outputs is linear. So I tried using multi output linear regression model.
I will work on your suggestion also sir. And compare the accuracy of both.
Thank you sir
  • asked a question related to Data Cleaning
Question
3 answers
Hello everyone
One of the obstacles for me in a ml project is data cleaning phase. I am fairly good at the model implementation but I have major issues in data preprocessing. Could someone give me an advice?
Thanks
Relevant answer
Answer
  • asked a question related to Data Cleaning
Question
8 answers
I know this has no fixed answer. To give the question a direction, what are the best set of feature engg, data validation, data cleaning, feature extraction, feature selection (these cover data-centric approach), image data modeling, language modeling, hyper-parameter opt. and deployment techniques which should be adopted to get optimal models in less number of hit and trials. Answer to this questions demands experience and structuring best yielding methods you've learned till date. This question is open so that different domain experts can give their own views. E.g. A researcher who has extensively worked in NLP can suggest a pipeline which he/she may follow.
Relevant answer
Answer
Good. Generally, the industrialists works for AutoML framework with complete pipeline.
If you want to know more, please visit this site.
  • asked a question related to Data Cleaning
Question
3 answers
If one is having problems related to contamination of iron in the glassware, what is the best way to clean it? I have tried to clean with nitric acid 10% solution (v/v), but I'm not sure if that's enough. If someone already had the same problem and wants to share the solution, I would be thankful.
  • asked a question related to Data Cleaning
Question
3 answers
HI ALL,
Can anybody help me to find the reference(s) about the fixation duration threshold of cognitive processing? Recently we are performing eye movement data cleaning. One key procedure is to exclude the fixation that is not relevant to cognitive processing. For example, very short fixations may be driven by incidental viewing, instead of by voluntarily viewing. We wanna set a threshold on the fixation duration, recording data below which are classified as incidental viewing, and filtered out. I once read one paper mentioning the fixation duration threshold (200ms or 400ms), but forget the title. We appreciate anybody who can provide us with some clues.
Thanks a lot!
Relevant answer
Answer
Speed test
  • asked a question related to Data Cleaning
Question
1 answer
Dear all,
In the context of our learning study, we used a go/no-go spoken recognition task for which participants were asked to recognize the spoken words and to reject the pseudowords.
Following the Signal detection theory, we calculated both a d' score (sensibility) and a c score (bias) . However, several participants exhibited a response strategy (rejecting all words and pseudowords; or accept all items as if they were words).
Thus, we are looking for a reference or an objective criterion to rely on for cleaning our data. For this reason, we are asking for your help to solve this issue. Has someone ever been confronted with this issue of participants' response strategy? If so, how did you solve it?
I am looking forward for your answer and I thank you in advance for your interest in my question.
Best regards,
F.Salomé
Relevant answer
Answer
Florian,
I recommend using a multilevel logistic regression approach so that you don't have to 'clean' your data. Such an approach will naturally incorporate floor and ceiling performance. Message privately if questions.
  • asked a question related to Data Cleaning
Question
1 answer
The study involved mathematics students completing a survey with several scales at three time points.
1) What is a reasonable threshold for missing data to remove an observation? Does this threshold apply to any given time point or to the entire longitudinal data set?
2) Should we remove people missing entire scales (at any time point) or impute them (e.g. using maximum likelihood)?
3) Should we impute at each time point or impute all observations from the longitudinal study at once?
4) If the latter, for cross-sectional analysis at time point 1, should we impute the time point 1 data set separately or should we take the imputations from the full data set, and then impute remaining missing data for extra students at time point 1 (i.e. who dropped the course or did not complete the other surveys)?
5) Should we remove observations who would be cleaned out at a specific time point (e.g., for straight-lining) but seem okay at other time points?
6) What is the current status of the debate on rounding and/or restricting range when imputing data?
Relevant answer
Answer
Hi Kaitlin
The link below shows how to impute longitudinal data using Stata:
To answer your other questions, the standard approach (if you have enough data) is to remove a few known values from your data. Then impute, then compare observed to expected. If the error in your imputation method is low, then it is safe to assume that the imputed dataset was sampled from the same underlying population as the unimputed. If not, then the data do not support imputation and instances with missing values should be discarded.
  • asked a question related to Data Cleaning
Question
4 answers
I have 1391 answers from a pre test, and 1261 from a post test and want to match them using a self generated identifier. I am not sure how best to do this. Should I first remove any duplicates in each dataset? Then match id and test score and remove the ones with only one test score?
  • asked a question related to Data Cleaning
Question
5 answers
The Machine learning first step after collect the data, make a cleaning process in several types before make the pre-processing and apply different algorithms to make a decision what is the optimal in term of the accuracy and privacy
Now, As a Weather data before we want to use the machine learning process to optimize (such as the pollution challenge)
what is the data that will be cleaned before we can apply the algorithms?
Relevant answer
Answer
In general yes you need a data cleaning as a preprocessing step along with other steps such as normalization, feature selection, and extractions
  • asked a question related to Data Cleaning
Question
4 answers
Hello everyone,
I have multiple data sets that contain a large number of species occurrence records, I would like to process these records through an automated coordinate cleaning package in R. I am looking for recommendations for a robust cleaning package with a straight forward script that can be easily replicated for multiple data sets.
Does anyone have any packages which they have previous experience with in R? I have used a new package called "CoordinateCleaner" (Zizka et al., 2018) which I recommend but would like to find out which package is most recommended by the scientific community. Many thanks in advance, Connor.
Relevant answer
Answer
Thank you
Malaika Mathew Chawla
, I'll bear this in mind if I need to work with these types of data again. In the end I used CoordinateCleaner which worked well
  • asked a question related to Data Cleaning
Question
3 answers
Hi,
I am doing a GWAS analysis of european bison's a dog's SNPs. Right now I am doing multiple data clean up, according to previous studies and what I have learned.
I was thinking, if it is ok to test HWE and discard data that does not fit if I am working with animals, where mating is controlled and inbreeding coefficient is high?
And what about tests for population structure etc.? It is obvious that there is going to be some structure in reintroduced and domestic animals. I don't want to loose any important data and I want to be sure, that I have done everything correctly.
Thank you for your help.
Relevant answer
Answer
Lenka Ungrová If your population is large enough, it is recommended to perform HWE test.
I feel these papers give you some insight into which decision might be right for you:
  • asked a question related to Data Cleaning
Question
3 answers
Hi everyone, I'm looking for sotware that helps me to clean data from bibliographic databases, i don't know how to program so I really need a tool that it's easy to use (maybe with a little bit of programming). I already used VantagePoint but I don't have more access to it :( Help me please
Relevant answer
Answer
Hi Ana, you should try the bibliometrix package in R ( https://bibliometrix.org/ ). Even if you don´t know anything about programming there is a web-based application (biblioshiny). With this package, you can analyze everything you want by drag and drop the items you want to investigate.
  • asked a question related to Data Cleaning
Question
3 answers
Hi
I have conducted a Master dissertation in Information Security Frameworks and Standards.
After collecting responses through a Google form questionnaire which is consists of 30 statements, and all of them are based on 5-point Likert Scale. I received 230 responses (cases).
Then I wanted to remove any careless answers which may affect the final result, so by using IBM SPSS I've done the following.
  1. I used the "Count values within cases" option under the Transform menu, then I assigned the value that I want to count, for instance, 1 = Strongly Agree. I did that for all five values of the Likert scale, so now I have five new variables.
  2. Then from the Descriptive Statistics, selected Frequencies with the Maximum option and applied to new five variables with the purpose of showing me the cases with the highest repetitive answers. For instance, I found that some of the respondents have 28 neutral responses out of 30 statements, another one with 25 agree out of 30 statements, thus I deleted this kind of answers.
But I'm not sure if this an effective method to handle careless answers.
So now my question is that, is there a better and effective method to identify careless responses in the Likert Scale whether by using Microsoft Excel or IBM SPSS, if yes could you please explain the method?
Another question, is it better to conduct the Internal Reliability (Cronbach Alpha) test before Data Cleaning or after (for instance after removing careless responses or before)?
The same question goes to the Principal Component Analysis or Factor Analysis (even though I know that there is a slight difference between PCA and FA).
I really appreciate your cooperation and consideration.
Thank you.
Relevant answer
Answer
Thank Dr David L Morgan for your reply.
Here is my justification.
  • Some respondent has 27 responses with "Neutral" Likert scale out of 29 statements of the whole questionnaire. This indicts that respondent either did not understand the terms within statement (because the questionnaire is about Information Security) or just wanted to finish the questionnaire as fast as possible without considering providing honest responses.
  • Respondent has 25 responses with "Disagree" Likert scale out of 29 statements of the whole questionnaire, which is not acceptable in my questionnaire, why? because based on the structure of the questionnaire and of the statements, having 25 responses of one point is contradicting. It can't be.
  • I'm not just deleting responses based on the factor of number of occurrences of a specific point (for instance, 3 = Neutral) only. But I actually see the pattern of the responses of the respondent. For example, if I have two questions first one says "I don't understand the information Security policies of the organization" and the second on says "The information Security policies of the organization are confusing", then the responses of the participant is Disagree for the first one and Agree for the second one, then I know that respondent either deliberately provided misleading answer or was not focused.
However, I have known better technique to identify (but not sure if it is the best technique).
  1. Quote from the main question "I used the "Count values within cases" option under the Transform menu, then I assigned the value that I want to count, for instance, 1 = Strongly Agree. I did that for all five values of the Likert scale, so now I have five new variables".
  2. Now I chose Explore from the Descriptive Statistics of the Analyze menu, then moved the 5 new variables to "Dependant List" and selected "Outliers" from Statistics to show me the highest number of occurrences per case for each Likert Scale point, besides the Boxplot. The Extreme Values table showed me five case with 29 "Neutral" responses out of 29 statement of the whole questionnaire!!, so I check these cases and removed them.
But, as mentioned before, not sure if this the best way.
Furthermore, I read different research papers regarding "Identifying Careless Responses within Likert scale or questionnaire", they discussed various methods to handle these kind of responses, but they are not clear to me.
  • asked a question related to Data Cleaning
Question
2 answers
We let rate about 60 short stories on valence and arousal.
We suspect that a few scorers have not read the short stories and have marked the possible answers more randomly or according to a pattern. My goal now is to find these bad scorers and remove them from the record. I want to be very careful and leave the scorers in the dataset in case of doubt.
I have chosen the following two criteria for exclusion: 1. Deviation from a range of expected values. 2. Deviation from the expected distribution of all rating values.
Criteria 1: If in more than 6 ratings, a scorer deviates by more than one standard deviation of the averaged ratings across all scorers, the ratings of the scorer will be removed from the dataset. If we assume the probability per rating that a reviewer randomly answers next to the expected range, after 6 short stories there's a probabilityof less than one percent.
Criteria 2: The distribution of all ratings across all scorers resulted in an equal distribution. If you allow an average deviation of 4 points for each scorer from the distribution of the rating values of all scorers for each score, all scorers with more than an average deviation of 4 points will drop out of the data set.
If either or both of these criteria apply to an scorer, the scorer's ratings are removed from the record.
  • Is this legitimate?
  • Are there better practices?
Thank you very much for your answers.
Yours sincerely
Relevant answer
Answer
Hello David
Thank you vor your response.
In fact, I compared the data wiht and without the 8 excluded cases.
The differences are rather small, and all correlations for valence and for arousal between the whole sample and the 'reduced' sample are >.99 for all rating groups.
Best regards
Egon
  • asked a question related to Data Cleaning
Question
11 answers
I have some MRI images and 2D phase contrast blood flow velocity data of healthy patients obtained from MRI. I need to obtain the vortex lines of blood flow inside the heart from the data. So to clean the data, I have applied Gaussian and wiener filters of certain radius. Still I think proper vortex lines are not coming. I am really in a fix. Any help or suggestion will be accepted with gratitude.
Relevant answer
Answer
This is the type of image which I have
  • asked a question related to Data Cleaning
Question
49 answers
I have some data in which a number of the analyte concentrations (data points) obtained were below the quantitation limit (LOQ) of the method used. This is possibly as a result of both the low values of the analytes of intrest expected in the test samples and the method and instrument detection limits. In such a case, how are these "missing" data points treated during statistical analysis and subsequent interpretation and representation?
Relevant answer
Answer
There are numbers of way to solve the problem of values below detection limits, I list some of them:
1- Substitute LOD/2 for all of them.
2- If you want to use the substitution, it is better to substitute LOD/sqrt(2) (to consider for the variance as well).
3- Impute the values below detection limits using multiple imputation.
4- More advance statistical approach is to use the mixture model. A binomial model for below detection limit and a log-normal model for above detection model data.
Hope it helps.
  • asked a question related to Data Cleaning
Question
2 answers
We collected household data using ODK system and we found 1 main data and 6 other attributable data. So, we need to merge these 6 different data to the main one. Once we have a flat file, we are going to perform the data cleaning and analysis. Thank you in advance for your kind answers!!!
Relevant answer
Answer
Hi Sisay
You need to use the merge command. Details of this can be found on https://www.stata.com/manuals13/dmerge.pdf
Good luck with your research.
Best Wishes
Colin
  • asked a question related to Data Cleaning
Question
3 answers
Im taught to treat missing values and straight line errors as part of data cleaning. Then running Validity & Reliability before going back to Normality & Outlier. Why not run the Normality & remove the Outliers before testing the constructs Validity & Reliability?
Relevant answer
Answer
  • asked a question related to Data Cleaning
Question
4 answers
Please suggest me
Relevant answer
Answer
Thank you sir
  • asked a question related to Data Cleaning
Question
6 answers
Dear All,
I have been analyzing data from surveys collected on Amazon M-Turk for the last year and a lot of the times it is obvious (and understandable) that people do a pretty awful job at responding. I can completely understand that a lot of the times people will be tired, drunk or stoned, and will be filling in surveys to make ends meet, but I need to find a widely accepted way of dealing with these responses so they don't add noise to the results.
I come from a neuroscience/pychophysics background where I had loads of freedom with cleaning data (as long as I did it transparently), but now in Consumer Research & Marketing a justified but somewhat arbitrary cleaning of the data is less accepted, both in terms of the reports I produce and the journals I am targeting.
I have an open question at the end of the survey, for ethical reasons, where I ask people what they think the purpose of the study was. These are some of the responses I get (real responses):
- NOTHING
- i ave impressed
- no
- NOTHING FOR LIKE THAT UNCLEAR. IT'S ALMOST FOR SATISFIED.
Clearly one cannot expect anything from a respondent that answers in such a way, and, in fact, when I eliminate such respondents the results make much more sense. I have already set my sample to US residents only, and stated I want English speakers. But linguistically impaired or non-English speakers seem to wriggle their way in.
What do you advise me to do? What is acceptable in science and business, in terms of dealing with random, bad, non-sensical responses?
Some people tell me that they eliminate up to 50% of data from M-turk because it is crappy, and that is normal to them. Other people say that is unacceptable. The people who eliminate up to 50% of data seem to not report it. I would like to have a reasonable procedure that most reasonable people would see as acceptable, and report it.
I am thinking about investing time creating a little program that processes English language and that detects text that cannot be considered as functional, grammatically-sound English statements. Is that something someone has tried?
Lastly, I have heard about an elusive statistical procedure that detects random responses, when rating items on a 5 or 7 point scale. I cannot find anything concrete on this, which makes me think its not widely accepted or well-known or generalizable.
Any tips or thoughts on the matter will be well appreciated.
Michael
Relevant answer
Answer
If you are working in a KAP (Knowledge-Attitude-Perception) framework, it makes sense to use your Knowledge questions to rank response adequacy. From there on, if you want to proceed with an outlier analysis followed by normalization of some kind, or go with a weighted scheme is up to you.
You can detect (/avoid) random answers by incorporating 'validation questions' (rephrasing of questions asked previously). If respondents answer both the original question and its validation counterpart consistently you're good; if not you have grounds to suspect they are not doing their best.
Optional questions means you already have a way to handle imbalanced data. Not all respondents will opt to answer. It's similar with open questions with no word limit.
Using linguistic criteria doesn't seem appropriate, unless the survey itself is linguistics-oriented.
You can always report versions of your analysis on both 'as-is' and 'clean' data.
  • asked a question related to Data Cleaning
Question
4 answers
I am trying to get hold of the WHO GPAQ (GLobal physical actvity questionniare data cleaning instructions for STATA software, the information is available from there website upon request and other version of the software are on the system
however, i have been unable to get in contact with anyone from the organisation. and am wondering does a fellow researcher doing similar work on physical activity perhaps have a copy of the cleaning program?
Thanks
Eileen
Relevant answer
Answer
Hi Mohammad, i didnt recieve a response either from them.
I ended up doing some of the corrections manually in excel - which was very slow.
  • asked a question related to Data Cleaning
Question
5 answers
I have got data on Pressure Injuries. Each row tells me how many pressure injuries each person has. Then I also have variables such as toe, foot, hand, and so on, which specify the stage of pressure injury. I want to break these variables apart but I am missing something. Current data is as follows:
ID No of PI Foot1 Arm Hand
1 1 Stage1
2 3 Stage 3 Stage 3 Stage 1
3 2 Stage 4 Stage 1
I want the data to look like:
D No of PI PI 1 Location PI 1 Stage PI 2 Location PI 2 stage
1 1 Foot 1 Stage1 Arm Stage 1
2 3 Foot 1 Stage 4 Hand Stage 2
3 2 Foot 1
I have tried using the recode and compute function and it works for the first pressure injury. But it doesn't seem to work for the second and so on.
Can you please offer suggestions, thanks in advance .
Relevant answer
Answer
Please refer to:
1. Landau, S. and Everitt, B. S.(2004). A Handbook of Statistical analyses using SPSS.
2.Howitt, D. and Cramer, D.(2008). Introduction to SPSS.
Regards,
Zuhair
  • asked a question related to Data Cleaning
Question
1 answer
Dear Prof. Chen,
I used the latest citespace to analyze cssci download files. However, when I follow the proceures to transform the data, it shows 0 of 0 references have been converted. I also downloaded the csscirec(new).jar. However, only the first record was converted. I'm wondering whether you could tell me how to fix this problem.
By the way, I'm wondering how to do the data cleaning is more efficient if I download the file from WOK. 
Thanks a lot for your kind consideration and sincere help in advance.
By the way, I once attened your workshop held in Dalian. 
Best,
Jie
Relevant answer
Answer
Hi, Thank you for your interest in using CiteSpace.
The latest version 5.0.R7 has a few minor versions with different built dates. Check the What's New to make sure it mentions the new CSSCI converter. It should work. The only thing missing from the new CSSCI is the keywords.
  • asked a question related to Data Cleaning
Question
1 answer
The dataset is an incomplete and potentially biased representation of academic journal and conference publishing just like any other known dataset that attempts to fulfil a similar mission. Herrmannova and Knoth (2016) describe both aspects that can be relied upon and those that need to be used with care, but is there specific, operational ways to improve the quality of Microsoft Academic Graph data?
Relevant answer
Answer
Aleksandr Sir, please explain  how citation analysis can be done. how it is related with data mining. what is your aim of this research questions. 
  • asked a question related to Data Cleaning
Question
4 answers
I am analysing movement data from 18 satellite-tagged individuals from an Old World vulture species. Upon inspecting the data I found there were 3 sampling regimes used:
(i) hourly, 2am to 7pm,
(ii) hourly, 3am to 8pm, and
(iii) every 3 hrs, continually, day and night.
I wonder how this will affect my calculations of monthly home range size (using 95% mcps and kernels). For the birds where fixes were not taken at night, will the importance of their nocturnal roost sites be underestimated in the home range estimates? Do I need to do some data cleaning before I start estimating home range size? And if so, how can I account for these 3 different sampling regimes in my calculations of monthly home range size?
Also, would cross validated be a better smoothing method than Href for these birds (which I expect to move mainly between roost/nest site and the same feeding site each day).
And is there any point in including mcps? The recent literature seems to imply that they are generally included only because other researchers include them, so 'for comparison with other studies', but they don't seem to be the best way of estimating home range size in a large bird which will have a few sites it spends most of its time at (roost/nest site and daily feeding site, for those birds that feed at vulture restaurants).
Lastly, would the package 'adehabitatHR' in R be the best way to get monthly home range sizes? Or would folks recommend T-LoCoH?
Thanks.
Relevant answer
Answer
I expect your (iii) to produce the smallest home range size, and at only 5 readings during daylight hours, then not very useful.  While (i) and (ii) should be identical, my questions about Hooded (?) Vultures are:  when do they go to roost, when do they fly off in the morning, and do they ever change position during the night, perhaps especially on moonlit nights?  Readings at 7/8 pm and 6 am will answer that.  Readings during the day at one-hourly intervals of a bird that can glide at >80 kph are not enough, but too bad.  I'd have thought that MCPs are ok, because we all understand them, but certainly a vulture (as would a seabird) needs a smarter approach.
  • asked a question related to Data Cleaning
Question
9 answers
hello everyone, I want to ask about multichannel eeg dataset, what will happen, if the cleaned dataset is cleaned again using ICA or wavelet enhanced ICA. 
should the signal result will be same as before (the original data that already cleaned) or will be different?
the point I ask the question are: we sometime don't know whether the data is already cleaned or not. (even we can manually looking the artifact, it will be time consuming)
Relevant answer
Answer
Dear Farrikh, 
I am not so sure about that. Please consider infact that:
1. In general, repeating twice the same procedure is a sign of inefficiency;
2. Ideal filter or procedures does not exist even in the numerical world: every computation you do introduces plenty of noise terms related, first of all, to the non-linear behavior of the phase in the filter response. 
Anyhow I suggest you to verify how much is this non-linear distortion and evaluate if it is negligible or not. Have a try is always a good idea. 
Hope it helps,
Ciao! 
  • asked a question related to Data Cleaning
Question
4 answers
I am working on two groups. Group A has sample size of 37 and B has a sample size of 29. Can ANOVA or t-tests  affect the results due to non-equivalent sample size? How can I randomly shrink the size of subjects in Group A to 29? Is this the correct approach to Data Cleaning/Analysis? 
Another question, In case of missing values due to drop out of subjects or non-response which approach is best? Multiple imputation that will randomly fill the missing values or deletion of subject(non-respondent) and the relevant data? I am using SPSS tool for cleaning and analysis.
Relevant answer
Answer
For missing value, for your problem, imputation doesn't help.  If you have other variables, you may check the homogeneity of distributions of other variables by missing and non missing, and discuss your finding    
  • asked a question related to Data Cleaning
Question
6 answers
I am currently studying outliers,but I am sort of confused that outliers are defined at attribute level or record level.For example say I have to predict house price by making some model on training data and features are #bedrooms,#floors and target value is house price.Then will we define outliers on the basis of #floors,#bedrooms or house price?
Thanks!
Relevant answer
Answer
I think the answer is - as usual - a little bit of both.
First, it is sensible to check for very large values in the response variable, i.e., housing prices. Such values might harm your regression model.
Second, one does often thinks of outliers in terms of the variables, very much as you say. From a practical point of view, examining features individually is a good starting point. For example, is there a house with 24 bedrooms in your data? that would probably be unusual and could thus represent an outliers. Tuckey's rule can be used to decide whether a large/small value qualifies as an outlier; other heuristics are also available.
Thrid, in addition to examining features individually, one can also look at outliers from a multivariate perspective. This requires that you consider the values of bedrooms and floors in parallel to decide whether a record qualifies as an outlier. I think when you speak about record level outliers this relates to multivariate perspective. Clearly, when considering many features in parralel, detecting outliers gets more difficult. A relatively simple method is can compute the Mahalanobis distance for all pairs of records and to then check for records that are, on average, very far away from all other records.
Clearly, there are many books that discuss the topic. Personally, I like the recent book from Baesens: Analytics in a Big Data World, because it is written in a very approchable manner. But there are many other great books ;)
  • asked a question related to Data Cleaning
Question
13 answers
Now I have the comments of several products in amazon, and I want to apply the lda model. Firstly, I have to construct the word-document matrix. But it is hard for me to make it, becauce I find it hard to select the word. Although I'm school of statistics and Ihave learned java by myself,  I don't know how to remove different signs such as","and "%%", and also remove the stopwords. Can anyone recommand me some books or some materials?  Thank you!
Relevant answer
Answer
I believe that would be easier for you to use a data analysis/statistic tool for this job. Examples are RapidMiner, Knime, Weka, R system (you'll need additional plugin for R), Statistica etc. Have a look at this survey [1] to learn more about methods and features supported by different tools.
  • asked a question related to Data Cleaning
Question
4 answers
It seems like Data Cleaning a little bit Hectic but an easy task. Even applying Data Mining algorithm on database is easy. But the problem arrives when you want to create an Interactive dashboard and Visualize for analysis. Can anyone guide me on the same?
P.S.: I am using Spreadsheet (EXCEL) as my database file.
Relevant answer
Answer
Check out LabVIEW. Most people think it is strictly for instrumentation control, however, it has developed into a full programming language. It was originally designed for use in laboratories, but, I use it everyday for data analysis, control of instrumentation, data acquisition or modeling of simple problems. I have written many LabVIEW programs that never touch instrumentation or hardware. LabVIEW can interface to Simulink, Matlab, C, C++, C sharp and many other existing programs. 
I agree, get away from excel (other than for the comma seperated variable data format), especially for large data sets. When it comes to GUI development, which I think is what you are asking about, my golden rule is "Write software that any idiot can use, because, you will have to use it later". Restated, keep it simple, and intutive, because you'll forget how to use it after a few months.
  • asked a question related to Data Cleaning
Question
4 answers
I collect the data using Bruker and they told us that there might be Kbeta residual. When using TOPAS, there is a function where you can add Cu Kb (1.39222A). I am now trying to use GSAS to refine the same data. Is it possible to add this?   
Relevant answer
Answer
"I collect the data using Bruker and they told us that there might be Kbeta residual."
Where is ur K-beta filter, x-ray tube side or detector side?
It's better u post ur data.
What is the space group and the chemical formula?
  • asked a question related to Data Cleaning
Question
12 answers
I have database in excel, is there any way so that I can create multidimensional datacube for further processing. Furthermore, is there any way so that we can perform data warehousing on such (excel) data.
Relevant answer
Answer
Dear Mr Manish, I suggest you to take a look at Pentaho Data Integration (http://community.pentaho.com/projects/data-integration/) and Data Cleaner (http://datacleaner.org/)
Best regards.
  • asked a question related to Data Cleaning
Question
10 answers
I manage a research participant registry. I have just come on board and the data from participant surveys has been being input by volunteers. I have checked the first 100 surveys for errors and have found around a 60% error rate (around 60% of the surveys have at least one entry error). I plan to double enter all of the current surveys at 100%. However, outside of more extensive volunteer training, I am looking for measures to ensure data integrity for the  future surveys.
Relevant answer
Answer
One error on 60% of surveys does not constitute a 60% error rate -- the rate would be dependent on # of responses entered.  Actually, only l error on 60 out of 100 surveys doesn't sounds unusual.  All manual data entry is subject to error. The best way is to double enter, as you are doing. If you could load the survey on to a telephone survey software, you would reduce your error rate and lessen the # hours in data entry.
  • asked a question related to Data Cleaning
Question
4 answers
My studies aim to evaluate how psychological constructs (type of motivation, habit, self-efficacy, intention) interact to predict physical activity maintenance (weekly frequency). Some participants get temporarily injured at a time wave but come back for subsequent questionnaires where they are apparently healed and back to their usual exercise level.
I would like to know the optimal procedure to control this confound without sacrificing too much data.
[EDIT 22/01/14] After discussion, my problem boils down to the following dilemma:
1) Should I keep all the data in the analyses and statistically control for the influence of injuries with a set of binary variables (not injured/injured) for each time wave (/4)
2) Should I delete only the behavioral data at the time wave where the injury occured and impute it using psychological data from the same time wave that is unaffected by injuries (i.e. type of motivation)?
3) Should I delete all data from this time wave, as if the injured participant was only absent and impute it with data from other waves?
4) Is there anything that prevents me from using data from subsequent time waves if the participant reports not being injured anymore?
My goal is not to test hypotheses regarding the onset of injuries or their effect on behavioral or psychological variables but to control them to obtain unbiased predition coefficients of psychological constructs on exercise frequency, as well as test mediations and moderations.
A secondary problem where I need validation is whether or not I should keep active participants that report dealing with a sports injury at baseline given that their exercise frequency appears to be unaffected. On the other hand I would exclude inactive participants that report being injured (sports or medical condition) at baseline unless they show improvements at the second wave of measurement.
Being guided through the rules and steps of data cleanup and imputation in regard to those temporarily injured participants would be a life-saver.
--- Any advice, general or specific to the scenarios below, would be welcome. I don't know if someone has written on the topic so I would be grateful for documentation on it.----
Thank you very much for your help!
-Pier
Relevant answer
Answer
Dear Professor Carrington,
Thanks for your intervention. I shortened my question and the details for clarity.
I am curious on the topic of censoring although I am unsure it answers my problem: if those temporarily injured participants stayed in the study, and that their subsequent data is supposedly unaffected by that injury now healed, should I avoid treating these participants as simple attrition? Thank you!
  • asked a question related to Data Cleaning
Question
2 answers
I have several datasets with thousands of variables. These different datasets have different variables for the same thing. 
Are there ways to automatically/semi-automatically check compatible variables of several datasets and make them consistent?
If there is such a thing, that would save me months of tedious work.
The data is stored in SPSS format.
Many thanks
Relevant answer
Answer
Do a factor analyses and see which ones are highly correlated with each other. The ones that are will load together and that is a sign that they are related. With these factors (groupings) you can see what "quality" they represent and wean them down.
  • asked a question related to Data Cleaning
Question
12 answers
I realize there are some situations where a true dummy coding (0, 1)  is needed to make coefficients interpretable (and situations where you have to do contrast coding, etc), but in terms of general practice and data cleaning/coding, which do you prefer? I can see pros and cons for each. Obviously, the numbers would always increase (i.e., 1, 2 not 1, 0). 
Pros for 0,1,2...: 
1) When binary, you can get a proportion by taking an average
Pros for 1, 2, 3...: 
1) The highest category is the number of categories (if no missing categories), whereas in the other method, you have remember to add 1. 
2) Similarly you avoid confusion when saying "the first category" and "category 1" (which in the other system the latter could be confused as "category labelled 1"). 
Just looking for some tips from practice that folks have used to help keep things clear (particularly when working with data that isn't fully labelled (or in situations in which labels won't display). Thanks!
Relevant answer
Answer
Matt,
(1) interpretation of the coefficient of a variable coded 1 and 2 does not change if you code it 0 and 1. If your regression equation is y = 4 + 3x + {error} when x is coded 1 and 2, then it simply becomes y = 5 + 3x' + {error} when x' is the copy of x coded 0 and 1. Acknowledged limitations: (a) The interpretation of interactionswill change; (b) The value and the interpretation of the constant/intercept term will change; (c) In multilevel modeling world where people are obsessed with the right centering, the question of coding 1/2 or 0/1 could matter two.
Having said that the coding does not really matter to me in statistical terms (and if you think otherwise, you should revisit your linear models class), the data management and cleaning tasks take the higher priority. Hence, other rules for me are:
(2) Being brought up using Stata, not Some Alternative Software, not the Software that Prohibits Sophisticated Sureys (and having learned C/C++ ways earlier in high school and college), this is a non-question: 0 is a logical "No", 1 is a logical "Yes", and Stata parses -if female do this- as -if female!=0 do this-, which provides concise and readable syntax. When coded 1 and 2, you have to be more explicit as in -if female==1-, which is fine, but Stata convention is 0/1 coding.
(3) For nominal or ordinal variables that have more than two categories, I use 1, 2, 3, 4, etc., with an understanding that I will never use this variable as is in the RHS of a regression equation (which would force the distance between categories 1 and 2 to be interpreted as the same as the distance between 2 and 3, which is rarely meaningful), but rather specify it as categorical (CLASS in SAS, i.agegroup in Stata, as.factor in R).
(4) With interactions of categorical variables, I prefer semi-interpretable coding: agegroup5_female <- agegroup5*10 + female, so that agegroup5_female = 31 stands for agegroup5==3 interacted with female==1. With this, I will be careful in (a) making sure that the leading variable does not have zeroes, as the category I would expect to be "01" would naturally look like a number 1, and would be more difficult to identify in the output; (b) making sure that the order of the variables in the generated interaction variable corresponds to the order of digits, as in my example above. Also, I can easily generate the labels for this new variable in a double cycle in Stata:
forvalues a = 1/5 {
  forvalues g=0/1 {
    label define agegroup5_female_lbl `a'`g' "`: label (agegroup5) `a'', `: label (female) `g''", modify
  }
}
label values agegroup5_female agegroup5_female_lbl
(5) A codebook that mixes conventions is particularly confusing, so ideally you would open the codebook with a universal statement like  "All binary variables are coded 0/1 with 1 being the value identified by the variable name (e.g., female==1 indicating female vs. female==0 indicating male), and all categorical variables with more than two categories are coded incrementally as 1, 2, etc., with 1 being the lowest/worst performing category when the variable is ordinal". However, if you have inherited a survey with a mix of codes (as you apparently have), a lot of researchers would have their old code that is adapted to these weird mixes, and it is better to keep the existing variable names with the existing weird codes... although nobody would prevent you from creating a uniformly coded version of variables that do not conform to the adopted convention.
P.S. Matt, from your earlier posts, I know that you are aware of J Scott Long's gem of "Workflow of Data Analysis Using Stata". Many of my considerations are coming from his book. In Stata in particular, I don't bother even trying to figure out what's behind the variable, as I can run -logistic obese i.race##i.year- followed by -margins race year- that will produce all the necessary labels in the output.
  • asked a question related to Data Cleaning
Question
18 answers
I have read a couple of articles which are trying to sell the idea that the organization should basically choose between either implementing Hadoop (which is a powerful tool when it comes to unstructured and complex datasets) or implementing Data Warehouse (which is a powerful tool when it comes to structured datasets). But my question is, can´t they actually go along, since Big Data is about both structured and unstructured data?
Relevant answer
Answer
It's very hard to answer this question in general without taking into considerations what your specific needs are. Also, "Data Warehouse" is a pretty general term which basically can mean any kind of technology where you put in your data for later analysis. It can be a classical SQL database, Hadoop (yes, Hadoop can be a Data Warehouse, too), or anything else. Hadoop is a general Map Reduce framework you can also use for a lot of different tasks, including Data Warehousing, but also many other things. You also have to bear in mind that Hadoop itself is a piece of infrastructure which will require a significant amount of coding on your part to do anything useful. You might want to look into projects like Pig or Hive which build on Hadoop and provide a higher level query language to actually do something with your data.
Ultimately you have to ask yourself what existing infrastructure is already in place, how much data you have, what the kind of questions are you want to extract from your data and so on, and then use something which fits your needs.
  • asked a question related to Data Cleaning
Question
18 answers
What would you suggest for big data cleansing techniques? Is your big data messy? If so, how (or how much) does it affect your work / research? How do you get rid of noise? How do you verify big data veracity, esp. if the source is social media? I would appreciate any suggestions and/or pointers to recent articles in the media, research papers, or documented best practices on big data verification, quality assessment or assurance.
Relevant answer
Answer
Hi Victoria. I do not really agree with the previous comments / answers. Data Cleaning has 2 parts. The first one and often the most time consuming is checking that all values (quantitative or categorical) are as expected (e.g. Format, do you allow NULL / NAN, ...). For this I will use a dedicated perl code for example. And then (second part) once you are sure about the content / values you can use statistics. BUT remember that "outlier removal" first is a question of where you put the threshold AND second and even more important this should always be justified / confirmed by experimentators. Variability is inherent to any experimental data. Never remove outliers if you cannot understand why they are outliers. Last but not least: Statisticians / Biostaticians use often an arbitrary threshold of 95%. This is purely arbitrary decision! And often i is possible that this leads to under/over fitting. If you want some help, just have a contact Pierre.lecocq@ngyx.eu.
  • asked a question related to Data Cleaning
Question
3 answers
The funding agencies and societies have commissioned a multitude of biological databases, 2334 at last count according to the NIF Registry, and the question that we keep hearing is "my renewal is dead" from many of them largely because of innovation. I would really like to know if there is a good reason to rebuild a database every 2-5 years?
Relevant answer
Answer
Dear Kim,
Granted, there is a good reason to archive the data from each publication. Thank you, the work on dataverse and the data citation principles in Force11 is highly relevant. However, when people work to create tools and databases I am still not clear on what exactly gets archived. Furthermore, do you archive if there is no publication associated? I am thinking not about just some desk drawer in my old lab, but about significant and well known resources such as some of the Allen institute brain-maps that may be many terabytes of unpublished data that are widely referenced in literature. Should there be an archive at each time point of the whole data set, parts of it?
My original question deals with how to deal with databases, like DataVerse, in terms of long term support. What happens to DataVerse data in five years? There is no proper paper published on DataVerse, so what happens to all the data archived there? If there was a paper published on it, would it back up all the data in there?
  • asked a question related to Data Cleaning
Question
3 answers
My question is very ubiquitous in nature, which applies to all range of research fields. Very often physicians and researchers present me data in Excel format for sophisticated data analysis and modeling which needs extensive treatment in terms of cleaning, processing, formatting and numerous transformations before its ready to be put into the mill. One thing very common to such datasets is lake of data dictionary which include variable short description and value labels (coding description). I wonder if anyone has developed a script to develop such information especially value labels from the raw data in any of statistical packages such as R, SAS, Stata or SPSS. and produce the usual summary estimates? Just to give idea categorical variables such as gender, race, death, diseased etc. all need to be coded such that numerical values such as 0, 1.... represent the categories which need to be labeled such as female/male, died/alive, White/Black/Hispanic etc.
Relevant answer
Answer
I don't think this would be reliable because people can't be trusted to name their columns sensibly.