Science topics: Quantitative Social ResearchSurvey Methodology and Data Analysis

Science topic

# Survey Methodology and Data Analysis - Science topic

Survey Methodology and Data Analysis is a some of the areas of interests: Constructing paper-based or on-line surveys Questionnaire design and question testing Types of data collection (e.g., interviews, etc) Data analysis *Special topic*: Stated Preference techniques (e.g., discrete choice experiments)

Questions related to Survey Methodology and Data Analysis

In 2007 I did an Internet search for others using cutoff sampling, and found a number of examples, noted at the first link below. However, it was not clear that many used regressor data to estimate model-based variance. Even if a cutoff sample has nearly complete 'coverage' for a given attribute, it is best to estimate the remainder and have some measure of accuracy. Coverage could change. (Some definitions are found at the second link.)

Please provide any examples of work in this area that may be of interest to researchers.

More exactly, do you know of a case where there are repeated, continuous data, sample surveys, perhaps monthly, and an occasional census survey on the same data items, perhaps annually, likely used to produce Official Statistics? These would likely be establishment surveys, perhaps of volumes of products produced by those establishments.

I have applied a method which is useful under such circumstances, and I would like to know of other places where this method might also be applied. Thank you.

I am currently studying cross-sectional research design. I have found that these studies are often associated with surveys and structured interviews but can also include other methods such as structured observation, content analysis, official statistics, and diaries (Bryman, 2016). I wonder if the focus group technique can be used in a cross-sectional research design and in what situations it could be classified as such.

Could you help me with literature or examples to resolve this question?

A number of people have asked on ResearchGate about acceptable response rates and others have asked about using nonprobability sampling, perhaps without knowing that these issues are highly related. Some ask how many more observations should be requested over the sample size they think they need, implicitly assuming that every observation is at random, with no selection bias, one case easily substituting for another.

This is also related to two different ways of 'approaching' inference: (1) the probability-of-selection-based/design-based approach, and (2) the model-based/prediction-based approach, where "prediction" means estimation for a random variable, not forecasting.

Many may not have heard much about the model-based approach. For that, I suggest the following reference:

Royall(1992), "The model based (prediction) approach to finite population sampling theory." (A reference list is found below, at the end.)

Most people may have heard of random sampling, and especially simple random sampling where selection probabilities are all the same, but many may not be familiar with the fact that all estimation and accuracy assessments would then be based on the probabilities of selection being known and consistently applied. You can't take just any sample and treat it as if it were a probability sample. Nonresponse is therefore more than a problem of replacing missing data with some other data without attention to "representativeness." Missing data may be replaced by imputation, or by weighting or reweighting the sample data to completely account for the population, but results may be degraded too much if this is not applied with caution. Imputation may be accomplished various ways, such as trying to match characteristics of importance between the nonrespondent and a new respondent (a method which I believe has been used by the US Bureau of the Census), or, my favorite, by regression, a method that easily lends itself to variance estimation, though variance in probability sampling is technically different. Weighting can be adjusted by grouping or regrouping members of the population, or just recalculation with a changed number, but grouping needs to be done carefully.

Recently work has been done which uses covariates for either modeling or for forming pseudo-weights for quasi-random sampling, to deal with nonprobability sampling. For reference, see Elliott and Valliant(2017), "Inference for Nonprobability Samples," and Valliant(2019), "Comparing Alternatives for Estimation from Nonprobability Samples."

Thus, methods used for handling nonresponse, and methods used to deal with nonprobability samples are basically the same. Missing data are either imputed, possibly using regression, which is basically also the model-based approach to sampling, working to use an appropriate model for each situation, with TSE (total survey error) in mind, or weighting is done, which attempts to cover the population with appropriate representation, which is mostly a design-based approach.

If I am using it properly, the proverb "Everything old is new again," seems to fit here if you note that in Brewer(2014), "Three controversies in the history of survey sampling," Ken Brewer showed that we have been all these routes before, leading him to have believed in a combined approach. If Ken were alive and active today, I suspect that he might see things going a little differently than he may have hoped in that the probability-of-selection-based aspect is not maintaining as much traction as I think he would have liked. This, even though he first introduced 'modern' survey statistics to the model-based approach in a paper in 1963. Today it appears that there are many cases where probability sampling may not be practical/feasible. On the bright side, I have to say that I do not find it a particularly strong argument that your sample would give you the 'right' answer if you did it infinitely many times when you are doing it once, assuming no measurement error of any kind, and no bias of any kind, so relative standard error estimates there are of great interest, just as relative standard error estimates are important when using a prediction-based approach, and the estimated variance is the estimated variance of the prediction error associated with a predicted total, with model misspecification as a concern. In a probability sample, if you miss an important stratum of the population when doing say a simple random sample because you don't know the population well, you could greatly over- or underestimate a mean or total. If you have predictor data on the population, you will know the population better. (Thus, some combine the two approaches: see Brewer(2002) and Särndal, Swensson, and Wretman(1992).)

..........

So, does anyone have other thoughts on this and/or examples to share for this discussion: Comparison of Nonresponse in Probability Sampling with Nonprobability Sampling?

..........

Thank you.

References:

Brewer, K.R.W.(2002), Combined Survey Sampling Inference: Weighing Basu's Elephants, Arnold: London and Oxford University Press

Brewer, K.R.W.(2014), "Three controversies in the history of survey sampling," Survey Methodology, Dec 2013 - Ken Brewer - Waksberg Award:

Elliott, M.R., and Valliant, R.(2017), "Inference for Nonprobability Samples," Statistical Science, 32(2):249-264,

https://www.researchgate.net/publication/316867475_Inference_for_Nonprobability_Samples, where the paper is found at

https://projecteuclid.org/journals/statistical-science/volume-32/issue-2/Inference-for-Nonprobability-Samples/10.1214/16-STS598.full (Project Euclis, Open Access).

Royall, R.M.(1992), "The model based (prediction) approach to finite population sampling theory," Institute of Mathematical Statistics Lecture Notes - Monograph Series, Volume 17, pp. 225-240. Information is found at

https://www.researchgate.net/publication/254206607_The_model_based_prediction_approach_to_finite_population_sampling_theory, but not the paper.

The paper is available under Project Euclid, open access:

Särndal, C.-E., Swensson, B., and Wretman, J.(1992), Model Assisted Survey Sampling, Springer-Verlang

Valliant, R.(2019), "Comparing Alternatives for Estimation from Nonprobability Samples," Journal of Survey Statistics and Methodology, Volume 8, Issue 2, April 2020, Pages 231–263, preprint at

At the US Energy Information Administration (EIA), for various establishment surveys, Official Statistics have been generated using model-based ratio estimation, particularly the model-based classical ratio estimator. Other uses of ratios have been considered at the EIA and elsewhere as well. Please see

At the bottom of page 19 there it says "... on page 104 of Brewer(2002) [Ken Brewer's book on combining design-based and model-based inferences, published under Arnold], he states that 'The classical ratio estimator … is a very simple case of a cosmetically calibrated estimator.'"

Here I would like to hear of any and all uses made of design-based or model-based ratio or regression estimation, including calibration, for any sample surveys, but especially establishment surveys used for official statistics.

Examples of the use of design-based methods, model-based methods, and model-assisted design-based methods are all invited. (How much actual use is the GREG getting, for example?) This is just to see what applications are being made. It may be a good repository of such information for future reference.

Thank you. - Cheers.

I am analyzing a Nationally representative survey and I wonder if I recode the categorical variables like gender or education, it would mess the weights!

each row of the data has a weight, strata, and PSU. does recoding the categorical variables impact the results of my regression analysis?

I am looking for online survey tools with the ability to upload a photo file and send your exact, present location (in terms of longitude and latitude).

The tool should work perfectly on mobile devices, and preferably be free of charge.

The online survey will be completed on mobiles while being outdoor, so

**sending location need to be easy and user friendly**. I was considering just using LimeSurvey and ask participants to copy - paste their location url from Google Maps app, but it is inconvenient and inaccurate.Thank you!

How can I validate a questionnaire for a small sample of hospitals' senior executive managers?

Hello everyone

-I performed a systematic review for the strategic KPIs that are most used and important worldwide.

-Then, I developed a questionnaire in which I asked the senior managers at 15 hospitals to rate these items based on their importance and their performance at that hospital on a scale of 0-10 (Quantitative data).

-The sample size is 30 because the population is small (however, it is an important one to my research).

-How can I perform construct validation for the items which are 46 items, especially that EFA and CFA will not be suitable for such a small sample.

-These 45 items can be classified into 6 components based on literature (such as the financial, the managerial, the customer, etc..)

-Bootstrapping in validation was not recommended.

-I found a good article with a close idea but they only performed face and content validity:

Ravaghi H, Heidarpour P, Mohseni M, Rafiei S. Senior managers’ viewpoints toward challenges of implementing clinical governance: a national study in Iran. International Journal of Health Policy and Management 2013; 1: 295–299.

-Do you recommend using EFA for each component separately which will contain around 5- 9 items to consider each as a separate scale and to define its sub-components (i tried this option and it gave good results and sample adequacy), but am not sure if this is acceptable to do. If you can think of other options I will be thankful if you can enlighten me.

How can i validate a questionnaire for hospitals' senior managers?

Hello everyone

-I performed a systematic review for the strategic KPIs that are most used and important worldwide.

-Then, I developed a questionnaire in which I asked the senior managers at 15 hospitals to rate these items based on their importance and their performance at that hospital on a scale of 0-10 (Quantitative data).

-The sample size is 30 because the population is small (however, it is an important one to my research).

-How can I perform construct validation for the items which are 46 items, especially that EFA and CFA will not be suitable for such a small sample.

-These 45 items can be classified into 6 components based on literature (such as the financial, the managerial, the customer, etc..)

-Bootstrapping in validation was not recommended.

-I found a good article with a close idea but they only performed face and content validity:

Ravaghi H, Heidarpour P, Mohseni M, Rafiei S. Senior managers’ viewpoints toward challenges of implementing clinical governance: a national study in Iran. International Journal of Health Policy and Management 2013; 1: 295–299.

-Do you recommend using EFA for each component separately which will contain around 5- 9 items to consider each as a separate scale and to define its sub-components (i tried this option and it gave good results and sample adequacy), but am not sure if this is acceptable to do. If you can think of other options I will be thankful if you can enlighten me.

Hi, I am an undergraduate currently working on a project that is using a quantitative survey.

I have developed 3 scenarios that have the same 5 Likert scale questions across these scenarios. Also, the questions are split into confidence and experience, as they are asking respondents to self-rate themselves on confidence and experience on the skills specified in the questions.

My question is, how should I analyse the Likert scale responses across all 3 scenarios? Can I sum them up then divide them to get the mean value of each response to each question? I cant seem to find similar papers like my situation.

I have found Cronbach alpha to be >0.7 across all the questions and there is significant positive correlation between confidence and experience across all 3 scenarios. Are these valid reasons enough to add up arithmetically the responses across the 3 scenarios? I can't find any research to say when I am able to add the responses up.

Please help as I am quite lost. Please cite sources in your statement so I can read up further too.

Hi all,

6 of the items in the 5-point Likert-type scale are positive and 6 are negative. When I turn negatives into positives, the meaning changes a bit.

E.g:

In a municipality where women are a minority, women feel excluded.

Is it appropriate to take scale averages and apply t-test and anova without reversing these items? Will there be a problem if I do this?

I am a Master's Student who specialized in development economics (especially rural development), now I am eager to publish my thesis in an academic journal. The topic is about determinants of vulnerability and roles of livelihood assets, so at this time, I would like to ask what kind of journal (paper) is more suitable? if possible, I would like to publish a high rank in terms of impact factors. thank you in advance.

Hello!

So, here is the story. I was give this Likert scale data for analysis, and I just can't get it how I should deal with it. It is a 1-7 scale with answers ranging from 1 being "extremely worse" to 7 being "extremely better". But here is the problem, 4 is "same as was before" and questions introduce the changes as an effect of a different variable, which is work from home (for example, "Compared to work from office, how much has your ability to plan your work so that it was done on time changed when working at home?").

Questions are separated into some groups to form variables, and mean should probably show each person's opinion on the change, right? But it just seems too strange to me to work with just 1 parameters and not go through full comparison of now vs before as 2 different constructs.

If you have any works or insight on the topic, can you please help me?

All the best and take care!

I am trying to perform the cell-weighting procedure on SPSS, but I am not familiar with how this is done. I understand cell-weighting in theory but I need to apply it through SPSS. Assume that I have the actual population distributions.

I've a:

- "Retrospective panel survey": In each year all units are asked "who (X) first told you about us (in the year you first learned about us)?"
- There is lots of attrition from the panel, which may vary by X, as well as new people entering in each year

My question of interest: How is X, the 'how did you first learn about us' thing changing across time? I.e., is the 'point of first contact' (referrer) changing from year to year?

*Possible approaches*

**A. Single-retrospect:**

If I use only the most recent (2020) retrospective data this may lead to a bias from differential attrition related to X (as well as issues of imperfect recall).

If people who 'heard about us through Spaghetti Monster' have dropped out at twice the average rate, and Spaghetti Monster was the referrer for 1/2 of those who learned about us in 2015. ...we will falsely report that "only 1/4 of people who heard about us in 2015 heard about us through SM".

**B. Recent-retrospect for each year**

I could look instead at the 2016 recall data only for 2015, 2017 data for 2016 etc., as there will be less attrition between the shorter time intervals.

But this has its own problem: the share who respond to the survey from coming from each referrer fluctuates from year to year. Suppose in 2020 there is a particularly low SM response rate vs 2019 ... we would falsely claim that SM-referrals fell dramatically in 2020 relative to 2019. This should

*not*be a problem for the single retrospectVaguely remember that I've seen papers dealing with similar issues but I can't recall. Before I try to reinvent the wheel, any suggestions? Thanks!

*I'm a bit new to these aspects of survey design and analysis. What should I read and what are some approaches to the following situation and question?*

**Suppose:**

- We've a
*population-of-interest*based on an affiliation, certain actions, or a set of ideas; (e.g., 'vegetarians' or 'tea-party conservatives)... call it the "Movement" - There has never been a national representative survey nor a complete enumeration of this group. There is no 'gold standard'
- For several years we've advertised a survey (with a donation reward) in several outlets (web pages, forums, listserves which we call 'referrers') associated with the 'movement'
- We can track responses from each referrer. We suspect some referrers are more broadly representative of the movement as a whole than others, but of course there is no gold standard.

This is essentially a

**'convenience sample'**, perhaps more specifically a 'river sample' (using the notation of Baker et al, 2013) or 'opt-in web-based sample'.**It is probably non-representative because of**- Exclusion/coverage bias: Some members of the movement will not be aware of the survey (they don't visit any of the outlets or they don't notice it)
- Participation/non-response bias: Among those aware (through visiting the 'referrers') only a smallish share complete the survey (and these likely tend to be the more motivated and time rich individuals). Some outlets/referrers may also promote the survey more prominently than others.

**We wish to measure:**

- The (changing) demographics (and size) of the movement
- Measures of the demographics, beliefs, behavior, and attitudes of people in the movement (and how these have changed from year to year)

**Our methodological questions**

*Analysis*: Are there any approaches that would be better than 'reporting the unweighted raw results' (e.g., weighting, cross-validating something or other) to using this "convenience/river' sample to either:

i. Getting results (either levels or changes) likely to be more 'representative of the movement as a whole' then our unweighted raw measures of the responses in each year?

ii. Getting measures of the extent to which our reports are likely to be biased ... perhaps bounds on this bias.

*Survey design:*In designing future years' surveys, is there a better approach?

**Brainstorming some responses...**

*Analysis*

- E.g., as we can separately measure demographics (as well as stated beliefs/attitudes) for respondents from each referrer, we could consider testing the sensitivity of the results to how we weight responses from each referrer.
- Or we might consider using the demographics derived from some weighted estimate of surveys in all previous years to re-weight the survey data in the present year to be "more representative."
- As noted we subjectively think that some referrers are more representative than others, sSo maybe we can do something with this using Bayesian tools
- We may have some measures of the demographics of participants on some of the referrers, which might be used to consider weighting to deal with differential non-response

*Survey design*

- Would 'probability sampling' within each outlet (randomly choosing a small share within each to actively recruit/incentivize, perhaps stratifying within each outlet if the outlet itself provides us demographics) somehow be likely to lead to a more representative sample?

It's not immediately obvious to me why this would improve things. The non-response within probability samples would seem to be an approximately equivalent problem to the limited participation rate in the convenience sample. The possible advantages I see would be:

i. We could offer somewhat-stronger incentives for the probability sample, and perhaps reduce this non-response/non-participation rate and consequent biases.

ii. If we can connect to an independent measure of participant demographics from the the outlets themselves this might allow us to get a better measure of the differential rates of non-participation by different demographics, and adjust for it.

**Some references (what else should I read?)**

Baker, R., Brick, J.M., Bates, N.A., Battaglia, M., Couper, M.P., Dever, J.A., Gile, K.J., Tourangeau, R., 2013. Summary report of the AAPOR task force on non-probability sampling. Journal of survey statistics and methodology 1, 90–143.

Salganik, M.J., Heckathorn, D.D., 2004. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological methodology 34, 193–240.

Schwarcz, S., Spindler, H., Scheer, S., Valleroy, L., Lansky, A., 2007. Assessing Representativeness of Sampling Methods for Reaching Men Who Have Sex with Men: A Direct Comparison of Results Obtained from Convenience and Probability Samples. AIDS Behav 11, 596. https://doi.org/10.1007/s10461-007-9232-9

Let's say that I'd like to compare male respondents' and female respondents' perceptions of government across 30 countries, and test if there is statistically significant gender differences in perceptions across countries. The data is collected from survey data conducted in 30 countries. Then, simple t-test results of each country sample shows that two group means are statistically different in all countries. But if I want to claim that my analysis shows that there exists statistically significant gender differences in any countries regardless of different values of country-level variables, including polity score, GDP, and gender equality index, what is my next step other than regression? I mean that I want to show there exists gender differences in any countries in my sample regardless of different values of the above country level variables. I don't need to do regression, but I just want to show group means differences across countries.

I am doing survey data analysis using svyset command. I would like to see if collinearity exists between any independent variables used in the regression analysis, which is the post estimation command I should use? Does margin command works after svyset? Thanks in advance

PFA the word file of the output I received and command used.

Context:- Secondary data analysis of big survey data

Logistc regression for prediction of output (binomial)

Hi,

My question is is two parts:

I have collected data from a set of people completing a course. This data concerned their confidence in certain procedures. Data was collected pre course, immediately post course and 6 months following course completion.

Unfortunately 33 people completed pre course, 28 people completed post course, and 21 people completed 6 month surveys.

I am not sure if this makes the data 'non-matching' and therefore meaning I could not use the Friedman Test.

Hoping someone will be able to help me!

Thanks

I hope to conduct a series of interviews/questionnaire surveys to collect information regarding urban flood management and the use of software tools for the same.

Fundamentally, decision-makers, flood modellers, general public and software modellers/developers are in my expected audience.

Could you please suggest what personal information should be considered when weighing them?

My assumptions are as follow;

1. Decision Makers: The age, level of education, years of service, the level in the organization, no of participations/decision makings in actual flood management activities

2. Flood modellers: educational status (MSc/PhD etc), years of experience, no of participations/decision makings in actual flood management activities

3. Software developers: years of experience, no of contributions in actual flood management software development and the role he/she played

4. General Public: The Age, the level of flood-affected to the person, educational level, experience with floods

Does it occur in handling the data collected or in the way of collecting the data.

I'm working on an R&R and a reviewer is questioning our response rate. We e-mailed a link to an online survey to student-athletes on our campus. We targeted specific teams, offered an incentive (entered into a drawing for a $25.00 amazon.com gift card) and got a response rate of 37.6%. I thought that was decent, but the reviewer specifically asked "Why was the response rate so low?" Are there published/expected response rates that I can cite?

Thanks!

KS

I have a question in my questionnaire regarding purchase intention, and the options to choose the answer are:

- Definitely Not
- Probably Not
- Possibly
- Probably
- Definitely

From this question, I need to figure out the relation between purchase intention and three other factors (that are asked in 15 different likert scale questions)

I am doing a pretest, and so far 8 people have filled in the questionnaire and 7 of them have chosen '

**Possibly**' as their answers for the purchase intention question.So, my question is if for example 90 percent of respondents in the final questionnaire choose the same answer for that question, can I still get the meaningful analysis from my data?

My data consits out of observations rated on 5 different dimensions (5 point Lickert-scale). A sixth variable describes the outcome (binary 0-1) I am interested in. But rather understanding the individual contribution of these dimensions on the outcome variable, I am interested in finding out the optimal combination of dimensions resulting in the highest probability of getting the Outcome equal to 1.

Do you have any advice regarding a methodological approach for me?

Thank you very much in advance, your help is highly appreciated!

Kind regards,

Jessica Birkholz

Hi, I am planning a community survey in a remote community. I have designed it as a paper based survey but am considering transferring it to an online survey platform such as Survey Monkey, Survey Gizmo, Survey Sparrow (later two have offline functionality so are looking more appealing). However, since the community is remote and there is a mix of young and old, I would like to still have the option of paper based surveys for those who prefer. Is there any risks or considerations I should be aware of if I go down this route?

I have utilized

**Qualtrics**and**SurveyMonkey**in the past, and I truly appreciate the look/feel both for researcher and participant found with Qualtrics. What do**you use**, and what would you recommend to do if funding is difficult to secure for a more expensive survey application (i.e. Qualtrics) for a three to six month study? Hi! I noticed that mPlus offers two alternative approaches to modeling when the measurements are not independent. One approach is multilevel modelling (TYPE=TWOLEVEL) and the other is complex and handles the problem of non-independence of observations in a different way (TYPE=COMPLEX). Multilevel modelling is very popular, but the second approach seems less frequent. What do you think about the second approach in cases when you have no reason to model random effects and you just want to account for the non-independence of observations. Please find below a passage from mPlus manual where the authors write about the second approach:

"Complex survey data refers to data obtained by stratification, cluster sampling and/or sampling with an unequal probability of selection. Complex survey data are also referred to as multilevel or hierarchical data. For an overview, see Muthén and Satorra (1995). There are two approaches to the analysis of complex survey data in Mplus. One approach is to compute standard errors and a chi-square test of model fit taking into account stratification, non-independence of observations due to cluster sampling, and/or unequal probability of selection. Subpopulation analysis is also available. With sampling weights, parameters are estimated by maximizing a weighted loglikelihood function. Standard error computations use a sandwich estimator. This approach can be obtained by specifying TYPE=COMPLEX in the ANALYSIS command in conjunction with the STRATIFICATION, CLUSTER, WEIGHT, and/or SUBPOPULATION options of the VARIABLE command. Observed outcome variables can be continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types. The implementation of these methods in Mplus is discussed in Asparouhov (2005, 2006) and Asparouhov and Muthén (2005, 2006a)."

I wonder whether anyone has already tried it before and what are your impressions. Thank you!

Recently i have been doing a research to examine the psychometric characteristics of Satisfaction With Life Scale (SWLS) (developed by Diener, Emmons, Larsen, & Griffin, 1985) in Indonesian context. The scale consists of 5 items with 7-point likert scale.

I'm planning to compare which one works better in Indonesian context, 7-point likert scale or 5-point likert scale of the Satisfaction With Life Scale. How can i compare both of them? What analysis i should conduct?

Thank you in advance.

As I recall, I saw an estimator which combined a ratio estimator and a product estimator using one independent variable, x, where the same x was used in each part. Here is my concern: A ratio estimator is based on a positive correlation of x with y. A product estimator is based on a negative correlation of x with y. (See page 186 in Cochran, W.G(1977), Sampling Techniques, 3rd ed., John Wiley & Sons.) So how can the same x variable be both positively and negatively correlated with y at the same time?

Can anyone explain how this is supposed to work?

Thank you for your comments.

As you are doubtless aware, paper-based survey has been known as one of the most common methods for gathering data relevant to people's behavior (either revealed preferences or stated preferences). I wanna make sure how much can we rely on new methods like Internet (Web)-based survey instead of traditional paper-based survey? In particular, my research's scope is related to travel behavior analysis. My research' sample should cover all socioeconomic groups and almost all geographical areas in a city.

I would be happy if somebody shared with me his/her opinion or the valid references.

Thanks in advance

Design-based classical ratio estimation uses a ratio, R, which corresponds to a regression coefficient (slope) whose estimate implicitly assumes a regression weight of 1/x. Thus, as can be seen in Särndal, CE, Swensson, B. and Wretman, J. (1992), Model Assisted Survey Sampling, Springer-Verlang, page 254, the most efficient probability of selection design would be unequal probability sampling, where we would use probability proportional to the square root of x for sample selection.

So why use simple random sampling for design-based classical ratio estimation? Is this only explained by momentum from historical use? For certain applications, might it, under some circumstances, be more robust in some way??? This does not appear to conform to a reasonable data or variance structure.

I'm working on a financial inclusion project that wants to use a survey to engage with clients of digital financial systems (DFS) in rural areas and am looking for a tool recommendation - possibly SMS based, or Voice interactive response system, or mobile phone app (but less desirable).

Would love to hear of peoples experiences and suggestions.

Thanks

We are working in a survey design to youths, but we will include children are 12-14 year old. The classic measuring of time preferneces and risk will be complex to childrens. Likewise, some recomendation of general application of this question in survey experiment would be perfect.

I'm at US National Science Foundation -- National Center for Science and Engineering Statistics. We sponsored the National Survey of Recent College Graduates (https://nsf.gov/statistics/srvyrecentgrads/), conducted from 1973 through 2010, a cross-sectional biennial survey that provided demographic and career information about individuals holding a bachelor's or master's degree. I am compiling a list of publications using the Survey data. We plan on adding this list to our website. If you have used the Survey data, please post the citation to your research below, and/or send me a copy of your paper. Thanks!

Hi,

I am trying to combine parent data information to their children file.

1. I have one dta file for Children information and another for their parents who are in the separate data file. The children data set contains unique identifier for their parents ID as well which I can use who is the childs mother or father. What I want to do is to match some details such as Father and Mothers employment details, education level to their children database. Is there any efficient way to do this.

2. If I have combined the two Children and Adult dataset into one dta file, can I then do what I want to do above or I will have to do it separately as I mention in (1).

3. What if I have organised the children data into a panel dataset and now want to add this information about parents. Any efficient way to do from here then?

Looking for ideas.

Thank You.

I'm at US National Science Foundation -- National Center for Science and Engineering Statistics. We sponsor the Survey of Earned Doctorates (https://nsf.gov/statistics/srvydoctorates/), a census of all individuals receiving a research doctorate from a US university. I am compiling a list of publications using the Survey data. We plan on adding this list to our website. If you have used the Survey data, please post the citation to your research below, and/or send me a copy of your paper. Thanks!

Hi,

So I'm a beginner at SPSS and I want to know what one should do when inputting survey data into SPSS and the question can yield both continuous numerical data as well as categorical data.

Example: I'm working on a survey of migrant workers' incomes and there is one question that asks about workers' incomes

*pre-migration*. There are five options:1) Write down your income:________$

2) I was self-employed

3) I worked in agricultural labour

4) I did not have a paid job

5) Don't know/can't remember

Unlike straightforward closed questions where one can code responses to particular values, there is the possibility for continuous input. As such, what is the best practice under such a circumstance?

I have 4 populations of different races that I am following for a period of time. I am looking for the incidence of a particular condition (lets call it condition A) after an event x. (lets say joining a particular business)... I find that one particular population (lets call it population k) has significantly lower incidence of the condition A compared to other populations . Upon further analysis I realize that event x occurred in most members of the population k a lot later than the other three populations.. (ie most members of the business joined the business a lot later than other 3 groups).

I want to know whether the lower incidence of condition A is due to the late occurrence of event x or do they actually have a lower incidence.

How do I approach this problem? I am thinking of getting kaplan meier curves of condition A in all 4 subgroups..what do I do there after?

I am still collecting survey tools used by researchers who have used in studies of transgender men and women. The tools will be used to inform the development of a new survey tool that

**may**be added to CDC's National HIV Behavioral Surveillance (NHBS) survey set. If you are willing to share your survey tool (and have not already done so) please do so.All the best,

Stephen

Think you are going to introduce a innovative therapy and you wanted to do a market survey ,what will be your population size and how to determine this.

I have two sets of data from the same survey with 55 responses each. The samples are independent. Subjects were asked to select from a list of 16 skill types (nominal) their top five most important skills. They were then asked to select their five least important skills. Each of these 10 skills that were selected are unique (no repeats).

After selecting those 5 most (least) important skills, I asked respondents to rank them in order of their importance (non-importance).

I would like to know what statistical methods I can use to analyze each data set and how can I compare the two data sets. I am trying to find out if there are any meaningful differences in skills that were chosen between these two groups.

I'm planning to perform a SEM in order to investigate intention to innovate. This dependent variable depends on attitude toward innovation (ATI) and entrepreneurial self-efficacy (ESE) - based on the scheme of the Theory of Planned Behaviour. Literature indicates that other variables play a role, such as gender and family exposure to entrepreneurship (FEE).

How should I perform the analysis, in order to see how gender and FEE impacts the scheme? The data comes from a survey with 1200 answers.

At first I thought on mediating model, but I believe the effect is not a causal relationship between gender and the dependent variable, but a moderator effect. Nevertheless, I am struggling to implement it in Stata. Most literature about it talk about mediation and moderator at the same time, which is not the case.

Other suggestions of analysis would also be really helpful.

Best regards, Pedro

I'm at US National Science Foundation -- National Center for Science and Engineering Statistics. We sponsor the National Survey of College Graduates (https://nsf.gov/statistics/srvygrads/), a longitudinal biennial survey conducted since the 1970s that provides data on the nation's college graduates. I am compiling a list of publications using the Survey data. We plan on adding this list to our website. If you have used the Survey data, please post the citation to your research below, and/or send me a copy of your paper. Thanks!

I have seen several references to "impure heteroscedasticity" online as heteroscedasticity caused by omitted variable bias. However, I once saw an Internet reference, as I recall, which reminds me of a phenomenon where data that should be modeled separately are modeled together, causing an appearance of increased heteroscedasticity. I think there was a youtube video. That seems like another example of "impure" heteroscedasticity to me. Think of a simple linear regression, say with zero intercept, where the slope, b, for one group/subpopulation/population is slightly larger than another, but those two populations are erroneously modeled together, with a compromise b. The increase in variance of y for larger cases of x would be at least partially due to this modeling problem. (I'm not sure that "model specification error" covers this case where one model is used instead of the two - or more - models needed.)

I have not found that reference online again. Has anyone seen it?

I am interested in any reference to heteroscedasticity mimicry. I'd like to include such a reference in the background/introduction to a paper on analysis of heteroscedasticity which, in contrast, is only from the error structure for an appropriate model, with attention to unequal 'size' members of a population. This would then delineate what my paper is about, in contrast to 'heteroscedasticity' caused by other factors.

Thank you.

Dear Fellows

I have got the survey data “World Bank’s Enterprise Survey 2013” in SPSS form. My research objective is to find out the obstacles faced by the firms while doing business in Pakistan. There are 15 obstacles listed below:

1. Electricity to Operations of This Establishment

2. Transport

3. Customs and Trade Regulations

4. Practices of competitors in informal sector

5. Access to Land

6. Crime, Theft and Disorder

7. Access to Finance

8. Tax Rates

9. Tax Administrations

10. Business Licensing and Permits

11. Political Instability

12. Corruption

13. Courts

14. Labor Regulations

15. Inadequately Educated Workforce

They are measured on 5-point likert scale (

No Obstacle ,Minor Obstacle , Moderate Obstacle, Major Obstacle , Very Server Obstacle).

Sampling Technique: Disproportionate Stratified Random Sampling

Three level of stratification have been used in the Survey: firm size, business sector, and geographic region within a country. Firm size levels are 5-19 (small), 20-99 (medium), and 100+ employees (large-sized firms). The business sector has been breakdown into manufacturing (Food, Textiles, Garments, Chemicals, Non‐metallic Minerals, Motor Vehicles, Other Manufacturing) and services (Retail and other services). Five regional stratification.

However, I am not interested in particular strata (groups) within the population. What kind of statistical tools can be applied here?

Thank you

Dear all,

Does anyone can suggest me some contributions about:

-the historical development of survey methodology in the social sciences

-the possibile integrations between survey and big data?

Thanks for your attention!

Francesco Molteni

Sample size = min 100 respondents; Approach = Email > Web-based questionnaire; Number of UK universities = 133;

For both Random Sampling and Stratified Random Sampling, there should be the list of all elements. Theoretically, the list could be made as the staff names and their emails are readily available online; however, practically, it seems to be not feasible because there is no guarantee that I could find the contact details (emails) of all academic staff for each university. Therefore, if I do not have a full list of staff, I do not know the population size, so then I cannot use Random Sampling. Could you give me advice on which technique I may use and how to apply it. Thank you for your help.

If I pick up data from a survey for only 10% and randomly generate the rest of 90% from an application. (based on the 10%) Will this work? I am in IS discipline.

I think many people do simulate things in other domains too.

I'm looking for good tools to document the development of pain/discomfort over a timespan of aprox. 10-15weeks for a small studie with 5-10 patients a group.

So far I only came across the VAS and SF-12, has one of you some other advice for me?

Cheers

You want to do a survey of student opinion in a district regarding outdoor sports. What will be your sampling plan? How will you ensure randomness?

Hi!

I've noticed that most questionnaires use a sample size of 100 to 1000 when testing a new positive psychology intervention. How would you decide on a reasonably representative sample? What method would you use? How would that method compare to experiments to decide the efficacy and safety of new medication?

Put in another way :My interest is testing the efficacy of a psychology intervention, not a questionnaire. Let's say i want to design a randomized placebo controlled experiment, how would I decide on the size of the sample to ascertain whether the intervention brings about a statistically significant result.

Thank you so much

Ibrahim

does anybody have experience/guidance on whether to use the standard strongly disagree, disagree, neither agree nor disagree, agree, strongly agree format or some alternative which I find attractive: fully disagree, mainly disagree, neither agree nor disagree, mainly agree, fully agree?

I am pretty sure it will not make a big difference, but my 2nd option is not widely used (at least I know only of one or two examples). The 2nd option appears to be more attractive when people may have complex opinions (along the lines - yes, but), and might be suitable for e.g. a survey amongst university staff members.

My study is to identify:

- The motivations of the local grassroots volunteers by using the construct of Self-Determination Theory

- Explore for any differences in motivation between new and old volunteers

- Explore the impact of grassroots volunteerism on national identity

- Evaluate effectiveness in recruiting new volunteers

**As of now, I am preparing to run the Volunteer Motivation Scale which has a 7-likert scale questions for the different types of motivation. However my question, how may I go about analysing the results to answer my research objectives? What kind of statistical test should I perform?**

**In addition, should I add in questions to measure how their volunteering work can impact on national identity?**

In addition to the questionnaire, I also intend to perform a face-to-face semi-structured interview to ask them on the Basic Needs that fuels their motivation, and their sustained motivation and also how their work can impact the national identity.

Hi there. I am currently doing a BA in TESOL and this is my first research project, so bare with me if I sound a little clueless!

My question is: can I adapt a research instrument (survey) to fit my needs, or will this invalidate it? To clarify; I want to measure to what extent my students' motivations for using the learning management system are internalized and autonomous in nature. I want to use the LLOS-IEA (Noels et al, 2003), however, I will need to change the instrument to be asking questions about the "Flipped Learning" system we use.

I've carried out a questionnaire, where participants had to rate on a likert scale 1-5 a list of response strategies with respect to their ability addressing a certain issue. In total there are five such major issues and they were asked to rank the strategies under each separately. What is the best analysis technique to analyse and summarize the above data.

Going to conduct student interviews for qualitative data - so survey identifying science skills.

Science self-efficacy survey before science fair participation and one before competition.

Thank you!

Where might I find a finite population dataset, with one dependent and two independent variables, with population size approximately 15 < N < 100, and all continuous (no count) variables? (A sample from an infinite population, with approximately 15 < n < 100, perhaps preferably a random sample, might possibly do. However it is important that two predictor variables are deemed adequate for the illustration I have in mind.)

I have a method for comparing regression model performances graphically, for cases with all continuous data, for which I want to write a clear explanation, including a simple graphical illustration. I am retired and would just be relying on Excel, and possibly a small population data set to illustrate this. I tried searching the internet and was at first encouraged by the number of multiple regression datasets available, but quickly found that locating such a dataset for clear illustrative purposes, and with my limited programming resources, was elusive.

Any suggestions would be appreciated. - Thank you.

Does the World Bank Enterprise Survey database include the indicator on firm innovation?

EN: In order to carry out a survey / questionnaire on the degree of satisfaction of the inhabitants In relation to walkability in the different tissues forming the city, I would need a few measurable indicators of walkability (in relation to urban morphology), I think Concept of "

**Comfort**" and "**Protection**" of pedestrians. Are there any others.? Thank youFR: Pour effectuer un sondage/questionnaire sur le degré de satisfaction des habitants par rapport à la marchabilité dans les différents tissus formant la ville, j'aurais besoin de quelques indicateurs mesurables de la marchabilité (en relation avec la morphologie urbaine), je pense à la notion du "

**Confort**" et de "**Protection**" des piétonniers. y' en a t-ils d'autres? Merci"Survey" is a very broad term, having widely different meanings to a variety of people, and applies well where many may not fully realize, or perhaps even consider, that their scientific data may constitute a survey, so please interpret this question broadly across disciplines.

It is to the rigorous, scientific principles of survey/mathematical statistics that this particular question is addressed, especially in the use of continuous data. Applications include official statistics, such as energy industry data, soil science, forestry, mining, and related uses in agriculture, econometrics, biostatistics, etc.

Good references would include

Cochran, W.G(1977), Sampling Techniques, 3rd ed., John Wiley & Sons.

Lohr, S.L(2010), Sampling: Design and Analysis, 2nd ed., Brooks/Cole.

and

Särndal, CE, Swensson, B. and Wretman, J. (1992), Model Assisted Survey Sampling, Springer-Verlang.

For any scientific data collection, one should consider the overall impact of all types of errors when determining the best methods for sampling and estimation of aggregate measures and measures of their uncertainty. Some historical considerations are given in Ken Brewer's Waksberg Award article:

Brewer, K.R.W. (2014), “Three controversies in the history of survey sampling,” Survey Methodology,

(December 2013/January 2014), Vol 39, No 2, pp. 249-262. Statistics Canada, Catalogue No. 12-001-X.

In practice, however, it seems that often only certain aspects are emphasized, and others virtually ignored. A common example is that variance may be considered, but bias not-so-much. Or even more common, sampling error may be measured with great attention, but nonsampling error such as measurement and frame errors may be given short shrift. Measures for one concept of accuracy may incidentally capture partial information on another, but a balanced thought of the cumulative impact of all areas on the uncertainty of any aggregate measure, such as a total, and overall measure of that uncertainty, may not get enough attention/thought.

What are your thoughts on promoting a balanced attention to TSE?

Thank you.

We are planning a study to compare a set of specific daily motivations between job families.

We will be developing a survey specifically for the constructs we are interested in using combination of grounded analysis and PCA in a larger sample, but we would like to compare our results to existing instruments and/or to be able to base our constructs on elements that have already appeared in the context of work motivation.

I have used various intrinsic motivation-based surveys in the past, so I would especially like something on specific intrinsic / extrinsic motivators at work - that is, not just whether the job as a whole is motivating.

I have a background in mixed methods and experimental research, but this is my first real foray in to work/organizational psychology so a bit of help to get started would go a long way!

dataset for analysis?

technique to apply on the dataset?

can I use R or other analytical tool?

I want to identify the relationship between DV (which in likert scale 1-5) and 3 or 4 or 5 IV which in likert scale too. I have computed the items of each scale (mean). It is for my thesis and from bibliografy i have found that there is relationship between the constructs. I have made the attached model and i want to confirm it. I must say that i have non-normal distribution in my data

I am conducting a project in three different centers. I have sent questionnaire to 1,000 respondents in each center and got responses of as follows;

1) less than 200 in first two centers, and

2) more than 750 in third center

As response of last center is extremely large as compare to first two.

How I will compare these three centers data. Kindly guide me I want to apply t test and ANOVA.

thank you so much

I am a PhD researcher and have utilised an exploratory sequential instrument development design. I conducted semi-structured interviews and have consequently developed a survey from the findings. The survey is descriptive and cross-sectional and is simply a follow-up quantitative phase to the main qualitative phase of the research.

Due to the issues with potential response rates (challenging to engage the study population), I envisage piloting to be difficult as I am asking additional questions to merely completing the survey. Is it appropriate for me to go back to the original interview participants and pilot test the survey with them in addition to identifying new pilot testers? Or methodologically would that not be approved of, given their participation in the original interviews and their existing awareness of the topic area?

Any guidance would be appreciated,

Thanks

I have purchased the PASE scoring manual and have been trying to use the Syntax in SPSS. However, the Syntax won’t calculate the PASE score when there are missing data in the questionnaire. This occurs specifically in questions 2,3,4,5,6, and 10 where the questions include multiple parts. Has anyone had a similar experience? Do I need to use a different software?

Also, there is no clear instruction in the manual on how to deal with missing data when scoring the questionnaire manually.

Does anyone know how to deal with missing data when scoring the questionnaire?

Thanks.

I am doing research on FDI in retail.

To examine "consumer’s perception towards FDI in retail." I had collected data from two cities meerut and agra. What type of test I should apply to analysis the data. Hereby Questionnaire is attached.

Dear all, in order to seperate the predictor from the criterion, I plan to collect data from two different groups of raters from the same company. Group A, 600 employees, will fill questionnaire 1 and Group B, also 600 employees, will fill questionnaire 2.

1. How do I select these groups with minimum differences which might impact the cause-effect relationship?

2. How do I measure or control for these differences.

Thank you very much in advance

dear all ... i want to check correlation between two variables knowledge and attitude ...my knowledge variable is further containing 9 questions and attitude variable is also containing 13 questions. Is it possible? If yes, then how?

both variables have all responses measured in likert scale

I am interested in whether using AI biases the direction of an evaluation survey and, if so, whether this is a limitation of the methodology in terms of rigour.

I have interviewed 40 women re their experiences of birth injury from a quantitative database of medically diagnosed damage and analyzed their responses. Need to check my framework and analysis is valid ?