Science method
Regression Analysis - Science method
Procedures for finding the mathematical function which best describes the relationship between a dependent variable and one or more independent variables. In linear regression (see LINEAR MODELS) the relationship is constrained to be a straight line and LEAST-SQUARES ANALYSIS is used to determine the best fit. In logistic regression (see LOGISTIC MODELS) the dependent variable is qualitative rather than continuously variable and LIKELIHOOD FUNCTIONS are used to find the best relationship. In multiple regression, the dependent variable is considered to depend on more than a single independent variable.
Questions related to Regression Analysis
I need to test the relationship between two different variables. One of the variables is calculated as 1-5 points, the other as 1-7 points. Does having different scale scores cause an error in the correlation or regression analysis results? Can you recommend a publication on the subject?
A dummy variable is a variable that takes specific numbers. Values for different attributes. Full rank.
Since I found out that there is a correlation between Timeliness and Semantic Accuracy (I'm studying linked data quality dimensions assessment, trying to evaluate a dimension quality -in this case Timeliness- from another dimension (Semantic Accuracy)), I presumed that regression analysis is the next step in this matter.
-the Semantic accuracy formula I used is: msemTriple = |G ∧ S| / |G|
msemTriple measures the extent to which the triples in the repository G (the original LOD dataset) and in the gold standard S have the same values.
-the Timeliness formula I used is:
Timeliness((de)) = 1-max{1-Currency/Volatility,0}
where :
Currency((de)) = (1-(lastmodificationTime(de )-lastmodificationTime(pe ))/(currentTime-startTime))*Ratio (the Ratio measures the extent to which the triples in the the LOD dataset (in my case wikidata) and in the gold standard (wikipedia) have the same values.)
and
Volatility((de)) = (ExpiryTime(de )-InputTime(de ))/(ExpiryTime(pe )-InputTime(pe ) )
(de is the entity document of the datum in the linked data dataset and pe is the correspondent entity document in the gold standard).
NB: I worked on Covid-19 statistics per country as a dataset sample, precisely Number of cases, recoveries and deaths
this is my spss file: https://drive.google.com/file/d/1DqMqVv4JHPbo3-pAXmavuC91pMlImFlu/view?usp=drive_link
this is the output of my spss file: https://drive.google.com/file/d/1JxVf542Kq9KfxeWIqmm1deLfJv67HOUh/view?usp=drive_link
I have performed a hypothesis testing using the simple regression analysis model. What action must I take after testing the hypothesis?
I have a set of data with spatial map with census tract, nearest distances to close by hospitals and cities. I need advise on how I can process this data for regression analysis and generate maps for ARC GIS (to see correlation). If you have a guide, it will be great. Thanks
After collecting my data, I decide to test my hypothesis using regression analysis. I also recognized the fact that my data must meet the assumptions of the tool before I can use it. Therefore I will like to know if after testing two assumptions and realizing the data had met those assumptions, can I just run the analysis or I must test all the assumptions before?
You can read my analysis about that
I am using the spss 28 version and I want to know how do I run the regression analysis. I have one dependent variable- bmc and one independent variable- mvpa. I have two other variables age and height and i dont know how do i adjust it on spss. Are these independent variables too? There is no covariate box in the spss version which im using. I
Hello,
I'm working on a panel multiple regression, using R.
And I want to deal with outliers. Is there a predifined function to do that?
If yes would you please give me an example of how to use it
Of late, some journal editors are insistent on authors providing a justification for the ordering of the entering of the predictor variables in the hierarchical regression models.
a) Is there a particular way of ordering such variables in regression models?
b) Which are the statistical principles to guide the ordering of predictor variables in regression models?
c) Could someone suggest the literature or other bases for decisions regarding the ordering of predictor variables?
In a psychology study of N = 149, I was testing for moderation using a three-step hierarchical regression analysis using SPSS. I had two independent variables, X1 and X2, an outcome variable, Y, and the moderator, M. Step 1 uses the variables X1, X2, the interaction X1X2, and 5 covariates. Step 2 adds M. Step 3 adds the interaction variables X1M and X2M.
In my collinearity statistics, VIF is all under 10 for Steps 1 & 2 (VIF of 6 is found between X2 and X1X2 in both steps). For Step 3, VIF is high for X1, X2, M, X1M, and X2M. When I go look at the collinearity diagnostics box, the variance proportions are high for the constant, X1, M, and X1M. I'm understanding that there is multicollinearity.
My question is, what does it mean when the constant shows a high VIF? What would it mean if only one predictor variable and the constant coefficient were collinear?
I am currently trying to perform mediation analysis with a long panel data set including controll variables in Stata.
Trying to do this i found solutions on how to do a morderated mediation regression with and without controll variables and i also found ways to run a regression with panel data but i did not find a way how to match this.
Is there a way how i can consider mediator variables in my panel regression?
Does anyone have information, links, advice on how to approach this challenge?
Greetings of peace!
My study is about the effect of servicescape on the quality perception and behavioral intentions
independent Variable-Under servicescape there are 4 indicators
Layout Accessibility - 10 items
Ambience condition- 3 items
Facility Aesthetics - 6 items
Facility cleanliness -4 items
Quality perception serve as mediator with 3 items
Dependent Variable-Behavioral intentions - 4 items
All were measured using Likert Scale (N = 400)
I tried Ordinal Regression Analysis but I don't know how to combine the items and the independent is ordinal. And the value of Pearson is <0.001 and the Deviance is 1.000.
I need to get the effect of individual indicators in servicescape on the quality perception and behavioral intentions.
Thank you in advance
The variable physical environment effect, is only a subset of the independent variable ( environmental factors) in my research, there are social and cultural environment effects as well. They are measured in my questionnaire with five questions and the responses are; ( never, rarely, often and always). The dependent variable, student performance, was also measured in the same format as the environmental factors(i.e with five questions and Never, rarely...being the responses). I have coded them into SPSS with the measure; Ordinal. I want to answer the research question; 1. How physical environment affect student performance? 2. How social environment affect student performance? 3. To what extent does cultural environment influence student performance? I've computed the composite score(mean) for the questions, can I use these scores in the ordinal regression analysis? Or is there any other way to compute the questions into a single variable, for both the independent and dependent variables?
Hi,
I want to predict the traffic vehicle count of different junctions in a city. Right now, I am modelling this problem as a regression problem. So, I am scaling the traffic volume (i.e count of vehicles) between 0 to 1 and using this scaled down attributes for Regression Analysis.
As a part of Regression Analysis, I am using LSTM, where I am using Mean Squared Error (MSE) as the loss function. I am converting the predicted and the actual output to original scale (by using `inverse_transform`) and then calculating the RMSE value.
But, as a result of regression, I am getting output variable in decimal (for example 520.4789), whereas the actual count is an integer ( for example 510 ).
Is there any way, where I will be predicting the output in an integer?
(i.e my model should predict 520 and I do not want to round off to the nearest integer )
If so, what loss function should I use?
If the value of correlation is insignificant or negligible. Whether We should run regression analysis or not. Obviously it will be insignificant, is it necessary to mention in article?
Correlation and regression analysis are part of descriptive or inferential statistics.
Hello researchers,
I am facing problem to do a regression analysis with three independent variables, one mediating variable, and one independent variable. How ca I do this in SPSS? Any one please can you help me?
When doing a regression analysis, the coefficients table in spss shows that my 3 main effects are significant. When I do a regression analysis for my 6 moderating effects, where I created interaction terms for, the coefficients table also shows these are significant. But when I do the regression analysis and include the 3 main effects and 6 moderating effects at the same time, none is significant. How should i interpret this? And how should I continue?
Examining some students on their final year projects defence. I discovered that a student had the Adjusted R² in the Regression analysis of her work to be greater than 99%. Could that be possible?
This question is concerned with understanding the degree and direction of association between two variables, and is often addressed using correlation or regression analysis.
This question is concerned with determining whether two or more groups differ in some meaningful way on a particular variable or set of variables, and is often addressed using statistical tests such as t-tests, ANOVA, or regression analysis.
Can you give all the criteria to evaluate the forecasting performance of the regression estimators?
I am performing a cross-country regression analysis with a sample of 101 countries. Most of my variables are averages of annual data across a period of 7 years. Every one of my primary variables has data available in each of these 7 years. However, certain countries have data missing in certain years for variables used in my robustness checks.
How should I handle this missing data for each robustness variable? Here are a few ideas I have considered
A. Average data for each country, regardless of missing years
B. Exclude any country with any missing years from data for that respective variable
C. Exclude countries that are missing data up to a certain benchmark, perhaps removing countries that are missing more than 2 or 3 of the 7 years that are being averaged for that respective regressor
D. Only use robustness variables that have available data for every country in every year that is being averaged
Please offer the best solution and any other solutions that would be acceptable.
Dear fellows,
Maybe you have done interesting measurements to test some model?
I can always use such data to use as examples and tests for my regression analysis software, and it's a win-win, since I might give you a second opinion on your research.
It's important that I also get the imprecision (measurement error/confidence interval) on the independent and dependent variables. At this moment, my software only handles one of each, but I'm planning to expand it for more independent variables.
Thanks in advance!
Assumptions of multinomial and linear regression analysis?
In finding the correlation and regression of multivariable distribution what is the significance of R and R^2? What is the main relation between them?
Hello,
I am doing a multiple regression with 2 predictors. The predictors correlate moderately/strongly, r=0,45. When the first predictor is put in a regression analysis, on its own it explains 16,8 % of variance of the dependent variable. The second predictor on its own explains 17,5 % of variance of the dependant variable. When both predictors are put into regression analysis, the VIF=1,26, so multicollinearity should not be a problem. The predictors together explain 23,4 % of variance of the dependant variable.
First of all, I would like to ask, whether the change in explained variance from 16,8-17,5 % to 23,4% is a big change. More specifically, if the predictors are together better at predicting the dependant variable compared to the situation when there is only one predictor. Also, as the predictors correlate but VIF is okay, is it safe to say that they probably explain some same parts of variance in the dependant variable/ do the predictors explain little of unique variance?
I would like to create a factor of the interaction effect of two variables for regression analysis.
I was wondering how to create the factor.
I was thinking to multiply the scores of the two, but I would like to hear from other researchers. Thank you,
Suppose I want to predict energy consumption for my building using regression analysis. What factors should I consider including in my model, and how can I determine their relative importance?
i am using a fixed effect panel data with 100 observations (20 groups), 1 dependent and three independent variables. i would like to get a regression output from it. my question is it necessary to run any normality test and linearity test for panel data? and what difference would it make if i don't go for these tests?
situation: The moderating variable can explain up to 25 percent while the remaining 75 percent is explained by
other factors outside the model. What does this mean? OR would this mean, the moderating variable did not significantly moderate the relationship of the IV nad DV? Thank you to anyone who would respond!
Propensity score matching (PSM) and Endogenous Switching Regression (ESR) by full information maximum likelihood (FIML) are most commonly applied models in impact evaluation when there are no baseline data. Sometimes, it happens that the results from these two methods are different. In such cases, which one should be trusted the most because both models have their own drawbacks?
I would like to know if I am wrong by doing this. I made quartiles out of my independent variable and from that I made dummy variables. When I do linear regression I have to record the betas with 95%CI per quartile per model (I adjust my model 1 for age and sex). Can I enter all the dummies into the model at the same time or do I have to enter them separately (while also adjusting for age and sex for example)?
So far I entered all the dummies and adjusted for age and sex at the same time but now I wonder whether SPSS doesn't adjust for the second dummy variable and the third.. So I think I need to redo my calculations and just run my models with one dummy in each.
Thank you.
Hello everyone,
I am currently working on my thesis but I have encountered a problem and I am not sure how to solve it. I would like to measure the impact that ESG (Environmental, Social, Governance) has on financial performance (ROA, ROE) from 2016 to 2021. Some important details about my study:
- I would like to compare two samples of companies: One first group with ESG part of the DJ Sustainability Index (DJSI) and another group without ESG (no part of DJSI).
- I intend to analyze companies that have been part of the DJSI between 2016 and 2021. However, some companies don't have an ESG score (independent variable) for some years. Should I still collect information for my dependent variables for all the years? For example, company X has ESG scores for 2016 and 2017 only, would I need data for ROA and ROE for all the years or just for 2016 and 2017?
- Any other aspects I should consider?
Thanks!!
Hi,
I have a dependent variable that represents ridership on each street. The numbers are bucketed to the nearest 50 so the values are 50 , 100 , 150 .. and so fourth.
My independent variable also discontinuous 1, 2 , 3 ,4 .. etc, representing street continuity.
Would it be appropriate to execute linear regression analysis, to see if there is a correlation between these two variable ?
Note that I will execute the analysis on multiple city.
I'm working on the below topic for my master thesis.
“Investigating the stages in a customer’s buying journey and determining the factors influencing the decision to switch between a retailer’s online sales channels – marketplace and own website.”
Considering this, my plan was to apply logit regression analysis on the channel choice of the customer (in this case “Marketplace” and “retailer’s own website”) as the dependent variable and the interaction between the independent variables “Age” and subjective norms (recommendation from peers, product reviews) for the three stages.
I’m struggling to ascertain if using the customer channel choice of either marketplace or own website be considered as the dependent variable. I have not used a Likert scale for this as this was a scenario-based survey. So, the respondents have chosen the channel they would use in every stage.
Could you please advise if using this choice as a dependent variable makes sense? And, if using Logit regression is the right way to go?
Also, how to calculate/analyze relative importance of the predictor variables (independent variables) in Logit Regression analysis?
Is there any explanation for strong, adequate or low value of it? Thank you
I'm doing a regression analysis of the effect of housing type on resident depression. When I included all samples in a single model, housing type had a significant effect on depression (p=0.000). But when I divided the sample into males and females, and performed regression analysis on the two separately, the analysis results of both males and females showed that housing type had no significant effect on depression (p=0.1-0.2). I wonder how to explain this result
I have 667 participants in my sample, and the outcome is continuous. I tested the normality, and the data on the histogram are bell-shaped, but the test results show that they are not normally distributed.
1- What is the cause of the discrepancy between the chart and the test results?
2- Can I still perform linear regression analysis on this data?

I am testing hypothesis of relationships between CEA and Innovation Performance (IP). If I am testing the relationship of one construct , say Management support to IP , is it ok to use single linear regression? Of should I be testing it in a multiple regression with all the constructs?
We want to analyze the relationship and impact between two variables in the education sector, The first variable is the independent variable (intellectual capital), which is measured by a sample of workers and leaders, size 150, and the second variable is the dependent variable (quality of service provided), and it is measured by a sample of Students and parents, its size 330
Regression analysis is used for models that have covariables.
Adjusted r2 5.99 %
F value 9.61
p-value 0.00
Oral Com = 21.36 - 1.194 Dissatisfaction with one’s linguistic skills
My questionnaire consist of 20 questions and five of them are related to dependent variable. But the problem is those questions are not in Likert scale and in different scales with fixed answers and one multiple choice question.
Example 5th question has 1-4
6th question dichotomous 1-2
7th question Multiple choice question 7 options
8th question 1-5 options to pick one.
can I build a composite index variable for dependent variable by standardizing these variables ? using Z Scores
can I use Standardized variables to perform a correlation and regression analysis?
Dear Altruist,
I want to make a regression analysis based on accident data. But I am not finding the necessary information about what type of regression is more applicable for analysis accident data?
Suggestions will be highly appreciated.
Thanks in advance.
My research involves finding correlation between variables, comparing groups and making a regression analysis to explain if one variable predicts another. The research is descriptive for sure, but can i state that it is both correlational and causal-comparative?
i have done a spearman correlation analysis and all of my independent variables correlated with the dependent variable. However, when i did multiple regression analysis, the results show the iv are not significant. is this possible?
It is actually the Simple Linear Regression Analysis Question
- For diabetics with an initial weight the same as yours, calculate the 95% confidence interval on their predicted mean weight loss at one year after DBI therapy.
- For an individual diabetic with an initial weight the same as yours, calculate the 95% confidence interval on his predicted weight loss at one year after DBI therapy
Hi all,
I am interested in whether the strength of correlation estimates between personality & depression scores is equally strong on two measurement occasions - when adolescents are 13yrs old vs. 15yrs old. This means I am comparing the coefficients in the same sample and the same set of variables, just measured at a different time point.
So far I was only able to figure out how to compare coefficients with independent observations (e.g. using Fisher’s r to z transformation).
Can I simply perform regression analysis and use "time" (categorical) as a moderator or this is only ok with independent observations?
Thanks a lot for your help!
Kind regards
Michaela
Kindly, let me know which regression model I can use specifically for hypothesis testing.
I want to perform multi-input multi-output regression analysis. Can you suggest some tutorials?
Help will be appreciated 🙏
Hello Everyone,
I am using SPSS with the PROCESS plugin by Hayes. I performed a (OLS) regression analysis (in SPSS) with dependent variable X and 3 dummy variables (I have 4 groups, one is the reference group). This gave me some coefficients for the dummies that made sense. But then, I continued with my analysis to a mediation analysis (using PROCESS), I used the same X and the same 3 dummies (same reference group), and the total effect model spat out different values for the coefficients (quite a bit larger). The same happened for another variable when PROCESS calculated an effect of X on M (the mediator), i.e. it returned different results from my "normal" regression analysis without PROCESS. From what I understand in Hayes 2018 book, the equations used to calculate these coefficients in mediation should be the same as my previously calculated ones. But they differ. Any ideas why?
Thank you,
Stefan
Greetings Fellow Researchers,
I am a newbie in using survival analysis (Cox Regression). In my data-set 10-40% cases have missing values (Depending on the variable I include in my analysis). Based on this I have two questions,
1- there are any recommendations on accepted percentage of cases dropped (missing values) from the analysis?
2- Should I impute the missing values of all the cases that were dropped (lets say maximum of 40%).
Thank you so much for your time and kind consideration.
Best,
Sarang
I'm planning a study with a twice repeated measurement but I'm primarily interested in the correlation between the two measurements. As there are a number of factors potentially influencing the agreement or otherwise between these two measurements, can Cohen's k or Pearson's r be used as the DV in a multiple regression analysis ? If so, are there any conditions or parameters that need to be taken into account ?
I have a question to the Members for which i requrest your comments and recommendations:
In my research, I use a National level entrance test score, which has high signficance for the students in their pursuit of higher education. The test is conducted all over the country every year at the same time, but has province wise differences in terms of test content and total marks. Again, the cut-offs vary across provinces and also on a year over year basis. There are three cut-offs for the scores called cut off1,2 &3.The total scores in many provinces are the same, but for others may vary based on total marks and combinations for the subjects. However, they all have a common purpose under the common guidelines of the govt.
The structure and procedure of conducting the exam is the same and is conducted nationally and the method of assessment is also similar even though differences in questions asked in the exam ( a little bit complex scenario).
My requirement is to derive a common Gaokao score (not sure it is possible) that can be used as a dependent variable in a regression. Considering the province wise differences and year-over-year variations in cut-offs, I would like to derive a common score nullifying these influences.
(1). For this purpose, my first task is to make the total scores for all the provinces the same(i.e, 750 , because majority of the provinces has this score). Here the first challenge is to increase or decrease the scale width of other provinces which have a total other than 750. For example, if province X has a total score of 850, it should be rescaled to 750. How can I do that? Simply using the equation (score/850)*750 be sufficient? Or it will make an error? (I have seen some other equations to convert these, but not sure whether it is reliable).
Does this rescaling may have any impact on using it in the analysis?
2. The second step is to ‘centering’ the scores (after equating the scale size) with the tier 3 cut-off value (here use the converted score for those whose were rescaled). Is that meaningful?
With these two procedures, Can I develop a new set of scores that can be used for regression analysis as DV? Dest this procedure eliminate the province wise and cut off variations in test scores?
OR
Do you have any other standard procedure to suggest?
Thank you very much for all of you for your valuble suggestions and comments!!
To define quantitative analysis as such in a mixed methods approach, is it necessary to include a regression analysis?
Following a principal component analysis, 3 factors were identified. The composite scores for these 3 factors were calculated for T1 data. So 3 new variables were obtained. The same will be done for T2 data. Then, how do use regression analysis to measure change and the impact of the intervention?
Dear RG members.
I conducted cox regression analysis and unfortunately the HR of some of the variables turned out extremely large, like 1.17e+09, 1.31e+10, and extremely low for some others (1.87e-21).
FYI: 24 variables were included in the regression, and before running the regression, I have checked interaction and around three variables were excluded because of this. So, why I am encountering this problem and any solution please!
Thank you in advance.
Hi
I wonder what the differences (pro's and con's) are between a multivariate regression analysis with some observed variables on a dependent variable and a piecewise SEM where 1 latent factor is constructed of these observed variables where the factor is used in a regression analyses.
I used a 5-point Likert scale (1 strongly disagree to 5 strongly agree) for measuring satisfaction. I want to examine the relationship between 5 independent variables and DV satisfaction and see which one is the best predictor. I have a large sample of 533 participants. The problem is that the assumption of normality is violated and I would like to know if there is a technique that will allow me to proceed with the intended inferential analysis of regression.
Thank you !
The application of multilevel regression models has become common practice in the field of social sciences. Multilevel regression models take into account that observations on individual respondents are nested within higher-level groups such as schools, classrooms, states, and countries.
In the application of multilevel models in country-comparative studies, however, it has long been overlooked that on the country-level only a limited number of observations are available. As a result, measurements on single countries can easily influence the regression outcomes.
Diagnostic tools for detecting influential data in multilevel regression are becoming available, but what are your experiences with influential cases in country-comparative (multilevel) studies? How do you deal with influential cases if you encounter them?
How to implement the Passing Bablok regression analysis in R language?
I wanted to study whether study habits predicted academic performance for which i did regression analysis and found an R square value of .022 which is a very low prediction value. I wish to know how to include this in my research? Any kind of suggestion is highly appreciated.
How to implement the Deming regression analysis in R language?
Hi everyone,
I have performed a Spearman's Rho with several variables: 1 continuous dependent variable and 5 continuous independent variables. I did this as normality was violated so I couldn't do a Pearson's Correlation. From the Spearman's Rho, I have ordered the independent variables from the strongest correlation to the weakest. I am planning to run a regression where I enter the independent variables in order (from the strongest correlation to the weakest) but I cannot figure out which regression analysis I should run. Someone suggested a Stepwise regression but I am not sure if this is the correct analysis. Do you think I should just run a multiple regression (where I cannot choose the order of variables to be entered) or some other regression?
Thank you in advance!
I have 186 respondents who were participated in my study. My questionnaire is dichotomous double bounded and asked their willingness to pay for the conservation programme.
To examine the determinants towards willingness to pay, I need to run the regression analysis but still not clear to use either logistic or probit regression.
Hello
I have seen several studies with only one binary independent variable where both crude and adjusted odds ratio was done. I am having difficulty knowing the variable/s that were adjusted for.
I have attached one such example from a study I wish to understand. How do I know the variables that were adjusted for?
How do I also determine the variables to adjust for in my research?
In my SEM analysis, all the paths from constructs to the outcome construct were shown to be insignificant, although the model fit indices were all acceptable. My particular focus is on whether an A variable is directly related to the B variable or the A variable is fully mediated by C.
Considering this result is related to type 2 error by the multicollinearity among the latent constructs, I tried a regression analysis to prove if there is a significant direct effect between the A variable to the outcome variable B. In this regression analysis, measured variables for the A were used. My question is whether this process, which is to use regression analysis to see a signigicant direct effect that was not shown in the SEM analysis with latent variables, is statistically valid.
It is possible to check the normality of my data (Kurtosis and skewness ) via Smartpls 3.0. and what is the procedure? Help needed.
I have a research and I have One independent variable, namely Role Of Internal Auditor (X) and Two dependent Variables, namely Fraud Prevention (Y1) & Fraud Detection (Y2), what regression analysis can be used in this situation?
I'm planning to use regression analysis in my study, but I am confused about this: if both of the predictors or independent variables (e.g. Academic Resilience & Academic Procrastination) were already correlated with the Dependent Variable (e.g. Test Anxiety) in the previous studies, is it still possible to pursue it? It's an undergraduate thesis btw, and research is really not my forte that's why I'm having a hard time.
For example, I want to know how baseline characteristics of patients (age, BMI...) and the confounding factors (smoking, diabetes or other chronic diseases) affect the serum vitamin D value. Which regression model should I use?
I use SPSS and R for analyzing the data