Science method

# Regression Analysis - Science method

Procedures for finding the mathematical function which best describes the relationship between a dependent variable and one or more independent variables. In linear regression (see LINEAR MODELS) the relationship is constrained to be a straight line and LEAST-SQUARES ANALYSIS is used to determine the best fit. In logistic regression (see LOGISTIC MODELS) the dependent variable is qualitative rather than continuously variable and LIKELIHOOD FUNCTIONS are used to find the best relationship. In multiple regression, the dependent variable is considered to depend on more than a single independent variable.
Questions related to Regression Analysis
Question
I need to test the relationship between two different variables. One of the variables is calculated as 1-5 points, the other as 1-7 points. Does having different scale scores cause an error in the correlation or regression analysis results? Can you recommend a publication on the subject?
Before doing any numerical calculations here, it would be sensible to just do a visual assessment using cross-plots. These would give an immediate indication of any concerns.
Question
A dummy variable is a variable that takes specific numbers. Values ​​for different attributes. Full rank.
Abdur Rahman I am not sure about the fact of removing one level of categorical variables because for the regression you have the system of reference thus you lose some information, for example, if your variable is the age group your levels are 20-30 31-40 41-50 51-60 61-70 your reference is the odd ratio 20-30 each group of age will be compared if you remove the 61-71 you did not have this information. For my ongoing research project, to handle the multicollinearity, I will prefer to use classification gradient boosting decision tree models (if these models are difficult to interpret I used a SHAP diagram)
Question
Since I found out that there is a correlation between Timeliness and Semantic Accuracy (I'm studying linked data quality dimensions assessment, trying to evaluate a dimension quality -in this case Timeliness- from another dimension (Semantic Accuracy)), I presumed that regression analysis is the next step in this matter.
-the Semantic accuracy formula I used is: msemTriple = |G ∧ S| / |G|
msemTriple measures the extent to which the triples in the repository G (the original LOD dataset) and in the gold standard S have the same values.
-the Timeliness formula I used is:
Timeliness((de)) = 1-max⁡{1-Currency/Volatility,0}
where :
Currency((de)) = (1-(lastmodificationTime(de )-lastmodificationTime(pe ))/(currentTime-startTime))*Ratio (the Ratio measures the extent to which the triples in the the LOD dataset (in my case wikidata) and in the gold standard (wikipedia) have the same values.)
and
Volatility((de)) = (ExpiryTime(de )-InputTime(de ))/(ExpiryTime(pe )-InputTime(pe ) )
(de is the entity document of the datum in the linked data dataset and pe is the correspondent entity document in the gold standard).
NB: I worked on Covid-19 statistics per country as a dataset sample, precisely Number of cases, recoveries and deaths
Not sure why you are re-asking this question.
Question
I have performed a hypothesis testing using the simple regression analysis model. What action must I take after testing the hypothesis?
I think, if THIS is unclear to you, you should take some statistics classes and consult a local statistician. If you don't know what to do after you got your results, I would presume that you also did not know what to do before the analysis. No offense!
Question
I have a set of data with spatial map with census tract, nearest distances to close by hospitals and cities. I need advise on how I can process this data for regression analysis and generate maps for ARC GIS (to see correlation). If you have a guide, it will be great. Thanks
ArcGIS has Geographically Weighted Regression (GWR) which can be used to regress a dependent variable over independent variables. It also takes into account spatial autocorrelation effects.
Question
After collecting my data, I decide to test my hypothesis using regression analysis. I also recognized the fact that my data must meet the assumptions of the tool before I can use it. Therefore I will like to know if after testing two assumptions and realizing the data had met those assumptions, can I just run the analysis or I must test all the assumptions before?
Enoch KABINAA Suglo -
Even a curved line may be modeled by a quadratic 'linear' regression. You would like your estimated residuals to be approximately 'normally' distributed. (Your data may be skewed, or otherwise nonnormal.) One often tests for heteroscedasticity, but you should look into weighted least squares (WLS) regression, as heteroscedasticity is to be expected with larger predicted values associated with larger expected variance of residuals. For time series, or even some spatial considerations in finite probability cases, you may have autocorrelation, and need to consider general least squares (GLS).
I suggest you first graph your data. A scatterplot can tell you a great deal. Then if you have further questions, you could show a graph along with your questions, and include the context.
Cheers - Jim
Question
regression analysis can be a powerful tool for forecasting project outcomes at completion, offering insights to enhance resource allocation, risk management, and decision-making. However, its success hinges on accurate data, appropriate model selection, and a clear understanding of the project's dynamics.
Question
I am using the spss 28 version and I want to know how do I run the regression analysis. I have one dependent variable- bmc and one independent variable- mvpa. I have two other variables age and height and i dont know how do i adjust it on spss. Are these independent variables too? There is no covariate box in the spss version which im using. I
To adjust age and height in linear regression on SPSS, you can use the “Transform” function to create new variables that are adjusted for age and height. You can then use these new variables in your linear regression analysis.
Question
Hello,
I'm working on a panel multiple regression, using R.
And I want to deal with outliers. Is there a predifined function to do that?
If yes would you please give me an example of how to  use it
Chuck A Arize Removing is a bad idea, unless you are absolutely sure that those data points are bad measurements.
Nonlinear transformations are not a good idea, see why:
Question
Of late, some journal editors are insistent on authors providing a justification for the ordering of the entering of the predictor variables in the hierarchical regression models.
a) Is there a particular way of ordering such variables in regression models?
b) Which are the statistical principles to guide the ordering of predictor variables in regression models?
c) Could someone suggest the literature or other bases for decisions regarding the ordering of predictor variables?
Jochen Wilhelm some instances use the word "hierarchical" when not all variables are entered together, but separetely or in blocks of variables. For the former one automatic algorithms may be used (but not recommended, as you know) like stepwise regression, whereas for the latter one typically the researcher decides which (blocks of) variables are entered in which order (blockwise regression).
As Bruce Weaver already mentioned, this is done very frequently in psychology, although it may be questioned if this is necessary, yet useful, or just a habit because everyone does it. In most psychological papers in such cases they are only interested in the delta R^2, for example to show that the increase in explained variance of an interaction term is significant (but miss that this could also be done by the squared semipartial correlation and no blockwise regression is needed).
I remember a paper (I believe about job satifaction or something similar) where they entered sociodemographic variables in a first block, typical predictors in a second block and new predictor variables in a third block to show that the typical ones explain more than socidemographic variables alone. The third block to show that the new predictors explain variance over and above the former ones. (In light of causality, I would cast some doubt about the results....)
But you are right, within each block and for the full model, the order of the variables do not have any meaning.
Does it help?
Question
Regression Analysis
What is the meaning of categorical more than 2(ordinal)?
But still, If a variable is categorical and has more than two categories, You should use that variable as a dummy variable or indicator variable. Mahfooz Alam
Question
In a psychology study of N = 149, I was testing for moderation using a three-step hierarchical regression analysis using SPSS. I had two independent variables, X1 and X2, an outcome variable, Y, and the moderator, M. Step 1 uses the variables X1, X2, the interaction X1X2, and 5 covariates. Step 2 adds M. Step 3 adds the interaction variables X1M and X2M.
In my collinearity statistics, VIF is all under 10 for Steps 1 & 2 (VIF of 6 is found between X2 and X1X2 in both steps). For Step 3, VIF is high for X1, X2, M, X1M, and X2M. When I go look at the collinearity diagnostics box, the variance proportions are high for the constant, X1, M, and X1M. I'm understanding that there is multicollinearity.
My question is, what does it mean when the constant shows a high VIF? What would it mean if only one predictor variable and the constant coefficient were collinear?
Mikayla Franczak and I discussed this yesterday (offline). For the benefit of other followers of this thread, here is a summary of what we talked about, plus one or two other things that have occurred to me since.
1) Recommended VIF cut-offs are just rules of thumb. They also vary a lot--see the attached PDF. For both of those reasons, they should be taken with a rather large grain of salt.
2) The VIFs that Mikayla has flagged as high in her Step 3 model are all for variables that are involved in product terms. As Paul Allison says, high VIFs that are "are caused by the inclusion of powers or products of other variables" can be "safely ignored".
3) Centering the variables that are involved in product terms will reduce the VIFs, but it is only necessary when one encounters "numerical/computational problems" that result in a term being excluded from the model. See Allison's reply to comment 744 in that same blog post.
4) If supervisors, committee members, or collaborators are skeptical when you say that high VIFs that are due to product terms are not problematic, you can do this to reassure them:
• a. Estimate the model with the variables as they are, and save the fitted values. (This model will have the troubling high VIFs.)
• b. Estimate the model using centered variables, and save the fitted values. (This model will have much lower VIFs that do not alarm anyone.)
• c. Show that the R2 and Adjusted R2 values for the two models are exactly the same.
• d. Show that the fitted values for the two models are exactly the same--e.g., the difference between them = 0 for every observation, the correlation between them = 1.
5) Mikayla asked, "what does it mean when the constant shows a high VIF?" Let me point out that on page 1 of the output that was shared, no VIF or Tolerance is reported on the Constant lines.
Cheers,
Bruce
Question
I am currently trying to perform mediation analysis with a long panel data set including controll variables in Stata.
Trying to do this i found solutions on how to do a morderated mediation regression with and without controll variables and i also found ways to run a regression with panel data but i did not find a way how to match this.
Is there a way how i can consider mediator variables in my panel regression?
Does anyone have information, links, advice on how to approach this challenge?
Yes You can
Question
Greetings of peace!
My study is about the effect of servicescape on the quality perception and behavioral intentions
independent Variable-Under servicescape there are 4 indicators
Layout Accessibility - 10 items
Ambience condition- 3 items
Facility Aesthetics - 6 items
Facility cleanliness -4 items
Quality perception serve as mediator with 3 items
Dependent Variable-Behavioral intentions - 4 items
All were measured using Likert Scale (N = 400)
I tried Ordinal Regression Analysis but I don't know how to combine the items and the independent is ordinal. And the value of Pearson is <0.001 and the Deviance is 1.000.
I need to get the effect of individual indicators in servicescape on the quality perception and behavioral intentions.
There are a few options with ordinal outcomes but also treating the predictors (IV) as ordinal is trickier. As David L Morgan noted you can do this with some SEM software. You could also use ordinal logistic regression treating the predictors as continuous, dummy coding them or treating them as monotonic (the latter only available in the R brms package as far as I'm aware https://cran.r-project.org/web/packages/brms/vignettes/brms_monotonic.html ).
Its worth noting that these are all parametric models - its just that they don't assume a normal distribution of residuals in the model.
Question
The variable physical environment effect, is only a subset of the independent variable ( environmental factors) in my research, there are social and cultural environment effects as well. They are measured in my questionnaire with five questions and the responses are; ( never, rarely, often and always). The dependent variable, student performance, was also measured in the same format as the environmental factors(i.e with five questions and Never, rarely...being the responses). I have coded them into SPSS with the measure; Ordinal. I want to answer the research question; 1. How physical environment affect student performance? 2. How social environment affect student performance? 3. To what extent does cultural environment influence student performance? I've computed the composite score(mean) for the questions, can I use these scores in the ordinal regression analysis? Or is there any other way to compute the questions into a single variable, for both the independent and dependent variables?
In your study, where you have measured the effects of physical, social, and cultural environments on student performance using ordinal scales, you can use ordinal regression analysis to answer your research questions.
To conduct the ordinal regression analysis, you do not necessarily need to compute a composite score for the questions. Instead, you can use each question as a separate predictor in the analysis. Each question represents a different aspect of the environment (physical, social, or cultural), and using them as separate predictors allow you to assess the unique contribution of each aspect to student performance.
Here's a general outline of the steps you can follow:
1. Prepare your data: Ensure that your data is properly coded and formatted in SPSS. Make sure the independent and dependent variables are coded as ordinal variables.
2. Run ordinal regression analysis: In SPSS, go to "Analyze" -> "Regression" -> "Ordinal..." and select the dependent variable (student performance) and the independent variables (physical environment, social environment, and cultural environment). Specify the appropriate link function (e.g., logit or probit) based on the distributional assumptions of your data.
3. Interpret the results: Examine the coefficient estimates, their significance levels, and the odds ratios associated with each independent variable. These results can provide insights into the effects of physical, social, and cultural environments on student performance.
By using each question as a separate predictor in the ordinal regression analysis, you retain the specificity and granularity of the different aspects of the environment being studied. This approach allows you to explore the individual effects of physical, social, and cultural environments on student performance.
Keep in mind that ordinal regression assumes proportional odds, which means it assumes that the relationship between the predictors and the outcome is consistent across different levels of the outcome variable. It's important to assess this assumption, for example, by conducting tests of parallel lines.
Additionally, be cautious when interpreting the results as causal relationships. While ordinal regression can help identify associations between variables, establishing causal relationships requires rigorous experimental designs or the consideration of other potential confounding factors.
Overall, by using ordinal regression analysis and examining the effects of different environmental factors on student performance separately, you can address your research questions and gain insights into the influence of physical, social, and cultural environments on student performance.
Question
Hi,
I want to predict the traffic vehicle count of different junctions in a city. Right now, I am modelling this problem as a regression problem. So, I am scaling the traffic volume (i.e count of vehicles) between 0 to 1 and using this scaled down attributes for Regression Analysis.
As a part of Regression Analysis, I am using LSTM, where I am using Mean Squared Error (MSE) as the loss function. I am converting the predicted and the actual output to original scale (by using `inverse_transform`) and then calculating the RMSE value.
But, as a result of regression, I am getting output variable in decimal (for example 520.4789), whereas the actual count is an integer ( for example 510 ).
Is there any way, where I will be predicting the output in an integer?
(i.e my model should predict 520 and I do not want to round off to the nearest integer )
If so, what loss function should I use?
If you want your regression model to predict integer values instead of decimal values, you can modify your approach by treating the problem as a classification task rather than regression. Instead of scaling the traffic volume between 0 and 1, you can map the integer values to a set of discrete classes. For example, you can define different classes such as 0-100, 101-200, 201-300, and so on, and assign each traffic volume to the corresponding class. Then, you can use a classification model like a neural network with softmax activation and categorical cross-entropy loss function to predict the class of each traffic volume. This way, your model will output integer predictions representing the class labels rather than decimal values.
Question
If the value of correlation is insignificant or negligible. Whether We should run regression analysis or not. Obviously it will be insignificant, is it necessary to mention in article?
"Significance" is not what should concern you. It is a function of sample size, and even if you compare x-variables with the same sample size in a multiple regression, issues such as collinearity could foul such comparisons. Of course if your sample size is too small, you won't be able to discover much of anything. But you might try plotting the points with a regression "line," and put curves around it using the estimated variance of the prediction error. Don't forget heteroscedasticity. I know SAS does a good job of this. You could put predicted-y on the x-axis and the y-values on the y-axis (or estimated residuals on the y-axis when looking for heteroscedasticity, which is natural, associated with the predicted-yi as a size measure).
Penn State has some good introductory material on this, though they did not include heteroscedasticity, the last time I looked. You can find information from Penn State by searching on a term, and including "Pennsylvania State University" in the search.
Best wishes.
Question
Correlation and regression analysis are part of descriptive or inferential statistics.
Descriptive statistics is a vague term and usually applied to describing the properties of a single variable. However there is law that says that a correlation cannot be a descriptive statistic. We might make the distinction between descriptive and hypothesis testing statistics, for example.
I would imagine that a table of the intercorrelations of the items of a scale would be descriptive rather than hypothesis testing, for example.
But there's no legal definition, so no worries!
Question
Hello researchers,
I am facing problem to do a regression analysis with three independent variables, one mediating variable, and one independent variable. How ca I do this in SPSS? Any one please can you help me?
Hello again Md.,
I would recommend you try either path analysis (available in any SEM package) or the associated multiple linear regression models.
The SEM model would have these proposed relationships:
1. IV1 -> Med
2. IV2 -> Med
3. IV3 -> Med
4. Med -> DV
Question
When doing a regression analysis, the coefficients table in spss shows that my 3 main effects are significant. When I do a regression analysis for my 6 moderating effects, where I created interaction terms for, the coefficients table also shows these are significant. But when I do the regression analysis and include the 3 main effects and 6 moderating effects at the same time, none is significant. How should i interpret this? And how should I continue?
Let's simplify things a bit and consider two models with X1 and X2 as the only explanatory variables, both quantitative.
1) Y = b0 +b1X1 + b2X2 + error
2) Y = b0 +b1X1 + b2X2 + b3X1X2 + error
In model 1:
• b1 shows the effect (on the fitted value of Y) of increasing X1 by one unit while holding X2 constant (at any value you wish)
• b2 shows the effect (on the fitted value of Y) of increasing X2 by one unit while holding X1 constant (at any value you wish)
But those interpretations of b1 and b2 do not work for model 2. In model 2:
• b1 shows the effect (on the fitted value of Y) of increasing X1 by one unit while holding X2 constant at a value of 0
• b2 shows the effect (on the fitted value of Y) of increasing X2 by one unit while holding X1 constant at a value of 0
Putting it another way, b1 and b2 in model 2 are like simple main effects in a two-way ANOVA model, not like main effects in a two-way ANOVA model. Some authors describe them as "main effects", but I think that can cause confusion, and so I prefer to call them first order effects. HTH.
Question
Examining some students on their final year projects defence. I discovered that a student had the Adjusted R² in the Regression analysis of her work to be greater than 99%. Could that be possible?
Take a closer look at the data matrix.
Next, check out the steps leading up to processing the R².
Question
This question is concerned with understanding the degree and direction of association between two variables, and is often addressed using correlation or regression analysis.
If the aim is "understanding", then the place to start is graphical presentation (visualisation).
Question
This question is concerned with determining whether two or more groups differ in some meaningful way on a particular variable or set of variables, and is often addressed using statistical tests such as t-tests, ANOVA, or regression analysis.
Extremely generic question: you must specify and detail.
Question
Can you give all the criteria to evaluate the forecasting performance of the regression estimators?
To check how good your regression model is, you can use the following metrics:
1. R-squared: indicate how many variables compared to the total variables the model predicted.
2. Average error: the numerical difference between the predicted value and the actual value.
Question
I am performing a cross-country regression analysis with a sample of 101 countries. Most of my variables are averages of annual data across a period of 7 years. Every one of my primary variables has data available in each of these 7 years. However, certain countries have data missing in certain years for variables used in my robustness checks.
How should I handle this missing data for each robustness variable? Here are a few ideas I have considered
A. Average data for each country, regardless of missing years
B. Exclude any country with any missing years from data for that respective variable
C. Exclude countries that are missing data up to a certain benchmark, perhaps removing countries that are missing more than 2 or 3 of the 7 years that are being averaged for that respective regressor
D. Only use robustness variables that have available data for every country in every year that is being averaged
Please offer the best solution and any other solutions that would be acceptable.
Using multiple imputation or full information maximum likelihood (FIML) would probably be your best options. Under the assumption of missing at random (MAR) data, these techniques allow you to include all available data points in your analyses. Most other techniques either lead to a loss of data (and therefore statistical power), make more restrictive assumptions about the missing data mechanism, or both. See
Enders, C. K. (2022). Applied missing data analysis (2nd ed.). Guilford Press.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989X.7.2.147
Question
Dear fellows,
Maybe you have done interesting measurements to test some model?
I can always use such data to use as examples and tests for my regression analysis software, and it's a win-win, since I might give you a second opinion on your research.
It's important that I also get the imprecision (measurement error/confidence interval) on the independent and dependent variables. At this moment, my software only handles one of each, but I'm planning to expand it for more independent variables.
Carlos Araújo Queiroz I don't see a dataset, or am I missing something?
Question
Assumptions of multinomial and linear regression analysis?
The Multinomial logistic regression is a statistical test that involves a single category variable which is predicted using one or more other factors. This method also establishes the numerical relation between such variable pairs. A categorical variable should be your targeted predictor. Linearity, independence, and the absence of outliers and multicollinearity are among the assumptions for multinomial logistic regression. For the details related to assumptions of multinomial logistic regression you may refer to following link:
A linear relationship, zero conditional mean of the error terms, Gaussian distribution of the error terms, homoskedasticity of the error terms, the absence of outliers, multicollinearity, and autocorrelation of the error terms are among the assumptions for linear regression. For more details you may use the following link:
Hope this would be useful.
Question
In finding the correlation and regression of multivariable distribution what is the significance of R and R^2? What is the main relation between them?
R represents the correlation coefficient between two variables in a multivariable distribution. It measures the strength and direction of the linear relationship between the two variables. R ranges from -1 to +1, where -1 indicates a perfect negative correlation, +1 indicates a perfect positive correlation, and 0 indicates no correlation.
R^2, on the other hand, represents the coefficient of determination. It measures the proportion of variance in one variable that is predictable from the other variable(s) in a multivariable distribution. R^2 ranges from 0 to 1, where 0 indicates no variance in the dependent variable is explained by the independent variable(s), and 1 indicates that all the variance in the dependent variable is explained by the independent variable(s).
The main relationship between R and R^2 is that R is the square root of R^2. In other words, R^2 is the proportion of variance in the dependent variable that is explained by the independent variable(s), and R is the correlation coefficient between the dependent variable and the predicted values based on the independent variable(s). Therefore, R^2 is a measure of how well the regression line fits the data, while R is a measure of the strength and direction of the linear relationship between two variables. R-Squared? R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model
Question
Hello,
I am doing a multiple regression with 2 predictors. The predictors correlate moderately/strongly, r=0,45. When the first predictor is put in a regression analysis, on its own it explains 16,8 % of variance of the dependent variable. The second predictor on its own explains 17,5 % of variance of the dependant variable. When both predictors are put into regression analysis, the VIF=1,26, so multicollinearity should not be a problem. The predictors together explain 23,4 % of variance of the dependant variable.
First of all, I would like to ask, whether the change in explained variance from 16,8-17,5 % to 23,4% is a big change. More specifically, if the predictors are together better at predicting the dependant variable compared to the situation when there is only one predictor. Also, as the predictors correlate but VIF is okay, is it safe to say that they probably explain some same parts of variance in the dependant variable/ do the predictors explain little of unique variance?
Have you compared thetwo univariate b's with their values in the full regression? Unstable estimates of these coefficients is the main problem in multicollinearity.
Question
I would like to create a factor of the interaction effect of two variables for regression analysis.
I was wondering how to create the factor.
I was thinking to multiply the scores of the two, but I would like to hear from other researchers. Thank you,
I suggest you try running the regression I suggested (two primary variables plus the multiplicative interaction effect) and see what happens.
Question
Suppose I want to predict energy consumption for my building using regression analysis. What factors should I consider including in my model, and how can I determine their relative importance?
It depends a little bit on the way your model will be designed and what are the available data.
In one of the most simplest ways you will have a mean energy consumption per human/animal/anything and you simply multiply by the number of individuals.
But i think that is not your intention.
Probably you have a data set with dfferent independent variables and you want one final response, that is not as simply associated as in my example above.
There are a couple of standard software tools available for a multiple (linear/non-liniear) regression. I would prefer R and the models of the lm type. Pretty good manuals are found around the web. But you may choose whatever you want; results should be the same.
An interesting introduction to the math behind is given in a paper by the US Geological Survey:
However, depending on the specification of your problem it might be interesting to have a closer look at some other statistics.
When you can define specific groups (like high energy consumption time, low energy consumption time, medium, extraordinary, ...) a linear discriminant function analysis might be an interesting choice. It allows identifying specific variables, being characteristic (high contribution) for a predefined group.
A factor analysis might also be a good choice to identify higher/lower variable contributions.
If you are not really in charge to present the contributions of the variables a random Forest analysis might be interesting. This machine leraning classifyier can be also run as a regression model. You have less impression about the variables contributions, but the results can be more precise...
Nevertheless, it depends a bit on the available data. Did you already prepare a pairs plot and the correlation among the variables? This may also give a first impression....
Jan
Question
i am using a fixed effect panel data with 100 observations (20 groups), 1 dependent and three independent variables. i would like to get a regression output from it. my question is it necessary to run any normality test and linearity test for panel data? and what difference would it make if i don't go for these tests?
Rede Ganeshkumar Dilipkumar so to answer the hypothesis test and relationship between regressor and regressand we need surely normality test right?... because oftenly in causal relationship method we have to prove if the hypothesis alternative is the answer or instead hypothesis null is the answer for answer the influence of regressor to regressand. So after we found the best fit model of regression whether pooled effect, fixed effect or random effect than we have to continue by finding the influence between regressor and regressand so you said that hypothesis test surely need normality test and other assumption test? .. Please put the recommended theory or reffrence to strengtening your argumentation. thank you for the enlightment..
Question
situation: The moderating variable can explain up to 25 percent while the remaining 75 percent is explained by
other factors outside the model. What does this mean? OR would this mean, the moderating variable did not significantly moderate the relationship of the IV nad DV? Thank you to anyone who would respond!
Percent increase in explained variance (R square change) is not a very intuitive way to assess the practical significance of an interaction/moderation effect in my opinion. The percent increase could be small despite the fact that the interaction may be important. If the interaction is statistically significant, you should plot the regression lines for different meaningful values of the moderator to interpret the meaning and practical relevance of the interaction effect.
Question
Propensity score matching (PSM) and Endogenous Switching Regression (ESR) by full information maximum likelihood (FIML) are most commonly applied models in impact evaluation when there are no baseline data. Sometimes, it happens that the results from these two methods are different. In such cases, which one should be trusted the most because both models have their own drawbacks?
what is thadvantage of psm over ESR
Question
I would like to know if I am wrong by doing this. I made quartiles out of my independent variable and from that I made dummy variables. When I do linear regression I have to record the betas with 95%CI per quartile per model (I adjust my model 1 for age and sex). Can I enter all the dummies into the model at the same time or do I have to enter them separately (while also adjusting for age and sex for example)?
So far I entered all the dummies and adjusted for age and sex at the same time but now I wonder whether SPSS doesn't adjust for the second dummy variable and the third.. So I think I need to redo my calculations and just run my models with one dummy in each.
Thank you.
What you are looking for is called linear regression. Good news is that linear regression is quickly done and easy to interpret. It will also give you more statistical power, as by categorization you loose information.
And don't worry, I've seen this categorization non-sense done by seasoned professors (and sometimes forced upon their students). That's from a past era, where you literally had to crunch the numbers with pencil and paper, which is easier with categories.
Question
Hello everyone,
I am currently working on my thesis but I have encountered a problem and I am not sure how to solve it. I would like to measure the impact that ESG (Environmental, Social, Governance) has on financial performance (ROA, ROE) from 2016 to 2021. Some important details about my study:
• I would like to compare two samples of companies: One first group with ESG part of the DJ Sustainability Index (DJSI) and another group without ESG (no part of DJSI).
• I intend to analyze companies that have been part of the DJSI between 2016 and 2021. However, some companies don't have an ESG score (independent variable) for some years. Should I still collect information for my dependent variables for all the years? For example, company X has ESG scores for 2016 and 2017 only, would I need data for ROA and ROE for all the years or just for 2016 and 2017?
• Any other aspects I should consider?
Thanks!!
Hi,
In any regression result, there must be an ANOVA statistics. You have to check it in the Regression options given. In this wise, if that is the case, the a Multivariate Analysis is suitable and then you can run A regression Statistics afterwards. Your study has sub-variables if I understood you well, if you want to treat your dependent variable as just one whereas your independent variable has sub-variables, the you can find their combining effects..
In a situation where some companies do not have the ESG score, then you need to run a Student T-test statistics since they have a certain financial performance(Dependent Variable). You have to ascertain if there is a significant difference in the financial performances of comapnies that have the ESG score and those that do not have.
Question
Hi,
I have a dependent variable that represents ridership on each street. The numbers are bucketed to the nearest 50 so the values are 50 , 100 , 150 .. and so fourth.
My independent variable also discontinuous 1, 2 , 3 ,4 .. etc, representing street continuity.
Would it be appropriate to execute linear regression analysis, to see if there is a correlation between these two variable ?
Note that I will execute the analysis on multiple city.
"Would it be appropriate to execute linear regression analysis ... ?"
The answer depends on what you are trying to do, what you are trying to estimate and how much model-development you are prepared to do in the search for meaning and "optimal estimates".
In the present context what you could do is to do a simple linear estimation by least squares, but ignore the usual summary statistics that produce "error-bars" for estimates. Keep only only a single statistic. If you had only a single independent variable this could be just the regression coefficient. For multiple independent variables (as you later mention) you could use the reduction in the sum-of-squares from the regression. Then, to get a test for whether there is a real statistical association between your observed values, you could apply the principle of permutation testing whereby you evaluate the same summary statistic from exactly the same algorithm when applied to randomised versions of the original dataset. Here the randomisation provides a representation of the case of "no statistical association". The randomisation can be applied by doing a random permutation of the column of data for the dependent variable.
The principle here is that you can construct any measure of association you like and obtain a valid statistical inference by a first-principles argument involving randomisation. You mention "multiple cities". Notionally, it is just a matter of constructing an overall measure of association to summarize across all cities and then doing the randomisation separately for each city (but within the same step).
Note that the above centres on the question "to see if there is a correlation between these two variable". If you really want some sort of predictive model, or are extremely concerned about extracting as much information as possible from the data, then you would need to develop a full statistical model and this would take you well away from a simple least squares analysis.
Question
I'm working on the below topic for my master thesis.
“Investigating the stages in a customer’s buying journey and determining the factors influencing the decision to switch between a retailer’s online sales channels – marketplace and own website.”
Considering this, my plan was to apply logit regression analysis on the channel choice of the customer (in this case “Marketplace” and “retailer’s own website”) as the dependent variable and the interaction between the independent variables “Age” and subjective norms (recommendation from peers, product reviews) for the three stages.
I’m struggling to ascertain if using the customer channel choice of either marketplace or own website be considered as the dependent variable. I have not used a Likert scale for this as this was a scenario-based survey. So, the respondents have chosen the channel they would use in every stage.
Could you please advise if using this choice as a dependent variable makes sense? And, if using Logit regression is the right way to go?
Also, how to calculate/analyze relative importance of the predictor variables (independent variables) in Logit Regression analysis?
Good morning. I understand fully as I've been doing independent work also. http://sites.google.com/site/deborahhilton/ Most of my work is on research gate though. I just try to work out the statistics myself also with textbooks I have here. If you had some specific questions I maybe able to answer for a nominal fee if you were wanting some reference material. Good luck. T-you.
Question
Is there any explanation for strong, adequate or low value of it? Thank you
It is a measure of goodness of fit in logistic regression analysis. It is a modification of the Cox and Snell R Square, which is derived from the likelihood ratio test statistic.
Nagelkerke R Square ranges from 0 to 1, with values closer to 1 indicating a better fit of the model. However, unlike in linear regression analysis, where R Square can be interpreted as the proportion of variance explained by the model, Nagelkerke R Square cannot be interpreted as easily.
A common rule of thumb is to interpret it, a value of 0.2 or less indicates a weak relationship between the predictors and the outcome.
A value of 0.2 to 0.4 indicates a moderate relationship.
A value of 0.4 or higher indicates a strong relationship.
However, it's important to note that the interpretation of Nagelkerke R Square should be taken with caution and should be supplemented by other measures of model fit, such as the Hosmer-Lemeshow test, AIC, or BIC. Additionally, it's essential to consider the practical significance of the relationship between the predictors and the outcome, rather than solely relying on statistical significance or goodness of fit measures.
Question
I'm doing a regression analysis of the effect of housing type on resident depression. When I included all samples in a single model, housing type had a significant effect on depression (p=0.000). But when I divided the sample into males and females, and performed regression analysis on the two separately, the analysis results of both males and females showed that housing type had no significant effect on depression (p=0.1-0.2). I wonder how to explain this result
It could be a power problem. When your samples get smaller, you reduce the statistical power to detect an effect. A better way to do this would be to include gender as a binary predictor into the overall analysis (and also potentially the interaction/product term housing*gender). That way, you can examine a potential gender (and interaction) effect without splitting your sample or running separate analyses.
Also, I would examine the data graphically using histograms, scatterplots etc. to see whether there are any peculiarities in the scores and their distributions (e.g., outliers, non-linear effects, etc.).
Question
I have 667 participants in my sample, and the outcome is continuous. I tested the normality, and the data on the histogram are bell-shaped, but the test results show that they are not normally distributed.
1- What is the cause of the discrepancy between the chart and the test results?
2- Can I still perform linear regression analysis on this data?
What the model requires is that the errors are mutually independent and identically distributed in a way that is approximating a Normal distribution with mean equal to 0 and variance equal to σ ².
Also, I remind you that linear regression is a mathematical function based on the equation of the geometric straight line.
The rest comes by itself.
Question
I am testing hypothesis of relationships between CEA and Innovation Performance (IP). If I am testing the relationship of one construct , say Management support to IP , is it ok to use single linear regression? Of should I be testing it in a multiple regression with all the constructs?
Question
We want to analyze the relationship and impact between two variables in the education sector, The first variable is the independent variable (intellectual capital), which is measured by a sample of workers and leaders, size 150, and the second variable is the dependent variable (quality of service provided), and it is measured by a sample of Students and parents, its size 330
Is this a nested sampling design where the students and/or parents from the second sample are nested within the workers and/or leaders? If that is the case, then perhaps multilevel (hierarchical linear) regression modeling could be an option. However, this requires nesting of observations, e.g., that the parents in Sample 2 can be linked to workers/leaders in Sample 1.
If the data aren't nested but there's some other connection/dependency between the observations in both samples, perhaps you could account for missing data by using full information maximum likelihood estimation or multiple imputation so that you wouldn't waste any data. But again, this requires that at least some observations can be linked across the two samples.
Question
F value 9.61
p-value 0.00
Oral Com = 21.36 - 1.194 Dissatisfaction with one’s linguistic skills
It is good that you carefully review the data matrix.
Also, try to understand carefully which are the dependent and possibly independent variables that you are interested in analyzing.
Question
My questionnaire consist of 20 questions and five of them are related to dependent variable. But the problem is those questions are not in Likert scale and in different scales with fixed answers and one multiple choice question.
Example 5th question has 1-4
6th question dichotomous 1-2
7th question Multiple choice question 7 options
8th question 1-5 options to pick one.
can I build a composite index variable for dependent variable by standardizing these variables ? using Z Scores
can I use Standardized variables to perform a correlation and regression analysis?
You haven't said what your research question(s) is(are), or what your goal of reducing the dimensionality of these 20 items is. It doesn't make sense to me (obviously it does to others ... see above!) to try to answer your question without this information (and again, those above obviously don't think these are important questions).
Question
Dear Altruist,
I want to make a regression analysis based on accident data. But I am not finding the necessary information about what type of regression is more applicable for analysis accident data?
Suggestions will be highly appreciated.
The first thing is to check what types of data you have collected or will plan to collect. The nature of the data and variables will dectate the type of analysis.
Question
My research involves finding correlation between variables, comparing groups and making a regression analysis to explain if one variable predicts another. The research is descriptive for sure, but can i state that it is both correlational and causal-comparative?
I have this VERY same question as my research looks to do the same. Is there further guidance others can provide?
Question
i have done a spearman correlation analysis and all of my independent variables correlated with the dependent variable. However, when i did multiple regression analysis, the results show the iv are not significant. is this possible?
Yes it is possible that the four coefficients of your linear regression are not significant even if each of your four variables have a significant correlation with your dependent variable. As David Eugene Booth suggested, collinearity can be the cause. Try omitting the variables one by one. You mentioned non-normal data which is not necessarily a big problem but are you sure that a linear model is appropriate?
Question
It is actually the Simple Linear Regression Analysis Question
1. For diabetics with an initial weight the same as yours, calculate the 95% confidence interval on their predicted mean weight loss at one year after DBI therapy.
2. For an individual diabetic with an initial weight the same as yours, calculate the 95% confidence interval on his predicted weight loss at one year after DBI therapy
You are not allowed to ask exam questions or other assignments here on RG!
Question
Hi all,
I am interested in whether the strength of correlation estimates between personality & depression scores is equally strong on two measurement occasions - when adolescents are 13yrs old vs. 15yrs old. This means I am comparing the coefficients in the same sample and the same set of variables, just measured at a different time point.
So far I was only able to figure out how to compare coefficients with independent observations (e.g. using Fisher’s r to z transformation).
Can I simply perform regression analysis and use "time" (categorical) as a moderator or this is only ok with independent observations?
Thanks a lot for your help!
Kind regards
Michaela
In case you are familiar with programs for structural equation modeling (SEM) such as Mplus, a convenient way to approach this problem is by setting up a simple correlation "model" with equality constraints on the correlations in question. A model test statistic (chi-square) then allows you to test whether the null hypothesis of equal correlations can be rejected at the chosen level of alpha. In Mplus, this can very easily be done with the MODEL CONSTRAINT option. For 6 or fewer variables, you can use the free demo version of the program. Other SEM programs such as lavaan (package in the free R software) have similar options for setting up and testing parameter equality constraints.
Question
Kindly, let me know which regression model I can use specifically for hypothesis testing.
Have you three hypotheses or one?
Please tell us a little more. It's well-nigh impossible to answer the question as it stands.
Question
I want to perform multi-input multi-output regression analysis. Can you suggest some tutorials?
Help will be appreciated 🙏
Worry about the multiple outputs these tell what types of regression you need ESPECIALLY IF THEY ARE OF DIFFERENT TYPES. ONCE YOU KNOW THAT ASK AGAIN.Best wishes David Booth
Question
Hello Everyone,
I am using SPSS with the PROCESS plugin by Hayes. I performed a (OLS) regression analysis (in SPSS) with dependent variable X and 3 dummy variables (I have 4 groups, one is the reference group). This gave me some coefficients for the dummies that made sense. But then, I continued with my analysis to a mediation analysis (using PROCESS), I used the same X and the same 3 dummies (same reference group), and the total effect model spat out different values for the coefficients (quite a bit larger). The same happened for another variable when PROCESS calculated an effect of X on M (the mediator), i.e. it returned different results from my "normal" regression analysis without PROCESS. From what I understand in Hayes 2018 book, the equations used to calculate these coefficients in mediation should be the same as my previously calculated ones. But they differ. Any ideas why?
Thank you,
Stefan
Did you ever figure it out? I have the same question..
Question
Greetings Fellow Researchers,
I am a newbie in using survival analysis (Cox Regression). In my data-set 10-40% cases have missing values (Depending on the variable I include in my analysis). Based on this I have two questions,
1- there are any recommendations on accepted percentage of cases dropped (missing values) from the analysis?
2- Should I impute the missing values of all the cases that were dropped (lets say maximum of 40%).
Thank you so much for your time and kind consideration.
Best,
Sarang
I have no first-hand experience to offer you either. But this systematic review article may give you some ideas.
Question
I'm planning a study with a twice repeated measurement but I'm primarily interested in the correlation between the two measurements. As there are a number of factors potentially influencing the agreement or otherwise between these two measurements, can Cohen's k or Pearson's r be used as the DV in a multiple regression analysis ? If so, are there any conditions or parameters that need to be taken into account ?
Question
I have a question to the Members for which i requrest your comments and recommendations:
In my research, I use a National level entrance test score, which has high signficance for the students in their pursuit of higher education. The test is conducted all over the country every year at the same time, but has province wise differences in terms of test content and total marks. Again, the cut-offs vary across provinces and also on a year over year basis. There are three cut-offs for the scores called cut off1,2 &3.The total scores in many provinces are the same, but for others may vary based on total marks and combinations for the subjects. However, they all have a common purpose under the common guidelines of the govt.
The structure and procedure of conducting the exam is the same and is conducted nationally and the method of assessment is also similar even though differences in questions asked in the exam ( a little bit complex scenario).
My requirement is to derive a common Gaokao score (not sure it is possible) that can be used as a dependent variable in a regression. Considering the province wise differences and year-over-year variations in cut-offs, I would like to derive a common score nullifying these influences.
(1). For this purpose, my first task is to make the total scores for all the provinces the same(i.e, 750 , because majority of the provinces has this score). Here the first challenge is to increase or decrease the scale width of other provinces which have a total other than 750. For example, if province X has a total score of 850, it should be rescaled to 750. How can I do that? Simply using the equation (score/850)*750 be sufficient? Or it will make an error? (I have seen some other equations to convert these, but not sure whether it is reliable).
Does this rescaling may have any impact on using it in the analysis?
2. The second step is to ‘centering’ the scores (after equating the scale size) with the tier 3 cut-off value (here use the converted score for those whose were rescaled). Is that meaningful?
With these two procedures, Can I develop a new set of scores that can be used for regression analysis as DV? Dest this procedure eliminate the province wise and cut off variations in test scores?
OR
Do you have any other standard procedure to suggest?
Thank you very much for all of you for your valuble suggestions and comments!!
Please see "Kolen MJ, Brennan RL. Test equating, scaling, and linking: Methods and practices. 2nd. New York: Springer-Verlag; 2004." ...based on your described task I think Chapter 2 should give you the statistically sound approach to apply.
Question
To define quantitative analysis as such in a mixed methods approach, is it necessary to include a regression analysis?
I agree with Muhammad Tanveer Afzal , i.e. you conduct apt statistically analysis for each approach, and then integrate the results. So you may or may not conduct regression analysis for your quantitative data, which is dependent on your research questions or hypotheses.
Question
Following a principal component analysis, 3 factors were identified. The composite scores for these 3 factors were calculated for T1 data. So 3 new variables were obtained. The same will be done for T2 data. Then, how do use regression analysis to measure change and the impact of the intervention?
Whatever methods (not clear if you are using PCA or factor analysis) you use for T1 you will be optimizing for those data, so it is likely the solution will not fit as well for T2. What is your research question?
Question
Dear RG members.
I conducted cox regression analysis and unfortunately the HR of some of the variables turned out extremely large, like 1.17e+09, 1.31e+10, and extremely low for some others (1.87e-21).
FYI: 24 variables were included in the regression, and before running the regression, I have checked interaction and around three variables were excluded because of this. So, why I am encountering this problem and any solution please!
If you have one predictor that is near perfect and it makes scientific sense, why do you need anymore?
Question
Hi
I wonder what the differences (pro's and con's) are between a multivariate regression analysis with some observed variables on a dependent variable and a piecewise SEM where 1 latent factor is constructed of these observed variables where the factor is used in a regression analyses.
A model with a single factor as predictor would imply that the indicators (observed predictor variables) are unidimensional (their covariances/correlations are fully explained by a single latent variable) and that they have no direct effects on the dependent variable. Those are strong assumptions that could be violated in practice.
Question
I used a 5-point Likert scale (1 strongly disagree to 5 strongly agree) for measuring satisfaction. I want to examine the relationship between 5 independent variables and DV satisfaction and see which one is the best predictor. I have a large sample of 533 participants. The problem is that the assumption of normality is violated and I would like to know if there is a technique that will allow me to proceed with the intended inferential analysis of regression.
Thank you !
For Likert data, the normality assumption is violated by definition. The assumption is clearly unreasonable for single items (which are not even numeric!). If you have a likert score (calculated as the sum or average of several item ranks), the asusmption may or may not be ok. This depends on the actual disrtibution of the score. If there are values close or at the upper and/or lower limit, the asusmption is clearly unjustified. If the score distribution is more central and unimodal, the asusmption may be justified. But even then I'd prefer assuming something like a beta-distribution.
There are many more (and possibly more severe) things that (can) go wrong. Like the score may not measure what is intended, important covariables are not considered, the sample is a convenience sample, the model can is misspecified (missing intercations, wrong functional form) and so on.
Is it "acceptable practice"? - Well, in practice, all crap is done using Likert data, and all kind of crap is published. In that sense, it is "accepted". But just because something is published (or publishable) it does not imply that this is good science or good practice.
Question
The application of multilevel regression models has become common practice in the field of social sciences. Multilevel regression models take into account that observations on individual respondents are nested within higher-level groups such as schools, classrooms, states, and countries.
In the application of multilevel models in country-comparative studies, however, it has long been overlooked that on the country-level only a limited number of observations are available. As a result, measurements on single countries can easily influence the regression outcomes.
Diagnostic tools for detecting influential data in multilevel regression are becoming available, but what are your experiences with influential cases in country-comparative (multilevel) studies? How do you deal with influential cases if you encounter them?
Multilevel modelling is a statistical model that is used to model the relationship between dependent data and independent data when there is a correlation between observations. These models are also known as hierarchical models, mixed effect models, nested data models or random coefficient models.
Question
How to implement the Passing Bablok regression analysis in R language?
Take a look at the attached search for details of the method and it's R implementation. Best wishes David Booth
Question
I wanted to study whether study habits predicted academic performance for which i did regression analysis and found an R square value of .022 which is a very low prediction value. I wish to know how to include this in my research? Any kind of suggestion is highly appreciated.
I support professors that the data could be considered a factor of a low R-square both missing data and outliners. However, the researchers should check for the reliability and validity of the questionnaire items. If you delete some items, the Cronbach's Alpha and Factor Loadings could be increased, and sometimes, the R-square could be increased. Otherwise, the multicollinearity effect should be considered. In your case, the explanation power to the phenomenon is only about 2.2%, which is very low and non-typical incur.
Question
Hi everyone,
I have performed a Spearman's Rho with several variables: 1 continuous dependent variable and 5 continuous independent variables. I did this as normality was violated so I couldn't do a Pearson's Correlation. From the Spearman's Rho, I have ordered the independent variables from the strongest correlation to the weakest. I am planning to run a regression where I enter the independent variables in order (from the strongest correlation to the weakest) but I cannot figure out which regression analysis I should run. Someone suggested a Stepwise regression but I am not sure if this is the correct analysis. Do you think I should just run a multiple regression (where I cannot choose the order of variables to be entered) or some other regression?
When you employ a five-point Likert Scale, Pearson Correlation is used for these the relationship among variables such as the relationship between marketing mix (7Ps). But if you consider marketing mix (7Ps) and the factors or predictors of customer satisfaction (dependent variable), you have to employ multiple regression analysis (MRA) for prediction analysis. The R-Square could explain the phenomenon explanation regarding the significance level of 0.05, 0.01 or 0.001.
Kindly visit the links for Multiple Regression Analysis and Pearson Correlation Analysis.
Question
I have 186 respondents who were participated in my study. My questionnaire is dichotomous double bounded and asked their willingness to pay for the conservation programme.
To examine the determinants towards willingness to pay, I need to run the regression analysis but still not clear to use either logistic or probit regression.
Less than sample size of 500, its better to run chi square of variables association.
Question
Hello
I have seen several studies with only one binary independent variable where both crude and adjusted odds ratio was done. I am having difficulty knowing the variable/s that were adjusted for.
I have attached one such example from a study I wish to understand. How do I know the variables that were adjusted for?
How do I also determine the variables to adjust for in my research?
Take a look at this attachment and I hope this will help.. the adjustment is for the IVs in your regression.. Best wishes David Booth.
Question
In my SEM analysis, all the paths from constructs to the outcome construct were shown to be insignificant, although the model fit indices were all acceptable. My particular focus is on whether an A variable is directly related to the B variable or the A variable is fully mediated by C.
Considering this result is related to type 2 error by the multicollinearity among the latent constructs, I tried a regression analysis to prove if there is a significant direct effect between the A variable to the outcome variable B. In this regression analysis, measured variables for the A were used. My question is whether this process, which is to use regression analysis to see a signigicant direct effect that was not shown in the SEM analysis with latent variables, is statistically valid.
Hello Hyunsoon,
Shifting from estimated factor scores as an IV to individual constituent manifest variable scores as IVs will do several things (none of which is particularly good):
1. Your comparison (does a path exist among latent variables in sem vs. does some linear combination of manifest variables relate to some other score, regardless of whether it is for a latent or manifest variable) is no longer "apples to apples" or the same research question, so no inference may be made as to whether one is better than another.
2. Regressing your "B" score on multiple A manifest indicators almost guarantees that the weights which might apply from your sem measurement model will be ignored in order to maximize the multiple R/R-squared in your regression. Hence, a relationship may or may not help make the case for the constructs being related.
3. The approach is somewhat like modifying the model in order to yield the results you would like to see, and so any interpretation of p-values (for a model changed after having looked at a prior analysis of the same data) is likely misleading, and the opportunity for overfitting is higher.
Finally, please note that model fit indices are driven by how well estimated model parameters serve to reproduce the observed correlations/covariances among the measured variables. You certainly can have good fit with no significant paths among latent variables, if the latent variables are unrelated.
Question
It is possible to check the normality of my data (Kurtosis and skewness ) via Smartpls 3.0. and what is the procedure? Help needed.
Greetings! May I know how i can run the skewness and kurtosis of the variable in student version SmartPLS. Thank you.
Question
I have a research and I have One independent variable, namely Role Of Internal Auditor (X) and Two dependent Variables, namely Fraud Prevention (Y1) & Fraud Detection (Y2), what regression analysis can be used in this situation?
Assuming your dependent variables are continuous, multivariate regression or path analysis would be options for you.
Question
I'm planning to use regression analysis in my study, but I am confused about this: if both of the predictors or independent variables (e.g. Academic Resilience & Academic Procrastination) were already correlated with the Dependent Variable (e.g. Test Anxiety) in the previous studies, is it still possible to pursue it? It's an undergraduate thesis btw, and research is really not my forte that's why I'm having a hard time.