Science method

Regression Analysis - Science method

Procedures for finding the mathematical function which best describes the relationship between a dependent variable and one or more independent variables. In linear regression (see LINEAR MODELS) the relationship is constrained to be a straight line and LEAST-SQUARES ANALYSIS is used to determine the best fit. In logistic regression (see LOGISTIC MODELS) the dependent variable is qualitative rather than continuously variable and LIKELIHOOD FUNCTIONS are used to find the best relationship. In multiple regression, the dependent variable is considered to depend on more than a single independent variable.
Questions related to Regression Analysis
  • asked a question related to Regression Analysis
Question
3 answers
In Brewer, K.R.W.(2002), Combined Survey Sampling Inference: Weighing Basu's Elephants, Arnold: London and Oxford University Press, Ken Brewer proved not only that heteroscedasticity is the norm for business populations when using regression, but he also showed the range of values possible for the coefficient of heteroscedasticity.  I discussed this in "Essential Heteroscedasticity," https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity, and further developed an explanation for the upper bound. 
Then in an article in the Pakistan Journal of Statistics (PJS), "When Would Heteroscedasticity in Regression Occur, https://www.researchgate.net/publication/354854317_WHEN_WOULD_HETEROSCEDASTICITY_IN_REGRESSION_OCCUR, I discussed why this might sometimes not seem to be the case, but argued that homoscedastic regression was artificial, as can be seen from my abstract for that article. That article was cited by other authors in another article, an extraction of which was sent to me by ResearchGate, and it seemed to me to incorrectly say that I supported OLS regression. However, the abstract for that paper is available on ResearchGate, and it makes clear that they are pointing out problems with OLS regression.
Notice, from "Essential Heteroscedasticity" linked above, that a larger predicted-value as a size measure, where simply x will do for a ratio model as bx still gives the same relative sizes, means a larger sigma for the residuals, and thus we have the term "essential heteroscedasticity."  This is important for finite population sampling.
So, weighted least squares (WLS) regression should generally be the case, not OLS regression. Thus OLS regression really is not "ordinary." The abstract for my PJS article supports this. (Generalized least squares (GLS) regression may even be needed, especially for time series applications.)
Relevant answer
Answer
Shaban Juma Ally, that is why one should use weighted least squares (WLS) regression. When the coefficient of heteroscedasticity is zero - which should not happen - then WLS regression becomes OLS regression.
  • asked a question related to Regression Analysis
Question
1 answer
I would like to perform a regression analyses with 3 categorical predictors (each with two levels), one continuous predictor, and one criterion variable in SPSS. I want to be able to look at the simple effects, so I was planning to use PROCESS. However, there is only room for one X variable in the dialogue box. How can I enter all 4 predictor variables into the model and get statistics for all interaction terms, and also obtain the simple slopes?
Relevant answer
Performing a linear regression in SPSS with three categorical independent variables (IVs) and one continuous IV involves several steps. Here’s a step-by-step guide:
  1. Prepare Your Data: Ensure your categorical variables are coded appropriately (e.g., 0/1 for binary variables or dummy coding for variables with more than two categories). Ensure your continuous variable is properly scaled.
  2. Open SPSS: Load your dataset into SPSS.
  3. Define Your Variables: Go to the "Variable View" tab and ensure your variables are correctly defined (e.g., categorical variables should be set as "Nominal" or "Ordinal" and continuous variables as "Scale").
  4. Create Dummy Variables (if needed): For categorical variables with more than two categories, create dummy variables. You can do this by using the "Transform" menu and selecting "Create Dummy Variables".
  5. Run the Regression Analysis: Go to "Analyze" > "Regression" > "Linear". Move your dependent variable (DV) to the "Dependent" box. Move your continuous IV and dummy variables (representing your categorical IVs) to the "Independent(s)" box.
  6. Specify the Model: Click on "Statistics" and select the options you need (e.g., estimates, confidence intervals). Click "Continue" and then "OK" to run the regression.
  7. Interpret the Output: Review the coefficients table to understand the impact of each IV on the DV. Check the R-squared value to see how well your model explains the variance in the DV. Look at the significance values (p-values) to determine which predictors are statistically significant.
For more detailed instructions and examples, you can refer to resources like this guide from UCLA or this tutorial from TidyStat.
  • asked a question related to Regression Analysis
Question
1 answer
Let variable x as compression index Cc and variable y as plasticity index IP. I have found the relationship in the form of (x/y)-x. Is the (x/y)-x relationship correct for the regression analysis? How do you explain the importance of such a correlation in regression analysis? Please try to explain why authors used this kind of relationship instead of x-y?
Relevant answer
Up
  • asked a question related to Regression Analysis
Question
3 answers
Dear Researchers:
when we do the regression analysis by using SPSS, when we want to measure a specific variable, some researchers take the average of items under each measurement while some others add the value of each items ? Which one is more reliable? which one produces more better results ?
Thanks in advance
Relevant answer
Answer
Summing really is not a great approach unless the summed scores are scaled in a way that’s well known in application (and even then is problematic).
The mean has two advantages. 1) If the items have a fixed range interpretation is usually easier. If it’s a 1 to 7 scale then 1 is lowest, 7 highest and 4 in the middle. A summed scale is hard to interpret because of variation in number of items. 2) With missing items the sum treats missing as zero so it distorts the score. The mean treats them as if missing completely at random. This is not perfect but better than treating values as 0. Also with only a few missing items and problems are minor.
Ideally one would impute missing values but I’ve seen quite a few analyses that sum scores with missing data and end up with essentially garbage outcomes (if 0 is an impossible value it can really mess with results).
If there’s no missing data the analyses will be identical except for the interpretation issue. So generally the mean is the better and safer default.
  • asked a question related to Regression Analysis
Question
5 answers
Hypothetical:
I conduct a survey and ask the respondents to answer a yes/no question "do you consider yourself to be an alcoholic". Their responses correlates with high/low averages on another measure "Qualities of Alcoholism" (QoA, e.g. yes=high, no=low). Most of the responses are in the 'no' camp. My questions are:
- Should I include everyone in the sample (yes and no's) if I'm comparing QoA to another measure (e.g. 'Effects of Drinking' scale) or should I just section out the 'yes' cases?
- What would the effects be on regression analyses if I took either approach?
Relevant answer
Answer
- **Include Everyone**: Provides a broader, generalizable analysis but may weaken the relationship in regression due to the inclusion of the 'no' responses.
- **Section Out 'Yes' Cases**: Focuses on a specific subgroup, likely strengthening the relationship in regression but limiting generalizability.
  • asked a question related to Regression Analysis
Question
1 answer
Confused with the regression analysis...need more clarity on what kind of regression analysis should be used and is there any database that can create prognostic model or using R is the only option
Relevant answer
Answer
There are other statistical tests and tools to develop a prognostic model for cancer using TCGA data. Firstly data should be recorded in a systematic manner which facilitates a easier analysis of prognostic model for cancer.
  • asked a question related to Regression Analysis
Question
4 answers
Hi,
I am trying to do an Ordinal Logistic Regression (OLR) in R since this is the regression analysis that I need to use for my research.
I followed the tutorial video I found in YouTube in setting up the two sample models. Here are the 2 codes for the models:
modelnull <- clm(as.factor(PRODPRESENT)~1,
data = Ordinaldf,
link = "logit")
model1 <- clm(as.factor(PRODPRESENT)~Age+ as.factor(Gender)+Civil Status,
data = Ordinaldf,
link = "logit")
Next, I followed what the instructor did in doing anova and an error message prompted. It says, Error in UseMethod("anova") : no applicable method for 'anova' applied to an object of class "c('double', 'numeric')"
Is there something wrong in setting up the two sample models, hence, an error message in prompting? What needs to be done to fix the error?
Please help.
Thank you in advance.
Relevant answer
Answer
This comes from the depth of R. It means that the command anova() expects a different object type. Usually, it would simply be two model objects, like:
anova(modelnull, model1)
The object you pass to anova() is a vector of numbers. Looks like you have passed a vector of coefficients to anova, not a model object.
  • asked a question related to Regression Analysis
Question
7 answers
I want to conduct a multivariable linear regression analysis with 15 independent variables, however when I was checking the assumptions not a single IV correlates with the dependent variable. Can I still conduct a linear regression analysis or should I do something else?
Relevant answer
Answer
If none of your independent variables (IVs) correlate with the dependent variable (DV), it raises questions about the appropriateness of conducting a multivariable linear regression analysis. In linear regression, the independent variables are expected to have some level of correlation with the dependent variable to explain variation in the DV.
  • asked a question related to Regression Analysis
Question
4 answers
Are there new projects and studies considering recursive approaches, in energy transition and resources, being empirically tested with regression analysis and models?
Relevant answer
Answer
Yes, there are new developments in how econometrics combines economic factors and energy modeling
  • Machine learning helps: Economists are using machine learning to better understand complex relationships between the economy and energy use. This can lead to more accurate predictions and policy evaluations.
  • Fancy models get updates: DSGE models, which help analyze how the economy and energy markets interact, are still being improved to better reflect real-world dynamics.
  • asked a question related to Regression Analysis
Question
7 answers
I have a mixed effect model, with two random effect variables. I wanted to rank the relative importance of the variables. The relimpo package doesn't work for mixed effect model. I am interested in the fixed effect variables anyway so will it be okay if I only take the fixed variables and use relimp? Or use weighted Akaike for synthetic models with alternatively missing the variables?
which one is more acceptable?
Relevant answer
Answer
install.packages("glmm.hp")
library(glmm.hp)
library(MuMIn)
library(lme4)
mod1 <- lmer(Sepal.Length ~ Petal.Length + Petal.Width+(1|Species),data = iris)
r.squaredGLMM(mod1)
glmm.hp(mod1)
a <- glmm.hp(mod1)
plot(a)
  • asked a question related to Regression Analysis
Question
8 answers
I want to perform an analysis using Poisson/negative binomial regression. There are 90 observations and about 20 variables(predictors). I read somewhere that there should be at least 10 observations per variable. So to prevent overfitting, I have to remove some of them. What is the best way to do this? I tried "Boruta feature selection" and "stepwise AIC" but I'm not sure about the results.
Thanks!
Relevant answer
Answer
When dealing with a large number of potential predictors and a limited number of observations, it's important to select variables carefully to avoid overfitting. The rule of thumb of having at least 10 observations per variable is a good guideline, but it's not a hard and fast rule. In your case, with 90 observations and about 20 variables, you should indeed consider reducing the number of predictors.
Both Boruta feature selection and stepwise AIC are valid approaches to variable selection, but they have their limitations. Here are a few suggestions to help you choose the best variables for your Poisson/negative binomial regression:
  1. Domain knowledge: Use your understanding of the subject matter to identify the most relevant predictors. Consider variables that have a plausible theoretical or logical relationship with the dependent variable.
  2. Univariate analysis: Conduct univariate Poisson or negative binomial regressions between each predictor and the dependent variable. Select variables that show a significant association with the outcome.
  3. Correlation matrix: Examine the correlation matrix of the predictors. If two variables are highly correlated (e.g., r > 0.7), consider removing one of them to avoid multicollinearity.
  4. Regularization: Consider using regularization techniques, such as Lasso (L1) or Ridge (L2) regression, which can help with variable selection by shrinking the coefficients of less important predictors towards zero.
  5. Cross-validation: Use cross-validation techniques to assess the predictive performance of different variable subsets. This can help you identify the most informative predictors while minimizing overfitting.
  6. Stepwise selection: Stepwise selection methods (forward, backward, or mixed) can be useful for identifying important predictors. However, be cautious when interpreting the results, as these methods can sometimes lead to biased estimates and inflated p-values.
  7. Combine methods: Use a combination of the above methods to triangulate your variable selection. For example, you could start with domain knowledge to identify a subset of potentially important predictors, then use univariate analysis and regularization to further refine your selection.
  • asked a question related to Regression Analysis
Question
3 answers
which regression analysis will be applied if there are 7 nominal dependent variables(YES/NO categories) and 1 ordinal independent variable(3 categories: Low, normal, high)?
Relevant answer
Answer
Kiran Tahir You need to evaluate the significance of the log odds using the coefficient's p-value, or you can also use the confidence interval to determine the significance of your estimates.
A positive log odds does not necessarily mean that the association is significant between the independent and dependent variables.
  • asked a question related to Regression Analysis
Question
1 answer
In the journal, "THE TRANSITION FROM GEL SEPARATORY SERUM TUBES TO LITHIUM HEPARIN GEL TUBES IN THE CLINICAL LABORATORY" by Oğuzhan Zengi, how do the results of Bland-Altman plots and regression analysis contribute to our understanding of the comparability between serum tubes and LIH tubes for different clinical chemistry and immunoassay tests, and what implications do these findings have for clinical practice?
Relevant answer
Answer
In order to assess the comparability of blood tubes and LIH tubes for clinical chemistry and immunoassay testing, this question explores the interpretation and consequences of statistical studies, such as Bland-Altman plots and regression analysis. Additionally, it is a graphical representation of the difference between two variables, such as the effectiveness of lithium heparinized plasma and serum in clinical chemistry and immunoassays. Through the clarification of statistical subtleties and the identification of noteworthy discoveries, scholars and researchers can enhance our comprehension of the fundamental mechanisms propelling distinctions—or absence thereof—between the two varieties of tubes. The scientific community, clinicians, and laboratory staff who choose tests and interpret results can all benefit from this research. Regression analysis and Bland-Altman plots, however, demonstrated that the majority of tests did not demonstrate any appreciable variations between serum and LIH. Nevertheless, the biological variation (BV) database's tolerance for error restrictions (TEa) was surpassed by a few analytes in this specific journal.
  • asked a question related to Regression Analysis
Question
12 answers
It is known that we can use the regression analysis to limit the confounding variables affecting the main outcome. But what if the entire sample have a confounding variable affecting main outcome, will Regression Analysis still applicable and reliable ?
For example a study was done to investigate the role of certain intervention in cognitive impairment, the entire population included was old aged (more than 60 years old ), which means that the age here is a risk factor ( Co-variate ) in the entire sample, and it is well known that age is a significant independent risk factor of cognitive impairment
My question here is; Will the regression here of a real value ? Will it totally vanish the effect of age and got to us the clear effects the intervention on cognitive impairment ?
Relevant answer
Answer
Yes of course, adjusting by age will remove the confounding factor of age. Actually, adjusting the model by a confounding factor is one way of two to remove its effect when checking the effect of another variable.
Look at this example:
if the equation of model with only cognitive score as x and the outcome as y is like this:
y = 1 + 5*x1
and equation of a model with only age:
y = 2 + 3*x2
Assuming that both variables have additive effect (which is may not be true)
then the final equation should be something like this:
2*y = 1 + 5*x1 + 2 + 3*x2 = 3 + 5*x1 + 3*x2
and it can be written like this: 2 * y/2 = (3 + 5*x1 + 3*x2)/2
y = 1.5 + 2.5* x1 + 1.5 * x2
And adding more and more variables to the model will definitely affect the coefficients
The other way is by stratification: you can just select homogeneous sub sample and fit a model and so on as I mentioned in the previous comment
There is another way "propensity score matching/ propensity weighted analysis" which is using the same model of the previous regression and adding weights to the patients which can be used in weighted analysis or scoring analysis ( but I don't encourage it in most cases it doesn't work unless you have a large sample )
This in general, but the question remains:
Is there results conclusive? Absolutely not, the only way that can eliminate the confounding factors is controlled randomized trials or a very big sample of the population like big clinical registries or census
In my opinion and from experience most studies group patients like this: ">18", "18-45", "46-65", "> 65"
In your case it deserves a good letter to the editor to criticize and obvious mistake and I would as the authors to send me their data to replicate their results.. there is a very big mistake if they did so.
>> Tooth loss and age are correlated variables and can't be included in the same model as they will produce a collinearity problem, it must do so.<<<
  • asked a question related to Regression Analysis
Question
8 answers
I want to examine the relationship between school grades and self-esteem and was planning to do a linear regression analysis.
Here's where my Problem is. I have three more variables: socioeconomic status, age and sex. I wanted to treat those as moderation variables, but I'm not sure if that's the best solution. Maybe a multiple regression analysis would be enough? Or should I control those variables?
Also if I'd go for a moderation analysis, how'd I go about analysing with SPSS? I can find a lot of videos about moderation analysis, but I can't seem to find cases with more than one moderator.
I've researched a lot already but can't seem to find an answer. Also my statistic skills aren't the best so maybe that's why.
I'd be really thankful for your input!
Relevant answer
Answer
Hi Daniel Wright. Sure, I'm fine with just calling it an interaction. I'm just saying that if one wanted to use some other term, I prefer effect modification over moderation because it is neutral with respect to the nature of the interaction.
  • asked a question related to Regression Analysis
Question
10 answers
My question is looking at the influence of simulation on student attitudes. My professor would like me to do regression analysis, but he says to do two regressions. I have my pre-test data and post-test data the only other information I have is student college. What I found in my class materials seems to indicate that I can complete a regression using the post-test as my dependent variable and the pre-test as my independent variable in SPSS. How would I do another regression? Should I work in the colleges as another dependent variable and if so, do I do them as a group or do I need to create a variable for each college?
Relevant answer
Answer
I have some questions.
1) Was there some treatment (or intervention) between the baseline and followup scores? If so, did all subjects receive it, or only some of them? And if so to that, how were they allocated to intervention vs control?
2) How many colleges are there? If the number is fairly large, it may be preferable to estimate a multilevel model with subjects at level 1 clustered within colleges at level 2.
  • asked a question related to Regression Analysis
Question
4 answers
Dear all,
I am planning to conduct an experiment for 2 IVs (categorical variable - each IV has 2 categories) and 1 mediator (continuous variable - 7-point Likert scale) on an ordinal DV (6 categories). I understand that usually mediation analysis involves regression analysis to examine the indirect and direct effect of IV --> DV and mediator --> DV, and I will be able to use the PROCESS SPSS by Hayes (2013) to estimate the moderated mediation model. However, since it is a between subject design, I am not sure if I can separate the IVs when conducting the regression analysis.
I would deeply appreciate it if anyone can recommend tests and models I can use for this study, or have any resources that I may look into to better find a suitable test. Thank you very much!
Relevant answer
Answer
You can use path analysis. Lavaan package in R will allow this and compute all relevant direct and indirect effects for you.
  • asked a question related to Regression Analysis
Question
2 answers
How can decision trees improve regression analysis?
Relevant answer
Answer
Decision trees are an ensemble of models. You can use the bagging methods such as random forest (i.e. perform models one by one and the accuracy will be an average of each model) or the boosted method (in this case the model will be performed based on the previous model weighting the errors made and correcting them ). However, the more these models become complex tuning their different parameters (e.g. deep in the tree, number of trees, how trees learn from previous model errors), the more it is difficult to interpret. Simple regressions are more used to understand how predictors affect your target variable and will support less multicolinearity between predictors implying the reduction of the number of predictors. Boosting models will be better to make predictions.
The right model depends on your research question and the bias-variance tradeoff i.e. the interaction between the complexity of your model, the accuracy of predictions, and the ability of the model to make predictions on new data.
  • asked a question related to Regression Analysis
Question
5 answers
Dear Colleagues
Which the best type of regression analysis can be performed to test the affect the treatment on the single/multiple outcomes?
Dependent Variable is continous such as thiskness of Achilles tendon (in mm)
Independent variable is categorical (treatment/no treatment).
Best regards
Relevant answer
Answer
Hello dear researcher, I agree with Martin's opinion.
good luck with your research
  • asked a question related to Regression Analysis
Question
2 answers
In some researches, in addition to structural equation modeling, regression analysis or mediating variable tests are performed. Is this necessary?
Relevant answer
Answer
If the SEM model specifies all the same "cause" and "effect" paths linking the (latent) variables as are implied by a regression/mediation model, then the regression results are pretty much redundant. The researcher might choose to include the regression results if such results are more likely to be familiar to, and understood by, the audience for the report than SEM would be.
Sometimes, SEM models only include linkages for the indicators of each latent variable separately, with those results being used to inform the construction of multiple-indicator scales for the theoretical variables. In that case the regression modeling is done directlyl on the (observed) scales, and hence the results do not correspond to anything in the SEM.
  • asked a question related to Regression Analysis
Question
5 answers
Hi all,
We are working on EFA/CFA analysis on a dataset.
We choose 'randomly select the cases' in SPSS 'Select Case' tab. Our rational is that we'll use the samples purely for factor analysis. We will not do any regression analysis. So, a simple action of randomly split the sample is appropriate.
We wonder if we should split the sample by controlling a variable, such as gender. We are not sure what difference it will make. Is this action necessary for straight-forward factor analysis?
Many thanks.
Relevant answer
Answer
Hello Xiaoyi Shu
again, you did not answer why you want to split the sample. I can imagine that you want to adhere to the practice of cross-validation in the literature. This is this idea that when you change the model post-hoc and you do this in the "calibration sample" and a check in the "validation sample" brings the same results, then the post-hoc change is supported.
This idea is one of the many examples how predictive modeling and causal modeling (SEM, CFA) became intertwined during the decades. Cross-validation recently became very potent in the context of machine learning (=predictive modeling) where you apply very flexible functions that can fit the hell out of a feature space (i.e., the pattern of relationships among all variables). Thus, you can fit complex, nonlinear and interactive relationships. This flexibility comes at the price of overfitting (fitting the noise). In this scenario, cross-validation is very potent as the samples resulting from the split will not have exactly the permutations of all data points). Hence the overfitting will become visible and makes sense.
In the realm of SEM/CFA (but any factor model), relationships are linear and thus, not very prone to this kind of overfitting. The danger here is rather "fitting wrong structures" to the data (wrong parameters or factors). This will not change from sample to sample and hence, cross-validation will not provide evidence for the viability of your decisions. One can easily show by simulation that if you fit a wrongly specified model to the data (by tinkering) and you repeat the exercise in a splitted part of the sample, you will get the same seemingly supporting result.
Christian's and my note that you only eliminate valuable N by splitting does not need special references as it is the main implication of the role of N for the imprecision / uncertainty of any parameter estimation. Every introductory textbook on statistics will tell you that.
HTH
Holger
  • asked a question related to Regression Analysis
Question
2 answers
What are the key considerations, methodologies, and interpretive techniques for correctly applying and interpreting regression analysis in quantitative research, and how do they compare in terms of their accuracy, reliability, and suitability for different research contexts?
Relevant answer
Answer
The primary goal of regression analysis is to identify the significant predictors of a dependent variable. I have a YouTube video on how to carry it out and to interpret it.
  • asked a question related to Regression Analysis
Question
11 answers
According to these results, Is regression analysis significant?
Analysis of Variance
- SS = 1904905
- DF = 8
- MS = 238113
- F (DFn, DFd)= F (8, 17) = 1.353
- P value P=0.2843
Goodness of Fit
- Degrees of Freedom = 17
- Multiple R = 0.6237
- R squared = 0.389
- Adjusted R squared = 0.1015
Relevant answer
Answer
The p value for the overall F statistic indicates that the overall multiple correlation R (and R squared) value is not significantly different from zero at the .05 level (p is greater than .05). The reason for the non-significant result may be a small sample (as indicated by your degrees of freedom) and/or the presence of too many predictor variables. This also explains the large discrepancy between R squared and adjusted R squared. You are lacking statistical power to "detect" an effect.
  • asked a question related to Regression Analysis
Question
7 answers
I got 1.41 and 1.7 for two independence variables.
Relevant answer
Answer
Can you please show the output? And are you sure that you are talking about the standardized beta regression coefficient and not the unstandardized b coefficient? How many predictor variables?
And sample size shouldnt be an issue, since beta is just beta_x = b_x * sd(y)/sd(x).
  • asked a question related to Regression Analysis
Question
18 answers
Whenever I like an article in which regression analysis is used, I ask the authors if they can share some raw (!) data, because I'm writing a book and software about this topic, and I want to include very diverse real examples.
But, to my disappointment, practically nobody even reacts! Why?
Are people affraid that a new light on their data might disrupt their conclusions?
I thought openness was considered a virtue in the world of science?
But if I want to see articles that include data, I have to dig in the very old ones!
What are your thoughts?
P.S.: I can still use simple datasets from physics to psychology, from chemistry to sociology, anything...(just 1 independent variable, preferably with information about the measurement imprecision). Of course I quote you as the source. Thanks in advance!
Relevant answer
Answer
Christian Geiser , you wrote that "After all, the researchers spent time and money collecting the data; they don't want others to benefit "for free.""
They have a publication, don't they? So they did earn their pay.
Often researchers are paid by the public, they work in public universities, they serve the public. Public money is spent in their institution, their salary, their instruments, their consumables, and their publications (many journals take money!). And after all, the data - the hard stuff that counts most - are kept secret. Even when the publication is in journals that require that the data are made available (as per their Instructions for Authors).
Of course, some human-related (typically medical patient) data may not be made fully available for data protection reasons, if the data would allow for an nearly unambiguous identification of individuals from the combination of subject-related data given (I think this is what you meant with possible IRB concerns). But this applies only for (some) clinical studies where a lot of variables are used or for the combination of extremely rare features.
In my opinion, publication includes publication of data, not just arbitrary summaries and conclusions drawn from it. When the data are not available, then it should not be regarded as a scientific publication. We (the scientific community) should set the bar as high as reasonably achievable, and today this means that the data should be available (ideally following the FAIR principles [https://en.wikipedia.org/wiki/FAIR_data]), either via the supplement or via some dedicated public database.
I have also seen papers where the authors claimed that the data are published because the individual data points were shown in scatter plots. This is not publishing data; the data shown in scatter plots is extremely hard to use, often data points are overlapping, and multivariate relationships can not be recovered at all.
  • asked a question related to Regression Analysis
Question
7 answers
I need to test the relationship between two different variables. One of the variables is calculated as 1-5 points, the other as 1-7 points. Does having different scale scores cause an error in the correlation or regression analysis results? Can you recommend a publication on the subject?
Relevant answer
Answer
I agree. Except when cleaning data and examining prior to running analysis, data on different scales are difficult to compare. My example is a real-world problem of two very different scales. For example, straight percentages and graphs that are not standardized/normalized can be tough to eye. So much so, I've seen many people get it wrong (including in my example).
  • asked a question related to Regression Analysis
Question
10 answers
It is possible to run a regression of both Seconday and primary data in the same model? I mean, when the dependent variable is primary data to be sourced via questionnaire and the Independent variable is secondary data to be gathered from published financial statements?
For Example: if the topic is  Capital Budgeting moderator and shareholders wealth (SHW). Capital budgeting moderators is proxy by inflation , management attitude to risk, Economic condition and Political instability. while SHW is proxy by Market value, Profitability and Retained earnings.
Relevant answer
Answer
There should be a causal effect of the independent variables on the dependent variable in regression analysis. Primary data gathered through questionnaire for the dependent variable would be influenced by the current happenings while the independent variables based on secondary data was influenced by past or historical happenings. Therefore, there would not true linkages between independent variables and the dependent variable. Therefore, running a regression with both Secondary and primary data in the same model would not give you best outcome.
  • asked a question related to Regression Analysis
Question
4 answers
A dummy variable is a variable that takes specific numbers. Values ​​for different attributes. Full rank.
Relevant answer
Answer
Suppose you need to estimate m coefficients in a linear regression model. The sum of all real variables and the dummy variables you have entered (m-1 plus the free coefficient) cannot be greater than the number of your observations, since you will be solving a system of linear equations when estimating the coefficients.
Exact equality is also undesirable, because in this case (assuming linear independence of rows and columns) you will get a single solution. Thus you must have at least one more observation than m.
If this sum is greater than the number of your observations, it means that the amount of data does not allow you to build such a detailed model.
  • asked a question related to Regression Analysis
Question
7 answers
Basically I am looking at a dichotomous dependent variable and other variables which possibly predict it. The problem is that that odds ratio for each of these predictor variables change depending on how many of them I have added into the binary logistic regression analysis. When I just look at one or two of them in the regression analysis the odds ratio seems more accurate. Any advice would be much appreciated.
Relevant answer
Answer
Hi Kevin!
This is not an answer, but I have the same doubts, I guess. If I understood your query correctly. When I add all my independent variables in the same analysis window for computing logistic regression with a categorical dependent variable, the significance of the beta coefficients for each variable differs from when I analyze them individually. In effect, I am getting differing predictors for my IV depending on whether I have clubbed the DVs or analyzed them separately. I am confused! If you found your answers, please help me understand what this means. Thank you! Kevin Glynn
  • asked a question related to Regression Analysis
Question
2 answers
You can read my analysis about that
Relevant answer
Answer
Plz read my idea about it and send me your opinion about IEAC(R)
  • asked a question related to Regression Analysis
Question
2 answers
Since I found out that there is a correlation between Timeliness and Semantic Accuracy (I'm studying linked data quality dimensions assessment, trying to evaluate a dimension quality -in this case Timeliness- from another dimension (Semantic Accuracy)), I presumed that regression analysis is the next step in this matter.
-the Semantic accuracy formula I used is: msemTriple = |G ∧ S| / |G|
msemTriple measures the extent to which the triples in the repository G (the original LOD dataset) and in the gold standard S have the same values.
-the Timeliness formula I used is:
Timeliness((de)) = 1-max⁡{1-Currency/Volatility,0}
where :
Currency((de)) = (1-(lastmodificationTime(de )-lastmodificationTime(pe ))/(currentTime-startTime))*Ratio (the Ratio measures the extent to which the triples in the the LOD dataset (in my case wikidata) and in the gold standard (wikipedia) have the same values.)
and
Volatility((de)) = (ExpiryTime(de )-InputTime(de ))/(ExpiryTime(pe )-InputTime(pe ) )
(de is the entity document of the datum in the linked data dataset and pe is the correspondent entity document in the gold standard).
NB: I worked on Covid-19 statistics per country as a dataset sample, precisely Number of cases, recoveries and deaths
Relevant answer
Answer
  • asked a question related to Regression Analysis
Question
7 answers
I have performed a hypothesis testing using the simple regression analysis model. What action must I take after testing the hypothesis?
Relevant answer
Answer
I think, if THIS is unclear to you, you should take some statistics classes and consult a local statistician. If you don't know what to do after you got your results, I would presume that you also did not know what to do before the analysis. No offense!
  • asked a question related to Regression Analysis
Question
3 answers
I have a set of data with spatial map with census tract, nearest distances to close by hospitals and cities. I need advise on how I can process this data for regression analysis and generate maps for ARC GIS (to see correlation). If you have a guide, it will be great. Thanks
Relevant answer
Answer
ArcGIS has Geographically Weighted Regression (GWR) which can be used to regress a dependent variable over independent variables. It also takes into account spatial autocorrelation effects.
  • asked a question related to Regression Analysis
Question
5 answers
After collecting my data, I decide to test my hypothesis using regression analysis. I also recognized the fact that my data must meet the assumptions of the tool before I can use it. Therefore I will like to know if after testing two assumptions and realizing the data had met those assumptions, can I just run the analysis or I must test all the assumptions before?
Relevant answer
Answer
Can you be more specific please, and tell us something about your data and expectations?
  • asked a question related to Regression Analysis
Question
10 answers
In a psychology study of N = 149, I was testing for moderation using a three-step hierarchical regression analysis using SPSS. I had two independent variables, X1 and X2, an outcome variable, Y, and the moderator, M. Step 1 uses the variables X1, X2, the interaction X1X2, and 5 covariates. Step 2 adds M. Step 3 adds the interaction variables X1M and X2M.
In my collinearity statistics, VIF is all under 10 for Steps 1 & 2 (VIF of 6 is found between X2 and X1X2 in both steps). For Step 3, VIF is high for X1, X2, M, X1M, and X2M. When I go look at the collinearity diagnostics box, the variance proportions are high for the constant, X1, M, and X1M. I'm understanding that there is multicollinearity.
My question is, what does it mean when the constant shows a high VIF? What would it mean if only one predictor variable and the constant coefficient were collinear?
Relevant answer
Answer
Thanks, I will review
  • asked a question related to Regression Analysis
Question
6 answers
This question is concerned with understanding the degree and direction of association between two variables, and is often addressed using correlation or regression analysis.
Relevant answer
Answer
It (relationship) refers to the degree and direction of association or dependence between them. It seeks to understand how changes in one variable are related to changes in another variable.
Correlation analysis is a statistical technique used to measure the strength and direction of the linear relationship between two continuous variables. It provides a numerical value, called the correlation coefficient, which ranges between -1 and +1. A correlation coefficient of +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. Correlation coefficients closer to 1 or -1 indicate stronger associations, while coefficients closer to 0 indicate weaker associations.
Regression analysis, on the other hand, is a statistical technique that examines the relationship between a dependent variable and one or more independent variables. It allows for quantifying the impact of the independent variables on the dependent variable and estimating the regression coefficients. Regression analysis can determine not only the direction but also the magnitude and statistical significance of the relationship between the variables.
In simple linear regression, the relationship between two variables is modeled using a straight line. The slope of the line represents the change in the dependent variable associated with a unit change in the independent variable. The intercept term represents the expected value of the dependent variable when the independent variable is zero.
In multiple linear regression, the relationship is extended to include multiple independent variables. It allows for examining the individual effects of each independent variable while controlling for the effects of other variables.
Correlation does not imply causation. A strong correlation between two variables does not necessarily mean that one variable causes the other. Causal relationships require further investigation through experimental designs or other rigorous methods.
Good luck!!
  • asked a question related to Regression Analysis
Question
4 answers
I am using the spss 28 version and I want to know how do I run the regression analysis. I have one dependent variable- bmc and one independent variable- mvpa. I have two other variables age and height and i dont know how do i adjust it on spss. Are these independent variables too? There is no covariate box in the spss version which im using. I
Relevant answer
Answer
What are "bmc" and "mvpa"?
Why would there be any linear dependency on age and height?
  • asked a question related to Regression Analysis
Question
4 answers
Hello, 
I'm working on a panel multiple regression, using R. 
And I want to deal with outliers. Is there a predifined function to do that?
If yes would you please give me an example of how to  use it
Relevant answer
Answer
Chuck A Arize Removing is a bad idea, unless you are absolutely sure that those data points are bad measurements.
Nonlinear transformations are not a good idea, see why:
  • asked a question related to Regression Analysis
Question
9 answers
Of late, some journal editors are insistent on authors providing a justification for the ordering of the entering of the predictor variables in the hierarchical regression models.
a) Is there a particular way of ordering such variables in regression models?
b) Which are the statistical principles to guide the ordering of predictor variables in regression models?
c) Could someone suggest the literature or other bases for decisions regarding the ordering of predictor variables?
Relevant answer
Answer
Jochen Wilhelm some instances use the word "hierarchical" when not all variables are entered together, but separetely or in blocks of variables. For the former one automatic algorithms may be used (but not recommended, as you know) like stepwise regression, whereas for the latter one typically the researcher decides which (blocks of) variables are entered in which order (blockwise regression).
As Bruce Weaver already mentioned, this is done very frequently in psychology, although it may be questioned if this is necessary, yet useful, or just a habit because everyone does it. In most psychological papers in such cases they are only interested in the delta R^2, for example to show that the increase in explained variance of an interaction term is significant (but miss that this could also be done by the squared semipartial correlation and no blockwise regression is needed).
I remember a paper (I believe about job satifaction or something similar) where they entered sociodemographic variables in a first block, typical predictors in a second block and new predictor variables in a third block to show that the typical ones explain more than socidemographic variables alone. The third block to show that the new predictors explain variance over and above the former ones. (In light of causality, I would cast some doubt about the results....)
But you are right, within each block and for the full model, the order of the variables do not have any meaning.
Does it help?
  • asked a question related to Regression Analysis
Question
2 answers
Regression Analysis
Relevant answer
Answer
What is the meaning of categorical more than 2(ordinal)?
But still, If a variable is categorical and has more than two categories, You should use that variable as a dummy variable or indicator variable. Mahfooz Alam
  • asked a question related to Regression Analysis
Question
6 answers
Greetings of peace!
My study is about the effect of servicescape on the quality perception and behavioral intentions
independent Variable-Under servicescape there are 4 indicators
Layout Accessibility - 10 items
Ambience condition- 3 items
Facility Aesthetics - 6 items
Facility cleanliness -4 items
Quality perception serve as mediator with 3 items
Dependent Variable-Behavioral intentions - 4 items
All were measured using Likert Scale (N = 400)
I tried Ordinal Regression Analysis but I don't know how to combine the items and the independent is ordinal. And the value of Pearson is <0.001 and the Deviance is 1.000.
I need to get the effect of individual indicators in servicescape on the quality perception and behavioral intentions.
Thank you in advance
Relevant answer
Answer
There are a few options with ordinal outcomes but also treating the predictors (IV) as ordinal is trickier. As David L Morgan noted you can do this with some SEM software. You could also use ordinal logistic regression treating the predictors as continuous, dummy coding them or treating them as monotonic (the latter only available in the R brms package as far as I'm aware https://cran.r-project.org/web/packages/brms/vignettes/brms_monotonic.html ).
Its worth noting that these are all parametric models - its just that they don't assume a normal distribution of residuals in the model.
  • asked a question related to Regression Analysis
Question
2 answers
The variable physical environment effect, is only a subset of the independent variable ( environmental factors) in my research, there are social and cultural environment effects as well. They are measured in my questionnaire with five questions and the responses are; ( never, rarely, often and always). The dependent variable, student performance, was also measured in the same format as the environmental factors(i.e with five questions and Never, rarely...being the responses). I have coded them into SPSS with the measure; Ordinal. I want to answer the research question; 1. How physical environment affect student performance? 2. How social environment affect student performance? 3. To what extent does cultural environment influence student performance? I've computed the composite score(mean) for the questions, can I use these scores in the ordinal regression analysis? Or is there any other way to compute the questions into a single variable, for both the independent and dependent variables?
Relevant answer
Answer
In your study, where you have measured the effects of physical, social, and cultural environments on student performance using ordinal scales, you can use ordinal regression analysis to answer your research questions.
To conduct the ordinal regression analysis, you do not necessarily need to compute a composite score for the questions. Instead, you can use each question as a separate predictor in the analysis. Each question represents a different aspect of the environment (physical, social, or cultural), and using them as separate predictors allow you to assess the unique contribution of each aspect to student performance.
Here's a general outline of the steps you can follow:
  1. Prepare your data: Ensure that your data is properly coded and formatted in SPSS. Make sure the independent and dependent variables are coded as ordinal variables.
  2. Run ordinal regression analysis: In SPSS, go to "Analyze" -> "Regression" -> "Ordinal..." and select the dependent variable (student performance) and the independent variables (physical environment, social environment, and cultural environment). Specify the appropriate link function (e.g., logit or probit) based on the distributional assumptions of your data.
  3. Interpret the results: Examine the coefficient estimates, their significance levels, and the odds ratios associated with each independent variable. These results can provide insights into the effects of physical, social, and cultural environments on student performance.
By using each question as a separate predictor in the ordinal regression analysis, you retain the specificity and granularity of the different aspects of the environment being studied. This approach allows you to explore the individual effects of physical, social, and cultural environments on student performance.
Keep in mind that ordinal regression assumes proportional odds, which means it assumes that the relationship between the predictors and the outcome is consistent across different levels of the outcome variable. It's important to assess this assumption, for example, by conducting tests of parallel lines.
Additionally, be cautious when interpreting the results as causal relationships. While ordinal regression can help identify associations between variables, establishing causal relationships requires rigorous experimental designs or the consideration of other potential confounding factors.
Overall, by using ordinal regression analysis and examining the effects of different environmental factors on student performance separately, you can address your research questions and gain insights into the influence of physical, social, and cultural environments on student performance.
  • asked a question related to Regression Analysis
Question
4 answers
Hi,
I want to predict the traffic vehicle count of different junctions in a city. Right now, I am modelling this problem as a regression problem. So, I am scaling the traffic volume (i.e count of vehicles) between 0 to 1 and using this scaled down attributes for Regression Analysis.
As a part of Regression Analysis, I am using LSTM, where I am using Mean Squared Error (MSE) as the loss function. I am converting the predicted and the actual output to original scale (by using `inverse_transform`) and then calculating the RMSE value.
But, as a result of regression, I am getting output variable in decimal (for example 520.4789), whereas the actual count is an integer ( for example 510 ).
Is there any way, where I will be predicting the output in an integer?
(i.e my model should predict 520 and I do not want to round off to the nearest integer )
If so, what loss function should I use?
Relevant answer
Answer
If you want your regression model to predict integer values instead of decimal values, you can modify your approach by treating the problem as a classification task rather than regression. Instead of scaling the traffic volume between 0 and 1, you can map the integer values to a set of discrete classes. For example, you can define different classes such as 0-100, 101-200, 201-300, and so on, and assign each traffic volume to the corresponding class. Then, you can use a classification model like a neural network with softmax activation and categorical cross-entropy loss function to predict the class of each traffic volume. This way, your model will output integer predictions representing the class labels rather than decimal values.
  • asked a question related to Regression Analysis
Question
3 answers
If the value of correlation is insignificant or negligible. Whether We should run regression analysis or not. Obviously it will be insignificant, is it necessary to mention in article?
Relevant answer
Answer
"Significance" is not what should concern you. It is a function of sample size, and even if you compare x-variables with the same sample size in a multiple regression, issues such as collinearity could foul such comparisons. Of course if your sample size is too small, you won't be able to discover much of anything. But you might try plotting the points with a regression "line," and put curves around it using the estimated variance of the prediction error. Don't forget heteroscedasticity. I know SAS does a good job of this. You could put predicted-y on the x-axis and the y-values on the y-axis (or estimated residuals on the y-axis when looking for heteroscedasticity, which is natural, associated with the predicted-yi as a size measure).
Penn State has some good introductory material on this, though they did not include heteroscedasticity, the last time I looked. You can find information from Penn State by searching on a term, and including "Pennsylvania State University" in the search.
Best wishes.
  • asked a question related to Regression Analysis
Question
27 answers
Correlation and regression analysis are part of descriptive or inferential statistics.
Relevant answer
Answer
Descriptive statistics is a vague term and usually applied to describing the properties of a single variable. However there is law that says that a correlation cannot be a descriptive statistic. We might make the distinction between descriptive and hypothesis testing statistics, for example.
I would imagine that a table of the intercorrelations of the items of a scale would be descriptive rather than hypothesis testing, for example.
But there's no legal definition, so no worries!
  • asked a question related to Regression Analysis
Question
5 answers
Hello researchers,
I am facing problem to do a regression analysis with three independent variables, one mediating variable, and one independent variable. How ca I do this in SPSS? Any one please can you help me?
Relevant answer
Answer
Hello again Md.,
I would recommend you try either path analysis (available in any SEM package) or the associated multiple linear regression models.
The SEM model would have these proposed relationships:
1. IV1 -> Med
2. IV2 -> Med
3. IV3 -> Med
4. Med -> DV
Good luck with your work.
  • asked a question related to Regression Analysis
Question
3 answers
When doing a regression analysis, the coefficients table in spss shows that my 3 main effects are significant. When I do a regression analysis for my 6 moderating effects, where I created interaction terms for, the coefficients table also shows these are significant. But when I do the regression analysis and include the 3 main effects and 6 moderating effects at the same time, none is significant. How should i interpret this? And how should I continue?
Relevant answer
Answer
Let's simplify things a bit and consider two models with X1 and X2 as the only explanatory variables, both quantitative.
1) Y = b0 +b1X1 + b2X2 + error
2) Y = b0 +b1X1 + b2X2 + b3X1X2 + error
In model 1:
  • b1 shows the effect (on the fitted value of Y) of increasing X1 by one unit while holding X2 constant (at any value you wish)
  • b2 shows the effect (on the fitted value of Y) of increasing X2 by one unit while holding X1 constant (at any value you wish)
But those interpretations of b1 and b2 do not work for model 2. In model 2:
  • b1 shows the effect (on the fitted value of Y) of increasing X1 by one unit while holding X2 constant at a value of 0
  • b2 shows the effect (on the fitted value of Y) of increasing X2 by one unit while holding X1 constant at a value of 0
Putting it another way, b1 and b2 in model 2 are like simple main effects in a two-way ANOVA model, not like main effects in a two-way ANOVA model. Some authors describe them as "main effects", but I think that can cause confusion, and so I prefer to call them first order effects. HTH.
  • asked a question related to Regression Analysis
Question
3 answers
Examining some students on their final year projects defence. I discovered that a student had the Adjusted R² in the Regression analysis of her work to be greater than 99%. Could that be possible?
Relevant answer
Answer
Hello Ibikunie,
1. Can an adjusted R-squared exceed 0.99? Yes.
2. Can an adjusted R-squared exceed the associated, unadjusted R-squared? No, though they can be equal (as pointed out by Debopam Ghosh )
Here's one version of the most commonly used formula for adjusted R-squared:
Adjusted R2 = 1 – [(1-R2)*(n-1)/(n-k-1)]
where:
R2 = unadjusted/observed R-squared
n = sample size used in model
k = number of IVs included in the model
Note that, the ratio (n - 1) / (n - k - 1) will always be less than one for k = 1, 2, ...n - 1. So, unless the unadjusted R-squared = 1, the adjusted R-squared will be less than the unadjusted R-squared. As N increases, relative to k, the difference between the two values will decrease as well.
Good luck with your work.
  • asked a question related to Regression Analysis
Question
3 answers
This question is concerned with determining whether two or more groups differ in some meaningful way on a particular variable or set of variables, and is often addressed using statistical tests such as t-tests, ANOVA, or regression analysis.
Relevant answer
Answer
Extremely generic question: you must specify and detail.
  • asked a question related to Regression Analysis
Question
11 answers
Can you give all the criteria to evaluate the forecasting performance of the regression estimators?
Relevant answer
Answer
To check how good your regression model is, you can use the following metrics:
  1. R-squared: indicate how many variables compared to the total variables the model predicted.
  2. Average error: the numerical difference between the predicted value and the actual value.
  • asked a question related to Regression Analysis
Question
3 answers
I am performing a cross-country regression analysis with a sample of 101 countries. Most of my variables are averages of annual data across a period of 7 years. Every one of my primary variables has data available in each of these 7 years. However, certain countries have data missing in certain years for variables used in my robustness checks.
How should I handle this missing data for each robustness variable? Here are a few ideas I have considered
A. Average data for each country, regardless of missing years
B. Exclude any country with any missing years from data for that respective variable
C. Exclude countries that are missing data up to a certain benchmark, perhaps removing countries that are missing more than 2 or 3 of the 7 years that are being averaged for that respective regressor
D. Only use robustness variables that have available data for every country in every year that is being averaged
Please offer the best solution and any other solutions that would be acceptable.
Relevant answer
Answer
Using multiple imputation or full information maximum likelihood (FIML) would probably be your best options. Under the assumption of missing at random (MAR) data, these techniques allow you to include all available data points in your analyses. Most other techniques either lead to a loss of data (and therefore statistical power), make more restrictive assumptions about the missing data mechanism, or both. See
Enders, C. K. (2022). Applied missing data analysis (2nd ed.). Guilford Press.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989X.7.2.147
  • asked a question related to Regression Analysis
Question
13 answers
Dear fellows,
Maybe you have done interesting measurements to test some model?
I can always use such data to use as examples and tests for my regression analysis software, and it's a win-win, since I might give you a second opinion on your research.
It's important that I also get the imprecision (measurement error/confidence interval) on the independent and dependent variables. At this moment, my software only handles one of each, but I'm planning to expand it for more independent variables.
Thanks in advance!
Relevant answer
Answer
Carlos Araújo Queiroz I don't see a dataset, or am I missing something?
  • asked a question related to Regression Analysis
Question
4 answers
Assumptions of multinomial and linear regression analysis?
Relevant answer
Answer
The Multinomial logistic regression is a statistical test that involves a single category variable which is predicted using one or more other factors. This method also establishes the numerical relation between such variable pairs. A categorical variable should be your targeted predictor. Linearity, independence, and the absence of outliers and multicollinearity are among the assumptions for multinomial logistic regression. For the details related to assumptions of multinomial logistic regression you may refer to following link:
A linear relationship, zero conditional mean of the error terms, Gaussian distribution of the error terms, homoskedasticity of the error terms, the absence of outliers, multicollinearity, and autocorrelation of the error terms are among the assumptions for linear regression. For more details you may use the following link:
Hope this would be useful.
  • asked a question related to Regression Analysis
Question
4 answers
In finding the correlation and regression of multivariable distribution what is the significance of R and R^2? What is the main relation between them?
Relevant answer
Answer
R represents the correlation coefficient between two variables in a multivariable distribution. It measures the strength and direction of the linear relationship between the two variables. R ranges from -1 to +1, where -1 indicates a perfect negative correlation, +1 indicates a perfect positive correlation, and 0 indicates no correlation.
R^2, on the other hand, represents the coefficient of determination. It measures the proportion of variance in one variable that is predictable from the other variable(s) in a multivariable distribution. R^2 ranges from 0 to 1, where 0 indicates no variance in the dependent variable is explained by the independent variable(s), and 1 indicates that all the variance in the dependent variable is explained by the independent variable(s).
The main relationship between R and R^2 is that R is the square root of R^2. In other words, R^2 is the proportion of variance in the dependent variable that is explained by the independent variable(s), and R is the correlation coefficient between the dependent variable and the predicted values based on the independent variable(s). Therefore, R^2 is a measure of how well the regression line fits the data, while R is a measure of the strength and direction of the linear relationship between two variables. R-Squared? R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model
  • asked a question related to Regression Analysis
Question
3 answers
Hello,
I am doing a multiple regression with 2 predictors. The predictors correlate moderately/strongly, r=0,45. When the first predictor is put in a regression analysis, on its own it explains 16,8 % of variance of the dependent variable. The second predictor on its own explains 17,5 % of variance of the dependant variable. When both predictors are put into regression analysis, the VIF=1,26, so multicollinearity should not be a problem. The predictors together explain 23,4 % of variance of the dependant variable.
First of all, I would like to ask, whether the change in explained variance from 16,8-17,5 % to 23,4% is a big change. More specifically, if the predictors are together better at predicting the dependant variable compared to the situation when there is only one predictor. Also, as the predictors correlate but VIF is okay, is it safe to say that they probably explain some same parts of variance in the dependant variable/ do the predictors explain little of unique variance?
Relevant answer
Answer
Have you compared thetwo univariate b's with their values in the full regression? Unstable estimates of these coefficients is the main problem in multicollinearity.
  • asked a question related to Regression Analysis
Question
3 answers
I would like to create a factor of the interaction effect of two variables for regression analysis.
I was wondering how to create the factor.
I was thinking to multiply the scores of the two, but I would like to hear from other researchers. Thank you,
Relevant answer
Answer
I suggest you try running the regression I suggested (two primary variables plus the multiplicative interaction effect) and see what happens.
  • asked a question related to Regression Analysis
Question
3 answers
Suppose I want to predict energy consumption for my building using regression analysis. What factors should I consider including in my model, and how can I determine their relative importance?
Relevant answer
Answer
It depends a little bit on the way your model will be designed and what are the available data.
In one of the most simplest ways you will have a mean energy consumption per human/animal/anything and you simply multiply by the number of individuals.
But i think that is not your intention.
Probably you have a data set with dfferent independent variables and you want one final response, that is not as simply associated as in my example above.
There are a couple of standard software tools available for a multiple (linear/non-liniear) regression. I would prefer R and the models of the lm type. Pretty good manuals are found around the web. But you may choose whatever you want; results should be the same.
An interesting introduction to the math behind is given in a paper by the US Geological Survey:
However, depending on the specification of your problem it might be interesting to have a closer look at some other statistics.
When you can define specific groups (like high energy consumption time, low energy consumption time, medium, extraordinary, ...) a linear discriminant function analysis might be an interesting choice. It allows identifying specific variables, being characteristic (high contribution) for a predefined group.
A factor analysis might also be a good choice to identify higher/lower variable contributions.
If you are not really in charge to present the contributions of the variables a random Forest analysis might be interesting. This machine leraning classifyier can be also run as a regression model. You have less impression about the variables contributions, but the results can be more precise...
Nevertheless, it depends a bit on the available data. Did you already prepare a pairs plot and the correlation among the variables? This may also give a first impression....
Jan
  • asked a question related to Regression Analysis
Question
3 answers
i am using a fixed effect panel data with 100 observations (20 groups), 1 dependent and three independent variables. i would like to get a regression output from it. my question is it necessary to run any normality test and linearity test for panel data? and what difference would it make if i don't go for these tests? 
Relevant answer
Answer
Rede Ganeshkumar Dilipkumar so to answer the hypothesis test and relationship between regressor and regressand we need surely normality test right?... because oftenly in causal relationship method we have to prove if the hypothesis alternative is the answer or instead hypothesis null is the answer for answer the influence of regressor to regressand. So after we found the best fit model of regression whether pooled effect, fixed effect or random effect than we have to continue by finding the influence between regressor and regressand so you said that hypothesis test surely need normality test and other assumption test? .. Please put the recommended theory or reffrence to strengtening your argumentation. thank you for the enlightment..
  • asked a question related to Regression Analysis
Question
4 answers
situation: The moderating variable can explain up to 25 percent while the remaining 75 percent is explained by
other factors outside the model. What does this mean? OR would this mean, the moderating variable did not significantly moderate the relationship of the IV nad DV? Thank you to anyone who would respond!
Relevant answer
Answer
Percent increase in explained variance (R square change) is not a very intuitive way to assess the practical significance of an interaction/moderation effect in my opinion. The percent increase could be small despite the fact that the interaction may be important. If the interaction is statistically significant, you should plot the regression lines for different meaningful values of the moderator to interpret the meaning and practical relevance of the interaction effect.
  • asked a question related to Regression Analysis
Question
10 answers
Propensity score matching (PSM) and Endogenous Switching Regression (ESR) by full information maximum likelihood (FIML) are most commonly applied models in impact evaluation when there are no baseline data. Sometimes, it happens that the results from these two methods are different. In such cases, which one should be trusted the most because both models have their own drawbacks?
Relevant answer
Answer
what is thadvantage of psm over ESR
  • asked a question related to Regression Analysis
Question
6 answers
I would like to know if I am wrong by doing this. I made quartiles out of my independent variable and from that I made dummy variables. When I do linear regression I have to record the betas with 95%CI per quartile per model (I adjust my model 1 for age and sex). Can I enter all the dummies into the model at the same time or do I have to enter them separately (while also adjusting for age and sex for example)?
So far I entered all the dummies and adjusted for age and sex at the same time but now I wonder whether SPSS doesn't adjust for the second dummy variable and the third.. So I think I need to redo my calculations and just run my models with one dummy in each.
Thank you. 
Relevant answer
Answer
What you are looking for is called linear regression. Good news is that linear regression is quickly done and easy to interpret. It will also give you more statistical power, as by categorization you loose information.
And don't worry, I've seen this categorization non-sense done by seasoned professors (and sometimes forced upon their students). That's from a past era, where you literally had to crunch the numbers with pencil and paper, which is easier with categories.
  • asked a question related to Regression Analysis
Question
2 answers
Hello everyone,
I am currently working on my thesis but I have encountered a problem and I am not sure how to solve it. I would like to measure the impact that ESG (Environmental, Social, Governance) has on financial performance (ROA, ROE) from 2016 to 2021. Some important details about my study:
  • I would like to compare two samples of companies: One first group with ESG part of the DJ Sustainability Index (DJSI) and another group without ESG (no part of DJSI).
  • I intend to analyze companies that have been part of the DJSI between 2016 and 2021. However, some companies don't have an ESG score (independent variable) for some years. Should I still collect information for my dependent variables for all the years? For example, company X has ESG scores for 2016 and 2017 only, would I need data for ROA and ROE for all the years or just for 2016 and 2017?
  • Any other aspects I should consider?
Thanks!!
Relevant answer
Answer
Hi,
In any regression result, there must be an ANOVA statistics. You have to check it in the Regression options given. In this wise, if that is the case, the a Multivariate Analysis is suitable and then you can run A regression Statistics afterwards. Your study has sub-variables if I understood you well, if you want to treat your dependent variable as just one whereas your independent variable has sub-variables, the you can find their combining effects..
In a situation where some companies do not have the ESG score, then you need to run a Student T-test statistics since they have a certain financial performance(Dependent Variable). You have to ascertain if there is a significant difference in the financial performances of comapnies that have the ESG score and those that do not have.
  • asked a question related to Regression Analysis
Question
12 answers
Hi,
I conducted a survey where all the question items corresponded to the components of each variable I wanted to measure. I consolidated these components following a literature review. I built the survey with 5-point Likert scale questions.
I'm measuring the impact of an independent variable on a dependent variable, and I am considering a linear regression analysis.
My question is, how can I combine all those components of independent variable into one? I have read that averaging is a frequently used method if all components measure the same but I am a bit worried that all components may not have the same weight or influence determining the outcome.
Would you have any recommendations on this topic? I'm happy to read any research articles
Thanks for your help
Relevant answer
Answer
You can use factor analysis to group these items
  • asked a question related to Regression Analysis
Question
4 answers
Hi,
I have a dependent variable that represents ridership on each street. The numbers are bucketed to the nearest 50 so the values are 50 , 100 , 150 .. and so fourth.
My independent variable also discontinuous 1, 2 , 3 ,4 .. etc, representing street continuity.
Would it be appropriate to execute linear regression analysis, to see if there is a correlation between these two variable ?
Note that I will execute the analysis on multiple city.
Relevant answer
Answer
"Would it be appropriate to execute linear regression analysis ... ?"
The answer depends on what you are trying to do, what you are trying to estimate and how much model-development you are prepared to do in the search for meaning and "optimal estimates".
In the present context what you could do is to do a simple linear estimation by least squares, but ignore the usual summary statistics that produce "error-bars" for estimates. Keep only only a single statistic. If you had only a single independent variable this could be just the regression coefficient. For multiple independent variables (as you later mention) you could use the reduction in the sum-of-squares from the regression. Then, to get a test for whether there is a real statistical association between your observed values, you could apply the principle of permutation testing whereby you evaluate the same summary statistic from exactly the same algorithm when applied to randomised versions of the original dataset. Here the randomisation provides a representation of the case of "no statistical association". The randomisation can be applied by doing a random permutation of the column of data for the dependent variable.
The principle here is that you can construct any measure of association you like and obtain a valid statistical inference by a first-principles argument involving randomisation. You mention "multiple cities". Notionally, it is just a matter of constructing an overall measure of association to summarize across all cities and then doing the randomisation separately for each city (but within the same step).
Note that the above centres on the question "to see if there is a correlation between these two variable". If you really want some sort of predictive model, or are extremely concerned about extracting as much information as possible from the data, then you would need to develop a full statistical model and this would take you well away from a simple least squares analysis.
  • asked a question related to Regression Analysis
Question
5 answers
I'm working on the below topic for my master thesis.
“Investigating the stages in a customer’s buying journey and determining the factors influencing the decision to switch between a retailer’s online sales channels – marketplace and own website.”
Considering this, my plan was to apply logit regression analysis on the channel choice of the customer (in this case “Marketplace” and “retailer’s own website”) as the dependent variable and the interaction between the independent variables “Age” and subjective norms (recommendation from peers, product reviews) for the three stages.
I’m struggling to ascertain if using the customer channel choice of either marketplace or own website be considered as the dependent variable. I have not used a Likert scale for this as this was a scenario-based survey. So, the respondents have chosen the channel they would use in every stage.
Could you please advise if using this choice as a dependent variable makes sense? And, if using Logit regression is the right way to go?
Also, how to calculate/analyze relative importance of the predictor variables (independent variables) in Logit Regression analysis?
Relevant answer
Answer
Good morning. I understand fully as I've been doing independent work also. http://sites.google.com/site/deborahhilton/ Most of my work is on research gate though. I just try to work out the statistics myself also with textbooks I have here. If you had some specific questions I maybe able to answer for a nominal fee if you were wanting some reference material. Good luck. T-you.
  • asked a question related to Regression Analysis
Question
4 answers
Is there any explanation for strong, adequate or low value of it? Thank you
Relevant answer
Answer
It is a measure of goodness of fit in logistic regression analysis. It is a modification of the Cox and Snell R Square, which is derived from the likelihood ratio test statistic.
Nagelkerke R Square ranges from 0 to 1, with values closer to 1 indicating a better fit of the model. However, unlike in linear regression analysis, where R Square can be interpreted as the proportion of variance explained by the model, Nagelkerke R Square cannot be interpreted as easily.
A common rule of thumb is to interpret it, a value of 0.2 or less indicates a weak relationship between the predictors and the outcome.
A value of 0.2 to 0.4 indicates a moderate relationship.
A value of 0.4 or higher indicates a strong relationship.
However, it's important to note that the interpretation of Nagelkerke R Square should be taken with caution and should be supplemented by other measures of model fit, such as the Hosmer-Lemeshow test, AIC, or BIC. Additionally, it's essential to consider the practical significance of the relationship between the predictors and the outcome, rather than solely relying on statistical significance or goodness of fit measures.
  • asked a question related to Regression Analysis
Question
4 answers
I'm doing a regression analysis of the effect of housing type on resident depression. When I included all samples in a single model, housing type had a significant effect on depression (p=0.000). But when I divided the sample into males and females, and performed regression analysis on the two separately, the analysis results of both males and females showed that housing type had no significant effect on depression (p=0.1-0.2). I wonder how to explain this result
Relevant answer
Answer
It could be a power problem. When your samples get smaller, you reduce the statistical power to detect an effect. A better way to do this would be to include gender as a binary predictor into the overall analysis (and also potentially the interaction/product term housing*gender). That way, you can examine a potential gender (and interaction) effect without splitting your sample or running separate analyses.
Also, I would examine the data graphically using histograms, scatterplots etc. to see whether there are any peculiarities in the scores and their distributions (e.g., outliers, non-linear effects, etc.).
  • asked a question related to Regression Analysis
Question
3 answers
I have 667 participants in my sample, and the outcome is continuous. I tested the normality, and the data on the histogram are bell-shaped, but the test results show that they are not normally distributed.
1- What is the cause of the discrepancy between the chart and the test results?
2- Can I still perform linear regression analysis on this data?
Relevant answer
Answer
What the model requires is that the errors are mutually independent and identically distributed in a way that is approximating a Normal distribution with mean equal to 0 and variance equal to σ ².
Also, I remind you that linear regression is a mathematical function based on the equation of the geometric straight line.
The rest comes by itself.
  • asked a question related to Regression Analysis
Question
10 answers
I am testing hypothesis of relationships between CEA and Innovation Performance (IP). If I am testing the relationship of one construct , say Management support to IP , is it ok to use single linear regression? Of should I be testing it in a multiple regression with all the constructs?
  • asked a question related to Regression Analysis
Question
2 answers
We want to analyze the relationship and impact between two variables in the education sector, The first variable is the independent variable (intellectual capital), which is measured by a sample of workers and leaders, size 150, and the second variable is the dependent variable (quality of service provided), and it is measured by a sample of Students and parents, its size 330
Relevant answer
Answer
Is this a nested sampling design where the students and/or parents from the second sample are nested within the workers and/or leaders? If that is the case, then perhaps multilevel (hierarchical linear) regression modeling could be an option. However, this requires nesting of observations, e.g., that the parents in Sample 2 can be linked to workers/leaders in Sample 1.
If the data aren't nested but there's some other connection/dependency between the observations in both samples, perhaps you could account for missing data by using full information maximum likelihood estimation or multiple imputation so that you wouldn't waste any data. But again, this requires that at least some observations can be linked across the two samples.
  • asked a question related to Regression Analysis
Question
4 answers
Regression analysis is used for models that have covariables.
Relevant answer
Answer
Yes, it can.
  • asked a question related to Regression Analysis
Question
3 answers
Adjusted r2 5.99 %
F value 9.61
p-value 0.00
Oral Com = 21.36 - 1.194 Dissatisfaction with one’s linguistic skills
Relevant answer
Answer
It is good that you carefully review the data matrix.
Also, try to understand carefully which are the dependent and possibly independent variables that you are interested in analyzing.
  • asked a question related to Regression Analysis
Question
5 answers
My questionnaire consist of 20 questions and five of them are related to dependent variable. But the problem is those questions are not in Likert scale and in different scales with fixed answers and one multiple choice question.
Example 5th question has 1-4
6th question dichotomous 1-2
7th question Multiple choice question 7 options
8th question 1-5 options to pick one.
can I build a composite index variable for dependent variable by standardizing these variables ? using Z Scores
can I use Standardized variables to perform a correlation and regression analysis?
Relevant answer
Answer
You haven't said what your research question(s) is(are), or what your goal of reducing the dimensionality of these 20 items is. It doesn't make sense to me (obviously it does to others ... see above!) to try to answer your question without this information (and again, those above obviously don't think these are important questions).
  • asked a question related to Regression Analysis