Science topic

# Multiple Linear Regression - Science topic

Explore the latest questions and answers in Multiple Linear Regression, and find Multiple Linear Regression experts.
Questions related to Multiple Linear Regression
Question
I have gathered data from 60 companies, across 5 years.
I want to do a Multiple Linear Regression Analysis with some variables, from 2017 to 2021.
1) Should I do the average of the 5 years for each variable for each company, and do the regression, having 60 observations, or
2) Should I threat the data as individual points, meaning observations = n*5 = 300
How do I display that information (colums and rows), on the excel sheet (if option 2)?
Thanks a lot
Question
optimisation et prédiction du ratio KWH/m3 dans une station de pompage par la modélisation mathématique régression linéaire multiple RLM et réseaux de neurones artificiels RNA
Question
I generated a scatterplot for Multiple Linear Regression analysis using Sigmaplot software and I want to display the regression equation. But, I couldn't able to do it. Can anybody suggest to me how to display the equation on sigma plot.
Dear Fiseha, when in presence of a multiple linear regression, the usual way of representing it in a bivariate space is to use as axes :
X axis = expected value of Y from the fitted equation
Y axis = observed value of Y.
This representation will give rise (if the equation has a decent fit) of the cloud of points elongated along the main diagonal of the plot (Y=X) allowing you to estimate by eye ouutliers, corresctly fitted points and so forth...
Question
I am conducting a meta-analysis on a research study and will need to build a model for my data to conduct a multiple linear regression analysis. Can anyone provide a sample model or step-by-step approach to building a model? Thanks.
Here is a playlist of videos of how to conduct Meta-Analysis & Meta-Regression In R: https://www.youtube.com/playlist?list=PLrLWLaG7yx85X7ZjN4ySllDQ-hiB9Q4nm
Question
I have clinical parameters, and I need to find the relationship between the variables; any Ideas or suggestions on how to do the calculation?
Dear Lina Naji
You may look at the following article.
Best of luck
Question
I have fitted a multiple linear regression model with three continuous independent variables and one independent categorical variable. How to visualize or display the fit?
It's obviously going to be difficult to try to display all those variables in a single plot.
A) A simple approach is to plot the observed values vs. the predicted values from the model.
B) You could plot the dependent variable vs. one continuous independent variable grouped by the categorical variable. Three plots like this may be informative (e.g. : https://rcompanion.org/rcompanion/images/e_04_04.jpg )
Question
The correlation coefficient between one of the independent variables and the dependent variable has a p value greater that the alpha value of 0.05
Bivariate screening of candidate predictors for a multivariable regression model is considered a bad practice that tends to produce overfitted models. See Mike Babyak's 2004 article, for example:
And having non-significant regressions in a multivariable model is not problematic either. There is more likely a problem when all variables are statistically significant, in fact--unless n is quite large. See these two sections in the DataMethods.org author checklist, for example (link below):
• Use of stepwise variable selection
• Lack of insignificant variables in the final model
HTH.
Question
I am running a multiple linear regression between demographics and performance on a measure. However, despite all the other assumptions being met, linearity hasn't (example image attached). I have applied a log10 transformation to the data and that hasn't helped. Are there any other corrections that can be applied?
Daniel Wright Unfortunately I can no longer see what you had written as Rgate will not let me see previous replies; so I am remembering/guessing.
By 'corrective' action I mean using a method that does not make the strong assumptions of the standard model. For example if you see heteroscedasticity in a catch-all plot I would model that heterogeneity and see if it makes a difference to the result; I think that is much better than some test for it . As Box (1953) notes of Bartlett's test "To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!" The GAM approach allows for non-linearity, displays the partial residuals so you can see what is going on (very helpful in the multiple predictor case) and gives you the linear model if there is not strong evidence for a curve.
I think we are all getting more aware of the "Forking paths" in the practice of modelling. I am reminded of the phrase that you would not eat sausages if you saw them being made! The good thing about GAM is the inbuilt cross validation which is aimed at reducing overfitting. I have enjoyed reading Stuart Ritchie's Science Fictions and think the only way is to have a plan, being open about what you have done and use cross validation to guard against implicit p-hacking.
Back to the original question - it looks to be an underlying linear relation to me and I would stick with that.
Question
To be more precise, my dependent variable was the mental well-being of students. The first analysis was chi-square (mental well-being x demographic variable), hence I treated the dv as categorical. Then, in order to find the influence of mental well-being on my independent variable, I treated the dv as a continuous variable so that I can analyse it using multiple regression.
Is it appropriate and acceptable? and is there any previous study that did the same thing?
Need some advice from all of you here. Thank you so much
اذا لم نقوم باستعمال الادوات الاحصائية الدقيقية لهذا المجال بالنسبة للمتغيرات البحث سوف يؤدي الى الى اخطاء احصائية في دقة النتائج ولكن ممكن استخدام في بحثين منفصلين كمتغير تابع @
Question
This post and following questions are based on the "you are probably modelling seasonality in the wrong way" post by a company called recast. https://getrecast.com/seasonality/
Aswell as the uber-publication "Bayesian time varying coefficients model for media mix modelling" https://arxiv.org/pdf/2106.03322.pdf
Although the papers above refers to media-mix-modelling the question is applicable to a vast range of differnet regression contexts.
The post by recast captures an very interesting part of regression in a nonstationary environment that is oftenly overseen.
Modellers oftenly assume an additive model as y=intercept+variablesofineteresteffect+control where the controlvariables oftenly are regressed on previous sales(y) as a function of time and then added into the model, they might for example include trend and seasonality by decomposing the timeseries with e.g fourier decomposition.
Oftenly the variables of interest effect is thought of as a constant w.r.t time thus having constant coefficients, this will lead to the control variables probably absorbing the effect our advertising had on sales. The effect of advertising is most likely time-variant and exhibits seasonality, trends etc which is not captured, instead it will probably be absorbed by our control variables which we are not really interested in.
In their post they dont break this down w.r.t to an multiplicative model. Can this be avoided by using an multiplicative model,Have a look at ubers paper. They model it by an multiplicative model which also incorporates timevarying coefficients for our variables of interest. I cant really grasp how they explicitly calculate the trend and seasonality component even though i understand the kernel part and fourier decompisition, is it decomposed with knots w.r.t sales and then plugged into the model?
Now this leads me to my second question, why do researchers decompose the trend and seasonality w.r.t sales and then plug them into the formula when using additive models, why dont they capture these patterns in the coefficients corresponding to our variables of interest?
It seems much more reasonable or am i missing something crucial here?
Can anyone shed some light on this?
Thanks for your reply to my post. I don't know the answer to the last question you propose. I do know however that prediction and explanation are very different goals. You might look the work of G.Shumeli on this topic. It may give you a framework for your exploration. The reference is cited in the attached paper. I know that it's available on Researchgate. Best wishes David Booth
Question
Hi, I am new here but really hope someone can help.
I ran a hierarchical multiple linear regression with 3 predictors in the first block and 1 in the second block. In the first model, only one of my predictors was significant, but I only included it to control for it as I expected it would be highly correlated to the DV. In my second model, one of the predictors that was non-significant in the first model is now significant. Can anyone explain what that means and how I can discuss those results? It would be great if you could also point me towards books or papers that explain this.
Thanks,
Charlenne
This could be due to a suppressor effect. A suppressor effect can occur when predictor variables are highly correlated with one another. In that case, one predictor can "suppress" variance in the other predictor and thereby enhance the predictive value of that other predictor. It is even possible that a predictor that is not significantly correlated with the DV turns out to be a "significant" predictor in a multiple regression due to suppression.
If you do a literature search for "suppression" or "suppressor effect" in regression, you should be able to find a number of relevant resources.
Question
Hello everyone,
I would like to do a multiple regression with 4 independent variables and 1 dependent variable. Also i have a dichotomous moderator "gender" which is split in female = 1 and male = 2.
How do i test the moderator with SPSS to see if it is linear?
I have already checked the assupmptions of the multiple linear regression with the dependent variables and independent variable using partial regression plots, But how can i check the dichotomous moderator if it is linear?
Again, there is no way that a dichotomous variable could have a non-linear relationship with another variable. Therefore, no need or possibility to check for non-linearity.
You can see this by plotting the dichotomous variable against the DV in a scatter plot. The dichotomous variable has only two possible values on the x axis. The OLS regression line will go through the group means. You could not fit anything other than a straight line through the two means.
Question
My dataset is not very big (between 100-200). the residuals are not normally distributed. so:
1. Is there any other statistical method similar to multiple linear regression but suitable for this case?
2. If not, what can be the solution?
Thank you
Look at the severity of the violation. It may be just negligible given the large sample size.
Question
When performing sensitivity analysis of the activated sludge model, multiple regression analysis of the different parameters (variables) with the model output is required. However, the model output is also a function of time, so I am confused how to implement a multiple linear regression of the model output with the parameters (variables) in MATLAB to find the linear regression coefficients.
Add the function of time as another IV just like you would any other eg an interaction term. David Booth
Question
Hello dear community,
I have a question regarding multiple linear regression (MLR), moderation analysis and median splits (dichotomization). For the context: I have a dichotomous independent variable OKO, continuous moderator KS and die Independent variable IPL. The question that I am asking myself is if I should dichotomize the moderator variable KS. The reason I am asking that myself are the following results.
Regressing IPL on OKO and OKO*KS (KS as continuous variable) yields the unstandardized regression coefficients:
OKO - 2.017
KS - 0.2189
OKO*KS 0.6475
=> meaning that KS dampens the negativ relation between OKO and IPL.
However, if I include the dichotomous variant of KS (KS_M) into the regression instead of KS (continuous variable) I get the following unstandardized regression coefficients:
OKO 0.272
KS_M -1.312
OKO*KS_M 1.820
=> meaning that KS amplifies the positive relation between OKO And IPL.
Can someone explain to me why I get contrary results?
THANK YOU
Thou shalt not dichotomize...
There is really no good reason to dichotomize a continuous moderator as this can introduce all kinds of problems. Simply use the continuous version and plot the interaction as suggested by David Booth.
Question
For my data analysis I am conducting a multiple linear regression model. Currently I am testing the following assumptions:
1. Normality of residual errors
2. Multicollinearity
3. Homoskedasticity
Since the normality of residual errors was violated, I transformed my dependent variable into a log variable. However, I am wondering whether I should now continue testing multicollinearity and homoskedasticity with this new log variable or still use the variable before data transformation? Moreover, when writing descriptives & correlation matrix should I include the variable before data transformation or the log?
Hopefully someone can help me! :)
Julie Schipper -
For heteroscedasticity, even after a transformation - which I would only do if necessary, and not to address heteroscedasticity - there can still be heteroscedasticity. I recall an example among those that Penn State puts on the internet where a transformation had been done for purposes of addressing heteroscedasticity in a real estate problem, and they still had substantial heteroscedasticity remaining. You can have a kind of artificial, I say "nonessential" heteroscedasticity, which may be the result of a problem with your model, but the natural, I say "essential" heteroscedasticity that you should expect can be impaired, again, by a problem with your model. Data quality issues can also be involved, etc. See https://www.researchgate.net/publication/354854317_WHEN_WOULD_HETEROSCEDASTICITY_IN_REGRESSION_OCCUR.
With enough data, you can measure it well using the following:
Defaults are mentioned in one sheer of that Excel file, and it would be reasonable for the coefficient of heteroscedasticity as defined here to be from 0.5 to 1.0. See https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity, as determined by Ken Brewer.
If you are working with data which have already been transformed, then that changes things in ways perhaps both good and problematic. The problematic can also include problems of interpretation, as Jochen noted. I suggest only using transformations if you absolutely must, even if you are going to "back-transform" as Salvatore mentioned. I think that the simpler you can keep things, the better.
Best wishes - Jim
Question
My data analyst has used a concept called "pre-hypothesis"; it isn't equivalent to the hypothesis equation, but it is for checking Durbin Watson.
My problem is that I haven't found any source that has used the same concept (i.e., pre-hypothesis) for Multiple Linear Regression in general and Durbin Watson in particular.
I would be more than grateful if anyone could provide me with:
1. an explanation of "pre-hypothesis" in this concept, and if that equals "assumption"; and
2. a source in which "pre-hypothesis" is used and explained.
Ask your analyst to explain it coz nobody else says stuff like this. Good luck on finding a better analyst
Best wishes David Booth
Question
Hello, I'm a student research looking to apply a multiple linear regression to my data set of all continuous values. My question is mainly about a step-by-step idea of how to do this, as I believe I understand the general gist of it all, but not quite the whole picture.
To start, it is my understanding you plot a scatterplot of your independent vs dependent data and your independent vs independent data. This helps determines if there's a linear relationship between the variables, and if there's any collinearity between your independent variables. In doing so, after this step I can eliminate the independent variables that either have a strong collinearity or no relationship with the dependent variable.
After the initial screen, I am under the impression I can run a step-wise(or forward/backward) multiple linear regression slowly adding in the variables that fits the BEST MODEL. Is that a correct way to look at it?
Doing a meaningful analysis is, unfortunately, not that simple.
The problem are possible interactions between variables that may either obscure non-linear relationships with the response variable or wrongly imply a non-linear relationship if the impact of the other variables is not considered at the same time (i.e. when you look at the marginal relationships). There is no easy way out. You need to make some assumptions, and they should be resonable - and justifying this is not done by means of statistics but by expert knowledge of the subject matter. A help - to a certain limited amount - can be residual diagnostics: if the residuals do not show any pattern and seem to be a sample from a normal distribution, then this is a slight hint that the model might be approriate or at least that it does not miss anything really important (like an important interaction or non-linear relationship).
Stepwise variable selection is a very bad approach to build a model. You should definitely avoid this. The selection of variables (and non-linear relationships, and interactions) should be based on theoretical considerations, expert knowledge of the subject matter and the purpose of the model. If you consider a variable important enough to inlcude it in the model, then there should not be a statistical reason (like the lack of a statistically significant effect estimate given your data) to exclude it from the model.
If you have many variables, and you are just looking if they allow you to somehow predict the outcome, modelling using linear models (or alikes) may actually not be the best choice. You may consider regression trees, random forrests or deep learning approaches. It's evident that this needs quite a lot of data, but this is the proce you have to pay if you have no theoretical concept but still want to create/build some tool that uses data to make good predictions.
And, finally, there is no BEST MODEL. There are many different possible models, all with their own pros and cons. Only if you have very, very specific demands what the model should do (and if you are able to formulate this in a mathematical way) you may be able to identify a specific model that is best suited for exactly this purpose. And again, decinding on this best model is not (only) a matter of statistics you calculate for your observed data - it remains mainly a task requiring expert knowledge in the subject matter.
If you are expert in the subject matter, it may be very helpful to cooperate with a local statistician to identify a good solution.
Question
Hi! I am trying to run the following multiple linear regression in R:
r ~ condition (1 = Control, 2 = Active Control, 3 = Treatment, 4 = Importance Treatment) + type (0 = false, 1 = true) + age (13, 14, 15, adult) + domain (1 = eco, 2 = health, 3 = society, 4 = culture)
r is the intention to share a certain headline for a certain participant (initially given on a 6-points scale but the score being transformed into a variable comprised between 0 and 1) , where participants are randomly assigned to 4 conditions, and we want to test the respective effect of 2 treatment conditions vs 2 controls on the intention to share false headlines (we predict it will reduce the sharing of fake news, without impacting the sharing of real news), knowing that in all the conditions the main task consists in assessing the intention to share 24 headlines successively presented to the participant, half false half true (so we also want to know the effect of the "type" of the headline, being true or false; and eventually the effect of its category - within the 2 sets of headlines, we always have 1/4 of headlines with an economic subject, 1/4 on health, 1/4 on "society", and 1/4 on culture, although it is less important); and finally we are testing different age groups (13, 14, 15 years old, as well as an "adult" group with participants aged from 25 to 36 years old pooled together), to know if the effect predicted varies with age etc.
The main hypothesis of our study are :
-that treatments will improve discernment, defined as the difference between the intentions (r > 0.5) to share true news and the intentions to share fake news (ie it will really improve the quality of sharing, and not merely causing general skepticism),
-that this effect will be higher for headlines perceived as the most inaccurate (there will be an analysis on headlines as well, based on pretest), as we suppose the effect of the treatment works by refocusing the attention of the participant on the accuracy criterion, hence being greater for headlines that were generally (we will calculate a mean perceived accuracy for each headline across participants) perceived as the least accurate ones, and that will be consequently the ones for which sharing intentions will drop the most thanks to the treatment
-We have no particular prediction concerning age, which is precisely the novelty of the study (the literature review concerning adolescents' ability to evaluate fake news and their sharing online leaving the door open to quite different scenari)
-and no prediction for domain as well (we expect it won't play a role as the headlines have been chosen to be quite similar in tone, whatever the category)
Also, I need to cluster by participant (since there are repeated measures for each participants) and headline (multiple ratings for each headline).
I thought of using lm_robust but I don't know if we can put 2 clusters directly?
I also wonder what is the simplest way to check for potential effect of other secondary measures (like scores on a Cognitive Reflexion Task, gender, CSP appartenance etc): do I have to do regression testing all simple effects, interaction etc, or can I just add them to the global formula?
Thank you very much for all your answers! I have unfortunately no choice but to use R... so I am going to go for the lme4 method.
Thanks again to you all!
Question
Through a survey, I have measured the Big Five Personality traits and Task Performance.
Now I'm about to do the regression analysis, however, I'm not sure if I should do a simple linear or multiple linear regression.
In the case of openness when I run a simple linear regression (DV: task performance IV: openness) my results are:
Sign: .001
Unstandardized B= .494
so the results are positive & significant
But when I run multiple linear regression (DV: task performance, IV: openness, conscientiousness, agreeableness, extraversion, neuroticism) my results for example for openness are:
Sign: .311
Unstandardized B= .110
so the results are positive & not significant
Now, I'm confused about which one should I use and why.
The multiple regression result shows you that there is partial redundancy between the independent variables (they are correlated and overlap in terms of "explaining" the DV). Openness is no longer a significant predictor once the other predictors are in the model because one or more of them accounts for (partially) the same portion of variance as does openness.
Another issue could be overfitting. With additional correlated predictors, standard errors will become larger, reducing statistical power. Therefore, with more predictors in the model, previously significant predictor variables may become insignificant.
The question is what you want to find out. If your goal is to determine the most relevant predictor variables then the multiple regression model would be preferred because it accounts for the redundancy of the predictors. Depending on your sample size, you may not have enough power though given the number of predictors.
Question
I want to know how age, obesity (0 for no and 1 for yes), smoking status (0 for no smoking and 1 for having smoking) affect the serum Vitamin D level using regression with SPSS.
The best-fit curve for age reported: Linear (R2=0.108), Quadratic (R2=0.109), Cubic (R2=0.106)
So I assumed that the E(VitaminD)= b0 + b1*Age +b2*Age^2
But if want to put this into a multiple regression analysis with age, obesity, and smoking status as independent variables, which one should I use? Age+obesity+smokingStatus or Age^2+obesity+smokingStatus?
Ho Nguyen Tuong -
Are you sure you just want "yes" or "no" for obesity? How about a BMI value? Also, how much smoking? At any rate, you can compare model results for a given sample using a "graphical residual analysis." (You can research that online.) Also, you probably want to consider a "cross-validation," as you do not want to overfit a model to a particular sample when that may mean your model won't fit so well to other data which you wanted to cover. Trying more than one sample could help.
Best wishes - Jim
Question
I have a query regarding which type of regression analysis to use for my study. I have used a scale (dependent variable) that contains 9-items and each item is marked on a 5-point Likert scale. Scores of each item are summed and ranges from 9 to 45. Higher score indicates the respondent has more characteristics of that construct.
Similarly, there are two independent variable. One IV has 20-items and each item is marked on a 4-point Likert scale. Score ranges from 20 to 80. Second IV has 7-items and each item is marked on a 5-point Likert scale. Score ranges from 7 to 28.
The reviewer has suggested me to use non-parametric tests since my data is ordinal. However, previous studies have used Multiple Linear Regression using similar types of constructs.
Which type of regression analysis is appropriate in this case - ordinal regression or multiple linear regression? Any literature explaining this would be highly useful.
I'd normalize the DV (that is, rescale it so that the value will be with 0...1) and use a beta-model or a quasi-binomial (regression) model.
Question
I have been working with a GAM model with numerous features(>10). Although I have tuned it to satisfaction in my business application, I was wondering what is the correct way to fine tune a GAM model. i.e. if there is any specific way to tune the regularizers and the number of splines; and if there is a way to say which model is accurate.
The question is actually coming from the point that on different level of tuning and regularization, we can reduce the variability of the effect of a specific variable i.e. reduce the number of ups and downs in the transformed variable and so on. So I don't understand at this point that what model represents the objective truth and which one doesn't; since other variables end up influencing the single transformed variables too.
Cross validation algorithm in scikit-learn of Python is working very well in tuning hyper-parameters.
Question
Hi everyone, I am working with a disjunctive model for decision-making. But here I'm a bit confused how can I figure out Co (Co is a constant variable above the largest Xi to ensure Y
will not be infinite) value when doing multiple regression analysis in SPSS? Does it have any contact value?
I'm confused about the model you are trying to use... Is there one value for Y and several values for X and for B ? And then you need to determine the best value for Co ? ...
Question
I seem to have a hard time with the statistics for linear regression. I have been scrolling on the Internet, but did not find an answer.
I am testing the assumptions for linear regression, of which one is homoscedasticity. My data however shows a heteroscedastic pattern in the scatterplot. But how do I check if this is correct? Whether it is actually heteroscedastic...
Could I still perform the linear regression even though my data seems to be heteroscedastic?
I have tried transforming my DV with ln, lg10 and sqrt but the heteroscedasticity remains visible.
Ellen Kroon, an alternative would be to estimate/use Heteroskedasticity Robust Standard Errors in your significance testing. This can be done easily in Stata or R. See the link below for more: https://www.r-econometrics.com/methods/hcrobusterrors/
Question
there are different models on regression used in wind speed forecasting. I need some relevant papers and reading sources and writings published on Multiple linear regression , moving average regression and K-nearest neighbour classification and regression methodology to develop a good understanding.
Question
I have 3 DVs and 5 predictor variables (technically, only 1 IV and the other 4 are controls).
I ran it on Stata and all seems to have worked (since mathematically, all predictor variables are treated the same way), but technically I only identify one of my predictor variables as the IV of interest. I would rather stay away from SEM. Thank you so much!
Thank you very much and hope you are staying warm, Xingyu Zhou
Question
Hello everyone!
Can we report the results of independent sample t-tests and multiple linear regression analysis in one table? If yes, what information should be put in that table? Thank you very much.
Hello Ayyu,
Reporting conventions differ by outlet, so you may wish to consider what is usual and customary for your target.
If you're talking about summary statistics for the variables involved (means, SDs, correlations among variables), then yes; one table can suffice.
If you're talking about the outcomes of the two methods, then no, generally not. The exception would be if the table was restricted to: (a) test name; (b) p-value; and (c) indicator of whether result was statistically significant.
For MLR, you'd generally want to present: (a) regression coefficients (both unstandardized and standardized) for each independent variable; (b) test of significance for that variable's coefficient; and in a summary, the overall R-square and test of whether the overall regression was significant.
For t-test, a table really isn't needed, beyond that for summary statistics. In text, something like: "t(42) = 4.22, p < .001" can suffice.
Question
I want to study the risk factor on a dependent variable which is percentage of lame cows on the farm-level,using a multiple linear regression on spss. Is that possible?
Question
Hello my friends - I have a set of independent variables and the Likert scale was used on them and I have one dependent variable and the Likert scale was used as well. I made the analysis and I want to be sure that I'm doing this right - how I can use control variables such as age, gender, work experience and education level as control variables to measure their effect on the relationship between the independent variables and the dependent variable? Please give me one example. Thanks
This link will be useful for you: Check:-
Question
I'm interested to compare multiple linear regression and artificial neural networks in predicting the production potential of animals using certain dependent variables. However, i have obtained negative R squared values at ceratin model architecture. please explain the reason for neagative prediciton accuracy or R square values.
If R2 for your regression is negative, it means that your regression predicts worse than the simple mean value predictor (e.g., when you simply predict that y = mean(x)).
Question
A manpower deployment model built using the Multiple Linear Regression Method got a higher P-value (above 0.05) for the final model. This research is basically based on an empirical study. The reason for the higher P-value was the problem with the data we received for the independent variables. ( Not highly accurate). The data sample is 240. My final question is concluding this research final model with higher P values reasonable? We found out other several reasons related to the data errors (organizational data entry problems). We have good reasons to show the overall model. Is it okay to conclude in that way?
**The overall model has one dependent variable and two independent variables (two independent variables also shows a high correlation between each other)**
For "graphical residual analyses," you could compare different models on the same scatterplot for the same sample. (Different models here would have one predictor, or the other, or both, with or without an intercept if that is in question, though subject matter knowledge could tell you if y is zero when the predictor or predictors are all zero.) For "cross-validation," you could try this on different subsamples, but results for other data not collected could still be different.
Question
Hi,
I have 2 categorical and 1 continuous predictors (3 predictors in total), and 1 continuous dependent variable. The 2 categorical variables have 3 and 2 levels, respectively, and I have only dummy coded the variable with 3 levels, but directly assigned 0 and 1 to the variable with only 2 levels (my understanding is that if the categorical variable has only 2 levels, dummy coding is not necessary?).
In this case, how do I do and interpret the assumption tests of multicollinearity, linearity and homoscedasticity for multiple linear regression in SPSS?
Thank you!
Yufan Ye -
Have you looked at a "graphical residual analysis?" You can search on that term if you aren't familiar. It will help you study model fit, including heteroscedasticity. Also, a "cross-validation" may help you to avoid overfitting to the sample at hand to the point that you do not predict so well for the rest of that population or subpopulation which you wish to be modeling.
If this model is a good fit, I expect you will likely see heteroscedasticity. See https://www.researchgate.net/publication/354854317_WHEN_WOULD_HETEROSCEDASTICITY_IN_REGRESSION_OCCUR.
Cheers - Jim
Question
• When it comes to multiple linear regression. If we take the dependent variable as an example time, then the rest independent variables should be in time? Here this MLR uses for model building-related task completion time. So whatever independent variables we get should be in a timely manner? cant we get the number of people as an independent variable?
The units in which the dependent and independent variables are measured do not have to be the same for multiple linear regression. For example, you could try to predict job success in salary dollars (DV) from IQ scores (IV1), number of years of education (IV2), and gender (IV3). In this example, all variables would have different units of measurement.
Question
r2 for MLR = 10-20 % approx at various train:test ratio
but for ANN it is 1-4%.
why?
What is the size of your data? Is it big data or a small one? MLR is a parametric model which assumes certain criteria. Neural needs big data as it uses multiple and many parameters. That may be one reason for this difference. Pls read more about parametric vs non-parametric models
Question
Hi everyone! I'm running multiple regression models on incomplete data using R, and I applied the MICE algorithm to deal with the missing data. I've been able to get the pooled coefficients (B, t-tests, p-values) with no effort using the available scripts, but I couldn't find a way to obtain "goodness of fit" measures (like adjusted R squared and F) for the pooled data (not individual imputations or original data, but pooled values). I got the same problem using multiple imputation in SPSS.
Thank you very much for your attention, any help will be greatly appreciated!
I would suggest using the pool.r.squared() function from the mice package.
It should give your desired results if you are using the lm modelling function.
Best wishes,
Francesco
Question
Dear colleagues,
I want to raise this question to the community. I know that in the ANOVA test where we compare means, for example, in different groups we will have problems with multiple comparisons since we only know the F-test results but group-to-group difference is unknown. Therefore, we would choose multiple comparison correction methods, such as Tukey's, Scheffe's, or Bonferroni, to adjust the p-value and explore each pair difference and significance.
However, while I am conducting multiple linear regression analysis, by implying stepwise (backward selection) method, that means I have an DV (QoL scores in eight different domains and two summary components; scores is continuous; the Rand-36 or SF-36), and a group of potential and associated IVs (factors; categorical continuous type) .
For the models that I get from the auto-selection process (stepwise and backward), I would like to ask will there be problems about multiple comparison? why? and what would be the recommended solutions to the kind of multiple comparisons problems in this multiple linear regression model building? Thank you!
There will be problems because you used an auto-selection process.
You can fit a given model to obeserved data, but you cannot use observed data to specify the model that should be fitted to. Model specification should be driven by theory. Surely, observed data may give you ideas about an improved model specification ("exploratory data analysis"), and such an improved model can in future be fitted to new observed data.
Automated model selection is a kind of extreme exploratory analysis, you actually lose completely the possibility to interpret any statistical significance of the data. This is in principle nothing bad. Just don't interpret significance and everything is "fine". However, the automated process is clearly not recommended because it ineviably will just find the best fit to all peculiarities of our observed set of data, disregarding any theory, what is never a very clever thing to do.
Question
I am looking at whether stress levels reduce from time point 1 to time point 2 when engaging in recommended resources.
DV's - stress levels from time point 1 and time point 2
IV's - engaged in resource 1 and resource 2
Yep, would agree, and recommend to use residualised change scores.
Question
Regarding either a linear or a machine learning based regression analysis, how should we perform the normality test of the model? should we consider all data or just the training or testing dataset? I would be grateful if anyone could describe it with more details!
Non-normality will not matter much, outliers will.
Question
Hello,
I am using Multiple Linear Regression model while analyzing the UTAUT2 model. However, the Durbin Watson value appeared as 2.080. How to check if the survey data has any negative autocorrelation as the result is <2? The sample size is 209.
Checked the DW model table from Savin and White (Durbin-Watson Statistic: 1 Per Cent Significance Points of dL and dU) and it lies in dU (k=10). Can anyone help explain my situation?
If the negative autocorrelation exists, how to avoid this without increasing the sample size?
Question
Hey all,
For my master's thesis I am conducting research to determine the influence of certain factors (ghost games, lack of fan interaction, esports) on fan loyalty. For my statistical analyses, I will first conduct confirmatory factor analysis to validate which items (f.e., purchased merchandise) belong to what latent factor (f.e., behavioral loyalty).
However, I am unsure for my next step. Can I use multiple lineair regression with my latent variables to identify the relationship between the factors and loyalty. The data is collected through a survey of mainly 7-point Likert scale questions. Can I use lineair regression or is ordinal regression a must with Likert scale data?
In statistics, it's crucial to think about testing the assumptions that go along with the analysis and determining the suitability of the data set for various analyses. Using factor analysis, few factors are extracted from the large number of related variables to a more manageable number, prior to using them in other analysis such as multiple regression or multivariate analysis of variance:
Question
my research is accuracy of ECG among doctors
dependent - ECG score (normally distributed)
independent - sociodemographic characteristic ( not normaly distributed)
which do i use for one to one correlation?
pearson/ spearman OR single linear regression?
what is the differences
i was told it was the same
but yet the results are different
i want to proceed with multiple linear regression
All the fellows have contributed well. Whenever we are to analyze the data, we ask ourselves whether we want to check whether variables are related or one is said to have impact on the other. If the case is of measuring strength of relation then we can use cross-tabulation (nominal or ordinal data), Spearman (ordinal data) and Pearson f(Continuous data).
But if the question is regarding one variable being the reason for the other then we need to use regression.
Question
Hi there, essentially, I have collected data for my study that includes injury frequency of particular injuries i.e., Sprained ankle x 5. I also have the experience of each individual i.e., they have played for 5 years. Due to the varying years played, in order to standardise the data to make it comparable i divided injury frequency by the total amount of years each person has played e.g. 5 injuries across 5 years would result in a standardised frequency of 1.
My question resides in, when I want to look at the effect of experience on my standardised injury frequency, does the fact that I have used experience as my standardising factor affect the results published. The only reason I ask is due to my results of a Multiple linear regression (there are other variables) showing that experience has a significant negative effect, which logically shouldn't happen as the more exposure you have, the increased probability of sustaining an injury would increase.
George
In regards to experience, there are several ways to look at it. Many believe that an inverse correlation between skill level (which is closely aligned with experience) and injury exists--which, based on my research and many others, favors that point of view. However, on the other hand, since the most experienced athletes will be on the field or will play more than their inexperienced peers, the opportunity for injury exposure would be greater. Lastly, more experienced players may be more prone to attempt riskier maneuvers during play, increasing the potential for injury.
Question
As part of my research, I have to analyse 10 years time series financial data of four public limited companies through Multiple Linear Regression. First I analysed each company separately using Regression. The adjusted R square value is above 95% and VIF is well within limits but Durbin-watson measure shows either 2.4, 2.6 or 1.1 etc which signifies either positive or negative auto correlation. Then I tried the model with the combined data of all the four companies. This results in very less adjusted R squared value (35%) and again a positive auto correlation of 0.94 Durbin-Watson . As I am trying DuPont Analysis where the dependent variable is Return on Equity and independent variables are Net Profit Margin, Total Asset Turnover and Equity Multiplier, which are fixed, I cannot change the independent variables to reduce the effect of auto correlation. Please suggest me what to do.
Time series analysis allows you to have various models like auto regressive etc.which would explain the behavior of your data based on which you can project the d.v. into the future.
Question
If dependent variable is continuous. Is it justifiable to use both categorical (region, gender, history of disease, etc.) and continuous variables (direct cost, indirect cost) as Independent variable in multiple linear regression model in epidemiological studies.
Hi,
Categorical variables as I.Vs can be converted into dummy variables and studied.
Question
Hi everyone,
My results show for some models that the model itself is not significant, but some independent variables within the model are significant (including the constant) in SPSS.
I was wondering how I should interpret these results? What can I say/conclude and what not?
Minke
Bruce Weaver, thank you for this clarification. Greatly appreciated.
Question
Dear All,
I am working on a data having cost of care as DV. This is a genuinely skewed data reflecting the socioeconomic gap and therefore healthcare financing gap among population of a developing country. Because of this skewness, my data violated normality assumption and therefore was reported using median and IQR. But I will like to analyze predictors of cost of care among these patients.
I need to know if I can go ahead and use MLR or are there alternatives?
The sample size is 1,320 and I am thinking of applying Central Limit theory.
Dimeji
I am conducting a research on a topic assessment of Socioeconomic of GSM Masts on residents living in closed proximity
Please can you help me with relevant literature on the subject matter?
İf you don't have, do you know anyone that can help me with related literature?
I am an undergraduate student of Environmental Management Technology
Thanks
Question
If my model pass all the assumptions test for Multiple Linear Regression (MLR) except autocorrelation test is not met and there is a negative autocorrelation, should my model pass all the assumptions test? or I should fix the autocorrelation?
If I want to fix it, what should I do? Could you please help me and explain it to me.
Thanks,
Mansour
Question
I was wondering how you can interpret a square root independent variable on a multiple linear regression? Furthermore, what is the best way to visualize your predictions with a square root transformation?
Hello Antonio,
The interpretation is now in the metric of square root of the IV: For every unit increase, the regression coefficient gives the estimated amount of change in the DV (in its scale), if none of the other IVs change in value. Using standardized regression coefficients means that you're now talking about the estimated number of SDs of change in the DV, per SD change in the IV.
How to think about this in the metric of the original (untransformed) scores? Well, you're now forced to think about a nonlinear linkage between transformed & untransformed scores. Consider these untransformed scores, which represent a unit change on the transformed (square root) scale:
1 -> 1
4 -> 2
9 -> 3
16 -> 4
25 -> 5
36 -> 6
49 -> 7
So, where one is on the untransformed scale makes a difference as to how many (untransformed) units represent a one unit difference on the transformed scale.
Question
I ran a multiple linear regression model which has one dependent variable and four independent variables influencing it. The R -square of the model was very high (reached 95%) but when I used the approximation for some cases, there was a significant difference in the calculated value compared to my research results.
According to the model, only one predictor has a very large impact on the output while the other three predictors are minimal. (This should not be the case, since they all affect the result)
Is there anything else I should consider when using regression? How can I improve the influence of the predictors? More data?
Howdy,
Not an unusual finding--most people I know don't investigate their models.
If the dependent measure is categorical ordinal (e.g., a Likert-type variable), here is a relevant paper: https://odajournal.com/2013/09/20/maximizing-the-accuracy-of-multiple-regression-models-using-unioda-regression-away-from-the-mean/
Finally, here is a paper which shows that regression-based analyses are adept at
Best wishes...
Question
Dears,
Hope all is fine ,
I'm trying to rank 10 factors (independent variables) that affecting performance of organizations (dependent variable) , I have measured the availability of the10 factors in multiple organizations by a questionnaire and also I have measured the performance for these organizations by another questionnaire .
Now I'm thinking to rank these 10 factors based on their effect on the performance and I'm wondering which one of the following method is more professional and can be defended scientifically :
The first Method : is to conducted multiple linear regression , as performance is the dependent variable and the 10 factors as its predictor. Based on the regression model I will summarize up the Standardized Coefficient of each predictor and the largest coefficient will be the most critical one that affect performance, and so on....
The Second Method: is to create a new survey to measure the expert opinion about the affect of these 10 factors in performance by asking " on a scale from 1-5, kindly evaluate the effect of each factors on the performance". Based on expert answers I will take the Means for each factors and the highest mean means the highest factor based on expert thoughts.
I'm looking for your discussion for which method is more robust and scientifically correct, Thanks.
My friend that is the problem you seemed confused about methods in the two different situations. Suggest you follow up with Gareth James et al Introduction to statisical learning available at z-library. Best wishes, David Booth
Question
I am working on a multiple linear regression problem, using simulated annealing optimization. I need to know which is more accurate, the simulated annealing or Bayesian model averaging; their pros and cons, if possible. Thanks in advance.
Hi! I agree with the previous answers: There's no correct method between the two you propose. I think one way to dicide may be to take a known historical data and analize it using both methods (as Jonathan Davis said) then you can decide the method that best suits your expected analysis according to the results obtained. Good luck with your analysis!
Question
.
.
Hello, for my undergrad dissertation I have a model where the dependent variable is Behavioral Intention (BI), and it has many independent variables. I first run regression analysis on SPSS by putting BI in the dependent box, and the rest of the variables (as well as the control variables) in the independent box. Almost all of my hypotheses were accepted, except 2 where the significance was over 0.05. Then I decided to run the analysis by testing the variables one by one instead of putting them all together (however I still included the control variables). I then realized that in this way, the standardized b coefficients were higher and the significance was almost always 0.000 (i.e. more strong relationships, and all hypotheses accepted). I know that probably the first method is more correct (multiple linear regression analysis), but why does this happen? Note: there are no issues of multicollinearity
"Almost all of my hypotheses were accepted"
How? Can you say what your hypotheses were? Were they of the type B_1 = 0?
Question
Hello all,
Although the total effect (.18) and the indirect effects through mediators are positive, we have a negative direct effect of X on Y. How can it be possible?
Note 1: The analysis was run via PROCESS model 4.
Note 2: All variables are continuous.
It is certainly possible for a variable to have a negative direct effect X --> Y and at the same time a positive effect X --> M on one or more mediator variables that then in turn positively affect the outcome Y. However, in your case, the direct effect (.09) seems small (assuming you're showing standardized path/regression coefficients in the graph). Is the direct effect X --> Y statistically significant? If not, then this may simply indicate that the effect of X on Y is fully mediated (in that case, the direct effect may simply be zero in the population and negative in your sample only due to random sampling error).
Otherwise, you should ask yourself whether a negative direct effect X --> Y is substantively plausible in your study and whether all the other direct paths (X --> M and M --> Y) are also plausible in terms of their sign based on your expectations/theory.
Question
In a correlation analysis between two variables sign was –22 but in a multiple linear regression due to influence of other variables it became 21. Conceptually it must be a negative correlation.
The change of sign depends on the presence of partial correlations between regressors and dpendent variables, this happens when regressors are not mutually independent, see: https://en.wikipedia.org/wiki/Partial_correlation.
Question
Hello all,
This is a real dbRDA plot using real invertebrate abundance data (taxa-station matrix) with environmental data (substrate characteristics-station matrix) as predictor variables. The plot is produced in PRIMER v.7. Invertebrate data is 4th root transformed, Bray-Curtis similarity was used. Environmental data is normalized, Euclidean distance was used.
My question is: why is the vector overlay not centered at 0,0 in the plot? Interpreting this plot, one would conclude that every sampling station within the study area has values below the mean for predictor variables 2 and 13, which is impossible. Why would the center of the vector overlay be displaced -40 units? How can this be? Why is the plot centered on the dbRDA2 axis but the dbRDA1 axis?
The analysis is fine. The position of the vector diagram relative to the ordination is arbitrary - it could just as easily be in a separate key. The diagram indicates the direction across the ordination plane in which values of the selected variables increase. The length of the lines indicates the amount of total variation in each variable is explained in the chosen ordination plane. If all of the variation is explained, the line reaches the circle.
Question
I have an enormous dataset and for each row, I have a predicted value, and in the same row, there are a few characteristics(independent variables). I have set an ideal linear regression for this dataset. Now, I want to compare the set of independent variables of my ideal regression with the regression of each every row of my dataset. I appreciate for any help....thank you!
Chi-square tests
Question
I'm currently working on my master's thesis, in which I have a model with two IVs and 2 DVs. I proposed a hypothesis that the two IVs are substitutes for each other in improving the DVs, but I cannot figure out how to test this in SPSS. Maybe I'm thinking to 'difficult'. In my research, the IVs are contracting and relational governance, and thus they might be complementary in influencing my DVs or they might function as substitutes.
I hope anyone can help me, thanks in advance!
I think you can check the sign of the coefficients. If the sign is positive it might be complementary, otherwise supplementary effects can be deduced.
Question
I am new to statistics and trying to analyse my quantitative data. I am referring to "https://stats.idre.ucla.edu/other/mult-pkg/whatstat/" and "Laerd statistics" for the choice of statistical tests. The previous source suggests tests ranging from chi-square to multiple linear regression based on the number and nature of variables. Whereas the latter source has a clear distinction between "tests for group differences" and "tests for association and prediction".
I am confused if I have to phrase my question, looking for group difference or association according to the number and nature of my variables, for example- "Is there a difference between males and females based on their physical fitness?" OR "Is there an association between gender and physical fitness?"
In other words, Can I only test either a group difference or association for a given set of variables? Or I can test both?
Please let me know if my question is not clear.
Regards,
Noopur
In statistics, they have different implications for the relationships among your variables.
Association between two variables means the values of one variable relate in some way to the values of the other. Association is usually measured by correlation for two continuous variables and by cross tabulation and a Chi-square test for two categorical variables.
Question
I am developing a questionnaire and first performing an exploratory factor analysis. After I have the final factor structure, I plan on regressing the factor scores on some demographic covariates. Since I am anticipating missing item responses, I am thinking of imputing the item scores before combining them into factor scores (by average or sum).
I came across a paper that suggested using mice in stata and specifying the factor scores as passive variables. I am wondering if this is the best approach since I read somewhere that says passive variables may be problematic. Or, are there any alternative solutions? Thank you!
Here is a link to the paper, and the stata codes are included in the Appendix.
Yes... I would like to go with Alvis
Question
My dependent variable when graphed is heavily right-skewed. All attempts to transform it has failed (Log, Log10, Sqrt, ..).
The most ideal test I would like to run is a multiple linear regression as I have one dependent variable, which is continuous, and many independent variables. Can someone suggest the best statistical test to run?
Assa -
Did you try a "graphical residual analysis?" If you plot predicted y on the x-axis, and either y or estimated residuals on the y-axis, you can consider model fit. (Holding out some data to see how well you would 'predict' for them can help you avoid overfitting your model to a particular sample. See "cross-validation.")
If your model fits without logs, and you are only doing them to "fix" heteroscedasticity, I suggest that you not do that. Heteroscedasticity is a natural feature, to be expected. See https://www.researchgate.net/project/OLS-Regression-Should-Not-Be-a-Default-for-WLS-Regression and the updates and references there. Regarding hypothesis tests, there is an update to that project which explains why that is not helpful. Estimation of the coefficient of heteroscedasticity for use in the regression weight 'formula' is much more practical. A paper giving examples of how to make such estimates is included, and an excel tool for doing this for your data is included in the references. Another update notes that transformations are not optimal. There is a paper which discusses why you should expect heteroscedasticity, as described by Ken Brewer regarding survey statistics. There are other factors to consider, found in another reference. The other of my active projects notes an invited presentation to the US Energy Information Administration (EIA) which deals with a great deal of highly skewed data. Heteroscedasticity is prominent there.
Cheers - Jim
Question
My research is about patient safety culture among healthcare workers using the Hospital survey on patient safety culture (HSOPSC). I am performing a multiple linear regression to predict the outcome "overall perception of patient safety" score composite from the independent variables which include the other patient safety culture composite scores (11 variables) and sociodemographic factors (9 variables). I can say that the sociodemographic factors can be potential confounders but I need to make sure. How do I perform that using SPSS? and after that, how to run MLR adjusting for confounders? Thank you
Question
I have 1851 soil samples data on pH covering a study area of 7482sq.km in northern Ghana and I am using 52 environmental long-term average variables (Relief, Climate, MODIS Reflectances and derived products) to fit a model in order to explain the variability for pH prediction. So far, all models tested have shown low explained variance and sometimes even gives negative.
How may I improve the Explained Variance below?
Kindly see attached, the spatial distribution of points, and metadata excel file showing details about the covariates used.
Below is the summary of my models explained variance.
Multiple Linear Regression: 0.03
Step-wise Multiple Linear Regression: 0.04
RandomForest: -7.02
Support Vector Machines with Polynomial Kernel: 0.03
1. Avg. distance between two sample points using nearest neighbor analysis: 2551.29m
2. Data source: Student research data
3. Sampling method: Grid/Management Zone Hybrid Soil Sampling method. The grid size is 2-4sq.km, management zones are subdivisions within the grid.
My assessment so far
1. Removed spatial outliers
2. Removed value outliers
3. Normality check was ok (please see attached)
4. Variography shows a spatial structure (please see attached)
5. Tried Recursive Feature Elimination to reduce the dimensionality but did not show any improvement
6. Tried reducing dimensionality by removing highly correlated covariates at a threshold of 0.75
I would be most grateful for insights into any techniques that could help improve the model explained variance.
Dear all,
Many thanks for helping me to think through my Digital Soil Mapping Analysis. Your references were also valuable.
In the end, Ordinary Kriging was the best estimator as the machine learning models were not able to capture enough variability in the observation data which was due to poor spatial dependence (i mean more points closer to each other have dissimilar pH values).
I reckon that if I have to use machine learning, I may need to use many hundreds of covariates where each covariate tries to explain a small portion of the variability and scale it down to the best predictors. This process is iterative and highly time-consuming but would be considered later.
I attach here, the Ordinary Kriging pH prediction results.
Thanks a lot!
Question
I would like to test the effectiveness of the educational module. And this test will be used by MANCOVA . But before that , should I have to use multiple linear regression to see the relationship between variables?
I assume that you are using MANCOVA to adjust post means for initial differences in groups? You may want to look at Fraas,Newman and Pool (2007) to get some MLR viewpoints.
Question
I want to model the following:
DV: Y at time 3
IVs: X at time 1, X at time 2, change of X from time 1 to time 2
Which is the most appropriate way (preferably in SPSS).
I am afraid I cannot use multiple linear regression due to multicollinearity, right?
Thanks a lot!
There are several biased Multiple Linear Regression methods that allow you to estimate Model parameters in presence of Multicollinearity, e.g Principal Component Regression and Ridge Regression, these models purposefully invoke bias in estimation to reduce the variance in the estimator.
In case of your research problem It seems your third variable is just a linear combination of first two, do you really need to invoke this regressor into your model? It is not providing any new information in this linear framework and you can simply work with a full rank model by model respecification in this case , perhaps in a nonlinear Regression modeling problem it would be more appropriate and informative about the process.
Question
Specifically, I want to know if I can use multiple linear regression to predict a response value 'y' from it's associated "treatment" or "group" averages in several variables. Let's say: Predicting a person's weight (y) from the average weight of people in the same country (x1), the average weight of people of the same age (x2) and the average weight of people of the same ethnicity (x3). This should be an ANOVA problem assuming we have data on 3 or four coutries, 3 or 4 age ranges and 3 or 4 ethnicities.
But, can I also propose a linear model as:
y = A +B1x1 + B2x2 + B3x3 ...?
...where, x1 to 3 are the averages previously mentioned.
Can this be simply solved using a linear regression algorithm?
Are there any statistical biases from doing this?
I'm also looking for literature references on this one. All over the internet people claim ANOVA and linear regression are pretty much the same thing. However, I would like to read an academic article where multiple linear regression has actually been used to solve a multiple-way ANOVA. Just to know if sombody has used it the same way as I propose.
Lots of resources on cross classified model here:
Question
I have N=400 and three independent variable, two of them are dummy variable, and the dependent is interval. I already check the normality of residuals and it is not normal. Is it possible for the data to be normal if we use dummy variable (residuals)? Or there is some special classic assumption for dummy variable in linear regression? And is this plot below consider as normal?
Samithamby Senthilnathan your answer is absolutely wrong! Neither the dependent variable, nor the predictor(s) have to be normal!! Just think about the simplest form of a linear regression with a dichotomous predictor, which is equivalent to a t-test. The predictor cannot be normal, since it is dichotomous and in case of a group difference, you would expect a bimodal distribution with peaks at the mean value of each group, the dependent variable is also absolutely not normal. Yet, if the distribution within each group is normal, i.e. the model just subtracts the mean value, the residuals will be normal!!
So it is not the case that "some argue" that the residuals, or better errors(!), should be normal, it is simply the only thing that matters in OLS, not the variables itself.
Question
I have construct a multiple linear regression model for hatchability. For instance, I have 360 of the sampling. I used 260 of sampling to construct the model. Also, I kept 100 samplings to validate model accuracy.
If I got RMSE values for testing and for validation samples, as 6.8468 and 13.6909 respectively, How could the accuracy be computed for the model?
See the relevant material here: https://b-ok.cc/book/5243562/0e8c4b
Best wishes, D. Booth
Question
I have a healthy control group (obese and non-obese) and patients with breast cancer. I am evaluating if protein intake values is an independent predictor for phase angle (body composition) results. It is not clear if I have to include all subjects or just the patient group in my regression analysis?