Multiple Linear Regression - Science topic
Explore the latest questions and answers in Multiple Linear Regression, and find Multiple Linear Regression experts.
Questions related to Multiple Linear Regression
I have gathered data from 60 companies, across 5 years.
I want to do a Multiple Linear Regression Analysis with some variables, from 2017 to 2021.
1) Should I do the average of the 5 years for each variable for each company, and do the regression, having 60 observations, or
2) Should I threat the data as individual points, meaning observations = n*5 = 300
How do I display that information (colums and rows), on the excel sheet (if option 2)?
Thanks a lot
optimisation et prédiction du ratio KWH/m3 dans une station de pompage par la modélisation mathématique régression linéaire multiple RLM et réseaux de neurones artificiels RNA
I generated a scatterplot for Multiple Linear Regression analysis using Sigmaplot software and I want to display the regression equation. But, I couldn't able to do it. Can anybody suggest to me how to display the equation on sigma plot.
I am conducting a meta-analysis on a research study and will need to build a model for my data to conduct a multiple linear regression analysis. Can anyone provide a sample model or step-by-step approach to building a model? Thanks.
I have clinical parameters, and I need to find the relationship between the variables; any Ideas or suggestions on how to do the calculation?
I have fitted a multiple linear regression model with three continuous independent variables and one independent categorical variable. How to visualize or display the fit?
Thanking you in advance
The correlation coefficient between one of the independent variables and the dependent variable has a p value greater that the alpha value of 0.05
I am running a multiple linear regression between demographics and performance on a measure. However, despite all the other assumptions being met, linearity hasn't (example image attached). I have applied a log10 transformation to the data and that hasn't helped. Are there any other corrections that can be applied?
To be more precise, my dependent variable was the mental well-being of students. The first analysis was chi-square (mental well-being x demographic variable), hence I treated the dv as categorical. Then, in order to find the influence of mental well-being on my independent variable, I treated the dv as a continuous variable so that I can analyse it using multiple regression.
Is it appropriate and acceptable? and is there any previous study that did the same thing?
Need some advice from all of you here. Thank you so much
This post and following questions are based on the "you are probably modelling seasonality in the wrong way" post by a company called recast. https://getrecast.com/seasonality/
Aswell as the uber-publication "Bayesian time varying coefficients model for media mix modelling" https://arxiv.org/pdf/2106.03322.pdf
Although the papers above refers to media-mix-modelling the question is applicable to a vast range of differnet regression contexts.
The post by recast captures an very interesting part of regression in a nonstationary environment that is oftenly overseen.
Modellers oftenly assume an additive model as y=intercept+variablesofineteresteffect+control where the controlvariables oftenly are regressed on previous sales(y) as a function of time and then added into the model, they might for example include trend and seasonality by decomposing the timeseries with e.g fourier decomposition.
Oftenly the variables of interest effect is thought of as a constant w.r.t time thus having constant coefficients, this will lead to the control variables probably absorbing the effect our advertising had on sales. The effect of advertising is most likely time-variant and exhibits seasonality, trends etc which is not captured, instead it will probably be absorbed by our control variables which we are not really interested in.
In their post they dont break this down w.r.t to an multiplicative model. Can this be avoided by using an multiplicative model,Have a look at ubers paper. They model it by an multiplicative model which also incorporates timevarying coefficients for our variables of interest. I cant really grasp how they explicitly calculate the trend and seasonality component even though i understand the kernel part and fourier decompisition, is it decomposed with knots w.r.t sales and then plugged into the model?
Now this leads me to my second question, why do researchers decompose the trend and seasonality w.r.t sales and then plug them into the formula when using additive models, why dont they capture these patterns in the coefficients corresponding to our variables of interest?
It seems much more reasonable or am i missing something crucial here?
Can anyone shed some light on this?
Hi, I am new here but really hope someone can help.
I ran a hierarchical multiple linear regression with 3 predictors in the first block and 1 in the second block. In the first model, only one of my predictors was significant, but I only included it to control for it as I expected it would be highly correlated to the DV. In my second model, one of the predictors that was non-significant in the first model is now significant. Can anyone explain what that means and how I can discuss those results? It would be great if you could also point me towards books or papers that explain this.
I would like to do a multiple regression with 4 independent variables and 1 dependent variable. Also i have a dichotomous moderator "gender" which is split in female = 1 and male = 2.
How do i test the moderator with SPSS to see if it is linear?
I have already checked the assupmptions of the multiple linear regression with the dependent variables and independent variable using partial regression plots, But how can i check the dichotomous moderator if it is linear?
Thanks in advance!
My dataset is not very big (between 100-200). the residuals are not normally distributed. so:
1. Is there any other statistical method similar to multiple linear regression but suitable for this case?
2. If not, what can be the solution?
When performing sensitivity analysis of the activated sludge model, multiple regression analysis of the different parameters (variables) with the model output is required. However, the model output is also a function of time, so I am confused how to implement a multiple linear regression of the model output with the parameters (variables) in MATLAB to find the linear regression coefficients.
Hello dear community,
I have a question regarding multiple linear regression (MLR), moderation analysis and median splits (dichotomization). For the context: I have a dichotomous independent variable OKO, continuous moderator KS and die Independent variable IPL. The question that I am asking myself is if I should dichotomize the moderator variable KS. The reason I am asking that myself are the following results.
Regressing IPL on OKO and OKO*KS (KS as continuous variable) yields the unstandardized regression coefficients:
OKO - 2.017
KS - 0.2189
=> meaning that KS dampens the negativ relation between OKO and IPL.
However, if I include the dichotomous variant of KS (KS_M) into the regression instead of KS (continuous variable) I get the following unstandardized regression coefficients:
=> meaning that KS amplifies the positive relation between OKO And IPL.
Can someone explain to me why I get contrary results?
For my data analysis I am conducting a multiple linear regression model. Currently I am testing the following assumptions:
1. Normality of residual errors
Since the normality of residual errors was violated, I transformed my dependent variable into a log variable. However, I am wondering whether I should now continue testing multicollinearity and homoskedasticity with this new log variable or still use the variable before data transformation? Moreover, when writing descriptives & correlation matrix should I include the variable before data transformation or the log?
Hopefully someone can help me! :)
My data analyst has used a concept called "pre-hypothesis"; it isn't equivalent to the hypothesis equation, but it is for checking Durbin Watson.
My problem is that I haven't found any source that has used the same concept (i.e., pre-hypothesis) for Multiple Linear Regression in general and Durbin Watson in particular.
I would be more than grateful if anyone could provide me with:
1. an explanation of "pre-hypothesis" in this concept, and if that equals "assumption"; and
2. a source in which "pre-hypothesis" is used and explained.
Hello, I'm a student research looking to apply a multiple linear regression to my data set of all continuous values. My question is mainly about a step-by-step idea of how to do this, as I believe I understand the general gist of it all, but not quite the whole picture.
To start, it is my understanding you plot a scatterplot of your independent vs dependent data and your independent vs independent data. This helps determines if there's a linear relationship between the variables, and if there's any collinearity between your independent variables. In doing so, after this step I can eliminate the independent variables that either have a strong collinearity or no relationship with the dependent variable.
After the initial screen, I am under the impression I can run a step-wise(or forward/backward) multiple linear regression slowly adding in the variables that fits the BEST MODEL. Is that a correct way to look at it?
Hi! I am trying to run the following multiple linear regression in R:
r ~ condition (1 = Control, 2 = Active Control, 3 = Treatment, 4 = Importance Treatment) + type (0 = false, 1 = true) + age (13, 14, 15, adult) + domain (1 = eco, 2 = health, 3 = society, 4 = culture)
r is the intention to share a certain headline for a certain participant (initially given on a 6-points scale but the score being transformed into a variable comprised between 0 and 1) , where participants are randomly assigned to 4 conditions, and we want to test the respective effect of 2 treatment conditions vs 2 controls on the intention to share false headlines (we predict it will reduce the sharing of fake news, without impacting the sharing of real news), knowing that in all the conditions the main task consists in assessing the intention to share 24 headlines successively presented to the participant, half false half true (so we also want to know the effect of the "type" of the headline, being true or false; and eventually the effect of its category - within the 2 sets of headlines, we always have 1/4 of headlines with an economic subject, 1/4 on health, 1/4 on "society", and 1/4 on culture, although it is less important); and finally we are testing different age groups (13, 14, 15 years old, as well as an "adult" group with participants aged from 25 to 36 years old pooled together), to know if the effect predicted varies with age etc.
The main hypothesis of our study are :
-that treatments will improve discernment, defined as the difference between the intentions (r > 0.5) to share true news and the intentions to share fake news (ie it will really improve the quality of sharing, and not merely causing general skepticism),
-that this effect will be higher for headlines perceived as the most inaccurate (there will be an analysis on headlines as well, based on pretest), as we suppose the effect of the treatment works by refocusing the attention of the participant on the accuracy criterion, hence being greater for headlines that were generally (we will calculate a mean perceived accuracy for each headline across participants) perceived as the least accurate ones, and that will be consequently the ones for which sharing intentions will drop the most thanks to the treatment
-We have no particular prediction concerning age, which is precisely the novelty of the study (the literature review concerning adolescents' ability to evaluate fake news and their sharing online leaving the door open to quite different scenari)
-and no prediction for domain as well (we expect it won't play a role as the headlines have been chosen to be quite similar in tone, whatever the category)
Also, I need to cluster by participant (since there are repeated measures for each participants) and headline (multiple ratings for each headline).
I thought of using lm_robust but I don't know if we can put 2 clusters directly?
I also wonder what is the simplest way to check for potential effect of other secondary measures (like scores on a Cognitive Reflexion Task, gender, CSP appartenance etc): do I have to do regression testing all simple effects, interaction etc, or can I just add them to the global formula?
Thanks in advance!
Through a survey, I have measured the Big Five Personality traits and Task Performance.
Now I'm about to do the regression analysis, however, I'm not sure if I should do a simple linear or multiple linear regression.
In the case of openness when I run a simple linear regression (DV: task performance IV: openness) my results are:
Unstandardized B= .494
so the results are positive & significant
But when I run multiple linear regression (DV: task performance, IV: openness, conscientiousness, agreeableness, extraversion, neuroticism) my results for example for openness are:
Unstandardized B= .110
so the results are positive & not significant
Now, I'm confused about which one should I use and why.
I want to know how age, obesity (0 for no and 1 for yes), smoking status (0 for no smoking and 1 for having smoking) affect the serum Vitamin D level using regression with SPSS.
The best-fit curve for age reported: Linear (R2=0.108), Quadratic (R2=0.109), Cubic (R2=0.106)
So I assumed that the E(VitaminD)= b0 + b1*Age +b2*Age^2
But if want to put this into a multiple regression analysis with age, obesity, and smoking status as independent variables, which one should I use? Age+obesity+smokingStatus or Age^2+obesity+smokingStatus?
I have a query regarding which type of regression analysis to use for my study. I have used a scale (dependent variable) that contains 9-items and each item is marked on a 5-point Likert scale. Scores of each item are summed and ranges from 9 to 45. Higher score indicates the respondent has more characteristics of that construct.
Similarly, there are two independent variable. One IV has 20-items and each item is marked on a 4-point Likert scale. Score ranges from 20 to 80. Second IV has 7-items and each item is marked on a 5-point Likert scale. Score ranges from 7 to 28.
The reviewer has suggested me to use non-parametric tests since my data is ordinal. However, previous studies have used Multiple Linear Regression using similar types of constructs.
Which type of regression analysis is appropriate in this case - ordinal regression or multiple linear regression? Any literature explaining this would be highly useful.
I have been working with a GAM model with numerous features(>10). Although I have tuned it to satisfaction in my business application, I was wondering what is the correct way to fine tune a GAM model. i.e. if there is any specific way to tune the regularizers and the number of splines; and if there is a way to say which model is accurate.
The question is actually coming from the point that on different level of tuning and regularization, we can reduce the variability of the effect of a specific variable i.e. reduce the number of ups and downs in the transformed variable and so on. So I don't understand at this point that what model represents the objective truth and which one doesn't; since other variables end up influencing the single transformed variables too.
Hi everyone, I am working with a disjunctive model for decision-making. But here I'm a bit confused how can I figure out Co (Co is a constant variable above the largest Xi to ensure Y
will not be infinite) value when doing multiple regression analysis in SPSS? Does it have any contact value?
I seem to have a hard time with the statistics for linear regression. I have been scrolling on the Internet, but did not find an answer.
I am testing the assumptions for linear regression, of which one is homoscedasticity. My data however shows a heteroscedastic pattern in the scatterplot. But how do I check if this is correct? Whether it is actually heteroscedastic...
Could I still perform the linear regression even though my data seems to be heteroscedastic?
I have tried transforming my DV with ln, lg10 and sqrt but the heteroscedasticity remains visible.
there are different models on regression used in wind speed forecasting. I need some relevant papers and reading sources and writings published on Multiple linear regression , moving average regression and K-nearest neighbour classification and regression methodology to develop a good understanding.
I have 3 DVs and 5 predictor variables (technically, only 1 IV and the other 4 are controls).
I ran it on Stata and all seems to have worked (since mathematically, all predictor variables are treated the same way), but technically I only identify one of my predictor variables as the IV of interest. I would rather stay away from SEM. Thank you so much!
Can we report the results of independent sample t-tests and multiple linear regression analysis in one table? If yes, what information should be put in that table? Thank you very much.
I want to study the risk factor on a dependent variable which is percentage of lame cows on the farm-level,using a multiple linear regression on spss. Is that possible?
Hello my friends - I have a set of independent variables and the Likert scale was used on them and I have one dependent variable and the Likert scale was used as well. I made the analysis and I want to be sure that I'm doing this right - how I can use control variables such as age, gender, work experience and education level as control variables to measure their effect on the relationship between the independent variables and the dependent variable? Please give me one example. Thanks
I'm interested to compare multiple linear regression and artificial neural networks in predicting the production potential of animals using certain dependent variables. However, i have obtained negative R squared values at ceratin model architecture. please explain the reason for neagative prediciton accuracy or R square values.
A manpower deployment model built using the Multiple Linear Regression Method got a higher P-value (above 0.05) for the final model. This research is basically based on an empirical study. The reason for the higher P-value was the problem with the data we received for the independent variables. ( Not highly accurate). The data sample is 240. My final question is concluding this research final model with higher P values reasonable? We found out other several reasons related to the data errors (organizational data entry problems). We have good reasons to show the overall model. Is it okay to conclude in that way?
**The overall model has one dependent variable and two independent variables (two independent variables also shows a high correlation between each other)**
I have 2 categorical and 1 continuous predictors (3 predictors in total), and 1 continuous dependent variable. The 2 categorical variables have 3 and 2 levels, respectively, and I have only dummy coded the variable with 3 levels, but directly assigned 0 and 1 to the variable with only 2 levels (my understanding is that if the categorical variable has only 2 levels, dummy coding is not necessary?).
In this case, how do I do and interpret the assumption tests of multicollinearity, linearity and homoscedasticity for multiple linear regression in SPSS?
- When it comes to multiple linear regression. If we take the dependent variable as an example time, then the rest independent variables should be in time? Here this MLR uses for model building-related task completion time. So whatever independent variables we get should be in a timely manner? cant we get the number of people as an independent variable?
r2 for MLR = 10-20 % approx at various train:test ratio
but for ANN it is 1-4%.
thanks in advance
Hi everyone! I'm running multiple regression models on incomplete data using R, and I applied the MICE algorithm to deal with the missing data. I've been able to get the pooled coefficients (B, t-tests, p-values) with no effort using the available scripts, but I couldn't find a way to obtain "goodness of fit" measures (like adjusted R squared and F) for the pooled data (not individual imputations or original data, but pooled values). I got the same problem using multiple imputation in SPSS.
Thank you very much for your attention, any help will be greatly appreciated!
I want to raise this question to the community. I know that in the ANOVA test where we compare means, for example, in different groups we will have problems with multiple comparisons since we only know the F-test results but group-to-group difference is unknown. Therefore, we would choose multiple comparison correction methods, such as Tukey's, Scheffe's, or Bonferroni, to adjust the p-value and explore each pair difference and significance.
However, while I am conducting multiple linear regression analysis, by implying stepwise (backward selection) method, that means I have an DV (QoL scores in eight different domains and two summary components; scores is continuous; the Rand-36 or SF-36), and a group of potential and associated IVs (factors; categorical continuous type) .
For the models that I get from the auto-selection process (stepwise and backward), I would like to ask will there be problems about multiple comparison? why? and what would be the recommended solutions to the kind of multiple comparisons problems in this multiple linear regression model building? Thank you!
I am looking at whether stress levels reduce from time point 1 to time point 2 when engaging in recommended resources.
DV's - stress levels from time point 1 and time point 2
IV's - engaged in resource 1 and resource 2
Regarding either a linear or a machine learning based regression analysis, how should we perform the normality test of the model? should we consider all data or just the training or testing dataset? I would be grateful if anyone could describe it with more details!
I am using Multiple Linear Regression model while analyzing the UTAUT2 model. However, the Durbin Watson value appeared as 2.080. How to check if the survey data has any negative autocorrelation as the result is <2? The sample size is 209.
Checked the DW model table from Savin and White (Durbin-Watson Statistic: 1 Per Cent Significance Points of dL and dU) and it lies in dU (k=10). Can anyone help explain my situation?
If the negative autocorrelation exists, how to avoid this without increasing the sample size?
For my master's thesis I am conducting research to determine the influence of certain factors (ghost games, lack of fan interaction, esports) on fan loyalty. For my statistical analyses, I will first conduct confirmatory factor analysis to validate which items (f.e., purchased merchandise) belong to what latent factor (f.e., behavioral loyalty).
However, I am unsure for my next step. Can I use multiple lineair regression with my latent variables to identify the relationship between the factors and loyalty. The data is collected through a survey of mainly 7-point Likert scale questions. Can I use lineair regression or is ordinal regression a must with Likert scale data?
Thanks in advance for answering!
my research is accuracy of ECG among doctors
dependent - ECG score (normally distributed)
independent - sociodemographic characteristic ( not normaly distributed)
which do i use for one to one correlation?
pearson/ spearman OR single linear regression?
what is the differences
i was told it was the same
but yet the results are different
i want to proceed with multiple linear regression
Hi there, essentially, I have collected data for my study that includes injury frequency of particular injuries i.e., Sprained ankle x 5. I also have the experience of each individual i.e., they have played for 5 years. Due to the varying years played, in order to standardise the data to make it comparable i divided injury frequency by the total amount of years each person has played e.g. 5 injuries across 5 years would result in a standardised frequency of 1.
My question resides in, when I want to look at the effect of experience on my standardised injury frequency, does the fact that I have used experience as my standardising factor affect the results published. The only reason I ask is due to my results of a Multiple linear regression (there are other variables) showing that experience has a significant negative effect, which logically shouldn't happen as the more exposure you have, the increased probability of sustaining an injury would increase.
Thanks in advance
As part of my research, I have to analyse 10 years time series financial data of four public limited companies through Multiple Linear Regression. First I analysed each company separately using Regression. The adjusted R square value is above 95% and VIF is well within limits but Durbin-watson measure shows either 2.4, 2.6 or 1.1 etc which signifies either positive or negative auto correlation. Then I tried the model with the combined data of all the four companies. This results in very less adjusted R squared value (35%) and again a positive auto correlation of 0.94 Durbin-Watson . As I am trying DuPont Analysis where the dependent variable is Return on Equity and independent variables are Net Profit Margin, Total Asset Turnover and Equity Multiplier, which are fixed, I cannot change the independent variables to reduce the effect of auto correlation. Please suggest me what to do.
If dependent variable is continuous. Is it justifiable to use both categorical (region, gender, history of disease, etc.) and continuous variables (direct cost, indirect cost) as Independent variable in multiple linear regression model in epidemiological studies.
My results show for some models that the model itself is not significant, but some independent variables within the model are significant (including the constant) in SPSS.
I was wondering how I should interpret these results? What can I say/conclude and what not?
Thank you in advance,
I am working on a data having cost of care as DV. This is a genuinely skewed data reflecting the socioeconomic gap and therefore healthcare financing gap among population of a developing country. Because of this skewness, my data violated normality assumption and therefore was reported using median and IQR. But I will like to analyze predictors of cost of care among these patients.
I need to know if I can go ahead and use MLR or are there alternatives?
The sample size is 1,320 and I am thinking of applying Central Limit theory.
Thanking you for your anticipated answers.
If my model pass all the assumptions test for Multiple Linear Regression (MLR) except autocorrelation test is not met and there is a negative autocorrelation, should my model pass all the assumptions test? or I should fix the autocorrelation?
If I want to fix it, what should I do? Could you please help me and explain it to me.
I was wondering how you can interpret a square root independent variable on a multiple linear regression? Furthermore, what is the best way to visualize your predictions with a square root transformation?
I ran a multiple linear regression model which has one dependent variable and four independent variables influencing it. The R -square of the model was very high (reached 95%) but when I used the approximation for some cases, there was a significant difference in the calculated value compared to my research results.
According to the model, only one predictor has a very large impact on the output while the other three predictors are minimal. (This should not be the case, since they all affect the result)
Is there anything else I should consider when using regression? How can I improve the influence of the predictors? More data?
Hope all is fine ,
I'm trying to rank 10 factors (independent variables) that affecting performance of organizations (dependent variable) , I have measured the availability of the10 factors in multiple organizations by a questionnaire and also I have measured the performance for these organizations by another questionnaire .
Now I'm thinking to rank these 10 factors based on their effect on the performance and I'm wondering which one of the following method is more professional and can be defended scientifically :
The first Method : is to conducted multiple linear regression , as performance is the dependent variable and the 10 factors as its predictor. Based on the regression model I will summarize up the Standardized Coefficient of each predictor and the largest coefficient will be the most critical one that affect performance, and so on....
The Second Method: is to create a new survey to measure the expert opinion about the affect of these 10 factors in performance by asking " on a scale from 1-5, kindly evaluate the effect of each factors on the performance". Based on expert answers I will take the Means for each factors and the highest mean means the highest factor based on expert thoughts.
I'm looking for your discussion for which method is more robust and scientifically correct, Thanks.
I am working on a multiple linear regression problem, using simulated annealing optimization. I need to know which is more accurate, the simulated annealing or Bayesian model averaging; their pros and cons, if possible. Thanks in advance.
The question has been answered- Closed Thread
Hello, for my undergrad dissertation I have a model where the dependent variable is Behavioral Intention (BI), and it has many independent variables. I first run regression analysis on SPSS by putting BI in the dependent box, and the rest of the variables (as well as the control variables) in the independent box. Almost all of my hypotheses were accepted, except 2 where the significance was over 0.05. Then I decided to run the analysis by testing the variables one by one instead of putting them all together (however I still included the control variables). I then realized that in this way, the standardized b coefficients were higher and the significance was almost always 0.000 (i.e. more strong relationships, and all hypotheses accepted). I know that probably the first method is more correct (multiple linear regression analysis), but why does this happen? Note: there are no issues of multicollinearity
Although the total effect (.18) and the indirect effects through mediators are positive, we have a negative direct effect of X on Y. How can it be possible?
Note 1: The analysis was run via PROCESS model 4.
Note 2: All variables are continuous.
In a correlation analysis between two variables sign was –22 but in a multiple linear regression due to influence of other variables it became 21. Conceptually it must be a negative correlation.
This is a real dbRDA plot using real invertebrate abundance data (taxa-station matrix) with environmental data (substrate characteristics-station matrix) as predictor variables. The plot is produced in PRIMER v.7. Invertebrate data is 4th root transformed, Bray-Curtis similarity was used. Environmental data is normalized, Euclidean distance was used.
My question is: why is the vector overlay not centered at 0,0 in the plot? Interpreting this plot, one would conclude that every sampling station within the study area has values below the mean for predictor variables 2 and 13, which is impossible. Why would the center of the vector overlay be displaced -40 units? How can this be? Why is the plot centered on the dbRDA2 axis but the dbRDA1 axis?
Please let me know if anyone needs more information. Thank you!
I have an enormous dataset and for each row, I have a predicted value, and in the same row, there are a few characteristics(independent variables). I have set an ideal linear regression for this dataset. Now, I want to compare the set of independent variables of my ideal regression with the regression of each every row of my dataset. I appreciate for any help....thank you!
I'm currently working on my master's thesis, in which I have a model with two IVs and 2 DVs. I proposed a hypothesis that the two IVs are substitutes for each other in improving the DVs, but I cannot figure out how to test this in SPSS. Maybe I'm thinking to 'difficult'. In my research, the IVs are contracting and relational governance, and thus they might be complementary in influencing my DVs or they might function as substitutes.
I hope anyone can help me, thanks in advance!
I am new to statistics and trying to analyse my quantitative data. I am referring to "https://stats.idre.ucla.edu/other/mult-pkg/whatstat/" and "Laerd statistics" for the choice of statistical tests. The previous source suggests tests ranging from chi-square to multiple linear regression based on the number and nature of variables. Whereas the latter source has a clear distinction between "tests for group differences" and "tests for association and prediction".
I am confused if I have to phrase my question, looking for group difference or association according to the number and nature of my variables, for example- "Is there a difference between males and females based on their physical fitness?" OR "Is there an association between gender and physical fitness?"
In other words, Can I only test either a group difference or association for a given set of variables? Or I can test both?
Looking forward to your inputs.
Please let me know if my question is not clear.
I am developing a questionnaire and first performing an exploratory factor analysis. After I have the final factor structure, I plan on regressing the factor scores on some demographic covariates. Since I am anticipating missing item responses, I am thinking of imputing the item scores before combining them into factor scores (by average or sum).
I came across a paper that suggested using mice in stata and specifying the factor scores as passive variables. I am wondering if this is the best approach since I read somewhere that says passive variables may be problematic. Or, are there any alternative solutions? Thank you!
Here is a link to the paper, and the stata codes are included in the Appendix.
My dependent variable when graphed is heavily right-skewed. All attempts to transform it has failed (Log, Log10, Sqrt, ..).
The most ideal test I would like to run is a multiple linear regression as I have one dependent variable, which is continuous, and many independent variables. Can someone suggest the best statistical test to run?
My research is about patient safety culture among healthcare workers using the Hospital survey on patient safety culture (HSOPSC). I am performing a multiple linear regression to predict the outcome "overall perception of patient safety" score composite from the independent variables which include the other patient safety culture composite scores (11 variables) and sociodemographic factors (9 variables). I can say that the sociodemographic factors can be potential confounders but I need to make sure. How do I perform that using SPSS? and after that, how to run MLR adjusting for confounders? Thank you
I have 1851 soil samples data on pH covering a study area of 7482sq.km in northern Ghana and I am using 52 environmental long-term average variables (Relief, Climate, MODIS Reflectances and derived products) to fit a model in order to explain the variability for pH prediction. So far, all models tested have shown low explained variance and sometimes even gives negative.
How may I improve the Explained Variance below?
Kindly see attached, the spatial distribution of points, and metadata excel file showing details about the covariates used.
Below is the summary of my models explained variance.
Multiple Linear Regression: 0.03
Step-wise Multiple Linear Regression: 0.04
ExtremeGradient Boosting: 0.03
Support Vector Machines with Polynomial Kernel: 0.03
Additional information about the sample data
- Avg. distance between two sample points using nearest neighbor analysis: 2551.29m
- Data source: Student research data
- Sampling method: Grid/Management Zone Hybrid Soil Sampling method. The grid size is 2-4sq.km, management zones are subdivisions within the grid.
My assessment so far
- Removed spatial outliers
- Removed value outliers
- Normality check was ok (please see attached)
- Variography shows a spatial structure (please see attached)
- Tried Recursive Feature Elimination to reduce the dimensionality but did not show any improvement
- Tried reducing dimensionality by removing highly correlated covariates at a threshold of 0.75
I would be most grateful for insights into any techniques that could help improve the model explained variance.
I would like to test the effectiveness of the educational module. And this test will be used by MANCOVA . But before that , should I have to use multiple linear regression to see the relationship between variables?
I want to model the following:
DV: Y at time 3
IVs: X at time 1, X at time 2, change of X from time 1 to time 2
Which is the most appropriate way (preferably in SPSS).
I am afraid I cannot use multiple linear regression due to multicollinearity, right?
Thanks a lot!
Specifically, I want to know if I can use multiple linear regression to predict a response value 'y' from it's associated "treatment" or "group" averages in several variables. Let's say: Predicting a person's weight (y) from the average weight of people in the same country (x1), the average weight of people of the same age (x2) and the average weight of people of the same ethnicity (x3). This should be an ANOVA problem assuming we have data on 3 or four coutries, 3 or 4 age ranges and 3 or 4 ethnicities.
But, can I also propose a linear model as:
y = A +B1x1 + B2x2 + B3x3 ...?
...where, x1 to 3 are the averages previously mentioned.
Can this be simply solved using a linear regression algorithm?
Are there any statistical biases from doing this?
I'm also looking for literature references on this one. All over the internet people claim ANOVA and linear regression are pretty much the same thing. However, I would like to read an academic article where multiple linear regression has actually been used to solve a multiple-way ANOVA. Just to know if sombody has used it the same way as I propose.
I have N=400 and three independent variable, two of them are dummy variable, and the dependent is interval. I already check the normality of residuals and it is not normal. Is it possible for the data to be normal if we use dummy variable (residuals)? Or there is some special classic assumption for dummy variable in linear regression? And is this plot below consider as normal?
I have construct a multiple linear regression model for hatchability. For instance, I have 360 of the sampling. I used 260 of sampling to construct the model. Also, I kept 100 samplings to validate model accuracy.
If I got RMSE values for testing and for validation samples, as 6.8468 and 13.6909 respectively, How could the accuracy be computed for the model?
I have a healthy control group (obese and non-obese) and patients with breast cancer. I am evaluating if protein intake values is an independent predictor for phase angle (body composition) results. It is not clear if I have to include all subjects or just the patient group in my regression analysis?
I appreciate your help.
I have a healthy control group and patients. I evaluating, if a certain parameter is an independent predictor for hyperinsulinemia. It is not clear, if I have to include all subjects or just the patient group in my regression analysis?
What is a common approach?
Thanks in advance for any help ;)