Science topic

# Regression Modeling - Science topic

Explore the latest questions and answers in Regression Modeling, and find Regression Modeling experts.
Questions related to Regression Modeling
Question
I'm trying to compare several linear regression models, however many of them do not report their SEE. Is there an alternative method to calcuate the SEE when we do not have access to the original data to calculate residuals?
Provide output, and say sample size.
Question
Hi,
I am using a pool of cross-sections over year (a cross-section of 255 firms over 11 years). I have used simple/classical linear regression model (CLRM) and logit model given continuous DV in former and dichotomous DV in later.
In logit regression, the econometric model is
Price_Deviation = B0 + B1 (prestegious_investor) + B2 (default_score) +Bk (controls)
where,
--> Price_Deviation is dummy variable takes value 0 or 1,
--> prestegious_investor is dummy variables takes value 0 or 1,
--> default_score is a continuous variable with range -inf to +inf, although it takes min value of -37.12 and max value of 24.25 in my case.
I want to include interaction effects. I want to know if it can be included in the way we introduce in CLRM (or simple linear regression models) or there is any adjustment in specific case of Logistic models?
Thanks and regards,
Sahil Narang
When using nonlinear models such as logistic regression models interaction terms can't be interpreted in the same way as in linear models (OLS-regression). Berry et al. (2012) show that here you have to distinguish between interaction terms and interaction effects.
I would recommend to calculate and plot marginal effects. How this can be done in principle is shown in Mize (2019) (how to do this in Stata is shown in lecture material by Williams here: https://www3.nd.edu/~rwilliam/stats3/ (see "Interpreting results: Adjusted Predictions and Marginal effects").
Helpful literature explaining what is different in nonlinear models (and hence in logistic regression models) and what to do to interpret results appropriately can be found here (in alphabetical order):
- Berry, W. D., Golder, M., & Milton, D. (2012). Improving tests of theories positing interaction. The Journal of Politics, 74, 653--671.
- Mize, T. D. (2019). Best practices for estimating, interpreting, and presenting nonlinear interaction effects. Sociological Science, 6, 81-117.
[Strongly recommended!]
- Mood, C. (2010). Why we cannot do what we think we can do and what to do about it. European Sociological Review, 26, 67--82.
[A classic to explain the general problem]
Additionally, when you are ilnterested in mediation effects, as a starter you should have a look at:
- Kohler, U., Karlson, K. B., & Holm, A. (2011). Comparing coefficients of nested nonlinear probability models. The Stata Journal, 11, 420--438.
Question
As for cumulative link models (also called ordinal logistic regression models), Christensen (2016) reports that, in the case the proportional odds assumption is not met, one or more covariates need to be scaled to relax the assumption.
However, when visualizing the summary, it may happen that the non-scaled covariate results as non-significant while the same scaled covariate as significant. What is the meaning of that?
Marcello
Thank you.
Question
I am working on a logistic regression model, with a binary outcome and my model look like this:
Intercept: coefficients is -2.034
Variable: coefficients are 0.031, and the P-value is 0.130. Not Significant.
However when I calculate the Odds Ratio and CI 95% I get
Odds ratio 1.031, Lower Limit = 0.991, Upper Limit = 1.073
However, since the variable is not significant, I shouldn't get a P-value greater than 1 correct?
Can anyone explain why this is happening?
I'm not exactly sure what you are asking, but a logistic regression slope coefficient of 0 (corresponding to an OR of 1) indicates no relationship. Your regression slope is not significantly different from zero at the .05 level (p = .13), and your 95% CI for the OR therefore includes 1.0. Both these values thus indicate that the relationship is not statistically significantly different from zero.
If your regression slope coefficient were significant at the .05 level (i.e., p < .05), you would expect to see a 95% CI for the OR that does not include 1.0.
Question
In my next research project by using surface regression model, estimate a model for a target variable and exogenous strategy varibales. Consequently by using this function I will try to fnd optimal X's maximizing the Y. Do you know any application like the one in my mind? If so can you share their names?
This is kind of a strange answer but look at Resonse surface methods. Redo the attached screenshot search and see if these help you.. Best wishes David Booth
Question
Suppose, we want to check association between A and Z using regression model. We add different covariates in model 1 let's say b, c, d, e. The regression shows significant association for A and Z after adjustment for these covariates. However, in model 2, we add another covariate let's say f, but when we run the model now, association between A and Z becomes non-significant? What caused this non-significance? What could be the reason for this?
There are, I think, three possibilities here (not mutually exclusive).
1) Collinearity between A and f. You state "Collinearity has been checked and there isnt any with F in the model and even without F in the model". This is implausible except in designed experiments as it implies that there is exactly zero correlation with any of the covariates. Moore likely I think is that the collinearity is low. However, even a low correlation between A and f is reducing your effective sample size (relative to a model with zero collinearity). For example modest collinearity such as tolerance of 0.90 is a 10% reduction in effective sample size.
2) Each predictor uses up a degree of freedom in the model and reduces the error d.f. by 1. Even if there is zero collinearity this produces a modest decrease in statistical power because the estimate of the error variance (SS residual) is larger and the test statistics smaller. If overall n is low this effect can be very pronounced. If n is large then this effect is usually negligible.
3) There are missing data on f and model 2 has smaller n and may lose influential cases in the A-Z relationship.
Question
I am performing a meta-analysis of odds ratio per unit of a continuous variable with a dichotomous outcome (dichotomized continuous variable). One of the studies reports a mixed linear regression model with coefficient and standard error for the continuous variable regressed on the continuous outcome variable. Is there any acceptable method to estimate the odds ratio I need?
Apologies for the sending..I don't have any ideas why the site did that. David Booth
Question
I am suppose to predict land use/land cover changes using rainfall, climate and socio-economic data. what regression model can be recommended ?
Hi Morrison
Does your dependent variable have a separate value? In other words, can it only have one of two values ​​(0 or 1, true or false, black or white, green or not green, etc.)? In this case, you may want to use logistic regression to analyze your data.
Kind Regards
Ernur
Question
I am running 6 separate binomial logistic regression models with all dependent variables having two categories, either 'no' which is coded as 0 and 'yes' coded as 1.
4/6 models are running fine, however, 2 of them have this error message.
I am not sure what is going wrong as each dependent variable have the same 2 values on the cases being processed, either 0 or 1.
Any suggestions what to do?
That error message says that for 2 of your DVs, everyone has the same value. Try this:
TEMPORARY.
SELECT IF NMISS(x1, x2, x3) EQ 0.
FREQUENCIES y1 y2.
Replace x1, x2, x3 with the list of explanatory variables in your model(s). And replace y1 y2 with the two DVs that are causing the error message to occur.
PS- Be sure to highlight and run all 3 lines together. Do not execute the SELECT IF line separately from TEMPORARY.
Question
suppose we split our data into training and test group, and we use the training data alone to create a logistic regression model using SPSS. does this model produce different results than the logistic regression model which is created by python??
and how could we change the probabilities in logistic regression model (SPSS model) for the test group, into interpretable categorical outputs (0 or 1)??
I don't understand what you want to do.in any case the attached may be helpful to you. Notice the validation step here works better than a single split etc Best wishes David Booth
Question
Dear All,
I want to develop a regression model that relates years of experience in an area to the ability of handling problems in this area. The ability to solve problems is answered through easy, average, and hard problems. A panel of experts will give these estimates.
Am I supposed to use Fuzzy regression for this case? Any recommendation for a paper that I can start with (an easy one to duplicate). Also, if I have 5 experts, how to give one answer out of their different answers?
Any other technique, other than Fuzzy regression, that was used in the literature?
Last, did anyone deal with a problem similar to the one I am describing?
Shahab Hosseini thanks for the papers. They are good
Question
I am currently investigating the differences between financial and non financial firms and apart from traditional output of the regression model, I need to test for their differences in coefficients. I am using dummy variables 0 and 1 for non-financial and financial respectively. Please could someone assist me with the STATA code/command in this regards. Thank you.
One term in your regression should be beta*(0,1 dummy). Test H0 beta=0 is the answer to your question. This is sometimes called the test for parallelism. Best wishes David Booth
Question
I have developed the multiple regression model and my responses size was 164. Now, I want to validate the model using new data set. Is there any rule for sample size that I can use? One colleague suggested I use 20% of sample used in the study to develop my model. Please, I need advice.
With smaller samples (depending on what you have as a population) you can try to validate your model with additional data collection especially when you are modelling social phenomenon that are highly fragile thus exhibit unexpected variances. I would not suggest a fixed sample size here but advise that you continue expanding your validation sample until the goodness of fit changes in response to new data is insignificant. Of course this assumes a very good sampling design and strategy.
Question
I made the following logistic regression model for my master's thesis.
FAIL= LATE + SIZE + AGE + EQUITY + PROF + SOLV + LIQ + IND.
Where I take a look if late filing of financial statements (independent variable) is an indicator of failure of small companies (dependent variable). FAIL is a dummy variable that is equal to 1 when a company failed during one of the researched years. I use data covering 3 years (2017, 2018 en 2019). Should I include a dummy variable YEAR, to account for year effects, or not. I have searched online but I don't understand what it exactly means and that is why I don't know if it is necessary to include it in this regression model. I hope you guys can help me. Thank you in advance!
Any new variable will only improve R^2, but one must not introduce variables that their theoretical model does not encompass. If you have no guesses (even better if others have already discussed it) on how a new independent variable might influence the dependent variable, it's best not to include it. The more variables, the more multicolliniarity. Any additional variation that you will explain adding the year variable will be misleading.
Adding time or space variables, for that matter, is almost always a bad idea. If you run LOO tests by year, they will show that adding year only worsens the actual predictive power of your model.
However, you might want to consider including some time-dependent macroeconomic indicators that impact bankruptcy. Although they will cause more multicolliniarity. I still strongly recommend running LOO cross-validation and perhaps target oriented cross-validation, if you can handle it, to show that such indicators do not spoil your model.
Question
I am looking to remove collinearity from my results. I have performed chi-square tests on two of the independent variables that I thought may have collinearity. The chi-square test is significant showing there is an association between the two categorical variables, so do I now need to exclude one of the variables from my final logistic regression model?
Hello Lauren,
Presence of a non-zero correlation among IVs isn't enough to confirm that any regression coefficient estimates will be biased. If researchers insisted on zero correspondence among IVs, then virtually every regression model would end up as a bivariate model (one IV, one DV).
If you're looking for "optimal" models/subsets of IVs (especially when there are other issues, like smaller than desirable sample and/or IV collinearity), then David Booth's recommendation for lasso (esp. adaptive lasso) is certainly worth your consideration.
But, if the concern is to evaluate the performance of a given or hypothesized model, perhaps it makes sense (as Christian Thrane suggests) to have in mind some specific criterion for declaring excess overlap among IVs that is more informative than whether their pairwise correlation is different from zero.
Question
I have 215 samples, and there are 9 independent variables. I want to know which independent variables could better predict the DV. However, the 9 IVs are highly correlated with each other, so I don't think I can use multiple regression model and compare the standardized Beta (some of the VIFs is close to 20, in addition, the assumption of normality of residual is violated). I wonder if I can do the bivariate correlation and compare each correlation coefficient (the assumption of normality is violated, so I used the Spearman to test the bivariate correlation between DV and each IV)? Do I have to transform the Spearman's rho into fisher Z, or I can directly compare the Spearman's rho?
Thank you so much for giving me any advices!
Hello Fennie Chang. If all you want to do is determine which pairwise correlation is strongest, compute the 9 correlations, and rank order them by absolute values. But I assume you want to statistically compare each of the 9 correlations to the other 8. These correlations will be non-independent with the DV common to each of them. There is a standard method for that situation. You can find code SAS code on Karl Wuensch's website and SPSS code on my website (file 6 in both cases).
Notice too that comparing standardized regression coefficients does not work when the variables are correlated with each other, as they are in your case (see pp. 542-543).
HTH.
Question
I heard about a software called πonix, used for creating regression models. It is useful when the scales have many items. However, when I search for it, I find nothing.
Has anyone heard of this software?
Thank you for the link to that program.
Ωnyx is for SEM and path analysis.
Question
I'm attempting to build a binary response (single species presence/absence) logistic regression model with random intercept (for site variable level). I surveyed 30 sites 1-3 times; approx half of the sites were only visited once. Ideally, I would model site as the random effects variable and include one or two habitat variable(s) as the fixed effects variable(s). I recognize that my sample size is very low and suspect that I also have insufficient replication of observations per level of the random effects variable.
Is it possible to use random effects in this situation? If not what other approach would you recommend?
The only alternative that I can think of is to build a regular binary response logistic regression including only one observation per site, and repeat this for every possible combination of 30 sites. I figure this would allow me to use all of my data to infer which covariates are most influential, although it makes getting coefficient and coefficient confidence estimates, AICc values, etc. difficult as far as I can tell.
Question
In general, the optimized model gives more accurate results than the local model but in my case, the traditional regression model shows a higher R2 value than the optimized model? Thank you in advance for your kind feedback.
GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters. It is basically a cross-validation method. the model and the parameters are required to be fed in. Best parameter values are extracted and then the predictions are made.
Question
More exactly, do you know of a case where there are repeated, continuous data, sample surveys, perhaps monthly, and an occasional census survey on the same data items, perhaps annually, likely used to produce Official Statistics?   These would likely be establishment surveys, perhaps of volumes of products produced by those establishments.
I have applied a method which is useful under such circumstances, and I would like to know of other places where this method might also be applied.   Thank you.
Besides some surveys I worked on extensively with help at the US Energy Information Administration (EIA), the Canadian Monthly and Annual Miller's Surveys noted in the poster linked below for an upcoming conference (the paper is not yet finished) would be an example that appears to be a good response to this question.
..............................
Monthly Miller's Survey:
Annual Miller's Survey:
Question
Hello, many articles talk about accuracy when using ANFIS ( or regression models ) and not RMSE or MAE ....
how are they calculating it?
thank you
Question
There are several independent variables and several dependent variables. I want to see how those independent variables affect the dependent variables. In other words, I want analyze:
[y1, y2, y3] = [a1, a2, a3] + [b1, b2, b3]*x1 + [c1, c2, c3]*x2 + [e1, e2, e3]
The main problem is that y1, y2, y3 are correlated. y1 increases may have lead to decrease of y2 and y3. In this situation, what multivariate multiple regression models can I use? And what assumptions of those models?
Hello Jialing,
The fact that DVs are correlated is often one argument for choosing a multivariate method to analyze the data rather than generating a univariate model for each individual DV.
If what you are saying is that, causally or temporally, is that Y1 influences Y2, then perhaps you'd be better off evaluating a path model which incorporates such proposed relationships rather than simply allowing for correlations/covariances other than zero among the DV pairs in the multivariate regression.
Question
Hi, I am running a logit regression in Stata. I do not see standard errors and p values in some regressors (for a few models) and in all regressors ( in some other regression models).
Pseudo R square is 1 in the models where SE and p value corresponding to no regressor is reported by the Stata.
I understand that this is happening due to either the problem of separation (quasi or full) or reduced degrees of freedom due to inclusion of higher number of regressors (sort of k>n).
Details for dataset are as follow:
--> Model includes a regressor in the level form as well as the square term (quadratic function).
--> In addition, the model includes 9 control variables plus industry and year controls.
-> Industry includes 34 unique industries in form of 2 digit industry codes.
--> Year control includes 11 unique years. Cross-sectional data is pooled across those 11 years.
If I remove industry effect controls the results become inconsistent in the models. Somewhere these give opposite signs and in some models these become insignificant.
I would be grateful your kind suggestions.
Thanks you!
Hello Sahil, based on what you just said the problem seems to be too many IVs for your number of observations . I understand that you would like to include industry effects but you don't have enough observations for that big a model. Therefore increased n or decreased number of predictors seems to be your options. Best wishes David Booth
Question
As there are several software to estimate threshold panel regressions such as Matlab, R, and Stata; I'm wondering which one is accurate and easier to use (dynamic balanced panel data).
Regard
Maryam
Stata
Question
We define linearity of a model in terms of the coefficient, not the variable therefore we treat Y=βX as a linear regression model [here coefficient and variable both are linear]
If the coefficient is linear but the variable is non-linear Y=βX^2 then why do some texts call this a non-linear regression model.[here variable is non-linear but the coefficient is still Linear]
Dear Khursheed Dar , the non-linear function could be transformed to be a linear function. Kindly visit the link.
Question
I have qualitative data for my research which are in qualitative format like ratings AAA,AA,A,BBB,BB,B,CCC,CC,C.
How can I convert these qualitative numbers into quantitative ones to include it in a regression model? I probably can not use the dummy variable method as it can only have to numbers either 0 or 1. In my scenario, I have more than two factors.
Could anyone please assist to convert qualitative data into a quantitative one?
You should create several columns if there is no ranking relationship between variables with only 0 and 1. For example, if you have three category A,B and C, in two columns A will be 0-0, B - 0-1 and C - 1-0. If qualitive variables contain some kind of ranking, just sort and use numbers.
Question
How do we calculate beta coefficient (standardized R) for results of a mixed regression model with negative binomial distribution please?
Standardized coefficients are with uniformed measurement units and are used for comparison of variable strength. I have trouble finding suitable r packages calculating beta for a glme.nb model.
Question
I have all the data regarding landslide susceptibility mapping, and I want to analyze the data by using the logistic regression model, but still, I have no idea how to process it.
Following my experience, if your maps are in the format of shape-file (boundary maps) you can transform it into a raster file (grid format). Then with this data structure, you can use the R program for logistic regression.
My paper in topic "Logistic Regression Model of Built-Up Land Based on Grid-Digitized Data Structure: A Case Study of Krabi, Thailand" was explained on it.
Question
I am running a Tobit model, where Pseudo R square is 3.21 (positive value). Is it okay to have Pseudo R square greater than 1 or there is some problem in the regression model?
My advice is to ignore pseudo Rsquares. I have never found one to be useful. Best wishes David Booth
Question
I am trying to code this optimizer for a linear regression model. What i want to confirm from is that the update of model parameter are happening even if they cause the increase in cost function, isn't ?
Or we only update the coefficients values if they decreased the value of the cost function?
Question
Hi all,
I have river run-off annual data and a catchment composite mean value of PDSI values (drought index). A simple regression model tells me that the PDSI well predicts the run-off (what can be physiologically expected). I checked the residuals, normality, homocedasticity, p-value. All perform fine, all is significant. The data spans 1869-2012.
Because I have the PDSI values for the period 0-2012, would it be possible to use the results of the model to reconstruct the run-off for the same period?
runoff_1869_2012 = ß0 + ß*PDSI_1869_2012
-> summary(model): ß0 = 1055.363 and ß=55.668
hence, would this make sense:
runoff_0_2012 = 1055.363 + 55.668*PDSI_0_2012
Eventually, I get the reconstructed run-off for the whole period with a certain uncertainty?
Many thanks for suggestions on this very basic question!
all best,
Michael
Michael -
Are you saying that each year gives you a data point, so 144 data points, and assuming all things else unchanged, you use those points in a simple linear regression, regardless of dates? (I don't know why you have 1869-2012 one place and 0-2012 another???) I would expect heteroscedasticity based on size, as measured by predicted-y, b0 + b1*PDSI_year, with years out of order. Anyway, did you use a "graphical residual analysis" to check fit? "Cross-validation?" What does your intercept represent, or is it noise, and you should use a ratio estimator?
Time series are not my interest, but I have to wonder if you should be considering that. Perhaps there is autocorrelation.
Perhaps I do not understand exactly what you are doing. If you can show a graph, or graphs, that might be helpful for everyone.
If you can demonstrate that what you have works by dropping data you have and seeing how close you come to predicting it, that could be helpful.
Cheers - Jim
Question
Hello,
I am studying the effect of ESG on financial performance measured as ROA. When I am running my regression model without control variables the ESG variable has a negative coefficient while when including them it has a positive coefficient. The control variables are log assets, log revenue, leverage, employees etc.
Does anyone know why this could be the case?
If your library has it, take a look at Chapter 13 (Woes of Regression Coefficients) in the classic book by Mosteller & Tukey (1977).
And see slide 16 here:
As the first bullet point on that slide notes, "predictors usually change together!"
HTH.
Question
Hello,
I would like to examine the association between a disease and psychological load. The disease can be determined by different diagnostic methods. In our study, we performed four accepted procedures to determine the disease. Each participant (total sample n=45) underwent all four procedures at different times. Additionally, we examined psychometric data, and these are the main variables I am focusing on.
My idea is to examine the association between the disease and psychological load, as a function of the diagnostic method chosen. In other words: Is the association between the disease - diagnosed by method A - and psychological load significantly different/stronger than the association between the disease - diagnosed by method B - and psychological load.
As for the statistical methods, I initially thought of logistic regression with the disease as criterion and the psychometric variables as predictors. This would lead to 4 regression models: SB diagnosed with A; SB diagnosed with B etc. as criterion. My idea is to compare the AICs of the four different models: Do the psychometric variables predict the disease better and explain more variance when diagnosed with method X or method Y.
I hope my question and concept is comprehensible.
Is this an appropriate approach or does anyone have another idea?
Thank you very much for your replies!
Kind reagrds,
Nicole
Could you simply compare the (pointbiserial) correlations between disease (binary variables) and psychological load (continuous/metrical variable) using statistical tests for comparing dependent correlations, that is, correlation coefficients obtained from the same sample for different variables?
Question
For example, I want to know how baseline characteristics of patients (age, BMI...) and the confounding factors (smoking, diabetes or other chronic diseases) affect the serum vitamin D value. Which regression model should I use?
I use SPSS and R for analyzing the data
Multi-variate regression analysis, you should run a group of independent variables analysis with dependent variable to consider confounding factors for associations.
Question
Dear all
I'm testing the parsimonious Technology Acceptance Model in the context of new retail technologies. As some studies found signficant influence of variables such as age, gender and education I thought about adding them into my regression model together with my main independant variables "perceived Usefulness" and "perceived ease of use".
I saw on Research gate that it is possible to just add the control variables together with the independant variables into the multiple regression model in SPSS? But for me it doesn't make sense, because these variables are not controlled and just treated as an independant variable.
Is that correct? Or do I need to do a hierarchical regression?
Here is the discussion that talks about putting it together with the independant variables:
Thank you very much!
I think if you can control the demographic variables to show more appropriate results, you can also use ANCOVA analysis by following the assumptions.
Question
Hi,
We would really appriciate some help building regression model(s) using a paneldata set. The dataset consists of data from retailstores over two years. Our research question are as follows: How does different payment methods affect retailstores unregistered shrinkage?
As a result we want to see which payment method that gives the highest increase in shrinkage, and compare the payment methods.
We have tried to use these following models, but the results does not make sense. y = x1 + x2 + x3 + u
1. Shrinkage = sale self-service checkout + revenue + region + u
2. Shrinkage = sale ShopExpress + revenue + region + u
3. Shrinkage = sale served checkout + revenue + region + u
We want to control for the stores size and region, which is why these variables are included. Also all stores in the dataset have served checkouts, and some have self-service and ShopExpress (Scan and go). We are therefore not able to use dummys for the payment methods?
We have also tried with this model, but also gets weird results:
4. Shrinkage = dummy served checkout + dummy self-service + dummy ShopExpress + revenue + region
• Is there a better way to create the regression models? Is it possible to gather them into one model?
• Do you have suggestions for other control variables that we should include?
• To be able to run the random effect model, we had to transform the variables into natural logaritm form. Does it make sense to use ln?
Thank you!
See the attached reference. Best wishes David Booth
Question
Dear all,
I am developing a new method for soil analysis, and have several candidates.
I am comparing the results derived from the candidates with that from the reference method using linear regressions. So, I have several linear regression models (same independent variable, different dependent variables), with different R-squares and slopes.
I am wondering which (statistical) method should be used to choose the best regression medel ? (i.e. choosing the most appropriate candidate as the new method). Can only R square and Slopes are used ? or do I have to use other statistics, such as RMSE, SEE, or MAE ?
Thanks a lot
Tho Nguyen -
You have "A direct comparison of the regression lines can not be made as I draw them using different response variables," but that is not true. A "graphical residual analysis" can be used to compare two or more different regression models on the same scatterplot, for the same sample. "Cross-validation" is used because fitting a model too tightly to one sample may mean it will not fit well for other data in the intended population or subpopulation. But you can try other samples on other scatterplots. Each graphical residual analysis scatterplot will be used to compare different models for a given sample, if you plot the results for each model for that sample on the same scatterplot, one scatterplot per sample.
The "graphical residual analysis" (you can research that term in quotes) will show the model fit, including heteroscedasticity, a naturally occurring feature which can be modeled by including regression weights. Typically variance of residuals becomes greater with larger predicted-y-values. This can be seen as the density in the y-direction on the scatterplot becomes less, indicating more variance in the y-direction, regardless of range, as you move to larger predicted-y-values. See https://www.researchgate.net/project/OLS-Regression-Should-Not-Be-a-Default-for-WLS-Regression, and the various updates and references there.
If you research "graphical residual analysis," you need to remember that you can put the results of more than one model for the same sample on the same scatterplot.
Best wishes - Jim Knaub
Question
Can anybody send me Solar radiation data for regression modeling.
Question
Dear all,
I am working on a research study exploring impact of air pollution on Peak expiratory flow rate (PEFR). I have five areas within a city - Rambagh, Alopibagh, Ashoknagar, Katra and Johnstonganj - with varying level of air pollution in each area. What I expect is, area which has high level of air pollutants, residents in that area will have low PEFR.
Variable png is a dummy variable representing those who consume tobacco and those who do not. Gender dummy variable has usual meaning.
I have run a double log regression model with result as shown in the image. Kindly help me to interpret the coefficients of dummy variable area and coefficients of interaction variable. - area#c.log_income
Any further suggestion to improve the model is highly welcomed. Thank you for your time.
Regards
Depending on the sign of the dummy Variable in the equation produced by the analysis, if the sign is minus the the higher score of the dummy Variable will decrease the dependent Variable more than the lower one. If the sign is plus the higher score of the dummy Variable will increase the dependent Variable. This will be the base to explain the effect of dummy Variable. Also, the results of analysis shows you the significance of these variables according to p value.
Question
Dear professor and Researcher
I have estimated the export supply function. I have taken the dependent variable (exports) in natural log but one explanatory variable (producer price index) in the level.
Now the coefficient value producer price index is PPI* 0.00745 please guide how we will interpret this value i mean how i convert this value in percentage.
thanks and best regards
The coefficients in a linear-log model represent the estimated unit change in your dependent variable for a percentage change in your independent variable. The term on the right-hand-side is the percent change in X, and the term on the left-hand-side is the unit change in Y.
Question
Can we use endogenous switching regression model in analyzing impact of any intervention using R-software? if so, can any one tell me the syntax?
BTW I forgot to say to be sure that you have the intervention coded on the RHS. Best wishes David Booth
Question
I have done a Logistic Regression model and I know that only the p-value and RR^2 are not enough. Could pls anyone suggest or any help?
Nitin Sharma, You may consider the lowest AIC as you run more and more iterations. And make sure that the Test set and Training set variance should not exceed 10% for RMSE.
Question
I was constructing a multiple regression model and was inspired by two papers the two papers didn't test stationarity although when I tried to test stationarity and took the difference all the variables were insignificant and R squared was too low 0.22 , but when I tried without taking difference of stationarity the variables were significant and R squared was 0.78
If by "stationarity" you mean constant variance of the errors, then no, you shouldn't test it; it is an *assumption*. However, it is always a good idea to examine your residuals for curvature and heteroskedasticity. Finding either (or both) suggests you may need to re-express your response variable. But you don't need a hypothesis test for that
Question
In a regression model, if the number of regressors k is very large, is there any systematic procedure to select the best possible model out of the very large number of (i.e. 2^k - 1 )possible models without computing the AIC/BIC or another such index for each of the possible models?
I think subject matter expertise can be used to try to avoid a spurious result though you could also research model selection techniques. (Avoid forward and backward elimination and comparing p-values. The former may not allow the best combination of variables, and the latter assumes they all act 'independently,' but they do influence each other in different ways, notably collinearity.) The sample matters. You can compare two or more models' results on the same "graphical residual analysis" scatterplot (please research "graphical residual analysis") which will give you a good overall picture of model fit, including heteroscedasticity, for each model, for that sample. But a "cross-validation" (please research that term) could help you see if you have overfit to that particular sample. If you could try a variety of samples, that may help. Any time there are a large number of predictors, you may have a collinearity problem, and not need them all. However, you could have a bias problem if you remove a variable that is needed. Some combinations of variables may perform consistently better than others. Best wishes to all.
Question
Would it be better if I don't enter into logistic regression model those variables with extremely unbalanced distribution in the 2 groups, even if statistically significant (p<0.05 ) in the bivariate analysis, for example: a frequency 0% or 100% in one of the two groups?
I am in agreement with both David and Jochen.
To be clear, for LR this is not only a matter of having a "large" sample; the issue of "rare events" and how that manifests in MLE regards the absolute size of the smallest group, not its relative size. That is, you will have the same problem with a 100/10 split and a 10000/10 split. This is why various rules-of-thumb regarding N for LR are in respect to the size of the smallest group for the number of predictors. Obviously, I have no knowledge of the data, but the applicability of this depends on what is meant by "unbalanced."
Firth's clever work, although often simply used to get LR estimates when they would otherwise be intractable with MLE (complete or quasi-complete separation), is a general attempt to penalize MLE weights in respect to the "smallness" of the sample, with those penalized weights being asymptotic to the MLE wights as sample size increases. As it has a Bayesian basis, Firth's penalization of MLE can be seen as conceptually parallel to the penalization of OLS in ridge or LASSO regression estimation.
John
Question
For instance, demographics could be potential confounders. Would it be better if I include them as the predictors in the logistic regression model along with other predictors or I should first factor them out and then use the residuals for further analysis?
Hello Guo-Hua,
The two approaches you propose are identical in impact (e.g., as tactics for "statistical control"). Once the data are collected, these are about your only options.
The usual array of methods by which to deal with nuisance / noise / extraneous / confounding variables in studies includes:
1. Randomization (doesn't eliminate variables, but tends to equalize their distribution across groups, over many trials);
2. Matching (on the target variable/s; this can be challenging with multiple variables);
3. Selecting cases with a fixed value (on the target variable/s; though this restricts generalizability);
4. Building the variables into the design (e.g., as a blocking factor);
5. Statistical control (e.g., as a covariate, in the ways you have proposed).
Question
Dear all, I employed the xtabond2 command (System GMM) to estimate a dynamic panel regression model. I used lags of differences and levels as instruments. However, I also want to apply the Kleibergen-Paap LM statistic to test of under-identifying restrictions. Unfortunately, i don't know any implemented test in the xtabond2 command to test for under-identifying restrictions. (The ivreg2 command contains the Kleibergen-Paap LM statistic. However, the ivreg2 command allows only to compute results separately for the differenced or levels equations.) Is there any possibility to compute an aggregated Kleibergen-Paap LM statistic (levels and difference) with the xtabond2 command? Thank you in advance.
See the attached screenshot and search on the same thing but add stata to the query to see how to do this by computer. Best wishes David Booth
Question
Hi,
I'm trying to fit a line at subgroups in a scatterplot in SPSS. Since the covariates in the linear regression model affects the direction of the regression, I need the plotted line to be adjusted for the covariates. Do you know how I can do this with the legacy dialogs tool or how I can change the syntax? Thanks!
Hello Claudia Cicognola. As a general rule, I think you'll have better luck getting good answers about how to do X with SPSS if you post to SPSS discussion forums. Here are two you could try:
Y = b0 + b1*X + b2*G + b3*X*G + other variables + error
where X = the focal variable, G = a group indicator variable or a set of indicator variables if there are more than 2 groups, and the other variables are potential confounders you wish to adjust for. From what you said, it is not completely clear whether you have included the X*G interaction or not. But I'll assume you have.
If all of that is (more or less) correct, I'll speculate further that you want to plot lines for each group separately showing the relationship between X and Y with the potential confounders set to their means (or some other value). Is that right? Is my ESPss working? ;-)
If all of my guesses are reasonably close to the mark, look up the ApplyModel command. You can use it to apply your model information to a new dataset that has all of the desired combinations of explanatory variable values for which you want fitted values, and allows you to plot those fitted values. Method 2 on this UCLA page shows a very simple example:
You can likely find more complex examples by searching on <spss applymodel examples> or something like that. HTH.
Question
I used logistic regression model for analysis which has over 17,000 observations. Although, the model results in several statistically significant predictors, McFadden's Adj R squared/ Cragg & Uhler's R2 are very low! For my model McFadden's Adj R squared is 0.026 and Cragg & Uhler's R squared is 0.044. Can I proceed with these R squared? I would really appreciate your suggestion on accepted level of R squared, which has to be backed up by relevant literature. Thank you!!
Hello Tuhinur,
The implied question in your query is, can a study be overpowered (by huge N) so as to flag as significant models which don't account for a meaningful amount of the observed differences (e.g., trivial effects)? The answer, of course, is "Yes."
It's always a judgment call, however, as to whether an effect is trivial.
Now, let me try to address your query. First, I would not rely on pseudo-R-square values as the measure of model adequacy for a logistic regression model. The reason is three-fold: (a) the value does not truly represent variance accounted for (as in OLS regression), so trying to adapt guidelines you may have seen for OLS models may not make sense; (b) context matters: the variables involved, the target population, the perspective of the decision-maker, and the intended use(s) of the model; and (c) many times people choose to focus on the exponentiated regression coefficients for IVs (the "OR" or odds ratio estimates) and/or the classification accuracy of the model and/or AIC/BIC indicators. For me, I'd stick with OR and classification accuracy, or information, and try to interpret those in context of my research aims.
Here's a link you may find handy that compares a number of the common pseudo R-square values: https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
Question
I am using DecisionTree as a regression model (not as a classifier) on my continuous dataset using Python. However, I get mean squared error equals ZERO. Does it mean my model is overfitting or it can be possible?
Question
Dear community,
I ran a logistic regression with continious IV in SPSS. In the table "variables in the equation" one variable is missing (despite using entry method) and without any message from SPSS . When browsing through the web I understood that this might happen due to collinearity. However Collinearity diagnostics did not return a clear sign for it. Highest VIF-values are 6.1 and the highest Conditionindex is 21.1.
So my question are:
1. Is my regression model still valid despite SPSS dropping one variable?
2. Are there other reasons than collinearity why the IV is missing in the model.
Thanks Ilka
Try stat package in R programming
Question
I am testing hypothesis of relationships between CEA and Innovation Performance (IP). If I am testing the relationship of one construct , say Management support to IP , is it ok to use single linear regression? Of should I be testing it in a multiple regression with all the constructs?
Chuck, et.al., as I noted above, you can have too few or too many independent variables. Adding one you do not need will generally increase variance. In Brewer, K.R.W.(2002), Combined Survey Sampling Inference: Weighing Basu's Elephants, Arnold: London and Oxford University Press, it says this is "well known." Also, it is the basis of the "bias-variance tradeoff" found in statistical learning. (See, for example, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, 2nd ed, 2009 (corrected at 7th printing 2013), Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Springer.)
A variable should not be added unless it is needed.
It is best to use a "graphical residual analysis" to check model fit to your sample, and consider a "cross-validation" to help avoid overfitting to your particular sample. The graphical residual analysis may also help you to consider heteroscedasticity, which should generally be part of the model. See https://www.researchgate.net/publication/354854317_WHEN_WOULD_HETEROSCEDASTICITY_IN_REGRESSION_OCCUR. - Jim
Question
The library optmatch that is used with MatchIt library is not compatible with R 4.1.2 and seems to be discontinued. Any one aware of an alternative without having to downgrade to an old verion? I'm trying to run full matching using a probit regression model.
Thank you
TWANG from Rand Corporation is what I can think of.
I would also reach out to the maintainer of the optmatch package
Question
While working on a dataset that has about 10 measured variables and two experimentally manipulated variables, it has become a cumbersome process for me to find a model that explains a good amount of variance (high R squared). I am quite new in this area and would like to know if there are any resources that explain the art and science of building good regression models. What I understand from Google search is that the stepwise regression procedure is not appreciated among the experts.
It depends on the purpose of the model, if you need a "prediction tool" or if you want a model representing your knowledge about functional relationships between different factors (which can be used for prediction, too, but can also help improving our theoretical understanding of the underlying processes as well as help in inductive reasoning).
Yes, stepwise selection procedures are dangerous. Any procedure that tries to rate the quality of a model only with data that was used to build and fit the model are dagerous, because they are bound to the observed data and therfore tend to overrate specific ("random") patterns in the observed data that usually won't generalize well. They also usually start on a very restricted set of possible models, neccesarily accounting only for the variables you have actually observed (missing variable bias) and often accounting only for linear relationships (failing to include the appropriate class of models), no or only lower-order interactions (kind of missing variable bias) and so on.
If it comes to prediction, you can only judge the performance of a model based on independent data. Using the result of this to modify or further select a model destroys this independence. This also applies to the cross-validation techniques. Having an independent set of data (aka test-set), it actually does not matter how you came up with your model (using a traing-set, pure intuition, or some theory). You show how well it works in the test-set, and this is it. This independence is difficult to achieve. If you have one large set of data, gerenerated under kind of homogeneous conditions, this set may contain specific, irrelevant "random" patterns that your model-builing process will try to capture. Dividing this data into a training-set and a test-set (afasik this is pretty much the standard today) just generates a test-set with the very same random patterns and you will overestimate the performance of your model.
Question
I have multiple regression models, and I need to validate LOO
how to calculate Leave One Out using SPSS and Minitab Software?
Example: Leave-One-Out Cross-Validation in R
Suppose we have the following dataset in R:
#create data frame df <- data.frame(y=c(6, 8, 12, 14, 14, 15, 17, 22, 24, 23), x1=c(2, 5, 4, 3, 4, 6, 7, 5, 8, 9), x2=c(14, 12, 12, 13, 7, 8, 7, 4, 6, 5)) #view data frame df y x1 x2 6 2 14 8 5 12 12 4 12 14 3 13 14 4 7 15 6 8 17 7 7 22 5 4 24 8 6 23 9 5
Question
I’m in need of materials on exponential regression model estimation using maximum likelihood estimation or other methods
You can try this easy to use program:
The help file explains how it works.
Question
How can I test the Panel Smooth Transition Regression Models? which software can I use to test this model? Is there any code for that?
As far as I can understand, you could use logistic regression with this easy to use program: www.lerenisplezant.be/fitting.htm
What kind of data are you studying?
Question
Respected all,
I am using regression modelling for crash prediction model. Can anyone tell me what should be the minimum value of Chi square for accepting in Publication?
I'm afraid your question doesn't make a lot of sense. Could you be more specific eg. kind.of regression, type of DV, and the hypothesis of your chi square test. These should get us started..Best wishes, David Booth
Question
There is a study on the impact of HIV on mortality in patients with COVID-19 (Article link: https://www.thelancet.com/journals/lanhiv/article/PIIS2352-3018(20)30305-2/fulltext). According to Figure 3 of the article, the unadjusted impact of HIV on death risk was not significant, however the adjusted effect was significant. What could be the reason for this?
Krishnan is right - while I think that any correlation with merely the diagnosis is not so relevant (since fortunately with early initiation of ART people living with HIV surely have a near normal immune system etc) - statistically adjustment for different factors in a cohort can go either way - positive and negative for respective factors. Do not think this study is to helpful to understand what is going on - also outdated (2021 paper)
authors write...:
"For similar reasons, we were unable to include data on antiretroviral therapy use, viral suppression, CD4 count, or previous AIDS-defining illnesses likely to be captured only in specialist care, so it was not possible to stratify results or quantify how much of the increased risk was driven by the minority of individuals with poorly controlled HIV. "
Question
I am running a panel regression model and I want to rank two of my independent variables on the basis of their degree of influence over the dependent variable.
Both of them are significant, have same sign and the coefficient on one V1 is 0.001 and V2 is 0.003 which I believe is not significant.
Is there any other way in which I can do this?
I have read that some people say that coefficient is not the correct measure to rank the variables.
How's comparing variance explained (R^2) per predictor?
Question
In a set of spatial data I used the Spatial Error Model. However, I am asked if there are other strategies that can help to absorb the autocorrelation.
Any help is welcome,
Thank you
Thank you William Durkan , Víctor Morales-Oñate and Noura Anwar Abdel-Fatah for the answers. I will take them into account.
Best regards,
Ferran
Question
Hi,
My regression model has better results when I included an intercept dummy variable. However, the dummy variable is insignificant... why is that? How can I interpret that?
Thanks
Mohamed Hashem -
I would not pay much attention to "significance." Size effect matters.
I suggest you try a "graphical residual analysis" with and without this variable, on the same scatterplot, for the same sample. This could be enlightening. You also could consider a cross-validation, as results could be different for a different sample.
If you still have questions, I suggest that you post your graphical residual analysis or analyses.
Cheers - Jim
PS -
I wonder if that dummy intercept variable may really not be useful, and just a way to make a model 'fit' to a sample. This can happen, especially if the sample size is too small. Consider this: If all predictors were zero, would you expect y to be zero? If so, do not include an intercept term. It is just a reaction to random 'error.' (On page 110 in Brewer, K.R.W.(2002), Combined Survey Sampling Inference: Weighing Basu's Elephants, Arnold: London and Oxford University Press, Ken Brewer provides a warning about intercept terms which may not be needed.) If you don't need it in any case, you should probably drop it, I think.
Question
In our research, dependent variable is measured through 5 point likert scale. Also, to measure dependent variable, 6 different questions are used. We are using ordered probit regression model. When taking the value of dependent variable, should we use mode or mean of all questions together? If there is a specific method that we can enter it, we also want to know the reason why we are entering in that way?
The point of the Likert scale is that it can be used to estimate order within a population. Order contains a lot more information value than we generally assume. However, we cannot just go assuming things and believe that makes them true.
Question
I have run an ARDL model for a Time Series Cross Sectional data but the output is not reporting the R.squared. What could be the reason/s.
Thank you.
Maliha Abubakari
I thought PMG estimator (a form of ARDL ) is more approprite
Question
I am using Stata to run a linear regression where I have used Student Engagement as a Dependent variable while the rest of the planning, teaching strategies, use of resources, the language proficiency of teachers are independent variables. The data against these variables is gathered on a Likert scale of 0 to 4, and the data is continuous.
You say that all variables (both dependent and independent) are on a continuous scale. But, what about the Likert scale you've used (0,1,2,3,4)? Clearly, the scores on a Likert Scale format will give you discrete numbers in your data frame. Or did you transform them to a continuous scale?
If you want I can look at the data and give you specific advice. For this, you have to send me a CSV or Excel format file of the data to my email address: total-health@kpnmail.nl.
Question
Dear all,
I am running several bivariate linear regressions to measure the impact of biases on controlling processes. These are all significant. I would like to measure which cognitive bias has the greatest impact on controlling processes. Can I compare the beta weights of different bivariate regression models to find the bias with the biggest impact? My dependent variable always remains the same.
I have searched for hours for information but unfortunately found conflicting information.
Alexander -
The problem is that predictors actually influence one another, not just through collinearity, but because of other considerations such as suppressor variables, etc., and when you look at a bivariate regression, the predictor is missing the influence of the other predictors. There will, however, often be an intercept term, which will vary by predictor, and be of varying importance. If y will be zero when x is zero, you will have a ratio estimator, where only b for the one predictor determines predicted-y. Otherwise the intercept will be whatever is needed so that you will have the estimated residuals sum to zero for your sample (so you have model-unbiasedness). But this is just an estimate from your sample. If you use more than one predictor, the influences among them change according to which predictors are used together, and the intercept changes. The best set of predictors used in the best way will determine the model for predicted-yi so that it approaches a perfect size measure, and heteroscedasticity will be expected. See
So looking to see how one variable does at a time is not only subject to the varying importance of each intercept, but also not very practical when it comes to that predictors part in the ideal regression, if that ideal regression is multiple regression. If one predictor alone does better than any other model, then that would mean it does perform better than the others but a direct comparison would be possible only if these were all ratio estimators, i.e., with no intercept term, so the intercept is set to zero in each case. (Cochran, Sampling Techniques, 1st ed, 1953, pages 205-206 suggests that for surveys, the best size measure, x, for y in such a case for a sample survey might be the same data item in a previous, recent census.)
There may be a temptation to use every variable you can find in a regression, and although that might help you tailor the best fit model to your sample, such model would likely be overfit to your sample at the expense of general applicability. However, looking at one predictor at a time may not get you very far either. It might be useful to graph such relations to see if there are curvilinear indications. (But note that a quadratic linear regression, for example, is still a linear regression, though you might want a nonlinear form.)
So when you ask if you can compare "...beta weights of different regression models...," the short answer is no, not directly, though you might obtain a general idea of their usefulness.
Cheers - Jim
Question
Dear Sir/Mam,
In time series regression, when should we use static regression model and when should we use distributed lag or auto regressive distributed lag model.
Dear Vaishnavi,
If your Y variable depends only on stationary level or similar time (t) then you can use static. That means for every Y you find the relationship with the corresponding time inflation Y t = B * Inflation t
If your variable depends on past steps of time, then you should use the auto regressive when Y t = B1 * Inflation t + B2 * Inflation t-1
If you don't know your data well, you can apply both of them and find the lower RMSE or BIC
Question
How do you perform a purposeful selection model?
Which variables should be excluded or included?
You need to understand the causal relationships that are implied by your hypothesis. I suggest that you visit http://www.dagitty.net which allows you to build a graphic representation of the relationships between your variables and, based on this, will allow you to build statistical models that test informative hypotheses.
Question
Dear Sir/Mam,
In secondary data time series analysis, when should we use static regression model and when should we use other regression models ( distributed lag or auto regressive distributed lag model)
When the response variables are autocorrelated (e.g., data collected over varying times or space), and also cross correlated with the predictors at their lagged values, typically distributed lag models will be more useful than static models. Basically, if you data set is like (y_t, x_t), t=1,2,... where y_t is the response at time t and x_t is the predictor (vector) at time t, then if you find y_t depends on y_{t-l} and/or x_{t-l} for lag values l=0,1,2,.., then use the distributed lag model. If you are familiar with R check out this package:
Question
I am working on a multiple regression model with quadratic terms. Once I've got the first multiple regression equation, I checked if adding the quadratic/cubic terms could increase my Adj r-square.
But I have a question. Since I am using backward elimination, can I load all of them (12 terms) and remove the less significant (as backward elimination does) proceeding until I get the equation?
In general this approach (stepwise forward and backward) is not recommended. This is covered in several RG answers so the search you made before adding another question should have found these.
Question
Dear researcher community.
I want to ask if it's possible to perform a meta-analysis in an index, I mean, in value without standard deviation. For instance, in the regression model, many indexes don't have a standard deviation.
I want to conduct a meta-analysis to compare two analysis approaches in the regression model, so I need to know if I could use a goodness index for the same research that reported the two methods. In this sense, I will have a value for n (data population), and index in a conventional regression method (control) and n (data population), and index in an alternative regression method (experimental) for several papers.
Wilson
Dear Wilson Barragán,
it is an old question but I would like to know you you could figured it out how to use standard errors for an index in meta-analysis. I would like to perform a meta-analysis using a index of feeding preference for mosquitoes (quotient between number of mosquitoes feed from a particular host divided by the relative abundance of such host). Every study will include only the number of mosquitoes, the percentage of them fed on that host and the abundance (%) of that host. It will not include a mean neither a standard error. If you could perform your meta-analysis and can provide me some topics of how you made it I would be very appreciate. Thanks in advance, Kevin.
Question
Dear altruists,
I am performing a spatial panel data model in r using SPLM package. My codes are:
SModxl <- read_excel("E:/Data/Spatial/SPLM/SpPanelMod 3079.xlsx", sheet = "Sheet1")
SMod <- pdata.frame(SModxl, index=c("Year", "Fips"))
Qnlw=mat2listw(WMat)
#SARAR
reg.Sarar1=spml(formula = y ~ x1+x2+x3, data = SMod, index = c("Year", "Fips"), listw=Qnlw, tol.solve=1.0e-20, lag = TRUE, spatial.error = "b", model = "within", effect = "individual", method = "eigen", na.action = na.omit, quiet = TRUE, zero.policy = TRUE, interval = NULL, control = list(), legacy = FALSE)
summary(reg.Sarar1)
summary(impacts(reg.Sarar1, listw=Qnlw, time = 15, R=5), zstats=TRUE)
My model runs fine. But when I am trying to get the marginal effect to see direct, indirect and total effects with significance test, I am using the following code as like the spdep package "summary(impacts(reg.Sarar1, listw=Qnlw, time = 15, R=5), zstats=TRUE)".
But it is not working.
Do you know any other techniques to estimate the marginal effects using splm package in r? Or is there any other packages available in r can do the task?