Science topic

# Regression Modeling - Science topic

Explore the latest questions and answers in Regression Modeling, and find Regression Modeling experts.

Questions related to Regression Modeling

I'm trying to compare several linear regression models, however many of them do not report their SEE. Is there an alternative method to calcuate the SEE when we do not have access to the original data to calculate residuals?

Hi,

I am using a pool of cross-sections over year (a cross-section of 255 firms over 11 years). I have used simple/classical linear regression model (CLRM) and logit model given continuous DV in former and dichotomous DV in later.

In logit regression, the econometric model is

Price_Deviation = B0 + B1 (prestegious_investor) + B2 (default_score) +Bk (controls)

where,

--> Price_Deviation is dummy variable takes value 0 or 1,

--> prestegious_investor is dummy variables takes value 0 or 1,

--> default_score is a continuous variable with range -inf to +inf, although it takes min value of -37.12 and max value of 24.25 in my case.

I want to include interaction effects. I want to know if it can be included in the way we introduce in CLRM (or simple linear regression models) or there is any adjustment in specific case of Logistic models?

please guide!

Thanks and regards,

Sahil Narang

As for cumulative link models (also called ordinal logistic regression models), Christensen (2016) reports that, in the case the proportional odds assumption is not met, one or more covariates need to be scaled to relax the assumption.

However, when visualizing the summary, it may happen that the non-scaled covariate results as non-significant while the same scaled covariate as significant. What is the meaning of that?

Thanks in advance,

Marcello

I am working on a logistic regression model, with a binary outcome and my model look like this:

Intercept: coefficients is -

**2.034**Variable: coefficients are

**0.031,**and the P-value is**0.130.**Not Significant.However when I calculate the Odds Ratio and CI 95% I get

Odds ratio

**1.031,**Lower Limit =**0.991,**Upper Limit =**1.073**However, since the variable is not significant, I shouldn't get a P-value greater than 1 correct?

Can anyone explain why this is happening?

In my next research project by using surface regression model, estimate a model for a target variable and exogenous strategy varibales. Consequently by using this function I will try to fnd optimal X's maximizing the Y. Do you know any application like the one in my mind? If so can you share their names?

Suppose, we want to check association between A and Z using regression model. We add different covariates in model 1 let's say b, c, d, e. The regression shows significant association for A and Z after adjustment for these covariates. However, in model 2, we add another covariate let's say f, but when we run the model now, association between A and Z becomes non-significant? What caused this non-significance? What could be the reason for this?

I am performing a meta-analysis of odds ratio per unit of a continuous variable with a dichotomous outcome (dichotomized continuous variable). One of the studies reports a mixed linear regression model with coefficient and standard error for the continuous variable regressed on the continuous outcome variable. Is there any acceptable method to estimate the odds ratio I need?

I am suppose to predict land use/land cover changes using rainfall, climate and socio-economic data. what regression model can be recommended ?

I am running 6 separate binomial logistic regression models with all dependent variables having two categories, either 'no' which is coded as 0 and 'yes' coded as 1.

4/6 models are running fine, however, 2 of them have this error message.

I am not sure what is going wrong as each dependent variable have the same 2 values on the cases being processed, either 0 or 1.

Any suggestions what to do?

suppose we split our data into training and test group, and we use the training data alone to create a logistic regression model using SPSS. does this model produce different results than the logistic regression model which is created by python??

and how could we change the probabilities in logistic regression model (SPSS model) for the test group, into interpretable categorical outputs (0 or 1)??

Dear All,

I want to develop a regression model that relates years of experience in an area to the ability of handling problems in this area. The ability to solve problems is answered through easy, average, and hard problems. A panel of experts will give these estimates.

Am I supposed to use Fuzzy regression for this case? Any recommendation for a paper that I can start with (an easy one to duplicate). Also, if I have 5 experts, how to give one answer out of their different answers?

Any other technique, other than Fuzzy regression, that was used in the literature?

Last, did anyone deal with a problem similar to the one I am describing?

I am currently investigating the differences between financial and non financial firms and apart from traditional output of the regression model, I need to test for their differences in coefficients. I am using dummy variables 0 and 1 for non-financial and financial respectively. Please could someone assist me with the STATA code/command in this regards. Thank you.

I have developed the multiple regression model and my responses size was 164. Now, I want to validate the model using new data set. Is there any rule for sample size that I can use? One colleague suggested I use 20% of sample used in the study to develop my model. Please, I need advice.

I made the following logistic regression model for my master's thesis.

FAIL= LATE + SIZE + AGE + EQUITY + PROF + SOLV + LIQ + IND.

Where I take a look if late filing of financial statements (independent variable) is an indicator of failure of small companies (dependent variable). FAIL is a dummy variable that is equal to 1 when a company failed during one of the researched years. I use data covering 3 years (2017, 2018 en 2019). Should I include a dummy variable YEAR, to account for year effects, or not. I have searched online but I don't understand what it exactly means and that is why I don't know if it is necessary to include it in this regression model. I hope you guys can help me. Thank you in advance!

I am looking to remove collinearity from my results. I have performed chi-square tests on two of the independent variables that I thought may have collinearity. The chi-square test is significant showing there is an association between the two categorical variables, so do I now need to exclude one of the variables from my final logistic regression model?

Thank you for your help

I have 215 samples, and there are 9 independent variables. I want to know which independent variables could better predict the DV. However, the 9 IVs are highly correlated with each other, so I don't think I can use multiple regression model and compare the standardized Beta (some of the VIFs is close to 20, in addition, the assumption of normality of residual is violated). I wonder if I can do the bivariate correlation and compare each correlation coefficient (the assumption of normality is violated, so I used the Spearman to test the bivariate correlation between DV and each IV)? Do I have to transform the Spearman's rho into fisher Z, or I can directly compare the Spearman's rho?

Thank you so much for giving me any advices!

I heard about a software called πonix, used for creating regression models. It is useful when the scales have many items. However, when I search for it, I find nothing.

Has anyone heard of this software?

Thanks for your help.

I'm attempting to build a binary response (single species presence/absence) logistic regression model with random intercept (for site variable level). I surveyed 30 sites 1-3 times; approx half of the sites were only visited once. Ideally, I would model site as the random effects variable and include one or two habitat variable(s) as the fixed effects variable(s). I recognize that my sample size is very low and suspect that I also have insufficient replication of observations per level of the random effects variable.

Is it possible to use random effects in this situation? If not what other approach would you recommend?

The only alternative that I can think of is to build a regular binary response logistic regression including only one observation per site, and repeat this for every possible combination of 30 sites. I figure this would allow me to use all of my data to infer which covariates are most influential, although it makes getting coefficient and coefficient confidence estimates, AICc values, etc. difficult as far as I can tell.

In general, the optimized model gives more accurate results than the local model but in my case, the traditional regression model shows a higher R2 value than the optimized model? Thank you in advance for your kind feedback.

More exactly, do you know of a case where there are repeated, continuous data, sample surveys, perhaps monthly, and an occasional census survey on the same data items, perhaps annually, likely used to produce Official Statistics? These would likely be establishment surveys, perhaps of volumes of products produced by those establishments.

I have applied a method which is useful under such circumstances, and I would like to know of other places where this method might also be applied. Thank you.

Hello, many articles talk about accuracy when using ANFIS ( or regression models ) and not RMSE or MAE ....

how are they calculating it?

There are several independent variables and several dependent variables. I want to see how those independent variables affect the dependent variables. In other words, I want analyze:

[y1, y2, y3] = [a1, a2, a3] + [b1, b2, b3]*x1 + [c1, c2, c3]*x2 + [e1, e2, e3]

The main problem is that y1, y2, y3 are correlated. y1 increases may have lead to decrease of y2 and y3. In this situation, what multivariate multiple regression models can I use? And what assumptions of those models?

Hi, I am running a logit regression in Stata. I do not see standard errors and p values in some regressors (for a few models) and in all regressors ( in some other regression models).

Pseudo R square is 1 in the models where SE and p value corresponding to no regressor is reported by the Stata.

I understand that this is happening due to either the problem of separation (quasi or full) or reduced degrees of freedom due to inclusion of higher number of regressors (sort of k>n).

Details for dataset are as follow:

--> Model includes a regressor in the level form as well as the square term (quadratic function).

--> In addition, the model includes 9 control variables plus industry and year controls.

-> Industry includes 34 unique industries in form of 2 digit industry codes.

--> Year control includes 11 unique years. Cross-sectional data is pooled across those 11 years.

If I remove industry effect controls the results become inconsistent in the models. Somewhere these give opposite signs and in some models these become insignificant.

I would be grateful your kind suggestions.

Thanks you!

As there are several software to estimate threshold panel regressions such as Matlab, R, and Stata; I'm wondering which one is accurate and easier to use (dynamic balanced panel data).

Regard

Maryam

We define linearity of a model in terms of the coefficient, not the variable therefore we treat Y=βX as a linear regression model [here coefficient and variable both are linear]

If the coefficient is linear but the variable is non-linear Y=βX^2 then why do some texts call this a non-linear regression model.[here variable is non-linear but the coefficient is still Linear]

I have qualitative data for my research which are in qualitative format like ratings AAA,AA,A,BBB,BB,B,CCC,CC,C.

How can I convert these qualitative numbers into quantitative ones to include it in a regression model? I probably can not use the dummy variable method as it can only have to numbers either 0 or 1. In my scenario, I have more than two factors.

Could anyone please assist to convert qualitative data into a quantitative one?

How do we calculate beta coefficient (standardized R) for results of a mixed regression model with negative binomial distribution please?

I have all the data regarding landslide susceptibility mapping, and I want to analyze the data by using the logistic regression model, but still, I have no idea how to process it.

I am running a Tobit model, where Pseudo R square is 3.21 (positive value). Is it okay to have Pseudo R square greater than 1 or there is some problem in the regression model?

I am trying to code this optimizer for a linear regression model. What i want to confirm from is that the update of model parameter are happening even if they cause the increase in cost function, isn't ?

Or we only update the coefficients values if they decreased the value of the cost function?

Hi all,

I have river run-off annual data and a catchment composite mean value of PDSI values (drought index). A simple regression model tells me that the PDSI well predicts the run-off (what can be physiologically expected). I checked the residuals, normality, homocedasticity, p-value. All perform fine, all is significant. The data spans 1869-2012.

Because I have the PDSI values for the period 0-2012, would it be possible to use the results of the model to reconstruct the run-off for the same period?

the model reads:

runoff_1869_2012 = ß0 + ß*PDSI_1869_2012

-> summary(model): ß0 = 1055.363 and ß=55.668

hence, would this make sense:

runoff_0_2012 = 1055.363 + 55.668*PDSI_0_2012

Eventually, I get the reconstructed run-off for the whole period with a certain uncertainty?

Many thanks for suggestions on this very basic question!

all best,

Michael

Hello,

I am studying the effect of ESG on financial performance measured as ROA. When I am running my regression model without control variables the ESG variable has a negative coefficient while when including them it has a positive coefficient. The control variables are log assets, log revenue, leverage, employees etc.

Does anyone know why this could be the case?

Thanks in advance

Hello,

I would like to examine the association between a disease and psychological load. The disease can be determined by different diagnostic methods. In our study, we performed four accepted procedures to determine the disease. Each participant (total sample n=45) underwent all four procedures at different times. Additionally, we examined psychometric data, and these are the main variables I am focusing on.

My idea is to examine the association between the disease and psychological load, as a function of the diagnostic method chosen. In other words: Is the association between the disease - diagnosed by method A - and psychological load significantly different/stronger than the association between the disease - diagnosed by method B - and psychological load.

As for the statistical methods, I initially thought of logistic regression with the disease as criterion and the psychometric variables as predictors. This would lead to 4 regression models: SB diagnosed with A; SB diagnosed with B etc. as criterion. My idea is to compare the AICs of the four different models: Do the psychometric variables predict the disease better and explain more variance when diagnosed with method X or method Y.

I hope my question and concept is comprehensible.

Is this an appropriate approach or does anyone have another idea?

Thank you very much for your replies!

Kind reagrds,

Nicole

For example, I want to know how baseline characteristics of patients (age, BMI...) and the confounding factors (smoking, diabetes or other chronic diseases) affect the serum vitamin D value. Which regression model should I use?

I use SPSS and R for analyzing the data

Dear all

I'm testing the parsimonious Technology Acceptance Model in the context of new retail technologies. As some studies found signficant influence of variables such as age, gender and education I thought about adding them into my regression model together with my main independant variables "perceived Usefulness" and "perceived ease of use".

I saw on Research gate that it is possible to just add the control variables together with the independant variables into the multiple regression model in SPSS? But for me it doesn't make sense, because these variables are not controlled and just treated as an independant variable.

Is that correct? Or do I need to do a hierarchical regression?

Here is the discussion that talks about putting it together with the independant variables:

Thank you very much!

Hi,

We would really appriciate some help building regression model(s) using a paneldata set. The dataset consists of data from retailstores over two years. Our research question are as follows: How does different payment methods affect retailstores unregistered shrinkage?

As a result we want to see which payment method that gives the highest increase in shrinkage, and compare the payment methods.

We have tried to use these following models, but the results does not make sense. y = x1 + x2 + x3 + u

- Shrinkage = sale self-service checkout + revenue + region + u
- Shrinkage = sale ShopExpress + revenue + region + u
- Shrinkage = sale served checkout + revenue + region + u

We want to control for the stores size and region, which is why these variables are included. Also all stores in the dataset have served checkouts, and some have self-service and ShopExpress (Scan and go). We are therefore not able to use dummys for the payment methods?

We have also tried with this model, but also gets weird results:

4. Shrinkage = dummy served checkout + dummy self-service + dummy ShopExpress + revenue + region

- Is there a better way to create the regression models? Is it possible to gather them into one model?
- Do you have suggestions for other control variables that we should include?
- To be able to run the random effect model, we had to transform the variables into natural logaritm form. Does it make sense to use ln?

Thank you!

Dear all,

I am developing a new method for soil analysis, and have several candidates.

I am comparing the results derived from the candidates with that from the reference method using linear regressions. So, I have several linear regression models (same independent variable, different dependent variables), with different R-squares and slopes.

I am wondering which (statistical) method should be used to choose the best regression medel ? (i.e. choosing the most appropriate candidate as the new method). Can only R square and Slopes are used ? or do I have to use other statistics, such as RMSE, SEE, or MAE ?

Thanks a lot

Dear all,

I am working on a research study exploring impact of air pollution on Peak expiratory flow rate (PEFR). I have five areas within a city - Rambagh, Alopibagh, Ashoknagar, Katra and Johnstonganj - with varying level of air pollution in each area. What I expect is, area which has high level of air pollutants, residents in that area will have low PEFR.

Variable png is a dummy variable representing those who consume tobacco and those who do not. Gender dummy variable has usual meaning.

I have run a double log regression model with result as shown in the image. Kindly help me to interpret the coefficients of dummy variable area and coefficients of

**interaction variable. - area#c.log_income**Any further suggestion to improve the model is highly welcomed. Thank you for your time.

Regards

Dear professor and Researcher

I have estimated the export supply function. I have taken the dependent variable (exports) in natural log but one explanatory variable (producer price index) in the level.

Now the coefficient value producer price index is PPI* 0.00745 please guide how we will interpret this value i mean how i convert this value in percentage.

thanks and best regards

Irshad

Can we use endogenous switching regression model in analyzing impact of any intervention using R-software? if so, can any one tell me the syntax?

I have done a Logistic Regression model and I know that only the p-value and RR^2 are not enough. Could pls anyone suggest or any help?

I was constructing a multiple regression model and was inspired by two papers the two papers didn't test stationarity although when I tried to test stationarity and took the difference all the variables were insignificant and R squared was too low 0.22 , but when I tried without taking difference of stationarity the variables were significant and R squared was 0.78

In a regression model, if the number of regressors k is very large, is there any systematic procedure to select the best possible model out of the very large number of (i.e. 2^k - 1 )possible models without computing the AIC/BIC or another such index for each of the possible models?

Would it be better if I don't enter into logistic regression model those variables with extremely unbalanced distribution in the 2 groups, even if statistically significant (p<0.05 ) in the bivariate analysis, for example: a frequency 0% or 100% in one of the two groups?

For instance, demographics could be potential confounders. Would it be better if I include them as the predictors in the logistic regression model along with other predictors or I should first factor them out and then use the residuals for further analysis?

Dear all,
I employed the xtabond2 command (System GMM) to estimate a dynamic panel regression model. I used lags of

**differences**and**levels**as instruments. However, I also want to apply the Kleibergen-Paap LM statistic to test of under-identifying restrictions. Unfortunately, i don't know any implemented test in the xtabond2 command to test for under-identifying restrictions. (The ivreg2 command contains the Kleibergen-Paap LM statistic. However, the ivreg2 command allows only to compute results separately for the differenced or levels equations.) Is there any possibility to compute an aggregated Kleibergen-Paap LM statistic (levels and difference) with the xtabond2 command? Thank you in advance.Hi,

I'm trying to fit a line at subgroups in a scatterplot in SPSS. Since the covariates in the linear regression model affects the direction of the regression, I need the plotted line to be adjusted for the covariates. Do you know how I can do this with the legacy dialogs tool or how I can change the syntax? Thanks!

I used logistic regression model for analysis which has over 17,000 observations. Although, the model results in several statistically significant predictors, McFadden's Adj R squared/ Cragg & Uhler's R2 are very low! For my model McFadden's Adj R squared is 0.026 and Cragg & Uhler's R squared is 0.044. Can I proceed with these R squared? I would really appreciate your suggestion on accepted level of R squared, which has to be backed up by relevant literature. Thank you!!

I am using DecisionTree as a regression model (not as a classifier) on my continuous dataset using Python. However, I get mean squared error equals ZERO. Does it mean my model is overfitting or it can be possible?

Dear community,

I ran a logistic regression with continious IV in SPSS. In the table "variables in the equation" one variable is missing (despite using entry method) and without any message from SPSS . When browsing through the web I understood that this might happen due to collinearity. However Collinearity diagnostics did not return a clear sign for it. Highest VIF-values are 6.1 and the highest Conditionindex is 21.1.

So my question are:

1. Is my regression model still valid despite SPSS dropping one variable?

2. Are there other reasons than collinearity why the IV is missing in the model.

Thanks Ilka

I am testing hypothesis of relationships between CEA and Innovation Performance (IP). If I am testing the relationship of one construct , say Management support to IP , is it ok to use single linear regression? Of should I be testing it in a multiple regression with all the constructs?

The library optmatch that is used with MatchIt library is not compatible with R 4.1.2 and seems to be discontinued. Any one aware of an alternative without having to downgrade to an old verion? I'm trying to run full matching using a probit regression model.

Thank you

While working on a dataset that has about 10 measured variables and two experimentally manipulated variables, it has become a cumbersome process for me to find a model that explains a good amount of variance (high R squared). I am quite new in this area and would like to know if there are any resources that explain the art and science of building good regression models. What I understand from Google search is that the stepwise regression procedure is not appreciated among the experts.

I have multiple regression models, and I need to validate LOO

how to calculate Leave One Out using SPSS and Minitab Software?

I’m in need of materials on exponential regression model estimation using maximum likelihood estimation or other methods

How can I test the Panel Smooth Transition Regression Models? which software can I use to test this model? Is there any code for that?

Respected all,

I am using regression modelling for crash prediction model. Can anyone tell me what should be the minimum value of Chi square for accepting in Publication?

Thank you all in advance.

There is a study on the impact of HIV on mortality in patients with COVID-19 (Article link: https://www.thelancet.com/journals/lanhiv/article/PIIS2352-3018(20)30305-2/fulltext). According to Figure 3 of the article, the

**unadjusted**impact of HIV on death risk**was not**significant, however the**adjusted**effect**was**significant. What could be the reason for this?I am running a panel regression model and I want to rank two of my independent variables on the basis of their degree of influence over the dependent variable.

Both of them are significant, have same sign and the coefficient on one V1 is 0.001 and V2 is 0.003 which I believe is not significant.

Is there any other way in which I can do this?

I have read that some people say that coefficient is not the correct measure to rank the variables.

In a set of spatial data I used the Spatial Error Model. However, I am asked if there are other strategies that can help to absorb the autocorrelation.

Any help is welcome,

Thank you

Hi,

My regression model has better results when I included an intercept dummy variable. However, the dummy variable is insignificant... why is that? How can I interpret that?

Thanks

In our research, dependent variable is measured through 5 point likert scale. Also, to measure dependent variable,

**6 different questions are used**. We are using ordered probit regression model. When taking the value of dependent variable, should we use mode or mean of all questions together? If there is a specific method that we can enter it, we also want to know the reason why we are entering in that way?I have run an ARDL model for a Time Series Cross Sectional data but the output is not reporting the R.squared. What could be the reason/s.

Thank you.

Maliha Abubakari

I am using Stata to run a linear regression where I have used Student Engagement as a Dependent variable while the rest of the planning, teaching strategies, use of resources, the language proficiency of teachers are independent variables. The data against these variables is gathered on a Likert scale of 0 to 4, and the data is continuous.

Dear all,

I am running several bivariate linear regressions to measure the impact of biases on controlling processes. These are all significant. I would like to measure which cognitive bias has the greatest impact on controlling processes. Can I compare the beta weights of different bivariate regression models to find the bias with the biggest impact? My dependent variable always remains the same.

I have searched for hours for information but unfortunately found conflicting information.

Thanks in advance!

Dear Sir/Mam,

In time series regression, when should we use static regression model and when should we use distributed lag or auto regressive distributed lag model.

How do you perform a purposeful selection model?

Which variables should be excluded or included?

Dear Sir/Mam,

In secondary data time series analysis, when should we use static regression model and when should we use other regression models ( distributed lag or auto regressive distributed lag model)

I am working on a multiple regression model with quadratic terms. Once I've got the first multiple regression equation, I checked if adding the quadratic/cubic terms could increase my Adj r-square.

But I have a question. Since I am using backward elimination, can I load all of them (12 terms) and remove the less significant (as backward elimination does) proceeding until I get the equation?

Dear researcher community.

I want to ask if it's possible to perform a meta-analysis in an index, I mean, in value without standard deviation. For instance, in the regression model, many indexes don't have a standard deviation.

I want to conduct a meta-analysis to compare two analysis approaches in the regression model, so I need to know if I could use a goodness index for the same research that reported the two methods. In this sense, I will have a value for n (data population), and index in a conventional regression method (control) and n (data population), and index in an alternative regression method (experimental) for several papers.

Many thanks in advance for your help.

Wilson

Dear altruists,

I am performing a spatial panel data model in r using SPLM package. My codes are:

SModxl <- read_excel("E:/Data/Spatial/SPLM/SpPanelMod 3079.xlsx", sheet = "Sheet1")

SMod <- pdata.frame(SModxl, index=c("Year", "Fips"))

WMat <- read_excel("E:/Data/Spatial/SPLM/Weights.xlsx")

Qnlw=mat2listw(WMat)

#SARAR

reg.Sarar1=spml(formula = y ~ x1+x2+x3, data = SMod, index = c("Year", "Fips"), listw=Qnlw, tol.solve=1.0e-20, lag = TRUE, spatial.error = "b", model = "within", effect = "individual", method = "eigen", na.action = na.omit, quiet = TRUE, zero.policy = TRUE, interval = NULL, control = list(), legacy = FALSE)

summary(reg.Sarar1)

summary(impacts(reg.Sarar1, listw=Qnlw, time = 15, R=5), zstats=TRUE)

My model runs fine. But when I am trying to get the marginal effect to see direct, indirect and total effects with significance test, I am using the following code as like the spdep package "summary(impacts(reg.Sarar1, listw=Qnlw, time = 15, R=5), zstats=TRUE)".

But it is not working.

Do you know any other techniques to estimate the marginal effects using splm package in r? Or is there any other packages available in r can do the task?

Your answer is very valuable to me and will be appreciated. Thanks in advance.

I have a dataset with some data missing completely at random (most < 10% missing). So far, so good. I am planning on computing regression models. However, the variables in my regression model, and also in the entire dataset, are, at best, only weakly correlated (if correlated at all). The total N of my sample is 95. I am using SPSS 27.

As far as I know, multiple imputation is a regression-based technique. So I am wondering: does it make any sense to impute data? I read that one can use 'auxiliary' variables for imputation from the entire dataset (usually people would, for instance, use items from a questionnaire with complete data to impute the missings; but I don't have that luxury). In my case, the only auxiliary variables that I could use from my dataset are correlated around .20 - .30 at best; and measure different constructs and also were often measured at different time points than the data I would need to impute.

Additional question: Is median- or mean-based imputation for covariates (control variables) okay in that case, where I really cannot/ should not impute based on regression due to lack of correlation in my dataset and missingness < 10% ?!?

I find these issues are really under-reported in publications that use imputation techniques.