Science topic
Regression Modeling - Science topic
Explore the latest questions and answers in Regression Modeling, and find Regression Modeling experts.
Questions related to Regression Modeling
Currently I'm running a project to compare the effect of exposure on mortality in different disease stages (i.e. stage 1, stage 2, stage 3) using Cox Proportional Hazard model. I've read some answers to previous questions but most of them focused on logistic regression models. I wonder is there a way to compare the Hazard Ratios in different stages to see if there is a statistically significant difference between them? Thank you!
I have been studying a particular set of issues in methodology, and looking to see how various texts have addressed this. I have a number of sampling books, but only a few published since 2010, with the latest being Yves Tille, Sampling and Estimation from Finite Populations, 2020, Wiley.
In my early days of survey sampling, William Cochran's Sampling Techniques, 3rd ed, 1977, Wiley, was popular. I would like to know which books are most popularly used today to teach survey sampling (sampling from finite populations).
I posted almost exactly the same message as above to the American Statistical Association's ASA Connect and received a few recommendations, notably Sampling: Design and Analysis, Sharon Lohr, whose 3rd ed, 2022, is published by CRC Press. Also, of note was Sampling Theory and Practice, Wu and Thompson, 2020, Springer.
Any other recommendations would also be appreciated.
Thank you - Jim Knaub
What are the simple effect and main effects in the regression model? While I'm familiar with the main and intraction effects in the multinomial logistic regression model, I have no idea what simple effects are and how they are involved with the regression model. I'd greatly appreciate it if you could explain this and recommend useful resources. Thank you.
The threshold least square regression model by Hansen (2000) divides the series into two regimes endogenously. The Regime above the threshold and below the threshold, and then regress both regimes individually by OLS. this method also involve bootstrap replication. In my case the regime above the threshold only remain with 17 number of observation. Does it creates loss of degree of freedom issue in the data?
I have a data set with binary dependent variable, where 95% observations take value 1, and 5 % take value 0. Which regression model will be perfect in such case?
our dependent variables is stationary level while independent variables are stationary at level and first difference
I'm currently working on a project where I need to understand the impact of outliers on different regression algorithms, specifically Random Forest, Gradient Boosting, and XGBoost. I have a few questions that I'd like to get some insights on:
- How do outliers typically affect the performance of Random Forest, Gradient Boost, and XGBoost regression models? Are these models generally robust to outliers, or do outliers significantly skew their predictions?
- If these models are affected by outliers, what are some common strategies to mitigate this issue? Should I consider preprocessing steps like outlier removal, or are there model-specific techniques that are more effective?
- Could you recommend any reliable sources (research papers, books, articles) that delve into this topic further? I’m particularly interested in the literature comparing these models' robustness to outliers in regression tasks.
Thank you in advance for your help! I appreciate any guidance you can provide.
Hi -
I have two models.
(1) y = b0 + b1*x1 + b2*x2 + e
(2) y = b0 + b1*x1 + b2*x2 + b3*x1*x2 + e
x1 and x2 are the independent variables, and x1*x2 is the interaction term. In Equation (1), neither b1 nor b2 is statistically significant. But in Equation (2), b1, b2, and b3 are all statistically significant. Equation (2) has a slightly better model fit. The adjusted R-squared of Equation (1) is 0.2; that of Equation (2) is 0.217. The interaction term (x1*x2) is the only difference between the two equations.
Why would this happen, and which of the two should I believe regarding the main effect?
Hello,
I have run into an issue that has sparkled a debate at work: is the LLOQ/ULOQ impacted by injection volume under the following circumstances:
1) Same dilution factor of all samples
2) Volumetric deviations corrected for by internal standard (separate compound from analyte, not labelled)
3) Matrix effects studied and found to be negligible at all injection volumes
4) Good linearity of the regression model; not impacted negatively by inclusion of standards injected at lower volume
The problem arose when I, instead of performing a dilution, chose to rely on the correcting function of the internal standard and injected 1/5:th the volume (0.2 µL rather than 1 µL) of my most highly concentrated samples and standards and evaluated them as part of the same calibration curve as the samples injected at 1 µL. As I mentioned, matrix effects were investigated at all concentration levels as well as injection volumes and they were very low.
The software settings I use employ the "signal response" which is [area analyte/area IS]-ratio and the regression model from the calibration curve to calculate concentrations. Hence, under the circumstances I mentioned above the response is independent from injection volume - providing one does not drop below the background level or saturates the detector.
We ended up in a situation where one of my colleagues claim that the LLOQ for the samples injected at 1/5 should be adjusted and multiplied by the factor difference in injection volume (i.e. x5) and the ULOQ to be divided by the same factor. Another colleague claim it should be done the other way around.
In contrast to both I claim that the volumetric correction by the IS has already compensated for this and that LLOQ/ULOQ should remain at the set concentrations of their respective standards.
Can anyone help me out here?
Best regards and thanks in advance,
// Karl
Hi guys,
in the context of my master thesis i analyze the statistical relationship between income and subjective well-being (Panel: SOEP, n: 300.000 observations over 10 years).
After creating a model that is in harmony with the existing literature i conducted a fixed-effects "within" Regression (with robust standard error) that includes all relevant control variables.
I got a highly significant (0.01 level) regression coefficient of 0.1 for my income variable.
Despite that i received a R squared value of 0.06 and a negative adj. R. of -0.19.
I do not really know how to interpret the negative R squared. Does it mean my model doesnt fit and has no explanatory power?
I was expecting a small R squared duo to the many factors influencing subjective well-being, but not a negative one?
Anyone got an advice how to interpret this result? Can i still draw conclusions regarding the statistically significant coefficient and a causal link between income and SWB?
Im thankful for any advice!
How to get Equation 8, please refer the research article? kindly let me know. Thank you.
What is the approach to determine uncertainty in energy and inflation time vs. inflation pressure of tyre control unit?
How do you measure the uncertainty of equipment adopted in the Regression Models developed to measure the energy and inflation time of a tyre pressure control Unit?
How do I ascertain uncertainty measurement?
Energy vs inflation pressure for three radii of the tyre
the
Dear colleagues with particular interests in r software
I have completed an ordinal regression model using two different r functions (polr - MASS package and clm - ordinal package). The output was measured using a 5 level Likert scale (0-4). Predictors are numerical as well as categorical. N=722 obs. Using both functions, results for coefficients values and p value are very similar, except for intercepts. In one case (polr) intercepts are significately different from zero. In the other case (clm) 3 out of 4 intercepts values are not different from zero. Difference come obvisouly from the calculation of standard error for intercepts (intercept values are the same in both of the models). Is anyone had a similar experience ? How to decide wich function is the most appropriate ?
Many thanks in advance
Dr Jeoffrey DEHEZ
I am trying to perform a count regression models but I want to determine first the predictors that has a significant relationship to my response variable
Hello !!!
Since simple multiple linear regression doesn't take into account the within-subject model, nor the fact that my dependent variable is an ordinal variable, I need to control for individual heterogeinity. For this, it's possible to run a regression model with clustered standard errors to take into account the pattern within the subject.
Can anyone explain how to run a regression model with clustered standard errors on ibm spss?
Explanation of my survey:
112 participants indicated on a scale of 1 to 7 their level of agreement with different statements after reading a scenario. 7 Likert scale (from strongly disagree to strongly agree)
The statements measured my dependent variables: work motivation, bonus satisfaction, collaboration and help. There were a total of 7 different scenarios. I also control for gender, age, and status ( employed, self-employed, student, unemployed, retired, other).
My aim is to see how the different scenarios affect the dependent variables.
My thesis supervisor advised me to take into account clustered standard errors in my regression model, but I have no idea how to do this on spss. I can't find the right test and command to do this.
Could someone help me?
Thanks in advance,
Best regards
I have a mixed effect model, with two random effect variables. I wanted to rank the relative importance of the variables. The relimpo package doesn't work for mixed effect model. I am interested in the fixed effect variables anyway so will it be okay if I only take the fixed variables and use relimp? Or use weighted Akaike for synthetic models with alternatively missing the variables?
which one is more acceptable?
I'm interested in investigating what individual factors contribute to the choice of three outcome variables. Additionally, I want to test whether there are pairwise interactions between the predictor variables under investigation. Is it valid to build two seperate models, given that there are two research questions? I would appreciate it if you could recommend books or articles related to this issue. Thank you.
I am curr research at phising detection Using URL . Using logistic Regression model . I have data set 1:10 ratio 20k legitimate and 2 k phishing .
I have panel data with two waves, where the same individuals answer in wave 1 and wave 2. In my regression model with individual fixed effects, I want to add a time trend dummy variable that increase by one for each day since the start of the survey. Does the time demeaning method of fixed effects ruin this time trend variable. I am using Panel OLS from linearmodels.panel (python package) to implement the fixed effects model
Can someone suggest a R package for Blinder Oaxaca decomposition for logistic regression models?
I would like to create a forest plot for a study we are conducting, in order to represent data from a logistic regression model with OR and CI for each variable included. However, I'm struggling to do it with Meta-Essentials resources. Is it possible or does it work exclusively for meta-analysis? Thank you.
What is the difference between the Hierarchical Bayesian spatiotemporal model and the Poisson regression model to analyze count data?
Hello,
I estimated a mixed-effect logistic regression model (GLMM) and I need to evaluate it. Specifically, I tried a few combinations of the independent variables in the model and I need to compare between them.
I know that for a regular logistic regression model (GLM), the Nagelkerke R-squared fits (a pseudo R-squared measure). But does it fit also for a mixed-effect model? If not, what is the correct way for evaluating a mixed-effect logistic regression model?
Thanks!
Hello,
I need to estimate a generalized linear mixed model (GLMM).
When using R, I need to choose a specific fixed effect (one of the independent variables in the model) to have a random slope according to a random variable such as subject ID.
But when using SPSS, I can't choose a specific fixed effect to have a random slope.
I asked chatGPT about it and it responded as follows:
The difference you're observing in the specification of random slopes for specific fixed effects between SPSS and R might be related to how the two software packages handle mixed-effects modeling. This difference is not necessarily due to the capabilities of the software but rather the way they present the options to the user. Here's why this might be the case:
SPSS:
In SPSS, the interface for specifying mixed-effects models is designed to be user-friendly and may abstract away some of the technical details. It allows users to specify random slopes and intercepts for subjects using the Repeated option or Mixed Linear Models (MIXED) procedure.
SPSS may automatically set up random slopes for all fixed effects by default, which is a more automated approach, but it doesn't give you explicit control over which fixed effects have random slopes.
R (lme4 package):
R, specifically with the lme4 package, provides more flexibility and control in specifying mixed-effects models. This can be both an advantage and a challenge for users.
In R, you have to explicitly specify which fixed effect(s) should have random slopes by including them in the model formula. This explicit control allows for more customized modeling but can be more complex and requires users to have a good understanding of their data and the modeling process.
1. Can someone please confirm this answer?
2. Is there a way to estimate a mixed-effect logistic regression model in R that would be the same as the estimated model in SPSS?
I would appreciate any clues about this issue! Thanks!
I am running FGLS using Stata and I want to know how I can get the value of Pseudo R2. Your help would be much appreciated.
The objective here is to determine factor sensitivities or slope coefficients in a multiple ols regression model.
My dependent variable consists of cross-sectional data with 8 observations, while the independent variables consist of time series data with 45 observations.
I am going to plot an overfitting- corrected calibration curve(using the cross-validation method) for a regression model with Poisson family (link function=log) in R.
I know that this can be done for the logistic, ols, and cox modelusing rms package.
Does anyone know about the R codes for doing overfitting- corrected calibration curve of Poisson model?
Thanks in advance
Hello,
I'm working on a panel multiple regression, using R.
And I want to deal with outliers. Is there a predifined function to do that?
If yes would you please give me an example of how to use it
Of late, some journal editors are insistent on authors providing a justification for the ordering of the entering of the predictor variables in the hierarchical regression models.
a) Is there a particular way of ordering such variables in regression models?
b) Which are the statistical principles to guide the ordering of predictor variables in regression models?
c) Could someone suggest the literature or other bases for decisions regarding the ordering of predictor variables?
I have 3 independent variable which is height(100,300,500),angle(0,30,......,180) and pressure point(1,2.......,112) and each pressure point has 30,000 values of wind pressure coefficient,I want to predict the unknown data by using machine learning algorithm.so which regression model is best suit for this.
AND if consider only 1 dependent variable i.e mean of 30,000 value then which algorithm is best for this case?
My dependent variable consists of cross-sectional data with 8 observations, while the independent variables consist of time series data with 45 observations.
Hello;
I have built different regression models Linear regression, MLTP, and Stack regressor.
I need to compare the matrics and performance of these models. How I can do that scientifically and professionally?
I am conducting a study to assess the incidence and predictors of X disease progression from 10 X diseased patients retrospectively.
#Small sample size
#disease progression
#poisson model
#Cox proportional hazard model
I am running 6 separate binomial logistic regression models with all dependent variables having two categories, either 'no' which is coded as 0 and 'yes' coded as 1.
4/6 models are running fine, however, 2 of them have this error message.
I am not sure what is going wrong as each dependent variable have the same 2 values on the cases being processed, either 0 or 1.
Any suggestions what to do?
Dear everyone,
I have imputed my data using STATA's command "mi impute" (m=200) - only one variable had missing. STATA's "mi estimate" command was used when running the logistic regression model.
How would you interpret an OR close to 1/ being 1 when the p-value is <0.05 ( 95% CI does not cross 1)? An example from the regression model for a continous variable is: OR = 0.9965 with 95% CI 0.9938 and 0.9992 which STATA round to OR = 1.00 and 95% CI: 1.00;1.00. Am I on to something when thinking that this can be possible as the p-value is not perfectly tied to the magnitude of an effect? Thus, statistically the predictor has an effect on the outcome, but that this effect is very small and may be of little clinically or otherwise "practically" significant?
And do you have any suggestions on how to report such findings in a research paper?
Best regards
Regine
Hello everyone! I am currently doing moderation/mediation analyses with Hayes Process.
As you can see the model 3 is significant with R2=.48
The independent variables have no sig. direct effect on the dependent variable, but significant interaction effects. The curious thing is: toptMEAN does not correlate with any of the variables, but still fits into the regression model. Should I take this as confirmation that toptMEAN has an effect on the dependent variable even though it does not correlate? Or am I missing something in the interpretation of these results?
(Maybe you could also suggest a different model for me to run. model 3 is currently the one with the highest r2 i found)
Hi fellow geeks, I am trying to understand the various statistical methods/tools used for studying risk factors associated with dementia especially the 40% of modifiable risk factors(Dementia prevention, intervention, and care: 2020 report of the Lancet Commission).
A recent article (PMID: 36394871) demonstrated the use of Least absolute shrinkage & selection operator and multivariate Cox proportional hazards regression models.
And a previous work (DOI: 10.1002/alz.12802) Bayesian regression modeling & Regression analysis over time prior to diagnosis.
Looking at these, I see it as using different tools on a swiss knife to get the job done... So if I were to design a specific tool and test it out...
How do I go about it?
In 2007 I did an Internet search for others using cutoff sampling, and found a number of examples, noted at the first link below. However, it was not clear that many used regressor data to estimate model-based variance. Even if a cutoff sample has nearly complete 'coverage' for a given attribute, it is best to estimate the remainder and have some measure of accuracy. Coverage could change. (Some definitions are found at the second link.)
Please provide any examples of work in this area that may be of interest to researchers.
Dear community,
I am currently doing a research project including a moderated mediation that I am analysing with R (model 8). I did not find a significant direct effect of the IV on the DV. Furthermore, the moderator did not have a significant effect on any path. Doing a follow-up, I thus calculated a second model, that excluded the moderator (now mediation only, model 6). In this model, the effect of the IV on the DV is significant. Is it possible, that the mere presence of the moderator in the first modell influences my direct effect, even if it does not have an effect on my relationship between IV and DV? Is my thinking of, that direct effects only depict direct effects, without including the influence of other variables in the model, wrong?
Can anybody help me with an explanation and maybe also literature for this?
Thank you very much in advance!
KR, Katharina
The same model provides 0.94 R2 value for one test set (9 observations) while 0.73 R2 value for another test set (95 observations), however, 0.73 R2 is associated with lower RMSE and MAE. how to explain this situation?
I am doing my thesis and in my design I have two independent variables, both with 3 levels (no frame, low and high). In the model I am also using the two-way interaction between these two and the two-way and three-way interaction with moderators (party preference and political trust). I found some tutorials online that say I should code the independent variables as a set of two variables to use both in the regression model (D1: no frame (0), low (1), high (0) and D2: no frame (0), low (0) and high (1)). However, when I enter D1.1 and D1.2 for independent variable 1 and D2.1 and D2.2 for independent variable 2 into the model, SPSS exlcudes one of the variables in the output (D1.1).
Moreover, I am not sure how to code and add the interactions to the model. Initially, I coded one variable for each independent variable as follows: no frame (-1), low (0), high (1) and used these for the interactions. Is that also an option, or does this cause problems?
The Seemingly Uncorrelated Regression Models
Examining some students on their final year projects defence. I discovered that a student had the Adjusted R² in the Regression analysis of her work to be greater than 99%. Could that be possible?
Hi all,
I was wondering if anyone knew of alternatives to zero - truncated poisson (ZTP) models. I am trying to fit a regression model using demographics on the ZTP. ZTP is doing okay, except the residuals are not behaving that great. Anything in the literature would help.
I have heterogeneous panel data model,, N=6 T=21,What is the appropriate regression model? I have applied CD test , It shows the data have cross-sectional dependency
I used the 2nd unit root tests , and the result found that my data is stationary at level
is it possible to use PMG ? would you pleas explain the appropriate regression model?
i am using a fixed effect panel data with 100 observations (20 groups), 1 dependent and three independent variables. i would like to get a regression output from it. my question is it necessary to run any normality test and linearity test for panel data? and what difference would it make if i don't go for these tests?
Hello everyone, I only found the additive interaction calculation method based on logistic regression model on the internet, can anyone provide the name of the R package or SAS code for calculating RERI, AP, SI and their 95%CIs based on log-binomial regression model? Millions of thx!
I would like to know if I am wrong by doing this. I made quartiles out of my independent variable and from that I made dummy variables. When I do linear regression I have to record the betas with 95%CI per quartile per model (I adjust my model 1 for age and sex). Can I enter all the dummies into the model at the same time or do I have to enter them separately (while also adjusting for age and sex for example)?
So far I entered all the dummies and adjusted for age and sex at the same time but now I wonder whether SPSS doesn't adjust for the second dummy variable and the third.. So I think I need to redo my calculations and just run my models with one dummy in each.
Thank you.
The regression model is built on log transformed data and later re-transformed .So to validate the model i m using k-fold cross validation method using weka. Obviously here i m getting validation model on log transformed data. My question is can we perform k-fold cross validation with re-transformed model?
Dear Team,
I am running a multinomial logistic regression model for one of the fictitious data before i implement the same on my real data.
Here i am trying to predict based some scores and economic group whether a person will go for a diploma, general or honors.
The code below:
m11$prog2<- relevel(m11$prog, ref = "honors"
already loaded the nnet library. However i got the below error:
Error in relevel.factor(m11$prog, ref = "honors") :
'ref' must be an existing level
I have tried searching on SO and nabble but did not find an answer that could help.
Please suggest what is incorrect. Also checked the class of the var and is a factor variable.
Lets say I have a SLR with correlation coefficient (r) = -0.81 then the coefficient of determination (r^2) will be around 0.661. From what I know, this would mean that will the while the variables have a strong negative correlation the independent variable is only able to account for 66.1% of the variation in the dependent variable. My question then based on this is, what does this suggest? Is the relationship between the two variables non-linear or I need to add more independent variables?
Dear,
I have conducted a study where 18 patiënts were included. I ran a logistic regression, the model is significant but none of my predictors are. My R square is 1 which also a bit strange. What do I need to do here? How do I report this?
if we have a regression model built by given data,how to fast build a model if one training data is removed? I know the leave-one-out error can be fast approximated, but how to build the model?
Can a person's perception of one thing influence his perception of another (measured using a Likert scale)? If so, can the degree of influence be measured using a simple linear regression model or else? Please give me any references for that. Thank you very much for your comments.
I am conducting a panel data regression for a research on economic growth of few countries. In real life, it is hard to find data that are normally distributed and most of the control variables are correlated with each other in one country or another.
However, the regression test results are satisfactory and all show that the residuals are normally distributed, there exists no serial correlation and heteroscedasticity. Even the CUSUM and CUSUMSQ tests show that the model is stable.
In such a case, are the diagnostic tests enough to justify that the results of the regression model are reliable and valid even when data are not normally distributed and there exists correlation among them?
Thank you in advance for your responses.
More exactly, do you know of a case where there are repeated, continuous data, sample surveys, perhaps monthly, and an occasional census survey on the same data items, perhaps annually, likely used to produce Official Statistics? These would likely be establishment surveys, perhaps of volumes of products produced by those establishments.
I have applied a method which is useful under such circumstances, and I would like to know of other places where this method might also be applied. Thank you.
I am testing hypothesis of relationships between CEA and Innovation Performance (IP). If I am testing the relationship of one construct , say Management support to IP , is it ok to use single linear regression? Of should I be testing it in a multiple regression with all the constructs?
I have constructed a multiple regression model. The results show that the selected variables are significant but the intercept is not. Is that model useful? How to interpret the model. Thanks
i try to find relation between dependent variable (membrane fouling rate) and independent variables (SMP,EPS,RH,.....) in treating wastewater in MBR. i want to have a regression model with these variables that shows which of these variables are more important ( based on coefficients)
I have a dependent variable: childcare take up, with answers 1 or 0.
I have several other independent variables such as education level, household income, migration background, etc. I have dichotomized all of them. For example, observations with college education have been classified as 1, otherwise 0.
I would like to know which regression model would be the best to predict the childcare take up.
I wanted to conduct regression for a count outcome data with under dispersion. Which regression model is appropriate?
Hello,
I have a question about the multiple regression model. I want to fit a multiple regression model on a dataset in which the dependent variable is continuous and all of the independent variables are categorial. Should I check the assumption of the multiple regression model also for categorical variables? if they are not met, which model may I use instead of the multiple regression model to analyze the effects of these variables on the dependent variables?
thank you
I am interested to extract the actual probability values (between o and 1) from a logistic regression curve (sigmoid curve) in python as shown in pink color in the attached image.
I would like to scale a number of variables by average total assets in the regression model. But i am not sure how or what command should I sue to scale the variable.
Example, I need to scale earnings by average total assets.
I am working on a health data set and trying to capture health inequality in India. Further, I am using a non-parametric (kernel) regression model to measure health inequality. To calculate the standard error of the concentration index should I use the delta method or bootstrap? and why?
In the latest approaches to trying to infer from nonprobability samples, multiple covariates are encouraged. For example, see https://www.researchgate.net/publication/316867475_Inference_for_Nonprobability_Samples. However, in my experience, when a simple ratio model can be used with the only predictor being the same data item in a previous census (and results can be checked and monitored with repeated sample and census surveys), results can be very good. When more complex models are needed, I question how often this can be done suitably reliably. With regard to that, I made comments to the above paper. (That paper is available through Project Euclid, using the DOI found at the link above.)
Analogously, for heteroscedasticity in regression, for Yi associated with larger predicted-yi, sigma should be larger. However, when a more complex model is needed, this is less likely to be empirically apparent. For a one-predictor ratio model where the predictor is the same data item in a previous census, and you have repeated sample and census surveys for monitoring, this, I believe, is much more likely to be successful, and heteroscedasticity is more likely to be evident.
This is with regard to finite population survey statistics. However, in general, when multiple regression is necessary, this always involves complications such as collinearity and others. Of course this has been developed for many years with much success, but the more variables required to obtain a good predicted-y "formula," the less "perfect" I would expect the modeling to be. (This is aside from the bias variance tradeoff which means an unneeded predictor tends to increase variance.)
[By the way, back in Cochran, W.G.(1953), Sampling Techniques, 1st ed, John Wiley & Sons, pages 205-206, he notes that a very good size measure for a data item is the same data item in a previous census.]
People who have had a lot of experience successfully using regression with a large number of predictors may find it strange to have this discussion, but I think it is worth mulling over.
So, "When more predictors are needed, how often can you model well?"
Let us suppose that we have got only a small amount of data (say subtractive manufacturing data). Owing to the costs involved, the number of experiments conducted are limited and therefore the number of data points covered is also limited.
Consequently, the graphs plotted are pretty discrete and therefore do not give a clear picture of the relationship between explanatory and predicted variables.
The question is, do I fit linear regression models or do I go for an ANN?
Explain why it is not possible to estimate a linear regression model that contains all dummy variables associated with a particular categorical explanatory variable?
Explain the benefits of using natural logarithms of variables, either of Y or of the X's, as opposed to other possible nonlinear functions, when scatterplots (or possibly economic considerations) indicate that nonlinearities should be taken into account