Science topic

# Logistic Regression - Science topic

Explore the latest questions and answers in Logistic Regression, and find Logistic Regression experts.
Questions related to Logistic Regression
• asked a question related to Logistic Regression
Question
I see in some discussion people saying logistic regression as a general linear model while some say that it is not a linear model. Looking at math behind it, I think that it is not a linear model.
A logistic model can be linear or non-linear. If you use a linear predictor function, then it's a "linear model". But you can also use a nonlinear predictor function, in which case it is a "non-linear model". The predictor function is linket to the expected value (µ) by a link function. In a logistic model this link function is (usually) the logit.
Example:
logit(µ) = β0 + β1*X1 + ... βk*Xk
is linear (in the parameters β - which are all simple coefficients here)
logit(µ) = sin( β0 + β1*X1 )^β3 + β4*X2
is not linear (not all parameters β can be expressed as simple coefficients).
The same relation between linear and nonlinear applies to other link functions (here it was the logit function), like the log ("log-linear" or "Poisson" model), the inverse, or the identity function (that is, the linear predictor represents µ directly; no transformation).
• asked a question related to Logistic Regression
Question
If multivariate logistic regression is calculated in a study, is there any need to still calculate the adjusted multivariate logistics regression? if yes, then is it appropriate to show both of them in research papers?
Hello Puneet,
Like Jamie Wallis , I don't attach any special meaning to "adjusted multivariate logistic regression" vs. "multivariate logistic regression;" they are the same.
1. A number of studies will report both the crude (one IV, one DV) odds ratios and the adjusted odds ratioss (full set of IVs, one DV) for logistic regression models. Please note that the latter can differ from the former because of overlap among IVs, as well as the potential for suppressor variables.
2. It may seem trivial, but the correct term would be "multiple" or "multivariable" logistic regression here, not "multivariate." Among statisticians, multivariate refers to the case of having more than one DV in a model; univariate refers to the case of a single DV in a model.
• asked a question related to Logistic Regression
Question
I am working on a logistic regression model, with a binary outcome and my model look like this:
Intercept: coefficients is -2.034
Variable: coefficients are 0.031, and the P-value is 0.130. Not Significant.
However when I calculate the Odds Ratio and CI 95% I get
Odds ratio 1.031, Lower Limit = 0.991, Upper Limit = 1.073
However, since the variable is not significant, I shouldn't get a P-value greater than 1 correct?
Can anyone explain why this is happening?
I'm not exactly sure what you are asking, but a logistic regression slope coefficient of 0 (corresponding to an OR of 1) indicates no relationship. Your regression slope is not significantly different from zero at the .05 level (p = .13), and your 95% CI for the OR therefore includes 1.0. Both these values thus indicate that the relationship is not statistically significantly different from zero.
If your regression slope coefficient were significant at the .05 level (i.e., p < .05), you would expect to see a 95% CI for the OR that does not include 1.0.
• asked a question related to Logistic Regression
Question
Hi All,
I hope everything is fine.
I'm running various separate logistic regressions on my ordinal response variable (score). I did not include all my (continuous) independent variables (10) in one single model as the model wouldn't run. So, I have a separate model for variable 1 and the score, variable 2 and the score, variable n and the score.
So, my question is whether it is possible to use Bonferroni correction to adjust the significance level (p = 0.05). I would do so by dividing my significance level by the number of models I run.
Thanks a lot!
Laura
Two points before I can offer an answer:
1. Are you interested in how each of the 10 are associated with the response variable, or are you interest in how well some subset of the 10 can be combined to predict the response variable? Or do you have more specific research questions like does variable A predict Y after conditioning on variable B.
2. Say more about why including all ten does not work. Is the sample small? Are the predictor variables collinear? etc.?
• asked a question related to Logistic Regression
Question
I'm trying to construct a model for binary logistics. The first model includes 4 variable of predictor and the intercept is not statistically significant. Meanwhile, in the second model, I exclude one variable from the first model and the intercept is significant.
The consideration that I take here is that:
The pseudo R² of the first model is better at explaining the model rather than the second model.
Any suggestion which model should I use?
You should use the model that makes more sense, practically and/or theoretically. A high R² is not in indication for the "goodness" of the model. A higher R² can also mean that the model makes more wrong predictions with a higher precision.
Do not build your model based on observed data. Build your model based on understanding (theory) and the targeted purpose (simple prediction, exptrapolation (e.g. forecast), testing meaningful hypotheses etc.)
Removing a variable from the model changes the meaning of the intercept. The intercepts in the two models have different meanings. They are (very usually) not comparable. The hypothesis tests of the intercepts of the two models test very different hypotheses.
PS: a "non-significant" intercept term just means that the data are not sufficient to statistically distinguish the estimated value (the log odds given all X=0) from 0, what means that you cannot distinguish the probability of the event (given all X=0) from 0.5 (the data are compatible with probabilities larger and lower 0.5). This is rarely a sensible hypothesis to test.
• asked a question related to Logistic Regression
Question
Hello, Is it posible to use logistic regression on pooled panel data? The dependent variable is whether or not respondent has diabetes. The independent variables are income, gender, education. Should the individual income observations be adjusted to reflect increasing (average) income over time? Are there any specific considerations that should be addressed?
Thank you.
Jakub
Why is this panel data and what do you mean by pooled? I am attaching what may be a similar study. Best, David Booth
• asked a question related to Logistic Regression
Question
I am performing a meta-analysis of odds ratio per unit of a continuous variable with a dichotomous outcome (dichotomized continuous variable). One of the studies reports a mixed linear regression model with coefficient and standard error for the continuous variable regressed on the continuous outcome variable. Is there any acceptable method to estimate the odds ratio I need?
Apologies for the sending..I don't have any ideas why the site did that. David Booth
• asked a question related to Logistic Regression
Question
300 Participants in my study viewed 66 different moral photos and had to make a binary choice (yes/no) in response to each. There were 3 moral photo categories (22 positive images, 22 neutral images and 22 negative images). I am running a multilevel logistic regression (we manipulated two other aspects about the images) and have found unnaturally high odd ratios (see below). We have no missing values. Could anyone please help me understand what the below might mean? I understand I need to approach with extreme caution so any advice would be highly appreciated.
Yes choice: morally negative compared morally positive (OR=441.11; 95% CI [271.07,717.81]; p<.001)
Yes choice: morally neutral compared to morally positive (OR=0.94; 95% CI [0.47,1.87]; p=0.86)
It should be noted that when I plot the data, very very few participants chose yes in response to the neutral and positive images. Almost all yes responses were given in response to the negative images.
I think you have answered your question: "It should be noted that when I plot the data, very very few participants chose yes in response to the neutral and positive images. Almost all yes responses were given in response to the negative images."
This is what you'd expect even in a simple 2x2 design. If the probability of a yes response in the positive condition is very high and the probability very low in the negative condition then the OR could be high as its the ratio of a big probability to a very low one.
This isn't unnatural unless the raw probabilities don't reflect this pattern. (There might still be issues but not from what you described).
• asked a question related to Logistic Regression
Question
I got highly significant negative logit values of the logistic regression .  Is Nagelkerke R Square R2=0.22  significant enough to explain the changes in the  variables?
I agree with David Eugene Booth here. I wrote this brief explainer as a supplement to a book a few years back. I think Nagelkerke's is probably the least useful in practice as it hasn't really got an intuitive interpretation.
• asked a question related to Logistic Regression
Question
Dear All,
I understand the basic concept of forward and backward stepwise logistical regression. However, I am unsure of the indication to use a forward model in preference to a backward model, and vice-versa.
Should you not end up with a final selection of variables that are similar regardless if you started with all the variables and subtracted (backward), or if you had no variables then added (forward)? This is the reason I'm having difficulty in justifying/selecting forward vs backward.
Thank you very much
Ryan Ting
Understand Forward and Backward Stepwise Regression – Quantifying Health
• asked a question related to Logistic Regression
Question
In my current study, I am identifying the association between some independent variables and a dependent variable. For which I am using bivariate analysis (Cross tab with p values) and multivariate analysis (multiple regression with adjusted Odds ratio). Some previous studies on my topic used different p value cut off points e.g. p<0.25, 0.05, and others included some variables without such restriction.
What should I do? Should I include the same variables in both of the bivariate and multivariate analyses?
You should not select variables by their p-value. You should build a model based on theory. If you consider a variable interesting or important enough to include it in the model, then include it. Otherwise don't.
• asked a question related to Logistic Regression
Question
Hello Experts,
I am working on a pooled country dataset (of 19 countries) to look at association of linear growth (both as continuous and binary variable) with IFA intake during pregnancy using linear and logistic regression.
I followed the process for model building as suggested by Homer & Lemeshow but the final predictor model had a p-value of <.05 for goodness-of-fit test and a pseudo R-square of .06 (Almost the same R-square value for linear regression). The VIF was <10. I also tried the modification proposed by Nattino, Pennell, and Lemeshow of the Hosmer-Lemeshow test for large samples as well as calibration test (using Stata 15.1). Even these showed model fit issues.
I then went on to add interactions- it was pure data mining. Adding all combination of interactions one at a time and then adding all significant ones to the final predictor model, followed by the elimination process. Some of these were nothing but noise as evidenced by probability graphs but the model fit turned out to be good. The pseudo R-squared became 0.07. If I remove the interactions which look like noise (in graphs) or cause high multicollinearity (Individual VIF values>10) , the model fit goes away.
I understand that I should be using a multilevel model but there are multiple studies out there that have used pooled data with even larger sample sizes. Unfortunately, no one describes anything on model fit. Is it not necessary? Also the ICC value for the most basic model was .06.
Though R-square and pseudo R-square are not important statistics, but such low values are making me question the models. Omitted variables is what comes to mind but I have used all important predictors found by literature review.
I also understand that data mining is the wrong way to approach but if I try to use only the plausible interactions, none of them are significant. It doesn't make sense to use them and it brings me back to the final predictor model which was not a good fit.
I am also aware that some experts propose not using statistical significance as the basis to decide on predictors. Does that mean that we don't need to look at model fit either?
I am not sure what is the right way to decide on the final model. I have attached a sheet that shows all the fit statistics that I have used for my models. I will really appreciate guidance on this.
Many thanks
Deepali
Could it simply be a result of the analysis that the overall explanatory power of these variables is low (or lower than expected)? Perhaps it just is what it is...
• asked a question related to Logistic Regression
Question
I am running 6 separate binomial logistic regression models with all dependent variables having two categories, either 'no' which is coded as 0 and 'yes' coded as 1.
4/6 models are running fine, however, 2 of them have this error message.
I am not sure what is going wrong as each dependent variable have the same 2 values on the cases being processed, either 0 or 1.
Any suggestions what to do?
That error message says that for 2 of your DVs, everyone has the same value. Try this:
TEMPORARY.
SELECT IF NMISS(x1, x2, x3) EQ 0.
FREQUENCIES y1 y2.
Replace x1, x2, x3 with the list of explanatory variables in your model(s). And replace y1 y2 with the two DVs that are causing the error message to occur.
PS- Be sure to highlight and run all 3 lines together. Do not execute the SELECT IF line separately from TEMPORARY.
• asked a question related to Logistic Regression
Question
suppose we split our data into training and test group, and we use the training data alone to create a logistic regression model using SPSS. does this model produce different results than the logistic regression model which is created by python??
and how could we change the probabilities in logistic regression model (SPSS model) for the test group, into interpretable categorical outputs (0 or 1)??
I don't understand what you want to do.in any case the attached may be helpful to you. Notice the validation step here works better than a single split etc Best wishes David Booth
• asked a question related to Logistic Regression
Question
I made the following logistic regression model for my master's thesis.
FAIL= LATE + SIZE + AGE + EQUITY + PROF + SOLV + LIQ + IND.
Where I take a look if late filing of financial statements (independent variable) is an indicator of failure of small companies (dependent variable). FAIL is a dummy variable that is equal to 1 when a company failed during one of the researched years. I use data covering 3 years (2017, 2018 en 2019). Should I include a dummy variable YEAR, to account for year effects, or not. I have searched online but I don't understand what it exactly means and that is why I don't know if it is necessary to include it in this regression model. I hope you guys can help me. Thank you in advance!
Any new variable will only improve R^2, but one must not introduce variables that their theoretical model does not encompass. If you have no guesses (even better if others have already discussed it) on how a new independent variable might influence the dependent variable, it's best not to include it. The more variables, the more multicolliniarity. Any additional variation that you will explain adding the year variable will be misleading.
Adding time or space variables, for that matter, is almost always a bad idea. If you run LOO tests by year, they will show that adding year only worsens the actual predictive power of your model.
However, you might want to consider including some time-dependent macroeconomic indicators that impact bankruptcy. Although they will cause more multicolliniarity. I still strongly recommend running LOO cross-validation and perhaps target oriented cross-validation, if you can handle it, to show that such indicators do not spoil your model.
• asked a question related to Logistic Regression
Question
For my bachelor thesis I'm conducting a study on the relationship between eye-movements and memory. One of the hypotheses is that the number of fixations made during the viewing of a movie clip will be positively related to the memory that movie clip.
Each participant viewed 100 movie clips, and the number of fixations were counted for each movie clip for each participant. Later participants' memory of the clips were tested and each movie was categorized as "remembered" or "forgotten" for each participant.
So, for each participants there are 100 trials with the corresponding number of fixations and categorization as "remembered" or "forgotten".
My first idea was to do a paired-samples t-test (to compare the number of fixations between "remembered" and "forgotten"), but I didn't find a way to do that in SPSS with this file format as there are 100 rows for each participant. I though of calculating the average number of fixations for the remembered vs forgotten movies per participant and compare and do a t-test on these means (one mean per participant for both categories) but this way the means get distorted because some subjects remember way more clips than others (so the "mean of the means" is not the same as the overall mean).
Now I'm thinking that doing a t-test might not be appropriate at all, and that logistic regression would be a better choice (to see how well the number of fixations predicts whether a clip will be remembered vs forgotten), but I didn't manage to find out how to do this in SPSS in for a within subject design with multiple trials per participant. Any help/suggestions would be highly appreciated.
I believe Blaine Tomkins meant to describe the data as having a LONG format, not a wide format. Apart from that, I concur with his advice. SPSS can estimate that model. Look up the GENLINMIXED command:
A good resource is the book by Heck, Thomas & Tabata (if you can get your hands on it):
HTH.
• asked a question related to Logistic Regression
Question
I am baffled by the ZTC term. In this case it would be ZTC =1 and the course mode = 0 (traditional). So why is the odds ratio higher than the odds ratio for ZTC*hybrid? The graph shows you that the yield is greater for hybrid courses (red line). I see that there is a 78% increase in the odds that one is successful in ZTC for traditional courses.... but that is higher than the effect for ZTC*hybrid, which i thought itself would have been higher!!!
(any blanks you see were not a part of this model)
Hello again Lisa,
It's not clear why Test 1 result list omits ORs/p-values for so many variables, and ditto for Test 2...that's one of the reasons for asking for clarification.
Looking at "Test 2" output,
1. Having ztc significantly improves odds of success;
2. Online course significantly improves odds of success (vs. traditional);
3. Hybrid course significantly reduces odds of success (vs. traditional);
4. ztc combined with online significantly reduces odds of success (more than you would expect from the related "main effects" directions, _and_ relative to traditional);
5. ztc combined with hybrid significantly increases odds of success (more than you would expect from the related main effects directions, _and_ relative to what happens with traditional classes).
If you look at the slopes of the Test 2 output plot, you'll see that ztc appears to have the sharpest increase in likelihood of success for the hybrid condition (but that the hybrid condition is still lower in odds of success than either of the other two methods). That is completely consistent with the significance test outcomes.
As well, ztc has less influence on success likelihood for online class than for traditional class. Again, consistent with the test results.
Good luck with your summary and presentation.
• asked a question related to Logistic Regression
Question
I want to draw a graph between predicted probabilities vs observed probabilities. For predicted probabilities I use this “R” code (see below). Is this code ok or not ?.
Could any tell me, how can I get the observed probabilities and draw a graph between predicted and observed probability.
analysis10<-glm(Response~ Strain + Temp + Time + Conc.Log10
+ Strain:Conc.Log1+ Temp:Time
predicted_probs = data.frame(probs = predict(analysis10, type="response"))
I have attached that data file
Plotting observed vs predicted is not sensible here.
You don't have observed probabilities; you have observed events. You might use "Temp", "Time", and "Conc.Log10" as factors (with 4 levels) and define 128 different "groups" (all combinations of all levels of all factors) and use the proportion of observed events within each of these 128 groups. But you have only 171 observations in total. No chance to get any reasonable proportions (you would need some tens or hundreds of observation per groups for this to work reasonably well).
• asked a question related to Logistic Regression
Question
I want to estimate the half-life value for the virus as a function of strain and concentration, and as a continuous function of temperature.
Could anybody tell me, how to calculate the half-life value in R programming?
I have attached a CSV file of the data
Ok, that's corrected.
What ist the "Response"? Is this an indicator if the virus is detected? If so, then it's hard to estimate the half-time, as both, the concentration and the sensitivity of the assay are unknown. I also wonder how Response could be 0 at Time 0.
It would be possible to fit a Cox-model (as David suggested), you could also fit a binomial model and find the roots of the predition function (on the logit scale) which is the log odds of "Response = 1" vs. "Response = 0". But it's not clear if the lod odds = 0 corresponds to a 50% survival.
• asked a question related to Logistic Regression
Question
Hi,
I am doing a study using logistic regression, where we want to control for a few possible confounders. However, many of the confounders, as well as the independent variable are categorical and have been recoded in dummies. How do we use the 10% rule for dummies? Do they all have to differ 10%, or is something considered a confounder if one of the dummies causes a difference of 10%?
In binary logistic regression, the outcome is usually coded 1/0; then I am assuming you have an exposure of interest, and all other variables are potential confounders coded as indicator variables. Assuming a large sample size relative to the number of potential confounders, you can run a model with the exposure of interest and all confounders. The adjusted OR for the exposure-disease association (controlling for all confounders), is the most valid estimate of the exposure-disease association (assuming accurate measurements, etc.). You could stop there. Depending on the research question, you could consider simplifying the model by removing some of the variables which appear to be weak confounders. Remove one of these variables, then compared the adj OR for the exposure-disease association for this reduced model with the adj OR controlling for all variables (which is considered the "gold standard"). If the reduced model OR is similar to the gold std OR (within 10%), leave that variable out. Repeat this simplification approach always comparing to the "gold std" OR; if dropping one variable has an adj OR more than 10% away from the gold standard, place that variable back into the model and assess another variable. While not necessarily a perfect approach, this change-in-indicator approach is reasonable.
• asked a question related to Logistic Regression
Question
I have to investigate
1) how the response depends on the strain, temperature, time, and concentration.
I applied logistic regression (glm) and got the reduced model. When I tried to make the logistic regression line and confidence interval, it looks like that in the picture. (pic attached below)
Could anybody tell me, how to resolve this issue (I want only one logistic regression and two confidence interval lines, not many)?
I have attached the data
For confidence interval I use this
prediction<-as.data.frame(predict(analysis13,se.fit=T))
prediction.data<- data.frame(pred=prediction\$fit,
upper=prediction\$fit+(1.96*prediction\$se.fit),
lower=prediction\$fit-(1.96*prediction\$se.fit))
plot(household\$Conc.Log10,prediction.data\$pred,type="l",
xlab="width",ylab='linear predictor',las=1,lwd=2,ylim=c(-10,6))
lines(household\$Conc.Log10,prediction.data\$upper,lwd=2,lty=2,col='dark red')
lines(household\$Conc.Log10,prediction.data\$lower,lwd=2,lty=2,col='dark red')
Dear Zuffain
It appears you are looking to draw counter-factual plots (one predictor changing while other predictors held at a constant value - usually average). Read from page 89 onwards (GELMAN, A., & HILL, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge, Cambridge University Press.) for more details.
However I am using monte-carlo approximation instead of the glm function in R - but you can figure it out.
Best wishes
• asked a question related to Logistic Regression
Question
I'm attempting to build a binary response (single species presence/absence) logistic regression model with random intercept (for site variable level). I surveyed 30 sites 1-3 times; approx half of the sites were only visited once. Ideally, I would model site as the random effects variable and include one or two habitat variable(s) as the fixed effects variable(s). I recognize that my sample size is very low and suspect that I also have insufficient replication of observations per level of the random effects variable.
Is it possible to use random effects in this situation? If not what other approach would you recommend?
The only alternative that I can think of is to build a regular binary response logistic regression including only one observation per site, and repeat this for every possible combination of 30 sites. I figure this would allow me to use all of my data to infer which covariates are most influential, although it makes getting coefficient and coefficient confidence estimates, AICc values, etc. difficult as far as I can tell.
• asked a question related to Logistic Regression
Question
Hello there,
I ran a logistic regression of several variables on a dichotomous-dependent variable. One of the independent variables showed the following results: exponent B = 2.16, 95% CI for exponent B is 0.99 - 4.71, and p-value = 0.053.
I wonder if these results are treated as significant or not?
Many thanks
I think the p-value of 0.053 is the two-tailed result, I agree with professors that it is a significant level at p-value < 0.1 or an error at 10%. However, you may check for reliability and validity of the variables, if any item could be deleted, and Cronbach's Alpha and Factor Loading are over 0.7, the p-value could be accepted at p-value < 0.05.
• asked a question related to Logistic Regression
Question
In time-stratified crossover study, a case day was matched to several control days, which was suitable to use conditional logistic regression to analysis. But how was the data formatted in excel and how to perform the model in R software? Are there any detailed papers that I can refer to?
Take a look at the attached screenshot information. PS stop collecting data before you know what you are going to do with it. David Booth
• asked a question related to Logistic Regression
Question
Hello, fellow researchers! I'm hoping to find someone well familiar with Firth's logistic regression. I am trying to analyse whether certain emotions predict behaviour. My outcomes are 'approached', 'withdrew', & 'accepted' - all coded 1/0 & tested individually. However, in some conditions the outcome behaviour is a rare event, leading to extremely low cell frequencies for my 1's, so I decided to use Firth's method instead of standard logistic regression.
However, I can't get the data to converge & get warning messages (see below). I've tried to reduce predictors (from 5 to 2) and increase iterations to 300, but no change. My understanding of logistic regression is superficial so I have felt too uncertain to adjust the step size. I'm also not sure how much I can increase iterations. The warning on NAs introduced by coercion I have ignored (as per advice on the web) as all data looks fine in data view.
My skill-set is only a very 'rusty' python coding, so I can't use other systems. Any SPSS friendly help would be greatly appreciated!
***
Warning messages:
1: In dofirth(dep = "Approach_Binom", indep = list("Resent", "Anger"), :
NAs introduced by coercion
2: In options(stringsAsFactors = TRUE) :
'options(stringsAsFactors = TRUE)' is deprecated and will be disabled
3: In (function (formula, data, pl = TRUE, alpha = 0.05, control, plcontrol, :
logistf.fit: Maximum number of iterations for full model exceeded. Try to increase the number of iterations or alter step size by passing 'logistf.control(maxit=..., maxstep=...)' to parameter control
4: In (function (formula, data, pl = TRUE, alpha = 0.05, control, plcontrol, :
logistf.fit: Maximum number of iterations for null model exceeded. Try to increase the number of iterations or alter step size by passing 'logistf.control(maxit=..., maxstep=...)' to parameter control
5: In (function (formula, data, pl = TRUE, alpha = 0.05, control, plcontrol, :
Nonconverged PL confidence limits: maximum number of iterations for variables: (Intercept), Resent, Anger exceeded. Try to increase the number of iterations by passing 'logistpl.control(maxit=...)' to parameter plcontrol
You seem to use R's logistf function (I guess from within SPSS). Therefore you need to fine-tune related parameters. There are functions like logistf.control, logistf.mod.control, logistpl.control which makes the tuning as described in the help document (https://cran.r-project.org/web/packages/logistf/logistf.pdf).
I don't know if from within SPSS these functions are accessible, but I presume so.
You can try fine-tuning parameters using these functions to push the procedure converge.
Or, alternatively you can try other software or other penalization procedure. But the latter requires broader computation skills.
• asked a question related to Logistic Regression
Question
I am conducting research on determinants of entry mode. Where equity vs. non-equity is my DV and I have a few IV's; family ownership (dummy), international experience, market competition and dynamism. Furthermore do I have a moderating variable which is host-country network (dummy variable). Is the interpretation of the interaction effect of Family Ownership x Host-Country Network on Entry Mode different than for the other variables? I haven't found any literature on the interaction effect between 3 categorical variables.
Two questions. Have you done an interaction plot? the second one is are you using logistic regression? Best wishes David Booth
• asked a question related to Logistic Regression
Question
My question concerns the problem of calculating odds ration in logistic regression analysis when the input variables are from different scales (i.e.: 0.01-0.1, 0-1, 0-1000). Although the coefficients of the logistic regression looks fine, the odds ratio values are, in some cases, enormous (see example below).
In the example there were no outlier values in each input variables.
What is general rule, should we normalize all input variables before analysis to obtain reliable OR values?
Sincerely
Mateusz Soliński
You need to interpret OR using Exponential of estimates.
• asked a question related to Logistic Regression
Question
Hello, I have a question related to multinomial logit model and conditional logit model. I have read a book (logistic regression using SAS Theory and Application), the book stated that multinomial logit model is a special case of the conditional logit model, while it also stated that the multinomial logit model and conditional logit model differ in two ways: conditional logit model can include characteristics related to the choice options; the set of available options can vary across individuals in the analysis.
Suppose I have a research in which different participants may be facing different options, therefore, which model (multinomial logit model and conditional logit model ) should I use? Can I keep using multinomial logit model since it is just a special case of conditional logit model?
Thank you.
You might find the search in the attachment to be helpful to you. Best wishes, David Booth
• asked a question related to Logistic Regression
Question
Please, if I divided my data (patients) into two groups: (a (yes) and (b (No), and I need to examine the preoperative factors( category( 2or 3 types) and nominal) for group b only.
I tried to use logistic regression, but independent space, I cannot be specific for group b, which makes me put both groups in the dependent area. However, I need to examine which factor affects group b and make it Negative(No), ( I need to determine if there is any relationship between group b and the factors ).
Thanks
I guess Your task is to define factors that predict response b (No).
If so You need all data and lodistic regression for prediction the outcome a (yes) or (b (No).
If not You sholud define DV for group b (No).
• asked a question related to Logistic Regression
Question
How does Stata calculate the pseudo R squared that it displays after logistic regression and how is it best interpreted?
This is McFadden's R^2 (Stata calls it Pseudo R squared) and is a measure of goodness of fit. Usually 0.2 - 0.4 indicates very good fit
• asked a question related to Logistic Regression
Question
I have all the data regarding landslide susceptibility mapping, and I want to analyze the data by using the logistic regression model, but still, I have no idea how to process it.
Following my experience, if your maps are in the format of shape-file (boundary maps) you can transform it into a raster file (grid format). Then with this data structure, you can use the R program for logistic regression.
My paper in topic "Logistic Regression Model of Built-Up Land Based on Grid-Digitized Data Structure: A Case Study of Krabi, Thailand" was explained on it.
• asked a question related to Logistic Regression
Question
Hello,
I have conducted a logistic regression (method: forced entry) in R to analyze a group of independent variables (n=4). My criterion is a disease (yes/no). My sample size is n=105. I selected these variables based on theoretical consideration (some of them are known risk factors). My aim was to examine how well these variables predict the outcome and which of them are significant. My results indicate that only one of the 4 variables is significant (one trending).
Now I am wondering what might be a reasonable next step:
1) Is it appropriate to report the model "as it is" with R^2 (Nagelkerke etc.), a goodness of fit statistic (chi-square test), even if only one variable is significant? My idea behind this is to demonstrate the status quo and show that some variables indeed have no predictive power (at least in our data set)
2) In the next step (after looking at the first model), should I run a stepwise backward regression to find a model that best fits the data and contains only significant predictors? If I understand correctly, stepwise methods are more useful for exploratory analysis with a large sample size.
3) Can I just exclude all non-significant predictors in model 1, rerun my logistic regression analysis and report model 2 with the one significant predictor and perhaps the one trending (I'm not sure if this is actually a stepwise backward regression).
I'm looking forward to hearing from you and thank ypu all for your help!
Best reagards,
Nicole
Nicole Walentek, Yes, You may check univariate logistics regression first. Also, you can try to remove the constant from the model :)
• asked a question related to Logistic Regression
Question
I ran a logistic regression model with PTSD, MDD, Nativity, and (PTSD*Born outside the US) interaction term predicting Nicotine Dependence (yes/no). The main effect of Born Outside the US (ref: born in the US) has OR=9.13, PTSD main effect OR=2.12. However, the interaction term of PTSD and Born outside the US has OR=0.31. I find it very strange that the OR changed direction. Can anyone advise on the potential explanations for such results?
Hi Stanislava Klymova, by including this interaction term in your model, you're assuming that the effect of being born outside the US is different for those with/without PTSD. Another way of interpreting this coefficient is the 'extra effect' of being born outside the US while having PTSD (assuming you included these predictors as binary variables with yes = 1). It may be a good idea to first run a likelihood ratio test comparing two models - one with/one without the (PTSD*Born outside the US) interaction term to test whether including this effect modification improves the fit of your model.
• asked a question related to Logistic Regression
Question
Hi,
I want to present the results of multinomial logistic regression at a conference in a visual way,
is it enough to present the table of the results in the power point,
thank you
Sam H.Bahreini I would also consider showing ROC curves and maybe the resultes of your Hosmer-Lemeshow goodness of fit tests. I have examples of each; DM me if interested.
• asked a question related to Logistic Regression
Question
I applied Ordinal Logistic Regression as my main statistical model, because my response variable is 7 Point-Likert Scale data.
After testing for Goodness of Fit using AIC, i got my best fit model, including 4 independent variables (3 explanatory and 1 factor variable).
However, I encounter 1 negative coefficient value (0.44 odds) of 1 explanatory variable (all explanatory variables are also 7 point Likert-scale).
My theoretical assumption is simple: the more frequency of explanatory variables (engage in activities) happen, the higher impact score on response variable (mutual understanding)
That's why I am confused when 1 independent variable has negative coefficient.
In this case, how should I interpret this IV?
Thank you very much,
Just like any other regression..If you want examples simply Google your question. Best wishes David Booth
• asked a question related to Logistic Regression
Question
I have a simple model with only one independent variable, and it is binary/categorical (as is the dependent variable). The log-odds estimate is 4.6821 with a standard error of 0.4978. The point estimate for the odds ratio is 108, with a confidence interval of 40.708 , 286.527.
I ran the same model, simply changing the reference and the log-odds estimate is -4.6821, same standard error, but the point estimate for the odds ratio is 0.009 with a confidence interval of 0.003 , 0.025. This seems reasonable.
Is an odds ratio of 108 valid? Since both my dependent and independent variables are binary/categorical, it isn't an issue of outliers. I only have 2 missing values. Sample sizes are sufficient (275 negative, 102 positive, 73 Y, and 304 N). For Y and negative there are only 5. Could that account for such an inflated odds ratio?
OR = exp(logOR). Plugging in your two logOR values yields the ORs you reported. Using Stata, I got this:
logOR OR
1. 4.6821 107.9966
2. -4.6821 .0092595
Those two ORs show effects of exactly the same magnitude, just as the two logOR values do. HTH.
• asked a question related to Logistic Regression
Question
Hello,
I would like to examine the association between a disease and psychological load. The disease can be determined by different diagnostic methods. In our study, we performed four accepted procedures to determine the disease. Each participant (total sample n=45) underwent all four procedures at different times. Additionally, we examined psychometric data, and these are the main variables I am focusing on.
My idea is to examine the association between the disease and psychological load, as a function of the diagnostic method chosen. In other words: Is the association between the disease - diagnosed by method A - and psychological load significantly different/stronger than the association between the disease - diagnosed by method B - and psychological load.
As for the statistical methods, I initially thought of logistic regression with the disease as criterion and the psychometric variables as predictors. This would lead to 4 regression models: SB diagnosed with A; SB diagnosed with B etc. as criterion. My idea is to compare the AICs of the four different models: Do the psychometric variables predict the disease better and explain more variance when diagnosed with method X or method Y.
I hope my question and concept is comprehensible.
Is this an appropriate approach or does anyone have another idea?
Thank you very much for your replies!
Kind reagrds,
Nicole
Could you simply compare the (pointbiserial) correlations between disease (binary variables) and psychological load (continuous/metrical variable) using statistical tests for comparing dependent correlations, that is, correlation coefficients obtained from the same sample for different variables?
• asked a question related to Logistic Regression
Question
I am currently replicating a study in which the dependent variable describes whether a household belongs to a certain category. Therefore, for each household the variable either takes the value 0 or the value 1 for each category. In the study that I am replicating the maximisation of the log-likelihood function yields one vector of regression coefficients, where each independent variable has got one regression coefficient. So there is one vector of regression coefficients for ALL households, independent of which category the households belong to. Now I am wondering how this is achieved, since (as I understand) a multinomial logistic regression for n categories yields n-1 regression coefficients per variable as there is always one reference category.
• asked a question related to Logistic Regression
Question
Which of the two models is better to analyze factors that influence the appearance of a certain event when the data is not censored? Cox Regression or Logistic Regression? Let´s add that time to the event is not really relevant.
Cox Regression.
• asked a question related to Logistic Regression
Question
I want to do univariate and multivariate binary logistic regression in SPSS. I am wondering about the timing of the Box-Tidwell test, is this both applicable to univariate and multivariate binary logistic regression? I am using a forward LR model, do I perform Box-Tidwell tests on all predictors that I placed in block 1 or just on the predictors that SPSS included in the forward LR model?
Moreover > what to do with a predictor if the linearity assumption is significant? Can I still include this predictor in the model in some way, or should I leave it out of the model?
Hello Michelle,
When you say, multivariate (binary) logistic regression, do you simply mean using a model with more than one independent variable/predictor? If so, it should more correctly be multivariable LR, as the usual definition of "multivariate" is inclusion of two or more dependent variables in a model.
The Box-Tidwell test may be applied if your model has one continuous IV, or if your model has multiple continuous IVs. I would assess it only for IVs that you elect to include in your final model. Whether IVs you don't intend to add to the model do or don't show the linearity. Do the check for all included continuous IVs.
Finally, I'm not a fan of automated variable inclusion (e.g., step methods, such as forward inclusion). There's a host of technical reasons underlying this, but one of the chief ones is, you are not guaranteed to identify the best ensemble of IVs for a given situation.
• asked a question related to Logistic Regression
Question
I have a set of independent variables (from 1 to 8 depending) which are all continuous variables. My dependent variable of interest is an ordinal value that is a Likert-scale representation of an employee's intent to remain at their current job from 1 to 5.
I attempted to run a binary logistic regression but I appear to fail the proportionality conditions there and want to give Mlogit a try,
I believe.a downside to this is the loss of "rank", however, in any event, I am not entirely clear on how to do this in SPSS (or R). In particular I am struggling to interpret my results.
In the attached, Factor1-8 are my independent.
My dependent variable is the aforementioned ordinal. I chose 5 to be my reference. My questions are as follows
1. Am I barking up the right tree here with this approach?
2. How do I interpret the results?
I think the following references may support you
1. Regression Modeling Strategies Frank E. Harrell, Jr. With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis
2. LOGISTIC REGRESSION MODELS FOR ORDINAL RESPONSE VARIABLES ANN A. O’CONNELL University of Connecticut
• asked a question related to Logistic Regression
Question
Dear researcher, I have read a lot about Liker scale and Liker-like questions. However, it is always ''depends'', and needs to be evaluate from situation to situation.
My aim is to examine factors that correlate with attitudes among public health workers represented by 5-point Liker scale. Dependent variables are Q7s. Indepent variables should be all variables above?
For instance.
Dependent variables are Liket like responds on these questions (only first two... In total there are nine questions):
1. I feel trained enough to ask the client about the use of psychoactive substances
2. I feel qualified enough to ask the client about the amount and frequency of use of psychoactive substances daily activities
Independet variables are: geneder, age (number of year), experience (number year), profession (4 group), training (yes/no), knowledge about different aspects of drug use (in 5-pont Likert like scale from no knowledge to excellent knoweldge)
sincerely
Aleksandar
If your dependent (outcome) variable is an ordinal categorical type, then ordinal logistic regression is one regression technique you may consider using. However, interpretation of the ordinal regression can be confusing; this is because the distance (difference) between one level to another is not necessarily consistent. UCLA website below provides a tutorial on ordinal regression. Even if you don't use R, just read its output and its interpretation. https://stats.oarc.ucla.edu/r/dae/ordinal-logistic-regression/
• asked a question related to Logistic Regression
Question
I am creating a risk score from some variables using the following steps:
1- Dividing data into training and validation cohorts.
2- Selecting the variables (p<0.05) in the fully adjusted.
3- Transforming Bs into scores.
4- ROC curve.
5- Calibration using the validation cohort.
I have problems with the last 3 steps. I am using SAS. So I will be grateful if you can give me sources for the codes.
Development of scoring system for risk stratification in clinical medicine: a step-by-step tutorial
• asked a question related to Logistic Regression
Question
Hello,
I would like to conduct a logistic regression, however, I have an independent variable that is not normally distributed and some values seem to be a bit extreme.
Shall I be worried about the effect of potential outliers in an independent variable that is not normally distributed or it will not impact the results of the logistic regression?
If yes, any advice on how to deal with these outliers would be very much appreciated.
Many thanks.
Regards,
Outliers can wreck an analysis.
There are various residual diagnostics for logit models that you can use to identify the effects of outliers in the predictor variables eg Dbetai will give you the change in a particular regression coefficient if that observation was dropped. This will tell you how sensitive your model is to particular observations. You can then investigate these observations further to understand what is going on.
see section on influential observations in this
The original development was by Pregibon, D. (1981) Logistic Regression Diagnostics, Annals of Statistics, Vol. 9, 705-724.
They are quite widely implemented in software.
• asked a question related to Logistic Regression
Question
Would it be better if I don't enter into logistic regression model those variables with extremely unbalanced distribution in the 2 groups, even if statistically significant (p<0.05 ) in the bivariate analysis, for example: a frequency 0% or 100% in one of the two groups?
I am in agreement with both David and Jochen.
To be clear, for LR this is not only a matter of having a "large" sample; the issue of "rare events" and how that manifests in MLE regards the absolute size of the smallest group, not its relative size. That is, you will have the same problem with a 100/10 split and a 10000/10 split. This is why various rules-of-thumb regarding N for LR are in respect to the size of the smallest group for the number of predictors. Obviously, I have no knowledge of the data, but the applicability of this depends on what is meant by "unbalanced."
Firth's clever work, although often simply used to get LR estimates when they would otherwise be intractable with MLE (complete or quasi-complete separation), is a general attempt to penalize MLE weights in respect to the "smallness" of the sample, with those penalized weights being asymptotic to the MLE wights as sample size increases. As it has a Bayesian basis, Firth's penalization of MLE can be seen as conceptually parallel to the penalization of OLS in ridge or LASSO regression estimation.
John
• asked a question related to Logistic Regression
Question
For instance, demographics could be potential confounders. Would it be better if I include them as the predictors in the logistic regression model along with other predictors or I should first factor them out and then use the residuals for further analysis?
Hello Guo-Hua,
The two approaches you propose are identical in impact (e.g., as tactics for "statistical control"). Once the data are collected, these are about your only options.
The usual array of methods by which to deal with nuisance / noise / extraneous / confounding variables in studies includes:
1. Randomization (doesn't eliminate variables, but tends to equalize their distribution across groups, over many trials);
2. Matching (on the target variable/s; this can be challenging with multiple variables);
3. Selecting cases with a fixed value (on the target variable/s; though this restricts generalizability);
4. Building the variables into the design (e.g., as a blocking factor);
5. Statistical control (e.g., as a covariate, in the ways you have proposed).
• asked a question related to Logistic Regression
Question
I used logistic regression model for analysis which has over 17,000 observations. Although, the model results in several statistically significant predictors, McFadden's Adj R squared/ Cragg & Uhler's R2 are very low! For my model McFadden's Adj R squared is 0.026 and Cragg & Uhler's R squared is 0.044. Can I proceed with these R squared? I would really appreciate your suggestion on accepted level of R squared, which has to be backed up by relevant literature. Thank you!!
Hello Tuhinur,
The implied question in your query is, can a study be overpowered (by huge N) so as to flag as significant models which don't account for a meaningful amount of the observed differences (e.g., trivial effects)? The answer, of course, is "Yes."
It's always a judgment call, however, as to whether an effect is trivial.
Now, let me try to address your query. First, I would not rely on pseudo-R-square values as the measure of model adequacy for a logistic regression model. The reason is three-fold: (a) the value does not truly represent variance accounted for (as in OLS regression), so trying to adapt guidelines you may have seen for OLS models may not make sense; (b) context matters: the variables involved, the target population, the perspective of the decision-maker, and the intended use(s) of the model; and (c) many times people choose to focus on the exponentiated regression coefficients for IVs (the "OR" or odds ratio estimates) and/or the classification accuracy of the model and/or AIC/BIC indicators. For me, I'd stick with OR and classification accuracy, or information, and try to interpret those in context of my research aims.
Here's a link you may find handy that compares a number of the common pseudo R-square values: https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
• asked a question related to Logistic Regression
Question
I have dependent variable with 3 categories. When I performed Chi-square test, it showed association of DV with 9 independent variables (p-value less than 0.05). but when i run ordinal logistic regression on the same data, the p-value is totally different and it just shows 3 significant variables?
how can i interpret these results?
The 2 tests are different. David Booth
• asked a question related to Logistic Regression
Question
Hello,
I am trying to adjust for the confounding effect of a third variable on the association between ethnicity (has multiple categories) and death (binary). I am using fixed effect conditional logistic regression to build multivariable model. I know that for a factor to be considered an important confounder it has to change the crude odds ratio by more than 10% (besides the other criteria of being associated with the exposure and outcome).
However, in case I have many categories for the exposure, how can I know if a third factor is an important confounder? Should it change the odds ratio of "ALL CATEGORIES" by 10% or more, or even a change in "one out of all categories" makes it an important confounder? or is there another more appropriate way to deal with the situation?
In principle, confounding variables are often known as risk factors in the literature, whereas exposure variables are what we want to test in our work.
• asked a question related to Logistic Regression
Question
Hi,
I am interested how to interpret odds ratio in logistic regression when OR is <1.
Lets say odds ratio for variable higher education = 0.34 3721
Now I calculated probabilities of staying and exit by applying formula P=Odds ratio/1+Odds ratio - P(staying) =  0.34 3721/1+0.34 3721= 0.2558
Then probability of exit will be 1 - 0.2558=0.7442 Can I interpret it in the following way:
Farmers with higher education (bachelor and above) are 0.34 times more likely to stay in agricultural sector in contrast to farmers with lower education, i.e. there is almost 74% less chance of staying in agricultural sector
Best regards,
Davit
Si votre model est multivarié il faut ajuster pour les catégories de références sur d'autres variables indépendantes incluses dans ce model.
• asked a question related to Logistic Regression
Question
Dear community,
I ran a logistic regression with continious IV in SPSS. In the table "variables in the equation" one variable is missing (despite using entry method) and without any message from SPSS . When browsing through the web I understood that this might happen due to collinearity. However Collinearity diagnostics did not return a clear sign for it. Highest VIF-values are 6.1 and the highest Conditionindex is 21.1.
So my question are:
1. Is my regression model still valid despite SPSS dropping one variable?
2. Are there other reasons than collinearity why the IV is missing in the model.
Thanks Ilka
Try stat package in R programming
• asked a question related to Logistic Regression
Question
I am running a multinomial regression analysis between a categorical (3) dependent variable and a continuous independent variable. My independent variable is arranged into quartile. I want to know how to get relative risk ratio/odds ratio/ coefficients for each quartile while keeping quartile 1 as base/reference. I am using stata.
It would be illuminating to know why you categorize a continuous predictor variable? Why?
• asked a question related to Logistic Regression
Question
Based on my contingent valuation survey (double-bounded dichotomous choice), I have decided to run the models (e.g. Model 1: WTPa; Model 2: WTPb; Model 3: WTPc, and WTPmax) for only positive response. The question is how can I run these models separately?
The dependent variable is dichotomous WTP response (e.g. WTPa-first answer for offered price; WTPb- second answer for next question...). While the independent variables are sociodemographic info (e.g. age, income, occupation, etc.).
Should I separate first the data into a different model? Then, the problem is what is the dependent variable? Because there is only one answer (e.g. Yes).
Well Done !!! interesting...
• asked a question related to Logistic Regression
Question
Hello, I'm an undergraduate student completing my dissertation (using SPSS) so please bear with my very limited understanding of binary logistic regression.
My outcome variable = referral outcome for ADHD assessment (dichotomous: accepted or rejected)
Significant predictor variable = gender
However ExpB which I understand to be the log odds ratio, is 20.520 with confidence intervals LI= 4.139 UI= 101.731
I've been told by my supervisor that 20.520 is implausibly high - e.g., it wouldn't be right for me to report that males are at 20.520 higher odds of being accepted for ADHD assessment (which is what I've interpreted these results to mean).
I've done lots of research to try to figure out what went wrong
- there is no multicollinearity (VIF are all between 1 and 2)
- there are 106 cases so I don't think the sample size is too small?
- the other predictor variables are all on the same scale (only one other variable is sig)
- there are a 3 outliers but the data has been input correctly and I believe it's not ultimately helpful to just remove them without good reason? Also, when I tried removing them, the ExpB just got larger...
- gender is set as a nominal variable in SPSS
Have attached table for reference.
Any help would be hugely appreciated about how to interpret this number, whether it could in fact be plausible? Or if not, what I could do as a next step?
Thank you.
Thanks greatly for your responses - they are very helpful and much appreciated!
• asked a question related to Logistic Regression
Question
I carried out binary logistic regression and it provided crude OR, how can I calculate adjused OR?
Suppose we are interested in understanding whether a mother’s age affects the probability of having a baby with a low birthweight.
To explore this, we can perform logistic regression using age as a predictor variable and low birthweight (yes or no) as a response variable.
Suppose we collect data for 300 mothers and fit a logistic regression model. Here are the results:
📷
To obtain the odds ratio for age, we simply need to exponentiate the coefficient estimate from the table: e0.173 = 1.189.
This tells us that an increase of one year in age is associated with an increase of 1.189 in the odds of a baby having low birthweight. In other words, the odds of having a baby with low birthweight are increased by 18.9% for each additional yearly increase in age.
This odds ratio is known as a “crude” odds ratio or an “unadjusted” odds ratio because it has not been adjusted to account for other predictor variables in the model since it is the only predictor variable in the model.
But suppose we were interested in understanding whether a mother’s age and her smoking habits affect the probability of having a baby with a low birthweight.
To explore this, we can perform logistic regression using age and smoking (yes or no) as predictor variables and low birthweight as a response variable.
Suppose we collect data for 300 mothers and fit a logistic regression model. Here are the results:
📷
• asked a question related to Logistic Regression
Question
Hi We are comparing mortality of two therapies in COVID patients. We have identified 74 patients in our hospital records (details in the attached image). We also have data of vital signs and lab data for these patients taken at different intervals.
Our idea is using logistic regression with mortality/discharge as endpoint, adjusted by patient status on admission.
Is this sample size enough for this kind of analysis? If not, how do you suggest we analyze this data?
Use one of the following.
A- odds ratio,
B- relative risk,
C- risk differences.
Put group on the rows and outcomes on the columns.
• asked a question related to Logistic Regression
Question
I am running a multinomial logistic regression. How do I change the reference category of independent variables that are categorical? I am familiar with changing the reference category of the dependent variable. But how does it work with independent variables? By default, the last category is taken as the reference group.
on the SPSS software, you can easily change your reference to the end or first
• asked a question related to Logistic Regression
Question
I am running multivariable confounding bootstrap (1000 iterations) logistic regressions to see how 5 ethnoracial groups (IV's) differ in terms of PrEP barriers (outcome).
For one of my ethnoracial identities though, there are no cases. Yet the bootstrap estimate was significant with a beta of -18, and a CI that does not cross zero. The other IV's were not significant for that barrier. I've never dealt with this scenario before; should I remove that IV and try to rerun the model?
Although I understand what bootstrap distributions are, I'm not sure how I could go from no cases and thus insignificant (with p > .999 before bootstrapping) to a p value in the bootstrapped analysis of .03. Any help is appreciated
I don't understand what you did. How did you resample cases? Can you show the code that you used? In general, if no one in your sample in one category, if you are sampling from the observed data then there would be also no one in that category in any of your bootstrap samples. You could sample in such a way that there is a probability of being in different categories. However, the norm in any analysis would be to remove that category. If you had been interested in that category would likely would have done your sampling differently so that a larger proportion were in that group.
• asked a question related to Logistic Regression
Question
Any one who can give me hint on how to determine the cut off point for p-value of bivariable logistic regression to take the variables to multivariable regression?
That's a very poor method of selection because it just doesn't work. Here's two papers on the subject. First is about method doesn't work and second one is about how you might deal with this in a much better way.
Best wishes, David Boot
• asked a question related to Logistic Regression
Question
Greetings,
We have been conducting a retrospective cohort study. The variables we are examining are assumed to be very age-dependent and the exposed population is very small (~40 patients), therefore we have considered matching for age and sex at a 1:2 or 1:3 ratio to increase statistical power and limit confounding.
Which statistical test would be most appropriate for calculating risk ratios for dichotomous categorical variables?
This article ( https://academic.oup.com/epirev/article/25/1/43/718675 ) suggests conditional Poisson regression, which I have attempted in Stata, but it appears to work only for 1:1 matched pairs.
It also suggests an adjustment of Cox regression so as to yield the same results as conditional Poisson regression (" if the time to death or censoring is set to some arbitrary constant value and if the Breslow or Efron methods are used to account for tied survival times, the results will be the same as those from conditional Poisson regression, as the likelihoods for these methods are identical when the data come only from matched pairs ").
I have recently attempted a similar adjustment (as described here: https://www.ibm.com/support/pages/conditional-logistic-regression-using-coxreg ) to yield the same results as conditional logistic regression (odds ratio) for a 1:N matched case-control study using Cox regression.
If such an adjustment is possible, how exactly could it be implemented in SPSS? If not, what other alternatives are available to us in this juncture?
Maybe not useful for the original question anymore, but still good to know: a very detailed exposition on how to perform a matched cohort analysis is present in Kleinbaum's Logistic Regression 3rd edition. It gives the specifics on how to do it in SPSS using the Cox regression module in the Appendix. I used it some years ago, and it works (obviously). Hope this helps.
• asked a question related to Logistic Regression
Question
I have a slope data classified based on weather an area is vulnerable to erosion. Below is an example: vulnerability is a categorical variable (yes/no) and slope is a continuous variable ranging from 0 to 90.
I want to know if the slope of vulnerable areas is significantly different from that of non-vulnerable areas. At first, I performed unpaired two-samples t-test by classifying slope data into two groups based on vulnerability. Then, while I was looking into statistics for other dataset, I realized this dataset might be interpreted in a different way: one continuous variable (i.e., slope) and one categorical variable (i.e., vulnerable or non-vulnerable). If it is correct, ANOVA or logistic regression can be used? Also, I'm wondering which analysis (two continuous variables VS one categorical and one continuous variables) is more appropriate in my case. Thanks.
You have one continuous, independent variable and one dichotomous dependent variable, and an unpaired t-test is appropriate
• asked a question related to Logistic Regression
Question
I have developed a logistic regression based prognostic model in Stata. Is there any way to develop an app using this logistic regression equation (from Stata)?
Most of the resources I found require me to develop the model from scratch in Python/R and then develop the app using streamlit/Shiny etc.
However, I am looking for a resource where I could use the coefficients and intercept values from Stata based model rather than build model from scratch in python.
• asked a question related to Logistic Regression
Question
Gender is a negative predictor contributing to the model in an ordinal logistic regression predicting pornography.
How can we interpret this.
gender
-.555 est
-.555 std .E
4.694 Wald
1 df
.030 sig
-1.057 LL
-.053 UL
This is a free online course that allows you to learn about ordinal logistic models and much else
These are the modules
1. Using quantitative data in research (watch video introduction)
2. Introduction to quantitative data analysis (watch video introduction)
3. Multiple regression
4. Multilevel structures and classifications (watch video introduction)
5. Introduction to multilevel modelling
6. Regression models for binary responses
7. Multilevel models for binary responses
8. Multilevel modelling in practice: Research questions, data preparation and analysis
9. Single-level and multilevel models for ordinal responses
10. Single-level and multilevel models for nominal responses
11. Three-level multilevel models
12. Cross-classified multilevel models
13. Multiple membership multilevel models
14. Missing Data
15. Multilevel Modelling of Repeated Measures Data
• asked a question related to Logistic Regression
Question
Hello All,
For logistic regression, the model may be specified as:
Pr(yi = 1 | Xi ) = G(X), where G(x) = ex/ (1+ex)
What would the corresponding model be for Firth Logistic Regression?
Pr(yi = 1 | Xi ) = G(X), where G(x) = ...
How (if) would the penalty feature?
Any help would be most appreciated!
Thank you,
Navya
Firth's logistic regression has become a standard approach for the analysis of binary outcomes with small samples. Whereas it reduces the bias in maximum likelihood estimates of coefficients, bias towards one-half is introduced in the predicted probabilities. The stronger the imbalance of the outcome, the more severe is the bias in the predicted probabilities. We propose two simple modifications of Firth's logistic regression resulting in unbiased predicted probabilities. The first corrects the predicted probabilities by a post hoc adjustment of the intercept. The other is based on an alternative formulation of Firth's penalization as an iterative data augmentation procedure. Our suggested modification consists in introducing an indicator variable that distinguishes between original and pseudo-observations in the augmented data. In a comprehensive simulation study, these approaches are compared with other attempts to improve predictions based on Firth's penalization and to other published penalization strategies intended for routine use. For instance, we consider a recently suggested compromise between maximum likelihood and Firth's logistic regression. Simulation results are scrutinized with regard to prediction and effect estimation. We find that both our suggested methods do not only give unbiased predicted probabilities but also improve the accuracy conditional on explanatory variables compared with Firth's penalization. While one method results in effect estimates identical to those of Firth's penalization, the other introduces some bias, but this is compensated by a decrease in the mean squared error. Finally, all methods considered are illustrated and compared for a study on arterial closure devices in minimally invasive cardiac surgery.
• asked a question related to Logistic Regression
Question
Hello!
I'm writing my thesis and performed first an multinomial logistic regression,
I found out this was wrong since my dependent variable is ordinal. Now im trying to perform an ordinal logistic regression but end up with dots in my table, can someone please explain to me why these appear?
If your estimate is 0, you can’t have any other outcome. That's why wherever the estimate is zero, those entire rows show dots. :)
• asked a question related to Logistic Regression
Question
How do you perform a purposeful selection model?
Which variables should be excluded or included?
You need to understand the causal relationships that are implied by your hypothesis. I suggest that you visit http://www.dagitty.net which allows you to build a graphic representation of the relationships between your variables and, based on this, will allow you to build statistical models that test informative hypotheses.
• asked a question related to Logistic Regression
Question
Hello,
In order to test the assumptions of a logistic regression I tried to conduct the Box-Tidewell test. So far so good... I encountered the problem, that I have quite often the value 0 in my independent variables which leads to no values for the x*ln(x) term. This means a very considerable reduction in includable cases. (19 instead of 198!!!) Any ideas how I could deal with it?
Many thanks and kind regards Ilka
Thanks Chris and David! Sorry I might not been clear in my question! I am aware of the fact that I cannot log 0 and that this is the reason why I loose cases. My qustion is how can I test the assumption of linearity of the logit for logistic regression then? Is there another approach than the Box-Tidewell Test?
• asked a question related to Logistic Regression
Question
Let us suppose we have a new cheap and simple diagnostic test we want to evaluate against the expensive and complex gold standard for a highly lethal disease.
The gold standard test is dichotomous (positive or negative), but the new test returns two continuous results: let's call them "Result A" and "Result B".
Assuming the disease can be accurately diagnosed with the gold standard test, we want to
1) estimate the posterior probability of disease given the prior and the new test results A and B, i.e. P(D+|A,B)
2) define the best threshold values for both A and B
Given the high lethality, we're more interested in avoiding false negatives.
Let's suppose we have data like the ones in figure 1 (randomly generated data). Big red dots and small grey dots are patients whose gold standard test did result respectively positive and negative.
Which is the best model to evaluate such a test?
Logistic regression and ROC curve?
Clustering in machine learning?
Other?
Thank you.
Hello Max Pierini. In trying to find the best balance between Se and Sp, you are using Youden's Index (or some variation on it). Let me remind you of what Jochen Wilhelm said in his reply (with emphasis added):
The performance of the new method can be evaluated by a ROC analysis. You may decide on a useful combination of sensitvity and specificity you can get (standard indices like the Youden index or similar are ignoring all practical consequences of false positives and false negatives and should not be used).
The only change I would make to what Jochen said is that Youden's index (or similar) should only be used when false positives and false negatives are deemed equally costly. But you want to limit false negatives (because of lethality), so trying to strike a balance between Se and Sp does not make sense. Rather, you need a cut-point that guarantees whatever level of Se you deem necessary (IMO). HTH.
• asked a question related to Logistic Regression
Question
Hi,
I'm a fish biologist and I'm interested in assessing the uncertainty around the L50, which is the (sex-specific) length (L) at which you expect 1 fish out of 2 (50%) to exhibit developed gonads and thus, participate in the next reproductive event.
Using a GLM with a binomial distribution family and a logit link, you can get the prediction from your model with the predict() function in R on the logit (link) scale, asking to generate too the estimated SE (se.fit=TRUE), and than back-transform the result (i.e., fit) on the response scale.
For the uncertainty (95%CI), one can estimate the commonly-used Wald CIs by multiplying the SE by ± 1.96 on the logit scale and then back-transform these values on the response scale (see the Figure below). From the same logistic regression model, one can also estimate the CI on the response scale with the Delta method, using the "emdbook" package and its deltavar() function or the "MASS" package and its dose.p() function, still presuming that the variance for the linear predictors on the link scale is approximately normal, which does not always hold true.
For the profile likelihood function that seems to better reflect the sometimes non-normal distibution of the variance on the logit scale when compared to the two previous methods (Brown et al. 2003), it unfortunately seems that no R package exists to estimate CIs of logistic regression model predictions according to this approach. You can, however, get the profile likelihood CI estimates for your Beta parameters with the confint() function or using the "ProfileLikelihood" package, but regarding a logistic regression prediction, it seems that one would need to write its own R scripts, which we will likely end up doing.
Any suggestion would be welcome. Either regarding specifically the profile likelihood function (Venzon & Moolgavkar 1988) or any advice/idea on this topic.
Briefly, we are currently trying to find out which of these methods (and others: parametric and non-parametric bootstrapping, Bayesian credible intervals, Fieller analytical method) is/are the most optimal at assessing the uncertainty around the L50 for statistical/biological inferences, pushing a bit further the simulation study of Roa et al (1999).