Science topic

Regression Modeling - Science topic

Explore the latest questions and answers in Regression Modeling, and find Regression Modeling experts.
Questions related to Regression Modeling
  • asked a question related to Regression Modeling
Question
1 answer
Currently I'm running a project to compare the effect of exposure on mortality in different disease stages (i.e. stage 1, stage 2, stage 3) using Cox Proportional Hazard model. I've read some answers to previous questions but most of them focused on logistic regression models. I wonder is there a way to compare the Hazard Ratios in different stages to see if there is a statistically significant difference between them? Thank you!
Relevant answer
Answer
Understanding the Challenge
While directly comparing Hazard Ratios (HRs) from different strata of a Cox Proportional Hazard model isn't straightforward, there are several statistical methods to assess if the differences between HRs are statistically significant.
Key Considerations:
  1. Statistical Significance of Individual HRs:Ensure that the individual HRs within each stratum are statistically significant. This is typically assessed using Wald tests or likelihood ratio tests.
  2. Heterogeneity of Treatment Effects: Formal Tests: Use statistical tests like the Cochran's Q test or the I² statistic to assess whether there is significant heterogeneity in the treatment effect across strata.Visual Inspection: Plot the HRs and their confidence intervals to visually inspect for differences.
Statistical Methods for Comparing HRs:
  1. Subgroup Analysis:Divide the sample into subgroups based on relevant factors (e.g., age, gender, disease stage). Fit separate Cox models for each subgroup. Compare the HRs and their confidence intervals. Caution: Subgroup analysis can increase the risk of false-positive findings, especially with small sample sizes.
  2. Interaction Terms:Include interaction terms between the treatment variable and the stratification variable in the Cox model. A significant interaction term indicates that the effect of treatment differs across strata.
  3. Meta-Analysis:If you have multiple studies with similar designs, a meta-analysis can be used to pool the HRs and assess overall treatment effect and heterogeneity.
Choosing the Right Approach:
The best approach depends on your specific research question and the design of your study. Consider the following factors:
  • Hypothesis Testing: If you have a specific hypothesis about the difference between HRs, use formal statistical tests.
  • Exploratory Analysis: If you're exploring potential differences, visual inspection and subgroup analysis can be helpful.
  • Clinical Significance: Even if a statistical difference is found, it's important to consider the clinical significance of the difference.
Statistical Software:
Most statistical software packages (e.g., R, SAS, Stata) can perform these analyses. Consult the specific documentation for your software to learn how to implement these methods.
Remember to consult with a statistician to ensure that the appropriate methods are used and the results are interpreted correctly.
By carefully considering these factors and employing appropriate statistical methods, you can effectively compare HRs from stratified Cox Proportional Hazard models and draw meaningful conclusions from your analysis.
  • asked a question related to Regression Modeling
Question
4 answers
I have been studying a particular set of issues in methodology, and looking to see how various texts have addressed this.  I have a number of sampling books, but only a few published since 2010, with the latest being Yves Tille, Sampling and Estimation from Finite Populations, 2020, Wiley. 
In my early days of survey sampling, William Cochran's Sampling Techniques, 3rd ed, 1977, Wiley, was popular. I would like to know which books are most popularly used today to teach survey sampling (sampling from finite populations).
I posted almost exactly the same message as above to the American Statistical Association's ASA Connect and received a few recommendations, notably Sampling: Design and Analysis,  Sharon Lohr, whose 3rd ed, 2022, is published by CRC Press.  Also, of note was Sampling Theory and Practice, Wu and Thompson, 2020, Springer.
Any other recommendations would also be appreciated. 
Thank you  -  Jim Knaub
Relevant answer
Answer
Here are some recommended ones: 1. "Sampling Techniques" by William G. Cochran This classic book covers a wide range of sampling methods with practical examples. It’s comprehensive and delves into both theory and application, making it valuable for students and professionals. 2. "Survey Sampling" by Leslie Kish'' This is another foundational text, known for its detailed treatment of survey sampling design and estimation methods. Kish's book is especially useful for those interested in practical survey applications. 3. "Model Assisted Survey Sampling" by Carl-Erik Särndal, Bengt Swensson, and Jan Wretman This book introduces model-assisted methods for survey sampling, which blend traditional design-based methods with model-based techniques. It's ideal for more advanced readers interested in complex survey designs. 4. "Sampling of Populations: Methods and Applications" by Paul S. Levy and Stanley Lemeshow This text is widely used in academia and provides thorough explanations of different sampling methods with a focus on real-world applications. It also includes case studies and practical exercises, making it helpful for hands-on learners. 5. "Introduction to Survey Sampling" by Graham Kalton This introductory book offers a concise and accessible overview of survey sampling methods. It’s well-suited for beginners who need a straightforward introduction to key concepts. 6. "Designing Surveys: A Guide to Decisions and Procedures" by Johnny Blair, Ronald F. Czaja, and Edward A. Blair This book focuses on the practical aspects of designing and conducting surveys, with particular emphasis on decision-making and procedural choices in the survey process.
  • asked a question related to Regression Modeling
Question
3 answers
What are the simple effect and main effects in the regression model? While I'm familiar with the main and intraction effects in the multinomial logistic regression model, I have no idea what simple effects are and how they are involved with the regression model. I'd greatly appreciate it if you could explain this and recommend useful resources. Thank you.
Relevant answer
Answer
Nothing special. There are many web pages about the topic you will find when searching for these keywords. This should also be part of any introductory stats book dealing with factorial designs (multiway-ANOVA models). Unfortunately I don't have one around I could look up, but maybe someone else can recommend a good book on that topic.
  • asked a question related to Regression Modeling
Question
6 answers
The threshold least square regression model by Hansen (2000) divides the series into two regimes endogenously. The Regime above the threshold and below the threshold, and then regress both regimes individually by OLS. this method also involve bootstrap replication. In my case the regime above the threshold only remain with 17 number of observation. Does it creates loss of degree of freedom issue in the data?
Relevant answer
Answer
It is not possible to answer that question without more information. First, you should say how many observations you have. A regression with 17 observations is questionable but it depends on the number of explanatory variables. Since these observations correspond to above the threshold, I fear that they are outliers, hence we are in the worst situation. Did you consider a transformation on the dependent variable? That would perhaps improve the situation provided the relationship with the other variables allows for it. I would say that each of the two regimes should have enough observations.
  • asked a question related to Regression Modeling
Question
2 answers
I have a data set with binary dependent variable, where 95% observations take value 1, and 5 % take value 0. Which regression model will be perfect in such case?
Relevant answer
Answer
Binary Logistic Regression. I have a YouTube video on it.
  • asked a question related to Regression Modeling
Question
5 answers
our dependent variables is stationary level while independent variables are stationary at level and first difference
Relevant answer
Answer
Marius Ole Johansen The threshold least square regression model by Hansen (2000) divides the series into two regimes endogenously. The Regime above the threshold and below the threshold, and then regress both regimes individually by OLS. this method also involve bootstrap replication. In my case the regime above the threshold only remain with 17 number of observation. Does it creates loss of degree of freedom issue in the data?
  • asked a question related to Regression Modeling
Question
2 answers
I'm currently working on a project where I need to understand the impact of outliers on different regression algorithms, specifically Random Forest, Gradient Boosting, and XGBoost. I have a few questions that I'd like to get some insights on:
  1. How do outliers typically affect the performance of Random Forest, Gradient Boost, and XGBoost regression models? Are these models generally robust to outliers, or do outliers significantly skew their predictions?
  2. If these models are affected by outliers, what are some common strategies to mitigate this issue? Should I consider preprocessing steps like outlier removal, or are there model-specific techniques that are more effective?
  3. Could you recommend any reliable sources (research papers, books, articles) that delve into this topic further? I’m particularly interested in the literature comparing these models' robustness to outliers in regression tasks.
Thank you in advance for your help! I appreciate any guidance you can provide.
Relevant answer
Answer
Outliers affect regression algorithms differently. Random Forests are robust to outliers due to their ensemble approach, which minimizes their impact. In contrast, Gradient Boosting is more sensitive as each model corrects errors from previous ones, which can lead to overfitting on outliers. XGBoost, a variant of Gradient Boosting, also faces challenges with outliers but has regularization techniques to mitigate their effects. Generally, Random Forests handle outliers better, while Gradient Boosting and XGBoost may need extra strategies to manage them effectively.
  • asked a question related to Regression Modeling
Question
6 answers
Hi -
I have two models.
(1) y = b0 + b1*x1 + b2*x2 + e
(2) y = b0 + b1*x1 + b2*x2 + b3*x1*x2 + e
x1 and x2 are the independent variables, and x1*x2 is the interaction term. In Equation (1), neither b1 nor b2 is statistically significant. But in Equation (2), b1, b2, and b3 are all statistically significant. Equation (2) has a slightly better model fit. The adjusted R-squared of Equation (1) is 0.2; that of Equation (2) is 0.217. The interaction term (x1*x2) is the only difference between the two equations.
Why would this happen, and which of the two should I believe regarding the main effect?
Relevant answer
Answer
This an old question, but I don't think any of the answers address the central point. In a regression model with an intercept the coefficients are estimated when all other variables take the value zero.
In the first model this does not matter:
y = b0 + b1*x1 + b2*x2 + e
the effects x1 are constant for all values of x2 while the effects of x2 are constant for all values of x1.
This changes in the second model because of the interaction:
y = b0 + b1*x1 + b2*x2 + b3*x1*x2 + e
The interaction by definition means that we can't consider effects of x1 in isolation from x2 and vice versa (unless the interaction effect b3 is exactly zero in which case it behaves like the first model). Now the "main effects" of x1 vary as x2 increases or decreases (and vice versa). In technical terms it isn't really a main effect but a "simple main effect". Because we evaluate them a when x2 is zero it can be very different from the effect in our first model.
In fact in practice this estimate is usually not remotely helpful because much real world data (probably most) is in the positive range - so we are evaluating the effect of x1 for a value of x2 that is unusual (or perhaps impossible). For instance if x1 is age and x2 income and y is happiness we'd be evaluating how happy someone is at different ages when their income is zero or happy they are with different incomes when their age is zero. In both cases these are not typical values of our data and probably don't address a useful research question.
To get round this there are two main solutions:
a) centre x1 and x2 before computing the interaction/product term x3. To centre a variable simple subtract its mean so the new centred variable has a mean of zero. This produces the same coefficients in the first model. In the second model b1 and b2 will be similar to the values in model 1. This is because you are evaluating them at the mean of the other variable (0 is the mean of both x1 and x2 in the centred model). This is still a simple main effect but has a nice interpretation as the effect of x1 for an average data point (average on x2). Interestingly b3 is the same in both models. (This can be proven mathematically but we don't need to worry about that).
b) for a full understanding graph the predictions or marginal effects of the model. See how y varies as a function of x1 and x2 together. There are various ways to do this, but a plot of x1 on the x axis, y on the y axis and different lines for x2 (low, medium, high often defined as -1SD, mean, + 1SD) is usually a good way to see what's going on. An interaction means that y depends on both x1 and x2 so plotting the pattern together is really the best way to see what's going. You can also plot the simple main effects separately - with is useful if that's your focus. Adding confidence bands to the plots is also useful.
  • asked a question related to Regression Modeling
Question
5 answers
Hello,
I have run into an issue that has sparkled a debate at work: is the LLOQ/ULOQ impacted by injection volume under the following circumstances:
1) Same dilution factor of all samples
2) Volumetric deviations corrected for by internal standard (separate compound from analyte, not labelled)
3) Matrix effects studied and found to be negligible at all injection volumes
4) Good linearity of the regression model; not impacted negatively by inclusion of standards injected at lower volume
The problem arose when I, instead of performing a dilution, chose to rely on the correcting function of the internal standard and injected 1/5:th the volume (0.2 µL rather than 1 µL) of my most highly concentrated samples and standards and evaluated them as part of the same calibration curve as the samples injected at 1 µL. As I mentioned, matrix effects were investigated at all concentration levels as well as injection volumes and they were very low.
The software settings I use employ the "signal response" which is [area analyte/area IS]-ratio and the regression model from the calibration curve to calculate concentrations. Hence, under the circumstances I mentioned above the response is independent from injection volume - providing one does not drop below the background level or saturates the detector.
We ended up in a situation where one of my colleagues claim that the LLOQ for the samples injected at 1/5 should be adjusted and multiplied by the factor difference in injection volume (i.e. x5) and the ULOQ to be divided by the same factor. Another colleague claim it should be done the other way around.
In contrast to both I claim that the volumetric correction by the IS has already compensated for this and that LLOQ/ULOQ should remain at the set concentrations of their respective standards.
Can anyone help me out here?
Best regards and thanks in advance,
// Karl
Relevant answer
Answer
All samples (standards) should be injected at the same VOLUME. Concentration should vary, not injection volume. This follows good chromatography fundamentals. The reasons are many, but one key issue is that an autoinjector's ability to accurately deliver very low volumes, esp 1 ul or 0.1 ul , often vary showing poor linearity (*This of course requires more information regarding the exact make, model and configuration of the A/I used). Changes in injection volume may also contribute to changes in peak shape, retention time and area amounts (which may change the results). To avoid problems and for these reasons we always maintain the same volume.
  • asked a question related to Regression Modeling
Question
6 answers
Hi guys,
in the context of my master thesis i analyze the statistical relationship between income and subjective well-being (Panel: SOEP, n: 300.000 observations over 10 years).
After creating a model that is in harmony with the existing literature i conducted a fixed-effects "within" Regression (with robust standard error) that includes all relevant control variables.
I got a highly significant (0.01 level) regression coefficient of 0.1 for my income variable.
Despite that i received a R squared value of 0.06 and a negative adj. R. of -0.19.
I do not really know how to interpret the negative R squared. Does it mean my model doesnt fit and has no explanatory power?
I was expecting a small R squared duo to the many factors influencing subjective well-being, but not a negative one?
Anyone got an advice how to interpret this result? Can i still draw conclusions regarding the statistically significant coefficient and a causal link between income and SWB?
Im thankful for any advice!
Relevant answer
Answer
Sometimes it can be that easy!
I think i solved the problem.
It seems that the package (plm) im using has problems calcualting the adj. R when you use fixed effects (model=within).
I conducted my regression with alternative packages (e.g. fixet) and it then properly calculated the R squared (0.5) and R within which lays around 0.08.
  • asked a question related to Regression Modeling
Question
4 answers
How to get Equation 8, please refer the research article? kindly let me know. Thank you.
Relevant answer
Answer
To find the regression model that expresses the relationship between adsorption and process variables for predicting fluoride removal (%), you can follow these steps:
  1. Collect Data: Gather data on adsorption levels, process variables, and corresponding fluoride removal percentages. Make sure you have a dataset with sufficient observations to analyze the relationship.
  2. Identify Potential Variables: Identify the process variables that may influence adsorption and fluoride removal. These variables could include factors such as pH, temperature, contact time, adsorbent dosage, etc.
  3. Perform Regression Analysis: a. Choose a Regression Model: Decide on the type of regression model to use based on the nature of your data. For example, you could use simple linear regression if there is a single predictor variable or multiple linear regression if there are multiple predictor variables.b. Fit the Regression Model: Use statistical software (e.g., R, Python, SPSS) to fit the regression model to your data. The regression model will estimate the coefficients that represent the relationship between the predictor variables (process variables) and the response variable (fluoride removal %).c. Assess Model Fit: Evaluate the goodness of fit of the regression model by examining metrics such as R-squared, adjusted R-squared, F-test, and p-values of coefficients. These measures help you understand how well the model explains the variation in fluoride removal percentages.d. Interpret Results: Interpret the coefficients of the regression model to understand the direction and strength of the relationships between the process variables and fluoride removal %. Positive coefficients indicate a positive relationship, while negative coefficients indicate a negative relationship.
  4. Validate the Model: Once you have developed the regression model, validate it using techniques such as cross-validation or split-sample validation to ensure its reliability and generalizability.
  5. Use the Model for Prediction: Once you have a validated regression model, you can use it to predict fluoride removal % based on the values of the process variables. Plug in the values of the process variables into the regression equation to estimate the expected fluoride removal %.
By following these steps, you can find a regression model that effectively captures the relationship between adsorption, process variables, and fluoride removal %, allowing you to predict fluoride removal based on the chosen process variables.
  • asked a question related to Regression Modeling
Question
3 answers
What is the approach to determine uncertainty in energy and inflation time vs. inflation pressure of tyre control unit?
How do you measure the uncertainty of equipment adopted in the Regression Models developed to measure the energy and inflation time of a tyre pressure control Unit?
How do I ascertain uncertainty measurement?
Energy vs inflation pressure for three radii of the tyre
the
Relevant answer
Answer
To determine uncertainty in energy and inflation time versus inflation pressure control in a tyre control unit, the approach typically involves:
  1. Experimental Design: Conduct controlled experiments or tests where you systematically vary energy input and inflation time while measuring inflation pressure using the tyre control unit.
  2. Data Collection: Collect data on energy consumption (input), inflation time, and resulting inflation pressure for each experiment.
  3. Statistical Analysis: Use statistical methods such as regression analysis or analysis of variance (ANOVA) to quantify the relationship
  • asked a question related to Regression Modeling
Question
5 answers
Dear colleagues with particular interests in r software
I have completed an ordinal regression model using two different r functions (polr - MASS package and clm - ordinal package). The output was measured using a 5 level Likert scale (0-4). Predictors are numerical as well as categorical. N=722 obs. Using both functions, results for coefficients values and p value are very similar, except for intercepts. In one case (polr) intercepts are significately different from zero. In the other case (clm) 3 out of 4 intercepts values are not different from zero. Difference come obvisouly from the calculation of standard error for intercepts (intercept values are the same in both of the models). Is anyone had a similar experience ? How to decide wich function is the most appropriate ?
Many thanks in advance
Dr Jeoffrey DEHEZ
Relevant answer
Answer
Jeoffrey Dehez , your thanks should not go to Joseph - he just used GPT and pasted the answer into RG. There is no intellectual work associated with that. And I am a bit sorry that you find that answer useful, since it is very generic and nowhere does is give any specific answer.
  • asked a question related to Regression Modeling
Question
3 answers
I am trying to perform a count regression models but I want to determine first the predictors that has a significant relationship to my response variable
Relevant answer
Answer
Why do you want to determine the significant predictors before conducting the regression? Instead, you should just run the regression (with a set of dummy variables to capture the categories of your nominal IV). That will tell you which predictors are significant when you take the full set into account.
  • asked a question related to Regression Modeling
Question
3 answers
Hello !!!
Since simple multiple linear regression doesn't take into account the within-subject model, nor the fact that my dependent variable is an ordinal variable, I need to control for individual heterogeinity. For this, it's possible to run a regression model with clustered standard errors to take into account the pattern within the subject.
Can anyone explain how to run a regression model with clustered standard errors on ibm spss?
Explanation of my survey:
112 participants indicated on a scale of 1 to 7 their level of agreement with different statements after reading a scenario. 7 Likert scale (from strongly disagree to strongly agree)
The statements measured my dependent variables: work motivation, bonus satisfaction, collaboration and help. There were a total of 7 different scenarios. I also control for gender, age, and status ( employed, self-employed, student, unemployed, retired, other).
My aim is to see how the different scenarios affect the dependent variables.
My thesis supervisor advised me to take into account clustered standard errors in my regression model, but I have no idea how to do this on spss. I can't find the right test and command to do this.
Could someone help me?
Thanks in advance,
Best regards
Relevant answer
Answer
Dear
Onipe Adabenege Yahaya
,
Thank you very much for your explanations!
However, I am working on IBM SPSS.
Do you know how I could do a regression within subject design with clustered standard error on IBM SPSS ?
Since my dependent variables are ordinal I thought about doing an ordinal regression. Analyze > Generalized Linear Model > then I could specify the type of model as ''ordinal logistic''. But does this test take into account within subject design?
I also thought about GEE on spss. But I don't know if I can do it. ?
My study :
112 participants had to face 7 different scenarios ( the factors) and indicate their level of agreement on a scale likert scale from strongly disagree to strongly agree to measure their motivation, satisfaction, collaboration and help. I also checked for age, gender, and professional situation.
Thank you in advance
  • asked a question related to Regression Modeling
Question
7 answers
I have a mixed effect model, with two random effect variables. I wanted to rank the relative importance of the variables. The relimpo package doesn't work for mixed effect model. I am interested in the fixed effect variables anyway so will it be okay if I only take the fixed variables and use relimp? Or use weighted Akaike for synthetic models with alternatively missing the variables?
which one is more acceptable?
Relevant answer
Answer
install.packages("glmm.hp")
library(glmm.hp)
library(MuMIn)
library(lme4)
mod1 <- lmer(Sepal.Length ~ Petal.Length + Petal.Width+(1|Species),data = iris)
r.squaredGLMM(mod1)
glmm.hp(mod1)
a <- glmm.hp(mod1)
plot(a)
  • asked a question related to Regression Modeling
Question
1 answer
I'm interested in investigating what individual factors contribute to the choice of three outcome variables. Additionally, I want to test whether there are pairwise interactions between the predictor variables under investigation. Is it valid to build two seperate models, given that there are two research questions? I would appreciate it if you could recommend books or articles related to this issue. Thank you.
Relevant answer
Answer
Yes, it's valid to build separate models to address different research questions. In your case, one model could focus on identifying the individual factors that contribute to the choice of outcome variables, while the other model could examine pairwise interactions between predictor variables.
For the first research question, you might consider using techniques such as multiple regression analysis or logistic regression analysis, depending on the nature of your outcome variables (continuous or categorical, respectively). These techniques can help identify which individual predictor variables significantly contribute to the outcome variables.
For the second research question, exploring pairwise interactions between predictor variables, you could use techniques like interaction terms in regression models or more advanced methods like structural equation modeling (SEM) or generalized linear models (GLMs) with interaction effects.
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2018). Multivariate Data Analysis (8th ed.). Cengage Learning.
Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics (4th ed.). Sage Publications.
  • asked a question related to Regression Modeling
Question
2 answers
I am curr research at phising detection Using URL . Using logistic Regression model . I have data set 1:10 ratio 20k legitimate and 2 k phishing .
Relevant answer
Answer
You can rely on classical prediction metrics such as:
  • Precision, also known as 'positive predictive value', measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It is calculated as the number of true positive results divided by the number of all samples predicted to be positive. You want high in scenarios where minimizing false positives is essential.
  • Recall, also known as 'sensitivity', quantifies the proportion of correctly predicted positive instances out of all actual positive instances. It is calculated as the number of true positive results divided by the number of all samples that should have been identified. You want high recall in scenarios where missing actual positives (false negatives) is costly.
A suitable combination of the above metrics for your purposes is the F2-score, which is a variant of the F1 score that puts a stronger emphasis on recall compared to the standard F1 score. Placing a stronger emphasis on recall rather than precision makes it suitable for tasks where capturing all positive instances is crucial.
Hope you found my answer useful.
Bests
  • asked a question related to Regression Modeling
Question
1 answer
I have panel data with two waves, where the same individuals answer in wave 1 and wave 2. In my regression model with individual fixed effects, I want to add a time trend dummy variable that increase by one for each day since the start of the survey. Does the time demeaning method of fixed effects ruin this time trend variable. I am using Panel OLS from linearmodels.panel (python package) to implement the fixed effects model
Relevant answer
Answer
Yes.
In a panel data model, the fixed effects estimator is used to control for time-invariant unobserved individual heterogeneity. This is done by allowing the intercept to vary across individuals (or entities).
  • asked a question related to Regression Modeling
Question
4 answers
Can someone suggest a R package for Blinder Oaxaca decomposition for logistic regression models?
Relevant answer
Answer
I have recently added easy to use R functions to Git-Hub for multivariate decomposition (non-linear models, complex svy designs etc.
  • asked a question related to Regression Modeling
Question
5 answers
no
Relevant answer
Answer
Of course. Why shouldn't that be possible?
  • asked a question related to Regression Modeling
Question
1 answer
I would like to create a forest plot for a study we are conducting, in order to represent data from a logistic regression model with OR and CI for each variable included. However, I'm struggling to do it with Meta-Essentials resources. Is it possible or does it work exclusively for meta-analysis? Thank you.
Relevant answer
Answer
Hello João Simões. I had never heard of Meta-Essentials, but the website says this about it:
Meta-Essentials is a free tool for meta-analysis. It facilitates the integration and synthesis of effect sizes from different studies. The tool consists of a set of workbooks designed for Microsoft Excel that, based on your input, automatically produces all the required statistics, tables, figures, and more. The workbooks can be downloaded from here. We also provide a user manual to guide you in using the tool (PDF / online) and a text on how to interpret the results of meta-analyses (PDF / online).
But if I understand your question, you are not doing meta-analysis. Rather, you appear to be estimating a logistic regression model for one sample, and you want to generate a forest plot displaying the OR and CI for each explanatory variable in the model. I.e., I think you want to do something like the first example shown here:
Have I understood you correctly? If so, what other statistical software do you use (besides Meta-Essentials)? Thanks for clarifying.
  • asked a question related to Regression Modeling
Question
1 answer
What is the difference between the Hierarchical Bayesian spatiotemporal model and the Poisson regression model to analyze count data?
Relevant answer
Answer
What do you mean by "Hierarchical Bayesian spatiotemporal model"? If I would fit a "Bayesian" model the data with a poisson model, "random effect" and an autoregressive component, I could call it a "Hierarchical Bayesian spatiotemporal model".
If use a a Generalized-Linear-Mixed-Model (GLMM) with poisson error distribution and autoregressive component I would fit the same model. However, I would only estimate the maximum likelihood, not the posterior distribution. Some more information (reference to Hierarchical Bayesian spatiotemporal model) would be needed to properly answer your question in detail.
Luckly both the "Bayesian" and "Frequentist" fit GLMMs. The difference is the inclusion of prior information and interpretation of the results. Unfortionately technical terminology make it for many difficult to distinguish.
Best,
  • asked a question related to Regression Modeling
Question
3 answers
Hello,
I estimated a mixed-effect logistic regression model (GLMM) and I need to evaluate it. Specifically, I tried a few combinations of the independent variables in the model and I need to compare between them.
I know that for a regular logistic regression model (GLM), the Nagelkerke R-squared fits (a pseudo R-squared measure). But does it fit also for a mixed-effect model? If not, what is the correct way for evaluating a mixed-effect logistic regression model?
Thanks!
Relevant answer
Answer
If your model outputs the likelihood, there are different pseudo r-square measures you can calculate, including Nagelkerke. See: https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
But there is a question as to what model you would consider the null model when you have random effects.
However, for logistic regression, you might consider the Efron's pseudo r-square described at that link. Or maybe, a count pseudo r-squared, that evaluates the proportion of observations the model predicts correctly.
Also, as Girma Beressa mentions, measures like AIC, BIC, or AICc may be more appropriate to decide among models.
  • asked a question related to Regression Modeling
Question
3 answers
Hello,
I need to estimate a generalized linear mixed model (GLMM).
When using R, I need to choose a specific fixed effect (one of the independent variables in the model) to have a random slope according to a random variable such as subject ID.
But when using SPSS, I can't choose a specific fixed effect to have a random slope.
I asked chatGPT about it and it responded as follows:
The difference you're observing in the specification of random slopes for specific fixed effects between SPSS and R might be related to how the two software packages handle mixed-effects modeling. This difference is not necessarily due to the capabilities of the software but rather the way they present the options to the user. Here's why this might be the case:
SPSS:
In SPSS, the interface for specifying mixed-effects models is designed to be user-friendly and may abstract away some of the technical details. It allows users to specify random slopes and intercepts for subjects using the Repeated option or Mixed Linear Models (MIXED) procedure.
SPSS may automatically set up random slopes for all fixed effects by default, which is a more automated approach, but it doesn't give you explicit control over which fixed effects have random slopes.
R (lme4 package):
R, specifically with the lme4 package, provides more flexibility and control in specifying mixed-effects models. This can be both an advantage and a challenge for users.
In R, you have to explicitly specify which fixed effect(s) should have random slopes by including them in the model formula. This explicit control allows for more customized modeling but can be more complex and requires users to have a good understanding of their data and the modeling process.
1. Can someone please confirm this answer?
2. Is there a way to estimate a mixed-effect logistic regression model in R that would be the same as the estimated model in SPSS?
I would appreciate any clues about this issue! Thanks!
Relevant answer
Answer
Beyond that, practically speaking, you might make a small data set available, be specific about what model you are using in SPSS, present the SPSS output, and then I'm sure someone can offer the equivalent model in R.
  • asked a question related to Regression Modeling
Question
3 answers
I am running FGLS using Stata and I want to know how I can get the value of Pseudo R2. Your help would be much appreciated.
Relevant answer
Answer
In summary, the method to obtain the Pseudo R2 value in Stata depends on the type of regression model used. To obtain the Pseudo R2 value in Stata, there are different methods depending on the type of regression model used. Here are some ways to obtain Pseudo R2 in Stata: For a feasible generalized least square (XTGLS) regression model, the Pseudo R2 value can be obtained by running the command "xtgls Y X1 X2 Xn, panels( ...)" in Stata. For a generalized linear model (GLM) regression, Nagelkerke's R2 can be obtained using the "roctab" command to create a table of predicted probabilities and then calculate the Pseudo R2 value. For a random-effects probit model using xtprobit, the Pseudo R2 value is not directly available in the output. However, it can be calculated using the formula 1 - loglikelihood of the model/loglikelihood of the null model. If a Stata command does not supply an R-squared value, a Pseudo R2 value can be calculated using different approximations or analogues to R-squared. These approximations are often labeled "pseudo" and can be found in the literature of the field. I am able to add....Best of luck
  • asked a question related to Regression Modeling
Question
5 answers
The objective here is to determine factor sensitivities or slope coefficients in a multiple ols regression model.
Relevant answer
Answer
I would certainly do that, just to check if your independent variables have a linear influence to the DV or not!
  • asked a question related to Regression Modeling
Question
3 answers
My dependent variable consists of cross-sectional data with 8 observations, while the independent variables consist of time series data with 45 observations.
Relevant answer
Answer
It depends on the nature of your study and the data you have. Can you please provide more details about your study?
  • asked a question related to Regression Modeling
Question
4 answers
I am going to plot an overfitting- corrected calibration curve(using the cross-validation method) for a regression model with Poisson family (link function=log) in R.
I know that this can be done for the logistic, ols, and cox modelusing rms package.
Does anyone know about the R codes for doing overfitting- corrected calibration curve of Poisson model?
Thanks in advance
Relevant answer
Answer
Shadi Naderyan Fe'li I would like to see your RAW data points and some information: what did you measure exactly and what kind of relationship do you expect?
  • asked a question related to Regression Modeling
Question
4 answers
Hello, 
I'm working on a panel multiple regression, using R. 
And I want to deal with outliers. Is there a predifined function to do that?
If yes would you please give me an example of how to  use it
Relevant answer
Answer
Chuck A Arize Removing is a bad idea, unless you are absolutely sure that those data points are bad measurements.
Nonlinear transformations are not a good idea, see why:
  • asked a question related to Regression Modeling
Question
9 answers
Of late, some journal editors are insistent on authors providing a justification for the ordering of the entering of the predictor variables in the hierarchical regression models.
a) Is there a particular way of ordering such variables in regression models?
b) Which are the statistical principles to guide the ordering of predictor variables in regression models?
c) Could someone suggest the literature or other bases for decisions regarding the ordering of predictor variables?
Relevant answer
Answer
Jochen Wilhelm some instances use the word "hierarchical" when not all variables are entered together, but separetely or in blocks of variables. For the former one automatic algorithms may be used (but not recommended, as you know) like stepwise regression, whereas for the latter one typically the researcher decides which (blocks of) variables are entered in which order (blockwise regression).
As Bruce Weaver already mentioned, this is done very frequently in psychology, although it may be questioned if this is necessary, yet useful, or just a habit because everyone does it. In most psychological papers in such cases they are only interested in the delta R^2, for example to show that the increase in explained variance of an interaction term is significant (but miss that this could also be done by the squared semipartial correlation and no blockwise regression is needed).
I remember a paper (I believe about job satifaction or something similar) where they entered sociodemographic variables in a first block, typical predictors in a second block and new predictor variables in a third block to show that the typical ones explain more than socidemographic variables alone. The third block to show that the new predictors explain variance over and above the former ones. (In light of causality, I would cast some doubt about the results....)
But you are right, within each block and for the full model, the order of the variables do not have any meaning.
Does it help?
  • asked a question related to Regression Modeling
Question
12 answers
I have 3 independent variable which is height(100,300,500),angle(0,30,......,180) and pressure point(1,2.......,112) and each pressure point has 30,000 values of wind pressure coefficient,I want to predict the unknown data by using machine learning algorithm.so which regression model is best suit for this.
AND if consider only 1 dependent variable i.e mean of 30,000 value then which algorithm is best for this case?
Relevant answer
Answer
Shushank Sengar, was the data file you shared the full dataset? Or was it a subset of a larger data file that has 30,000 rows? Basically, I think everyone is still uncertain about where the 30,000 you talk about comes in. As you saw in my earlier post, I speculate that the "mean value" in the final column is the mean of 30.000 raw wind pressure measurements. But Jonatan disagrees. Only you can clarify this. Thanks.
  • asked a question related to Regression Modeling
Question
4 answers
My dependent variable consists of cross-sectional data with 8 observations, while the independent variables consist of time series data with 45 observations.
Relevant answer
Answer
A time series regression model will have the form
y_t = \beta_1 + \beta_2 x_{2t} + \beta_3 x_{3t} + \dots + \beta_k x_{kt} +u_t
where y is time series of the independent variable and
x_2,x_3,\dots,x_k are explanatory time series variables. (Certain conditions must be satisfied if the coefficients are to be estimated. There are additional problems to be addressed if any of the series are non-stationary)
While I have seen studies with 30 observations of each variable that have provided useful results generally considerably more observations are usually required.
Look at your data. As far as I can tell you do not have the necessary time series independent variable to do a time series regression. Perhaps I still do not understand.
  • asked a question related to Regression Modeling
Question
5 answers
Hello;
I have built different regression models Linear regression, MLTP, and Stack regressor.
I need to compare the matrics and performance of these models. How I can do that scientifically and professionally?
Relevant answer
Answer
The regression problems dealing with class or categorical variables may need to have model metrics like sensitivity and specificity. The use of a confusion matrix gives u the model precision metrics--and a good example is when one is performing logistic regression predictions of a variable like groundwater salinity or a phenomenon like landslide susceptibility, given an array of predictors in a study.
For the conventional /Bayesian regression problems that seek to do predictions of continuous variable (i.e. linear regression problems) model metrics like precision and the correlation coefficient may be useful in determining whether your model is reliable.
One may make similar predictions using simpler algorithm like the k-Nearest neighbor, and compare the results of the summary outputs generated in the statistics involved when one uses the conventional (or even Bayesian regression)
Overall, the kind of data one has and the size of the data will greatly influence the regression outcomes during Multiple or just the simple linear regression predictions.
The other key factor in classification regression metrics output is the size of datasets and class imbalance-the training and testing subsets determine greatly the model metrics and accuracy. When one has datasets with one class of data dominating the other(s), the model may need fine tuning before being analyzed.
  • asked a question related to Regression Modeling
Question
3 answers
I am conducting a study to assess the incidence and predictors of X disease progression from 10 X diseased patients retrospectively.
#Small sample size
#disease progression
#poisson model
#Cox proportional hazard model
Relevant answer
Answer
With small sample sizes you should
  1. craft your model with extra care. In particular getting the response distribution right is important. Never use tests for that (e.g. K-S test), but follow the principles in this chapter: https://schmettow.github.io/New_Stats/glm.html
  2. use exact estimation methods, such as MCMC sampling, rather than asymptotic methods, such as max likelihood estimation.
  • asked a question related to Regression Modeling
Question
14 answers
I am running 6 separate binomial logistic regression models with all dependent variables having two categories, either 'no' which is coded as 0 and 'yes' coded as 1.
4/6 models are running fine, however, 2 of them have this error message.
I am not sure what is going wrong as each dependent variable have the same 2 values on the cases being processed, either 0 or 1.
Any suggestions what to do?
Relevant answer
Answer
Sure Bruce Weaver I was running logistic regression as a part of the propensity score matching technique. While watching the tutorial video on Youtube (https://youtu.be/2ubNZ9V8WKw) I realized I am including variables in the equation that I shouldn't have. So, I excluded them and magically, the error did not appear anymore. It was that simple.(: Good luck to everyone who's facing this error!
  • asked a question related to Regression Modeling
Question
3 answers
Dear everyone,
I have imputed my data using STATA's command "mi impute" (m=200) - only one variable had missing. STATA's "mi estimate" command was used when running the logistic regression model.
How would you interpret an OR close to 1/ being 1 when the p-value is <0.05 ( 95% CI does not cross 1)? An example from the regression model for a continous variable is: OR = 0.9965 with 95% CI 0.9938 and 0.9992 which STATA round to OR = 1.00 and 95% CI: 1.00;1.00. Am I on to something when thinking that this can be possible as the p-value is not perfectly tied to the magnitude of an effect? Thus, statistically the predictor has an effect on the outcome, but that this effect is very small and may be of little clinically or otherwise "practically" significant?
And do you have any suggestions on how to report such findings in a research paper?
Best regards
Regine
Relevant answer
Answer
"Thus, statistically the predictor has an effect on the outcome, but that this effect is very small and may be of little clinically or otherwise "practically" significant?"
Exactly.
The statistical significance is an indication that your data provides enough information about the effect to be able to distinguish the estimate (here: 0.9965) from the hypothesized value (here: 1.0). Since the estimate is lower than the hypothesized value, you can conclude that the (still unknown!) effect is also lower.
The confidence interval is the range of all hypothesized values for which the data would not be statistically significant. Hence, this is a range of effect values that are "not too incompatible" with your data. Values outside the interval are "too incompatible" with your data. You see that the (0.95-)confidence interval in your case does not include 1.0, which corresponds to your p value being < 0.05. So, whatever hypothetical value is not too incompatible with your data is lower than 1.0, but very close to 1.0. This may not include any value of clinical or practical relevance.
  • asked a question related to Regression Modeling
Question
3 answers
Hello everyone! I am currently doing moderation/mediation analyses with Hayes Process.
As you can see the model 3 is significant with R2=.48
The independent variables have no sig. direct effect on the dependent variable, but significant interaction effects. The curious thing is: toptMEAN does not correlate with any of the variables, but still fits into the regression model. Should I take this as confirmation that toptMEAN has an effect on the dependent variable even though it does not correlate? Or am I missing something in the interpretation of these results?
(Maybe you could also suggest a different model for me to run. model 3 is currently the one with the highest r2 i found)
Relevant answer
Answer
It is very well possible to have significant interaction effects but no significant "main" effects. It is also possible to have significant "main" effects and no significant interactions. And it is possible that variables have significant zero-order correlations with the dependent variable but no significant regression coefficients (e.g., due to redundancies and/or overfitting of the regression model). Finally, it possible to find non-significant zero-order correlations between IVs and DV but significant regression coefficients (e.g., due to suppression effects).
Many of these effects cannot easily be identified from the zero-order correlations. That's one reason why we run multiple regression analyses--to identify redundancies, interactions, suppression effects, etc. that are not easy to see in a bivariate correlation analysis.
Note also that the "main" (lower-order) effects (and their significance) in the moderated regression model depend on whether you centered the predictors or not. This can make a huge difference for the interpretation. Especially when predictor variables do not have a meaningful zero point, centering prior to calculating the interaction terms is recommended (e.g., Aiken & West, 1991; Cohen et al., 2003). Otherwise, the lower-order terms (and their significance) may not be interpretable at all.
In any case, it would be a good idea for you to plot the effects to get a better understanding of what is going on. That is, look at the regression lines for different values of your moderators to understand the meaning of the interaction effects.
Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park: Sage.
Cohen, J., Cohen, P., West, S. G. & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. Mahwah, NJ: Erlbaum. (Chapters 7 and 9)
  • asked a question related to Regression Modeling
Question
3 answers
Hi fellow geeks, I am trying to understand the various statistical methods/tools used for studying risk factors associated with dementia especially the 40% of modifiable risk factors(Dementia prevention, intervention, and care: 2020 report of the Lancet Commission).
A recent article (PMID: 36394871) demonstrated the use of Least absolute shrinkage & selection operator and multivariate Cox proportional hazards regression models.
And a previous work (DOI: 10.1002/alz.12802) Bayesian regression modeling & Regression analysis over time prior to diagnosis.
Looking at these, I see it as using different tools on a swiss knife to get the job done... So if I were to design a specific tool and test it out...
How do I go about it?
Relevant answer
Answer
Rana Hamza Shakil
Given the subjective nature of dementia, each of the tools/techniques you have mentioned would be to be carried out individually, right?
In that case, that would require individual data collection and curation.
Which translates into hardware and personal time requirements/allocation.
How does one solve this challenge then?
  • asked a question related to Regression Modeling
Question
6 answers
In 2007 I did an Internet search for others using cutoff sampling, and found a number of examples, noted at the first link below. However, it was not clear that many used regressor data to estimate model-based variance. Even if a cutoff sample has nearly complete 'coverage' for a given attribute, it is best to estimate the remainder and have some measure of accuracy. Coverage could change. (Some definitions are found at the second link.)
Please provide any examples of work in this area that may be of interest to researchers. 
Relevant answer
Answer
I would like to restart this question.
I have noted a few papers on cutoff or quasi-cutoff sampling other than the many I have written, but in general, I do not think those others have had much application. Further, it may be common to ignore the part of the finite population which is not covered, and to only consider the coverage, but I do not see that as satisfactory, so I would like to concentrate on those doing inference. I found one such paper by Guadarrama, Molina, and Tillé which I will mention later below.
Following is a tutorial i wrote on quasi-cutoff (multiple item survey) sampling with ratio modeling for inference, which can be highly useful for repeated official establishment surveys:
"Application of Efficient Sampling with Prediction for Skewed Data," JSM 2022: 
This is what I did for the US Energy Information Administration (EIA) where I led application of this methodology to various establishment surveys which still produce perhaps tens of thousands of aggregate inferences or more each year from monthly and/or weekly quasi-cutoff sample surveys. This also helped in data editing where data collected in the wrong units or provided to the EIA from the wrong files often showed early in the data processing. Various members of the energy data user community have eagerly consumed this information and analyzed it for many years. (You might find the addenda nonfiction short stories to be amusing.)
There is a section in the above paper on an article by Guadarrama, Molina, and Tillé(2020) in Survey Methodology, "Small area estimation methods under cut-off sampling," which might be of interest, where they found that regression modeling appears to perform better than calibration, looking at small domains, for cutoff sampling. Their article, which I recommend in general, is referenced and linked in my paper.
There are researchers looking into inference from nonprobability sampling cases which are not so well-behaved as what I did for the EIA, where multiple covariates may be needed for pseudo-weights, or for modeling, or both. (See Valliant, R.(2019)*.) But when many covariates are needed for modeling, I think the chances of a good result are greatly diminished. (For multiple regression, from an article I wrote, one might not see heteroscedasticity that should theoretically appear, which I attribute to the difficulty in forming a good predicted-y 'formula'. For psuedo-inclusion probabilities, if many covariates are needed, I suspect it may be hard to do this well either, but perhaps that may be more hopeful. However, in Brewer, K.R.W.(2013)**, he noted an early case where failure using what appears to be an early version of that helped convince people that probability sampling was a must.)
At any rate, there is research on inference from nonprobability sampling which would generally be far less accurate than what I led development for at the EIA.
So, the US Energy Information Administration makes a great deal of use of quasi-cutoff sampling with prediction, and I believe other agencies could make good use of this too, but in all my many years of experience and study/exploration, I have not seen much evidence of such applications elsewhere. If you do, please respond to this discussion.
Thank you - Jim Knaub
..........
*Valliant, R.(2019), "Comparing Alternatives for Estimation from Nonprobability Samples," Journal of Survey Statistics and Methodology, Volume 8, Issue 2, April 2020, Pages 231–263, preprint at 
**Brewer, K.R.W.(2013), "Three controversies in the history of survey sampling," Survey Methodology, Dec 2013 -  Ken Brewer - Waksberg Award article: 
  • asked a question related to Regression Modeling
Question
7 answers
Dear community,
I am currently doing a research project including a moderated mediation that I am analysing with R (model 8). I did not find a significant direct effect of the IV on the DV. Furthermore, the moderator did not have a significant effect on any path. Doing a follow-up, I thus calculated a second model, that excluded the moderator (now mediation only, model 6). In this model, the effect of the IV on the DV is significant. Is it possible, that the mere presence of the moderator in the first modell influences my direct effect, even if it does not have an effect on my relationship between IV and DV? Is my thinking of, that direct effects only depict direct effects, without including the influence of other variables in the model, wrong?
Can anybody help me with an explanation and maybe also literature for this?
Thank you very much in advance!
KR, Katharina
Relevant answer
Answer
Thom Baguley thank you very very much, I get it now. Very helpful explanations and sources!
  • asked a question related to Regression Modeling
Question
3 answers
The same model provides 0.94 R2 value for one test set (9 observations) while 0.73 R2 value for another test set (95 observations), however, 0.73 R2 is associated with lower RMSE and MAE. how to explain this situation?
Relevant answer
Answer
With so few observations (n = 9), there will be a lot of sampling error plus the potential for outliers/extreme observations to have a large influence on R squared. With a larger sample (e.g., n = 95), there is less sampling error and less potential for extreme cases to influence the results. I would trust the estimate of .94 (for n = 9) less than the estimate of .73 (which is based on a more substantial sample size).
  • asked a question related to Regression Modeling
Question
2 answers
I am doing my thesis and in my design I have two independent variables, both with 3 levels (no frame, low and high). In the model I am also using the two-way interaction between these two and the two-way and three-way interaction with moderators (party preference and political trust). I found some tutorials online that say I should code the independent variables as a set of two variables to use both in the regression model (D1: no frame (0), low (1), high (0) and D2: no frame (0), low (0) and high (1)). However, when I enter D1.1 and D1.2 for independent variable 1 and D2.1 and D2.2 for independent variable 2 into the model, SPSS exlcudes one of the variables in the output (D1.1).
Moreover, I am not sure how to code and add the interactions to the model. Initially, I coded one variable for each independent variable as follows: no frame (-1), low (0), high (1) and used these for the interactions. Is that also an option, or does this cause problems?
Relevant answer
Answer
Hello Bieke,
If you genuinely believe that the unit difference separating "no frame" and "low" is exactly the same as that separating "low" and "high" for an IV, then yes, you may code them as -1, 0, 1 (or any other linear transformation thereof). However, if you don't believe this, then you must opt for recoding, replacing a three-level variable with two dummy variates.
I suspect that the reason one of your variables is being dropped is due to redundancy, and this could be caused by: (a) no instances of one of the three categories in the data set; (b) IV1 and IV2 are redundant in some way (e.g., every case is in the same level for both variables); (c) a mistaken recoding.
Are you also including all four of the dummy variate products to capture the information about their interaction?
Good luck with your work.
  • asked a question related to Regression Modeling
Question
18 answers
The Seemingly Uncorrelated Regression Models
Relevant answer
Answer
Bruce Weaver , he also states that SPSS has a procedure for SUR with correlated errors. Given the specifics that Samuel Oluwaseun Adeyemo 's post includes either he is working with a system beyond what (https://www.ibm.com/support/pages/seemingly-unrelated-regression) is refers to that we don't know about, or there is some other explanation for how he came up with this information. Shame in his response above he did not answer where he got this information. Since he didn't answer on the other thread, the using a new version explanation seems unlikely.
  • asked a question related to Regression Modeling
Question
3 answers
Examining some students on their final year projects defence. I discovered that a student had the Adjusted R² in the Regression analysis of her work to be greater than 99%. Could that be possible?
Relevant answer
Answer
Hello Ibikunie,
1. Can an adjusted R-squared exceed 0.99? Yes.
2. Can an adjusted R-squared exceed the associated, unadjusted R-squared? No, though they can be equal (as pointed out by Debopam Ghosh )
Here's one version of the most commonly used formula for adjusted R-squared:
Adjusted R2 = 1 – [(1-R2)*(n-1)/(n-k-1)]
where:
R2 = unadjusted/observed R-squared
n = sample size used in model
k = number of IVs included in the model
Note that, the ratio (n - 1) / (n - k - 1) will always be less than one for k = 1, 2, ...n - 1. So, unless the unadjusted R-squared = 1, the adjusted R-squared will be less than the unadjusted R-squared. As N increases, relative to k, the difference between the two values will decrease as well.
Good luck with your work.
  • asked a question related to Regression Modeling
Question
4 answers
Hi all,
I was wondering if anyone knew of alternatives to zero - truncated poisson (ZTP) models. I am trying to fit a regression model using demographics on the ZTP. ZTP is doing okay, except the residuals are not behaving that great. Anything in the literature would help.
Relevant answer
Answer
As for literature, I found Joseph Hilbe's books Negative Binomial Distribution and Modeling Count Data helpful for my models.
  • asked a question related to Regression Modeling
Question
7 answers
I have heterogeneous panel data model,, N=6 T=21,What is the appropriate regression model? I have applied CD test , It shows the data have cross-sectional dependency
I used the 2nd unit root tests , and the result found that my data is stationary at level
is it possible to use PMG ? would you pleas explain the appropriate regression model?
Relevant answer
  • asked a question related to Regression Modeling
Question
3 answers
i am using a fixed effect panel data with 100 observations (20 groups), 1 dependent and three independent variables. i would like to get a regression output from it. my question is it necessary to run any normality test and linearity test for panel data? and what difference would it make if i don't go for these tests? 
Relevant answer
Answer
Rede Ganeshkumar Dilipkumar so to answer the hypothesis test and relationship between regressor and regressand we need surely normality test right?... because oftenly in causal relationship method we have to prove if the hypothesis alternative is the answer or instead hypothesis null is the answer for answer the influence of regressor to regressand. So after we found the best fit model of regression whether pooled effect, fixed effect or random effect than we have to continue by finding the influence between regressor and regressand so you said that hypothesis test surely need normality test and other assumption test? .. Please put the recommended theory or reffrence to strengtening your argumentation. thank you for the enlightment..
  • asked a question related to Regression Modeling
Question
2 answers
Hello everyone, I only found the additive interaction calculation method based on logistic regression model on the internet, can anyone provide the name of the R package or SAS code for calculating RERI, AP, SI and their 95%CIs based on log-binomial regression model? Millions of thx!
Relevant answer
Answer
Hi,
did you see this paper?
They refer to reference [9] to calculate CIs based on logreg for RERI, AP and S. Perhaps that could work for you?
This is reference [9] in the paper mentioned above:
Hosmer DW, Lemeshow S. Confidence interval estimation of interaction. Epidemiology. 1992;3:452–456.
  • asked a question related to Regression Modeling
Question
6 answers
I would like to know if I am wrong by doing this. I made quartiles out of my independent variable and from that I made dummy variables. When I do linear regression I have to record the betas with 95%CI per quartile per model (I adjust my model 1 for age and sex). Can I enter all the dummies into the model at the same time or do I have to enter them separately (while also adjusting for age and sex for example)?
So far I entered all the dummies and adjusted for age and sex at the same time but now I wonder whether SPSS doesn't adjust for the second dummy variable and the third.. So I think I need to redo my calculations and just run my models with one dummy in each.
Thank you. 
Relevant answer
Answer
What you are looking for is called linear regression. Good news is that linear regression is quickly done and easy to interpret. It will also give you more statistical power, as by categorization you loose information.
And don't worry, I've seen this categorization non-sense done by seasoned professors (and sometimes forced upon their students). That's from a past era, where you literally had to crunch the numbers with pencil and paper, which is easier with categories.
  • asked a question related to Regression Modeling
Question
9 answers
The regression model is built on log transformed data and later re-transformed .So to validate the model i m using k-fold cross validation method using weka. Obviously here i m getting validation model on log transformed data. My question is can we perform k-fold cross validation with re-transformed model?
Relevant answer
Answer
Anureet Kaur Can you please point me to the answer you found? I seem to have the same dilemma here. I can do cross-validation on the log-transformed data (independent variable log-transformed), but I have several models to compare, some using log-transformed data, some not. It is obviously not possible to simply inversely transform the log-model validated MSE and RMSE, since it gives incomparable results vis a vis the models without transformation, which suggests such inversely transformed MSE and RMSE are invalid.
  • asked a question related to Regression Modeling
Question
7 answers
Dear Team,
I am running a multinomial logistic regression model for one of the fictitious data before i implement the same on my real data.
Here i am trying to predict based some scores and economic group whether a person will go for a diploma, general or honors.
The code below:
m11$prog2<- relevel(m11$prog, ref = "honors"
already loaded the nnet library. However i got the below error:
Error in relevel.factor(m11$prog, ref = "honors") :
'ref' must be an existing level
I have tried searching on SO and nabble but did not find an answer that could help.
Please suggest what is incorrect. Also checked the class of the var and is a factor variable.
Relevant answer
Answer
I am experiencing the same issue. Could not figure out why.
Error in nnet::multinom(OverallCondition ~ ., data = data_house, family = "binomial") :
need two or more classes to fit a multinom model
In addition: Warning message:
In nnet::multinom(OverallCondition ~ ., data = data_house, family = "binomial") :
groups ‘Poor’ ‘Average’ ‘Good’ are empty
I am sitting with the same error around 2 hrs.
  • asked a question related to Regression Modeling
Question
6 answers
Lets say I have a SLR with correlation coefficient (r) = -0.81 then the coefficient of determination (r^2) will be around 0.661. From what I know, this would mean that will the while the variables have a strong negative correlation the independent variable is only able to account for 66.1% of the variation in the dependent variable. My question then based on this is, what does this suggest? Is the relationship between the two variables non-linear or I need to add more independent variables?
Relevant answer
Answer
This (-.81) seems like a pretty substantial correlation to me! Nonetheless, it would not hurt to take a look at the scatterplot to check for potential non-linearity. Also, if you have other variables that should also be related to the DV in theory, you could add them to the model.
  • asked a question related to Regression Modeling
Question
9 answers
Dear,
I have conducted a study where 18 patiënts were included. I ran a logistic regression, the model is significant but none of my predictors are. My R square is 1 which also a bit strange. What do I need to do here? How do I report this?
Relevant answer
Answer
You have quasi-complete or complete separation. Ideally you need more data - or to use an approach that adds more information (e.g., Bayesian regression - though other options exist).
  • asked a question related to Regression Modeling
Question
5 answers
if we have a regression model built by given data,how to fast build a model if one training data is removed? I know the leave-one-out error can be fast approximated, but how to build the model?
Relevant answer
Answer
Dear Dr. Li,
Thank you for your attention. Your answer gives me some inspiration. I will try the approach. Thank you again.
  • asked a question related to Regression Modeling
Question
3 answers
Can a person's perception of one thing influence his perception of another (measured using a Likert scale)? If so, can the degree of influence be measured using a simple linear regression model or else? Please give me any references for that. Thank you very much for your comments.
Relevant answer
Answer
If there is data, you can build a model. The key is to see the relevance of the model.
  • asked a question related to Regression Modeling
Question
6 answers
I am conducting a panel data regression for a research on economic growth of few countries. In real life, it is hard to find data that are normally distributed and most of the control variables are correlated with each other in one country or another.
However, the regression test results are satisfactory and all show that the residuals are normally distributed, there exists no serial correlation and heteroscedasticity. Even the CUSUM and CUSUMSQ tests show that the model is stable.
In such a case, are the diagnostic tests enough to justify that the results of the regression model are reliable and valid even when data are not normally distributed and there exists correlation among them?
Thank you in advance for your responses.
Relevant answer
Answer
There is no assumption about normality of predictor or outcome variables in linear regression (only about the normality of residuals/errors). Also, linear regression is designed to handle correlated predictors, so there should be no problem unless the correlations are extremely high.
  • asked a question related to Regression Modeling
Question
12 answers
More exactly, do you know of a case where there are repeated, continuous data, sample surveys, perhaps monthly, and an occasional census survey on the same data items, perhaps annually, likely used to produce Official Statistics?   These would likely be establishment surveys, perhaps of volumes of products produced by those establishments. 
I have applied a method which is useful under such circumstances, and I would like to know of other places where this method might also be applied.   Thank you. 
Relevant answer
Answer
This is for the crushed stone industry in the US:
I'm told these quarterly surveys are for "a select set of companies" which reminds me of how the quasi-cutoff sample of electric sales in the US got started. The electric sales survey of a select group of entities was later modified and used as a sample, first a stratified random sample with a large company censused stratum, and then only the censused stratum as a quasi-cutoff sample, all after starting an annual census of all electric sales by economic sector (residential, etc.), from the production/supply side. If one wanted to monitor the crushed stone industry the same way, I would suggest this approach using a quasi-cutoff sample with a ratio model for prediction, as is done at the US Energy Information Administration (EIA).
Does anyone know of other surveys, each of a select group of larger establishments being followed, where there is a chance to instead have an occasional census to be used for regressor data for the same data items in a more frequent sample?
  • asked a question related to Regression Modeling
Question
10 answers
I am testing hypothesis of relationships between CEA and Innovation Performance (IP). If I am testing the relationship of one construct , say Management support to IP , is it ok to use single linear regression? Of should I be testing it in a multiple regression with all the constructs?
  • asked a question related to Regression Modeling
Question
3 answers
I have constructed a multiple regression model. The results show that the selected variables are significant but the intercept is not. Is that model useful? How to interpret the model. Thanks
Relevant answer
Answer
Its very hard to answer without more information. Generally the intercept isn't of much interest in a model, but there are exceptions. Its the expected value of the outcome Y when all predictors are zero. Usually that's not an interesting quantity and testing whether it is different from zero is also rare of interest. So in most cases its fine, unless there is good reason to believe the intercept should not be zero or close to zero in this context.
  • asked a question related to Regression Modeling
Question
5 answers
i try to find relation between dependent variable (membrane fouling rate) and independent variables (SMP,EPS,RH,.....) in treating wastewater in MBR. i want to have a regression model with these variables that shows which of these variables are more important ( based on coefficients)
Relevant answer
Answer
Old-fashioned stepwise regression is not recommended. Here are some links you may find useful.
Standardized regression coefficients are also problematic. See John Fox's comment in the attached file, for example.
The 2nd link about includes some suggestions about other approaches you could try (e.g., LASSO). You could also read up on dominance analysis.
HTH.
  • asked a question related to Regression Modeling
Question
3 answers
I have a dependent variable: childcare take up, with answers 1 or 0.
I have several other independent variables such as education level, household income, migration background, etc. I have dichotomized all of them. For example, observations with college education have been classified as 1, otherwise 0.
I would like to know which regression model would be the best to predict the childcare take up.
Relevant answer
Answer
For dichotomous dependent variables, you could use logistic regression.
However, I would discourage you from dichotomizing your independent variables (leave them in their original metric). Dichotomizing leads to a loss of information and can reduce statistical power. There is almost never a need to dichotomize variables from a statistical perspective because there are many statistical procedures available that can handle ordinal-polytomous and/or continuous (metrical, interval-scale) variables. When you type "dichotomizing continuous variables" into Google Scholar, you will find a large number of relevant references that explain why this is a bad idea.
  • asked a question related to Regression Modeling
Question
9 answers
I wanted to conduct regression for a count outcome data with under dispersion. Which regression model is appropriate?
Relevant answer
Answer
Thank you prof. Kelvyn Jones for your usual informative response for my concerns. I downloaded the first reference 1st ed. freely and I`m exploring it.
  • asked a question related to Regression Modeling
Question
27 answers
Hello,
I have a question about the multiple regression model. I want to fit a multiple regression model on a dataset in which the dependent variable is continuous and all of the independent variables are categorial. Should I check the assumption of the multiple regression model also for categorical variables? if they are not met, which model may I use instead of the multiple regression model to analyze the effects of these variables on the dependent variables?
thank you
Relevant answer
Answer
Get a book on regression. This a standard ANOVA type regression problem. I am attaching such a set of notes for anyone that is interested. David Booth
  • asked a question related to Regression Modeling
Question
6 answers
I am interested to extract the actual probability values (between o and 1) from a logistic regression curve (sigmoid curve) in python as shown in pink color in the attached image.
Relevant answer
Answer
You can use
model.predict_proba(x_test) to get the probability values if using the sklearn library.
  • asked a question related to Regression Modeling
Question
12 answers
I would like to scale a number of variables by average total assets in the regression model. But i am not sure how or what command should I sue to scale the variable.
Example, I need to scale earnings by average total assets.
Relevant answer
Answer
Thank you very much David. I always appreciate your efforts to educate me.
  • asked a question related to Regression Modeling
Question
2 answers
I am working on a health data set and trying to capture health inequality in India. Further, I am using a non-parametric (kernel) regression model to measure health inequality. To calculate the standard error of the concentration index should I use the delta method or bootstrap? and why?
Relevant answer
Answer
Sujata Sujata Both methods have contextual applications depending on how the data is analyzed. The idea of bootstrapping SE is variability especially if the sample is not randomized otherwise repeated sampling may apply. Whereas Delta method is for approximation of already randomized observation.
You cannot choose any except you justify the context.
Best of luck
  • asked a question related to Regression Modeling
Question
11 answers
In the latest approaches to trying to infer from nonprobability samples, multiple covariates are encouraged.  For example, see https://www.researchgate.net/publication/316867475_Inference_for_Nonprobability_Samples.  However, in my experience, when a simple ratio model can be used with the only predictor being the same data item in a previous census (and results can be checked and monitored with repeated sample and census surveys), results can be very good.  When more complex models are needed, I question how often this can be done suitably reliably.  With regard to that, I made comments to the above paper.  (That paper is available through Project Euclid, using the DOI found at the link above.) 
Analogously, for heteroscedasticity in regression, for Yi associated with larger predicted-yi, sigma should be larger.  However, when a more complex model is needed, this is less likely to be empirically apparent.  For a one-predictor ratio model where the predictor is the same data item in a previous census, and you have repeated sample and census surveys for monitoring, this, I believe, is much more likely to be successful, and heteroscedasticity is more likely to be evident. 
This is with regard to finite population survey statistics.  However, in general, when multiple regression is necessary, this always involves complications such as collinearity and others.  Of course this has been developed for many years with much success, but the more variables required to obtain a good predicted-y "formula," the less "perfect" I would expect the modeling to be.  (This is aside from the bias variance tradeoff which means an unneeded predictor tends to increase variance.) 
[By the way, back in Cochran, W.G.(1953), Sampling Techniques, 1st ed, John Wiley & Sons, pages 205-206, he notes that a very good size measure for a data item is the same data item in a previous census.] 
People who have had a lot of experience successfully using regression with a large number of predictors may find it strange to have this discussion, but I think it is worth mulling over. 
So, "When more predictors are needed, how often can you model well?"
Relevant answer
Answer
Prediction-based inference from finite population sampling is well-established.   See
Valliant, R, Dorfman, A.H., and Royall, R.M.(2000), Finite Population Sampling and Inference: A Prediction Approach, Wiley Series in Probability and Statistics,
and
Chambers, R, and Clark, R(2012), An Introduction to Model-Based Survey Sampling with Applications, Oxford Statistical Science Series. 
This is also in parts of other sampling books such as Thompson, S.K.(2012), Sampling, 3rd ed, John Wiley & Sons. 
A classical ratio model is based on less heteroscedasticity than one may usually find in survey data, except when the effective coefficient of heteroscedasticity is reduced by data quality issues from the small responders.  It can be very useful in Official Statistics.  We can estimate sample size requirements for a subpopulation or a population modeled well by a single such model, using a sample drawn in any way which does not miss any important subdivision which would indicate more than one model is needed.  The format for the "formula" for estimating the sample size in such a case is similar to what is found for a simple random sample in Cochran, W.G.(1977), Sampling Techniques, 3rd ed, John Wiley & Sons.  See https://www.researchgate.net/publication/261947825_Projected_Variance_for_the_Model-based_Classical_Ratio_Estimator_Estimating_Sample_Size_Requirements.  Both cutoff/quasi-cutoff sampling and balanced sampling are discussed. 
Thus we can estimate the sample size required here to make satisfactory inference from a population or subpopulation which falls under the purview of one model-based classical ratio estimator.  This simple result is relatively easily verified. 
So we can see that when a simple model is appropriate, as when a single predictor is the same data item in a previous census, we may more feasibly infer from a nonprobability sample.  
Does anyone have any other experience to discuss?
  • asked a question related to Regression Modeling
Question
5 answers
Let us suppose that we have got only a small amount of data (say subtractive manufacturing data). Owing to the costs involved, the number of experiments conducted are limited and therefore the number of data points covered is also limited.
Consequently, the graphs plotted are pretty discrete and therefore do not give a clear picture of the relationship between explanatory and predicted variables.
The question is, do I fit linear regression models or do I go for an ANN?
Relevant answer
Answer
Was the experiment used to collect the data planned through any of the statistical methods of Design of Experiments?
  • asked a question related to Regression Modeling
Question
3 answers
Explain why it is not possible to estimate a linear regression model that contains all dummy variables associated with a particular categorical explanatory variable?
Relevant answer
Answer
Start with the idea of a single dichotomous variable scored 1 and 0. When you enter this into your regression, the observations scored 0 will become the omitted category, which will then be represented as the intercept of the regression equation. Regardless of how many categories you have, you will always need to omit one to serve as the intercept.
Another day to think of this is in terms of degrees of freedom. A dichotomous variable only has one degree of freedom because if you know the grand mean and the mean of either category, then you can automatically calculate the mean of the remaining category.
  • asked a question related to Regression Modeling
Question
6 answers
Explain the benefits of using natural logarithms of variables, either of Y or of the X's, as opposed to other possible nonlinear functions, when scatterplots (or possibly economic considerations) indicate that nonlinearities should be taken into account
Relevant answer
Answer
Hello Hong,
Here's one: Consider a variable that generally shows strong positive skew (for example, annual income among residents of a country). There's actually several issues which arise from this:
1. There can be many orders of magnitude separating largest, smallest values in the data set. Too many can threaten the accuracy of computed model coefficients.
2. The strong asymmetry can result in model residuals being non-normally distributed.
3. The strong asymmetry can result in a fitted model being more non-linear in form than linear.
For reasons such as this, transforms such as log (regardless of base) can be helpful.
Good luck with your work.