Science topic

Regression - Science topic

Explore the latest questions and answers in Regression, and find Regression experts.
Questions related to Regression
  • asked a question related to Regression
Question
3 answers
I am using L8 mix level. One factor (categorical) at 4 level and other four factors at 2 levels. How to develop regression equation
Relevant answer
Answer
Yes and no, k-1 = 3 dummies, should be structured something like this (attached). I've never used minitab, in R it would simply be
model <- lm(Y ~ B + C + D + E + D1 + D2 + D3, data = data)
summary(model)
  • asked a question related to Regression
Question
4 answers
Hello, I have used machine learning regression algorithms, including Random Forest, Decision Tree, and K-Nearest Neighbors (KNN), to estimate soil moisture parameters. What other suitable and practical algorithms do you recommend for evaluating and estimating soil moisture?
Thank you.
Relevant answer
Answer
  • asked a question related to Regression
Question
9 answers
I used to plot B-H plot. But getting negative INTERCEPT.
Relevant answer
Answer
Well, constrained optimization can do job. That is, the usual linear regression which minimizes the sum of squares (sum of squared errors as an objective function) can be extended with one (or with many constraints) on the variables, i.e., on the slope and intercept. See, e.g.
  • asked a question related to Regression
Question
3 answers
In three randomly selected primary care centers, we conducted a pilot intervention. We also have three control centers. The results are available for one month before the intervention, the first month of the intervention, and the second month of the intervention. However, I do not have individual-level data; I only have aggregated numbers. For example, in the first center, X individuals tested positive, and Y tested negative in the month before the intervention.Also all individuals were different since our healthcare center includes all those who screened during the study period.
My main question is: Which statistical test should I use? Can I use regressions, and specifically, which regression should I use?
Thank you
Relevant answer
Answer
If you have the counts for both the positive and negative results, logistic regression would be the better approach. The model is slightly complicated because you have multiple centers (locations) within each treatment group, and two time periods.
You could probably do an initial, more simple, analysis by pooling the centers within each treatment, and just looking at the proportions. Maybe just plot the proportions with confidence intervals about the proportions. A plot like this may tell the whole story.
  • asked a question related to Regression
Question
1 answer
Hello everyone,
I have a data set with panel structure (panel data) with 78 individuals observed over 5 three-year periods. I have 10 dependent variables an 1 independent variable. I applied logarithmic transformations to all of them due to differences in scales. I have found that the true model is the FE and have tested for homoskedasticity and serial correlation, with the following results: heteroskedasticity and serial correlation.
However, I have read that serial correlation is not really an issue with panels with less than 20 time observations, so I would like to know if it's just safe to ignore it or not?
This leaves me with the problem of dealing with heteroskedasticity. I'm using R and cannot use Stata. So far I've pretty much used this presentation to run the regression: http://www.princeton.edu/~otorres/Panel101R.pdf
The image below shows the explanation of the covariance matrices:
So, stated all the above, I would like to know which covariance matrix is the best option. Should I only treat heteroskedasticity or serial correlation as well?
Relevant answer
Answer
Eigenvectors in Python might seem incorrect due to several potential issues, such as numerical precision errors, incorrect input matrices, or misunderstanding the output format. Eigenvalue problems can be sensitive to the scaling or conditioning of the matrix, and small computational errors may arise in floating-point arithmetic, leading to seemingly incorrect results. It's also important to check if the eigenvectors are normalized or if they are presented as column vectors or row vectors, as the orientation might differ based on the solver or the specific Python package (like NumPy or SciPy) used. Additionally, eigenvectors corresponding to distinct eigenvalues are not unique and can differ by a scalar multiple, which might also lead to apparent discrepancies.
  • asked a question related to Regression
Question
4 answers
I learned ANOVA and Regression are general linear model but they have some differences. then shouldn't it be different betweeen MANOVA and multivariate regression(multiple dependent variables)? when I look in to internet, there are only instructions about how to conduct MANOVA on SPSS. is it becuase MANOVA and multivariate regression is same or SPSS just doesn't provide multivariate regression?
which software is best for conducting multivariate regression?
Relevant answer
Answer
Nuri M. Abduali I'd argue MANOVA doesn't test "... for differences in multiple dependent variables" or at least not in the sense that people think or that is generally useful.
MANOVA creates a weighted average of the various DVs and uses this as its DV. This weighting is opportunistic and atheoretical - its based on the optimal combination in the data (the weighted linear function of DVs that maximizes the variance). So if you repeated the study with new data for 4 DVS the second analysis using the same 4 DVs would have a different linear combination. This creates interpretability and other issues. Most researchers end up falling back on univariate analyses (incorrectly think the MANOVA has added adequate Type I error protection).
  • asked a question related to Regression
Question
5 answers
Hello to all dear friends
I have 2 questions
1- Why use regression to estimate heritability between parents and offspring?
2- Why use correlation to estimate heritability between full and half-siblings?
Relevant answer
Answer
Correlation is used to estimate heritability between full and half-siblings because it captures the degree of phenotypic similarity due to shared genetic and environmental factors. Full siblings share 50% of their genetic material on average, while half-siblings share 25%. By comparing their phenotypic correlations, researchers can separate additive genetic effects from shared environmental influences. This approach helps estimate narrow-sense heritability () while accounting for differences in genetic relatedness and environmental contributions
  • asked a question related to Regression
Question
3 answers
In statistical analysis@
Relevant answer
Answer
IMO, this is a prime example of what Ronán Michael Conroy recently described (in another thread) as a JUNK QUESTION. What do you hope to learn that you cannot learn from a simple search using Google, Duck-Duck-Go, or some other search engine? (I find it hard to imagine that you have access to ResearchGate, but not to any search engines.)
  • asked a question related to Regression
Question
6 answers
In Brewer, K.R.W.(2002), Combined Survey Sampling Inference: Weighing Basu's Elephants, Arnold: London and Oxford University Press, Ken Brewer proved not only that heteroscedasticity is the norm for business populations when using regression, but he also showed the range of values possible for the coefficient of heteroscedasticity.  I discussed this in "Essential Heteroscedasticity," https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity, and further developed an explanation for the upper bound. 
Then in an article in the Pakistan Journal of Statistics (PJS), "When Would Heteroscedasticity in Regression Occur, https://www.researchgate.net/publication/354854317_WHEN_WOULD_HETEROSCEDASTICITY_IN_REGRESSION_OCCUR, I discussed why this might sometimes not seem to be the case, but argued that homoscedastic regression was artificial, as can be seen from my abstract for that article. That article was cited by other authors in another article, an extraction of which was sent to me by ResearchGate, and it seemed to me to incorrectly say that I supported OLS regression. However, the abstract for that paper is available on ResearchGate, and it makes clear that they are pointing out problems with OLS regression.
Notice, from "Essential Heteroscedasticity" linked above, that a larger predicted-value as a size measure, where simply x will do for a ratio model as bx still gives the same relative sizes, means a larger sigma for the residuals, and thus we have the term "essential heteroscedasticity."  This is important for finite population sampling.
So, weighted least squares (WLS) regression should generally be the case, not OLS regression. Thus OLS regression really is not "ordinary." The abstract for my PJS article supports this. (Generalized least squares (GLS) regression may even be needed, especially for time series applications.)
Relevant answer
Answer
Shaban Juma Ally, that is why one should use weighted least squares (WLS) regression. When the coefficient of heteroscedasticity is zero - which should not happen - then WLS regression becomes OLS regression.
  • asked a question related to Regression
Question
1 answer
Hello dear academics/researchers
I want to use a test for robustness after conducting KRLS test. When i was looking for alternative tests i found lpoly and localp (advanced version of lpoly). Local Polynomial Regression is nice to visualization of graphs. But i'm not sure it will be a valid test after conducting KRLS. I couldn't find any paper that help me. That's why i prefer to ask you dear academics/researchers. Thank you in advance
Relevant answer
Answer
Metin Dogan Combining Kernel-based Regularized Least Squares (KRLS) with Local Polynomial Regression (LPR) can indeed be an interesting approach, but it largely depends on your research objectives and the type of robustness or validation you are seeking.
KRLS is a non-parametric regression method that excels in capturing complex, non-linear relationships in your data. On the other hand, LPR, especially with tools like lpoly or its advanced version localp, is often used for visualizing trends in the data or performing exploratory smoothing to understand local patterns.
If your goal is to use LPR as a complementary tool for robustness checks or graphing after KRLS, it can be valid, provided you are clear about the specific purpose:
  1. Visualization: Using LPR to visualize trends in your data after applying KRLS is reasonable, as it can help you explore the consistency between your KRLS results and the observed data patterns. Ensure you communicate that this is a descriptive visualization rather than a statistical validation.
  2. Robustness Testing: If you intend to use LPR as a formal robustness test, you should consider whether the assumptions and properties of the two methods align. KRLS models data globally, while LPR focuses on local smoothing, which may lead to different interpretations. Ensure the robustness checks align with your theoretical framework.
  3. Validation in Literature: As you mentioned, finding explicit examples of these methods used together might be challenging. However, this could also present an opportunity to contribute new insights to the field. Consider documenting your approach clearly and providing justification for why this combination addresses your research question.
  4. Alternative Robustness Tests: If LPR does not seem suitable for formal robustness testing after KRLS, you may explore other statistical tests or resampling methods (e.g., bootstrapping) that align more closely with KRLS.
Ultimately, while combining these methods is not inherently invalid, the validity will depend on your methodological rigor and how you communicate the intent behind using both.
  • asked a question related to Regression
Question
2 answers
Hi
I have a paper where the reviewer suggested the Benjamini-Hochberg Correction.
I have the following hypotheses/tests:
Mean differences across three groups: 5 DVs
Correlations of an ADOS score with various fluency scores: 5 correlations
Mean differences for two groups: 1 DV
Regressions with moderating variables: 6 regressions
I found the original (1995) paper and it seems that instead of using all tests across the whole study, they are grouped into families. My questions are:
1. Do I use the whole hypothesis total or do I do it by hypothesis? That is, is my n-tests=17 or 5, 5, 1, and 6 and I do the correction 4 times?
2. When I am doing mean differences across three groups, and especially for the regressions with all the moderators, am I counting the hypotheses correctly? In particular for the regressions, each beta weight is being tested along with the interactions. With the covariate and moderators I have 6 significance tests under the 1 regression analysis for 5 regressions, and 4 significance tests under 1 regression for the last one. Do I count the regression analysis as 6 (original) + 30 (beta weights for each regression with 6 beta weights) + 4 (beta weights for the regression with 4 beta weights) = 40? Relatedly, do I count the post-hocs in the ANOVAs or the covariates in the ANCOVAs?
3. Also, if p values are identical (e.g., <.000 in SPSS) they get the same rank?
4. Preliminary analyses are excluded, yes? I checked to see if groups were equivalent on age, IQ, etc. I suppose I could be fancy and do a BHC for that too but....the point is they are considered separately and not part of the hypothesis being tested, correct?
Thank you
Amy C
Relevant answer
Answer
thank you Sal Mangiafico
  • asked a question related to Regression
Question
4 answers
I have applied the regression and correlation analysis but the analysis is having a negative result. So I analysed on the Likert scale, still the result is negative.
Relevant answer
Answer
Correlation and regressions can produce "negative results"...
What is your problem?
  • asked a question related to Regression
Question
1 answer
I am using multiple regression, Correlation Analysis and simple regression in comparing variables for my dissertation. Any insights or resources?
Relevant answer
Answer
Kindly mention more details about the paper.
  • asked a question related to Regression
Question
2 answers
Logistic regression can handle small datasets by using shrinkage methods such as penalized maximum likelihood or Lasso. These techniques reduce regression coefficients, improving model stability and preventing overfitting, which is common in small sample sizes (Steyerberg et al., 2000).
Steyerberg, E., Eijkemans, M., Harrell, F. and Habbema, J. (2000) ‘Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets’, Statistics in medicine, 19(8), pp. 1059-1079.
Relevant answer
Answer
I have the book Statistical methods in medical research by Armitage and Berry
  • asked a question related to Regression
Question
2 answers
There is a research model with one IV, one DV and one mediator. So, when checking the parametric assumptions in SPSS, what will be done to the mediator?
Does the following answer is true?
Path 1: IV to Mediator
For the path where the IV predicts the mediator, the mediator is treated as the outcome.
  • Run a simple regression with the IV as the predictor and the mediator as the dependent variable. Check parametric assumptions for this regression:
    • Linearity
    • Normality of Residuals
    • Homoscedasticity
    • Independence of Errors
Path 2: Mediator to DV (and IV to DV)
In this step, the mediator serves as a predictor of the DV, alongside the IV.
  • Run a regression with the IV and the mediator as predictors and the DV as the outcome. Check assumptions for this second regression model:
    • Linearity
    • Normality of Residuals
    • Homoscedasticity
    • Independence of Errors
    • Multicollinearity
Relevant answer
Answer
This sounds correct to me.
  • asked a question related to Regression
Question
1 answer
I came across this article on Nature Human Behavior. I did not quite understand the analysis reporting here. What do the two decimal points mean?
Heiserman, N., Simpson, B. Discrimination reduces work effort of those who are disadvantaged and those who are advantaged by it. Nat Hum Behav 7, 1890–1898 (2023). https://doi.org/10.1038/s41562-023-01703-9
Relevant answer
Answer
Oh that's definitely a typo, should be for instance b = 0.167 (or 1.67? Doesn't say standardized or not so impossible to tell)
  • asked a question related to Regression
Question
1 answer
Hi everyone, I intended to use AMOS24 software to do a regression design with a moderator variable. That's why I installed this software. The software was installed without any problems, but there is a problem when I select the data file option, so that the data from spss27 enters Amos and is not listed. But the file itself is visible to me, even when I select the data view option, spss opens for me. what is problem ?
Relevant answer
Answer
tray to export your data as csv file
  • asked a question related to Regression
Question
4 answers
To my knowledge, the total effect in mediation reflects the overall impact of X on Y, including the magnitude of the mediator (M) effects. A mediator is assumed to account for part or all of this impact. In mediation analysis, statistical software typically calculates the total effect as: Total effect = Direct effect + Indirect effect.
When all the effects are positive (i.e., the direct effect of X on Y (c’), the effect of X on M (a), and the effect of M on Y (b)), the interpretation of the total effect is straightforward. However, when the effects have mixed or negative signs, interpreting the total effect can become confusing.
For instance, consider the following model: X: Chronic Stress, M: Sleep Quality, Y: Depression Symptoms. Theoretically, all paths (a, b, c’) are expected to be negative. In this case, the indirect effect (a*b) should be positive. Now, assume the indirect effect is 0.150, and the direct effect is -0.150. The total effect would then be zero. This implies the overall impact of chronic stress on depression symptoms is null, which seems illogical given the theoretical assumptions.
Let’s take another example with mixed signs: X: Social Support, M: Self-Esteem, Y: Anxiety. Here, the paths for a and c’ are theoretically positive, while b is negative. The indirect effect (a*b) should also be negative. If the indirect effect is -0.150 and the direct effect is 0.150, the total effect would again be zero, suggesting no overall impact of social support on anxiety.
This leads to several key questions:
1. Does a negative indirect effect indicate a reduction in the impact of X on Y, or does it merely represent the direction of the association (e.g., social support first improves self-esteem, which in turn reduces anxiety)? If the second case holds, should we consider the absolute value of the indirect effect when calculating the total effect? After all, regardless of the sign, the mediator still helps to explain the mechanism by which X affects Y.
2. If the indirect effect reflects a reduction or increase (based on the coefficient sign) in the impact of X on Y, and this change is explained by the mediator, then the indirect effect should be added to the direct effect regardless of its sign to accurately represent the overall impact of both X and M.
3. My main question is: Should I use the absolute values of all coefficients when calculating the total effect?
Relevant answer
Answer
Yes, the signs of the direct and indirect effects do matter when calculating the total effect in mediation analysis. Here's how the signs influence the total effect:
Breakdown of Effects in Mediation:
  1. Direct Effect: The effect of the independent variable (X) on the outcome variable (Y) without considering the mediator.
  2. Indirect Effect: The effect of X on Y through the mediator (M). This is calculated as the product of:The effect of X on M (aaa), The effect of M on Y while controlling for X (bbb). Indirect effect = a×ba \times ba×b
  3. Total effect=Direct effect+Indirect effect\text{Total effect} = \text{Direct effect} + \text{Indirect effect}Total effect=Direct effect+Indirect effectTotal Effect: This is the combined effect of X on Y, accounting for both the direct path and the mediated (indirect) path. It is the sum of the direct and indirect effects.
How Signs Matter:
  • If both the direct effect and the indirect effect have the same sign (both positive or both negative), the total effect will increase in magnitude.
  • If the direct effect and indirect effect have opposite signs, they will work against each other, and the total effect will decrease in magnitude or potentially even change direction (depending on the relative sizes of the effects).
Example:
  1. Positive Direct Effect and Positive Indirect Effect:Direct effect = +0.5 Indirect effect = a×b=+0.3×+0.4=+0.12a \times b = +0.3 \times +0.4 = +0.12a×b=+0.3×+0.4=+0.12 Total effect = +0.5+0.12=+0.62+0.5 + 0.12 = +0.62+0.5+0.12=+0.62
  2. Negative Direct Effect and Positive Indirect Effect:Direct effect = -0.5 Indirect effect = +0.12+0.12+0.12 Total effect = −0.5+0.12=−0.38-0.5 + 0.12 = -0.38−0.5+0.12=−0.38
  3. Opposing Signs:Direct effect = +0.5 Indirect effect = −0.12-0.12−0.12 (e.g., if a=−0.3a = -0.3a=−0.3 and b=+0.4b = +0.4b=+0.4) Total effect = +0.5−0.12=+0.38+0.5 - 0.12 = +0.38+0.5−0.12=+0.38
Interpretation:
  • The signs of the direct and indirect effects influence whether the mediator amplifies or reduces the overall effect of the independent variable on the outcome.
  • If the signs are opposite, the mediator might be suppressing the effect of X on Y, or even reversing it, depending on the magnitude of the indirect effect.
  • asked a question related to Regression
Question
1 answer
There are many works and programs in which the closeness of measured and calculated response variables assesses regression quality. Is this enough?
Relevant answer
One way to assess the accuracy of the PLS method is by calculating the figures of merit (RMSEC, RMSEP, only is not enough), using REP or others according to the attached articles. Another very effective way is to compare the predicted data with the values ​​found in a reference analytical technique for your method (i.e. HPLC)
  • asked a question related to Regression
Question
2 answers
Hi all,
I was hoping i could get some support on my difficulty in Amos. Please note that I am a novice to the software. I have a good model fit which I now wish to test with the common latent factor approach. I am using Amos version 27. I am running the model without the CLF and copying the standardized regression weights in excel. Then i am creating the CLF, and constraining the paths to a and the CLF to regression weight of 1. When i run the model and run the standardized regression weights they are identical to the model I run without the CLF. I think this cannot be correct. I am adding a photo of the model in question.
Any ideas or help would be much appreciated!
Relevant answer
Answer
A similar problem of mine got solved by removing all the optional correlation curves in structural model.
  • asked a question related to Regression
Question
11 answers
Hello all,
I want to assess the effect of one categorical independent variable along with a few confounders on one continuous outcome variable, I fitted a multiple regression and see the adjusted R square is negative like -0.002.
I added a few interactions and just increase from -0.002 to -.02, I believe linear regression is not a good model here.
All the figures for this variable is not positive to fit gamma regression or b/n zero and one to fit beta regression, I am not sure what to do in this case?
there is also no multicollinearity
Sample size=200
One independent variable
Two confounder
Any input appreciated.
Relevant answer
Answer
There may be a substantial problem (all B’s are approximately zero), the correlation matrix may be inconsistent, or there are too many predictors.
As R2 is zero or positive by definition, and Adj.R2 is
1-(1-R2)*(NCases-1)/(NCases-NPredictors-1), Adj.R2 may become negative with large NPredictors. For instance, if R2=10% and N=100, Adj.R2 is 1% with 9 predictors, but -0,1% with 10 predictors. If this is the case, remove irrelevant and/or insignificant predictors.
Inconsistent matrices may be the result of pairwise deletion, when the correlations are not calculated on exactly the same cases. In that case use listwise deletion and/or remove predictors with many missings.
Finally, if R2 was near zero to begin with, and so all B’s were near zero as well, you (and especially your client) has to accept that ‘nothing’ was found. But remember the old saying ‘Absence of proof is not proof of absence’. Null-results may be either because there are really none in the population, but also because the research had simply not enough power to detect those results, for instance because of too few cases (note that the ‘alpha’-error is something else than the ‘beta’-error).
  • asked a question related to Regression
Question
5 answers
Hello all,
I want to assess the effect of one categorical independent variable along with a few confounders on one continuous outcome variable, I fitted a multiple regression and see the adjusted R square is negative like -0.002.
I added a few interactions and just increase from -0.002 to -.02, I believe linear regression is not a good model here.
All the figures for this variable is not positive to fit gamma regression or b/n zero and one to fit beta regression, I am not sure what to do in this case?
there is also no multicollinearity
Sample size=200
One independent variable
Two confounder
Any input appreciated.
Relevant answer
Answer
Thanks for your help.
  • asked a question related to Regression
Question
1 answer
Hi everyone,
I am trying to identify relevant predictors of an outcome in multi-level data. I am wondering what the best approach is here - how do you go about the selection process? By significance in the overall model, recalculate, select again, recalculate again etc.? Using a normal regression stepwise, forwards or backwards procedure? LASSO (do I need to model the multi-level structure if ICC is low)?
Thank you so much!
Relevant answer
Answer
It is recommended that a multi-level model be used to account for the hierarchical structure, even in cases where the intraclass correlation coefficient (ICC) is low.
Variable Selection:
LASSO: This method is useful for selecting predictors, but it is essential to ensure that the multi-level structure is considered.
Stepwise: This method is less ideal for multi-level data, but it can be used with caution.
Significance: It is important to evaluate the significance of predictors in the overall model, but it is crucial to avoid overfitting.
Focus on modeling the hierarchical structure, careful selection, and validation to ensure accuracy.
  • asked a question related to Regression
Question
2 answers
I have monthly data of DJIA and S&P 500 (BSE) from 2003-2024 and I wish to test if there are any structural breaks in the data and to identify the same.
Relevant answer
Answer
In the context of testing for structural breaks in time series data, the typical approach involves regressing the dependent variable (e.g., DJIA) on one or more independent variables (e.g., S&P 500) and examining for changes in the relationship over time.
To identify structural breaks, one may utilize the Chow Test in instances where a specific breakpoint is suspected, such as in the case of a known event. Alternatively, the Bai-Perron Test may be employed to identify multiple breakpoints without the necessity of prior knowledge regarding their potential occurrence. These tests assist in determining whether and when a significant change in the relationship between variables has occurred over the period from 2003 to 2024.
  • asked a question related to Regression
Question
2 answers
You are probably collecting data from tools such as interview guides where every respondent is giving different answers/opinions
Relevant answer
Answer
You can code and arrange the data into various themes. That way, the size of the data is significantly reduced and is more meaningful as well.
  • asked a question related to Regression
Question
3 answers
I am trying to analyse data from a survey examining what variables affect teachers perceived barriers to incorporating technology into their classroom. I have 5 predictor variables however my DV (perceived barriers) is nominal, and answers are not mutually exclusive as participants were allowed to select all options that applied to them. I was hoping to do a multinomial regression but as the DV is not mutually exclusive this is not possible. I am wondering if there is a similar analysis that allows for a non-mutually exclusive DV.
Relevant answer
Answer
It would be worthwhile to start, as Marius Ole Johansen suggests, by becoming curious about the structure of the DV. Do the categories in the variable represent a smaller number of underlying dimensions? Do they form a scale or several scales? You can work on this from the bottom up, using exploratory statistical methods, but also from the top down by using what you know of the situation to identify likely barrier "clusters" that might be treated together.
But I'd explore the DV before you start modelling it, for sure.
  • asked a question related to Regression
Question
2 answers
I would like to utilise the correct regression equation for conducting Objective Optimisations using MATLAB's Optimisation Tool.
When using Design Expert, I'm presented with the Actual factors or Coded factors for the regression equation. However, with the Actual Factors, I'm presented with multiple regression equations since one of my input factors was a categoric value. In this categoric value, the factors were, Linear, Triangular, Hexagonal and Gyroid. As a result, I'm unsure which Regression equation to utilise from the actual factors image.
Otherwise, should I utilise the single regression equation which incorporates all of them? I feel like I'm answering my own question and I really should be using the Coded Factors for the regression equation, but I would like some confirmation.
I used one of the regression equations under "Actual Factors" where Linear is seen, but I fear that this did not incorporate all of the information from the experiment. So any advice would be most appreciated.
Most appreciated!
Relevant answer
Answer
The coded equations enable the comparison between the factors in terms of directions and magnitudes of effects by simply comparing the coefficients. on the other hand, the equations in actual terms are not helpful in that comparison while they can be used for the prediction of the responses. So, you can use either the coded or actual equations for predicting a response but you have to substitute for the factors and other regression terms in coded values for the coded equations or in actual values in actual values equations.
  • asked a question related to Regression
Question
3 answers
Why direct inversion of mutivariate regression equation is not preferred and instead optimization techniques are used?
Relevant answer
Answer
Can you show your models and explain the variables? I find it strange to have a R^2 of .99 (at least in social sciences, psychology and biology...)
Why would you want to predict predictors also from your dependent variables? I cannot find a good reason why this would make sense, but again, this may depend on the research field.
How do you came up with the interaction? Dont you have any theory behind the association of the variables?
As you can see, I am quite confused, it would be very helpful to get more information.
  • asked a question related to Regression
Question
4 answers
I'm so confused if I have too large R-squared over 0.9 in the regression. Does it mean over fitting, multicollinearity, or spurious regression? How can I solve it?
Relevant answer
Answer
A high or low adjusted R square is not necessarily good or bad. Generally you can get a low adjusted R square in cross-sectional data and high adjusted R square in time series analysis. However, in time series analysis, higher adjusted R square with low or insignificant t values indicate multicollinearity problem in regression analysis
  • asked a question related to Regression
Question
3 answers
Can I easily do Regression with indicator variable using SPSS? Or is there a website that calculates it online?
Relevant answer
Answer
Onipe Adabenege Yahaya
, when posting AI-generated content, please label it as such. You might also care to explain how that particular content is meant to be helpful in answering the question that was posted. TY.
Simindokht Kalani, the two main commands for estimating OLS via SPSS are REGRESSION and UNIANOVA (or GLM). The REGRESSION command is older, and does not support factor variables (i.e., categorical variables). However, indicator variables are special, as they have only 2 categories. So you can include them when using REGRESSION. For convenience, though, it is wise to use 0/1 coding for them.
The UNIANOVA command, on the other hand, does support factor variables. In the GUI, categorical variables are called "fixed factors" IIRC, and quantitative variables are "covariates". In the command syntax, factor variables follow the key word BY and covariates follow the key word WITH.
So, to answer your question, yes, you can easily include indicator variables with either REGRESSION or UNIANOVA in SPSS.
  • asked a question related to Regression
Question
2 answers
I utilized a Bayesian state space model to analyze the data. The 95% confidence interval, which ranges from 0.0 to 1.2, prompts the question: Is the regression coefficient significantly different from zero?
Relevant answer
Answer
No, since your interval includes 0.0 you can't say with a 95% certainty that the regression coefficient is different from 0
  • asked a question related to Regression
Question
5 answers
I am getting a positive correlation between two variables. But in Multi-linear regression, i find the regression coefficient to be negative between those two variables. Is this normal?
Relevant answer
Answer
Well, the results can be said to be normal in view of the fact that they are similar at one point in time but distinct in the other circumstances.
Specifically, correlation matrix is a diagnostic test or analysis that shows the degree of relationships between the independent variables and the outcome variable(s) both individually and cumulatively. On the other spectrum, the multiple linear regression model is the estimation test or technique of data analysis that examine the impact or effect of the explanatory variables on the outcome variable of the study. The key similarity here is that, they both examine the relationships between the explanatory variables and the explained variable(s). Their major differences is that, while correlation matrix coefficients are used to determine the presence or absence of high correlations among variables, the coefficients in the Multiple Linear Regression (MLR)model are used to ascertain the explanatory power of the independent parameters on the outcome variable of the study.
It should be noted therefore, that a correlation matrix coefficient can serve as a signal for knowing the likely significance level of the coefficients in the MLR model. However, variations normally exist between the two especially regarding their significance level (P-values) and their corresponding signs which could be negative or positive. Therefore, the differences in their correlation coefficients sign (Whether + or -) sign is a normal scenario in a panel data analysis. However, such variations must be backed by strong justifications from both practical, theoretical, and methodological perspectives.
I hope this could be of great help in your data analysis.
  • asked a question related to Regression
Question
4 answers
added an answer
22 seconds ago
Hi my peers,
I have study with panel data from 2009 to 2018. My study is causal relationship . I need to highlight the temporal effect on the regression results... How can I do that?
I tried two methods but each produced different result to some extent.
The first was through creating year dummies through command.... tab year, gen (yr_) ,this produce ten dummy years.
The second method was through i.year which produced only 9 years, leaving the beginning year i.e 2009. However, the regression result to great extent unchanged but significant years become less in i.year inclusion.
So, please
I dont know what correctly to follow of these or if there is another way that show the temporal effects on the study findings.
Thank you in advance for your help
Relevant answer
Answer
To demonstrate temporal effects in regression results:
1. Use time dummy variables to capture seasonality or periodic trends.
2. Include a time trend variable to account for gradual changes over time.
3. Test for interactions between time variables and other factors.
4. Plot the dependent variable over time or residuals against time to visualize trends.
5. Consider lagged effects or autocorrelation adjustments if applicable.
  • asked a question related to Regression
Question
4 answers
I have performed panel data regression analysis and selected the FEM model, and also used GLS to address autocorrelation and heteroscedasticity. However, my R-Square value is so high that it almost reaches 1, is that true? because my lecturer doubts this.
Relevant answer
Answer
A very high R-squared in panel data regression indicates that the independent variables explain a large portion of the variation in the dependent variable, which is generally good. However, it could also suggest overfitting or issues with data quality or model complexity. Further analysis is needed to ensure the robustness and validity of the results.
  • asked a question related to Regression
Question
6 answers
I did linear regression of X (independent variable) to M (Mediator)
then I used survival regression to fit X to Y (dependent variable)
With these questions:
a. HOW to correctly do a mediation analysis from X to Y through M with survival regression?
b. If the Mediation() function is available, why the results are so weird? ie. ACME and ADE are so large and have negative values.
C. if the negative values are fine, how to explain them? As I know, they might be explained as the suppressing effects.
I'm new to mediation analysis and I'm using mediation() with R. My results are very strange and I'm not sure if they are correct. I haven't found a very detailed mediation analysis on survival regression, any discussion is very welcome and if anyone can give me some hints I would appreciate it!
Here is the code:
# Mediator model
mediator_formula <- paste("scale(", mediator_var, ") ~ ", iv_name, " + ", paste(covariates, collapse = " + "))
model_mediator <- lm(as.formula(mediator_formula), data = data_with_residuals)
lm_sum <- summary(model_mediator)
# dependent model
model_dv_formula <- paste("Surv(time, status) ~ ", iv_name, " + ", "scale(", mediator_var, ")", " + ", paste(covariates, collapse = " + "))
model_dv <- survreg(as.formula(model_dv_formula), data = data_with_residuals)
surv_sum<-summary(model_dv)
# Mediation
mediator_name <- paste("scale(", mediator_var, ")", sep="")
mediation_results <- mediate(model_mediator, model_dv, treat = iv_name, mediator = mediator_name, sims = 500)
------------------------------------------------------------------------------
________________________________________________________________________________
Relevant answer
Answer
Becareful when you are using ratio scale especially for parametric test in any inferetial data analysis. The rule is that the dependent variable must be in ratio scale and must be normally distributed. If the dependent variable is not normally distributed, then adjust the data using either log transformation or other method.
  • asked a question related to Regression
Question
4 answers
Hello,
I have a question regarding the interpretation of the results from an experiment I conducted. Each participant answered 4 questions measuring motivation, satisfaction, help, and collaboration (my dependent variables) in 7 different scenarios (my independent variables). To analyze my results, I used three methods: a Wilcoxon signed rank test, a regression with standard errors clustered at the individual level - CRSE (to control for individual heterogeneity), and an ordinal regression (using GENLIN ) to account for the ordinal nature of the dependent variable.
The aim of this analysis was to verify if the significant results obtained with the Wilcoxon test were consistent across the other two methods. I conclude that significant results found with the Wilcoxon test, if they are also significant in the other two regressions, are robust.
Conversely, if an effect is significant in the Wilcoxon test and in the regression with CRSE ( standard errors cluster at the individual level), but not in the ordinal regression (GENLIN ordinal), I consider that this is not a robust effect, indicating that the result is not consistent across the three tests, this indicates that there is an indication of the effect, but that this indication is weak.
I am wondering how to properly interpret this ? What does it really mean ?
For the majority of my results, they are robust, but I have some scenarios where significant effects on certain dependent variables are no longer significant in the ordinal regression, but are in the Wilcoxon test and the regression with clustered standard errors. I am wondering why this happens and how to explain it.
I am working with SPSS version 27. Could you help me better understand these results and their interpretation?
Thank you in advance for your help.
Relevant answer
Answer
really, my personal opininion is that the whole approach does not make any sense at all.
1) The different models test different statistical hypothesis, therefore, even IF all three approaches would show a significant result, this does not mean that the results converge, nor that the general result is robust. You cannot conclude that, its like comparing apples with pears.
2) You state yourself that the DV is ordinal, in that case using an approach for metric variables, even if you account for clustered data, is not optimal and again, tests something different than the Wilcoxon test.
3) Which ordinal approach did you use, logit or probit regression? I never used it with SPSS, therefore, I do not know the options there.
4) Again, you stated yourself, that you did not account for the clustered data with your ordinal approach. How do you think is the result comparable to the other two?
5) You say: "I do know how to do an ordinal regression with clustered data on spss" Did you mean that you dont know how to do it? If yes, why did you do it in the first place, if you already knew that it is the wrong approach? I am really puzzled..... basically, you knew from the beginning that 2 out of 3 analyses are not suitable for your data (one is not for ordinal data and the other does account for clustered data), what do you expect will the results tell you? Garbage in, garbage out.
Without knowing more, I would use a probit regression, to account for the ordinal data structure. The model assumes a normally distributed latent variable (it uses the normal CDF instead of the logistic CDF, but makes the interpretation easier), which seems reasonable, if you already used a OLS regression and thougt it will work. To account for the clustered data (repeated measures), I would use a linear mixed model (aka. multilevel model). I dont know if this is possible in SPSS, but surely in R with the ordinal package (I havent used it so far, but is capable of cumulative linear mixed models and uses the lme4 syntax) or go Bayesian (I would recommend) and use the brms package, which also uses the lme4 syntax.
  • asked a question related to Regression
Question
2 answers
I'm working on cancer model in mice. In my free-drug treatment group, there is slow growth of tumor as compared to control (Saline) group. But in the nano-particle treated Group, the tumor undergone apoptosis and complete disappearance of tumor was observed while the study plan was going on. So in this situation, how the study should be justified if in the paper, we will submit the histology only from the free drug treated Group and not the nano-particle treated Group ?
Should I re-replan the study only for this Nanoparticle treated Group and remove the tumor just before it spent off from the site of implantation so that I can perform histology of an apoptotic tissue atleast?
Relevant answer
Answer
When providing a rationale for studying the total regression of a tumor, it is crucial to highlight the importance of histology in comprehending the underlying mechanisms and verifying the full regression. Here are few factors you might take into account:
1. Significance of histology: Histology offers intricate insights into the cellular and tissue structure, enabling a thorough evaluation of the tumor and its reaction to treatment. It aids in verifying the lack of remaining tumor cells and evaluating the degree of tumor regression.
2. Treatment response validation: Histological investigation can confirm the clinical response to treatment by verifying the absence of live tumor cells. It is essential to demonstrate the efficacy of the treatment regimen and ensure accurate assessment of treatment outcomes.
3. Mechanistic understanding of regression: Histology can offer valuable insights into the underlying mechanisms that drive tumor regression. This technique enables the evaluation of cellular and molecular alterations occurring in the tumor microenvironment, such as the infiltration of immune cells, the formation of new blood vessels, and programmed cell death. Gaining insight into these pathways can provide valuable guidance for future treatment approaches and contribute to the advancement of individualized medicines.
4. Documentation and reporting: Histological evaluation offers a uniform and unbiased assessment of tumor regression. It enables precise documenting and reporting of therapy response, allowing for comparison with other trials and aiding data analysis for future research.
To summarize, the justification for studying cases where a tumor completely regresses and the necessity of demonstrating histology involves highlighting the significance of histology in verifying the absence of any remaining tumor cells, validating the effectiveness of treatment, comprehending the mechanisms of regression, and guaranteeing precise documentation and reporting of treatment results.
  • asked a question related to Regression
Question
5 answers
In what situation we will use please tell me I face difficulty
Relevant answer
Answer
The ARDL (Autoregressive Distributed Lag) model and its related concepts—ARDL Error Correction Model (ECM) and ARDL Long Run Bounds Test—are different approaches used in econometrics to analyze relationships between variables, particularly in the context of time series data. Here’s how they differ:
  1. ARDL Model (Autoregressive Distributed Lag Model):Definition: The ARDL model is a dynamic regression model that includes lagged values of both dependent and independent variables. Purpose: It is used to examine long-run relationships between variables that may be cointegrated, meaning they move together over time. Structure: The ARDL model includes lagged levels of the dependent variable and lagged differences of both the dependent and independent variables. Usage: ARDL models are suitable when the variables are integrated of different orders (e.g., one is I(1) and another is I(0)), allowing for the testing of long-run relationships even if short-run dynamics are present.
  2. ARDL Error Correction Model (ARDL ECM):Definition: The ARDL ECM extends the basic ARDL model by incorporating an error correction term. Purpose: It is used to capture short-term dynamics and adjustments towards long-run equilibrium following a deviation from it. Structure: The error correction term represents the speed of adjustment towards equilibrium, capturing the short-term deviations from the long-run relationship. Usage: ARDL ECM is particularly useful when variables are cointegrated, indicating a long-run relationship, and when there is evidence of short-run dynamics that need to be accounted for.
  3. ARDL Long Run Bounds Test:Definition: The ARDL Long Run Bounds Test is a procedure used to determine the existence and direction of long-run relationships between variables. Purpose: It tests whether cointegration exists between variables and, if so, whether the relationship is positive or negative. Procedure: The test involves estimating the ARDL model for different combinations of lag lengths, then testing the significance of the coefficients to determine if they support the existence of a long-run relationship. Usage: The bounds test helps researchers identify the appropriate lag lengths in the ARDL model and ascertain whether the relationship between variables is stable in the long run.
Key Differences:
  • Model Structure: ARDL is a general model structure that includes lagged levels and differences of variables. ARDL ECM adds an error correction term to capture short-term adjustments towards long-run equilibrium.
  • Focus: ARDL and ARDL ECM focus on estimating long-run relationships and dynamics, while the ARDL Long Run Bounds Test specifically tests for the existence and direction of these relationships.
  • Application: ARDL and ARDL ECM are regression models used for estimation and inference, while the ARDL Long Run Bounds Test is a diagnostic test used to determine the appropriate lag structure and the presence of long-run relationships.
In summary, ARDL is the general framework for modeling long-run relationships, ARDL ECM incorporates short-term dynamics through an error correction term, and the ARDL Long Run Bounds Test is a diagnostic tool to test for cointegration and determine the appropriate lag structure in the ARDL model. Each serves a distinct purpose in analyzing time series data and understanding the relationships between variables over time.
  • asked a question related to Regression
Question
7 answers
Can I run Mixed-effect-logistic regression with only one single outcome variable which is time varying covariate?there is no confounder for the study.
or should i have more than one independent variable?
Is there any article related to Mixed-effect-logistic regression with time varying covariate to send me?
I want to write the stat and I need some.
Thanks
Relevant answer
Answer
I assume this question is a modification or elaboration of the question posted here:
If so, please delete the first question.
Second, a time-dependent covariate is an explanatory variable that can change over time, while an outcome variable is a dependent variable. So it is not possible for your outcome variable to be a time-dependent covariate. It would help to clarify things if you posted a small dataset that shows what your data look like. CSV format would be good, as it would be accessible to everyone regardless of what statistical software they use. What software do you use, by the way?
  • asked a question related to Regression
Question
2 answers
Hello,
I had a questionnaire conducted in which all participants had to answer four different questions in order to measure the level of satisfaction, motivation, collaboration, and help in seven different scenarios. The aim was to measure the dependent variables in the first scenario and to see how they evolved through the different scenarios. The dependent variables were measured on a 7 Likert scale ( from strongly disagree to strongly agree ). All participants answered my questionnaire, so it's a within-subject design.
I've created dummies for scenarios:
scenario 1, 2,3,4,5 6 and 7. And also for the gender variable ( female, male ) and for situation ( employed, unemployed, student, retired, other,self-empoyed).
First, I performed a Wilcoxon test. Then, I performed an ordinal regression. But I don't understand how I'm supposed to interpret the values in the "Estimate" column from te table "Parameter Estimate".
For example, the Wilcoxon tests showed that scenarios 2 and 3 had a significantly negative impact on satisfaction, whereas scenario 6 had a significantly positive impact. But here I see that for scenarios 2 and 3, the “Estimate” is positive, not negative? I don't understand.
When I run ordinal regression on spss, and select els dummies for scenario (except scenario 1 because I'm using it as a basis for comparison), I get this :
The table shows :
Scenario 2 = 0 Estimate :1.037 p:<.001
Scenario 2= 1 “This parameter is set to 0 because it is redundant”
I would rather have expected a negative value for scenario 2, as the Wilcoxon test and simple linear regression had shown me.
And when I don't put in the dummies, but just put in the lavairbale scenario, I get this.
Scenario =2 Estimate:-1.118 p:<.001
This seems more logical to me than the other wilcoxon and regression tests I had done before. But the problem is that the table for scenario=7 shows no value.
scenario =7 '' “This parameter is set to 0 because it is redundant”'
Sorry, I'm a beginner and I'm learning to use spss.
If anyone can help me, that would be very kind.
Relevant answer
Answer
Thank you very much
  • asked a question related to Regression
Question
1 answer
In Amos, I am explaining a dependent construct with one exogenous construct and get a squared multiple correlation for the dependent of .48. I then add a second exogenous construct as an additional predictor and the squared multiple correlation for the dependent falls to .37. How is that possible? When I do a normal regression, the R2 goes up, as it should.
I have asked two local experts here that could not help. Any help here? Thanks in advance!
Relevant answer
Answer
Hi Klaus, I don't know anything about Amos, but my guess would be that it has to do with what kind of R2 you are looking at.
The classical R2 measures how well your model fits the data it was trained on. This is sometimes called "calibrated" explained variance and, as you say, it can only go up or remain the same by adding additional explanatory variables.
However, this is not true for another type of R2 - the one that measures how well your model generalizes to new data (so called "validated" explained variance). This can be obtained by using the model on a new dataset or by leaving out part of the original dataset (cross-validation) for this purpose. In this case, it is theoretically possible that the R2 decreases by adding a new variable, if the second variable does not bring additional information relevant to explaining the new data (e.g., it might just fit noise in the original dataset rather then some underlying trend).
My guess would be that your software by default runs some kind of cross-validation in your model and therefore R2 you get is in fact the validated explained variance. If the value go down by adding the second variable it might indicate that the new variable does not provide useful information for predicting new data but instead result in overfitting.
  • asked a question related to Regression
Question
4 answers
How do we evaluate the importance of individual features for a specific property using ML algorithms (say using GBR) and construct an optimal features set for our problem.
image taken from: 10.1038/s41467-018-05761-w
Relevant answer
Answer
You can do it in many ways. PCA is a nice way to gather important parameters. Another way would be to train multiple models with and without specific features and see how that will influence error. Correlations can also help. However, in most cases you need to use your head and see what parameters, why and how are effecting your results. In some cases ANOVA is a nice technique but only if you think and not blindly thrust in results. For example, speed in metres and speed in centimetres are both just speed so using one of them is enough. I know that was stupid example, but it shows the point. Know your data, analyse what impacts results and you will do great. Good luck, hope it will help even a bit.
  • asked a question related to Regression
Question
4 answers
Hello all,
I am running into a problem I have not encountered before with my mediation analyses. I am running a simple mediation X > M > Y in R.
Generally, I concur that the total effect does not have to be significant for there to be a mediation effect, and in the case I am describing, this would be a logical occurence, since the effects of path a and b are both significant and respectively are -.142 and .140, thus resulting in a 'null-effect' for the total effect.
However, my 'c path X > Y is not 'non-significant' as I would expect, rather, the regression does not fit (see below) :
(Residual standard error: 0.281 on 196 degrees of freedom Multiple R-squared: 0.005521, Adjusted R-squared: 0.0004468 F-statistic: 1.088 on 1 and 196 DF, p-value: 0.2982).
Usually I would say you cannot interpret models that do not fit, and since this path is part of my model, I hesitate to interpret the mediation at all. However, the other paths do fit and are significant. Could the non-fitting also be a result of the paths cancelling one another?
Note: I am running bootstrapped results for the indirect effects, but the code does utilize the 'total effect' path, which does not fit on its own, therefore I am concerned.
Note 2: I am working with a clinical sample, therefore the samplesize is not as great as I'd like group 1: 119; group2: 79 (N = 198).
Please let me know if additional information is needed and thank you in advance!
Relevant answer
Answer
Somehow it is not clear to my, what you mean with "does not fit"? Could you please provide the output of the whole analysis? I think this would be helpful.
  • asked a question related to Regression
Question
2 answers
I am evaluating impacts of major health financing policy changes in Georgia (country). The database is household level and it is not panel data. Continues outcome variable is out-of-pocket health spending (OOPs) and it exhibits skewed distribution as well as seasonality. The residuals are positively autocorrelated. The regression also takes independent variables connected with each household characteristics into account. My goal is to evaluate impact of health policies on the financial wellbeing of population connected with health care utilization determinants. Should I aggregate the dataset or keep it as it is?
Relevant answer
Answer
Thank you for the information
  • asked a question related to Regression
Question
9 answers
Here is the case, as I said, I am working on how Macroeconomic variables affect REIT Index Return. To understand how macroeconomic variables affect REIT which tests or estimation method should I use.
I know I can use OLS but is there any other method to use? All my time series are stationary at I(0).
Relevant answer
Answer
You can use econometric methods such as regression analysis, Vector Autoregression (VAR), or Granger causality tests to analyze how macroeconomic variables affect REIT index returns.
  • asked a question related to Regression
Question
10 answers
In the domain of clinical research, where the stakes are as high as the complexities of the data, a new statistical aid emerges: bayer: https://github.com/cccnrc/bayer
This R package is not just an advancement in analytics - it’s a revolution in how researchers can approach data, infer significance, and derive conclusions
What Makes `Bayer` Stand Out?
At its heart, bayer is about making Bayesian analysis robust yet accessible. Born from the powerful synergy with the wonderful brms::brm() function, it simplifies the complex, making the potent Bayesian methods a tool for every researcher’s arsenal.
Streamlined Workflow
bayer offers a seamless experience, from model specification to result interpretation, ensuring that researchers can focus on the science, not the syntax.
Rich Visual Insights
Understanding the impact of variables is no longer a trudge through tables. bayer brings you rich visualizations, like the one above, providing a clear and intuitive understanding of posterior distributions and trace plots.
Big Insights
Clinical trials, especially in rare diseases, often grapple with small sample sizes. `Bayer` rises to the challenge, effectively leveraging prior knowledge to bring out the significance that other methods miss.
Prior Knowledge as a Pillar
Every study builds on the shoulders of giants. `Bayer` respects this, allowing the integration of existing expertise and findings to refine models and enhance the precision of predictions.
From Zero to Bayesian Hero
The bayer package ensures that installation and application are as straightforward as possible. With just a few lines of R code, you’re on your way from data to decision:
# Installation devtools::install_github(“cccnrc/bayer”)# Example Usage: Bayesian Logistic Regression library(bayer) model_logistic <- bayer_logistic( data = mtcars, outcome = ‘am’, covariates = c( ‘mpg’, ‘cyl’, ‘vs’, ‘carb’ ) )
You then have plenty of functions to further analyze you model, take a look at bayer
Analytics with An Edge
bayer isn’t just a tool; it’s your research partner. It opens the door to advanced analyses like IPTW, ensuring that the effects you measure are the effects that matter. With bayer, your insights are no longer just a hypothesis — they’re a narrative grounded in data and powered by Bayesian precision.
Join the Brigade
bayer is open-source and community-driven. Whether you’re contributing code, documentation, or discussions, your insights are invaluable. Together, we can push the boundaries of what’s possible in clinical research.
Try bayer Now
Embark on your journey to clearer, more accurate Bayesian analysis. Install `bayer`, explore its capabilities, and join a growing community dedicated to the advancement of clinical research.
bayer is more than a package — it’s a promise that every researcher can harness the full potential of their data.
Explore bayer today and transform your data into decisions that drive the future of clinical research: bayer - https://github.com/cccnrc/bayer
Relevant answer
Answer
Many thanks for your efforts!!! I will try it out as soon as possible and will provide feedback on github!
All the best,
Rainer
  • asked a question related to Regression
Question
2 answers
SVM regression based on PSO optimisation
Relevant answer
Answer
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from pyswarm import pso
# Create a simple dataset
X, y = make_classification(n_samples=100, n_features=5, n_informative=3, n_redundant=2, random_state=42)
# Define the objective function to be minimized by PSO
def svm_pso_loss(params):
C, gamma = params
# Ensure the parameters are positive and within a reasonable range
if C <= 0 or gamma <= 0:
return float('inf') # return infinity if parameters are out of bounds
# Define the SVM classifier
model = SVC(C=C, gamma=gamma)
# Negative mean cross-validated score (we need to minimize it)
neg_accuracy = -cross_val_score(model, X, y, cv=5).mean()
return neg_accuracy
# Set bounds for C and gamma
lb = [0.01, 0.001] # lower bounds of C and gamma
ub = [100, 10] # upper bounds of C and gamma
# Run PSO
best_params, best_score = pso(svm_pso_loss, lb, ub, swarmsize=50, maxiter=100)
print("Best Parameters: C={}, gamma={}".format(best_params[0], best_params[1]))
print("Best Score: {}".format(-best_score))
  • asked a question related to Regression
Question
1 answer
I have three variables (A, B, C) and do a multilevel SEM with R - Lavaan.
I do not understand why the following two models render different regression coefficients:
in the 1st one I use the ready aggregated latent variables from the sheet directly, in the 2nd one I define them within the model, but the data behind is of course the same.
Could anybody please explain why that is and which model would be the right one to use?
1.) "
level: 1
A ~ B + C
level: 2
A ~ B + C
"
2.)"
level: 1
A =~ a1 + a2 + a3
B =~ b1 + b2 + b3 + b4
c =~ c1 + c2 + c3
A ~ B + C
level: 2
A =~ a1 + a2 + a3
B =~ b1 + b2 + b3 + b4
C =~ c1 + c2 + c3
A ~ B + C
"
thanks so much for any help!
Relevant answer
Answer
Hello! I have the same question - did you find out the cause of the problem? I get really weird results when defining my latent variables in the SEM model, but just fine results when I use the aggregated variables.
  • asked a question related to Regression
Question
2 answers
Dear All,
I have an imagery with a single fish species within each image along with a list of morphometric measurements of the fish (length, width, length of tail, etc). I would like to train a CNN model that will predict these measurements having as input only the images. Any ideas what kind of architecture is ideal for this task? I read about multioutput learning, but I haven't found a practical implementation in Python.
Thank you for your time.
Relevant answer
Answer
Thank you Aldo for your suggestion. I can see the general framework.
Cheers!
  • asked a question related to Regression
Question
2 answers
I have collected data at community level using cluster sampling. The ICC shows >10% variability at cluster level. However, I don't have relevant variable at cluster level (all variables are at household and individual levels).
Then, can I run multilevel regression without having multilevel variable?
Thanks!
Relevant answer
Answer
Yes, you can run a multilevel model without level 2 predictors. This is sometimes referred to as a random coefficient regression analysis. In that analysis, you would simply model potential variability in the level-1 regression intercept and/or slope coefficients across clusters (level-2 units) without explaining that variability by level-2 predictors. This allows you to properly take into account the non-independence that arises from cluster sampling that is shown by your ICC.
  • asked a question related to Regression
Question
1 answer
My Topic is Study of the energy status of the construction materials, that is why all parameters of them is very needed. calculation, comparison, regression, co relation etc.
that is why if you have any idea of them .
thanking you all
Relevant answer
Answer
Certainly! Conducting a study on the energy status of construction materials is a fascinating and relevant topic. Here are some reference databases and research articles that you can explore to gather information on energy status, efficiency, and related parameters of construction materials:
Research Articles and Journals:
  1. Energy and Buildings - A peer-reviewed journal focusing on the energy efficiency, sustainability, and environmental impacts of buildings and construction materials.
  2. Construction and Building Materials - A journal covering research on the properties, performance, and sustainability of construction materials, including energy-related aspects.
  3. Journal of Cleaner Production - A multidisciplinary journal publishing research on cleaner production, sustainable development, and energy efficiency in various industries, including construction.
  4. Applied Energy - A journal focusing on energy engineering, energy efficiency, and sustainable energy systems. It includes articles related to the energy status and efficiency of construction materials.
  5. Building Research & Information - A journal publishing research on building science, construction technology, and sustainable building practices, including energy-efficient materials and systems.
Keywords to Search:
When searching these databases and journals, consider using the following keywords and phrases to find relevant articles and research papers:
  • Energy efficiency in construction materials
  • Energy status of building materials
  • Thermal properties of construction materials
  • Life cycle energy analysis of materials
  • Embodied energy of building materials
  • Energy-efficient construction materials
  • Sustainable construction materials
  • Energy consumption and efficiency in construction
  • asked a question related to Regression
Question
2 answers
Hello everyone and thank you for reading my question.
I have a data set that have around 2000 data point. It have 5 inputs (4 wells rate and the 5th is the time) and 2 ouputs ( oil cumulative and water cumulative). See the attached image.
I want to build a Proxy model to simualte the cumulative oil & water.
I have made 5 models ( ANN, Extrem Gradient Boost, Gradient Boost, Randam forest, SVM) and i have used GridSearch to tune the hyper parameters and the results for training the models are good. Of course I have spilited the training data set to training, test and validation sets.
So I have another data that I haven't include in either of the train,test and validation sets and when I use the models to predict the output for this data set the models results are bad ( failed to predict).
I think the problem lies in the data itself because the only input parameter that changes are the (days) parameter while the other remains constant.
But the problem is I can't remove the well rate or join them into a single variable because after the Proxy model has been made I want to optimize the well rates to maximize oil and minimize water cumulative respectively.
Is there a solution to suchlike issue?
Relevant answer
Answer
To everyone who faced this problem, this type of data is called time series data which have a specific algorithm that used to build the proxy models (i.e RNN, LSTM)
  • asked a question related to Regression
Question
2 answers
Hi,
I am trying to evalaute the impact of gender quotas on women's political engagement. I am using the world values survey data on different countries over the time period of range 1981-2009. I wish to do a country and time fixed regression of gender quotas on a variable while controlling for age. However age in the survey is divided into categories, how can i recode it for my regression. Should I use binning to control for age or should I use the mean values of the categories?
Relevant answer
Answer
If age is an ordinal (ordered categorical) variable in your data set, you can simply add that variable as another predictor to your regression model.
  • asked a question related to Regression
Question
12 answers
It is known that we can use the regression analysis to limit the confounding variables affecting the main outcome. But what if the entire sample have a confounding variable affecting main outcome, will Regression Analysis still applicable and reliable ?
For example a study was done to investigate the role of certain intervention in cognitive impairment, the entire population included was old aged (more than 60 years old ), which means that the age here is a risk factor ( Co-variate ) in the entire sample, and it is well known that age is a significant independent risk factor of cognitive impairment
My question here is; Will the regression here of a real value ? Will it totally vanish the effect of age and got to us the clear effects the intervention on cognitive impairment ?
Relevant answer
Answer
Yes of course, adjusting by age will remove the confounding factor of age. Actually, adjusting the model by a confounding factor is one way of two to remove its effect when checking the effect of another variable.
Look at this example:
if the equation of model with only cognitive score as x and the outcome as y is like this:
y = 1 + 5*x1
and equation of a model with only age:
y = 2 + 3*x2
Assuming that both variables have additive effect (which is may not be true)
then the final equation should be something like this:
2*y = 1 + 5*x1 + 2 + 3*x2 = 3 + 5*x1 + 3*x2
and it can be written like this: 2 * y/2 = (3 + 5*x1 + 3*x2)/2
y = 1.5 + 2.5* x1 + 1.5 * x2
And adding more and more variables to the model will definitely affect the coefficients
The other way is by stratification: you can just select homogeneous sub sample and fit a model and so on as I mentioned in the previous comment
There is another way "propensity score matching/ propensity weighted analysis" which is using the same model of the previous regression and adding weights to the patients which can be used in weighted analysis or scoring analysis ( but I don't encourage it in most cases it doesn't work unless you have a large sample )
This in general, but the question remains:
Is there results conclusive? Absolutely not, the only way that can eliminate the confounding factors is controlled randomized trials or a very big sample of the population like big clinical registries or census
In my opinion and from experience most studies group patients like this: ">18", "18-45", "46-65", "> 65"
In your case it deserves a good letter to the editor to criticize and obvious mistake and I would as the authors to send me their data to replicate their results.. there is a very big mistake if they did so.
>> Tooth loss and age are correlated variables and can't be included in the same model as they will produce a collinearity problem, it must do so.<<<
  • asked a question related to Regression
Question
4 answers
in the use of spectral indices in the estimation of corn yield, why is it that when I put the average of the total index at the farm level in the equation generated from the regression, the predicted yield is closer to the actual yield even though the coefficient of determination is weak?
# spectralindices
#predictedyield
#RS
Relevant answer
Answer
Thank you. You helped a lot.
  • asked a question related to Regression
Question
17 answers
I am doing landuse projection using the Dyna-CLUE model, but I am stucked with the error "Regression can not be calculated due to a large value in cell 0,478". I would appreciate any advice you can provide to solve this error.
Relevant answer
Answer
My problem could be identified by your suggestion since I could find the variable that had the problem.
Do you have any idea about this Error?
ERROR: no solution; program terminated
thanks for your help.
  • asked a question related to Regression
Question
3 answers
I am conducting a meta-analysis and I want to use the nonlinear polynomial regression and splines functions to model the dose-response relationship between the parameters of interest.
I would appreciate any help or suggestions.
Thank you very much.
Relevant answer
Answer
You can use this:
but polynomial regression is very dangerous to use since it doesn't explain anything and it can never be extrapolated.
  • asked a question related to Regression
Question
10 answers
My question is looking at the influence of simulation on student attitudes. My professor would like me to do regression analysis, but he says to do two regressions. I have my pre-test data and post-test data the only other information I have is student college. What I found in my class materials seems to indicate that I can complete a regression using the post-test as my dependent variable and the pre-test as my independent variable in SPSS. How would I do another regression? Should I work in the colleges as another dependent variable and if so, do I do them as a group or do I need to create a variable for each college?
Relevant answer
Answer
I have some questions.
1) Was there some treatment (or intervention) between the baseline and followup scores? If so, did all subjects receive it, or only some of them? And if so to that, how were they allocated to intervention vs control?
2) How many colleges are there? If the number is fairly large, it may be preferable to estimate a multilevel model with subjects at level 1 clustered within colleges at level 2.
  • asked a question related to Regression
Question
1 answer
I regress X to Y: ,direct effect (c)
M: mediator: I regress X to M (a), M to Y (b)
Total effect = c + a*b
now i introduce a moderator effect between X and Y
How i calculate the total effect with moderator and mediator effect
Relevant answer
Answer
If the moderation effect is different from zero (i.e., if there is a moderation/interaction effect), the c path would differ for different values of the moderator variable (Z). Consequently, also the total effect would differ for different values of Z.
  • asked a question related to Regression
Question
1 answer
I have daily sales data and stock availability for items in a supermarket chain. My goal is to estimate the sales quantity elasticity with respect to availability (represented as a percentage). With this model, I want to understand how a 1% change in availability impacts sales. Currently, single-variable regressions yield low R-squared values. Should I include lagged sales values in the regression to account for other endogenous factors influencing sales? This would isolate availability as the primary exogenous variable
Relevant answer
Answer
Well, I don't get a clear picture of the variables you are considering for your study. Since you are considering daily data and you have more than one variable you can either apply non-linear ARDL(s) or multivariate volatility models depending on the objectives of your study. Bear in mind that the pre-tests are highly instrumental to the choice of models.
Best wishes!
  • asked a question related to Regression
Question
1 answer
I am doing a study focusing on analyzing differences in fish assemblages due to temperature extremes. I calculated Shannon diversity, evenness, richness, and total abundance for each year sampled. The years are grouped into 2 temperature periods essentially as well, which is what I want to overall compare.
On viewing results, there appears to be consistency across years, and when comparing the two groupings. I do have multivariate tests to follow this after for community composition, but when describing univariate results, are there any statistical tests that can be followed up with to better show there is no difference, rather than simply describing the numbers and their mean differences?
Relevant answer
Answer
Hi Alana Barton It would be good to have more information here to be able to help more, but have you tried a GLM with both years and temperatures included in the model? Perhaps you'd also need to add an interaction effect between temperature and year (as from what you said there's seems to be an interaction). Further explanatory variables could be added to the model if you have measured them.
  • asked a question related to Regression
Question
7 answers
Dear all,
I am sharing the model below that illustrates the connection between attitudes, intentions, and behavior, moderated by prior knowledge and personal impact perceptions. I am seeking your input on the preferred testing approach, as I've come across information suggesting one may be more favorable than the other in specific scenarios.
Version 1 - Step-by-Step Testing
Step 1: Test the relationship between attitudes and intentions, moderated by prior knowledge and personal impact perceptions.
Step 2: Test the relationship between intentions and behavior, moderated by prior knowledge and personal impact perceptions.
Step 3: Examine the regression between intentions and behavior.
Version 2 - Structural Equation Modeling (SEM)
Conduct SEM with all variables considered together.
I appreciate your insights on which version might be more suitable and under what circumstances. Your help is invaluable!
Regards,
Ilia
Relevant answer
Answer
Ilia, some thoughts on your model. According to your path diagram you have 4 moderator effects. For such a large model, you need a large sample size to detect all moderator effects simultaneously. Do you have a justification for all of these nonlinear relationships?
Some relationships in the path diagram are missing. First, prior knowledge, personal impact, and attitude should be correlated - these are the predictor variables. Second, prior knowledge and personal impact should have direct effects on the dependent variables behavioral intentions and behavior (this is necessary).
As this model is quite complex, I would suggest to start with analyzing the linear model. If this model fits the data well, then I would include the interaction effects one by one. Keep in mind that you need to use a robust estimation method for parameter estimation because of the interaction effects. If these effects exist in the population, then behavioral intentions and behavior should be non-normally distributed.
Kind regards, Karin
  • asked a question related to Regression
Question
1 answer
Hello everyone, for my dissertation I have two predictor variables and one criterion variable. In one of the predictor variable- I further have 5 domains and it doesn't have a global score so in that case can i used multiple regression or i have to perform step wise linear regression seperately for 6 predictors(5 domains and another predictor) ?- keeping in mind the assumption of multicollinearity.
Relevant answer
Answer
There are two different issues here. The first is with regard to step-wise regression, which is a very old-fashioned technique which is no longer widely accepted. Instead, you should indeed use multiple regression.
The other issue is with regard to multicolinearity. Since you predictors will almost certainly be inter-correlated, you will thus have some degree of multicolinearity. But this goes back your wanting to keep the 5 domains separate, since it is their degree of inter-correlation that creates the multicolinearity.
Have you considered using Structural Equation Analysis, or exploratory factor analysis to clarify whether your 5 domains truly are statistically distinct, as opposed to indicators of a single larger domain.
  • asked a question related to Regression
Question
3 answers
hi, i'm currently writing my psychology dissertation where i am investigating "how child-oriented perfectionism relates to behavioural intentions and attitudes towards children in a chaotic versus calm virtual reality environment".
therefore i have 3 predictor variables/independent variables: calm environment, chaotic environment and child-oriented perfectionism
my outcome/dependent variables are: behavioural intentions and attitudes towards children.
my hypotheses are:
  1. participants will have more negative behavioural intentions and attitudes towards children in the chaotic environment than in the calm environment.
  2. these differences (highlighted above) will be magnified in participants high in child-oriented perfectionism compared to participants low in child oriented perfectionism.
i used a questionnaire measuring child-oriented perfectionism which will calculate a score. then participants watched the calm environment video and then answered the behavioural intentions and attitudes towards children questionnaires in relation to the children shown in the calm environment video. participants then watched the chaotic environment video and then answered the behavioural intentions and attitudes towards children questionnaire in relation to the children in the chaotic environment video.
i am unsure whether to use a multiple linear regression or repeated measures anova with a continuous moderator (child-oriented perfectionism) to answer my research question and hypotheses. please please can someone help!
Relevant answer
Answer
1. participants will have more negative behavioural intentions and attitudes towards children in the chaotic environment than in the calm environment.
--- because there were only two conditions (levels of your factor), you can use a paired t-test (or wilcoxon if nonparametric) to compare the behavioral intentions/attitudes between the calm and chaotic environment where the same participants were subjected to both environments.
2. these differences (highlighted above) will be magnified in participants high in child-oriented perfectionism compared to participants low in child oriented perfectionism.
--- indeed this is a simple linear regression (not multiple one), you can start with creating a new dependent variable (y) as the difference in behavioral intentions/attitudes between the calm and chaotic environment, then you run a regression on the independent variable of a perfectionism score (x).
  • asked a question related to Regression
Question
5 answers
How can I interpret these two examples below in the mediation analysis? Help me
1) with negative indirect and total effect, positive direct effect
Healthy pattern (X)
Sodium Consumption (M)
Gastric Cancer (Y)
Total Effect: Negative (-0.29)
Indirect Effect: Negative (-0.44)
Direct Effect: Positive (0.14)
Mediation percentage: 100%
2) With total and direct negative effect, positive indirect effect
Healthy pattern (x)
Sugar consumption (m)
Gastric Cancer (Y)
Total Effect: Negative (-0.42)
Indirect Effect: Positive (0.03)
Direct Effect: Negative (-0.29)
Mediation percentage: 10.3%
Relevant answer
Answer
The interpretations depends on all aspects either positive nor negative.simply advantages and disadvantages .
  • asked a question related to Regression
Question
2 answers
I run OLS regression on panel data in Eviews and then 2SLS and GMM regression.
I introduced all the independent variables of OLS as instrumental variables.
I am getting exacty same results under the three methods.
is there any mistake in running the models
I am also attaching the results.
thanks in advance
Relevant answer
Answer
OLS (Ordinary Least Squares), (2SLS) Two-Stage Least Squares, and (GMM) Generalized Method of Moments, are all statistical methods used in econometrics.
These tools are used in different contexts and under different assumptions, so they do not generally produce similar results.
  • asked a question related to Regression
Question
2 answers
In his 1992 paper, (Psychological Assessment 1992, Vol.4, No. 2,145-155) Tellegen proposed a formula to calculate the uniform T score.
UT = B0 + B1X + B2X2 + B3X3.
B0 being the intercept, X the raw score and B1, B2 and B3 different regression coefficients. X2 is squared and X3 cubic.
What is the intercept ? How do you calculate the intercept (Bzero)?
How do you calculate the regression cofficient? Is it between the raw score and the percentile? Why 3 different regression coefficients?
Relevant answer
Answer
Yes the formula to obtain a linear T score is quite simple to apply. The question is more about uniform T score or normalized T score in order to address the sweness and kurtosis of different slopes.
  • asked a question related to Regression
Question
2 answers
Suppose I compute a least squares regression with the growth rate of y against the growth rate of x and a constant. How do I recover the elasticity of the level of y against the level of x from the estimated coefficient?
Relevant answer
Answer
The elasticity of y with respect to x is defined as the percentage change in y resulting from a one-percent change in x, holding all else constant. In the context of your regression model, where you have regressed the growth rate of y (which can be thought of as the percentage change in y) against the growth rate of x (the percentage change in x), the estimated coefficient on the growth rate of x is an estimate of this elasticity directly.
Here's why: If you run the following regression:
Δ%y=a+b(Δ%x)+ϵ
where Δ%y is the growth rate of y (dependent variable), Δ%x is the growth rate of x (independent variable), a is the constant term, b is the slope coefficient, and ϵ is the error term, the coefficient b represents the change in Δ%y for a one-unit change in Δ%x. Because Δ%y and Δ%x are already in percentage terms, the coefficient b is the elasticity of y with respect to x.
So, if you have estimated the coefficient b from this regression, you have already estimated the elasticity. There is no need to recover or transform the coefficient further; the estimated coefficient b is the elasticity of y with respect to x.
It's important to note that this interpretation assumes that the relationship between y and x is log-linear, meaning the natural logarithm of y is a linear function of the natural logarithm of x, and the model is correctly specified without omitted variable bias or other issues that could affect the estimator's consistency.
  • asked a question related to Regression
Question
2 answers
In most of the studies tobit regression is used but in tobit model my independent variable is not significant. Whether fractional logistic regression is also an appropriate technique to explore determinants of efficiency?
Relevant answer
Answer
When using efficiency scores as a dependent variable in subsequent regression analysis, researchers often encounter the issue of these scores being bounded between 0 and 1, which violates the assumption of unboundedness in standard linear regression models. To address this issue, fractional regression models, such as the fractional logistic regression, are employed as they are designed specifically for dependent variables that are proportions or percentages confined to the (0,1) interval.
Fractional logistic regression, based on the quasi-likelihood estimation, can be used to model relationships where the dependent variable is a fraction or proportion, which is exactly the nature of technical efficiency scores resulting from DEA. Therefore, it is suitable to apply fractional logistic regression in a two-stage DEA analysis where the first stage involves calculating the efficiency scores, and the second stage seeks to regress these scores on other explanatory variables to investigate what might influence the efficiency of the DMUs.
This two-stage approach, where the DEA is used first to compute efficiency scores and then fractional logistic regression is used in the second stage, helps to avoid the potential biases and inconsistencies that might arise if standard linear regression techniques were used with bounded dependent variables. It is an appropriate statistical technique for dealing with the special characteristics of efficiency scores and can provide more reliable insights into the factors influencing DMU efficiency.
  • asked a question related to Regression
Question
2 answers
If I want to carry out innovative research based on Wasserstein Regression, what other perspectives can I carry out statistical innovation? Wasserstein Regressions can I carry out statistical innovation? Specifically,(1) Combining with Bayesian framework, the prior distribution is introduced and parameter estimation is performed based on Bayesian rule to obtain more reliable estimation results.(2)Variable selection technique is introduced to automatically select the predictive distribution that has explanatory power to the response distribution to obtain sparse interpretation.
Can the above questions be regarded as a highly innovative research direction?
Relevant answer
Answer
Yes, I have mathematical and statistical knowledge for research. I have also some knowledge of SPSS and R programming for data analysis and visualization
  • asked a question related to Regression
Question
4 answers
Whether one independent variable in multilevel regression can be the other two independent variables, such as type, token, and type/token ratio
Relevant answer
Answer
You can see the following books for more details
1. Using Multivariate Statistic by Tabachnic
2. Hair - Multivariate Analysis 7e
3. Applied Multivariate Techniques - Subhash Sharma
  • asked a question related to Regression
Question
17 answers
Can we use inferential statistics like correlation and regression if we use non-probability sampling technique like convenient or judgement sampling?
Relevant answer
Answer
Yes, it is possible
Statistical tests such as relationship tests, regression, comparison, etc
Randomness of the sample is not required
Each test has separate special conditions
It can be applied to any block of data that meets the conditions regardless of the type of sampling method performed
However, the sampling method affects the generalizability of the results
best wishes
  • asked a question related to Regression
Question
1 answer
Phonetic - What is progressive and regressive assimilation and dissimilation in the Romanic Languages (especially Spanish) and how do you recognize it? Was ist progressive und regressive Assimilation und Dissimilation in den Romanischen Sprachen, besonders im Spanischen? Que es la asimilación progresiva y regresiva y la disimilación fonética en las lenguas románicas (especialmente en el español)?
I am searching for an explication, a good source recommendation, where I could read more about this topic and some examples for the assimilation or dissimilation in the romanic languages, especially in the Spanish language. Thank you for helping me!!
Relevant answer
Answer
The assimilation phenomena are the influences that a sound makes to another. We have the progressive assimilation when the preceding sound changes the characteristics of the following: [aθ't̪̟eka] the voiceless interdental fricative consonant interdentalized the voiceless dental oclusive consonant. In the other hand, the regressive assimilation takes place when the following sound influences to the preceding: [an̪dalu'θia] the voiced dental occlusive consonant dentalized the alveolar nasal, but here we have also the progressive assimilation because the alveolar nasal made the voiced dental occlusive consonant mantain its occlusion. Finally, the dissimilation is the process by which two sounds variate their characteristics in order to emphasise the difference between them: an example could be the evolution of the medieval Spanish sibilants.
  • asked a question related to Regression
Question
4 answers
I found in google but can't understand properly.
Relevant answer
Answer
Correlation measures the strength and direction of a linear relationship between two variables, while regression goes a step further by modeling and predicting the impact of one or more independent variables on a dependent variable. Correlation does not imply causation, merely showing association, whereas regression can provide insights into potential cause-and-effect relationships.
  • asked a question related to Regression
Question
3 answers
Hi,
I have some confusion as to which one is better for outcomes as a model using binomial regression or logistic regression. currently i am working on judicial decisions, outcomes in tax courts where the cases go either in favour of assessee or the taxman. The factors influencing the judges as reflected in the cases are duly represented by presence(1) or absence (0) of the same. If a factor is not considered in final judgment it takes '0' else '1'. if outcome is favourable to assessee - it is '1' else'0' - now which would be the best approach to put this into a regression model showing relationhip between outcome (dependant) and independent ( factors - may be 5-6 variables). I need some guidance on this . can i use any other better model for forecast after i can perform a bootstrap run for say 1000 simulations and then arrive average outcomes and related statistics.
Relevant answer
Answer
In your case, the appropriate choice would be logistic regression rather than binomial regression. Logistic regression is specifically designed for binary outcomes, which seems to align with your scenario where the judicial decisions can go either in favor of the assessee (1) or the taxman (0).
Logistic regression models the probability of a binary outcome, and it's well-suited for situations where the dependent variable is categorical and has two possible outcomes. Binomial regression, on the other hand, is a more general term that can encompass logistic regression as a special case, but it's not the same thing. Logistic regression is a type of binomial regression.
Given that you have a binary outcome and you want to model the relationship between this outcome and several independent variables (factors), logistic regression would be the more appropriate choice.
As for incorporating bootstrapping for more robust estimates, that's a good approach. By running simulations and generating multiple bootstrap samples, you can assess the stability of your model and obtain more reliable estimates of model parameters. This can be especially helpful when dealing with a limited number of observations.
  • asked a question related to Regression
Question
4 answers
I've often seen the following steps used to test for moderating effects in a number of papers, and I don't quite understand the usefulness of Model 1 (which only tests the effect of the control variable on the dependent variable) and Model 4 (which only adds the cross-multiplier term of one of the moderating variables with the independent variable). These two models seem redundant.
Relevant answer
Answer
Burke D. Grandjean Thank you very much for your patience! I agree with you, though I still don't think there is much value in performing this manipulation in the results section.
If the difference between the two coefficients is not large, I cannot be sure if this is because the risk of confounding is low in my study or because my control variables are null (not eliminating the risk of confounding). Conversely, if the difference between the two coefficients is large, again I cannot determine whether this is because I am controlling for confounding well or because of some other problem (such as the multicollinearity problem you describe).
What is the point of this manipulation, given my difficulty in determining the reason for a particular situation? I think the relevant discussion should focus on the relationship of the control variables to the independent and dependent variables, with a dedicated section for detailed discussion and further testing.
  • asked a question related to Regression
Question
1 answer
if we can
1- First degree equation
2- 7th-degree polynomial equation
3- A non-linear differential equation
4- A system of two variables and 2 equations
5- A system of second-order differential equations
6- A device of nonlinear equations
And
7- The next step is the non-linear differential equation
If we can model all of these by regression and get the output correctly or with good accuracy, then we can have an approximate model of the system using only data, and this can be a start for using control or troubleshooting. Complex systems for which exact models are not available.
for the test, I start with 1 and 2 but Is it possible to achieve a good accuracy of the answer with regression for the rest of the cases?
Relevant answer
Answer
While regression analysis is a powerful tool for modeling relationships in data, its applicability to different types of equations varies. Let's explore each case:
First-degree equation:
Regression is highly effective for linear relationships. You can achieve good accuracy modeling a first-degree equation through linear regression.
7th-degree polynomial equation:
Polynomial regression can be used to model higher-degree polynomials. However, as the degree increases, there's a risk of overfitting, and the model may not generalize well to new data. Careful consideration of model complexity is crucial.
Non-linear differential equation:
Non-linear differential equations often involve dynamic systems. While regression may provide some insights, more specialized techniques like differential equation modeling or system identification methods would be more appropriate to capture the dynamic behavior accurately.
System of two variables and 2 equations:
Linear regression can be extended to multiple variables, making it suitable for simple systems. However, for more complex systems, you might need to explore system identification methods or machine learning techniques tailored for system modeling.
System of second-order differential equations:
Modeling second-order differential equations involves understanding the system's dynamics. Traditional regression may not capture these complexities well. System identification methods or dynamic modeling approaches are more suitable.
Device of nonlinear equations:
Nonlinear regression techniques can be employed for systems represented by nonlinear equations. However, the accuracy depends on the nature of nonlinearity and the quality and quantity of data.
The next step is the non-linear differential equation:
Similar to the third case, modeling non-linear differential equations requires specialized methods. Ordinary Differential Equation (ODE) solvers or system identification techniques may be more suitable than standard regression.
In summary, while regression is valuable for simple relationships, more complex systems often require advanced techniques tailored to the specific characteristics of the equations involved. Always validate the model's accuracy and generalizability, and consider consulting experts in the relevant field for a comprehensive understanding of the system.
  • asked a question related to Regression
Question
3 answers
In the case of a constant coefficient, where the VIF is greater than 10, what does that mean? Do all the variables in the model exhibit multicollinearity? How can multicollinearity be reduced? Multicollinearity could be reduced by removing variables with VIF >10. But I don't know what to do with the constant coefficient.
Thank you very much
Relevant answer
Answer
Looking further - your package may be reporting an uncentred VIF in place of or in addition to a centred VIF. There is an apparently unresolved debate in the literature about when or why that's useful. For practical purpose in most regressions it seems likely that high uncentred VIF may not be problematic. I've never seen uncentred VIF used in a published paper ...
  • asked a question related to Regression
Question
5 answers
Hello,
I am measuring thermal stability of a small protein (131 aa) using circular dichroism following the loss of its secondary structure. The data obtained is normalized to be within 0 and 1 where 0 is folded protein and 1 is completely unfolded. The CD of the fully unfolded state was calculated from a different experiment on the same batch and taken as reference. Once plotting my data in Graphpad Prism 9, I am fitting a standard 4PL curve using non-linear regression, constraining the regression to use 0 as bottom value and 1 as top value (see attached file). The Tm is reported as IC50 in this screenshot because this formula is often use for calculating IC50 and EC50. However, the resulted fitted line seems to not being able to represent correctly my data. I performed this experiment twice and the replicate test is showing that the model is inadequately representing the data. Should I look for a different equation to model my data? Or am I making a mistake in performing this regression? Thank you for the help!
Relevant answer
Answer
To ascertain that your model is applicable to your data, plot the residuals (y-\hat{y}) against the independent variable, they should be randomly distributed. You can perform a “runs-test” as a non-parametric test, or (more work, but also more powerful) a t-test to compare the residuals with the standard deviation of the data. Both methods test for H0: The data are reasonably well described by the regression curve, against H1: The data significantly deviate from the regression curve.
Most commercial software performs non-linear curve fitting with the Marquardt-Levenberg algorithm. I had this fail me on occasions and found that Nelder-Mead's simplex-algorithm is more reliable (DOI:10.1093/comjnl/7.4.308). Disadvantage: you need to get the errors of the fitting parameters from bootstrapping (10.1016/0076-6879(92)10009-3), ML gives them directly.
  • asked a question related to Regression
Question
16 answers
There are two ways we can go about testing the moderating effect of a variable (assuming the moderating variable is a dummy variable). One is to add an interaction term to the regression equation, Y=b0+b1*D+b2*M+b3*D*M+u, to test whether the coefficient of the interaction term is significant; an alternative approach could also be to equate the interaction term model to a grouped regression (assuming that the moderating variables are dummy variables), which has the advantage of directly showing the causal effects of the two groups. However, we still need to test the statistical significance of the estimated D*M coefficients of the interaction terms by means of an interaction term model. Such tests are always necessary because between-group heterogeneity cannot be resorted to intuitive judgement.
One of the technical details is that if the group regression model includes control variables, the corresponding interaction term model must include all the interaction terms between the control variables and the moderator variables in order to ensure the equivalence of the two estimates.
If in equation Y=b0+b1*D+b2*M+b3*D*M+u I do not add the cross-multiplication terms of the moderator and control variables, but only the control variables alone, is the estimate of the coefficient on the interaction term still accurate at this point? At this point, can b1 still be interpreted as the average effect of D on Y when M = 0?
In other words, when I want to test the moderating effect of M in the causal effect of D on Y, should I use Y=b0+b1*D+b2*M+b3*D*M+b4*C+u or should I use Y=b0+b1*D+b2*M+b3*D*M+b4*C+b5*M*C+u?
Reference: 江艇.因果推断经验研究中的中介效应与调节效应[J].中国工业经济,2022(05):100-120.DOI:10.19581/j.cnki.ciejournal.2022.05.005.
Relevant answer
Answer
You are welcome! Dont bother to ask further questions.
  • asked a question related to Regression
Question
14 answers
I am writing my bachelor thesis and I'm stuck with the Data Analysis and wonder if I am doing something wrong?
I have four independent variables and one dependent variable, all measured on a five point likert scale and thus ordinal data.
I cannot use a normal type of regression (since my data is ordinal and my data is not normally distributed and never will be (transformations could not change that) and is also violating homoscedasticity), so I figured ordinal logisitc regression. Everything worked out perfectly but the test of parallel lines on SPSS was significant and thus the assumption of proportional odds violated. So, I am now considering multinomial logisitc regression as an alternative.
However, here I could not find out how to test the assumption on SPSS: Linear relationship between continuous variables and the logit transformation of the outcome variable. Does somebody know how to do this???
Plus, I have a more profound question about my data. To get the data on my variables, I asked respondents several questions. My dependent variable for example is Turnover Intention and I used 4 questions using a 5 point likert scale, thus I got 4 different values from everyone about their Turnover Intention. In order to do my analysis, I took the average since I only want one result, so one value of Turnover Intention per respondent (and not four). However, now the data does not range from 1,2,3,4 and 5 anymore like before with the five point likert scale but is infinite since I took the average and now have decimals like 1,25 or 1,75. This leaves me with endless data points and I was wondering if my approach makes sense? I was thinking of grouping them together since my analysis is biased by having so many different categories due to the many decimals.
Can somebody provide any sort of guidance on this??
Relevant answer
Answer
Lisa Ss it doesn't make sense to pool the data that way, if you believe that you have ordinal data. You cannot simply calculate a mean or a sum score from the items, since ordinal data doesnt provide that information. This demands a metric scale. Therefore, your "average" score is not appropriate and hence the "grouping", too.
In my opinion you have several options:
1) Use an ordinal multilevel model to account for the repeated measures and the ordinality.
2) Conduct an ordinal confirmatory factor analysis, calculate the factor score for the latent variable and use this as a dependent variable in a OLS regression.
3) Do everything with an ordinal SEM, the structural and the measurement model.
4) Treat the oridinal items as metric (not recommended),
Maybe others have different approaches, please share.
  • asked a question related to Regression
Question
5 answers
In my Ms thesis, I calculated my data by the process of linear regression but my supervisor added me also step-wise linear regression.
Relevant answer
Answer
stepwise methods seem like the answer to the problem of having a lot of possible predictors and not knowing which ones to put into your model. In fact, the problem is that you don’t know which variables to put into your model. Specifying the variables in the model should be done based on your hypothesis, which should build on previous research. Collecting data without a planned model is like shopping without having a recipe in mind. You end up with half the ingredients needed for half a dozen recipes. Science – same thing.
If you don’t have a hypothesised model and you’ve gone ahead anyway and accumulated data, there are still very important reasons for not letting stepwise methods act as a substitute for having a theory and a hypothesis. Briefly, these are :
1. The p-values for the variables in a stepwise model do not have the interpretation you think they do. It’s hard to define what hypothesis they actually test, or the chances that they are false-positive or false-negative.
2. The variables selected may not be the best subset of variables either. There may be other equally good, or even better, combinations of variables. One simple solution is to test all possible subsets of variables. And, like all simple solutions to complex problem, it's wrong. You end up with an unreproducible, atheoretical model that has sacrificed any generalisability to the task you gave it, which was fitting a particular sample of data.
3. The overall model fit statistics are wrong. The adjusted R2 is too big, and if there were a lot of variables not included in the final model, the adjusted R2 will be a massive overestimate. R2 should be adjusted based on the number of variables entered into the process, not on the number actually selected.
4. Stepwise models produce unreproducible results. A different dataset will, most likely, give a different model, and a stepwise model from one dataset fitted to a new dataset will fit badly.
5. But the most important argument is that stepwise models break a fundamental assumption of statistics, which is that the model is specified in advance and then the model coefficients are calculated from the data. If you allow the data to specify the model, as well as the coefficients, all bets are off. See the Stata FAQ :
I can do no better than quote Kelvyn Jones, a geography researcher significant enough to have his own Wikipedia page : There is no escaping of the need to think; you just cannot press a button.
Essentially, stepwise methods break the first rule of data analysis :
The software should work; the analyst should think.
  • asked a question related to Regression
Question
3 answers
I have retrieved a study that reports a logistic regression, the OR for the dichotomous outcome is 1.4 for the continuous variable ln(troponin) . This means that the Odds increase 0.4 every 2.7-fold in the troponin variable; but, is there any way of calculating the OR for a 1 unit increase in the Troponin variable?
I want to meta-analyze many logistic regressions, for which i need them to be in the same format (i.e some use the variable ln(troponin) and others (troponin). (no individual patient data is available)
Relevant answer
Answer
Just for the sake of completeness: it might be possible if there is a meaningful reference concentration of troponin you could refer to, but I doubt that there is such a value.
  • asked a question related to Regression
Question
3 answers
For instance, when using OLS, objective of the could be
# to determine the effect of A on B
could this kind of objective hold when using threshold regression?
Relevant answer
Answer
Thank you
  • asked a question related to Regression
Question
2 answers
Hi folks!
Let's say that I have two lists / vectors "t_list" and "y_list" representing the relationship y(t). I also have numerically computed dy/dt and stored it into "dy_dt_list".
The problem is that "dy_dt_list" contains a lot of fluctuations, and that I know that it MONOTONOUSLY DECREASES out of a physical theory.
1) Is there is a simple way in R or Python to carry out a spline regression that reproduces the numerical values of dy/dt(t) in "dy_dt_list" as best it can UNDER THE CONSTRAINT that it keeps decreasing? I thus want to get a monotonously decreasing (dy/dt)_spline as the output.
2) Is there is a simple way in R or Python to carry out a spline regression that reproduces the numerical values of y(t) as best it can UNDER THE CONSTRAINT that (dy/dt)spline keeps decreasing? I thus want to get y_spline as the output, given that the above constraint is fulfilled.
I'd like to avoid having to reinvent the wheel!
P.S: I added an example to clarify things!
Relevant answer
Answer
Hi!
There is C-library "GNU Scientific Library" - chapter 29 "Numerical Differentiation". It's free software.
There is FORTRAN-C-...-Pyton-library "IMSL". Is it possible documentation will be sufficient supporting? Documentation is free.
Best regards.
  • asked a question related to Regression
Question
4 answers
I have the OR of a logistic regresion that used the independent variable as continuous. I also have the ORs of 2x2 tables that dichotomized the variable (high if >0.1, low if < 0.1).
Is there anyway i can merge them for a meta-analysis. i.e. can the OR of the regression (OR for 1 unit increase) be converted to OR for High vs Low?
Relevant answer
Answer
Hello Santiago Ferriere Steinert. These two ORs are from different studies, right? How many ORs do you have in total? If I had only the two ORs you describe, I think I would just report them separately. If they were two ORs of a much larger number of ORs, and all but that one were from models that treated the X-variable as continuous, I might compare the OR from the 2x2 table to the pooled estimate of the OR from the other studies. But I think more information is needed. HTH.
  • asked a question related to Regression
Question
5 answers
#QuestionForGroup GoodDay, I'm utilizing the Lovibond photometer for water analysis but I noticed in its handbook for analysis that this calibration function is like in the attached photo. is it the inverted equation of Beer's law and why did it use polynomial regression?!. can you clarify the derivation and purpose of this equation?
Relevant answer
Answer
Apparently they use linear regression. The higher order coefficients are 0.
  • asked a question related to Regression
Question
3 answers
We know that bone is active tissue with continuous remodelling( bone growth and resorption). The atherosclerosis is static process as it formed or it could regress ? if the condition of lipid oxidation stopped, could atheroma regress spontaneously?
Relevant answer
Answer
Dear Dr. Ahmed Mahdy
When one is diagnosed of atherosclerosis, the most one can do is prevent progression and further complications. There are medical treatment, regular exercise, and dietary changes that can be used to keep atherosclerosis from getting worse and stabilize the plaque, but they aren’t able to reverse the disease. For instance, aspirin, its blood thinning qualities are beneficial in reducing blood clots and thus preventing strokes and heart attacks, but it has no effect in reducing arterial plaque.
Or, the use of statins which are the most effective and commonly used cholesterol-lowering medications. They work by blocking the protein in the liver that the body uses to make low-density lipoprotein (LDL), or bad cholesterol. The lower one knocks the LDL down, the more likely it is that one will get the plaque to stop growing.
Best.
  • asked a question related to Regression
Question
4 answers
Suppose one has 40 or 50 survey questions for an exploratory analysis of a phenomenon, several of which are intended to be dependent variables, but most independent. A MLR is conducted with e.g. 15 IVs to explain the DV, and maybe half turn out to be significant. Now suppose an interesting IV warrants further investigation, and you think you have collected enough data to at least partially explain what makes this IV so important to the primary DV. Perhaps another, secondary model is in order... i.e. you'd like to turn a significant IV from the primary model into the DV in a new model.
Is there a name for this regression or model approach? It is not exactly nested, hierarchical, or multilevel (I think). The idea, again, is simply to explore what variables explain the presence of IV.a in Model 1, by building Model 2 with IV.a as the DV, and employing additional IVs that were not included in Model 1 to explain this new DV.
I am imagining this as a sort of post-hoc follow up to Model 1, which might sound silly, but this is an exploratory social science study, so some flexibility is warranted, imo.
Relevant answer
Answer
If you how coherent subsets of your variables (i.e., they all measure essentially the same thing), then you can create scales that are stronger measures than any of the variables take alone.
I have consolidated a set of references on this approach here:
  • asked a question related to Regression
Question
13 answers
When i do regression analyze, in Model Summary Table, i found Rsquare is very weak like:0,001 or 0.052, and value of sig. in Anova table is greater than 0.05, how can i fix this?
Relevant answer
Answer
Unless you have an error in your data, this may just simply be the result of the analysis (i.e., that your predictor(s) is/are only weakly related to, and do not significantly predict, the dependent variable).
  • asked a question related to Regression
Question
4 answers
İ have a data set with six categorical variables, with responses on a scale of 1-5; the reliability test for the individual variables is very strong but when combined for all variables the reliability test give very low figures. What could be the problem. Also what would be an appropriate regression for this analysis.
Relevant answer
Answer
@Christian İ used Cronbach's Alpha in both instances
  • asked a question related to Regression
Question
7 answers
If we have a research (analysis of factors affecting on sustainable agriculture...) in order to analyze its data, most previous researches have used techniques such as regression. To identify effictive factors, Is it possible to use the exploratory factor analysis technique?
Relevant answer
Answer
Yes you can use exploratory factor analysis
  • asked a question related to Regression
Question
2 answers
I am planning to assess the extent of different income diversification strategies on rural household welfare. Considering simultaneous causality between different livelihood strategies and welfare indicators, the Two Stage Least Square (2SLS) method with instrumental variables will applied to estimate the impact of the strategies on household welfare.
Please check the attached file also. I just need to know which regression was used in table 4 of this paper and which tool (SPSS, STATA, R, etc.) I need to use to analyse the data.
Relevant answer
Answer
To perform two-stage least squares (2SLS) methods, you can follow these steps:
  1. Identify your endogenous and exogenous variables. The endogenous variable is the variable that you are interested in explaining. The exogenous variables are the variables that you believe influence the endogenous variable.
  2. Find instrumental variables.Instrumental variables are variables that are correlated with the endogenous variable but not with the error term.
  3. Run the first stage regression. In the first stage regression, you will regress the endogenous variable on the instrumental variables.
  4. Use the predicted values from the first stage regression in the second stage regression. In the second stage regression, you will regress the endogenous variable on the exogenous variables and the predicted values from the first stage regression.
The coefficient on the endogenous variable in the second stage regression is the 2SLS estimate of the effect of the exogenous variables on the endogenous variable.
Here is an example of how to perform 2SLS methods in R:
  • asked a question related to Regression
Question
2 answers
What does the unstandardized regression coefficient in simple linear regression means?
Where as in multiple linear regression Unstandardized regression coefficients tell how much change in Y is predicted to occur per unit change in that independent variable (X), *when all other IVs are held constant*. But my question is in simple linear regression we have only one independent variable so how should I interpret it?
Relevant answer
Answer
The same, just that there is no "*when all other IVs are held constant*".
It's simply the (expected) change in Y per unit change of X. There are no other variables involved that would need to be "held constant".
  • asked a question related to Regression
Question
4 answers
Hi,
How do I interpret a significant interaction effect between my moderator (Coh) and independent variable (Hos)? The literature states Hos and my dependent variable (PDm) has a negative relationship. The literature also states the moderator (Coh) has a positive relationship with the DV (PDm). My regression co-efficient for the interaction effect is negative. Does this mean Coh is exacerbating the negative effect (i.e., making it worse) or weakening the effect (i.e., making it better)?
I have attached the SPSS output and simple slopes graph.
Thank you!
Relevant answer
Answer
Dear Np No
First, check the sign of the correlation coefficient (Pearson, Spearman, or others, depending on the condition of your data).
Before studying the regression model, you must calculate the correlation coefficients, which give you the inevitable result about the direction and strength of the relationship.
Regarding the interpretation of negative regression or correlation coefficients
An inverse relationship means, for example, one of two situations.
First: An increase in the value of the independent variable will lead to a decrease in the value of the dependent variable.
Second: A decrease in the value of the independent variable will lead to an increase in the value of the dependent.
The concept of increase and decrease is fundamentally related to the state of the studied variable and its logic.
EXP:
The shorter treatment period means a faster recovery. This is a positive thing, And A drop in vital indicators below a certain level, such as blood pressure, means something negative.
The depth here is expressed in the number after calculating it. If you need more detailed help, you can contact me, and you are welcome.
  • asked a question related to Regression
Question
2 answers
Hello, I am trying to analyze factors that influence the adoption of technology, and while doing that, I am facing issues with rbiprobit estimation. I have seven years (2015-2021) balanced panel data that contains 2835 data. The dependent variable y1 (Adopt2cat), the endogenous variable "BothTechKnowledge," and the instrumental variable "SKinfoAdoptNew" takes value 0 and 1. Although the regression works, I am unsure how to include panel effects in the model. I am using follwing codes: rbiprobit Adopt2cat ACode EduC FarmExp HHCat LaborHH LandSizeDec LandTypeC landownership SoilWaterRetain SoilFertility CreditAvail OffFarmCode BothTechAware IrriMachineOwn, endog(BothTechKnowledge = ACode EduC FarmExp HHCat LaborHH LandSizeDec LandTypeC landownership SoilWaterRetain SoilFertility CreditAvail OffFarmCode BothTechAware IrriMachineOwn SKinfoAdoptNew) rbiprobit tmeffects, tmeff(ate) rbiprobit margdec, dydx(*) effect(total) predict(p11) If we do not add time variables (year dummy), can we say we have obtained pooled panel estimation? I kindly request you to please guide me through both panel and pool panel estimation procedures. I have attached the Data file for your kind consideration. Thank you very much in advance. Kind regards Faruque
Relevant answer
Answer
Thank you very much Mr. Usman for your kind reply. It would be great help if you could kindly share the the code.
  • asked a question related to Regression
Question
5 answers
I have 4 groups in my study and I want to analyse the effect of treatment in 4 groups at 20 time points. Which test should I chose?
Relevant answer
Answer
If I did understand your question correctly, I will suggest to you to use RCBD , at the
same time you still have the chance to analyze data for regression, for each 20 points and/or 80 collected together. Regards.
  • asked a question related to Regression
Question
7 answers
I did principal component analysis on several variables to generate one component measuring compliance to medication but need understanding on how to use the regression scores generated for that component.
Relevant answer
Answer
Nicco Lopez Tan thanks so much