Science method
OLS - Science method
Explore the latest questions and answers in OLS, and find OLS experts.
Questions related to OLS
What are the pre-estimation tests or cautions for Dynamic Panel Data Analysis? For example, in Pooled OLS we run the Unit root Test, Cointegration Test, VIF test etc.
The threshold least square regression model by Hansen (2000) divides the series into two regimes endogenously. The Regime above the threshold and below the threshold, and then regress both regimes individually by OLS. this method also involve bootstrap replication. In my case the regime above the threshold only remain with 17 number of observation. Does it creates loss of degree of freedom issue in the data?
Dear researchers,
Selecting a function of the right form (linear, polynomial, exponential, power law ...) to fit a set of data usually requires the use of some knowledge and some trial-and-error experimentation.
In practice, I guess researchers:
- first select a function form and
- then use a chosen method (e.g., ordinary least squares OLS) to estimate the parameters of that model minimising a defined objective (e. g: minimising the RMSE)
The web contains numerous guidelines on how to estimate the parameters for a given objective . However, at first, from my understanding, the function form must be assumed.
My concern comes from a very concrete issue.
I have numerous inputs and one output. I would like to build a model to predict the output. I've checked numerous laws in the form:
ex:
test 1: F1=aX^2+b*X+c*Y^2+d^Y+e*(X*Y)**2+f*(X*Y)
test 2: F2=a*ln(X)+Y^b+c
..
For each test, I've used train/test subsets, OLS method to find parameters and then RMSE computation, ... very usual process I guess.
Is there research work/tools to automatically generate the functions to evaluate?
I've been searching online for days so any help will be very much appreciated.
Regards,
Long story short:
I use a long unbalanced panel data set.
All tests indicate that 'fixed effects' is more appropriate than 'random effects' or 'pooled OLS'.
No serial correlation.
BUT, heteroskedasticity is present, even with robust White standard errors.
Can someone suggest a way to either 'remove' or just 'deal' with heteroskedasticity in panel data model?
I need to understand which model should I select from the attached file, are these results valid for selecting model? as LM test is insignificant to choose RE Model? if OLS is good then should i go for FE Model or not, and if FE is good as per Hausman test then, Pesaran and Wooldridge test are significant for autocorrelation.
How to move from there, please guide.
I am running a panel data and the Hausman test shows p-value = 0.0113, which means that FE model is better. What test is required to choose between the Fixed effect model and Pooled OLS? Pls kindly explain clearly the test and the steps or Stata command? Thank you very much.
I am performing the Breusch-Pegan test and it is showing an insignificant p-value. Also, the Hausman test also shows an insignificant p-value.
I wanted to ask if OLS, Fixed-effect or Random-effect model will be suitable for my dataset, depending on the above test statistics.
Here is the case, as I said, I am working on how Macroeconomic variables affect REIT Index Return. To understand how macroeconomic variables affect REIT which tests or estimation method should I use.
I know I can use OLS but is there any other method to use? All my time series are stationary at I(0).
I run OLS regression on panel data in Eviews and then 2SLS and GMM regression.
I introduced all the independent variables of OLS as instrumental variables.
I am getting exacty same results under the three methods.
is there any mistake in running the models
I am also attaching the results.
thanks in advance
For instance, when using OLS, objective of the could be
# to determine the effect of A on B
could this kind of objective hold when using threshold regression?
Dear community,
I spend the last days reading many questions and responses as well as several journal articles and research papers on the topic of time series analysis and the individual steps. Unfortunately, the more I read the more confused I became...now I hope that you can help me out.
I am writing my master thesis about FDI determinants in Ethiopia. I have selected FDI per GDP as dependent variable and several independent variables (exchange rate change, inflation rate, labor force participation rate, telephone subscriptions per 100 citizens, a governance index (ranking from 0 to 100), and exports + imports as ratio of GDP). I am also thinking about adding a dummy variable to reflect the effects of the Growth and Transformation Plan in 2010.
The next step is to analyse the data empirically in order to investigate the impact of the explanatory variables on the FDI inflow. Due to data availability I only cover a period of 20 years (1996-2016). I read several papers on this topic but somehow everyone performed/applied different steps/tests/models. Also the order of the test performed varies in the papers.
I am really confused by now with regards to the differences between OLS, ECM, ARDL, and VAR and what is the most appropriate method in my case.
In addition, authors performed (some didn't) different tests for unit root/stationary, for co-integration, for multicollinearity, for autocorrelation, for heteroskedasticity. Also in a different order.
Besides, I am confused about the lag selection process and the meaning/difference of AR(1) and I(1).
Moreover, many authors transformed the variables first with log. I cannot do that as I have to negative observations (inflation rate).
Earlier I also read something about trend and difference stationary and depending on this different unit root test.
Like I said, I am just so confused by now that I don't even know who and where to start.
I am working mainly with SPSS but will perform the unit root tests in Eviews as SPSS do not have this function.
I really hope that you can help me by providing a little guideline on what I need to do and in which order. Thank you so much!
Since OLS and Fixed effect estimation varies, for a fixed effect panel data model estimated using a fixed effects (within) regression what assumptions, for example no heteroskedasticity, linearity, do I need to test for, before I can run the regression.
I'm using the and xtreg,fe and xtscc,fe commands on stata.
Hello,
Need your help,
My research topic is impact of dividend policy and profitability on share prices. I have taken 29 nonfinancial companies for 10 years time span. I ran simple OLS model in E-View 12 Student Version. But, I faced the problem of positive autocorrelation. I also applied panel data analysis - Pooled OLS, Fixed Effect, Random Effect Model. Is it require to fulfill the assumptions i. e. Normality of Residuals, Homoskedasticy of Residuals, No Autocorrelation, No Multicollinearity? E-View doesn't allow diagnostic tests in Panel Data analysis. I have no much idea about analysis and econometrics. Guide me. So, I can complete my multiple linear regression analysis.
Thank you in Advance.
I am using a Durbin-Watson test as a method through which to test for autocorrelation in a time series. Said data forms the basis of an interrupted time series analysis. My question is whether the absence of detected autocorrelation in the DW test on a simple model (OLS) is sufficient to inform modelling thereafter (ie. should I undertake the DW test on other plausible model types or does the DW test on an OLS suffice)?
I am already familiar with the process of calculating hedge ratios with linear regression (OLS). I am already running 4 different regressions for calculating hedge ratios between emerging markets and different hedging assets like gold. This is done on in-sample data.
That would look something like this: EM=a+b*GOLD+e
I then construct a portfolio and test the standard deviation of the portfolio and compare that with the non-hedged portfolio of only emerging market equities in the out-sample: R-b*GOLD
However, I want to compare these OLS hedge ratios to conditional hedge ratios from for instance a BEKK GARCH or a DCC GARCH.
I have already tried to work with R and I used the rugarch and rmgarch packages and created a model, modelspec and modelfit, but I do not know how to go from there.
I am using VECM testing long run relationships of variables affecting housing prices. In my OLS, I have seasonable variables and lag 1 and other economic fundamental variables. Johansen cointegration test and VECM didn't work when
I include these seasonable variables and lag 1 of the dependent variable. When I exclude these, Johansen and VECM worked. Can I conclude error term in VECM can correct disequilibrium? thanks.
I conducted research on the effects of natural disasters on credit growth, using 2008-2022 panel data. All of my independent variables are represented in the lag, and I also include the lag dependent variable as a regressor. When comparing the coefficient of lag dependent variable between GMM, OLS, and FEM, with GMM are always greater than FEM and OLS. Is there an explanation for this condition? Is there any treatment that can be done so that the GMM value is between OLS and FEM?
I have data related to households with a number of variables with the dependent variable being household consumption. I need to specify an OLS regression to identify the treatment effect of interest but I do not have interest as a variable within the data provided. How would I go about creating this variable and introducing it as a shock to the data?
Hello!
I have a non-normally distributed variable (income) and although I tried to transform it to a normally distributed variable skewness and kurtosis values are still so high and there is lots of outliers on it. But can't delete the outliers because it is about nature of income variable. So I didn't delete a single one (by the way N=9918, I am not sure it is acceptable to delete 200 or 300 of them). I read about after conducting the OLS if residuals are distibuting normally it is acceptable to use OLS results. But I couldn't find any academic source/strong reference about it.
I wonder that when I have normally-distributed residuals can I use OLS results even if the variable has outliers and have higher skewness and kurtosis values? If this is an acceptable way to conduct this analysis, can you suggest an academic resource that I can reference to support this usage?
Thank you in advance.
I am trying to run an OLS regression, with two continuous variables, both with negative and positive values. How should one deal with this? How do you plot a regression line in this case? Is it necessary to add a constant to counteract any negative values? Thank you.
Hi Everyone,
I am trying to use the SURE model in Nlogit, the software uses generalized least square regression for estimation. Is there any way or command to use OLS instead of GLS to be used in the SURE model?
Thanks
Hi, Everyone!
I am developing a model in which the dependent variable (y) is the number of days a company takes to complete an acquisition (or a merger) of another company. Since it can take from 1 to a 1000 or more days according to records, I think the Poisson model would not be suitable, since the number of days are not necesarily small count data . Can I apply the traditional OLS, or wich model do you recomend?
Thank You
I tested multiple linear regression analysis with my Likert scale data and it violated the normality assumption of OLS, after that I found ordinal logistic regression and tested but the p-value of parallel lines and goodness of fit(Pearson) is less than 5%. What to do?
Hello Everyone,
I am using SPSS with the PROCESS plugin by Hayes. I performed a (OLS) regression analysis (in SPSS) with dependent variable X and 3 dummy variables (I have 4 groups, one is the reference group). This gave me some coefficients for the dummies that made sense. But then, I continued with my analysis to a mediation analysis (using PROCESS), I used the same X and the same 3 dummies (same reference group), and the total effect model spat out different values for the coefficients (quite a bit larger). The same happened for another variable when PROCESS calculated an effect of X on M (the mediator), i.e. it returned different results from my "normal" regression analysis without PROCESS. From what I understand in Hayes 2018 book, the equations used to calculate these coefficients in mediation should be the same as my previously calculated ones. But they differ. Any ideas why?
Thank you,
Stefan
I have 5-point Likert data for 54 items. These items are broken into several scales, some have been adapted from validated measures. I have two waves of data with the same survey, different samples drawn from the same populations. In each wave I have three groups and a supplemental sample (so really 4 groups). I would like to compare the responses between the two waves and between the groups. My inclination is to use an OLS regression with everything in the model (wave, group, demographics). However, I have some hesitation for treating Likert data as interval. Yet, a categorial approach would require analyses item by item, which would be impractical given the number of items and the 5-level response option. I am looking forward to some takes on this-can I use a categorial approach and collapse items into scales or is that a "having your cake and eating too"-kind of question?
I am working on my thesis and I have a few questions about which method I should use for the analysis of my data. My research is about inequality in Europe, and in general households in Europe that have access to broadband internet connections and what its effects are on their educational attainment. The IV here would be
- % of households that have access to a broadband connection
- GINI Index of the country (to mesure inequality between the countries)
The dependent variable would be a variable that can be divided in 3 groups: . % of population that completed primary education, % of population that completed secundary education, and finally % of population that completed tertiary education.
The data is available on the World Bank database and has about 20 countries over a period of 15 years. After doing some research I figured out that the type of data i'm using is Panel Data. I have done some reading about it, but I can't figure out how to continue, because most of the tutorials only use one IV and on DV. What I have read suggests using OLS(my promotor also told me that OLS would be best suited) for the type of variables and data I'm using and that I will need control variables like "population" or "unemployment", but I don't get it.
I don't know if I'm being clear here, I basically want to know if I need to do what I read (but then I have no clue on how to work with 2IV and 1 DV), or if it's something completely different..?
If something isn't clear, let me know, and I'll try to explain better. Thank you very much.
I assume that volatility of dependent variable for a population is explained by volatility of same variable for one subpopulation rather than another. Comparison of those subpopulations is the focus of my study. Thus, I proposed a model:
Y for population A = b0 + b1 * time + b2 * Y for subpopulation A1 + b3 * Y for subpopulation A2
My qestion is, if such a model is correct? If not, what kind of method should be applied to measure the impact of subpopulations' variability on volatility of population in total.
Good day scholars,
Please I need your suggestions on single-equation estimators that can best capture serial correlation problem in a single-equation regression model settings. The single-equation model has been estimated using OLS and we found the Durbin-Watson Statistic to be 1.042838 and R-squared as 0.967900.
The datasets used for the study is time series datasets with mixed order of integrations such as four I(1)s and 1 I(2).
Thanks and God bless
I was constructing a multiple regression model and was inspired by two papers the two papers didn't test stationarity although when I tried to test stationarity and took the difference all the variables were insignificant and R squared was too low 0.22 , but when I tried without taking difference of stationarity the variables were significant and R squared was 0.78
I found different ways in estimating long-run and short-run when processing time series data through ARDL-ECM method in Eviews 10.
Someone use OLS method to estimasi long-run and short run, however, others directly use output ARDL for long run, and ECM for short run.
In your opinion, which one between these two method shoud I use, and why?
Thank you
Hello,
I am attempting a two-part model on semi-continuous data (zero inflated).
As I understand, the first part is a binary logistic regression (or probit) model for the dichotomous event of having zero or positive values.
logit[P(Yi = 0)] = xβ ......... Equation (1)
Conditional on a positive / non-zero value, the second part (continuous) can be modelled using either a OLS regression (with or without log-normal transformation of outcome variable) or generalized linear models (GLM).
log(yi|yi > 0) = xβ2 + e where e is normally distributed .... Eq (2)
Combining the above two parts, the overall mean can be written as the PRODUCT of expectations from the first and second parts of the model (refer Eq 1 and 2), as follows:
E(y|x)=Pr(y> 0|x) ×E(y|y> 0, x) .... Eq (3).
I could find a no. of papers which have employed the 'twopm' command of STATA package. However, I am using SPSS. I have conducted the two parts (binary and continuous) using the same set of predictors.
Please suggest how to multiply the results from both the parts using SPSS.
Thank you.
Dear Community,
I am facing a rather odd problem. I am running an OLS multivariate regression, my independent variables change signs compared to signs of univariat regression, although VIF is below 2,0. Can somebody please help me and explain the matter to me? Many thanks.
Dear Community,
I was wondering whether it is possible or not to validate hypothesis testing based on OLS regression with 2 indep variables only? Or will such thing decrease the credibility of OLS regression results?
If we push it to the extreme, is it possible to validate hypothesis testing with uni-variate regression analysis?
Results in OLS method, illustrate a strong relationship between 3 explanatory variables and a dependent variable; however, when it comes to GWR method(using the same explanatory variables), running fails, containing Error 040038 (Results cannot be computed because of severe model design problems).
What's the reason? how to solve it.
Thanks in advance.
I am doing my masters essay. I am trying to find the determinants of environmental performance index using Human development index as the income variable. I am a bot confused with my results. The EPI and HDI scatter plot showed linear relationship, or it was my mistake, i did not look at it close enough. I already designed all my work around this. I do not have much time in my hand either. Will my OLS regression be legitimate under the cicumstances?
Good evening,
I am currently working on my master thesis document and I do have a potentially basic question.
I try to understand the impact of having female in the acquirer's board of directors on the premium paid during an acquisition (percentage of price paid above or under the real value of the target).
To do so, I tried to perform an OLS on 800 observations (R Studio) but I kept finding heteroskedasticity in my model (even by transforming the data).
I tried to perform a plm regression but I keep having heteroskedasticity according to either the residual vs. fitted plot or the bptest. Additionnaly, the variable I try to analyse does have a p-value of 0.8 rather than <0.05, it is thus not significant and not something I could use in my report.
Do you have any idea of how I could manage to cancel the heteorskedasticity problem or if this might be something I will not be able to get rid of ? Additionnaly, do you think I could write a master thesis with the variable studied being not significant or should I keep searching ?
Thank you in advance for your help
I did IPS test and Fisher for unit root test and found out that all variables were stationary at level but only real exchange rate was not. Futhermore, the real exchange rate was stationary at I(2).
Then I took Pedroni test and the null hypothesis was rejected, which means there is long term relationship between variables.
Could I continue to do OLS, FE or Re? If not what should I do next?
Plus that my data is panel with N=12, T=20, unbalanced data.
Hi all,
I hope anyone can help me and save my life!
I have been trying to gather as much information as possible to decide which analysis method I should use for my collected data, but I can’t make a decision.
In particular, I am struggling to decide whether I should see my data as non-normal or normal, then I can decided which statistical method I can adopt.
For my thesis, I am currently working on analyses with the collected primary data (n=110) based on a cross-sectional, non-probability and self-reported survey for a theory-testing purpose.
All constructs are reflective measures and multi-item scales (between 3 to 6 items) across 2 IVs, 2 DVs, 2 moderators. .
1 IV is a higher-order construct (3 dimensions), and another IV is unidimensional (5 items).
All of them are seven points Likert scale that is previously developed and validated in other papers
Positive sides:
After the preliminary analysis, reliability (Cronbach’s a >.7), correlation, and regression analyses (OLS) show a good sign and are mostly in line with the conceptual and theoretical models from the role model papers.
Negative sides:
· The results of EFA (to check the unidimensionality) and CFA analysis (to test the model) were horrible. The EFA results were messy as per image below. Many of them are cross-loaded or loaded on unmeaningful components.
· KMO is >.850, suggesting that a sample size is not an issue?
· Basically, many OLS assumptions are violated because of leptokurtic distribution, heteroscedastic, and non-random sampling.
· Normality tests (e.g. Shapiro–Wilk test ) says all of them are nonnormally distributed. Besides, across all constructs, the Skewness and Kurtosis values are around -2 and +6.5, respectively. However, although I am aware of the threshold line, +- 2 for BOTH skewness and Kurtosis, I can chose to follow the Hair’s (2010) guideline, which suggests that +- 2 for skewness and +- 7 for Kurtosis for data to be considered as “normal”.
· Lastly, Common method bias is checked through Harman's single factor test (34% of total variance)
Based on this information, I have questions.
(1) Which regression model should I use? (OLS, PLS and so on). Please provide me with some justifications of choice (e.g. Sample size).
(2) Then, which statistical tool do you recommend? ( I am currently using SPSS and AMOS, but I can use R, SmartPLS or MPlus if needed.
(3) Should I consider my data as normal or non-normal?
Thank you so much for your help in advance!!
Ted
Hello everyone,
Currently I am doing an research of factors affecting exports by using OLS,FEM or REM with N=12 and T=20. When I did an unit root test, it turned out that all independent variables are stationary but for dependent variable. Futhermore, dependent variable was stationary at I(1).
Could I use the data at the level? My dependent variable have some missing values so is it good if I run the regression at I(1)? What should I do next?
I am really confused so please help me, thank you a lot.
Hi everyone,
recently, I have been working on a study where I examine the impact of American tariffs, customs and other import duties on European exports to the US. I have three variables (y = EU exports to the US, x1 = US tariffs, x2 = US customs and other duties). I use quarterly data from 1995Q1 until 2017Q1 (89 observations). My tutor has emphasized that I need to controll for year and country fixed effects and maybe introduce dummies per year and country. I am quite clueless how to do that. Why is it necessary? What is the equation? How do I do that in Excel or eViews? I would appreciate step by step instructions so much!
Thank you in advance for any help or comments.
Why do the signs of coefficients change when moving from OLS regression to Fixed Effect Regression? my theory stands on the assumption that PS would be positive and significant, the same as in the OLS result, but the fixed effect changes that. What's the solution?
Is it possible to use backward regression for all variables in a OLS model, but to leave the year dummies out of the bacward selection, so that these dummies appear in the final model even if they are not significant?
Thank you.
I need the data to perform OLS where the dependent variable will be the shadow economy in Greece and the independent the corruption in Greece and the amount of e-transactions per person in the country, but I can't find the data, can someone help me please?
I am looking to do a power analysis with assumptions about dependent observations.
I the research design, each respondent is assigned 4 short pieces of text, each of which they are asked to rate (on a continuous scale). Each text belongs to either treatment A or treatment B so each respondent reads and rates 2 texts from each treatment. Therefore, the 4 ratings that each respondent gives will be dependent on the respondent or they will be correlated.
In the following OLS regression, I would normally cluster the standard errors by respondent, but I am unsure how to implement this in a power analysis. I will assume either a small or medium treatment effect (e.g. Cohen's d of 0.2 or 0.4). Then, I am interested in knowing the power if I have a) 100, 200, 300 or 400 respondents and b) 2, 4 or 6 texts rated by each respondent.
I have tried to use the clusterPower package in R but I am not sure which function to use exactly and how.
I hope you can help. Thanks!
Dear everyone,
I am in great distress and desperately need your advice. I have the cumulated (disaggregated) data of a survey of an industry (total export, total labour costs etc.) of 380 firms. The original paper is using a Two-stage least square (TSLS) model in oder to analyze several industries with one Independent variable having a relationship with the dependent variable, which was the limitation not to use an OLS method, according to the author. However, i want to conduct a single industry analysis and exclude the variable with the relationship, BUT instead analyze the model over 3 years. What is the best econometric model to use? Can is use an OLS regression over period of 3 years? if yes, what tests are applicable then?
Thank you so much for your help, you are helping me out so much !!!!!!!
I have panel dataset with T=15 and N=5 countries. I used hausman test to decide between the fixed and random effect model and i got p-value as 1 which implies random effect model is preferred...Then i also used Breusch Pargan LM test to decide between RE model and pooled OLS model which gives LM statistic as 0 and p-value as 1. This implies that pooled OLS is better. Could anyone tell me whether getting Chi square test statistic of 0 and p-value of 1 is fine or doubtful that something is not right with the data. Any suggestion/comments in this regard will be highly appreciated.
Hi,
This is my first ever research I am doing for my MSc so please bear with me.
N=422, looking at the histogram and the skewness coefficients it can be confirmed it is skewed. Scatter plots tell me there are 3 weak to moderate relationships between IVs and DV linear and 1 curvilinear relationship but also seem to be homoscedastic. Kolmogorov-Smirnov values for all variables were significant, once again confirming non-normal distribution.
I tried removing outliers but did not help the distribution. I transformed them in spss, following Pallant (2016) recommendation by using Reflect and Logarithm, Reflect and inverse as these shapes matched my histograms. The results are still far from normal.
Following SPSS survival manual (Pallant, 2016), is seems there are limited options to what I can do at this point. I can report the descriptive part of my analysis for the dissertation and what usually happens is people report correlations next and then they test their model with some form of regression. And if they test moderating effect of a variable they seem to do SEM.
Now I chose Spearman's correlation and described the association between the variables (rather than correlation) then I learned that it can also be used for hypothesis testing? But it can only be used for hypothesis such as H1 There is a relationship between this and that. It that correct? If so, that is fine, three of my hypotheses are formulated according to this. But what about the moderating effect? Is there a non-parametric way of doing it?
As an answer to my question regarding regression with data like this, my supervisor said OLS does not assume normal distribution (referring to the Gauss Markov theorem) so I could look into that and that it is recommended not to rely on normality tests but the diagnostic plots of residuals. Or stick to Spearman's but then he says I need to change my approach accordingly.
So once I use Spearman can I in addition use OLS? And if so, is there a non-parametric way of testing for moderating effects of a variable on the other relationships? Sorry about the long message, just wanted to give you enough context.
Dear colleagues,
I am planning to investigate the panel data set containing three countries and 10 variables. The time frame is a bit short that concerns me (between 2011-2020 for each country). What should be the sample size in this case? Can I apply fixed effects, random effects, or pooled OLS?
Thank you for your responses beforehand.
Best
Ibrahim
ran OLS regression on my data and found issues with auto correlation due to non-stationarity of data (time series data). I need to conduct a Generalized least Square regression as it is robust against biased estimators
I have multiple measurements of two variables in different settings; my hypothesis is that the relationship between variables _in general_ is described by a Kuznets curve (U-shape). I tried quadratic curve fitting with OLS and results seem to confirm my hypothesis (out of 25 sets of measurements only 1 relationship is not U-shaped). Any advice on a better test?
Hello everyone
I have extensively read throughout the platform about the usage of different models to analyze likert scale and ordered dependent variables. I wanted to share my plans and see your opinions if it is the best model.
My context is the following: We asked how comfortable they would feel downloading an app with different characteristics (factors) from 1 - 11, 1 being under no circumstance I would download such app and 11 being I would download and use that app everyday. There are three factors,with 2, 3 and 7 levels accordingly (open-source, security and app provider). We deckerized with open source, as our previous research showed it wasn't significant, meaning that respondents were asked to evaluate a set of vignettes either from open-source or non-open source. We used clustered sampling and our sample data is representative of our objective population (with 600 answers).
I have read from sociological methodology that given that the likert scale is of 11 points (bringing a number of benefits) and it is set in an experimental manner, you can use ANOVA, OLS and Random Intercepts Models. However, I feel a bit uncomfortable using these, as some assumptions are broken. Thus, I decided to use an ordered logit regression (OLR) , as for me the dependent variable (willingness to download) is ordered. The parallel line assumptions isn't broken and all variable as significant, so that gives me confidence I can use this model. However, I started doubting if maybe a multinomial logistic regression.
I'm using R for the analysis, with the MASS packaged (specifically the polr function for the OLR and rant and poTest for checking the parallel assumptions). I have crossed checked I get the same results with STATA and it fits.
On the article I plan on also including the ANOVA, OLS and Random Intercepts Model to add robustness to the analysis. What's interesting is that, although some specific coefficients change from OLS to OLM, the conclusion are the same.
Thus: Should I used the multinomial logistic regression or not? Comments on what to report and improvements?
Edit PS: Through my ANOVA, it shows that the ind.var don't interact. Should I still include them in the OLR? Currently it is like dep.var ~ x1 + x2 . Would you suggest dep.var ~ x1 + x2 + x1:x2 as a better fit, even if the ANOVA with interaction says the interaction isn't significant? And if you think that the OLR should include the interaction, do you exactly know how to know if it is significant?
I have sample a of 138 observations (cross sectional data) and running OLS regression with 6 independent variables.
My adjusted R2 is always coming negative even if I include only 1 independent variable in the model. All the beta coefficients as well as regression models are insignificant. The value of R2 is close to zero.
My queries are:
(a) Is negative adjusted R2 possible? If yes how should I justify it in my study and any references that can be quoted to support my results.
(b) Please suggest what should I do to improve my results? It is not possible to increase the sample size and have already checked my data for any inconsistencies.
Dear Readers,
I am writing this short series of steps that I have formulated to help those who are in a state of confusion regarding the "Analysis" of their research. As a professional from the industry, I encountered several difficulties in understanding the research. I will post that discussion separately. Please bear in mind this information will apply to the "Chapter 4- Analysis" section of your research and the same principle will be applied at the Masters, MPhil, and PhD level. If any expert can help me in further refining the concepts, I would very much appreciate it. I do understand the information might be too much to process at first, so give it a read several times, I guarantee you will find most of the answers. If you are confused about the concept of research i will post another refresher for clarity- I write this because I have been a victim of the same confusion.
P.S. A word of warning, DO NOT confuse "Model" with "Method", I made the same mistake. Model is the equation you formulate in Chapter 3 (X= Alpha + Beta x X1 + error term), Method is what we (some of us) stupidly call models (OLS, Fixed effects, Random effects, etc).
Steps for Analysis (What to do When confused):
1. Identify Data Type (Cross-Sectional, Time Series, Panel)
2. Run Descriptive Statistics
3. Check for Multicollinearity (Tested by correlation among variables or VIF tests) _(Books ridiculously quote application of Heteroscedasticity and Autocorrelation but researches have given little importance and only used these items to defend their methodology so if you use then good, if not that it’s not the end of the world- BUT Multicollinearity is a MUST).
4. Identify if there is any Trend in Data Using Unit Root Tests- This will decide the next steps
a) If there is no Trend in Data- We can Use OLS- Next steps are as follows for point A.
i. Run OLS and check for Heteroscedasticity- If Heteroscedasticity is detected OLS is not applicable- We should move towards the application of Fixed Effects and Random Effects and subsequently decide which is better from Hausman Test. Researchers point out the possible issue of Heterogeneity which is sometimes used as an acid test for choosing between FE and RE. This usually states that if your data is Homogenous (i.e. You are working on a single industry- Fixed Effects is appropriate this is because the only Heterogeneity is a with-in the sample (Firm Size, Capital Structures, etc., i.e. “Beta (Ungeared-CAPM-Industry Risk and NOT the beta we use in research ” is similar/.) If the data is heterogeneous (say an index that includes several companies from different industries this contributed to unidentified Heterogeneity and Random Effects Model is more appropriate). However, you still have to check the "Rho Values" and correlation for presence of endogeneity which can still reject the choice of RE/FE model.
b). If Trend Is detected (Which will obviously be there for Time Series and Panel Data)- than Regression is not applicable- (Do not get confused if you still see researches with this method because the method is just a choice-)In case of Trend, The following steps are to be followed:
a. Use log of variables or perform Differencing (1st observation minus second and so on) - (1st Order, 2nd Order, etc.) and identify the “Level” at which variables become stationary i.e. Trend is removed (Eventually after some levels, data automatically becomes stationary so need to get confused).
If some variables are stationary at “Level” (Initially) - while some become stationary at different levels (1st Order Differencing, 2nd Order...)- then a mixed approach is used. In this instance the following models are applicable:
1. ARDL 2. VAR 3. GMM/LSDV
GMM (Is Applicable when: 1. The panel is Dynamic 2. The panel is Short (Literature states if N<15, T is between 5 & 15) 3. The panel is Unbalanced (Not All observations are available for all time periods- for example, some companies close down during the research period) In this case closest compeitor to GMM is LSDV but GMM is still prefered- Further Tested by 1. Under Identification Test 2. Over Identification test- Additional Tests include Kruskal Walis Test- which can be used if your variables are challenged for being “Too Similar” say Return on Assets and Return on Equity - You can use this test to shut them up by proving that variables are not "Similar"- As both are derived from Returns. Ramsey Test can also augment the findings.
"Old Men" in the field, adamantly advocate the application of 2SLS/3SLS model in such situations. the basic functioning of all of these models is the same- They take lags up to several stages. However, As per Research Papers, i ve found that 2SLS/3SLS can offer a remedy instead of GMM, ARDL,LSDC, because of instrumental variables, but developing instrumental variables is another complex task and is often challenged, secondly researchers also believe that 2SLS/3SLS will be applicable for "Survey Data". Using these logics and citing references, you can easily defend your choice.
My Apologies for the bad language, I only speak from my personal experiences, when People merely pointed out "Some Thing Wrong" and never gave me the solution- unless I dove into the literature to find out the problem and the solution my self- Will post the story of my transition from a "Professional in the Industry" to "Researcher and a Professional in the Industry" and how I was able to conclude a paragon research in my industry- The first-ever based on secondary data from Telecommunications in my country- Never got the recognition for it, but wanted to give something back to the industry in the form of a contribution to the literature.
Hi,
I am using recall of words first trial as a dependent variable. It captures the amount of words an individual remembered. It ranges from 0 to 10 words.
Do I need to use an ordered model since 10 is preferred above 0 and it are all integers, or can I just run it with OLS?
Thank you!
Dear researchers
I am calculating Gravity Equation. For the OLS everything goes normal but when taking a country id and time fixed effects then Distance goes wrong. Can anyone explain this how to make it negatively affect trade volume?
I have attached the screenshot of my equation and results.
I want to create variable_3 taking the difference of variable_1 and variable_2
in order to run the OLS regression containing variable_3, correctly, what should I do for the dates that variable_1/variable_2 don't have data?
This is my research problem so far:
In this scientific paper I will conduct an empirical investigation where the objective is to discover if the number of newly reported corona cases and deaths have been contributing towards the huge spike in volatility on the sp500 during the pandemic phase of the corona outbreak. This paper will try to answer the following questions: “Is there any evidence for significant correlation between stock market volatility on the SP500 and the newly reported number of corona cases and deaths in the US?”. “If there is significant evidence, can the surge in volatility mostly be explained by the national number of daily reported cases or was the mortality number the largest driver? “
So far i have conducted a time series object in R-studio containing the variables; VIX numbers, newly reported US corona cases and deaths. I have also converted my data into a stationary process and will later on test some assumptions. I have a total of 82 obersvations for each variable that stretches from 15. February to 15. June.
I do not have a lot of knowledge regarding all the different statistical models, and which ones that is logical to use in my case. My first thought was to implement a GARCH or OLS regression, although I am not sure if this is a smart choice or not. Hence, I ask you for some advice.
Thank you in advance :)
Best regards, stressed out student!
I am working on a panel of 36 countries over a period 1990-2018 and l want to compute marginal effects for the various regression models employed(OLS, RE, FE and system GMM) using STATA. Anytime l run the marginal effects in stata with the commands....."margins, dydx" or "margins atmeans", the coefficients for the marginal effects are the same as those of the original regression coefficients. I learnt l must consider conduit variables.
Please, l need your help on how to compute marginal effects in stata.
I have moderating variable W and dependent variable is Y. OLS results show WY is positive and significant but Y becomes negative which was positive without moderation. So my main relationship changes from positive to negative after moderation. Can I report the results? What could be the possible justification?
I am attempting to construct an error-correction model. I have 7 variables including the dependent variable. I already ran a ADF test to test unit root for both level of difference and First difference, and I'm looking at the results through both with constant and constant and trend. One of the variables, without the first difference is already smaller than 0.05 in constant and trend, but other than that all the other variables accepted the Null hypothesis. Would that one variable that rejected cause any problem?
My second problem is that when looking at the First difference, one variable is still bigger than 0.05 and when I ran the test for cointegration the p value showed up to be 0.7881, which is bigger than 0.05, does this mean that I cannot run a OLS then estimate a Error Correction Model?
Dear colleagues,
I ran several models in OLS and found these results (see the attached screenshot please). My main concern is that some coefficients are extremely small, yet statistically significant. Is it a problem? Can it be that my dependent variables are index values that ranged between -2.5 and +2.5 while I have explanatory variables that have, i.e the level of measurement in Thousand tons? Thank you beforehand.
Best
Ibrahim
I am analysis a data using propensity score matching and path analysis. However, one of the assumptions in selecting matching variables for propensity score matching is that they should not affect selection into treatment and also not affected by the treatment variable (endogeneity problem and unconfoundedness assumption) (Caliendo & Kopeinig, 2008).
1. Am I making an analytical mistake (violating the assumption) if I run both propensity score matching and normal regression (OLS using outcome variable as dependent variable or probit regression using treatment variable as dependent) in the same study? Because I am using the same matching variables as independent variables in the OLS and probit.
2. Is it analytically correct to run a path analysis in addition to propensity score matching using the same matching variables? I am running the path analysis to trace the path towards the outcome variables.
I need to do an OLS regression on a dataset with stock index returns. To control for the Monday- and Holiday effect I need to add dummy variables. An other study that did the same regression described this as follows:
"𝐷𝑡={𝐷1𝑡,𝐷2𝑡,𝐷3𝑡,𝐷4𝑡}t,Dt={D1t,D2t,D3t,D4t} are dummy variables for Monday through Thursday, and 𝑄𝑡={𝑄1𝑡,𝑄2𝑡,𝑄3𝑡,𝑄4𝑡,𝑄5𝑡}Qt={Q1t,Q2t,Q3t,Q4t,Q5t} are dummy variables for days for which the previous 1 through 5 days are non‐weekend holidays."
Let's consider that there is no trading on Tuesday May 1, because of a non-weekend holiday, how do I correctly use the dummies? Which dummy from Q1t through Q5t has to be used for Wednesday May 2 or are there also dummies needed for Thursday May 3 and Friday May 4? Do I only need one dummy for Wednesday or do I need to add more than one dummy for the next days after Tuesday May 1?
I am new with this kind of regressions, so I do not know how to use this.
Can somebody maybe help me? Thanks in advance.
Hi, dear researchers.
I am running panel regression in STATA. My time series dataset contains 3 countries covering 25 years with 7 variables.
The variable I need is found to be significant as the result of panel regression. However, when I run simple OLS regression for each country separately, the required variables show the significance for only two countries. What would you recommend in this case? Which model should I use?
I have asked students a series of questions relating to service quality of university from before and after the pandemic on a 7 point likert scale.
For instance
"Instructors motivate the students- PRE COVID-19" Strongly Agree-Strongly Agree
and then again straight after
"Instructors motivate the students- DURING COVID-19" Strongly Agree-Strongly Agree
So I have about 30 of these questions and I went to see if changes in scores across the two periods have impacted a range of dependant variables, for example one of them is student engagement proxied by how many lectures they attend per week.
"How many lectures a week do you attend (Pre COVID-19)?" 12/3/4/5/6/7/8/9+
"How many lectures a week do you attend (During COVID-19)?" 12/3/4/5/6/7/8/9+
I am not very knowledgeable in using stata. What is the best regression model for this? Do I need to do a fixed effects model and if so how? Can I just pool the data by creating a new variables for instance change in scores pre to during covid and then run a basic OLS?
Dear Colleagues,
I estimated the OLS models and checked them for several tests; however, instability in CUSUMSQ persists as described in the photo. What should I do in this case?
Best
Ibrahim
Hello,
I am in the process of estimating panel data regressions where I regress stock returns on Fama-French factors + 1 dummy capturing ESG initiatives. In order to first test for which estimation technique is the most appropriate, I run the Hausman test in order to compare FE and RE. The results from this test are that we fail to reject the null that Cov(alpha_i , x_it) = 0, which suggests that random effects yields the most efficient estimates.
However, when I estimate the regression using RE i get a theta (also referred to lambda) of 0 indicating that RE is equivalent to OLS, since the cross-sectional mean will disappear from the model (multiplied by zero). I've also tested estimating with Pooled OLS, and it indeed gives exactly identical estimates and SEs. The problem is that when I run a poolability test (H0: beta_i = beta), it rejects the null that the data is fit for a pooled OLS.
I feel like this is somewhat contrary as the RE model estimated the regression as a pooled OLS, but the F-test for poolability rejects the null.
Am I missing something here or have anyone been in a similar situation? Should be stated that I do the econometric modelling using Matlab and Panel Data Toolbox (Álvarez et. al. 2017).
I build a model of 4 regression equations (probit and OLS) in R studio. I need to take the weights of observations into consideration. While it is simple to do so for every regression equation, separately. I can't find how to do so for the SUR model.