Questions related to Applied Econometrics
I am using an ARDL model however I am having some difficulties interpreting the results. I found out that there is a cointegration in the long run. I provided pictures below.
Hi all, I'm doing fyp with the title of the determinant of healthcare expenditure in 2011-2020. Here are my variables: government financing, gdp, total population.
first model is: healthcare expendituret=B0+B1gov financingt + B2gdpt + B3populationt + et
second causal relationship model is healthcare expenditure per capitat= B0 + B1gdp per capitat +et
It is possible to use unit root test then ADRL for the first model and what test can use for the second model?
Thank you in advance for those reply me :)
I am currently replicating a study in which the dependent variable describes whether a household belongs to a certain category. Therefore, for each household the variable either takes the value 0 or the value 1 for each category. In the study that I am replicating the maximisation of the log-likelihood function yields one vector of regression coefficients, where each independent variable has got one regression coefficient. So there is one vector of regression coefficients for ALL households, independent of which category the households belong to. Now I am wondering how this is achieved, since (as I understand) a multinomial logistic regression for n categories yields n-1 regression coefficients per variable as there is always one reference category.
Hey! I need to improve already existing panel data model by adding 1 variable for access to technology. Is it possible, and what is the best variable to measure for technology accessibility. If is possible I would like to measure technological advancment as well. What should be my variables fpr this? What are the common practises so far? Thank you!
I use a conditional logit model with income, leisure time and interaction terms of the two variables with other variables (describing individual's characteristics) as independent variables.
After running the regression, I use the predict command to obtain probabilities for each individual and category. These probabilities are then multiplied with the median working hours of the respective categories to compute expected working hours.
The next step is to increase wage by 1%, which increases the variable income by 1% and thus also affects all interaction terms which include the variable income.
After running the modified regression, again I use the predict command and should obtain slightly different probabilities. My problem is now that the probabilities are exactly the same, so that there would be no change in expected working hours, which indicates that something went wrong.
On the attached images with extracts of the two regression outputs one can see that indeed the regression coefficients of the affected variables are very, very similar and that both the value of the R² and the values of the log likelihood iterations are exactly the same. To my mind these observations should explain why probabilities are indeed very similar, but I am wondering why they are exactly the same and what I did possibly wrong. I am replicating a paper where they did the same and where they were able to compute different expected working hours for the different scenarios.
In the file that I attached below there is a line upon the theta(1) coefficient and another one exactly below C(9). In addition, what is this number below C(9)? There is no description
Hello, I am facing a problem concerning the computation of regression coefficients (necessary information in attached image):
Three regression coefficients (alpha y, m and f) of the main regression (2) are generated through three separate regressions.
Now I was wondering which would be the appropriate way to compute the alphas and gammas.
In case I first run regression (2) and obtain the three regression coefficients alpha y, m and f, can I use these for the separate regressions as dependent variables in order to then run regressions (3) and obtain the gammas?
What is strikes me with this approach is that the value of the dependent variables alpha y, m and f would always be the same for each observation.
In the paper they state that the alphas are vectors, but I don't properly understand how they could be vectors (maybe that's the issue after all?).
Or is there a way to directly merge the regressions / directly integrate the regressions (3) into regression (1)? Preferably in Stata.
I appreciate any help, thank you!
I’m conducting an event study for the yearly inclusion and exclusion of some stocks (from different industry sectors) in an index.
I need to calculate the abnormal return per each stock upon inclusion or exclusion from the index.
I have some questions:
1- How to decide upon the length of backward time to consider for the “Estimation Window” and how to justify ?
2- Stock return is calculated by:
(price today – price yesterday)/(price yesterday)
LN(price today/price yesterday)?
I see both ways are used, although they give different results.
Can any of them be used to calculate CAR?
3- When calculating the Abnormal return as the difference between stock return and a Benchmark Return (market return), The market (benchmark) return should be the index itself (on which stock are included or excluded) ? Or the sector index related to the stock?
Appreciate your advice with justification.
Many thanks in advance.
I have a big dataset (n>5,000) on corporate indebtedness and want to test wether SECTOR and FAMILY-OWNED are significant to explain it. The information is in percentage (total liabilities/total assets) but is NOT bounded: many companies have an indebtedness above 100%. My hypothesis are that SERVICES sector is more indebted than other sectors, and FAMILY-OWNED companies are less indebted than other companies.
If the data were normally distributed and had equal variances, I'd perform a two-way ANOVA.
If the data were normally distributed but were heteroscedastic, I'd perform a two-way robust ANOVA (using the R package "WRS2")
As the data is not normally distributed nor heteroscedastic (according to many tests I performed), and there is no such thing as a "two-way-kruskall wallis test", which is the best option?
1) perform a generalized least squares regression (therefore corrected for heteroscedasticity) to check for the effect of two factors in my dependent variable?
2) perform a non-parametric ANCOVA (with the R package "sm"? Or "fANCOVA"?)
What are the pros and cons of each alternative?
I have run an ARDL model for a Time Series Cross Sectional data but the output is not reporting the R.squared. What could be the reason/s.
Dear research community,
I am currently working with Hofstede's dimensions, however, I do not exactly use his questionnaire. In order to calculate my index in accordance to his process, I am looking for the meaning of the constants in front of the mean scores.
For example: PDI = 35(m07 – m02) + 25(m20 – m23) ... What do 35 and 25 mean? How could I calculate them with regard to my research?
Thank you very much for your help!
I'm working on the research using the DEA(Data Envelopment Analysis) method to measure the provincial energy efficiency. However, due to the data constraint the provincial energy consumption data is not available. Can i assume the provincial energy consumption is proportional to provincial GDP?
(national energy consumption/national GDP x province i GDP)?
Can I use Granger Causality test on a monetary variable only? or do I need non-monetary variables?
Also Do I need to do any test before Granger, like a unit root test, or just use raw data?
What free programs can I use to compute the data?
I have heard some academics argue that t-test can only be used for hypothesis testing. That it is too weak a tool to be used to analyse a specific objective when carrying out an academic research. For example, is t-test an appropriate analytical tool to determine the effect of credit on farm output?
What is the most acceptable method to measure the impact of regulation/policy so far?
I only know the Difference-in-Difference (DID), Propensity Score Matching (PSM), Two-Step System GMM (for dynamic) are common methods. Expecting your opinion for 20 years long panel for firm-level data.
i would like to analyze the effect of innovation in 1 industry over a time period of 10 years. the dependent variable is export and the Independent variables are R&D and Labour costs.
What is the best model to use? i am planning to do a Log-linear model.
Thank you very much for your greatly needed help!
I am planning to investigate the panel data set containing three countries and 10 variables. The time frame is a bit short that concerns me (between 2011-2020 for each country). What should be the sample size in this case? Can I apply fixed effects, random effects, or pooled OLS?
Thank you for your responses beforehand.
I am investigating the change of a dependent variable (Y) over time (Years). I have plotted the dependent variable across time as a line graph and it seems to be correlated with time (i.e. Y increases over time but not for all years).
I was wondering if there is a formal statistical test to determine if this relationship exists between the time variable and Y?
Any help would be greatly appreciated!
Dear Research Community,
I would like to check structural breaks in polynomial regression that predicts expect excess return on excess equity-to-bond market volatility. I find some good references useful but none is dealing with the polynomials. For instance:
- Andrews, D.W.K., 1993, Tests for Parameter Instability and Structural Change With Unknown Change Point. Econometrica 61, 821-856.
- Bai, J. and P. Perron, 1998, Estimating and Testing Linear Models With Multiple Structural Changes. Econometrica 66, 47-78.
- Bai, J. and P. Perron, 2003. Computation and Analysis of Multiple Structural Change Models. Journal of Applied Econometrics 18, 1-22.
- Bai, J. and P. Perron, 2004. Multiple Structural Change Models: A Simulation Analysis. In Econometric Essays, Eds. D. Corbae, S. Durlauf, and B.E. Hansen (Cambridge, U.K.: Cambridge University Press).
As my polynomial is 3-order, I am wondering if structural breaks have to be checked for the 3 orders' parameters (X, X2 and X3) in time-varying or is there other efficient way to handle this issue?
I am looking forward to test unit root for a panel data series. In this regard, I would want to use the Hadri and Rao (2008) test with structural break. Is there any way, I can perform the test in STATA or any other like statistical software.
My research is to find out the determinants of FDI. I am doing bound test to see the long run relationship, cointegration test and other diagnostic tests.
I ran an Error Correction Model, obtaining the results depicted below. The model comes from the literature, where Dutch disease effects were tested in the case of Russia. My dependent variable was the real effective exchange rate, while oil prices (OIL_Prices), terms of trade (TOT), public deficit (GOV), industrial productivity (PR) were independent variables. My main concern is that only the Error Correction Term, the dummy variable, and the intercept are statistically significant. Moreover, residuals are not normally distributed, while also the residuals are heteroscedasdic. There is no serial correlation issue according to the LM test. How can I improve my findings? Thank you beforehand.
i estimated Autoregressive model in eview. I got parameter estimation for one additional variabel which i have not included in the model. the variable is labelled as ' SIGMASQ '.
what is that variable and how to interpret it?
i am attaching the results of the autoregressive model.
thanks in advace.
I applied the Granger Causality test in my paper and the reviewer wrote me the following: the statistical analysis was a bit short – usually the Granger-causality is followed by some vector autoregressive modeling...
What can I respond in this case?
P.S. I had a small sample size and serious data limitation.
Hello, dear network. I need some help.
I'm working on research, using the Event Study Approach. I have a couple of doubts about the significance of the treatment variable leads and lag coefficients.
I'm not sure to be satisfying the pre-treatment Parallel Trends Assumption: all the lags are not statistically significant and are around the 0 line. Is that enough to accomplish the identification assumption?
Also, I'm not sure about the leads coefficient's significance and their interpretation. The table with the coefficients is attached.
Thank you so much for your help.
I am working on formulating hydrological model when runoff(output variable) is available at monthly time-step while rainfall(input variable) is at daily time-step.
I firstly wanted to explore mathematical models and techniques that can be used here. I have found MIDAS regression method, which forms relationship between mixed frequency data variables (output at monthly time step and input at daily time step). But the problem is variables in hydrological models are at the same time step. So that technique will not work, because the MIDAS model will have relation between variables sampled at different frequency.
So can anyone suggest relevant literature, in which both output and input variables of model are related at high frequency (say daily) but the model is learning through low frequency (monthly) output data and high frequency (daily) input data.
The pair-wise granger causality test can be done by using e-views. Only doing this test, is it reliable enough to explain causality? and is it only for long run causality or both long run and short run causality test?
Hello everyone. I am using the VECM model and I want to use variance decomposition, but as you know variance decomposition is very sensitive to the ordering of the variable. I read in some papers that it will be better to use generalized variance decomposition because it is invariant to the ordering of the variables. I am using Stata, R or Eviews and the problem is how to perform Generalised VD and please if anyone knows help me
My aim is to find out the significant relationship between FDI and its determinants. I am using bound test and error correction model.
I am running an ARDL model on eviews and I need to know the following if anyone could help!
1. Is the optimal number of lags for annual data (30 observations) 1 or 2 OR should VAR be applied to know the optimal number of lags?
2. When we apply the VAR, the maximum number of lags applicable was 5, beyond 5 we got singular matrix error, but the problem is as we increase the number of lags, the optimal number of lags increase (when we choose 2 lags, we got 2 as the optimal, when we choose 5 lags, we got 5 as the optimal) so what should be done?
In one of my paper, I have applied Newey-West standard error model in panel data for robustness purpose. I want to differentiate this model from FMOLS and DOLS model. So, on what ground can we justify this model over FMOLS and DOLS model.
My research is based on Foreign direct investment and its determinants. So, I need to see if there is any significant relationship between the variables by looking at the p values. Should i interpret all the variables including the lagged ones ?
I ran several models in OLS and found these results (see the attached screenshot please). My main concern is that some coefficients are extremely small, yet statistically significant. Is it a problem? Can it be that my dependent variables are index values that ranged between -2.5 and +2.5 while I have explanatory variables that have, i.e the level of measurement in Thousand tons? Thank you beforehand.
I have GDP and MVA data and though the MVA is stationary, the GDP is non stationary even after log-transformation followed by de-trend followed by differencing. I want to build a VAR/VEC model for ln(GDP) and ln(MVA) but this data has been haunting me for past 3 days. I also tried both method of differencing i.e linear regression detrend and direct difference but nothing seems to work.
Also, they(ln GDP and ln MVA) satisfy the cointegration test, the trends are very similar. But for VAR/VEC I will need them to be I(1) which is not the case. Any suggestions on how to handle this data will be highly appreciated!
I have attached the snapshot of the data and also the data itself.
I would like to employ within transformation in panel data analysis. Market Value Added represents dependent variable. Various value drivers (advertising expenses, number of patents, etc) are explanatory variables. Is it appropriate to use standardized coefficients? Maybe logarithmic forms of a regression is more suitable
I paid attention to that, when I estimate an equation by Least Squares in Eviews, under the options tab we have a tick mark for degrees of freedom (d.f.) Adjustment. What is the importance and its role? Because, when I estimate an equation without d.f. Adjustment, I get two statistically significant relationship coefficients out of five explanatory variables; however, when I estimate with d.f. Adjustment, I do not get any significant results.
Thank you beforehand.
I made a Nested logit model. Level 1: 8 choices and level 2: 22 choices.
In type 4, I have only 1 choice in level 1 corresponds to one choice in level 2.
The dissimilarity parameters are equal to 1 in this case (not surprising).
Can i run the model normally when i have a an IV parameter than is equal to one?
The results can be interpreted normally or what should i have to do in this case?
I tried the commande "constraint 1=[type4_tau]_cons=1" but the model does not run.
what can i do?
Thanks in advance for your advices
I am trying to run a regression of cobb douglas function:
The problem that my dataset capture the firm at a point of time,
So I have a dataset over the period 1988-2012.
Each firm appears one time!
(I cannot define if it is a panel/time series/cross section..)
I want to find the effect of labor, capital on value added.
I have information on intermediate input.
I use two methods Olley& pakes, levinsohn-patrin.
But Stata is always telling me that there is no observations!
levpet lvalue, free(labour) proxy(intermediate_input) capital(capital) valueadded reps(250)
Why the command is not working and telling that there is no observations?
(Is this due the fact that each firm appear only one time in the data?)
(If yes, what is the possible corrections for simultanety and selection bias in this data?)
Thanks in advance for your help,
I have Panel Data fits difference in difference
I regress the (Bilateral Investment Treaties-BIT) on (Bilateral FDI). BIT: is dummy taking 1 if BIT exists and Zero Otherwise. While Bilateral FDI: Amount of FDI between the two economies. Objective: Examine If BIT enhances Bilateral FDI?
The issue is : - Each country have started its BIT with another pair country at a fixed time (different from the others): NO Fixed Time for the whole data.
I am willing to assume different time periods in a random way and run my Diff in Diff (for robustness):
My questions :
(1) Do you suggest this method is efficient?
(2) Any suggestion random selection of time?
I am interested to know about the difference between 1st and 2nd and 3rd generation panel data techniques.....
I am currently trying to estimate the effect of energy crises on food prices. Given the link between energy and food prices, I am inclined to reason that ECM will be best to estimate the relationship between food price and energy price (fuel price). Additionally I would like to include dummy variables in the model to estimate the effects of periods of energy crises on food prices. This I know is simple to do.
Where am confused is, how to model price volatility in the context of an ECM. I am only interested in the direction where fuel price, as well as the structural dummies for energy crises influences not just the determination of food price, but their volatility as well.
Can anyone help me to carry out mean group analysis and pooled mean group analysis. I have used Microfit and Eviews before. Appreciate if I can get some advice on how to use these panel data methods in Microfit, Eviews and STATA.
I am estimating a bivariate probit model, where the errors of the two probit equations are correlated and therefore not independent. However, I suspect that one of the explanatory variables of both models may also cause endogeneity problems. My question is whether there is a perhaps two-stage procedure to correct this situation? Instrumental variables maybe? Could you suggest literature on this problem?
I would like to perform event study analysis through website: https://www.eventstudytools.com/.
Unfortunately they ask for uploading data in a format i dont understand , dont know how to put data in this form, and i dont find a user manual or email to communicate with them.
Can anyone kindly advise how to use this service and explain it in a plain easy way?
Thanks in advance.
I'm conducting an event study for a sample of 25 firms that each gone through certain yearly event (inclusion in an index).
(The 25 firms (events) are collected from last 5 years.)
I'm using daily price abnormal returns (AR), and consolidated horizontally the daily returns for the 25 firms to get daily "Average abnormal Returns" (AAR).
Estimation Window (before the event)= 119 days
Event Window = 30 days
1- I tested the significance of daily AAR through a t-test and corresponding P-value, How can i calculate the statistical power for those daily P-values?
(significance level used=.0.05, 2 tailed)
2- I calculated "Commutative Average Abnormal Returns" (CAAR) for some period in the event window, performed a significance test for it by t-test and corresponding P-value, how can i calculate the statistical power of this CAAR significance test?
(significance level used=.0.05, 2 tailed)
Thank you for your help and guidance.
The original series is nonstationary as it has a clear increasing trend and its ACF plot gradually dampens. To make the series stationary, what optimum order of differencing (d) is needed?
Furthermore, if the ACF and PACF plots of the differenced series do not cut off after a definite value of lags but have peaks at certain intermittent lags. How to choose the optimum values of 'p' and 'q' in such a case?
I have seen that some researchers just compare the difference in R2 in two models: one in which the variables of interest are included and one in which they are excluded. However, in my case, I have that this difference is small (0.05). Is there any method by which I can be sure (or at least have some support for the argument that) this change is not just due to luck or noise?
To illustrate my point I present you an hypothetical case with the following equation:
wage=C+0.5education+0.3rural area (
Where the variable "education" measures the number of years of education a person has and rural area is a dummy variable that takes the value of 1 if the person lives in the rural area and 0 if she lives in the urban area.
In this situation (and assuming no other relevant factors affecting wage), my questions are:
1) Is the 0.5 coefficient of education reflecting the difference between (1) the mean of the marginal return of an extra year of education on the wage of an urban worker and (2) the mean of the marginal return of an extra year of education of an rural worker?
a) If my reasoning is wrong, what would be the intuition of the mechanism of "holding constant"?
2) Mathematically, how is that just adding the rural variable works on "holding constant" the effect of living in a rural area on the relationship between education and wage?
I am trying to learn use of augmented ARDL. But I did not find the command for augmented ardl in stata. Can anyone please refer to the user written code for Augmented ARDL? Is there any good paper that describe the difference between ARDL bound test and augmented ARDL process? I would be happy if you can answer those questions.
In this Augmented ARDL, I find there are three test to get confirmation for the long run cointegration; e.g, overall F test, t test on lagged dependent variable, F test on lagged independent variable.
- How to find/calculate t-statistics for the lagged dependent variable?
- How to find/calculate F-statistics for the lagged independent variable?
Using STATA, I find that the bound test produces two test statistics: F statistics and t-statistics. But both of them are for examining overall test for cointegration. How could I find t-statistics for lagged dependent variable and F statistics for lagged independent variable?
The paper, on which I am working, is a multivariate study. I am planning to use this model as it has two advantages:
1. It tests the stability of the long-term relationship across quantiles and provides a more flexible econometric framework.
2. It can explain the possible asymmetry in the response on one variable to changes in another variable.
Because of these two reasons, I am preferring it above NARDL.
As I am not good in STATA coding, therefore, any help regarding coding this method is highly appreciated.
I’m conducting and event study for inclusion of companies in a certain index.
The event is the “inclusion event” for companies in this index for last 5 years.
For the events, we have yearly Announcement date (AD) for inclusions, and also effective Change Dates (CD) for the inclusion in the index.
Within same year, I have aligned all companies together on (AD) as day 0, and since they are companies from same year, CD will also align for all of them.
The problem comes when I try to aggregate companies from different years together, although I aligned them all to have same AD, but CD is different from one year to another so CD don’t align for companies from different years.
How can I overcome this misalignment of CD from different years , so that I’m able to aggregate all the companies together?
I'm conducting an event study for the effect of news announcement at certain date on stock return.
Using the market model to estimate the expected stock return in the "estimation window" , we need to regress stock returns ( stock under study) with returns from market portfolio index.
1- How can we decide upon choosing this market portfolio index for regression ?
Is it just the main index of the market?
Sector index from which the stock under study belong?..etc ?
2- Is it necessary that stock under study be among the constituents of this market index?
Appricite to justify your kind answers with research citations if possible
I am currently assisting on a research on cross border capital flows.
A common problem seems to be that both the acquisition of assets and valuation effects determine the cross border asset holdings as , for example, reported in the CPIS data. Hobza and Zeugner use the BoP statistics on portfolio investments to derive valuation effects on portfolio debt and equity (change in asset holdings minus acquisitions) (2014).
I am wondering if the valuation effect could also be estimated because I do not only want to distinguish between portfolio debt and equity but also between different types of instruments.
For instance, between different debt maturities.
I am struggling with statistics for price comparison.
I would like to check if the mean market price of a given product A differs over two consecutive time periods, namely from December till January and from February till March.
My H_0 would be that means are equal and H_1 that from Feb till March the mean is lower.
For this I have all necessary data as time series, sampled at the same frequency.
I thought of using the paired t-test, yet price distribution is not normal (extremely low p-value of Shapiro-Wilk test).
I guess that the two random samples of two groups cannot be treated as independent, as my intuition is that price in February would depend on price in January.
Do you know any test that would fit here? Given the nature of the problem?
Thanks in advance
I would like to estimate changes in GINI during 15 years. Should I make log transformation of independet varible? Raw data for some independent varibles are not normaly distributed.
Every generation seems to strive for a better future ignoring the immediate present betterment in real sense. Because time is essential to all activities and the results are only obtained at end irrespective of scale and magnitude or the span of it. Thus the present behaviour is in anticipation of a future outcome. In this way the society fails to address the present and gets trapped in this vicious cycle of future driven momentum and ignores the true future. By isolating the driving force itself, mankind should realize the present and secure the future by securing the present.
Is it not the economic problem of the society at this very juncture? Are we actually addressing sustainability?
I'm working with life satisfaction as my dependent variable and some other independent variables that measure purchasing power (consumption, income and specific expenditures). To take into account the diminishig marginal returns of this last variables (following the literature) I transformed them in terms of their natural logarithm. However, now I want to compare the size of the coefficients of specific expenditures with the ones of consumption and income. Specifically, I would like some procedure which allows me to interpret the result like this: 1 unit of resources directed to a type of expenditure (say culture) is more/less effective to improve life satisfaction in comparison with the effect that this same unit would have under the category of income. If I just do this with withouth the natural logarithm (that is, expressed in dolars) the coefficients change in counterintuitive ways, so I would prefer to avoid this.
I was thinking about using beta coefficients, but I don't know if it makes sense to standarize an already logarithmic coefficient.
Please, would somebody can help me to indicate a right direction how to analyze salary? I asked students about their salary expectations in 6 different situations, so I got 6 independent groups. Then i asked about gender, age, employment status, course and GPA.
I have to 1) compare some groups and indicate a side-effect and 2) select 1 group and found out the impact (gender, GPA etc.) on salary expectations. But my problem is – besides I have quite a lot data, my distribution is strange. More I read, more I confused.
For example, I have 2 groups N1=175, N2=202
How can I compare salary expectations in this groups? I read that “overlapping kernel density plots can be a powerful way to compare groups”. Ok I can overlap them but then what? I want to have also some quantitative result in salaries, not just a visualization.
- 1) - What do you think about a) Quantile regression or b) Kernel regression in this case?
Or maybe there is some other way and I don’t see it.
Thank you for reading my question!
My observations are points along a transect, irregularly spaced.
I aim at finding the distance values that maximize the clustering of my observation attribute, in order to use it in the following LISA analysis (Local Moran I).
I iteratively run Global Moran I function with PySAL 2.0, recreating a different distance-based weight matrix (binary, assigning 1 to neighbors and 0 to not neighbors) with a search radius 0.5m longer at every iteration.
At every iteration, I save z_sim,p_sim, I statistics, together with the distance at which these stats have been computed.
From these information, what strategy is best to find distances that potentially show underlying spatial processes that (pseudo)-significantly cluster my point data?
- Esri style: ArcMap Incremental Global Moran I tool identify peaks of z-values where p is significant as interesting distances
- Literature: I found many papers that simply choose the distance with the higher absolute significant value of I
Because with varying search radius the number of observations considered in the neighborhood change, thus, the weight matrix also change, the I value is not comparable
In order to analyze if there is a mediation effect using Baron & Kenny's steps, is it necessary to include the control variables of my model, or is it enough to do the analysis just with the independent variable, the mediator variable and the dependent variable of my interest?
I have a dummy varible as the possible mediatior of a relationship in my model. By reading the Baron and Kenny's (1986) steps, I see that, in the second one you have to test the relationship between the indepentend variable and the mediator, using the last one as a dependent variable. However, normally you won't use an OLS when you have a dummy as a dependent variable. Should I use a Probit in this case?
In my investigation about the determinants of subjective well being (life satisfaction) I have some variables that measure the access to food and also other variables that measure the affections (if, in the last week, the interviewed felt sad/happy, for example). These variables don't show high levels of simple Pearson correlations nor high levels of VIF. In experimenting with different models (including and excluding some variables), I see that access to food has a positive and significant coefficient, except in the ones that the affective varaibles are included. Can I make the case that this is due to the fact that affective variables are mediating the effect of access to food to life satisfaction? I also tried a with an interaction between access to food and affective variables but they are not significant.
Reading Wooldridge's book on introductory econometrics I observe that the F test allows us to see if, in a group, at least one of the coefficients is statistically significant. However, in my model I have that, individually, one of the variables of the group I want to test is already statistically significant (measured by the t-test). So, if that is the case I expect that, no matter with which variable I test for, if I include the one that is already individually significant, the F test will also be significant. Is there any useful usage I can make with the F test in this case?
I have a well endowed database with almost 29 0000 observations and I want to make an analysis with more than 50 variables. What are the problems that can arise from this situation? Can the model be overfitted? If it is possible, why?
I have reported life satisfaction as my dependent variable and many independent variables of different kinds. Of them, one is the area in which the individual lives (urban/rural) and other is the access to public provided water service. When the area variable is included in the model, the second variable is non significant. However, when it is excluded, the public service gains enough significance for a 95% level of confidence. The two variables are moderately and negatively correlated (r= -0.45).
What possible explanations do you see for this phenomenom?
I'm studying the determinants of subjectve well being in my country and I have reported satisfaction with life as my dependent variable and almost 40 independent variables. I ran multicolinearity tests and I dind't find values bigger than 5 (in fact, just two variables had a VIF above 2). Also, my N=22 000, so I dont expect to have an overfitted model. Actually, at thet beggining, all was going well: the variables maintained their signficance and the values of their coefficients when I added or deleted some variables to test the robustness of the model and the adjusted R squared increased with the inclusion of more variables.
However, I finally included some variables that measure the satisfaction with specific life domains (family, work, profession, community, etc.) and there is when the problem started: my adjusted R squared tripled and the significance and even the signs of some variables changed dramatically, in some cases, in a counterintuitive way. I also tested multicolinearity and the correlation of these variables with the other estimators and I didn't find this to be a problem.
The literature says that it is very likely that there are endogeneity problems between satisfaction with life domains and satisfaction with life since it is not too much the objective life conditions that affect life satisfaction but the propensity to self-report satisfaction. Can this be the cause of my problem? If so, how?
PD: I'm not trying to demonstrate causality.
Gini index is used as a measure of inequality in the income distribution of a nation. However, there might be cases where the income of a person is negative (debts, etc.). In such scenarios, how do we proceed to calculate the Gini index?
I have read that it's relatively unproblematic to include dummy variables (for structural breaks for instance) in ARDL cointegration. How is this done? I have included them simply as independent variables (Chow and Bai-Perron tests suppport the break dates). Is that ok? I use Stata, with the Stata module "ARDL".
There are arguments for and against adjusting data for seasonality before estimating a VAR model (and then Granger causality). I've monthly tourist arrival data for three countries (for 18 years) and interested in spill-over effects or causality among the arrivals. I expect your views on the following.
1. Is seasonal adjustment is compulsory before estimating VAR?
2. If I take 12-month seasonal differenced data without adjusting for seasonality, will it be okay?
I'm investigating the determinants of subjective well being in my country and I have a well endowed database in which I found a lot of environmental, psicosocial and political variables (+ the common ones) that are theory-related to my dependent variable (subjective well-being). In this context, do you find any trouble about including them all (almost 35, and that already deleting the ones that measure the same concept) in one single model (using the adjusted R2)?