# Statistics

15
What software would you recommend for multilevel modelling?
I would be very grateful if you might present limitations and unique advantages of multilevel modelling software based on your experience. Thank you!

Hi,

This is a very insightful discussion and I am so pleased to have found it.

I am a PhD student working on a policy capturing study and still figuring out what my data analysis needs are. I am thinking of purchasing HLM7 but the user manual was not written for novices like me. Oh yes, I have limited statistical and programming knowledge.

Thanks to Kelvyn I have found the extremely helpful resources on CMM's site and signed up for a LEMMA course.

Going back to Luis' comment on "What's the best car out there?", I would start with Hierarchical - multivariate analysis and Non-linear - cross-classified analysis.

Which software would you recommend given the above 'constraints'?

Thank you.

3
How do I estimate variance inflation factor VIF theoretically?

VIF is used to estimate the proportion of variance inflated when predictors are correlated in multiple regression analysis. How do I estimate variance inflation factor VIF theoretically?

The VIF of variable (i) can be found as the reciprocal of (1-R_squared(i)), where R_squared(i) is the coefficient of determination for a model in which X(i) is predicted from the remaining X's.

34
When I use AIC (akaike information criterion) to find the model of the best fit, do I need to consider p-values?
Using the AIC method, I extracted the parameters which are the best fit to explain the variability in my dependent variable. My question is, when I want to publish my results, do I need to state my p-value that the AIC method gave me or the p-values that the regression model calculated?

Just AIC value to get conclusion about the best model. The model with the lowest AIC value- is considered as the best model.

8
What is a Likelihood ratio test with 0 degree of freedom?

Would the results from a likelihood ratio test with 0 degree of freedom not be interpretable? From what I understand, by definition when the degrees of freedom = 0, chi-squared = 0 thus making the p-value quite low -- which makes me hesitant about being able to interpret the results. I am comparing two linear mixed effects models where one model had a max gradient of 0.006 during fitting triggering warnings of conversion (in case that is related). Thanks in advance for any help

3
What would be the net effect streptozotocin on glucose transporter type 4 translocation in tissue sample?

Western blot results showed that, streptozotocin induced diabetic mice has significantly downregulated level of GLUT4 protein in both cytosol and membrane faction. However, membrane to cytosol ratio (m/c) was not significantly decreased.

I think your animals' diet play major role on Glut4 gene expression.  L-glutamine and L-serine may be effective.

Mol Cell Biol. 2001 Nov;21(22):7852-61.
Activation of protein kinase C zeta induces serine phosphorylation of VAMP2 in the GLUT4 compartment and increases glucose transport in skeletal muscle.
Braiman L1, Alt A, Kuroki T, Ohba M, Bak A, Tennenbaum T, Sampson SR.

L-glutamine supplementation induces insulin resistance
in adipose tissue and improves insulin signalling in liver
and muscle of rats with diet-induced obesity
P. O. Prada & S. M. Hirabara & C. T. de Souza &
A. A. Schenka & H. G. Zecchin & J. Vassallo &
L. A. Velloso & E. Carneiro & J. B. C. Carvalheira &
R. Curi & M. J. Saad
Received: 19 March 2007 /Accepted: 30 April 2007 / Published online: 29 June 2007
# Springer-Verlag 2007
Abstract
Aims/hypothesis Diet-induced obesity (DIO) is associated
with insulin resistance in liver and muscle, but not in
adipose tissue. Mice with fat-specific disruption of the gene
encoding the insulin receptor are protected against DIO and
glucose intolerance. In cell culture, glutamine induces
insulin resistance in adipocytes, but has no effect in muscle
cells. We investigated whether supplementation of a highfat
diet with glutamine induces insulin resistance in adipose
tissue in the rat, improving insulin sensitivity in the whole
animal.
Materials and methods Male Wistar rats received standard
rodent chow or a high-fat diet (HF) or an HF supplemented
with alanine or glutamine (HFGln) for 2 months. Light
microscopy and morphometry, oxygen consumption, hyperinsulinaemic–euglycaemic
clamp and immunoprecipitation/
immunoblotting were performed.
Results HFGln rats showed reductions in adipose mass and
adipocyte size, a decrease in the activity of the insulininduced
IRS–phosphatidylinositol 3-kinase (PI3-K)–protein
kinase B–forkhead transcription factor box 01 pathway in
results were associated with increases in insulin-stimulated
glucose uptake in skeletal muscle and insulin-induced
suppression of hepatic glucose output, and were accompanied
by an increase in the activity of the insulin-induced
IRS–PI3-K–Akt pathway in these tissues. In parallel, there
were decreases in TNFα and IL-6 levels and reductions in
c-jun N-terminal kinase (JNK), IκB kinase subunit β
(IKKβ) and mammalian target of rapamycin (mTOR)
activity in the liver, muscle and adipose tissue. There was
also an increase in oxygen consumption and a decrease in
the respiratory exchange rate in HFGln rats.
Conclusions/interpretation Glutamine supplementation
induces insulin resistance in adipose tissue, and this is
accompanied by an increase in the activity of the hexosamine
pathway. It also reduces adipose mass, consequently attenuating
insulin resistance and activation of JNK and IKKβ,
while improving insulin signalling in liver and muscle.
Keywords Akt . Glutamine . High-fat diet .
Insulin resistance . Insulin signalling .
Phosphatidylinositol 3-kinase . Obesity . PI-3K
D

14
Should I transform non-normal independent variables in logistic regression?

I want to do a binomial logistic regression in SPSS. My independent variables are, however, not normally distributed (moderately positively skewed). A square root transformation was successful in normalising the distribution of the IVs. However, after running the logistic regression on the normalised data, I get some very strange results - huge Odds Ratios and Confidence Intervals.

• Why are the odds ratios so different after applying square root transformation?
• Should one apply the square root transformation to non-normal predictors when doing logistic regression?

A logistic regression is very similar (though not identical) to a binomial logit model. This means the logarithm of the odds ratio can be interpreted as the deterministic utility difference Delta V between the two options (Y=1 and Y=0). Since in the standard logistic regression, this utility difference is formulated linearly in the exogenous variables,  Delta V=beta' x, the primary guidance for a possible transformation needs to be that the resulting independent variables contribute *linearly* to the utility difference. Thus, it depends on the specific problem. Formal criteria such as normality are more or less irrelevant. Notice that a nonlinear transformations influence both the x values *and* the estimated parameters beta. For example, when applying a sqrt transform to x components which are all above 1, the value of this component decreases but the estimate (absolute value of) the corresponding beta component increases. So, it is not a priori evident whether the odds ratio depending only on beta'x becomes more or less extreme.

In a nutshell: Linearity in the utility difference is the only criterium for a possible transformation, and this depends on the problem at hand and not on formal criteria.

48
Can you recommend a free (open) software that has similar statistical analysis to SPSS/PASW?
I would like to know if there is other packages than SPSS/PASW or open/free access software that enable the use of similar statistic analysis tools and produce similar graphs

I absolutely recommend you R, the amount of things you can do in it´s interface are unlimited, not just quantitative analysis but qualitative analysis also.

27
How would you explain to your parents(*) hypothesis testing and significance?
The shorter, the better!

(*) assuming that your parents don't have a formal statistical training. If so, go up your family tree as many levels as needed.

One can always improvise on the (classical) difference between "guilty or not guilty" and "guilty or innocent".

11
What would be a good application for a spline regression?

Fellow RGers:

A recent survey statistics and methodology journal had an article on the use of splines, and as one example they used data with which I am very familiar. In that case, however, I know that they combined categories of data for which results are also needed, and at the more aggregate level, those categories could be treated as strata.  Then, each stratum, with its individual separate simple model, could contribute to perhaps the best overall results, rather than use a linear spline regression for the combined data set.  Each category would appear to logically best be modeled by a simple WLS regression through the origin (which has worked well for these data).  However, with data grouped together, a linear spline regression or a lowess model would seem logical, if one were unaware of the categories/strata that made up that combined data set.

But I would think that there would be much better uses for spline regressions.  I'm curious to hear of some applications, especially - but not limited to - your own applications, with comments on why it was a beneficial approach, and why you chose it.

Thank you - Jim

P.S.

In order to avoid the problems shown above, we can also use alternatives such as linear splines, Akima interpolation, Hermite splines, wavelets or trigonometric approximation.

In our paper we proposed applying liner splines to the first difference of time series and to use judgmental estimates for some future points so that spline extrapolation does not go in a wrong direction  (section 2.2):

https://www.researchgate.net/publication/282136270_A_joint_Bayesian_forecasting_model_of_judgment_and_observed_data

• Source
##### Technical Report: A joint Bayesian forecasting model of judgment and observed data
[Hide abstract]
ABSTRACT: This paper presents a new approach that aims to incorporate prior judgmental forecasts into a statistical forecasting model. The result is a set of forecasts that are consistent with both the judgment and latest observations. The approach is based on constructing a model with a combined dataset where the expert forecasts and the historical data are described by means of corresponding regression equations. Model estimation is done using numeric Bayesian analysis. Semiparametric methods are used to ensure finding adequate forecasts without any prior knowledge of the specific type of the trend function. The expert forecasts can be provided as estimates of future time series values or as estimates of total or average values over any particular time intervals. Empirical analysis has shown that the approach is operable in practical settings. Compared to standard methods of combining, the approach is more flexible and in empirical comparisons proves to be more accurate.
Report number: LUMS Working Paper 2012:4, Affiliation: Lancaster University: The Department of Management Science
5

DIRECT AND TOTAL EFFECTS
Coeff      s.e.             t                     Sig(two)
b(YX)    .0300     .0934 .       3213                   .7488
b(MX)    .4522    .0645         7.0137                 .0000
b(YM.X) .6767    .1470        4.6038                  .0000
b(YX.M) -.2760   .1065       -2.5924                 .0114

Can someone please help me interprate this output? The total effects are not significant, however the direct effects is significant and negative. I get a sobel Z=3.8 p<001

I got lost as well. I do not know what your variables represent. Kindly give further information on this.

5
I am trying to match age between my controls and patients, my question is which statistic is better median with range or mean with S.D?

i am comparing gene expression between healthy and sick individuals

Bokang, most folks use mean and SEM for graphing. It is partly preference since the test for significant (p<0.05) difference between means will be the same (ideally) for all investigators irrespective of how you wish to present your variation. Best.

50
Low R-squared values in multiple regression analysis?
In my regression analysis I found R-squared values from 2% to 15%. Can I include such low R-squared values in my research paper? Or R-squared values always have to be 70% or more. If anyone can refer me any books or journal articles about validity of low R-squared values, it would be highly appreciated.

The real question is what type of regression you are doing. If you have cross-sectional data - R^2 value ranging from 2% to 15%  will be good enough, provided coefficients of  variables have desired signs envisaged by the model specification, one or two variables have significant  t  values etc. On the contrary high R^2 value coupled with very low t values of all the coefficients denote doubt about the model specification and / data. If your model specification has got solid economical argument - then don't worry too much about R^2 value particularly in case of cross-sectional  and pooled regression.

13
Coefficient of determination or correlation coefficient, which one is better to use for precision of a regression equation?
As a nonnative speaker of English, I am not clear which expression (coefficient of determination or correlation coefficient) is better for describing the precision of a regression equation. The regression equation has three independent variables and one dependent variable. Whether they are all OK and just a square relationship. Thank you very much for your kindly help.

Dear John, as if I could consult you!

I don't have much knowledge about using R² in GLMs. To my understanding and opinion this does not add any value to the analysis. You have the log-likelihood, the deviance, and information measures like AIC, BIC and whatelse. R² is a measure associated to the residual variance, and it could possibly (but please don't ask me where!) make some sense to have a look at R² when the reduction of the residual variance by the model is a major aspect of the analysis. To my knowledge this may only be the case for "normal" LMs* and not so in GLMs where the variance is usually not independent of the mean.

There are many ways to specify a kind of a surrogate measure for R² in GLMs, which are called "pseudo-R²". A nice list can be found here:

http://www.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm

So for GLMs there not one (pseudo!)R², there are plenty, and all have different properties, some nice, some ugly.

Another related source I found is a document about "R-Squared Measures for Count Data Regression Models - With Applications to Health Care Utilization" (but I guess you already know it):

--

http://stats.stackexchange.com/questions/13314/is-r2-useful-or-dangerous/13317#13317

21
How do I compute for and quantify the content-per-sample with this linear regression data?

Hi,

I am currently doing my thesis and one of the experiments I did was to quantify the Fucose (a sugar) in my sample, using a colorimetric reaction.

I have already obtained the line equation, but I am not confident that I will arrive with the correct data in terms of Fucose content (mg/mL or ug/mL) in the sample.

Ultimately, I would like to know the %Fucose in the sample.

I have made and attached a file outlining all the procedures I have done and the data obtained. I humbly hope that you show me how to get the concentration of Fucose, and the percentage (%), in the sample.

Cheers,

Gene

Hi Nitesh and Christopher,

I humbly appreciate all your input. This has been a learning experience for me and it just goes to show that, I should be up for further training to be more than skilled in the scientific field I would like to venture on.

To Nitesh: Thank you, that was the calculation I arrived with. I would like to thank Peter too, for the both of you went on to show me the basics of calculating concentration with such data. Thank you for being humble too and staying on this thread to share, add, and correct any information you hold.

To Christopher: Thank you, too! I will make it to a point to account for this misadventures in the scope and limitations area of the paper. Likewise, this is a reminder that I will need further training to become skilled in the laboratory.

To all: I am grateful for your input! Everything is much appreciated!

Cheers,

Gene

18
What is a binomial distribution?
I wanted to understand it with some real world examples rather than a definition.

For those of you who would like to see how I am inviting the public to help science to self-correct, please feel free to visit: http://temptdestiny.com

A revised "Flawed Scientific Method" document has been uploaded to replace the previous version. This version is designed to go with the public invitation to help science self-correct. In essence, this one page document illustrates for the public the mechanics of the discovery of Einstein's nonlocal hidden variables which in turn revealed how the scientific method is flawed (see link below).

• Source
##### Dataset: A Flawed Scientific Method - Mechanics Of The Two Acts Of Selection
[Hide abstract]
ABSTRACT: Albert Einstein held the belief that quantum mechanics was an incomplete theory and that there were local hidden variables that would give us a complete sense of reality. As the findings show, he was correct about there being hidden variables. However, he was incorrect as to where to find them. The basketball examples serve to illustrate the findings of the Tempt Destiny experiment and the mechanics involved. The "Flawed Scientific Method" illustrations were designed to go with the public invitation to help science self-correct. In essence, this one page document illustrates for the public the mechanics of the discovery of Einstein's nonlocal hidden variables which in turn revealed how the scientific method is fundamentally flawed and how to fix it.
2
In FTIR intensity ratio analysis, for statistical difference, researcher sometime use p<0.05 and p<0.001. Any insights why?

So for example if Int(1000 cm-1)/ Int(1646 cm-1)/  sometime i see researcher use p<0.05 or p<.001. which one to use and why. is there any guidelines?

Thank you,

Hi Rekha,

Thank you very much for the response. It's helpful.

7
What statistical tests and parameters are appropriate to study the relations of pharmacokinetics and pharmacodynamics of drugs?
(specifically I am trying to do that with antihistaminics concentration / weal and flare inhibition)

Plz see this paper, may be interesred

Regards

87
Where can I find websites to get free scientific publications?

Please I need links like http://gen.lib.rus.ec/ or http://www.freefullpdf.com/; I´m from Bolivia and sometimes it is too expensive buy scientific papers, usually it is not one or three, also for the students could have access without having a account where you have to be endorsed for a institution to get it (like on reserchgate). Also publications in other fields such as art, music, etc. Thank you for answers

http://fieldguides.fieldmuseum.org/guides

7
How can I prepare input files for case-control permutation analysis using PHASE 2.1.1 software?

I'm having troubles to perform case-control permutation analysis in PHASE 2.1.1. PHASE documentation says that for this test, individuals should be identified by putting "0" or "1" for controls and cases, respectively, followed by space just before the individual's identifier. However, when I run the program in this way, the following error message appears: "Error in input file: more than 2 alleles at SNP locus. Individual = C; Locus = 1, Alleles are: #, L and C". My individuals are identified by #CICL followed by a number. The weird thing is that when a substitute "0 #" or "1 #" by "0#" or "1#", in other words, when I delete the space, the software reads the file normally, but do not performs the permutation test cause it don't identifies cases and controls in the file, but identifies individuals as a single group.

Does anyone already had this same issue and got any suggestion regarding input file preparation? There is some undocumented bug in this test? There is some restriction in the nomenclature of individuals that may be causing this problem? There is any other software option to perform this test after having all individuals' haplotypes identified?

Thanks!

You're welcome!

Feel free to ask me whatever you want..

10
How many times must I repeat one experiment to find the appropriate probability of occurrence?
For example, we have a coin with unknown probability of landing on each side: how many times must I repeat throwing the coin to find the best answer for probability of landing on each side.

If you conduct the experiment practically large number of time you will found that the probability is converging toward a point, when you found that point of convergence is not changing. Stop repeating experiment...........

15
Anyone know of a package in R or Stata that will estimate quantile regression with endogenous regressors for data that is purely cross-sectional?
I am aware of (and have used) the rq.ivpanel package for R by Charlie Lamarche and ivqreg package for Stata by Do Wan Kwak, but those both work only for panel data.

Thank you so much all!

1
How do I calculate SSMD in a HTS?

I have a screen in which i assess the effect of compounds on colony number. My output is colony number as a percentage of control (fold change). The screen i carried out in 96 well plates. I have 1 well of negative control and 1 well of positive control. I only carry out the screen once so is it possible to calculate SSMD on this data set? I have screened roughly 2000 compounds.

Hi Paul,

how many replicates do you have? SSMD needs at least two replicates because it need calculate the standard deviations.

It looks that your data is similar to our data in the PNAS paper in http://www.pnas.org/content/110/30/12426.full. If you have replicates, I think that you can use the SSMD method on your data as what we used (for details see the statistical analysis section of the supplementary document in http://www.pnas.org/content/suppl/2013/07/08/1305207110.DCSupplemental/sapp.pdf.

Yaoyong

6
Any advice on spatio-temporal variogram and kriging using gstat?

Hello, I am trying to carry out a spatio-temporal analysis of some data recorded at monthly intervals at different locations using gstat package of Edzer Pebesma.The data is recorded once a month for few years. The problem is that despite following all the instructions in many documents e.g. st.pdf by Edzer Pebesma and many variants such as Benedikt I am still unable to create the required spatio-temporal object. Can anybody help me in this regard? I will upload the data if somebody turns up to help. The data is simple reading of SO2 at 33 places and 12 readings per year for 4 years. What I want is an analysis like this link http://www.r-bloggers.com/spatio-temporal-kriging-in-r/.

1. Be aware that the product of two variograms is not a variogram (variograms are conditionally negative definite whereas the product of two variograms is conditionally positive definite.)

2. In order to "split"  the problem into a spatial problem and a time series problem you need very different data than what you need for spatial temporal kriging. There would still be the problem of how to put the results back together

3. See a series of papers by De Iaco, Myers and Posa including FORTRAN programs in  Mathematical Geology as well as in Computers & Geosciences  (and several statistics journals)

4. To fit data to a space time variogram the data file must have four columns (five if working in 3-D space). One for each of the spatial coordinates, one for the time coordinate and one for the data value, each row corresponds to a point in space time together with the  data value at that point..

13
Does anyone can suggest me a website with a good statistical decision tree?

... in order to select the right kind of analysis.

Thanks!

As a general resource, I'd recommend the site mentioned by Sofia Cividini.

A somewhat simpler, though maybe not as clear, site is:

http://www.biostathandbook.com/testchoice.html

8
If the chi-square difference between the model with constraints and the model without is non-significant we pick the more parsimonious one?

Hello there,

I am just trying to decide upon the model doing multigroup analysis.

I ran regression analysis for males and females separately and obtained "similarly looking" coefficients (all sig.). So I wonder if the differences between them are significant or not.

Subsequently, I constrained the coefficient to be the same for both groups.

The change in chi-square between the two models was non-significant. Does this mean that the differences in coefficients are statistically no different from zero ?

Just want to be sure that my understanding is right... thanks for any answer.

Measures such as CFI were created to deal with situations where the chi-square value was significant, but the overall fit appeared to be adequate. Most of those indices deal with the problem that chi-square is highly sensitive to the number of degrees of freedom, so that data from large samples often produces significant chi-square values.

In other words, many analysts feel direct chi-square assessments of models are too "conservative" and thus use things like CFI to provide a more "realistic" assessment of fit. In the present case, the actual chi-square indicates no difference, so there is little reason to delve into alternative fit indices (especially when they require subjective judgments to assess how much difference is enough to merit further attention).

8
What is the formula for non-parametric Tukey's HSD?

I ran a Kruskal-Wallis test which turned out positive significance. Now I need to do a post-hoc using Tukey's HSD manually.

If you are using R, based on the website below, you can install pgirmess package and use the function kruskalmc(OutcomeVariable, Expla. Variable). Its called Nemenyi-Damico-Wolfe-Dunn test.

http://stats.stackexchange.com/questions/17342/is-there-a-nonparametric-equivalent-of-tukey-hsd