# Statistics

What is a representative sampling?
Probability sampling I know, but it seems be different.
Is the any standard / robust method to identify outliers?
I have performed linear regression analysis. I wish to know if there is any standard procedure to identify outliers with precision? I am using matlab for statistical analysis. If anyone has come across any specific function in matlab for the same, kindly let me know?
Aria Tsam · Aristotle University of Thessaloniki
Good evening.please refer on site http://www.rsc.org/images/robust-statistics-technical-brief-6_tcm18-214850.pdf
What is the best way to estimate the PDF from a given distribution obtained numerically?
Given a probability density function supported on a semi-infinite interval obtained numerically. What is the best methodology to follow in order to know / estimate the distribution law (Beta prime, chi, gamma, log-normal, etc.) Enclosed is an example of a PDF obtained numerically.
Is anyone aware of studies that have addressed the specific issue of identification of demand and supply in agricultural (or other) markets?
Recent approaches to the identification are those proposed by Leamer (ReSTAT, 1981), Rigobon (ReSTAT, 2003), Roberts and Schlenker (AER, 2013). Is anyone aware of other approaches and/or of empircal applications of these approaches?
Aria Tsam · Aristotle University of Thessaloniki
Good evening.PLease refer on site: http://highered.mcgraw-hill.com/sites/dl/free/0073523208/931865/Borjas_6e_Chapter_4.pdf
How can I perform time series data similarity measures and get a significance level (p-value)?
I have two sets of time series data, how can I measure the similarity or difference and how can I get a significance level (p-value)?
Albert Galick · State University of New York College at Buffalo
Not exactly an answer to your question, but let me bring to your attention my patent for characterizing system behavior from time-series (see attached).
What is the basic difference between the maximum likelihood estimator and the least square estimator?
In statistics when we go for estimating parameters, sometimes least square estimator is used and sometimes m. l. e. is used. Which one is better and when it can be applied?
Fausto Galetto · Politecnico di Torino
Dear Demetris, here you find the ideas I followed to find formula for the #40 data of Emilio problem. >>>>>1)The motivation for choosing such a class of functions. My solution was one “Engineering aided by Statistics”; as soon I was seeing the data, I was thinking of the “bathtub curve”; I computed “numerically” the function h(t)=f(t)/[1-F(t)], that confirmed my idea. Then I decided to use (t/eta1)^beta1+(t/eta2)^beta2 for H(t), the integral of h(x) from 0 to t. >>>>>2)The specific method that was used for doing MLE From my Manager experience and scholar expertise, I decided that some values of the 4 parameters were suitable for the case. I “ASKED to the data” to confirm them: the confidence intervals, with CL=90%, computed from the MLE confirmed my intuition. I used a method of my own. >>>>>3)The initial guess values for the 4 parameters It had NO importance on the estimation and on the confidence intervals; in any case I DO NOT remember them. >>>>> 4) The number of iterations done, the stopping criterion and the final likelihood value that was obtained. Very few number of iterations (>10); the stopping criterion was “difference between 2 successive iterated values <0.0001”. NO FINAL likelihood computed…. >>>>> 5)The epsilon of the machine for the program(s) that you used. “difference between 2 successive iterated values <0.0001”. AFTER THAT, I compared the “estimated H(t)” with the “EMPIRICAL H(t)”, simply by eyes: the approximation was “good enough for my engineering purpose”. THEN I computed F(t)=1-exp(-H(t)) and compared it with the “EMPIRICAL F(t)”, simply by eyes: the approximation was “good enough for my engineering purpose”. I repeat: my work was “Engineering aided by Statistics”; IF I had not that experience I would not have done that way. Perhaps in future I will provide something different, BECAUSE the 40 data have a “peculiar characteristic”…. I hope we can stop with the 40 data…. I used them only for showing that ML is a powerful method…. that was useful in my Managing work…. BEFORE joining Politecnico di Torino.
How to approximate the cdf of t-distribution efficiently?
Approximation of t distribution is essential for finding the p-value in a computer program (while testing the hypothesis about the means). Is it enough to have three decimal point accuracy of the approximation? How many decimals should be correctly approximated by a function? Finally, how to approximate the cdf of t-distribution efficiently? Please suggest some thoughts on the same. Thank you.
Naveen Boiroju · Osmania University
Thanks Kevin.
Can a variable assume the role of a predictor as well as a confounder in the same study? If yes, why? If no, why not?
Example: IN the association between education and family planning, economic status was considered as a confounder. IN the same study, can we consider education as a confounder in the association between economic status and use of family planning? Economic status was considered as a confounder and a predictor.
The type of variable whether predictor or confounder will depend on study objectives.
Low R-squared values in multiple regression analysis?
In my regression analysis I found R-squared values from 2% to 15%. Can I include such low R-squared values in my research paper? Or R-squared values always have to be 70% or more. If anyone can refer me any books or journal articles about validity of low R-squared values, it would be highly appreciated.
Kelvyn Jones · University of Bristol
For me the regression coefficient is what matters as it estimates the size of the effect of the undelying relationship - an R-squared as a correlation coeffcient aseses the scatter around that relation. It is the correlation (squared) between the observed and predicted response values. You can get a close fit to a very shallow line. As always estimates and effcets have to be put in context - I have seen the following benchmarks stated for odds ratios: Small but not trivial: 1.5; Medium: 3.5: Large: 9. Anfdfor R-squared: Aggregate time series: expect .9+; Cross sections, .5 is good. and large survey data sets, .2 is not bad. [ see people.stern.nyu.edu/wgreene/.../Statistics-14-RegressionModeling.pptx‎] And yet in a trial of the effect of taking aspirin on heart attack the odds ratio was so dramatic that the trial was stopped and placebo group advised to take aspirin. The odds ratio of a heart attack for placebo compared to taking aspirin was a lowly 1.83, while the R2 was a puny 0.0011; yet this was sufficient for action.
Is there any alternative to the ridge regression method?
In my model, I experience a multicollinearity problem in least squares estimation. Therefore, I decided to use the ridge regression method. I examined the variance inflation factors (VIFs). In the beginning, some of the VIFs for the variables were above 10 and the R-Squared statistic was 61.24. The value of the ridge parameter in my model is 0.1. Now, VIFs are around 2 for all the variables in my model. Finally, the R-Squared statistic indicates that my model as fitted explains 57.59% of the variability. I believe that for my model this R-Squared statistic is too low.
Yasin Asar · Necmettin Erbakan Üniversitesi
You can use Liu or Liu type estimators as well.
What is the best software for multilevel modelling?
I work with effects of contexts like the place of residence, and use different softwares that fit multilevel models (R, Stata, MLWin, Mplus). Almost any software does this analysis, nowadays (SAS, SPSS, HLM) and all provide similar estimates for coefficients, especially for linear models. I noticed, however, some difference in the variances (i.e. second level variance) and I am aware they use different estimators (IGLS, REML, MLR, and so on). What are the advantages and disadvantages of the main softwares? Is there any published paper comparing them for discrete variables and non linear models (Binomial, Poisson, N-Binomial, zero-inflated, etc)?
Kelvyn Jones · University of Bristol
I have added two volumes on Stat-JR to research gate Book: An Advanced User's Guide to Stat-JR version 1.0.0 Programming and Documentation https://www.researchgate.net/publication/259640684_An_Advanced_User%27s_Guide_to_Stat-JR_version_1.0.0_Programming_and_Documentation_by?ev=prf_pub and A Beginner's Guide to Stat-JR's TREE interface version 1.0.0 Programming and Documentation https://www.researchgate.net/publication/259640463_A_Beginner%27s_Guide_to_Stat-JR%27s_TREE_interface_version_1.0.0_Programming_and_Documentation_by?ev=prf_pub
Mentalising Score - How to work it out?
Hi All, Forgive me if this seems a silly question and do pardon my statistical ignorance, but I am attempting to isolate a persons 'Mentalising Score' by following advice from this: http://www.psychologytoday.com/blog/the-imprinted-brain/200912/the-diametric-revolution-in-psychotherapy Basically - the author suggests that: "the first step in diagnosis would be to ascertain a person’s standing on the mentalistic continuum. Measures such as the Autism Spectrum Quotient (AQ) already exist to calibrate hypo-mentalism, and comparable ones could easily be devised to correspondingly calibrate hyper-mentalism: a Psychotic Spectrum Quotient (PQ, comparable to the existing Magical Ideation Scale). Ideally, AQ might be expressed negatively and PQ positively, meaning that perfect normality would get a summed Mentalism Quotient of zero." I have used the 28 item Autistic Quotient and the Schizotypal Personality Questionnaire-Brief. I plan to somehow work out a theoretical mentalising score by using the Cognitive-Perceptual factor of the SPQ (9 items) and integrate it (somehow) into the 28 Item AQ... once I have this 'mentalising score' I will be able to examine how it maps on to various cognition. Any idea what the author means? Any help would be much appreciated. Cheers Guys.
Isabel Carvalho · Nova School of Business and Economics
Hi Marcus, the “diametric model” of mental illness looks to be a very exciting line of research. Coming from organizational cognition, I am not familiar with mentalising scores. However, your words “… once I have this 'mentalising score' I will be able to examine how it maps on to various cognition”, drive me to think in the other way around namely in a person-centered analytical approach to your data. Taking in consideration the proposition that “Normal mentalism” occupies a central ground and represents balance (a) between enough mentalizing ability to understand other people’s minds but not so much as to be paranoid (b), or so little that you become autistic (c); a distinct and yet complementary approach to your goal could be the identification of the empirical representation of individuals mentalism distribution, based on individuals reports. A person-centered strategy (e.g. latent profile instead of regression or SEM) assume that a sample can contain subgroups and that the variables of interest (e.g., hypo-mentalism, hyper-mentalism) might combine and relate differently to other variables within these subgroups. Assuming that the ability to understand other people´s minds can be measured properly (e.g. AP, QP), the identification and comparison of at least these three subgroups of individuals (a, b, c) will allow the identification of the most likelihood scores of the individuals belonging to each group and the differences between the mentalistic continuum. This information will help to support the development of a empirical score theory driven. Recent developments in mixture modeling, for instances latent profile analyses (LPA: Muthén, 2002) and factor mixture analyses (FMA - Lubke & Muthén, 2005), represents a model-based approach to clustering that allows for the direct specification of alternative models that can be compared with various fit statistics and the simultaneous inclusion of continuous, ordinal, and categorical measurement scales in the same model (McLachlan & Peel, 2000). A great source of information is Mplus site http://www.statmodel.com/discussion/messages/13/13.html?1394751196 Some references are: Lubke, G. H., & Muthén, B. (2005). Investigating population heterogeneity with factor mixture models. Psychological Methods, 10, 21-39. Magidson, J., & Vermunt, J. K. (2004). Latent class models. In D. Kaplan (Ed.), Handbook of quantitative methodology for the social sciences (pp. 175-198). Newbury Park, CA: SAGE. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: John Wiley. Muthén, B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 81-117. Muthén, B. (2004). Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In D. Kaplan (ed.), Handbook of quantitative methodology for the social sciences (pp. 345-368). Newbury Park, CA: Sage Publications. Just a thought. My best
Recommended statistics books to learn R?
Some time ago, there was a discussion on a listserv to which I describe regarding statistical software preference. Someone had mentioned a strong preference for the use of R and since that time, I have downloaded the software package (seeing as how it's freeware). However, in looking at the interface, I am at a loss regarding how to actually use the application, and I currently cannot commit the time necessary to pour through the hundreds of help articles or forums. That being said, I looked into some R tutorial books and I wanted to see if anyone has any experience with the books I have listed below or if there are any other recommendations (the ones listed are based on reviews). I am currently gravitating towards Andy Field's book because his writing style is accessible and entertaining, but I also feel that there may be some "wasted chapters" because I already have the SPSS version of his book and I assume that there will be some redundancy. I am also open to the idea that I might need to buy 2 books. I will likely be conducting traditional statistical analyses (e.g., factor analysis, discriminant function analysis, MANOVA/MANCOVA, ANOVA/ANCOVA, regression), but I would also like to learn how to conduct other analyses through R (e.g., canonical correlation analysis, structural equation modeling, path analysis, time series analysis, etc). I have not used some of these techniques, so a book that includes didactics regarding the nature of these analyses would also be ideal. I appreciate any insight into this. Thank you for your time and I hope everyone has a nice day. Discovering Statistics using R (Andy Field, Jeremy Miles, & Zoe Field) The R Book (Michael J. Crawley) R Cookbook (Paul Teetor) R for Dummies (Joris Meys and Andrie de Vries) (they have one of these books for everything, don't they?) Introductory statistics with R (Peter Dalagard) R by Example (Use R!) (Jim Albert and Maria Rizzo) R in Action (Robert Kabacoff)
Hans Sieburg · Sanford-Burnham Medical Research Institute
Dear Thomas, Go with "The R Book (Michael J. Crawley)". Michael describes R beyond what most people think of it, i.e. statistics. As for statistics, what impresses me are the well-selected data sets used. His presentation of R's visualization capabilities is exemplary. And so forth. Best, Hans
Which statistical package or software application is easiest to use by non-statisticians?
Can you recommend a simple-to-use statistical software for a non-statistician?
Hans Sieburg · Sanford-Burnham Medical Research Institute
Dear Isaac, InStat produced by GraphPad, Inc is what you are looking for. This package teaches while it works your data, thus helping you to become more proficient in using statistics. Here is the web address: http://www.graphpad.com/scientific-software/instat/ Best regards, Hans
In hypotheses tests can you make assumptions under H0 only?
Let's consider the (standard) 2-sample t-test for a difference in means. The t-value is calculated from empirical difference in sample means divided by the SE of this difference (in turn derived from a pooled variance estimate). Therefore, by construction, the test is sensitive to differences in location, but not in dispersion. The p-value is calculated from a t-distribution (with n1+n2-2 d.f.). As I understand, this t-distribution is derived for one single condition: the data from both groups is sampled from *one* population with normal distribution. The normal distribution if often called an "assumption", and the fact that both samples are from the same population is called the "null hypothesis" (H0). Is there any reasonable argument to assume that under H0(!) the two samples are coming from two different populations, with possibly different variances but the same means? My question has two aspects: I wonder if such a test is justified when we believe/know that the samples must have been taken from different populations (that might have the same mean or not, but they surely have different variances). What is the rationale behind judgeing differences in mean values when I seem to compare apples and peaches anyway? (Just as a side note: often the difference in variances under H0 can be explained by inhomogeneities within one of the groups; this is for instance often observed in studies with diseased and control animals, where the diseased group suffers several side-effects increasing the variability of the response, but not neccesarily the mean. - Wouldn't it be more appropriate, if possible, to adjust for these indirect effects instead of simply "assuming different variances"?). (and I know that if we ignore all these logical things there is a Welch-correction) The second aspect is: Since the p-value related specifically to H0, only the conditions under H0 are relevant. Right? Again a typical example from biomedical research: mean and variance of concentrations are usually correlated; the higher the observed mean, the higher the observed variability. I know that a log-normal or gamma-glm with log link is most appropriate here for analysis, but here I am asking for a simple t-test again(considering the violation of the normal-distribution assumption is negligible!). The observed differences in variances are related to different (sample)means. And under H0 (!) I think it is justified to assume equal variances (otherwise see the previous paragraph). Having said this it follows that the t-test (without Welch-adjustment) would be perfectly fine, although the sample data has apparently very different variances in the groups. The problem might be to get a good estimate for the SE. Using a pooled estimate might result in unneccesarily low power, but nothing could be done so wrong to accidentally inflate the type-I error rate. Right? The same questions apply to the "non-parametric" alternative, the Wilcoxon test. The p-value here is again derived from under the assumption that both samples are taken from the *same* population (that dosen't need to have a normal distribution). It is often stated that this is a test of location-shift (equality of the medians), if and only if all other other moments of the distributions are the same. Again I wonder if H0 does not automatically and necessarily imply hat all moments must be identical, since there is only one distribution under H0.
Hans Sieburg · Sanford-Burnham Medical Research Institute
Dear Jochen, Comments regarding the second paragraph first. I have heard things like this before, though not as blunt as your co-author put it. Fortunately, the information measures have "their own" p-values; see, for example, the Akaike information criterion (often quoted as AIC). So, they are "safe" as far as stars in the plots are concerned :). I saw that you posted a question concerning R, so I gather that you are using it for your work. In R, you will find ways and means to determine the AIC or other such quantities. In Mathematica, these are delivered by standard as part of the built-in statistics package, which in my book, speaks volumes in favor of using Mathematica (the plots can have lots of stars, too, if warranted :) ). Now to the first paragraph. The assumptions that we make for any particular test are reasonable in the sense that they influence the mathematics that we use to derive the validity of the test. It follows that those who apply a particular test verify (actually should have designed their experiments before doing them according to the assumptions of that particular test) that the assumptions hold. Variance must be estimated for both groups, no doubt, and not just the one that we define to be the "healthy" control, since you have to establish that the assumption of equal variance is true, or not. Therefore, a comparison must be made in a separate test to ascertain that they are, or are not, similar. And, alas, we must ascertain the likelihood that the data are generated from the same process (represented by one distribution. Hence, the need to establish that there is only one distribution (at the very least, from the same distribution family). Do do this cleanly, you will inevitably end up doing a goodness-of-fit test, which is another way of saying "information distance". That assumptions can be made differently is evidenced by the fact that there are other tests that are variations of the t-test and have different names. Finally, there is the topic of variance-adjusting transformations that can be applied to the data without harm. I do not like to do this, since I feel that variance is one of the most important pieces of information ever, in particular in biology and medicine, but the technique is useful from exploratory purposes. The story that you tell is prototypical for how statistical work is still perceived by a large part of the bio-medical and clinical research community. This behavior appears to be unique, since I have never found engineers, finance people, economists, physicists, etc etc, to put up such a fight against clean procedure. Indeed, that's a whole different story. Best, Hans
Log transformation of values that include 0 (zero) for statistical analyses?
When trying to search for linear relationships between variables in my data I seldom come across "0" (zero) values, which I have to remove to be able to work with Log transformation (normalisation) of the data. However, it would be important to consider these values in the analysis. How can I do this? Should I assign a very low number to the missing data?
Donald Singer · United States Geological Survey
A good source of information on these issues is: Helsel, D.R., 2005, Nondetects and data analysis: Statistics for censored environmental data: Wiley-Interscience, 250p.
Missing statistical method in textbook "MATLAB Recipes for Earth Sciences"?
I am currently working on the 4th edition of my textbook "MATLAB Recipes for Earth Sciences" with Springer. There will be an interactive ebook for tablet computers in addition to the regular PDF ebook and the printed book. Furthermore, I am expanding the existing chapters of the book including statistical and numerical methods that became popular during the last couple of years. Is there anything you think I should add to the book, a method which is widely used in your field of expertise? Thanks for your help, contents of the book attached!
Jorge Parra · Southwest Research Institute
Dear Martin, You can add geostatistical algorithms, such as variograms, co-variograms, kriging and cokriging. There are several open source MATLAB codes that you can obtain from Computer and Geosciences. I have a good application for ground water in the paper titled, Parra, J., and Emery, X., Geostatistics applied to cross-well reflection seismic for imaging carbonate aquifers,Journal of Applied Geophysics 92 (2013) 68–75. This work was done using open source MATLAB codes, Best regards, Jorge
Testing equality of slopes across several regression models
I have several regression models, each with the same outcome and dependent variables. All observations are independent. I would like to test the null hypothesis: the coefficient for a specific covariate is not significantly different across the models. I have seen methods for doing multiple pairwise tests, but am looking for a single test. Any methods easily implementable in R would be appreciated.
Jochen Wilhelm · Justus-Liebig-Universität Gießen
So this is simply the interaction of "Predictor X" and "Smoking Prevalence".
Can anyone help me with my problem involving Benjamini-Hochberg correction in complex general linear models?
I have the following problem with statistics (imposed by a journal reviewer). I have to use Benjamini-Hochberg correction for false discovery rate. In our work we tested three separate hypotheses: (1) Bird abundance is dependent on habitat type, (2) bird number is higher in January than in December and (3) there is interaction between habitat type and month. In order to test these hypotheses I built generalized linear models (GLM) where all three hypotheses are tested in one model. However, I have 50 bird species to test these hypotheses on thus I built 50 GLMs. My question is: should I use P-values for calculation of Benjamini-Hochberg correction derived for a specific hypothesis of only (50 p-values for 50 species), or 150 values derived from general linear models for 50 species (3 hypotheses x 50 models)? Personally, I believe I should use 50 values for a specific hypothesis. Moreover, as I mentioned, all three hypotheses are tested in one statistical test. Any idea or different point of view? p.s. Let's omit the problem with very high number of tests and power.
Sven Krackow · University of Zurich
If you are interested for each species, if it is affected by your designed model effects, there is no reason to "correct" the p for anything, as you are testing 50 different hypotheses. Corrections are needed if you test the same hypothesis multiple times!
Can anyone help with multicollinearity in nonlinear regression models and model simplification (deletion tests)?
While collinearity among explanatory variables is acknowledged in case of multiple linear regression, for me still remains a question mark in case of nonlinear regression. I am working with a multiplicative nonlinear model Y~(X1^a1)*(X2^a2)*…*(Xn^an) Some of the explanatory variables are linearly related, some are even nonlinearly. The interactions are complex and data are not orthogonal, distribution of explanatory variables not being uniform and rather right-skewed. I thought that fitting first the most complex model (using all the explanatory variables) and then drop them in the order dictated by their significance will make a good strategy of obtaining the most parsimonious model - as explained by Crawley (2007). Same author underlines that order of dropping variables matters when dealing with non-orthogonal data. What is not clear for me is: does it matter in any kind of model, linear or nonlinear, or is just valid for the linear ones? What if following strictly the significance levels I risk to drop 'good' explanatory variables due to 'ill-conditioning' (see below). Given the 'ill-conditioning', even if I obtain convergence for a complex model (with many correlated explanatory variables), but not necessarily significant estimates, does that model make any sense to start the simplification process with? Though I cannot provide references, it came to my attention that I should not be afraid of collinearity in case of MULTIPLICATIVE nonlinear models…On the other hand Seber & Wild (2003) say ‘multicollinearity in linear models can lead to highly correlated estimates and ill-conditioned matrices […] Unfortunately, all these problems are inherited by nonlinear-regression models’ with the added complication that the confidence contours are curved. They end with ‘However, with nonlinear models ill-conditioning can be a feature of the model itself […] good experimental design can reduce the problem, but it may not be able to eliminate it’. This is rather scary. If anyone dug deeper into the issue of model simplification of nonlinear models, considering non-orthogonal data and collinearity, any advice would be of great help. References: Crawley, M., 2007. The R book. John Wiley & Sons Seber, G.A.F., Wild, C.J., 2003. Nonlinear Regression. John Wiley & Sons
Aurélie Cailleau · Afrique One Consortium
Is the relationship between explanatory variables linear, and is it collinearity between pair of variables, of between all variables? You may try to write the model in a way that account for collinearity, by estimating parameters that are independent (a1, a2, etc.) and parameters that account for the collinearity (r1, r2, etc.). If the relationship is linear, and all variables are correlated, this is something like: Y~(X1^a1)*(X2^(r1* a1))*…*(Xn^(r^n-1*a1)) If the relationship is linear, and variables are correlated by pairs, this is something like: Y~(X1^a1)*(X2^(r1* a1))*...*(X3^a3)*(X4^(r2* a3))*…*(Xn^(r^n-1*a1)) Well, I am not sure of the shape (whether a and r have to be multiplied or added) and actually you may need a bit more parameters, but do you see what I mean? If the relationship is not linear, it is going to be more complex, but not impossible. Of course then for model selection you have to start by deleting variables that are described as correlated to other variables... Or I would rather suggest to make AIC comparisons rather than stepwise hierarchical selection, that would allow you to compare models where variables are assumed to be correlated, and models where they are not.
What prevents you from using a p-value other than 0.05 as your statistical significance cut-off?
I've been researching this problem for several years from a statistical perspective. I haven't heard many rational, purely scientific reasons to always use p<0.05, so I suspect this is more of a psychological issue. I'm hoping to find an enlightening answer that may serve as a basis for helping to change people's minds. I feel it's necessary to lay out some guidelines for answering this question, 1) I don't want this question to devolve into a discussion of why we shouldn't use NHST, or better alternative to this method. The criticisms are well know, but the fact of the matter is that the majority of scientists still use and publish results using NHST. Have a look through any journal if you don't believe me. 2) Please do not criticize other's responses. I am interested in honest answers.
James Schmidt · Ghent University
An alpha of .05 is, of course, partially arbitrary and conventional. But it is desirable that there is some convention on the matter. Were researchers able to "play" with the alpha level, then it would introduce extra experimenter degrees of freedom. For instance, effect I wanted: p = .09. "We set our alpha level at .10." Or conversely, effect I didn't want (e.g., because it confirms a competing account): p = .04. "We set our alpha level at .01." If we want to continue with NHST (which maybe we shouldn't, but I digress), then there does need to be a conventional cutoff that we all stick by (and that we also, of course, don't hack). An alpha of .05 is a reasonable number for this.
• Jorita Krieger asked a question:
How can I test normal distribution and variance homogeneity in split-plot designs with SAS?
I need SAS codes for statistical analysis of field experiments (split-plot design: A/(B*C)-R). Are there any differences between one year analysis and the analysis over several years (A/(B*C*D)-R)? It wasn't a static experiment. In the worst-case scenario, which non-parametric post-hoc test can I use? Is the "HOLM-procedure for contrasts" a suitable method? I already used it for block-designs.
I stratified a dataset based on a categorical variable to get 10 strata. The purpose of the stratification is to increase observation homogeneity and remove potentially confounding unobserved effects. The observations are independent. Within each strata, I fit a weighted multiple regression with the same outcome and dependent variables. I am interested in meta-analyzing the coefficients for one of the dependent variables. Specifically, I want an estimate of the coefficient across all strata and the corresponding confidence interval and p-value. Any methods which are easily implementable in R would be appreciated.
Huy Nguyen · University of Nagasaki
I think you can consider each state as an individual study and perform a meta-analysis of B1. Based on your figure, I think the data is highly heterogeneity and you need to use to random model. The pooled B1 would be a good fit. Hope it helps
Where can I find proof of Scheffé's method for multiple comparisons?
Any bibliographical reference is welcome. Thanks in advance!
George Seber · University of Auckland
Scheffe's proof is geometrical using parallel tangent planes. An algebraic proof is much shorter (e.g., Seber and Lee, Linear Regression Analysis, 2nd edit. page 123, also in the first edit.
Standardized variables in weighted least squares regression?
I would like to fit a multiple regression model with a continuous response and several continuous predictor variables. Each observation has a corresponding weight which I incorporate into the least squares fitting method (by setting the weights parameter in R lm() function). I am interested in standardized coefficients, so I transform the variables into z-scores before fitting the model. However, since my observations have weights do I need to use the weighted mean and weighted standard deviation when calculating z-scores?
Daniel Himmelstein · University of California, San Francisco
Here is an R script implementing the method John suggests.
Matlab or R?
I've been using Matlab in my research for years, including applications of statistics. Recently, I've been particularly interested in robust statistical methods (such as suggested by Wilcox), which are not (as far as I can see) built in Matlab Statistical Toolbox. One alternative may be using R, which seems to be a very popular computer package for statistical applications. I'd like to ask this question to researchers and practitioners who are familiar with both computer packages: Is shifting from Matlab to R worth the effort? Is learning R and then employing it to problems worth the additional effort compared to writing modern statistical method algorithms on Matlab, using Statistical toolbox, both of which I am very familiar?
L. Sanabria · Geoscience Australia
Burak, A 80000x500 matrix is not large for R, especially considering available computers these days. I have worked with matrices of 1.3Mx250 in R with no problem (in a Linux environment, mind you). If you don't find what you want in R, develop it yourself, it is not as difficult as you may think. Besides imagine the amount of people that would benefit from your effort! We, modern researchers, have received a lot from other people, it is time for us to repay the debt! That is the philosophy of open source. Regards, Augusto
Is p=0.2 enough evidence to eject the H0?
In the article “Response of tree phenology to climate change across Europe” (F.-M- Chmielewski and T. Rötzer, 2001, Agricultural and Forest Meteorology 108, 101-112), the problem of multiplicity in statistics (see my old questions which I asked in January and February 2014) arises again. Their table 4 is a list of 78 correlation coefficients between different temperatures and the beginning of growing season (B). Each of them is assumed to be significant at p<0.05. Because the multiplicity is neglected, the probability of one or more “type 1 errors” is 98% (assuming independence of the tests/data), even if the H0 is true. And worse: For the last column (the average temperature from month x to y) in this table and for two natural regions (NR), x and y were selected in such a way that a maximum regression coefficient was achieved. This is in contrast to a proper statistical procedure, and the computed significances are useless. In table 5, we find 26 (2*13) trends. Here even p<0.2 is called “significant”. The chance of one or more false ejections of the H0 is now 99.7% (assuming the same preconditions as above). Hence I do not believe the significances in this table. Even the p<0.01-entries would not be significant on a global 0.95 (or 0.90) significance level if one would take the multiplicity into account and uses, e.g., “Holm’s Sequentially-Rejective Bonferroni Method” (to be significant, the lowest p-value had to be lower than 0.05/12=0.0042 (0.10/12=0.008), but table 5 only tells us that p has any value less than 0.01). Nevertheless, because almost all “trend in B”-values in table 5 are negative, it is likely that the H1 is true (i.e., the trend is negative or unequal zero) for most of the Natural Regions (NRs). I am wondering if anybody know a statistical test what tells us which of the 12 trends in B are significant? Maybe one could argue that the spatial correlation in temperature and “begin of growing season” is high and therefore the 12 trend-tests are highly dependent and one could neglect the multiplicity? Another problem is, that even if there would be only one single test, relative large p-values, e.g. p=0.05, does not necessarily mean that the H0 is more likely than the H1. The Fisherian p-values do not measure the strength of the evidence against the null hypothesis! (We only know P(D | H0), the probability of the data D given that the H0 is true (or in the case of p we know the probability that the data are more extreme than the observed one); but to decide that the H1 is more likely than the H0 we had to know the ratio P(H0 | D) : P(H1 | D). See, e.g., the very well written paper “A practical solution to the pervasive problems of p values” (by E. –J. Wagenmakers; Psychonomic Bulletin & Review 2007, 14 (5), 779-804; http://www.ejwagenmakers.com/2007/pValueProblems.pdf) (with his corrigendum: http://www.ejwagenmakers.com/2007/CorrigendumPvalues.pdf). Hence all results (also in different articles) showing p-values as large as or larger than 0.05 must be considered very carefully, especially if N (the number of “years”) is large. And significant results with p=0.2 in a multiple test environment are senseless in my judgement. Some further remarks: Concerning the significances, the authors did not mention if they used one or two sided tests and they did not define the H0 (and H1). Furthermore I am missing confidence intervals, at least for the “EU trend of B” and the correlation coefficient between B and T24, and between NAO and T24. These intervals seem to be relatively wide (see table 5 and the widely scattered values for “trend of B” from different authors in the introduction).
Fausto Galetto · Politecnico di Torino
Dear Klaus, my post was addressed to Peter.
How can we find the optimum K in K-Nearest Neighbor?
Sometimes it's mentioned that, as a rule of thumb, setting K to the square root of the number of training patterns/samples can lead to better results. Is there any justification for that term or have you ever seen that in a paper? Any other straight forward solutions?
Theodoros Anagnostopoulos · National and Kapodistrian University of Athens
A lot of conversation could be done but in the end you should evaluate the K value through a set of training-test evaluations. Note here that the K value may vary from dataset to dataset even of the same conceptual model. Further analysis could be done with data cleaning and preprocessing while a feature selection algorithm should capture the notion of each specific attribute and the whole set of instances which in turns has impact on K and the classification result.
Different result of skewness and kurtosis - any thoughts?
I am surprised to get different results for skewness and kurtosis from different application or library. Take for example, with moments library package in R we have the following result: skewness: -0.1450712 Kurtosis: 1.464111 with library e1071 in R skewness: -0.123864 kurtosis: -1.81407 With excel and SAS, I obtain the same result as: skewness: -0.1720333 kurtosis: -1.7509470 I think going for the last result is better. The same result from two different applications. What do you think?
Kostas Katselidis · Aristotle University of Thessaloniki
I would strongly suggest to Read the Reference Manual or help. library(e1071) help(skewness), or help(kurtosis) kurtosis(x, na.rm = FALSE, type = 3) skewness(x, na.rm = FALSE, type = 3) type: an integer between 1 and 3 selecting one of the algorithms for computing skewness detailed below Type 1: g_1 = m_3 / m_2^(3/2). This is the typical definition used in many older textbooks. Type 2: G_1 = g_1 * sqrt(n(n-1)) / (n-2). Used in SAS and SPSS. Type 3: b_1 = m_3 / s^3 = g_1 ((n-1)/n)^(3/2). Used in MINITAB and BMDP.