Science topic

# Normal Distribution - Science topic

Continuous frequency distribution of infinite range. Its properties are as follows: 1, continuous, symmetrical distribution with both tails extending to infinity; 2, arithmetic mean, mode, and median identical; and 3, shape completely determined by the mean and standard deviation.
Questions related to Normal Distribution
• asked a question related to Normal Distribution
Question
I am conducting a panel data regression for a research on economic growth of few countries. In real life, it is hard to find data that are normally distributed and most of the control variables are correlated with each other in one country or another.
However, the regression test results are satisfactory and all show that the residuals are normally distributed, there exists no serial correlation and heteroscedasticity. Even the CUSUM and CUSUMSQ tests show that the model is stable.
In such a case, are the diagnostic tests enough to justify that the results of the regression model are reliable and valid even when data are not normally distributed and there exists correlation among them?
There is no assumption about normality of predictor or outcome variables in linear regression (only about the normality of residuals/errors). Also, linear regression is designed to handle correlated predictors, so there should be no problem unless the correlations are extremely high.
• asked a question related to Normal Distribution
Question
Hello!
I have a non-normally distributed variable (income) and although I tried to transform it to a normally distributed variable skewness and kurtosis values are still so high and there is lots of outliers on it. But can't delete the outliers because it is about nature of income variable. So I didn't delete a single one (by the way N=9918, I am not sure it is acceptable to delete 200 or 300 of them). I read about after conducting the OLS if residuals are distibuting normally it is acceptable to use OLS results. But I couldn't find any academic source/strong reference about it.
I wonder that when I have normally-distributed residuals can I use OLS results even if the variable has outliers and have higher skewness and kurtosis values? If this is an acceptable way to conduct this analysis, can you suggest an academic resource that I can reference to support this usage?
If you look up the assumptions for an OLS general linear model, you'll see that there is no assumption that the dependent variable is normally distributed.
Usually the assumption is written as the errors from the model being normally distributed. These are approximated by the residuals.
Pretty much any textbook on the design and analysis of experiments should be a reference for this. Slides in the following presentation are taken from Montgomery, Design and Analysis of Experiments. See slide 8, and then slides 19 - 22.
• asked a question related to Normal Distribution
Question
Am trying to model NGN/USD exchange rate using some linear forecast methods, after transforming the data using Box-Cox method, the data failed to be normal. I checked for possible outliers and yet no outliers found. So am wondering what can be done to make the transformed data normal.
Daniel Wright , no, it just results in the default being the Elfving method. I don't think I had any great reason to choose one method over another for the default.
• asked a question related to Normal Distribution
Question
Which test do I use for not normal distributed data set (n=10)? I originally wanted to use a t-test. I want to compare two groups with each other.
You state "I want to compare two groups with each other." What aspect of the two groups do you want to compare?
• asked a question related to Normal Distribution
Question
I have 667 participants in my sample, and the outcome is continuous. I tested the normality, and the data on the histogram are bell-shaped, but the test results show that they are not normally distributed.
1- What is the cause of the discrepancy between the chart and the test results?
2- Can I still perform linear regression analysis on this data?
What the model requires is that the errors are mutually independent and identically distributed in a way that is approximating a Normal distribution with mean equal to 0 and variance equal to σ ².
Also, I remind you that linear regression is a mathematical function based on the equation of the geometric straight line.
The rest comes by itself.
• asked a question related to Normal Distribution
Question
My data were not normally distributed even after transformation, so I used Kruskal Wallis test and then went further to do Bonferonni post hoc. i am a little mixed up with Dunn's test, Bonferonni test and Dunn's Bonferonni test
My favorite reference is attached here.. David Booth
• asked a question related to Normal Distribution
Question
Hello there,
I just have a quick question when it comes to deciding whether a dataset is normally distributed. I have come across the situation where I have checked that the values of skewness and kurtosis remained between 1 and -1. However, when I checked these data using a normality test (using the Kolmogorov-Smirnov and Shapiro-Walk tests), these gave significant values, indicating that the data was not normally distributed. I have tried to transform the data and I encountered the same situation: Kurtosis and Skewness values remaining between 1 and -1, but normality tests giving significant values. What do you decide in this case? Is the data normally distributed?
(data contains more than 2000 participants).
Joel
Joel Coll Ferrer , your data is clearly not normal distributed, and assuming that it would be a sample from a normal distribution is very clearly wrong. This is independent of the sample size.
However, you want to make inference abot the mean of that (unknown, clearly not normal) distribution. For this inference the distribution of the statistic (the sample mean) is relevant.
If -hypothetcally!- the data were taken from a normal distributuon, then we would know that the sample mean could be interpreted as a value taken from a normal distribution, too. And if this would be the case, then we could make inference based on the normal distribution model. This is what linear models (t-test, ANOVAs, liner regressions) do. If the assumption that the data are approximately normal distributed is quite reasonabble, then the inference made are also quite reasonable or approximately correct. This is again (quite) independent of the sample size.
You data is very clearly not from a normal distribution. Assuming that it is from a distribution that would be at least approximately normal is clearly not reasonable. Hence, inference based on such assumptions may be considerably wrong. To be able to make some reasonably correct inference would require to know the distribution of the sample mean. It might be possible to derive this, but this might be difficult and require the help of a mathematical statistician.
The distribution of the sample mean depends on the distribution of the data. The CLT tells us that for large sample sizes it will approximate the normal distribution reasonably well. Hence, large samples allow you to do reasonable inference about the mean using the standard linear models although we know that the distribution of the data is not normal.
• asked a question related to Normal Distribution
Question
I have one three-digit integer, with known parameters of normal distribution (expectation and standard deviation). How to calculate apropriate value of such normal distribution by this integer ?
Calculate with the help of this neat Normal Distribution Applet/Calculator:
• asked a question related to Normal Distribution
Question
In my research, I have four categories in my questionnaire, each category comprises a set of items. My data proved not normally distributed, so a non-parametric statistical procedure will be used to determine the relationship among variables, i.e., the Spearman rank-order correlation is chosen. My question is: what do I choose from the function list in the Compute Variables option on SPSS, so I create new representative-to-sets-of-items variables to use when I carry out Spearman correlation? Is it the overall mean or median for each set of items?
Thank you.
Beyond the lack of clarity of your exposition, is it possible that the university you attend doesn't have adequately trained statisticians?
• asked a question related to Normal Distribution
Question
Can I run the ARDL method even if my residuals do not follow normal distribution. I have already taken the log of the variables and they follow normal distribution but not when I estimate the ARDL model?
In small samples the distribution of the residuals of a model need to be normally distributed for hypothesis tests on the coefficients to follow the standard distributions. In large samples the Central Limit Theorem applies and the coefficients follow conventional distributions despite the residuals not following a normal distribution. However, the tests for a long run levels relationship in an ARDL model are nonstandard. Nevertheless, my understanding is that in large samples the ARDL model can have nonnormally distributed residuals and the simulated nonstandard critical values will be valid. However, this will not be the case in small samples. Others may have more information on whether the nonstandard critical values are valid in large samples when the ARDL model's residuals are non-normal.
• asked a question related to Normal Distribution
Question
I my research question there are two independent variable and one independent variable, first independent variable has two-levels (gender) second independent variable has three-levels and dependent variable has two-levels, which are normally distributed. Can I use Two-way ANOVA?
Zoiya Naz ANOVA requires the dependent variable be measured on a continuous scale. I recommend binary logistic regression.
• asked a question related to Normal Distribution
Question
I am planning to analyze my data through SEM. However, my data is not normally distributed and some variables are heavily negatively skewed. I have tried transform my data using log transformation, square root and inverse transformation but normality tests still indicate a non normal distribution. What else can I do to arrive to a normal distribution? Noting that I have tried to run the data through AMOS and I keep getting very large numbers for CMIN/DF fit indicators. I have a sample of over 500 cases.
Non-normality mostly affects the test statistics (chi-square test of model fit) and parameter standard errors when using maximum likelihood estimation in SEM. To obtain adjusted ("robust") test statistics, standard errors, and p-values with non-normal data, you can use the Bollen-Stine bootstrap option in AMOS.
• asked a question related to Normal Distribution
Question
Hello everyone,
I have a normally distributed data. Before conducting CFA, is it necessary to remove outliers?
First, I removed two of the outliers and conducted CFA, my CMIN/df value was <3, but one of the factor loadings were .25. Then, I conducted CFA without clearing any outliers: my CMIN/df value was slightly over 3, but the factor loading I mentioned become .29. Other values (GFI, CFI, TLI, RMSEA) did not change much in both calculations.
It depends on the outliers but always remember that normally distributed means normally distributed and if there are outliers your distribution at best is a contaminated normal. So how much contamination is really important? Investigate that if you can but please remember that there are robust methods to help you. See the attachment for more details. Best wishes David Booth
• asked a question related to Normal Distribution
Question
Regarding SPSS.
I have an empirical dataset with a mix of scale variables and dichotomous variables. I am seeking to test the relation between the scale variables and dichotomous, and the relation between two dichotomous variables.
Can I perform a linear regression for the scale and dichotomous? I have heard that dichotomous variables are not normally distributed, and normally is an essential assumption for linear regression.
And can I perform Chi-square to test between two dichotomous variables? I got suggested Spearman as well, but those two tests give the same results. Is the sampling supposed to be random?
Thanks
You have to decide from which direction you want to look at the relationships. When you have a continuous and a dichotomous variable, you may ask how the expectation of the continuous response variable depends on the value of the dichotomous predictor or how the expectation of the dichotomous response variable depends on the value of the continuous predictor variable. In either way, the question can be addressed by some form of regression model that depends on the characteristic of the response variable.
When the response is dicotomous, a logistic regression model is used, because the disitribution of the dichotomous variable must be Bernoulli. When the response is continuous, the form of the regression model depends on the assumed distiribution model of the response variable, which without specification can be any continuous distribution model.
Such models can include and kind and any muber of predictor variables. Nominal predictor variables in such regression models are typically 0/1 coded. If the predictor is dichotomous, one of the model parameters then reflects the expected difference in the response between the "groups".
For continuous predictors, the form of the relationship with the response must be specified, which may be simply linear but may also follow some arbitrary nonlinear relationship.
One can test hypotheses about parameter values in such models, typically in form of likelihood-ratio tests (the test statistic is the ratio of the likelihood under the unrestricted model and the likelihood under the model restricted at the hypothesized parameter values). In most cases, the sampling distribution of the likelihood ratio is not known, but Wilk's theorem sais that the distribution of the log likelihood ratio is at least approximately Chi². So one can do Chi² tests, which are usually approximate likelihood-ratio test. Tests about a single parameter can be equivalently formulated as z-tests (when Z is z-distributed, then Z² is chi²-distributed).
If the distribution of the response is normal, then the sampling distribution of the log likelihood ratio is exactly Chi², and z-tests about individual parameters are also exact tests.* However, the normal distribution has an independent parameter for the variance, which must be known/given. If this must be estimated from the data, too, the additional uncertainty can be accounted for with F- and t-tests, respectively. These tests are exact for normal distributed response variables with unknown variance. For response variables with other disributions that have an independent variance parameter (e.g. Gamma) the F/t test are superior to the Chi²/z test but not exact as for the normal distributed response.
* "exact" means that the derivation is mathematically exact and not approximate. It does not mean that the actual results are "exact". This "exact"-thing is not very relevant in practice. More relevant is that we assume what a reasonable distribution model for the response might be, and that the assumed functional forms of the relationships are reasonable, too.
• asked a question related to Normal Distribution
Question
I have studied the effect of hyperglycemia on subarachnoid hemorrhage (preclinic) and I used PET and MRI for this purpose. I am now comparing the data between hyperglycemic and normoglycemic animals on days 1 and 3 (I measured the volumes of damage). For doing the statistics, I am using GraphPad prism. I put all my data in columns, I performed a normality test and they do not follow a normal distribution so the parametric analysis is automatically dismissed. Then, I performed a Kruskal Wallis analysis but, here is my concern:
As I am interested in assessing the difference between hyperglycemia and normoglycemia (differences on day 1 and differences on day 3), I thought that maybe a Mann-Whitney test choosing the variables two by two would be correct as well (I have two groups, not four). What do you think? Is it correct?
Mann–Whitney U test
• asked a question related to Normal Distribution
Question
In the use of Expectation Maximization, which is one of the methods used in missing data analysis, is it necessary for the variable containing missing data to be normally distributed?
• asked a question related to Normal Distribution
Question
Hi
My dependent variable is categorical and independent variables are ratio.
However, dependent and independent variables are not normally distributed and have outliers.
What suitable correlation test can be used?
Thank you
(Multinominal) logistic regression. Possibly after a transformation of the IV to linearize the relationship between the IV and the log odds ratios. I suggest you contanct a local statistician.
• asked a question related to Normal Distribution
Question
One way ANOVA has multiple assumptions. Amongst them, what do these two mean?
-There should be homogeneity of variances
-Dependent variable should be normally distributed for each category of the independent variable
-what does 95% confidence interval in one-way ANOVA mean? Does this mean F statistic is 95% likely to be true compared to population mean?
-For significant F statistic, I should use dunnett's post hoc test to compare means of groups to control mean. Tukey post hoc would be if I want to compare means of each group to each other. In simple terms, What is Bon Ferroni post hoc and Sidak post hoc for?
-There should be homogeneity of variances
The variances of the populations from which the groups are sampled are all assumed to be identical. This assumption is required to make and use a single pooled variance estimate from all the observed data.
-Dependent variable should be normally distributed for each category of the independent variable
The distribution of the dependent variable is assumed to be normal, independent if the sampled population. This assumption is required to derive the sampling distribution of the test statistics from which the p-values are calculated.
-what does 95% confidence interval in one-way ANOVA mean? Does this mean F statistic is 95% likely to be true compared to population mean?
No. Your idea makes no sense. A 95% confidence interval is (very typically) the interval of all hypotheses about the value of a parameter that can not be rejected at a 5% level of significance.
-For significant F statistic, I should use dunnett's post hoc test to compare means of groups to control mean. Tukey post hoc would be if I want to compare means of each group to each other.
Both procedures, Dunnett's and Tukey's, are based on series of t-tests performed using the pooled variance estimate to get the standard errors of the mean differences. Dunnett's procedure controls the family-wise error-rate of "multiple-to-one" comparisons, Tukey's procedure controls the family-wise error-rate for "all-pairwise" comparisons. There is no ANOVA required, and there is no need that the ANOVA is significant. Fisher's "least significant difference" method for exactly 3 tests between 3 groups is an ANOVA-protected procedure of controlling the family-wise error-rate of these 3 tests (only when the ANOVA is significant at the same level).
- In simple terms, What is Bon Ferroni post hoc and Sidak post hoc for?
Bonferroni 's procedure is a particularily simple procedure to control the family-wise error-rate across any family of tests. You calculate the least-significant difference tests (t-tests using the pooled variance estimate), and then you simply compare the p-values from these tests to the level of significance divided by the number of tests in order to control the family-wise error rate (this is called the Bonferroni-correction). Sidak's procedure is extremely similar, using a slightly more difficult to calculate correction factor. The differences seem mostly of academic (not practical) interest to me (for instance, see Journal of Modern Applied Statistical Methods 14(1), 12-23).
Both correction methods can be applied any series of p-values, which may come from entirely unrelated tests.
There are improvements of these procedures that slightly increase the power, particularily when the number of tests is large. The most frequently used improvement is that by Holm (aka Bonferroni-Holm or Sidak-Holm), which starts with sorting the p-values and then adjusting the stringency of the tests while "stepping down" the list of sorted p-values.
Note: "post-hoc tests" are tests about subgroups of the data that can be performed only after the entire data of all groups is lying on the table. The typical post-hoc tests are t-tests, that are performed using a pooled variance estimate to calculate the standard errors. This pooled variance estimate can be calculated only when all the data are available: you can compare group A and B only after you also have the data for the groups C, D, E, .... This logically is unrelated to omnibus procedures like ANOVA.
• asked a question related to Normal Distribution
Question
Hi
I'm analysing data coming from a QoL scale where lower scores indicate better quality of life.
As the data in the pre and post intervention reviews is not normally distributed, I've used a Wilcoxon Signed Ranks Test for 2 related samples.
The sample size is not tiny but not very large either (n=51); there is a clear difference in the rankings, with the sum of ranks for negative ranks (which in this case indicate improved QoL) being higher than the sum for positive ranks (598.50 vs 262.50), Z=-2.188 and the p value =.029.
However there is no change in the median values (12 at pre and post).
Would it make sense to report only on the ranks and not the median? Or reporting the median change by ranks (negative, positve, ties)? I'm getting a tad confused.
The sums of ranks aren't always very helpful for the reader because they depend on the sample size of the data analyzed, and aren't in the same units as the original data.
One simple way to express the change in values is to present the median of the differences in scores. Or a five-number summary of these differences. Or a histogram of these differences:
A helpful plot is a bivariate plot of Before and After, with a one-to-one line superimposed. Here, points above and to the left of the line indicate an observation where the value increased.
You could also calculate the number of observations that increased, decreased, or were tied, and express this as a proportion of the total observations.
Another effect size statistic is the matched-pairs rank biserial correlation coefficient. This is a standardized effect size measurement and matches the signed rank test. See: King, B.M., P.J. Rosopa, E.W. and Minium. 2000. Statistical Reasoning in the Behavioral Sciences, 6th. Wiley.
• asked a question related to Normal Distribution
Question
Dear Scholars,
I have 4 independent variables, 1 mediator, and 1 dependent. The sample size is 214
1- After Conducting the normality test by Kolmogorov–Smirnov, and Shapiro–Wilk test, I found the following :
A- 2 independent variables were nonnormal distributed according to Smirnov, and 2 were normally distributed.
However, I found all were normally distributed according to the Shapiro test.
C- The skewness, kurtosis, and histogram were clearly normally distributed.
2- After conducting a multicollinearity test, I found the independent non-significantly correlated
3- I conducted a correlation test and I found that 1 of the nonnormal distribution according to the Smirnov test was non-significantly correlated with the mediator or with the dependent.
Because of that, I will not consider them in further analysis steps.
thus now I have 3 independent were 1nonnormal distributed according to Smirnov( but normally according to Shapiro, skewness, kurtosis, and histogram data)
4- Now, I am going to use Baron and Kenny's steps for mediation analysis with Preacher and Hayes (2004) bootstrap method.
The questions are as follows:
1-if my approach is right or not?
2- If I have to conduct CFA or not? (The questionnaire was adopted after translation, content validation, and reliability test)
3- If there are missing steps in my analysis approach?
Your model may be saturated or "just identified" (df = 0). This would be the case if you estimated all possible paths between variables. You should still be able to look at the parameter estimates for a saturated/just-identified model. You just don't get an overall chi-square test of model fit (because a saturated model by definition reproduces the observed data perfectly).
• asked a question related to Normal Distribution
Question
Hi everyone,
I have samples from diagnosed AD patients distributed in three groups (Ctrl, mild and severe groups) with n=10 for each of the groups. I have performed WB for each of those 10 samples and assessed the expression levels of different proteins. I have normalised those proteins to a loading control and then I have normalised the mild group and severe group to the CTRL group.
The data obtained does not seem to follow a normal distribution and there is a great variability between patients.
Considering that and that my n is below 30 samples, I deciced to run a non-parametric test (Kruskal Wallis H Test) followed by Dunn's correction for multiple comparisons when p value was <0.05.
I am unsure if I should have chosen an ANOVA test to compare the means of the groups even thought the data do not follow a normal distribution. I have considered doing a log transformation (the graph looks much better because I had high SEM), which apparently helps making your data normally distributed and then applying a parametric test (ANOVA) but I am still in doubt.
Any suggestions would be helpful. I have found very different opinions on previous similar disccussions but none of them helped me deciding which test to choose.
Many thanks.
Kr,
Nerea
When I hear things like "...patients distributed in three groups (Ctrl, mild and severe groups)..." my alarm starts ringing. Very often (but not always), something quantitative is measured (e.g. beta-amyloid, reaction time, number of correctly perfomred tasks, etc) to get a "diagnosis" which is then broken down by often arbitraryly defined cut-offs into mild/moderate/severs and whatever. This might be reasonable for the clinical physisian but definitively not for research work. But that is not the point of your question...
But there is another point, before I come to your question: Quantification by WB densitometry is very often lacking appropriate controls. You should make sure that changes in the protein concentration (in the relevant range) can be and will be detected. You need an external standard for this. This is certainly laborious, but not doing this, the whole procedure is pretty much based on hopes.
Ok, now to your question: protein concentration is gamma- or log-normal distributed. So the log-treansformed values should be approximately normal distributed - and this is exactly what you see. Testing hypotheses about differences in log concentrations (-> ANOVA, t-test, regression) is ok.
Jos Feys , normalizig the band intensity values of the protein of interest from a WB to that of a loading control is common practice in the field. Each protein preparation has a different yield. Measureing a protein that is assumed to have a constant concentration in all samples (in the living cells) allows to eliminate most of all these differences between samples by normalization. Additionally there are differences between every blot/membrane (it very sensitively depends on many things), and these effects may also be protein/antibody specific. Hence, comparing data from different membranes requires additional normalization to a standard (that is measured on each membrane).
• asked a question related to Normal Distribution
Question
Hi RG community,
I have a dataset which has a non-normal distribution and I want to perform linear regression on it (using Graphpad Prism 9.3.1). Here is the process I follow:
1. First determine normality of data set using D'Agastino-Pearson omnibus normality test.
2. Find correlation coefficient (r): Pearson Correlation if data has normal distribution or Spearman Correlation if data has non-normal distribution.
3. Then I perform linear regression in Graphpad Prism and find out the fit equation and the goodness of fit (R-squared or R2).
Usually, for normally distributed data, the R-squared term numerically turns out to be square of correlation coefficient r (Pearson Correlation coefficient)
i.e., R-sqaured or R2 = (r)2
However, this is not true for non-normally distributed data. (Is this normal?)
Hence, I just wanted to get clarified on a few silly things: (haven't really worked much with non-normally distributed data)
1. Is it ok to have R-squared or R2 not equal to (r)2 for non-normally distributed data?
2. Is it ok to perform linear regression on non-normally distributed data as is or does it need some modification before linear regression can be performed?
I appreciate any help regarding this topic.
Thank you!
Linear regression only makes a normality assumption with regard to the residuals (errors). I would take a look at residual distribution plots (e.g., histograms) and residual scatter plots and see what they look like. Unless the residual plots reveal strong non-normality or other issues (e.g., non-linearity), you may be fine simply running conventional OLS linear regression and computing the corresponding R-squared.
• asked a question related to Normal Distribution
Question
Is normal distribution of data necessary in methods used to complete missing data such as Markov Chain Monte Carlo (MCMC) , Expectation Maximization (EM), k-NN Algorithm, Multiple Imputation (MI)?
If there is a need to transform the data to a normal distribution, then any transformation that does that would be acceptable. The one I proposed is very general, but certainly not the only option.
• asked a question related to Normal Distribution
Question
I have 2 groups that I'm measuring effect of intervention. There is a large variation in SD for the intervention and non intervention. Intervention SD is 74.24 and non-Intervention SD is 63. How this can reflect the Intervention as true predictor vs Non-Intervention? Do I focus on normality? (There is normal distribution between the two) or something else? How to discuss as a limitation? Thanks
Hello Mariyana,
I'm not quite clear on what you have. Is it: (a) two groups, one of which received an intervention and one of which did not, and each group was subsequently measured on some relevant variable; or (b) two groups, each measured pre-intervention and post-intervention on some relevant measure? (or something else?)
If (a), then assuming groups were in fact comparable prior to the intervention, the fact that the intervention group has a slightly higher SD than the non-intervention group simply means the impact of the intervention wasn't equal for all participants in the intervention condition. Of course, if groups weren't comparable at the outset, then you simply can't be sure this wasn't merely an a priori difference between groups/batches.
If (b), pretty much the same thing: post-intervention scores are more variable than pre-intervention scores, which implies some persons benefited more from the intervention than did others.
It's not apparent why this might be a study limitation, unless you're worried about a homogeneity of variance assumption for a significance test (as in option "a", above). If that's the case, then you can:
1. Opt for bootstrap/resampling methods to estimate group differences (and the CI for same);
2. Apply an exact/permutation test;
3. Use robust regression model;
4. If it makes sense to do so, try transforming scores on the measure of interest so as to stabilize variances;
5. Use one of the common methods for adjusting/penalizing t-test/anova models for violation of homogeneity, such as: Welch method; Brown-Forsythe method.
• asked a question related to Normal Distribution
Question
Hi,
the two samples show a significant Shapiro-Wilk p-value when I'm applying it for each of them. When I'm running the t-Test my stats software (JASP) allows me to check assumptions and now it says that normal distribution is not violated. How can that be? Is it about the distribution of the mean differences?
Thanks and have a nice day!
The paired t test makes no assumption about the distribution of the two samples, just (as you suspect) their differences. These are assumed to be sampled from a normal distribution. (This is also equivalent to assuming normality of residuals/errors in a regression model).
In any case I see little value in a Shapiro-Wilk test (or similar) for normality in testing assumptions of tests. The issue is statistical power. In large samples the tests detect minor violations of assumptions that are not likely to impact the interpretation. In small samples they fail to detect major violations. Just use graphical methods or descriptive statistics.
If you have trouble interpreting these then simulation can be useful learning tool. For a variable with mean m and SD s simulate draws from a perfection normal distribution and plot the shape (or calculate descriptives) with the same n. Repeat it a few times. This can give you a feel for how "normal" data should look if it meets the assumptions.
• asked a question related to Normal Distribution
Question
Hello. I am testing a model fit in AMOS. The data follow a normal distribution and meet the assumptions of: linearity, multicollinearity and homocestality. I have used maximum likelihood (ML).
The problem is that in the standardized weights, the correlations and the R-squared, no information about the significance and confidence intervals are provided.
I have seen that this information can be accessed using ML together with bootstrapping, also the confidence intervals. Can ML be used together with bootstrapping when the data follows a normal distribution?
I don't see why not. If you're meeting the assumptions, then I would expect that your asymptotic/theoretical ML standard errors are similar to the robust/bootstrap standard errors (you should verify this for the unstandardized solution--simply compare the ML and bootstrap standard errors and p values).
It is unlikely that you are "perfectly" meeting the multivariate normality assumption anyway. The fact that the normality hypothesis is not rejected in your sample does not mean it is true. It could just be a sign of "insufficient" power.
• asked a question related to Normal Distribution
Question
Hello all! As part of my master's thesis, I did an experiment. I now have 5 groups with n=45 participants each. When I look at the data for the manipulation checks, they are not normally distributed. In theory, however, an ANOVA needs normally distributed data. I know that an ANOVA is a robust instrument and that I don't have to worry about it with my group size. But now to my question: Do I in fact need normally distributed data at all for a manipulation check measure or is it not in the nature of the question that the data is skewed? E.g. If I want to know if a Gain-Manipulation worked, do i not want to have data skewed either to the left (or right - depending on the scale)?
Would be great if somebody could give me feedback on that!
Best
Carina
I want to emphasize Daniel Wright's comment about assumptions applying to the populations from which you sampled. Textbooks often present the F-test for one-way ANOVA as if it is an exact test. But in order for it to be an exact test, you would need to have random samples from k populations that are perfectly normally distributed with exactly equal variances. (In addition to that, each observation would have to be perfectly independent of all other observations.) Even if it was possible to meet those conditions (which it is not if you are working with real data), the samples would not be perfectly normal, and would not have exactly equal sample variances.
Because it is not possible to meet the conditions described above (at least when you are working with real data, not simulated data), the F-test for ANOVA is really an approximate test. And when you are using an approximate test, the real question is whether the approximation is good enough to be useful.* That's how I see it. YMMV. ;-)
* Yes, I am borrowing "useful" from George Box's famous statement(s) about all models being wrong, but some being "useful". Several variations on that statement can be found here:
• asked a question related to Normal Distribution
Question
Hello!
I am in the process of adapting the questionnaire for my research. The data on one of the scales has a strong deviation from the normal distribution - 64% of the answers have the lowest possible value (the sample consisted of 282 people). The rest of the responses were also distributed towards the low values. There were 3 items on the scale. A Likert scale of 1 to 7 was used for responses.
However, it seems to me that the construct being measured implies such a result in this sample.
Can you tell me whether such a scale can be considered usable? Or is it too heavily skewed? Are there any formal criteria for this?
I've already done a CFA, and I've checked the test-retest reliability and convergent validity (the results are acceptable).
It depends on what statistical analyses/estimators you are planning to use. For example, for conventional CFA/SEM with maximum likelihood (ML) estimation (assuming continuous variables), extreme skew could be a problem because ML estimation rests on the assumption of multivariate normality of the variables. Although robust ML estimators are available, skewness could still be a problem because the inter-item covariances/correlations can be affected/biased by skewness (and, as a consequence, also the factor structure). In that situation, it may be useful to try out alternative estimation methods (e.g., WLSMV estimation using polychoric correlations vs. ML estimation) and conduct sensitivity analyses (i.e., checking how much the results differ across different estimation methods).
I recommend studying the extensive literature on non-normality and use of categorical (ordinal) variables in CFA/SEM. A Google Scholar search using terms like "ordinal items CFA", "non-normal data CFA" and the like will point you to the relevant papers.
• asked a question related to Normal Distribution
Question
I run a two way random effects model and in residual diagnostics, I found that the residuals are not normally distributed. What should I do now?
Dewan -
I have not really worked with panels, but in general, the estimated variance of the prediction error is made up of a part for the model (coefficients), and a part for the estimated residuals.  (The former also involves the latter.)  The former becomes smaller with larger sample sizes, and the central limit theorem applies to estimating the coefficients.  The difference between the former and latter parts is a difference between standard errors and a standard deviation.  If the latter is a big part of your estimated variance of the prediction error, then your prediction interval will not be in a t or normal shape, and I suppose rather skewed, which may make it harder to interpret.
Not having close to normally distributed residuals then impacts your prediction interval.  I think it also may make it not the best of all predictions, but I'm not so familiar with that.
Cheers - Jim
• asked a question related to Normal Distribution
Question
If I need to find correlation between dependent variable which is not normally distributed with two independent variable one are normally distributed and other was not which test prefer person or spearman?
Note: All my variable are scale not ordinal
Yes, it is possible to estimate Spearman's correlation (rank correlation). It gives you the monotone association (rather than the linear association) of the two variables.
But I don't see that septal wall thickness is not approximately normal:
• asked a question related to Normal Distribution
Question
if for example I want to compare BMI for two group? When I use shapiro wilk test to check normality between BMI of each group! one group is normally distributed And other not .. so what test should be used to compare their mean t test or mann-whitney?
with the group that is not normally distributed, is it close to being normally distributed or far from being normally distributed
• asked a question related to Normal Distribution
Question
if for example I want to compare BMI for two group?
When I use shapiro wilk test to check normality between BMI of each group!
one group is normally distributed
and the other was not ?
what test should i use either t-test or mann whitney?
I think the dispositive questions are, How different are the distributions of the two groups ? What hypothesis do you want to test ? ... If the distributions of the two groups are quite different, you might ask if comparing the means is the most meaningful approach for your situation.
• asked a question related to Normal Distribution
Question
I'm developing a set of firm-level risk and uncertainty index. I run different density functions, for instance, PDF, CDF. In my case, most of the indexes are normally distributed. Do you think that it should be ideally normally distributed? Thanks in advance for your response.
Not necessarily!
How did you determine the index is normally distributed?
consider the discussion here:
Good luck,
Hamid
• asked a question related to Normal Distribution
Question
Hello everybody
I hope you can assist me. To complete my multiple regression analysis with a sample of 80, I examined my data for Kurtosis and Skewness as part of my normality analysis. Two of the variables showed skewness and kurtosis z-scores in the region of -2.20, which is somewhat greater than the minimum value of -1.96
My questions are:
1. Should I conduct a log 10 transformation to correct a small skewness, or should I simply report it in the results section?
2. If the response is that I must log 10 the data, should I then utilise the modified log 10 data to conduct my regression analysis (instead of my original data without the log 10)? I believe this will not affect the total outcome.
3. In addition, I performed a log 10 test, and although the variable was normally distributed, the Shapiro-Wilks and Komogorov smirnov tests revealed that the P value was less than 0.05. Should I report this, or am I required to apply a non-parametric test to continue?
Thank you beforehand for your assistance.
Hello again Brian,
Should you opt for transforming one or more variables, I'd suggest reporting summary statistics for both original & transformed values (especially if DV was transformed). Do include correlations (transformed values) as part of summary stats.
However, for interpreting estimated regression coefficients, your discussion needs to make clear that you're talking about predicted score differences in the transformed metric(s), not the original.
• asked a question related to Normal Distribution
Question
I found that some people tend to use Welch T-test for analysis even if the results are not normally distributed. Should I use Welch T-test or the Mann-Whitney u test? The sample size is slightly different between the 2 groups. The sample size is around 30 in each group. The results from the 1 group are not normally distributed.
There are several confusions and misconceptions already in the question, so let me start with a small refresher to make these things clear:
The t-test tests a mean difference. It assumes that the values are sampled from a normal distribution. In Student's version, the standard error of the mean difference is estimated by the pooled variance of the observations in both groups. Therefore, there is the additional assumption that the population variances in both groups are identical. The Welch t-test claculates the standard error of the mean difference from the two individual sample standard deviations and uses a t-distriobution modified degrees of freedom. It therefore relaxes the assumption of equal variances. It has been shown that the t-test is quite robust against inhomogeneous variance, particularily when the sample sized are similar. It is also very robust against non-normality, as long as the distributions are more or less symmetric. The robustness increases with increasing sample size (because it's the distribution of the mean difference that must eventually be normal distributed, and the central limit theorem sais that the distribution of the mean difference approximates the normal distribution increasingly well with increasing sample size). Skewed distributions are problematic as the strongly decrease the power of the t-test. The test still faithfully tests the mean difference, but it needs rather large sample sizes to "detect" differences (give small p-values).
The Mann-Whithey U-test tests stochastic equivalence and makes no assumptions on the distributions from which the values are sampled. Note that the tested hypothesis is not the same as that of the t-test. Only when additional assumptions are imposed on the distributions, the Mann-Whithey U-test can effectively test a mean difference, e.g. the distributions differ only by a location shift, what means, particularily, that they have the same variance! This is quite a strong assumption and I have no practical example where this might be justified (happy if someone could point out some examples, though!). Also note that, for instance, two distributions A and B may be constructed so that the mean difference A-B is positive and at the same time Pr(A>B) < 0.5. If the U-test is used as a "substitute" for the t-test, it would lead to the opposite interpretation ("B is kind of larger than A"; although the data suffice to see that the expectation B is smaller than the expectation of B).
So here are the confusions:
- The t-tests are quite robust against violations of their assumptions.
- The robustness depends the actual distributions and on the sample size.
- Violations do not invalidate the test, but limit its power.
- The U-test is generally NOT testing a mean difference. It's therefore generally not a "substitute" for the t-test.
and further:
- The assumptions are about the distributions of the variables (population distributions, if you like). But what you see is only a sample. A sample can be misleading (a sample from a symmetric distribution may be skewed and vice versa), and a sampel may be too small to reveal any reliable information about the distribution if was sampled from. This means that it's hard to justify distributional assumptions by looking at sample data.
Finally,
- No real variable has a distribution that would be described EXACTLY (fully correctly) by a distribution model (be it the normal distribution or some other distribution). Hence, one can say that your variables are never normal distributed. But this is not the point. The point is if the distributions can reasonabley well be described by a normal distribution. And the meaning of "reasonably" depends also on the sample size.
Now whe you write "The results from the 1 group are not normally distributed" implies that you measure two different variables in the two groups, and with this in mind your task ist like comparing apples and peaches. It makes no sense.
Practically: how did you come to the conclusion that one variable is normal distributed and the other is not? It's likely that this procedure is already logically flawed. I presume that the variables are not normal distributed, and that the difference in the sample distributions to a normal distribution is just a bit more obvious in one sample. This is often the case when the variables is strictly positive and accordingly has a skewed distribution with a particular mean-variance relationship and that the mean of one sample is smaller.
• asked a question related to Normal Distribution
Question
I would like to analyze the p-value for 7 groups. The sample size is slightly different among the groups. The sample size is around 30 in each group. The results from the 3 groups are not normally distributed. Could I still use ANOVA for analysis?
The description "not normal" isn't very useful. ANOVA can be somewhat robust against deviations from normality, but if the underlying distributions of the group populations are very skewed or distinctively different, the results from ANOVA may not be very reliable.
I recommend thinking of ANOVA as a form of general linear model. See slide 8 here: https://staff.emu.edu.tr/adhammackieh/Documents/courses/ieng581/lecture-notes/ch03.pdf . You'll see it's the errors from the model that are assumed to be normally distributed. From this perspective, you can fit the model and then examine the residuals with a histogram or quantile-quantile plot.
Also with ANOVA, heteroscedasticity may be more of an issue. I like to plot the residuals vs. the predicted values to examine this. There are ways to address heteroscedasticity. In the one-way case or t-test, you can use Welch's anova or t-test.
• asked a question related to Normal Distribution
Question
I have designed a framework containing 3 variables, 2 dependent and 1 independent. The constructs are reflective type. I need to test direct relationship as well as the possibility of existence of mediation and moderation relationship. There are two control variables as well. The sample size is 183 and the data is normally distributed. I am facing difficulty in deciding whether to use CB-SEM or PLS-SEM.
• asked a question related to Normal Distribution
Question
Hi,
my study sample size is 355, and after i performing Mahalanobis test, 7 out of 355 outliners were identified. the initial reliability score of each sub-scale items ranged 0.7 - 0.9, but after i removing these 7 outliners the reliability of all sub-scales were drastically dropped below 0.6, even 0.5. does it mean i cannot remove these outliners because they're naturally generated? (my data set is not normally distributed)
many thanks!!
In general, I would not expect such a large drop in reliability for removing 7 out of 355 observations. So, compare the correlation matrices from the full data set and the reduced data set.
• asked a question related to Normal Distribution
Question
Dear all,
I have a questionnaire with 20 questions. Average score of question 1,6,7,9 (say FACTOR-1) and 2,3,5,8,10 (say FACTOR-2) are taken. The two factors are not normally distributed.
What non-parametric techniques mixed effect models are possible for this situation?
Thank you
Aligned ranks transformation anova may work for your situation. Maybe also ordinal regression or quantile regression. Or an appropriate generalized linear model.
• asked a question related to Normal Distribution
Question
My design is a pre-post control group design. Due to the small samples size (16+16) and the absence of normality is the pre-test scores, I performed a Wilcoxon Signed Ranks Test to test the differences between pre and post test scores for both groups. However, I found out that post-test data is normally distributed for both groups. My questions is:
In this case, in order to test the difference between groups, should I perform Mann–Whitney U Test or Independent- T test?
If I understood correctly, you have two groups, and for each group you have pre and post data. In this case, the distribution of pre- or post-data is irrelevant. Only the distribution of post minus pre is relevant, and here you should also separate the two groups.
PS: I don't know how you came to the conclusion that some scores are normal distributed and others are not. Often, normal distribution tests are used in this context - what is rather stupid. Such test will sometimes allow you to reject the tested hypothesis (the sample is from a normal distributed RV), sometimers it won't. If it does, you don't know if the difference to the normal distribution is of any relevance, and if if does not, you don't know if there are relevant differences but the sample size is too small to recognize this. It's better to make an informed judgement which hypotheses make sense and which are clearly nonsensical.
• asked a question related to Normal Distribution
Question
I have one environmental risk score for psychosis (ERS) value for each participant in my database. The scores vary from -4.5 to 10. And the data is not normally distributed.
I would like to ask whether there is a proper way to transform these scores into positive values.
After transforming these values into positive values, is there any suggestion to have a normal distribution of the data? I thought after having the positive values, I could do a log transformation to have a more normal distribution. However, I am open to all your suggestions.
Laxman Singh Bisht Thank you for the reply!
Since my database included negative numbers, I added the absolute of the most negative number, which is -4.5, to all the values. In that case, my values got 0 and positive. And then I did Box-cox transformation on them. After I checked the normality with Shapiro-Wilk Test, ERS variable got normally distributed.
However, I am not sure whether it is valid to play with the data like that since we want to publish an article with ERS data??
Jochen Wilhelm Thank you for the reply!
Actually, my Professor asked me to find a way to make the current data more normally distributed to do regression analysis in twin data. And, since there are negative values in the data, I believe it is not possible to transform the data. That's why I am looking for a way to apply a suitable transformation.
• asked a question related to Normal Distribution
Question
I am currently writing my thesis on the effects of prolonged handling on the behavior and physiology of mice, by comparing a handled and non-handled group. I have worked out the duration spent per behaviors observed from four handled individuals and four non-handled individuals. I discovered that the data values from certain individuals of each group are not normally distributed and data values from other individuals are. Can anyone give me advice on which statistical test I can use to determine if there is significance between the duration spent per behavior from all of the individuals between both groups? I'd also like to test for significance between duration spent per behavior from only handled individuals and non-handled individuals. Hope this makes sense.
I'm not very clued up on statistics.
A few comments. First, it doesn't matter whether each individual mouse's data is normally distributed, but whether the residuals from the entire group (handled, nonhandled) approximate a normal distribution. Second, the issue is not whether the data are normally distributed, but whether the residuals are (roughly) normally distributed. I recommend generating Q-Q plots of the residuals to inspect this assumption if you have not already done so. Third, duration (i.e., time) data are rarely normally distributed, so tests based on a normally-distributed model may not be appropriate. In most cases, time data are best modeled by a gamma distribution. Fourth, before transforming the data, you need to consider whether you're interested in making absolute time comparisons between the groups or only relative comparisons. If you're interested in absolute time differences, do NOT transform the data. If you're only interested in making relative comparisons, transforming the data should not be a problem.
• asked a question related to Normal Distribution
Question
Let X be a random variable which follows a distribution, say S with parameters a, b and c. Knowing that or Assuming that a, b and c are independent of one another, which one is reasonable to do?
a) Is it okay to find joint Jeffrey's prior as the product of three Jeffrey's prior (for a, b and c)
b) Is it better to use the square root of determinant of a 3x3 Fisher's Infromation matrix?
Even for normal distribution with two unknown parameters, joint Jeffrey's prior based on the product of two Jeffrey's prior (mu and sigma) yields a slighly different prior than the one based on the determinant of a 2x2 Fisher's Information matrix.
Jeffrey's prior (also called Jeffreys-Rule Prior), named after English mathematician Sir Harold Jeffreys, is used in Bayesian parameter estimation. It is an uninformative prior, which means that it gives you vague information about probabilities.Feb 8, 2018
Jeffreys Prior / Jeffreys Rule Prior: Simple Definition - Statistics How To
• asked a question related to Normal Distribution
Question
My study design is quasi-experimental. Basically, I am trying to explore the effect of an intervention on two groups of participants (control vs experiment) over two times (pre- and post-test), and the effect of this intervention on three dependent variables (self-efficacy, motivation and self-regulation). All three DVs were measured using surveys of 5 and 6-point Likert scales. So, the data here are ordinal. However, the skewness and kurtosis fall between normal and acceptable ranges, i.e., ±2 for skewness and kurtosis. The data are normally distributed as indicated by the approximately small range of skewness and kurtosis values. My sample size is = 124. Some scholars suggested using repeated measures ANOVA in my case, but how to carry out ANOVAs when my DVs are ordinal? I still believe I should be running Wilcoxon signed-rank test instead.
Any suggestions?
Jim McLean has provided some very good advice, and David Morgan has raised a good point about individual items vs. summated scales... You might also consider a resampling/bootstrapping approach for "back up."
• asked a question related to Normal Distribution
Question
I have 6 faecal samples which are divided into 3 groups. I have got Shannon's indices of group 1 is 2.20, and 2.91, group 2 is 2.52 and 2.78, group 3 is 2.28, and 3.30. The Simpson indices of groups 1 (0.798 and 0.878), group 2 (0.841, 0.851), and group 3 ( 0.910, 0.799) My question is, which would be the best statistical analysis to determine if there is a significant difference in the species that are present in each group? I've seen web pages suggesting using one-way Anova test but that works for normally distributed data. The Shapiro wilk test shows a not normal distributed data.
It is widely believed, but unfortunately untrue, that we must check for normality before we do a statistical text. In fact it is the errors that must be normal. So we run the ANOVA to compare means and look at the residuals to see if they are normal. Looking at your data, the residuals may well be normal and homogeneous, even though the response variable (all 6 values) are not.
With only 6 values, the ANOVA will have little power to detect even large differences, where the means for your data differ little. The result "not significant" will be due either to no real difference or to a test with little power to detect a difference. Another way to say this is that you will not be able to reject the null of no difference for either of two reasons -- because there is none real difference or because the sample size was too small to detect even a large difference.
A non-parametric test such as Kruskal Wallis is of little help. It does not test differences in means. It gives a p-value that is not about the difference in means. So you can't use it to reject a null hypothesis of no difference between means.
I would recommend that you consider a larger sample size.
David Schneider
• asked a question related to Normal Distribution
Question
Dear all,
In my research, I hypothesize that a variable X (among others) will have an inverted U-shaped relationship with Y.
To make the residuals normally distributed I am using inverse transformation on Y.
Then, I perform a linear regression with adding X, sqaured terms for X and other variables. both X and X sq. are significant and the coefficient of X sq is positive implying u-shaped relationship.
Below is the scatterplot of X and Y (inverse). I want to know that when I report this, do I show the scatter plot of X and Y inverse (which was the variable after transformation) and then I can conclude, if inverse have U shaped effect then the variable without transformation will have an inverted U shape effect. Will it be legitimate?
or I show a scatter plot of X and Y.
I'm not sure I understand why you transformed Y. Perhaps I'm missing something, but leaving Y untransformed and simply including the quadratic term X^2 should give you what you are looking for (an inverted u-shaped relationship). If the residual non-normality is due to the quadratic effect, then including X^2 should account for this.
• asked a question related to Normal Distribution
Question
My dataset is not very big (between 100-200). the residuals are not normally distributed. so:
1. Is there any other statistical method similar to multiple linear regression but suitable for this case?
2. If not, what can be the solution?
Thank you
Look at the severity of the violation. It may be just negligible given the large sample size.
• asked a question related to Normal Distribution
Question
I am running an experiment to compare the difference between the control and intervention groups with 5 and 4 sample sizes. From the literature, the smallest sample size that an independent sample t-test can be used is 4. However, one of the major assumptions of the t-test is that the data from each group must be normally distributed. Normality assumption cannot be established on a small sample due to inadequate estimation of the dispersion of the data.
Can I continue using the independent t-test or consider using a non-parametric test. I have run 70% of the experiment data before the normality assumption comes to mind.
The assumption should be sensible and resonable. Judging if this is the case requires to understand the characteristic of the variable you are measuring. If you don't understand your variable, you possibly should not test hypotheses at all, as you will not be able to interpret the tested hypotheses anyway. If you have some justifyable opinion that the variable should have an approximate normal distribution but you are not sure if this is sufficiently correct, the assumtion is quite save or conservative. The problem may only be that the data remin unconclusive, what is your personal problem but not a scientific one.
That brings me to using non-parametric tests: usually, these tests are about a different hypothesis, and again you should be clear about which hypothesis you actually want to test. It is not useful to test some weird or misunderstood hypothesis just because one knows how to test it. There should be a scientific rationale behind your choice (a feasability argument is not a scientific rationale). Despite this, your tiny sample size prohibits non-parametric analyses anyway. There is not enough information in the data to reliably estimate the sampling distribution, and non-pareametric tests will have almost no power (again: power to test a hypothesis that may not be the one you actually like to test!).
• asked a question related to Normal Distribution
Question
Test normality
Please remember that most people that do normality tests get it wrong. For these tests large p-values favor the normal hypothesis not small p-values. See Wikipedia for your test. Best wishes David Booth
• asked a question related to Normal Distribution
Question
As it's known, coefficient of variation is ratio between standar deviation and average. What is coefficient of variation if average is zero ? What about Normal (0, 1) distribution ? Does it mean that this normal distribution has no variation ?
"In case of coefficient of variation we measure standard deviation (non zero value) with something what can be zero."
No, this is not true. The CV is not a useful measure in this case and no sensible person would use it in such instances.
The CV is helpful in cases where you measure something positive (length, mass, intensity, concentration, activity etc) with different methods/principles so that it is reported in different units. In such cases, the CV serves as a measure for the relative precision that can be compared between the methods/principles.
• asked a question related to Normal Distribution
Question
In case of not, what is the non parametric test could be used?
Try Aligned Rank Transform (ART) method via the R
package ARTool proposed by Wobbrock et al. (2011) to prepare data for a non-parametric
ANOVA. There is also a non-R based version as well.
Wobbrock, J. O., Findlater, L., Gergle, D., & Higgins, J. J. (2011, May). The aligned rank
transform for nonparametric factorial analyses using only anova procedures.
In Proceedings of the SIGCHI conference on human factors in computing systems (pp.
143-146).
• asked a question related to Normal Distribution
Question
My null hypothesis is that "there is no significant relationship between socioeconomic characteristics of the victims and their perception of treatment by police." Perception of treatment by police consists of three statements and is measured using the 5-point Likert scale. I opted to perform ANOVA and tested the normality. But the data failed to achieve the normality. Please suggest a suitable test find an association when the data is not normally distributed.
The first thing you save to decide is if you are okay with treating the Perception scale responses as a continuous variable or would rather treat it as an ordinal variable.
a) If you have values from 3 to 15 (13 unique values), it might be okay to treat the response as a quasi-continuous variable. In that case, you might try a regular Ordinary Least Squares (OLS) general linear model. This is the same as anova or multiple regression, except that you can include whatever terms and interactions you want on the right hand side. ... This model has assumptions, that you should understand. In particular for the normality assumption, understand that the model assumes a response variable that represents a population that is conditionally normal, not that the data itself is normal. One way to assess this is to fit the model and then look at the residuals from the model.
a2) A non-parametric approach like aligned ranks transformation anova may work for what you need.
b) If you want to treat your response variable as ordinal in nature, ordinal regression will likely work. This is a good idea especially if you have maybe 10 or fewer unique response values. (Like if all responses were, say, 9, 10, 11, 12, 13, 14, or 15).
• asked a question related to Normal Distribution
Question
One dependent variable (continuous) ~ two continuous and two categorical (nominal) independent variables
I'm seeking for the best method for predicting a data collection with more than 100 sites. The distribution of all continuous variables is not normally distributed.
Beyond the scarcity of information, are you sure of the relationship between variables?
• asked a question related to Normal Distribution
Question
I have a sample and in my study were going to compare 2 or 3 samples.
If I am going to compare 2 groups, I need to know if one of the sample has a normal distribution and another doesn't, can I run an parametric or nonparametric test?
Same for 3 samples, if 2 sample has normal distribution an 1 doesnt. Or, of 3 samples, 1 doesnt have normal distibribution and the others has, I run a parametric or non parametric test?
Thank you.
In general, the t-Test (and equivalently ANOVA) are relatively robust with regard to non-normality, especially with large sample sizes. So, if you have a small sample, I would use non-parametric tests, but otherwise the data as linear.
• asked a question related to Normal Distribution
Question
For my data analysis, I used the Kruskal Wallis test because there is no variance homogeneity and no normal distribution. As a post hoc test, I used the Games-Howell test with Turke's p-value. Can I now calculate the effect size, which Jasp does not show me for the individual post-hoc tests, simply by hand from my descriptive data?
Games-Howell test is the sound choice since it is designed for situations in which the variances are unequal. Concerning the effect size calculation, you could consider free online effect size calculators such as the one referenced below.
Uanhoro, J. O. (2017). Effect Size Calculators. https://effect-size-calculator.herokuapp.com/
Good luck,
• asked a question related to Normal Distribution
Question
any help with what probably is not a very complex question, i am unsure as to what statistical test to use. I am examining the relationship between a categorical variable (Tree species) with 5 levels (5 different species) and two categorical explanatory variables, size (small, medium and large) and orientation (N,S,E,W). data does not have normal distribution, what test should i use? a multinomial regression?
Looking for the association among three categorical variables is generally done with log-linear modeling, though you seem to want to predict species from the other two variables. There are methods around for plotting these associations, but one of the more straight-forward would be to combine what you are calling your explanatory variables into one with 12 categories and do a biplot from a correspondence analysis. Details will be covered in most categorical data analysis books and some general intro to stats books.
• asked a question related to Normal Distribution
Question
Shapiro-Wilk p-value is 0.0072, W = 0.8844. Data Points 26. Table data for W: n = 26, W = 0,891 - a = 0,0,1; W = 0,92 - a = 0,05. Is distribution normal in my case?
With almost certainty: it is not normal (unless it is a synthetic variable, and even then it ay be doubtful, depending on the RNG used).
Note 1: the test cannot give any indication or evidence for the assumption that the tested distribution is normal. It only checks if the given sample size is already sufficient to reject this hypothesis. Failing to reject only means that the sample size is too small.
Note 2: I bet that answering the question "are my data sampled from a normal distribution?" is not what you need. I guess that the correctly formulated question is rather: "are my data sampled from a distribution that is sufficiently similar to a normal distribution to warrant conclusions from analysis methods assuming that the data are samples from a normal distribution?". This is a considerably different question, and this cannot be answered by "Normality test" like Shapiro-Wilk etc.
• asked a question related to Normal Distribution
Question
I would like to run a Monte Carlo Simulation for stocks. Since stock returns are not normally distributed I am wondering, what is the best distribution function?
What about Weilbull, Frechet, Gumbel, Rossi etc.?
My biggest concern is, to incorporate the difference between median and mean. My Data is:
Mean: 10,67% Median: 14,77 % Standarddeviation: 17,41%
For highest returns, I would recommend the General Extreme Value distribution. Our analysis in the Turkish Stock Exchange by using this distribution showed, however, that highest returns follow the Frechet distribution.
• asked a question related to Normal Distribution
Question
Determining intervals for the common language effect size (CLES), probability of superiority (PS), Area Under the Curve (AUC) or Exceedance Probability (EP) is possible via multiple method Ruscia and Mullen (2012). However, is this also possible via Fishers Z transformation? For simplicity I will call the “effect size” EP.
If we make the following assumptions: we have a (real) value that can range between -1 and 1 and assume the error distribution is (approximately) normally distributed (also invoking CLT), then we would be able to obtain intervals via Fishers z transformation (I think???).
The rationale is: EP does not range from -1 and 1, but from 0 till 1. Hereby 0.5 would represent the NULL. However, it would be possible to transform the EP to a value between -1 and 1 assuming a “directionality”: < 0.5 is negative and >= 0.5 is positive. Then,
EPd =(EP-0.5)*2
EPz = ln[ (EPd+1)/(EPd-1) ]*0.5 = atanh(EPd)
SE = √[ 1/(3-n1) + 1/(3-n2) ]
Lower = SE*1.96-EPd
Upper = SE*1.96+EPd
Transformation back to the original scale (EP) would be possible for both positive and negative values.
If positive
EP = [ exp(EPd*2)-1)/(exp(EPd*2)+1 ]/2+0.5 = tanh(EPd)/2+0.5
If negative
EP = [ 1 + (exp(EPd*2)-1)/(exp(EPd*2)+1) ] / 2 = [ 1 + tanh(EPd) ]/2
However, when comparing the analytical intervals to Monte Carlo (MC) simulations the intervals are much broader using a smaller samples size. Although the extreme intervals, either upper when EP < 0.5 or lower when EP < 0.5 Below an example of Ruscio and Mullen (2012) where n1 and n2 are both 15 and another example with the same mu and sd when nx and ny are both 150. Also the intervals by Ruscio and Mullen (2012) are much smaller. The question is, why are these intervals broader is the rationale completely wrong, did I make a mistake or is it simply impossible what I am doing? I know there are other ways obtaining the intervals but using fishers Z transformation would make it rather “elegant”.
Hello Wim,
I think your scenario doesn't match well to the situation for which Fisher initially developed the z-transformation (r-to-z). The issue with the distribution of the Pearson correlation (r) is that it does not have a constant variance and the r values don't behave linearly very well (unlike r-squared). So, the z-transform is principally a variance-stabilizing measure, which does tend to yield a z-variate that conforms better to the normal distribution than does r.
But, you're presuming that the variate to be transformed has an underlying normal distribution; that's unlike the r-to-z situation. As well, I'm guessing that your intention is to convert some proportion of area (of non-overlap) to a z-score or some other standard score (not the z-transformation, above), which is also presuming a normal distribution for whatever attribute is under investigation, which may not be the case (and frequently is not in practice). Why not stick with proportions as your explanatory ES, as in:
1. CLES of McGraw & Wong: Proportion of times randomly selected case from batch A has score that exceeds that of randomly selected case from batch B?
2. Cliff's dominance statistic: CLES - proportion of times that B cases exceed A cases?
3. Vargha & Delaney's A statistric (similar to Cliff's d, though it also splits tied cases and assigns half to one batch and half to the other).
Each of these is exact. There is a pretty good closed form expression for the variance of d, if this is important.
• asked a question related to Normal Distribution
Question
I have two species of birds, for each species I measured 4 replicates of an enzyme for each individual. When I ran the tests for normality it came out that my data are not normally distributed so I cannot use the repeated measures ANOVA to test for differences between species. Transformations didn't work for my data. Which non parametric test should I use?
Thank you!
• asked a question related to Normal Distribution
Question
The data collected is on time to appearance of different developmental stages, of fish eggs in concentrations of toxicant. Five concentrations of toxicant and eight egg developmental stages were monitored (total of 40 cases). The data was ranked with the following results: the distribution of 15 variables were normally distributed (Shapiro-Wilk test) both before and after ranking using SPSS version 25 (37.5%). Twenty-one variables became normally distributed (52.5%), while two cases were not affected. I read somewhere that data normalized by ranking can be analyzed using parametric methods. Does this apply here? Thanks.
• Basically, whatever transformations you apply to your variables is allowed if it results in data that meet the assumptions of the model you want to use.
• But it may not be the most desirable approach.
• I suspect that you have some misconceptions about the assumptions of anova. I assume you are conducting a two-way anova. In this case, the model is a general linear model, and the assumption is that the errors are normally distributed. We can check this by looking at the residuals from the analysis.
• It's not very helpful to use a hypothesis test to test the assumptions of a model. It's better to look at plots of the residuals, or to understand the nature of the population that was sampled.
• asked a question related to Normal Distribution
Question
Effects size for repeated measures of not normally distributed data
• One thing, be sure you understand what are the assumptions of the model you are considering. For a general linear model, the assumption is that the errors are normally distributed, not the data. This could be assessed by looking at the residuals. It's not a good idea to use a hypothesis test to test the normality or homoscedasticity.
• A good place to start is with the nature of the dependent variable. For example, if it is count data, or ordinal data, or bound on the left by zero and right-skewed. Knowing this may suggest a generalized linear model, or other model that would be appropriate. This coincides with the suggestion by Raju Rimal , # 2.
• A flexible nonparametric approach, like aligned ranks transformation (ART) anova may work well. This coincides with the suggestion by Jos Feys .
• Friedman's test works only for data arranged in unreplicated complete block design. This is common, but won't work for other designs. There is also a Quade test, which is used in similar situations, that may be more desirable.
• Note that "effect size" and identifying "significant" effects are two distinct concepts.
• asked a question related to Normal Distribution
Question
The data I collected for my research yielded a non-normal distribution.
I aim to test a hypothetical model using SEM, and AMOS is said to be better for confirmatory research. However, I don't want an inflated model (since the data are not normally distributed).
Accordingly, I have the following questions:
1. Is SmartPLS a good fit for conducting SEM and path analyses, and is that more accurate than Amos for the data that are not normally distributed?
2. Moreover, is it better to use VB-SEM?
I asked the second question because VB-SEM is said to be more flexible regarding non-normality.
I sincerely thank the researchers who will answer these questions.
Here is a quick overview:
And here is a results comparison:
• asked a question related to Normal Distribution
Question
Recently I am trying to reproduce results from this paper Janich, P., Toufighi, K., Solanas, G., Luis, N. M., Minkwitz, S., Serrano, L., Lehner, B., & Benitah, S. A. (2013). Human Epidermal Stem Cell Function Is Regulated by Circadian Oscillations. Cell Stem Cell, 13(6), 745–753.
Here is the difficulty I met:
The author performed microassay to detect gene expression in mutiple overlapping time window.
Take time window 1 for example, let's say there are 100 genes, of which their expression are detected at 5h, 10h, 15h, 20h.
Then the author applied a quadratic regression model "expression = a(time.point)^2 + b(time.point) + c" to determine whether these genes change periodically within each time window (time.point can be 5, 10, 15, 20 in this example). If the coefficent "a" <0 and pvalue for the coefficient "a" < 0.05, this gene would be identified as "peak gene"; Otherwise, if the coefficent "a" >0 and pvalue for the coefficient "a" < 0.05, it would be labelled as "trough gene". But the problem is that, the author calculate the pvalue with two methods, the first one is based on t distribution i.e. pvalue = Pr(>|t|) and the R code would be:
summary(lm(formula = expression ~poly(time.point ,2, raw=T))))\$coefficients [3,'Pr(>|t|)'])
the other way is based on normal distrubution i.e.
pvalue = pnorm(q=t.score, lower.tail=T)*2
(That means if |t.statistics| > 1.96, the pvalue is guaranteed to be < 0.05.)
The author chose the latter one as the final pvalue. But is it right to do so in this situation?
From what I learned, t.distribution should be better when the population standard deviation is unknown and the sample size is < 30 (for each regression model, there are only 4 observations) . Since different pvalue calculated by these two methods could greatly affect the final result and conclusion, could someone give me a detailed explanation? Any help would be appreciated!
To follow on from comment by Salvatore S. Mangiafico .you might think about what difference this step makes in the eventual results.
In this small part of the procedure, you have a test statistic consisting of a ratio, where the divisor is an estimated standard deviation, and the two versions you mention either do or do not take into account the fact that you have an estimated standard deviation.
(i) there is a possible variant where you replace the estimated standard deviation (estimated locally in the sequence) by something obtained more broadly, and so less subject to error, and for which the normal approximation is better.
(ii) the cases where the test statistic might be most misleading arise when the estimated standard deviation is either unusually small or unusually large. So you might want to look at whether the actual pattern of data in such cases really do justify being counted as turning points. Perhaps the short section of the series is too smooth or too rough to support a conclusion.
(iii) you might consider an alternative way of assessing this, or any other test statistic, (getting a "p-value") which could be via a permutation test of some sort.
But, for the two versions you mention, if the later arguments of the overall procedure really only choose to look at smallest apparent p-values without using a formal threshold (such as 0.05), then there may be no contradiction between the lists of points selected as turning points (except perhaps in the numbers in the lists).
• asked a question related to Normal Distribution
Question
Hi,
so I am investigating the impact of sex on inspiratory muscle training as a recovery modality from long COVID.I have 30 participants (15 female) and I take 4 physiological readings and 12 psychosocial health readings these are taken at pre and post intervention. I was going to use a mixed measures anova but on testing for normality I found that all my pre values are normally distributed but all of my post values are not normally distributed. what test should I run?
You can look at the conditional distribution of the data. That is, run the model, and then look at the residuals from the analysis. ... There may be other appropriate approaches like generalized linear model or a nonparametric approach like aligned ranks anova.
• asked a question related to Normal Distribution
Question
Hi,
I have a question regarding of ddCT method of qPCR.
Various threads have pointed out that all statistics should be done with dCT or ddCT because dCT follows a normal distribution.
I understand that dCT in single cells follows a normal distribution, since gene expression in single cells often follows a log-normal distribution, as shown in various papers (i.e, Bengtsson et al. Genome Research (2005)).
However, if mRNA is extracted from a large number of cells (e.g., more than 1.0x10^6cells), then the mRNA solution is the sum of mRNA extracted from a cell population whose gene expression follows a log-normal distribution, what happens in this case?
Does this mean that the amount of mRNA of interest between samples follows a normal distribution, according to the central limit theorem?
In this case, can dCT follow a normal distribution?
Dr. Wilhelm,
• asked a question related to Normal Distribution
Question
Majority of our questions were in Likert scale (from 5 very frequent to 1 never), and we use a pretest-posttest methodology. To compare the pretest and the posttest, we wanted to use paired sample t-test. However, this is a parametric test wherein the data should be normally distributed.
I have also read in the work of Norman, G. (2010) that parametric statistics can still be used with Likert data even with non-normal distributions.
What would be the best option here? Should we proceed in using the paired sample t-test, or go for Wilcoxon tests since the data is not normal? Thank you for answering in advance.
Norman, G. (2010). Likert scales, levels of measurement and the “laws” of statistics. Advances in health sciences education, 15(5), 625-632.
Just go for normal. The means and standard errors will be close.
• asked a question related to Normal Distribution
Question
I'm trying to figure out what this means:
"the distribution of score for all malignant cells was fitted to a normal distribution and a threshold of p < 0.001 was used for distinguishing cycling cells."
I fitted my data to normalize distribution using R
data<- fitdist(data, "norm")
plot(G1S)
However, I have no idea what p <0.001 means.
According to the paper's figure, the score should be around 0.8
However, p <0.001 would be like score of 3~4.
I don't get it.
Normal distribution means that the collected data is distributed like a bell shap. Begin with the low measured values and the number increased till it reach a peak value and then reduced. The normal distribution curve is a symmetrical curve. The data can be tested using Shapero test. If the test is not significant p>0.05 then the data is normally distributed, while if p<0.05 or the test is significant, the distribution is abnormal.
• asked a question related to Normal Distribution
Question
I am aware that a high degree of normality in the data is desirable when maximum likelihood (ML) is chosen as the extraction method in EFA and that the constraint of normality is less important if principal axis factoring (PAF) is used as the method of extraction.
However, we have a couple of items in which the data are highly skewed to the left (i.e., there are very few responses at the low end of the response continuum). Does that put the validity of our EFAs at risk even if we use PAF?
This is a salient issue in some current research I'm involved in because the two items are among a very small number of items that we would like, if possible, to load on one of our anticipated factors.
Christian and Ali, thanks for your posts. Appreciated. I'll follow up on both of them.
Robert
• asked a question related to Normal Distribution
Question