Science topic

# Multivariate Statistics - Science topic

Multivariate statistics is a form of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. The application of multivariate statistics is multivariate analysis.
Questions related to Multivariate Statistics
Question
Hello
I have been in constant discussions with my friends and colleagues in recent years, in my experience I generally use multivariate statistics because most data sets do not have the assumptions for classical frequentist statistics. However, I know some people who use univariate and Bayesian methods to answer the same hypothesis questions. With this, the question would be, what would be the most appropriate way to answer our research questions?
Question
For example, I want to explore the relationships between soil nematode communities and microbial communities under different treatments, including relationships of functional and taxonomic composition between each other. Nematodes and microbes belongs to different trophic levels, i.e., bacterivore nematodes feed on bacterias; fungivore nematodes feed on fungi; herbivore nematode feed on plant roots; and omnivore nematode prey on bacterivore, fungivore, and herbivore nematodes. In conclusion, which statistical tools are suit for analysis of relationships in complex soil food web?
you can use the Orthogonal partial least squares discriminant analysis (OPLS-DA) is a supervised multiple regression analysis for identification of discrimination between different datasets, or use some kind of machine learning algorithm, here are some manuscripts:
Question
I did a MCA analysis using FactoMineR. I know how to interpret cos2, contributions and coordinates, but I don't know how values of v.test should be interpreted.
Thank you
(a v.test over 1.96 is equivalent to a p-value less than 0.05 )
Question
I want to investigate the relationship between differences in coral physiological variables based on euclidean distances and seawater environmental variables using DISTLM and dbRDA in PRIMER, but I am not sure if this analysis is suitable given the lack of replication I have in my predictor variable (environmental) matrix.
I have attached an excel file illustrating the structure of my data set (the response and predictor variables). Briefly, I have a multivariate data set of measured physiological variables (e.g. lipid concentration, protein concentration, tissue biomass etc.) for corals collected from five different locations (A-E), where each site is very unique in its seawater physico-chemical parameters. I collected 12 corals per site (total of 60 samples). I have constructed a resemblance matrix of the physiological data in PRIMER based on Euclidean distances, and there is clear grouping of data points in the NMDS, which coincides with the different collection sites for each coral. I want to investigate the proportion of the observed variation in the multivariate data cloud that can be explained by the environmental characteristics of each collection site (e.g. mean annual sea surface temperature, seawater chlorophyll concentration, salinity etc.). However, the dataset of environmental variables does not have replication. i.e. for each site (A-E), I only have one value for mean annual sea surface temp, one value of salinity etc.
All of the case-study examples I have read about distance-based redundancy analysis in R or PRIMER have two resemblance matrices (predictor and response) both of which have replication. However, in my case, my response variables have replication (i.e. 12 samples per site), whereas my environmental variables do not have replication (i.e. one measurement per variable per site).
Can someone advise me whether or not dbRDA is suitable in this instance? If as I predict, it is not suitable, can you recommend a better approach? I am not an expert in multivariate statistics, but I want to make sure that the approach I take is sound.
Any and all advice is welcome. Thanks
Hi Rowan, I am in a similar situation. What I did I used an average of the response variables. But I do not know if it the optimal solution. Did you solve this riddle at the end?
Question
It is possible to learn a standard SVM in a kernel space. But is it possible to do the same with L1 regularization ?
Yes exactly. You can check the article produced by Yuesheng Xu.
Question
Discriminant analysis has the assumption of normal distributions, homogeneity of variances, and correlations between means and variances. If those assumptions are not fullfilled, is there any non-parametric method that can be used as a "substitute" for Discriminant analysis?
Clustering or classification methods
Question
Hi everyone
I have a dependent categorical variable of three levels, corresponding to three sectors of activity in the agricultural field, A, B and C and within each of them there are sub-levels, for example under A there are four sub-levels a1, a2, a3 and a4.
Over time, for more profit, farmers change their activity, this change is subject to several factors like demands, financial support, etc. (we are talking here about independent, quantitative and categorical variables).
For example, a farmer who practiced activity A, changed his activity to B, i.e. He has completely changed the activity sector or can only change to a subsector, for example moving from “a2” to “a1” (the same applies to other farmers.
Is there a statistical technique that can be used to model these changes?
HI, may be you should consider a Markov (or semi-Markov) chain model to treat your data. Parametrization could be tricky as it may lead to numerous parameters but some are more parsimonious than a direct one.
Question
I have a dataset made by plant species as presence/absence (1/0) found in 133 samples within an archeological area. Every sample has a particular substrate, a position in the monument and it is made in a monument (for substrate, position and monument I have data as words, no numbers, eg for substrate I have green rock or black rock). My aim is to see if some species or groups of species are more associated with some substrates or positions or monuments or if there is any other pattern. What method would you recommend me to try with? Thank you very much in advance!
You need to apply binary models then. Logit and Probit models are good in this case.
Besides, you can apply some non-parametric techniques that may be good to compare between the groups you have. For example, you can apply kernel densities. This helps you to compare microsites.
You can also categorize your microsites by defining: Small, Medium, large....and here you can apply ordered logit model
Question
I'm trying to build up a model to use 'chemical data' to predict 'sensory evaluation data'.
I've got multiple sensory data blocks (Matrix a1, a2, a3, ..., an) and one chemical data block (Matrix b). Now I'm trying to build up a chemical parameters - sensory evaluation model.
Is there any multivariate statistical methods that I can use? As far as I know, the L-PLS can only deal with 3 matrixes. Or any machine-learning methods you'd like to recommend?
Many thanks!
Question
Hi!
I have 3 IVs (metric):
1. affect sensitivity (subscales: positive affect, negative affect)
2. affect regulation (self-relaxation, self-motivation)
3. stress (threat, demand)
and I want to measure 3 DVs (metric):
a. self-access
b. well being
c. symptoms
Moreover, I want to measure a moderator effect (culture, nominal).
How can I measure both multiple regression (for all IVs and each subscales) and moderator analysis for my case?
Very interesting question..worth following.
Question
It is known that PCA maximizes the total variance explained. Let's say it hits a total variance of 51%. Yet, doing a factor analysis instead would reduce such a total variance to 30% (naturally, since FA looks at the common variance only).
In such a case, is calculating the AVE (convergent validity) based on the FA loadings irrelevant since one is only going to use the PCA loadings/variance/scores in the first place and in subsequent analysis? If yes, can one conclude that his 'AVEs' based on PCA are > 0.5? The goal is just data reduction for multiple regression and not (C)FA or any other SEM modeling.
The average variance extracted has often been used to assess discriminant validity based on the following "rule of thumb": Based on the corrected correlations from the CFA model, the AVE of each of the latent constructs should be higher than the highest squared correlation with any other latent variable.
Question
I would like to use multivariate statistical methods to characterize a hydrogeochemical processes and controlling factors of groundwater quality in a semi-arid region of Algeria
You can check these links :
Question
Suggest me the best software to develop a regression equation using more than five independent variables.
any one guide me how to work with R software
Question
I need to use multivariate statistical techniques to extract information about the similarities or dissimilarities between to extract information about the similarities or dissimilarities between sampling sites, identification of water quality variables responsible for spatial variations in groundwater quality, and the influence of possible sources (natural and anthropogenic) on the water quality parameters.
No existe la "mejor forma". Es importante su problema, aunque existen varias propuestas como ya se ha comentado. Siempre la 1ra CP es importante y si logras interpretarla tendrás los efectos esenciales o regionales. El resto de las CP están más asociada a efectos locales y el número a utilizar depende de cuanto pueden explicar del sistema original, así como del conocimiento que poseas sobe tu problema
Question
Hello,
I am using multivariate multiple regression for my master's thesis but I'm not sure if I am doing the analysis and reporting it in the right way. I have very limited time till the deadline to submit thesis. So any help is very much appreciated
I would be really glad if someone can recommend/send articles/dissertations using this analysis.
Yağmur
Question
How can I do mean separation for treatments arranged in factorial ways by SAS? Or could you tell me other software which can do this? Thank you
Dear Zabihollah
I do not understand what you want to do now. Do you have an example?
Question
I run repeated measures ANOVA for an intervention study. I have 3 intervention groups and 3-time points. The output in SPSS showed that there is no significant main effect of time and no significant interaction effect for group*time in Tests of Within-Subjects Effects table. When I checked Tests of Between-Subjects Effects, I also did not find a significant result for the group. However, pairwise comparison tables with Bonferroni, there is a significant difference between two 2 time points in my experimental group (one of my intervention groups). Also, Multivariate Tests table indicated Wilks' lambda to be significant for the experimental group.
I got confused about these findings and looked up what people report in the articles. In some papers, people were reporting Wilks' lambda, while in others people where reporting main and interaction effects. What would you recommend me to do? Is there any rule of thumb?
Hello Başak,
I agree with Abjulmuhsin that you might be better served by connecting with a local statistician to assist with understanding your specific research questions and recommending the most suitable analysis and interpretation thereof.
That said, I can give you some general suggestions and observations based on what you present. However, without your research questions and all the relevant output, I'm somewhat at a disadvantage!
1. The no significant main effect of time (for the repeated measures analysis) indicates that combined scores across groups do not differ on average for Time 1 vs. Time 2 vs. Time 3.
2. The no significant main effect of group (for the repeated measures analysis) indicates that combined time scores (Time1 + Time 2 + Time 3) do not differ on average among groups (Group 1 vs. Group 2 vs. Group 3).
3. Note that these first two tests are seldom the most informative, since they ignore or collapse data across levels of the other factor.
4. The no significant group x time interaction indicates that the differences in times (e.g., Time 1 vs. Time 3) are consistent across groups. As well, the differences in groups (e.g., Group 2 vs. Group 3) are consistent across times. This test is usually the more informative one for one-between, one-within (aka, split-plot/mixed/RM anova). So, profiles of mean performance across the times do not differ by group.
5. The multivariate test (given when you run a RM analysis in SPSS, whether it makes sense to do so or not) evaluates a different hypothesis from those that the one-between, one-within (repeated measures) address. The multivariate (null) hypothesis is that there is no linear combination of measures on which the groups differ. In RM, the main effect of group evaluates differences for the summed time scores. In manova, the main effect of group evaluates equality of all possible linear combinations of the DVs (which, for your data, would be the time scores).
6. If time scores for your study are, in fact, the same measure/variable, then RM is arguably the better framework compared to manova.
7. I'm not clear as to what specific multivariate table you might be referring to, but if it's the default manova from within the RM analysis (Analyze/General Linear Model/Repeated measures...), then note #5 above is likely the explanation. I presume you meant to say, for the main effect of group, instead of "for the experimental group." If you in fact ran a separate manova with the three time scores for a single group, that tests the null hypothesis that the three DV (time) means are zero for that group. Usually, that test would not be very informative, unless scores were scaled such that zero was a meaningful value.
8. So you found one pairwise comparison to be significant for one of your three groups, is that correct? There are 9 such tests possible, and these are typically run as ordinary t-tests, so that the error terms are different from those used for the three general effects in 1B, 1W design. Hence, you could see what appears to be "discrepant" results.
9. For your two significant results: (a) a pairwise test of two times for one treatment group in the RM analysis; and (b) the manova for (apparently, as questioned in #7) a single group, are you certain that these tests are as you described? I'm asking about the single group aspect, not significance; the reason being that such tests do not come automatically from the analysis in spss, even if you ask for "post hoc" and pairwise comparisons on the within-subjects factor via EM Means... option.
Question
Say I am interested in examining individual differences in cognition and behavior and am interested in how specific survey scores and parameters predict/covary with performance on a task. How would I analyze this data based on the literature?
Are there conventional methods for analyzing differences in psychological phenomenon across individuals? Is that exactly what uni/multivariate statistics is for? Or are there alternative methods? Are there where advanced statistics comes in?
Is it more compelling and/or informative to analyze individual differences in a single subject design, an aggregate model/submodel (GLM), or as a dynamical system?
What does the basic and current literature say? What papers or books explicitly discuss this?
Thanks,
JD
I do recommend to take into consideration AI methods. More specifically machine learning methods like k-NN --- stands for k nearest neighbors --- might bring deeper insight into your problematic.
What I recommend is to study similar problems already resolved by this method and similar ML methods and try to understand the way of implementation into your specific problem.
The best way to tackle the problem is to use Python and relevant libraries that already contain all necessary ML methods. When you are novel in the field, try to find someone around you who us familiar with Python programming. He/she can speed up your learning curve substantially.
Some review papers on this topic would be an excellent start.
Question
In your experience, how do you know or what do you need to check in order to declare econometric data as bad (univariate & multivariate cases).
No honestly obtained data is bad. My ideas are expressed in the attached. D, Booth
Question
Recently several measures for testing independence of multiple random variables (or vectors) have been developed. In particular, these allow the detection of dependencies also in the case of pairwise independent random variables, i.e., dependencies of higher order.
Thus, if you had a dataset which was considered uninteresting - because no pairwise dependence was detected - it might be worth to retest it.
If your data is provided in a matrix x where each column corresponds to a variable. Then the following lines of R-code perform such a test with a visualization.
install.packages("multivariance")
library(multivariance)
dependence.structure(x)
If the plot output is just separated circles (these represent the variables) then no dependence is detected. If you get some lines connecting the variables to clusters then dependence is detected, e.g.
dependence.structure(dep_struct_several_26_100)
dependence.structure(dep_struct_iterated_13_100)
dependence.structure(dep_struct_ring_15_100)
dependence.structure(dep_struct_star_9_100)
Depending on the number of samples and number of variables the algorithm might take some time, the above examples with up to 26 variables and 100 samples run quickly.
Due to publication bias datasets are usually only published if some (pairwise) dependence is present. Thus there should be (plenty) of cases where data was considered uninteresting, but a test on higher order dependence shows dependencies. If you have such datasets, it would be great if you share it.
Comments and replies - public and private - are welcome.
For those interested in a bit more theoretic background: arXiv:1712.06532
The underlying package has been updated (current version: 2.2.0). Based on the refined methods higher order dependencies can be detected in more cases and with better accuracy.
In particular, new features for the dependence structure detection are:
* The approximate probability of a type I error is provided.
* New option 'type': In particular, 'type = "pearson_approx"' provides a fast and approximately sharp detection, in contrast to the original 'type="conservative"' which is still faster but much more conservative.
E.g.
dependence.structure(x, type = "pearson_approx")
* New option 'structure.type': The original algorithm clusters dependent variables and treats them thereafter as one variable. This is still the default option 'structure.type = "clustered"'. But in the case of many variables this can cluster variables which are only indirectly dependent via some other variables. In contrast the new option 'structure.type = "full"' treats always each variable separately and detects dependence for all tuples which are lower order independent. E.g.
dependence.structure(x, type = "pearson_approx", structure.type = "full")
Based on this many datasets feature higher order dependence. I am looking forward to hear from field experts, who can also provide an explanation within their subject for the occurrence of these higher order dependencies.
Question
I have the data of total dissolved soilds of apple as references (y-variable).
I also have near-infared spectra data as predictors (x-variables).
I have the StatSoft Statistica software for the analysis.
Different related software's can be used for building ANN predictive model using NIR spectra. My suggestion are Unscrambler and IBM modeler. Before ANN modeling, PCA should be done to reduce large spectra variables to PC1, PC2 .... . after calibrating ANN, you have predicted values by calibrated model and you have also reference values for each sample. Now use RPD index to realizing the goodness of your calibrated model.
Question
I am doing a study on classifying fruits on visible region taking spectra with Vis-NIR spectroscopy. I want to classify the fruit based on maturity in terms of skin color. I am trying to use SVMC, SIMCA and KNN with PLS toolbox. I went to the wiki site of the eigenvector, but the procedure described there is a bit blurry to me. It would be great if somebody could tell me the stepwise procedure on how to perform these on PLStoolbox using Matlab.
For classification, you can perform the PCA (Principal Component Analysis)
Question
I've been through google search, signed up to a specialised statistical website and checked on my texts (though not advanced), and I can't find a nonparametric analog to the one-way MANOVA. Any accurate advice please?
Dear Dr. Merino,
I suggest asking either Prof. Frank Konietschke (UT Dallas, Texas, email:<fxk141230@utdallas.edu>) or Prof. Arne Bathke (University of Salzburg, Austria, email: <Arne.Bathke@sbg.ac.at>) whether they could provide some R-package. There are also some more recent paper on this topic.Just ask also Arne or Frank.
Kind regards,
Edgar Brunner
Question
The multivariate statistical framework MaAsLin detected the abundance of bacterial class TK17 positively correlated with mean salinity (q=0.19) in the roots and rhizosphere of my study tree. I can't find much information on this class and I'm wondering if anyone out there is familiar with this class of bacteria.
that is a good question
Question
Hi everyone, i am new to statistics and would like to consult you about some statistics issues.
I have a judgement task with multiple-choice questions. The design is a 5x2 factorial design, in this case i have 10 conditions, each condition had 10 questions, we had a total of 100 questions. Each participant Chooses one choice out of 6 options which correspond to 6 categories (i.e., Category A-F).
In this case, i will collect frequency data, and i think i could use chi square test for the frequency data. I want to know (1) whether any two of the 6 categoies in each condition significantly differ from each other and (2) whether the option A of any 2 conditions differ from each other (in this logic, option B of any 2 conditions and so on so forth).
I have a few questions, they are:
1. Since i have 10 items for each condition for each participant, how should i manage the frequency data, like in a tabulated table or in SPSS? Should i just add up the frequency such that if i have 50 participants i would have a total of 50people x 10items = 500 counts for each condition? Would it be inappropriate because each count represents not one person? I wonder whether there are any better way to handle the case.
2. i think the frequency data in fact can be turned into percentage data. In this case, we will have ratio scale data. then we may be able to use ANOVA to handle the data. But i wonder whether i should do in this way.
I follow
Question
i want compare more than 50 groups? SPSS could not performing ANOVA. which test shal i use to compare these groups ? kindly give your suggestions
I have a somewhat different view on this question than the previous answers. But step by step.
(1) The conceptual question, is it meaningful to perform an ANOVA with so many groups? You test the Null hypothesis, that not all groups are equal. This can be done with two groups, with five or with fifty. There is no principled difference between different numbers of groups and certainly no magic limit of the type, if N_groups > x, then ANOVA is wrong. No, it is ok to perform an ANOVA here.
(2) In case the ANOVA gives a significant result on e.g. the 5% level, then you have exactly what you tested. In 5% of the cases where all 50 groups have identically distributed values you reject it nevertheless. That's the logic of null-hypothesis testing. Thus, when the ANOVA is significant, well, it is significant and you can say, the 50 groups are not all the same. However, the value of that is reduced. Here the large number of groups comes in after all. You will like to know which ones are different from the rest. This calls for post-hoc tests. And here you need multiple comparison correction. Bonferroni assumes independent tests, which is not the case, and is therefore overly conservative. Tukey is the way to go. You might argue that LSD, i.e. no correction, is appropriate as the ANOVA is significant. There is something to this argument, but in fully exploratory settings I'd be sceptical. The difference between Tukey and LSD increases with the number of groups. Here the discussion above kicks in.
(3) Pooling groups depends on the type of variable. If it is arbitrarily binned, e.g. like age, you should think about a different test altogether! Use linear regression, spline fitting, or a full blown model. If it is different categories, e.g. like brands of car makers, think about your scientific question. It might allow meaningful grouping or it might not.
(4) Finally, which software to use? This is completely detached from the above. My version of SPSS can manage 99 groups. In case this is not sufficient, calculate all those sum of squares in your favorite programming language and do the ANOVA yourself. As long as you do not have fancy corrections and the like, it is not that difficult or tedious. Have a look at Jarad Niemi wonderful tutorials on you tube.
Question
I have three dependent variables, and 10 predictors and I am analyzing the data with multivariate regression. However, I need to compare the model and the contribution of each predictor with another groups. Any ideas how to proceed?
The interaction approach already suggested by everybody here is the way to go for comparing models. As far as looking at the importance of variables, I would suggest trying lasso(also known as elastic net). Programs are available in R. I have added some references. Best wishes, David Booth
Question
I should analyse some biomarkers predicting development of AKI using NRI. I have no idea how to do it via SPSS.
Hello,
Does anyone have experience with calculating the NRI and IDI in R comparing models with two different threshold? Is this possible with the predictABEL or survNRI packages?
Question
I am carrying out a multinomial analysis (dependent variable with 5 categories). I have read multiple discussions on the question what the minimum number of cases is per independent variable. However, I cannot find whether there is a recommended number of cases per category of the dependent variable. Can anyone help me with this?
I was also wondering if you had an answer? I thought the number of people required would be influenced by the number of categories in the DV. When you do power calculations for multinomial logistic regression (e.g. in Gpower), it does this based on a series of binary logistic regressions, to say the sample size you'd need per comparison of the categories. As multinomial does this multiple times (with one fixed reference category), the number required increases (i.e. an additional category).
Question
Dear all,
do you know if:
1 - can I run an RDA with negative (taxa) values (as delta Control - Treatment)?
2 - Do I have to use the function decostand function on these delta values before performing the RDA?
3 - Shall I use Bray-Curtis distance (dist='bray") in the RDA function?
Best
Alessandro
I have no idea yet.Good question.....
Question
what is leave-one-out classification method used in discriminant analysis for classifying the cases? how the cases are classified under this mehod?
Hi Srikanth,
If you have N cases, for each ith case in (1,2,...,N), you can use the rest data (except ith case) to build a classifier model, then apply this model on the ith case to get its class. After repeat this procedure N times, all cases will be assigned a class label and you can evaluate the accuracy of your classification model.
Hope this helps.
Wen
Question
Looking for recent (preferably meta-analytic) findings that yielded estimates of the proportion of shared variance among common personnel selection methods such as structured and unstructured interviews, assessment centers, general cognitive abilities tests, personality tests, etc. Thank you
This isn't a meta-analysis but the following paper provides correlations among assessment center exercises, cognitive ability tests, and perosnality tests:
Spector, P. E., Schneider, J. R., Vance, C. A., & Hezlett, S. A. (2000). The relation of cognitive ability and personality traits to assessment center performance. Journal of Applied Social Psychology, 30, 1474-1491. .
Question
I want to perform O2PLS-DA analysis of multiomics data (from different metabolomics lipidomics and proteomics experiments) by using SIMCA 130.2? I have data in matrix format (samples in row with labels and variables in column). I can perform upto PCA, PLS-DA and OPLS-DA, but the O2PLS-DA tab is not active. I think I do have a problem with data arrangement. However, not sure if its the only problem. Any help will be highly appreciated!
If you want to integrate the six of them in one go I would rather use multiple co-inertia analysis, regularized generalized CCA or STATIS. Refer to Meng et al. 2016:
I personally dislike SIMCA, it is highly limited. In any case I presume you have to provide two separate matrices at a time for O2PLS-DA. As for the other methods, you can run PCA, PLS-DA, etc. on the entire thing - however note that you run the risk of contaminating results with the technical noise specific to each separate omics readout. Hope this is clear? Cheers
Question
I am trying to analyse the degrees of influence of a few environmental factors on benthic mollusc assemblage structure using DistLM and dbRDA plots. After selecting the best model, I have used a forward-stepping selection proceedure based on Bray-Curtis distance measures to run both adj R^2and AIC selection criteria tests. Two things are odd - the results of both marginal tests came out almost identical for both adj R^2 and AIC, and there are no factors/values listed at all for either of the sequential tests! What have I done/what should I have done?
Jen I have the same quetion like you!
Do you already know how to solve it?
Regards
Question
In many studies, it is observed that the geochemical and environmental data do not follow a normal distribution. This may be due to the samples from different populations or origins.
The basic statistics (mean, standard deviations etc.) are sometimes computed considering these data which may lead to bias or wrong results. Because, the statistical methods (classical) are always based on the assumptions of normal distribution of data.
For these types of data, can we compute median (as a measure of location) and median absolute deviation (as a measure of spread) instead of mean and standard deviation?
Can we use non-parametric methods for multivariate statistics or statistical tests for such type of data?
What are the suggestions of statisticians, environmentalists and geochemists?
1) I don't know, actually, but I suppose yes, because factor analysis is based on the same principles as linear models.
2) Yes... for the purpose of testing expected values.
3) Example:
in https://www.iii.org/fact-statistic/facts-statistics-lightning you find the numbers of lightning fatalities in the US states in 2016. The data is
number of cases: 0, 1, 2, 3, 4, 9 number of states: 31, 8, 5, 1, 2, 1
Imagine the data was summarized as mean and standard deviation:
mean number of cases per state: m = 0.792
standard deviation: s = 1.597
Assuming that the number of cases would be normally distributed, one would expect that most states would have about 1 case, but in fact the vast majority of states don't have a single case. One would further expect that there shouldn't be states with more than 5 cases (m+3s = 5.6, and based on the normal probability model P(x>5.6) < 0.001, what is pretty unlikely), but we observed a state with 9 cases. The probability, under the nomal model, to get x>=9 is lower than 0.000001! We would never expact that to happen if we assume the normal probability model. Finally, under the normal probability model the probability P(x<0) is 0.31, so we would expect about one third of the states having a negative number of cases, what is obviousely absurd.
Giving order statistics to summarize the data is not very instructive here because there is not very much data (n=48). However, it is clearly more instructive than giving m and s. The median, for instance, is 0, and the interquartile range is from 0 to 1, indicating that about 50% of the states have 0 or 1 cases, most having 0 cases.
A better strategy, to my opinion, is to think about the kind of data and a possibly more sensible probability model for that kind of data. Here we have counts, what implies that a Poisson model would be better. However, that model assumes that assumes that the incident rate is identical in all states, what results in the fact that the mean and the variance of the Poisson distribution are equal. For our data, the variance is s² = 2.55 what is more than 3 times larger than the mean. So the assumption of the Poisson model might not be very sensible for our data. If one relaes that assumption and models the incidence rate itself as gamma-distributed, one gets a more flexible model for counts that can deal with overdisperion (that is, when variance > mean): the negative-binomial model that can be parametrizes with a location parameter m' and a dispersion parameter s' . If one fits the parameters of that model to the observed data, one gets m' = m = 0.792 and s' = 0.397 (that's not a standard deviation; it is the estimated value of dispersion parameter of the negative binomial model). Thus, reporting that the number of cases was negative-binomial distributed with mean 0.792 and dispersion 0.397 would be the most instructive summary of the data.
Question
How many people do I need to recruit if I conduct randomized, between subject pilot study using 4 different condition for 4 type of manipulations?
Pilot study
Saunders et al., (2007) state that prior to using the questionnaire to collect data it should be pilot tested. Saunders et al., (2007) point out the purpose of the pilot test is to refine the questionnaire so that the respondents will have no problems in answering the questions and also there will be no problems in recording the data
Fink (2003b) as cited in Saunders et al., (2007) state that the minimum number for a pilot study is 10.
Reference
Saunders, M.N., 2007. Research methods for business students, 5/e. Pearson Education India.
Question
I read here that principal components scores are always in Euclidean distance and the distance of PCA is Euclidean: https://www.mii.lt/zilinskas/uploads/visualization/lectures/lect4/lect4_pca/PCA1.ppt
Is it true? I have a list of 20 principal components scores and have never been shown what distance measure do they represent. I want to calculate the Manhattan distance similarity and indicies between my samples according to these 20 principal components, but it would be pointless if the principal components are already made of Euclidean distance and I calculate out of them the Manhattan. So do always the PC scores actually represent the Euclidean distance measure or any else? Or they are not based on any distance measure? I hope that they don't, so I can go ahead and obtain the accurate Manhattan distance between the samples.
Is your original data euclidean? If so then doing PCA first will not matter anyway.
PCA is always gives Euclidean distance as it is calculated based on variance, which is part of classical euclidean geometry  and is the square of the distance between the data point and the origin. Manhattan distance is a step wise distance that depends on point to point spacing and connections rather than reference to a central origin.
Manhattan distance is more relevant where you believe your data has limited discrete possible outcomes. Euclidean distance is relevant where you believe your data comes from a continuous range of possible outcomes.
In many real world applications the distinction is fudged and there may not be a clear winner.
Also, using euclidean techniques first does not change the structure of the data, it just rotates it. To me that suggests that any discrete characteristics are retained so Manahttan distances can still be worked out after. It would be interesting to see what others think. However, transforming into Manhattan distance, as it assumes discreteness, may affect the data structure and so may disrupt subsequent euclidean based processing of the data.
Is the number of discrete possibilities increase or the variance between them starts to blur the discreteness then it looks more and more like a continuous situation.
Question
When conducting a one - way ANOVA, the F ratio is defined as the sum of squares between/sum of squares within.
However, when you actually do the math, the F ratio is the mean square between/mean square within.
For example:
(sum of squares between/degrees of freedom) = mean square (i.e., for variance explained)
And
(sum of squares within/degrees of freedom) = mean square (i.e., for variance not explained, or error).
My question is why do we need to adjust the sum of squares by the degrees of freedom in order to determine the F ratio?
Thank you for the recommendation Subrata. I will check out the site.
Jolie
Question
I am looking for suggestions for analyses that can compare of different taxa in terms of the relative difference in composition among sites.
I have 4 parallel datasets of species abundance data from 4 different taxa sampled in the same sites (n=12).
Each site was sampled between 4 - 10 times.  Usually (not always) sampling was done at the same time for all taxa within a site, but not all sites were sampled at the same time so the data are unbalanced.
I can create balanced subsets if needed but this would severely truncate the data.
I've heard of co-correspondence analysis, co-inertia anlaysis, and possibly multiple-factor analysis as potential candidates for doing this type of comparison but I'm not sure about the differences or which is most appropriate.
Are there pros and cons/restrictions/assumptions for each of these?
Is there an alternative method that I have mentioned that would be better?
Also what do these analyses allow me to test exactly - is their intention is to be able say for example that taxa A and B had high correlation in terms of variation in composition across sites, while taxa C showed low correlation with any other taxa ...etc  ?
Thanks
Tania
Thank you for all your responses.
The question I would like to ask is
1) do the spatial patterns of diversity differ among taxa? For example one taxa may show high clustering of sites based on habitat type, while another will show similar composition across all sites.
2) do spatial trends differ over time- for example, one taxa may show stable composition in all habitats over time while another taxa may show convergence of composition between habitats, and a third taxa may show high variability over time for one particular habitat...
The intention is to demonstrate the different taxa have different spatial and temporal distributions and therefore can or cannot be used as surrogates for each other based on composition.
3) Characterise the similarity (betadiversity) between and within habitat types, based on all taxa.
I can of course compare univariate measures of diversity in each site using anova but I would like to compare the taxa based on their composition.
Thanks for further suggestions to address these specific research questions..
Question
I'm really confused about familywise error. Here are my questions:
1. I know that multiple comparisons (e.g., Group A vs Group B, Group B vs Group C, Group A vs Group C) based on the same dependent variables will increase type 1 error, and that's the reason why we use ANOVA instead of multiple independent t-tests. But what if the multiple comparisons are based on multiple dependent variables (DV1, DV2, DV3) within the same two groups (eg. Group A vs Group B)? Does it need correction, such as FDR?
2. What about multiple one-way repeated-measures ANOVAs? If there are several dependent variables, do I need to adjust the p-vlaue? How can I adjust it (the p-value for the main effect and interaction effect)?
3. I have learned some about robust statistics recently, and I'm not sure if multiple comparisons / familywise error in robust statistics is also a problem. If so, how can I correct it? I read some books and searched online, but found no information concerning familywise error in robust statistics. So could you give me some ideas?
Any ideas are appreciated. Thanks a lot!
ANOVA is not controlling the family-wise error rate (FWER). ANOVA is used to compare two nested linear models, what is a bit a different question than a "family" of questions if all additional coefficients in the larger model are all zero. These tests can refer to coefficients from one or from several different predictors. Since all these tests are based on the same set of data, the complete information from the whole data can be used to get better estimates of the residual variance and thus of the standard errors. An ANOVA is only made to calculate these interim statistics, but it is not the ANOVA that is interesting. The tests on individual coefficients are t-tests, and these tests use the residual variance estimate. They can account for multiple testing to control the FWER. So it's not about ANOVA, it's only about using a testing procedure that controls the FWER.
The "family of tests" is not given by the analysis technique you use. It is the intellectual question behind your analysis. You can see tests of different predictors as independent or "unconnected", or you can see them as a family. Whenever you would say that "if at least one of these tests is significant, then ..(bla bla)" then you have a family of tests. An example: given you have treated cells with some cytokine and you test several different inhibitors, it may be the aim to show that at least on of the inhibitors shows an effect (because you know that not each inhibitor must really work; but if you could show it for at least one, you would see your task accomplished). In this case you have a family of tests and you should check the FWER. If, in constrast, you test the inhibitor at several different times, it is more a consecutive series of tests and you would stop testing after the first time-point where the inhibition is "significant". There is no family and thus no need to control a FWER.
So the definition what a family of tests is and what not must be decided by you. It is sometimes difficult and it may be that different people have a different opinion on that. That's science. And as Daniel wisely noted: the type-I error is not the only thing to consider.
Within a linear model, the FWER of tests can be controlled by neat functions like Tukey's HSD, Dunnett's procedure and so on. Otherwise, methods like that of Bonferroni and Holm have to be used. This has nothing to do with what method the data was analyzed. It is only about the p-values. They can be from "parametric" or "non-parametric" (rank-based) tests, from "robust" or "sensitive" tests.
Question
I want to run a before/after test probing the change of variability.
This is my setup: 7 operators, 6 samples, 2 measurements (1 before training and 1 after the training) being the two measurements made on the same sample. Roughly speaking I have a 6x7x2 matrix of measurements of many different physical quantities.
I want to demonstrate that, for each physical quantity, the measurements made before training are less sparse than those made after the training ("the training is usefull and serve to standardize the operator's skills").
I can not figure out how to demonstrate this. Running a two-way anova on a single physical quantity I get the results reported in the figure test_anova2.bmp.
It is clear that the variability between operators is largely decreased. This proof is quite naive and not rigorous, moreover this is only one physical property but I have more than 20 features to take into account.
Finally my question:
Is there a rigorous method to prove what I see naively? Do I have to run the test on every single physical property or there is a way to use all of them toghether?
Any help will be appreciated.
Thanks
Seven people at each of six locations were tested before and after training. Assessment was based on a 20 question survey on 42 people.
Problem 1) A person is a replicate.
Your perception of variability is across people. You have no within-person assessment of variability.
Your assessment of variability after is correlated with your measure of variability before.
You have replicates for estimating the mean, but you have no replicates for estimating the variability.
What you should have done is one of two approaches: 1) Give the same people multiple tests before and after. 2) Not used the same people. I had 100 participants, and I surveyed 42 of them at random before training. I then sampled another 42 of the remaining 58 after training. The existing before-after strategy is good if you were testing means.
Lets say that the data are independent (maybe that the degree of dependence is trivial relative to the strength of the signal that you are trying to detect). You now have a null hypothesis that the existing organization of the 84 data points results in two standard deviations (one for each group of 42) that are equal.
If they are equal, then it makes no difference how you arrange the data. So put all the data into an 84-cell matrix. Randomly reorder the observations. Find the standard deviation of the first 42 and the second 42. Take the difference. Store the result. Do this some 20,000 times. Given the difference in standard deviations between the observed groups, how common is this value among the randomized data? This is a randomization test.
Do this for each question. With 20 questions, it is likely that at least one will give a significant difference by chance alone. So if only one question shows a significant difference with a p-value of 0.03 I would strongly suggest that you consider this an artifact. Also be aware that failing to find a statistically significant difference is not sufficient to prove that the null hypothesis is correct. It is simply an inconclusive result.
Question
Hi
i have data coming from a survey tends to measure the degree conservative behaviours. However, first i will conduct hierarchical cluster analysis and then k-means clustering to create my blocks. Since clustering algorithms has a few pre analysis requirements, i suppose outliers will not be a problem at first stage.
However i am planing to define my clusters by using factor scores, which i am going to produce by using factor analysis method. Unless it was not a problem for cluster analysis, on factor analysis and discriminant analysis stage i will be more subjected to outlier effects.
Since all my variables has same range value, there are some significant outliers among data. Because these data is survey data collected from individuals, i believe i should only eliminate outliers stemming from data entry errors?
Do you think this is a proper way?
Thanks a lot
Lov what you suggest is suitable when finding univariate outliers. When multivariate outliers are concerned, one might use Mahalanobis distance.
Question
I am trying to obtain transfer curve (Vg, gate voltage vs. Id, drain current) with graphene transistor using 2636A Keithley source-meter. Device basic structure is attached.
I have put drain voltage fixed as 30 mV and made the gate voltage sweep from -20 V to 20 V. When the gate voltage is negative, I am getting perfect curve as it should be theoretically. But when the gate voltage appears as positive, I am getting negative current. I have attached the curve I have obtained. Your kind suggesting is requested to come out from this predicament.
If the structure is large enough, you could call in the assistance of a theorist and ask them to construct a simulation based on my latest QED model, please see link to my latest paper on the subject for more details, if required:
Deleted research item The research item mentioned here has been deleted
Question
As objects for my cluster analysis, I compare three different types of Sport Mega Events (Summe Olympics, Winter Olympics and World Cups) and their correspondent impact factors (Costs, Surface area, New Venues etc.). As I don't want to compare just the absolute variation of the impact but more the relative variation of the impact of the events, I want to cancel out the 'between variation' between the different types of events. If I would compare just the absolute variation, it is quite reasonable that bigger events have just bigger impacts. But that is obvious and not very astonishing.
Therefore I apply the following homogenisation technique to impact factors to cancel out the between variation:
x(E,T) − x ̄(E) + x ̄
x ̄ is the overall average.
x ̄(E) is the average per event type
x(E,T) is the Individual impact factor value before homogenisation
It all depends on how you measure of distance., i.e., what metric you are using for the analysis.
If you are to use scale-dependent indexes such as Euclidean Distance (squared or not) or Manhattan City-Block (Taxicab Distance), then you do need to standardize. The form you mention is quite OK, but the use of Z-scores (individual value minus mean, divided by the standard deviation) for all the variables is quite common and has the advantage of perfectly replicating the distribution of the original variable (its a linear transformation).
On the other hand, if you are using an index based on association, such as 1-Pearson r or the Guttman Monotonicity Coefficient, then scale is not an issue and you can simply use the raw data as it is.
The choice of a metric for the measurement of distance is an arbitrary one, so, you can simply go for whatever you please (though there will be implications for interpretation).
I suggest using a non-parametric index of association such as the Guttman Coefficient or maybe 1-Spearman Rho. Then you don't have to worry about linearity, distribution, equivalent intervals, etc. Besides, my personal experience is that association-based measures usually are MUCH more revealing than those based on score values. This is because the first will yield clusters based on relationship, whereas the second will simply bunch together similar values.
I also suggest you look into Multidimensional Scaling (MDS), particularly Smallest Space Analysis (SSA) and Facet Theory as a very flexible and rich alternative to Cluster Analysis. It is based on the whole idea of distances, so that all the above arguments apply, but the results are expressed in a very intuitive diagram that has more possibilities and is easier to interpret than a Dendrogram.
When you know what you are looking for (i.e., have clear expectations of what the groupings might be), then MDS it is probably the best choice. However, if you are just exploring, with no clear idea of what you may find, then stick to Cluster Analysis.
Question
Hello,
I would like to check multigroup hypothesis with the help of R Studio/STATA/Mplus Softwares but I would like to read some documents that used a multigroup analysis
If you are looking to perform mutligroup analyses within an SEM framework, then an option in R would be the lavaan package.
Question
After running a multivariate model with 4 dependent variables, I am struggling to calculate the marginal effects of explanatory variables on the dependent. can anyone help
You are perhaps referring to the Test for Additional Information which is a sub protocol after Analysis of Dispersion according to C.R. Rao (1965)  Linear Statistical  Inference and Its Applications. John Wiley & Sons, New York, 522pp.
I have produced two R scripts, Andy.R and Adinfo.R that run the initial Analysis of Dispersion and provides the data allowing Test for Additional Information which allows the question:
For Y1, Y2, Y3, .... Yp of a n by p Y observation matrix on the X n by q design matrix:
does Y1 provide any additional information beyond Y2 - Yp on design feature w of design matrix X?
The R-scripts are available at URL:
These were designed for R running on a Mac which has minor graphical problems of compatibility with Windows PCs.   If you have problems running them we can communicate over the issues.
Question
how can I calculate the marginal effects of explanatory variables after running a multivariate probit model on STATA?
Dear Ermias,
In Stata 11, 12 and 13 this is certainly possible for probit, visit:
The command margins was preceded by the command mfx, visit:
and for an example, visit:
Best,
D.A
Question
I have sampled 6 populations of lizards for which I have presented each lizard with 4 odour treatments. I want to know if lizards flick their tongue less when presented with a control (no odor) then with a specific odour.
My mixed model (individual and trial as random effect, because of repeated measures) was very overdispersed with a poisson distribution, so I used a negative binomial distribution (now a value of 3; is this ok?) in the glmmadmb function. I have a significant 3-way interaction between the population (factor, 6 levels), treatment (factor, 4 levels) and walk (continuous, time spend walking) and get the below output with the summary.
If I am correct, the reference/intercept for the 3-way interaction is popBru : every level of treatment : I(walk/10). This would mean that the estimate for the first comparison popDol:treatctrl:I(walk/10) is obtained by comparing this with popBru:treatctrl:I(walk/10). The estimate of popVis:treathie:I(walk/10) is then from the comparison with popBru:treathie:I(walk/10), etc. So the output is giving me the differences between populations for each level of I(walk/10):treatment, correct?
I am actually more interested in the comparison between treatments within each level of I(walk/10):population. I have tried with the releveling code and by reordering the variables in the model, but it keeps giving me the same. Does anyone know how I can put the focus more on comparing treatments than on comparing populations?
Many thanks,
Charlotte
You may want to try the lsmeans package to make comparisons among treatments.
The lsmeans package does handle model objects of the the type "glmerMod", which are produced by the glmer.nb function in the lme4 package.  glmer.nb can be used for mixed-effects negative binomial regression.
If you are interested, let me know, and I will send you some sample code to try.
Question
My study has a control group and an experimental group and involves a pre-test and post-test for each group. The pre-test and post-test are identical and have a subjective measure (ratings of anxiety) and an objective measure (word production ability). I would like to know if the difference between pre and post tests are significant for each group and if there is a significant difference between the two groups. I would also like to know if the subjective and objective measures are correlated. Which statistical analysis or analyses should I use?
You can try a paired t-test on the differences between the outcomes of pre and post tests of the same individuals.
To examine relationships, a simple graphical view for pre vs post outcomes (scatter-plot command in MINITAB) may reveal some facts about the data for further statistical inference.
Question
This question is for multivariate nonlinear regression analysis.
You can use mvregress for multivariate regression, default is linear reg.
See this:
To add a link function (for nonlinearity), you can explicitly transform the 'X' variables/covariates.
nlinfit does multiple regression, not multivariate regression.
Question
The model  includes the following variables (N=367):
(i)   X  is an independent variable
(ii)  Y is dependent variable
The result shows following significant relationships:
(i)   c= X--->Y; Beta value=0.47 (p<0.005)
On including mediators "M" it is found that "M" partially mediates the relationship between "X" and "Y" (Barron & Kenny,1986):
a=X-->M; Beta value=0.80 (p<0.005)
b=M-->Y ; Beta value=0.24(p<0.005)
c*=X-->Y ; Beta value=0.34 (p<0.005)
However, on measuring the moderating effect of Gender (Male=262 and Female=105) in the mediating model following results are obtained:
No mediating effect in the case of male (Barron & Kenny,1986):
c=X-->Y (without mediation) ; Beta value=0.49(p<0.005)
On including mediator "M"
a=X--->M ; Beta value=0.79 (p<0.005)
b=M-->Y; Beta value=0.16 (p=0.21) insignificant path ~absence of mediation
While for female; complete mediation occurs
c=X--->Y (Without mediator) ; Beta value=0.40 (p<0.005)
a=X-->M; Beta value=0.74(p<0.005)
b=M-->Y ; Beta value=0.56(p<0.005)
c*=X-->Y ; Beta value=-0.047 (p=0.796 i.e. insignificant and beta value is negative)
Though, the stated relationship is a case of complete mediation (c* is insignificant and beta value approaches zero) but it is observed that the effect size is greater than 1 i.e. a.b (indirect effect)/c (Total effect)= 0.74*0.56/0.40=1.03 which is considered as inconsistent mediation.
Kindly advice me how will I report such findings and are there any literature support that incorporates inconsistent but full mediation of gender?
For one, your obtained ratio of the indirect to the total effect is very nearly one and could be attributed to some sort of rounding error. That being said, I am not sure how important exploring this particular value is to be honest. See page 15 of the Hayes (2009) ref I have linked to. Also, there is a general trend away from talking about mediation as being "full" vs. "partial". It should be sufficient to demonstrate that your indirect effect is significantly different from zero, preferably using a bootstrapping approach (here is an easy web utility that would work with your existing analyses: http://www.quantpsy.org/medmc/medmc.htm).
Question
Dear all,
I am using Smart PLS. In my measurement model, I noticed I have to delete quite a number of indicators (> 20%) that is below than 0.4 loading (Hulland, 1999). The current construct has Composite Reliability (> 0.7) and AVE (>0.5) and the estimated model fit (SRMR < 0.08). The measurement model now is fit for hypothesis testing. My question is, is it okay to delete such amount of indicators? If okay, do you have reference to assist me to know more? If not okay, why? I would love to hear from you. Thanks.
Dear Michael,
This is a rule of thumb and you don´t need to follow it always. The mininum score is .30. It depends on the mumber of your sample. According to Stevens (1996), for a N = 50 you need loadings higher than 2*.361, for a N = 80, higher than 2*.286, for a N = 140, you need 2* .217 = .434, (..) and for a N = 1000 you only need loadings higher than 2*.081.
Also according to Hair, Anderson, Tatham, & Black, 2005, p. 107, for a N = 50 you nedd loadings higher than .50, buit for a N = 350 you only need loadings higher than .30
Stevens, J. (1996). Applied multivariate statistics for the social sciences (3rd ed.).Mahwah, NJ: Lawrence Erlbaum Associates.
Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1999). Multivariate data analysis. NJ: Prentice-Hall.
Question
I am looking for a way to incorporate observation weights into a partial least-squares regression.
More specifically I want to extract the first pair of singular vectors u and v from the matrix XTY where X is an n observations x k predictors matrix and Y is an n observations x p response variables matrix.
When the observations are unweighted these singular vectors maximize the covariance between projections onto u and projections onto v. I would like this to emphasize maximizing the covariance between some observations more than others.
Any pointers are greatly appreciated.
Harry
Dear Harold,
I suggest you "do think out of the box" and consider regression methods other than those in linear algebra. Especially since the most of real world problems are non-linear in nature and as I understand, you got somehow a similar case in your work.
Regards
Peyman
Question
I have data from a business plan competition. There are a total of 127 judges and 201 business plans in which each of the judges rate each plan they are assigned on 6 items.
Judges are randomly assigned to plans such that each plan has anywhere from 4-10 judges rating them. Plans are also randomly assigned to judges such that no plan has the exact same panel of judges.
Can I calculate ICCs in this case? If so, how would I do that in SPSS?
Hi Jason!
Using the MS(Between) and MS(Within) values from a one-way ANOVA*, the formula (can be computed in excel) for ICC1 and ICC2 are as follows:
ICC1: (MSB - MSW) / MSB + [(k-1)*MSW]. Where k = the average group size
ICC2: (MSB-MSW) / MSB OR
ICC2: k(ICC1) / 1 + (k-1) ICC1
*Set up the one-way ANOVA in SPSS as business plan ID by rating item.
ICC1 Tells us between team variance. Index of interrater reliability (raters are substitutable). Acceptable values range from .10 to .20 or higher to support aggregation (Bliese 2000).
ICC2 tells us the reliability of the group means.
Bliese, P.D. Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In K.J. Klein & S.W. Kozlowski (Eds), Multilevel theory, research, and methods in organizations. San Francisco, CA: Jossey-Bass, 2000, pp. 349–81.
Question
Hi, I am testing for an association between gender and an ordinal variable. From what I have read online, when dealing with ordinal data, normally we use the the linear-by-linear association output in SPSS. However, over 20% of my cells have an expected count less than 5 (in my case, its 30%). When both variables are nominal and this issue arises, we can use the G-Test (Likelihood Ratio). Is there a similar alternative test for when this issue arises with ordinal data? What can I do at this point? Thanks!
I think these answers have missed a critical point.  The question is about the linear-by-linear test, which is for contingency tables with ordered (ordinal) categories.
The test is discussed by Agresti, Introduction to Categorical Data Analysis.
The answer to the original question probably depends on how SPSS performs the test.
Agresti uses a type of linear model, and the test reduces to a G-squared goodness-of-fit test.  I don't know if low cell counts are a problem for this test.
In R, I think the only way to perform the test is by permutation test.
What's also nice about the way the test is conducted in R is that 1) You can have a contingency table with one ordinal variable and one nominal variable.  2) You can specify the distance between the ordinal categories.  That is, they don't need to be equally-spaced.
For the R code, see:
Question
How can I extract two (some) samples out of the original sample? Is there any criteria for that?
Can I just arrange them in ascending order and divide them into two samples with 2500 elements in each?
How can I find the weight of the each sample?
Thank y'all ! :)
Faraz - Let Ali Tohidi and I know if you need further help - the link above should help!
Question
"Age and sex differences in relation between frugality and self efficacy" - Here there are 2 Independent variables - age and sex , each having 2 levels (I intend to keep adolescents and young adults as the 2 age groups). There are 2 Dependent variables - frugality and self efficacy.
Objectives :
1. Are there any age differences in frugality and self efficacy?
2. Are there any sex differences?
3. What is the relation between frugality and self efficacy?
Sample :
N = 100 Male ( 50 - adolescents, 50 - young adults ) ; 100 Females (50 - adolescents , 50 young adults )
Data analysis : I would be greatly obliged if I could get the views of Fellow Researchers and Professors about the statistics that are best suited for this research.
I was wondering if MANOVA would be suitable.
Hi,
I suggest the multivariate covariance generalized linear models (McGLMs). In this framework, you can easily deal with bivariate data (two response variables). As part of the model, you can assess the effect of your covariates (age and sex) in each response variable and compute the correlation between responses taken into account the covariate effects. You can find more details in this paper
The mcglm (https://cran.r-project.org/web/packages/mcglm/index.html) package in R fits McGLMs using an interface similar to the glm function.
Even, if your response variables are not Gaussian, you still can use the mcglm package for binary, binomial, bounded, count and continuous ones, including the case of mixed types of response variables. If you decide by pursue in this direction I can provide more examples and assist you in the analysis.
All the best!
Question
I have run my univariate normality test with the rules of -2 and +2 for skewness and kurtosis (George & Mallery, 2010) and multivariate normality test with the rules of < 3 for skewness and in between -2 and 2 for kurtosis (Chemingui & lallouna, 2013). Thank God, the dataset passed all these rules. That's mean I have a normally distributed data. My question is, do I still need to bother outliers? Or I shall just report outliers to be not applicable in my study? Even if there are outliers, my data is still normally distributed. What's your opinion?
*I'm using Likert Scale for my entire questionnaire.
Please note that an outlier is an observation that appears to deviate markedly from other observations in the sample.Identification of potential outliers is important for the following reasons:
1) An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly. If it can be determined that an outlying point is in fact erroneous, then the outlying value should be deleted from the analysis (or corrected if possible).
2) In some cases, it may not be possible to determine if an outlying point is bad data. Outliers may be due to random variation or may indicate something scientifically interesting. In any event, we typically do not want to simply delete the outlying observation. However, if the data contains significant outliers, we may need to consider the use of robust statistical techniques.
Identifying an observation as an outlier depends on the underlying distribution of the data. If the normality assumption for the data being tested is not valid, then a determination that there is an outlier may in fact be due to the non-normality of the data rather than the prescence of an outlier.
For this reason, it is recommended that you generate a normal probability plot of the data before applying an outlier test. The box plot can also be useful graphical tools in checking the normality assumption and in identifying potential outliers. The lower and upper tails of the box plot can be a useful graphical technique for identifying potential outliers.
Question
I am using this method for interpreting the proportion of each predictor variable in a disease occurrence which contains 3 outliers, then I cannot use multiple linear regression, now I want to know that what are the assumptions of Probit-Logit model.
Parallel regression/ proportional odds assumption: all coefficients on the predictors/independent variables are equal for every category of the outcome. Hence, the slopes of the estimated equations are identical. Brant's test is used to test this assumption.
Question
in proc LIFETEST there can't be missing data, because it is cut. some of below procedures may be iplemented in survival analysis?
"CALIS Procedure — Fits structural equation models
GEE Procedure — Generalized estimating equations approach to generalized linear models
MCMC Procedure — General purpose Markov chain Monte Carlo (MCMC) simulation procedure that is designed to fit Bayesian models with arbitrary priors and likelihood functions
MI Procedure — Performs multiple imputation of missing data
MIANALYZE Procedure — Combines the results of the analyses of imputations and generates valid statistical inferences
SURVEYIMPUTE Procedure — Imputes missing values of an item in a data set by replacing them with observed values from the same item and computes replicate weights (such as jackknife weights) that account for the imputation" (Sas documentation)
thank you for help. the solution was to use proc MI to impute data and then I used proc PHREG and MIANALYZE to be given some statistics
Question
I have 14 treatments that i have analysed using ANOVA. There were no significant differences. Just by looking at the treatment means I am convinced there are differences. So which other statistical test can i use. I even used the Kruskal wallis test, since the data is not normal.
In line with Stephen's comment, I would suggest in future that you have a robust analysis plan in place *before* collecting data, to avoid the kind of situation where you are running multiple tests on the same data - there is of course the risk of false positive when you do this.
Question
I am trying to find out association between risk and disease (education and reporting of NCD) of populations in three different areas (i.e. in spatial context).There are six population groups. Each pair of population group belong to a particular neighborhood and each pair itself is made up of two population groups; one from higher and one from lower.In some cases Relative Risk is greater than 1  (significant) indicating causation. What can be the probable explanations if the value is not significant?
A lower confidence interval for the RR >1 is needed to feel reasonably confident that there is a positive association. An upper confidence interval <1 for a negative association.
If the confidence intervals cross 1 then your hypothesis you are testing cannot be distinguished from zero effect with any statistical confidence. The aim of experimentation is to try and falsify your hypothesis, not to  confirm your hypothesis. I'm afraid a negative result does not give any licence to continue with a hypothesis.
If there is independent evidence that would suggest your negative result is wrong the only course open to you is to evaluate your methodology and identify any shortcomings then re-run an improved study.
On a final note, even a significant result does not in an of itself imply causation, it merely indicates association without any indication of what factors precede another. Causation requires a lot more work to demonstrate.
Question
I would like to analyse three breeding grounds, one active, one recently abandoned and one abandoned for several years. The predictors are continuous, rough percentage of the coverage of a particular habitat type. Their distribution is quite skewed, many times only either 100 or 0%. The units are patches, altogether around 300, with 30 predictors. The dependent variable is therefore multinomial (three classes exactly). Which would be a good statistical technique to analyse this situation? I thought of multinomial boosted regression trees but cannot find any good instructions to do it. Thanks!!!
Maximum-accuracy methods work well with this design, and there are no distributional assumptions required. Here are a few intuitive articles (two are open-access) that introduce this method (novometric analysis--Latin for new measurement--handles binary, multicategorical, and ordered "dependent" variables).
Question
I plan to conduct a study that includes one continues dependent variable (attitudes) and seven categorical independent variables teaching position (general or special education teacher), gender, level of education, previous inclusive teaching experience, years of teaching, training in inclusive education, and the presence or absence of family members with disabilities).
I will use descriptive research to obtain information about the target population and describe the characteristics of the teachers in my study. The second method is correlation research to determine whether or not there is a relationship, without exploring the cause-effect links, between the dependent and independent.
So, I developed three questions and one of them is the following:
R3: Are teachers’ attitudes toward the inclusion of hard of hearing students in general education classrooms in public schools differentiated by factors including current teaching position, training in inclusive education, the teacher’s gender and level of education, previous inclusive teaching experience, years of teaching, and the presence or absence of family members with disabilities?
And for analyzing this question
I was going to use t tests, and one-way ANOVAs to determine the relationship between the independent variables, and the dependent variable as following:
1. The differences in teachers’ attitudes toward the inclusion of students who are hard of hearing based on teaching position (independent t-test to compare differences in group means).
2. The differences in teachers’ attitudes toward the inclusion of students who are hard of hearing based on gender (independent t-test to compare differences in group means).
3. The differences in teachers’ attitudes toward the inclusion of students who are hard of hearing based on level of education (one-way ANOVA to compare differences between group means).
4. The differences in teachers’ attitudes toward the inclusion of students who are hard of hearing based on previous inclusive teaching experience (independent t-test to compare differences in group means).
5. The differences in teachers’ attitudes toward the inclusion of students who are hard of hearing based on the number of years of teaching (one-way ANOVA to compare differences in group means).
6. The differences in teachers’ attitudes toward the inclusion of students who are hard of hearing based on teachers’ training in inclusive education (independent t-test to compare differences in group means).
7. The differences in teachers’ attitude toward the inclusion of students who are hard of hearing based on having family members with disabilities (independent t-test to compare differences in group means).
But some of my colleagues suggesting using other analysis model where I can control the effect between all independent variables on the dependent variable. In other words, they indicated that there is may be interaction between the IVs that might influence the effect of the relationship between one independent variable and the depended variable and so I have to control that. I was thinking of using the Analysis of Covariance (ANCOVA), but I am a little confuse how to control the effect of all these independent variables. Any suggestions
If the dependent variable is continuous is a t-test if the distribution is sufficiently symmetrical and asymmetrical distribution in the case of a U or suggest a test sample median.
We can also use a one-dimensional analysis of variance (one-way ANOVA) or we can use a coding conducive to appeal to groups and multiple regression.
If the independent variable is categorical using analysis of variance and multiple regression depending on what we find out. And in the latter case the comparison must be made before the media to determine where the differences between groups.
If the independent variable is still interesting to do regression dependent variable depending on the independent variable. This implies to determine whether the influence of the independent variable on the dependent variable follow a straight line (so it's a fairly linear) or curved (what is said to be a relationship curvilinear) and examined the nature of that relationship on cases failed - deviating rule.
Question
I collected data from 107 respondents on their three cognitive styles and five types of behavioral characters. So, every respondent gave me their response on eight variables (three cognitive styles and five behavioral characters).
Now I want to make the relations among eight variables. Please suggest me what types of statistical test (SPSS) are suitable for making the above relations.
If you have any material (with data analysis) like the above problem, please share with me.
With Regards,
Surajit Saha
(PhD Scholar, IME Department, IIT Kanpur, INDIA)
Hello Surajit,
As Potlizer-Ahles and Bauer indicate, the underlying question of interest should drive the intended analysis.
For example, if your question was, how do the selected behavioral characteristics relate to the selected style scores, you could use canonical correlation.
If you wanted to know whether individual (one at a time) style scores were predictable from the behavioral characteristics, you could use regression analysis.
If you wanted to know whether style scores were influenced by a different underlying trait than were the behavioral scores, you could try factoring the set of 8 measures.
And so on...
Question
i have experiment with 6 factors when i use BBD i got 54 runs but i want increase the number of runs for more accuracy. If i use CCD i got 86 runs but 3 or 4 of them have negative values. Is it Ok to delete these negative runs and carry on the experiment normally i mean this have no effect on the model ? and what is better to use in this case BBD with replicates or CCD ?
You should be able to use some type of Optimal Response Surface to minimize the amount of work you have to do. An optimal Response surface will use about 40 samples.
6 factors seems like a lot though. Did you test those 6 factors using a screening design to make use all those factors are significant? If not, try using a Definitive Screening Design first. 6 continuous factors will use a minimum of 13 runs, though 17 runs would be better. The DSD will give you estimates on linear and quadratic terms in the model. It can also give you some insight into 2-way interactions. If you find some of the factors do not have any effect, you can eliminate them and move on with what is significant.
Question
I want to run a regression analysis to study the relationship between material porosity and PPI on velocity loss in a metal foam. Porosity and PPI are predictor variables and Velocity loss is dependent variable. I have two separate metal foams-Copper and Aluminum. I am confused as to whether I must should use multilevel regression or nested regression (typical multi-variable regression).
Thank you people. I will try your suggestions.
Question
Hi all,
i want to do multiple imputation on iem level, since several studies have shown the superiority of item level imputation compared to scale level.
I want to do a sensitivity analysis. Some authors suggest, to add or substract a constant to the imputed values. However many of the applications used continious scale scores for such analysis. Since I do imputation on item level, I could add such a constant, however, I have never seen any comparable application. I know many of NMAR models (selection models and so on), however, since I am not interested in growth, they do not fit my application (at least I think so).
So, du you think this strategy is reasonable and do you have any papers, that have used this strategy. Or any other suggestions?
best wishes, Manuel
Dear Manuel,
see if the doc in attch helps you.
Question
I have a matrix with species as columns and sites as rows and it contains counts of individuals per site per species. It was suggested that due to the high variability in counts (they go from 0 to ~3000), I should transform my matrix. Basically, I am doing multivariate statistics to find differences in community compositions among bioregions.
My knowledge of stats is by no means great, but I do know transformations are used to normalize your data and perform parametric statistical analyses, and you can check if the transformation applied does this by doing a test (e.g., Shapiro-Wilks). But, this is different to what I want to do and I don't understand why and when transformations should be applied to your data and how do I decide that the transformation chosen is the correct one or most suited for my data and research question. Is there like a rule of thumb for the application of transformations (e.g., if the SD is over 2 or the counts vary by more than two orders of magnitude, etc. the data should be transformed)?
Thanks!
The question stated by Denisse is a quite standard one but not easy to answer. The appropriate answer depends critically on the underlying assumptions for the data and the stated problem. For a problem in species composition, to use a "compositional data approach" is becoming standard. For a general review of the state of the art you can have a look at "Modeling and Analysis of Compositional Data" (Pawlowsky-Glahn, Egozcue, Tolosana-Delgado, 2015)
The main assumption for compositional data is that the relevant information is contained in the ratios between the abundances of species. Therefore, does not matter about the number of individuals measured but on the proportions of the species. When this is so, one realizes that the scale of proportions is not absolute but relative. That is, observing 5 individuals compared with 10 individuals is just observing half abundance; observing 1000 individuals compared with an abundance of 1005 is almost the same. In this circumstances, standard mean values and variances are no longer valid, and a suitable transformation is mandatory in order to transform into a new scale which can be assumed near to absolute. If you are interested in compositional information (I think this is your case) the immediate recipe is: "transform your data using an isometric log-ratio transformation (ilr)", so that you represent your proportions by ilr-coordinates. The interesting result is that standard statistics can be safely applied to that ilr-coordinates (see Mateu-Figueras, G., Pawlowsky-Glahn, V. and Egozcue, J. J.:
The principle of working on coordinates. In Pawlowsky-Glahn, V. and Buccianti A. (Eds.) Compositional Data Analysis: Theory and Applications,
ISBN-10: 0-470-71135-3, Wiley, Chichester UK, 2011.
However, your data are counts that may be very low, including zeros, and large counts. Large counts, divided by the number of observations can be identified with proportions or probabilities. However, low counts cannot. If you observe 1000 individuals and you get 0 of species A, it does not mean that the actual proportion for species A is 0 but a small proportion. In these cases with low counts, the counts by themselves are not considered compositional. What is effectively compositional are the probabilities (proportions of each species) in a multinomial sampling. This is what is done in multinomial logistic regression. For more information see
Josep-Antoni Martín-Fernández et al.
Bayesian-multiplicative treatment of count zeros in compositional data sets, Statistical Modelling 2015; 15(2): 134-158
With respect to other transformations. The logit transformation is a limiting case of Box-Cox transformation. The ilr-coordinates are a multivariate generalization of the univariate logit. Componentwise logarithmic transformation produces analyses that are dependent on the total number of counts. No transformation confronts you with spurious correlation if the data is assumed compositional.
Compositional data packages are available in R: "compositions", "zCompositions","robCompositions". See also the free stand alone program CoDapack (teaching software) for exploratory analysis of compositional data.
Question
I am not a statistician and would be very thankful if someone could clarify me this.
I am reading about the Generalized Additive Models that 'they don't handle interaction well. Rather than fitting multiple variables simultaneously, the algorithm fits a smooth curve to each variable and then combines the results additively, thus giving rise to the name Generalized Additive Models.' (in http://ecology.msu.montana.edu/labdsv/R/labs/lab5/lab5.html)
Could someone give an example of two interacting variables and how they are being handled by GAMs?
Hello,
I am not sure about the "don't handle interaction well" part. I believe that there are actually some structures available in R (mgcv package) that take into account the interaction. For instance, you can use the tensor product (which by definition is a bivariate function) with te(). A very simple example for a model with interaction y ~ f(x0, x1) where y = x02 + x13 would be
library(mgcv)
# simulate data: y depends explicitly on x0 and x1
x0 <- rnorm(1000)
x1 <- rnorm(1000)
y <- x0^2 + x1^3
# fit the model
model <- gam(y ~ te(x0, x1))
summary(model)
plot(model)
Since we are dealing with a two variables function (compared to a single variable function when there is no interaction), the main difference is that you now get a contour plot.
There are many options available (see the attached link), but I am not familiar with all of them. For more details, I suggest you read chapter 5 of Generalized Additive Models: an introduction with R (Wood, 2010).
Hope this helps!
Question