Science topic

# Multivariate Analysis - Science topic

A set of techniques used when variation in several variables has to be studied simultaneously. In statistics, multivariate analysis is interpreted as any analytic method that allows simultaneous study of two or more dependent variables.
Questions related to Multivariate Analysis
Question
The experiment was designed as follows :
There are three field plots planted with three plant species each, with 3 replicates by species.
The species were not randomized trough the plot, i.e. They were planted at the same place through the plots.
One fertilization was applied for each plot. In total there is three plots that receive three different fertilization.
There is two treatment :
The fertilization (3 levels = three plots)
The species (3 levels with 3 replicates)
I want to compare the species between the plots, but the species are not randomized through the plots and there is no replication of the plots.
I have read that i can proceed to an ANOVA for each plot separately as different experiments done in different locations. Then applying a combined anova. How to proceed to the Df decomposition for the combine ANOVA.
This experiment was already finished and i have to do the statistical analysis.
I understand you have one approximately normally distributed outcome
and multiple predictors
- species, 3 levels
- fertilization-plot combinations, 3 levels.
There were also 3 within-plot repeats of species (3x3 plants were planted per plot), so altogether 3x3x3=27 measurements.
Theoretically the plots may influence the outcome, for example plants may have had different outcomes in different plots even under the same fertilization. There may be interaction between plots and fertilization (e.g. if one plot is deficient in a nutrient and not all fertilizers supply this nutrient the fertilizers may have different effects across plots). There may be species-fertilizer and species-plot interactions, too; finally a species-plot-fertilizer interaction.
In your setting it will be impossible to study the plot-fertilizer interaction and the triple interaction - or even to distinguish between the effects of fertilizers and plots. You can disregard any interactions if you have strong reasons to believe they are not present in your setting.
Finally, if you can underpin why plots should have no effect at all in your research setting, you can ignore them. In this case the problem reduces to a 2-way ANOVA.
Question
I used a lot EPC fr my project, but I don't understand how any kind of the principal components could have no percentage of variation, explained by each one from the total matrix? I cannot do this with mesquite.
You can do the analysis in R. There is a function in Liam Revell's phytools package, phyl.pca. I don't know Mesquite so I don't know how to get the eigenvalues in that program.
Question
I have a question on multilevel analysis. I hope I explain everything clearly, if not please let me know!
I have a dichotomous dependent variable (burglary yes/no), which I try to explain by variables on three levels (house, street and neighbourhood). I am conducting multi level analysis. I just did some cross level interactions, which show a significant influence. With a normally distributed dependent variable, it is now possible to calculate the explained slope variance, to see how much percent of the effect is explained by this variable to see if the influence is indeed relevant. Since for a dichotomous dependent variable it is not useful to calculate the explained variance I used the ICC. But I cannot seem to find anything on how to calculate the ICC for the slope variance. Is this even possible? And if so, how do I do this?
If you are use a multilevel logit model the level 1 variance is usually taken to be the variance of a standard logistic distribution - that is 3.29. The complication is that this cannot go down so you get some quite strange  effects as you put variables into the model. The Snijder and Bosker multilevel book is very ggod on this and you may also want to look at the material on the Lemma course on discrete outcomes
There is a specific online resource on this topic at
Question
I am referring to an old work, but I guess most of the multivariate analysis techniques implemented and software systems developed could be usefully be re-adapted to modern software environments. Does anyone have knowledge of such work being performed ? Does anyone know of such techniques being in used in the mining / prospecting industry today ?
To help foster the discussion, I have added a paper in English published In book: Use of Microcomputers in Geology, Edition: Print ISBN 978-1-4899-2337-0, Chapter: 3, Publisher: Plenum Publishing Corporation - Springer Science+Business Media New York 1992, Editors: Hans-Kürzl and Daniel F. Merriam, pp.25-71
Question
Dear all (Mathematics and Ecologists mainly):
The Renyi spectrum of fractals contain all the main fractal dimensions in a multivariate structure. This structure is very useful for comparative analysis. I want to know what values of Q in this Renyi spectrum correspond exactly to the fractal dimensions of gyration and variance. I do not find this information in any published paper at the moment.
Many thanks.
Daniel, the work by Renyi on spectra of fractals is well-worth careful study.   I was surprised to find Renyi's 2006 Socratic dialogue, which reveals his ways of thinking.    That Socratic dialogue goes along nicely with the more abstract treatment of spectra of fractals.
Question
Hi Lauro!
Look the citations from the first one, because they can be important for you understand the work.
Cheers!
Question
If the factor levels are measured values and the measurement instrument has a published error of +\- some value, how does one deal with that error?  For instance, suppose one of my class variables (factors) is the length of a component and I need to study how that length effects performance at 0.01 cm increments.  If I use a vanier caliper that is only accurate to +\- 0.01 cm then my levels potentially overlap.
Is there some way of propagating the uncertainty through the Manova?
Steven,
I think the gold standard here would be a bootstrap approach, but I honestly don't know how readily accessible that is for MANOVA in standard software, depending on how much programming you want to do.
Otherwise, a random-effects MANOVA should offer some improvement.
What I'd be very careful about, though, is the possibility of correlations between the measurement error in your predictors and your outcomes.
Pat
Question
I want to calculate odds ratio using multivariate regression. How can I do that using spss 16.0 version?
You run a binary logistic regression in SPSS with the given dependent variable & include the indepedndent variable as covariates & define them as categorical. In output part , the EXP(B) is the odds ratio of the outcome.
Question
I have demonstrated that FA may be a better solution than Successive Projections Algorithm for Variable Selection in a multivariate calibration problem. However, I would like to know if someone has ever demonstrated that the FA may be a better solution than others variable selection techniques.
Hi, you can first step review Firefly Algorithm in Matlab, so this code is useful for your research area:
Description
Firefly algorithm for nonlinear constrained optimization.
For simple demo in 2D, please use firefly_simple.m
For unconstrained functions in higher dimenisions, please use fa_ndim.m
For nonlinear constrained optimization,
MATLAB 7 (R14)
Question
Currently, I am analyzing data from a small study of 1500 participants. There are only 63 disease cases (asthma). The exposure is a continous score (questionnaire data). I ran seperate regression models to predict asthma using a) the continous exposure and b) the dichotomoized exposure (median split).
My question: Am I right to assume that the sample / case numbers are too limited to run analyses based on tertiles (-> 500 cases per category and less than 20 cases, especially in the intermediate category)? I am not aware of any rule saying that 20 cases are too few, but that is what I often heard. If this is true, can you recommend a reference to support this?
Ps. the estimates are multivariate-adjusted (7 confounders?)
There is no such thing like "minimal sample size for analysis of a kind". Analysis can be performed on sample of any size, but the sensitivity (or precision) of the results depends on the sample size severely. Hence, the "minimal sample size" depends upon what precision is enough for your particular problem, i.e. the magnitude of differences that you want to prove to be statistically significant.
So, being in your boots, I'd calculate confidence intervals of the estimated regression coefficients (or, preferably, confidence set) and look what coefficients are statistically different from the neutral/"does-not-matter" value. If a coefficient turn out to be statistically indistinguishable from neutral, it would likely mean that sample size is too small to capture the effect of the respective parameter on the outcome.
On the other hand, one should check the predictive power of the model: if the predictions are good enough (the outcome is correctly predicted in most cases) but some regression coefficients are still "statistically neutral" then it means that respective predictors likely have no or almost no effect on the outcome (and, thus, should be thrown out of the model).
Question
I have parallelized (on GPU) and used the SPA for variable selection in multivariate calibration problems, and would like to know if there are some others parallelized algorithms that have been used for the same content.
Thanks, Guilherme! I will research about them too.
Using a likert scale, how can I deal with the neutral part of responses?
Question
Likert scale usually gives five responses, highly disagree, agree, neutral, agree, highly agree. how should neutral responses be understood?
Neutral part should be given a weight. 3 when scales are 1 2 3 4 5 0 when -1 -2 0 +1 +2
Question
Hello, i tried using the binary logistic regression in spss,the following error message is being shown
Warnings
The dependent variable has more than two non-missing values. For logistic regression, the dependent value must assume exactly two values on the cases being processed.
This command is not executed.
The DV has 5 categories:
1,00= No player
2,00 = social player
3,00= risk player
4,00= pathologic player
5,00 clinical pathological player
Best regards
Do not try to force a specific statistical model on your data set. It is the other way around; the experiment dictates how to analyze the data.
What are the independent variables X's in your study? I will assume that they are all continuous. Given what we know about the dependent variable (Y), I suggest these possible way for modeling:
1. Do a Chi square test only on the dependent variable Y. This will give you some information on the distribution of the values of Y.
2. Do the multivariate analysis of variance F test where Y is the treatment factor and where the X;s are the response variables here. This will be the main statistical analysis that you need. Test for normality first.
3. Do a logistic regression with 0=no player 1=otherwise.
4. Do a correlation analysis on the X's.
Use all of the above to learn about the experiment. Each part gives you some information.
Question
The GENLINMIXED analysis covers a wide variety of models, from simple linear regression to complex multilevel models for non-normal longitudinal data but it is difficult to run and analyze.
Question
In order to use the most important variables among 100+, I am thinking to use factor analysis or PCA. Therefore I need to know if any differences exist between factor analysis and PCA. Thank you in advance.
The term factor analysis is referred to group of different but related statistical methods. The main difference being among two methods named Principal Component Analysis (PCA) and factor analysis (FA). This two methods are similar and interchangeably used in research. Both of them employ reduction of large set of variables to create as less as possible dimensions that can account for the larger portion of variance in the dataset (relaying on linear correlation). In PCA original variables are transformed in smaller set of linear combinations with use of the entire variance from dataset, and in FA dimensions are estimated by mathematical modelling where only shared variance is analyzed. Both approaches yield similar results, however PCA seemed more popular. Some authors state:
quote
“Tabachnick and Fidel (1996, pp. 662-663), in their review of PC.A. and FA conclude that: `”If you are interested in a theoretical solution uncontaminated by unique and error variability, FA is your choice. If on the other hand you want an empirical summary of the data set, PCA is the better choice' (p. 664). “
Pallant J (2001) SPSS Survival Manual: A Step By Step Guide to Data Analysis Using SPSS for Windows (Version 10). 1 edn. Open University Press, Philadelphia pg. 151-152
Question
The electronic edition would also be ok.
I have this version with instruction manual.if you want I can send it to you By Email, It's so easy program and excellent.
Question
Validating logistic regression models.
I derived a logistic regression model for predicting clinical response to a drug. I have two groups: responders and non-responders, classified according to a clinical scale. The predictors of the model were two polymorphisms and three environment variables such as smoking. I generated the predicted probabilities using logistic model for all patients. Then using predicted probabilities i constructed a roc curve for responders vs non-responders.
No. Validation neccessarily requires new data. The predictive performance is far overestimated when the the data is predicted that was already used for fitting
In the foggy in-between there are methods like cross-validation (CV), for instance leave-one-out CV.
Question
Suppose X ~ N(µ, σ^2), then we may have found that the sample mean is unbiased and also the MVUE for population mean µ and hence COV(sample mean, U_0)=0, where U_0 means E(U_0)=0. Now If I consider the sample median which is unbiased for population mean then we may consider as E(sample mean - sample median) = 0. Now if we consider COV(sample mean, {sample mean - sample median}, then what will be the result of this? Is it zero? I am facing a problem in this regard. Please, if possible explain this to me.
Dhruba. As an scheptic -from intuition- about variances and covariances I have no good arguments to contribute to the question -I should have them-. In this moment I am reading an article that is related to the topic: http://www.phil.vt.edu/dmayo/personal_website/Error_Statistics_2011.pdf by Barbarah Mayo and Aris Spanos. It questions why some important statistical theories are centered in "error" but not in "data" and it deals with same terms of your question description. I hope it helps.Thanks, emilio
Question
I have checked whether my ordinal scale-dependent variable has a normal distribution. The Kolmogorov-Smirnov test says it is acceptable, but the Shapiro-Wilk test says it is not. My ranking has a scale of 1 to 150. Please see attached files for tests of normal distributions.
Hi Jason- when ranked are based on a pattern (such as Likert scale) or judgmaents (such as VAS) the behavior of ranks are likes as scores. Also, you can use of latent variable (random effect) models to accounting the discretness of ranks. A main reference is: Categorical data Analysis; Author: Agresti,(page 277-278)
Question
I like to work with some well log data using core data with the multivariate methods in conjunction with Fuzzy logic and neural networks. I need some idea or Problematic.
Yes. Thank you very much but I work and I use all these features or data log and Imagery and my questio is : How to use all these data specifyng any problematic in the oil industry ?  using the artificial intelligence idea. What is the mean parameter for prediction from these data ?
How is the output varied between one way anova and regression?
Question
Regression and one way anova, similarities and differences?
I think that from one point of view the methods are identical. However the conceptual basis of one and of the other are rather different. In ANOVA you analyse the variance to various components and then focus on the one interests you. In regression on the other you have some given data and you try to fit the best line of curve.
Question
I have 15 treatments. My main interest is to find the best treatment. The response is measured every day up to 30 days. My model has an interaction effect between time and treatments. I use suitable effect size, but I need power of 80% with type I error 5%. How can I calculate sample size by simulation?
you have to decided min difference b/w two treatment which you expect.
if variable is qualitative it will be in percentages and if variable is numeric then it will be in mean and SD.
if you need further clarification then send my your proposal or synopsis at this email address; waqas341@gmail.com
Question
1. I am working on some research relative to females unemployment, and want to use the Multilevel Modelling techniques, but when my residual variance is .043 for an empty model, I include some explanatory variables for level-1 then it would increase. Now, I could not interpret it. Please guide me how to interpret it, and also if it becomes smaller, then whether it would be better off if it increases.
Some results are attach for your information.
2. Is there any criteria for the selection of the explanatory variables to be included for level-2 ?
Re question 2, I follow the procedure described in Zuur et al. (2009).
Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A. and Smith, G. M. (2009). Mixed effects models and extensions in ecology with R, Springer.
Question
I have dependent variable Diabetes Mallitus categorized as 1=Normal, 2= Pre-Diabetes, 3= Diabetes.
I have applied ordinal logistic regression for multivariate analysis.
Independent variables are;
Heart Disease (Binary), BMI (Ordinal), Central Obesity (Binary), Sex (Binary), Hypertension (Binary), Age (Continuous), Income (Continuous), Number of Cigarettes smoked per day (Continuous), family History of Diabetes (Binary).
Am I using the right statistical procedure?
Interestingly, in regression model, all independent variables are insignificant, but R square is 0.52. When I remove the variable 'number of cigarettes smoke' R drops down to 0.42. Same goes with HTN and Heart Disease.
Can any one guide me, what is going on? Is it co-linearity, confounding or what? And how do I resolve it?
You need 30 patients per predictor variable in your dataset, otherwise you may be experiencing sample size bias. I am puzzled as to why BMI is ordinal. How did you establish the cut-off values for the three or more groups? You decrease robustness converting continuous to nominal or ordinal. Age and hypertension probably have a degree of collinearity, as well as age and income, Income and tobacco use. Effects screening and measures of association would be my next steps.
Question
I would like to use the annealing data set from UCI repository. Nevertheless, the description of the data does not match the contains. Treating '?' as NaN's and eg. removing features with mostly NaN's results in 9 features left...does anybody have a clean annealing data set? Thanks!!!
You can applying data cleaning methods from various open source tools like Weka, Matlab.
NaN can be removed by taking average or Median row wise. I don't know precisely whats your position but things can be sorted out by tools.
Also I am in favour of Marcelo's opinion, after all odds try contacting the person who has uploaded the dataset, it will be an enthusiastic experience
Question
I conducted a driving simulator study in which each participant (32 participants in total) passed each of the four infrastructural conditions in a randomized order. The road segment nearby the infrastructural condition was subdivided in ten sections of 50 meter. For each section, we recorded the mean speed and the mean lateral position for each participant. My dataset has thus 40 columns (4 conditions x 10 road sections).
Based on this research design, I would like to perform a 4 (condition) x 10 (section) within-subjects MANOVA for mean speed and mean lateral position. In SPSS I run a GLM_Repeated measures with two within-subject factors (condition and section) which have 4 and 10 levels respectively. The measure names are speed and LP (from lateral position). Than, I select my columns and drag them to the field "within-subjects variables".
In my SPSS output I find two tables which attract my attention: first there is the table called “Multivariate Tests” and second a table “Multivariate” under the heading “Tests of within-subjects effects”. Because the study has a full within-subjects design, my question is “Which table do I have to use in my analysis description?”. It is important to note that some of the test statistics differ between both tables and that some cells of the table “Multivariate Tests” are empty because SPSS “Cannot produce multivariate test statistics because of insufficient residual degrees of freedom”.
Can someone explain the difference between the two tables (Multivariate Tests and Tests of within-subjects effect_Multivariate) and which table is preferable to use in my data analysis?
If you are conducting a repeated measures then you are just doing an ANOVA, a MANOVA would begin by you doing a Multivariate GLM. Within-subjects effects is what you would need to focus on.
Question
I'm currently working with macrobenthic communities in Admiralty Bay (King George Island ~ South Shetlands, Antarctic Peninsula) using a functional grouping approach to elucidate their assemblages.
Some preliminary research in one of its inlets (Mackellar Inlet) indicates that the ocean currents are an important variable that strongly influences over the coupling of the sampling stations. In order to run a multivariate analysis, I would like to incorporate these values into the analysis. Is there any special treatment that I have to do? Any further suggestion or recommendation?
Hi there, you could use Principal Component analysis to see how much environmental variability is explained by your abiotic parameters of interes (current velocity, speed and others) among your sampling stations. You may explore several ordination analysis using CANOCO . If the question is to see whether the variability of those abiotic parameters are indeed important explaning the variability of your community data, then you could correlate your biotic matrix with the environmental one. For that I would suggest using BIOENV a test available in PRIMER. Hope this helps, good luck. Aldo
Question
I'm working with a table of demographic information about my patients and controls, such as job information, whether or not they have been exposed to heavy metal, to pesticides or to any water body, such as river, lakes and etc.
I was told that multivariable analyses could give me an idea of which of these parameters could've been a more important source of exposure to heavy metal.
Question
I have a problem in solving a double integral with Jacobian transformation. I've read a journal but it uses transformation beta=beta.
Could I solve the integral using the transformation? Thanks for your help.
the variables for integration is x and beta Mr.
i would to integrate the double integral by jacobian method but in journal that i've read, it's been explained that the transformation is y=x^beta but the transformation of beta=beta. beta=beta is called companion transformation. i do not understand the transformation beta=beta. do you know much about companion transformation? Please help me to solve this problem. thank you Mr
Question
Hi everybody,
I'm trying to apply the methodology of the paper of Wang et al. To select some OTUs I will put in a multivariate analysis. I performed the canonical analysis but I'm a beginner with R software and I don't understand how to use the envit function with my data. Would somebody help please?
The article says: "We performed CCA using the CCorA function in the vegan package (software R version 2.7) to detect the interactions between the selected metadata and the given microbiota dataset at OTU level (100 OTUs) and used the envfit function to get the p-value of correlation of each variable with overall bacterial communities and the p-value of each correlation between each OTU and all variables"
Best Regards, Vanina
The envfit function needs two pieces of data, your initial data matrix and a table of metadata that groups the individual samples according to some environmental groupings. Usually in R the hardest part of an analysis is getting your data into R, once you have the data the actual analysis is very simple.
Since you've done your CCA already, you should have a matrix of your data, probably an OTU table. This should be in the format of samples as rows, and OTU abundances as columns. You then need a table that lists your environmental data (pH, temperature, salinity or whatever), where the rows of the table correspond to the rows of your OTU table, and each column reports a different environmental factor.
As a completely made-up example, this code will make a quick OTU table called 'myTable'. In this table the first 10 rows contain values between 1 and 10, and rows 11 - 20 contain between 15 and 25. This is meaningless data, but gives a strong difference between these row groupings.
myTable = matrix(nrow=20,ncol=10)
for(i in 1:10) {myTable[i,] = sample(1:10,10)}
for(i in 11:20) {myTable[i,] = sample(15:25,10)}
Then this code will make a data frame that contains some metadata about the OTU table. In this, the first column of my metadata table groups the rows into two groups, 1-10 and 11-20. The second column groups all the odd-numbered rows together and all the even-numbered rows together. Again, this is meaningless but gives us one metadata grouping that should provide a very strong signal and one with a very weak, insignificant signal.
myMetadata = data.frame(FirstColumn = character(), SecondColumn = character(), stringsAsFactors=FALSE)
To run the envfit, all you need to do is call the command(s):
library(vegan)
envfit(myTable ~ myMetadata$FirstColumn, data = myMetadata, perm=1000) This will perform the envfit function against your OTU table, grouping the individual samples (rows) according to how they are grouped in the first column of myMetadata. The perm parameter specifies how many random permutations to test for assigning significance to the vectors. You should see a high R2 value with strong statistical significance (p-value). If you test the data using the other column: envfit(myTable ~ myMetadata$SecondColumn, data = myMetadata, perm=1000)
You should see an extremely weak (~0) R2 with no significance.
With your own data, you'll just need to step through each metadata column in order to record the fit/significance of each parameter. I hope that helps. If you need anything else just let me know.
Question
We performed some analysis with multivariate logistic regression (several continuos independent variables and a single dichotomous dependent variable)
A referee revising our paper asked us to show the effect-sizes of each predictors but (s)he was satisfied neither of OR’s (that I agree are unstandardized) nor of standardized B coefficient.
Which standardized effect-size is best to calculate and report for predictors of multivariate logistic regression?
Thank you!
Following Hosmer and Lemeshow (2000), my preferred effect size indicator for multivariate logistic regression is the area under the receiver operating characteristic curve.  In this analysis, the predicted values (estimated logits) are the predictors and the outcome is the dichotomous outcome.  Note that it does not matter whether the predicted values are probabilities or logits, as the AUC is an ordinal measure, such that the rank order is all that matters. I prefer the AUC over pseudo-r because the variance is inversely proportional to the base rate, making the proportion of variance accounted for dependent on factors related to study design.
Karl.
Question
To my knowledge, you can gain more than 2 cut offs if you performed a scatterplot. I don't think the same can be achieve using a ROC analysis, since it has a binary function. Any suggestions?
Rabin, a parallel discussion emerged since ours, which I'm confident that you will find interesting, and perhaps useful:
Question
I have 3 Dependent Variables and 3 Independent Variables in a study. Some of my IVs are categorical and some of them are continuous. I think I need to run a higher-order factorial MANOVA instead of performing MANOVA 3 times separately for each of my independent variables. Is there any simple reference explaining and interpreting the output of higher-order factorial MANOVA win SPSS?
I'm responding to your statement that you "need to run a higher-order factorial MANOVA instead of performing MANOVA 3 times separately for each of my independent variables".
First, I think you meant to say for each of your dependent variables.
Second, when I see people using MANOVA, I always wonder if they are using it as a precursor to univariate ANOVAs, and if they think it is buying them some protection against Type I error.  If that is why you are doing MANOVA, I suggest you read the classic 1989 article by Huberty & Morris (link below).
HTH.
Question
Hi,
I have a data set and there are 90 variables and more than 6000 observations. I am going to use subset selection method before start to the analysis but problem is that regsubset() is so slow, even if i use really.big=TRUE, it is not giving any output, after 2 hours waiting, I gave up and start to look for another alternative techniques in R. What other methods or techniques are there ?
Thank you,
Idris
I understand that you are using regsubsets() from the {leaps} package. If that is true, you're not simply subsetting your data based on self-specified criteria.
The answers anyone gave you so far concern regular subsetting of data with general subsetting functions from {base} and {dplyr}, subset()
and select(), respectively.
This shows how important it is to phrase questions as clearly as possible. It also helps to be precise when naming a function or package you are working with, because you might otherwise mislead readers.
With regsubsets() you select a number of best fitting linear models based on something like the Bayesian or Akaike Information Criterion (BIC, AIC). That can help you in subsetting your set of possible variables for your model.
But it is vastly different from normally subsetting your dataset. So to give you a clear answer, we would need a more precise question. What are you really trying to do? If you just want to subset you data, use the base function subset(). If you still need a faster method for that (simple subsetting of data), have a look at this post:
Best,
Angelo
Question
Which other discrete time probability distributions can be used instead of binomial distributions?
Can not and you didnt know the difference between Poisson and Geometric?
Question
I am trying to find any paper that proposes a Multiobjective Firefly Algorithm for Variable Selection, but I am unable to find. If anybody knows a paper related to this issue, please inform me.
Question
Theorem: Let X be a bivariate random variable with distribution function F. Given two nonsingular 2X2 matrices, say A and B, such that A-1B or B-1A has no zero element. Further, if the components of BX and AX are independent then F is bivariate normal distribution function.
How do we construct bivariate normality test using the above theorem ?
Choose two non-singular matrix A and B with A-B  has no zero element. If you have the data from F firstly look at the person correlation coefficient between AX and BX. If it is around  0, you may say that "H0: X has bivariate normal distribution" hypothesis can not be rejected. You can construct the critical values for different significance levels by using monte carlo simulation. But I think that it is not goog idea using above theorem to test bivariate normality. Because choosing A and B is big problem.
Question
Short run - Granger Causality Test
i do agree with amer if you could explain with example than we can discuss. i have used it in indian stock market.
Question
I have got a model with one continuous dependent variable and 100 categorical predictors (candidate SNP´s) with 3 levels each (homozygous for one allele, heterozygous, homozygous for the other allele) and 288 observations. What is the best method to select a more parsimonious model ? (with, say, just 5-20 independent variables?).
I discriminate many small samples. So, it is difficult problem to evaluate the best model. I developed "k-fold cross validation". And I choose K=100. And I choose the best model having the minimum mean of error rates in the validation samples. See "Comparison of Linear Discriminant Function by K-fold Cross Validation. Data Analytics 2014". In the regression analysis, you choose the minimum summation of deviation square as the best model in the validation samples.
Question
I am analysing 1 group of 12 subjects. The design was 2x2 (arm role x visual feedback), repeated measures (unfortunately condition order was not randomised). I have 24 trials per condition. I have 7 response variables. I want to determine which response variables co-varied with condition. Which stats method is most appropriate? Thank you.
A quick answer is difficult simply because there are various ways of looking at this and the choice of method can make a practical difference. The issue is how many methods does it take to get a good understanding of the data. If you are able to look at  Wilcox (2012, Introduction to Robust Estimation and Hypothesis Testing, Elsevier) you will get an idea about what I mean.
Question
Let's say, we have treatments A and B (decided by experimenter) that we use as response variable and factors 1, 2, 3...n that we use as predictors (measured variables). Intuitively, this is not correct, because we should model the outcome, not the controlled factor, but classification methods that are based on very similar math still do that. Any ideas/references?
Excellent! I will discuss this with a colleague of mine (who actually owns the data) and try to convince her that this is exactly what we need.
I really appreciate your offer. More later/E
Question
Can someone provide me a link maybe a toolbox for multivariate analysis in MATLAB? How to install it later on?
Question
Here's a slight upshot of my problem, thank you :)
That's a Cauchy-Schwartz inequality.
Question
Hi.
I ran a PCA with 5 variables, and it seems that I should retain only one PC, which accounts for 70% of the variation. The PC2 eigenvalue is 0.9.
I was wondering:
1- if it makes any sense to use varimax rotation in this particular case retaining only one PC
2- in case I retained two PC, should I rotate the whole loadings matrix (with the five PC) or just those I retain?
Thanks!
David
The second eigenvalue is less than one, indicating that it accounts for less variance than a single variable (the theoretical basis of Kaiser's stopping rule). Thus, eigenvectors 2 through 5 are all scree (error), and you have a one-factor solution. For interpretative analysis the factor loading coefficients indicate the correlation between each variable and the factor, and corresponding R-squared values indicate the relative contribution of each variable to the overall constitution of the factor. 
In multi-factorial solutions, if there is an a priori theoretical reason to hypothesize that the factors are independent (uncorrelated) then an orthogonal rotation to achieve simple structure is appropriate. However, if there is an a priori theoretical reason to hypothesize that the factors are correlated, then an oblique rotation is appropriate. Only retained factors are rotated. [1,2]
As is true of all linear models, it is important to evaluate (when possible) whether the underlying data (correlations) and/or the results (eigenvalues) are paradoxically confounded. [3,4]
REFERENCES
 Bryant, F. B., & Yarnold, P. R. (1995). Principal-components analysis and confirmatory factor analysis. In L. G. Grimm & P. R. Yarnold (Eds.), Reading and understanding multivariate statistics (pp. 99-136). Washington, DC: American Psychological Association.
 Yarnold, P.R. (1996). Characterizing and circumventing Simpson’s paradox for ordered bivariate data. Educational and Psychological Measurement, 56, 430-442.
Question
For my Matlab code, as soon as the number of random variables becomes 3, acceptance rate of MCMC using metropolis-hasting algorithm drops to less than 1%.
You need to choose a different proposal density. If that does not help you may want to use some kind of adaptive MCMC algorithms.
Question
I am trying to perform a partial CCA in CANOCO 4.5. When chosing some groups of variables as variables and the rest of the environmental variables as covariables to calculate the net effect of the group, I get the error message "No explanatory variables remained" and the analysis fails. The variables in the groups could be numerical or dummy coded categorical, it happens in either case. When regrouping the respective group to another, it does increase the effect of the other group, so there should be some explanatory power! I already checked for linear combinations. All variables concerned were significant in forward selection. Does anybody have an idea what could be wrong with the data?
I meant environmental variables, but actually I already got some help, it was caused by linear dependences between some of the environmental variables. Thank you for answering!
Question
The current toolbox solves the continuous variable and thats the default algorithm. Can we customize and ask GA to solve the same problem by defining the intervals on design variable.?
Malay is right. Almost any metaheuristic may work. Just be careful with the operators. Start with a basic GA and check wether you need to improve it or not. If you are using a commercial toolbox, I do not know how to make the customization.
Question
Testicular volume & scrotal circumference of twelve bulls were measured repeatedly. The bulls were divided into two categories (young versus old). We are interested in investigating the effect of age group (Independent variable) on testicular volume & scrotal circumference (dependent variables). Can someone please advise which test we should use? I think repeated measures MANOVA?
SAS performs both univariate and multivariate repeated measures analyses. It can assist you in analyzing this data on repeated measures..
Question
I don't have a good basis upon which to express my data statistically. I would like to learn more about multivariate analysis. The courses that I attended during my graduate studies were very advanced since I did not have any statistics background at that time. I can't go back to school now, so I would be grateful if someone could guide me or suggest the best way to teach myself the basics of ecological analysis. Thanks.
Hello.
I would suggest to listen basic statistic courses on coursera (coursera.org). And also there is very nicely written, easy to understand statistical book by Steve McKillup "Geostatistics explained: an introductory guide for life scientists" which includes the basics of both univariate and multivariate statistics.
Question
The different units being cm, kg, etc.
The determinant is related to scale; so choosing cm vs m or mm matters; as does kg vs g. The value of the determinant will be different if you use different units.
Question
I have a set of categorical functional traits (growth form, photosynthethic pathway, etc). To pool them in a single variable (this is just a part of a more complex design) I reorganized these categorical functional traits into binomial variables and constructed a matrix with species as rows and binomial traits as columns. The number of columns was equal to the number of categories of the traits (e.g, growth-form had three columns: woody [yes/no], grass [yes/no], forb [yes/no]. To obtain a single variable from it, I conducted a non-metric multidimensional scaling using Euclidean distance. However, I´m not sure if multivariate techniques are suitable when you have only binomial data and, in case they are, if I selected the most appropiate technique. I couldn´t find this particular case in the literature and I would like to be sure
Thanks very much for your responses, Kevin, Pierre and Lasse. I finally choose to use NMDS since it is well suited to handle non-normal and non-continuous data (McCune & Grace 2002). Moreover, the axes that I obtained from these ordination were very well-related to most of my traits, providing an excellent continuous variable pooling all the binomial data. Anyway, I agree that MCoA would be another very good solution, but I found problems in finding good (and reliable) scripts to run it in R.
regarding PCA, I don´t think that it would be appropiate for non-continuous data, Lasse, at least is what many multivariate stats books say...
as a summary to the readers of this question; MCoA and NMDS are both valid and good techniques to ordinate binomial datasets, although you need to be careful in the distance used to built the ressemblance matrix depending in your question.
thanks to all who answered to this questions, I learned a lot from you!! :-)
Question
I have a matrix [n individuals X 3 variables]. My 3 variables are proportions (summing up to 1 for each individual). I want to compute a distance matrix (between all pairs of individuals) using euclidean distance. But for two variables, I would like to give twice as many weight than to the third. I thought to transform my variables as follow V1'=2*V1, V2'=2*V2, V3'=1*V1 , and then compute the matrix distance. Does it make sense? Thanks in advance
NB: Subsequent analysis will consist in permutationnal MANOVA
Seems to make sense - although it is questionable that you really attribute twice the weight for V1 and V2 - because of the sum under the root.
It would be easier by using the block metric (Manhattan distanc) - but if you want Euklidan distance it is a way to do it.
Question
Does anybody know which options one has to select in G3 to calculate the power in Multiple Regressions post-hoc on the basis of the R²? I do not know which R² values one has to put in, my result were incorrect so far.
F-test, Multiple regression Omnibus, Then choose determine, input R2 and then transfer. I wish this will be helpful.
Question
I have a proteomic dataset with more than a hundred proteins in different conditions. I would like to run a stepwise discriminant analysis to select a subgroup that is discriminating among my conditions. However, the dataset there is multicolinearity. How can I deal with it and still compute a discrimnant analysis?
Warning: this is a weird answer, but I think it might work for you.
Have you thought of trying exploratory factor analysis to pull out profiles of proteins? That way, you might be able to find clusters of collinear proteins and rename the clusters as singular covariates with states (e.g., high, medium, low, or some sort of integer or ratio as a numerical index). This essentially what people do when trying to develop psychometric instruments that have indices (e.g., the NEO-PI-R personality test, which has 5 indices for personality traits).
Here's an example of confirmatory factor analysis (not exploratory like yours. This is when you have hypothesized clusters) that is a typical way factor analysis is used: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218373/
Here is a recent way exploratory factor analysis is being used for lipid profiles that mirrors pretty much what I am suggesting you try with the proteins: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3552685/
My opinion is that the reason why this wonderful tool, factor analysis, has not been used as much in biomedical research is because it has been mainly on the radar of the sociologists and psychologists. It is one of their main tools. I am in public health, and it is not emphasized in our basic biostatistics curriculum. In fact, I have not noticed it taught at all anywhere in public health. I learned it myself while collaborating with an organizational psychologist on an interdisciplinary project.
It is possible you have too many proteins and this would not work because you have too much data. Sometimes that is good, and sometimes that is a problem!
The genetics people who have too much data try to use multifactor dimensionality reduction (MDR): http://en.wikipedia.org/wiki/Multifactor_dimensionality_reduction At first blush, it looks great, but the reality is, to quote what happens to be on Wikipedia today, "As with any data mining algorithm there is always concern about overfitting. That is, data mining algorithms are good at finding patterns in completely random data."
This is why I have never used this MDR approach; I find factor analysis more informative. I just try to get rid of some columns empirically if I have too many and make it so I can use factor analysis.
I hope this helps you. Good luck with your project!
-Monika
Question
I used this statistics to test repeated measurements, but I have some significant interactions, and I want to know where the differences are. I use the program Statistica and IBM SPSS.
DATASET ACTIVATE DataSet1.
MIXED cv BY fixedeffect1 fixedeffect2 randomeffect
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=fixedeffect1 fixedeffect2 fixedeffect1*fixedeffect2 | SSTYPE(3)
/METHOD=REML
/RANDOM=randomeffect | COVTYPE(VC)
/EMMEANS=TABLES(OVERALL)
You can not do this from the windows interface, which is frustrating
The bottom two lines should give you all the pairwise post-hocs you need.
Question
I used multivariate analysis to define understory plant communities in Nothofagus forests, however, I think the results can be improved. I saw some results obtained using this kind of software, but, I am not able to find any papers related to this topic.
It is clear, that it is a very innovative methodology...
Question
The context is risk perception
Question
We need non parametric multivariate analysis of Variance for comparing k groups (partitions) of a large multivariate data set produced by a particular clustering method.
Also check into anova.cca(), which can be used in conjunction with capscale()... Both in vegan.
Question
I have a dataset consisting of proportion variables as independent variables. I need to run a linear regression however there is the issue of multicollinearity. I've read that using a centered log ratio transformation can fix the problem but I have no idea how to implement in R. Here's what I've done so far.
#My table
a = data.frame(score = c(12,321,411,511),yapa = c(1,2,1,1),ran=c(3,4,5,6),aa=c(0.1,0.4,0.7,0.8),bb=c(0.2,0.2,0.2,0.1),cc=c(0.7,0.4,0.1,0.1))
library(compositions)
dd = clr(a[,4:6]) #centered log ratio transform
summary(lm(score~aa+bb+cc,a))
summary(lm(score~dd,a))
but I get the same result essentially with the last variable being omitted because of multicollinearity.
There is an alternative that does work if I introduce jitter in the variables aa,bb,cc, however I need something that can directly be implemented in the lm function because I use other variables in my real dataset as well.
library(robCompositions)
lmCoDaX(a$score, a[,4:6], method="classical") Anyone has any experience with these type of data? Relevant answer Glad to read that! Yes, the issue with the interpretation of the coefficients is one of the hardest aspects of the log-ratio approach to compositional data analysis. And in a regression framework with compositional input that is even more complex... In a given sense, you can see it as the gradient of "scores" with respect to "dd". And as a gradient, it shows the "compositional direction" in which "scores" increases fastest. To get a better idea of what does that mean you can compute all possible pairwise logratios of your coefficient compcoef = ilrInv(mdl$coefficients[-1], orig=cmp)
D = length(compcoef)
z = log(outer(compcoef, compcoef, "/")) / (4*D^2)
If the element z[i,j] of this matrix is positive this means that "scores" increases z[i,j] units for each time that the ratio of variables i and j is increased by a factor of e=2.71~3. The same applies to negative elements, meaning a decrease, of course. This may help in deciding which components are relevant to predict "scores", and of course, in interpreting the whole thing. The (2D)^2 constant appears because of the own Euclidean structure of compositions. Perhaps a look at
Question
http://en.wikipedia.org/wiki/High-dimensional_model_representation
Question
Suppose I am assessing a bunch of risk factors and their associations with an infection (odds ratio will be the final measure). Outcome variable is the infection (yes vs. no)
Normally, I will select a priori covariates to adjust for based on DAG, biological mechanism or evidence from previously published journal articles - if I have a specific exposure and outcome to evaluate. Then I will use a backward selection method to retain those significant ones (based on 10% change-in-estimate rule of thumb). Apparently I don't think I can do it like this because I don't have a specific exposure, as I aim to know what risk factors are significantly associated with the infection. What I am trying to do is to perform bivariate analysis of each factor with the outcome and pick those with a p-value less than 0.1 to be included in the multivariable model. Then I will use a backward selection procedure to generate a parsimonious model as the final model, in which final estimates for each would yield an adjusted OR for each factor retained in the model. However, this method is considered data driven and somehow suboptimal.
What do you think would be the better method for variable selection in this case?
Hello, Yu Liu! Perhaps the Successive Projections Algorithm (SPA) may be a very good alternative for you to use. Some papers have demonstrated that it is a viable solution for variable selection in multivariate calibration problems. I think that it may be helpful for you too.
Another alternative may be the use of a Firefly Algorithm. It is a bio-inspired metaheuristcs, and I have demonstrated that it is a good variable selection technique. If you wish, you may considered yourself able to contact me or have a look at my papers available on my profile.
Question
I would like to investigate the effect of an environmental factor simultaneously on a species abundance matrix and a functional trait related to these species. To be clearer: I have sampled species in 10 sites, each site is characterized by one out of three levels of isolation and each species has a level of specialization (4 levels). I would like to know if species characterized by a given level of isolation are to be found at sites with a given level of isolation. Any help is welcomed.
Why do we always think that everything that Einstein uttered had to be taken seriously? If one has a problem, surely the most important factor is its solution. One has to define the problem before one can commence working on a solution but the definition just cannot be worth more than a solution.
Question
I want to do a multiple regression analysis and some values of my dependent variable are negative (from -100 to +100). Can I run the analysis with negative values, or do I have to recode the variable in order for it to have only positive values?
Hello Mihai,
as was indicated already, you can indeed have negative values in your dependent variable.
1.) Fit a regression model to your data
2.) Create a new variable (y2) which is exactly the same as your original dependent variable, but add 100 to each observation. The new variable should now have a range 0 to 200.
3.) Fit exactly the same regression model as in step 1, but use the new variable.
4.) All parameter estimates should be exactly the same, except for the intercept. The intercept should be 100 higher than in the original model.
Hope this is useful.
Question
We have heavy metal data for coastal water and sediment for a couple of years for several locations.
If your heavy metal concentrations are not percentage concentrations i.e. < 1000 to 5000 mg/kg (they will be very contaminated sediments if they are !) then the data will approximate to independent variables and you can run them without transformation. This will make it easier to to interpret your results. Given the ease of applying these methods in various software packages I would suggest you try a number of different options (i.e do some preliminary exploratory data analysis) to see what will suite your application the best.
Question
-
I should probably add that the c-index addresses discrimination, but not calibration or accuracy. I think the best way to assess calibration is to break your data into meaningful risk strata (or maybe quintiles or deciles of risk) and then do a scatter plot of the predicted values from the Cox model versus observed values from a KM curve for each strata at a particular time point of interest (e.g. one year or five years). If the data you're using for validation and model development are on in the same, you will likely want to do some cross-validation.
Question
I need assistance with how to formulate a case-mix variable for a project involving assisted living. I have seen case mix used as a coefficient primarily in reimbursement formulas. I also have looked at anova analysis and multiple logistic regressions. It would be simpler for me to have a single variable representing the case mix because I am using multinomial analysis. I am using STATA 12 for analysis.
Case mix construction needs a frame of reference, that is, what are you using it for? To study variation in reimbursement ? Patient outcomes? What is your key dependent variable? Then, knowing that, what are your options in available independent variables that could be construed as exogenous variables, or variables beyond the control of the groups you wish to compare, that could influence the dependent variable, like patient age, sex, and comorbidities? Regressing the exogenous variabes onto your dependent variable will help identify the strong variables to use in building a case mix variable. How to combine them into a single case mix indicator is something of a creative task. You could do hi/lo splits on some and then add points for scoring high to create an additive single indicator of case mix. Always test as you go to see if expected variation in how the levels of your case mix variable demonstrate expected variation in your outcome variable(s) as you build. It's possible there may be some "standard" case mix groupings used in the relevant literature that could be drawn upon, saving you from this otherwise more "organic" approach for trying to build a case mix model for your particular study, outcome variables, and independent variables. I hope this gives you some ideas to work with.
Some examples of this kind of organic approach for building risk adjustment variables, which are like case mix variables, can be found here:
Best of luck.
Question
It seems journals are considering Bonferroni adjustment for p-values of terms within a multiple regression model. Has anyone else noticed this? What do you think of the trend?
The challenge of dealing with multiple testing in the same body of data is real. Performing many stochastically dependent tests may result in serious inflation of type I errors (i.e. accepting 'spurious' significance results as 'real'). There are many less conservative modifications to deal with the Bonferroni inequality. One mode that preserves much statistical power would be the false discovery rate approach (there was a nice review paper on that in about 2006 in Ecoscience).
With highly multivariate regression models, one may alternatively go for a different approach. Instead of classical hypothesis testing, you could follow a model-building strategy, for example guided by Akaike's AIC (or AICc with small sample numbers) or Bayes' BIC.
Both these information criteria allow you to compare models on the basis of (a) how good the fit to the data is PLUS (b) how few parameters are required to meet that. In this franework, the 'significance' of individual predictor variables does no longer attract much interest.
Frequently, you may then end up with a number of models of more or less equal goodness / parsimony. Instead of significance testing, then, model averaging would be a viable strategy.
Anyhow, as long as you go for multiple hypothesis testing, it is certainly wise to keep the Bonferroni problem in mind. Otherwise marginally 'significant' relationships may easily be given undue weight.
I've been reading Professor Tõnu Kollo's Advanced Multivariate Statistics with matrices & I've been strugling in solving exercises 1,2 and 3 page 275.
Question
How to find the expectation of the product of an inverse generalized inverse Wishart with a wishart, or vice-versa? Thank you.
Question
What statistical program would one use to test multivariate generalized hyperbolic (GH) distribution?
Question
We have evaluated many parameters for predicting e.g. healthy and diseased individuals. In univariate logistic analysis some parameters showed a high standard error as a result of the logistic analysis. Standard error is greater than 400. Is a multivariate logistic analysis meaningful if I include parameters with a high standard error?
Usually a p-value cut-off (0.15 or so) on the univarite regressions is used to filter variables out of the multivariate model. The s.e. is going to depend on the magnitude of the variable - if you want to use s.e. then you need to standardize (center & divide by the deviance) the variables before fitting the (univariate) regressions.
Question
I have a sample of 210 (Convenience-drien) and the dependent variable is a continuous variable made of a composite index. The independent variables are dummy and dichotomized varialbles.
Maybe I am wrong, but I thought you would do a multiple regression ? If you go to the statistical book on my website on the favourite links page and there is a chapter there and also the chapter on GLM is useful to read as that explains differences.
This is the definition they give;
The Purpose of Multiple Regression
The general linear model can be seen as an extension of linear multiple regression for a single dependent variable, and understanding the multiple regression model is fundamental to understanding the general linear model. The general purpose of multiple regression (the term was first used by Pearson, 1908) is to quantify the relationship between several independent or predictor variables and a dependent or criterion variable
Do you have missing data / what do you do in this instance? Do you check for wrong data or outliers? I gather all your independent variables are dichotomised. What does the word drien mean ? it doesn't come up in my dictionary on word in the synonyms list ?
Can you dichotomise your dependent variable into pass or fail or achieve or not achieve - some sort of yes / no categorisation? Let me know fi I can be of further assistance.
Thanks Debbie
Deborah Hilton Statistics Online
Question
How to do the computation of ARL for multivariate EWMA using the R program?
Question
I am interested in selecting interesting variables based on the pls-da model. In a PLS-DA with multiple components, how are the interesting variables selected based on their VIP scores? Variables have different VIP score for each component, hence the confusion. I have earlier worked with opls-da and in that case its just one predictive component and just one VIP score per variable.
Parallelization of a Modified Firefly Algorithm using GPU for Variable Selection in a Multivariate Calibration Problem
Hello, Aakash! I suggest you to read one of my papers:
I believe it might be helpful.
Question
Linear regression and correlation play an important part in the interpretation of quantitative method comparison studies. How to use linear regression and correlation in quantitative method comparison studies?
Assuming that we want to relate two variables: one dependent and one independent.
First we try a scatter plot to see if there is a linear (+ or -) relationship between them without any outliers and then we quantify this relationship with the use of the correlation coefficient. If there is a linear correlation (r tends to + or - 1) between them then we move on to a regression analysis. If there is no linear relationship we may transform the variables.
Please inform me if I can be of any further help.
Kind regards
Question
.
Thanks Dr. Imran for your valuable input
Question
When one has a data set containing eight dependent variables and three independent variables where all three IVs are factors having an unequal number of levels, which kinds of multivariate models can one use to analyse such a data set?
The DVs are all continuous, taking the same measurements for each level of the IVs. However, the measurements use different units of measurement. I ran some tests and discovered that there is a lot of within variance and also a lot of between variance, however, the within group variance is more than the between group variance.
I would like to know:
1. How does one account for the high variability observed in the data? Potential sources of variability include: the three categorical IVs have an unequal number of levels; some of the IVs have null values in some of the levels and high values in the other levels.
2. Which would be most appropriate to use between a correlation matrix and a covariance matrix in terms of both analysis and interpretation?
Hi Bancy, You didn't say if your response variables are ordinal or just categorical, but this whole class of variables is generally best modeled with a class of models known as generalized linear mixed models--they are actually known by a bunch of names, but they are essentially extensions of general linear models that add "link functions" to the outcome that allow for a variety of unusual outcome types. So, once you have these things set up, you can work with them pretty much how you would work with a regression in terms of putting in predictor variables. You can search these models on the Internet and learn more about them. I haven't checked it, but the UCLA stats site can be helpful if you need help getting started, because it usually includes program code from popular stats programs, too. For some issues, such as large numbers of zeroes, there are specific models, such as zero-inflated Poisson, that can help, but not with categorical data. Bob
Question
Any ideas on statistics that measure the distances between the observed values and the expected values? Apart from using the T-square statistic.
Stress function in a Non-metric Multidimensional Scaling?
Question
I have 8 response variables in different scales. The variance of 4 variables are very high and 4 variables are low. The all observation are separated in 3 groups. By MANOVA, there is a significant effect of grouping. Which correlation matrix should be used in PCA in this situation?
As Anders points out, scaling variables from diverse sources is important. Correlation matrix is less sensitive to scaling, which is why the variation explained is lower, but this in itself does not indicate better or worse. My preferred method to use is whatever one is more interpretable. Pre-scaling your data means you understand exactly what transformations it has been through. This should make it easier to tie the results back to the original data. As well as unit variance scaling, consider mean centering if the different variables cluster around different mean values - but if using both always do mean centering first.
Question
I have Three factors A, B and C with levels 15, 2, and 2 respectively. The standard deviation of population is 1.8 from pilot survey. I want to fit three way ANOVA model:
y_jikl=mu+alpha_i+beta_j+gamma_k+(alpha*beta)_ij+(alpha*gamma)_ik+()ijk+error_ijkl
Our main hypothesis is to find best level of factor A with interaction levels of B and C. How do you calculate sample size for testing this hypothesis? And could you give me the R/SAS code for calculating sample size by simulation?
G*Power is a flexible, free tool for calculating sample sizes. http://www.gpower.hhu.de/en.html
Question
I came across a code in Matlab on how to generate the data with autocorrelation "X=cumsum(rand(n,p)-r)", where n is the number of observations, p is the number of variables and r is the correlation coefficient. The results I am getting used to be a structured patterned when plotted on scatter plot or control chart (MEWMA), the value of autocorrelation (rk) used to be very close to 1(0.9998). That is not what I want.
Question
In a binary unconditional logistic model that I am working on, one of the variables (let's say X) is a confounder. Removing it is changing the odds ratios (ORs) of several other variables by more than 10% (in fact its changing some values by 50% or more). However, X also has missing information and including it reduces the cases included in analysis (N) by about 2000. It makes me wonder regarding two things: 1. Is the change in ORs due to change in N and not due to confounding effects of X? 2. Given the change in N and the change in ORs if I include X in the model, should I keep X or not?
As a general principle, when comparing different models, e.g. progressive adjustment for confounders, one should always restrict the data to the same set of participants. Otherwise the models aren´t comparable.
In terms of your actual question, you can get an idea of which is happening by: 1) running your model without the confounder but restricted to people who are not missing the confounder, and then comparing to 2) running the model with the confounder. That way both models will be in exactly the same people. Look at the ORs and 95% CIs to decide whether there is any material/important change. You could also do a loglikelihood ratio test of the two models to see whether they are significantly different. Since you have so many missing observations for the confounder, consider including the confounder only as part of a sensitivity analysis.
Hope this helps.
Question
Can anyone provide me with an extensive explanation for which type of statistical tool should be used for:
1) Independent variables (Ethnicity): Chinese, Indian and Myanmar, code as 1, 2, 3.
2) Several dependent variables: TTCP, LCP, HCP, WCP, TCP (all are continuous data).
Initially, I used ANOVA to compare the difference in the MEANS for each DV among these ethnicities but found only LCP to be statistically significant in the overall ANOVA table. I then used it and proceeded to the post hoc test, while the rest of the DVs didn't show any significant difference. I think, I have to make a decision and conclusion without proceeding to the post hoc test since the overall ANOVA table shows no sig value, but I am not sure. What am I suppose to do in this case? Can I use MANOVA since I have several DVs? Which test is suitable for these problems?
In my opinion, if your DVs are correlated (perform a correlaton analysis on your DVs) , then the MANOVA approach will be helpful, otherwise you may perform separate ANOVAs for your DVs.
Question
I am working on a problem in which I have derived a set of D formulae relating a different dependent variable to a grouping of independent variables.
D1 = intercept + ax1 + bx2 + bx3 + bx4
D2 = intercept + ex2 + fx7 + gx8
D3= intercept + hx1 + ix3 + jx7
etc to ... D8.
I have 3 categorical variables P, Q and A [which are actually hierarchical with A within Q with P, each containing a different number of classes] – I want to look at each of the categorical variables as a separate issue clustering each of the D formulae into classes, so I can say something about how the D's vary / interact across classes.
Intuitively this seems to be a discriminant function problem because the classes are already known. However, a PCA or FA might be necessary – and then do a DFA on the clusters. Either way I am not sure how to set it up or even if I can interpret it to make sense.
Alternatively, I might be climbing up/down the wrong tree [pun intended]. Other methods might be better.
Help!
George F. Hart
I'm sending this to a number of statistics groups so apologize if you get this note more than once.
Question
I am trying to determine the effect of autocorrelation on the performance of standard control chart, but first have to model the auto-correlative structure of the standard control chart for the data set before I use the residuals for the control chart.
Auto-correlative data are, in most cases, time series which depend on time. For such data successional time series are applied in most times. So, you can build a regression model based on the succession order and the value. Most of times the model is non-linear and most importantly, when you get a new data you should know where it belongs in the succession series.
Question
I am a little confused about this. I've read a lot but I really need some support from others who have knowledge about LARS and LASSO. Thanks in advance to all who are willing to contribute.
Hi Russel,
I found the link below provided a very nice thumbnail explanation -
LARS represents an efficient algorithm to calculate the entire family of LASSO solutions - that is for the entire range of bound parameters (s constraint) for |beta|
I've played around a little with the R package glmnet for fitting these models- but have been focusing more on RandomForest models in my work, though Andrew Gelman had a very interesting recent blog post on practical advantages of the LASSO approach that is making me want to use it more...
Gelman post:
Question
Does anyone know much about this topic? I would truly appreciate if you could share some good resources related to this topic. The articles/studies related to this were scarce.
Question
Most multivariate techniques, such as Linear Discriminant Analysis (LDA), Factor Analysis, MANOVA and Multivariate Regression are based on an assumption of multivariate normality. On occasion when you report such an application, the Editor or Reviewer will challenge whether you have established the applicability of that assumption to your data. How does one do that and what sample size do you need relative to the number of variables? You can check for certain properties of the multivariate normal distribution, such as marginal normality, linearity of all relationships between variables and normality of all linear combinations. But is there a definitive test or battery of tests?
you can test multivariate normality by assessing the multivariate skewness and kurtosis through the software in the following link
• Cain, M. K., Zhang, Z., & Yuan, K. H. (2017). Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation. Behavior research methods, 49(5), 1716-1735.
• Ramayah, T., Yeap, J. A., Ahmad, N. H., Halim, H. A., & Rahman, S. A. (2017). Testing a confirmatory model of facebook usage in smartpls using consistent PLS. International Journal of Business and Innovation, 3(2), 1-14.
Question
I have four different groups of independent samples with nonparametric data (low number of samples). To compare the frequency of data between groups, I used Kruskal-Wallis with Dunn post-test. In addition, I want to determinate the trend between groups. Which nonparametric test for trend should I use?
Jose, the sample size is neither indication nor contra-indication of any statistical analysis.(*) If you can reasonably assume a particular error model (**), then you can (and should) perform a parametric analysis. If you don't have such a model but you can reasonably assume that the (unknown) model is the same for all groups, then you can do a non-parametric analysis. If you cannot reasonably assume the same model for all groups, then there is nothing (sensible) you can do.
Further note that non-parametric tests have a considerably low power for small sample sizes. This also applies to any permutation- or bootstrap-based procedure.
(*) I suspect you mean that in small samples you can't expect the central limit-theorem to do a good job so that the frequency distributions of sample means is still considerably deviating from a normal distribution (so you can't use the normal distribution as error model). If your errors have a symmetric distribution, n=5 is already fine. Skewed distributions may cause problems and considerably larger sample sizes (n>30-50) are required to assure a reasonably good normal approximation.
(**) For frequencies, a Poisson model may be reasonable. If you doubt a strict correlation of mean and variance you can use a negative binomial model. There are also zero-inflated Poisson models available, when the expected frequencies of zero counts are higher than predicted from pure Poisson processes.
Question
I am using a multivariate (Trivaraite) probit in STATA 12. How to calculate the conditional marginal effects for each equation while remaining two dependent variables are consider to be at 1?
you could try computing predicted marginal probabilities of success for each outcome and the joint probabilities of success and failure in evry outcome using the mvppred in stata 12
Question
I know some basics, but want to expand my skills focusing on regression, survival and advanced data management.
Question
Recently we have conducted a prospective study to evaluate if BMI is associated with treatment response to a combination DEMARDs therapy. We found an inverse association between BMI and disease activity at baseline. Also, we found an inverse association between BMI and response to treatment (change in DAS28 after 6 months); those with higher BMI had less DAS28 at baseline and also had less changes in DAS28 after 6 months of therapy. When we run a multivariate analysis and consider the baseline DAS28 as a co-founder, the association between BMI and treatment response disappeared.
The question is that:
- Should we conclude that the whole association is confounded by baseline DAS28 and there is no real association between BMI and treatment response?
- How we could find that what amount of the association between BMI and treatment response is confounded by baseline DAS28? and what amount is a real association?
I agree with Audrey,
mulitcollinearity should be no problem, you actually want a relation of your independent and dependent variables. In your case, by including the DAS28 value of the pretest you are able to control for differences before the intervention.
Also plotting is a very good idea! But why not try a repeated measures AN(C)OVA? Have you tried that? With a model like that you would be able to
a) take a look at the interaction of time and BMI (or other independent variables)
b) include other variables such as gender, age, etc. (I'm just guessing, you can include whatever variables you assessed that might have an impact on either BMI or change in DAS28).
Question
I am analyzing taxa community composition in relation to explanatory variables by distance-based redundancy analysis (db-RDA), based on Bray-Curtis dissimilarities on untransformed abundances, with R. I have a stratified design (3 stream reaches) and initially I had the same number of replicates per site. As predictors, I'm using, among other variables, treatments 1, 2, and 3. At the end of the study, I lost all replicates from, say, treatment 1 at a given site, and a couple of replicates per site for, say, treatment 2. I ran the db-RDA model for sites combined (but with permutations stratified within sites), and then ran a separate model per site. What are the consequences of an unbalanced design when using db-RDA?
When running an F test with unbalanced an design you may change rates of type I and type II statistical errors depending on the type of sum of squares you are using. This is related with the way the main effects of factors are estimated.
If you have an unbalanced design, make sure that the F-test you are using calculates the Type III sum of squares, which seems to be the case in anova.cca() function when setting the argument by = "margin".
Cheers
Is there any way to find the underlying factors of a set of observations besides Factor Analysis and Principal Component Analysis?
Question
I am working on the Arbitrage Pricing Model and I found that most researchers just performed a multivariate regression and concluded that they found a good model. I think that instead of finding a good-fit model through trial and error, a better approach should be asking the data. Thus I was trying Factor Analysis and it worked pretty well so far. However, I wonder whether there are other methods so that I can deal with a more general model, where linearity of the factor needs not be assumed? Thanks!
Here is one more suggestion: working in a regression tree context, which do not assume linearity and even monotony. Since you may have collinear predictors, I would suggest to try random forest, that is an ensemble of regression tree. Random Forest give you the importance of each predictors in explaining the value of your response(s) variable(s). It give you also the proximity matrix so you can have a pairwise (among observations) dissimilarity matrix , that you can visualize using NMDS and/or cluster the observations. If you have multivariate response data and want to do Multivariate Random Forest in R, have a look to: Segal, M. and Y. Xiao (2011). "Multivariate random forests." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(1): 80-87. you can contact the authors for getting R source codes. HTH Cheers Pierre
Question
Bivariate Normal Distribution
First, the bivariate normal probability density function f(x,y), reduces to the product of two one dimensional normal probability density functions if, let say X and Y random variables, are uncorrelated (\rho = 0): f(x,y) = fx(x) * fy(y). [http://www.maths.qmul.ac.uk/~ig/MTH5118/Notes11-09.pdf]
Now, the two random variables X and Y are independent iff,
F(x,y) = Fx(x) * Fy(y), where F(x,y) is the joint distribution function, and Fx and Fy are the marginal ones. Using the definition of the distribution function:
F(x,y) = P(X<=x, Y<=y) [Probability that X<=x and Y<=y]= \int_{-infinity}^x dx \int_{-infinity}^y dx f(x,y),
where f(x,y) is the joint probability density function, which based on the fact that X and Y are uncorrelated can be written as:
f(x,y) = fx(x) * fy(y).
By replacing in the above integral, we get
F(x,y) = \int_{-infinity}^x fx(x) dx \int_{-infinity}^y fy(y)dy = P(X<=x)P(Y<=y) = Fx(x) * Fy(y),
which proves the independence.
Question
Are there any online and free place to learning reporting formats.?
Thank you for sharing this question with me, Unfortunately it is not my field. You can try Google search or Youtube
Question
Is there any trial or free version of programs with capability of detrended canonical analysis?
Yes, you should try R. There is a package called Vegan for the community of ecologists, that contains popular methods of multivariate analysis. If you are new to R try also RStudio as development environment.
Question
I've seen a number of trials published that had crossover, but the statistical analysis proceeded as if no crossover occurred. It seems likely to me that ignoring crossover can introduce biases and/or decrease power, but I'm unfamiliar with the literature on this. Any suggestions?
Scott Got ya. I think the key statistical point you are interested in is the independence of the predictor variables and the assumed linear dependence between these variables. Since the Cox method is modelling the variability of the survival probabilities on the predictor variables, if one of the variables does have an unexpected crossover effect like you mentioned above, (ie. diabetic status and people who become diabetic midway through the study) then that confounding effect only affects that predictor variable. As you mentioned two methods of analysis can then be performed. 1) Remove those individuals for the predictor for which the crossover occurred. This only limits the power of that regression coefficient. 2) Determine if the crossover events can be categorized meaningfully into either of the two original groups or possibly be classified as an additional class possibility (diabetic, non-diabetic, converted diabetic).
The variability in the survivability probabilities due to the other predictor variables are unchanged by the crossover events. This is true by definition since the predictor variables are assumed to be independent, but this assumption can also be tested to see if there is a statistically significant interaction effect in the standard ANOVA/regression analysis. This is one reason regardless of the statistical method used it is always important to test the assumptions of the model in question. If the assumption of independence of predictor variables does not hold up then there may be additional predictor variables which would also be affected by the crossover events and those predictors would also have to be handled as described above. Fortunately this all falls under the standard statistical analyses performed in any ANOVA/regression and nothing additional needs to be done beyond testing the assumptions of the model.
How to deal with the crossover events is a choice that needs to be made, but it is even better if both methods of dealing with the confounding events are tested (as the study you described did) and if both models yield essentially identical results the "choices" don't affect the predictions of the model.
Question
I was reading a paper, the author were using a principal component analysis to acquire scores from two highly correlated biological variables. How correct is that? Mathematically, why not use just one of the two variables, if they are highly correlated? Maybe it is my stats background, but I am having a hard time dealing with the idea and not sure if that's correct.
I will try to give a very brief answer to your questions... If two variables are highly correlated and there is a significant amount of error associated to each one of them, there is an advantage of using a score provided by the PCA. As you are ware, the PCA provides the direction of maximum variability of both variables. Thus, in the presence of lab error, scores can provide a better estimate to follow the state of a given system since they optimally combine two sources of (noisy) measurements.
Of course, in the presence of a very small lab error (rare in biological variables) it is mathematically equivalent to use either one or the score (linear combination of both).
Question
Is there somebody with experience using Multidimensional Scaling for Java (MDSJ) libraries to produce 3D diagrams from dimensional / dissimilarity matrices or using another java open source or free java library?
I am using the java library mdsj.jar from http://www.inf.uni-konstanz.de/algo/software/mdsj/ version 0.8 2008 (there is a newer one from 2009) but I am having problems with verifying the results against results provided using R.