Science topic

# Multivariate Analysis - Science topic

A set of techniques used when variation in several variables has to be studied simultaneously. In statistics, multivariate analysis is interpreted as any analytic method that allows simultaneous study of two or more dependent variables.

Questions related to Multivariate Analysis

The experiment was designed as follows :

There are three field plots planted with three plant species each, with 3 replicates by species.

The species were not randomized trough the plot, i.e. They were planted at the same place through the plots.

One fertilization was applied for each plot. In total there is three plots that receive three different fertilization.

There is two treatment :

The fertilization (3 levels = three plots)

The species (3 levels with 3 replicates)

I want to compare the species between the plots, but the species are not randomized through the plots and there is no replication of the plots.

I have read that i can proceed to an ANOVA for each plot separately as different experiments done in different locations. Then applying a combined anova. How to proceed to the Df decomposition for the combine ANOVA.

This experiment was already finished and i have to do the statistical analysis.

I used a lot EPC fr my project, but I don't understand how any kind of the principal components could have no percentage of variation, explained by each one from the total matrix? I cannot do this with mesquite.

I have a question on multilevel analysis. I hope I explain everything clearly, if not please let me know!

I have a dichotomous dependent variable (burglary yes/no), which I try to explain by variables on three levels (house, street and neighbourhood). I am conducting multi level analysis. I just did some cross level interactions, which show a significant influence. With a normally distributed dependent variable, it is now possible to calculate the explained slope variance, to see how much percent of the effect is explained by this variable to see if the influence is indeed relevant. Since for a dichotomous dependent variable it is not useful to calculate the explained variance I used the ICC. But I cannot seem to find anything on how to calculate the ICC for the slope variance. Is this even possible? And if so, how do I do this?

I am referring to an old work, but I guess most of the multivariate analysis techniques implemented and software systems developed could be usefully be re-adapted to modern software environments. Does anyone have knowledge of such work being performed ? Does anyone know of such techniques being in used in the mining / prospecting industry today ?

To help foster the discussion, I have added a paper in English published In book: Use of Microcomputers in Geology, Edition: Print ISBN 978-1-4899-2337-0, Chapter: 3, Publisher: Plenum Publishing Corporation - Springer Science+Business Media New York 1992, Editors: Hans-Kürzl and Daniel F. Merriam, pp.25-71

Dear all (Mathematics and Ecologists mainly):

The Renyi spectrum of fractals contain all the main fractal dimensions in a multivariate structure. This structure is very useful for comparative analysis. I want to know what values of Q in this Renyi spectrum correspond exactly to the fractal dimensions of gyration and variance. I do not find this information in any published paper at the moment.

Many thanks.

If the factor levels are measured values and the measurement instrument has a published error of +\- some value, how does one deal with that error? For instance, suppose one of my class variables (factors) is the length of a component and I need to study how that length effects performance at 0.01 cm increments. If I use a vanier caliper that is only accurate to +\- 0.01 cm then my levels potentially overlap.

Is there some way of propagating the uncertainty through the Manova?

I want to calculate odds ratio using multivariate regression. How can I do that using spss 16.0 version?

I have demonstrated that FA may be a better solution than Successive Projections Algorithm for Variable Selection in a multivariate calibration problem. However, I would like to know if someone has ever demonstrated that the FA may be a better solution than others variable selection techniques.

Currently, I am analyzing data from a small study of 1500 participants. There are only 63 disease cases (asthma). The exposure is a continous score (questionnaire data). I ran seperate regression models to predict asthma using a) the continous exposure and b) the dichotomoized exposure (median split).

My question: Am I right to assume that the sample / case numbers are too limited to run analyses based on tertiles (-> 500 cases per category and less than 20 cases, especially in the intermediate category)? I am not aware of any rule saying that 20 cases are too few, but that is what I often heard. If this is true, can you recommend a reference to support this?

Ps. the estimates are multivariate-adjusted (7 confounders?)

I have parallelized (on GPU) and used the SPA for variable selection in multivariate calibration problems, and would like to know if there are some others parallelized algorithms that have been used for the same content.

Using a likert scale, how can I deal with the neutral part of responses?

Likert scale usually gives five responses, highly disagree, agree, neutral, agree, highly agree. how should neutral responses be understood?

Hello, i tried using the binary logistic regression in spss,the following error message is being shown

Warnings

The dependent variable has more than two non-missing values. For logistic regression, the dependent value must assume exactly two values on the cases being processed.

This command is not executed.

The DV has 5 categories:

1,00= No player

2,00 = social player

3,00= risk player

4,00= pathologic player

5,00 clinical pathological player

Could you please help me.

Best regards

The GENLINMIXED analysis covers a wide variety of models, from simple linear regression to complex multilevel models for non-normal longitudinal data but it is difficult to run and analyze.

In order to use the most important variables among 100+, I am thinking to use factor analysis or PCA. Therefore I need to know if any differences exist between factor analysis and PCA. Thank you in advance.

The electronic edition would also be ok.

Validating logistic regression models.

I derived a logistic regression model for predicting clinical response to a drug. I have two groups: responders and non-responders, classified according to a clinical scale. The predictors of the model were two polymorphisms and three environment variables such as smoking. I generated the predicted probabilities using logistic model for all patients. Then using predicted probabilities i constructed a roc curve for responders vs non-responders.

Suppose X ~ N(µ, σ^2), then we may have found that the sample mean is unbiased and also the MVUE for population mean µ and hence COV(sample mean, U_0)=0, where U_0 means E(U_0)=0. Now If I consider the sample median which is unbiased for population mean then we may consider as E(sample mean - sample median) = 0. Now if we consider COV(sample mean, {sample mean - sample median}, then what will be the result of this? Is it zero? I am facing a problem in this regard. Please, if possible explain this to me.

I have checked whether my ordinal scale-dependent variable has a normal distribution. The Kolmogorov-Smirnov test says it is acceptable, but the Shapiro-Wilk test says it is not. My ranking has a scale of 1 to 150. Please see attached files for tests of normal distributions.

I like to work with some well log data using core data with the multivariate methods in conjunction with Fuzzy logic and neural networks. I need some idea or Problematic.

How is the output varied between one way anova and regression?

Regression and one way anova, similarities and differences?

I have 15 treatments. My main interest is to find the best treatment. The response is measured every day up to 30 days. My model has an interaction effect between time and treatments. I use suitable effect size, but I need power of 80% with type I error 5%. How can I calculate sample size by simulation?

1. I am working on some research relative to females unemployment, and want to use the Multilevel Modelling techniques, but when my residual variance is .043 for an empty model, I include some explanatory variables for level-1 then it would increase. Now, I could not interpret it. Please guide me how to interpret it, and also if it becomes smaller, then whether it would be better off if it increases.

Some results are attach for your information.

2. Is there any criteria for the selection of the explanatory variables to be included for level-2 ?

I have dependent variable Diabetes Mallitus categorized as 1=Normal, 2= Pre-Diabetes, 3= Diabetes.

I have applied ordinal logistic regression for multivariate analysis.

Independent variables are;

Heart Disease (Binary), BMI (Ordinal), Central Obesity (Binary), Sex (Binary), Hypertension (Binary), Age (Continuous), Income (Continuous), Number of Cigarettes smoked per day (Continuous), family History of Diabetes (Binary).

Am I using the right statistical procedure?

Interestingly, in regression model, all independent variables are insignificant, but R square is 0.52. When I remove the variable 'number of cigarettes smoke' R drops down to 0.42. Same goes with HTN and Heart Disease.

Can any one guide me, what is going on? Is it co-linearity, confounding or what? And how do I resolve it?

I would like to use the annealing data set from UCI repository. Nevertheless, the description of the data does not match the contains. Treating '?' as NaN's and eg. removing features with mostly NaN's results in 9 features left...does anybody have a clean annealing data set? Thanks!!!

I conducted a driving simulator study in which each participant (32 participants in total) passed each of the four infrastructural conditions in a randomized order. The road segment nearby the infrastructural condition was subdivided in ten sections of 50 meter. For each section, we recorded the mean speed and the mean lateral position for each participant. My dataset has thus 40 columns (4 conditions x 10 road sections).

Based on this research design, I would like to perform a 4 (condition) x 10 (section) within-subjects MANOVA for mean speed and mean lateral position. In SPSS I run a GLM_Repeated measures with two within-subject factors (condition and section) which have 4 and 10 levels respectively. The measure names are speed and LP (from lateral position). Than, I select my columns and drag them to the field "within-subjects variables".

In my SPSS output I find two tables which attract my attention: first there is the table called “Multivariate Tests” and second a table “Multivariate” under the heading “Tests of within-subjects effects”. Because the study has a full within-subjects design, my question is “Which table do I have to use in my analysis description?”. It is important to note that some of the test statistics differ between both tables and that some cells of the table “Multivariate Tests” are empty because SPSS “Cannot produce multivariate test statistics because of insufficient residual degrees of freedom”.

Can someone explain the difference between the two tables (Multivariate Tests and Tests of within-subjects effect_Multivariate) and which table is preferable to use in my data analysis?

I'm currently working with

**macrobenthic communities**in Admiralty Bay (King George Island ~ South Shetlands,**Antarctic Peninsula**) using a**functional grouping approach**to elucidate their assemblages.Some preliminary research in one of its inlets (Mackellar Inlet) indicates that the ocean currents are an important variable that strongly influences over the

**coupling**of the sampling stations. In order to run a multivariate analysis, I would like to incorporate these values into the analysis. Is there any special treatment that I have to do? Any further suggestion or recommendation?I appreciate your response.

I'm working with a table of demographic information about my patients and controls, such as job information, whether or not they have been exposed to heavy metal, to pesticides or to any water body, such as river, lakes and etc.

I was told that multivariable analyses could give me an idea of which of these parameters could've been a more important source of exposure to heavy metal.

I have a problem in solving a double integral with Jacobian transformation. I've read a journal but it uses transformation beta=beta.

Could I solve the integral using the transformation? Thanks for your help.

Hi everybody,

I'm trying to apply the methodology of the paper of Wang et al. To select some OTUs I will put in a multivariate analysis. I performed the canonical analysis but I'm a beginner with R software and I don't understand how to use the envit function with my data. Would somebody help please?

The article says:

*"We performed CCA using the CCorA function in the vegan package (software R version 2.7) to detect the interactions between the selected metadata and the given microbiota dataset at OTU level (100 OTUs) and used the envfit function to get the p-value of correlation of each variable with overall bacterial communities and the p-value of each correlation between each OTU and all variables"*Best Regards, Vanina

We performed some analysis with multivariate logistic regression (several continuos independent variables and a single dichotomous dependent variable)

A referee revising our paper asked us to show the effect-sizes of each predictors but (s)he was satisfied neither of OR’s (that I agree are unstandardized) nor of standardized B coefficient.

Which standardized effect-size is best to calculate and report for predictors of multivariate logistic regression?

Thank you!

To my knowledge, you can gain more than 2 cut offs if you performed a scatterplot. I don't think the same can be achieve using a ROC analysis, since it has a binary function. Any suggestions?

I have 3 Dependent Variables and 3 Independent Variables in a study. Some of my IVs are categorical and some of them are continuous. I think I need to run a higher-order factorial MANOVA instead of performing MANOVA 3 times separately for each of my independent variables. Is there any simple reference explaining and interpreting the output of higher-order factorial MANOVA win SPSS?

Hi,

I have a data set and there are 90 variables and more than 6000 observations. I am going to use subset selection method before start to the analysis but problem is that regsubset() is so slow, even if i use really.big=TRUE, it is not giving any output, after 2 hours waiting, I gave up and start to look for another alternative techniques in R. What other methods or techniques are there ?

Thank you,

Idris

Which other discrete time probability distributions can be used instead of binomial distributions?

I am trying to find any paper that proposes a Multiobjective Firefly Algorithm for Variable Selection, but I am unable to find. If anybody knows a paper related to this issue, please inform me.

Conference Paper Multiobjective Firefly Algorithm for Variable Selection in M...

Theorem: Let X be a bivariate random variable with distribution function F. Given two nonsingular 2X2 matrices, say A and B, such that A-1B or B-1A has no zero element. Further, if the components of BX and AX are independent then F is bivariate normal distribution function.

How do we construct bivariate normality test using the above theorem ?

Short run - Granger Causality Test

I have got a model with one continuous dependent variable and 100 categorical predictors (candidate SNP´s) with 3 levels each (homozygous for one allele, heterozygous, homozygous for the other allele) and 288 observations. What is the best method to select a more parsimonious model ? (with, say, just 5-20 independent variables?).

I am analysing 1 group of 12 subjects. The design was 2x2 (arm role x visual feedback), repeated measures (unfortunately condition order was not randomised). I have 24 trials per condition. I have 7 response variables. I want to determine which response variables co-varied with condition. Which stats method is most appropriate? Thank you.

Let's say, we have treatments A and B (decided by experimenter) that we use as response variable and factors 1, 2, 3...n that we use as predictors (measured variables). Intuitively, this is not correct, because we should model the outcome, not the controlled factor, but classification methods that are based on very similar math still do that. Any ideas/references?

Can someone provide me a link maybe a toolbox for multivariate analysis in MATLAB? How to install it later on?

Here's a slight upshot of my problem, thank you :)

Hi.

I ran a PCA with 5 variables, and it seems that I should retain only one PC, which accounts for 70% of the variation. The PC2 eigenvalue is 0.9.

I was wondering:

1- if it makes any sense to use varimax rotation in this particular case retaining only one PC

2- in case I retained two PC, should I rotate the whole loadings matrix (with the five PC) or just those I retain?

Thanks!

David

For my Matlab code, as soon as the number of random variables becomes 3, acceptance rate of MCMC using metropolis-hasting algorithm drops to less than 1%.

I am trying to perform a partial CCA in CANOCO 4.5. When chosing some groups of variables as variables and the rest of the environmental variables as covariables to calculate the net effect of the group, I get the error message "No explanatory variables remained" and the analysis fails. The variables in the groups could be numerical or dummy coded categorical, it happens in either case. When regrouping the respective group to another, it does increase the effect of the other group, so there should be some explanatory power! I already checked for linear combinations. All variables concerned were significant in forward selection. Does anybody have an idea what could be wrong with the data?

The current toolbox solves the continuous variable and thats the default algorithm. Can we customize and ask GA to solve the same problem by defining the intervals on design variable.?

Testicular volume & scrotal circumference of twelve bulls were measured repeatedly. The bulls were divided into two categories (young versus old). We are interested in investigating the effect of age group (Independent variable) on testicular volume & scrotal circumference (dependent variables). Can someone please advise which test we should use? I think repeated measures MANOVA?

I don't have a good basis upon which to express my data statistically. I would like to learn more about multivariate analysis. The courses that I attended during my graduate studies were very advanced since I did not have any statistics background at that time. I can't go back to school now, so I would be grateful if someone could guide me or suggest the best way to teach myself the basics of ecological analysis. Thanks.

The different units being cm, kg, etc.

I have a set of categorical functional traits (growth form, photosynthethic pathway, etc). To pool them in a single variable (this is just a part of a more complex design) I reorganized these categorical functional traits into binomial variables and constructed a matrix with species as rows and binomial traits as columns. The number of columns was equal to the number of categories of the traits (e.g, growth-form had three columns: woody [yes/no], grass [yes/no], forb [yes/no]. To obtain a single variable from it, I conducted a non-metric multidimensional scaling using Euclidean distance. However, I´m not sure if multivariate techniques are suitable when you have only binomial data and, in case they are, if I selected the most appropiate technique. I couldn´t find this particular case in the literature and I would like to be sure

I have a matrix [n individuals X 3 variables]. My 3 variables are proportions (summing up to 1 for each individual). I want to compute a distance matrix (between all pairs of individuals) using euclidean distance. But for two variables, I would like to give twice as many weight than to the third. I thought to transform my variables as follow V1'=2*V1, V2'=2*V2, V3'=1*V1 , and then compute the matrix distance. Does it make sense? Thanks in advance

NB: Subsequent analysis will consist in permutationnal MANOVA

Does anybody know which options one has to select in G3 to calculate the power in Multiple Regressions post-hoc on the basis of the R²? I do not know which R² values one has to put in, my result were incorrect so far.

I have a proteomic dataset with more than a hundred proteins in different conditions. I would like to run a stepwise discriminant analysis to select a subgroup that is discriminating among my conditions. However, the dataset there is multicolinearity. How can I deal with it and still compute a discrimnant analysis?

I used this statistics to test repeated measurements, but I have some significant interactions, and I want to know where the differences are. I use the program Statistica and IBM SPSS.

I used multivariate analysis to define understory plant communities in Nothofagus forests, however, I think the results can be improved. I saw some results obtained using this kind of software, but, I am not able to find any papers related to this topic.

We need non parametric multivariate analysis of Variance for comparing k groups (partitions) of a large multivariate data set produced by a particular clustering method.

I have a dataset consisting of proportion variables as independent variables. I need to run a linear regression however there is the issue of multicollinearity. I've read that using a centered log ratio transformation can fix the problem but I have no idea how to implement in R. Here's what I've done so far.

#My table

a = data.frame(score = c(12,321,411,511),yapa = c(1,2,1,1),ran=c(3,4,5,6),aa=c(0.1,0.4,0.7,0.8),bb=c(0.2,0.2,0.2,0.1),cc=c(0.7,0.4,0.1,0.1))

library(compositions)

dd = clr(a[,4:6]) #centered log ratio transform

summary(lm(score~aa+bb+cc,a))

summary(lm(score~dd,a))

but I get the same result essentially with the last variable being omitted because of multicollinearity.

There is an alternative that does work if I introduce jitter in the variables aa,bb,cc, however I need something that can directly be implemented in the lm function because I use other variables in my real dataset as well.

library(robCompositions)

lmCoDaX(a$score, a[,4:6], method="classical")

Anyone has any experience with these type of data?

Anybody knows about HDMR methods?

Anybody knows about HDMR methods?

Suppose I am assessing a bunch of risk factors and their associations with an infection (odds ratio will be the final measure). Outcome variable is the infection (yes vs. no)

Normally, I will select a priori covariates to adjust for based on DAG, biological mechanism or evidence from previously published journal articles - if I have a specific exposure and outcome to evaluate. Then I will use a backward selection method to retain those significant ones (based on 10% change-in-estimate rule of thumb). Apparently I don't think I can do it like this because I don't have a specific exposure, as I aim to know what risk factors are significantly associated with the infection. What I am trying to do is to perform bivariate analysis of each factor with the outcome and pick those with a p-value less than 0.1 to be included in the multivariable model. Then I will use a backward selection procedure to generate a parsimonious model as the final model, in which final estimates for each would yield an adjusted OR for each factor retained in the model. However, this method is considered data driven and somehow suboptimal.

What do you think would be the better method for variable selection in this case?

I would like to investigate the effect of an environmental factor simultaneously on a species abundance matrix and a functional trait related to these species. To be clearer: I have sampled species in 10 sites, each site is characterized by one out of three levels of isolation and each species has a level of specialization (4 levels). I would like to know if species characterized by a given level of isolation are to be found at sites with a given level of isolation. Any help is welcomed.

I want to do a multiple regression analysis and some values of my dependent variable are negative (from -100 to +100). Can I run the analysis with negative values, or do I have to recode the variable in order for it to have only positive values?

We have heavy metal data for coastal water and sediment for a couple of years for several locations.

I need assistance with how to formulate a case-mix variable for a project involving assisted living. I have seen case mix used as a coefficient primarily in reimbursement formulas. I also have looked at anova analysis and multiple logistic regressions. It would be simpler for me to have a single variable representing the case mix because I am using multinomial analysis. I am using STATA 12 for analysis.

It seems journals are considering Bonferroni adjustment for p-values of terms within a multiple regression model. Has anyone else noticed this? What do you think of the trend?

I've been reading Professor Tõnu Kollo's Advanced Multivariate Statistics with matrices & I've been strugling in solving exercises 1,2 and 3 page 275.

How to find the expectation of the product of an inverse generalized inverse Wishart with a wishart, or vice-versa? Thank you.

What statistical program would one use to test multivariate generalized hyperbolic (GH) distribution?

We have evaluated many parameters for predicting e.g. healthy and diseased individuals. In univariate logistic analysis some parameters showed a high standard error as a result of the logistic analysis. Standard error is greater than 400. Is a multivariate logistic analysis meaningful if I include parameters with a high standard error?

I have a sample of 210 (Convenience-drien) and the dependent variable is a continuous variable made of a composite index. The independent variables are dummy and dichotomized varialbles.

How to do the computation of ARL for multivariate EWMA using the R program?

I am interested in selecting interesting variables based on the pls-da model. In a PLS-DA with multiple components, how are the interesting variables selected based on their VIP scores? Variables have different VIP score for each component, hence the confusion. I have earlier worked with opls-da and in that case its just one predictive component and just one VIP score per variable.

Linear regression and correlation play an important part in the interpretation of quantitative method comparison studies. How to use linear regression and correlation in quantitative method comparison studies?

When one has a data set containing eight dependent variables and three independent variables where all three IVs are factors having an unequal number of levels, which kinds of multivariate models can one use to analyse such a data set?

The DVs are all continuous, taking the same measurements for each level of the IVs. However, the measurements use different units of measurement. I ran some tests and discovered that there is a lot of within variance and also a lot of between variance, however, the within group variance is more than the between group variance.

I would like to know:

1. How does one account for the high variability observed in the data? Potential sources of variability include: the three categorical IVs have an unequal number of levels; some of the IVs have null values in some of the levels and high values in the other levels.

2. Which would be most appropriate to use between a correlation matrix and a covariance matrix in terms of both analysis and interpretation?

Any ideas on statistics that measure the distances between the observed values and the expected values? Apart from using the T-square statistic.

I have 8 response variables in different scales. The variance of 4 variables are very high and 4 variables are low. The all observation are separated in 3 groups. By MANOVA, there is a significant effect of grouping. Which correlation matrix should be used in PCA in this situation?

I have Three factors A, B and C with levels 15, 2, and 2 respectively. The standard deviation of population is 1.8 from pilot survey. I want to fit three way ANOVA model:

y_jikl=mu+alpha_i+beta_j+gamma_k+(alpha*beta)_ij+(alpha*gamma)_ik+()ijk+error_ijkl

Our main hypothesis is to find best level of factor A with interaction levels of B and C. How do you calculate sample size for testing this hypothesis? And could you give me the R/SAS code for calculating sample size by simulation?

I came across a code in Matlab on how to generate the data with autocorrelation "X=cumsum(rand(n,p)-r)", where n is the number of observations, p is the number of variables and r is the correlation coefficient. The results I am getting used to be a structured patterned when plotted on scatter plot or control chart (MEWMA), the value of autocorrelation (rk) used to be very close to 1(0.9998). That is not what I want.

In a binary unconditional logistic model that I am working on, one of the variables (let's say X) is a confounder. Removing it is changing the odds ratios (ORs) of several other variables by more than 10% (in fact its changing some values by 50% or more). However, X also has missing information and including it reduces the cases included in analysis (N) by about 2000. It makes me wonder regarding two things: 1. Is the change in ORs due to change in N and not due to confounding effects of X? 2. Given the change in N and the change in ORs if I include X in the model, should I keep X or not?

Can anyone provide me with an extensive explanation for which type of statistical tool should be used for:

1) Independent variables (Ethnicity): Chinese, Indian and Myanmar, code as 1, 2, 3.

2) Several dependent variables: TTCP, LCP, HCP, WCP, TCP (all are continuous data).

Initially, I used ANOVA to compare the difference in the MEANS for each DV among these ethnicities but found only LCP to be statistically significant in the overall ANOVA table. I then used it and proceeded to the post hoc test, while the rest of the DVs didn't show any significant difference. I think, I have to make a decision and conclusion without proceeding to the post hoc test since the overall ANOVA table shows no sig value, but I am not sure. What am I suppose to do in this case? Can I use MANOVA since I have several DVs? Which test is suitable for these problems?

I am working on a problem in which I have derived a set of D formulae relating a different dependent variable to a grouping of independent variables.

D1 = intercept + ax1 + bx2 + bx3 + bx4

D2 = intercept + ex2 + fx7 + gx8

D3= intercept + hx1 + ix3 + jx7

etc to ... D8.

I have 3 categorical variables P, Q and A [which are actually hierarchical with A within Q with P, each containing a different number of classes] – I want to look at each of the categorical variables as a separate issue clustering each of the D formulae into classes, so I can say something about how the D's vary / interact across classes.

Intuitively this seems to be a discriminant function problem because the classes are already known. However, a PCA or FA might be necessary – and then do a DFA on the clusters. Either way I am not sure how to set it up or even if I can interpret it to make sense.

Alternatively, I might be climbing up/down the wrong tree [pun intended]. Other methods might be better.

Help!

George F. Hart

I'm sending this to a number of statistics groups so apologize if you get this note more than once.

I am trying to determine the effect of autocorrelation on the performance of standard control chart, but first have to model the auto-correlative structure of the standard control chart for the data set before I use the residuals for the control chart.

I am a little confused about this. I've read a lot but I really need some support from others who have knowledge about LARS and LASSO. Thanks in advance to all who are willing to contribute.

Does anyone know much about this topic? I would truly appreciate if you could share some good resources related to this topic. The articles/studies related to this were scarce.

Most multivariate techniques, such as Linear Discriminant Analysis (LDA), Factor Analysis, MANOVA and Multivariate Regression are based on an assumption of multivariate normality. On occasion when you report such an application, the Editor or Reviewer will challenge whether you have established the applicability of that assumption to your data. How does one do that and what sample size do you need relative to the number of variables? You can check for certain properties of the multivariate normal distribution, such as marginal normality, linearity of all relationships between variables and normality of all linear combinations. But is there a definitive test or battery of tests?

I have four different groups of independent samples with nonparametric data (low number of samples). To compare the frequency of data between groups, I used Kruskal-Wallis with Dunn post-test. In addition, I want to determinate the trend between groups. Which nonparametric test for trend should I use?

PS. Statistical software: Graphpad Prism

I am using a multivariate (Trivaraite) probit in STATA 12. How to calculate the conditional marginal effects for each equation while remaining two dependent variables are consider to be at 1?

I know some basics, but want to expand my skills focusing on regression, survival and advanced data management.

Recently we have conducted a prospective study to evaluate if BMI is associated with treatment response to a combination DEMARDs therapy. We found an inverse association between BMI and disease activity at baseline. Also, we found an inverse association between BMI and response to treatment (change in DAS28 after 6 months); those with higher BMI had less DAS28 at baseline and also had less changes in DAS28 after 6 months of therapy. When we run a multivariate analysis and consider the baseline DAS28 as a co-founder, the association between BMI and treatment response disappeared.

The question is that:

- Should we conclude that the whole association is confounded by baseline DAS28 and there is no real association between BMI and treatment response?

- How we could find that what amount of the association between BMI and treatment response is confounded by baseline DAS28? and what amount is a real association?

I am analyzing taxa community composition in relation to explanatory variables by distance-based redundancy analysis (db-RDA), based on Bray-Curtis dissimilarities on untransformed abundances, with R. I have a stratified design (3 stream reaches) and initially I had the same number of replicates per site. As predictors, I'm using, among other variables, treatments 1, 2, and 3. At the end of the study, I lost all replicates from, say, treatment 1 at a given site, and a couple of replicates per site for, say, treatment 2. I ran the db-RDA model for sites combined (but with permutations stratified within sites), and then ran a separate model per site. What are the consequences of an unbalanced design when using db-RDA?

Is there any way to find the underlying factors of a set of observations besides Factor Analysis and Principal Component Analysis?

I am working on the Arbitrage Pricing Model and I found that most researchers just performed a multivariate regression and concluded that they found a good model. I think that instead of finding a good-fit model through trial and error, a better approach should be asking the data. Thus I was trying Factor Analysis and it worked pretty well so far. However, I wonder whether there are other methods so that I can deal with a more general model, where linearity of the factor needs not be assumed? Thanks!

Are there any online and free place to learning reporting formats.?

Is there any trial or free version of programs with capability of detrended canonical analysis?

I've seen a number of trials published that had crossover, but the statistical analysis proceeded as if no crossover occurred. It seems likely to me that ignoring crossover can introduce biases and/or decrease power, but I'm unfamiliar with the literature on this. Any suggestions?

I was reading a paper, the author were using a principal component analysis to acquire scores from two highly correlated biological variables. How correct is that? Mathematically, why not use just one of the two variables, if they are highly correlated? Maybe it is my stats background, but I am having a hard time dealing with the idea and not sure if that's correct.

Is there somebody with experience using Multidimensional Scaling for Java (MDSJ) libraries to produce 3D diagrams from dimensional / dissimilarity matrices or using another java open source or free java library?

I am using the java library mdsj.jar from http://www.inf.uni-konstanz.de/algo/software/mdsj/ version 0.8 2008 (there is a newer one from 2009) but I am having problems with verifying the results against results provided using R.

There are many statistical programs produced by software companies, enough to one should decide which software program is more fit to present and analyze the data. If we have data on ages of trees, size, growth rate, vitality, and seeds production. What is the best statistical program can be used for multivariate analysis for these parameters?