Science topic

Multivariate Data Analysis - Science topic

Explore the latest questions and answers in Multivariate Data Analysis, and find Multivariate Data Analysis experts.
Questions related to Multivariate Data Analysis
  • asked a question related to Multivariate Data Analysis
Question
4 answers
Hello,
I am using multivariate multiple regression for my master's thesis but I'm not sure if I am doing the analysis and reporting it in the right way. I have very limited time till the deadline to submit thesis. So any help is very much appreciated
I would be really glad if someone can recommend/send articles/dissertations using this analysis.
Thanks in advance,
Yağmur
  • asked a question related to Multivariate Data Analysis
Question
3 answers
Hi ! I'm looking for an open source program dealing with exploratory technique of multivariate data analysis, in particular those based on Euclidean vector spaces.
Ideally, it should be capable of handling databases as a set of k matrices.
There is a software known as ACT-STATIS ( or a an older version named SPAD) who perform this task, but as far as I know they are not open source. Thanks !
Relevant answer
Answer
Thanks a lot ! I'll check and try it
  • asked a question related to Multivariate Data Analysis
Question
19 answers
I ran a PCA analysis on SPSS Using varimax rotation . there were 14 variables and three components were extracted. I noticed that some of the variables loaded strongly on two different components. that is one variable loaded strongly on more than one component. what is wrong pls
Relevant answer
Answer
Loadings are interpreted as the coefficients of the linear combination of the initial variables from which the principal components are constructed. From a numerical point of view, the loadings are equal to the coordinates of the variables divided by the square root of the eigenvalue associated with the component.
  • asked a question related to Multivariate Data Analysis
Question
2 answers
Multivariate data analyses were and still an effective tool to solve problems related to several topics, however I wonder if anyone can use it in the field of artificial intelligence. How can we do it and what are the restrictions as well as the aims that should be kept in mind when we do that?
Relevant answer
Answer
Thanks for your reply.
  • asked a question related to Multivariate Data Analysis
Question
9 answers
There are several independent variables and several dependent variables. I want to see how those independent variables affect the dependent variables. In other words, I want analyze:
[y1, y2, y3] = [a1, a2, a3] + [b1, b2, b3]*x1 + [c1, c2, c3]*x2 + [e1, e2, e3]
The main problem is that y1, y2, y3 are correlated. y1 increases may have lead to decrease of y2 and y3. In this situation, what multivariate multiple regression models can I use? And what assumptions of those models?
Relevant answer
Answer
Hello Jialing,
The fact that DVs are correlated is often one argument for choosing a multivariate method to analyze the data rather than generating a univariate model for each individual DV.
If what you are saying is that, causally or temporally, is that Y1 influences Y2, then perhaps you'd be better off evaluating a path model which incorporates such proposed relationships rather than simply allowing for correlations/covariances other than zero among the DV pairs in the multivariate regression.
Good luck with your work.
  • asked a question related to Multivariate Data Analysis
  • asked a question related to Multivariate Data Analysis
Question
5 answers
I did a MCA analysis using FactoMineR. I know how to interpret cos2, contributions and coordinates, but I don't know how values of v.test should be interpreted.
Thank you
Relevant answer
Answer
(a v.test over 1.96 is equivalent to a p-value less than 0.05 )
  • asked a question related to Multivariate Data Analysis
Question
10 answers
Hi all,
For my master thesis, I conducted a Multivariate multiple regression since my three dependent variables are correlated with each other. I used Stata 16 and the command "mvreg"
However, I can't find how I get the adjusted R-squared and I really want to report the model fit but the only value I got is the R-squared. Is there a specific reason that I can't get the value of the adjusted R-squared? Or does someone know how to get the adjusted R squared in the right way so I can report it?
Furthermore, I can't find how to control for multicollinearity with VIF values after MMR and I think that the explanation that my independent variables and control variables are not highly correlated with each other when looking at my correlation table are not sufficient to exclude multicollinearity.Does someone know how to do this?
Thanks in advance!
Relevant answer
Answer
Qiqi van der Kolk Yes, it is common to report adjusted R2 when conducting a multivariate multiple regression because the more predictors you add to the model (unadjusted) R2 will increase, which may inflate the actual variance explained. Adjusted R2 takes the number of predictors into account. However, with multivariate regression, you will have multiple R2 values rather than just one. Specifically, you will have an R2 for each dependent variable (criterion). If you have, say, 3 dependent variables you will have 3 adjusted/R2 results.
I would assume it doesn't matter how you calculate the adjusted R2 for your thesis, unless you wanted to include the computer output from Stata directly into the paper - or - if your program has some requirement that results of statistical tests must be conducted via certain software? I'm confident adj. R2 can be calculated in Stata. It might just take a bit of digging:
  • asked a question related to Multivariate Data Analysis
Question
5 answers
Hi there,
For my hierarchical regression model, I have planned to report the VIF values to indicate collinearity. I worry that as I am including interaction terms, these values will be high. Would entering the interaction terms separately (i.e., without the proposed moderator) help with this?
Secondly, what is the recognised threshold for a critical VIF value? This stats resources states a value lower than 10 is acceptable (Hair, Joseph F., et al. Multivariate Data Analysis: A Global Perspective. 7th ed. Upper Saddle River: Prentice Hall, 2009. Print.), but some studies have standardised / mean centred the variables even at a VIF of approx 4.0. Lastly, any advice on choosing standardisation or mean centring would be greatly appreciated.
Thanks a lot,
Esther
Relevant answer
Answer
Multicollinearity makes some of the significant variables under study to be statistically insignificant:
Good Luck
  • asked a question related to Multivariate Data Analysis
Question
5 answers
Discriminant analysis has the assumption of normal distributions, homogeneity of variances, and correlations between means and variances. If those assumptions are not fullfilled, is there any non-parametric method that can be used as a "substitute" for Discriminant analysis?
Many thanks in advance.
Relevant answer
Answer
Clustering or classification methods
  • asked a question related to Multivariate Data Analysis
Question
6 answers
Hello,
is it possible to use the linear discriminant analysis (LDA) to determine which of the analyzed variables best separates the different groups (which are already known)?
For example, I want to understand how 3 different croplands are different in terms of ecosystem services provisioning. So, I decide to measure 4 variables for each ecosystem (Soil Carbon, Dry matter, Biodiversity, and GHG) and then I run an LDA analysis (on PAST 3.4 here)
I get this result (see the attached picture). Here clearly the Grassland seems to be more different than the other two croplands (because it is more displaced than the other two croplands on the X-axis).
Would it be correct to conclude that this grassland differs most from the other 2 crops and this seems to be determined by its level of biodiversity?
Thanks (and of course, these data are not real. That's just an example)
Relevant answer
Answer
Hello Matteo,
It would be correct to say that the centroid (mean on the linear composite of the variables forming the first discriminant root or function) for the grasslands group is further from the aggregate (all cases) centroid, or from the centroids of the other two groups. However, the display doesn't inform you as to what variable(s) are most influential in the function.
For that, you'd have to look at both: (a) standardized function coefficients, and (b) variable-function correlations (sometimes called loadings or structure coefficients). If variables are uncorrelated, then standardized function coefficients alone will let you know the relative magnitude of emphasis being placed on each variable in the function. If correlated, then you have to look also at loadings to be sure that you're not letting collinearity confounding your interpretation of relative emphasis as also indicating importance. So, the information presented isn't enough to presume that a single variable (biodiversity) is the reason for the separation in group centroids.
Good luck with your work.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
Dear colleagues, I'm working with Path analyses in lavaan and MVN packages. There are some results that for me are confusing and features from the MVN that I do not know how to set.
My dataset is composed of 130 rows and 7 variables. Using the MVN package to run Mardia, hz and Royston, results indicate that my data is multivariate normal. However, the Mahalanobis distance at different alpha levels give me different results: alpha 0.5 = 14 alpha 0.6 = 8 alpha 0.65 = 5 alpha 0.7 = 0
I have different questions: 1. what is the meaning of 'candidate' outliers? if MVN and Q-Q plot indicates good fit to MVN, should I not consider the outliers? based on what?
2. if outliers are the main concern to get multivariate normality, how can it be that I get multivariate normality by 3 different methods while I have 0-14 candidate outliers?
3. Most important, where can I read which alpha and tolerance values I have to use in my case? Is there any way I can see any tutorial, or recommendations of the alpha according to sample size and other data characteristics?
I was looking on the web and I have found no answer to this. So, any literature recommendation or advice will be welcome.
Thanks and sorry for taking your time. Sincerely,
Relevant answer
Answer
Hello Mario,
If your check of multivariate normality suggests a satisfactory conformance of your data set to the target distribution, I'd forget about jettisoning any cases as outliers unless there was some more compelling reason (such as, impossible values, coding errors, etc.) for doing so.
Good luck with your work.
  • asked a question related to Multivariate Data Analysis
Question
6 answers
I'm currently working on my master's thesis, in which I have a model with two IVs and 2 DVs. I proposed a hypothesis that the two IVs are substitutes for each other in improving the DVs, but I cannot figure out how to test this in SPSS. Maybe I'm thinking to 'difficult'. In my research, the IVs are contracting and relational governance, and thus they might be complementary in influencing my DVs or they might function as substitutes.
I hope anyone can help me, thanks in advance!
Relevant answer
Answer
I think you can check the sign of the coefficients. If the sign is positive it might be complementary, otherwise supplementary effects can be deduced.
  • asked a question related to Multivariate Data Analysis
Question
6 answers
For my research there're five independent variable (Risk, usefulness, Awareness, Complexity and Ease of use) and single dependent variable (Adoption of e-banking). I want to test the relationship between IV and DV using Pearson correlation analysis  and multiple regression analysis to test the hypothesis in spss. I know Independent variables can be collected through survey but i don't know if Dependent variables can also be collected from survey. So, my question is, can i collect dependent variable through survey using a 5-point likert scale?
Relevant answer
Answer
Independent variables are controllable variables. Researchers pick their independent variables and manipulate them. The independent variables affect the dependent variables. In the survey, researchers can view the respondents as dependent variables and the questions as independent variables. The researchers can choose what the questions will be, giving them control over them. The ways in which the respondents behave in the survey depend on the questions. Researchers want to develop questions that will ensure that respondents give the most exact answers possible so the survey results are reflective of the real views of people surveyed.
  • asked a question related to Multivariate Data Analysis
Question
2 answers
Hello everyone
I would compute Hosking & Wallis discordancy test based on L-comoments for multivariate Regional Frequency Analysis. Please help me by answer to these questions.
1- Is that the same transpose of the U matrix in the attached file?
2- Are there any R packages related to multivariate L-comoments homogeneity tests?
I worked with “lmomco” and “lmomRFA” before this.
Thanks for any help.
Relevant answer
Answer
Dear Muhammad A. El Hameedy I am very thankful to you for your answers to my questions.
  • asked a question related to Multivariate Data Analysis
Question
14 answers
I am making a research design for estimating the relation of three independent variables namely organic, conventional and integrated agriculture in relevance to the decrease of soil erosion. i need to decide on what kind of analysis i need, that is, bi-variate or multivariate data analysis.  i feel that taking pairs of correlation between pairs of variables is more appropriate than looking for interaction that leads to a aggregate measurement that is, multiple correlation or multiple linear regression. so anyone that can comment and help to answer my question will be appropriate as a feed back.   
Relevant answer
Answer
I agree. Find the Pearson correlation between each independent variable and the dependent variable first and then proceed to use Multiple Linear regression
  • asked a question related to Multivariate Data Analysis
Question
7 answers
According to Hair et al. not only parameters or observed items but also model complexity and number of constructs with communalities should be focused for sample size calculation. They mentioned that minimum sample size should be at least 500 if the model with large numbers of constructs (>7) and some with lower communalities and/or having fewer than three measured items.
Hair JF, Black WC, Babin BJ, Anderson RE. Structural equations modeling overview. Multivariate Data Analysis, 7th edition. United Kingdom: Prentice Hall PTR; 2013.
Relevant answer
please I need this publication book.
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate data analysis (7 ed.). Upper Saddle River, NJ: Pearson Education.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
My data- set has 5 variables. One of the variables is the group . There are 10 different groups. How to check the relationships between the groups based on the other 4 variables.
I want to check which groups are most similar to each other and which are very different?
Also how do I plot the hierarchical nature and the spatial nature of these relationships between groups.
Which multivariate technique shall I choose ? I am using R .
Thank you very much in advance for your help.
Relevant answer
Answer
One way is using PCA Biplot.
  • asked a question related to Multivariate Data Analysis
Question
16 answers
Hi,
I have analysed my data using multivariate multiple regression (8 IVs, 3 DVs), and significant composite results have been found.
Before reporting my findings, I want to discuss in my results chapter (briefly) how the composite variable is created.
I have done some reading, and in the sources I have found, authors simply state that a 'weighted linear composite' or a 'linear combination of DVs' is created by SPSS (the software I am using).
They do not explain how they are weighted, and as someone relatively new to multivariate statistics, I am still unclear.
Are the composite DVs simply a mean score of the three DVs I am using, or is a more sophisticated method used on SPSS?
If the latter is true, could anyone either a) explain what this method is, or b) signpost some useful (and accessible) readings which explain the method of creating composite variables?
Many thanks,
Edward Noon
Relevant answer
Answer
OK.
MANOVA works by analysing a linear function of the DVs rather than the raw DVs. For example it could be
Ycomp = a*DV1 + b*DV2 + c*DV3
The weights a, b and c are optimised so that the variance explained in Ycomp by the DVs is maximised (as if in a one-way ANOVA).
This seems like a reasonable thing to do but it has some strange consequences. Firstly, its atheoretical so it might not be interpretable even if the separate DVs are. Second, it optimises on the data so will capitalise on sampling variability and other characteristics. Thus an identical replication using the same analysis will actually use a different composite DV. Third, standardised effect size statistic relate to the composite so are even harder to interpret than normal.
I'm not a fan of MANOVA for these and other reasons (e.g., it doesn't protect against Type I error in the way people assume).
  • asked a question related to Multivariate Data Analysis
Question
4 answers
I am doing a study on temperature compensation study on fruit dry matter prediction using NIRS spectra. As I don't know much about matlab and mostly perform my multivariate regression using Unscrambler software, I am looking for simplified version of external parameter orthogonalization algorithm.
Relevant answer
Answer
I'm using MATLAB: PLS toolbox for analyzing my data and i have a problem using EPO preprocessing for PLS-DA.
when im using EPO i get this error which the toolbox cannot perform cross validation. can anyone help me with this situation?
  • asked a question related to Multivariate Data Analysis
Question
4 answers
For instance comparing satisfaction levels coded on a scale from 0 (completely dissatisfied) to 10 (completely satisfied) between 2010  and 2011.
Relevant answer
Answer
Regarding the Wilcoxon signed rank test, be very careful. The test is based on a ranking of the differences between pre-test and post-test scores. In other words, it silently assumes that the "distance" between two consecutive points is defined and - !!! - equal. 2-1 = 6-5 = 4-3, etc. Moreover, what does it mean to have completely-satisfied (10) minus completely-satisfied (0)? Completely-satisfied (10)? In ordinal data you operate on labels, not numbers. And what if you get fractional outcome, like 5.5? "Neutral and a half" ? Technically you can do everything you want, but it should have underlying sense. Also, quantile regression won't help there. But there is a good tool for that purpose: the ordinal regression (aka proportional odds model) - ideal for dealing with Likert items. It also "integrates" with mixed models, through which you can account for the repeated observations, i.e. choose the covariance structure (via either random effects covariance structure, residual covariance structure or both). And of course, you're not limited to just one covariate.
If, however, you believe such "distance measure" exists for your Likert items, or that fractional outcomes still make sense, having 10 points, you could try classic parametric methods, assuming interval scale or non-parametric tests in case of distributional issues.
  • asked a question related to Multivariate Data Analysis
Question
4 answers
the PCA results for a multi layer aquifer (carbonate karst layer and alluvial aquifer are the most reservoir) gives three factors (eigenvalues above 1). the PC1 shows positive weightings with electrical conductivity, Cl, Na, Ca and negative weightings with HCO3. PC2 shows moderate positive weightings with pH and SO4 and moderate negative weightings with Mg. PC3 shows moderate negative weightings with K and moderate positive weightings with NO3 & pH. what is the meanning of the two last factors (PC2 & PC3). thank you in advance.
Relevant answer
Answer
Hello Belkacem,
While someone more familiar with aquifer variables might "see" meaning in these arrangements, it is also entirely possible that there is none, since:
PCA arranges the variable set into linearly independent composite variables, such that the first one extracted will always account for the maximum amount of total variation in the data set, while successive components will account for successively lesser amounts. There's nothing inherent in the process that assures the resultant component/composite variates will necessarily be meaningful.
I'd strongly suggest you try rotation of the extracted components; this will frequently aid in trying to make sense of what each component represents. As well, you should set some criterion for the variable-component loading (correlation) to be declared "salient" for purposes of trying to characterize the meaning of each component. In other words, pay attention only to the variables that show salient relationships to a component when trying to characterize what that component might represent.
Good luck with your work!
  • asked a question related to Multivariate Data Analysis
Question
6 answers
I want to correlate meteorological data and particulate matter data. Can I use both the Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA)? Or is there any preliminary test to determine which one to use? Thanks!
Relevant answer
Answer
If you use PCA, you can run any analysis you want on the newly created PCA variables, including correlation with other variables.
If you use CCA, you need an environmental matrix.
Your answer lies in your data structure and what you hope to accomplish.
  • asked a question related to Multivariate Data Analysis
Question
5 answers
Hello,
a seemingly simple design question: The aim is to visualize the dependence of A and B by connecting A and B by a straight line (possibly with a label). The design options are: line type, line strength, text or symbolic label.
How would you visualize the "significance" and/or "strength" of the dependence?
Details:
- A and B are either independent (no line) or dependent. They are considered dependent if the likelihood of being independent (the p-value / "significance") is small (which corresponds in each setting to a certain value of a test statistic).
- The "strength" of dependence of A and B might be given on a scale, e.g. [-1,1] if one considers classical correlation.
(The use of colour is a further design option, which breaks down in black and white print. Therefore it was excluded.)
### all below can be skipped, it provides only further details for the reader interested in the background of the question ###
The detection of dependence and its quantification are usually separate procedures, thus a mixture of both might be confusing...
Background:
Apart from many other new contributions the paper arXiv:1712.06532
introduces a visualization scheme for higher order dependencies (including consistent estimators for the dependence structure).
Based on feedback there seems to be a tendency to interpret the method/visualization by a wrong intuition (rather than by its description given in the paper)... so I wonder if this can be moderated by an improved visualization.
If you want to test your intuition use in R:
install.packages("multivariance")
library(multivariance)
dependence.structure(dep_struct_several_26_100,alpha = 0.001)
dependence.structure(dep_struct_star_9_100,alpha = 0.01)
dependence.structure(dep_struct_ring_15_100,alpha = 0.01)
# which performs dependence structure detections on sample datasets
The current visualization does NOT include the "strength" of dependence, but that's what some seem to believe to see.
The paper is concerned with dependencies of higher order, thus it is beyond the simple initial example of this question. But still, it depicts dependencies by lines and uses as a label usually the value of the test statistic. Redundancy is introduced by using colour, line type and in certain cases also the label to denote the order of dependence.
It seems that using the value of the test statistic as label causes irritation. The fastest detection method is based on conservative tests, in this setting there is a one-to-one correspondence (independent of sample sizes and marginal distributions) between the value of the test statistic and the p-value - thus it provides a very reasonable label (for the educated user). In general the value of the test statistic gives only a rough indication of the significance.
A further comment to the distinction between "significance" and "strength": In the paper also several variants of correlation-like measures are introduced, which are just scaled version of the test statistics. Thus (for a fixed sample size and fixed marginals) there is also a one-to-one correspondence between the "strength" and the conservative "significance". These measures also satisfy certain dependence measure axioms. But one should keep in mind that these axioms are not sufficient to provide a sensible interpretation of different (or identical) values of the "strength" in general (e.g., when varying the marginal distributions). ... that's why currently all methods are based on "significance".
Relevant answer
Answer
go to,paper, that will definitively help you in this regard
  • asked a question related to Multivariate Data Analysis
Question
2 answers
Recently several measures for testing independence of multiple random variables (or vectors) have been developed. In particular, these allow the detection of dependencies also in the case of pairwise independent random variables, i.e., dependencies of higher order.
Thus, if you had a dataset which was considered uninteresting - because no pairwise dependence was detected - it might be worth to retest it.
If your data is provided in a matrix x where each column corresponds to a variable. Then the following lines of R-code perform such a test with a visualization.
install.packages("multivariance")
library(multivariance)
dependence.structure(x)
If the plot output is just separated circles (these represent the variables) then no dependence is detected. If you get some lines connecting the variables to clusters then dependence is detected, e.g.
dependence.structure(dep_struct_several_26_100)
dependence.structure(dep_struct_iterated_13_100)
dependence.structure(dep_struct_ring_15_100)
dependence.structure(dep_struct_star_9_100)
Depending on the number of samples and number of variables the algorithm might take some time, the above examples with up to 26 variables and 100 samples run quickly.
Due to publication bias datasets are usually only published if some (pairwise) dependence is present. Thus there should be (plenty) of cases where data was considered uninteresting, but a test on higher order dependence shows dependencies. If you have such datasets, it would be great if you share it.
Comments and replies - public and private - are welcome.
For those interested in a bit more theoretic background: arXiv:1712.06532
Relevant answer
Answer
The underlying package has been updated (current version: 2.2.0). Based on the refined methods higher order dependencies can be detected in more cases and with better accuracy.
In particular, new features for the dependence structure detection are:
* The approximate probability of a type I error is provided.
* New option 'type': In particular, 'type = "pearson_approx"' provides a fast and approximately sharp detection, in contrast to the original 'type="conservative"' which is still faster but much more conservative.
E.g.
dependence.structure(x, type = "pearson_approx")
* New option 'structure.type': The original algorithm clusters dependent variables and treats them thereafter as one variable. This is still the default option 'structure.type = "clustered"'. But in the case of many variables this can cluster variables which are only indirectly dependent via some other variables. In contrast the new option 'structure.type = "full"' treats always each variable separately and detects dependence for all tuples which are lower order independent. E.g.
dependence.structure(x, type = "pearson_approx", structure.type = "full")
Based on this many datasets feature higher order dependence. I am looking forward to hear from field experts, who can also provide an explanation within their subject for the occurrence of these higher order dependencies.
  • asked a question related to Multivariate Data Analysis
Question
2 answers
I'm looking for examples and analyses of autocorrelation tests performed on cross-sectional data and the issues encountered (data prop for analysis, problems with estimations, etc). Not interested in spatial autocorrelation.
Many thanks in advance!
Relevant answer
Answer
The richer you are the more you save
The taller and good looking you are the higher your income
The more fashionable is your neighbourhood... the more expensive your house will be.
OK? You get it: the residuals on each of the auto-correlated variables will move in the same direction.
Should not be confused with multi-collinearity when 2 or more variables mean the same thing and become redundant.
  • asked a question related to Multivariate Data Analysis
Question
13 answers
I have a dataset having 56 variables, in which 4 dependent and 52 independent variables. In those independent variables 45 variables are categorical and 3 dependent variables are categorical remains are continuous. Each variables having 1500 observations. Independent variables are nominal, and dependent Categorical variables are ordinal. I want to check, there is any effect of independent variables on each dependent variables .
Relevant answer
Answer
If you are interested in the general properties of your data set, like any multivariate data set with non-trivial signals encoded in a matrix, its strength to infer groups, you should give network-based exploratory data analysis a try.
You'll find a bunch of related posts with ideas and links to literature on the Genealogical World of Phylogenetic Networks (GWoN)
Especially the neighbour-net is a most versatile tool for multivariate analysis, however, largely unknown and massively underused. Its applications in phylogenetics or linguistics are multi-fold but it can be very revealing for entirely different data (you didn't mention whether you are working with biological or other data), a few of the examples we explored on GWoN:
A network of gun legislation in the U.S. (to illustrate diversity) – https://phylonetworks.blogspot.com/2018/03/visualizing-us-gun-laws.html
A network of moons (for the purpose of classification) – https://phylonetworks.blogspot.com/2018/06/to-boldy-go-where-no-one-has-gone.html
A network of party programmes (to illustrate overall similarity and dissimilarity, also over time) – https://phylonetworks.blogspot.com/2018/10/jumping-political-parties-in-germanys.html
Visualisation of the data behind rankings, e.g.
  • asked a question related to Multivariate Data Analysis
Question
5 answers
Dear all,
My question is the following:
I have large datset: 100,000 observations, 20 numerical and 2 categorical variables (i.e. mixed variables)
I need to cluster these observations based on the 22 variables, I have no idea how many clusters/groups a priori I should expect.
As the large dataset I use clara() function in r (based on "pam").
Because of the large number of observations, there is no way to compare distance matrixes (R does not allow such calculations, and is not a problem of RAM), therefore the common way of cluster selection using treeClust() and pamk() and comparison of "silhouette" does not work.
My main quesitons is: can I use factors like total SS, within SS, between SS to have an idea of the best performing Tree (in terms of number of clusters)? Do you have any other idea of how can I select the right number of clusters?
Best regards
Alessandro
Relevant answer
Answer
The book
" Experimental Design" by Federer
will be helpful in this situation.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
I'm trying to use Mettler Toledo's IC Quant software to generate calibration models for the reaction components. I run my reactions at high temperature and pressure, since I cannot collect the reaction standards needed for the calibration at the reaction temperature, I prepare the reaction standards first by running the reactions to different conversions of my limiting reagent (so that I have different concentrations for the components). I then collect the spectrum of these reaction standards at room temperature and use these spectra and the measured GC-FID  concentrations for multivariate data analysis. 
Now the problem is, the absorptions are becoming less intense with increasing temperature. Hence when I try to apply the calibration model (built using reaction standards collected at 25 C) to the real-time reaction spectra collected at the reaction temperature of 140 C, I see a significant offset in the predicted concentrations from that of its actual value (the predicted concentration have negative values). I also notice that the temperature dependence is linear in the range that I tested (25 - 140 C). I'd like to to know if there is a standard procedure to apply the temperature correction to the spectra collected at a different temperature in real-time to get accurate predictions for concentrations.
Relevant answer
Answer
A physical law is always good but I thing that an empirical correlation that you found under your experimental conditions with your specific materials is better. When I do FTIR-ATR experiments I prefer my calibrations over external ones like PNNL and I would prefer using it also over theoretical ones (but maybe that's just me)... good luck
  • asked a question related to Multivariate Data Analysis
Question
7 answers
A power analysis software such as G3 can determine the minimum required sample size for logistic regression, but I can't find a software to determine the sample size for a multinomial logit regression
Relevant answer
Answer
Joanne M Eaves , G*Power can do *multiple* logistic regression, but I don't think it can do *multinomial* logistic regression.
  • asked a question related to Multivariate Data Analysis
Question
7 answers
I'm intrested in forecasting Stock market data. I tried to predict the close price and volume seperately using ARIMA, but not got a better result. So I tried with LSTM model. If I want to perform a multivariate analysis, considering the co-relation of both variables, which are the best multivariate analysis techniques?
Relevant answer
Answer
Hi
How many variables you have and what is the length of your time series.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
I have the data of total dissolved soilds of apple as references (y-variable).
I also have near-infared spectra data as predictors (x-variables).
I have the StatSoft Statistica software for the analysis.
Relevant answer
Answer
Different related software's can be used for building ANN predictive model using NIR spectra. My suggestion are Unscrambler and IBM modeler. Before ANN modeling, PCA should be done to reduce large spectra variables to PC1, PC2 .... . after calibrating ANN, you have predicted values by calibrated model and you have also reference values for each sample. Now use RPD index to realizing the goodness of your calibrated model.
  • asked a question related to Multivariate Data Analysis
Question
9 answers
when we do EFA, do we need to include all variable's item together or each variable's item separately? (For example, imagine i have 3 latent variables and each latent variable have 10 item. When we do EFA do we need to put 30 items together or each 10 items separately? Meaning that doing same procedure 3 times for each variable.)
Relevant answer
Answer
Try to make a discriminant validity test. It can happen that your constructs are overlapping (measuring the same). This could explain why you have fewer factors than expected from the EFA.
I recommend also the Common Method Bias test. I saw situations when uninterested participants selected answers 'on the right' or 'in the middle's and it caused discriminant validity issues.
If these tests are fine and your constructs come from literature - report Alphas and composite reliability and conduct your main tests. Otherwise: 1. Delete constructs (I would not reduce them to 8! with the items from different constructs). 2. Try second order constructs. If some constructs are to close, maybe they are a part of the same blatent second-order construct?
Good luck
  • asked a question related to Multivariate Data Analysis
Question
5 answers
I am doing a study on classifying fruits on visible region taking spectra with Vis-NIR spectroscopy. I want to classify the fruit based on maturity in terms of skin color. I am trying to use SVMC, SIMCA and KNN with PLS toolbox. I went to the wiki site of the eigenvector, but the procedure described there is a bit blurry to me. It would be great if somebody could tell me the stepwise procedure on how to perform these on PLStoolbox using Matlab.
Relevant answer
Answer
For classification, you can perform the PCA (Principal Component Analysis)
  • asked a question related to Multivariate Data Analysis
Question
9 answers
Hi all,
I have a data frame of multivariate abundances, measured from sites under two different treatments.  These sites have been sampled each summer for ~15 years, and thus are repeated measures.
I am modelling my data using the mvabund package in R, which fits Generalized Linear Models.  
My model is of the form abundance~Year*Treatment
When I perform an ANOVA on my model, I need to account for the lack of independence between years as the samples are repeated measures.  I want to do this using restricted permutations with the 'permute' package in R.
However, I am struggling to theoretically understand how I need to permute my data in order to account for this.
Any help or suggestions would be greatly appreciated.  First time poster so sorry if this is at all unclear.
Relevant answer
Answer
Hi Stacey,
Seems that you did it well. In his example about Tikus data (same transects sampled several years), D Warton says that: "the transects serve as blocks, but we want to test for an effect of time which operates within blocks so blocks cannot be resampled. Instead, time labels could be resampled within blocks if you input a matrix of bootstrap labels using the permute package".
You could therefore construct something like:
permID=shuffleSet(n=...,nset=...,control=how(block=data$treatment))
Then Fit 2 models (with and within treatment) and compare them with anova using bootID=permID in the function.
Hope that help.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
For example, I do ADONIS (http://cc.oulu.fi/~jarioksa/softhelp/vegan/html/adonis.html) to investigate an effect of BMI (continues variable) on beta-diversity (weighted UniFrac) and I got significant p-value. This means that BMI have an effect on beta-diversity. How the p-value was calculated?
H0 for ADONIS is ‘‘the centroids of the groups, as defined in the space of the chosen resemblance measure, are equivalent for all groups.’’. It’s pretty clear how this hypothesis will be tested when having two groups. But it is not clear how this is done in case of continues variable. Does anyone know?
One more question :), I often see that PERMANOVA and ADONIS are used interchangeably. Is this correct?
Relevant answer
Answer
and the answer is:
"Good question. I should probably change that definition. Variation explained is directly analogous to that of general linear models. With a continuous variable, it acts like simple linear regression, where each point is associated with its own "centroid" which is the best fit linear approximation." Now everything makes sense.
  • asked a question related to Multivariate Data Analysis
Question
7 answers
Data Transform, Nominal Scale, Ordinal Scale, Interval Scale, SPSS
Relevant answer
Answer
It has an outcome (INTENSITY with two categories: Positive and Negative) and four independent variables (Commerce and industry, farming, fishing and law and order). Thanks.
  • asked a question related to Multivariate Data Analysis
Question
6 answers
I am trying to complete a multiple imputation of some missing data in my dataset using SPSS  I have three continuous variables that are each missing data.  However, when I attempt to conduct a multiple imputation for these variables I get the message:
Warning: There are no missing values to impute in the requested variables.
This is obviously incorrect as I have several missing values in each of these columns and  when I ran the "analyze patterns" function is revealed missing data in these three categories.  Any thoughts on how I might resolve this issue? Thanks in advance!
Relevant answer
Answer
You need to include a variable with no missing data otherwise it wont work or at least each participant should have some data - complete missing cases dont compute - hope that helps.
  • asked a question related to Multivariate Data Analysis
Question
10 answers
I'm running an exploratory study where for three weeks I get participants to report their perception of a phenomenon using a self-report Likert scale whilst wearing an array of sensors.
Data types:
  • Ordinal Likert data (1-5), collected 15 times per day
  • Skin temperature data collected once per second
  • Room temperature data collected once every 3 minutes
  • Room humidity data collected once every 3 minutes
  • Heart rate data collected once per second
This data will be collated into a table for each participant, where a brief example of the data is shown in the attached image.
For each participant, I might expect between 50-120 complete entries like this. However, as participants are from a hard to recruit medical population, I can only realistically expect to get about 5-6 participants.
So, I'll have a reasonable number of data instances (roughly 250-720), but from few participants. I also expect the Likert data to be unbalanced and to centre around a neutral (3) rating, with the extremes of the scale being relatively rare.
The aim is to explore if the Likert (particularly the class) data has any form of relationships, or if it's possible to form a model of some sort. So, say it would be great to be able to find a correlation between Likert ratings of 5 and SkinTemp. However, I would be astounded to find such simple relationship- instead, I imagine it will be far more likely that the Likert Data will be influenced by each of the data points at any one time.
My current thinking will be to try out PCA and maybe an LDA approach. However, as I'm relatively new to multivariate data analysis I'm still unsure if this would be a good way to explore the data.
If I had a larger dataset, I'd be tempted to try some supervised machine learning classification algorithms (e.g. RF, adaboost or SVM), however, I'm even more cautious that such a small dataset might make this approach inappropriate.
The question- If you were faced with this type of data and this amount of data, what would be the predominant statistical approaches and methods you'd use to interrogate this data set?
The cheeky extra question- As I'm new to this, can anyone recommend any good books or tutorials which specifically deal with reasonably sized datasets from few participants? Most seem to have the situation of medium-large datasets with 'high' N, so it's hard to know if it would be appropriate to employ the same analytic process.
Relevant answer
Answer
Following
  • asked a question related to Multivariate Data Analysis
Question
5 answers
 For most real life situations I have come accross number of active effects is quite low. I need a situation where a large number (i.e. more than half) of the contrasts are active for illustrating a method of analysisng unreplicated factorial experiments.
Relevant answer
  • asked a question related to Multivariate Data Analysis
Question
12 answers
Hello all,
I want to use Spearman's Rank Correlation to to measure association between two constructs - Culture and Ethics. I have coded  the Likert Scale data and aggregated the responses from the several questions in the questionnaires into two sets (Culture and Ethics) using the median values of responses from each question. However, in one of the Construct of 7 questions the median values are the same number (2). Now SPSS will not perform Spearman's Rank Correlation for the two Construct because 'one of the variable is constant'. Am I right to have used median values? What did I do wrong? How do I remedy this?
Many thanks! 
Relevant answer
Answer
SPSS Newbie here. Feel free to mock me, I now realize how what I did was ridiculous. But it might help others in the future searching this answer.
I ran a bivariate correlation on one measure of my scale against only one demographic. Thus...that column is constant--all 1's. Thus there is nothing to correlated against.
  • asked a question related to Multivariate Data Analysis
Question
10 answers
How to draw dot plot for different groups and denote comparisons ?
(See attached images)
If there is a procedure in SPSS/ Excel or if there is any free user friendly online S/W guide me.
Thank you
Relevant answer
Answer
I found this software very convenient to create boxplot or violin plots with datapoints: http://vinci.bioturing.com
  • asked a question related to Multivariate Data Analysis
Question
7 answers
I'd like to compare between two sofware SPSS and Statgraphic centurion version 15 for principle component analysis and factor analysis.The factorability tests include the Kaiser-Meyer-Olsen (KMO) measure of sampling adequacy and Bartlett’s test of sphericity" in Statgraphics centurion version 15.
Relevant answer
Answer
Dear RG colleague,
The KMO statistic, which can vary from 0 to 1, indicates the degree to which each variable in a set is predicted without error by the other variables. A value of 0 indicates that the sum of partial correlations is large relative to the sum correlations, indicating factor analysis is likely to be inappropriate . A KMO value close to 1 indicates that the sum of partial correlations is not large relative to the sum of correlations and so factor analysis should yield distinct and reliable factors. Hair et al. (2006) suggest accepting a value of 0.5 or more. values between 0.5 and 0.7 are mediocre, and values between 0.7 and 0.8 are good.
Bartlett’s test of sphericity is a statistical test for the presence of correlations among variables, providing the statistical probability that the correlation matrix has significant correlations among at least some of variables. For factor analysis to work some relationships between variables are needed. Thus, a significant Bartlett’s test of sphericity is required, say p<.005.
Analyze, Dimension Reduction, Factor Analysis, and then select Descriptives.
Regards,
  • asked a question related to Multivariate Data Analysis
Question
3 answers
how to get discriminant loadings in multiple discriminant analysis by using SPSS?
SPSS is reporting canonical discrimant function coefficient. Are these refer to discriminant loadings ?
what is the difference between discrimant function coefficient and loading?
thanks in advance
Relevant answer
Answer
Dear Srikanth, thank you for your question.
A discriminant function is a model or equation generated in Discriminant analysis for differentiating or discriminating between the groups or classes given from the original variables. The discriminating functions will subsequently produce the coefficients and discriminant scores as desired.
Like in factor factor analysis, the factor functions or models are used to produce coefficients and factor scores. The difference in factor analysis is that the most correlated groups are grouped together whereas in discriminant analysis, the different groups are being differentiated.
  • asked a question related to Multivariate Data Analysis
Question
14 answers
How to calculate the Average Variance Extracted (AVE) by SPSS in SEM?
I know what it estimate by some software such as AMOS, Lisrel, .... 
I want to know if that can be used in SPSS for calculation of AVE? 
Relevant answer
Answer
Here is the software
  • asked a question related to Multivariate Data Analysis
Question
5 answers
It is essential for multivariate data analysis.
Relevant answer
Answer
Govinda,
If the problem you are talking about is related to encountering non-constant variance in your model, I would recommend weighted least square approaches.
Hope this helps.
  • asked a question related to Multivariate Data Analysis
Question
4 answers
I want to perform O2PLS-DA analysis of multiomics data (from different metabolomics lipidomics and proteomics experiments) by using SIMCA 130.2? I have data in matrix format (samples in row with labels and variables in column). I can perform upto PCA, PLS-DA and OPLS-DA, but the O2PLS-DA tab is not active. I think I do have a problem with data arrangement. However, not sure if its the only problem. Any help will be highly appreciated!
Relevant answer
Answer
If you want to integrate the six of them in one go I would rather use multiple co-inertia analysis, regularized generalized CCA or STATIS. Refer to Meng et al. 2016:
I personally dislike SIMCA, it is highly limited. In any case I presume you have to provide two separate matrices at a time for O2PLS-DA. As for the other methods, you can run PCA, PLS-DA, etc. on the entire thing - however note that you run the risk of contaminating results with the technical noise specific to each separate omics readout. Hope this is clear? Cheers
  • asked a question related to Multivariate Data Analysis
Question
7 answers
Field, A. (2009). Discovering statistics using SPSS. Sage publications.
Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (1998). Multivariate data analysis (Vol. 5, No. 3, pp. 207-219). Upper Saddle River, NJ: Prentice hall.
Relevant answer
Answer
As Jos notes, this issue has been (and will continue to be) debated many times.  I like Jeremy Miles' post on the Stack Exchange page linked below. 
Also, when it comes to Likert scales, I think some people forget about the distinction between Likert-type items and Likert scales.  See the second link below. 
HTH.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
I would like to use the propensity score matching in measuring the effect of treatment between the control and treated group  
doing it by spss 22 after the R plug is easy but I would like to understand the output and measure the effect  
Relevant answer
Answer
Please let me know if the following references/sites are helpful to you in your quest:
1.  Why Propensity Scores Should Not Be Used for Matching - Gary King
https://gking.harvard.edu/files/gking/files/psnot.pdfDec 16, 2016 ... We show that propensity score matching (PSM), an enormously popular ... proximates random matching which, we show, increases imbalance ...
2.  An Introduction to Propensity Score Methods for Reducing the ...
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3144483/Jun 8, 2011 ... Propensity score matching entails forming matched sets of treated and untreated subjects who share a similar value of the propensity score ...
3.  Propensity Score Matching: Definition & Overview - Statistics How To
http://www.statisticshowto.com/propensity-score-matching/Feb 3, 2017 ... Statistics Definitions > Propensity Score Matching What is a Propensity Score? A propensity score is the probability that a unit with certain ...
4.  Propensity-Score Matching (PSM)
http://cega.berkeley.edu/assets/cega_events/31/Matching_Methods.pptPropensity score matching: match treated and untreated observations on the estimated probability of being treated (propensity score). Most commonly used.
5.  Understanding Propensity Scores
http://bayes.cs.ucla.edu/BOOK-09/ch11-3-5-final.pdf11.3.5 Understanding Propensity Scores. The method of propensity score ( Rosenbaum and Rubin 1983), or propensity score matching (PSM), is the most ...
Dennis
Dennis Mazur
  • asked a question related to Multivariate Data Analysis
Question
1 answer
I tried the offset function but it is quite complex. I have 60,000 rows of data from UV detection. I need to offset downwards by +1 for all of them so that only y axis of graphed chromatogram(s) will be affected. If there is a way of manipulating the figure to do this please let me know. 
The percentage deviation and multiplication did not work. 
I am using excel 2017. If you know Origin it wont matter it is similar. I couldnt download SPSS. 
End goal is attached below. 
Relevant answer
Answer
As for statistical analysis: You can use the r language as effectively as SPSS and it is now being used more widely than either SPSS or SAS. There are a lot of tutorials and demonstrations of ways to use it on YouTube. I think a few might even be in Spanish.
Also see:
  • asked a question related to Multivariate Data Analysis
Question
24 answers
I have generated a new variable from a factor analysis "religiosity" as same as in the article in the attached picture. But then I do not know how to standardize the scales of the values of this variable to 0-100 scales as it is in the article. I want to take the same steps but I am lost here. Could anyone help?
Relevant answer
Answer
Yes, do use this formula to make the scales (factor scores) 100 point:
y = ((factor - min)/ (max - min)) * 100
Forget the T-score. Sorry for the confusion.
You can transform the factor scores with the above formula.
Example:
min = -1.47203
max = 1.8796
factor1 =  0.96967
y = ((0.96967+ 1.47203)/ (1.8796  + 1.47203)) * 100 
y = 72.851
Remember that factor score are standardized (mean = 0, std = 1)
  • asked a question related to Multivariate Data Analysis
Question
4 answers
I have an interval scale survey data. I have defined the socioeconomic status of the survey site with 5 parameters and wish to create a composite index. I do not have any prior weights for these parameters. Can I use PCA to create an index? My data is satisfying KMO and Bartlett Sphericity test, but for the results to be robust is it important for the data to pass normality and "no non-linearity" tests as well? Some of my parameters are normal while some are not. Please guide.
Relevant answer
Answer
Why don't you try:
It is an online app based on the following R package:
Good Luck !
  • asked a question related to Multivariate Data Analysis
Question
2 answers
Hi
When constructing a principal coordinate analysis (PCoA) to see (dis)similarity in a set of data, you have to choose between "Co-variance standardized" and "distance standardized" measures of variability. If you are using genetic distances between populations, which measure do you choose (between co-variance standardized and distance standardized) to construct your PCoA?
Thanks in advance for all answers 
Relevant answer
Answer
Hi Paul
Thank you for your answer.
  • asked a question related to Multivariate Data Analysis
Question
4 answers
The project is based on following patients longitudinally at different age categories, with high variability in time of follow-up as well as missing follow-ups.  
The data that we are analyzing is parametric and continuous.  
We are thinking of plotting a mixed effect model, but is there a robust test to calculate a difference among the samples?
Thanks in advance for any suggestions!
Relevant answer
Answer
Many thanks to both Hendrik and Mehmet.  We'll see if we can incorporate your feedback into our project!
  • asked a question related to Multivariate Data Analysis
Question
6 answers
Dear all,
I wonder if it is possible not only to combine different variables into one variable via "compute" or "transform" in SPSS but to add subcategories to that variable too.
For example, I have 24 items (answers either yes--> 1 or no-->0) of which 12 items belong to subcategory 1 and 8 items to subcategory 2 and again 8 items to subcat. 3. I would like to make one variable out of all 24 variables but without "loosing" my subcategory arrangement.
I hope my question,explanation is clear and I appreciate every help!
Thanks a lot,
Anne
Relevant answer
Answer
Hello Anne.  You wrote:
"For example, I have 24 items (answers either yes--> 1 or no-->0) of which 12 items belong to subcategory 1 and 8 items to subcategory 2 and again 8 items to subcat. 3. I would like to make one variable out of all 24 variables but without "loosing" my subcategory arrangement."
I have two questions:
  1. 12 + 8 + 8 = 28. Some items must belong to more than one category. Please clarify.
  2. What are the possible values (and value labels) for the new variable you want to create?
HTH.
  • asked a question related to Multivariate Data Analysis
Question
2 answers
Hi, I am analyzing the data of this questionnaire (link below). Most of dependent variables are dichotomous/binary and independent variables are ordinal (educational level), scale (age) and binary (gender).
I know I could use a logistic regression, for example to predict the effect of independent variable on a binary response variable like "I use a tablet computer".
But I would like to create a model of latent variables (so first I need some factors), created from dependent variables, that would show something similar as a Structural equation model. For example that latent variables created from items like different internet usage, different skills and different options of internet availability (exogenous) explain about xx percent of the variance of the latent variable created from items describing e-government use (endogenous).
But maybe I am mixing apples and oranges :-)
Thank you for any suggestions
(I use SPSS and AMOS, also heard about the FACTOR software)
Relevant answer
Answer
Thanks a lot, I will check the source.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
Hi, I conducted PCA on a set of 28 variables capturing various economy related data using Stata. The eigen values come greater than 1 for 7 components. When I conduct KMO, the output just states "The correlation matrix is singular". Can I still go ahead with the results of PCA? Is KMO a necessary condition for PCA? If yes, how can I fix the singular matrix problem?
Relevant answer
Answer
Hi,
Matrix singularity can have multiple causes, but a common one is two or more variables in the analysis are perfectly correlated. Have a look at the correlation matrix of your variables to see if you can spot the source of trouble. In the meantime, here's a good statsexchange discussion about singularity:
  • asked a question related to Multivariate Data Analysis
Question
3 answers
Hi, 
I have conducted a study to see if a training intervention has an impact on delegate's perceptions of the supportiveness of 3 different behavioural factors (i.e. I have 3 DVs, which are each measured at 3 different timepoints).  I have assessed delegates before training (time 1), immediately after training (time 2) and  6 months later (time 3).  Currently, I have run three separate one-way repeated measures ANOVAs  to test this- one for each behavioural category DV,  and have used the bonferroni corrected post-hoc test to see where there are significant differences between the three time points.
Is this correct, or is there some way of running a repeated measures MANOVA?
I also have data from a control group at time 1 and time 2 (but not time 3). Therefore, I have run three mixed ANOVA's (one for each behavioural cateogory DV); where the group (intervention or control) is the between subjects factor and the time (time 1 or time 2) is the within subjects factor. As there are only 2 time points assessed, no post-hoc tests are conducted.  Is this approach correct, or should I be using mixed MANOVA as I have multiple (related) DVs again? Can you perform mixed MANOVA when your intervention and control group are very different in size? (i.e. N=1283 Vs N=40)
I know that MANOVA is good for multiple dependent variables, but I am struggling to find many tutorials that use a similar study design to mine? 
Finally, when using the bonferroni corrected post-hoc, I know that this accounts for multiple comparisons within one test, but as I have run several separate ANOVA tests, should I be controlling for this too somehow (as in theory, couldn't all the separate ANOVAs I've run be considered as part of an overall 'family' of tests and therefore increase the likelihood of type 1 error?)
Thanks in advance
Relevant answer
Answer
A few points:
1) if you want to be able to compare the 3 behavioral factors then that should be added as another repeated factor.
2) You could do this as a repeated measures ANOVA but using the MIXED procedure in SPSS would be more straightforward given the missing time point for the control group, etc. However, you will need to have the data in long form.
3) Scrap the Bonferroni control and focus on effect sizes, CIs, future replications, etc. instead of just p-values. I have attached a paper that discusses this issue.
  • asked a question related to Multivariate Data Analysis
Question
5 answers
I wrote a algorithm based on GMM. After a few iterations the following problem is caused: density values in Some observations are zero and the entire algorithm makes trouble.
For example, for i=67203 from observations, the density values in all components is zero (not even a small number).
With this code, I calculated the density of all components for this observation and all became zero.
for l=1:9
pr(l)=mvnpdf(y(67203,:),mu(l,:),sigma(:,:,l))
end
Hence, posterior probabilites and then parameters (mu and sigma matrix) are NaN.
Is there a way to prevent this zero values for densites and NaN values for Posteriori probabilities?
Relevant answer
Answer
Your code seems very messy and difficult to check. At various places "l" (ell) is the first subscript and in others the second. Your use of transpose (especially in mu) seems incorrect. The core is P(t) [l|i] ... what is the sum over i of that measure? All the rest is algebra. I never use "I' (ell) as a subscript. Too easy to mix up with 1 (one) , I  (capital eye) , | (vertical bar) , i (small  i), ...  you get the idea.
Assume we have data as P(ell given eye) ,,,,, "ell" is the column and "eye" is the row. Is your convention that vectors are row vectors and transpose is a column?  Have you been 100% consistent with that? You can make a column vector of "1" (ones) and get the sum over "i" by PT x 1 --- or you can have a row vector of 1s and do 1 x P. You have to do all this carefully.
If this is all correct, you have to go back and see why is sum over i of P(t) [l|i] = 0.
  • asked a question related to Multivariate Data Analysis
Question
4 answers
I am looking for suggestions for analyses that can compare of different taxa in terms of the relative difference in composition among sites.
I have 4 parallel datasets of species abundance data from 4 different taxa sampled in the same sites (n=12).
Each site was sampled between 4 - 10 times.  Usually (not always) sampling was done at the same time for all taxa within a site, but not all sites were sampled at the same time so the data are unbalanced.
I can create balanced subsets if needed but this would severely truncate the data.
I've heard of co-correspondence analysis, co-inertia anlaysis, and possibly multiple-factor analysis as potential candidates for doing this type of comparison but I'm not sure about the differences or which is most appropriate. 
Are there pros and cons/restrictions/assumptions for each of these? 
Is there an alternative method that I have mentioned that would be better? 
Also what do these analyses allow me to test exactly - is their intention is to be able say for example that taxa A and B had high correlation in terms of variation in composition across sites, while taxa C showed low correlation with any other taxa ...etc  ? 
Thanks
Tania
Relevant answer
Answer
Thank you for all your responses.
The question I would like to ask is
1) do the spatial patterns of diversity differ among taxa? For example one taxa may show high clustering of sites based on habitat type, while another will show similar composition across all sites.
2) do spatial trends differ over time- for example, one taxa may show stable composition in all habitats over time while another taxa may show convergence of composition between habitats, and a third taxa may show high variability over time for one particular habitat... 
The intention is to demonstrate the different taxa have different spatial and temporal distributions and therefore can or cannot be used as surrogates for each other based on composition. 
3) Characterise the similarity (betadiversity) between and within habitat types, based on all taxa. 
I can of course compare univariate measures of diversity in each site using anova but I would like to compare the taxa based on their composition. 
Thanks for further suggestions to address these specific research questions.. 
  • asked a question related to Multivariate Data Analysis
Question
3 answers
Hi folks,
i want to survey employee in three waves and need to connect the three timepoints to the individual. Furthermore I need to survey one leader of every employee at each timepoint for causal inferences.
Do you have any suggestions for software which is easy to use for this purpose?
Thank you
Relevant answer
Answer
Dear Phil,
you can collect the data with any software.
The key-point then, later, before data analysis is to be able to restructure the database in the way you want. It is a matter of "statistical unit". For example, in your case, the statistical unit could be the employee, with a huge number of variables like evaluations from periods 1, period 2, period 3 etc. A good software I know to operate changes in statistical units is Sphinx. 
Is it clear enough? Regards.
Stephane
  • asked a question related to Multivariate Data Analysis
Question
3 answers
Is there any source for the acceptable size of smallest cluster and threshold of ratios of sizes in cluster analysis output?
Thank you for your help.
Relevant answer
Answer
Thanh
This question is a conundrum that has been bugging investigators using cluster analysis ever since the technique was developed. In the past, you simply selected a fixed level of distance similarity or dissimilarity and used that as a threshold to define groups (clusters) with no minimum number of samples. As Moacyr correctly states above, even one sample would be acceptable if it was sufficiently different from all of the other samples.
Now there is a statistical technique which applies permutation tests to allow you to identify significantly different clusters of samples. You need the software but perhaps someone at your university will pay for it OR perhaps there is something similar in R. See the attached citation.  I don't sell their software or anything so I have nothing to gain here but I've used it and it works well.The paper describes the technique I mention references earlier papers and describes several other interesting techniques.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
I want to show distribution of species along a transect line, can  any one guide me on how to draw it?
Thanks in advance for your time.
Relevant answer
Answer
Hi Mohammad - did you have any success with this, I'm looking for software to accomplish something similar.
  • asked a question related to Multivariate Data Analysis
Question
4 answers
Subject closed
Relevant answer
Answer
Two items are not enough for a factor analysis because they generate only a single correlation, but you certainly can combine them by taking an average, as you suggest.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
Suppose I want to compare two independent groups X1 and X2 with respect to one latent variable Z (comprises 3 indicators Z1, Z2, and Z3) through SPSS. Is it possible to use Z as Test variable instead of using Z1, Z2, and Z3 separately?
Relevant answer
Answer
You need to create the latent variable "trust", e.g. by doing a factor analysis with the variables trust1,...,trust5 (which supposedly define one factor) and saving the factor scores. Then you enter the factorscores into the U-Test.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
I measured ad liking with two repeated-measure samples: participants of each sample saw 7-7 different ads, 14 in total. I had to do two waves because otherwise the study would have been too long. Participants were different in the first and second wave. I would like to compare the ad liking scores of the total 14 items, but I don't know which statistical test to use: if I compare only one wave, it would be a one-way repeated-measure ANOVA, but adding the second wave where participants are different, it is no longer a repeated-measure design, but the combination of repeated-measure and independent samples. Anyone has an idea? I use SPSS.
Relevant answer
Answer
It seems superior, well done.
  • asked a question related to Multivariate Data Analysis
Question
3 answers
I have a quite huge dataset including data from different years. I would like to create one variable including data from 4 different variables. These different variables include the value 1 or 5 (or -2,-1 for missing). Some of the variables are overlapping with the same value, why I can't just use the function "compute variable - sum".  
Relevant answer
Answer
Nevertheless, as it is described, the new variable may be bedeviled. 
  • asked a question related to Multivariate Data Analysis
Question
5 answers
I have measured task performance (Performance) of 48 people under two different conditions (Condition 1 and Condition 2). The time taken by participants to complete the task (Time) and their accuracy (Score) were measured in both conditions as two independent variables representing Performance. I have also measured participants sensitivity to noise (Sensitivity). Can I use MANCOVA for this type of analysis using Time and Score as two independent variables and Sensitivity as a covariate? And if so, what would a significant (p<.05) interaction between Time*Score*Sensitivity mean? Thanks
Relevant answer
Answer
Dear Zanyar,
    You can use MANCOVA on data with this sort of arrangement. The question you need to consider first is what hypothesis is really being tested. The reason for jointly considering Time and Score is that you believe that they will jointly rise or fall. If the impact of condition is not necessarily the same over both variables then a joint test may not make sense.
   In the context of cognitive psychology experiments there is often a tradeoff between speed and accuracy which varies by subject. Before interpreting your result you need to consider whether or not this is likely to be the case. Negative correlations or bend correlations would be a clear sign. I will also point out that accuracy measures are likely to show ceiling effects and time measures often show right skew. Both should be managed as likely violations of the Manova assumptions.
   The traditional solutions revolve around either analyzing time and score separately or building a composite score based on ROC curves. For example A' can be used to capture any tradeoff. There are also plenty of penalty functions that people have tried. 
   Returning to your original plan, a T X S X S interaction suggests that the changes in Time vary with both Score and Sensitivity. Alternately you can say that  the changes in Score vary with both Time and Sensitivity. This is simple enough, but suggests that you need to conduct follow up tests to describe how Time and Score are interacting and how this interaction changes over the range of sensitivity. This can all be done within the MANCOVA framework, assuming the distributions make sense.
Good luck.
  • asked a question related to Multivariate Data Analysis
Question
4 answers
I am conducting an experiment with 1 dependent variable and 3 independent variables, each IV have 2 and 3 levels, so I will have 12 treatments in total (12 experimental groups). The dependent variable is credibility and I know that the most appropiate way of making the analysis is by doing multifactorial or ANOVA. For this reason I would like to know:  which is the most appropiate "measurement scale" to assess the DV in terms of statistical applications (ANOVA)? 
Relevant answer
Answer
Karen, the ratio scale of measurement is even of a higher level than the interval scale and ANOVA is even better with the ratio scale than the interval scale. In such a case, a parametric test based on the usual Gaussian distributional assumptions would be applicable.
  • asked a question related to Multivariate Data Analysis
  • asked a question related to Multivariate Data Analysis
Question
1 answer
I am calculating effect sizes for a literature review, however several of the papers use neuropsychological tests and group means are reported as standardised T-scores (i.e. mean = 50, sd = 10).
Can I calculate an effect size from these types of statistics?
Relevant answer
Answer
If all you have is a T Score for a single sample of people, then no - it's a descriptive statistic and there is no comparison being made.
However, if you have T Scores for at least two groups (or conditions), then yes! In fact, the T Scores make it easy. For instance, suppose you have scores on a memory test from before and after an intervention (such as alcoholics entering treatment, then retested after a month of sobriety). All you have to do is subtract one score from the other, then divide by 10. 
Many effect size statistics work this way: take the difference in mean scores and divide by the standard deviation. Where they differ is in what standard deviation is used (e.g., that for the control group? that for the groups combined? etc.). But if you know the standard deviation is 10, you're all set.
One caveat, though. Just because it's a T Score doesn't necessarily mean that the standard deviation of scores in your sample will be 10. It could easily be a bit higher or lower. And if so, then you should use the standard deviation you actually have, not the theoretical one. 
One more thought. I'd be surprised if the papers you're consulting failed to report inferential statistics (such as t-tests or correlations). If they do, then you don't need to start from scratch, so to speak. To convert from t to d, I think th