University of Otago
Question
Asked 28 June 2015
Could someone help me decide the ideal no.of clusters from the pseudo t squared graph in SAS?
In SAS, I ran ACECLUS and Cluster procedure to analyze a data set of 16 morphometric characters of a bird.
Most recent answer
Thank you for your continued support Ryan.... I plotted.... there seems to be clusters. What do you think?
In the aceclus procedure I have invoked p=0.03 to get rid of outliers.
Could you please tel me the codes for changing the 4 colors of the clusters into 4 shapes please?
the code used was
LEGEND1 FRAME CFRAME=LIGR CBORDER=BLACK
POSITION=CENTER VALUE=(JUSTIFY=CENTER);
AXIS1 LABEL=(ANGLE=90 ROTATE=0) MINOR=NONE ORDER=(-10 TO 20 BY 5);
AXIS2 MINOR=NONE ORDER=(-10 TO 20 BY 5);
PROC GPLOT DATA=ZOHARA.NEW;
PLOT CAN2*CAN1=CLUSTER/FRAME CFRAME=LIGR
LEGEND=LEGEND1 VAXIS=AXIS1 HAXIS=AXIS2;
RUN;
All Answers (4)
PPD
It would be helpful if you provided your SAS code for PROC CLUSTER. For all we know, you made a mistake. See, for example, the examples in the SAS documentation for PROC CLUSTER.
Assuming your code is correct and there aren't any problems with the data itself, these results suggest to me that there isn't any clustering in your data. Due to the peak is psuedo t squared, you should graphically investigate the data using 5 (see below) clusters, but my intuition is that you won't see clear separation. Let's look at each statistic individually:
CCC is the cubic clustering criterion; the idea behind it is to compare the R squared you get with a specific number of clusters versus the R squared you would get by clustering a uniformly distributed set of points. That is, you interpret it similarly as you would R squared. You are getting STRICTLY negative values (and, in fact, they are decreasing with additional number of clusters before increasing again; I would interpret that increase as overfitting). This means that the model you are fitting to the data with X number of clusters fits worse than uniformly distributed points. This is evidence of a lack of clustering (or problems with the data).
In addition, according to SAS Technical Report A-108 (CCC was developed by SAS), "If all values of the CCC are negative and decreasing for two or more clusters, the distribution is probably unimodal or long-tailed." He goes on to say that very negative values may be due to outliers. I would check your data and your code to make sure, but this heavily implies that the data is unimodal, and thus there is no clustering.
Pseudo F is the ratio of between-cluster variance to within-cluster variance. That is, it provides a measure of how separated the clusters are. Your plot shows that this ratio is essentially unchanged regardless of the number of clusters your define. This implies a lack of clustering behavior.
Pseudo T squared is an index that quantifies the difference in the ratio of between-cluster variance to within-cluster variance when clusters are merged at a given step (put another way, psuedo T squared is working "backwards", from right to left on the plot). If there is a distinct jump in psuedo T squared with X number of clusters, then X+1 represents the optimal number of clusters. In your case, you see a jump in psuedo T squared with 4 clusters, so you should graphically investigate the data with 5 clusters to look for clear separation. However, given the CCC and psuedo F values, I would guess you don't have any clustering and this psuedo T squared is just an artifact due to an outlier.
University of Otago
Thank you very much Ryan for your descriptive explanation. These data are of two sister species and it SHOULD cluster. I've attached my SAS code.
PPD
I don't see any obvious errors. So, two notes:
1) Plot the data with scatterplots using the PROC CLUSTER output. See the example in the linked SAS help page. What do the data look like? Can you see distinct clusters in the data?
2) Again, you need to do some exploratory data analysis. I notice you are using the Ward minimum-variance method for determining cluster distance. Note that this method is extremely sensitive to outliers; check your data for skew and outliers that may be driving the results.
University of Otago
Thank you for your continued support Ryan.... I plotted.... there seems to be clusters. What do you think?
In the aceclus procedure I have invoked p=0.03 to get rid of outliers.
Could you please tel me the codes for changing the 4 colors of the clusters into 4 shapes please?
the code used was
LEGEND1 FRAME CFRAME=LIGR CBORDER=BLACK
POSITION=CENTER VALUE=(JUSTIFY=CENTER);
AXIS1 LABEL=(ANGLE=90 ROTATE=0) MINOR=NONE ORDER=(-10 TO 20 BY 5);
AXIS2 MINOR=NONE ORDER=(-10 TO 20 BY 5);
PROC GPLOT DATA=ZOHARA.NEW;
PLOT CAN2*CAN1=CLUSTER/FRAME CFRAME=LIGR
LEGEND=LEGEND1 VAXIS=AXIS1 HAXIS=AXIS2;
RUN;
Related Publications
Thirteen qualitative and six morphometric variables on a total of 651 adult cattle (76 oxen and 575 cows) from four purposively selected districts were recorded to characterize the cattle populations in and around the breeding tract of Raya cattle. General linear model, frequency, and multivariate analysis procedures of Statistical Analysis Softwar...
This volume is based on the NATO Advanced Study Institute, "Advances in Mor phometries" held in 11 Ciocco, Tuscany, Italy from July 18-30, 1993, and directed by Leslie F. Marcus. The "Advances in Morphometries" ASI was advertised in Nature and a number of professional journals. Announcements were sent to relevant institutions and departments throu...