Question
Asked 28 June 2015

Could someone help me decide the ideal no.of clusters from the pseudo t squared graph in SAS?

In SAS, I ran ACECLUS and Cluster procedure to analyze a data set of 16 morphometric characters of a bird.

Most recent answer

Zohara Rafi
University of Otago
Thank you for your continued support Ryan.... I plotted.... there seems to be clusters. What do you think?
In the aceclus procedure I have invoked p=0.03 to get rid of outliers.
Could you please tel me the codes for changing the 4 colors of the clusters into 4 shapes please?
the code used was
LEGEND1 FRAME CFRAME=LIGR CBORDER=BLACK
POSITION=CENTER VALUE=(JUSTIFY=CENTER);
AXIS1 LABEL=(ANGLE=90 ROTATE=0) MINOR=NONE ORDER=(-10 TO 20 BY 5);
AXIS2 MINOR=NONE ORDER=(-10 TO 20 BY 5);
PROC GPLOT DATA=ZOHARA.NEW;
PLOT CAN2*CAN1=CLUSTER/FRAME CFRAME=LIGR
LEGEND=LEGEND1 VAXIS=AXIS1 HAXIS=AXIS2;
RUN;

All Answers (4)

It would be helpful if you provided your SAS code for PROC CLUSTER. For all we know, you made a mistake. See, for example, the examples in the SAS documentation for PROC CLUSTER.
Assuming your code is correct and there aren't any problems with the data itself, these results suggest to me that there isn't any clustering in your data. Due to the peak is psuedo t squared, you should graphically investigate the data using 5 (see below) clusters, but my intuition is that you won't see clear separation. Let's look at each statistic individually:
CCC is the cubic clustering criterion; the idea behind it is to compare the R squared you get with a specific number of clusters versus the R squared you would get by clustering a uniformly distributed set of points. That is, you interpret it similarly as you would R squared. You are getting STRICTLY negative values (and, in fact, they are decreasing with additional number of clusters before increasing again; I would interpret that increase as overfitting). This means that the model you are fitting to the data with X number of clusters fits worse than uniformly distributed points. This is evidence of a lack of clustering (or problems with the data). 
In addition, according to SAS Technical Report A-108 (CCC was developed by SAS), "If all values of the CCC are negative and decreasing for two or more clusters, the distribution is probably unimodal or long-tailed." He goes on to say that very negative values may be due to outliers. I would check your data and your code to make sure, but this heavily implies that the data is unimodal, and thus there is no clustering.
Pseudo F is the ratio of between-cluster variance to within-cluster variance. That is, it provides a measure of how separated the clusters are. Your plot shows that this ratio is essentially unchanged regardless of the number of clusters your define. This implies a lack of clustering behavior. 
Pseudo T squared is an index that quantifies the difference in the ratio of between-cluster variance to within-cluster variance when clusters are merged at a given step (put another way, psuedo T squared is working "backwards", from right to left on the plot). If there is a distinct jump in psuedo T squared with X number of clusters, then X+1 represents the optimal number of clusters. In your case, you see a jump in psuedo T squared with 4 clusters, so you should graphically investigate the data with 5 clusters to look for clear separation. However, given the CCC and psuedo F values, I would guess you don't have any clustering and this psuedo T squared is just an artifact due to an outlier. 
Zohara Rafi
University of Otago
Thank you very much Ryan for your descriptive explanation. These data are of two sister species and it SHOULD cluster. I've attached my SAS code.
I don't see any obvious errors. So, two notes:
1) Plot the data with scatterplots using the PROC CLUSTER output. See the example in the linked SAS help page. What do the data look like? Can you see distinct clusters in the data? 
2) Again, you need to do some exploratory data analysis. I notice you are using the Ward minimum-variance method for determining cluster distance. Note that this method is extremely sensitive to outliers; check your data for skew and outliers that may be driving the results.
Zohara Rafi
University of Otago
Thank you for your continued support Ryan.... I plotted.... there seems to be clusters. What do you think?
In the aceclus procedure I have invoked p=0.03 to get rid of outliers.
Could you please tel me the codes for changing the 4 colors of the clusters into 4 shapes please?
the code used was
LEGEND1 FRAME CFRAME=LIGR CBORDER=BLACK
POSITION=CENTER VALUE=(JUSTIFY=CENTER);
AXIS1 LABEL=(ANGLE=90 ROTATE=0) MINOR=NONE ORDER=(-10 TO 20 BY 5);
AXIS2 MINOR=NONE ORDER=(-10 TO 20 BY 5);
PROC GPLOT DATA=ZOHARA.NEW;
PLOT CAN2*CAN1=CLUSTER/FRAME CFRAME=LIGR
LEGEND=LEGEND1 VAXIS=AXIS1 HAXIS=AXIS2;
RUN;

Similar questions and discussions

Landsat Scaling Factors --- Problems ?
Question
1 answer
  • Pat S. Chavez, Jr.Pat S. Chavez, Jr.
Hello Everyone,
The scaling factors used for Collection 2 and Level 2 Landsat data includes a multiplicative (0.0000275) and an additive (-0.2) term. When these factors are used to convert to non-scaled surface reflectance values for each band, and these data are used in various processing procedures (like the ratio used to compute the NDVI), it has both noise problems (pixels with water or clouds) and the data are smoothed so it appears that the spatial resolution is lowered.
I don't understand why this problem has been around so long and not fixed at this stage. Not sure what the benefit is to spread the range of digital numbers way beyond the original range to create an artificially enhanced dynamic range of numbers. The 'empty spaces' from valid number to valid number is then filled in with a smoothing type of procedure (perhaps the resampling using cubic or bi-linear methods), so when zoomed up the image looks smooth and appears to have lost resolution compared to the original image (get level 1 and level 2 data and compare them). This combined with the 'noise' problem introduced by the 'scaling factor corrections' for pixels with water or clouds is one reason why many users are downloading level 1 data and doing their own corrections instead.
Generating the NDVI with the not scaled corrected data (the data you download) DOES NOT give valid NDVI values. They are in the range of -1 to +1, however, they are not correct; the values will be quite a bit smaller than they should be. If only a multiplicative scaling was used (e.g., 0.0000275) and not an additive also (e.g., -0.2) then the scaling would not matter because they would cancel out in the ratio process; this does not happen when an additive term is included.
I think it is time to re-think the scaling of the data and use a method that does not create problems that were not there to begin with !
Pat Chavez
Northern Arizona University (retired USGS)
Feature selection for logit Regression with imbalanced dataset and non-linear association between predictors and outcome?
Question
2 answers
  • Camille PeraultCamille Perault
Hi everyone!
I am trying to find the "best" logit regression model given an a priori set of predictors (chosen from literature and the data available). My binary outcome is the fact of being associated with a certain production sector, and i would like to know which one of my predictors are the most explanatory of this outcome, knowing that i ultimately want to do a diff-in-diff or so to find the impact of such an association on the revenues of the firms i am looking at, the challenge being of creating a control group of firms not associated with such a sector but still comparable on most important metrics.
Initially, i have a tens of a priori selected categoric predictor variables, as well as numeric ones (i am studying agricultural firms so i have agrarian surface, livestock units etc). The challenges being that :
-my dataset for the logit model is very very imbalanced in favor of the group for which the outcome = 0 (i.e. firms that are not associated with the unit). Actually, i have 59000 of such firms vs 900 firms that are associated...
-inside each of my categorical predictor variables, i also have important imbalance between some levels
-having selected my features based on a stepAIC() procedure in R, and having re-grouped categorical variables so as to limit the imbalance (although for variable like sex of the director, i can't so the imbalance remains, while this may be an important predictor so i ideally want to keep it), i ended up with a model for which remaining predictor variables did not pass, for most, the "linearity" diagnostic (i.e. the condition of being linearly associated with the logit of the outcome), and log or polynomial transformation doesn't really change this association in the right way.
I also tried to undersample the majority group so as to have exactly the same number of firms associated with the unit, and not associated with it. Using the stepAIC() again, i end up with slightly different models, the accuracy (calculated thanks to a predict()) being of 68%, while the non-sampled model had an artificial accuracy of 98%... with a pseudo R^2 of respectively 0.1864 for the sample model and 0.11 for the non-sampled model.
in both models, i have a fair amount of significant predictors, but with very small estimates. Also, while some variables are significant whatever the model used, others change in their significance (or estimate's direction) depending on the choices made (for sampling, as well as for the AIC process according to whether i add a specific interaction).
given my ultimate goal, i am actually not reallly interested in predicting, but rather explaining the most significant predictors of my outcome, so I wonder if this is the right process (especially stepAIC()) to do so? and if so, if this is enough of a process to be able to say that my most significant predictors should be the variables that remained significant the whole time throughout all the models? Or are there inherent problems in my setup, given the non-passed diagnostic and the multiple imbalances? in this case what should i do?
Thanks in advance!

Related Publications

Article
Full-text available
Thirteen qualitative and six morphometric variables on a total of 651 adult cattle (76 oxen and 575 cows) from four purposively selected districts were recorded to characterize the cattle populations in and around the breeding tract of Raya cattle. General linear model, frequency, and multivariate analysis procedures of Statistical Analysis Softwar...
Book
This volume is based on the NATO Advanced Study Institute, "Advances in Mor­ phometries" held in 11 Ciocco, Tuscany, Italy from July 18-30, 1993, and directed by Leslie F. Marcus. The "Advances in Morphometries" ASI was advertised in Nature and a number of professional journals. Announcements were sent to relevant institutions and departments throu...
Got a technical question?
Get high-quality answers from experts.