RSS MattersLink to the last RSS article here: Got EBCDIC? Take This PROC and Call Me in the Morning. -- Ed. Basics of Cluster AnalysisBy Mike Clark, Research and Statistical Support Services ConsultantCluster analysis is a statistical technique used to categorize cases or variables into like groups or ‘clusters’ with the usual goal to then proceed with other more conventional analyses using the information gained from the cluster analysis. Essentially it is a statistical method of classification and is often performed often when one has but a vague idea of what to expect from the data, and so is in that sense largely exploratory in nature. An example would be the biological classification of animals into kingdom, phylum, class and so forth down to species. The primary goal in cluster analysis is to find some meaningful structure in the data, and when it comes to cluster analysis, there are many options to choose from on how to go about this. This article will discuss approaches available in SPSS with a bit about S-Plus at the end. Hierarchical cluster analysisThe goal of cluster analysis is to choose cluster membership that will minimize the variability within the clusters, and maximize the differences between clusters. In SPSS, the first step toward conducting a cluster analysis is to Click Analyze/Classify Figure1
Already we have three choices of clustering techniques available (the “Discriminant” refers to discriminant function analysis, which can be thought of as the confirmatory counterpart to the exploratory procedure of cluster analysis). We will start with hierarchical clustering first. This is to be used when one truly has little idea of what to expect with the data. First you’ll select which variables are of interest, and then you’ll tell SPSS whether it’s the cases/individuals you are trying to classify or whether you are looking for structure within the variables (e.g. items of some questionnaire). Clicking on ‘statistics’ will bring up a dialog box where we can choose some things to include in our output. The agglomeration schedule will show us at what point various cases become part of a cluster, and this will be different depending on the linkage method (see below) chosen. The proximity matrix will tell us how far apart the cases/variables are from one another and is a good option to choose. Also we can specify a certain number of clusters or range of clusters if we at least have an inkling as to how many groups to expect. Figure 2
Next, clicking on ‘plots’ we can choose some options for visual display- the dendrogram or icicle plot. Both of these can be quite unwieldy when large numbers of cases or variables are used, in which case SPSS won’t even display the entire dendrogram without an extra step. Suffice it to say, the dendrogram is this branching sort of thing that shows each variable or case as its own individual cluster at one end (left) and then how they are combined until they are all are eventually a part of one cluster. The length of the branch shows how far apart each case or variable is from the other(s) in its cluster. In figure 3, the depression and gsit variables are vary close to one another (purple) while the somatization and phobia variables (red) are not as alike though still clump together to form their own cluster. If we went with a 2 cluster scenario, the somatization and phobia indices would make up one cluster while the other variables would make up the other cluster. Figure 3
The icicle plot below is our other ‘visual’ representation. I don’t know about anyone else but these are just annoying to me and seem a throwback to when we had to do this sort of thing because it was about as visual as you could get back then. The way to read them is from bottom to top. Our variables are all nice and separate to start off with (except for depression and gsit which have already been combined), and when they are joined in the middle (i.e. an X fills up the space between them) they have joined a cluster with that neighbor. The look of this will depend on the linkage method chosen. Figure 4
I’ve mentioned linkage method twice so it’s about time to explain it, or them, I suppose. After choosing our stats and plots we click on the ‘Method’ button. The measure option refers to what you want use as your measure of distance between two points (cases or variables). If using count or binary data you’ll need to select something from one of those menus. If you are using data that has different scales of measurement you may want to standardize or transform the values for analysis. The cluster (linkage) method is the option you’ll choose that determines distances between clusters (rather than the individual cases), and when mini-clusters will combine into a larger cluster. Figure 5
So which options should I choose? That’s up to you really. There is no real standard though some are used more than others by convention and some believe that some particular methods might be better in particular situations. Sometimes you can choose different options and get the same result, sometimes fairly different ones. Use the past literature of studies of the sort your conducting as a guide, or read up on the different methods yourself and decide which one you like best. Use the one that ends up with clusters that adhere more to your theory even. Remember that cluster analysis is inherently exploratory. As long as you use accepted methods your results will be taken in the descriptive light in which they are presented. No one is going to change an entire way of practice or system of belief based on a cluster analysis alone. K-means and Two-step cluster analysisThe K-means analysis can be used when you already know how many clusters you are expecting to find. There are less options to choose from (it has basically done the leg work for you by means of using its own procedure) though will give you meaningful output that you won’t get in the hierarchical method, such as cluster centers (the mean of a variable for that cluster), distances for a cluster from the other cluster centers, and ANOVA tables that may give information about which variables may be contributing more to the solution. One can save cluster membership and distance from cluster center, and then use the graph procedure to create a boxplot that can point to outliers. The K-means procedure also may be preferable for data containing a large number of cases. The Two-step procedure will allow you to use categorical variables to formulate clusters. It is also useful for when you already have one or two possible solutions as far as number of clusters, and gives a choice of statistics that can be used to compare different solutions to determine the best number of clusters to retain. One must have normally distributed continuous variables and categorical variables must have a multinomial distribution. S-plus for cluster analysisFigure 6
S-plus also provides quite a bit as far as cluster analysis is concerned. It offers a couple things that SPSS does not, e.g. fuzzy partitioning and visual representations of the cluster. With fuzzy partitioning, cases are allowed to have partial membership in more than one cluster (see below), and this might be important information to have depending on your research scenario. The visual representations are very helpful in getting a better feel for the data and spotting potential outliers. Figure 7
Remember that the purpose of conducting cluster analysis is largely an exploratory one. It is useful for when we’d like to determine whether there are possibly distinct groups based on particular variables for which we have data, though we might also be interested in whether or not the variables themselves might be grouped in some meaningful fashion. The point is we may not be sure what is going on, and the goal would usually be to use the results of the cluster analysis to engage in more structured research later. There are many details to sort out in carrying out such an analysis, and the reader is encouraged to rely heavily on not only the literature on cluster analysis, but also the previous research in their field using such analyses to help them choose the best method for their situation. ReferencesBailey, Kenneth (1975). Cluster Analysis. Sociological Methodology, Vol. 6, 59-128. |