Cluster analysis is one of those techniques I don’t get to use very often. About once every couple of years someone will be doing a study of types of companies, patients or clients and have a need for a cluster analysis. The best description I read of cluster analysis came from a book many years ago, by Kaufman & Rousseuw that began,

“Cluster analysis is the art of finding groups in data.”

twizzler_full_prodIt falls into that gray area between descriptive statistics, that asks how many people like programming and twizzlers and inferential statistics, which question the daily consumption of twizzlers by programmers and non-programmers and whether any difference between the two groups is greater than would be predicted by chance 95% of the time.

Cluster analysis is an exploratory method, usually, and is incorporated in what the young ‘uns now call data mining.

However, it can also be confirmatory in a hypothesis-testing sort of way. Say, I hypothesize that there are three groups of people who have eating disorders, anorexia, bulimia and anorexic-bulemics and they differ in their treatment. I can classify people in those three groups using a cluster analysis, then do an ANOVA or MANOVA on the clusters to see if there are in fact significant differences among clusters in days hospitalized, total inpatient costs, total outpatient costs or other variables of interest.

Personally, when I think of cluster analysis the first type that always comes to mind is the partition, k-means clustering method. I suppose that says a little about my level of weirdness that there actually IS a type that comes to mind. The second type, unless you were dying to know, is fuzzy clusters, because it is something I have been pondering lately. Fuzzy clusters are NOT, contrary to the vicious rumors spread by my enemies, what can be found under my bed because I last cleaned sometime during the Mesozic era, but rather, a method where observations are allowed to fall into two clusters at once. Can people be only anorexic or bulimic or can they fall in both groups? Fuzzy clustering says yes. Kmeans partitions says, no. You can perhaps, though, have a third group of people who are anorexic-bulimic.

Rambling note: When Maria came home for Christmas after having gotten her first job as a sportswriter, someone asked her if she had a favorite sport, she responded:

“Well, since Sports Illustrated is paying me to write about football, this week my favorite sport is football.”

Similarly, since I am meeting with someone tomorrow on how to do a cluster analysis with Stata, it has now become my favorite software for cluster analysis.

How to do it:


and then from the pop-up window select the number of groups and the variables

OR type the following in the command window

cluster kmeans list-of-variables , k(#) measure(type) start(seed)

FOR EXAMPLE, if I wanted to use data on anorexia and bulimia looking for two groups I would do this

cluster kmeans fast binge vomit purge hyper mens weight, k(2) measure(L2) start(krandom)

The default similarity/dissimilarity measure is Euclidean and you started with a random seed. The output of cluster analysis in Stata might be disconcerting to some people by virtue of the fact that there really isn’t any. It will come back and say something singularly unenlightening like “cluster name: _clus_1 ” and that’s it .

The first thing I recommend adding to your cluster analysis command is the keepcenters option, so my command looks like this:

cluster kmeans fast binge vomit purge hyper mens weight, k(2) measure(L2) start(krandom) keepcen
I am assuming you have a relatively modest number of observations, like I do, so you can open up your Stata data file in the Data editor and take a look at it. When I scroll down to the bottom I find that the last two observations are the means for each of the variables for the two clusters. My first cluster has very low values for body weight,  moderate values for absence of menses, and high values for fasting, binging, vomiting, purging and hyperactivity. My second cluster has medium body weight, medium scores on hyperactivity, low values for amenorrhea and fasting, and high values for binging, vomiting and purging.

It is starting to appear that I have two groups, anorexics and bulimics. My next step in this exploration would be to use the tabstat command, by cluster to see if other expected (or unexpected) differences emerge.
Currently, there has been an unexpected emergence of my daughter, Jenn, who is neither anorexic nor bulimic, in the lobby downstairs, so we are off to conduct our own experiment to explore whether Chardonnay is best grouped with angel hair pasta primavera or does Pinot Noir fit better in this cluster.

I’ll let you know.


16 Responses to “Cluster Analysis: Finding Groups in Data”

  1. Will on March 16th, 2010 10:13 pm

    I use clustering almost exclusively in gene-expression analysis. I’m a big fan of k-means as well, but the problem I get into is that its tough to know the number of cluster a-priori. I use tricks like scanning across many “reasonable” K’s and finding an optimal guess or using a “Chinese restaurant process” technique.

    Do you have anything cool in your bag of tricks for these sorts of situations?

  2. Lou on March 17th, 2010 2:36 pm

    I suggest you look into Latent Class Cluster
    Analysis via LatentGold. LCCA offers many
    advantages over distance based measures.

  3. Franz on March 18th, 2010 7:34 am

    If you have enough data: split into model and validation set. Generate a model with c clusters on the model set. Apply the model to the validation set – then model with the same c on the validation set and check, if the same clusters pop up (count the numbers on the diagonal after suitable rearranging).
    Choose c with the highest stability.

  4. ffernandez on March 19th, 2010 9:49 am

    Hmm, interesting.I’ve spent the last two months looking for ways of estimating also trantistion probabilities between clusters, could LCCA lead to this? Seems to.

  5. Thomas on May 23rd, 2010 8:35 am

    i just did a cluster analysis on my sample of 244 and found that there are only two clusters. ive been reading on some articles and books and the examples that were used resulted in more than 2 clusters at the end of the analysis. so im just wondering if this 2-cluster result is acceptable.
    also, the variables that are used to cluster cases are simply the scale items, right..? without having to obtain the construct scores..?

  6. kathy on August 7th, 2010 10:35 am

    I am really interested in listening to the second part of your research about fuzzy cluster using Stata, because at this time, I am lost with the fuzzy option.
    Lookin forward to hearing about you.
    Thanks in advance.

  7. Mohit on August 16th, 2010 12:50 am

    If you wanna decide on number of clusters, you can you use hierarchial clustering. And k-means is really too old and shitty. Why not use the algorithms of Rosseau (Finding Groups in Data). We have had great success in using their algorithms.

  8. basab on June 8th, 2011 1:49 pm


    I am very new to cluster analysis field and my question is –

    can I fix the number of observations (say 4)in each cluster?

  9. Simon on September 8th, 2011 8:44 am

    On the number of clusters

    The initial assumption should be that no optimal nr exists. However, there are several methods to obtain a “reasonable” nr of clusters.

    The easiest way is probably to do it mathematically. You can for example use this simple approach: k=(n/2)^1/2 (if you like to deepen this, see e.g. Mardia et al, 1979: 360-384). For example, if n=200, then a reasonable nr of clusters would be k=(200/2)^1/2, k=10.

    However this method does not consider the consistency of the clusters. Hierarchical cluster methods can help you with this (as suggested above).

    Among several hierarchical cluster methods,
    the Ward agglomerating method is often used because it aims at minimizing the variance within clusters, thus maximize the homogeneity within each cluster. I.e. it creates groups of observations which are similar to each other, while separating the very “different observations”. This is often what researchers want to do. Technically, ward iterates till a statistical equilibrium has been obtained.

    Other words, the nr of groups and the clustering itself are statistically optimized. From a scientific point of view this is well considered. However, if you are into more behavioral studies then the, by ward, suggested grouping may not be optimal. Yet, it can still be helpful and work as a indicator on how to proceed in your work.

    If you like to read more about cluster analysis, try: Romesburg (2004) Cluster Analysis.

    Good luck!

  10. AnnMaria on September 8th, 2011 10:18 am

    I haven’t read that book. I’ll check it out. Thanks a lot for the tip.

  11. Rahul on December 26th, 2011 12:08 am

    Cluster analysis is basically used to reduce no. of variable which has same behavior(pattern). E.g suppose w hv diff brands like pepe,arrow,spykar,PA,biba,etc so we hv to cluster that which hv same behaviour .

  12. roya danesh on March 14th, 2012 12:07 pm

    I want to know history of cluster analysisand who used it for the first time?

  13. roya danesh on March 14th, 2012 12:20 pm

    1-usages of cluster analysis?2-what is the basic presumption in cluster analysis?3-who should we access that the basicpresumption in cluster analysis?4-give a research example of cluster analysis usage?which software has the ability to payform cluster analysis practically?5-explain the manual way of performing cluster analysis?

  14. July Salazar on August 24th, 2012 6:33 pm

    Buenas tardes
    Quiero saber cual es la instrucción en stata que permite ver los individuos que finalmente quedaron en un cluster, despues de realizar ya sea por la metodologia jerarquica (dendograma) o haciendo kmeans.


  15. Koma on December 3rd, 2012 6:15 pm

    Hi AnnMaria!

    I’m facing with a really similar problem: I’m investigating firm strategies, and according to an old theory there are 2 optimal ones. However I would like to show, that the mix of these two strategies leads to a higher performance. Should I use fuzzy clustering?
    Fortunately I have more or less 14000 observations, but I would like to use about 15-20 “explanatory” variables.
    I have never done any cluster analysis, so I am a bit lost… I hope you can help me, thank you in advance

  16. Vishal on March 27th, 2014 2:37 am

    How can one fix the number of objects in a cluster (e.g. 5 objects in each cluster for a 7 cluster solution)

Leave a Reply