### Mar

#### 16

# Cluster Analysis: Finding Groups in Data

March 16, 2010 | 20 Comments

Cluster analysis is one of those techniques I don’t get to use very often. About once every couple of years someone will be doing a study of types of companies, patients or clients and have a need for a cluster analysis. The best description I read of cluster analysis came from a book many years ago, by Kaufman & Rousseuw that began,

*“Cluster analysis is the art of finding groups in data.”*

It falls into that gray area between descriptive statistics, that asks how many people like programming and twizzlers and inferential statistics, which question the daily consumption of twizzlers by programmers and non-programmers and whether any difference between the two groups is greater than would be predicted by chance 95% of the time.

Cluster analysis is an exploratory method, usually, and is incorporated in what the young ‘uns now call data mining.

However, it can also be confirmatory in a hypothesis-testing sort of way. Say, I hypothesize that there are three groups of people who have eating disorders, anorexia, bulimia and anorexic-bulemics and they differ in their treatment. I can classify people in those three groups using a cluster analysis, then do an ANOVA or MANOVA on the clusters to see if there are in fact significant differences among clusters in days hospitalized, total inpatient costs, total outpatient costs or other variables of interest.

Personally, when I think of cluster analysis the first type that always comes to mind is the partition, k-means clustering method. I suppose that says a little about my level of weirdness that there actually IS a type that comes to mind. The second type, unless you were dying to know, is fuzzy clusters, because it is something I have been pondering lately. Fuzzy clusters are NOT, contrary to the vicious rumors spread by my enemies, what can be found under my bed because I last cleaned sometime during the Mesozic era, but rather, a method where observations are allowed to fall into two clusters at once. Can people be only anorexic or bulimic or can they fall in both groups? Fuzzy clustering says yes. Kmeans partitions says, no. You can perhaps, though, have a third group of people who are anorexic-bulimic.

Rambling note: When Maria came home for Christmas after having gotten her first job as a sportswriter, someone asked her if she had a favorite sport, she responded:

*“Well, since Sports Illustrated is paying me to write about football, this week my favorite sport is football.”*

Similarly, since I am meeting with someone tomorrow on how to do a cluster analysis with Stata, it has now become my favorite software for cluster analysis.

How to do it:

Either from the STATISTICS menu select MULTIVARIATE ANALYSIS > CLUSTER ANALYSIS > CLUSTER DATA > KMEANS

and then from the pop-up window select the number of groups and the variables

OR type the following in the command window

**cluster kmeans** * list-of-variables* , **k**(#) **measure**(type) **start**(seed)

FOR EXAMPLE, if I wanted to use data on anorexia and bulimia looking for two groups I would do this

cluster kmeans fast binge vomit purge hyper mens weight, k(2) measure(L2) start(krandom)

The default similarity/dissimilarity measure is Euclidean and you started with a random seed. The output of cluster analysis in Stata might be disconcerting to some people by virtue of the fact that there really isn’t any. It will come back and say something singularly unenlightening like “cluster name: _clus_1 ” and that’s it .

The first thing I recommend adding to your cluster analysis command is the keepcenters option, so my command looks like this:

cluster kmeans fast binge vomit purge hyper mens weight, k(2) measure(L2) start(krandom) keepcen

I am assuming you have a relatively modest number of observations, like I do, so you can open up your Stata data file in the Data editor and take a look at it. When I scroll down to the bottom I find that the last two observations are the means for each of the variables for the two clusters. My first cluster has very low values for body weight, moderate values for absence of menses, and high values for fasting, binging, vomiting, purging and hyperactivity. My second cluster has medium body weight, medium scores on hyperactivity, low values for amenorrhea and fasting, and high values for binging, vomiting and purging.

It is starting to appear that I have two groups, anorexics and bulimics. My next step in this exploration would be to use the tabstat command, by cluster to see if other expected (or unexpected) differences emerge.

Currently, there has been an unexpected emergence of my daughter, Jenn, who is neither anorexic nor bulimic, in the lobby downstairs, so we are off to conduct our own experiment to explore whether Chardonnay is best grouped with angel hair pasta primavera or does Pinot Noir fit better in this cluster.

I’ll let you know.

# Comments

20 Comments so far

## Blogroll

- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog

I use clustering almost exclusively in gene-expression analysis. I’m a big fan of k-means as well, but the problem I get into is that its tough to know the number of cluster a-priori. I use tricks like scanning across many “reasonable” K’s and finding an optimal guess or using a “Chinese restaurant process” technique.

Do you have anything cool in your bag of tricks for these sorts of situations?

I suggest you look into Latent Class Cluster

Analysis via LatentGold. LCCA offers many

advantages over distance based measures.

If you have enough data: split into model and validation set. Generate a model with c clusters on the model set. Apply the model to the validation set – then model with the same c on the validation set and check, if the same clusters pop up (count the numbers on the diagonal after suitable rearranging).

Choose c with the highest stability.

Franz

Hmm, interesting.I’ve spent the last two months looking for ways of estimating also trantistion probabilities between clusters, could LCCA lead to this? Seems to.

i just did a cluster analysis on my sample of 244 and found that there are only two clusters. ive been reading on some articles and books and the examples that were used resulted in more than 2 clusters at the end of the analysis. so im just wondering if this 2-cluster result is acceptable.

also, the variables that are used to cluster cases are simply the scale items, right..? without having to obtain the construct scores..?

AnnMaria,

I am really interested in listening to the second part of your research about fuzzy cluster using Stata, because at this time, I am lost with the fuzzy option.

Lookin forward to hearing about you.

Thanks in advance.

If you wanna decide on number of clusters, you can you use hierarchial clustering. And k-means is really too old and shitty. Why not use the algorithms of Rosseau (Finding Groups in Data). We have had great success in using their algorithms.

Hi

I am very new to cluster analysis field and my question is —

can I fix the number of observations (say 4)in each cluster?

On the number of clusters

The initial assumption should be that no optimal nr exists. However, there are several methods to obtain a “reasonable” nr of clusters.

The easiest way is probably to do it mathematically. You can for example use this simple approach: k=(n/2)^1/2 (if you like to deepen this, see e.g. Mardia et al, 1979: 360-384). For example, if n=200, then a reasonable nr of clusters would be k=(200/2)^1/2, k=10.

However this method does not consider the consistency of the clusters. Hierarchical cluster methods can help you with this (as suggested above).

Among several hierarchical cluster methods,

the Ward agglomerating method is often used because it aims at minimizing the variance within clusters, thus maximize the homogeneity within each cluster. I.e. it creates groups of observations which are similar to each other, while separating the very “different observations”. This is often what researchers want to do. Technically, ward iterates till a statistical equilibrium has been obtained.

Other words, the nr of groups and the clustering itself are statistically optimized. From a scientific point of view this is well considered. However, if you are into more behavioral studies then the, by ward, suggested grouping may not be optimal. Yet, it can still be helpful and work as a indicator on how to proceed in your work.

If you like to read more about cluster analysis, try: Romesburg (2004) Cluster Analysis.

Good luck!

I haven’t read that book. I’ll check it out. Thanks a lot for the tip.

Cluster analysis is basically used to reduce no. of variable which has same behavior(pattern). E.g suppose w hv diff brands like pepe,arrow,spykar,PA,biba,etc so we hv to cluster that which hv same behaviour .

hello

I want to know history of cluster analysisand who used it for the first time?

1-usages of cluster analysis?2-what is the basic presumption in cluster analysis?3-who should we access that the basicpresumption in cluster analysis?4-give a research example of cluster analysis usage?which software has the ability to payform cluster analysis practically?5-explain the manual way of performing cluster analysis?

Buenas tardes

Quiero saber cual es la instrucción en stata que permite ver los individuos que finalmente quedaron en un cluster, despues de realizar ya sea por la metodologia jerarquica (dendograma) o haciendo kmeans.

Gracias!!!

Hi AnnMaria!

I’m facing with a really similar problem: I’m investigating firm strategies, and according to an old theory there are 2 optimal ones. However I would like to show, that the mix of these two strategies leads to a higher performance. Should I use fuzzy clustering?

Fortunately I have more or less 14000 observations, but I would like to use about 15-20 “explanatory” variables.

I have never done any cluster analysis, so I am a bit lost… I hope you can help me, thank you in advance

Hi,

How can one fix the number of objects in a cluster (e.g. 5 objects in each cluster for a 7 cluster solution)

Hi AnnMaria,

I am not sure if you provide me an answer to my questions, but I am pretty desperate and would be glad of any help I can get! I am conducting a study in which I applied first Factor Analysis (for a question that was asked with Likert Scale) to reduce variables and name these factors. Then, I did a cluster analysis with these factors (hierarchical method because I didn’t know how many groups I should keep) which suggested me keeping 3 groups. In STATA, a new variable was created, which I called “hierarg” and which represents the 3 groups. So I checked which of the factors are most represented in the groups and was then able to name the groups. Now I face the problem: I want to test if for example in Group 1 (people preferring product A) those people are older than in the other groups (preferring product B or C), or for example if education might have an influence on the fact of being in a specific group. I know how many observations are in each group but I do not know which command I can use in STATA to get such results… I tried to solve it with mlogit, taking this variable hierarg as categorical one (i.hierarg) and hence got a result, that if people are older, the probability of being in Group 2 decreseas and so on. But is that the only option? I hope I could explain myself right and would be more than happy if you or anybody else could help me! Kind regards, Kathrin

It is actually the same that

July Salazar on August 24th, 2012 6:33 pm

asked. Is there an answer to it?

Hello, Kathrin –

If your question is simply “Are there differences between clusters on variable X” you can solve it in multiple ways. Your mlogit is one.

For example, you could do an ANOVA with cluster as the independent variable and see if education is different between the clusters, get means by cluster, etc.

If you want to look at the data visually, you could check out the graphs produced by some procedures http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_cluster_examples02.htm

Dear Anna Maria,

can I just say THANK YOU! we have no stats support here and I was trying to find a practical tutorial but nothing really hit the nail until I found your blog 🙂 🙂