statistics

Cluster Analysis: Finding Groups in Data

ByAnnMaria De Mars March 16, 2010March 16, 2010

Cluster analysis is one of those techniques I don’t get to use very often. About once every couple of years someone will be doing a study of types of companies, patients or clients and have a need for a cluster analysis. The best description I read of cluster analysis came from a book many years ago, by Kaufman & Rousseuw that began,

“Cluster analysis is the art of finding groups in data.”

twizzler_full_prod It falls into that gray area between descriptive statistics, that asks how many people like programming and twizzlers and inferential statistics, which question the daily consumption of twizzlers by programmers and non-programmers and whether any difference between the two groups is greater than would be predicted by chance 95% of the time.

Cluster analysis is an exploratory method, usually, and is incorporated in what the young ‘uns now call data mining.

However, it can also be confirmatory in a hypothesis-testing sort of way. Say, I hypothesize that there are three groups of people who have eating disorders, anorexia, bulimia and anorexic-bulemics and they differ in their treatment. I can classify people in those three groups using a cluster analysis, then do an ANOVA or MANOVA on the clusters to see if there are in fact significant differences among clusters in days hospitalized, total inpatient costs, total outpatient costs or other variables of interest.

Personally, when I think of cluster analysis the first type that always comes to mind is the partition, k-means clustering method. I suppose that says a little about my level of weirdness that there actually IS a type that comes to mind. The second type, unless you were dying to know, is fuzzy clusters, because it is something I have been pondering lately. Fuzzy clusters are NOT, contrary to the vicious rumors spread by my enemies, what can be found under my bed because I last cleaned sometime during the Mesozic era, but rather, a method where observations are allowed to fall into two clusters at once. Can people be only anorexic or bulimic or can they fall in both groups? Fuzzy clustering says yes. Kmeans partitions says, no. You can perhaps, though, have a third group of people who are anorexic-bulimic.

Rambling note: When Maria came home for Christmas after having gotten her first job as a sportswriter, someone asked her if she had a favorite sport, she responded:

“Well, since Sports Illustrated is paying me to write about football, this week my favorite sport is football.”

Similarly, since I am meeting with someone tomorrow on how to do a cluster analysis with Stata, it has now become my favorite software for cluster analysis.

How to do it:

Either from the STATISTICS menu select MULTIVARIATE ANALYSIS > CLUSTER ANALYSIS > CLUSTER DATA > KMEANS

and then from the pop-up window select the number of groups and the variables

OR type the following in the command window

cluster kmeans list-of-variables , k(#) measure(type) start(seed)

FOR EXAMPLE, if I wanted to use data on anorexia and bulimia looking for two groups I would do this

cluster kmeans fast binge vomit purge hyper mens weight, k(2) measure(L2) start(krandom)

The default similarity/dissimilarity measure is Euclidean and you started with a random seed. The output of cluster analysis in Stata might be disconcerting to some people by virtue of the fact that there really isn’t any. It will come back and say something singularly unenlightening like “cluster name: _clus_1 ” and that’s it .

The first thing I recommend adding to your cluster analysis command is the keepcenters option, so my command looks like this:

cluster kmeans fast binge vomit purge hyper mens weight, k(2) measure(L2) start(krandom) keepcen
I am assuming you have a relatively modest number of observations, like I do, so you can open up your Stata data file in the Data editor and take a look at it. When I scroll down to the bottom I find that the last two observations are the means for each of the variables for the two clusters. My first cluster has very low values for body weight, moderate values for absence of menses, and high values for fasting, binging, vomiting, purging and hyperactivity. My second cluster has medium body weight, medium scores on hyperactivity, low values for amenorrhea and fasting, and high values for binging, vomiting and purging.

It is starting to appear that I have two groups, anorexics and bulimics. My next step in this exploration would be to use the tabstat command, by cluster to see if other expected (or unexpected) differences emerge.
Currently, there has been an unexpected emergence of my daughter, Jenn, who is neither anorexic nor bulimic, in the lobby downstairs, so we are off to conduct our own experiment to explore whether Chardonnay is best grouped with angel hair pasta primavera or does Pinot Noir fit better in this cluster.

I’ll let you know.

statistics

So, what did we learn this week at SAS Global Forum?
ByAnnMaria De Mars April 25, 2012

After over a quarter of a century of experience working as a statistical consultant in a wide array of settings, it’s safe to say that a large proportion of the statistics presentations cover topics I have been over before. Still, even if it is a technique I’ve used many times, I almost always come across…

Read More So, what did we learn this week at SAS Global Forum?
Dr. De Mars General Life Ramblings | statistics

Survival analysis and conference attendance
ByAnnMaria De Mars April 9, 2011April 12, 2011

Since the whole presentation Patricia Berglund gave on survival analysis is available at the SAS Global Forum takeout section (which I explained yesterday, you should have been paying attention), I just wanted to add a few highlights here. 1. Using PROC LIFETEST with a STRATA statements is a very dandy way to show survival curves…

Read More Survival analysis and conference attendance
Software | statistics | Technology

An Introduction to Repeated Measures ANOVA
ByAnnMaria De Mars June 8, 2017

I’m teaching a course on multivariate statistics and for some of the students it’s been a minute since their last inferential statistics course. So, I have been doing a few videos here and there to refresh, for example, what is a repeated measures ANOVA and why you might want to do it. Sometimes I…

Read More An Introduction to Repeated Measures ANOVA
Software | statistics | Technology

How to write a statistical analysis paper: Step Three
ByAnnMaria De Mars May 20, 2015

So far, we have looked at How to get the sample demographics and descriptive statistics for your dependent and independent variable. Computing descriptive statistics by category Now it’s time to dive into step 3, computing inferential statistics. The code is quite simple. We need a LIBNAME statement. It will look something like this. The exact…

Read More How to write a statistical analysis paper: Step Three
Open data | statistics

Cigar-smoking won’t kill you if you’re already old
ByAnnMaria De Mars December 27, 2011January 8, 2012

In my analysis of data on the oldest old from the Kaiser Permanente study, I started with cigar smoking because I have a friend who turned 65 last month. His annual physical went like this: Doctor: Do you still smoke cigars? Jim: Yep. Doctor: Do you still drink 8 or 9 beers every night? Jim: Yep….

Read More Cigar-smoking won’t kill you if you’re already old
Dr. De Mars General Life Ramblings | statistics | The Julia Group

Men, Women, Tech, Discrimination & Statistics
ByAnnMaria De Mars October 13, 2015October 13, 2015

Let’s get this out right up front – I have no question that there is discrimination in the tech industry. I gave an hour-long talk on this very subject at MIT a couple of weeks ago, where I pointed out that everyone’s first draft of pretty much everything is crap – your first game, first…

Read More Men, Women, Tech, Discrimination & Statistics

20 Comments

Will says:

March 16, 2010 at 10:13 pm

I use clustering almost exclusively in gene-expression analysis. I’m a big fan of k-means as well, but the problem I get into is that its tough to know the number of cluster a-priori. I use tricks like scanning across many “reasonable” K’s and finding an optimal guess or using a “Chinese restaurant process” technique.

Do you have anything cool in your bag of tricks for these sorts of situations?
Lou says:

March 17, 2010 at 2:36 pm

I suggest you look into Latent Class Cluster
Analysis via LatentGold. LCCA offers many
advantages over distance based measures.
Franz says:

March 18, 2010 at 7:34 am

If you have enough data: split into model and validation set. Generate a model with c clusters on the model set. Apply the model to the validation set – then model with the same c on the validation set and check, if the same clusters pop up (count the numbers on the diagonal after suitable rearranging).
Choose c with the highest stability.
Franz
ffernandez says:

March 19, 2010 at 9:49 am

Hmm, interesting.I’ve spent the last two months looking for ways of estimating also trantistion probabilities between clusters, could LCCA lead to this? Seems to.
Thomas says:

May 23, 2010 at 8:35 am

i just did a cluster analysis on my sample of 244 and found that there are only two clusters. ive been reading on some articles and books and the examples that were used resulted in more than 2 clusters at the end of the analysis. so im just wondering if this 2-cluster result is acceptable.
also, the variables that are used to cluster cases are simply the scale items, right..? without having to obtain the construct scores..?
kathy says:

August 7, 2010 at 10:35 am

AnnMaria,
I am really interested in listening to the second part of your research about fuzzy cluster using Stata, because at this time, I am lost with the fuzzy option.
Lookin forward to hearing about you.
Thanks in advance.
Mohit says:

August 16, 2010 at 12:50 am

If you wanna decide on number of clusters, you can you use hierarchial clustering. And k-means is really too old and shitty. Why not use the algorithms of Rosseau (Finding Groups in Data). We have had great success in using their algorithms.
basab says:

June 8, 2011 at 1:49 pm

Hi

I am very new to cluster analysis field and my question is —

can I fix the number of observations (say 4)in each cluster?
Simon says:

September 8, 2011 at 8:44 am

On the number of clusters

The initial assumption should be that no optimal nr exists. However, there are several methods to obtain a “reasonable” nr of clusters.

The easiest way is probably to do it mathematically. You can for example use this simple approach: k=(n/2)^1/2 (if you like to deepen this, see e.g. Mardia et al, 1979: 360-384). For example, if n=200, then a reasonable nr of clusters would be k=(200/2)^1/2, k=10.

However this method does not consider the consistency of the clusters. Hierarchical cluster methods can help you with this (as suggested above).

Among several hierarchical cluster methods,
the Ward agglomerating method is often used because it aims at minimizing the variance within clusters, thus maximize the homogeneity within each cluster. I.e. it creates groups of observations which are similar to each other, while separating the very “different observations”. This is often what researchers want to do. Technically, ward iterates till a statistical equilibrium has been obtained.

Other words, the nr of groups and the clustering itself are statistically optimized. From a scientific point of view this is well considered. However, if you are into more behavioral studies then the, by ward, suggested grouping may not be optimal. Yet, it can still be helpful and work as a indicator on how to proceed in your work.

If you like to read more about cluster analysis, try: Romesburg (2004) Cluster Analysis.

Good luck!
AnnMaria says:

September 8, 2011 at 10:18 am

I haven’t read that book. I’ll check it out. Thanks a lot for the tip.
Rahul says:

December 26, 2011 at 12:08 am

Cluster analysis is basically used to reduce no. of variable which has same behavior(pattern). E.g suppose w hv diff brands like pepe,arrow,spykar,PA,biba,etc so we hv to cluster that which hv same behaviour .
roya danesh says:

March 14, 2012 at 12:07 pm

hello
I want to know history of cluster analysisand who used it for the first time?
roya danesh says:

March 14, 2012 at 12:20 pm

1-usages of cluster analysis?2-what is the basic presumption in cluster analysis?3-who should we access that the basicpresumption in cluster analysis?4-give a research example of cluster analysis usage?which software has the ability to payform cluster analysis practically?5-explain the manual way of performing cluster analysis?
July Salazar says:

August 24, 2012 at 6:33 pm

Buenas tardes
Quiero saber cual es la instrucción en stata que permite ver los individuos que finalmente quedaron en un cluster, despues de realizar ya sea por la metodologia jerarquica (dendograma) o haciendo kmeans.

Gracias!!!
Koma says:

December 3, 2012 at 6:15 pm

Hi AnnMaria!

I’m facing with a really similar problem: I’m investigating firm strategies, and according to an old theory there are 2 optimal ones. However I would like to show, that the mix of these two strategies leads to a higher performance. Should I use fuzzy clustering?
Fortunately I have more or less 14000 observations, but I would like to use about 15-20 “explanatory” variables.
I have never done any cluster analysis, so I am a bit lost… I hope you can help me, thank you in advance
Vishal says:

March 27, 2014 at 2:37 am

Hi,
How can one fix the number of objects in a cluster (e.g. 5 objects in each cluster for a 7 cluster solution)
Kathrin says:

March 27, 2015 at 8:53 am

Hi AnnMaria,

I am not sure if you provide me an answer to my questions, but I am pretty desperate and would be glad of any help I can get! I am conducting a study in which I applied first Factor Analysis (for a question that was asked with Likert Scale) to reduce variables and name these factors. Then, I did a cluster analysis with these factors (hierarchical method because I didn’t know how many groups I should keep) which suggested me keeping 3 groups. In STATA, a new variable was created, which I called “hierarg” and which represents the 3 groups. So I checked which of the factors are most represented in the groups and was then able to name the groups. Now I face the problem: I want to test if for example in Group 1 (people preferring product A) those people are older than in the other groups (preferring product B or C), or for example if education might have an influence on the fact of being in a specific group. I know how many observations are in each group but I do not know which command I can use in STATA to get such results… I tried to solve it with mlogit, taking this variable hierarg as categorical one (i.hierarg) and hence got a result, that if people are older, the probability of being in Group 2 decreseas and so on. But is that the only option? I hope I could explain myself right and would be more than happy if you or anybody else could help me! Kind regards, Kathrin
Kathrin says:

March 27, 2015 at 8:54 am

It is actually the same that
July Salazar on August 24th, 2012 6:33 pm
asked. Is there an answer to it?
AnnMaria says:

March 27, 2015 at 10:20 am

Hello, Kathrin –
If your question is simply “Are there differences between clusters on variable X” you can solve it in multiple ways. Your mlogit is one.

For example, you could do an ANOVA with cluster as the independent variable and see if education is different between the clusters, get means by cluster, etc.

If you want to look at the data visually, you could check out the graphs produced by some procedures http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_cluster_examples02.htm
Uta says:

July 15, 2015 at 3:24 am

Dear Anna Maria,

can I just say THANK YOU! we have no stats support here and I was trying to find a practical tutorial but nothing really hit the nail until I found your blog 🙂 🙂

Similar Posts

20 Comments

Leave a Reply