statistics

Cluster Analysis: Finding Groups in Data

ByAnnMaria De Mars March 16, 2010March 16, 2010

Cluster analysis is one of those techniques I don’t get to use very often. About once every couple of years someone will be doing a study of types of companies, patients or clients and have a need for a cluster analysis. The best description I read of cluster analysis came from a book many years ago, by Kaufman & Rousseuw that began,

“Cluster analysis is the art of finding groups in data.”

twizzler_full_prod It falls into that gray area between descriptive statistics, that asks how many people like programming and twizzlers and inferential statistics, which question the daily consumption of twizzlers by programmers and non-programmers and whether any difference between the two groups is greater than would be predicted by chance 95% of the time.

Cluster analysis is an exploratory method, usually, and is incorporated in what the young ‘uns now call data mining.

However, it can also be confirmatory in a hypothesis-testing sort of way. Say, I hypothesize that there are three groups of people who have eating disorders, anorexia, bulimia and anorexic-bulemics and they differ in their treatment. I can classify people in those three groups using a cluster analysis, then do an ANOVA or MANOVA on the clusters to see if there are in fact significant differences among clusters in days hospitalized, total inpatient costs, total outpatient costs or other variables of interest.

Personally, when I think of cluster analysis the first type that always comes to mind is the partition, k-means clustering method. I suppose that says a little about my level of weirdness that there actually IS a type that comes to mind. The second type, unless you were dying to know, is fuzzy clusters, because it is something I have been pondering lately. Fuzzy clusters are NOT, contrary to the vicious rumors spread by my enemies, what can be found under my bed because I last cleaned sometime during the Mesozic era, but rather, a method where observations are allowed to fall into two clusters at once. Can people be only anorexic or bulimic or can they fall in both groups? Fuzzy clustering says yes. Kmeans partitions says, no. You can perhaps, though, have a third group of people who are anorexic-bulimic.

Rambling note: When Maria came home for Christmas after having gotten her first job as a sportswriter, someone asked her if she had a favorite sport, she responded:

“Well, since Sports Illustrated is paying me to write about football, this week my favorite sport is football.”

Similarly, since I am meeting with someone tomorrow on how to do a cluster analysis with Stata, it has now become my favorite software for cluster analysis.

How to do it:

Either from the STATISTICS menu select MULTIVARIATE ANALYSIS > CLUSTER ANALYSIS > CLUSTER DATA > KMEANS

and then from the pop-up window select the number of groups and the variables

OR type the following in the command window

cluster kmeans list-of-variables , k(#) measure(type) start(seed)

FOR EXAMPLE, if I wanted to use data on anorexia and bulimia looking for two groups I would do this

cluster kmeans fast binge vomit purge hyper mens weight, k(2) measure(L2) start(krandom)

The default similarity/dissimilarity measure is Euclidean and you started with a random seed. The output of cluster analysis in Stata might be disconcerting to some people by virtue of the fact that there really isn’t any. It will come back and say something singularly unenlightening like “cluster name: _clus_1 ” and that’s it .

The first thing I recommend adding to your cluster analysis command is the keepcenters option, so my command looks like this:

cluster kmeans fast binge vomit purge hyper mens weight, k(2) measure(L2) start(krandom) keepcen
I am assuming you have a relatively modest number of observations, like I do, so you can open up your Stata data file in the Data editor and take a look at it. When I scroll down to the bottom I find that the last two observations are the means for each of the variables for the two clusters. My first cluster has very low values for body weight, moderate values for absence of menses, and high values for fasting, binging, vomiting, purging and hyperactivity. My second cluster has medium body weight, medium scores on hyperactivity, low values for amenorrhea and fasting, and high values for binging, vomiting and purging.

It is starting to appear that I have two groups, anorexics and bulimics. My next step in this exploration would be to use the tabstat command, by cluster to see if other expected (or unexpected) differences emerge.
Currently, there has been an unexpected emergence of my daughter, Jenn, who is neither anorexic nor bulimic, in the lobby downstairs, so we are off to conduct our own experiment to explore whether Chardonnay is best grouped with angel hair pasta primavera or does Pinot Noir fit better in this cluster.

I’ll let you know.

Dr. De Mars General Life Ramblings | Software | statistics | Technology

To specialize or not to specialize, in 140 characters or less
ByAnnMaria De Mars September 11, 2010September 11, 2010

AN ACTUAL CONVERSATION THIS WEEK … “This paper is not going to be as much an academic treatise as most of the ones I write, but I am hoping it will be more interesting. I was wondering about the fact that some well-respected people say the secret to career success is to be the foremost…

Read More To specialize or not to specialize, in 140 characters or less
Software | statistics

More cultural relevance = lower academic achievement: WHY?
ByAnnMaria De Mars July 5, 2011July 5, 2011

Before we went to Arlington, VA to get our hands on the National Indian Education Study , my colleague, Dr. Erich Longie, hypothesized that schools that had more cultural activities would have lower academic achievement. In addition to being an old friend, Dr. Longie is president of Spirit Lake Consulting, Inc. , a published author…

Read More More cultural relevance = lower academic achievement: WHY?
Software | statistics | Technology

A few statistical details on JMP’s pointy-clicky SEM
ByAnnMaria De Mars May 4, 2011

The new structural equation modeling for JMP is pretty cool. It’s unfortunate that it requires both JMP and SAS/STAT to run it, the cost of the two combined being so expensive that you pretty much have to work for a huge organization that can afford a site license for both or sell a kidney to…

Read More A few statistical details on JMP’s pointy-clicky SEM
Dr. De Mars General Life Ramblings | statistics

Finding Groups in Data
ByAnnMaria De Mars September 26, 2008

Today, Dr. De Mars is — happy. One of the fun things about my job is that I get to do lots of different things. That can be a bit troubling some days, because statistical software consultant encompasses a wide range from different types of models, to coding, to various operating systems to all of…

Read More Finding Groups in Data
statistics

Statisticians and Sanity
ByAnnMaria De Mars October 30, 2010

My paper on data visualization next week may be the most useful thing I do this year, if I succeed in convincing my audience, that is. Fact: Statisticians have failed in a very important respect. This fact became apparent to me on a beautiful day lying on the beach in Santa Monica when it was…

Read More Statisticians and Sanity
Software | statistics | Technology

SAS Tricks for Massaging Data into Shape
ByAnnMaria De Mars October 3, 2014

Today, I was thinking about using data from the National Hospital Discharge Survey to try to predict type of hospital admission. Is it true that some people use the emergency room as their primary method of care? Mostly, I wanted to poke around wit the NHDS data and get to know it better for possible…

Read More SAS Tricks for Massaging Data into Shape

20 Comments

Will says:

March 16, 2010 at 10:13 pm

I use clustering almost exclusively in gene-expression analysis. I’m a big fan of k-means as well, but the problem I get into is that its tough to know the number of cluster a-priori. I use tricks like scanning across many “reasonable” K’s and finding an optimal guess or using a “Chinese restaurant process” technique.

Do you have anything cool in your bag of tricks for these sorts of situations?
Lou says:

March 17, 2010 at 2:36 pm

I suggest you look into Latent Class Cluster
Analysis via LatentGold. LCCA offers many
advantages over distance based measures.
Franz says:

March 18, 2010 at 7:34 am

If you have enough data: split into model and validation set. Generate a model with c clusters on the model set. Apply the model to the validation set – then model with the same c on the validation set and check, if the same clusters pop up (count the numbers on the diagonal after suitable rearranging).
Choose c with the highest stability.
Franz
ffernandez says:

March 19, 2010 at 9:49 am

Hmm, interesting.I’ve spent the last two months looking for ways of estimating also trantistion probabilities between clusters, could LCCA lead to this? Seems to.
Thomas says:

May 23, 2010 at 8:35 am

i just did a cluster analysis on my sample of 244 and found that there are only two clusters. ive been reading on some articles and books and the examples that were used resulted in more than 2 clusters at the end of the analysis. so im just wondering if this 2-cluster result is acceptable.
also, the variables that are used to cluster cases are simply the scale items, right..? without having to obtain the construct scores..?
kathy says:

August 7, 2010 at 10:35 am

AnnMaria,
I am really interested in listening to the second part of your research about fuzzy cluster using Stata, because at this time, I am lost with the fuzzy option.
Lookin forward to hearing about you.
Thanks in advance.
Mohit says:

August 16, 2010 at 12:50 am

If you wanna decide on number of clusters, you can you use hierarchial clustering. And k-means is really too old and shitty. Why not use the algorithms of Rosseau (Finding Groups in Data). We have had great success in using their algorithms.
basab says:

June 8, 2011 at 1:49 pm

Hi

I am very new to cluster analysis field and my question is —

can I fix the number of observations (say 4)in each cluster?
Simon says:

September 8, 2011 at 8:44 am

On the number of clusters

The initial assumption should be that no optimal nr exists. However, there are several methods to obtain a “reasonable” nr of clusters.

The easiest way is probably to do it mathematically. You can for example use this simple approach: k=(n/2)^1/2 (if you like to deepen this, see e.g. Mardia et al, 1979: 360-384). For example, if n=200, then a reasonable nr of clusters would be k=(200/2)^1/2, k=10.

However this method does not consider the consistency of the clusters. Hierarchical cluster methods can help you with this (as suggested above).

Among several hierarchical cluster methods,
the Ward agglomerating method is often used because it aims at minimizing the variance within clusters, thus maximize the homogeneity within each cluster. I.e. it creates groups of observations which are similar to each other, while separating the very “different observations”. This is often what researchers want to do. Technically, ward iterates till a statistical equilibrium has been obtained.

Other words, the nr of groups and the clustering itself are statistically optimized. From a scientific point of view this is well considered. However, if you are into more behavioral studies then the, by ward, suggested grouping may not be optimal. Yet, it can still be helpful and work as a indicator on how to proceed in your work.

If you like to read more about cluster analysis, try: Romesburg (2004) Cluster Analysis.

Good luck!
AnnMaria says:

September 8, 2011 at 10:18 am

I haven’t read that book. I’ll check it out. Thanks a lot for the tip.
Rahul says:

December 26, 2011 at 12:08 am

Cluster analysis is basically used to reduce no. of variable which has same behavior(pattern). E.g suppose w hv diff brands like pepe,arrow,spykar,PA,biba,etc so we hv to cluster that which hv same behaviour .
roya danesh says:

March 14, 2012 at 12:07 pm

hello
I want to know history of cluster analysisand who used it for the first time?
roya danesh says:

March 14, 2012 at 12:20 pm

1-usages of cluster analysis?2-what is the basic presumption in cluster analysis?3-who should we access that the basicpresumption in cluster analysis?4-give a research example of cluster analysis usage?which software has the ability to payform cluster analysis practically?5-explain the manual way of performing cluster analysis?
July Salazar says:

August 24, 2012 at 6:33 pm

Buenas tardes
Quiero saber cual es la instrucción en stata que permite ver los individuos que finalmente quedaron en un cluster, despues de realizar ya sea por la metodologia jerarquica (dendograma) o haciendo kmeans.

Gracias!!!
Koma says:

December 3, 2012 at 6:15 pm

Hi AnnMaria!

I’m facing with a really similar problem: I’m investigating firm strategies, and according to an old theory there are 2 optimal ones. However I would like to show, that the mix of these two strategies leads to a higher performance. Should I use fuzzy clustering?
Fortunately I have more or less 14000 observations, but I would like to use about 15-20 “explanatory” variables.
I have never done any cluster analysis, so I am a bit lost… I hope you can help me, thank you in advance
Vishal says:

March 27, 2014 at 2:37 am

Hi,
How can one fix the number of objects in a cluster (e.g. 5 objects in each cluster for a 7 cluster solution)
Kathrin says:

March 27, 2015 at 8:53 am

Hi AnnMaria,

I am not sure if you provide me an answer to my questions, but I am pretty desperate and would be glad of any help I can get! I am conducting a study in which I applied first Factor Analysis (for a question that was asked with Likert Scale) to reduce variables and name these factors. Then, I did a cluster analysis with these factors (hierarchical method because I didn’t know how many groups I should keep) which suggested me keeping 3 groups. In STATA, a new variable was created, which I called “hierarg” and which represents the 3 groups. So I checked which of the factors are most represented in the groups and was then able to name the groups. Now I face the problem: I want to test if for example in Group 1 (people preferring product A) those people are older than in the other groups (preferring product B or C), or for example if education might have an influence on the fact of being in a specific group. I know how many observations are in each group but I do not know which command I can use in STATA to get such results… I tried to solve it with mlogit, taking this variable hierarg as categorical one (i.hierarg) and hence got a result, that if people are older, the probability of being in Group 2 decreseas and so on. But is that the only option? I hope I could explain myself right and would be more than happy if you or anybody else could help me! Kind regards, Kathrin
Kathrin says:

March 27, 2015 at 8:54 am

It is actually the same that
July Salazar on August 24th, 2012 6:33 pm
asked. Is there an answer to it?
AnnMaria says:

March 27, 2015 at 10:20 am

Hello, Kathrin –
If your question is simply “Are there differences between clusters on variable X” you can solve it in multiple ways. Your mlogit is one.

For example, you could do an ANOVA with cluster as the independent variable and see if education is different between the clusters, get means by cluster, etc.

If you want to look at the data visually, you could check out the graphs produced by some procedures http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_cluster_examples02.htm
Uta says:

July 15, 2015 at 3:24 am

Dear Anna Maria,

can I just say THANK YOU! we have no stats support here and I was trying to find a practical tutorial but nothing really hit the nail until I found your blog 🙂 🙂

Similar Posts

20 Comments

Leave a Reply