# Finding Groups in Data

Today, Dr. De Mars is — happy.

One of the fun things about my job is that I get to do lots of different things. That can be a bit troubling some days, because statistical software consultant encompasses a wide range from different types of models, to coding, to various operating systems to all of non-parametric, parametric, Bayesian and other statistics that I cannot remember at the moment.

Because the range of people I work with continually increases, I am now more often running into questions I cannot answer off the top of my head. I do know how Mahalanobis’ distance is used, even though I had not thought about it in years until someone asked me a question yesterday, I do know the calculation for pooled variance , which should be used when Levene’s test is rejected.  Still, once a day or so, someone asks me a question I have to look up. Sometimes, these are on techniques I have  not used before and just as many times, the question relates to something that I KNOW can be done, and I know this because I personally have used that statistic or written that code before. I just can’t remember how.

You know that saying,

“I have forgotten more about statistics than you’ll ever know.”

Well, that is my problem. I keep forgetting it.  Fortunately for me, and this is why I am happy, I get  to consult on a lot of different projects each week that remind me of things I used to know. For example, cluster analysis, as the Stata multivariate statistics guide so poetically says, is used for finding groups in data. You can use it to identify or validate specific diagnostic groups, you can try to group just about anything. Most often, cluster analysis is used as an exploratory technique, which is my favorite type of statistics, where you are turning a bunch of numbers into knowledge.

The most common way to use cluster analysis is the k-means technique. You assume there are k-groups (with k being a number you specify) and the program iterates to a solution. The program starts with k “seeds” which are the means for each group. Every observation is assigned to the group whose mean is closest to it. New group means are calculated based on the observations in the group. If an observation’s mean is closer to a different group, it is moved into that group. Then, group means are calculated again. This continues until a step is reached where none of the observations change groups. And that is one way to do cluster analysis.

# SPSS: Does not play well with others

Filed Under Software | 2 Comments

SPSS does some pretty cool things, and I have written about some of them here. However, there are also some annoyances. First of all, it is one of the hardest packages to use moving datasets from one format or platform to another. The regular SPSS dataset, the one that ends in .sav crashes many mail and file transfer programs when you try to email or upload it.

Recognizing this problem, SPSS created the .por (for portable) file that makes it easier to move files across systems. These files can be easily uploaded or downloaded. I routinely move mine from a Mac to a Unix server to a Windows machine. For students who are used to clicking on an Excel file and having it download, it can be a bit daunting when you click on a .por file it comes up with gibberish that looks something like this.

`1233 SPSS some stuff 12397346 9400- -9774 45556 “ ~~~ 34455

for an entire page.

And yes, I know the solution is to just right-click on the file name and then you can select “Save Target As” or “Save Link As” and then save the file to disk. For once, the problem cannot be blamed on Internet Explorer (the archetypal does-not-play-well-with-others) and Microsoft’s on-going efforts to take over the world, because this gibberish-appearing trick happens with every browser.

Incredibly, SPSS does not even play well with itself. I discovered this less than lovely characteristic when students sent me .spo output files from SPSS 15 and I could not read these because I had not installed the legacy viewer that comes with SPSS 16 for Windows.

Well, I guess no one is perfect.

Lately, I have been playing with Stata, which, although it is not going to replace any time soon as my all favorite statistical software, is starting to grow on me, like that unattractive, ill-behaved child in a class that slowly charms himself into being the teacher’s pet. Stata: Bart Simpson of statistical packages.

I guess that would make SAS, Lisa and SPSS Maude Flanders.