In my copious spare time, of which I have none, I occasionally get the urge to actually read technical books from beginning to end.
I think my life took the path of most grown-ups in my field. You get a degree, or two or three or four. Perhaps during the course of that, but certainly at the end, you get what my mother refers to as “a real job”, which is a job outside of a university. In the course of this real job, they require you to do stuff – produce reports, answer questions, write research designs – whatever your real job happens to be. In producing these reports, answering questions and so on, you read PARTS OF the manual. The operative phrase in this sentence being “parts of”. You read the part that tells you how to obtain a Wald statistic using Stata – but you skip the part on what a Wald statistic actually is because you have a meeting at 2:30. You read the article on odds ratios in logistic regression but you skip the part on parallel processing for maximum likelihood methods because you have a report due tomorrow.
So, maybe you have been just skipping over very useful features in software and not having the time to notice. I am sure I must have mentioned this book before,
Programming and Data Management for IBM SPSS Statistics 18: A Guide for IBM SPSS Statistics and SAS© Users. It is very well-written and very free. One of the smartest things SPSS has done is make a ton of its documentation available for free, based, I think, on the reasonable notion that the better people can use its software the more likely they are to buy it. Also, as far as the title, it should be noted that 90% of the book is how to use SPSS and the other 10% is how to use SPSS if you know SAS pretty well. I’ve actually found that section extremely useful.
Anyway, as for aggregrate, which you might think I was going to discuss because that is in the title. Aggregate is an incredibly cool feature in SPSS that you may not have ever noticed. My friend works in an Emergency Room in a large city. She is quite concerned that some people are using the ER for primary care or even just for attention. One evening she said to a patient:
“You have a serious problem because I KNOW YOUR NAME! Do you know what the definition of the word ‘emergency’ is? No one should be in the emergency room so often that they and the staff are on a first name basis !”
Let’s say you work in this ER. You have a database with client records and most clients come once, some of them come more than once. You’d like to attach a variable to each client that is “Number of Visits”. You could then do all kinds of analyses, say, pulling out all the patients with 10 or more visits this year and seeing how many visits that represents. Or, you might want to know how much total time these chronic emergencies take up.
Here is what you do:
Go to the DATA menu and select AGGREGATE
For BREAK VARIABLE select “ClientID” or whatever your variable is named.
Check the button next to NUMBER OF CASES. The default name is N_BREAK but I changed it to “Visits” because that was a lot more obvious.
Check the button next to “ADD AGGREGATED VARIABLES TO ACTIVE DATASET” .
Now I want to know how many total visits were from “chronic emergencies” and how many total minutes they took up in my ER. First, I select out these folks by
From the DATA menu choose
Click the button next to IF CONDITION IS SATISFIED
In the pop-up window, enter Visits > 9
Go to ANALYZE
Then select DESCRIPTIVE STATISTICS
Move Length of Visit under Variables
Click on the OPTIONS button
Click the button next to SUM
If you prefer syntax to pointing and clicking, here you go:
COMPUTE filter_$=(Visits > 9).
VARIABLE LABEL filter_$ 'Visits > 9 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
/STATISTICS=MEAN SUM STDDEV MIN MAX.
There are plenty of other ways you could aggregate your data and add the counter to each record but this way is just so simple it is worth remembering.
TRUE CONFESSION: I hadn’t used aggregate in well over a year. Someone asked me how to do this and I was thinking of the LAG function and Proc summary using SAS and there was this dim memory that there was some other way to do it. So, I just started reading the data management book from page one. A lot of it I skipped over, of course. The second chapter, on programming tips and best practices is either new or I had skipped it when I read the book originally. It was good enough to warrant mentioning randomly, which I just did. Anyway, some time after the second chapter I came across the mention of aggregate and it all came back to me.
I may have told this story before and forgotten it, so I am telling it again which is kind of the point. A few years ago, I was writing a proposal on increasing parental involvement in special education. I said to my incredibly helpful research assistant that someone must have published articles on this, I mean, it isn’t exactly an esoteric topic, so please run through a few databases of scientific articles and bring me the references. She came in less than an hour later, laughing, with a list of articles. She said,
“Yes, you’re right. Someone had done research on this. Four of the first twelve references to pop up were by someone named Rousey.”
This is funny because that was my name before I remarried and not only had someone done research on it but that someone was me! Come on, a couple of them were from 1990. Can you remember what YOU were doing in 1990? If you’re the age of a lot of people I meet at conferences and my helpful young research assistants you were probably drinking juice from a box in Ms. Campbell’s kindergarten class.
So, now you know one of the main reasons I write this blog. I’ll vaguely recall something about crontab or aggregate or a geometry column in Proc GMAP and remember that I used that a year or two ago. If I write a post about it, maybe it will be helpful to someone else, and, in 2012, when I need to do the same thing again, I can search my blog and well, what do you know!