I’ve been asked several times what made me change my mind about SAS Enterprise Guide. SAS EG and my husband have a lot in common. For one thing, neither made a great first impression.

When I first met Dennis and saw that he was the same size as me (which my third daughter says is only the perfect height if your aim is to be shipped in a box), the first thought that went through my mind was,

“Oh my God, I’m dating a munchkin!”

When I first used Enterprise Guide, probably at version 1, my first thought was,

“Who would ever use this $#@ ?”

Here is another similarity between my husband and SAS EG, the more time I spent with both, the more I realized,

“Hey, you are really brilliant.”

Since I have four daughters, I have given a lot of talks about men, types of men, things to look for and what to avoid. Contrary  to my friend who believed that men only come in three types, I think they are more complicated than that.

There are brilliant men who are great to be around because they are just so deep and insightful, you never get bored. At the same time, they are really high maintenance. They expect you to be always perfectly dressed, ready to go to Lago’s at the drop of a hat, take their laundry to the dry cleaners and be interested in their favorite sports teams. That’s SAS. Brilliant, high maintenance, but worth it.

Then there are men who are just as brilliant but a lot more comfortable to be around. They can glance at the third iteration and tell you the final equation before the computer finds the solution. At the same time, they are happy to stay home, drink beer and watch the Daily Show. If you want to go to Lago, that’s fine, too. Brilliant, easy to get along with, but able to rise to any challenge.

How is that SAS Enterprise Guide? First of all, as I said the other day, the time it takes to do the data cleaning and checking can be cut to a FRACTION of what it is with SAS. That is a lot of the hours that go into a research project.

Second, there is the 80-20 rule, where 80% of your research projects are going to use 20% of all possible techniques – ANOVA, linear regression, logistic regression and those in SAS EG are extremely easy to do.

infdads2Third, and really important – think of the children! SAS EG would make a good dad because good dads make things easy for the kids to understand. EG has a much gentler learning curve than does SAS. This is one reason I think it was a brilliant move for the company. As a consultant, most people come to me interested in learning SPSS because it is easy to get started with the pointing and the clicking. Lately, the big interest in classes on campuses has been with SAS Enterprise Guide. EG fits with how people are used to using computers. We have a whole younger generation that expects to use a computer to solve just about every problem in life and certainly does not expect to learn programming.

Fourth, and equally important, let’s say you WANT to clean up nice and go to Chinois on Main – well Enterprise Guide includes a code window (under the PROGRAM menu option in 4.2) where you can write code to your heart’s delight. Although Enterprise Guide is relatively easy and comfortable to use, it combines that with the limitless range of SAS.

Does that sound like an ad? Not quite. There are still two significant disadvantages. While SAS EG is easy to learn relative to programming in SAS it is still not exactly intuitive.  If you have been using SAS for a long time, you’ll probably find EG a piece of cake. If not, it’s kind of like Dreamweaver, compared to coding html from scratch it is easy. Compared to typing it is hard.  For those people who want statistical analysis to be like the genie in Aladdin  -

“Computer, bring me a repeated measures Analysis of Variance.”

well this isn’t quite it. But, it IS closer.

The other really bad thing about SAS EG it also has in common with my husband – it’s difficult to get to know. (Hint to women: If you like the guy, ignore your friends who tell you that anyone who is 42 years old and never been married must be gay. See photo of baby above as evidence of not-gayness. )

We spent frustrating months trying to get the installation working at our site. We finally (I think) have a usable method for individuals but SAS EG is still not working error-free in our labs. That’s my project for next week.  So, yes, it is brilliant, flexible, comfortable and has limitless possibilities. However, if they don’t fix that installation mess, SAS EG may end up like my husband and be forty-two years old before it gets a really great user base.

At the JMP seminar on Monday, when Dick De Veaux said that 65-70% of time in all research projects is spent on data cleaning, everyone in the audience groaned in agreement.

One of the biggest problems I run into is recoding those simple textboxes. For example, we often want to look at data for one reservation or tribe. Now, one would THINK that the answer to a question like:

If you are an enrolled tribal member, in what tribe?

would be SIMPLE? I mean, you know what tribe you’re in, right?Most likely it is on the sign on every **^ building on the reservation.

spiritlaketribesign

No.  People need to abbreviate because it takes too much time to write, for example, “Turtle Mountain”,  so they have to abbreviate it as TM because that extra 4 seconds they saved could be so profitably spent driving slowly on the back roads in front of me when  I  need to be somewhere. Other people, helpfully, put TMT (for Turtle Mountain Tribe), or the legal name, “Turtle Mountain Band of Chippewa Indians”, which is abbreviated by some as TMBCI.  Still others put Ojibwe which is also sometimes spelled Ojibwa which is the name for Chippewa in Ojibwe (or is it Ojibwa).  I can go on for another several paragraphs on this, because there are also Red Lake, White Earth and more reservations of the same tribe.

So … (hyperventilating here), in the old days, if I wanted to do an analysis of just the respondents from the Chippewa tribe,  I would do something like :

1. Do a frequency distribution to find all the zillion permutations of this ONE question.

2. Be refrained by the wiser, kinder members of our office staff from beating the participants in our research with sticks.

3. Explain for the 59th time in the staff meeting why tribe cannot be multiple choice item (there are 562 federally recognized tribes. We cannot have a multiple choice item with 562 choices. )

4. Write   SAS statements that look something like this:

Tribe = upcase(Tribe) ;

Tribe_s = substr(Tribe,1,4) ;

If tribe_s in (“TMBC”, “TURT”,”CHIP”,”OJIB”)

OR

tribe in (“RED LAKE”, “WHITE EARTH”) then chippewa = 1 ;

Except that the IF statement would be much, much, much longer. This is a common type of problem and you can find lots of solutions in many SAS Global Forum papers. Here is my new one, no beatings required.

In SAS Enterprise Guide, go to TASKS, select DATA, select FILTER & SORT. Click on the thing that looks like a filter.

selectinlistFrom the drop-down variable list, I pick TRIBE. From the next list, I select IN A LIST. I click on the three dots and a list of all values appears. I can hold down the shift key and select several in a row, all the Chip, Chippewa, Chippewa Tribe and so on. After clicking OK, I can go back and select more values to add to the list.

Done. Pointing and clicking and no clever uses of SUBSTR, UPCASE or other nifty functions required. (Note to self: Find out what new careers the SAS function- users have now.)

So, now, with a few points and clicks, by going to TASKS, selecting DESCRIBE, then SUMMARY TABLES, I can produce this table that tells me if you are on one of the Chippewa reservations surveyed, your internet usage is related to your years of education and age. The relationship between internet usage and age seems to be curvilinear here.

tableresultsMy suspicion is that older people are slower to adopt new technologies, however,  technology adoption is also enabled by money and those with more money tend to have more education (which is somewhat related to age, you don’t have a lot of 18-year-old college graduates).  I can begin to examine some multivariate relationships now.

Notice what is going on here ! In about 12 seconds I have motored through the data cleaning part combining the frequency distributions, recoding and selection and am now delving into data analysis.

I would not go so far as to say that this is better than sex (hence explaining the four children) but it is definitely way cool and makes me happy.

In the interest of full disclosure, I must say this. If you have never used SAS before, it will take you longer than 12 seconds. Here is why :

  1. The filter on filter and query is three blanks followed by a box with three dots. The reaction of an experienced SAS programmer, especially one who ever used the analyst application,  is to recognize that as an IF statement with the first box as the variable (click arrow for drop down list of variables), the second box as the operation (click arrow for drop down list of operations) and the third box as whatever you want to select (click the three dots for more). The reaction of the rest of humanity is going to be WTF?
  2. In creating computed columns, which I did to recode the internet usage variables, I immediately knew that the format I wanted was $CHARw.  and that I needed a length of 10.

So, SAS EG is a great thing for any researcher who is a SAS programmer. It is also a great thing for any researcher who wants to be a SAS programmer. I won’t lie to you and say it will be completely easy and painless, but it is true that less beating of subjects with sticks will be required.

The model is non-significant, therefore my theory is supported.

Huh?

Just when you thought it was safe to get back into statistics… It took you two years of graduate school but now you have it down. P-value low = good, relationship detected, publication, tenure, Abercrombie & Fitch models at your feet.

P-value = high, no relationship, no publications, no money, dating the creepy guy next door.

Enter Hosmer to screw things up.

There are a whole bunch of reasons you might want to do a logistic regression (no, I’m serious). If you want to predict a categorical dependent variable like death, drop-out or watching Afghan Star. If you were going to do a propensity score match you would start with logistic regression. If you plain can’t think of anything else to do with your evenings.

The first thing would be to see if your dependent had a relationship with your grouping variable or you really are wasting your time. Okay, now that is settled, you have found that people seen in hospitals with Intensive Care Units are more likely to die than those seen at other hospitals.

You also want to see if the variables on which they differ have anything to do with the outcome. For example, I ran an analysis where I coded their favorite colors of pants – blue, brown, white, black or green pants (seriously, who buys green pants?) . People who went into intensive care were more likely to own green pants.  To test if this is significant, I run a logistic regression with  death as the outcome variable and pants color as the predictor.

In SPSS you go to ANALYZE > REGRESSION > BINARY LOGISTIC

So, the Hosmer and Lemeshow Test is statistically significant with a chi-square of 349.06, df = 4 and  p < .001. Is that exciting? Do I immediately publish an article on “The American Apparel Effect” and how poor fashion taste is dangerous to your health?

Not so fast. You see, Hosmer & Lemeshow tests the Goodness of Fit of the model predictions to the observed data. If you reject the hypothesis that your model fits the data, that is bad!

In my next logistic regression, I used age over 65 as a dichotomous variable.  My second variable was the Dr. MechOth scale. Dr MechOth (not her real name) was a friend of mine when I was a young Assistant Professor who occasionally hung out in bars. Dr. MechOth rated all men on a 1 to 3 scale, where 1= “Yes” , 2 =”Maybe if I was drunk” & 3=”I couldn’t get drunk enough”.

The results of the Hosmer & Lemeshow test shown below, with a chi-square = 4.52, df = 3, p > .20  show that the data fit the model somewhat, although it could be better.

significlogistic

Does this mean that in logistic regression high p-values are always a good thing? Nope, that would be too easy for you to remember.  In fact, no sooner have we inverted our understanding of p-values but now it is time to do it again. When interpreting the COEFFICIENTS, a low p-value is a good thing. So, which of Dr. MechOth’s groups one is in, and being really, really old are related to probability of death.

significlogistic2

Sadly, my original hypothesis about death by green pants is not supported and all I have discovered is that if you are really, really old and no one would go home from a bar with you if you are the last person on earth, you are more likely to keel over dead from natural causes or suicide, whichever comes first, than hot, young people.

I do not think I will be winning the Nobel Prize for Medicine any time soon. I wonder if that guy next door likes Cup-A-Noodle soup.

Lately I have been on a roll looking at relatively less common statistical techniques, proportional hazards, survival analysis, etc.

In keeping with that, I have been taking a look at propensity score matching, fondly known as PSM by, – well, by no one actually.

The problem to be solved ….

Think about some of these comparisons:

  • Hospitals with special burn,cardiac or neonatal units versus general hospitals
  • Public schools versus parochial, private or charter schools
  • People who watch TV > 40 hours weekly versus those surfing the Internet > 40 hours

In all of these cases, and probably a lot more you can think of, there are very likely differences in certain “outcome” variables, whether it be survival in the case of hospital patients, academic achievement of students or annual income of TV versus Internet users. However, all of these comparisons also begin with groups who are already different.

For example …

You have two groups, say people who are treated at a hospital with a specialized unit for terminally ill patients and patients from another hospital without any such specialized unit.  Your outcome variable of interest is whether the patient lived or died.

The simplest way to test this is a chi-square. You compare the percentage of people who survived at St. George of Money Hospital versus Heart of Despair County Hospital.  There is a problem with that, though.  A simple comparison will almost always show WORSE outcomes for hospitals with special units for patients who are terminally ill, seriously burned, extremely premature births, etc. The reason is probably obvious  – if you get sicker patients, they are less likely to live.  If your interest is in knowing whether having a specialized unit increases your chances of survival, you would want to compare similar groups.

It isn’t as simple as just controlling for severity of condition, though. There are other variables, for example, people who are better educated, who have private insurance and who live in urban areas all may be more likely to be patients at more “elite” hospitals. Some of those factors may be related to survival as well. What we’d really like is to compare a  group of people from St. Money’s that is similar to patients from Despair.

In short, certain types of people have a greater propensity to be admitted to one type of place than the other.

Enter propensity score matching — to the sounds of trumpets and wearing a cape.

In fact, the first step is to do a logistic regression analysis and I will admit that it is not strictly necessary to wear a cape while doing so but it would probably be more comfortable than this business suit from Filene’s that I am wearing.

Using SPSS, go to the ANALYZE  menu, select REGRESSION, then select BINARY LOGISTIC. Your dependent variable will be the hospital to which the patient was admitted. Covariates are the variables such education, severity of illness and insurance that you want to control.  For variables that are categorical, e.g., insurance, which could be private, public (a.l.a. MediCal if it hasn’t disappeared in the latest round of state budget cuts) and none, click on the CATEGORICAL button and move those over to the “Categorical covariate” window.

Here’s the really important part  — click on SAVE and select PREDICTED PROBABILITIES – that is your propensity score.

This is what you are going to match on. Hence the name.

This is step one. I would say it gets easier after this point – but it doesn’t.