# Beware mean substitution ! (And the importance of mothers)

Filed Under statistics | 3 Comments

Today was a lesson in why one should always be a little leery of mean substitution. I had downloaded a data set to use as a logistic regression example for my class tomorrow. It happened to be the 2010 Monitoring the Future study and I was particularly interested in school drop out.

This is a sample of around 15,000 students in their senior year of high school. You would think that once students had made it to their senior year they would stick around and graduate. There were 90 students who said they didn’t expect to graduate, and about 14,000 who expected to graduate on time. (The rest either expected to graduate in the summer or did not answer.)

Because this was a student assignment, I didn’t want to bother with a huge data set. I used PROC SURVEYSELECT to pull a random sample of 500 students from those who expected to graduate and combined that with the 90 who didn’t for a comfortable sample size of 590.

One of the variables I wanted to use in my equation was Mother’s Education. This was on a scale from 1 (= grade school) to  6 (=graduate school).  There is a category 7= don’t know.

There were only 4% of the subjects who had put “Don’t know” for mother’s education and you might think it wouldn’t be a big deal to just use the mean. As an alternative, there are multiple imputation procedures for handling missing data. I could have gone with either of those alternatives and moved on. Instead, though, I got to thinking …

These aren’t little kids in elementary school. These are high school seniors. Why DON’T they know how much education their mother has?

My husband died when my children were 8, 9 and 12 years old. I had a good friend whose wife died when his children were almost the exact same ages. On more than one occasion when we’ve been discussing how life turned out, he’s said to me very seriously,

I think my kids would have been much better off if I had died and my wife had lived.

It seems like a pretty harsh thing to say, and my friend is a very hard-working, good person who has tried his damnedest to be a good father, but he seems very sincere about his opinion. So … the first thing I did was run a cross-tabulation of mother’s education by whether or not the student expected to graduate. I found that students who did not expect to graduate were more than four times as likely to not know their mother’s level of education (17%) as students who did expect to graduate (4%).

That intrigued me enough that I went back to the original data set and pulled out some additional variables, including whether or not there was a mother in the home. When I ran my first analysis with independent variables of student self-rating of school ability (1= far below average) to 7 = (far above average), race (white, black, Hispanic), gender and whether or not there was a mother in the home, I got this lovely ROC curve here.

Only ability was more important than whether or not there was a mother in the home. Now you could argue there are all sorts of reasons why a mother might not be in the home. Principal among these is that the student may not be living at home but rather with a significant other or spouse. Still, if you got married and moved out of the house, you wouldn’t forget how much education your mother had.

In fact, in another analysis I looked at being married versus single as a predictor and it was significant but not as much as having a mother in the home.

My point (and by now you may have despaired of me ever having one) is that if you just go blithely ahead with mean substitution that you may overlook some very interesting questions that arise in your data, such as why you have missing values in the first place.

I have much more to say about this, but I have a child who wants me to come upstairs and read her Little Women, so it will have to wait.

# Can you say Caveat Emptor if the data are free?

Let the buyer beware – that phrase certainly applies to open data, as does the less historical but equally true statement that students always want to work with real data until they get some.

Lately, I have had students working with two different data sets that have led me to the inevitable conclusion that self-report data is another way of proving that people are big fat liars. One data set is on campus crime and, according to the data provided by personnel at college campuses, the modal number of crimes committed per year – rape, burglary, vehicle theft, intimidation, assault, vandalism – is zero. Having taught at a very wide variety of campuses, from tribal colleges to extremely expensive private universities, and never seen one that was 100% crime free, I suspected  – and this is a technical term here – that these data were complete bullshit. I looked up a couple of campuses that were in high crime areas and where I knew faculty or staff members who verified crime had occurred and been reported to the campus and city police. These institutions conveniently had not reported their data, which is morally preferable to lying through their teeth, with the added benefit of requiring less effort.

Jumping from there to a second study on school bullying, we found, as reported by school administrators in a national sample of thousands of public elementary, middle and secondary schools, that bullying and sexual harassment never, or almost never, occur and there are no schools in the country where these occur on a daily basis. Are you fucking kidding me? Have you never walked down the hall at a middle school or high school? Have you never attended a school? What the administrators thought to gain or avoid by lying on these surveys, I don’t know, but it certainly reduces the value of the data for, well, for anything.

So …. the students learn a valuable life lesson about not trusting their data too much. In fact, this may be the most valuable lesson they learn, Stamp’s Law

The government are very keen on amassing statistics. They collect them, add them, raise them to the nth power, take the cube root and prepare wonderful diagrams. But you must never forget that every one of these figures comes in the first instance from the village watchman, who just puts down what he damn pleases.

From an analysis standpoint, this is my soapbox that I am ranting on every day. Before you do anything, do a reality check. If you use SAS Enterprise Guide, the Characterize Data task is good for this, but any statistical software, or any programming language, for that matter, will have options for descriptive statistics  – means, frequencies, standard deviations.

This isn’t to say that all open data sucks. On the contrary, I’m working with two other data sets at the moment that are brilliant. One used abstracts of medical records data over nine years plus state vital records to record data on medical care, diagnoses and death for patients over 65. Since Medicare doesn’t pay unless you have data on the care provided and diagnosis, and the state is kind of insistent on recording deaths, these data are beautifully complete and probably pretty accurate.

I’ve also been working with the TIMSS data. You can argue all you want about teaching to the test, but it’s not debatable that the TIMSS data is pretty complete and passes the reality test with flying colors. Distributions by race, gender, income and every other variable you can imagine is pretty much what you would expect based on U.S. Census data.

So, I am not saying open data = bad. What I am saying is let the buyer beware, and run some statistics to verify the data quality before you start believing your numbers too much.

# I’m Claiming my Love Stats Award

It’s about time I got some recognition !

You can claim your own Love Stats award here.

Careful, of undeserved awards, though. The last person who falsely claimed a Love Stats award had multicollinearity in his measures, a high VIF and died of complications of homoscedasticity. You have been warned.

# Geeks Guide to Women: AnnMaria Explains it All

Filed Under Dr. De Mars General Life Ramblings | 13 Comments

I am tired of hearing that men in technical fields cannot meet women. Some stupid company even went so far as to say that they had hired hot women to work at their company so now they were going to have no problem attracting good programmers and engineers. This is stupid on so many levels. As the house rocket scientist said,

“Right. That’s exactly what guys want, more good-looking women to turn them down. I think the way  to attract talent is have interesting problems and pay people well.”

To continue on the logic – some of the technical people you want to hire are women, a lot of those men are married, and explaining to your wife that you are taking a job at another company because there are hot women is probably not the kind of conversation conducive to marital happiness. I presume this company is looking for young men who are very bright with no social life so they will work all of the time. I do have to wonder, though, what the “hot women” hired by the company think of being used as recruiting bonuses.

As a public service, I have decided to explain the three steps to actually meeting a woman. This is based on the extensive research of being a woman, having been married three times (less exciting than it sounds, I was divorced at 25 and my second husband passed away after 10 years of marriage) and accomplished “till death do us part” 67% of the time – in excess of the national average I might add – being the mother of three daughters in their twenties and spending well over half my life as a programmer, statistician and engineer. I believe anthropologists call this “participant observation”. All of these same steps apply in on-line relationships as well.

Jenn after finishing her M.A.

1. Say “Hi”.  This is the first step where many men fail. Have you actually talked to that woman in accounting or who you see every day at the coffee shop? Next time you walk by her cubicle, see her in front of you in line or you pass each other in the halls in your apartment complex, say “Hello”.  That’s all. Just say, “Hi”.  The next time you see her, do it again. (I don’t mean walking by Janet’s cubicle to the water cooler every hour and saying hello. That puts you in the weirdo camp.) I mean, when you walk into the office in the morning or by her desk as you head out for lunch.

After this has gone on for a few days, if you are behind her in line, coming out of the elevator, say, “You know, I don’t think we’ve been introduced. My name is Jack. I work over in the systems administration group.” (Unless your name is actually Jack and you really are in systems administration, you should substitute the correct information in that sentence.) DON’T ASK HER OUT. She probably already knows who you are, but that doesn’t matter. She’ll tell you her name, as if you didn’t know, and something about herself. If she tells you she works in accounting or for Goldman Sachs, ask her about it, how she likes working there, how long she has been there, how did you end up there, did you always plan on being an accountant. You might discover she is the most boring person on God’s green earth and you don’t want to go out with her, but that is okay, too.  After you’ve walked to where her desk is or picked up your coffee, say “Nice talking to you.” And leave.

Exhibit A of how seldom men actually even say, “Hi” is darling daughter number two, shown above right after she finished her masters degree. For four semesters, while she was getting her M.A., she would come over to the building where I worked and we would car pool home. Often, I was stuck in a meeting and she would be sitting downstairs waiting for me in a building where hundreds of the exact men who complain about meeting women worked. At the  time, she wasn’t dating anyone, having just moved back to L.A. to start graduate school. How many people in that time actually stopped and said, “Hi” to her? About seven – four married men who worked with me and knew she was my daughter, and the three security guards who rotated at the front desk, one of whom was a woman and the other old enough to be her grandfather.

An even better time to say hi is when someone new starts at work or moves in the neighborhood. When you pass the person in the hall, say, “Hi, I’m Jack. I know you’re new here – if you need any help, I’m in apartment 12/ I work in the Unix group – feel free to ask. I know it is hard getting used to everything a new place.”

Here is a really key point – do this whenever someone new moves in/ comes in to the area. She may be married, old or not the type of woman you would date in a million years. You ought to offer to help out new men at work too. You don’t want to be seen as that creepy guy who hits on all the attractive women.

Why anyone thinks all of these suggestions go out the window when they are online is beyond me. Do you know how many women on on-line sites get messages that start out, “Show me your tits.” If you wouldn’t do that in the hallway don’t do it on line.

2. Don’t be a jerk. If Janet does come over and ask how to log in to the server, don’t start out with “Well, you see we have these things called computers…” I cannot count the number of men I have seen trying to impress women with their technical knowledge and instead coming off as a pompous ass (and I can count pretty high). If she asks a question, tell her the answer. Then, if you can, add something like, “Did anyone tell you where the company handbook/ procedures/ secret site of insider knowledge is?” and show her where she can find the answer for herself in the future. DON’T ASK HER OUT.

From talking to Janet in the elevator, you know that she works in accounting, is into fly fishing whatever. The next time you see her in the hallway ask her if she could recommend an accountant, tell her you were interested in taking up fly fishing and wonder if Bass is a good place to buy bait. Don’t lie. If fly-fishing makes you want to puke, go with the accountant angle or you may be trying to find some way to get out of cutting bait. DON’T ASK HER OUT.  On the other hand, if she says, “I’m going to Bass on Friday after work, do you want to come?” Don’t be an idiot. Say yes.

Keep in mind, though, that when a woman asks you if you want to go to lunch or go to Mandelbrot’s lecture (well, before he was dead) or go to Macworld, she may just think it would be nice to talk with you at lunch or to have company to hear Mandelbrot talk about fractals or go to Macworld.

Again, all of this applies on-line as well. Darling daughter number three (shown here) has thousands of twitter followers and Facebook friends. Fairly often, she’ll get a tweet along the lines of,

“I would like you to have my babies.”
and many, many more graphic ones. What the hell are these people thinking? If she ever in a psychotic break went out with one of these idiots, I’d smack them in the head the instant they came into the house. The only one that I did not think was super-creepy was the guy who asked her to the Marine Corps Ball. If she hadn’t been in the middle of training for a fight this Friday, it wouldn’t have surprised me at all  if she had taken him up on it.

So, when do you get to go out with her?

3. Find the right woman. You might think this would come first. You’re wrong. If you do steps 1 & 2 first, you’ll find out if Janet is married. DON’T EVER ASK HER OUT. What are you, a moron? If Janet has a boyfriend, don’t ask her out either. Seriously, do you want to date a woman who cheats on her boyfriend? If she breaks up with him, and she’s interested in you (and often even if she isn’t) she’ll bring it up in the conversation. By this point, you know if Janet is dumb as a rock, is a born-again Christian or spends her evenings kick-boxing.  All of those may be a turn-off for you or it may be exactly what you are looking for your whole life. You may find that you thought dating someone very religious was off the table but that Janet’s views on always being honest, the importance of family and the way she walks the walk by volunteering at a soup kitchen every weekend are pretty amazing. In short, you’ve gotten to know Janet as  a person. Worst case scenario is you have made an acquaintance who you don’t want to date. However, that same person may have given you a new perspective, may refer you for a job some day or introduce you to someone you DO want to date. Women tend to have a lot more friends who are other women than men do. A second possible outcome is you and Janet become friends. She’s not your type, maybe she never was. She’s married, too old, too young, just not compatible. However, having a friend is a good thing and as a bonus, she may give you some good insight into starting a relationship with a woman.

Ronda trains two or three times a day and loses things at such a rate that it is almost a super-power. She has lost so many passports that I think the State Department has her on a watch list.  If hiking in the mountains for three hours is not your idea of a good time, the two of you are not going to get along. On the other hand, Jenn, lovely daughter number two above, had a minor in Film Studies and teaches history. If your idea of a good time is going to the gym at 5 a.m. on Sunday morning, she is probably going to quote some line from a movie I never saw to describe you. Whatever it means, it won’t be good. They are both massive Dr. Who fans, though. So, one day, when you are hanging out with your friend, Ronda, watching a Dr. Who marathon because she has trained for five hours that day and is too exhausted to move, you mention you really liked the Ken Burns Civil War documentary. She rolls her eyes at your boringness and then a light goes off, “You know, you really ought to meet my sister …”  and then you are off again at Step #1.

# Computing in the Cloud – Squared: Survival Analysis & SAS On-Demand

Giving a whole new meaning to “computing in the cloud”, I finished up my paper “A gentle introduction to survival analysis” for the Nevada SAS Users Group from 30,000 feet up using on-board wi-fi and SAS On-demand.  I was shocked to find that the performance was much better in-flight. Presumably, if you charge people \$12.95 for three hours to access your wireless system on the plane there are fewer people using it than if everyone can use it as long as they want for free, say, at a university. I believe this was some principle about supply and demand I learned in microeconomics.

In case you did not know, SAS On-Demand is the *FREE* (as in free puppy, although occasionally as in free beer) offering from SAS. It comes in three flavors, Enterprise Guide (which I am using), Enterprise Miner and JMP.

Let’s say I wanted to do a survival analysis in SAS Enterprise Guide. More specifically, let’s say I wanted to do a proportional hazards regression model using the PROC PHREG procedure.

Step 1: Go to TASKS in the top menu, select SURVIVAL ANALYSIS and then PROPORTIONAL HAZARDS, as shown below.

Next, you need to specify your variables

In this case,

• I drag the variable Survival Time in Days under  “Survival Time”
• I drag the variable named Status under “Censoring variable”
• Finally, I drag the variable named Rx under “Explanatory variables”

For this analysis, I only have one explanatory variable, so I am done. But nothing happens. I would click on the RUN button but SAS On-Demand won’t let me. It’s greyed out. If this happens to you, it’s one of those simple things to fix once you know how  – so, hey, you came to the right blog. Click on that variable under Censoring variable. A pane on the right will appear.

Click on the value that denotes censoring. In this case, the value of 0 means there was no event (the patient survived to the end of the study). If nothing shows up in the box at the bottom to check, never fear. You can always enter the value that denotes censoring the box above that says Enter Custom Value and click ADD. Then, you can click on the RUN button.

To see the results, exactly as produced, you can click here for the pdf file.

What about code? We like code !

When your program runs, you get the pdf file as output, but if you look at the top of your screen, next to the output you can see three other tabs, for log, code and input data. Click on the Code button and you can see the code SAS Enterprise Guide created.

Maria Burns Ortiz, a sports writer for Fox News Latino and my darling daughter number one, has commented,

“You write a blog on statistics and other people read it? That must be nerd-squared.”

Well, now you have SAS computing in the cloud while in the clouds, so that must be cloud-squared. Does that make this blog nerd to the fourth?

# Survival analysis tip from Ogden Nash

I might as well give you my opinion of these two kinds of sin as long as, in a way, against each other we are pitting them,
And that is, don’t bother your head about the sins of commission because however sinful, they must at least be fun or else you wouldn’t be
committing them.
It is the sin of omission, the second kind of sin,
That lays eggs under your skin.
The way you really get painfully bitten
Is by the insurance you haven’t taken out and the checks you haven’t added up the stubs of and the appointments you haven’t kept and the bills you haven’t paid and the letters you haven’t written.

I had to respond to the post by Mike Nemecek  and tweets by Rick Wicklin quoting Shakespeare with some culture of my own. Not having any degrees in liberal arts, though, the best I could do was this excerpt from the poem by Ogden Nash, Portrait of the Artist as a Prematurely Old Man.

Lately I’ve been playing around with the PROC LIFETEST procedure and it occurred to me that a way to get painfully bitten with this, and other survival analysis procedures, is not to think about some obvious facts. I’m assuming you are new to these procedures, either that, or in a big hurry, and you don’t scrutinize your output carefully. In that case, you may misinterpret the mean survival rate.

The mean survival rate is the mean length of time people/ bacteria / rats survived, right? Not necessarily.

Many procedures, say factor analysis, regression – automatically drop observations with missing data. Survival analysis procedures don’t work exactly the same way.

I am telling you this because it is a mistake I have seen people make who were familiar with other statistical procedures, and I can only presume in a hurry. Their solution to not knowing the length of survival time for some of their subjects was to drop those for whom the data are unknown.

Let’s try this with a data set I have laying around. I only use those observations for which I have complete data, that is, I know the survival time. It gives me the following information on survival time in days.

Mean        Standard Error
360.934    22.183

Quartile Estimates
95% Confidence
Percent  Estimate Transform [Lower Upper)
75               532.00 LOGLOG 489.00 612.00
50              308.00 LOGLOG 244.00 393.00
25               167.00 LOGLOG 117.00 193.00

Easy. right? The mean survival time is 360 days. The median is 308 days.

However, this is only using those people for whom we have a survival time. What about the other people? When I include EVERYONE, whether they died or not, I get the following

 Mean Standard Error 431.466 22.506

So, is this the correct value then?  Are these the correct percentile points?

 Percent Point Estimate Transform [Lower Upper) 75 652.00 LOGLOG 560.00 755.00 50 428.00 LOGLOG 341.00 512.00 25 192.00 LOGLOG 157.00 237.00

Well, not exactly. In fact, if you are using SAS, you will see this helpful note in your log
`The mean survival time and its standard error were underestimated because the largest observation was censored and the estimation was restricted to the largest event time. `
In my sample data set, 25% of the observations were censored, that is, we don’t know when they died.

Can we then say that the median survival rate really is 428 days? Okay, the mean is not correct because those 25% some of them may have died years later. What about the median? The answer is, “It depends.”

Depends on what, you might ask. Well, it depends on why you don’t know when they died.If they dropped out of your study and you have no idea what happened to them, then I would say that you want to be a bit cautious in your interpretation of both the mean and the median survival rates.

If all of the people who you don’t know how long they lived are censored because your study ended and they were still alive, then I would say, yes, the median survival time is accurate, assuming you had data on all of those people at the beginning and the end of the study.

Although survival analysis is generally thought of as predicting whether one survived in the sense of not dying, it is also used for other applications, such as predicting how long people last in a treatment program, without committing another crime, without drinking or using drugs. In those cases, when people are lost by dropping out of your study and out of sight, I would suspect it is very possible that at least some of those people began drinking, committing crimes or whatever prior to the end of your study. So, when people are censored due to having missing data, I would be a bit skeptical of both the median and the mean, and, as with all data problems, the larger percentage of your sample this affects, the more worried about it I would be.

An interesting point came up the other day when I was listening to a lecture. I’d assumed in animal studies that you would only have the problem of right-censored data due to the subjects living past the end of the study period. She mentioned that a couple of her subjects were censored because the rats had died of other causes unrelated to the disease under study. Not sure what occupational hazards a rat faces, but it’s getting pretty bad when you can’t even count on a rat these days.

Completely unrelated to statistics, software or, anything really, my darling daughter number three is ranked in the top ten in the world in mixed martial arts.

She has been nominated for female fighter of the year, which is decided in a scientific, objective manner by how many votes the nominees get on the Internet. She said,

“Hey mom, you have a blog and I bet not too many people who read it have a favorite fighter  they’re going to vote for, so why don’t you ask them all to vote for me.”

You need to click on the very large VOTE NOW button, and give your name, email and a password. Ronda is under female fighters, which is category #12.

If you need some actual statistics, her current record is 6-0 . The longest fight that she has had to date is about 57 seconds. It has taken her a total of less than three minutes to win six fights. You can watch one of her fights on youtube here.

In an odd coincidence, her next fight is in Las Vegas on November 18, a few hours after I finish speaking to the Nevada SAS Users Group on, of all things, survival analysis. It has been suggested to me that one of the examples I could use is the number of seconds that the opponent will last fighting Ronda. When I went to her last fight, I was having a martini and the man in front of me at the bar asked why I looked so nervous. I told him that my daughter was fighting. He said, “Ronda Rousey? The bookies are giving six to one odds on her. It’s three to one odds against her opponent even getting out of the first round.”

See? Statistics are everywhere.

If you’re in Las Vegas, the fight is at The Palms, after the SAS Users Group meeting, and it will also be shown on Showtime (the fight, not the users group meeting).

So, yeah, vote for Ronda.

She’s a good kid, she’s funny and she has a big-ass white dog. What more could you ask?

# LSMestimate statement – you’ll think it’s cool

Filed Under Software, statistics | 2 Comments

Jon Peltier and I were going back and forth on twitter about why it is that people will post answers on a forum or mailing list that are completely incorrect. As Peter Flom says, “They are often in error but never in doubt.”

Jon suggested there are three types; those who don’t know and admit it, those who don’t know they don’t know and those who know they don’t know but won’t admit it.

I’m going to be in the first type today and admit several things I did not know.

I have always hated the idea of custom hypotheses using the ESTIMATE or CONTRAST statements because they are a pain in the ass to do, but sometimes they just make sense. Essentially, any time your interest is not in whether “any of these cell means are different from any other”, the standard test, but whether one or more specific cell means are different than the others, than you want a CONTRAST or ESTIMATE statement.

First coolness thing about the LSMESTIMATE statement is that it is easier to write. Here (from the SAS documentation) is an example of the ESTIMATE statement replaced by LSMESTIMATE

```estimate 'AB12' intercept 1
a 1 0
b 0 1 0
a*b 0 1 0 0 0 0;
estimate 'avg ABij' intercept 6
a 3 3
b 2 2 2
a*b 1 1 1 1 1 1 / divisor=6;
estimate 'AB12 vs avg ABij' a 3 -3
b -2 4 -2
a*b -1 5 -1 -1 -1 -1 / divisor=6;```

Is replaced by

`       lsmestimate a*b 'AB12 vs avg ABij' -1 5 -1 -1 -1 -1 / divisor=6;`

Not only is it a lot less trouble to write, and, I think, to interpret, but, I have very bad vision. I wear nuclear-strength contacts to see past the end of my nose, glasses on top of them to read and enlarge everything on my screen 125% or more, so the odds of me typing something like the first several statements without making a mistake somewhere are very slim. It is very hard for me to tell of there is actually a space there or not, which is why I was very enthusiastic about the other syntax option for LSMESTIMATE.

Phil Gibbs, in a paper at the Denver SAS Users Group meeting, gave some very good examples of exactly when you would want to use custom hypotheses, for example if you thought one drug was more effective for disease A and a second drug was more effective for disease B, which seems a perfectly reasonable set of hypotheses, barring you are testing some incredible drug that cures all ills (according to my grandmother, one already exists and it is called rum).

Even better, he pointed out that you can use non-positional syntax where rather than listing all of the cells with zeroes for those you don’t want to contrast  you can just have the ones you are interested in, like this …

```LSMESTIMATE drug "drug pair 1,2 vs drug pair 3,4" [ 1,1] [ 1,2] [-1,3] [-1,4] / divisor=2;```

This isn’t all that new of a statement and I don’t know how I overlooked it when I came out. I was probably in a session where it was mentioned, didn’t have any use for it at the moment and just remembered how much I hated ESTIMATE and CONTRAST statements.

Most interesting to me was the fact that this paper was originally presented at PharmaSUG , which I have never attended because I haven’t done anything with pharmaceuticals in years (and no, I’m not referring to those parties in college).

Right before Phil spoke, Dr. Patrick Thornton had given a talk on ODS that he had presented at PharmaSUG. He mentioned that there is way more than just information of interest to the pharmaceutical industry there and you should check it out. Although he was not speaking to me personally, I did check it out and found that there really are a LOT of interesting papers presented there, and it is in San Francisco next year, right close to home, so I just might head up that way.

# R vs SAS/SPSS in Corporations: A view from the other side

Filed Under Software, Technology | 22 Comments

I read Allen Englehardt’s post this morning, on R vs SAS/SPSS in corporations and it motivated me to set aside my infinite to-do list and write about something I’ve been thinking for a long time.

Since Allen writes on R-bloggers, it will surprise no one that his conclusion was that R is preferable to SAS and the main obstacle to its use is the inertia and ignorance of executives and HR departments. What may surprise some people is that I agree with him that there may be cases where R is preferable, although not for the same reasons he gives, and that SAS Institute has some serious issues it needs to address, although looking at it from the side of someone who likes and uses SAS, I see different problems.

As someone who has used SAS daily for 29 years, I disagree on some of Mr. Engelhardt’s reasons both for and against SAS. I do agree, though, that there are some serious issues that, unless SAS Institute starts taking them seriously, may eventually end up in SAS going  the way of WordPerfect or COBOL.

Engelhardt said that one reason R is not the choice for corporations is

“R takes talent to use. (That is kind of why we like it.) It takes talent to maintain. My problem as the manager of a commercial analytical insights team is that it is very hard for me to retain that talent.”

I quoted this so you would not think I made it up. I thought of incredibly brilliant people like Rick Wicklin, the author of Statistical Programming with SAS/IML software. The first paper I pulled up at random in my notes from SAS Global Forum was An Overview of Survival Analysis using Complex Sample Survey Data, by Dr. Patricia Berglund. I could add a vast number of examples of SAS users who are not talent-less hacks, but you get my point.

He’s incorrect in assuming most of the people who use SAS use the menu-drive SAS Enterprise Guide, Enterprise Miner, etc. I’ve been to many user group meetings/ conferences where when asked how many do it’s less than 10% in the room who raise their hands. (Non-random sample, I know) but in 29 years in diverse organizations I see the same thing – the great majority of people who use SAS write code. Those who use it for very long write macros, create their own formats, extend it with CSS, Perl, Python, IML and sometimes even R. Assuming R = talented,  SAS = pointing, clicking drone is a bit over-simplistic.

SPSS, I’ve seen the opposite and I agree on that point. People who are SPSS users are hardly likely to abandon it for R – yet (see below for why they may).  I was once speaking with a developer at SPSS about a problem and he asked me, as one of the standard questions, “Do you write syntax?”  Then, because we had been talking for a while already, he caught himself and said, “Of course you do.”  My point is that the assumption was that you did not use syntax, and, again, in my admittedly non-random sample over 25 years of using SPSS, that assumption has been increasingly born out ever since menus became an option.

So, I disagree with his assumption that R people are just more talented (although that was popular with readers of R-bloggers) and I am not completely sold on his disadvantage that SAS costs corporations a lot of money. I think  Mr. Engelhardt over-estimated the ignorance of executives and under-estimated the cost of the vast body of legacy code out there.

As I have said before,

Re-writing everything to run on free software is only a good deal if your time has no value.

I think he under-emphasized this for corporations, an enormous COST of replacing legacy code. You’d need to re-write the code, re-write the documentation and re-train the employees. Anyone who has written much code, especially for a complex system, realizes that it will not work right out of the gate. For a while, you will be running two parallel systems. That’s expensive. You will need to keep all of your SAS people until you have your new system up and running. Will you have those people learn R? As Engelhardt notes, there is a difference between reading an introductory book and being an expert. Will you hire new people with years of experience with R? Then what will you do with your SAS people? Fire them all? I presume they have other knowledge of statistics, your industry, etc. that you might want. Will you just take the SAS code and re-write it in R? As anyone who has worked in corporations on large systems will guarantee, a lot of that code “Grew like Topsy”. It can be improved because you probably have patches on top of patches. What do you say to your manager when your R code has a bug and quits running? (This happens to everyone, but remember, you are replacing a system that was running with a new one that, made with free software and better in many ways, is not running.) Also, does that mean your people who are writing the R code are going to be well-versed in SAS, too? Or are you going  to have one of those talent-less SAS people you are going to fire sit next to you and tell you what each piece does?

I said this before but, who is going  to write the documentation of everything the program does and how to maintain it for when your talented R person leaves?

So why should SAS (and SPSS) be worried about R?

First of all, for those people and organizations that do NOT have legacy code, the major barrier I just talked about is removed. If you are a new company, you don’t have any legacy. There is no cost of re-writing, re-documenting anything. If you are a student, your time doesn’t have any value to anyone but you. This is why R is so popular among students, and this should make SAS very, very worried. Yes, lots of students hate R, but lots of them hate SAS, too (more about that in a minute).

A few days ago, I was at a SAS USERS GROUP MEETING and three people sitting around me were discussing using R to teach students. One person said that the students would hate it because it was too difficult, where a second professor countered that he had used R studio and it was not that difficult. The third chimed in that he had used it in graduate school. Again, this is not a random sample but rather one that should be biased toward SAS. These are people who are interested in SAS enough to attend users group meetings and yet discussing the benefits of switching to R. One had already done it, a second was at least considering it, though  unconvinced, and the third saw no problem with it.

I used SAS On-demand for my statistics course I am teaching. Here is what I did:

1. Tested everything myself and registered a month before class.
2. Made a powerpoint of step by step how to get the software
3. Made a MOVIE of how to get and install the software that students can watch to review the steps
4. Demonstrated in class how to get a SAS profile,  register for the course and download and install the software.

Obviously I did this because I believed learning SAS would benefit my students, but it took quite a bit of time I would not have had when I was an assistant professor trying to get tenure.

As it is, about half of my students have been able to use SAS On-demand. Why? Mostly because it doesn’t run on a Mac (more on that later). Those who had Windows were able to get it to run by the third week of class. One student, however, could not get it to run. I tried uninstalling and re-installing it. Still didn’t run. In the end, he received this message from SAS Technical Support, who were no doubt correct

It sounds like you may have a registry key that is acting up.  Lets try the following:

3. Close all applications including anti-virus software (even if it is just running in the background).
4. Go to the system registry by clicking on Start>Run and type:
regedit.
5. Examine the following Windows registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager

If it contains FileRenameOperations or PendingFileRenameOperations, delete this key, and retry your SAS installation.
Warning: Always back up your registry before you make any registry changes. For assistance, see Windows Help, Microsoft documentation,or the Microsoft Windows Web site. SAS is not responsible when you edit the Windows registry: changes in the Windows registry can render your system unusable and will require that you reinstall
the operating system.
After removing these keys, continue on with the installation.

I am not faulting SAS Technical Support. They are probably right, this was probably the problem and it probably would have worked. I have done similar things getting Enterprise Miner to work on a computer once and it did work. The problem is that when you send this to a student who is just trying to pass a statistics class, and Advanced Quantitative Data Analysis is not a fluff course to begin with, their response is going to be, and I believe this is a direct quote, “Fuck it!”

The student asked if he could use a different software package he had used as an undergraduate and I said sure, go ahead.

This type of problem does not occur often – this was one out of 10 or 11 students who tried to install SAS – but when it does, this student becomes like the SAS Administrator I mentioned above. They both hate SAS. This cannot be good.

After the problem of installation, the biggest problem SAS has is it does not run native on a Mac and the SAS On-Demand doesn’t run on virtual machines, either.

Of the 17 students in my class, 7 or 8 have Macs. When I required SAS on-demand, I found that it did not run on a virtual machine, so I had to partition my hard drive, install boot camp, buy a copy of Windows 7 and install that. Since I am using this for a class,  I was able to get Windows 7 for under \$50 so it was not a big deal for me, but since my “free” version of SAS has now cost me \$50 that is as much or more than many student licenses for statistical software. Also, there is the  time part. I like playing with computers, installing boot camp and partitioning the hard drive was pretty effortless (Your mileage may vary) and downloading and installing SAS On-Demand took very little time  with the very, very good connection we have in our office.

I have taught statistics at three private universities in California in the last several years (again, a non-random sample) at one, 25% of the students had a Mac. At the other two, it was closer to 50%. According to tech support, this was what they saw campuswide. Perhaps if you can afford \$30K and up for tuition you buy more expensive computers. This was also something the folks at the SAS user group mentioned about R – you know it runs on Mac and Unix, too.

A few of the students did what I did, installed boot camp, installed SAS On-demand, and it worked fine. The only problem now is that much of your other software like PowerPoint, Word is probably on the Mac side. You can do what I do and install OpenOffice, which I really like, but now you are taking more time to install boot camp, install OpenOffice – so the time aspect of using SAS over R is starting to disappear.

The final problem – the free cloud-based service, SAS On-Demand is pathetically slow. I’m holding out hope for that one, though, because it has increased so much from a year ago when it was just useless. Useless to usable and decent but slow is a pretty big leap.

Why I Recommend SAS Anyway – for now

First of all, amazing technical support. Engelhardt just brushes this aside, but SAS tech support is AMAZING. See the answer above. If I really wanted to get SAS working, and I was that student, I’ll bet it would work. I called them the other day because a client needed the equations used to calculate power in PROC POWER because her dissertation committee required it (no, I very seldom have clients who are students because they can’t afford our fees, but this was a special case). I got transferred to the right person and got an answer in 10 minutes. Or  you can read here about the amazing Tom from SAS Technical Support . See this post I see smart people, for more details on both problems with SAS installation and the amazingness of technical support. (not to be confused with the creepy Tom from MySpace).

Compare this to SPSS where I have sat on hold for 45 minutes, as the norm. (This was before they were bought by IBM, it may be better now.)

Second,  SAS has a huge user group base. Their user groups are amazing. I know R has meet-ups and meetings that are becoming more common around the country. From what I have seen, though, the SAS user groups are growing in size and activity as well. Orange County is starting a new user group, the one in San Diego meets quarterly, LA has annual meetings and we were discussing at WUSS possibly making this semi-annual. There is SAS-L and its archives, which are a fountain of information, the growing SAScommunity.org Did I mention their user groups are amazing? They have regional user group meetings, PharmaSUG and SAS Global Forum which is amazing cubed. All of the regional user groups offer student and junior professional scholarships, including travel, to allow people starting their career to attend for free, learn and network.

Third, SAS does EVERYTHING. This might be why it takes the sacrifice of a flamingo to get it to install sometimes, but once installed it can be used for anything. More than once, when someone has had a problem computing a statistic, I’ve heard someone sniff, “Well you could do that in R”, believe me, whether it is reporting with columns in alternating chartreuse and magenta, running a nightly analysis of your data that is uploaded to the web at 2 a.m. or analyzing a complex national survey, SAS does it.

Because SAS does everything, including being great for analyzing huge and complex data sets, really great statistical graphics, maps, every flavor of report and every type of statistic, there are jobs out there in those corporations now. That is the main reason I chose it for my students. Many of them are mid-career professionals getting a Ph.D. and there will be SAS jobs available when they graduate and for the 10 or 15 years remaining until they retire.

For younger students, and down the road, I think unless SAS Institute can get SAS On-demand working and fix its installation fiasco, there are going to be some serious problems. That makes me sad because I think SAS On-Demand could be insanely great and SAS Institute is completely missing it. This must be how Steve Jobs and Steve Wozniak felt when they saw the first GUI interface and mouse at Xerox.

Dudes! This could be insanely great! Don’t you see that?

Apparently, they don’t. If nothing else, they should license it to some start-up that will realize that potential. If you are interested in that, holler and I’ll holler back.

More on that later, this post is already thousands of words longer than I meant to write today, I have a paper to write, need to price a contract and the rocket scientist is asking why we live by the beach in Santa Monica if I won’t walk down and have a drink with him while overlooking the ocean. Having no answer for that, I’m heading out for Chardonnay.

# Random SAS tips from Colorado

Filed Under Software | 3 Comments

Two things, first of all.

1. Those folks in Colorado are unbelievable studs to come out in inches of snow and more falling to a full house at the Denver & northern Colorado / Wyoming SAS users group meeting.
2. I’ll bet the people in North Dakota are laughing their asses off right now thinking, “If that little bit of snow bothers her, wait until she comes up here next month.” I can hardly wait.

As with any time I go to a user group meeting, I learned some things and was reminded of other things I had forgotten. In random order of coolness

Robert Gately –  showed some interesting uses of logistic regression, but in the process brought up some little SAS tricks I had not thought of in a long time. For example:

Use of a : as a wild card character for variables such as

ARRAY survey {*}  Q:  ;

Will create an array with all of the variables starting with Q .

Haping Luo wrote a whole paper on the use of the colon in SAS. And I read it, which according to my darling daughters, is nerd squared.

Steve Anderson had an interesting use for ridge regression to handle multicollinearity. Along with that, he had a formula that he used to get the best estimate of the k factor. He showed an application of it and it seemed to work pretty well. Unfortunately, I wasn’t fast enough to write it down but he promised to send me his paper.

Denise Poll from SAS discussed SAS options – did you know that there are around 1,500 SAS options? Pretty amazing, huh?  Oddly, she did not discuss any options I hadn’t heard of before in her paper (she obviously only had time to discuss a limited number). However, she did mention a function I had never heard of – the getoption function , which will tell you a lot of information about any option you specify, including the default setting, current setting and even help rat out who changed it.

Despite aspersions cast by my enemies, the SAS VERBOSE option was not named after me, although I have used it from time to time. It is actually a useful diagnostic tool sometimes to see the settings of your SAS system options.

I also rambled on some about categorical data analysis but I didn’t learn anything from that because I already knew it (you thought I just made this shit up, didn’t you? Au contraire). On the other hand, I did learn something from a question someone asked about computing a kappa for more than two raters. I told her I did not know that PROC FREQ did it but I was sure at a minimum there was a macro out there. When I got back to somewhere with Internet access, it took me five seconds to find out that yes, there is such a macro by Bin Chen et al from Westat . You can also do it with JMP 9, according to their documentation. Look up categorical kappa.

And that was just in the morning.

There were some really good speakers in the afternoon. I would like to write about the cool stuff about ODS, Report (okay, I don’t really use report but it is cool if you like that sort of thing) and LSMestimate.

I’d also like to talk about the wrong direction I think SAS is taking in neglecting the Mac market and the complete missed market opportunity with SAS on demand. However, it will have to wait because I just got home, it’s midnight and there is a bubble bath and a glass of wine calling my name.  Also, the recording of The Daily Show is playing, with Pat Robertson talking about having sex with ducks (I’m not kidding). If I was a more timid soul, I might be thrown off balance by talking alcoholic beverages and bath water. That doesn’t scare me. Robertson, though, is a little creepy.