Oct
29
R vs SAS/SPSS in Corporations: A view from the other side
October 29, 2011 | 22 Comments
I read Allen Englehardt’s post this morning, on R vs SAS/SPSS in corporations and it motivated me to set aside my infinite to-do list and write about something I’ve been thinking for a long time.
Since Allen writes on R-bloggers, it will surprise no one that his conclusion was that R is preferable to SAS and the main obstacle to its use is the inertia and ignorance of executives and HR departments. What may surprise some people is that I agree with him that there may be cases where R is preferable, although not for the same reasons he gives, and that SAS Institute has some serious issues it needs to address, although looking at it from the side of someone who likes and uses SAS, I see different problems.
As someone who has used SAS daily for 29 years, I disagree on some of Mr. Engelhardt’s reasons both for and against SAS. I do agree, though, that there are some serious issues that, unless SAS Institute starts taking them seriously, may eventually end up in SAS going the way of WordPerfect or COBOL.
Engelhardt said that one reason R is not the choice for corporations is
“R takes talent to use. (That is kind of why we like it.) It takes talent to maintain. My problem as the manager of a commercial analytical insights team is that it is very hard for me to retain that talent.”
I quoted this so you would not think I made it up. I thought of incredibly brilliant people like Rick Wicklin, the author of Statistical Programming with SAS/IML software. The first paper I pulled up at random in my notes from SAS Global Forum was An Overview of Survival Analysis using Complex Sample Survey Data, by Dr. Patricia Berglund. I could add a vast number of examples of SAS users who are not talent-less hacks, but you get my point.
He’s incorrect in assuming most of the people who use SAS use the menu-drive SAS Enterprise Guide, Enterprise Miner, etc. I’ve been to many user group meetings/ conferences where when asked how many do it’s less than 10% in the room who raise their hands. (Non-random sample, I know) but in 29 years in diverse organizations I see the same thing – the great majority of people who use SAS write code. Those who use it for very long write macros, create their own formats, extend it with CSS, Perl, Python, IML and sometimes even R. Assuming R = talented, SAS = pointing, clicking drone is a bit over-simplistic.
SPSS, I’ve seen the opposite and I agree on that point. People who are SPSS users are hardly likely to abandon it for R – yet (see below for why they may). I was once speaking with a developer at SPSS about a problem and he asked me, as one of the standard questions, “Do you write syntax?” Then, because we had been talking for a while already, he caught himself and said, “Of course you do.” My point is that the assumption was that you did not use syntax, and, again, in my admittedly non-random sample over 25 years of using SPSS, that assumption has been increasingly born out ever since menus became an option.
So, I disagree with his assumption that R people are just more talented (although that was popular with readers of R-bloggers) and I am not completely sold on his disadvantage that SAS costs corporations a lot of money. I think Mr. Engelhardt over-estimated the ignorance of executives and under-estimated the cost of the vast body of legacy code out there.
As I have said before,
Re-writing everything to run on free software is only a good deal if your time has no value.
I think he under-emphasized this for corporations, an enormous COST of replacing legacy code. You’d need to re-write the code, re-write the documentation and re-train the employees. Anyone who has written much code, especially for a complex system, realizes that it will not work right out of the gate. For a while, you will be running two parallel systems. That’s expensive. You will need to keep all of your SAS people until you have your new system up and running. Will you have those people learn R? As Engelhardt notes, there is a difference between reading an introductory book and being an expert. Will you hire new people with years of experience with R? Then what will you do with your SAS people? Fire them all? I presume they have other knowledge of statistics, your industry, etc. that you might want. Will you just take the SAS code and re-write it in R? As anyone who has worked in corporations on large systems will guarantee, a lot of that code “Grew like Topsy”. It can be improved because you probably have patches on top of patches. What do you say to your manager when your R code has a bug and quits running? (This happens to everyone, but remember, you are replacing a system that was running with a new one that, made with free software and better in many ways, is not running.) Also, does that mean your people who are writing the R code are going to be well-versed in SAS, too? Or are you going to have one of those talent-less SAS people you are going to fire sit next to you and tell you what each piece does?
I said this before but, who is going to write the documentation of everything the program does and how to maintain it for when your talented R person leaves?
So why should SAS (and SPSS) be worried about R?
First of all, for those people and organizations that do NOT have legacy code, the major barrier I just talked about is removed. If you are a new company, you don’t have any legacy. There is no cost of re-writing, re-documenting anything. If you are a student, your time doesn’t have any value to anyone but you. This is why R is so popular among students, and this should make SAS very, very worried. Yes, lots of students hate R, but lots of them hate SAS, too (more about that in a minute).
A few days ago, I was at a SAS USERS GROUP MEETING and three people sitting around me were discussing using R to teach students. One person said that the students would hate it because it was too difficult, where a second professor countered that he had used R studio and it was not that difficult. The third chimed in that he had used it in graduate school. Again, this is not a random sample but rather one that should be biased toward SAS. These are people who are interested in SAS enough to attend users group meetings and yet discussing the benefits of switching to R. One had already done it, a second was at least considering it, though unconvinced, and the third saw no problem with it.
A major reason that people, especially in academics, consider switching to R – or a piece of slate and a sharp rock for that matter is that their installation process BLOWS. If you have never had to install SAS, let me just tell you that it is bad beyond imagination and has been for thirty years. I remember in graduate school using SAS 5 how every time we had to renew our license and I had to get things working again the SPSS people in sociology would laugh at me. It has only gotten worse. A month ago, I was having lunch with the SAS administrator at a large university and she told me she hated SAS. She tells people to switch to JMP or SPSS every chance she gets. I asked about SAS On-demand and she said that almost every single person had a problem installing it. At one point, I was the SAS administrator for a large university and about 10% of the people had trouble installing SAS. These are not stupid, lazy people. They’re faculty and researchers at a prestigious institution.
I used SAS On-demand for my statistics course I am teaching. Here is what I did:
- Tested everything myself and registered a month before class.
- Made a powerpoint of step by step how to get the software
- Made a MOVIE of how to get and install the software that students can watch to review the steps
- Demonstrated in class how to get a SAS profile, register for the course and download and install the software.
Obviously I did this because I believed learning SAS would benefit my students, but it took quite a bit of time I would not have had when I was an assistant professor trying to get tenure.
As it is, about half of my students have been able to use SAS On-demand. Why? Mostly because it doesn’t run on a Mac (more on that later). Those who had Windows were able to get it to run by the third week of class. One student, however, could not get it to run. I tried uninstalling and re-installing it. Still didn’t run. In the end, he received this message from SAS Technical Support, who were no doubt correct
It sounds like you may have a registry key that is acting up. Lets try the following:
1. Reboot your system.
2. Log in as the Administrator.
3. Close all applications including anti-virus software (even if it is just running in the background).
4. Go to the system registry by clicking on Start>Run and type:
regedit.
5. Examine the following Windows registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session ManagerIf it contains FileRenameOperations or PendingFileRenameOperations, delete this key, and retry your SAS installation.
Warning: Always back up your registry before you make any registry changes. For assistance, see Windows Help, Microsoft documentation,or the Microsoft Windows Web site. SAS is not responsible when you edit the Windows registry: changes in the Windows registry can render your system unusable and will require that you reinstall
the operating system.
After removing these keys, continue on with the installation.
I am not faulting SAS Technical Support. They are probably right, this was probably the problem and it probably would have worked. I have done similar things getting Enterprise Miner to work on a computer once and it did work. The problem is that when you send this to a student who is just trying to pass a statistics class, and Advanced Quantitative Data Analysis is not a fluff course to begin with, their response is going to be, and I believe this is a direct quote, “Fuck it!”
The student asked if he could use a different software package he had used as an undergraduate and I said sure, go ahead.
This type of problem does not occur often – this was one out of 10 or 11 students who tried to install SAS – but when it does, this student becomes like the SAS Administrator I mentioned above. They both hate SAS. This cannot be good.
After the problem of installation, the biggest problem SAS has is it does not run native on a Mac and the SAS On-Demand doesn’t run on virtual machines, either.
Of the 17 students in my class, 7 or 8 have Macs. When I required SAS on-demand, I found that it did not run on a virtual machine, so I had to partition my hard drive, install boot camp, buy a copy of Windows 7 and install that. Since I am using this for a class, I was able to get Windows 7 for under $50 so it was not a big deal for me, but since my “free” version of SAS has now cost me $50 that is as much or more than many student licenses for statistical software. Also, there is the time part. I like playing with computers, installing boot camp and partitioning the hard drive was pretty effortless (Your mileage may vary) and downloading and installing SAS On-Demand took very little time with the very, very good connection we have in our office.
I have taught statistics at three private universities in California in the last several years (again, a non-random sample) at one, 25% of the students had a Mac. At the other two, it was closer to 50%. According to tech support, this was what they saw campuswide. Perhaps if you can afford $30K and up for tuition you buy more expensive computers. This was also something the folks at the SAS user group mentioned about R – you know it runs on Mac and Unix, too.
A few of the students did what I did, installed boot camp, installed SAS On-demand, and it worked fine. The only problem now is that much of your other software like PowerPoint, Word is probably on the Mac side. You can do what I do and install OpenOffice, which I really like, but now you are taking more time to install boot camp, install OpenOffice – so the time aspect of using SAS over R is starting to disappear.
The final problem – the free cloud-based service, SAS On-Demand is pathetically slow. I’m holding out hope for that one, though, because it has increased so much from a year ago when it was just useless. Useless to usable and decent but slow is a pretty big leap.
Why I Recommend SAS Anyway – for now
There are advantages, too.
First of all, amazing technical support. Engelhardt just brushes this aside, but SAS tech support is AMAZING. See the answer above. If I really wanted to get SAS working, and I was that student, I’ll bet it would work. I called them the other day because a client needed the equations used to calculate power in PROC POWER because her dissertation committee required it (no, I very seldom have clients who are students because they can’t afford our fees, but this was a special case). I got transferred to the right person and got an answer in 10 minutes. Or you can read here about the amazing Tom from SAS Technical Support . See this post I see smart people, for more details on both problems with SAS installation and the amazingness of technical support. (not to be confused with the creepy Tom from MySpace).
Compare this to SPSS where I have sat on hold for 45 minutes, as the norm. (This was before they were bought by IBM, it may be better now.)
Second, SAS has a huge user group base. Their user groups are amazing. I know R has meet-ups and meetings that are becoming more common around the country. From what I have seen, though, the SAS user groups are growing in size and activity as well. Orange County is starting a new user group, the one in San Diego meets quarterly, LA has annual meetings and we were discussing at WUSS possibly making this semi-annual. There is SAS-L and its archives, which are a fountain of information, the growing SAScommunity.org Did I mention their user groups are amazing? They have regional user group meetings, PharmaSUG and SAS Global Forum which is amazing cubed. All of the regional user groups offer student and junior professional scholarships, including travel, to allow people starting their career to attend for free, learn and network.
Third, SAS does EVERYTHING. This might be why it takes the sacrifice of a flamingo to get it to install sometimes, but once installed it can be used for anything. More than once, when someone has had a problem computing a statistic, I’ve heard someone sniff, “Well you could do that in R”, believe me, whether it is reporting with columns in alternating chartreuse and magenta, running a nightly analysis of your data that is uploaded to the web at 2 a.m. or analyzing a complex national survey, SAS does it.
Because SAS does everything, including being great for analyzing huge and complex data sets, really great statistical graphics, maps, every flavor of report and every type of statistic, there are jobs out there in those corporations now. That is the main reason I chose it for my students. Many of them are mid-career professionals getting a Ph.D. and there will be SAS jobs available when they graduate and for the 10 or 15 years remaining until they retire.
For younger students, and down the road, I think unless SAS Institute can get SAS On-demand working and fix its installation fiasco, there are going to be some serious problems. That makes me sad because I think SAS On-Demand could be insanely great and SAS Institute is completely missing it. This must be how Steve Jobs and Steve Wozniak felt when they saw the first GUI interface and mouse at Xerox.
Dudes! This could be insanely great! Don’t you see that?
Apparently, they don’t. If nothing else, they should license it to some start-up that will realize that potential. If you are interested in that, holler and I’ll holler back.
More on that later, this post is already thousands of words longer than I meant to write today, I have a paper to write, need to price a contract and the rocket scientist is asking why we live by the beach in Santa Monica if I won’t walk down and have a drink with him while overlooking the ocean. Having no answer for that, I’m heading out for Chardonnay.
Oct
27
Random SAS tips from Colorado
October 27, 2011 | 3 Comments
Two things, first of all.
- Those folks in Colorado are unbelievable studs to come out in inches of snow and more falling to a full house at the Denver & northern Colorado / Wyoming SAS users group meeting.
- I’ll bet the people in North Dakota are laughing their asses off right now thinking, “If that little bit of snow bothers her, wait until she comes up here next month.” I can hardly wait.
As with any time I go to a user group meeting, I learned some things and was reminded of other things I had forgotten. In random order of coolness
Robert Gately – showed some interesting uses of logistic regression, but in the process brought up some little SAS tricks I had not thought of in a long time. For example:
Use of a : as a wild card character for variables such as
ARRAY survey {*} Q: ;
Will create an array with all of the variables starting with Q .
Haping Luo wrote a whole paper on the use of the colon in SAS. And I read it, which according to my darling daughters, is nerd squared.
Steve Anderson had an interesting use for ridge regression to handle multicollinearity. Along with that, he had a formula that he used to get the best estimate of the k factor. He showed an application of it and it seemed to work pretty well. Unfortunately, I wasn’t fast enough to write it down but he promised to send me his paper.
Denise Poll from SAS discussed SAS options – did you know that there are around 1,500 SAS options? Pretty amazing, huh? Oddly, she did not discuss any options I hadn’t heard of before in her paper (she obviously only had time to discuss a limited number). However, she did mention a function I had never heard of – the getoption function , which will tell you a lot of information about any option you specify, including the default setting, current setting and even help rat out who changed it.
Despite aspersions cast by my enemies, the SAS VERBOSE option was not named after me, although I have used it from time to time. It is actually a useful diagnostic tool sometimes to see the settings of your SAS system options.
I also rambled on some about categorical data analysis but I didn’t learn anything from that because I already knew it (you thought I just made this shit up, didn’t you? Au contraire). On the other hand, I did learn something from a question someone asked about computing a kappa for more than two raters. I told her I did not know that PROC FREQ did it but I was sure at a minimum there was a macro out there. When I got back to somewhere with Internet access, it took me five seconds to find out that yes, there is such a macro by Bin Chen et al from Westat . You can also do it with JMP 9, according to their documentation. Look up categorical kappa.
And that was just in the morning.
There were some really good speakers in the afternoon. I would like to write about the cool stuff about ODS, Report (okay, I don’t really use report but it is cool if you like that sort of thing) and LSMestimate.
I’d also like to talk about the wrong direction I think SAS is taking in neglecting the Mac market and the complete missed market opportunity with SAS on demand. However, it will have to wait because I just got home, it’s midnight and there is a bubble bath and a glass of wine calling my name. Also, the recording of The Daily Show is playing, with Pat Robertson talking about having sex with ducks (I’m not kidding). If I was a more timid soul, I might be thrown off balance by talking alcoholic beverages and bath water. That doesn’t scare me. Robertson, though, is a little creepy.
Oct
25
Open Data, SAS On-Demand & African-American Women
October 25, 2011 | 3 Comments
Let me just say off the bat that open data is awesome and there should be more of it available. This semester, I have been using SAS On-Demand in my statistics class and creating the data sets to meet students’ interests. Despite some people’s aspersions that I read on Twitter that some statisticians know no more than what PROC to use to get a p-value, it is, unfortunately, not all that easy.
I was going to write about adjusted survival curves and log log curves with PHREG tonight but it is already past 1 a.m. and both my time and Chardonnay are exhausted creating analytic data sets for my students.
I did hear back from the helpful folks at the National Center for Educational Statistics (thank you!) and downloaded the School Survey on Crime and Safety for a group of students interested in bullying. Awesome public use data set. Check it out!
After that, I had another group of students interested in testing the hypothesis that African-American women are less likely to get married the more education they have. Conveniently, I had the American Community Survey data for California on my desktop from some analyses I had done earlier, so I pulled out the subset of people they were interested in, which is native-born African-American women over 15 years of age. (Actually, the picture is my cousin who has never, as far as I know, been to America, but hey, Ashelle, if you’re reading this, come and visit. It’s nice here.)
I downloaded the data, created a few new variables to fit the students’ interest and emailed them the file and documentation. For example, they wanted to break education down into categories, thinking, rightly, I believe, that getting a high school diploma or college diploma is a better way of categorizing education than by years, it’s not a linear relationship with most other variables.
I did run some of the analyses myself because I was curious and I will say is that the preliminary results are very, very interesting. I am looking forward to their presentation.
So, that is the plus of open data – real data, real experience and questions the students really want to answer.
The minus – well, it took me a lot of time to locate and download the data. The data set for the study on African-American women I had on my computer, but the one on school crime I had to track down and it still wasn’t exactly what they originally planned – although it ended up working perfectly.
A second minus is that SAS On-demand is SLOW. It is several times better than it was originally. When first released it was so slow as to be useless. Now, based on, I don’t know what – sunspots – there are times it works perfectly, just a tiny bit slower than SAS on my desktop, and other times when it is really tedious. I’m sticking with it this semester because it is a) free, b) used in lots of organizations where my students may work one day and c) showing the potential to be really useful.
A third minus is that one of the students has not been able to get it to install and run, for reasons I cannot figure out. I referred him to SAS Tech support today.
If the professor teaching a statistics, research methods or data mining course did not have a lot of SAS programming experience, I think using SAS on-demand would be a challenge.
So — why bother? I think it comes back to the study one group is doing on African-American women and marriage, another group is doing on bullying in school, a third group is doing on the relationship between arts education and academic achievement using the National Educational Longitudinal Study.
Years ago, when my daughter, Maria Burns Ortiz, was a little girl, I asked her how science class was at her new school. We had recently moved and she had gone from a magnet school for gifted children to a regular parochial school. She said, “We don’t have science.”
I corrected her, “Mija, you must have science. You got an A in it on your progress report.”
She said, “No, we don’t have science at this school. We just read about it.”
So, that is why I am putting together data sets at 1 a.m. My students have statistics, they don’t just read about it.
Oct
21
Survivor Functions, Hazard Functions and Pictures
October 21, 2011 | Leave a Comment
Unfamiliar jargon like Kaplan-Meier curves, PROC PHREG, right-censored and hazard functions can be daunting to the newcomer. Survival analysis is really quite straightforward; it is simply a set of statistical techniques used when the focus is “time to event”. The event can be death, divorce, arrest, substance abuse or literally anything else. You’ve been wanting survival analysis if you have ever asked any of these questions:
- “How long can the average person with X be expected to survive before the event occurs?”
- “Given that a person has made it up to this point, how much longer can she be expected to survive?”
- “Are chances of survival higher for people in group A or group B?”
So, I was thinking today, why not just show pictures to calm people down and begin introducing concepts in survival analysis. Everybody likes pictures, right?
Let’s imagine a picture of a survivor function. It looks like this. In the beginning we are all alive and in the long-run we are all dead.
To get this curve, I used the following SAS code
ods graphics on ;
LIBNAME in "C:\Users\AnnMaria\Documents\survive" ;
PROC LIFETEST DATA = in.addicts METHOD = KM PLOTS = ALL ;
TIME survt*status(0) ;
WHERE clinic = 1;
Now let’s compare some survival curves from actual data. These are from the addicts data set in David Kleinbaum’s book on survival analysis. Here we have two clinics and we want to know whether one clinic has a higher probability of patients surviving the treatment than the other. In this case survival means not dropping out of the treatment.
If you look at the curves above, you’ll see that Clinic 2 clearly had a better survival rate than did Clinic 1. That’s nice and all but what about the possibility that Clinic 1 had “tougher customers” ? Is Clinic 2 still superior when controlling for possible pre-existing differences?
That, my dears, is an adjusted survival curve, which is a picture for another day.
Oct
17
Using SAS to test whether “It gets better” makes you gay
October 17, 2011 | 3 Comments
Next question on categorical data analysis …
Correlated proportions. There are a lot of reasons why you might have correlated data in a two-way contingency table. The most common is that you have measured people twice.
I have heard people say that including discussion of homosexuality in school makes it more likely that children would become gay. Personally, I think, this is – and this is a technical term here – total bullshit.
If I were to test this hypothesis, I could survey a group of 141 male students and ask them several questions, including,
“Would you consider having sex with Bob?”
I would include the picture of Bob above so we are clear what we are talking about here and there is no misunderstanding that I really meant to say Bobbette or Bobbi or Bobby Lou.
Six months later, after having read about people like Alan Turing , the same students would take the same survey. I do not have 282 students here, I have 141 students tested twice.
Some people might say the only satisfactory outcome shows at a minimum all of the students who previously stated that Bob was not their type still saying, “No”. Even better would be if some of those who previously said they would consider it now are in the anti-Bob category.
In fact, we instead get something like shown in the output below, with 1 of the students who said no previously now saying “Yes” and one of those who previously said, “Yes” now being on a no-Bob diet.
Having taught adolescents, I suspect that our two who changed boxes either were not paying attention the first time, were being a smart-ass by checking “Yes”, or were too timid to admit that Bob is indeed their cup of tea.
Statistically speaking, my hypothesis is that learning about famous people who were homosexual and learning about intolerance and discrimination against homosexuals does not make one gay. My null hypothesis is that there is zero difference between time one and time two. Another hypothesis I could test is that the level of agreement in Bob-attraction is 1.0 between time1 and time2.
To test both of these hypotheses using SAS all I need to do is this:
TITLE "MCNEMAR AND KAPPA WITH COMPLETELY FABRICATED DATA" ;
PROC FREQ DATA = AREYOUGAY ;
TABLES BOB*BOB2 / AGREE ;
Using my completely made up data, you can see that the value of McNemar’s Test is 0 and the probability of a greater S = 1.00 . This being a very far cry from .05, we accept the null hypothesis that there is no difference between the proportion of male students who are gay (or, at least interested in guys like Bob) pre- and post class discussions of historical contributions and issues of gay people.
In the next table, we see that the Kappa coefficient is .9153 and that 1.00 is within the 95% confidence interval, so we can conclude it is plausible that there is perfect agreement. Of course, one could point out that .79 is also a plausible value, so maybe those classes did make one student gay after all. I would counter with, but I already accepted the null hypothesis of no difference based on the McNemar test, so there!
There you have it, two statistical tests to decide if the “It gets better” movement and classes on gay history make you gay.
Please note, since we want to be correct here (statistically, not politically) that McNemar is only used for two by two tables. If you had multiple options like,
“Yes”, “Only if he looks like a real man under that damn My Little Pony costume.” and “No” then you would not use McNemar. You would use Cochran’s Q. That, however, is a post for some other day. My next post, in case you are dying to know, is on survival analysis in pictures.
Oct
16
Does your candidate have a prayer in hell of getting elected?
October 16, 2011 | Leave a Comment
The other day I wrote about how I had never had the need to do a binomial test for a proportion with SAS. In my business, at least, the question of whether the proportion in the population equals X just doesn’t come up very often. Actually, given that I had taught not one, but two courses on non-parametric statistics in the past and then forgotten about them, it is possible that I did actually do such a test in the past and forgot about it.
I was trying to think about a real-life useful application for this test and it being election season and all, this occurred to me…
Let’s say your candidate, Bill W, is running for dogcatcher. A poll of nearly 2,500 people shows him to have 48% of the vote. Does he have a chance in hell of winning this election?
The question you want answered is whether or not 50.1% is within the realm of probability (the 95% confidence interval) .
As a reminder, the SAS code is
proc freq data = dsname ;
tables candvote / binomial (exact equiv p = .501) alpha = .05 ;
SAS will test for the proportion for the FIRST level in the frequency table, but I’m assuming they are in alphabetical order and Bill comes first.
You can see two things from these results. First, your buddy Bill W does not have a chance of getting 50.1% of the vote. However, it is within the 95% confidence interval that he could get 50.02% of the vote.
Still, if I was you, I would be looking for a different job. Maybe as campaign manager for his opponent.
I had a lot more I had to say about categorical data analysis but I did not have time to put it all in one post and I already have seven draft posts that I started and didn’t finish.
Tomorrow we will look at one possible use of the McNemar test – to assess the plausibility of my personal opinion that the claim that teaching about the contributions of famous gay people to science and literature, or anti-bullying classes makes children decide that,
“Hey, I thought I was straight, but on second thought, now I want to have sex with Bob!”
is total bullshit.
Oct
15
My Reading Week Schedule, thanks to WUSS
October 15, 2011 | 2 Comments
Anyone who claims to know all of SAS is clinically insane – Ernest Hemingway (not intended to be a factual attribution)
Okay, I admit it, Hemingway didn’t really say that, but he would have, except for being dead and all.
As usual, this year I didn’t have time to do everything I wanted at the Western Users of SAS Software conference. As usual, I spent the majority of my time when I wasn’t presenting in the Statistics and Analytics section. Also as usual, I didn’t get to attend any presentations by David Pasta. He presents every year and I have READ several of his papers on line afterward and they are really good . For some reason, his papers are always at the exact time that I am presenting or when I have a meeting with someone. Yet another reason to go back next year and try again.
As usual, there were so many presentations I wanted to attend that I did not get outside for days. As one programmer said to another,
“I went outside once. The graphics kind of sucked.”
Unlike most years, nothing came up in the papers I attended that I could immediately use – unless you count Maura Stokes mentioning that ODS GRAPHICS ON will be the default in SAS 9.3 – which had me go up to the hotel room and add an * to several slides for my presentations the next two days where a program began with ODS GRAPHICS ON to have a footnote at the bottom saying no longer necessary when you get SAS 9.3.
Even though I didn’t find anything I could go back and use on Monday, which is unusual, that doesn’t mean at all that the conference wasn’t really worth attending. Very often, I will learn something and three months or six months later, I’ll have a need for PROC X. PROC GLMSELECT , also mentioned in the presentation by Maura Stokes (I did go to other presentations, really) is one I haven’t used before. It is used (surprise surprise) to help you select the best model. According to the documentation, “The procedure offers extensive capabilities for customizing the selection with a wide variety of selection and stopping criteria, from traditional and computationally efficient significance-level-based criteria to more computationally intensive validation-based criteria. The procedure also provides graphical summaries of the selection search. ” It was way cooler when you see the results than it sounds like from that description.
PROC VARCLUS was mentioned by David Pasta in the Q & A after one of the papers. That is another procedure I haven’t used, but that I want to explore further. According to the documentation “The VARCLUS procedure divides a set of numeric variables into either disjoint or hierarchical clusters. Associated with each cluster is a linear combination of the variables in the cluster, which may be either the first principal component or the centroid component.”
Something I can see coming in handy sooner rather than later is the %GetTweet macro to analyze (what else?) tweets from twitter, discussed in a paper by Satish Garla who is a student at Oklahoma State University.
Really interesting to me was the Java object talk by Diane Shealy. I did not even know there was a Java object in SAS. Apparently, the Javaobj has been hiding from me for a long time because once I started looking into it, I found an interesting paper by Richard DeVenezia about using JavaObj in the DATA step that was written back in 2004. We are likely to be doing a lot with Java in our company over the next year so if there is an easy way to connect it to SAS, I am intensely interested. Diane is also a student.
What is it with these bright, motivated students? Didn’t they read that article by Joel Stein in Time magazine this week saying they were the lazy generation?
When I was their age I was drinking beer at frat parties, not presenting at conferences.
So, there you have it, my homework from WUSS
GLMSELECT
VARCLUS
%GetTweet
Javaobj
Even cooler, Martha McLean on twitter had mentioned #readingweek. When I asked what that was she said it was something they had at the university she attended and now she applies it to her work. Every now and then she sets aside two or three days and does all the reading of work-related blogs, articles and books she has been meaning to read.
What an awesome idea. I hereby declare November 1-7 my reading week. Don’t try to stop me.
Oct
13
Making a Difference: Different views from WUSS
October 13, 2011 | 4 Comments
At the opening session, Randy Guard from SAS talked about making a difference. That sounded promising, but then the examples he gave were how analyses could be run on large databases of stock market data so much more quickly that instead of having market value overnight traders could get the data hourly. It sounded like the big difference was that it made high frequency trading easier to do and, of course, gave a big advantage to people who could afford SAS Business Intelligence software. His second example was use by retailers to identify when they should mark down merchandise to move more of their inventory. Well, certainly lower prices help consumers, although even that is not as simple as it sounds because if the long-term result is that the producers are getting paid less and thus cutting wages and benefits, the good is questionable. It’s, of course, also plausible that moving more inventory would increase overall profits and more people would be hired. I’m not saying that maximizing sales is a bad thing. What I am saying is when I personally think of making a difference it isn’t in helping Wal-Mart sell more at low, low prices or increasing the amount of high volume stock trading.
In her talk on what is new in SAS 9.2, 9.22 and 9.3, Maura Stokes mentioned PROC PLM. In case you don’t know (and really, why would you?) “The PLM procedure performs postfitting statistical analyses for the contents of a SAS item store that was previously created with the STORE statement in some other SAS/STAT procedure. An item store is a special SAS-defined binary file format used to store and restore information with a hierarchical structure.”
Speaking of making a difference, I can see where this could be really useful to the Department of Education. A few months ago, I was in Washington at a seminar with dozens of other researchers around the country, analyzing the National Indian Education Study (NIES) data. Because the data are confidential and restricted, none of us could take the data home. We all left and then had to send in forms asking for access to the data and promising to lock it in a closet guarded by a leopard, or something like that. Just think how much more efficient we could have been if we had been able to output the model information using the STORE statement and then take that – aggregated, not personally identifiable – back home and do some more work with it.
I’m not against making money. I’m no Mother Teresa and I haven’t given everything I own to charity. On the other hand, it troubles me when really educated people focus only on maximizing sales and profits. I think that the attitude that there is no more to making a difference than making a buck as part of the reason our country is a mess.
I’m surrounded by really intelligent people and I would hope that some of them would apply at least a small part of their considerable gifts to making their communities better.
It doesn’t have to be selling your house and joining the Peace Corps. I don’t think what many rural communities in third world countries need most is SAS programmers anyway. It could be something as simple as contacting the Institute for Education Sciences and saying, “Hey, here’s a way that your outreach to researchers could be done more effectively.”
It could be coming to the regional users group meeting and talking to new programmers about how they can use open data to gain experience in more advanced and varied programming techniques while doing projects for their communities, ranging from analysis of education data to guest lectures explaining statistics to school children. (Yes, I am talking about that tomorrow, how did you guess?)
If we could focus, each of us, a little more on balance and a little less on balance sheets, I think THAT would make a difference.
Oct
8
What Rocket Scientists Do on the Weekends
October 8, 2011 | 1 Comment
The rocket scientist decided the most important use of his spare time was to pose Homer as an R6σ specialist.
I was looking at this in my email and wondered why I ever decided to actually have a child with this person. Then, I looked at the books in the background.
That’s why. If you can’t be more mature than your children, you can at least be smarter and more educated.
Oct
3
What does everybody already know about categorical data?
October 3, 2011 | 6 Comments
I’m teaching a class on categorical data analysis after the Western Users of SAS Software conference next week. As always, I have WAY more information than I can cover. Handouts are limited to 40 pages so I sent the organizers 80 slides but I know I am going to cover way more than that. Why not put them three to a page? Because that is just silly. I’d rather have people have 80 slides they can read than 120 they can’t.
From lectures and papers over the years, I have way, way more material than will fit in three hours. Now, the question is what do I include and what do I leave out? There are some obvious things to leave in:
- How to code and interpret a logistic regression analysis.
- How to interpret model fit statistics.
- What is an odds ratio and how do you get it?
The above points all address things that people will commonly want to do, like use multiple variables to predict which category a person will fall into (hence the need for logistic regression).
What can everyone be expected to know already?
Okay, that’s the easy part. The not as easy part is to know what everyone can be expected to already know, as I don’t want to waste anyone’s time.
How many people really look at those different chi-square values like Pearson and the maximum likelihood chi-square? Does everyone know WHY the expected frequency is expected to be that number?
Does pretty much everybody know what a phi coefficient was? Yes, I know we all learned it in basic statistics but how many people never thought about it again?
Can I just skip discussing the marginal distribution and conditional distribution because “everyone knows that”?
How about computing confidence intervals with PROC FREQ ?
What about testing the null hypothesis that the population value is a specific proportion, also using PROC FREQ ?
In a normal household one might ask one’s spouse, to at least get some indication if “the man on the street” would be familiar with a topic. I, on the other hand, married someone who decided to pursue a doctorate in particle physics because he found nuclear physics too easy. Somehow, I don’t think he does a very good imitation of the man in the street.
Here is my plan right now:
- Collate everything I have on categorical data analysis, including the material from the two courses I taught years ago on non-parametric statistics which I had forgotten that I had taught until I found the powerpoint presentations in a folder. Then I remembered, oh yeah, THOSE courses on non-parametric statistics!
- Put these in order in an outline under “Questions you want answered”.
These are the questions I have so far:
- Are your data any good? (Always a good question to ask first)
- What is the distribution of X ?
- What is the distribution of X given Y?
- Is there a significant relationship between X and Y?
- Given X, what are the odds of Y?
- How well, and with what variables, can we predict which category of X a person falls into?
- Is this set of variables significantly better for predicting X than that other set of variables lying over there?
Then, there are those questions of special cases:
What if you only have a small number of cases in one or more cells?
What if your data are repeated measures?
Any suggestions, experience or good categorical data analysis jokes are welcome.
(Hey, if there are SQL jokes, there must be some categorical data analysis jokes. )
Tip of the day: A three-way interaction has an entirely different meaning in categorical data analysis than it does in an X-rated video. I actually found a three-way interaction with sex within military service. It was not at all exciting. It only meant that the relationship between school experiences and plans to enter the military varied by gender.
keep looking »Blogroll
- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog