I was reading this article in the Wall Street Journal on SAS software and CEO Jim Goodnight said,
“We’re producing so many new products. [They’re all] in the funnel and we’ve got to get them in production and it’s taking us longer to get them out the door…. A lot of the problems are testing issues. It’s taking too long to solve all the problems. Every piece of software has its bugs. And with more and more products, we’re struggling with compatibility to make sure the next release is easy to migrate to for certain customers. With the sheer number of software solutions we have, it’s making it harder and harder to develop and test them all.”
Which got me to wondering why SAS does not take advantage of its considerable user base. In another article from CBR Business Intelligence, it’s claimed that SAS has 45,000 customer sites. (Not that I’m stalking y’all or anything). I have no idea what the average number of users is at each site but let’s try a really low estimate of around 22. I say really low because I know that many of those sites are universities and corporations that have hundreds of users and it would take a heck of a lot of small shops to bring it down to 22, but let’s go with that. So, this gives them around a million users.
You would think that somewhere in those million users you would have people who would be happy to dive in and break things (also known as testing). One benefit of open source software is that you have people trying things all the time. When they come up with problems, they create fixes. There are also disadvantages to open source software SOMETIMES – lack of documentation, lack of support. That’s not always true. Personally for large open source options, like Linux, the documentation is massive as is the support within the user community. There can also be legal issues if a company is marketing something and it has someone else’s code included.
However, there is a hybrid option, that can happen on two fronts:
When I was at USC when we would get new versions of any software or operating system, there were several people (including me), who had the immediate reaction of ,
“Let’s try to break it!”
We’d install it on every configuration of virtual machine, hardware and operating system. We’d try to use enormous files and analyze huge matrices of dependent and independent variables. We’d compare results from different applications. Ostensibly we did this as part of our job so that when anyone asked us a question we had a better answer than,
“Your guess is as good as mine.”
There was also the aspect of, as one of my co-workers said,
“They pay me to play with computers all day. How cool is that?”
The fact is when you create a piece of software of a significant degree of complexity (and SAS is certainly on the far right of the bell curve) it is damn near impossible to test every possible permutation –
“What if I run SAS on Linux running on VMWare on a Macbook to read a 100GB dataset that was orginally created by SAS on Windows in China using a double-bit character system and then exported using PROC COPY?”
With a large enough user base, SAS could make copies of work in progress available for testing which would allow identification of problems. I can see many reasons that users would be happy to do this:
- Consultants would see this as an opportunity to get ahead of the curve, using the newest software before it was available to the general public.
- Students could use this chance to learn more about the latest software.
- Just for the hell of it. (This is my motivation for most things.)
Fixing those bugs would still have to be done in-house to allow for quality control, documentation and liability issues. Hence the hybrid part. I’d be interested to see how having a large number of users turning the software inside and out could supplement whatever is done internally to minimize the bugs in software when it is released (and just accept the fact that there will always be bugs).
Someone suggested possibly SAS is worried about it damaging their reputation if they let out software with lots of bugs in it. I don’t think that would occur if they were very upfront about,
“Look, this is a work in progress. Run it through the paces and see what you think.”
as opposed to doing what some companies do and essentially releasing their beta version.
And, of course, no matter what you do or say, some people will complain and criticize because some people are just stupid that way.
“What! I can’t believe you did not see the importance of including a Serbo-Croatian to Mandarin translation function! And you say you have a comprehensive set of character functions, you fools!”
2. User-written macros
Back in the 1980’s, there used to be a book (yes, an actual physical phone book type of book) of SAS user-written macros. Over the years, of course, many of those have morphed into SAS procedures. I remember learning SPSS because SAS didn’t do loglinear models, so this was quite some time ago and there is not as much need.
Of course, there are a zillion packages for R, which is true open source, but Stata, which is not, also has a host of Stata procedures that are written by users. Raynald Levesque’s website has 140 macros for SPSS. SAS macros exist in diverse places – individual websites. SUGI/SGF and local user group proceedings.
With the data.gov initiative alone I think there is a lot of extensibility of SAS that is going untapped.
Lately, I’ve been pondering if I should be picking up some other language, partly just because it is good to learn new stuff, but also because it seems as if SAS is a bit behind on getting involved with some of these opportunities like with open government.
All the same points could be made about SPSS but they at least have the excuse that there are not as many users, not as many people writing syntax versus pointing-and-clicking and they have been bought by IBM so who knows what direction they are taking.
It’s just puzzling to me why with a such a strong user community SAS is overlooking so much of the potential for being faster, better, smarter.
I went to the Western Users of SAS Software (WUSS) conference a couple of weeks ago. One of the great benefits of this conference, more than most others, is the extent to which you see new people discussing new ideas.
What WUSS is not
For both statistical software and academics, I have found a diminishing marginal returns of the “big name” conferences. There are conferences I attended 20 years ago that still have many of the same people presenting on essentially the same topics. I have been told haughtily that is because they have a “program of research” and they are the “recognized experts”. It’s nice to be a recognized expert, compared to say, being Paris Hilton or the Octomom or the Unibomber. I certainly understand if you have done a really good presentation or a really good piece of research the value of presenting it more than once for an audience of 50 people or publishing more than one twelve-page journal article. But seriously, enough already!
Regional conferences often have young people presenting for the first time, and they may not all have the polished skills of their more experienced colleagues, but they often have new perspectives.
For example, Dmitry Rozhetskin gave a paper on different ways to manage lists in SAS, a topic I had never given a second’s thought. Today I was working on a problem where I need to use a macro to run some code for each url contained in a SAS dataset. One possibility that occurred to me, since I had attended this 10 minute talk, was I could append all of the records to one another for a single long list and then scan it. I don’t think that’s the most efficient way, it was just the first solution that popped into my mind when I looked at the problem. My point is that I never would have even thought of that a month ago.
Rick Wicklin talked about PROC IML, which is something I have looked at and played with a very little bit but never gave much thought. I had briefly considered it as a possible means of teaching about regression, since you could easily transpose, invert and multiply matrices. However, I’m not teaching statistics right now I haven’t taught in a program that got to that level of detail in the mathematics of regression for years. So, here is what I learned
- PROC IML has over 300 functions (called modules)
- Creating a new function is a lot like creating a SAS macro
- It’s easy to call SAS from within PROC IML, just type SUBMIT — your SAS code — ENDSUBMIT
He also gave a really good explanation of bootstrap permutation which I did not write down so I am going to go back and read his paper.
Last week, on the tram ride back from Disneyland to the parking lot, my husband was talking about a problem he was working on. He hadn’t been making progress using the first method and his next effort was to use a polynomial equation. This, however, meant that instead of dealing with 10,000 variables, he was now dealing with 10,000 squared, or 100,000,000 variables. The number of clauses required to solve the problem would be in the millions. I suggested that if he could perhaps recast this as a matrix problem he could code it and debug it more simply, although, of course, the number of operations will be the same. Also, if the code could be written using macros then it could be doable.
My oldest daughter, who had been sitting behind us, leaned over and said,
“This must be the most boring conversation I have ever eavesdropped on.”
My first point is that is why I did not marry her. Well, it wasn’t the only reason, temporal incompatibilities, sexual orientation and other factors came into play.
My second is that I would not have thought of the matrix algebra solution if I hadn’t still been pondering what I could do with PROC IML.
WUSS – the gift that keeps on giving.
Even better, the WUSS proceedings are now available online. The papers can all be found at http://www.wuss.org/proceedings10/ . I intend to spend this weekend reading several I was not able to attend, and one, on SGPANEL, that I was able to attend but want to learn more about. For those of you who don’t know – you only get twenty minutes to present contributed papers, ten minutes for coders corner, so often the published paper has more information that was given during the talk at the conference.
Hello, world –
It’s me, on my soapbox again. The latest straw was a discussion of articles in Science which have been the subject of retraction or “expression of concern”.
According to Retraction Watch there are between 30 and 95 retractions each year. The ones you hear about are usually in medicine and the “hard” sciences – biology, physics, chemistry.
Retractions of articles published in Wiley-Blackwell journals over a two-year period showed social science to be dead last in the proportion of articles retracted, a contest I’m sure social scientists are happy to lose.
Let’s exclude the sheer stupidity of plagiarism (I mean, seriously, you managed to get a Ph.D. without learning that copying off the kid in the desk next to you is wrong? And you haven’t heard that we have computers now where people, including the original author, can see what you wrote and compare it with that other kid’s work).
No, let’s talk about subtler stupidity. First up is the type of stupidity, hubris or willingness to believe one’s own numbers too much that leads you to grab a significant r-squared and run with it. Every project I have ever worked on, every contract I have ever written, allows substantial time for “data quality” and “convergence”.
My fellow (experienced) statistical consultants never get the answer to this particular question wrong.
If a client objects that his/ her data has no problems, I add a clause to the contract stating that we will refund that percentage of our fee that was applied to these analyses if we don’t find any problems with the data. How many times, in 25 years, have I had to pay a refund? (Extra bonus points if you can guess how many dollars this added up to.)
The correct answer is zero and zero.
Examples of the kinds of errors we see:
- Columns are off by one or two so from point X onward all of your data are actually for the next gene/survey item/test question etc.
- Dataset is supposed to only be for one state / school/ organism/ race etc. but includes others.
- Some of your survey respondents are dead. That is, they have a date of death entered in their medical records we merged with the survey data.
- Your data were entered wrong. People in the control group were entered as experimental or vice-versa.
- Your data were scored wrong. Questions that should have been marked right were marked wrong. Questions that should have been reverse-coded weren’t.
You can see these are not small inflate your standard errors a bit beyond the stated rate type of errors but completely f—ed up your data type of errors. And yet, it is no problem because we EXPECT these. We run a PROC CONTENTS in SAS or codebook in Stata. We run descriptive statistics – means, frequencies, standard deviations – on everything. We run graphs of distributions and stare at them. And we catch these errors.
On the list of things I don’t understand, right after teenagers, comes why people will spend so much time worrying about if they should use a mixed model or general linear model, or puzzling out endogenous versus exogenous variables in a structural equation model and not spend the time checking their data.
Stupid crap to stop doing, take two:
So, you checked all of your statistics every way from Tuesday, your data are great. You coded everything correctly, fixed all data entry errors.
You have 22,000 people in your organization. For reasons beyond me, you randomly sampled 3,000 to send the emailed survey. (Why? Because bytes are so expensive?) From your 3,000 people you get 92 responses and based on this you write a report saying people are generally happy with all aspects of your organization. (When the person who conducted this study asked for my opinion, I sent a two-page report detailing the inadequacies in sampling and the survey used. He later complained to someone that I was overly negative. It’s good I did not go with my initial reaction on reading it, which was to send an email that said, “You’re f—ing kidding me, right?”
Who are you going to vote for president in 2008?
A. Obama – the African-American, Muslim, born in either Kenya or Indonesia who is a socialist/communist who wants to kill your grandmother and take your guns
B. McCain – the old white guy, two steps from death’s door (may already be dead for all we know), who looks eerily like the evil emperor in Star Wars (coincidence, I think not!) and hates gays and immigrants (especially gay immigrants)
C. Someone who has less chance of getting elected than a naked mole rat, who, in fact, for all we know, may actually be running from a political party controlled by naked mole rats.
Those kind of surveys generally produce valid data. Yours, on the other hand, is more along the lines of:
How many people in your life would you consider to be “like family” ? ___
Rate your level of happiness on a scale of 1(=extremely unhappy) to 10 (=extremely happy).
You have actual numbers. How much more objective could you be? And then you go on to talk about correlations with social support and happiness. You put these numbers into extremely complex models and control for stratification, distributions of the independent and dependent variables. You use the appropriate statistics.
Somewhere, you are missing the point. The point was where you asked those questions. My daughter is a pretty good athlete (as in, having won world and Olympic medals and now competing in MMA -she’s the uninjured one in this video). Coaches, managers, fans and random people are always telling me, “She’s just like family to me.”
My husband has suggested I send the tuition and car insurance bills to these people and see if they pay them. I’ve gotten so sick of it, I tell them,
“Excuse me, but you have a different definition of family than me. NO ONE is like family to me but my family. I may like other people a whole lot, but if you threaten one of these four children, I will shoot you dead if I have to, and not feel one bit of remorse looking down at your bleeding corpse.”
My children have a whopping amount of social support, even if it’s only from me. Just because you SAID that some number you got from a random question you asked is a measure of social support doesn’t mean that it IS.
One of the reasons fewer retractions are required in social science research (in my not-the-least-bit-humble opinion) is that it is far more difficult to replicate the sampling method and measures used to actually question the findings in the first place.
Now that I am really on my soap box, I think I will just make this an on-going, perhaps infinite series on all of the things that are wrong with research and particularly social science research (and yeah, marketing research guys, I’m including you!)
Unstructured data is to the usual database as Toontown is to Irvine Ranch (or Diamond Bar or Porter Ranch or any other white bread community that has two names, six types of floor plans and where half the children are named Buffy, Jessica, Jason or Justin – you know who you are).
If one were to have structured data to answer the question,
“What are the two websites you visit the most?”
One could have a list of the most common websites and have people pick from those in a drop down menu. The problem is, of course, that you might need a very, very long menu and it would still leave some sites off.
Instead, we just have a question and people write whatever the hell they want, which can include the word “none”, one website, two websites or five or more websites because they either cannot read or are terribly indecisive.
Here is a start on putting some of these data into a reasonable structure for analysis. I happened to use SAS for this but you could probably do it just as easily using a bunch of other languages.
/*THIS FIRST PART READS IN THE DATASET, CREATES TWO VARIABLES LONG ENOUGH */
/* MAKES EVERYTHING LOWER CASE FOR ALL OF THOSE PEOPLE WHO PUT */
/* Yahoo, yahoo, YAHOO etc TO BE COUNTED AS ALL ONE WEBSITE */
data readsites ;
set urls ;
attrib website1 website2 length = $32. ;
web = lowcase(web) ;
**** If someone put “none” (with trailing blanks trimmed) , then
**** Website1 is equal to blank and the number of sites named = 0 ;
if trim(web) = “none” then do ;
website1 = “”;
namesite = 0 ;
**** I use the index function to see if there is a “,”
**** If there is a “,” in the field that means the person entered two websites (or more)
**** separated by a comma, so I set the value for number of sites named to 2
**** I read in the first website from the beginning of the string until just before the comma
**** I read in the second website starting after the comma to the end ;
else do ;
fnd = index(web,”,”) ;
if fnd > 0 then do ;
namesite = 2 ;
website1 = substr(web,1,(fnd -1)) ;
website2 = substr(web,(fnd +1),length(web)) ;
**** If there is only one website listed, then the first website is equal to the whole string ;
if fnd = 0 then do ;
website1 = web ;
namesite = 1 ;
**** If there is no “.” then the person just entered something like ‘yahoo’
**** I tack an http://www. in the front and a .com at the end ;
if index(website1,”.”) = 0 then
website1 = “http://www.” || trim(website1) || “.com” ;
**** If I get to this point without a www then the person entered something like bebo.com
**** so, I tack http://www at the beginning ;
if index(website1,”www”) = 0 then
website1 = “http://www.” || trim(website1) ;
if index(website2,”.”) = 0 then
website2 = “http://www.” || trim(website2) || “.com” ;
else if index(website2,”www”) = 0 then
website2 = “http://www.” || trim(website2) ;
When I get done, for all of those who entered 0, 1 or 2 websites I have a correctly formed url, thanks to the INDEX, SUBSTR, LENGTH and TRIM functions in SAS. The || concatenates strings, but you probably guessed that. I also have a variable that tells me if the person entered zero, one or two or more websites. You might have wondered about appending the .com. I just happened to know that in this dataset the few sites that were .edu or .gov had that entered.
My point – and it may surprise you to hear that I have one – is that while several years ago character functions were mildly interesting and occasionally useful, with the explosion in the amount of unstructured data being collected, functions like these are now one of the basic prerequisites of life. They are right up there with, say, coffee at any hour before 9 a.m. when one is expected to simulate a coherent human being.
If you are not at WUSS you missed world-famous statistician Stanley Azen SINGING the conclusion of his keynote accompanied by an original score on the piano that he had written himself , thus setting an impossible standard for any other presenter to top.
My only talent being judo (I was the world champion 26 years ago), I tried to get WUSS co-chair Marian Oshiro to be my partner and open my presentation this morning by lifting her over my head and then throwing her into the audience. She refused. You would think a conference co-chair would have more commitment than that, wouldn’t you?
Despite the nice weather and being right on the bay, I have yet to spot anyone sneaking out of sessions to lay by the pool. If you are not here, other stuff you missed :
– Coder’s Corner, where I spent most of the day. This is 10 minute talks on anything, really. You get people who it is their first presentation (like my co-author on the supercomputing talk, Ernesto Flores) so it seems less threatening to do ten minutes. And, you get a lot of very experienced people showing some cool thing they learned and thought might interest other people, too.
* Qiweng talked about shading the area between a curve and reference line in different colors.
* Nate talked about using a macro that defaults to ALL (sites, flights, categories, whatever) but allows you to put in a parameter for the macro variable if you do want just one.
* Dmitry talked about the different ways to handle lists with SAS and why you would use each, an interesting topic I had never really thought about before.
* James talked about outputting SAS charts directly to a worksheet in Excel using XML
There was more way cool stuff but now I have to go give three presentations back to back starting at 8:30 a.m. This being the first time I have been up at this hour since an all-night party in 1978, I suspect the session chair is getting nervous about now that I’ve overslept.
I received two doctoral fellowships, and, years later, a post-doctoral fellowship. My grandmother, who was born in Caracas, Venezuela, and had never been to college, was quite worried. On more than one occasion, she grilled me about what I actually did, and when I told her that I went to the library, read, and then wrote about what I read, she kept insisting that no one could be collecting a check every month and just doing that. I’m sure she died still suspecting that I was secretly a “kept woman” and just not willing to ‘fess up to a sugar daddy named Elmer in the woodwork.
Decades later, I’m still spending a good bit of my time reading stuff and writing about it. Tonight, I thought instead of writing a post on statistics, I’d read some. Besides, Elmer’s asleep.
According to Caslon, the average blog has the lifespan of a fruit fly, lasting around a month because, “blogging met human nature and human nature won”. In other words, people’s intention to continue blogging is about as successful as my intention to clean off my desk.
LoveStats – blog on social media, sampling, survey design and occasionally random. Besides the name is cool.
John D. Cook’s blog - pretty much all statistics all the time. When I read it today he was discussing bias and consistency.
Blog-normal, by John Sall, co-founder of SAS and one of the masterminds behind JMP is sometimes rah-rah company stuff but sometimes, like his posts on Goldilocks, the research bears and negative R-square, it is just too cool. As someone who spent decades teaching statistics to graduate students who didn’t want to learn it, I wish his blog had been around sooner.
StatChat, from the Analysis Factor, by Karen Grace Martin – I liked this blog because along with statistics it includes SAS, SPSS syntax and some general discussion of statistical software that is not “I like this and you suck”. Also, the only mention I have heard of BMDP in forever (not missed by me).
Andrew Gelman’s blog is more statistics applied to important (or, at least, interesting) issues than discussions of sums of squares, residuals and domain sampling. That’s okay with me, though. I like interesting.
Michael O’Brien’s blog is another one that is useful if you are teaching statistics. It is a relatively new blog and relatively basic stuff so far. I’d be more likely to recommend it to students than read it myself but it was kind of him to put it out there as a public service.
500 hats - Doesn’t have much to do with statistics but is funny, thought-provoking and the only old person that swears more than me now that George Carlin is dead.
Well, this was fun, but from my non-random sample, I did not find a lot of statistics blogs that were interesting, exceeding the lifespan of a fruit fly or not. The many daughters would no doubt have some sarcastic comments on this finding, but I hear Elmer stirring so I gotta go.