March 25, 2009 | 3 Comments

You know that feeling after Thanksgiving dinner when you have had all of your favorite foods all at once and are so full that you think you will never eat again. Well, the past week has left my brain like that.

Within twelve days, I attended the SPSS Higher Education Road Show at UCLA, then the Innovations in Medical Education conference in Pasadena, then flew across country to spend four stimulating days at the SAS Global Forum. The SPSS Higher Education road show was small but very interesting. There was an excellent presentation on teaching data mining to business students, another presentation on using SPSS to teach basic statistics, using data the students and their friends enter into surveymonkey. Plus, I had the chance to talk with colleagues in the field, which is always beneficial.

Oddly, while information on sasopedia/ sascommunity, SAS Global Forum and SAS-L is all over the place, SPSS appears to want to keep its road show (excellent!) , academic resources (just beginning, but good) and other web resources secret. After 15 minutes of searching, including on the SPSS web site, I couldn’t find either. Maybe next week I will do a blog on the secret SPSS sites and post some of these links that I have bookmarked on my computer back in California. (Yes, I know I should have bookmarked them on delicious, thank you for bringing it up – not! )

I am finally catching up with this blog and I could spend the next three weeks writing every day on what I have learned, the fascinating people I have met, or caught up with again and thoughts on emerging technology. Let’s start with common trends:

Data mining – this is one of the areas of greatest opportunity for both business intelligence and the growth of scientific knowledge. The data are out there and the tools are out there, whether it is Clementine, from SPSS or SAS Enterprise Miner . We have the ability, far more than we are using at present, to determine who is likely to purchase information, who is likely to constitute a terrorist threat and which person is likely to engage in binge drinking. Why aren’t we using this information?

Lack of technically and scientifically competent people – (Hint: The answer is NOT to have more visas to hire foreign engineers and scientists.) At every conference I attended, there was a discussion of the need for more people knowledgeable about new technology, more people who understood the basic science and mathematics that underlies everything from diseases to parallel processing. Hiring staff from other countries doesn’t increase the number of available personnel, it just moves them around, sometimes switching the focus of a biology major from stemming the spread of tuberculosis in her home country of India to making the tastiest formula for twinkies in the United States. There are a whole lot of reasons why that is not a good thing. Just one of them is that it does not address our national problem that we are not producing enough people in our own schools who are interested and prepared to undertake basic and applied research on cancer, Alzheimer’s, the stock market, preventing alcohol abuse, identifying terrorist threats or why other people (but not me) lost 40% of their 401(k). On top of all of this is the scary notion of “statistics without math”. We have people who can get the software to work by following rules they have memorized but don’t really know what they are doing. It’ s sort of like having someone who cannot read road signs or maps drive across country with a GPS. As long as the GPS works, there are no roads closed due to construction or bad weather, and the area they are going to is not so new it doesn’t show in the GPS, they are fine. Once anything out of the routine happens, though, you’re screwed.

GUI is in, even if you don’t know what GUI is. GUI is graphical user interface, and it is what SAS Enterprise Guide has, for example, or the SAS Forecasting Product and most of the newer offerings from both SAS and everyone else.
For example, this image combines two SAS datasets using Enterprise Guide. This is actually how it is done with the pointing and the clicking and the menus dropping down. Previously, this would have been done as follows:

Libname mydata “c:\annmaria\data” ;
proc sort data = in.pre ;
by name ;
proc sort data= ;
by name ;
data in.alldata ;
merge in.pre ;
by name ;

While I found this a perfectly acceptable way of doing things, sad to say, the world is comprised of people who are not me. (There will be a brief pause while I mourn this unfortunate state of affairs.) Now I may seem to be contradicting myself when I say that we need people who understand the underlying mathematics and logic of data analysis and then I say that the move to GUI is a good. First of all, I didn’t say it was good, more like my children growing up, is simply inevitable. Whether it is good or not is a moot point. Second, there is nothing keeping anyone from using Enterprise Guide or Forecast Server in extremely sophisticated ways. I don’t mean to imply that one needs to be able to transpose covariance matrices in their heads or build a nuclear reactor out of wood to interpret an Analysis of Variance. I do, think, however, that you should know that one cannot do an Analysis of Variance when all of your variables are categorical. (“Are you sure? Someone asked me recently, having already collected their dissertation data.” Yes. I am sure. Very, very sure. If you laughed at this, seriously, you need to sell your slide rule and buy a life.)

The next two big things are text mining and social network analysis.
If you combine textual analysis software with data mining and social network analysis and know what the hell you are doing (note all of my points above converging here), you could get some predictive power akin to science fiction. This was all brought together in a fantastic presentation co-authored by Howard Moskowitz.
who is among other things, the author of Selling Blue Elephants, and fifteen other books and president of Moskowitz Jacobs.

Speaking of which, this illustrates yet another one of the amazingly cool things about conferences like these. After his presentation, I had a few questions about data modeling, and he was gracious enough to ask where I worked and where I attended graduate school. As people often do, he started naming people who had been his professors, well-known in the field, who were at USC or at UC and asking, “Well, have you heard of -”

After one or two who I remembered and several who drew a blank, I finally gathered as much in the way of manners as I could muster (which if you met me, you would know is not much) and said,

“Well, sir, the one I have really heard of is you !”



Since I have written about odds ratios and logs lately, I was going to write about the natural log of the odds ratio, however, random events have caused me to do otherwise.

I read an interesting blog by Adam Jackson lately, in which he is concerned that robots will take over the world. At the very least, he worries,  all of the data we are submitting on ourselves via our friends on Facebook, our pictures on Flickr, our personal web pages on igoogle, travelocity account, grocery store frequent shopper cards and a thousand other sources from billions of people will be combined and inserted into a series of equations that will predict our every move.

I say it aint so.  Billions and billions have been spent on data mining and we still cannot predict behavior with any degree of accuracy. Two obstacles are imperfect models and random error. The latter might be better described as, “The State of Being Human”. For example, I hate SQL. I even wrote  a blog about how much I hate it. That doesn’t mean I never use SQL and even, if offered an unimaginably enormous amount of money, would not become an SQL programmer. In fact, I was going to write on my blog about types of sums of squares a few days ago (since it was Square Root Day), but instead, I have decided to write about how robots are not taking over the world.

Let’s go with the imperfect models, first. I have a tank of sea monkeys on my desk that I bought for the purpose of collecting random data. Among the variables I have entered into an SPSS dataset are the following:

I have tried various models to predict the dependent variable WTF, which is my co-workers’ behavior. I have tried all of the independent variables, linear functions, quadratic functions, t-tests, multiple regression. To date, my attempts could only be characterized as a success if success has a completely different meaning than the one with which I am familiar. In fact, in one spectacular non-success event, I actually achieved an adjusted R-squared of -.232, meaning in statistical terms that my equation was, in fact, WORSE than doing nothing in terms of predictability.


This can be interpreted as SPSS way of saying,

“Why don’t you just shut the hell up, because a better equation could be developed by, say, a sea monkey?!”

Part of my problem may be that I just don’t have enough data. It could be that I need more data, and once I have it, like Adam Jackson’s megalomaniac robots, I could take over the world. Or, my failure to attain predictive nirvana may be due to having an imperfect model. it could be that I have the wrong variables. In that case, I could have variables on a billion people on a billion days and run it on a super-computer and still end up with no better prediction than nothing.


It is very likely that my dependent variable is not explained by the number of sea monkeys, the number of specks, the degree of sea monkey activity or anything at all external to my co-workers. Perhaps the variables I should have measured are these:

BOREDOM: Total rating of each worker

ATTEND : Number of people at work today on the third floor

PERSONBORE: The product of the first two variables for a boredom factor for the floor

WEATHER: This could be an ordinal scale from nasty, cold, rainy to perfect beach weather

usc(March 4 in Los Angeles – everyone else, eat your heart out!)

Even if I were to have the right variables, you cannot measure everything. Believe me, I have been trying most of my life. There is that random error factor. Co-worker A may have a boredom index of 4,000,000 , the weather outside may be hailing bowling balls so she is not leaving the building,  she may walk by my desk 762 times but she is NOT going to speak to me, not even if the sea monkeys get impact by radiation leaking out of the pipes, grow to the size of a dinosaur and swallow our new CIO whole because once, three months ago, she arrived at the elevator shortly after me and I did not hold the door for her and she had to wait for the next elevator thus getting to her meeting thirty seconds late and blowing her chance to make a great impression on the new CIO.

But, what if you had a billion billion data points… wouldn’t that work? No. If you had that many data points you would be uniquely identifying each person. That would work kind of like tests like the Myers-Briggs work, where you check off boxes that you’d rather go to a party than solve algebra problems,  you’d rather exercise than paint pictures and you’d rather drink a martini than write a blog. You get results back that say you like parties, exercise and drinking better than math, computers and art. No new information is added.

Mathematically, if you use a billion different data points to predict a billion people’s outcomes, you get just what I got with a much smaller number – an adjusted explained variance of less than nothing.

According to the Myers-Briggs, I am probably one of the most introverted people on earth. Although I loathe meetings, I spent decades as a consultant flying around the country, because my husband died, I had three small children and the money was good. That is just one random event, but here is what I have noticed …. for all of us, life is a whole series of random events strung together.

So, I don’t think we need to worry about the robots rising up against us just yet. On the other hand, maybe we ought to keep a closer eye on  those sea monkeys.


WP Themes