I’m on Twitter a lot, and more to the point, I read a whole lot of blogs and web pages, all of which point to three, related questions:
- Why do I so seldom read anything on how to DO predictive analytics or modeling from people who are always tweeting how these are (** Drum roll **) – THE WAVE OF THE FUTURE.
- Even in the small minority of people on the planet who are writing about analytics, there is an even smaller minority who actually explain statistical concepts underlying those techniques. Is this because they don’t think these are important to know or because they have just given up on getting anyone to care?
- How the hell do people get time to spend all day on Twitter and posting on blogs? Don’t they have jobs?
Well, I do have a job but today has been a kick-ass rocking awesome day when all of my programs ran, my output was interpretable. This followed an equally good day yesterday when my program did not run perfectly, but well enough to do what the client wanted. So, life under the blue skies is just pretty damn great. Sorry if you live some place it snows. Sucks to be you.
I was taking a break this morning reading a book and on page 42 of Advanced Statistics with SPSS (or PASW or whatever they are calling it these days) and I came to this line,
“The ANOVA method (for variance component options) sometimes indicates negative variance estimates, which can indicate an incorrect model … ”
and I thought,
and then I stopped because I could think of several people off the top of my head to whom that would not be obvious. So, let’s start here.
Variance is the degree to which things vary from each other. Some people, including me, consider science to be the search for explained variance. Why do some people score high on a test while others score low? Why do birds fall out of the sky in Arkansas in January but not in California?
We calculate variance by taking the difference from the mean (average) and squaring it, adding up the squares (hence the amazingly popular term in statistics Sum of Squares). Let’s say we have a population of people with a very rare disorder that causes them to become stuck to the walls of large aquariums. There are only three such people in the world. You can see them here. Any resemblance of the smaller one to the child pictured in the swing above is purely coincidental.
The mean of the population is 4.5 feet tall. One of our sufferers is exactly 3 feet tall. The difference between her and the mean is -1.5 which squared is 2.25. Since the differences squared will always be positive, the sum of squares will always be positive. You can’t have a negative square. You can’t have a negative sum of squares. Since the variance is the sum of squared numbers divided by a number, the only way that could possibly be negative is if you had a negative number for your population. That doesn’t make sense, though, does it? I mean, the lowest number you can have in your species/ population/ study is one. Don’t write and tell me you can have zero because you can’t. If you have zero, you don’t have a study, you just have a wish for a study that never happened.
So… lesson number one. If you have a negative variance or a negative sum of squares of any type, your model totally blows. It makes no sense and you should not use it for anything.
(I once worked for a large organization where a middle manager weenie was quite aghast at the way I explained statistics. She stormed over to me in outrage and said,
“This is a professional setting! I cannot think of a single situation in my twenty years here that using “blow” in a sentence is appropriate.”
“I can. Blow me.”
Subsequently, my boss maintained admirable composure as he promised her that he would speak to me severely about my attitude. )
How to tell if your model sucks less
Only really, really terrible models have a negative variance. Lets say your model just kind of sucks. And you would like to know if a different model sucks less. Here is where the Akaike Information Criterion comes in handy. You may have seen it on printouts from SAS, SPSS or other handy-dandy statistical software. You don’t recall any such thing, you say? That is what AIC stands for. Go back and look through your output again.
Generally, when we look at statistics like an F-value, t-value, chi-square or standardized regression coefficient we are used to thinking that bigger is better. In fact, it is so easy to get confused that some of the newer versions and newer procedures (for example, SAS PROC MIXED) tell you specifically on the output that smaller is better.
Let’s take a few models I have lying around. All of them are from an actual study where we trained people who provide direct care for people with disabilities. We wanted to predict who passed a test at the end of the training. We include two predictors (covariates), education and group (trained versus control).
SAS gives us two AIC model fit statistics
Intercept only: 193.107
Intercept and Covariates: 178.488
We are happy to see that our model has a lower AIC than just the intercept, so we are doing better than nothing. However, we are sad to see that while education is a significant predictor (p < .001), group is not (p > .10 ). Since we have already spent the grant money, we are sad.
At this point, one of us (okay, it was me), gets the brilliant idea of looking at that screening test we gave all of the subjects. So, we do a second model with the screening test.
We see that our screening test is significantly related to whether they passed (p <.0001) , education is still significant (p <.001) and joy of joys, group is also significant (p <.05 ).
Let's look at our two fit statistics
two AIC model fit statistics
Intercept only: 193.107 (still the same, of course)
Intercept and Covariates: 142.25
Not only is our model now much better than the intercept alone, but it is also much better than our earlier model that didn't include the screening test.
Won't that always happen when you add a new variable that you get a better fit to the data?
Okay, fine, you want another example? This training was a combination of on-line and classroom training. We thought perhaps people who were more computer proficient would benefit more. We included in our third model a scale that included their use of email, Internet access and whether they had a computer at home. Here are our final results:
Akaike Information Criterion (AIC)
Intercept only: 193.107
Intercept, Education & group: 178.488
Intercept, Education, group & pretest: 141.25
Intercept, Education, group, pretest & computer literacy: 142.83
The third model is the best of our four options (one of the options being to say the hell with using anything to predict).
As they will tell you in everything decent ever written on Akaike’s Information Criterion (see, it IS fun to say) cannot give you a good model. It can just tell you which of the models you have is the best. So, if they all suck, it will pick out the one that sucks less.
Speaking of things decent written on AIC, I recommend
Making sense out of Akaike’s Information Criterion.
Also, it just so happens that the model I selected did not suck, based on such criteria as the percent of concordant and discordant pairs, but I don’t have time for that right now as I must take the world’s most spoiled twelve-year-old to her drum lesson and then drive to Las Vegas, not for the Consumer Electronics Show but to see my next-to-youngest at the Orleans Casino in her last amateur fight before she goes professional next month.
I read an article in Salon.com today by a stay-at-home mom who was regretting her decision. I am grateful to her that I do not feel guilt about writing this blog post before drum lessons instead of making my child a home cooked meal.
The world’s most spoiled twelve-year-old is also grateful because she got chocolate and glazed doughnuts for supper.
Yet, despite my lack of parenting skills, my children nevertheless continue to survive and even frequently thrive. Yes, it amazes all of us.