### Jan

#### 6

# Choosing models that suck less: Akaike is more than just fun to say

Filed Under statistics | 6 Comments

*I’m on Twitter a lot, and more to the point, I read a whole lot of blogs and web pages, all of which point to three, related questions:*

- Why do I so seldom read anything on how to DO predictive analytics or modeling from people who are always tweeting how these are (** Drum roll **) – THE WAVE OF THE FUTURE.
- Even in the small minority of people on the planet who are writing about analytics, there is an even smaller minority who actually explain statistical concepts underlying those techniques. Is this because they don’t think these are important to know or because they have just given up on getting anyone to care?
- How the hell do people get time to spend all day on Twitter and posting on blogs? Don’t they have jobs?

Well, I do have a job but today has been a kick-ass rocking awesome day when all of my programs ran, my output was interpretable. This followed an equally good day yesterday when my program did not run perfectly, but well enough to do what the client wanted. So, life under the blue skies is just pretty damn great. Sorry if you live some place it snows. Sucks to be you.

I was taking a break this morning reading a book and on page 42 of Advanced Statistics with SPSS (or PASW or whatever they are calling it these days) and I came to this line,

“The ANOVA method (for variance component options) sometimes indicates negative variance estimates, which can indicate an incorrect model … ”

and I thought,

“Yeah, duh!”

and then I stopped because I could think of several people off the top of my head to whom that would not be obvious. So, let’s start here.

Variance is the degree to which things vary from each other. Some people, including me, consider science to be the search for explained variance. Why do some people score high on a test while others score low? Why do birds fall out of the sky in Arkansas in January but not in California?

We calculate variance by taking the difference from the mean (average) and squaring it, adding up the squares (hence the amazingly popular term in statistics Sum of Squares). Let’s say we have a population of people with a very rare disorder that causes them to become stuck to the walls of large aquariums. There are only three such people in the world. You can see them here. Any resemblance of the smaller one to the child pictured in the swing above is purely coincidental.

The mean of the population is 4.5 feet tall. One of our sufferers is exactly 3 feet tall. The difference between her and the mean is -1.5 which squared is 2.25. Since the differences squared will always be positive, the sum of squares will always be positive. You can’t have a negative square. You can’t have a negative sum of squares. Since the variance is the sum of squared numbers divided by a number, the only way that could possibly be negative is if you had a negative number for your population. That doesn’t make sense, though, does it? I mean, the lowest number you can have in your species/ population/ study is one. Don’t write and tell me you can have zero because you can’t. If you have zero, you don’t have a study, you just have a wish for a study that never happened.

So… lesson number one. If you have a negative variance or a negative sum of squares of any type, your model totally blows. It makes no sense and you should not use it for anything.

(I once worked for a large organization where a middle manager weenie was quite aghast at the way I explained statistics. She stormed over to me in outrage and said,

“This is a professional setting! I cannot think of a single situation in my twenty years here that using “blow” in a sentence is appropriate.”

I said,

“I can. Blow me.”

Subsequently, my boss maintained admirable composure as he promised her that he would speak to me severely about my attitude. )

How to tell if your model sucks less

Only really, really terrible models have a negative variance. Lets say your model just kind of sucks. And you would like to know if a different model sucks less. Here is where the Akaike Information Criterion comes in handy. You may have seen it on printouts from SAS, SPSS or other handy-dandy statistical software. You don’t recall any such thing, you say? That is what AIC stands for. Go back and look through your output again.

Generally, when we look at statistics like an F-value, t-value, chi-square or standardized regression coefficient we are used to thinking that bigger is better. In fact, it is so easy to get confused that some of the newer versions and newer procedures (for example, SAS PROC MIXED) tell you specifically on the output that smaller is better.

Let’s take a few models I have lying around. All of them are from an actual study where we trained people who provide direct care for people with disabilities. We wanted to predict who passed a test at the end of the training. We include two predictors (covariates), education and group (trained versus control).

SAS gives us two AIC model fit statistics

Intercept only: 193.107

Intercept and Covariates: 178.488

We are happy to see that our model has a lower AIC than just the intercept, so we are doing better than nothing. However, we are sad to see that while education is a significant predictor (p < .001), group is not (p > .10 ). Since we have already spent the grant money, we are sad.

At this point, one of us (okay, it was me), gets the brilliant idea of looking at that screening test we gave all of the subjects. So, we do a second model with the screening test.

We see that our screening test is significantly related to whether they passed (p <.0001) , education is still significant (p <.001) and joy of joys, group is also significant (p <.05 ).
Let's look at our two fit statistics
two AIC model fit statistics
Intercept only: 193.107 (still the same, of course)
Intercept and Covariates: 142.25
Not only is our model now much better than the intercept alone, but it is also much better than our earlier model that didn't include the screening test.
Won't that always happen when you add a new variable that you get a better fit to the data?
No.
Okay, fine, you want another example? This training was a combination of on-line and classroom training. We thought perhaps people who were more computer proficient would benefit more. We included in our third model a scale that included their use of email, Internet access and whether they had a computer at home. Here are our final results:
**Akaike Information Criterion (AIC)**

Intercept only: 193.107

Intercept, Education & group: 178.488

Intercept, Education, group & pretest: 141.25

Intercept, Education, group, pretest & computer literacy: 142.83

The third model is the best of our four options (one of the options being to say the hell with using anything to predict).

As they will tell you in everything decent ever written on Akaike’s Information Criterion (see, it IS fun to say) cannot give you a good model. It can just tell you which of the models you have is the best. So, if they all suck, it will pick out the one that sucks less.

Speaking of things decent written on AIC, I recommend

Making sense out of Akaike’s Information Criterion.

Also, it just so happens that the model I selected did not suck, based on such criteria as the percent of concordant and discordant pairs, but I don’t have time for that right now as I must take the world’s most spoiled twelve-year-old to her drum lesson and then drive to Las Vegas, not for the Consumer Electronics Show but to see my next-to-youngest at the Orleans Casino in her last amateur fight before she goes professional next month.

I read an article in Salon.com today by a stay-at-home mom who was regretting her decision. I am grateful to her that I do not feel guilt about writing this blog post before drum lessons instead of making my child a home cooked meal.

The world’s most spoiled twelve-year-old is also grateful because she got chocolate and glazed doughnuts for supper.

Yet, despite my lack of parenting skills, my children nevertheless continue to survive and even frequently thrive. Yes, it amazes all of us.

### Jan

#### 5

# Never Believe the User

Filed Under Software | 2 Comments

*You know that guy, supposedly a program, in Tron, the one that yells,*

“I serve the user”.

*Well, he never met the first lead engineer I worked with. *

Reading Donald Farmer’s post “Is it really so?”, I was reminded of something that happened decades ago and it was a lesson I never forgot.

I was responsible for maintaining a program for inventory control, written in some language no one uses any more. It ran monthly but sometimes we needed a special run in the middle of the month when the clouds would part and some Very Important Person would request an up to the minute report.

The clerk only needed to submit the JCL (Job Control Language) by opening a file, typing SUBMIT at the command line and the program would run. Well, it didn’t run and I got a call. My first question was,

Did you change ANYTHING in the JCL file.

She insisted she had not changed anything.

I am sure any experienced programmer can see where this is going but I was only in my early twenties and still trusting. I spent hours reading over the code (this was not a simple program) . I tried everything and could not find a thing wrong. I took it to our senior engineer who asked me did I review the JCL. I said,

“No, I didn’t bother because she told me she hadn’t changed anything. “

He said,

“That was your first mistake. Never believe the user.”

I assured him that she was a very nice person who would never lie to me. Hey, we’d even gone out for drinks after work together, and she’d tried to fix me up with a friend of hers. He just shook his head.

While he was standing at my desk I opened the JCL file and it was an unbelievable mess. I called and asked her what the hell happened and she said, (I am not making this up),

“Well, I didn’t change any of the words, but there were a lot of extra commas in there and I learned in secretarial school that was wrong, so I took out all the commas.”

I don’t know what was better, the look on my face when she said this or the look on our lead engineer’s face as he nearly choked to death trying not to laugh.

### Jan

#### 4

# Software Books I Want That I Have Not Got

Filed Under Software, statistics | 5 Comments

*The new year is a popular time for blogs to give lists of favorite books one read over the last year. Reading several of these posts did not inspire in me any desire to update my Amazon wish list. Novels really aren’t my cup of tea. I don’t care about any girls who knocked over bee hives or whatever.*

I was thinking this morning about books that I would like to have read if they existed, or maybe books I did read that I would like to have been written differently. Lately, I have read several hundred pages of documentation of SAS software. Stata documentation, by the way, is written exactly the same, only more so.

I had the 224-page PROC MIXED book excerpt on my desktop, so I just opened a random page in the first twenty pages and here is what it says (click to see larger font, as if that will help – ha!) :

Now, maybe I am just grumpy because I have to teach this stuff to graduate students who generally don’t want to learn it, and professionals who do want to learn it but have rather unreasonable expectations, like being made an expert by Thursday.

That being said, the reaction of the average student, is generally along the lines of , and I may be paraphrasing here (or not):

“Are you fucking kidding me?”

I’d like a book that for the first 20 pages provided a general description of the procedure, when it is used, compared and contrasted it with other procedures. The next 100 pages would give examples of appropriate uses of mixed models (or whatever the particular procedure happened to be) with the appropriate code after each one. The book would introduce, say, the Akaike Information Criterion, and show how it could be used to compare models, using one model with several predictor variables and then a second model without one of those variables.

The examples used would be real ones with real data. Picking mixed models again, the first example in the SAS manual is predicting height from the variables family (with a random sample of families) and gender. These are good variables from the standpoint of an example of random effects (randomly sampled from all possible families) and fixed effects (gender having two fixed levels, male and female). However, as I read this example, I tried to think of any possible scenario in which it would matter to predict height from these two variables. I failed. Perhaps if one were a biologist and had discovered a new species, say, the Pine-baby Tree and you wanted to determine if the male of the species was significantly larger than the female of the species.

(As no expense is spared in the researching of this blog, a photo of the Pine-baby Tree in its natural environment of living room sofas next to smart phones, is included. I had to brave suburbia to take this picture. You’re welcome.)

My complaint, as is the complaint of the 50% of students who begin majoring in science and then switch majors, is that the examples presented early on are not in any context. I know this demand is hard on the authors, because you are asking for an example that is simple for someone new to a language or procedure to understand, general enough that it will make sense to the majority of readers and at the same time a real world application.

This challenge is addressed in an interesting way by a book I’m reading now, **Beginning Ruby: From Novice to Professional.** The author starts off with the example of Pets as a class and then discusses dog, cat and snake as subclasses and gets into the issue of inheritance. Now, it no doubt helped that I already knew what classes and inheritance were (as well as knowing about pets, dogs, cats and snakes) but it also helps that he continually draws specific generalizations ..

*“Now, you can see how this would apply if the class was Person or Tickets.”
*

One could argue that the Ruby book is more of a textbook or self-teaching tool while the SAS documentation is meant for reference, like the Unix man pages (man as in manual, not as in only meant for men). However, this is unlike Unix in that one can find lots of well-written helpful books.

For statistical software, once you get past the most basic statistics (for which there are some good books available), all of the books and articles I read seem to follow the same frustrating format – a few pages of introduction, if any, and then pages of formula, with 20 pages at the end of stuff I really need to know.

I feel like someone who wants to drive from Los Angeles to San Francisco and the first 195 pages of the map are a discussion of the manufacture, operation and quality testing of internal combustion engines. A few pages mixed in there are important points about how you have to put gas in when the gauge is near empty, what windshield wipers do, and so on. Somewhere else in there are all of the possible routes one can take to go anywhere in the United States, one of which includes going from Los Angeles to San Francisco with different routes through all California cities of over 50,000. At the end of the book is an example of driving from San Diego to Sacramento. However, since you don’t know which and where are those important things like putting in gas, you have to read the entire book, making you two days late for your meeting in San Francisco.

Let me give a real-life example for statistics, since I just complained about people not doing that. If you are using PROC LOGISTIC, GLM or MIXED, you need to use a CLASS statement to define your categorical variables. For example, I used five different schools where I administered an experimental training program. At each I had an experimental and control group.

If I did this:

`Proc mixed data = mystudy ;`

model score = group school ;

I would get an error message because school is not a numeric variable and therefore needs to be specified in the CLASS statement. That’s the sort of thing you need to know up front.

The discussion of the asympotic covariance matrix and what the ODS object name is for it, well that can wait (AsyCov if you really just couldn’t) .

I’d like to have read about ten books like that in 2010 but Santa didn’t bring me any for Christmas. If you have any to recommend, I’d be extremely grateful.

### Jan

#### 1

# People who annoy me: Mathematicians who pretend to be statisticians

Filed Under statistics | 3 Comments

The first course I ever took in statistics was in the math department, over thirty years ago, and Dr. Spitznagel, at Washington University in St. Louis taught me a good deal despite my best efforts, assisted by Fraternity Row, to major in partying (please don’t tell my mom). So, math people, thanks for that.

HOWEVER … please, please, please do me a favor and realize that mathematics is not statistics.

On the list of people who annoy me, mathematicians who pretend to be statisticians come in at number two. If you are giving an explanation of any statistical concept and spend three-fourths of your article or more on deriving the equations to obtain the results and skim over in the remaining few pages the interpretation of the output and the situations in which it should be applied, then you are a mathematician. That’s a perfectly fine thing to be and you should teach courses in mathematics and we will both be happy.

As the official word-chooser of this blog, I am going with the wikipedia definition of statistics as,”the science of the collection, organization, and interpretation of data”.

I just read an article where DOZENS of pages into equation after equation, the author gave an example with values for specificity and sensitivity, without explaining either one. This was the first bit of information in the whole paper that one could actually apply and yet he didn’t even tell you what it was. No wonder people hate math and statistics! I would, too, if this is how it was explained to me every day.

If I were me, which, I, in fact, am, I would start explaining logistic regression by discussing when you would use it and a bit about how to interpret the overall model. Since I am, in fact, me, I did that several days ago.

Next, I would give a bit of information on useful ways of interpreting your logistic regression results:

Two useful measures are sensitivity and specificity.

**Sensitivity is the percent of true positives,** for example, the percentage of people you predicted would die who actually died. (Only in statistics could this be considered a positive outcome.)

**Specificity is the percent of true negatives, **for example, the percentage of people you predicted would NOT die who survived.

There, now, see how easy that was? You’re welcome.