This is the third and last part of my attempt to explain logistic regression in pictures. You can see a picture of odds ratios here,  and a picture of two charts of predicted probabilities, to compare models, here.

If people only know one chart associated with logistic regression, it is usually the ROC chart, though many of them cannot tell you what ROC stands for (not that it really matters) or how to interpret the chart – which kind of does matter, because it’s useful.

The ROC curve is an abbreviation for receiver operating characteristic curve (I told you it didn’t matter). This is a plot of

SENSITIVITY – the percentage of true positives, the people we predicted would die who did, and

SPECIFICITY – or true negatives, the number of people we said would NOT die, who did not

We actually plot (1 – specificity) by sensitivity. If we predicted no one would die, our rate of true negatives would be 100%. Since we predicted nobody would die, we would be exactly right for all of the people who didn’t die. 1 – 1.0 = 0  so we’d be at 0 on the X axis.

On the other hand, we’d have zero sensitivity. Since we predicted no one would die, we would have zero true positives.

At the other extreme, if we predicted everyone would die, we would have 100% true positives and 0 true negatives. Since 1-0 = 1 , that would be at the upper right corner here.

The straight line is what we would get without any predictor variables, if we just randomly guessed whether a person would live or die. The top left corner, where we have correctly predicted all of our positives and all of our negatives is what we would get in a perfect model.

The more that curve is bowed toward the top left and away from the straight line, the better our model.

Curve with substantial bow toward the upper left corner

Let’s take a look at our actual curve from the Kaiser-Permanente data, where we used gender, age, number of emergency room visits and nursing home residence (yes or no) to predict whether or not a person would die within the next nine years.

From this, we can conclude that while our model is substantially better than random guessing – a conclusion that is consistent with what we saw in our previous charts. We can also see that there is definitely room for improvement. Perhaps future research could improve prediction by including behavioral risk indicators such as amount of alcohol and tobacco usage, as well as socioeconomic status and diagnosis of chronic illness.

So, there you have it  – logistic regression in three blog posts and four pictures.




I tweeted that I believed one could explain logistic regression results in three or four charts, and Alberta Soranzo tweeted back,

“Try me.”

Challenge accepted. 

These data are taken from the Kaiser-Permanente study of the Oldest Old,  a sample of 5,986 people who were aged 65-95 when recruited into the study. Participants were followed for nine years, or until they died, at which time it was pointless to continue following them as they weren’t going anywhere.

Last post I showed the predicted probabilities charts for two different models. I pointed out that it was quite clear the first model was superior. Using the same sample of older adults, nursing home status and gender were much better at predicting who died than were race and alcohol consumption in the past year (coded only as if a person drank alcohol or not).

Nursing home status and gender are better than those other variables, but are they actually any good? Are they statistically significant? Is the effect substantial?

Next chart to examine is the odds ratios with 95% confidence limits.

odds ratio charted for each variable

If the odds of living vs dying are equal for people in a nursing home and not in a nursing home, then the odds ratio will be 1.0.  If the odds of people dying are LESS for people who are not in a nursing home (NO vs YES) then the odds ratio will be less than 1.  As you can see, the odds of people not in nursing homes dying are considerably less than those who are in nursing homes. Females have lower odds of dying than males. Being in a nursing home (or not) is a better predictor of dying within the next nine years than is gender.

The dots on the chart are the odds ratio for each variable and the bars extend across the 95% confidence interval. If the bars cross 1.0 then the odds being equal is a value that falls within the 95% confidence limits  – or, in other words, the predictor is not statistically significant.

You can also see from this chart that all four of our predictors are significant. You can also see that people who are older and have more visits to the emergency room are more likely  to die.



If you have a mad desire to do logistic regression with SAS On-Demand with SAS Enterprise Guide, here is a movie that shows how to do it. It is a .avi file so you may want to just download it and run it on your PC.

Here is why the movie is not all that good — Grrr —  SAS On-Demand does not run on a Mac. Unfortunately, Quicktime does a screen capture video on the Mac version but the Windows version only the professional version does that. I used Debut Video Capture on Windows, which I actually paid for. I made one movie, made a mistake in the middle of it and the guinea pigs were raising a ruckus because they wanted parsley. You could here them squeaking all through it.  So, the second try, when I was doing logistic regression, the sound track was about 15 or 20 seconds ahead of the video! So, as you were listening to the video, you were seeing something different on the screen! That was annoying. So … this third video is a bit sparse.

I also ran tasks before I did the video so I did not have to wait forever for them to run. I ran it on this old, old windows machine we use for testing because I did not want to take the time to re-boot my shiny new 12GB RAM Mac into Windows. That was stupid. It would have been quicker to re-boot the Mac than re-do the movie twice. Also, my Mac has a wired Internet connection so it is much faster all the way around.

Lessons learned today in addition to logistic regression.

1. When using SAS On-Demand, use the fastest computer

2. When using SAS On-Demand, use the fastest Internet connection

3. Get the Windows version of Quicktime to replace Debut Video Capture (there are other reasons  I don’t like it, chief among them being the default format is .avi and if you change it to some other format, it does NOT remember that)

I have had SO much more of a positive experience with the SAS Web Editor – runs on a Mac, faster, no install problem – that I wonder why I ever used SAS Enterprise Guide to begin with. Actually, if you are running it in your office or home with a good Internet connection on a good computer, it’s not too bad. Not only is the lack of programming attractive to many students but from a learning statistics standpoint the fact that it kind of is in your face with the “Dependent variable”, “Classification variable”, “Quantitative variable” distinctions is kind of nice.

Most of all, though, I remembered how clunky SAS Enterprise Guide for the desktop was when it first came out and now I find it very useful, so I am HOPING this will be the direction for SAS On-Demand EG as well. Personally, the single biggest improvement I hope for is that it starts to run on the Mac. The simplest way for that to happen would be if it just ran as a client like the web editor does. Here’s hoping.





When people ask me what type of statistical software to use, I run through the advantages and disadvantages, but always conclude,

“Of course, whatever you choose is going to give you the same results. It’s not as if you’re going to get a F-value of 67.24 with SAS and one of 2.08 with Stata. Your results are still going to be significant, or not, the explained variance is going to be the same.”

There are actually a few cases where you will get different results and last week I ran across one of them.

A semi-true story

While I was under the influence of alcohol/ drugs that caused me to hallucinate about having spare time during the current decade, I agreed to give a workshop on categorical data analysis at the WUSS conference this fall . After I sobered up (don’t you know, all of my saddest stories begin this way), I realized it would be a heck of a lot easier to use data I already had lying around than go to any actual, you know, effort.

I had run a logistic regression with SPSS with the dependent variable of marriage (0 = no, 1 = yes) and independent variable of career choice (computer science or French literature ). There were no problems with missing data, sample size, quasi-complete separation, because like all data that has no quality issues, I had just completely made it up. I thought I would just re-use the same dataset for my SAS class.

So, here we have the SPSS syntax



/CONTRAST (Career)=Indicator

/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

As I went on at great boring length in another post, if you take e to the parameter estimate B, (  Exp(B) in other words) you get the odds ratio for computer scientists getting married versus French literature majors, which are 11 to 1.

Also, I don’t show it here but you can just take my word for it,

the Cox & Snell R-square was .220 and the Nagelkerke R-square was .306 .

If you are familiar with Analysis of Variance and multiple regression, you can think of these as two different approximations of the R-squared and read more about pseudo R-squared values on the UCLA Academic Technology Services page.

So, I run the same analysis with SAS, same data set and again, I just accept whatever the default is for the program.
proc logistic data = in.marriage    ;
class cs  ;
model married = cs /   expb  rsquare;

If you look at the results, you see there is an

R-squared value of .220 and something called a Max-rescaled R-squared of .306

Okay, so far so good, but what is this?

Output from logistic regression

For our parameter estimate for both the intercept and our predictor variable we get completely different values,  and, in fact, the relationship with career choice is  NEGATIVE but the Wald chi-square and significance level for the independent variable, is exactly the same. (This is what we care about most.)

The odds ratio is different, but wait, isn’t this just the inverse? That is .091 is 1 /11  so SAS is just saying we have 1:11 odds instead of 11:1

Difference number 1: SAS uses the lower value as the reference group, for example NOT being married.

That’s easy to fix. I do this:

Title "Logistic - Default Descending" ;
proc logistic data = in.marriage descending   ;
class cs  ;
model married = cs /   expb  rsquare;

This is a little better, The two R-squared values are still the same, the odds ratio is now the same, at least the relationship between the CS variable and marriage is now positive. You can see the results here or the most relevant table is pasted below if you are too lazy to click or you have no arms (in which case, sorry for my insensitivity about that and if you lost your arms in the war, thank you for your service <– Unlike everything else in this blog, I meant that.)

Output with DESCENDING option on Proc Logistic statementBUT, the parameter values are still not the same as what you get from SPSS and Exp(B) still does not equal the odds ratio.

Since actual work is calling me, I will give you the punch line thanks very much to David Schlotzhauer of SAS,

“If the predictor variable in question is specified in the CLASS statement with no options, then the odds ratio estimate is not computed by simply exponentiating the parameter estimate as discussed in this usage note:

If you use the PARAM=REF option in the CLASS statement to use reference parameterization rather than the default effects parameterization, then the odds ratio estimate is obtained by exponentiating the parameter estimate.  For either parameterization the correct estimates are automatically provided in the Odds Ratio Estimates table produced by PROC LOGISTIC for any variable not involved in interactions.”

So, the SAS code to get the exact same results as SPSS is this (notice the PARAM = ref option on the class statement)

Title “Logistic Param = REF” ;
proc logistic data = in.marriage descending   ;
class cs/ param = ref ;
model married = cs /   expb  rsquare;

You can see the output here.

Did you notice that the estimate with the PARAM = REF (the same estimate as SPSS  produces by default)  is exactly double the estimate you get by default with the DESCENDING option? That can’t be a coincidence, can it?

If you want to know why, read the section on odds ratios in the SAS/STAT User Guide section on the LOGISTIC Procedure. You’ll find your answer at the bottom of page 3,952  (<— I did not make that up. It’s really on page 3,952 ).




All my programs are working today and I am sad.

Fortunately for everyone else, but unfortunately for me today, SAS has increasingly automated or semi-automated fixing those errors. It’s unfortunate for me because I wanted to talk about errors and how to fix these. I could create a simulation dataset but I hate doing that. I think if whatever the issue is occurs often enough to bother talking about it, you ought to be able to find a real dataset it applies to without going to the extreme of making one up.

Ever notice how programs like SAS have 300-page manuals for something that you can code in three statements, like logistic regression? How does that make any sense?

How it makes sense is that coding those three or five or seven statements correctly, understanding your output like Nagelkerke pseudo-R-squared and fixing many of the errors that you encounter require understanding a lot of terms and some underlying mathematics.

On the other hand (where you have different fingers), very commonly the errors that occur when you are first learning a language have nothing to do with a Hessian matrix that is not positive definite and everything to do with having misspelled a word.

Years ago, I used to say that I could make a billion dollars if I could come up with a language that did what I meant to tell it instead of what I told it. SAS did this a version or two back. They also made a billion dollars. So I was right, but they still didn’t give me any of it. How rude!

(There is a lesson in here to silly young people who ask for NDAs, by the way. What is worth the billion dollars is not the idea, it’s the implementation.)

Now if you type DAAT instead of DATA, your program will run anyway with a polite note in your SAS log telling you that it has assumed you meant DATA and went ahead and executed based on that assumption, but if not, hey, feel free to let it know. (Am I the only one who feels this bears a little creepy resemblance to the happy doors in the Hitchhiker’s Guide to the Galaxy?)

So now, with almost no fanfare whatever, SAS has gone on and done this with its statistical procedures as well. There is the ODS graphics, which guesses what diagnostic graphs you would probably want, and there are also automatic self-correcting mechanisms.

I TRIED to get PROC LOGISTIC to screw up by doing some of the basic errors I see and here is what happened. Just so you know, the dependent variable in all cases except #3 was whether the person was employed, coded as 0 for no, 1 for yes.

1. I used two variables that were perfectly correlated as independent variables. Whether the subject had difficulty in job training was coded as 1 or 0, for a variable I named “difficulty”. Whether the person found job training easy was coded as

ease = 1 - difficulty ;

I see people do this when they are unfamiliar with the concept of dummy variables. In fact, (I’m not making this up), they think I am insulting them when I tell them that their problem is with having too many dummy variables.

What did SAS do?

Note: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.

ease = 0.5 * Intercept + 0.5 * difficulty0

So, it dropped the redundant parameter, then ran with the corrected code.

2. I created a constant, that I creatively named const.

const = 5 ;

This happens to people usually when their sample size is too small or restricted. Say your dependent variable is divorced, coded 0 or 1. The variable really does vary in the population. However, if your sample is of high school students, very few of them are divorced and you would need a large sample size (or luck) to have any variation. If your sample size is 15 people, since only 10.4% of the U.S. population is divorced, it is perfectly possible you might not have anyone who is divorced.

So, what does SAS do in this situation?

Note: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.

cons = 5 * Intercept

Didn’t really think about that, did you? A constant is just the intercept multiplied by some number. So, SAS goes ahead and runs, as if your code did not have that variable in the equation.

3. Okay, now I’m getting pissed. I set the dependent variable to be equal to a constant where everyone has a job. Now it finally does give me an error.

ERROR: All observations have the same response. No statistics are computed.
NOTE: The SAS System stopped processing this step because of errors.

You might think,

“Why didn’t it just say that the dependent variable was a constant?”

In its computer brain, it did. Remember, in logistic regression, the dependent is called the response variable. (That was in that 300-page manual.)

4. Finally, I am getting really annoyed trying to create an obscure error message that would justify having invested the time to read a 300-page manual (not to mention all of those statistics courses in graduate school). I create a variable that has very little variance.

if _n_ < 10 then wrong_job = 0 ; else wrong_job = 1 ;

With the result that the first 10 people have no job while the other 470 do. I finally get SAS to run and give me a kind of obscure message:

Model Convergence Status
Quasi-complete separation of data points detected.

Warning: The maximum likelihood estimate may not exist.
Warning: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

Just to make sure you don't overlook this message skipping over all of the other tables to the end and looking at the hypothesis tests and seeing if you have significance (oh, yes, both SAS and I have met people like you before), it helpfully prints this heading on EVERY SINGLE PAGE for the remainder of the output.

WARNING: The validity of the model fit is questionable.

AND, in my log it adds, just for good measure and a little extra nagging:

WARNING: There is possibly a quasi-complete separation of data points. The maximum likelihood
estimate may not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on
the last maximum likelihood iteration. Validity of the model fit is questionable.

If it was my grandmother, it would have added,

And if you go ahead with what you want to do any way after I warned you, on your head be it!

I expect to see that included in my log with SAS version 9.3

As for quasi-complete separation, you can find that on page 84 of the SAS/STAT 9.22 User's Guide: The Logistic Procedure. No, that's not on page 84 of the guide, that's on page 84 of the LOGISTIC PROCEDURE SECTION of the guide.

There is also a really good article by Paul Allison on complete and quasi-complete separation, called "Convergence Failures in Logistic Regression".

So, even when you do get an error, it is not too hard to find fairly well-written explanations of the problem.

Two points occur to me:

The world's most spoiled twelve-year-old is rolling her eyes and saying,

"I can't believe you forgot to write a blog about whatever that is either! But it's not like it's something really important, like that you forgot you were going to take me to Becca's dad's house for a sleepover."

Um, I think I have to go drive someone to Torrance for a sleepover.



My friend, Gokor Chivichyan, a mixed martial arts instructor, once gave this questionable advice to his students,

“Women usually just say they’re fat to get attention. So, me, I agree with them. If she says she’s fat, I say, yes, you fat but we like you anyway. If she’s really fat, though, you just have to dump her. Not if she’s your wife, though. Then, it’s too bad but you have to keep her anyway and take care of her because she’s the mother of your kids.”

(For the record, I have met Gokor’s wife and she is both lovely and charming.)

We tend to keep our private life extremely private. Dennis has been referred to as “your alleged husband” by my friends, who have never met him.  Recently, another friend of mine, my former business partner from Spirit Lake Consulting was visiting and, after knowing me for 20 years, met my husband for the first time. My friend commented,

“Your husband really loves you.”

I was a bit miffed that he found this so surprising.  I think one reason this is a surprise is that we so often categorize people into single boxes. I was a world-class athlete while my husband hates all exercise. Being a sixth-degree black belt, I once suggested to him that perhaps he could learn judo for exercise. His response was,

I’m not doing anything where people touch me unless I get to have sex with them at the end of it.

As I was laying in bed this morning with my eyes closed, trying to avoid morning, Dennis was carrying on about the Complete Works of Euclid, which proofs were not really proofs, but axioms, the incompatibility of irrational numbers with early Greek geometry, the inefficiency of using geometry for certain proofs rather than algebra or calculus. This is why my husband loves me. Not only did I not find this boring and throw a pillow at him, which most of the women of my acquaintance might have done, but actually opened my eyes, made a comment or two about how it fit exactly with what I was thinking about, which happened to be …

The geometric concept of a line is that it extends infinitely in both directions. What most people think of as a line, with two end points,  is really a line segment. Most of us learned this in high school or middle school and don’t really think about it much. However, sometimes it becomes relevant.

Let’s say you were trying to predict a dichotomous dependent variable. Since it is around Christmas time, let’s pick whether a person is traveling home for the holidays or not, which we have coded 1= no, 2 = yes.  That might be a very useful fact for people to know who were in either the travel or family therapy industries to target their marketing/ determine outpatient clinic hours.

This is a dichotomous variable and you can see that it plots pretty terribly against a continuous predictor variable – say, income.

You can see that the linear equation

Y =  a + bX

is just plain wrong here. It doesn’t fit. Very, very far from the assumption that a line extends infinitely we are stuck with a stupid line that goes from 1 to 2.

How about probability then? We could use the probability of going home for Christmas by income. That will extend from 0 to 100, which is certainly closer to infinite.

Well, this is better. It sort of approximates a line.In the graph above, you can see the obtained regression equation

Y = -.1236 + .0313X

(I know you were dying to know.)

You can also see the predicted values it gave me for incomes below $5,000 are negative. I guess those are the people who are not coming home even IF hell freezes over. The probabilities for people with incomes over $40,000 are above 1.0.  I guess that means they are going home twice, once to mom’s house and once to dad’s place in the Hamptons with his trophy wife.

So, we have one case, with just the binary outcome, which is clearly not linear. We have another case, predicting the probability of the outcome, which may be linear, but is actually a line segment and not a line. That may be true in theory for lots of things. I doubt income extends from negative infinity to positive infinity, although Bill Gates and Warren Buffet are doing their bit to extend the right side of the distribution for themselves while the Republicans and certain banking and investment firms are making a best effort to extend it on the left for all the rest of us.

There are a whole bunch of reasons that using linear regression is wrong when you have a binary dependent variable, and the fact that it is flat not a linear relationship is just one of those.

Now, if I were an ancient Greek, I would include a lot of geometric examples, not really proofs. If I were an ancient Greek that had access to JMP 8 software I might include another variable graphed against probability and say, “Looky here”, or however you say that in Greek.

This is a very important point – Greek or not – even though the relationship charted above is very high – R-squared = .78 to be precise, it is clearly not a linear relationship. It is an S-curve and it looks very much like a logistic relationship.

Three very important points emerged from this:

  1. The potential to teach kids the basic understanding of some of the more abstract concepts of mathematics by pictures. I can see how you could start with these graphs and do a linear relationship, then log one variable, log both and start to see the different types of pictures. Those Greeks were on to something. Too bad they didn’t have JMP. Never know what they could have achieved. (Click here for link to random JMP page.)
  2. The idea of using graphs to teach students is intriguing, and yet I am puzzling how I could drag the world’s most spoiled twelve-year-old away from the Disney channel downstairs and get her to see it that way. The use of graphing calculators in mathematics is not new, but neither does it seem to be particularly effective. This is all fascinating TO ME because I see the end point of making predictions. Perhaps we should spend the first few weeks of mathematics on why what we are about to do is important?
  3. I was thinking that I had failed miserably on most of the #reverb10 prompts because, well, frankly, I’m more interested in examining logistic and linear relationships than ruminating on my life. Then, it came to me – what’s the one thing I have come to appreciate in the past year? That I’m married to someone who would wake me up by bringing me coffee in bed and talking about the complete works of Euclid!



The #reverb10 prompt for December 3rd was to write about a time when you felt most truly alive in 2010. There were more prompts,  about what you wonder about and other examining-your-soul type of introspection. This isn’t that kind of blog. I don’t think I’m that type of person. For the record, the time I feel most alive is when I am with my family but I wasn’t the least bit interested in writing about how much I love my family right now. In fact, I was very interested in logistic regression.

I'll get this down eventually

Should YOU wonder about logistic regression? Well, that depends.

Logistic regression is the statistical technique of choice when you have a single dependent variable and multiple independent variables from which you would like to predict it.

With logistic regression, the dependent variable you are modeling is the PROBABILITY of the value of Y being a certain value divided by ONE MINUS THE PROBABILITY. Let’s start with the simplest model, binary logistic regression. There are two probabilities, married or not. We are modeling the probability that an individual is married, yes or no.  [Logistic regression is NOT what you would use to model how long a marriage lasted. That would be survival analysis.]

The logistic regression formula models the log of the odds ratio. That is

The probability of y =1 / probability of y = 0

So, the left side of your equation is

ln(p / (1- p) )

**** Very, mega- super-important point here – the p in this equation is NOT the same old p as in p < .05. No, au contraire. Completely different. This is the probability of event = 1. For example, the probability of being married. 1-p then would be 1 – the probability of being married.  Yes, that second number is the same as the probability of being single. You aren’t missing anything.

I was, in this post going to use the probability of being a dumb-ass but some people have written and told me that I am too hostile for a statistician so I am trying to mend my ways, it being around the holidays and all.

The right side of the equation is the same old ß0 + ß1X1 + …ßnXn
that you are used to with Ordinary Least Squares (OLS) regression also known as multiple regression or multiple linear regression, or, if you are a complete weirdo, Monkey-Bob .


The probability of y =1 / probability of y = 0  when x =1

divided by

The probability of y = 1/ probability of y = 0 when x = 0

I presume the only reason you have read this far is that you have some deep-rooted need or desire to understand logistic regression. An example will help. I have discovered lately that I love my husband for a very important reason. He is not a dumb ass. I have had multiple husbands (not simultaneously, that would be polyandry and illegal in most states and immoral according to certain anal-retentive religions) what they all had in common, other than the obvious being married to me, is that they were all in technical fields and pretty good at what they did. Let’s go with the hypothesis that people who are in a technical field are more likely to be married.  Further, let’s say that we have sampled 100 people in computer science and 100 people in French literature. We find that 90 of the computer scientists are married and 45 of the French literature people.

So, if the probability of marriage is 90/100  and the probability of not married is 10/100 then the odds ratio of  9:1 for the computer scientists. =  9

For the French literature people, the probability of marriage is 45/ 100  and the probability of not being married is 55/100  = .818

So, 9/.818  = 11.00

This tells you that the odds of a computer scientist being married versus single are 11 times that of a French literature professor. Also, that you should study computer science instead of French.

If you really had nothing else to do in your life and wanted to run this using SPSS just to see if I was correct (really, now!) you would get this output.

Gasp! The value of  ß0, that being our constant, is -.201 . The inverse of the log is Exp(x) also shown as “e to the x”. This is a function in SPSS, if you want to double-check. Also a function in SAS, Stata and Excel, but NOT on the calculator on my iPhone. Steve Jobs should feel shame.

The value of  Exp(-.201) = .818  –  the odds  for French literature people.

The value of Exp (2.398) = 11.00  – the odds ratio for computer scientists versus French literature whatever you call them (unemployed would be my guess).

Coincidence? I think not!

In interpreting a logistic regression analysis you want to look at the significance of the parameter estimates (.000) and the parameter estimate, in this case the ß = 2.398. A positive coefficient says that the dependent is MORE likely if the variable has the value in question. In SPSS, that value is shown in parentheses. Notice it says cs(1) – that means when cs has the value of 1, the outcome is more likely to occur. How much more likely? Look to your right. (On the table, in this blog post,not to your right in your room. What are you thinking?) The odds are 11 times greater for computer scientists than for French literature whats-its.

A really good reference if you want a plain language introduction to logistic regression is by Newsom . There are a lot of really bad references to logistic regression in very obscure language but I decided not to bother mentioning them.

The syntax for producing this table in SPSS is below.

/CONTRAST (cs)=Indicator(1)




Going from the phi coefficient to odds-ratios. Remember the numerator for the phi coefficient was

phinum well, the odds ratio is the same two numbers DIVIDED rather than subtracted. You might think it is four numbers, but really it is not. The first number is the product of the diagonal cells (see below). The second number is the product of the off-diagonal cells. Let’s take a look at our data again, first in symbolic form and then the actual numbers.


So the odds of a woman doing the dishes are 9:1 , that is for every one woman who doesn’t do the dishes, there are nine who do. The odds of a man doing the dishes are 1:3, that is, for every three men who don’t do the dishes, there is one who does.

Here is our formula for the odds ratio:


=   (10*25)  /(75*90)  =   1/27

The odds of a man doing the dishes (1/3) are one-twenty-seventh the odds of a woman doing them (9/1).

Tomorrow, I will try to find the time  to explain how this is intimately related to logistic regression.

But for now, I am going to go home, and, no doubt, eventually do the dishes.



A few weeks ago, I ended my post with “there is one thing a statistical consultant absolutely must have and promised to say what that is in the next post. Maria and I had just picked up our rental car at the Detroit airport when she turned to me and asked:

So, what is the one thing a statistical consultant has to have?

I told her,

“I have absolutely no idea what I was thinking last month!”

In my defense, I have been in five states and 22 cities in the past 21 days. Maria says it is only 16 because I was in Minneapolis, Fargo and Denver twice each. She also says I can’t count Denver, Chicago or San Francisco since I only changed planes there. Poo!

In Long Beach thinking about statistical consultants

Now that I am back in Los Angeles and my brain has unfrozen I think there are actually five things you must have but one of these is the most important. In my not at all humble opinion, though, you need ALL FIVE.

The actually five skills a statistical consultant must have

Man playing drum in preparation for me saying what  are the 4 skills statistical consultants must have
Drum roll, please
  1. COMMUNICATION – This is the number one most important skill. If you don’t have the rest, you’ll still suck and be unemployed but a terrific communicator with mediocre statistical analysis skills will get more business. I don’t just mean shaking hands and small talk at conferences, either. Communication includes documentation, both in your code and in codebooks, an internal wiki, etc. It includes letting clients know what you’re going to do, what it’s going to cost, what that cost includes, what were your result and what those results mean. If you’re good at communicating with clients, colleagues and your future self, you’re half-way to success.
  2. TESTING – I’ve ranted on this blog a lot about testing because it is one of the areas where people often seem to fall short early in their careers. I got a lot of hate for this post when I said I don’t hire self-taught developers because there are things they don’t teach themselves adequately, like testing.
  3. Statistics – Well, duh. Props to the person in the Chronicle of Higher Education forum whose signature read, “Being able to find SPSS in the start menu does not qualify you to run a multinomial logistic regression.” Your clients may not know what power, quasi-complete separation or multicollinearity mean in interpreting an analysis. They do trust that you understand whatever is necessary to be understood for the work. Don’t let them down.
  4. Programming – when I was a graduate student Very Important Professors had lowly peon graduate students and programmers to write their code for them. All of those people had started their careers using punched cards, (honest!) it was that long ago. All of the statistical consultants I know do programming, or can code their own analyses if necessary. Even if you aren’t doing it all yourself – I’m certainly not these days – you need to know enough to review the code your minions wrote or help said minions when they get stuck. Sometimes, it’s just quicker to do it by yourself than explain to someone else, especially if you need to fix a bug in a code that a client is waiting on.
  5. Be a generalist – I’ll have more to say about this in future posts. In brief, even the consultants I know who are well-known specialists in one language know and use others. If you think your career is going to be you sitting on a mountain or in penthouse office, pontificating to others about sums of squares, the computation of Wilks’ lambda or options for PROC GLMSELECT , you are going to be sadly disappointed. On the other hand, if you do know of a job like that, I would consider taking it for a sufficiently large quantity of money.



I’ll be speaking about being a statistical consultant at SAS Global Forum in D.C. in March/ April. While I will be talking a little bit about factor analysis, repeated measures ANOVA and logistic regression, that is the end of my talk. The first things a statistical consultant should know don’t have much to do with statistics.

A consultant has paying clients.

In History of Psychology (it was a required course, don’t judge me) one of my fellow students chose to give her presentation as a one-woman play, with herself as Sigmund Freud. “Dr. Freud” began his meeting with a patient discussing his fee. In fact, Freud did not accept charity patients. He charged everyone. There’s a winning trivial pursuit fact for you.*

Why am I starting with telling you this? Because I have had plenty of graduate students whose goal is “to be a consultant” but they seem to think their biggest problem when they start out is going to be whether they should do propensity score matching using the nearest neighbor or caliper method.

Here are the biggest problems you’ll face:

Let’s start with getting clients. I can think of four ways to do this; referrals, as part of a consulting company, through your online presence and through an organization. I’ve done three of them. First, and most effective, I think, is through referrals. I got my first two clients when professors who did consulting on the side recommended me. I do this myself. If someone can’t afford my fees or I am just booked at the moment, I will refer potential clients to either students, former students or other professionals I know who are getting started as a consultant. It’s not competing with my business. I am never going to work for $30 an hour again and if that’s all that’s in your budget, I understand. If all you need is someone to do a bunch of frequency distributions and a chi-square for you, you don’t need me, although I’m happy to do it as a part of a larger contract.

Lesson number one: Don’t be a jerk.

Referrals mean I’m using my own reputation to help you get a job and so I’m going to refer students who are good statisticians and who I think will be respectful and honest with the client. Don’t underestimate the latter half of that statement.

Lesson number two: It helps if you really love data analysis.

I’d be the first to say that I’m a much nicer person now than when I was in graduate school. Yes, it took me a while to learn lesson one, I am embarrassed to say. However, I really did love statistics and if any of my fellow students had trouble, I was the first person they asked and I was really happy to help. When those students later became superintendents of schools or principal investigators of grants, they thought of me and became some of my earliest clients. Some of my professors also became clients, although those were after I’d had several years of experience.

Lesson number three: Don’t think you are smarter than your clients.

A young relative, who has a Ph.D. In math asked me, “No offense but isn’t what you do relatively easy, like anyone who understood statistics could do it? Why are you so in demand?”
Corollary to this lesson: If you find yourself saying, “No offense” just stop talking right then.

One reason a lot of want-to-be consultants go bankrupt or have to find another line of work is they do think they are smarter than their clients. This manifests itself in a lot of ways so we’ll return to it later, but one way is that they charge much more than the work is worth.

How do you know how much your work is worth?

Lesson number four: Ask yourself, if I had twice as many grants/ contracts as I could do and I was paying someone to do this work, what would I be willing to pay?

That’s a good place to start.

I’ve met a lot of people over the years who charged much more than me and bragged to me about it. In the long run, though, I’m sure I made a lot more money. Clients talk. They find out that you are charging them three times as much as their friend down the block is getting charged by their consultant. You may think you’re getting away with it, but you won’t. You may get paid on those first few contracts but you’ll have a very hard time getting work in the future.

Lesson number five: Know multiple languages, multiple packages

I’ve had discussions with colleagues on whether it is better to be a generalist or a specialist.

I have had a few jobs where they just needed propensity score matching or just a repeated measures ANOVA but those have been the small minority over the past 30 years.

I would argue that even those who consider themselves specialists actually have a wide range of skills. Maybe they are only an expert in SAS but that includes data manipulation, macros, IML and most statistical procedures.

In my case, I would not claim to be the world’s greatest authority on anything but if you need data entry forms created in JavaScript/HTML/CSS, a database back end with PHP and MySQL, your data read into SAS, cleaned and analyzed in a logistic regression, I can do it all from end to end. No, I’m not equally good at all of those. It’s been so long since I used Python, that I’d have to look everything up all over again.

I’ve used SPSS, STATA, JMP and Statistica, depending on what the client wanted. I think I might have even had a couple of clients using RapidMiner. For the last few years, though, the only packages I’ve used have been SAS and Excel. Why Excel? Because that’s what the clients were familiar with and wanted to use and it worked for their purposes. (See lesson three.)

I was really surprised to read Bob Muenchen saying SPSS surpassed R and SAS in scholarly publications. Almost no one I know uses SPSS any more, but, of course, my personal acquaintances are hardly a random sample. I suppose it depends on the field you are in.

I have never used R.

Some people think this is a political statement about being a renegade. Others think it’s because I’m too old to learn new things or in subservience to corporate overlords or some other interesting explanation. (The Invisible Developer, who has been reading over my shoulder, says he never got past C, much less D through Q.)

Since I fairly often get asked why not, I will tell you the real reasons, which is a complete digression but this is my blog so there.

  1. In my spare time that I don’t have, I teach Multivariate Statistics at a university that uses SAS. Since I’m using SAS in my class anyway and need real life data for examples, when a client has licenses for multiple packages and doesn’t care what I use (almost always the case), I use SAS.
  2. About the time that R was taking off, my company was also taking off in a different direction. The Invisible Developer and I own the majority of 7 Generation Games which is an application of a lot of the research done by The Julia Group. When we started developing math games, we needed to learn Unity, C#, PHP, SQL, JavaScript, HTML/CSS. We also needed to analyze the data to assess test reliability, efficacy, etc. I called the analysis piece and told The Invisible Developer I was interested in all of it so I’d do whatever was left. He was really interested in 3D game programming so he did the Unity/C# part. I did everything else. Then, after a few years, I moved to Chile, where the language I most had to improve was my Spanish.
Games in Spanish, English and Lakota

It worked out for me. We have a dozen games available from 7 Generation Games and now we’re coming out with a new line on decision-making.

I mention all this because I want to emphasize there isn’t a single path to succeeding as a consultant. There isn’t a specific language or package you have to learn. There is one thing you absolutely must have, though, and that’s the next post.

* (See Warner, S. L. Sigmund Freud and Money. (1989) Journal of the American Academy of Psychoanalysis. Winter;17(4):609-22)

keep looking »


WP Themes