Back in 1976, Howard Wainer published an article in Psychological Bulletin entitled, “Estimating linear coefficients in linear models: It don’t make no never mind.”
Since I read this sometime in graduate school and I took my last statistics course in 1989, spending the rest of the time writing my dissertation, I believe I should win some kind of prize for remembering this.
In short, Wainer said that if you had non-zero regression coefficients that using equal weights was just as good as the supposedly optimal weights created by a linear regression. So, for example, assuming standard scores
College_GPA = HS_GPA + SAT + Class_Rank
would predict equally well as, say,
College_GPA = .4*HS_GPA + .2*SAT + .3* Class_Rank
Yesterday, MP posted a link on my blog to an apparently very interesting article in a journal called Quality and Quantity, making a similar point, that it didn’t matter if you used linear regression or logistic regression.
It actually is true, to some extent, that more complex techniques don’t always yield wildly different results. For example, depending on how different the proportions are in your strata, the SURVEYMEANS procedure in SAS may not get you very different results than a simpler weighted means procedure.
I was skeptical about the linear / logistic equivalence. I tried to read the article but could not get it through any of the libraries at universities where I am an adjunct and I was not inclined to pay $34 for the article. This is a pet peeve of mine that journals who do not pay the authors who did the research and wrote the article nonetheless charge to read it.
Anyway… I was reminded of Nassim Taleb’s The Black Swan, wherein he says that to disprove the assertion that all swans are white you only need to find one black swan.
So.. I deliberately chose a dependent variable that was skewed from 50-50 to make my dependent even less normal. I did these analyses using SPSS because it runs native on a Mac and I was using my MacBook pro. Yes, I could have opened up VMWare and used SAS but that would have required at least 45 seconds and that time could be spent going downstairs to get more cognac. It is New Year’s Eve, you know.
I used the example dataset on anorexia that comes with SPSS and used the TRANSFORM > RECODE INTO DIFFERENT VARIABLES to create a new variable, diagnosis that was 1 if the person had a diagnosis of atypical eating disorder and 0 otherwise. This gave me a distribution of about 13% =1 and 87% = 0
Then, I created a scale for anorexia symptoms and a dichotomous variable for binge eating. Thus, it replicated a fairly usual logistic regression problem with a binary dependent, a numeric predictor and a categorical predictor. Code, with regression, is shown below, for you syntax lovers (not that there is anything wrong with that).
RECODE diag (4=0) (Lowest thru 3=1) INTO typicaldisorder.
VARIABLE LABELS typicaldisorder ‘Typical Disorder’.
COMPUTE anorexia=weight + mens + fast + hyper + preo + body.
RECODE binge (Lowest thru 2=1) (3 thru Highest=2) INTO binges.
VARIABLE LABELS binges ‘binge eater’.
/STATISTICS COEFF OUTS R ANOVA
/METHOD=ENTER anorexia binges.
Then, I ran the same analysis with logistic regression:
LOGISTIC REGRESSION VARIABLES typicaldisorder
/METHOD=ENTER anorexia binges
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).
Both of these are selected from the ANALYZE menu then REGRESSION. You may need the regression or advanced statistics add-on modules for the logistic. Since I’m on the faculty at a few schools I got the faculty pack with 13 modules. It is awesomely awesome for $250 a year and no, I don’t get a kickback from SPSS, I just like statistics.
SO! The results (see detail below)
The total model statistics were different (both significance and R-square estimates). Significance was different for the binge eater variable (significant in the linear regression, not in logistic).
What was the same:
- In both anorexia was the less important of the two predictors.
- In both cases, the model kind of sucked, which was not surprising since it was an arbitrarily constructed dependent variable and a couple of predictors selected based on no more theory than “These were in the same dataset”.
So, no, the two techniques do not produce identical results. They did produce similar results in terms of the relative importance of the predictors and in showing that the model just wasn’t very good.
Decades ago, when I was in graduate school, I remember Dr. Donald MacMillan saying that,
“If you need to give a kid an IQ test to figure out whether or not he is mentally retarded, he isn’t.”
I think perhaps the same can apply in a lot of statistical decisions. If your model really does fit well in reality, you’re going to find significance regardless of the method and if it really is a crappy model, it’s not going to come out no matter how perfect your statistical technique.
If I had to conclude, I would say categorical variables do make some never mind in selecting your variables. (If English is your second language, good luck with understanding that! )
The details …..
Linear regression –
Model F=3.06 (p = .049)
R-Square = .028 Adjusted R-square = .019
anorexia beta coefficient = .084 ( p =.218)
binge eater beta coefficient = – .157 (p = .022)
Model Chi-square = 7.52 (p= 0.023)
Cox & Snell R-square = .034, Nagelkerke _R-square = .064
anorexia coefficient p =.23
Binge eater p= .053
A very observant person might have noted the inconsistency in the results here, that the overall model is significant but none of the coefficients are. Sometimes this happens because the two use different types of tests for chi-square. Read more about it at the UCLA stats website. (Scroll down, it’s at the bottom of that page.)