“Can you explain multicollinearity statistics?”

she asked.

Why, yes, yes I can.

First of all, as noted in the Journal of Polymorphous Perversity,

“Multicollinearity is not a life-threatening condition except when a depressed graduate student employs multiple, redundant measures.”

What is multicollinearity, then, and how do you know if you have it?

Multicollinearity is a problem that occurs with regression analysis when there is a high correlation of at least one independent variable with a combination of the other independent variables. The most extreme example of this would be if you did something like had two completely overlapping variables. Say you were predicting income from the Excellent Test for Income Prediction (ETIP). Unfortunately, you are a better test designer than statistician so your two independent variables are Number of Answers Correct (CORRECT)  and Number of Answers Incorrect (INCORRECT). Those two are going to have a perfect negative correlation of -1.  Multicollinearity. You are not going to be able to find a single least squares solution. For example, if you have this equation:

Income = .5*Correct + 0*Incorrect

or

Income = 0*Correct -.5*Incorrect

You will get the exact same prediction.  Now that is a pretty trivial example, but you can have a similar problem if you use two or more predictors that are very highly correlated. Let’s assume you’re predicting income from high school GPA, college GPA and SAT score. It may be that high school GPA and SAT score together have a very high multiple correlation with college GPA.

For more about why multicollinearity is a bad thing, read this very nice web page by a person in Michigan who I don’t know. Let’s say you already know multicollinearity is bad and you want to know how to spot it, kind of like cheating boyfriends. Well, I can’t help you with THAT (although you can try looking for lipstick on his collar), but I can help you with multicollinearity.

One suggestion some people give is to look at your correlation matrix and see if you have any independent variables that correlate above some level with one another. Some people say .75, some say .90, some say potato. I say that looking at your correlation matrix is fine as far as it goes, but it doesn’t go far enough. Certainly if I had variables correlated above .90 I would not include both in the equation. Even if it was above .75, I would look a bit askance, but I might go ahead and try it anyway and see the results.

The problem with just looking at the correlation matrix is what if you have four variables that together explain 100% of the variance in a fifth independent variable. You aren’t going to be able to tell that by just looking at the correlation matrix. Enter the Tolerance Statistic, wearing a black cape, here to save the day. Okay, I lied, it isn’t really wearing a black cape  – it’s a green cape. ( By the way, if you have a mad urge to buy said green cape, or a Viking tunic, you can fulfill your desires here. I am not affiliated with this website in any way. I am just impressed that they seem to be finding a niche in the Pirate Garb / Viking tunic / cloak market .)

In complete seriousness now, ahem ….

To compute a tolerance statistic for an independent variable to test for multi-collinearity, a multiple regression is performed with that variable as the new dependent and all of the other independent variables in the model as independent variables. The tolerance statistic is 1 – R2 for this second  regression. (R-square, just to remind you, is the amount of variance in a dependent variable in a multiple regression explained by a combination of all of the indepedent  variables). In other words, Tolerance is 1 minus the amount of variance in the independent variable explained by all of the other independent variables. A tolerance statistic below .20 is generally considered cause for concern.Of course, in real life, you don’t actually compute a bunch of regressions with all of your independent variables as dependents, you just look at the collinearity statistics.

Let’s take a look at an example in SPSS, shall we?

The code is below or you can just pick REGRESSION from the ANALYZE menu. Don’t forget to click on the STATISTICS button and select COLLINEARITY STATISTICS.

Here I have a dependent variable that is the rating of problems a person has with sexual behavior, sexual attitudes and mental state. The three independent variables are ratings of symptoms of anorexia, symptoms of bulimia and problems in body perception

REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA COLLIN TOL
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT problems
/METHOD=ENTER anorexic perceptprob bulimia.


Let’s just take a look at the first variable “anorexic”. It has a Tolerance of .669.  What does that mean? It means that if I ran a multiple regression with anorexic as the dependent, and perceptprob and bulimia as the independent vairables, I would get an R-square value of .331. Don’t take my word for it. Let’s try it. Notice that now anorexic is the dependent variable.

REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA COLLIN TOL
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT anorexic
/METHOD=ENTER perceptprob bulimia.

Now, look at that. When we do a regression with anorexia as the dependent variable and bulimia and perceptprob as the two independent variables the R-square is .331 . If we take 1 – .331 we get .669 which is exactly the Tolerance Statistic for anorexia in the previous regression analysis above. Don’t you just love it when everything works out?

So WHY is a tolerance below .20 considered a cause for concern? It means that at least 80% of the variance of this independent variable is shared with some other independent variables. It means that the multiple correlation of the other independent variables with this independent variable is at least .90 (because .9 * .9 = .81) .

Another statistic sometimes used for multicollinearity is the Variance Inflation Factor, which is just the reciprocal of the tolerance statistics. A VIF of greater than 5 is generally considered evidence of multicollinearity. If you divide 1 by .669 you’ll get 1.495, which is exactly the same as the VIF statistic shown above .

And you thought I just made this sh*t up as I went along, didn’t you?

Comments

22 Responses to “Multicollinearity statistics with SPSS”

  1. james on April 15th, 2012 7:16 am

    thanking you so much for such a great explanation

  2. gakul chandra saikia on October 1st, 2012 1:42 pm

    I am greatful to you for this better explanation. May I request you to explain about the condition index ! what it is ,how it is . How to solve the multicollinearity problem . Thanks a lot

  3. M on October 31st, 2012 9:52 pm

    Thankyou for a wonderful, simple and entertaining blog post. The link t the Chicago guy was extremely helpful, too :)

    You’ve made one little psychology student much less stressed, thankyou!

    xx

  4. Andres on February 22nd, 2013 9:40 am

    Thank you for this. Working on my thesis now and was looking for some help with collinearity – the article was very helpful. I have some results!

  5. Calvin Veltman on March 29th, 2013 11:08 am

    Hi,

    I know that some smart-ass student is going to ask me about the tolerance statistic–even if the other 99% couldn’t care less. I couldn’t find anything really helpful in French so I did a more general search–just to be sure of my facts ! Your explanation is delicious.

    Thank you.

  6. Kate on April 8th, 2013 3:52 pm

    Very simple and clear! Thanks for making this interesting :)

  7. How to solve collinearity problems in OLS regression? | Question and Answer on April 24th, 2013 3:36 pm

    [...] other articles, I have read this article about dealing with collinearity. Often-suggested tips are removing the variable with highest [...]

  8. Maryam on May 15th, 2013 2:29 pm

    This was the most I have had reading about statistics in forever. you are hilarious and you can explain sth in simple terms which means you are very knowledgeable as well.

    Thanks!

  9. Nurilng on June 10th, 2013 8:41 am

    it is nice explanation but is it possible to test multidisciplinarity using VIF for both categorical and continuous variables?

  10. AnnMaria on June 11th, 2013 11:34 pm

    You can use VIF if you dummy-code your variables, e.g., code Male =1 , Female = 0 for a variable gender. Now include another variable rsex and code it Male = 0, Female = 1. See what happens if you put both into an equation.

  11. Indrajeet on December 20th, 2013 12:54 am

    Cool explanation! Thanks! :)

  12. abas on January 1st, 2014 3:44 pm

    master student of clinical psychology at UT; it was great.useful and piratical.tks

  13. ursula on January 8th, 2014 9:13 am

    What shall I do if VIF values oscillate between 1,000-2,000, but while removing ‘suspicious’ predictors coefficients and significance still changes?

  14. AnnMaria on January 8th, 2014 4:05 pm

    Can you give a little more detail, Ursula? How many variables do you have and how many observations?

  15. Karen on February 17th, 2014 1:05 pm

    I can now ignore everything else I’ve found on the Internet regarding multicollinearity thanks to your post. You’ve made it SOOO much easier to understand. I really appreciate it!

  16. Carolyn on April 6th, 2014 9:56 am

    Thanks. This was very helpful. Just to confirm (I am quite statistically challenged), I am doing binary logistic regression (an assignment) and I am just using the linear regression to do the multicollinearity diagnostic. I do not have to run all my independents with one as dependent each time for this? That was the impression I was getting from reading other explanations. I just run all my independents with my “real” dependent and use the Tolerance level to tell me whether or not I have multicollinearity, as you have done? Thanks very much.

  17. AnnMaria on April 6th, 2014 1:39 pm

    Carolyn -

    The answer is yes and if you need a citation for the “Yes”, see Applied Logistic Regression Analysis by Scott Menard

  18. Carolyn on April 9th, 2014 4:47 am

    Thank you very much.

  19. Carlos on May 23rd, 2014 10:04 pm

    Hi: I am doing a logistic regression, I have my dependent variable categorical (0/1) and 5 independent categorical (0/1) such as gender. Please tell me if it is correct to get VIF considerig that R2 is only for lineal regression. Please if you can cite the author to support your answer. Thank you very much, sorry for my english i´m from Mexico

  20. AnnMaria on May 23rd, 2014 10:45 pm

    First, I would look at my standard erros in the logistic regression – large standard errors are associated with multicollinearity. Although you should be aware that you could also have large standard errors for other reasons.

    IBM (which now owns SPSS) suggests that you dummy code your categorical variables and then run a linear regression to get the multicollinearity statistics, then, after dropping variables as appropriate, running a logistic regression to get the predicted probabilities, etc.

    http://www-01.ibm.com/support/docview.wss?uid=swg21476696

  21. Jay on September 5th, 2014 12:49 am

    Stats made sexy ! Thanks for simplifying the explanations. Much appreciated.

  22. Arjit on September 18th, 2014 3:59 am

    Hi,
    I have just read you comments about multicollinearity in categorical variables. As per your instructions I have created dummy variables (dichotomous) now when I run the collinearity stats keeping age band as my Dependent Variable and gender=1 and gender=0 as my independent variable. I get a VIF = 15.02 against gender =1 and VIF 1.67 against gender=0 so which one should I remove.

    Further gender=1 comes out to be statistically significant in t test (having p value less than 0.05), another possibility is if pvalue of gender =1 is greater than 0.05 that is not significant then in that case should we remove it? Please suggest what should be done.

Leave a Reply