Multicollinearity statistics with SPSS

“Can you explain multicollinearity statistics?”

Why, yes, yes I can.

First of all, as noted in the Journal of Polymorphous Perversity,

“Multicollinearity is not a life-threatening condition except when a depressed graduate student employs multiple, redundant measures.”

What is multicollinearity, then, and how do you know if you have it?

Multicollinearity is a problem that occurs with regression analysis when there is a high correlation of at least one independent variable with a combination of the other independent variables. The most extreme example of this would be if you did something like had two completely overlapping variables. Say you were predicting income from the Excellent Test for Income Prediction (ETIP). Unfortunately, you are a better test designer than statistician so your two independent variables are Number of Answers Correct (CORRECT)ย  and Number of Answers Incorrect (INCORRECT). Those two are going to have a perfect negative correlation of -1.ย  Multicollinearity. You are not going to be able to find a single least squares solution. For example, if you have this equation:

Income = .5*Correct + 0*Incorrect

or

Income = 0*Correct -.5*Incorrect

You will get the exact same prediction.ย  Now that is a pretty trivial example, but you can have a similar problem if you use two or more predictors that are very highly correlated. Let’s assume you’re predicting income from high school GPA, college GPA and SAT score. It may be that high school GPA and SAT score together have a very high multiple correlation with college GPA.

For more about why multicollinearity is a bad thing, read this very nice web page by a person in Michigan who I don’t know. Let’s say you already know multicollinearity is bad and you want to know how to spot it, kind of like cheating boyfriends. Well, I can’t help you with THAT (although you can try looking for lipstick on his collar), but I can help you with multicollinearity.

One suggestion some people give is to look at your correlation matrix and see if you have any independent variables that correlate above some level with one another. Some people say .75, some say .90, some say potato. I say that looking at your correlation matrix is fine as far as it goes, but it doesn’t go far enough. Certainly if I had variables correlated above .90 I would not include both in the equation. Even if it was above .75, I would look a bit askance, but I might go ahead and try it anyway and see the results.

The problem with just looking at the correlation matrix is what if you have four variables that together explain 100% of the variance in a fifth independent variable. You aren’t going to be able to tell that by just looking at the correlation matrix. Enter the Tolerance Statistic, wearing a black cape, here to save the day. Okay, I lied, it isn’t really wearing a black capeย  – it’s a green cape. ( By the way, if you have a mad urge to buy said green cape, or a Viking tunic, you can fulfill your desires here. I am not affiliated with this website in any way. I am just impressed that they seem to be finding a niche in the Pirate Garb / Viking tunic / cloak market .)

In complete seriousness now, ahem ….

To compute a tolerance statistic for an independent variable to test for multi-collinearity, a multiple regression is performed with that variable as the new dependent and all of the other independent variables in the model as independent variables. The tolerance statistic is 1 โ R2 for this secondย  regression. (R-square, just to remind you, is the amount of variance in a dependent variable in a multiple regression explained by a combination of all of the indepedentย  variables). In other words, Tolerance is 1 minus the amount of variance in the independent variable explained by all of the other independent variables. A tolerance statistic below .20 is generally considered cause for concern.Of course, in real life, you don’t actually compute a bunch of regressions with all of your independent variables as dependents, you just look at the collinearity statistics.

Let’s take a look at an example in SPSS, shall we?

The code is below or you can just pick REGRESSION from the ANALYZE menu. Don’t forget to click on the STATISTICS button and select COLLINEARITY STATISTICS.

Here I have a dependent variable that is the rating of problems a person has with sexual behavior, sexual attitudes and mental state. The three independent variables are ratings of symptoms of anorexia, symptoms of bulimia and problems in body perception

REGRESSION
/MISSINGย LISTWISE
/STATISTICSย COEFFย OUTSย Rย ANOVAย COLLINย TOL
/CRITERIA=PIN(.05)ย POUT(.10)
/NOORIGIN
/DEPENDENTย problems
/METHOD=ENTERย anorexicย perceptprobย bulimia.

Let’s just take a look at the first variable “anorexic”. It has a Tolerance of .669.ย  What does that mean? It means that if I ran a multiple regression with anorexic as the dependent, and perceptprob and bulimia as the independent vairables, I would get an R-square value of .331. Don’t take my word for it. Let’s try it. Notice that now anorexic is the dependent variable.

REGRESSION
/MISSINGย LISTWISE
/STATISTICSย COEFFย OUTSย Rย ANOVAย COLLINย TOL
/CRITERIA=PIN(.05)ย POUT(.10)
/NOORIGIN
/DEPENDENTย anorexic
/METHOD=ENTERย perceptprobย bulimia.

Now, look at that. When we do a regression with anorexia as the dependent variable and bulimia and perceptprob as the two independent variables the R-square is .331 . If we take 1 – .331 we get .669 which is exactly the Tolerance Statistic for anorexia in the previous regression analysis above. Don’t you just love it when everything works out?

So WHY is a tolerance below .20 considered a cause for concern? It means that at least 80% of the variance of this independent variable is shared with some other independent variables. It means that the multiple correlation of the other independent variables with this independent variable is at least .90 (because .9 * .9 = .81) .

Another statistic sometimes used for multicollinearity is the Variance Inflation Factor, which is just the reciprocal of the tolerance statistics. A VIF of greater than 5 is generally considered evidence of multicollinearity. If you divide 1 by .669 you’ll get 1.495, which is exactly the same as the VIF statistic shown above .

And you thought I just made this sh*t up as I went along, didn’t you?

Have kids? Know anyone who has kids? Like kids? Own a computer? Fish Lake will teach fractions and Native American history, with no whining and all for under ten bucks.

local_offerevent_note May 28, 2011

account_box

50 thoughts on “Multicollinearity statistics with SPSS”

• james says:

thanking you so much for such a great explanation

• gakul chandra saikia says:

I am greatful to you for this better explanation. May I request you to explain about the condition index ! what it is ,how it is . How to solve the multicollinearity problem . Thanks a lot

• M says:

Thankyou for a wonderful, simple and entertaining blog post. The link t the Chicago guy was extremely helpful, too ๐

You’ve made one little psychology student much less stressed, thankyou!

xx

• Andres says:

Thank you for this. Working on my thesis now and was looking for some help with collinearity – the article was very helpful. I have some results!

• Calvin Veltman says:

Hi,

I know that some smart-ass student is going to ask me about the tolerance statistic–even if the other 99% couldn’t care less. I couldn’t find anything really helpful in French so I did a more general search–just to be sure of my facts ! Your explanation is delicious.

Thank you.

• Kate says:

Very simple and clear! Thanks for making this interesting ๐

• Maryam says:

This was the most I have had reading about statistics in forever. you are hilarious and you can explain sth in simple terms which means you are very knowledgeable as well.

Thanks!

• Nurilng says:

it is nice explanation but is it possible to test multidisciplinarity using VIF for both categorical and continuous variables?

• You can use VIF if you dummy-code your variables, e.g., code Male =1 , Female = 0 for a variable gender. Now include another variable rsex and code it Male = 0, Female = 1. See what happens if you put both into an equation.

• Indrajeet says:

Cool explanation! Thanks! ๐

• abas says:

master student of clinical psychology at UT; it was great.useful and piratical.tks

• ursula says:

What shall I do if VIF values oscillate between 1,000-2,000, but while removing ‘suspicious’ predictors coefficients and significance still changes?

• Can you give a little more detail, Ursula? How many variables do you have and how many observations?

• Karen says:

I can now ignore everything else I’ve found on the Internet regarding multicollinearity thanks to your post. You’ve made it SOOO much easier to understand. I really appreciate it!

• Carolyn says:

Thanks. This was very helpful. Just to confirm (I am quite statistically challenged), I am doing binary logistic regression (an assignment) and I am just using the linear regression to do the multicollinearity diagnostic. I do not have to run all my independents with one as dependent each time for this? That was the impression I was getting from reading other explanations. I just run all my independents with my “real” dependent and use the Tolerance level to tell me whether or not I have multicollinearity, as you have done? Thanks very much.

• Carolyn –

The answer is yes and if you need a citation for the “Yes”, see Applied Logistic Regression Analysis by Scott Menard

• Carolyn says:

Thank you very much.

• Carlos says:

Hi: I am doing a logistic regression, I have my dependent variable categorical (0/1) and 5 independent categorical (0/1) such as gender. Please tell me if it is correct to get VIF considerig that R2 is only for lineal regression. Please if you can cite the author to support your answer. Thank you very much, sorry for my english iยดm from Mexico

• First, I would look at my standard erros in the logistic regression – large standard errors are associated with multicollinearity. Although you should be aware that you could also have large standard errors for other reasons.

IBM (which now owns SPSS) suggests that you dummy code your categorical variables and then run a linear regression to get the multicollinearity statistics, then, after dropping variables as appropriate, running a logistic regression to get the predicted probabilities, etc.

http://www-01.ibm.com/support/docview.wss?uid=swg21476696

• Jay says:

Stats made sexy ! Thanks for simplifying the explanations. Much appreciated.

• Arjit says:

Hi,
I have just read you comments about multicollinearity in categorical variables. As per your instructions I have created dummy variables (dichotomous) now when I run the collinearity stats keeping age band as my Dependent Variable and gender=1 and gender=0 as my independent variable. I get a VIF = 15.02 against gender =1 and VIF 1.67 against gender=0 so which one should I remove.

Further gender=1 comes out to be statistically significant in t test (having p value less than 0.05), another possibility is if pvalue of gender =1 is greater than 0.05 that is not significant then in that case should we remove it? Please suggest what should be done.

• Joselyne says:

hello!
I’m Joselyne i would like to know how can i remove multicollinearity by using spss without removing any independent variable because all independent variables are very important.(are imports,exports and exchange rate)the vif of imports and exports are very high.
thank you

• Amelia says:

Thank you so much for such a clear, concise and funny explanation. I appreciate your help loadddsss! ๐

• Thajudeen Hassan says:

Thank you,very good explanation.I would like to know, the steps to find CFI and TLI.

• Astha says:

I am going to remember this forever. Thank you for explaining it so well…

• Shawnda says:

Thanks so much!! That was awesome (I mean it)! xo.

• By the way, your explanation is really accurate. Congrats

• Gabby says:

Hi! For each regression I compute, I am finding a Tolerance below .2 and a VIF above 5 for the same 2 independent variables. However, when I put either of these 2 problem children as the dependent variable, I do not find multicollinearity (i.e., Tolerance above .2 and VIF below 5). What does this mean? Do I not have multicollinearity?

• Joyeeta deb says:

Such a easy and Crystal clear explanation. Thanks

• Sam says:

Hi, Thank you very much for clear blog. I had a severe problem regarding multicollinearity in my research. You made my life easier. Your blog is great. Highly appreciated your work. ๐

• Osman says:

Thanks for this post. It helped me a lot. Your jokes are excellent by the way ๐

• ORODHO JOHN ALUKO says:

Quite simply explained. Most Doctorate students find the interpretation of tolerance and VIF confusing . They hardly conduct multi-linearity test before embarking on regression analysis. We shall continue sharing experiences as we demystify statistics to all learners.
Prof. John Aluko Orodho[Associate Professor of Research and Statistics ]

• sayid says:

Hi, when should we conduct multicollinearity test? at the beginning of variable screening or after the regression (meaning after we have a final variables)

• Darlene Nguyen says:

Thank you so much for the detail explanation!! I’m so bad at SPSS and always have to count on others. But thanks to your detail explanation, my mind is exploded! haha not really, but I’m very appreciate your time writing this article for clueless student like me!

Hahaha. I am doing the final touches on preparation for my research proposal in the next3-5hrs. I was trying to understand these tests. Wow, you have made my day. Thanks

• Hamzah says:

Dear all. I have the final model, using survey logistic regression, with the collinearity especially in dummy variables. Nevertheless,logistic model for desease, goodnes of fit of test Prob > F : 0.002 and each of variables in the model is significance. May I kept the model with the collinearity or should remove the variable that have collineaeity?. Please give me sugestion. Thank you so much.

• Maria Gonzalez says:

Thank you very much for this wonderful explanation and for making me laugh while learning through! I am working on my PhD and needed to handle with collinearity, that has been really helpful and so easy to read!
I will sign up for a seminary with you at any time
Thank you again
Maria
Barcelona

• Your explanation about tolerance & VIF is very easy to follow. Thanks a lot for your great service & contribution to the community.

• Som Nwegbu says:

Hello,

I have quite a different sort of problem. I’m working with survey data for a huge dataset (DHS 2013 for Nigeria). I’m trying to use the tolerance and VIF score to determine if I have multicollinearity. After running a regression model of my predictor variables and then the follow up VIF command (in Stata) this is what I get:

tolerance=. & VIF=.

I’m stumped. I got a nice output for my regression model though, unfortunately I have no way of showing it to you. What does it mean that my tolerance and VIF values are missing? Is that bad?

What could I be doing wrong? Thanks in advance.

• mariam says:

Hi
pls how can I get a combined VIF and tolerance for multicollinearity using one dependent and 5 independent variables

• yenew says:

Tolerance above .2 and VIF below 5). What does this mean? Do I not have multicollinearity?

In spss I am using enter method under regression analysis but still results show some excluded variable. I need to include all. Kindly tell me the way out. Thank you in advance. Kindly revert back.

• Ruby says:

This was fun to read and informative. Thank you.

• Lynn Murton says:

Hi there,

So if these were my output would this be multicollienarity because they are greater than .9?
Colienatity
Tolerance VIF
.952 1.050
.835 1.197
.872 1.147

Thanks!

• Amani says:

how do we compute collinearity with dummy values in SPSS?

• Nashon says:

Very useful article on multi collinearity. It has really helped me. Thanks

This site uses Akismet to reduce spam. Learn how your comment data is processed.