statistics

Multicollinearity statistics with SPSS

ByAnnMaria De Mars May 28, 2011April 5, 2017

“Can you explain multicollinearity statistics?”

she asked.

Why, yes, yes I can.

First of all, as noted in the Journal of Polymorphous Perversity,

“Multicollinearity is not a life-threatening condition except when a depressed graduate student employs multiple, redundant measures.”

What is multicollinearity, then, and how do you know if you have it?

Multicollinearity is a problem that occurs with regression analysis when there is a high correlation of at least one independent variable with a combination of the other independent variables. The most extreme example of this would be if you did something like had two completely overlapping variables. Say you were predicting income from the Excellent Test for Income Prediction (ETIP). Unfortunately, you are a better test designer than statistician so your two independent variables are Number of Answers Correct (CORRECT) and Number of Answers Incorrect (INCORRECT). Those two are going to have a perfect negative correlation of -1. Multicollinearity. You are not going to be able to find a single least squares solution. For example, if you have this equation:

Income = .5*Correct + 0*Incorrect

Income = 0*Correct -.5*Incorrect

You will get the exact same prediction. Now that is a pretty trivial example, but you can have a similar problem if you use two or more predictors that are very highly correlated. Let’s assume you’re predicting income from high school GPA, college GPA and SAT score. It may be that high school GPA and SAT score together have a very high multiple correlation with college GPA.

For more about why multicollinearity is a bad thing, read this very nice web page by a person in Michigan who I don’t know. Let’s say you already know multicollinearity is bad and you want to know how to spot it, kind of like cheating boyfriends. Well, I can’t help you with THAT (although you can try looking for lipstick on his collar), but I can help you with multicollinearity.

One suggestion some people give is to look at your correlation matrix and see if you have any independent variables that correlate above some level with one another. Some people say .75, some say .90, some say potato. I say that looking at your correlation matrix is fine as far as it goes, but it doesn’t go far enough. Certainly if I had variables correlated above .90 I would not include both in the equation. Even if it was above .75, I would look a bit askance, but I might go ahead and try it anyway and see the results.

The problem with just looking at the correlation matrix is what if you have four variables that together explain 100% of the variance in a fifth independent variable. You aren’t going to be able to tell that by just looking at the correlation matrix. Enter the Tolerance Statistic, wearing a black cape, here to save the day. Okay, I lied, it isn’t really wearing a black cape – it’s a green cape. ( By the way, if you have a mad urge to buy said green cape, or a Viking tunic, you can fulfill your desires here. I am not affiliated with this website in any way. I am just impressed that they seem to be finding a niche in the Pirate Garb / Viking tunic / cloak market .)

In complete seriousness now, ahem ….

To compute a tolerance statistic for an independent variable to test for multi-collinearity, a multiple regression is performed with that variable as the new dependent and all of the other independent variables in the model as independent variables. The tolerance statistic is 1 – R2 for this second regression. (R-square, just to remind you, is the amount of variance in a dependent variable in a multiple regression explained by a combination of all of the indepedent variables). In other words, Tolerance is 1 minus the amount of variance in the independent variable explained by all of the other independent variables. A tolerance statistic below .20 is generally considered cause for concern.Of course, in real life, you don’t actually compute a bunch of regressions with all of your independent variables as dependents, you just look at the collinearity statistics.

Let’s take a look at an example in SPSS, shall we?

The code is below or you can just pick REGRESSION from the ANALYZE menu. Don’t forget to click on the STATISTICS button and select COLLINEARITY STATISTICS.

Here I have a dependent variable that is the rating of problems a person has with sexual behavior, sexual attitudes and mental state. The three independent variables are ratings of symptoms of anorexia, symptoms of bulimia and problems in body perception

REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA COLLIN TOL
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT problems
/METHOD=ENTER anorexic perceptprob bulimia.

Let’s just take a look at the first variable “anorexic”. It has a Tolerance of .669. What does that mean? It means that if I ran a multiple regression with anorexic as the dependent, and perceptprob and bulimia as the independent vairables, I would get an R-square value of .331. Don’t take my word for it. Let’s try it. Notice that now anorexic is the dependent variable.

REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA COLLIN TOL
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT anorexic
/METHOD=ENTER perceptprob bulimia.

Now, look at that. When we do a regression with anorexia as the dependent variable and bulimia and perceptprob as the two independent variables the R-square is .331 . If we take 1 – .331 we get .669 which is exactly the Tolerance Statistic for anorexia in the previous regression analysis above. Don’t you just love it when everything works out?

So WHY is a tolerance below .20 considered a cause for concern? It means that at least 80% of the variance of this independent variable is shared with some other independent variables. It means that the multiple correlation of the other independent variables with this independent variable is at least .90 (because .9 * .9 = .81) .

Another statistic sometimes used for multicollinearity is the Variance Inflation Factor, which is just the reciprocal of the tolerance statistics. A VIF of greater than 5 is generally considered evidence of multicollinearity. If you divide 1 by .669 you’ll get 1.495, which is exactly the same as the VIF statistic shown above .

And you thought I just made this sh*t up as I went along, didn’t you?

Have kids? Know anyone who has kids? Like kids? Own a computer? Fish Lake will teach fractions and Native American history, with no whining and all for under ten bucks.

An Introduction to Repeated Measures ANOVA

ByAnnMaria De Mars June 8, 2017

I’m teaching a course on multivariate statistics and for some of the students it’s been a minute since their last inferential statistics course. So, I have been doing a few videos here and there to refresh, for example, what is a repeated measures ANOVA and why you might want to do it. Sometimes I…

Software | statistics | Technology

Websurfing multivariate statistics

ByAnnMaria De Mars September 25, 2014

My life is upside down. All day, as my job, I spent writing a program to get a little man to run around a maze, come out the other end and have a new screen come up with a math challenge question. Then, in the evening, I’m surfing the web for interesting bits to read…

statistics

Factor analysis of parcels: part 1

ByAnnMaria De Mars May 26, 2016May 26, 2016

Where we left off, I had created some parcels and was going to do a factor analysis later. Now, it’s later. If you’ll recall, I had not find any items that correlated significantly with the food item that also made sense conceptually. For example, it correlated highly with attending church services but that didn’t really…

Software | statistics | Technology

PROC FREQ (and a LAG) for data validity

ByAnnMaria De Mars July 11, 2015July 11, 2015

I’m in the middle of data preparation on a research project on games to teach fractions. This is the part of a data analysis project that takes up 80% of the time. Fortunately, PROC FREQ from SAS can simplify things. 1. How many unique records ? There are multiple quizzes in the game, and you…

Dr. De Mars General Life Ramblings | statistics | The Julia Group

Men, Women, Tech, Discrimination & Statistics

ByAnnMaria De Mars October 13, 2015October 13, 2015

Let’s get this out right up front – I have no question that there is discrimination in the tech industry. I gave an hour-long talk on this very subject at MIT a couple of weeks ago, where I pointed out that everyone’s first draft of pretty much everything is crap – your first game, first…

Software | statistics

A SAS Mystery Solved – When FREQ and MEANS disagree

ByAnnMaria De Mars June 15, 2015June 15, 2015

I’m preparing a data set for analysis and since the data are scored by SAS I am double-checking to make sure that I coded it correctly. One check is to select out an item and compare the percentage who answered correctly with the mean score for that item. These should be equal since items are…

51 Comments

james says:

April 15, 2012 at 7:16 am

thanking you so much for such a great explanation
gakul chandra saikia says:

October 1, 2012 at 1:42 pm

I am greatful to you for this better explanation. May I request you to explain about the condition index ! what it is ,how it is . How to solve the multicollinearity problem . Thanks a lot
M says:

October 31, 2012 at 9:52 pm

Thankyou for a wonderful, simple and entertaining blog post. The link t the Chicago guy was extremely helpful, too 🙂

You’ve made one little psychology student much less stressed, thankyou!

xx
Andres says:

February 22, 2013 at 9:40 am

Thank you for this. Working on my thesis now and was looking for some help with collinearity – the article was very helpful. I have some results!
Calvin Veltman says:

March 29, 2013 at 11:08 am

Hi,

I know that some smart-ass student is going to ask me about the tolerance statistic–even if the other 99% couldn’t care less. I couldn’t find anything really helpful in French so I did a more general search–just to be sure of my facts ! Your explanation is delicious.

Thank you.
Kate says:

April 8, 2013 at 3:52 pm

Very simple and clear! Thanks for making this interesting 🙂
Pingback: How to solve collinearity problems in OLS regression? | Question and Answer
Maryam says:

May 15, 2013 at 2:29 pm

This was the most I have had reading about statistics in forever. you are hilarious and you can explain sth in simple terms which means you are very knowledgeable as well.

Thanks!
Nurilng says:

June 10, 2013 at 8:41 am

it is nice explanation but is it possible to test multidisciplinarity using VIF for both categorical and continuous variables?
AnnMaria says:

June 11, 2013 at 11:34 pm

You can use VIF if you dummy-code your variables, e.g., code Male =1 , Female = 0 for a variable gender. Now include another variable rsex and code it Male = 0, Female = 1. See what happens if you put both into an equation.
Indrajeet says:

December 20, 2013 at 12:54 am

Cool explanation! Thanks! 🙂
abas says:

January 1, 2014 at 3:44 pm

master student of clinical psychology at UT; it was great.useful and piratical.tks
ursula says:

January 8, 2014 at 9:13 am

What shall I do if VIF values oscillate between 1,000-2,000, but while removing ‘suspicious’ predictors coefficients and significance still changes?
AnnMaria says:

January 8, 2014 at 4:05 pm

Can you give a little more detail, Ursula? How many variables do you have and how many observations?
Karen says:

February 17, 2014 at 1:05 pm

I can now ignore everything else I’ve found on the Internet regarding multicollinearity thanks to your post. You’ve made it SOOO much easier to understand. I really appreciate it!
Carolyn says:

April 6, 2014 at 9:56 am

Thanks. This was very helpful. Just to confirm (I am quite statistically challenged), I am doing binary logistic regression (an assignment) and I am just using the linear regression to do the multicollinearity diagnostic. I do not have to run all my independents with one as dependent each time for this? That was the impression I was getting from reading other explanations. I just run all my independents with my “real” dependent and use the Tolerance level to tell me whether or not I have multicollinearity, as you have done? Thanks very much.
AnnMaria says:

April 6, 2014 at 1:39 pm

Carolyn –

The answer is yes and if you need a citation for the “Yes”, see Applied Logistic Regression Analysis by Scott Menard
Carolyn says:

April 9, 2014 at 4:47 am

Thank you very much.
Carlos says:

May 23, 2014 at 10:04 pm

Hi: I am doing a logistic regression, I have my dependent variable categorical (0/1) and 5 independent categorical (0/1) such as gender. Please tell me if it is correct to get VIF considerig that R2 is only for lineal regression. Please if you can cite the author to support your answer. Thank you very much, sorry for my english i´m from Mexico
AnnMaria says:

May 23, 2014 at 10:45 pm

First, I would look at my standard erros in the logistic regression – large standard errors are associated with multicollinearity. Although you should be aware that you could also have large standard errors for other reasons.

IBM (which now owns SPSS) suggests that you dummy code your categorical variables and then run a linear regression to get the multicollinearity statistics, then, after dropping variables as appropriate, running a logistic regression to get the predicted probabilities, etc.

http://www-01.ibm.com/support/docview.wss?uid=swg21476696
Jay says:

September 5, 2014 at 12:49 am

Stats made sexy ! Thanks for simplifying the explanations. Much appreciated.
Arjit says:

September 18, 2014 at 3:59 am

Hi,
I have just read you comments about multicollinearity in categorical variables. As per your instructions I have created dummy variables (dichotomous) now when I run the collinearity stats keeping age band as my Dependent Variable and gender=1 and gender=0 as my independent variable. I get a VIF = 15.02 against gender =1 and VIF 1.67 against gender=0 so which one should I remove.

Further gender=1 comes out to be statistically significant in t test (having p value less than 0.05), another possibility is if pvalue of gender =1 is greater than 0.05 that is not significant then in that case should we remove it? Please suggest what should be done.
FASIL says:

September 26, 2014 at 11:34 pm

THANKS
Joselyne says:

October 8, 2014 at 6:04 am

hello!
I’m Joselyne i would like to know how can i remove multicollinearity by using spss without removing any independent variable because all independent variables are very important.(are imports,exports and exchange rate)the vif of imports and exports are very high.
thank you
Amelia says:

November 25, 2014 at 6:32 am

Thank you so much for such a clear, concise and funny explanation. I appreciate your help loadddsss! 🙂
Thajudeen Hassan says:

January 25, 2015 at 11:17 pm

Thank you,very good explanation.I would like to know, the steps to find CFI and TLI.
Astha says:

February 25, 2015 at 3:25 pm

I am going to remember this forever. Thank you for explaining it so well…
Shawnda says:

February 26, 2015 at 11:22 am

Thanks so much!! That was awesome (I mean it)! xo.
randomwalk says:

March 26, 2015 at 4:04 pm

I think that collinearity is somethink we have to deal with. It is like being invisible, inflates out coefficients drinving us to misinterpreting. Anyway on my website i wrote an interesting article about
http://randomwalkproject.it/2015/01/11/why-collinearity-matters-in-regression-analysis/
randomwalk says:

March 26, 2015 at 4:05 pm

By the way, your explanation is really accurate. Congrats
Gabby says:

May 29, 2015 at 11:17 am

Hi! For each regression I compute, I am finding a Tolerance below .2 and a VIF above 5 for the same 2 independent variables. However, when I put either of these 2 problem children as the dependent variable, I do not find multicollinearity (i.e., Tolerance above .2 and VIF below 5). What does this mean? Do I not have multicollinearity?
Joyeeta deb says:

August 16, 2015 at 2:29 am

Such a easy and Crystal clear explanation. Thanks
Sam says:

September 22, 2015 at 8:36 pm

Hi, Thank you very much for clear blog. I had a severe problem regarding multicollinearity in my research. You made my life easier. Your blog is great. Highly appreciated your work. 🙂
Osman says:

November 1, 2015 at 4:20 pm

Thanks for this post. It helped me a lot. Your jokes are excellent by the way 🙂
ORODHO JOHN ALUKO says:

November 2, 2015 at 4:42 am

Quite simply explained. Most Doctorate students find the interpretation of tolerance and VIF confusing . They hardly conduct multi-linearity test before embarking on regression analysis. We shall continue sharing experiences as we demystify statistics to all learners.
Prof. John Aluko Orodho[Associate Professor of Research and Statistics ]
sayid says:

November 19, 2015 at 1:10 am

Hi, when should we conduct multicollinearity test? at the beginning of variable screening or after the regression (meaning after we have a final variables)
Darlene Nguyen says:

January 10, 2016 at 8:40 am

Thank you so much for the detail explanation!! I’m so bad at SPSS and always have to count on others. But thanks to your detail explanation, my mind is exploded! haha not really, but I’m very appreciate your time writing this article for clueless student like me!
gladys mwaniki says:

February 10, 2016 at 8:50 pm

Hahaha. I am doing the final touches on preparation for my research proposal in the next3-5hrs. I was trying to understand these tests. Wow, you have made my day. Thanks
Hamzah says:

March 17, 2016 at 12:12 pm

Dear all. I have the final model, using survey logistic regression, with the collinearity especially in dummy variables. Nevertheless,logistic model for desease, goodnes of fit of test Prob > F : 0.002 and each of variables in the model is significance. May I kept the model with the collinearity or should remove the variable that have collineaeity?. Please give me sugestion. Thank you so much.
Maria Gonzalez says:

April 9, 2016 at 3:49 am

Thank you very much for this wonderful explanation and for making me laugh while learning through! I am working on my PhD and needed to handle with collinearity, that has been really helpful and so easy to read!
I will sign up for a seminary with you at any time
Thank you again
Maria
Barcelona
Abed Nadir says:

July 22, 2016 at 8:16 am

Great! Thank you!
Chris Lee says:

October 6, 2016 at 9:01 am

Your explanation about tolerance & VIF is very easy to follow. Thanks a lot for your great service & contribution to the community.
Som Nwegbu says:

January 31, 2017 at 11:12 am

Hello,

I have quite a different sort of problem. I’m working with survey data for a huge dataset (DHS 2013 for Nigeria). I’m trying to use the tolerance and VIF score to determine if I have multicollinearity. After running a regression model of my predictor variables and then the follow up VIF command (in Stata) this is what I get:

tolerance=. & VIF=.

I’m stumped. I got a nice output for my regression model though, unfortunately I have no way of showing it to you. What does it mean that my tolerance and VIF values are missing? Is that bad?

What could I be doing wrong? Thanks in advance.
mariam says:

March 24, 2017 at 6:43 am

Hi
pls how can I get a combined VIF and tolerance for multicollinearity using one dependent and 5 independent variables
yenew says:

June 6, 2017 at 2:55 am

Tolerance above .2 and VIF below 5). What does this mean? Do I not have multicollinearity?
MaDhu sarda says:

July 14, 2017 at 12:53 pm

In spss I am using enter method under regression analysis but still results show some excluded variable. I need to include all. Kindly tell me the way out. Thank you in advance. Kindly revert back.
Ruby says:

September 13, 2017 at 5:46 pm

This was fun to read and informative. Thank you.
Lynn Murton says:

October 18, 2017 at 3:12 pm

Hi there,

So if these were my output would this be multicollienarity because they are greater than .9?
Colienatity
Tolerance VIF
.952 1.050
.835 1.197
.872 1.147

Thanks!
Amani says:

August 29, 2018 at 7:54 am

how do we compute collinearity with dummy values in SPSS?
Nashon says:

November 24, 2018 at 6:27 am

Very useful article on multi collinearity. It has really helped me. Thanks

Similar Posts

51 Comments

Leave a Reply