Recently, someone asked me if I could explain chi-square in a few words. The short answer is, “No, I am incapable of using only a few words for any purpose whatsoever. If you doubt this, ask any of my children.”
What is chi-square?
Chi-square is a measure of relationship between two categorical variables. For example, let’s pick Proposition 8, the recent initiative on the ballot regard gay marriage. The two categories of voters were “For” and “Against”. This initiative was passed 52% to 48%. Remember these proportions. They are important later on.
The gender of voters also fell into two categories, “Male” and “Female”, who are around 49% and 51% of the population. If I wanted to test for whether there was relationship between votes on Prop 8 and gender, a chi-square would be a great test to use.
The null hypothesis being tested is : “There is no relationship between gender and how one voted on Proposition 8″.
I tried to find actual data on this relationship but after searching through a lot of websites and articles trying to find facts on this issue, I was depressed by the number of people who hate other groups of people and were not at all reluctant to write about it, data or no, so I just gave up. Here, proceeding without any interference from real data, is a hypothetical example.
We find 1,000 people who are willing to tell us how they voted and their gender. Just to make life easier, we deliberately select 500 males and 500 females. This gives us a two by two table
Female 237 263
Male 280 220
More males voted yes and more females voted no. Was this just random or are males really more likely to vote against gay marriage?
The formula for a chi-square is sum of the observed number in each cell minus the expected number, squared, and divided by the expected.
In this case, if there were no relationship between gender and which way you voted, the expected number in each cell would be 260 yes (52%) and 240 no (48%) for both male and female.
In the first cell, we have (237- 260) ** 2 / 260 = 2.03
In the second cell, we have (263 – 240)**2/ 240 = 2.20
In the third cell, (280 – 260)**2 / 260 gives us 1.54
And, in the fourth cell (220 – 240)**2 / 240 = 1.67
I end up with a chi-square value of 7.4 which is statistically significant.. The probability of obtain a chi-square value of 7.4 is less than .01, or one out of 100. Therefore, if these data were real and not some random numbers that I made up, I could conclude that women are less likely to be opposed to gay marriage than men.
Why did I detour into chi-square when I said I was going to spend the next week talking about categorical models? It’s not a detour, really.
Understanding chi-square is one of the building blocks of getting into log-linear models and more. Next, I want to talk about another basic statistic, the phi coefficient, and how, like marzipan, it really isn’t all it’s cracked up to be.
How to get a chi-square in SAS:
Proc freq data = datasetname ;
tables variable1 * variable2 / chisq ;
How to get a chi-square in SPSS
/TABLES = variable1 BY variable2
/ STATISTICS = CHISQ.
Chi-square in Stata
tabulate variable1 variable2 , chi2
Now you know more than you wanted to know about chi-square.