Recently, someone asked me if I could explain chi-square in a few words. The short answer is, “No, I am incapable of using only a few words for any purpose whatsoever. If you doubt this, ask any of my children.”
What is chi-square?
Chi-square is a measure of relationship between two categorical variables. For example, let’s pick Proposition 8, the recent initiative on the ballot regard gay marriage. The two categories of voters were “For” and “Against”. This initiative was passed 52% to 48%. Remember these proportions. They are important later on.
The gender of voters also fell into two categories, “Male” and “Female”, who are around 49% and 51% of the population. If I wanted to test for whether there was relationship between votes on Prop 8 and gender, a chi-square would be a great test to use.
The null hypothesis being tested is : “There is no relationship between gender and how one voted on Proposition 8″.
I tried to find actual data on this relationship but after searching through a lot of websites and articles trying to find facts on this issue, I was depressed by the number of people who hate other groups of people and were not at all reluctant to write about it, data or no, so I just gave up. Here, proceeding without any interference from real data, is a hypothetical example.
We find 1,000 people who are willing to tell us how they voted and their gender. Just to make life easier, we deliberately select 500 males and 500 females. This gives us a two by two table
Female 237 263
Male 280 220
More males voted yes and more females voted no. Was this just random or are males really more likely to vote against gay marriage?
The formula for a chi-square is sum of the observed number in each cell minus the expected number, squared, and divided by the expected.
In this case, if there were no relationship between gender and which way you voted, the expected number in each cell would be 260 yes (52%) and 240 no (48%) for both male and female.
In the first cell, we have (237- 260) ** 2 / 260 = 2.03
In the second cell, we have (263 – 240)**2/ 240 = 2.20
In the third cell, (280 – 260)**2 / 260 gives us 1.54
And, in the fourth cell (220 – 240)**2 / 240 = 1.67
I end up with a chi-square value of 7.4 which is statistically significant.. The probability of obtain a chi-square value of 7.4 is less than .01, or one out of 100. Therefore, if these data were real and not some random numbers that I made up, I could conclude that women are less likely to be opposed to gay marriage than men.
Why did I detour into chi-square when I said I was going to spend the next week talking about categorical models? It’s not a detour, really.
Understanding chi-square is one of the building blocks of getting into log-linear models and more. Next, I want to talk about another basic statistic, the phi coefficient, and how, like marzipan, it really isn’t all it’s cracked up to be.
How to get a chi-square in SAS:
Proc freq data = datasetname ;
tables variable1 * variable2 / chisq ;
How to get a chi-square in SPSS
/TABLES = variable1 BY variable2
/ STATISTICS = CHISQ.
Chi-square in Stata
tabulate variable1 variable2 , chi2
Now you know more than you wanted to know about chi-square.
Categorical data analysis used to be simple. You had two nominal variables and you did a chi-square analysis. If it was statistically significant, that was all it took to make life good.
Then, logistic regression came along, with the reasonable notion that:
A. Dichotomous choices such as bought a candy apple/ate tofu & bean sprouts instead, lived/died or voted/didn’t vote were a very far cry from a normal distribution and pretending otherwise was just a bad thing.
B. It could be useful to be able to predict such dichotomous choices from a combination of other variables, not just one, and it would be even nicer if the variables could be categorical, continuous or a combination.
If you are interested, and who wouldn’t be, a good basic discussion of logistic regression can be found on this page at the University of South Florida site
Just when we got logistic regression settled in, with a pseudo-R squared which made us feel comfortable with something kind of like the ordinary least squares regression we had come to understand, if not exactly know and love, a new complication entered the picture.
What about when you have more than one choice, say voting Democrat, Republican or Independent? Maybe we went from plain old caramel apples or red candy apples to a choice of white chocolate covered, dark chocolate, carmel, M & Ms or Reese’s pieces.
Enter Multinomial Logistic Regression.
Sometimes, though, choices are predicated on one another. For example, I decide that I am going to take a vacation. I like vacations. I go to travelocity, on which I have spent approximately 8 zillion dollars over the past ten years, and am confronted with three options, within driving distance, Europe, or somewhere with flamingos. Why not? I like flamingos.
Nested Logistic regression
Nested logit models are used when one choice is based on, or “nested in” another. For example, once I have chosen the “flamingo” option, I can choose between two destinations, Florida and the Bahamas. (Did you know that the flamingo is the national bird of the Bahama Islands? Well, now you do.) Bahamas are nested within the flamingo option, because if I had chosen to take a vacation somewhere within driving distance or Europe, I could not have chosen the Bahamas.
Both a nested logit model and a multi-nomial model take advantage of the fact that you have more information than under a simple yes / no model.
Another model that seems to be coming up in popularity is the ordered logit model. This is used when your data can be rank-ordered. For example, the following responses to “How likely are you to vote for candidate X?”
Not unless winged monkeys descend from the sky and carry me off to the voting booth.
I will be writing about these models for the next week because they make me happy. For now, though, I am off to meetings.