# When Data Misbehave: Categorical Data

Today I was thinking about categorical modeling, I suppose other people were thinking about art, music, unicorns and bunnies, but they are not me. I was going to title this blog post “Modeling categorical data, Part 1) but two things occurred to me. The first is that no one would read a blog post named “Modeling categorical data” and the second was that almost every time I write a post with Part 1 in the title I don’t ever get to a Part 2 since sequential processing is inconsistent with my blogging philosophy.

Anyway … I was specifically thinking about the problem people have when they have a categorical variable they are interested in, say, will a student major in a STEM (science, technology, engineering or mathematics) field or not.

My dependent is clearly categorical. If I had a bunch of numeric predictors, say, family income, academic achievement test scores, GPA – then logistic regression would be a good choice.

Actual point 1: If you have a categorical dependent variable and numeric predictors then logistic regression can be good. Regular, ordinary least squares regression is bad in this case and you should know why if you ever took an introductory statistics course. If you did not, read this lovely paper by Osborne and Waters on four assumptions researchers should test to see what you have been missing. It really is well-written. Then think about the fact that a binary variable, like pursued STEM career yes/no  cannot be normally distributed.

The problem with working with human beings and the institutions who serve them is that data are often not available in neat continuous form. For example, rather than family income (which is far from normally distributed anyway, but that’s a different topic) we have whether the student received a free lunch or not. This is a categorical variable. (Thank you Captain Obvious.) Rather than GPA, we have the number of advanced placement courses in mathematics and science they took, which is entered as 0, 1, 2, 3, 4 or more. This is an ordinal variable. You could also model it as a categorical variable. You don’t have academic achievement test scores but you do have whether or not the student is in a specialized academy for high school students interested in technology careers, another dichotomous, categorical variable.

Log-linear or Logistic Regression

There seems to be some confusion about when one should use a log-linear model versus a logistic regression. Here are two very simple questions:

Do you have one and only one categorical dependent variable (some call this a response variable) which you would like to predict with multiple independent variables ?

If the answer is yes then loglinear analysis is not your answer. Use logistic regression instead.

Do you have a categorical dependent variable and one or more continuous, numeric independent variables?

If the answer is yes then loglinear analysis is not your answer. Use logistic regression.

There is a nice page by Angela Jeansonne of San Francisco State University that goes into this in more detail. Her description of loglinear models and nested models is excellent, but I wish she had chosen a better example.

In short, the big reason for using a loglinear model is that you don’t have a single dependent variable. Instead, you have multiple variables that are related. The best thing about loglinear models in my opinion (and since I am writing this, it is only my opinion that counts), is that you can test nested models and go with the simplest one that fits the data.