statistics

When Data Misbehave: Categorical Data

ByAnnMaria De Mars December 27, 2010December 27, 2010

Today I was thinking about categorical modeling, I suppose other people were thinking about art, music, unicorns and bunnies, but they are not me. I was going to title this blog post “Modeling categorical data, Part 1) but two things occurred to me. The first is that no one would read a blog post named “Modeling categorical data” and the second was that almost every time I write a post with Part 1 in the title I don’t ever get to a Part 2 since sequential processing is inconsistent with my blogging philosophy.

Anyway … I was specifically thinking about the problem people have when they have a categorical variable they are interested in, say, will a student major in a STEM (science, technology, engineering or mathematics) field or not.

My dependent is clearly categorical. If I had a bunch of numeric predictors, say, family income, academic achievement test scores, GPA – then logistic regression would be a good choice.

Actual point 1: If you have a categorical dependent variable and numeric predictors then logistic regression can be good. Regular, ordinary least squares regression is bad in this case and you should know why if you ever took an introductory statistics course. If you did not, read this lovely paper by Osborne and Waters on four assumptions researchers should test to see what you have been missing. It really is well-written. Then think about the fact that a binary variable, like pursued STEM career yes/no cannot be normally distributed.

The problem with working with human beings and the institutions who serve them is that data are often not available in neat continuous form. For example, rather than family income (which is far from normally distributed anyway, but that’s a different topic) we have whether the student received a free lunch or not. This is a categorical variable. (Thank you Captain Obvious.) Rather than GPA, we have the number of advanced placement courses in mathematics and science they took, which is entered as 0, 1, 2, 3, 4 or more. This is an ordinal variable. You could also model it as a categorical variable. You don’t have academic achievement test scores but you do have whether or not the student is in a specialized academy for high school students interested in technology careers, another dichotomous, categorical variable.

Log-linear or Logistic Regression

There seems to be some confusion about when one should use a log-linear model versus a logistic regression. Here are two very simple questions:

Do you have one and only one categorical dependent variable (some call this a response variable) which you would like to predict with multiple independent variables ?

If the answer is yes then loglinear analysis is not your answer. Use logistic regression instead.

Do you have a categorical dependent variable and one or more continuous, numeric independent variables?

If the answer is yes then loglinear analysis is not your answer. Use logistic regression.

There is a nice page by Angela Jeansonne of San Francisco State University that goes into this in more detail. Her description of loglinear models and nested models is excellent, but I wish she had chosen a better example.

Another really good explanation of loglinear, logit and probit models can be found at this North Carolina State University site.

In short, the big reason for using a loglinear model is that you don’t have a single dependent variable. Instead, you have multiple variables that are related. The best thing about loglinear models in my opinion (and since I am writing this, it is only my opinion that counts), is that you can test nested models and go with the simplest one that fits the data.

Whipping your data into shape with SAS : Part 1 for Today

ByAnnMaria De Mars February 24, 2018February 24, 2018

I’m sure I’ve written about this before – after all, I’ve been writing this blog for 10 years – but here’s something I’ve been thinking about: Most students don’t graduate with nearly enough experience with real data. You can use government websites with de-identified data from surveys, and I do, but I teach primarily engineering…

Software | statistics

Exploratory Factor Analysis with Mplus

ByAnnMaria De Mars May 15, 2013

Previously, I discussed how to do a confirmatory factor analysis with Mplus. What if you aren’t sure what variables should load on what factor? Then you are doing an exploratory factor analysis. Really, you should probably do the exploratory factor analysis first unless you have some very large body of research behind you saying that…

Software | statistics | Technology

PROC FREQ (and a LAG) for data validity

ByAnnMaria De Mars July 11, 2015July 11, 2015

I’m in the middle of data preparation on a research project on games to teach fractions. This is the part of a data analysis project that takes up 80% of the time. Fortunately, PROC FREQ from SAS can simplify things. 1. How many unique records ? There are multiple quizzes in the game, and you…

Software | statistics

Fun studying deaths of old people – or not

ByAnnMaria De Mars October 1, 2011October 1, 2011

I am probably going to hell for this … because today I was studying the death rate of older people using the data from Kaiser Permanente available on the Inter-university Consortium for Political and Social Research (ICPSR) website and really having a great time. Reading .stc First funny thing, after I extracted it and noticed…

Software | statistics | Technology

Watch me work: Data Project

ByAnnMaria De Mars February 3, 2016February 3, 2016

On twitter, there were a few comments from people who said they didn’t like to take interns because “More than doing work, they want to watch me work.” I see both sides of that. You’re busy. You’re not netflix. I get it. On the other hand, that’s a good way to learn. The data are…

Dr. De Mars General Life Ramblings | statistics

Survival analysis and conference attendance

ByAnnMaria De Mars April 9, 2011April 12, 2011

Since the whole presentation Patricia Berglund gave on survival analysis is available at the SAS Global Forum takeout section (which I explained yesterday, you should have been paying attention), I just wanted to add a few highlights here. 1. Using PROC LIFETEST with a STRATA statements is a very dandy way to show survival curves…

3 Comments

MP says:

December 29, 2010 at 11:00 am

Have you seen this article?
Linear versus logistic regression when the dependent variable is a dichotomy.

http://www.springerlink.com/content/m2j74066476t0g41/
admin says:

December 29, 2010 at 7:57 pm

Thanks for mentioning it. I had not read the article. I read the abstract but since we don’t subscribe to that journal and I didn’t feel sufficiently motivated to pay $34 to retrieve the article I decided to test the hypothesis myself.

See next blog post (-:
MP says:

January 3, 2011 at 10:13 am

That figures…I tried to access the article again and my old online library access is no longer valid either. I wonder how often people actually pay the ridiculous prices for single article downloads from journals.

Similar Posts

3 Comments

Leave a Reply