statistics

When Data Misbehave: Categorical Data

ByAnnMaria De Mars December 27, 2010December 27, 2010

Today I was thinking about categorical modeling, I suppose other people were thinking about art, music, unicorns and bunnies, but they are not me. I was going to title this blog post “Modeling categorical data, Part 1) but two things occurred to me. The first is that no one would read a blog post named “Modeling categorical data” and the second was that almost every time I write a post with Part 1 in the title I don’t ever get to a Part 2 since sequential processing is inconsistent with my blogging philosophy.

Anyway … I was specifically thinking about the problem people have when they have a categorical variable they are interested in, say, will a student major in a STEM (science, technology, engineering or mathematics) field or not.

My dependent is clearly categorical. If I had a bunch of numeric predictors, say, family income, academic achievement test scores, GPA – then logistic regression would be a good choice.

Actual point 1: If you have a categorical dependent variable and numeric predictors then logistic regression can be good. Regular, ordinary least squares regression is bad in this case and you should know why if you ever took an introductory statistics course. If you did not, read this lovely paper by Osborne and Waters on four assumptions researchers should test to see what you have been missing. It really is well-written. Then think about the fact that a binary variable, like pursued STEM career yes/no cannot be normally distributed.

The problem with working with human beings and the institutions who serve them is that data are often not available in neat continuous form. For example, rather than family income (which is far from normally distributed anyway, but that’s a different topic) we have whether the student received a free lunch or not. This is a categorical variable. (Thank you Captain Obvious.) Rather than GPA, we have the number of advanced placement courses in mathematics and science they took, which is entered as 0, 1, 2, 3, 4 or more. This is an ordinal variable. You could also model it as a categorical variable. You don’t have academic achievement test scores but you do have whether or not the student is in a specialized academy for high school students interested in technology careers, another dichotomous, categorical variable.

Log-linear or Logistic Regression

There seems to be some confusion about when one should use a log-linear model versus a logistic regression. Here are two very simple questions:

Do you have one and only one categorical dependent variable (some call this a response variable) which you would like to predict with multiple independent variables ?

If the answer is yes then loglinear analysis is not your answer. Use logistic regression instead.

Do you have a categorical dependent variable and one or more continuous, numeric independent variables?

If the answer is yes then loglinear analysis is not your answer. Use logistic regression.

There is a nice page by Angela Jeansonne of San Francisco State University that goes into this in more detail. Her description of loglinear models and nested models is excellent, but I wish she had chosen a better example.

Another really good explanation of loglinear, logit and probit models can be found at this North Carolina State University site.

In short, the big reason for using a loglinear model is that you don’t have a single dependent variable. Instead, you have multiple variables that are related. The best thing about loglinear models in my opinion (and since I am writing this, it is only my opinion that counts), is that you can test nested models and go with the simplest one that fits the data.

Mixed models with SAS Enterprise Guide – Not Really

ByAnnMaria De Mars February 13, 2013February 13, 2013

I was going to use SAS Enterprise Guide 4.3 with SAS On-Demand to do my mixed model analysis, but it did not quite work out. First of all, if like me you are used to doing PROC GLM where each subject is one record, you have to change your dataset to be one where each…

Dr. De Mars General Life Ramblings | statistics

Replication, Correlation and Causation

ByAnnMaria De Mars March 31, 2015

There is not nearly enough replication in scientific research. It’s unfortunate that funding agencies and academic journals always want to see a new twist – a different technique, a different population. Personally, I’m very interested in reading studies that say: “I did the exact same study as Mary Lou Who and I found pretty much…

statistics

So, what did we learn this week at SAS Global Forum?

ByAnnMaria De Mars April 25, 2012

After over a quarter of a century of experience working as a statistical consultant in a wide array of settings, it’s safe to say that a large proportion of the statistics presentations cover topics I have been over before. Still, even if it is a technique I’ve used many times, I almost always come across…

Dr. De Mars General Life Ramblings | statistics

It only seems like this has nothing to do with statistics

ByAnnMaria De Mars September 13, 2017September 13, 2017

Last post, I talked about bricolage, the fine art of throwing random stuff together to make something useful. This is something of a philosophy of life for me. Seems rambling but it’s not … Over 30 years ago, I was the first American to win the world judo championships. A few years ago, I co-authored…

statistics

Univariate statistics for categorical data? How weird!

ByAnnMaria De Mars September 2, 2011September 2, 2011

PROC UNIVARIATE is for numeric data. I use it a lot of times as the first step in my categorical data analyses. How weird is that? Okay, well, maybe it’s not leafy sea dragon level of strange but it does seem an odd thing to do. After all, much of the output that PROC UNIVARIATE…

Software | statistics | Technology

Text Mining with SAS – class notes

ByAnnMaria De Mars June 30, 2014July 2, 2014

More notes from the text mining class. … This is the article I mentioned in the last post, on Singular Value Decomposition ftp://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf Contrary to expectations, I did find time to read it, on the ride back from Las Vegas and it is surprisingly accessible even to people who don’t have a graduate degree in statistics,…

3 Comments

MP says:

December 29, 2010 at 11:00 am

Have you seen this article?
Linear versus logistic regression when the dependent variable is a dichotomy.

http://www.springerlink.com/content/m2j74066476t0g41/
admin says:

December 29, 2010 at 7:57 pm

Thanks for mentioning it. I had not read the article. I read the abstract but since we don’t subscribe to that journal and I didn’t feel sufficiently motivated to pay $34 to retrieve the article I decided to test the hypothesis myself.

See next blog post (-:
MP says:

January 3, 2011 at 10:13 am

That figures…I tried to access the article again and my old online library access is no longer valid either. I wonder how often people actually pay the ridiculous prices for single article downloads from journals.

Similar Posts

3 Comments

Leave a Reply