I’m teaching a class on categorical data analysis after the Western Users of SAS Software conference next week. As always, I have WAY more information than I can cover. Handouts are limited to 40 pages so I sent the organizers 80 slides but I know I am going to cover way more than that. Why not put them three to a page? Because that is just silly. I’d rather have people have 80 slides they can read than 120 they can’t.
From lectures and papers over the years, I have way, way more material than will fit in three hours. Now, the question is what do I include and what do I leave out? There are some obvious things to leave in:
- How to code and interpret a logistic regression analysis.
- How to interpret model fit statistics.
- What is an odds ratio and how do you get it?
The above points all address things that people will commonly want to do, like use multiple variables to predict which category a person will fall into (hence the need for logistic regression).
What can everyone be expected to know already?
Okay, that’s the easy part. The not as easy part is to know what everyone can be expected to already know, as I don’t want to waste anyone’s time.
How many people really look at those different chi-square values like Pearson and the maximum likelihood chi-square? Does everyone know WHY the expected frequency is expected to be that number?
Does pretty much everybody know what a phi coefficient was? Yes, I know we all learned it in basic statistics but how many people never thought about it again?
Can I just skip discussing the marginal distribution and conditional distribution because “everyone knows that”?
How about computing confidence intervals with PROC FREQ ?
In a normal household one might ask one’s spouse, to at least get some indication if “the man on the street” would be familiar with a topic. I, on the other hand, married someone who decided to pursue a doctorate in particle physics because he found nuclear physics too easy. Somehow, I don’t think he does a very good imitation of the man in the street.
Here is my plan right now:
- Collate everything I have on categorical data analysis, including the material from the two courses I taught years ago on non-parametric statistics which I had forgotten that I had taught until I found the powerpoint presentations in a folder. Then I remembered, oh yeah, THOSE courses on non-parametric statistics!
- Put these in order in an outline under “Questions you want answered”.
These are the questions I have so far:
- Are your data any good? (Always a good question to ask first)
- What is the distribution of X ?
- What is the distribution of X given Y?
- Is there a significant relationship between X and Y?
- Given X, what are the odds of Y?
- How well, and with what variables, can we predict which category of X a person falls into?
- Is this set of variables significantly better for predicting X than that other set of variables lying over there?
Then, there are those questions of special cases:
What if you only have a small number of cases in one or more cells?
What if your data are repeated measures?
Any suggestions, experience or good categorical data analysis jokes are welcome.
(Hey, if there are SQL jokes, there must be some categorical data analysis jokes. )
Tip of the day: A three-way interaction has an entirely different meaning in categorical data analysis than it does in an X-rated video. I actually found a three-way interaction with sex within military service. It was not at all exciting. It only meant that the relationship between school experiences and plans to enter the military varied by gender.