Univariate statistics for categorical data? How weird!

PROC UNIVARIATE is for numeric data. I use it a lot of times as the first step in my categorical data analyses. How weird is that?

They do look weird

Okay, well, maybe it’s not leafy sea dragon level of strange but it does seem an odd thing to do. After all, much of the output that PROC UNIVARIATE gives you is completely nonsensical for categorical data – the mean, standard deviation, tests for mu = 0. Why would I do such an odd thing?

 

Eva, Supergenius Baby

 

Wait, I can  explain!

There are two reasons that I often use PROC UNIVARIATE for categorical data.

  • Exhibit A: Character data are stored as numeric Often I will be using data sets with hundreds, or even thousands, of variables. Even though the variables really do represent categorical data, they are stored in numeric format. That is, 1= Democrat, 2 = Independent, 3= Republican and so on.

My soap box I am on again and again has to do with checking your data, getting to know your data. If there are many questions like “Which party do you belong to?” “Which party did you vote for in the 2008 national election?” “Which party did your spouse vote for?” etc. etc. or a very large number of  Yes/No questions “Do you own a computer?” “Do you own a car?” “Do you own your own home?” “Do you own a kazoo?” then you know the number of categories that should exist.

I have written before at great length about checking data quality using a macro. Also, I’m doing a presentation on it at WUSS this year.

or you could do this with two tasks in SAS Enterprise Guide:

Process flow beginning with data set, including two tasks

One very quick way to check the data for data entry problems is to run the CHARACTERIZE DATA task  with SAS Enterprise Guide. If you take a peak at the back end of what SAS EG is doing (check out the CODE or LOG windows) you can see it is running PROC UNIVARIATE, PROC FREQ and some macros.

The UNIVARIATE output will provide you the following information:

Number of records with non-missing data.

Number of records with missing data

Minimum

Maximum

Minimum and maximum seem an odd statistic to use. What is the maximum for political party? What I am looking for here is data entry errors, missing data. If the only possible answers were 1 – 4 and our minimum is 0 and maximum is 9, we have a problem.

With the CHARACTERIZE DATA task I can get frequency distributions and graphs for the variables that are stored as character (in this particular example data set, there is only one), all that univariate stuff for all of the other variables that are stored as numeric (even though they’re really character variables).

In this data set, which is SUPPOSED to contain nicely cleaned up data, all of the answers should be on a scale of 1 -4. So, I create a FILTER just by pointing and clicking to pull out the out-of-range data.  This identifies any variables with values out of range, missing more than 5% of the data (there are about 7,300 records so 365 is about 5%) and with a standard error of 0. Why the heck that last one? Because for some odd reason, CHARACTERIZE data does not give you a standard deviation or variance. Since the standard errors is the standard deviation divided by the square root of N a standard error of 0 means a standard deviation of 0 which means everyone gave the same answer. I want to see any variables where everyone gave the same answer because, in a data set of  over 7,300 people that’s just plain weird.


Filter on minimum, maximum, number missing and standard error

This produces the following data set as output. There are more variables than this, but I hid the ones like mean, median and total that were irrelevant for categorical data.

Data set showing variables with a lot of missing data, no variance, etc.

You will also get graphs for all of your variables. The default is to graph the 30 most common categories. Normally, I would not request these graphs if I had hundreds of variables but if your variables are stored as character data, this is probably the quickest way of doing  a check for out-of-range values.

Your mileage may vary:

Personally, I find it easier to glance through frequency tables than graphs, but that’s me.  Also, if there is a way to set options for the CHARACTERIZE DATA task, it is well-hidden. As far as I can see, you get the frequency distributions for categorical variables only …

Frequency distribution for 1 categorical variable, univariate statisics for numeric variables

… and if you want any distributional information on the variables stored as numeric you have to go with the default which is to produce graphs. Since the default is to produce graphs, you don’t need to do anything extra to get them.

Of course with truly categorical data the only measure of central tendency you can discuss is the mode, and you can see here that it is the first category. Most eighth-graders say they spend no time playing computer games (they’re probably lying). You can also talk about the distribution. With truly categorical data saying it is positively or negatively skewed doesn’t make much sense because there is no positive or negative direction. You can, however, comment on whether the observations were relatively evenly distributed among categories or predominantly in one or two categories.

Graph of hours per day spent playing computer games

In this example it is obvious but with other variables it may not be so clear. This is just the part where you are getting  a first look at your data. More on a detailed look tomorrow (maybe, if I have time).

  • Exhibit B: Your data really are ordinal data in disguise.This is kind of obviously the case for the question above. Even if it was stored as character data, you can see that it really is on an ordinal scale. It makes sense in that case to talk about the median and say that half of the students played computer games less than one hour and half play for one hour or more. You can also say that your data are positively skewed. Don’t get all excited though and think just because of that it is a good idea to use this as a dependent variable in a regression equation. You only have five possible answers and that really does not fit the idea of a normal distribution.

And now I have some how managed to spend three or four hours playing with SAS On-demand with categorical data. (Yes, I did a lot more than what I wrote about in this blog.)

Maybe tomorrow I will write about bivariate descriptive statistics. Or maybe I will just sleep late and be a slug.

If you’re dying to know more, you can come to the class on categorical data analysis I’m doing at WUSS. Or you could just keep reading this blog. Or both.

 

 

Similar Posts

One Comment

  1. Great post. I find that the Characterize Data task is extremely useful as you have shown.

    Unfortunately can’t make your class. WUSS is a bit of commute from Australia. 🙂 So am looking forward to your next post.

Leave a Reply

Your email address will not be published. Required fields are marked *