May

11

I used to think that a clean house was a sign of a wasted life. After all, why would anyone spend time cleaning when they could be reading, writing, teaching, running, programming, doing judo or drinking margaritas at the pier watching the sun go down?

Well, I still don’t alphabetize my spices or clean the grout with a toothbrush. In fact, the grout and I have pretty much adopted a truce. It doesn’t think about me and I don’t think about it.

On the other hand, I’ve found that having  the place cleaned up actually saves me time because I waste a whole lot less time looking for things.

The same is true of your programming. So, here, as a public service on behalf of mothers everywhere are several tips to clean up your programming.

1. Get rid of all the useless junk.

When I’m first starting a project, I may run the same program over and over. It runs and there is a 200-line input statement because it has several hundred variables and they just could NOT be named something like Q1 – Q800, now could they? At line 312, it turns out that I misspelled the libref. I did something brilliant like spelled “in” with two i’s . So, now I need to re-run it, and when I am reviewing my log, I really don’t need those first 400 lines with my 312 statements and a bunch of SAS notes. I want to start all over.

OR

I ran several analyses to better understand different variables. I do this all of the time. For example, with the TIMSS data on students, there are two variables on student sex. Neither are as interesting as one might hope. One is the sex recorded on the student’s permanent record (apparently it does exist, and, at least they got your gender right). The other is the sex the student put on the survey. I did a cross-tabulation and found that what the student put always agreed with what was in the administrative record (not even one smart-ass in a group of over 7,300 eighth-graders – amazing) but 58 students did not answer this question. So, I’m going to keep the ITSEX variable and drop the other variable. All well and good, but I did several cross-tabs like that, with birth month, birth year and so on. I don’t need to keep those results cluttering up my output window.

I put this statement FIRST in the program when I am starting on data cleaning:

DM “CLEAR LOG ; CLEAR OUT” ;

If I re-run the program from the beginning, it erases the log and output from the previous runs, leaving only the results from the current run.

The second line I put in a data cleaning program is:

OPTIONS NOFMTERR ;

At a later time I’ll probably be interested in any user-written formats but this isn’t the time. I don’t want my program stopping and giving me a bunch of errors because I don’t have some special format.

Often, the programs I use do come with special formats and a PROC FORMAT with hundreds of lines of VALUE statements. I run that once, see that there are no errors and then move that out of the way. So, the third line I put in a data cleaning program is often

%INCLUDE “c:\projectdir\sascode\theseformats.sas” ;

2. Do not whine that you’ll do it later.

The fourth line in my program is usually PROC MEANS

At this point, I’m only looking for one thing and that is if values are out of range, for example, the items are scored on a 1 – 5 scale and the maximum of many of the variables is 8. Occasionally for perfectly good reasons other than to annoy me, people code data as 8, 99 or whatever to show that it was “not administered”, “not answered” , “did not know” and so on. There are some interesting analyses that can be done of patterns of missing data, but now is not the time to do them. If you want to use those missing value codes later, good for you, but having seen SO many times that these screwed up results, I’m going to set them to missing in my analytic file right away.

(A side note on that – I read in the raw data and save it as a SAS dataset. Then, I create an analytic file. So, I can always go back to the original raw data if I need it.)

The next step is to include an ARRAY statement and DO – loop

ARRAY somename{*}  beginvar — endvar ;

In case you aren’t familiar with this …

somename is the name of the array

{*} means to set the dimension (that is, the number of elements in an array) to however many elements there happen to be.

This will put all of the variables from beginvar to endvar in the array.

DO I = 1 TO DIM(somename) ;

IF someneame{i}  >  5 THEN somename{i} = . ;

END ;

3. Get rid of more useless junk

It makes a lot of sense in collecting data to ask anything you might possibly ever want to know, because it is frightfully expensive to track down the same subjects again. The result, though, is that you may end up with a whole lot of variables that you are never going to use. The first type is where there is no variance. If everyone in your dataset is in the eighth grade and from the USA then you have all of the information you are ever going to get from the “grade” and “country” variables.  Think about it, no matter what you do from here on out, if you pull out a sample of females only – you’re still going to have only American eighth-graders. No matter what kind of correlation, factor analysis or other statistical technique you do, “country” is going to explain exactly none of the variance. Make a note somewhere about the zero variance of those variables and drop them from your dataset.

There is a reason I did these steps in this order. Often, when I recode the data so all of the various missing categories get coded as missing instead of 8, 9 , 742 or whatever, the variance drops to zero or the number of records with non-missing data drops to zero.

4. Think about what you’re throwing out

Is there a good reason that those items have all missing data or the answers are all the same? It may be that the question was if the student had dropped out of school and, very reasonably, it is not asked of third-graders and you are only analyzing data for the third grade.  For obvious reasons, the American Community Survey only asks women if they gave birth in the last year.

5. Use the vacuum

In other words, use machines to help you. I automate the four steps above by combining PROC TRANSPOSE, PROC MEANS, PROC FREQ and a couple of macros.

However, that’s the topic for tomorrow’s post.


Comments

Name (required)

Email (required)

Website

Speak your mind

1 Comment so far

  1. Data quality: I love this macro ! : AnnMaria’s Blog on May 12, 2011 3:13 am

    […] SAS Programming lessons from mom: Clean up this damn mess! […]

Blogroll

WP Themes