May

13

Yesterday, I talked about using a macro for beginning checks on data quality and I promised to explain it today. So, here we go…

options mstored sasmstore = maclib  ;

If you want to store your macro, you need to use two options in the OPTIONS statement.

MSTORED that you want the stored compiled macros in the library specified after SASMSTORE = .

Why you need two options, I don’t know.

LIBNAME MACLIB "C:\Users\MyDir\Documents\My SAS Files" ;

*** That’s the directory where my macro will be stored.
%macro dataqual(dsn,idvar,startvar,endvar,obsnum) / store ;

*** This creates a macro named dataqual  that will take parameters for

dsn = data set name

idvar = the subject identifier in the data set . This is something like social security number, telephone number  (if you’re a phone company), customer number or other variable that should NOT have duplicates.

startvar = this is the first variable in the data set that I want to get descriptive statistics on

endvar = this is the last variable I want descriptive statistics on

obsnum =  the number of variables in the data set

The / store  is an option that tells SAS to store this macro, and it will be stored in the directory specified after the sasmstored option and matched in the LIBNAME statement.

Title "Duplicate ID Numbers" ;
Proc freq data =  lib.&dsn noprint ;
tables &idvar / out = &dsn._freq (where = ( count > 1 )) ;
format &idvar ;

*****    This will create a frequency distribution but not print it (that ‘s the noprint option on the PROC FREQ statement ). This is important to remember because many of my data sets will have hundreds of thousands or millions of records, each with a unique identifier. It will output the duplicate values to a data set named &dsn._freq , with, of course &dsn replaced by whatever name I give it.

Since many of the public data sets I use have formats for every value, and I don’t want the formatted value used, the statement

FORMAT &idvar ;

will cause it to use the unformatted value for &idvar .

proc print data = &dsn._freq (obs = 10 ) ;
run ;

*** This prints the first ten duplicate values. You want to be careful to put in that (obs = 10 ) just in case something went wrong in the FREQ procedure and it ended outputting everybody. For example, if you accidentally put (count = 1 ) . Then you may get 2 million records in your &dsn._freq data set and it would not be very good to print all of those out.

(Skipping story of dumping reams of green and white lined computer paper. If you’re my age, you have one of those stories of your own.)

proc summary data = lib.&dsn mean min n std ;
output out = &dsn._stats ;
var &startvar -- &endvar ;

**** This is going to create a data set , &dsn._stats with the mean, minimum,  n and standard deviation for each variable in your data set from &startvar to &endvar /

proc transpose data = &dsn._stats out = &dsn._stats_trans ;
id _STAT_ ;

This is going to transpose your dataset so that instead of five records with 200 or 500 or however many variables you have, instead you have 500 records with variables for _name_ , _label_ , minimum, mean, n and standard deviation.
data &dsn._chk ;
set &dsn._stats_trans ;
pctmiss = 1 – (n/&obsnum) ;
if min < 0 then neg_min = 1 ;
else neg_min = 0 ;
if std = 0 then constant = 1 ;
else constant = 0 ;
if (pctmiss > .05 or neg_min = 1 or constant = 1) then output ;
Title “Deviant variables to check ” ;
proc print data = &dsn._chk ;
run;

**** This reads in the transposed data set and then does some quality checks – if the standard deviation is 0, more than 5% of the data are missing or there is a negative minimum value, the variable is output. Then, I get a listing of variables that are in some way warranting a second look. The columns will show the descriptive statistics, plus new variables that show the percent missing, whether the variable is a constant or has a negative minimum.

Title “First 10 observations with ALL of the variables unformatted ” ;
proc print data = lib.&dsn (obs= 10)  ;
format _all_ ;

run ;

****  You should always stare at your data. This prints out the first 10 values without formats, just so I can see what it looks like. I use a lot of public data sets that come with user-defined formats. The FORMAT _ALL_  statement removes all formats for this step and will print just the unformatted values.
Title “First 10 observations with ALL of the variables formatted ” ;
proc print data = lib.&dsn (obs= 10)  ;
run ;

*** This prints the first 10 observations with formatted values, obviously. Using formatted values is the default so I didn’t need a FORMAT statement.

Title “Deviant variables to drop ”  ;
proc print data = &dsn._chk  noobs;
var _name_ ;

****  This prints the names of all of those variables that have problems. I can copy this from the output window, type the word

DROP

paste the list of variables, add a semi-colon and I have my drop statement.

%put You defined the following macro variables ;
%put _user_ ;

**** This just helps with trouble-shooting. It will put the user defined macro variables to the log so it looks something like:

dsn :  studentgr8

idvar: IDSTUD

and so on.

run ;
%mend dataqual ;

*** and that is the end of my macro.

I hope you love it as much as I do. While it isn’t all fun sexy cool like doing a proportional hazards regression model or even a factor analysis, if your data are no good all of those fancy statistics are at best, wrong and a waste of time, and , at worst, wrong and a major blow to your career depending on whose life your wrong data screwed up.


Comments

Name (required)

Email (required)

Website

Speak your mind

1 Comment so far

  1. Univariate statistics for categorical data? How weird! : AnnMaria’s Blog on September 2, 2011 3:11 am

    […] have written before at great length about checking data quality using a macro. Also, I’m doing a presentation on it at WUSS this […]

Blogroll

WP Themes