|

Data quality: I love this macro !

I will be missing my lovely daughter Ronda’s fight in Calgary next month because I will be attending the National Association of Hispanic Journalists conference in Orlando, Florida, where sports journalist and equally lovely daughter Maria Burns Ortiz will be. It is completely false that this proves Maria is the favorite child. In addition to which, Julia claims to be the favorite child because “I’m the baby, you gotta love me & you named your company after me when I was a baby, so there!” The perfect Jennifer just smiles because she KNOWS she is the favorite child, after all, she points out, she is the only one with a graduate degree.

They are all wrong. I do not have a favorite child.

I do, however, have favorite macros.

Yesterday, I wrote about how I start my programs to test data quality. In fact, there’s a macro that I call at the beginning of each program that does all of that stuff:

%strt(TIMSS,G8_STUDENT07,studentformats) ;

I like it okay, but I don’t love it. The code is below.

%macro strt(Projdir,dsn,fmts) / store ;
DM “CLEAR LOG; CLEAR OUT”;
options nofmterr ;
libname LIB “C:\Users\AnnMaria\Documents\&projdir\sasdata”;
proc means data = lib.&dsn ;
%include “C:\Users\AnnMaria\Documents\&projdir\sascode\&fmts..sas” ;
%put You defined the following macro variables ;
%put _user_ ;
run ;
%mend strt ;

This macro that goes at the end is one of my favorites. I enter one line in my program like this:

%dataqual(student_int,IDSTUD,BS4GOLAN,USBSGQ6B,7377) ;

This one line results in some very useful stuff:

  • It  creates  data set  of duplicate ID values, and prints the first ten duplicates found.
  • It creates and prints a data set that has variables with more than 5% of the data missing, that have no variance or that have a minimum value that is a negative number.
  • Because you should always look at your data, it prints the first ten observations unformatted
  • It  prints the first ten formatted (remember, I have the formats in that handy %INCLUDE in the first macro above)
  • The last thing it does is print all of the variable names that are flagged as “deviant” that I might want to delete from my data set.

Why that last one? Seriously, you don’t think I’m going to type 300+ variable names do you? I’m a terrible typist. After I look through the results above, I will usually decide to delete most of the variables. There are usually a few I want to keep – for example, father’s education might be missing for more than 5% of the sample but that’s because some people don’t live with their fathers and may not have the data, so I might keep that variable despite missing data.  Besides, 5% is a low threshold, so I may keep variables that are missing 10% of the data. For some variables like income, a negative number can be correct. After examining the results, I can copy and paste the whole list of variables into a DROP statement, and just delete the few I want to keep.

The code for the macro is below. Tomorrow, I’ll go through what each statement does. I would do it today but this post is already long and besides, it’s powered by Chardonnay and we are all out of wine. The resident rocket scientist, who usually brings home the wine as well as the bacon, is on a diet so he’s been bringing home diet soda instead. Good for him, but what about ME, what about MY needs, what about people who need an explanation of

FORMAT _all_ ;

Apparently he did not consider that. Ha!

/* =========^===============================================**=============== [This line fulfills no purpose */

options mstored sasmstore = maclib  ;
LIBNAME MACLIB “C:\Users\MyDir\Documents\My SAS Files” ;
%macro dataqual(dsn,idvar,startvar,endvar,obsnum) / store ;
Title “Duplicate ID Numbers” ;
Proc freq data =  lib.&dsn noprint ;
tables &idvar / out = &dsn._freq (where = ( count > 1 )) ;
format &idvar  ;
proc print data = &dsn._freq (obs = 10 ) ;
run ;
proc summary data = lib.&dsn mean min n std ;
output out = &dsn._stats ;
var &startvar — &endvar ;

proc transpose data = &dsn._stats out = &dsn._stats_trans ;
id _STAT_ ;
data &dsn._chk ;
set &dsn._stats_trans ;
pctmiss = 1 – (n/&obsnum) ;
if min < 0 then neg_min = 1 ;
else neg_min = 0 ;
if std = 0 then constant = 1 ;
else constant = 0 ;
if (pctmiss > .05 or neg_min = 1 or constant = 1) then output ;
Title “Deviant variables to check ” ;
proc print data = &dsn._chk ;
run;

Title “First 10 observations with ALL of the variables unformatted ” ;
proc print data = lib.&dsn (obs= 10)  ;
format _all_ ;

run ;
Title “First 10 observations with ALL of the variables formatted ” ;
proc print data = lib.&dsn (obs= 10)  ;
run ;

Title “Deviant variables to drop ”  ;
proc print data = &dsn._chk  noobs;
var _name_ ;
%put You defined the following macro variables ;
%put _user_ ;
run ;
%mend dataqual ;

Similar Posts

2 Comments

  1. Really? I’ve heard SO many people complain about the macro language I’m surprised when I hear someone say they like it. You’re certainly not the only one, but it is a minority. I love macros but I’ve seen them be overdone, too.

Leave a Reply

Your email address will not be published. Required fields are marked *