|

Making Drinking Easier with SAS Macros & Filenames

SAS macros (well, any macros, really – Excel, whatever), can be a great thing. While at first glance they may look a bit hairy, if you have repetitive tasks it makes it easier to debug your code. That may sound nuts at first glance. How can adding something with the %global %do &var  and call symput(‘fname’,filename) kind of stuff make your code MORE readable.

It really can, though. For example with a project I am working on now, I have to do the same 30 lines of code  80,000 times, reading in 80,000 different files. I don’t know which is worse, the thought of having to write over two million lines of code or the thought of debugging it.

Even though I cannot imagine ever doing this without using macros, there are a couple things you need to keep in mind when using macros with the FILENAME statement.

Let’s just pretend that I am doing an analysis of all of the disciplinary records of all of the schools in the Las Tortugas Unified School district over several years. Every time a student whacks another student on the head with a math book and gets sent to the office, there is a report filed. At the end of each day, each school uploads its disciplinary report files to the main server. The first question is, how do I get the names of all of those 80,000 files?  If there was some logical file structure, like school 1 to school 1700, I could do a %DO loop, but schools actually have names (unless you are New York City), so that option is ruled out. Well, I’m sure as hell not typing them in! And, yes, I can think of a whole lot better ways for people to collect data than this also, but my clients aren’t usually interested in hearing how much easier my job would be if they had been collecting and entering their data differently for the past ten years.

Here’s my solution. This is in Windows but you could do the same thing easily enough with Linux.

Go to the C:\ prompt

Type

dir/s  > filenames.txt

This will output all of the directories and subdirectories into the file filenames.txt .  Here is one of those examples where you can be too smart for your own good. The file will have several lines of information at the top which are not file names but directories. Then, it lists the first directory. Then, every file in that directory. Then, the next directory, and ever file in that, and so on.

I briefly considered a solution with FIRSTOBS = and counting the number of lines until the first directory, then doing an INPUT statement with @@ to keep it on the same line and ….

Then I realized that all the lines that had a file were in this format:

01/12/2011 10:00 AM 33789   filename200901_1.txt

That date was the date that the files were received from the client and copied on to my hard drive, so it was the same for every file.  All I needed was the file name. As is common in projects like this, the files were saved each month to a separate subdirectory, with all of the files for a given year in one subdirectory and each month in a subdirectory under that, so the full reference for the file above would be:

c:\annmaria\ltusd\2009\01\filename200901_1.txt

So, I did this:

FILENAME names "C:\annmaria\ltusd\filenames.txt" ;
DATA in.files ;
ATTRIB a length = $10.  filename length = $40. ;
INFILE names ;
INPUT a $ b $ c $ d $ filename ;
yr = SUBSTR(filename,9,4) ;
mnth = SUBSTR(filename,13,2) ;
KEEP filename yr mnth ;
IF  a =  "01/12/2011"  then output ;

You don’t need to do this next part, but just to test a few things before I really get into the macros and run it 80,000 times, I do this:

%let fname = RIOSECO_2011011.txt ;
%let yr = 2011 ;
%let mnth = 01 ;
filename test "c:\annmaria\incidents\school\&yr\&mnth\&fname" ;

If you aren’t terribly familiar with macros and you run this, you may think it didn’t work to resolve the macro variables because you see this in your log:

filename test "c:\annmaria\incidents\school\&yr\&mnth\&fname" ;

What went wrong? Nothing went wrong. SAS is merely echoing back the statement you gave it. To see if it worked, you can do something like this:

DATA CHECK ;
INFILE test ;
INPUT v1 $ :

and you will find that your notes in your SAS log tell you that test is

"c:\annmaria\incidents\school\2011\01\RIOSECO_201101.txt"

Okay, nice, looks like it all works but I’m not going to change those three %LET statements  and re-run the code 80000 times. I need a macro!

This creates the incidents dataset I am going to build. My log tells me it has zero observations, but I create it here because I am going to add every file, one by one, and if I don’t have a dataset on the first execution, I will get an error message telling me that the file in.incidents does not exist. And I will be sad.

data in.incidents ;
infile test length=len  ;
run;

Here is the first macro, to read in each file as a temporary SAS dataset and add it to the permanent incidents dataset.

%macro readem(yr,mnth,fname,num) ;
/* Reads for every year, month and incident */
filename school ""c:\annmaria\incidents\school\&yr\&mnth\&fname" ;
data check&num ;
infile school ;
*** All the statements to read in and process your data go here;
**** Now I am going to build one long file from those 80,000 records ;
data in.incidents ;
set in.incidents check&num ;
proc datasets lib = work kill memtype= data ;
***** Without the PROC DATASETS you end up with 80,000 temporary files ;
run;
%mend readem ;

The macro below, calls the macro above and runs it 80,000 times.

%macro readall ;
%global yr mnth fname nmbr ;
%do i = 1 %to 80000 ;
Data test5 ;
set in.names ;
if _n_ = &i then do ;
call symput('fname',filename) ;
call symput ('mnth', mnth) ;
call symput ('yr', yr) ;
call symput ('nmbr', _n_) ;
output ;
end ;
run ;
%readem(&yr,&mnth,&fname,&nmbr) ;
%end ;
run;
%mend readall ;
%readall ;
r
un ;

Why the &nmbr? I don’t really need that, do I?  I mean, I’m deleting the temporary dataset after each step so why bother changing the number? I just did it for error checking. If there was a problem with a file, the message would say something like CHECK9876 not found. Since this was data collected over several years and people (I am sure just to annoy me personally) have a habit of changing how they record their data over time, if I saw that, say, from record 17000 on there were errors, I could be pretty sure that some field was added that month, and I’d have to go rewrite my program to account for that. Miraculously, there weren’t any errors, everything ran.

So, there you go. I did not need to write a 2,400,000 lines of code.

With the time I saved, I did research on Chardonnay. The house rocket scientist is also the resident wine procurer because he disapproves of my method, which consists of buying whatever has the weirdest label. I bought smoking loon for a long time, because hey, it has a duck on the label, with a cigar. That is hard to top. However, now in my extra spare time I have discovered there is a label named Barking Lizard. How have I missed this all of my life? I am SO picking up a bottle of this tomorrow! See how useful the time you save can be.

Similar Posts

One Comment

  1. I’m so glad to hear of a California resident who also selects wine based on the most interesting label. Smoking Loon is a favorite in this house; we’ll keep an eye out for Barking Lizard.

    And there’s this bottle of Old Fart wine that seems to just get passed from friend to friend at each milestone birthday.

Leave a Reply

Your email address will not be published. Required fields are marked *