Macros, SQL and Reading CSV with SAS – Part 1

I wish I knew then what I know now

Close to 40 years ago, I was a bright, young industrial engineer who had been using SAS for a year or so. I thought it was the best thing since sliced bread and could solve all problems. An older engineer, who did not have much use for all of those new-fangled computer things was told by his supervisor to talk to me about automating some type of factory loading. I don’t remember the details but I think it had something to do with figuring out how many people to have work each shift based on the hours estimated necessary to make each part and the number of each type of part needed.

I don’t remember the details but I remember that I couldn’t solve the problem

The types of parts and number of each changed daily. The part numbers were sometimes numeric and sometimes a combination of letters and numbers. There was not a defined length for part number and the part numbers and the number required of each were not in specific columns. Both SAS and I have come a long way since then and I am pretty sure I could solve a problem like that now.

I don’t do a lot with SAS these days but I had a problem that reminded me of that first stumbling block

A student had a question about analyzing a longitudinal dataset. Some work had been done on it by a graduate student who has graduated and moved away.

The problem involved the following:

  1. The previous student had used a macro to read in XLS datasets but this year’s datasets were in CSV formats
  2. In some datasets the unique identifier was 25 characters in length, in some it was 12, in others 20.
  3. The database had been tested quite a bit by the graduate student who left, and all of the test students needed to be removed. Fortunately, each had the word TEST somewhere in the username, except for some that had “UNDEFINED” or no username at all because the initial programmer had skipped over the login step to test some feature.
  4. The unique identifier went by different names in different datasets. In some it was called ‘student’, in others it was ‘username’ .
  5. Some of the datasets this year did not exist last year.
  6. They wanted the total number (non-duplicated count) of students for each year, for each school.
  7. Students duplicated across datasets were not really duplicate users. For example, School A might have a JuanGarcia and a JuanGarcia2 but School B could have a different JuanGarcia and it was a different person.

Solving parts 1, 2, 3 and 4

Fixing part 1 was super easy. To change the macro, I just switched from DBMS = XLSX to DBMS = CSV and it read in the CSV files. Easy peasy.

For part 2, I created a variable named student and gave it a length of 25, which was the longest length any of the (many!) individual datasets had for an actual user’s name.

I used the UPCASE function to make all the user names upper case and set the student variable to have that value. I used the INDEX function to find and delete any that had the word TEST or UNDEFINED or were blank.


%macro imp_csv(refd=,outd=,school=) ;
// ALL I NEEDED TO DO  here was change the DBMS = XLSX in the original macro to DMBS = CSV ; 

PROC IMPORT DATAFILE=&refd
   DBMS=CSV
   REPLACE
   OUT=work.&outd;
   GETNAMES=YES;

// CREATING A PERMANENT DATASET HERE SO NEED A LIBNAME STATEMENT; 
DATA in.&outd;
    SET work.&outd ;
    LENGTH student $ 25.  ;
      student = UPCASE(username) ;
      if INDEX(student,"TEST") > 0 or   INDEX(student,"UNDEFINED") > 0 or student = "" THEN DELETE ;
      school = &school;
%mend ;

I used this macro for the half of the datasets that had a unique identifier of “username”. The other half all had a unique identifier of “student” and, in all cases, a length of 25 characters. I actually used a different macro to read in the other half because those had a whole different set of problems, but that is a post for another day.

Calling the macro

FILENAME reffile1 ‘/home/annmaria/data_analysis_examples/schoolA.csv’;

%imp_csv(refd=reffile1,outd=schoolA, school=”Academy of A”);

For each file, I simply needed to give the location of the csv file and then in the macro call, give that file reference, the name I wanted that output dataset and the name of the school. Since I am creating permanent datasets in this macro, I’d need to have a LIBNAME statement in the macro before it is called.

Okay, that is Part 1. As you can probably guess, after I had done Part 1, Part 2 was super simple. Actually, Part 1 was pretty simple, too.

In my day job, I make educational games and teach courses on entrepreneurship at a tribal college. When I am on vacation (and I am currently on a five-day vacation, which is my longest in 49 years , I help students with their research with SAS code and write blogs about it.

I will try to get to Part 2 tomorrow when I get back from the aquarium with my grandchildren.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *