Dec
27
Macros, SQL and Reading CSV with SAS – Part 1
December 27, 2022 | Leave a Comment
I wish I knew then what I know now
Close to 40 years ago, I was a bright, young industrial engineer who had been using SAS for a year or so. I thought it was the best thing since sliced bread and could solve all problems. An older engineer, who did not have much use for all of those new-fangled computer things was told by his supervisor to talk to me about automating some type of factory loading. I don’t remember the details but I think it had something to do with figuring out how many people to have work each shift based on the hours estimated necessary to make each part and the number of each type of part needed.
I don’t remember the details but I remember that I couldn’t solve the problem
The types of parts and number of each changed daily. The part numbers were sometimes numeric and sometimes a combination of letters and numbers. There was not a defined length for part number and the part numbers and the number required of each were not in specific columns. Both SAS and I have come a long way since then and I am pretty sure I could solve a problem like that now.
I don’t do a lot with SAS these days but I had a problem that reminded me of that first stumbling block
A student had a question about analyzing a longitudinal dataset. Some work had been done on it by a graduate student who has graduated and moved away.
The problem involved the following:
- The previous student had used a macro to read in XLS datasets but this year’s datasets were in CSV formats
- In some datasets the unique identifier was 25 characters in length, in some it was 12, in others 20.
- The database had been tested quite a bit by the graduate student who left, and all of the test students needed to be removed. Fortunately, each had the word TEST somewhere in the username, except for some that had “UNDEFINED” or no username at all because the initial programmer had skipped over the login step to test some feature.
- The unique identifier went by different names in different datasets. In some it was called ‘student’, in others it was ‘username’ .
- Some of the datasets this year did not exist last year.
- They wanted the total number (non-duplicated count) of students for each year, for each school.
- Students duplicated across datasets were not really duplicate users. For example, School A might have a JuanGarcia and a JuanGarcia2 but School B could have a different JuanGarcia and it was a different person.
Solving parts 1, 2, 3 and 4
Fixing part 1 was super easy. To change the macro, I just switched from DBMS = XLSX to DBMS = CSV and it read in the CSV files. Easy peasy.
For part 2, I created a variable named student and gave it a length of 25, which was the longest length any of the (many!) individual datasets had for an actual user’s name.
I used the UPCASE function to make all the user names upper case and set the student variable to have that value. I used the INDEX function to find and delete any that had the word TEST or UNDEFINED or were blank.
%macro imp_csv(refd=,outd=,school=) ;
// ALL I NEEDED TO DO here was change the DBMS = XLSX in the original macro to DMBS = CSV ;
PROC IMPORT DATAFILE=&refd
DBMS=CSV
REPLACE
OUT=work.&outd;
GETNAMES=YES;
// CREATING A PERMANENT DATASET HERE SO NEED A LIBNAME STATEMENT;
DATA in.&outd;
SET work.&outd ;
LENGTH student $ 25. ;
student = UPCASE(username) ;
if INDEX(student,"TEST") > 0 or INDEX(student,"UNDEFINED") > 0 or student = "" THEN DELETE ;
school = &school;
%mend ;
I used this macro for the half of the datasets that had a unique identifier of “username”. The other half all had a unique identifier of “student” and, in all cases, a length of 25 characters. I actually used a different macro to read in the other half because those had a whole different set of problems, but that is a post for another day.
Calling the macro
FILENAME reffile1 ‘/home/annmaria/data_analysis_examples/schoolA.csv’;
%imp_csv(refd=reffile1,outd=schoolA, school=”Academy of A”);
For each file, I simply needed to give the location of the csv file and then in the macro call, give that file reference, the name I wanted that output dataset and the name of the school. Since I am creating permanent datasets in this macro, I’d need to have a LIBNAME statement in the macro before it is called.
Okay, that is Part 1. As you can probably guess, after I had done Part 1, Part 2 was super simple. Actually, Part 1 was pretty simple, too.
In my day job, I make educational games and teach courses on entrepreneurship at a tribal college. When I am on vacation (and I am currently on a five-day vacation, which is my longest in 49 years , I help students with their research with SAS code and write blogs about it.
I will try to get to Part 2 tomorrow when I get back from the aquarium with my grandchildren.
Jan
30
Converting to fiscal years, using SAS
January 30, 2022 | Leave a Comment
Often when analyzing data for annual reports, you need to summarize by fiscal year but people normally enter calendar dates. Also, if you didn’t know, now you know …
SAS and Excel use different date formats
If you have uploaded your data from an Excel file, you want to convert from an Excel date to a SAS date by subtracting 21, 916. This is because the reference date from which days are numbered in excel is January 1, 1900 and in SAS it is January 1, 1960. Coincidentally, 365.25*60 = 21,915 and at the century mark, we get shorted one day because 1900, 2000 etc. do not have a February 29. It’s true. Look it up. You can also read this article on SAS and Excel dates by Erik Tilanus. Or you can just quit being difficult and subtract 21,916.
Use four functions to find fiscal year
Annoyingly, my data were read in as character data, even though it was the number of days since January 1, 1900 which is definitely a number.
Function 1: Use input to convert data from character to numeric
The input function can be used to either convert data from character to numeric or vice versa. Simply include two parameters, the name of the variable you are reading in and the informat. Since I am creating a new variable, I may as well subtract the 21,916 from it at the same time.
Functions 2 and 3: Year and Month
If your fiscal year started in January, you wouldn’t be doing this, so clearly you need to get the month and year from your date. These are pretty obvious functions. Year gives you the year, and month gives you -wait for it – the month. Just give the name of the variable holding your date.
Function 4 : Use CATX to create a fiscal year variable
There is certainly more than one way to do this. If, for example, you wanted fiscal year 2020-21 to be referred to as 2020, you could just have something like:
if appmonth < 10 then fiscal_year = appyear ;
else fiscal_year = appyear + 1;
I can certainly see how you might want to do that if, for example, you were going to correlate year with some other variable and needed a numeric variable. In general, though, I find it a bit confusing when a date is January 24, 2022 and the year shows 2021.
The CATX function will concatenate any variables. The first parameter is what you want as a delimiter between the variables. You could use a space, but here I wanted a dash so that the result is something like 2021-2022.
For the current dataset, the fiscal year runs from October 1 to September 30 of the following year, so, if the month is less than 10, the beginning of the fiscal year was the prior year and the fiscal year ends in the year of the application date. If the month is 10 or greater, the beginning of the fiscal year is that year and the end of the fiscal year is the following year.
data test2 ;
set mydata.fixdata22 ;
app_date = input(application_date, 8.) – 21916 ;
appyear = year(app_date) ;
appmonth = month(app_date) ;
if appmonth < 10 then fiscal_year = catx(“-“, appyear -1, appyear) ;
else fiscal_year = catx(“-“, appyear, appyear +1);
So, that’s it. Now you have a new variable that allows you to subset and report on your data by fiscal year.
I wrote this post because my first thought of how to do this was embarrassingly complicated. I honestly don’t know what I was thinking. It involved multiple arrays and a Do-loop and then I gave it 30 seconds more thought and realized the solution was actually super easy.
Blogroll
- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog