(If you are too young to remember the song, “Money for nothing and your chicks for free”, I guess the title of this post is not nearly as amusing to you as it was to me, as my lovely daughter so unhelpfully pointed out.)
Since I whined yesterday about Codecademy not providing much explanation of the code in the Quickstart course (where not much is defined as none at all), I thought I should not be so hypocritical.
I am often posting some code and saying I will explain it later. I noticed some of those from 2009 or 2010. Well, 2012 is definitely later. I think I’ll work backward though and start with what I promised to explain “later” this week.
As I’ve rambled on here a lot, open data is a great idea, but it takes some work. I decided to post some of what I’ve been doing here with explanation. In part, this was motivated by a talk I had today with some researchers I’ll be working with over the next few months. Someone said, quite correctly,
“You read the journal article and they say they did propensity score matching but they never tell you exactly how they did it, how they modified the code, which macro they used, because that’s not really the focus of the article. Unfortunately, when you go to replicate that model, you can’t because there isn’t enough detail.”
So, here in great detail from beginning to end is how I banged the data into shape for the analyses of the Kaiser-Permanente Study of the Oldest Old. These are data I downloaded from the Inter-university Consortium for Political and Social Research. (Free, open data).
The analysis was all done using SAS On-Demand for Academics.
LIBNAME mydata "/courses/u_2.edu1/i_123/c_2469/saslib" ;
The LIBNAME statement specifies the directory for my course on the SAS server. This is going to be unique to your course. If you have a SAS On-Demand account and you are the instructor, you know what this directory is. The “mydata” is a libref – that is, it is just used in the program to refer to the library a.k.a. directory. You can use any valid SAS name. Actually, I am not using anything in the class directory in this example, but I put it the first line in every program as a habit so when I DO need the data in the directory available, I have it.
OPTIONS NOFMTERR ;
This prevents the program from stopping when it cannot find the specified format. SAS Enterprise Guide is generally pretty forgiving about format errors, and I used a .stc file which should have the formats included, but I usually include this option anyway. If there are missing formats, you’ll get notes in your SAS log but your program will still run.
FILENAME readit "/courses/u_2.edu1/i_123/c_2469/saslib/04219-0001-data.stc" ;
PROC CIMPORT INFILE = readit LIBRARY = work ISFILEUTF8=TRUE ;
This imports the file, complete with formats. I rambled on about this in an earlier post. In short, because this particular file was created on a different system, you can EITHER have the formats OR have it in your permanent (i.e. course) directory, but not both. Click on the previous link if you need detail on .stc files or CIMPORT. Otherwise, move on.
PROC FORMAT ;
3 = "1-2"
2 = "3-5"
1 = "6+" ;
This is creating a format for one of the variables in the data set. As you can see, the variable was coded 1= 6 or more drinks a day, 2 = 3-5 drinks a day. I want to change that format so the actual values print out, instead of “1″ for the people who had six drinks.
DATA work.alcohol ;
SET work.DA4219P1 ;
The previous CIMPORT step created that dataset DA4219P1 . I am reading it into a new data set, named alcohol, that I’m going to change and create variables for my final dataset to analyze. Everything I am doing in this program COULD be done with tasks in SAS Enterprise Guide, but I found it more efficient to do this way. These are both in temporary (working) memory.
ATTRIB b LENGTH = $10. c1dthyr LENGTH = 8.0 c2dthyr LENGTH = 8.0;
In the ATTRIB statement, I am defining new variables and specifying the length and type. You don’t have to do this in a SAS program but if you don’t, SAS will assign length and type when it first encounters the variables, and it may not do it exactly the way you want.
IF alcopyr = 0 OR alcohol = 0 THEN amntalc = 0 ;
FORMAT amntalc alc. ;
The variable amntalc was missing for a large number of people, but most of those people had previously answered “no” to the questions asking if they drank alcohol in the previous year or ever in their life. If they said, “no” to either of those, I set the amntalc, which is how much they drink per day, to zero. This dramatically cut the amount of missing data. Then I applied the format.
death = INPUT(COMPRESS(dthdate),MMDDYY10. ) ;
b = "01/" ||
COMPRESS(brthmo) || "/" ||
brthdate = INPUT(b,MMDDYY10.) ;
lifespan = (death - brthdate) / 365.25 ;
lifespan = ROUND(lifespan,1) ;
All the above calculates two variables I actually care about and a couple of others I’m going to drop. Death is the date of death. Lifespan is how old the person lived to be. First, I read in the date of death, which had been stored as a text field, that’s what the INPUT function does, and the mmddyy10. gives it the format in which to read the data. I stripped out the blanks (that’s what the COMPRESS function does). Now I have the date of death.
The file doesn’t have actual birth days, just birth month and year. So, b = 01 plus the month and year – assigning everyone the birthdate of the first day of the month when they were born. The statement with birth date reads that as a SAS date. Now I am going to subtract the birthday from death date and divide it by 365.25 to give me how many years the person lived. Finally, I am going to ROUND the result to the nearest year.
Yes, I could have combined a lot of these statements into one. For example, I could easily combine the last two statements calculating lifespan. I did it like this because I use SAS On-Demand to teach and if my students peek at the program, which many do, it is easier for them to understand broken down like this.
if alcopyr = 1 then alcopyr = . ;
if smcigars = 1 then smcigars = . ;
if educ in (3,4,5) then education = 4 ;
else education = educ ;
The above recodes some variables. For a couple, “1″ meant the data were not available, so I changed that to missing. For education, I combined three categories so my data were the more typical categories of “less than high school”, “high school” etc.
IF cohort = 1 then do ;
IF death = . then c1dthyr = 12 ;
ELSE c1dthyr = YEAR(death) - 1971 ;
yrslived = c1dthyr ;
dthy1 = YEAR(death) ;
IF cohort = 2 then do;
IF death = . then c2dthyr = 12 ;
ELSE c2dthyr = YEAR(death) - 1980 ;
yrslived = c2dthyr ;
dthy2 = YEAR(death) ;
There were two cohorts, one for which data collection began in 1971 and one in 1980. I wanted the option to either analyze two different variables (and possibly split the data set later), so I created two variables named c1dthyr and c2dthyr. I also wanted one variable because I wanted to be able to compare survival rates by cohort, so I created a variable named yrslived. I was working with a student who was interested in deaths in a specific range of years, so I created variables dthy1 and dthy2 for her. The YEAR function returns the year part from a SAS date. A DO loop for each cohort took care of all of that.
DROP b bd F_A_X01 -- I_AUTPSY MR1_CS1 -- A_F_DT1 D_A_DT2 -- H_DISDT4 MRCSDT1 -- MR3_DT4 IREVMVA -- NTIS5 LABsum1 -- E_MRDX Vis1y1 -- vis9y9 E_CR1 -- PRCS43 B_INTR -- MRCSDDT23 ;
The last statement drops a bunch of variables I don’t need. The somevar — othervar notation with the two dashes in between includes all of the variables in order in the dataset from the first variable mentioned to the second. There were literally hundreds of variables I wanted to drop, so here they are all dropped. Now, I have the file I want and I am happy and ready to start running stuff for my first day of class.