If you, too, feel more comfortable with a data set you have produced yourself, let me give you a few tips.
- There is a wealth of data available on line, much of it thanks to your federal government. For example, I downloaded the 2014 birth data from the Center for Disease Control site. They have a lot of interesting public use data sets.
- Read the documentation! The nice thing about the CDC data sets is that they include a codebook. This particular one is 183 pages long. Not exciting reading, I know, but I’m sitting here right now watching a documentary where some guy is up to his elbows in an elephant’s placenta (I’m seriously not making this up) and that doesn’t look like a bowl of cherries, either.
- Assuming you did not read all of the documentation even though I told you it was important, (since I have raised four daughters and know all about people not paying attention to my very sound advice), at a MINIMUM you need to read three things; 1) sampling information to find out if there are any problems with selection bias, any sampling weights you need to know about, 2) the record layout – otherwise how the hell are you going to read in the data, and 3) the coding scheme.
Let’s take a look at the code to create the data set I want to use in examples for my class. Uncompressed, the 2014 birth data set is over 5 GB which exceeds the limit for upload to SAS On-Demand for Academics and also isn’t the best idea for a class data set for a bunch of reasons, a big one being that I teach students distributed around the world and ships at sea (for real) and having them access a 5GB data set isn’t really the most feasible idea.
I’m going to assume you downloaded the file into your downloads folder and then unzipped it.
STEP 1: CHECK OUT THE DATA SET
Since I READ the codebook, not being a big hypocrite and saw that the record length is 775 and there are nearly 4 million records in the data set. Opening it up in SAS Enterprise Guide or the Explorer window didn’t seem a good plan. My first step, then , was to use a FILENAME statement to refer to the data I downloaded, data that happens to be a text file.
I just want to take a look at the first 10 records to see that it is what it claims to be in the codebook. (No, I never DO trust anyone.)
The default length for a SAS variable is 8.
I’m going to give the variable a length of 775 characters.
Notice that the INFILE statement has to refer back to the reference used in the FILENAME statement, which I very uncreatively named “in” . Naming datasets, variables and file references is not the place for creativity. I once worked with a guy who named all of his data sets and variables after cartoon characters – until all of the other programmers got together and killed him.
Dead programmers aside, pay attention to that OBS=10 unless you really want to look at 3,998,175 records. The OBS =10 option will limit the number of records read to – you guessed it – 10.
With the INPUT statement, I read in from position 1-775 in the text file.
All of this just allows me to look at the first 10 records without having to open a file of almost 4 million records.
FILENAME in “C:\Users\you\Downloads\Nat2014us\Nat2014PublicUS.c20150514.r20151022.txt ” ;
LENGTH var1 $775;
INFILE in OBS= 10 ;
INPUT var1 1-775;
STEP 2: READ IN A SUBSET OF THE DATA
Being satisfied with my first look, I went ahead to create a permanent data set. I needed a LIBNAME statement to specify where I wanted the data permanently stored.
The out in the LIBNAME and DATA statements need to match. It could be anything, it could be komodo or kangaroo, as long as it’s the same word in both places. So … my birth2014 data set will be stored in the directory specified.
How do I know what columns hold the birth year, birth month, etc. ? Why, I read the codebook (look for “record layout”).
LIBNAME out “C:\Users\me\mydir\” ;
DATA out.birth2014 ;
INFILE in OBS= 10000 ;
INPUT birthyr 9-12 bmonth 13-14 dob_tt 19-22 dob_wk 23 bfacil 32
mager 75-76 mrace6 107 mhisp_r 115 mar_p $ 119 dmar 120
meduc 124 bmi 283-286 cig1_R 262 cig2_R 263 cig3_r 264 dbwt 504-507 bwtr4 511;
STEP 3: RECODE MISSING DATA FIELDS. Think how much it would screw up your results if you did not recode 9999 for birthweight in grams, which means not that the child weighed almost 20 pounds at birth but that the birthweight was missing. In every one of the variables below, the “maximum” value was actually the flag for missing data. How did I know this? You guessed it, I read the codebook. NOTE: these statements below are included in the data step.
IF bmi > 99 THEN bmi = . ;
if cig1_r = 6 then cig1_r = . ;
if cig2_r = 6 then cig2_r = . ;
if cig3_r = 6 then cig3_r = . ;
if dbwt = 9999 then dbwt = . ;
if bwtr4 = 4 then bwtr4 = . ;
STEP 4: LABEL THE VARIABLES – Six months from now, you’ll never remember what dmar is.
NOTE: these statements below are also included in the data step.
LABEL mager = “Mom/age”
bfacil = “Birth/facility” mrace6 = “Mom/race” mhisp_r = “Mom/Hispanic”
dmar = “Marital/Status” meduc = “Mom/Educ”
cig1_r =”Smoke/Tri1″ cig2_r =”Smoke/Tri2″ cig3_r =”Smoke/Tri3″;
So, that’s it. Now I have a data set with 10,000 records and 19 variables that I can upload for my students to analyze.