I can’t believe I haven’t written about this before – I’m going to tell you an easy (yes, easy) way to find and communicate to a non-technical audience standardized mortality rates and relative risk by strata.

It all starts with PROC STDRATE . No, I take that back. It starts with this post I wrote on age-adjusted mortality rates which many cohorts of students have found to be – and this is a technical term here – “really hard”.

walnut

Here is the idea in a nutshell – you want to compare two populations, in my case, smokers and non-smokers, and see if one of them experiences an “event”, in my case, death from cancer, at a higher rate than the other. However, there is a problem. Your populations are not the same in age and – news flash from Captain Obvious here – old people are more likely to die of just about anything, including cancer, than are younger people. I say “just about anything” because I am pretty sure that there are more skydiving deaths and extreme sports-related deaths among younger people.

Captain Obvious wearing her obvious hat

Captain Obvious wearing her obvious hat

So, you compute the risk stratified by age. I happened to have this exact situation here, and if you want to follow along at home, tomorrow I will post how to create the data using the sashelp library’s heart data set.
The code is a piece of cake

cake

PROC STDRATE DATA=std4
REFDATA=std4
METHOD=indirect(af)
STAT=RISK
PLOTS(STRATUM=HORIZONTAL);
POPULATION EVENT=event_e TOTAL=count_e;
REFERENCE EVENT=event_ne TOTAL=count_ne;
STRATA agegroup / STATS;

The first statement gives the data set name that holds your exposed sample data, e.g., the smokers, your reference data set of non-exposed records, in this example, the non-smokers. You don’t need these data to be in two different data sets, and, this example, they happen to be in the same one.  The method used for standardization is indirect. If you’re interested in the different types of standardization, check out this 2013 SAS Global Forum paper by Yang Yuan.

STAT = RISK will actually produce many statistics,  including both crude risk estimates and estimates by strata for the exposed and non-exposed groups, as well as standardized mortality rate – just, a bunch of stuff. Run it yourself and see.  The PLOTS option is what is of interest to me right now. I want plots of the risk by stratum.

The POPULATION statement gives the variable that holds the value for the number of people in the exposed group who had the event, in this case, death by cancer, and the count is the total in the exposed group.

The REFERENCE statement names the variable that holds the value of the number in the non-exposed group who had the event, and the total count in the non-exposed group (both those who died and those who didn’t).

The STRATA statement gives the variable by which to stratify. If you don’t need your data set stratified because there are no confounding variables – lucky you – then just leave this statement out.

Below is the graph

risks by strata
The PLOTS statement produces plots of the crude estimate of the risk by strata, with the reference group risk as a single line. If you look at the graph above you can see several useful measures. First, the blue circles are the risk estimate for the exposed group at each age group and the vertical blue bars represent the 95% confidence limits for that risk. The red crosses are the risk for the reference group at each age group. The horizontal, solid blue line is the crude estimate for the study group, i.e., smokers, and the dashed, red line is the crude estimate of risk for the reference group, in this case, the non-smokers.

Several observations can be made at a glance.

  1. The crude risk for non-smokers is lower than for smokers.
  2. As expected, the younger age groups are below the overall risk of mortality from cancer.
  3. At every age group, the risk is lower for the non-exposed group.
  4. The differences between exposed and non-exposed are significantly different for the two younger age groups only, for the other two groups, the non-smokers, although having a lower risk, do fall within the 95% confidence limits for the exposed group.

There are also a lot more statistics produced in tables but I have to get back to work so maybe more about that later.

I live in opposite world

Speaking of work — my day job is that I make games for 7 Generation Games and for fun I write a blog on statistics and teach courses in things like epidemiology. Actually, though, I really like making adventure games that teach math and since you are reading this, I assume you like math or at least find it useful.

Mom and kid

Share the love! Get your child, grandchild, niece or nephew a game from 7 Generation Games.

One of my favorite emails was from the woman who said that after playing the games several times while visiting her house, her grandson asked her suspiciously,

Grandma, are these games on your computer a really sneaky way to teach me math?

You can check out the games here and if you have no children to visit you or to send one as a gift, you can give one to a school – good karma. (But, hey, what’s with the lack of children in your life? What’s going on?)

SENSITIVITY AND SPECIFICITY – TWO ANSWERS TO “DO YOU HAVE A DISEASE?”

Both sensitivity and specificity address the same question – how accurate is a test for disease – but from opposite perspectives. Sensitivity is defined as the proportion of those who have the disease that are correctly identified as positive. Specificity is the proportion of those who do not have the disease who are correctly identified as negative.

Students and others new to biostatistics often confuse the two, perhaps because the names are somewhat similar. If I was in charge of naming things, I would have named one ‘sensitivity’ and the other something completely different like ‘unfabuloso’. Why I am never consulted on these issues is a mystery to me, too.

Specificity and sensitivity can be computed simultaneously, as shown in the example below using a hypothetical Disease Test. The results are in and the following table has been obtained:

 

  Disease No Disease
Test Positive 240 40
Test Negative 60 160

Results from Hypothetical Screening Test

COMPUTING SENSITIVITY AND SPECIFICITY USING SAS

Step 1 (optional): Reading the data into SAS. If you already have the data in a SAS data set, this step is unnecessary.

The example below demonstrates several SAS statements in reading data into a SAS dataset when only aggregate results are available. The ATTRIB statement sets the length of the result variable to be 10, rather than accepting the SAS default of 8 characters. The INPUT statement uses list input, with a $ signifying character variables.

DATALINES;

a statement on a line by itself, precedes the data. (Trivial pursuit fact : CARDS; will also work, dating back to the days when this statement was followed by cards with the data punched on them.) A semi-colon on a line by itself denotes the end of the data.

DATA diseasetest ;

ATTRIB result LENGTH= $10 ;

INPUT result $ disease $ weight ;

DATALINES ;

positive present 240

positive absent 40

negative present 60

negative absent 160

;

Step 2: PROC FREQ

PROC FREQ DATA= diseasetest ORDER=FREQ ;

TABLES result* disease;

WEIGHT weight ;

Yes,  plain old boring PROC FREQ. The ORDER = FREQ option is not required but it makes the data more readable, in my opinion, because with these data the first column will now be those who had a positive result and did, in fact, have the disease. This is the numerator for the formula for sensitivity, which is:

 

Sensitivity =   (Number tested positive)/ (Total with disease).

 

TABLES variable1*variable2   will produce a cross-tabulation with variable1 as the row variable and variable2 as the column variable.

Weight weightvariable will weight each record by the value of the weight variable. The variable was named ‘weight’ in the example above but any valid SAS name is acceptable. Leaving off this statement will result in a table that only has 4 subjects, 1 subject for each combination of result and disease, corresponding to the data lines above.

Results of the PROC FREQ are shown below. The bottom value in each box is the column percent.

Because the first category happens to be the “tested positive” and the first column is “disease present”, the column percent for the first box in the cross-tabulation – positive test result, disease is present – is the sensitivity, 80%. This is the proportion of those who have the disease (the disease present column) who had a positive test result.

 

Table of result by disease
result disease
Frequency
Percent
Row Pct
Col Pct
present absent Total
positive 240
48.00
85.71
80.00
40
8.00
14.29
20.00
280
56.00
negative 60
12.00
27.27
20.00
160
32.00
72.73
80.00
220
44.00
Total 300
60.00
200
40.00
500
100.00

Output from PROC FREQ for Sensitivity and Specificity

The column percentage for the box corresponding to a negative test result and absence of disease is the value for specificity. In this example, the two values, coincidentally, are both 80%.

Three points are worthy of emphasis here:

  1. While the location of specificity and sensitivity in the table may vary based on how the data and PROC FREQ are coded, the values for sensitivity and specificity will always be diagonal to one another.
  2. This exact table produces four additional values of interest in evaluating screening and diagnostic tests; positive predictive value, negative predictive value, false positive probability and false negative probability. Further details on each of these, along with how to compute the confidence intervals for each can be found in Usage Note 24170 (SAS Institute, 2015).
  3. The same exact procedure produces six different statistics used in evaluating the usefulness of a test. Yes, that is pretty much the same as point number 2, but it bears repeating.

Speaking of that SAS Usage Note, you should really check it out.

Policy makers have very good reason for wanting to know how common a condition or disease is. It allows them to plan and budget for treatment facilities, supplies of medication, rehabilitation personnel. There are two broad answers to the question, “How common is condition X?” and, interestingly, both of these use the exact same SAS procedures. Prevalence is the number of persons with a condition divided by the number in the population. It’s often given as per thousand, or per 100,000, depending on how common the condition is. Prevalence is often referred to as a snapshot. It’s how many people have a condition at any given time.

 

Just for fun, let’s take a look at how to compute prevalence with SAS Studio.

Step 1: Access your data set

First, assign a libname so that you can access your data. To do that, you create a new SAS program by clicking on the first tab in the top menu and selecting SAS Program.

Click to create new program

libname mydata "/courses/number/number/" access=readonly;

(Students only have readonly access to data sets in the course directory. This prevents them from accidentally deleting files shared by the whole class. As a professor with many years of experience, let me just tell you that this is a GREAT idea.)

Click on the little running guy at the top of your screen and, voila, your LIBNAME is assigned and the directory is now available for access.

(Didn’t believe me there is a little running guy that means “run”? Ha!)

running guy

Next, in the left window pane, click on Tasks and in the window to the right, click on the icon next to the data field.

window to select library

From the drop down menu of directories, select the one with your data and then click on the file you need to analyze.

list of files in directory

Step 2: Select the statistic that you want and then select the variable. In this case, I selected one-way frequencies, and one cool thing is that SAS will automatically show you ONLY the roles you need for a specific test. If you were doing a two-sample t-test, for example, it would ask for you groups variable and your analysis variable. Since I am doing a one-way frequency, there is only an analysis variable.

select roles

When you click on the plus next to Analysis Variables, all of the variables in your data set pop up and you can select which you want to use. Then, click on your little running guy again, and voila again, results.

Results of proc freq

So … the prevalence of diabetes is about 11% of the ADULT population in California, or about 110 per 1,000.

You can also code it very simply if you would like:
libname mydata “/courses/number/number/” access=readonly;

PROC FREQ DATA = mydata.datasetname ;

TABLE variable ;

Of course, all of this assumes that your data is cleaned and you have a binary variable with has disease/  doesn’t have disease, which is a pretty large assumption.

Now, curiously, the code above is the exact SAME code we used to compute incidence of Down syndrome a few weeks ago. What’s up with that and how can you use the exact same code to compute two different statistics?

Patience, my dear. That is a post for another day.

Who was it that said asking a statistician about sample size is like asking a jeweler about price. If you have to ask, you can’t afford it.

We all know that the validity of a chi-square test is questionable if the expected sample size of the cells is less than five. Well, what do you do when, as happened to me recently, ALL of your cells have a sample size less than five?

baby mashing cake

The standard answer might be to collect more data, and we are in the process of that, but having the patience of the average toddler, I wanted that data analyzed NOW because it was very interesting.

It was our hypothesis that rural schools were less likely to face obstacles in installing software than urban schools, due to the extra layers of administrative approval required in the latter (some might call it bureaucracy). On the other hand, we could be wrong (horrors!). Maybe rural schools had more problems because they had more difficulty finding qualified personnel to fill information technology positions. We had data from 17 schools, 9 from urban school districts and 8 from rural districts. To participate in our study, schools had to have a contact person who was willing to attempt to get the software installed on the school computers. This was not a survey asking them whether it would be difficult or how long it would take. We actually wanted them to get software ( 7 Generation Games ) not currently on their school computers installed. To make sure that cost was not an issue, all 17 schools received donated licenses.

You can see the full results here.

In short, 8 of the 9 urban schools had barriers to installation of the games which delayed their use in the classroom by a median of three months. I say median instead of mean because four of the schools STILL have not been able to get the games installed. The director of one after-school program that wanted to use the games decided it was easier for his program to go out and buy their own computers than to get through all of the layers of district approval to use the school computer labs, so that is what they did.

For the rural schools, 7 out of 8 reported no policy or administrative barriers to installation. The median length of time from when they received the software to installation was two weeks. In two of the schools, the software was installed the day it was received.

Here is a typical comment from an urban school staff member,

“I needed to get it approved by the math coach, and she was all on board. Then I got it approved at the building level.  We had new administration this year so it took them a few weeks to get around to it, and then they were all for it. Then it got sent to the district level. Since your games had already been approved by the district, that was just a rubber stamp but it took a few weeks until it got back to us, then we had all of the approvals so we needed to get it installed but the person who had the administrator password had been laid off. Fortunately, I had his phone number and I got it from him. Then, we just needed to find someone who had the spare time to put the game on all of the computers. All told, it took us about three months, which was sad because that was a whole semester lost that the kids could have been playing the games. “

And here is a typical comment from a rural staff member.

“It took me, like, two minutes to get approval. I called the IT guy and he came over and installed it.”

The differences sound pretty dramatic, but are they different from what one would expect by chance, given the small sample size? Since we can’t use a chi-square, we’ll use Fisher’s exact test. Here is the SAS code to do just that:

PROC FREQ DATA = install ;
TABLES rural*install / CHISQ ;

Wait a minute! Isn’t that just a PROC FREQ and a chi-square? How the heck did I get a Fisher’s exact test from that?

Well, it turns out that if you have a 2 x 2 table, SAS automatically computes the Fisher exact test, as well as several others. I told you that you could see the full results here but you didn’t look, now, did you?

You can see the full results here.

In case you still didn’t look, the probability of obtaining this table under the null hypothesis that there is no difference in administrative barriers in urban versus rural districts is .0034.

If you think these data suggest it is easier to adopt educational technology in rural districts than in urban ones, well, not exactly. Rural districts have their own set of challenges, but that is a post for another day.

 

Recall that in the last post we were using SAS functions to score a test that had been completed by middle school and upper elementary students. Since we wanted to make it as easy as possible for students to enter their answers, we accepted just about any format.

Picking up where we left off …

SUBSTR FUNCTION – READING ONLY PART OF AN ANSWER

In one question, the correct answer is 1/8 . Students entered 1/8, 1/8 cup, 1/8 cup of beans, and so on.  To score these, we use the substr function to read the first 3 characters and score the problem correct if those are “1/8”

If substr(q22,1,3) = “1/8” then q22 = 1 ;

else q22 = 0 ;

MISSING FUNCTION – TO CHECK FOR MISSING DATA

For q27, students clicked on which of the equations are correct. If they clicked the correct equation, the variable was set  to 1.

When they didn’t click on anything, it was missing. I wanted that to be changed to a zero, so I used the missing function, like this:

if missing(q27) then q27 = 0 ;

ARRAY, DO-LOOP and INPUT FUNCTION

Now, SAS is not the newest kid on the block

and I can relate, because neither am I, not even if I’m on a relatively old block

Old and middle-aged people

The problem with being an older language is that  you have static types and you cannot have mixed arrays. What does that mean? It means that if you have defined q1 as a character variable because it might have a $ in it then by God it is going to stay a character variable and you can’t be doing any funny stuff like finding the mean and standard deviation of it. Also, if you are going to have an array, everything in it better be either all character variables or all numeric variables.

Well, fine, then, here is how I change all those now scored questions to numeric. First, I created a new numeric array of 32 items. You can tell it is numeric because there is no $.

Second, I used a DO loop and the INPUT function. The input function will read in a variable and and read it out in a different format, in this case, a numeric variable with a length of 8 and 0 decimal places.

I dropped the index variables j and i , which I mentioned in a previous post.

Now that I have my variables all nicely numeric, I can use the SUM function and add up all of the scored items into a total score.

 

array items {32} item1 – item32 ;

do j = 1 to 32 ;
items{j} = input(qs{j},8.0);
end;
drop i j ;
total = sum(of item1-item32);

 

Now that I have my data, the fun stuff begins, but that’s for another post because I need to get back to making games.

 

This is my day job. Check it out. Buy a game. Maturity is over-rated!

burning village

Or donate a game to a school for good karma.

Did you ever fill out one of those online forms where you kept trying to submit it and got messages like,

“You need to enter your phone number in the format 311-234-12234”

or

You cannot have any special characters in this field.

That one really irritates me because, in fact, my last name has a space in it and many websites refuse to accept it. Take it up with The Invisible Developer, or his ancestors.

Have you ever just said the hell with it, and skipped filling out the form? Preventing users from entering all but the expected data type saves problems when you analyze your data, but it can also cause people to give up on your stupid web form.

So … when I created the pretest for Forgotten Trail and Aztech, I made it accept just about anything. If you wanted to write in 6, six, 9R6, 6 left over — any and all of those would be accepted and recorded.

You can get the first two games we developed here.

background for hidden pictures game

Forgotten Trail and Aztech are in beta and will not be commercially available for another couple of months.

What now? I have to score that test, but I’d rather the difficulty be on me than 150 or so middle school students who are our first test group.

So… how to fix it, with SAS character functions. Here is me, scoring the first half of the test:

First, I read the data into a new data set because I want to preserve the original data and not write over it. I may want to look at the exact incorrect answers later.

I create a character array of all 32 items on the test, and then I use a DO loop to change all of the questions to upper case.

 

Data in.recode ;
set in.pretestGMS ;
array qs{32} $ q1 – q27 q28a q28b q28c q29 q30 ;
do i = 1 to 32 ;
qs{i} = upcase(qs{i}) ;
end ;

Now, on to the questions. I eventually need all of these items to be score 1= correct, 0= incorrect

q1 is a question about money. People put all kinds of wrong answers – $35, $40, as well as the correct answer, 100 and $100. I used the COMPRESS function to remove the ‘$’ , then set q1 to equal 1 if the answer was 100, an 0 otherwise.
q1 = compress(q1,”$”) ;
if q1 = 100 then q1 = 1 ;
else q1 = 0 ;

The second use of compress function removes trailing blanks – if you don’t put any second parameter in the compress function, it just removes blanks. In q2, the answer was 4 but the students put “four”, “four frogs” “4/14” and so on. All of these are correct. You can have a list in an IF statement and if the variable matches any of those values in the list, then do something, in this case, set the answer as correct.
q2 = compress(q2) ;
if compress(q2) in (“4″,”FOUR”,”FOURFROGS”,”4/14″,”4OUTOF”,”4FROGS”) then q2 = 1;
else q2 = 0 ;

*** How to keep only numeric data using a simple SAS function (take that all you regular expression fetishists!)

The third use of the compress function KEEPS the characters that are the second parameter, because I added an optional third parameter of “k”, to KEEP the characters in the second parameter instead of discard those. So, this keeps numbers and deletes everything else from the answer. If it is 150, it is scored correct, otherwise, it’s wrong.
if compress(q5,”0123456789″,”k”) = 150 then q5 = 1;
else q5 = 0 ;

 

A lot of the items were similar, so that is half of scoring the test. I’ll try to write up the rest from the airport  tomorrow, but for now, I need to write a couple of emails, finish this scoring program and pack before 2 am, and that only gives me about 40 minutes.

On twitter, there were a few comments from people who said they didn’t like to take interns because “More than doing work, they want to watch me work.”

I see both sides of that. You’re busy. You’re not netflix. I get it. On the other hand, that’s a good way to learn.

The data are part of the evaluation of the effectiveness of 7 Generation Games in increasing math achievement scores. You can read more about us here. Below is a sneak peek of the artwork from the level we are creating now for Forgotten Trail.

characters from Forgotten Trail in Maine

So, here you go. I’m starting on a data analysis project today and I thought I’d give you the blow by blow.

phpmyadmin

It just so happens that the first several steps are all point-y and click-y. You could do it other ways but this is how I did it today. So, step one, I went to phpMyAdmin on the server where the data were saved and clicked Export.

analyze2

For the format to export, I selected CSV and then clicked on the Go button. Now I have it downloaded to my desktop.

import data

Step 3: I opened SAS Enterprise Guide and selected Import Data.  I could have done this with SAS and written code to read the file, but, well, I didn’t. Nope, no particular reason, just this isn’t a very big data set so I thought, what the heck, why not.

boxes to check in import data menu

Step 4: DO NOT ACCEPT THE DEFAULTS!  Since I have a comma-delimited file with no field names, I need to uncheck the box that says File contains field names on record number. SAS conveniently shows you the data below so I can see that it is comma-delimited. I know I selected CSV but it’s always god practice to check. I can also see that the data starts at the first record, so I want to change that value in Data records start at record number to 1.

changing names

Step 5: Change the names  – I am never going to remember what F1, F2 etc. are, so for the first 5 , I click on the row and edit the names to be the name and label I want.

That’s it. Now I click the next button on the bottom of the screen until SAS imports my data.

I could have continued changing all of the variable names, because I KNOW down the line I am not going to remember that F6 is actually the first question or that F25 is question 28a. However, I wanted to do some other things that I thought would be easier to code, so I opened up a program file in SAS Enterprise guide and wrote some code.

/* THIS CREATES TWO ARRAYS BECAUSE I AM TOO LAZY

TO RENAME 32 QUESTIONS INDIVIDUALLY

THE PRETEST DATA SET WAS CREATED BY THE STEPS ABOVE USING IMPORT DATA */

 

data pretest2 ;

    set pretest ;

** NOTE THAT THERE IS A $ AFTER THE NUMBER OF ELEMENTS IN THE ARRAY

** BECAUSE THIS IS A CHARACTER ARRAY ;

    array ren{32} $ f6-f37 ;

array qs {32} $ q1-q27 q28a q28b q28c q29 q30;

do i = 1 to 32 ;

qs{i} = ren{i} ;

end ;

** YOU CAN ALSO USE A RENAME STATEMENT TO RENAME THE SAME VARIABLES ;

rename f38 = date_test ;

*** SINCE I NO LONGER NEED THE VARIABLES F6- F37 OR THE INDEX VARIABLE FOR THE

ARRAY, I DROP THEM HERE ;

drop f6- f37 i ;

*** SOME STUDENTS SAVED THE TEST MORE THAN ONCE BECAUSE THEY SAVED BEFORE THEY WERE DONE AND AT THE END. SO, I SORT BY USERNAME AND TEST. WE WILL ONLY KEEP THE LAST ONE.

proc sort data=pretest2 ;

by username date_test ;

*** THIS KEEPS JUST THE LATEST TEST DATE. ALSO, WE TESTED THIS 45 TIMES IN

THE PROCESS OF GETTING READY FOR USE IN THE SCHOOLS. ALL OF OUR STAFF USED USERNAMES WITH ‘TEST” SO I USED THE INDEX FUNCTION TO FIND IF THERE WAS A “TEST” IN THE USERNAME AND, IF SO, DELETED THAT RECORD ;

data pretest2 ;

set pretest2;

by username date_test ;

if last.username ;

if index(username,‘TEST’) > 0 then delete;

run;

Okay, that’s it. Now I have my data all ready to analyze. Pretty painless, isn’t it?

Want to learn more about SAS?

Here is a good paper on Arrays made easy .

If you’re interested in character functions like index, here is a good paper by Ron Cody.

Even though Rick Wicklin (buzzkill!) disabused me of the concern that SAS was communicating with aliens through the obscure coding in its sashelp data sets, I still wanted to roll my own.

If you, too, feel more comfortable with a data set you have produced yourself, let me give you a few tips.

  • There is a wealth of data available on line, much of it thanks to your federal government. For example, I downloaded the 2014 birth data from the Center for Disease Control site. They have a lot of interesting public use data sets.
  • Read the documentation! The nice thing about the CDC data sets is that they include a codebook. This particular one is 183 pages long. Not exciting reading, I know, but I’m sitting here right now watching a documentary where some guy is up to his elbows in an elephant’s placenta (I’m seriously not making this up) and that doesn’t look like a bowl of cherries, either.
  • Assuming you did not read all of the documentation even though I told you it was important, (since I have raised four daughters and know all about people not paying attention to my very sound advice), at a MINIMUM you need to read three things; 1) sampling information to find out if there are any problems with selection bias, any sampling weights you need to know about, 2) the record layout – otherwise how the hell are you going to read in the data, and 3) the coding scheme.

Let’s take a look at the code to create the data set I want to use in examples for my class. Uncompressed, the 2014 birth data set is over 5 GB which exceeds the limit for upload to SAS On-Demand for Academics and also isn’t the best idea for a class data set for a bunch of reasons, a big one being that I teach students distributed around the world and ships at sea (for real) and having them access a 5GB data set isn’t really the most feasible idea.

 

I’m going to assume you downloaded the file into your downloads folder and then unzipped it.

STEP 1: CHECK OUT THE DATA SET

Since I READ the codebook, not being a big hypocrite and saw that the record length is 775 and there are nearly 4 million records in the data set. Opening it up in SAS Enterprise Guide or the Explorer window didn’t seem a good plan. My first step, then , was to use a FILENAME statement to refer to the data I downloaded, data that happens to be a text file.

I just want to take a look at the first 10 records to see that it is what it claims to be in the codebook. (No, I never DO trust anyone.)

The default length for a SAS variable is 8.

I’m going to give the variable a length of 775 characters.

Notice that the INFILE statement has to refer back to the reference used in the FILENAME statement, which I very uncreatively named “in” . Naming datasets, variables and file references is not the place for creativity. I once worked with a guy who named all of his data sets and variables after cartoon characters – until all of the other programmers got together and killed him.

Dead programmers aside, pay attention to that OBS=10 unless you really want to look at 3,998,175 records. The OBS =10 option will limit the number of records read to – you guessed it – 10.

With the INPUT statement, I read in from position 1-775 in the text file.

All of this just allows me to look at the first 10 records without having to open a file of almost 4 million records.

FILENAME  in “C:\Users\you\Downloads\Nat2014us\Nat2014PublicUS.c20150514.r20151022.txt ” ;

DATA example;
LENGTH var1 $775;
INFILE in OBS= 10 ;
INPUT var1 1-775;

STEP 2: READ IN A SUBSET OF THE DATA

Being satisfied with my first look, I went ahead to create a permanent data set. I needed a LIBNAME statement to specify where I wanted the data permanently stored.

The out in the LIBNAME and DATA statements need to match. It could be anything, it could be komodo or kangaroo, as long as it’s the same word in both places. So … my birth2014 data set will be stored in the directory specified.

How do I know what columns hold the birth year, birth month, etc. ? Why, I read the codebook (look for “record layout”).

 

LIBNAME  out “C:\Users\me\mydir\” ;
DATA  out.birth2014 ;
INFILE in OBS= 10000 ;
INPUT birthyr 9-12 bmonth 13-14 dob_tt 19-22 dob_wk 23 bfacil 32
mager 75-76 mrace6 107 mhisp_r 115 mar_p $ 119 dmar 120
meduc 124 bmi 283-286 cig1_R 262 cig2_R 263 cig3_r 264 dbwt 504-507 bwtr4 511;

 

STEP 3: RECODE MISSING DATA FIELDS. Think how much it would screw up your results if you did not recode 9999 for birthweight in grams, which means not that the child weighed almost 20 pounds at birth but that the birthweight was missing.  In every  one of the variables below, the “maximum” value was actually the flag for missing data. How did I know this? You guessed it, I read the codebook. NOTE: these statements below are included in the data step.
IF bmi > 99 THEN bmi = . ;
if cig1_r = 6 then cig1_r = . ;
if cig2_r = 6 then cig2_r = . ;
if cig3_r = 6 then cig3_r = . ;
if dbwt = 9999 then dbwt = . ;
if bwtr4 = 4 then bwtr4 = . ;

STEP 4: LABEL THE VARIABLES – Six months from now, you’ll never remember what dmar is.

NOTE: these statements below are also included in the data step.
LABEL mager = “Mom/age”
bfacil = “Birth/facility” mrace6 = “Mom/race” mhisp_r = “Mom/Hispanic”
dmar = “Marital/Status” meduc = “Mom/Educ”
cig1_r =”Smoke/Tri1″ cig2_r =”Smoke/Tri2″ cig3_r =”Smoke/Tri3″;

So, that’s it. Now I have a data set with 10,000 records and 19 variables that I can upload for my students to analyze.

I was going to write about prevalence and incidence, and how so-called simple statistics can be much more important than people think and are vastly under-rated. It was going to be cool. Trust me.

In the process, I ran across two things even more important or cooler (I know, hard to believe, right?)

Julia and company find this hard to believe

The Spoiled One & Co. find this hard to believe also

Here’s what happened … I thought I would use the sashelp library that comes with SAS On-Demand for Academics -and just about every other flavor of SAS – for examples of difference in prevalence. Since no documentation of all the data sets showed up on the first two pages of Google, and one is prohibited by international law from looking any further, I decided to use SAS to see something about the data sets.

Herein was revealed cool thing #1 – I know about the tasks in SAS Studio but I never really do much with these. However, since I’m teaching epidemiology this spring, I thought it would be good to take a look. You should do this any time you have a data set. I don’t care if it is your own or if it was handed down to you by God on Mount Sinai.

Moses with tablets

Okay, I totally take that back. If your data was handed down to you by God on Mount Sinai, you can skip this step, but only then.

At this point, Buddhists and Muslims are saying, “What the fuck?” and Christians are saying, “She just said, ‘fuck’! Right after a Biblical reference to Moses, too!”

This is why this blog should have some adult supervision.

But I digress. Even more than usual.

KNOW YOUR DATA! I don’t mean in the Biblical sense, because I’m not sure how that is even possible, but in the statistical sense. This is the important thing. I don’t care how amazingly impressive the statistical analyses are that you do, if you don’t know your data intimately (there’s that Biblical thing again) you may as well make your decisions by randomly throwing darts at a dartboard. I once told some people at the National Institutes of Health that’s how I arrived at the p-value for a test. For the record, the Men in Black have more of a sense of humor about these things than the NIH.

Ahem, so … if you are using SAS Studio, here is a lovely annotated image of what you are going to do.

characterize0

1. Click in the left window pane where it says TASKS on the arrow to bring up a drop down menu

characterize2

2. Click on the arrow next to Data and then select the Characterize Data task. (You might say that was 2 AND 3 if you were a smart ass and who asked you, anyway?)

3. Click the arrow next to the word DATA in the window pane second from the left and it will give you a choice of the available directories. (NOTE: If you are going to use directories not provided by SAS you’ll need a LIBNAME statement in an earlier step but we’re not and you don’t in this particular example.) Under the directory, you will be able to select your file, in this case, I want to look at birthweight.

characterize3

 

 

4. Next to the word VARIABLES, click the + and it will show the variables in the data set. You can hold down the shift key and select more than one. You should do this for all of the variables of interest. In my case, I selected all of the variables – there aren’t many in this dataset.

characterize5

5. To run your program, click the little running guy at the top of the window. This will give you – ta- da

RESULTS!

characterize6

Let’s notice something here – the mother’s age ranges from -9 (seriously? What’s that all about?) to 18. Is this a study of teenage mothers or what? The answer seems to be “what” because the mean age is .416. Say, what? The mother’s educational level ranges from 0 to 3, which probably refers to something but I’ll bet it’s not years of education.

 

In a class of 50 students, inevitably, one or two will turn in a paper saying,

“The average mother had 1.22 years of education.”

WHAT? Are you even listening to yourself? Those students will defend themselves by saying that is what “SAS” told them.

According to the SAS documentation, these data are from the 1997 records of the National Center for Health Statistics.

I ran the following code:


proc print data=sashelp.bweight (obs=5) ;

And either it’s the same data set or there was an amazing coincidence that all of the data in the first five records was the same.

However, because I really need to get a hobby, I went and found the documentation for the Natality Data set from 1997 and it did not match up with the coding here. This led me to conclude that either:

a. SAS Institute re-coded these data for purposes of their own, probably instructional,

b. This is actually some other 1997 birthweight data set from NCHS, in which case, those people have far too much time on their hands.

c. SAS is probably using secretly coded messages in these data to communicate with aliens.

Julia as fat alien

Not being willing to chance it, I went to the NCHS site and downloaded the 2014 birth statistics data set and codebook so I could make my own example data.

So … what have we learned today ?

  1. The TASKS in SAS Studio can be used to analyze your data with no programming required.
  2. It is crucial to understand how your data are coded before you go making stupid statements like the average mother is 3 months old.
  3. You can download birth statistics data and other cool staff from the National Center for Health Statistics site.
  4. The Spoiled One uses any phone not protected by National Security Council level of encryption for photos of herself.

———

Want to make your children as smart as you?

Player needing help

Get them 7 Generation Games. Learn math. Learn social studies. Learn not to fall into holes.

Runs on Mac and Windows for under ten bucks.

 

 

Over the weekend, I wrote a post showing how SAS can be used to make what appears to be a complex problem quite simple.

First of all, am I just being dramatic? Seriously, how can having your variable lengths differ be a disaster?

Simple. You are merging by a variable that is a unique user identifier like username, social security number. Because the two different data sets have different lengths, they do not match. If you are computing the number of unique users you may overestimate by a huge amount. If you want the number of people who are in both data sets, you may vastly underestimate the amount of true matches.

As with anything in programming, there are many ways to do this. My solution is to create a new variable and set it to the identical length and format using the ATTRIB statement. Extra bonus is this will work when you have variables that are not only different lengths but different types, say character in one data set and numeric in the other.

You really only need two statements in your data step, an ATTRIB statement and then an assignment statement that sets the value of the variable you created to whatever the variable is you want to merge.

DATA dsname ;
ATTRIB newvar LENGTH = $49 ;
SET mydata2.dsname ;
newvar = oldvar ;
Repeat this step for the second data set and then merge (or concatenate) to  your little heart’s delight.

Angry guard faceThe voice of experience: 
Notice two things here:  I created a temporary data set from my permanent one. Although SAS has gotten more forgiving over the years in not writing over your existing data sets when there is an error, it is still better to err on the side of caution and make sure all is wonderful before saving over that existing data, especially if it took you a lot of effort to get the data in that form.
Second, I created a new variable and kept the old one as is. I don’t always do this but it is good practice. You may be tempted to just use the first 9 digits because we all know social security numbers are 9 digits and then later you find that it was entered as 123-45-6789  and now you only have 123-45-67

—- Feel smarter after reading this blog?

Fish Lake artwork

Want to feel even smarter? Download and play our games!  You can run around in our virtual world while reviewing your basic math skills. If you are too busy (seriously?) you can still give a game as a gift or donate a game to a classroom or school.

← Previous PageNext Page →