As I said in my last post, repeated measures ANOVA seems to be one of the procedures that confuses students the most. Let’s go through two ways to do an analysis correctly and the most common mistakes.

Our first example has people given an exam three times, a pretest, a posttest and a follow up and we want to see if the pretest differs from the other two time points.

proc glm data = example ;
model pre post follow = /nouni ;
repeated exams 3 contrast (1) /summary printm ;

Among other things, this will give you a table of Type III Sum of Squares that tells you that you have a significant difference across time. It will also give you contrasts between the 1st treatment and each of the other two.

You can see all of the output produced here.

This is using PROC GLM and so it requires that you have multiple VARIABLES representing each of the multiple times you measured people. This is in contrast to PROC MIXED which requires multiple records for each subject. We’ll get into that another day.

One thing that throws people all of the time is they ask, “Where did you get the exams variable?” In fact, I could have used any valid SAS name. It could have been “Nancy” instead of “exams” and that would have worked just as well. It’s a label we use for the factor measured multiple times. So, as counterintuitive as it sounds, there is NO variable named “exams” in your data set.

Let’s try a different example. This time, I have a treatment variable. I have administered two different treatments to my subjects. I want to see if treatment has any effect on improvement.

proc glm data =example ;
class treatment ;
model pre post follow = treatment/ nouni ;
repeated exams 3 /summary ;

The fixed effect does *not* go in your REPEATED statement

In this case, I do need a CLASS statement to specify my fixed effect of treatment. A really common mistake that students make is to code the REPEATED statement like this:

repeated treatment 3 /summary ; *WRONG! ;

It seems logical, right? Why would you use a completely made up name instead of one of your variables? If you think about it for a minute, though, treatment wasn’t repeated. Each subject only received one type of treatment.

When you are asking whether one group improved more than the other(s) what you are asking is, “Is there an interaction effect?” You can see by the table of Type III Sums of Squares produced below that there was no interaction effect.


A significant effect for the repeated measure does not mean your treatment worked!

A common mistake is to look at the significance for the repeated measure and because a significant change was found between times 1 and 3 to say that the treatment had an effect. In fact, though, we can see by the non-significant interaction effect that there was not an impact of treatment because there was no difference in the change in exam scores across the levels of treatment.

There are a lot of other common mistakes but I need to go back to work so those will have to wait for another blog.

When I teach students how to use SAS to do a repeated measures Analysis of Variance, it almost seems like those crazy foreign language majors I knew in college who were learning Portuguese and Italian at the same time.

I teach how to do a repeated measures ANOVA using both PROC GLM and PROC MIXED. It seems very likely in their careers my students will run into both general linear models and mixed models. The problem is that they confuse the two and the result is buggy code.

Let’s start with mistakes in PROC GLM today. Next time we can discuss mistakes in PROC MIXED.

Let’s say I have the simplest possible analysis – I’ve given the same students a pre- and a post-test and want to see if there has been a significant increase from time one to time two.

This will work just fine:

proc glm data =mydata.fl_pre_post ;
model pretest posttest = /nouni ;
repeated time 2 ;

Coding the repeated statement like this will also work

repeated time 2 (1 2) ;

So will

repeated time ;

It almost seems as if anything or nothing after the variable name will work. That’s not true. First of all,

repeated time 2 (a b) ; IS WRONG

… and will give you an error – Syntax error, expecting one of the following: a numeric constant, a datetime constant.

“Levels gives the number of levels associated with the factor being defined. When there is only one within-subject factor, the number of levels is equal to the number of dependent variables. In this case, levels is optional. When more than one within-subject factor is defined, however, levels is required,”

SAS 9.2 Users Guide

So, this explains why you can be happily using your repeated statement without bothering to specify the number of levels for a factor and then one day it doesn’t work. WHY? Because now you have two within-subject factors and you need to specify the number of levels but you don’t know that. This is why, when teaching I always include the number of levels. It will never cause your program to fail, even if it is unnecessary sometimes.

One more cool thing about the repeated statement for PROC GLM, you can do a planned contrast super easy. Let’s say I have done 3 tests, a pretest, a post-test and a follow-up. I want to compare the posttest and followup to the pretest.

proc glm data =mydata.fl_tests ;
model pretest posttest follow = /nouni ;
repeated test_time 3 contrast (1) /summary ;

What this will do is compare each of the other time points to the first one. A common mistake students make is to use a CONTRAST statement here with test_time. This will NOT work, although it will work with PROC MIXED, but that is a story for another day.

I was going to write more about reading JSON data but that will have to wait because I’m teaching a biostatistics class and I think this will be helpful to them.

What’s a codebook?

If you are using even a moderately complex data set, you will want a code book. At a minimum, it will tell you the name of each variable, the type (character, numeric or date), a label, if it has one and its position in the data set. It will also tell you the number of records and number of variables in a data set. In SAS, you can get all of this by running a PROC CONTENTS. (Also from a PROC DATASETS but we don’t cover that procedure in this class.)

So, for the sashelp.heart data set, for example, you would see:

output from Proc contents

The variable AgeAtDeath is the 12th variable in the data set. It is numeric, with a length of 8 and the label for it is “Age At Death”. Because it is a numeric variable, if you try to use it for any character functions, like finding a substring, you will get an error. (A substring is a subset of a string, so ‘ABC’ is a substring of ‘ABCDE’.)

Similarly, BP_Status is the 15th variable in the data set, it is a character, with a length of 7 and a label of “Blood Pressure Status”. Because it’s a character variable, if you try to do any procedures or functions that expect numeric variables, like find the mean, you will get an error. The label will be used in output, like in the table below.

Frequency distribution of blood pressure status

This is useful because you may have no idea what BP_Status is supposed to mean. HOWEVER, if you use “Blood Pressure Status” in your statements like the example below, you will get an error.

**** WRONG!!!
Proc means data=sashelp.heart ;
Var blood pressure status ;

Seems unfair, but that’s the way it is.

The above statement will assume you want the means for three separate variables named “blood” “pressure” and “status”.

There are no variables in the data set named “blood” or “pressure” so you will get an error. There is a variable named “status”, but it’s something completely different, a variable telling if the subject is alive or dead.

Even if you don’t have a real codebook available, you should at a minimum start any analysis by doing a PROC CONTENTS so you have the correct variable names and types.

What about these errors I was talking about, though? Where will you see them?

LOOK AT YOUR SAS LOG!!

If you are using SAS Studio , it’s the second tab in the middle window, to the right of the tab that says CODE.

Click on that tab and if you have any SYNTAX errors, they will conveniently show up in red.

Also, if you are taking a course and want help from your professor or a classmate, the easiest way for them to help you is if you is to copy and paste your SAS log into an email, or even better, download it and send it as an attachment.

Just because you have no errors in the SAS log doesn’t mean everything is all good, but it’s always the first place you should look.

To get a table of blood pressure status, you may have typed something like

Proc freq data=sashelp.heart ;
Tables status ;

That will run without errors but it will give you a table that gives status as alive or dead, not blood pressure as high, normal or optimal.

PROC CONTENTS is a sort of “codebook light”. A real codebook should also include the mean, minimum, maximum and more for each variable. We’ll talk about that in the next post. Or, who knows, maybe I’ll finally finish talking about reading in JSON data.

Sometimes data changes shape and type over time. In my case, we had a game that was given away free as a demo. We saved the player’s game state – that is, the number of points they had, objects they had earned in the game, etc. as a JSON object. Happy Day! The game became popular. Schools started using it. We came out with a premium version with lots more activities, a bilingual Spanish-English version, a bilingual Lakota- English version. Life is good.
Making Camp Premium on Google Play 
Once schools started using our games, they wanted data on how much students played, how many problems they answered. This is when life started to get complicated. We added more fields to the JSON object to show which activities they had completed and whether they had won. Data for one person might look like this, or much, much longer. “{“”points””:””8″”,””first_trade””:””false””,””first_visit””:””true””, “”has_wigwam””:””true””,””inventory””:””6,3,4,14″”,””inventory_position””:””[{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:176,\””top\””:4},{\””left\””:-309,\””top\””:-254},{\””left\””:0,\””top\””:0},{\””left\””:619,\””top\””:-45},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:293,\””top\””:-44},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0},{\””left\””:0,\””top\””:0}]””,””milestone””:””EarnPage””}”

The JSON engine didn’t work (not surprisingly)

I actually didn’t expect it would work because we didn’t actually have a JSON object but rather a JSON object converted to a string and saved in an SQL database. However, if something would only take a minute, I try it first
libname testmcb JSON fileref =reffile ;

Okay, so what now? We have a variable record length, the values we want might be in any column. It must be really difficult to figure out, right? Not really. This next bit looks complicated but it actually took very little time to code. Before we get into the code, let’s talk about what I needed to do and why. The “why” is that I want to know how much kids played the games. There are three milestones – logging in, getting a wigwam and trading for items. However, some kids just liked playing the mini-games and they played A LOT without pausing to trade in their points. Also, you can’t just look at how many points they have because they may have traded some points for items. Fortunately, all items cost two points, so I need to compute points + (items in inventory/2).
  • First, there were some records that did NOT start with points, they started with “milestone” because we were coding the JSON object one way and then we switched. (Pause to swear under my breath at people here.) So, I need to decide what type of record it was.
  • Second, I want to read in those variables that are milestones in the game play. That is, is this their first time playing, did they get points, did they get a wigwam and did they trade for anything for their wigwam. I need to keep in mind that a variable might be coded 1 or true if the player passed that milestone and 0, false or null if not. (Pause for more swearing.)
  • Third, I want to create Boolean variables that show whether or not a player passed a particular milestone.
  • Fourth, find out the NUMBER of objects in the inventory.
  • Fifth, divide the number in inventory by two and add to the points
This actually took me very little time. Let’s look at it bit by bit (ha ha). The first line just assigns a file reference for where my text file is located. FILENAME reffile '/home/annmaria.demars/data_analysis_examples/data2018/exceldata/mc_bilingualtest.txt'; Data mcb_json ; INFILE reffile DLM="," LRECL = 1337 ; *** READ IN FILE, VARIABLES ARE SEPARATED (DELIMITED) BY A COMMA. LENGTH OF THE RECORD IS 1,337 ; INFORMAT first_trade $40. first_visit $40. has_wigwam $40. ; INPUT  firstcol $ 15-16 @@ ; IF firstcol = '":' THEN  INPUT @43 points $ first_trade $ first_visit $ has_wigwam $ ; ELSE INPUT @15 points $ first_trade  first_visit  has_wigwam  ; ***** READ IN THE VARIABLES FOR TRADING, FIRST VISIT AND WIGWAM AS CHARACTER WITH A LENGTH OF 40 ; ****  READ COLUMNS 15-16 AND STAY ON THAT LINE ; **** IF THE VALUE IN COLUMNS 15-16 = “:  THEN IT HAD MILESTONE AT THE BEGINNING; ***** THE VARIABLES WE WANT START AT COLUMN 43, OTHERWISE, THEY START AT COLUMN 15; *** THESE ARE SEPARATED BY COMMAS SO WE CAN JUST LIST THE VARIABLE NAMES; *** NOTE THAT I NEED TO SPECIFY THAT POINTS IS CHARACTER DATA SINCE IT’S NOT IN MY INFORMAT STATEMENT ; IF INDEX(first_trade,"true") > 0 or index(first_trade,"1") > 0 then traded =0 ; ELSE IF INDEX(first_trade,"false") or index(first_trade,"0") then traded =1 ; IF INDEX(first_visit,"false") or index(first_visit,"0") then play_twice = 1 ; ELSE IF INDEX(first_visit,"true") or index(first_visit,"null") or index(first_visit,"1") then play_twice = 0 ; IF INDEX(has_wigwam,"true") then wigwam = 1 ; ELSE IF INDEX(has_wigwam,"null") > 0 or index(has_wigwam,"0") > 0 then wigwam = 0 ; *** ABOVE I JUST USE A FEW IF THEN STATEMENTS TO CREATE MY NEW VARIABLES ; **** The INDEX function returns the position in the variable in the first argument where it finds the value in the second argument; **** If the value isn’t found, it returns a value of 0 ; PROC FREQ ; TABLES  traded*first_trade first_visit*play_twice wigwam*has_wigwam ; This last step isn’t strictly necessary, except that it is. Here I do a cross-tabulation to make sure that all of the variables were assigned correctly and they were.  For example, you can see that if the value of “has_wigwam” was 0 or null the wigwam variable was set to ). If has_wigwam was equal to “true” the wigwam variable was set to 1. Screen shot of table of has_wigwam by wigwam values This would have worked for the whole JSON object except for the commas in the inventory. If you look at just a piece of the data, you can see that after the variables denoting milestones there is a variable that is actually an array, separated by commas. “first_visit””:””true””, “”has_wigwam””:””true””,””inventory””:””6,3,4,14″”,” We’ll look at how to handle that in the next post.

When I was young and knew everything, I would frequently see procedures or statistics and think, “When am I ever going to use THAT?” That was my thought when I learned about this new procedure to transpose a data set. (It was new then. Keep in mind, I learned SAS when I was pregnant with my first child. She is now CEO of a an educational game company and the mother of three children. )

PROC TRANSPOSE is super-useful. You might only think it is useful for transforming data for use with PROC GLM to use with PROC MIXED, or you might have no idea what the hell that means and it is still super-useful.

Let me give you today’s example. I’m looking for data to use in a biostatistics class I’m teaching next month. It’s a small data set, with data on eight states included in the Center for Disease Control’s Autism and Developmental Disabilities Monitoring Network.

The data looks like this:

As you can see, each state is a column. I would like to know, for example, what percentage of people with autism also have a physical disability. There is a way to do it by finding the mean across variables but I want to use this data set for a few examples and it would be much easier for me if each of those categories was a variable.

The code is super simple:

PROC TRANSPOSE DATA=mydata.autism OUT=mydata.autism2 NAME=state;
ID eligibility ;

The NAME = option is not required nor is the ID statement but they will make your life easier.  First, let’s take a look at our new data.

Data set with one record for each state

Now, instead of state being a variable, we have one record for each state, the percent with autism diagnosis only is one  variable, percent with emotional disturbance another, and so on. What the NAME = option does is give a name to that new variable which was the name of each column. If you don’t use that option, the first column would be named  _name_  . Now, with these data it would still be pretty obvious that this variable is the state but in some cases it wouldn’t be obvious at all.

The ID statement is really necessary in this case because otherwise each column is going to be named “COL1”, “COL2” etc.  Personally, I found the ID statement here confusing because normally the ID statement I think of as the individual ID for each record, like a social security number or student ID. In this case, the variable name you give in the ID statement is going to be used to name the variables. So, as you can see above, the first column is named Autism(%), the second is named Emotional Disturbance (%) and so on.

So, that’s it. All I need to do to get means, standard deviation, minimum and maximum is :

PROC MEANS DATA =mydata.autism2;

So, that’s it.

By the way, I get this data set and a few others from SAS Curriculum Pathways. Nice source for small data sets to start off a course.


I live in opposite world, where my day job is making games and I teach statistics and write about programming for fun.  You can check out our games here. You’re probably already pretty good with division but you’ll learn about the Lakota language and culture with Making Camp Lakota.  A bilingual (English-Lakota) game that teaches math.

feather

I was reminded today how useful a SAS log can be, even when it doesn’t give you any errors.

I’m analyzing data from a study on educational technology in rural schools. The first step is to concatenate 10 different data sets. I want to keep the source of the data, that is, which data set it came from, so if there are issues with these data, outliers, etc. I can more easily pinpoint where it occurred.

I used the IN= option for each data set when I read them in and then some IF statements to assign a source.

DATA mydata.all_users18 ;
    SET  sl_pre_users18 (in=slp )
              aztech_pre_clean (in=azp )
             AZ_maya_students18 (in=azms)
            fl_pretest_new18 (in=flpn)
            fl_pretest_old18 (in=flpo)
            ft_users18(in=ft)
           mydata.fl_students18 (in=fls )
          mc_bilingual_students18 (in=mcb)
        mc_users18 (in=mc)
        mydata.sl_students18 (in=sls)
;

After I run the data step, I see that 425 observations do not have a value for “source”. How would you spot the error?

Of course, there is more than one way, but I thought the simplest thing was to search in the SAS log and see which of the data sets had exactly 425 observations. Yep. There it is. Took me 2 seconds to find.

147 PROC IMPORT DATAFILE=REFFILE
148 DBMS=XLSX
149 OUT=WORK.MC_bilinguaL_students18 replace;
150 GETNAMES=YES;

NOTE: The import data set has 425 observations and 2 variables.

So, I looked at the code again and sure enough, I had misspelled “source”

IF  slp THEN source = “Spirit Pre” ;
      else if azp then source = “Az Pre” ;
     else if fls then source = “Fish Studn”;
     else if mcb then sourc = “M.Camp.Bil” ;

You might think I could have just read through the code, and you are right, but there were a lot of lines of code. In this case, I could immediately identify that it was something to do with that specific data set and reduce the code I needed to look at significantly. I just started with the last place that data set was referenced to work backward. Fortunately for me, it was in the very last place I called it.

The fact is, you will probably spend as much time debugging code as you do writing it. The log and logic are your friends. Also, no matter how long you have been programming you still make typos.

Want to play one of the games from this study? Have a computer? Go ahead, maturity is over-rated.


I know people who are so obsessive about testing and validating their code to the point they spend more time on testing it than actually writing it and analyzing the output. I said I know people like that, I didn’t say I was one of them. However, it is good practice to validate your SAS code and despite false rumors spread by my enemies, I do it sometimes.

Here is a simple example.  I believed that using the COMPRESS function with “l” for lower case or “I” for case-insensitive gave the same results. I wanted to test that. So, I ran two data steps

DATA USE_L;
set mydata.aztech_pre ;
q3 = compress(Q3,’ABCDEFGHIJKLMNOPQRSTUVWXYZ’,’l’);
q5 = compress(Q5,’ABCDEFGHIJKLMNOPQRSTUVWXY’,’l’);

… and a whole bunch more statements like that.

Then, I ran the exact same data step but with an “I” instead of an “l”  .

Finally, I ran a PROC COMPARE step

PROC COMPARE base =USE_L compare=USE_I ;
Title “Using l for lowercase vs I for insenstitive” ;

PROC COMPARE RESULTS SHOW NO DIFFERENCES

But, hey, maybe PROC COMPARE just doesn’t work. Is it really removing everything whether it is upper or lower case? To test this, I ran the procedure again comparing the dataset with the compressed results with the original data set.

PROC COMPARE base =mydata.aztech_pre compare=use_I ;
Title “Comparing with and without compress function” ;

The result was a whole lot of output, which I am not going to reproduce here, but some of the most relevant was:

  Values Comparison Summary                                                      
                                                                                                                                    
Number of Variables Compared with All Observations Equal: 24.                                     
 Number of Variables Compared with Some Observations Unequal: 16.                                  
Number of Variables with Missing Value Differences: 10.                                           
Total Number of Values which Compare Unequal: 694. 

Looking further in the results, I can see comparison of the results for each variable by observation number

          ||  q5                                                                              
           ||  Base Value           Compare Value                                              
       Obs ||  q5                    q5                                                        
 ________  ||  ____________          ____________                                              
            ||                                                                                  
         5  ||  150m                  150                                                       
         6  ||  42 miles              42                                                        
        10  ||  one thousand                                                                    
        12  ||  200 MILES             200       

So, I can see that the data step is doing what I want, which is removing all of the text from the responses and only leaving numbers. This is important because the next step is comparing the responses to the questions with the answer key and I don’t want any mismatches to occur because the student wrote ‘200 miles’ instead of 200.

In case you are interested, this is the pretest for two games that are used to teach fractions and statistics. You can find Aztech: The Story Begins here and play it for free, on your iPad , Mac, Windows or Chromebook computer.

Mayan god
Play Aztech !

Forgotten Trail can be played in a browser on any Mac, Windows or Chromebook computer.

According to that source of all knowledge on the interwebz, Wikipedia,  “Gaslighting is a form of psychological manipulation that seeks to sow seeds of doubt in a targeted individual or in members of a targeted group, making them question their own memory, perception, and sanity.”

Have you ever had a brilliant, super-competent friend who doubted her own competence?

I’ve often seen this happen to women in technical jobs, and it’s happened to me. Here’s what happens. You work with a man or a couple of men. (In theory it could be women, or men could do this to other men but I personally have only seen men do this, and usually to women). No one knows everything (duh!). You are an expert in Python, Ruby, JavaScript, PHP and Objective C. You’ve developed some pretty cool iOS apps, been part of some successful teams.  Bob suggests that the team really needs an Android app, but, 

You don’t know Java, do you, Joan?

You suddenly realize,

“Oh, my God, I don’t! How did I miss learning Java?”

Part of gaslighting is “using what’s important to you as ammunition”. If you’re a woman who has been in software development, mathematics, statistics or science for a long time,  it’s no doubt important to you and you’ve overcome a lot to stick it out and get where you are.  It’s important to you to be competent and knowledgeable and having someone question that is disconcerting. 

Gaslighters wear you down. It’s the death of a thousand cuts. Bob will insist that the prototype of the next app has to be built for Android because it’s the largest market share, “Of course, that leaves you out of the prototype build  because we need experienced Java developers.” “I’ll bet you’ve never used Android Studio.”

Gaslighters are also experts at reframing things, so much so that you don’t think of the fact that the last five prototypes were done for iOS and there was no problem porting to Android.

Gaslighters can also be good at getting other people to go along with them. If Bob repeatedly tells Sam that Joan isn’t a good fit for this project because we really need an Android developer for this prototype and Joan has no expertise in that area, “she mostly just does testing on iPhones”, Sam may believe him, after all, she’s admitted she has no expertise with Java. So, Sam is not going to consult with Joan on any technical issues, which wears Joan down even further. 

I agree with Stephanie Sarkis that some gaslighters may do this unintentionally and subconsciously. They are, in my experience, trying to make up for feelings of inferiority by making themselves look better by comparison and getting other people to depend on them. 

It doesn’t matter whether it is deliberate or not. The effects are insidious.

I used to think, “Suck it up, buttercup. If some clowns don’t think you have the technical chops, prove them wrong.”

I still think that to some extent, but I can see that it can be really difficult if you are constantly pricked with an endless series of whispering questions of your competence, both behind your back and to your face. It’s exhausting to always be trying to prove your abilities in the areas where you are knowledgable at the same time explaining that no, you have never used (insert any language here because no one has used all of them). I’ve seen women who really enjoyed coding move into marketing or project management giving the reason, 

It just wasn’t fun any more.

You may already be the solution.

Three of us, mutual friends, were at lunch one day and one woman mentioned she had been offered a terrific job but it was for “an expert in the field’ and she didn’t consider herself an expert. Her other friend and I immediately interrupted her,

What? Are you nuts? You are the very definition of an expert!

Then, we proceeded to list all of her amazing accomplishments because she really is incredible. 

Stick with people who see you in the best possible light

I have a great advantage in protection against gaslighters in that I married the right person. Recently, we were drinking beer with a friend who referred to me as “testing the games” and The Invisible Developer corrected, 

She doesn’t just test the games. She makes them, too.

It’s not often someone questions my technical ability around my husband, but when it does happen, he speaks up for me 100% of the time. That’s a big deal because he is not at all one to draw attention to himself. He’s not called the Invisible Developer for nothing. 

It’s not just him. I’m super fortunate to have a group of friends and colleagues who are really supportive and collaborative people who always have my back. 

If you are the problem, you have a problem

Maybe you are scoffing dismissively at this point that if Joan was any good none of this would have bothered her. You are making a snide comment over your cubicle that real developers don’t need anyone to tell them they’re good. People often feel uncomfortable around gaslighters, even if they can’t give a reason. They are right, too, because once Joan leaves, you’ll need someone else to disparage to make yourself feel superior, maybe Sam.

If Sam has a choice of his next project, it’s probably not going to be one with you. If he does get stuck working with you, after all of your comments about Joan, Sam is going to expect that you are God’s gift to Android development, that, in fact, your middle name is Java and the language was named after you. Imagine his response when you turn out to be nothing special.

What I’ve seen happen to the gaslighters eventually is that no one wants to work with them. Even though Bob thinks he’s a 10X software developer, for some reason no one wants him on their team. He tells himself it’s because they’re jealous. 

In the meantime, though, Joan is now managing the marketing department.

Don’t end up like Joan

Years ago, on the More than Ordinary podcast, I had my lovely daughter , Julia, as a guest to talk about what it’s like in boarding school. After saying, that “First of all, it’s nothing like Hogwarts … ” she went on to add

No matter where you are, you can find people to study with, to help support you to reach your goals. And, if not, well, just be that person for yourself.

So, if you find yourself being questioned so much that you start questioning yourself, try finding friends and colleagues who support you and remind you of your awesomeness. If for some reason that’s not an option, I suggest this. Remind yourself. Sit down and write down all of your accomplishments. Then, next time Bob questions you tell him, 

“Shut up you little prick. I’ve done amazing things, and I’m going to be here long after you’re gone.”

Okay, well, maybe you shouldn’t say that out loud at work, but if you do, I won’t blame you. 

In my day job, I make educational games, like this one where a Mayan god thing drags you into the past. Yes, it teaches math.

Mayan guy
Aztech Games: Not made with Java – yet – but that’s just a coincidence

About ROUND(100) years ago, I took a couple of COBOL courses. I never coded anything in COBOL after the classes, but the concept of a table look up has stuck with me.

Just like it sounds, this is used when you want to look something up in a table. For example, if I have ICD-10 codes, which are used to classify diagnoses, I could look up J09.X2 in a table and see that is “influenza”.

There are several ways to do this with SAS, including using PROC FORMAT, a one-to-many merge with either the data step or PROC SQL , a bunch of IF statements and a macro.

One way I learned very early on to do this in SAS was to use a PROC FORMAT.

Say I have a bunch of possible codes for outcomes for my program and they are coded 02 = applies all the way to 32 = “Post-employment services” .

PROC FORMAT; 
VALUE stats
0 = “referral”
2 = “applicant”
6 = “evaluation”
8 = “closed”
10 = “eligible”
12 = “IPE complete”
14 = “counseling”
18 = “training”
20 = “service complete”
24 = “service interrupted”
22 = “employed”
26 = “successful closure”
28 = “closed after service”
30 = “closed before service”
32 = “post-employment”
;

Another option, if I wanted to combine categories, like those who had a successful and unsuccessful outcome, is to do it like this :

PROC FORMAT;
VALUE stats
0, 2, 6, 10 , 12, 14, 18, 20  = “open case”
24, 28, 30 = “unsuccessful”
22, 26 , 32 = “successful closure”
;

In either case,  if I just wanted to have the type of service printed , I could use a FORMAT statement , like this.

FORMAT status_type stats. ;

If I wanted to create a new variable I could use the put function to put the original value into a new variable using the format.

DATA testy ;
  SET mydata.vr_codes ;
  recoded_status = PUT(status_type,stats.) ;
  

Is there any advantage of PROC FORMAT over doing 20 or 50 IF statements?

I can think of several. First of all, you can use the same PROC FORMAT repeatedly.  If you need to do the same transformation with several different data sets, you can just do the format procedure once, include one PUT or FORMAT statement in each data step and you are done.  Second, since you can store formats permanently,  if you haven’t gotten around to learning macros yet, this can be one method of using the same code over and over in different programs. Third, it’s just less messy to type, which seems trivial until you have 300 values to recode. 

Some day I might write a post on user-defined formats, especially how to store and re-use them. Today is not that day. In the meantime, I highly recommend reading this paper on building and using  user-defined formats by Art Carpenter while you are waiting.


I live in opposite world. I blog on SAS and statistics for fun and make games for a living. Check out Making Camp Premium. Learn about Ojibwe culture, brush up your math skills, learn more English and have fun. All for under two bucks.

Making Camp scene with buffalo, deer and rabbit

 

“I don’t document my code because if you really understood the language, it should be obvious.”

– Bob

Bob is an arrogant little prick.

Here are just a few reasons to document your code.

  1. Other people may need to modify it because, despite your assumed brilliance, there may be other people in the universe capable of maintaining your code when you get a promotion, take another job or get your sorry ass fired.
  2. Six months from now, you may need to look at this code again. After 11 other projects have intervened, you’ll be trying to figure out what the hell the prev_grant_yrs variable was supposed to measure. Every time I add comments to a project, I say to myself, “Future me will thank me for this.”
  3. If you use Title and Label statements, there will be additional clarity not just for you as a programmer but also for the users.

Here is an example

This comes from a longitudinal analysis of a vocational rehabilitation project. There are only two comment statements in this snippet, however, there is a LABEL statement which explains that the prev_grant_yrs variable is the number of years a consumer was served under the previous grant. There was a significant change in operations in the current grant cycle, but when this five-year cycle started there were a number of people already on the caseload who had been determined eligible under the previous administration.

data by_year ;
set mydata.vr2018 ;

** USES YEAR FUNCTION TO GET THE YEAR OF INDIVIDUAL PLAN OF EMPLOYMENT ;

*** AND OF APPLICATION TO THE PROGRAM ;


ipe_year = year(ipe_date) ;
app_year = year(app_date);

if ipe_year ne . and ipe_year < 2008 then prev_grant_yrs = "5+" ;
else if ipe_year < 2012 then prev_grant_yrs = "2-4" ;

  else if ipe_year > 2011 then prev_grant_yrs = "0-1";

LABEL prev_grant_yrs = "Years Under Previous Grant"
ipe_year = "Year IPE written"

app_year = "Year applied"

;

The first procedure, I simply wanted to get a closer look at the people who had been getting services for more than five years under the previous grants. It’s important to add that second title line so readers know this isn’t ALL long-term consumers but those who had been long-term users coming into the current grant cycle.

TITLE "Check long-term consumers";

TITLE2 "Getting services 5+ under the previous grant" ;
proc print data= by_year ;
where prev_grant_yrs = "+5";
id username ;
var ipe_date app_year prev_grant_yrs ;
format ipe_date mmddyy8. ;

The second procedure, I wanted to see how consumers served in the current year were doing. Why do I have grantyear as a variable in the VAR statement when it is clear from the WHERE statement that only people from 2018 will be included?  Because the person who gets the output won’t see that WHERE statement. Just having “current year” in the title is not enough because next January someone looking at this might think it was for 2019.  I could have included 2018 in the title, but including it as a variable on the output both acts as a validity check for me and lets the user, my customer, know that the data are correct.

TITLE "Current year consumers" ;
proc print data=by_year ;
where grantyear = 2018 ;
id username ;
var grantyear status status_type ipe_year;
format ipe_year mmddyy8. ;

A few of the individuals served by this project did not have an Individual Plan of Employment. I wanted to see if the people missing an IPE just hadn’t had time to complete it yet or if they never came back and did it. An IPE is the first step in getting project services, so, if they had a missing date for a year or more than they had just dropped out. Again, the second title line tells the users what I’m trying to do here.

TITLE "IPE YEAR by Application Year";
TITLE2 "Note: Missing IPE consumers had ample time to complete IPE";
proc freq data=by_year ;
tables ipe_year*app_year/missing ;

So, you get the idea. Elegant code is nice, correct code is essential.

You know what is essential?

A young person once asked me,

“No offense, but why are your services so much in demand? It’s not as if there aren’t a lot of people who can do what you do.”

Okay, first tip, young people, when you find yourself saying, “no offense” you should probably just stop talking and then you definitely won’t offend anyone. Actually, I was pretty amused. It’s true that lots of people can do frequency distributions, if-then-else statements and cross-tabulations (although, in my defense, that’s not ALL I did on this project).

One essential skill is make your analyses easily understood by your co-workers and customers.

As a wise person once said,

“Mystery novels should be figured out. Code should be read.”


Wonder what else I’m writing these days?

You can get A Different Kind of Textbook, our family group text, for $2.99 as an ebook.  We definitely are a different kind of family.

Contacts : bios of family members

 

Next Page →