Did you ever fill out one of those online forms where you kept trying to submit it and got messages like,

“You need to enter your phone number in the format 311-234-12234”


You cannot have any special characters in this field.

That one really irritates me because, in fact, my last name has a space in it and many websites refuse to accept it. Take it up with The Invisible Developer, or his ancestors.

Have you ever just said the hell with it, and skipped filling out the form? Preventing users from entering all but the expected data type saves problems when you analyze your data, but it can also cause people to give up on your stupid web form.

So … when I created the pretest for Forgotten Trail and Aztech, I made it accept just about anything. If you wanted to write in 6, six, 9R6, 6 left over — any and all of those would be accepted and recorded.

You can get the first two games we developed here.

background for hidden pictures game

Forgotten Trail and Aztech are in beta and will not be commercially available for another couple of months.

What now? I have to score that test, but I’d rather the difficulty be on me than 150 or so middle school students who are our first test group.

So… how to fix it, with SAS character functions. Here is me, scoring the first half of the test:

First, I read the data into a new data set because I want to preserve the original data and not write over it. I may want to look at the exact incorrect answers later.

I create a character array of all 32 items on the test, and then I use a DO loop to change all of the questions to upper case.


Data in.recode ;
set in.pretestGMS ;
array qs{32} $ q1 – q27 q28a q28b q28c q29 q30 ;
do i = 1 to 32 ;
qs{i} = upcase(qs{i}) ;
end ;

Now, on to the questions. I eventually need all of these items to be score 1= correct, 0= incorrect

q1 is a question about money. People put all kinds of wrong answers – $35, $40, as well as the correct answer, 100 and $100. I used the COMPRESS function to remove the ‘$’ , then set q1 to equal 1 if the answer was 100, an 0 otherwise.
q1 = compress(q1,”$”) ;
if q1 = 100 then q1 = 1 ;
else q1 = 0 ;

The second use of compress function removes trailing blanks – if you don’t put any second parameter in the compress function, it just removes blanks. In q2, the answer was 4 but the students put “four”, “four frogs” “4/14” and so on. All of these are correct. You can have a list in an IF statement and if the variable matches any of those values in the list, then do something, in this case, set the answer as correct.
q2 = compress(q2) ;
if compress(q2) in (“4″,”FOUR”,”FOURFROGS”,”4/14″,”4OUTOF”,”4FROGS”) then q2 = 1;
else q2 = 0 ;

*** How to keep only numeric data using a simple SAS function (take that all you regular expression fetishists!)

The third use of the compress function KEEPS the characters that are the second parameter, because I added an optional third parameter of “k”, to KEEP the characters in the second parameter instead of discard those. So, this keeps numbers and deletes everything else from the answer. If it is 150, it is scored correct, otherwise, it’s wrong.
if compress(q5,”0123456789″,”k”) = 150 then q5 = 1;
else q5 = 0 ;


A lot of the items were similar, so that is half of scoring the test. I’ll try to write up the rest from the airport  tomorrow, but for now, I need to write a couple of emails, finish this scoring program and pack before 2 am, and that only gives me about 40 minutes.

On twitter, there were a few comments from people who said they didn’t like to take interns because “More than doing work, they want to watch me work.”

I see both sides of that. You’re busy. You’re not netflix. I get it. On the other hand, that’s a good way to learn.

The data are part of the evaluation of the effectiveness of 7 Generation Games in increasing math achievement scores. You can read more about us here. Below is a sneak peek of the artwork from the level we are creating now for Forgotten Trail.

characters from Forgotten Trail in Maine

So, here you go. I’m starting on a data analysis project today and I thought I’d give you the blow by blow.


It just so happens that the first several steps are all point-y and click-y. You could do it other ways but this is how I did it today. So, step one, I went to phpMyAdmin on the server where the data were saved and clicked Export.


For the format to export, I selected CSV and then clicked on the Go button. Now I have it downloaded to my desktop.

import data

Step 3: I opened SAS Enterprise Guide and selected Import Data.  I could have done this with SAS and written code to read the file, but, well, I didn’t. Nope, no particular reason, just this isn’t a very big data set so I thought, what the heck, why not.

boxes to check in import data menu

Step 4: DO NOT ACCEPT THE DEFAULTS!  Since I have a comma-delimited file with no field names, I need to uncheck the box that says File contains field names on record number. SAS conveniently shows you the data below so I can see that it is comma-delimited. I know I selected CSV but it’s always god practice to check. I can also see that the data starts at the first record, so I want to change that value in Data records start at record number to 1.

changing names

Step 5: Change the names  – I am never going to remember what F1, F2 etc. are, so for the first 5 , I click on the row and edit the names to be the name and label I want.

That’s it. Now I click the next button on the bottom of the screen until SAS imports my data.

I could have continued changing all of the variable names, because I KNOW down the line I am not going to remember that F6 is actually the first question or that F25 is question 28a. However, I wanted to do some other things that I thought would be easier to code, so I opened up a program file in SAS Enterprise guide and wrote some code.





data pretest2 ;

    set pretest ;



    array ren{32} $ f6-f37 ;

array qs {32} $ q1-q27 q28a q28b q28c q29 q30;

do i = 1 to 32 ;

qs{i} = ren{i} ;

end ;


rename f38 = date_test ;



drop f6- f37 i ;


proc sort data=pretest2 ;

by username date_test ;



data pretest2 ;

set pretest2;

by username date_test ;

if last.username ;

if index(username,‘TEST’) > 0 then delete;


Okay, that’s it. Now I have my data all ready to analyze. Pretty painless, isn’t it?

Want to learn more about SAS?

Here is a good paper on Arrays made easy .

If you’re interested in character functions like index, here is a good paper by Ron Cody.

Even though Rick Wicklin (buzzkill!) disabused me of the concern that SAS was communicating with aliens through the obscure coding in its sashelp data sets, I still wanted to roll my own.

If you, too, feel more comfortable with a data set you have produced yourself, let me give you a few tips.

  • There is a wealth of data available on line, much of it thanks to your federal government. For example, I downloaded the 2014 birth data from the Center for Disease Control site. They have a lot of interesting public use data sets.
  • Read the documentation! The nice thing about the CDC data sets is that they include a codebook. This particular one is 183 pages long. Not exciting reading, I know, but I’m sitting here right now watching a documentary where some guy is up to his elbows in an elephant’s placenta (I’m seriously not making this up) and that doesn’t look like a bowl of cherries, either.
  • Assuming you did not read all of the documentation even though I told you it was important, (since I have raised four daughters and know all about people not paying attention to my very sound advice), at a MINIMUM you need to read three things; 1) sampling information to find out if there are any problems with selection bias, any sampling weights you need to know about, 2) the record layout – otherwise how the hell are you going to read in the data, and 3) the coding scheme.

Let’s take a look at the code to create the data set I want to use in examples for my class. Uncompressed, the 2014 birth data set is over 5 GB which exceeds the limit for upload to SAS On-Demand for Academics and also isn’t the best idea for a class data set for a bunch of reasons, a big one being that I teach students distributed around the world and ships at sea (for real) and having them access a 5GB data set isn’t really the most feasible idea.


I’m going to assume you downloaded the file into your downloads folder and then unzipped it.


Since I READ the codebook, not being a big hypocrite and saw that the record length is 775 and there are nearly 4 million records in the data set. Opening it up in SAS Enterprise Guide or the Explorer window didn’t seem a good plan. My first step, then , was to use a FILENAME statement to refer to the data I downloaded, data that happens to be a text file.

I just want to take a look at the first 10 records to see that it is what it claims to be in the codebook. (No, I never DO trust anyone.)

The default length for a SAS variable is 8.

I’m going to give the variable a length of 775 characters.

Notice that the INFILE statement has to refer back to the reference used in the FILENAME statement, which I very uncreatively named “in” . Naming datasets, variables and file references is not the place for creativity. I once worked with a guy who named all of his data sets and variables after cartoon characters – until all of the other programmers got together and killed him.

Dead programmers aside, pay attention to that OBS=10 unless you really want to look at 3,998,175 records. The OBS =10 option will limit the number of records read to – you guessed it – 10.

With the INPUT statement, I read in from position 1-775 in the text file.

All of this just allows me to look at the first 10 records without having to open a file of almost 4 million records.

FILENAME  in “C:\Users\you\Downloads\Nat2014us\Nat2014PublicUS.c20150514.r20151022.txt ” ;

DATA example;
LENGTH var1 $775;
INFILE in OBS= 10 ;
INPUT var1 1-775;


Being satisfied with my first look, I went ahead to create a permanent data set. I needed a LIBNAME statement to specify where I wanted the data permanently stored.

The out in the LIBNAME and DATA statements need to match. It could be anything, it could be komodo or kangaroo, as long as it’s the same word in both places. So … my birth2014 data set will be stored in the directory specified.

How do I know what columns hold the birth year, birth month, etc. ? Why, I read the codebook (look for “record layout”).


LIBNAME  out “C:\Users\me\mydir\” ;
DATA  out.birth2014 ;
INFILE in OBS= 10000 ;
INPUT birthyr 9-12 bmonth 13-14 dob_tt 19-22 dob_wk 23 bfacil 32
mager 75-76 mrace6 107 mhisp_r 115 mar_p $ 119 dmar 120
meduc 124 bmi 283-286 cig1_R 262 cig2_R 263 cig3_r 264 dbwt 504-507 bwtr4 511;


STEP 3: RECODE MISSING DATA FIELDS. Think how much it would screw up your results if you did not recode 9999 for birthweight in grams, which means not that the child weighed almost 20 pounds at birth but that the birthweight was missing.  In every  one of the variables below, the “maximum” value was actually the flag for missing data. How did I know this? You guessed it, I read the codebook. NOTE: these statements below are included in the data step.
IF bmi > 99 THEN bmi = . ;
if cig1_r = 6 then cig1_r = . ;
if cig2_r = 6 then cig2_r = . ;
if cig3_r = 6 then cig3_r = . ;
if dbwt = 9999 then dbwt = . ;
if bwtr4 = 4 then bwtr4 = . ;

STEP 4: LABEL THE VARIABLES – Six months from now, you’ll never remember what dmar is.

NOTE: these statements below are also included in the data step.
LABEL mager = “Mom/age”
bfacil = “Birth/facility” mrace6 = “Mom/race” mhisp_r = “Mom/Hispanic”
dmar = “Marital/Status” meduc = “Mom/Educ”
cig1_r =”Smoke/Tri1″ cig2_r =”Smoke/Tri2″ cig3_r =”Smoke/Tri3″;

So, that’s it. Now I have a data set with 10,000 records and 19 variables that I can upload for my students to analyze.

I was going to write about prevalence and incidence, and how so-called simple statistics can be much more important than people think and are vastly under-rated. It was going to be cool. Trust me.

In the process, I ran across two things even more important or cooler (I know, hard to believe, right?)

Julia and company find this hard to believe

The Spoiled One & Co. find this hard to believe also

Here’s what happened … I thought I would use the sashelp library that comes with SAS On-Demand for Academics -and just about every other flavor of SAS – for examples of difference in prevalence. Since no documentation of all the data sets showed up on the first two pages of Google, and one is prohibited by international law from looking any further, I decided to use SAS to see something about the data sets.

Herein was revealed cool thing #1 – I know about the tasks in SAS Studio but I never really do much with these. However, since I’m teaching epidemiology this spring, I thought it would be good to take a look. You should do this any time you have a data set. I don’t care if it is your own or if it was handed down to you by God on Mount Sinai.

Moses with tablets

Okay, I totally take that back. If your data was handed down to you by God on Mount Sinai, you can skip this step, but only then.

At this point, Buddhists and Muslims are saying, “What the fuck?” and Christians are saying, “She just said, ‘fuck’! Right after a Biblical reference to Moses, too!”

This is why this blog should have some adult supervision.

But I digress. Even more than usual.

KNOW YOUR DATA! I don’t mean in the Biblical sense, because I’m not sure how that is even possible, but in the statistical sense. This is the important thing. I don’t care how amazingly impressive the statistical analyses are that you do, if you don’t know your data intimately (there’s that Biblical thing again) you may as well make your decisions by randomly throwing darts at a dartboard. I once told some people at the National Institutes of Health that’s how I arrived at the p-value for a test. For the record, the Men in Black have more of a sense of humor about these things than the NIH.

Ahem, so … if you are using SAS Studio, here is a lovely annotated image of what you are going to do.


1. Click in the left window pane where it says TASKS on the arrow to bring up a drop down menu


2. Click on the arrow next to Data and then select the Characterize Data task. (You might say that was 2 AND 3 if you were a smart ass and who asked you, anyway?)

3. Click the arrow next to the word DATA in the window pane second from the left and it will give you a choice of the available directories. (NOTE: If you are going to use directories not provided by SAS you’ll need a LIBNAME statement in an earlier step but we’re not and you don’t in this particular example.) Under the directory, you will be able to select your file, in this case, I want to look at birthweight.




4. Next to the word VARIABLES, click the + and it will show the variables in the data set. You can hold down the shift key and select more than one. You should do this for all of the variables of interest. In my case, I selected all of the variables – there aren’t many in this dataset.


5. To run your program, click the little running guy at the top of the window. This will give you – ta- da



Let’s notice something here – the mother’s age ranges from -9 (seriously? What’s that all about?) to 18. Is this a study of teenage mothers or what? The answer seems to be “what” because the mean age is .416. Say, what? The mother’s educational level ranges from 0 to 3, which probably refers to something but I’ll bet it’s not years of education.


In a class of 50 students, inevitably, one or two will turn in a paper saying,

“The average mother had 1.22 years of education.”

WHAT? Are you even listening to yourself? Those students will defend themselves by saying that is what “SAS” told them.

According to the SAS documentation, these data are from the 1997 records of the National Center for Health Statistics.

I ran the following code:

proc print data=sashelp.bweight (obs=5) ;

And either it’s the same data set or there was an amazing coincidence that all of the data in the first five records was the same.

However, because I really need to get a hobby, I went and found the documentation for the Natality Data set from 1997 and it did not match up with the coding here. This led me to conclude that either:

a. SAS Institute re-coded these data for purposes of their own, probably instructional,

b. This is actually some other 1997 birthweight data set from NCHS, in which case, those people have far too much time on their hands.

c. SAS is probably using secretly coded messages in these data to communicate with aliens.

Julia as fat alien

Not being willing to chance it, I went to the NCHS site and downloaded the 2014 birth statistics data set and codebook so I could make my own example data.

So … what have we learned today ?

  1. The TASKS in SAS Studio can be used to analyze your data with no programming required.
  2. It is crucial to understand how your data are coded before you go making stupid statements like the average mother is 3 months old.
  3. You can download birth statistics data and other cool staff from the National Center for Health Statistics site.
  4. The Spoiled One uses any phone not protected by National Security Council level of encryption for photos of herself.


Want to make your children as smart as you?

Player needing help

Get them 7 Generation Games. Learn math. Learn social studies. Learn not to fall into holes.

Runs on Mac and Windows for under ten bucks.



Over the weekend, I wrote a post showing how SAS can be used to make what appears to be a complex problem quite simple.

First of all, am I just being dramatic? Seriously, how can having your variable lengths differ be a disaster?

Simple. You are merging by a variable that is a unique user identifier like username, social security number. Because the two different data sets have different lengths, they do not match. If you are computing the number of unique users you may overestimate by a huge amount. If you want the number of people who are in both data sets, you may vastly underestimate the amount of true matches.

As with anything in programming, there are many ways to do this. My solution is to create a new variable and set it to the identical length and format using the ATTRIB statement. Extra bonus is this will work when you have variables that are not only different lengths but different types, say character in one data set and numeric in the other.

You really only need two statements in your data step, an ATTRIB statement and then an assignment statement that sets the value of the variable you created to whatever the variable is you want to merge.

DATA dsname ;
ATTRIB newvar LENGTH = $49 ;
SET mydata2.dsname ;
newvar = oldvar ;
Repeat this step for the second data set and then merge (or concatenate) to  your little heart’s delight.

Angry guard faceThe voice of experience: 
Notice two things here:  I created a temporary data set from my permanent one. Although SAS has gotten more forgiving over the years in not writing over your existing data sets when there is an error, it is still better to err on the side of caution and make sure all is wonderful before saving over that existing data, especially if it took you a lot of effort to get the data in that form.
Second, I created a new variable and kept the old one as is. I don’t always do this but it is good practice. You may be tempted to just use the first 9 digits because we all know social security numbers are 9 digits and then later you find that it was entered as 123-45-6789  and now you only have 123-45-67

—- Feel smarter after reading this blog?

Fish Lake artwork

Want to feel even smarter? Download and play our games!  You can run around in our virtual world while reviewing your basic math skills. If you are too busy (seriously?) you can still give a game as a gift or donate a game to a classroom or school.

Some problems that seem really complex are quite simple when you look at them in the right way. Take this one, for example:

My hypothesis is that a major problem in math achievement is persistence. Students just give up at the first sign of trouble. I have three different data sets with student data from the Spirit Lake game. Many of the students in the student table are the control group, so they will have no data on game play. There is a table of answers to the math challenges and another table with answers to quizzes which students took only if they missed a math challenge. When students miss a math challenge in  the game, depending on which educational resource they choose, they may do one of two or three different quizzes to get back into the game.  Also, some of the quiz records were not from quizzes actually in the game but from supplemental activities we provided. So, how do I identify where in the process students drop out and present in a simple graphic to discuss with schools? Just to complicate matters, the username was different lengths in the different datasets and the variable for timestamp also had different names.

It turns out, the problem was not that difficult.

  1. Merge the student table with the answers (math challenges) and only include those students with at least one answer.
  2. Merge the student table with the quizzes and only include those students with at least one quiz
  3. Concatenate the data sets from steps 1 & 2
  4. Create a new userid variable and set it equal to the username
  5. Create a new “entered” variable and set it equal to whichever of the datetime fields exists on that record
  6. Delete the quizzes not included in the game.
  7. Sort the dataset by userid and the date and time entered.
  8. Keep the last record for each userid. Now you have their last date of activity.
  9. If there is a value for the math challenge field then that is the name of the last activity, otherwise the quiz name is the name for the last activity.
  10. Use a PROC FORMAT to assign each activity a value equal to the step in the game.
  11. Do a PROC FREQ using that format and the order = FORMATTED option.

Once I had the frequencies, I just put them into a table in a word document and shaded the columns to match the percentage. There may be a way in SAS/Graph or something else to do this automatically, but honestly, the table took me two minutes once I had the data.

graph showing students dropping out at each step

I think it illustrates my points pretty clearly, which are:

  • A sizable number of students drop out after the second problem.
  • 25% of the students drop after the first difficulty they have (missing the second problem)
  • Only a minority of students persist all the way to the end, less than 25% of the total sample

This isn’t based on a tiny sample, either. The data above represent a sample of 397 students.

In case you would like to see it, the code for steps 3-11 is below. Particularly useful is the PROC FORMAT. Notice that you can have multiple values have the same format, which was important here because players can take multiple paths that are still the same step in the sequence.

data persist ;
attrib userid length= $49 ;
set mydata2.sl_answers mydata2.sl_quizzes ;
entered = max(date_answered_dt,date_taken_dt) ;
if quiztype in (“problemsolve”,”divide1long”,”multiplyby23″) then delete ;
userid = new_username ;
format entered datetime20. ;
proc sort data=persist ;
by quiztype ;

proc sort data=persist ;
by userid entered ;

data retention ;
set persist ;
by userid ;
if last.userid ;
attrib last_activity length= $14 ;
if inputform ne “” then last_activity = inputform ;
else last_activity = quiztype ;

proc freq data= retention ;
tables last_activity ;

proc format ;
“findcepansi” = “01”
“x2x9” = “02”
“math2x” = “02”
“math2_2” = “02”
“wolves1a” = “02”
“multiplyby5” = “03”
“multiplyby4” = “03”
“multiplyby3” = “04”
“wolves1b” = “05”
…. AND SO ON ….

“horseform2” = “21”
ods rtf file = “C:\Users\Spirit Lake\phaseII\pipeline.rtf” ;
proc freq data= retention order=formatted ;
tables last_activity ;
format last_activity $activity. ;
run ;
ods rtf close ;

—- Feel smarter after reading this blog?
Fish Lake artwork
Want to feel even smarter? Download and play our games!  You can run around in our virtual world while reviewing your basic math skills. If you are too busy (seriously?) you can still give a game as a gift or donate a game to a classroom or school

Many years ago, I was walking through the exhibits at the county fair with my late husband (he was alive then, that’s why he was able to walk with me) and I lamented,

Look at those quilts. My grandmother makes quilts. Look at those crocheted tablecloths. My other grandmother crochets. Look at me – what do I make?

My wonderful husband turned to me and said in his good-old-boy, country accent,

Money. That’s what you make that your grandmothers didn’t make. You make money, darlin’.

Everyone is posting pictures of the cute Halloween costumes their mom made for them or that they made for their children. I never made a Halloween costume in my life, but here is a copy of some code I finished last weekend that makes a graph with different types of pastries. Another function I wrote (not shown here) changes it from Spanish to English. If you get it correct, it takes you to another problem that does bar graphs with actual bars.

I didn’t make a costume but I did make money from working on this project which The Spoiled One can use to buy whatever costume she likes.

graph with pastries

<script type="text/javascript">
    $( window ).load(function() {
        var ncup = 0;
        var nd = 0 ;
        var ncake = 0 ;
        var thisone = 0;
        var sesstries = 0 ;
        document.getElementById("arrow").addEventListener("click", function(){
           if(ncake== 4 & nd ==5 & ncup==7){
               window.location.href="problem5_go_to.html" ;
            else {goFail();}
        document.getElementById("button1").addEventListener("click", function(){
        document.getElementById("button2").addEventListener("click", function(){
           window.location.href ="../learn_more4.html";
        $(function () {
                helper: "clone",
                start: function (event, ui) {
                    thisone = 1;
                revert: function (event, ui) {
                    $(this).data("uiDraggable").originalPosition = {
                        top: 0,
                        left: 0
                    return !event;
                helper: "clone",
                start: function (event, ui) {
                    thisone = 100;
                revert: function (event, ui) {
                    $(this).data("uiDraggable").originalPosition = {
                        top: 0,
                        left: 0
                    return !event;
                helper: "clone",
                start: function (event, ui) {
                    thisone = 1000;
                revert: function (event, ui) {
                    $(this).data("uiDraggable").originalPosition = {
                        top: 0,
                        left: 0
                    return !event;

                drop: function (event, ui) {
                    if (thisone != 1) {goFail();}
                    if (thisone == 1) {
                        if (nd > 5) {

                       // $(this).draggable('disable');
                    else {



                drop: function (event, ui) {

                    if (thisone == 100) {
                        if (ncup > 7) {
                        // $(this).draggable('disable');
                    else {

                drop: function (event, ui) {

                    if (thisone == 1000) {
                        if (ncake > 4) {
                        // $(this).draggable('disable');
                    else {

        function goFail(){

            var prev = sessionStorage.getItem("caketries");

            if (prev != 1 )
                sessionStorage.setItem("caketries", "1") ;
                prev = sessionStorage.getItem("caketries");

        else {
                sessionStorage.setItem("caketries", "0") ;


    }) ;


Do you really need to document everything?

Those who say that there is no such thing as a stupid question might be reconsidering their position right about now. Of course we need to document everything!

Perhaps my reluctance stems from my hatred of technical writers. If you are a tech writer and are a decent human being, the next time we are at the same event, please come up and introduce yourself. I would like to see what you look like. All of the technical writers I ever knew well were  evil, which made me suspicious of the rest of you. I should also note here that if you respond to this by posting hateful comments you will have only reinforced the stereotype of tech writers = evil.

Supposedly, tech writers are hired because of their ability to communicate well, which makes me wonder why they insist on making such insulting comments as:

I translate what the programmers wrote into English so the normal people can understand it.


I take what the engineers say and put it into complete sentences.


Everyone knows that software engineers can’t communicate with other humans.

Hey, tech writers, you do know we’re standing right here and we can hear you, right?

Animosity toward an entire semi-profession aside, I caught myself wondering how much documentation was really necessary. I was looking through some code I had written months before, which, I have to confess had almost no comments in the code and no documentation anywhere outside of the code. Even though I hadn’t looked at it in four months (there was a comment with the date created!), I found it pretty easy to read and got to thinking that perhaps documentation was over-sold by literature majors who couldn’t find jobs.

Then, an uncharacteristic burst of rationality overtook me and I realized that our company is growing. The code was pretty clear to me because I’m fairly familiar with the canvas tag and using canvas for graphics. The program used two libraries with which I’m familiar – jquery and a library for making charts. There were 50 other ways the program was easy for me to read because I wrote it using what was easy and familiar for me. However, we’re a growing company, and as The Invisible Developer reminded me, whether it happens kicking and screaming or I go quietly, the handwriting is on the wall and I’m going to be doing less coding and more CEO’ing.

I know that if HE were to have taken the same program, there would be much swearing going on upstairs right now.

So, yes, documentation appears to be more necessary than originally believed.

Is the solution for me to go through and document everything? Sigh. If only I had infinite time.

What I think might work and be a good idea for a new employee is to have him or her start off with reading some of the code and documenting it. That would be a good way to get familiar with what we are doing before diving into a project. It would also be a good way for us to verify if that person understands what is going on in a particular piece of code.

Or, one might say that was a lazy way for me to get out of writing documentation – if one were a tech writer.


My ISP is currently not allowing me to upload or insert image files. Show your sympathy for me by buying games. The games will make you smarter, amuse you and enable me to afford a better host. Everyone wins!


In assessing whether our Fish Lake game really works to teach fractions, we collect a lot of data, including a pretest and a post-test. We also use a lot of types of items, including a couple of essay questions. Being reasonable people, we are interested in the extent to which the ratings on these items agree.

Lake with fish, divided into quarters

To measure agreement between two raters, we use Kappa’s coefficient. PROC FREQ produces two types of Kappa coefficients. The Kappa coefficient ranges from -1 to 1, with 1 indicating perfect agreement, 1 indicating exactly the agreement that would be expected by chance and negative numbers indicating less agreement than would be expected by chance . When there are only two categories, PROC FREQ produces only the Kappa coefficient. When more than two categories are rated, a weighted Kappa is also produced which credits categories closer together as partial agreement and categories at the extreme ends as no agreement.

The code is really simple:

PROC FREQ DATA =datasetname ;
TABLES variable1*variable2 / PLOTS = KAPPAPLOT;

Including the ODS GRAPHICS ON statement and the PLOTS = KAPPAPLOT option in your TABLES statement will give you a plot of both the agreement and distribution of ratings. Personally, I find the kappa plots, like the example below, to be pretty helpful.

Kappa plot

This visual representation of the agreement shows that there was a large amount of exact agreement (dark blue shading) for incorrect answers, scored 0, with a small percentage partial agreement and very few with no agreement. With 3 categories, only exact agreement or partial agreement is possible for the middle category. Two other take-away points from this plot are that agreement is lower for correct and partially correct answers than incorrect ones and that the distribution is skewed, with a large proportion of answers scored incorrect. Because it is adjusted for chance agreement, Kappa is affected by the distribution among categories . If each rater scores 90% of the answers correct, there should be 81% agreement by chance, thus requiring an extremely high level of agreement to be significantly different from chance. The Kappa plot shows agreement and distribution simultaneously, which is why I like it.


Want to play the game ? You can download it here, as well as our game for younger players, Spirit Lake.

Sometimes, you can just eyeball it.

Really, if something truly is an outlier, you ought to be able to spot it. Take this plot, for example.

plot with 3 large bars and a few outliers

It should be pretty obvious that the vast majority of our sample for the Fish Lake game were students in grades, 4, 5 and 6. Those in the lower grades are clearly exceptions. I don’t know who put 0 as their grade, because I doubt any of our users had no education.

I use these plots especially if I’m explaining why I think certain records should be deleted from a sample. For many people, it seems as if the visual representation makes it clearer that “some of these things don’t belong here.”

Did you know that you can get a plot from PROC FREQ just by adding an option, like so:

PROC FREQ DATA= datasetname ;


This will produce the frequency plot seen above, as well as a table for your frequency distribution.

Well, if you didn’t know, now you know.

← Previous PageNext Page →