Over the weekend, I wrote a post showing how SAS can be used to make what appears to be a complex problem quite simple.

First of all, am I just being dramatic? Seriously, how can having your variable lengths differ be a disaster?

Simple. You are merging by a variable that is a unique user identifier like username, social security number. Because the two different data sets have different lengths, they do not match. If you are computing the number of unique users you may overestimate by a huge amount. If you want the number of people who are in both data sets, you may vastly underestimate the amount of true matches.

As with anything in programming, there are many ways to do this. My solution is to create a new variable and set it to the identical length and format using the ATTRIB statement. Extra bonus is this will work when you have variables that are not only different lengths but different types, say character in one data set and numeric in the other.

You really only need two statements in your data step, an ATTRIB statement and then an assignment statement that sets the value of the variable you created to whatever the variable is you want to merge.

DATA dsname ;
ATTRIB newvar LENGTH = $49 ;
SET mydata2.dsname ;
newvar = oldvar ;
Repeat this step for the second data set and then merge (or concatenate) to  your little heart’s delight.

Angry guard faceThe voice of experience: 
Notice two things here:  I created a temporary data set from my permanent one. Although SAS has gotten more forgiving over the years in not writing over your existing data sets when there is an error, it is still better to err on the side of caution and make sure all is wonderful before saving over that existing data, especially if it took you a lot of effort to get the data in that form.
Second, I created a new variable and kept the old one as is. I don’t always do this but it is good practice. You may be tempted to just use the first 9 digits because we all know social security numbers are 9 digits and then later you find that it was entered as 123-45-6789  and now you only have 123-45-67

—- Feel smarter after reading this blog?

Fish Lake artwork

Want to feel even smarter? Download and play our games!  You can run around in our virtual world while reviewing your basic math skills. If you are too busy (seriously?) you can still give a game as a gift or donate a game to a classroom or school.

Some problems that seem really complex are quite simple when you look at them in the right way. Take this one, for example:

My hypothesis is that a major problem in math achievement is persistence. Students just give up at the first sign of trouble. I have three different data sets with student data from the Spirit Lake game. Many of the students in the student table are the control group, so they will have no data on game play. There is a table of answers to the math challenges and another table with answers to quizzes which students took only if they missed a math challenge. When students miss a math challenge in  the game, depending on which educational resource they choose, they may do one of two or three different quizzes to get back into the game.  Also, some of the quiz records were not from quizzes actually in the game but from supplemental activities we provided. So, how do I identify where in the process students drop out and present in a simple graphic to discuss with schools? Just to complicate matters, the username was different lengths in the different datasets and the variable for timestamp also had different names.

It turns out, the problem was not that difficult.

  1. Merge the student table with the answers (math challenges) and only include those students with at least one answer.
  2. Merge the student table with the quizzes and only include those students with at least one quiz
  3. Concatenate the data sets from steps 1 & 2
  4. Create a new userid variable and set it equal to the username
  5. Create a new “entered” variable and set it equal to whichever of the datetime fields exists on that record
  6. Delete the quizzes not included in the game.
  7. Sort the dataset by userid and the date and time entered.
  8. Keep the last record for each userid. Now you have their last date of activity.
  9. If there is a value for the math challenge field then that is the name of the last activity, otherwise the quiz name is the name for the last activity.
  10. Use a PROC FORMAT to assign each activity a value equal to the step in the game.
  11. Do a PROC FREQ using that format and the order = FORMATTED option.

Once I had the frequencies, I just put them into a table in a word document and shaded the columns to match the percentage. There may be a way in SAS/Graph or something else to do this automatically, but honestly, the table took me two minutes once I had the data.

graph showing students dropping out at each step

I think it illustrates my points pretty clearly, which are:

  • A sizable number of students drop out after the second problem.
  • 25% of the students drop after the first difficulty they have (missing the second problem)
  • Only a minority of students persist all the way to the end, less than 25% of the total sample

This isn’t based on a tiny sample, either. The data above represent a sample of 397 students.

In case you would like to see it, the code for steps 3-11 is below. Particularly useful is the PROC FORMAT. Notice that you can have multiple values have the same format, which was important here because players can take multiple paths that are still the same step in the sequence.

data persist ;
attrib userid length= $49 ;
set mydata2.sl_answers mydata2.sl_quizzes ;
entered = max(date_answered_dt,date_taken_dt) ;
if quiztype in (“problemsolve”,”divide1long”,”multiplyby23″) then delete ;
userid = new_username ;
format entered datetime20. ;
proc sort data=persist ;
by quiztype ;

proc sort data=persist ;
by userid entered ;

data retention ;
set persist ;
by userid ;
if last.userid ;
attrib last_activity length= $14 ;
if inputform ne “” then last_activity = inputform ;
else last_activity = quiztype ;

proc freq data= retention ;
tables last_activity ;

proc format ;
“findcepansi” = “01”
“x2x9” = “02”
“math2x” = “02”
“math2_2” = “02”
“wolves1a” = “02”
“multiplyby5” = “03”
“multiplyby4” = “03”
“multiplyby3” = “04”
“wolves1b” = “05”
…. AND SO ON ….

“horseform2” = “21”
ods rtf file = “C:\Users\Spirit Lake\phaseII\pipeline.rtf” ;
proc freq data= retention order=formatted ;
tables last_activity ;
format last_activity $activity. ;
run ;
ods rtf close ;

—- Feel smarter after reading this blog?
Fish Lake artwork
Want to feel even smarter? Download and play our games!  You can run around in our virtual world while reviewing your basic math skills. If you are too busy (seriously?) you can still give a game as a gift or donate a game to a classroom or school

Many years ago, I was walking through the exhibits at the county fair with my late husband (he was alive then, that’s why he was able to walk with me) and I lamented,

Look at those quilts. My grandmother makes quilts. Look at those crocheted tablecloths. My other grandmother crochets. Look at me – what do I make?

My wonderful husband turned to me and said in his good-old-boy, country accent,

Money. That’s what you make that your grandmothers didn’t make. You make money, darlin’.

Everyone is posting pictures of the cute Halloween costumes their mom made for them or that they made for their children. I never made a Halloween costume in my life, but here is a copy of some code I finished last weekend that makes a graph with different types of pastries. Another function I wrote (not shown here) changes it from Spanish to English. If you get it correct, it takes you to another problem that does bar graphs with actual bars.

I didn’t make a costume but I did make money from working on this project which The Spoiled One can use to buy whatever costume she likes.

graph with pastries

<script type="text/javascript">
    $( window ).load(function() {
        var ncup = 0;
        var nd = 0 ;
        var ncake = 0 ;
        var thisone = 0;
        var sesstries = 0 ;
        document.getElementById("arrow").addEventListener("click", function(){
           if(ncake== 4 & nd ==5 & ncup==7){
               window.location.href="problem5_go_to.html" ;
            else {goFail();}
        document.getElementById("button1").addEventListener("click", function(){
        document.getElementById("button2").addEventListener("click", function(){
           window.location.href ="../learn_more4.html";
        $(function () {
                helper: "clone",
                start: function (event, ui) {
                    thisone = 1;
                revert: function (event, ui) {
                    $(this).data("uiDraggable").originalPosition = {
                        top: 0,
                        left: 0
                    return !event;
                helper: "clone",
                start: function (event, ui) {
                    thisone = 100;
                revert: function (event, ui) {
                    $(this).data("uiDraggable").originalPosition = {
                        top: 0,
                        left: 0
                    return !event;
                helper: "clone",
                start: function (event, ui) {
                    thisone = 1000;
                revert: function (event, ui) {
                    $(this).data("uiDraggable").originalPosition = {
                        top: 0,
                        left: 0
                    return !event;

                drop: function (event, ui) {
                    if (thisone != 1) {goFail();}
                    if (thisone == 1) {
                        if (nd > 5) {

                       // $(this).draggable('disable');
                    else {



                drop: function (event, ui) {

                    if (thisone == 100) {
                        if (ncup > 7) {
                        // $(this).draggable('disable');
                    else {

                drop: function (event, ui) {

                    if (thisone == 1000) {
                        if (ncake > 4) {
                        // $(this).draggable('disable');
                    else {

        function goFail(){

            var prev = sessionStorage.getItem("caketries");

            if (prev != 1 )
                sessionStorage.setItem("caketries", "1") ;
                prev = sessionStorage.getItem("caketries");

        else {
                sessionStorage.setItem("caketries", "0") ;


    }) ;


Do you really need to document everything?

Those who say that there is no such thing as a stupid question might be reconsidering their position right about now. Of course we need to document everything!

Perhaps my reluctance stems from my hatred of technical writers. If you are a tech writer and are a decent human being, the next time we are at the same event, please come up and introduce yourself. I would like to see what you look like. All of the technical writers I ever knew well were  evil, which made me suspicious of the rest of you. I should also note here that if you respond to this by posting hateful comments you will have only reinforced the stereotype of tech writers = evil.

Supposedly, tech writers are hired because of their ability to communicate well, which makes me wonder why they insist on making such insulting comments as:

I translate what the programmers wrote into English so the normal people can understand it.


I take what the engineers say and put it into complete sentences.


Everyone knows that software engineers can’t communicate with other humans.

Hey, tech writers, you do know we’re standing right here and we can hear you, right?

Animosity toward an entire semi-profession aside, I caught myself wondering how much documentation was really necessary. I was looking through some code I had written months before, which, I have to confess had almost no comments in the code and no documentation anywhere outside of the code. Even though I hadn’t looked at it in four months (there was a comment with the date created!), I found it pretty easy to read and got to thinking that perhaps documentation was over-sold by literature majors who couldn’t find jobs.

Then, an uncharacteristic burst of rationality overtook me and I realized that our company is growing. The code was pretty clear to me because I’m fairly familiar with the canvas tag and using canvas for graphics. The program used two libraries with which I’m familiar – jquery and a library for making charts. There were 50 other ways the program was easy for me to read because I wrote it using what was easy and familiar for me. However, we’re a growing company, and as The Invisible Developer reminded me, whether it happens kicking and screaming or I go quietly, the handwriting is on the wall and I’m going to be doing less coding and more CEO’ing.

I know that if HE were to have taken the same program, there would be much swearing going on upstairs right now.

So, yes, documentation appears to be more necessary than originally believed.

Is the solution for me to go through and document everything? Sigh. If only I had infinite time.

What I think might work and be a good idea for a new employee is to have him or her start off with reading some of the code and documenting it. That would be a good way to get familiar with what we are doing before diving into a project. It would also be a good way for us to verify if that person understands what is going on in a particular piece of code.

Or, one might say that was a lazy way for me to get out of writing documentation – if one were a tech writer.


My ISP is currently not allowing me to upload or insert image files. Show your sympathy for me by buying games. The games will make you smarter, amuse you and enable me to afford a better host. Everyone wins!


In assessing whether our Fish Lake game really works to teach fractions, we collect a lot of data, including a pretest and a post-test. We also use a lot of types of items, including a couple of essay questions. Being reasonable people, we are interested in the extent to which the ratings on these items agree.

Lake with fish, divided into quarters

To measure agreement between two raters, we use Kappa’s coefficient. PROC FREQ produces two types of Kappa coefficients. The Kappa coefficient ranges from -1 to 1, with 1 indicating perfect agreement, 1 indicating exactly the agreement that would be expected by chance and negative numbers indicating less agreement than would be expected by chance . When there are only two categories, PROC FREQ produces only the Kappa coefficient. When more than two categories are rated, a weighted Kappa is also produced which credits categories closer together as partial agreement and categories at the extreme ends as no agreement.

The code is really simple:

PROC FREQ DATA =datasetname ;
TABLES variable1*variable2 / PLOTS = KAPPAPLOT;

Including the ODS GRAPHICS ON statement and the PLOTS = KAPPAPLOT option in your TABLES statement will give you a plot of both the agreement and distribution of ratings. Personally, I find the kappa plots, like the example below, to be pretty helpful.

Kappa plot

This visual representation of the agreement shows that there was a large amount of exact agreement (dark blue shading) for incorrect answers, scored 0, with a small percentage partial agreement and very few with no agreement. With 3 categories, only exact agreement or partial agreement is possible for the middle category. Two other take-away points from this plot are that agreement is lower for correct and partially correct answers than incorrect ones and that the distribution is skewed, with a large proportion of answers scored incorrect. Because it is adjusted for chance agreement, Kappa is affected by the distribution among categories . If each rater scores 90% of the answers correct, there should be 81% agreement by chance, thus requiring an extremely high level of agreement to be significantly different from chance. The Kappa plot shows agreement and distribution simultaneously, which is why I like it.


Want to play the game ? You can download it here, as well as our game for younger players, Spirit Lake.

Sometimes, you can just eyeball it.

Really, if something truly is an outlier, you ought to be able to spot it. Take this plot, for example.

plot with 3 large bars and a few outliers

It should be pretty obvious that the vast majority of our sample for the Fish Lake game were students in grades, 4, 5 and 6. Those in the lower grades are clearly exceptions. I don’t know who put 0 as their grade, because I doubt any of our users had no education.

I use these plots especially if I’m explaining why I think certain records should be deleted from a sample. For many people, it seems as if the visual representation makes it clearer that “some of these things don’t belong here.”

Did you know that you can get a plot from PROC FREQ just by adding an option, like so:

PROC FREQ DATA= datasetname ;


This will produce the frequency plot seen above, as well as a table for your frequency distribution.

Well, if you didn’t know, now you know.

Previously, I discussed PROC FREQ for checking the validity of your data. Now we are on to data analysis, but, as anyone who does analysis for more than about 23 minutes can tell you, cleaning your data and doing analysis is seldom a two-step process. In fact, it’s more like a loop of two steps, over and over.

First, we have the basic.

PROC FREQ DATA = mydata.quizzes ;

TABLES passed /binomial ;


(NOTE: If you have a screen reader, click here to read the images below. This is for you, Tina! )

This will give me not only what percentage passed a quiz that they took,

frequency table

but also the 95% confidence limits.

95% confidence limitsThis also gives  a test of the null hypothesis that the population proportion equals the number specified. If, as in this case, I did not specify any hypothesized population value, the default of .50.

Test of Ho: proportion = .50

I didn’t have any real justification for hypothesizing any other population value. What proportion of kids should be able to pass a quiz that is ostensibly at their grade level? Half of them – as in, the “average” kid? All of them, since it’s their grade level? I’m sure there are lots of numbers one could want to test.

If you do have a specific proportion, say, 75%, you’d code it like this:

PROC FREQ DATA =in.quizzes ;
TABLES passed / BINOMIAL (P=.75);

Note that the P= has to be enclosed in parentheses or you’ll get an error.

So, out of the 770 quizzes that were taken by students, only 30.65% of them passed. However, the quizzes aren’t all of equal difficulty, are they? Probably not.

So, my next PROC FREQ is a cross-tabulation of quiz by passed. I don’t need the column percent or percent of total. I just want to know what percent passed or failed each quiz and how many players took that quiz. The way the game is designed, you only need to study and take a quiz if you failed one of the math challenges, so there will be varying numbers of players for each quiz.

PROC FREQ DATA =in.quizzes ;

The first variable will be the row variable and the one after the * will be the column variable. Since I’m only interested in the row percent and N, I included the NOCOL and NOPERCENT options to suppress printing of the column and total percentages.

(For an accessible version for screen readers, click here)


Before I make anything of these statistics, I want to ask myself, what is going on with quiz22 (which actually comes after quiz2) and quiz4? Why did so many students take these two quizzes? I can tell at a glance that it wasn’t a coding error that made it impossible to pass the quiz (my first thought), since over a quarter of the students passed each one.

This leaves me three possibilities:

  1. The problem before the quiz was difficult for students, so many of them ended up taking the quiz (another PROC FREQ)
  2. One of the problems in the quiz was coded incorrectly, so some students failed the quiz when they shouldn’t have,
  3. There was a problem with the server repeatedly sending the data that was not picked up in the previous analyses (another PROC FREQ).

Remember what I said at the beginning about data analysis being a loop? So, back to the top!


If you’d like to see the game used to collect these data, even play the demo yourself, click here.

level up screen from Fish Lake

I’m in the middle of data preparation on a research project on games to teach fractions. This is the part of a data analysis project that takes up 80% of the time. Fortunately, PROC FREQ from SAS can simplify things.

1. How many unique records ?

There are multiple quizzes in the game, and you only end up taking a quiz if you miss one of the problems, so knowing how many unique players my 1,000 or so records represent isn’t as simple as dividing the number of players by X, where X is  a fixed number of quizzes.

PROC FREQ DATA = mydata.quizzes NLEVELS ;

TABLES username ;

Gives me the number of unique usernames. If you were dying to know, in the quizzes file for Fish Lake it was 163.

2. Are there data entry problems?

We had a problem early in the history of the project where, when the internet was down, the local computer would keep trying to send the data to our server, so we would get 112 of the same record once the connection was back up.

Now, it is very likely that a player might have the same quiz recorded more than once. Failing it the first time, he or she would be redirected to study and then have a chance to try again. Still, a player shouldn’t have TOO many of the same quiz. I thought this problem had been fixed, but I wanted to check.

To check if we had the same quiz an excessive number of times, I simply did this :

PROC FREQ DATA= in.quizzes ;
TABLES username*quiztype / OUT=check (WHERE = (COUNT > 10)) ;

This creates an output data set of those usernames that had the same quiz more than 10 times.

There were a few of these problems.  The question then became how to identify and delete those without deleting the real quizzes. This took me to step 3.

3. The LAG function

The LAG function provides the value from the prior observation. Assuming that it would take at least 2 minutes for a quiz, I sorted the data by username, quiz type, number correct and the time. I assumed it would take a minimum of 120 seconds for even the fastest student to complete a study activity and complete a test for the second time. Using the code below, I was able to delete all duplicate quizzes that occurred due to dropped internet connections.

proc sort data = check4;
by username quiztype numcorrect date_time ;

data check5 ;
set check4 ;
lagu = lag(username) ;
lagq = lag(quiztype) ;
lagn = lag(numcorrect) ;
lagd = lag(dt) ;
if lagu = username & lagq = quiztype & lagn = numcorrect then ddiff = dt – lagd ;
if ddiff ne . & ddiff < 120 then delete ;
run ;

Having finished off my data cleaning in record time, I’m now ready to do more PROC FREQ ‘ ing for data analysis – tomorrow.

(Actually, being 12:22 am, I guess it is technically tomorrow now.)


If you’d like to see the game that we are analyzing, you can download a free demo here



I’m the world’s biggest hypocrite when it comes to documentation. In every staff meeting, I emphasize documenting whatever code you have written that week, but I always put off doing it myself.  My excuse is always that it isn’t final and when I get the complete project done I will put it in the wiki.

I’ve come to the conclusion that no complex software is ever done. You just quit working on it and go to something else.

If you have run into the awful problem of having your animation run on some browsers and not others, or even run sometimes and sometimes not in the same browser, you may have a timing problem. In brief, you are trying to draw the image before it is loaded.

Here is what I did today and how I fixed that problem…

In this part of the game (Forgotten Trail) , the player has answered a question that asks for the average number of miles the uncle walked each day when he attempted this journey.  If the player gives the correct answer, say, 22 miles, a screen pops up and the mother asks, “Do you really think you can walk 22 miles a day?” If the player says, “No,” then he or she is sent to workout. Each correct answer runs your character 5 miles. After 20 miles, you can go back to the game.

Sam is running

So …. we need animation and sound that occurs when a correct answer is submitted. I had finished the code to randomly generate division problems, so today I was working on the function after the player is correct.

The HTML elements are pretty simple – a div that contains everything, two layers of canvas, a button and two audio elements.


<div id="container">
<canvas id="layer1" style="z-index: 1; position:absolute; left:0px; top:5px;"
 height="400px" width="900"> This text is displayed if your browser does not support HTML5 Canvas. 
<canvas id="layer2" style="z-index: 1; position:absolute;left:0px;top:5px;" 
height="400px" width="900"> This text is displayed if your browser does not support HTML5 Canvas.
 <button id="workout" name="workout" >CLICK TO WORK OUT</button> </div> 
<audio autoplay id="audio1"><source type="audio/ogg"></audio> 
<audio autoplay id="audio2"><source type="audio/mpeg"></audio>

Because the character is only moving horizontally – he is running on a field – there is no dy and no y variable.


<script type="text/javascript">
    var layer1;
    var layer2;
    var ctx1;
    var ctx2;
    var animationFrames =[
        "game_art/sam1.png", "game_art/sam4.png"];
    var frameIndex = 0 ;
    var dx = 0 ;
    var x = 150;
    var width = 900;
    var height = 600;

    var player = new Image();
    var bkgd = new Image() ;


    function drawAll() {



    function draw1() {
        ctx1.clearRect(0, 0, width, height);
        ctx1.drawImage(bkgd, 0, 0);

    function draw2() {
        dx = 10;
        x = x + dx;
        ctx2.clearRect(0, 0, width, height);
         player.src = animationFrames[1];
        ctx2.drawImage(player, x,320);
        frameIndex++ ;
        if (frameIndex == 1) {
            ctx2.clearRect(0, 0, width, height);
            player.src = animationFrames[0];
            player.addEventListener('load', drawImage);
        else  {
            ctx2.clearRect(0, 0, width, height);
            player.src = animationFrames[1];
            player.addEventListener('load', drawImage);
            frameIndex = 0 ;


    function drawImage(){
        ctx2.drawImage(player, x,320);
   window.onload =function(){
        bkgd.src ="game_art/fields.png";
        layer1 = document.getElementById("layer1");
        ctx1 = layer1.getContext("2d");
        layer2 = document.getElementById("layer2");
        ctx2 = layer2.getContext("2d");

        var interval = setInterval(function(){
        }, 100);
        document.getElementById("workout").onclick = function drawnow(){
            var startTime = new Date().getTime();
            var interval = setInterval(function(){
                if(new Date().getTime() - startTime > 9000){
                    document.getElementById("audio1").src = "";
                    document.getElementById("audio2").src = "";
            }, 100);

            document.getElementById("audio1").src = "sounds/footsteps_run.ogg";
            document.getElementById("audio2").src = "sounds/footsteps_run.mp3";



Sometimes, you can know too much programming for your own good. Yesterday, I was working on analyzing a data set and can I just say here that SAS’s inability to handle mixed-type arrays is the bane of my existence. (In case you don’t know, if you mix character and numeric variable types in an array, SAS will give you an error. If you know an easy way around this, you will be my new best friend.)

I started out doing all kinds of fancy things, using

ARRAY answersc{*} _char_ ;

to get all of the character variables in an array and the DIM function to give the dimension of an array.

There were various reasons why creating new variables that were character using the PUT function or numeric using the INPUT function was a bad idea.

It occurred to me that I was making the whole process unnecessarily complicated. I had a relatively small data set with just a few variables that needed to be changed to character. So, I opened the original CSV file in SAS Enterprise Guide by selecting IMPORT DATA and picking COMMA from the pull-down menu for how fields are delimited.

opening in SAS EG

Next, for each of the variables, I changed the name, label and type to what I wanted it to be.

change properties

If you’re one of those people who just click, “NEXT” over and over when you are importing your data you may not be aware that you can change almost everything in those field attributes.  To change the data in your to-be-created SAS data set, click in the box labeled TYPE. Change it from number to string, as shown below. Now you have character variables.
change hereNice! Now I have my variables all changed to character.

One more minor change to make my life easier.

change length

We had some spam in our file, with spambots answering the online pretest and resulting in an input format and output format length of 600. Didn’t I just say that you can change almost anything in those field attributes? Why, yes, yes I did. Click in the box on that variable and a window will pop up that allows you to change the length.

That’s it. Done!

Which left me time to start on the data analysis that you can read about here.

Next Page →