In assessing whether our Fish Lake game really works to teach fractions, we collect a lot of data, including a pretest and a post-test. We also use a lot of types of items, including a couple of essay questions. Being reasonable people, we are interested in the extent to which the ratings on these items agree.

Lake with fish, divided into quarters

To measure agreement between two raters, we use Kappa’s coefficient. PROC FREQ produces two types of Kappa coefficients. The Kappa coefficient ranges from -1 to 1, with 1 indicating perfect agreement, 1 indicating exactly the agreement that would be expected by chance and negative numbers indicating less agreement than would be expected by chance . When there are only two categories, PROC FREQ produces only the Kappa coefficient. When more than two categories are rated, a weighted Kappa is also produced which credits categories closer together as partial agreement and categories at the extreme ends as no agreement.

The code is really simple:

ODS GRAPHICS ON;
PROC FREQ DATA =datasetname ;
TABLES variable1*variable2 / PLOTS = KAPPAPLOT;
TEST AGREE ;

Including the ODS GRAPHICS ON statement and the PLOTS = KAPPAPLOT option in your TABLES statement will give you a plot of both the agreement and distribution of ratings. Personally, I find the kappa plots, like the example below, to be pretty helpful.

Kappa plot

This visual representation of the agreement shows that there was a large amount of exact agreement (dark blue shading) for incorrect answers, scored 0, with a small percentage partial agreement and very few with no agreement. With 3 categories, only exact agreement or partial agreement is possible for the middle category. Two other take-away points from this plot are that agreement is lower for correct and partially correct answers than incorrect ones and that the distribution is skewed, with a large proportion of answers scored incorrect. Because it is adjusted for chance agreement, Kappa is affected by the distribution among categories . If each rater scores 90% of the answers correct, there should be 81% agreement by chance, thus requiring an extremely high level of agreement to be significantly different from chance. The Kappa plot shows agreement and distribution simultaneously, which is why I like it.

———

Want to play the game ? You can download it here, as well as our game for younger players, Spirit Lake.

Sometimes, you can just eyeball it.

Really, if something truly is an outlier, you ought to be able to spot it. Take this plot, for example.

plot with 3 large bars and a few outliers

It should be pretty obvious that the vast majority of our sample for the Fish Lake game were students in grades, 4, 5 and 6. Those in the lower grades are clearly exceptions. I don’t know who put 0 as their grade, because I doubt any of our users had no education.

I use these plots especially if I’m explaining why I think certain records should be deleted from a sample. For many people, it seems as if the visual representation makes it clearer that “some of these things don’t belong here.”

Did you know that you can get a plot from PROC FREQ just by adding an option, like so:

PROC FREQ DATA= datasetname ;

TABLES variable / PLOTS=FREQPLOT ;

This will produce the frequency plot seen above, as well as a table for your frequency distribution.

Well, if you didn’t know, now you know.

Previously, I discussed PROC FREQ for checking the validity of your data. Now we are on to data analysis, but, as anyone who does analysis for more than about 23 minutes can tell you, cleaning your data and doing analysis is seldom a two-step process. In fact, it’s more like a loop of two steps, over and over.

First, we have the basic.

PROC FREQ DATA = mydata.quizzes ;

TABLES passed /binomial ;

RUN;

(NOTE: If you have a screen reader, click here to read the images below. This is for you, Tina! )

This will give me not only what percentage passed a quiz that they took,

frequency table

but also the 95% confidence limits.

95% confidence limitsThis also gives  a test of the null hypothesis that the population proportion equals the number specified. If, as in this case, I did not specify any hypothesized population value, the default of .50.

Test of Ho: proportion = .50

I didn’t have any real justification for hypothesizing any other population value. What proportion of kids should be able to pass a quiz that is ostensibly at their grade level? Half of them – as in, the “average” kid? All of them, since it’s their grade level? I’m sure there are lots of numbers one could want to test.

If you do have a specific proportion, say, 75%, you’d code it like this:

PROC FREQ DATA =in.quizzes ;
TABLES passed / BINOMIAL (P=.75);

Note that the P= has to be enclosed in parentheses or you’ll get an error.

So, out of the 770 quizzes that were taken by students, only 30.65% of them passed. However, the quizzes aren’t all of equal difficulty, are they? Probably not.

So, my next PROC FREQ is a cross-tabulation of quiz by passed. I don’t need the column percent or percent of total. I just want to know what percent passed or failed each quiz and how many players took that quiz. The way the game is designed, you only need to study and take a quiz if you failed one of the math challenges, so there will be varying numbers of players for each quiz.

PROC FREQ DATA =in.quizzes ;
TABLES quiz*passed /NOCOL NOPERCENT ;

The first variable will be the row variable and the one after the * will be the column variable. Since I’m only interested in the row percent and N, I included the NOCOL and NOPERCENT options to suppress printing of the column and total percentages.

(For an accessible version for screen readers, click here)

Cross-tabulation

Before I make anything of these statistics, I want to ask myself, what is going on with quiz22 (which actually comes after quiz2) and quiz4? Why did so many students take these two quizzes? I can tell at a glance that it wasn’t a coding error that made it impossible to pass the quiz (my first thought), since over a quarter of the students passed each one.

This leaves me three possibilities:

  1. The problem before the quiz was difficult for students, so many of them ended up taking the quiz (another PROC FREQ)
  2. One of the problems in the quiz was coded incorrectly, so some students failed the quiz when they shouldn’t have,
  3. There was a problem with the server repeatedly sending the data that was not picked up in the previous analyses (another PROC FREQ).

Remember what I said at the beginning about data analysis being a loop? So, back to the top!

————–

If you’d like to see the game used to collect these data, even play the demo yourself, click here.

level up screen from Fish Lake

I’m in the middle of data preparation on a research project on games to teach fractions. This is the part of a data analysis project that takes up 80% of the time. Fortunately, PROC FREQ from SAS can simplify things.

1. How many unique records ?

There are multiple quizzes in the game, and you only end up taking a quiz if you miss one of the problems, so knowing how many unique players my 1,000 or so records represent isn’t as simple as dividing the number of players by X, where X is  a fixed number of quizzes.

PROC FREQ DATA = mydata.quizzes NLEVELS ;

TABLES username ;

Gives me the number of unique usernames. If you were dying to know, in the quizzes file for Fish Lake it was 163.

2. Are there data entry problems?

We had a problem early in the history of the project where, when the internet was down, the local computer would keep trying to send the data to our server, so we would get 112 of the same record once the connection was back up.

Now, it is very likely that a player might have the same quiz recorded more than once. Failing it the first time, he or she would be redirected to study and then have a chance to try again. Still, a player shouldn’t have TOO many of the same quiz. I thought this problem had been fixed, but I wanted to check.

To check if we had the same quiz an excessive number of times, I simply did this :

PROC FREQ DATA= in.quizzes ;
TABLES username*quiztype / OUT=check (WHERE = (COUNT > 10)) ;

This creates an output data set of those usernames that had the same quiz more than 10 times.

There were a few of these problems.  The question then became how to identify and delete those without deleting the real quizzes. This took me to step 3.

3. The LAG function

The LAG function provides the value from the prior observation. Assuming that it would take at least 2 minutes for a quiz, I sorted the data by username, quiz type, number correct and the time. I assumed it would take a minimum of 120 seconds for even the fastest student to complete a study activity and complete a test for the second time. Using the code below, I was able to delete all duplicate quizzes that occurred due to dropped internet connections.

proc sort data = check4;
by username quiztype numcorrect date_time ;

data check5 ;
set check4 ;
lagu = lag(username) ;
lagq = lag(quiztype) ;
lagn = lag(numcorrect) ;
lagd = lag(dt) ;
if lagu = username & lagq = quiztype & lagn = numcorrect then ddiff = dt – lagd ;
if ddiff ne . & ddiff < 120 then delete ;
run ;

Having finished off my data cleaning in record time, I’m now ready to do more PROC FREQ ‘ ing for data analysis – tomorrow.

(Actually, being 12:22 am, I guess it is technically tomorrow now.)

—————–

If you’d like to see the game that we are analyzing, you can download a free demo here

catfish

 

I’m the world’s biggest hypocrite when it comes to documentation. In every staff meeting, I emphasize documenting whatever code you have written that week, but I always put off doing it myself.  My excuse is always that it isn’t final and when I get the complete project done I will put it in the wiki.

I’ve come to the conclusion that no complex software is ever done. You just quit working on it and go to something else.

If you have run into the awful problem of having your animation run on some browsers and not others, or even run sometimes and sometimes not in the same browser, you may have a timing problem. In brief, you are trying to draw the image before it is loaded.

Here is what I did today and how I fixed that problem…

In this part of the game (Forgotten Trail) , the player has answered a question that asks for the average number of miles the uncle walked each day when he attempted this journey.  If the player gives the correct answer, say, 22 miles, a screen pops up and the mother asks, “Do you really think you can walk 22 miles a day?” If the player says, “No,” then he or she is sent to workout. Each correct answer runs your character 5 miles. After 20 miles, you can go back to the game.

Sam is running

So …. we need animation and sound that occurs when a correct answer is submitted. I had finished the code to randomly generate division problems, so today I was working on the function after the player is correct.

The HTML elements are pretty simple – a div that contains everything, two layers of canvas, a button and two audio elements.

HTML

<div id="container">
<canvas id="layer1" style="z-index: 1; position:absolute; left:0px; top:5px;"
 height="400px" width="900"> This text is displayed if your browser does not support HTML5 Canvas. 
</canvas> 
<canvas id="layer2" style="z-index: 1; position:absolute;left:0px;top:5px;" 
height="400px" width="900"> This text is displayed if your browser does not support HTML5 Canvas.
 </canvas>
 <button id="workout" name="workout" >CLICK TO WORK OUT</button> </div> 
<audio autoplay id="audio1"><source type="audio/ogg"></audio> 
<audio autoplay id="audio2"><source type="audio/mpeg"></audio>
</div>

Because the character is only moving horizontally – he is running on a field – there is no dy and no y variable.

JAVASCRIPT


<script type="text/javascript">
    // INITIALIZE VARIABLES FOR CANVAS CONTEXT, LAYERS, HEIGHT & WEIGHT ;
    // ALSO INITIALIZE THE PLAYER ELEMENTS WE'LL DRAW;
    // BECAUSE THE PLAYER IS RUNNING, I LEFT OUT THE MIDDLE TWO FRAMES FOR THE SPRITE ;
    var layer1;
    var layer2;
    var ctx1;
    var ctx2;
    var animationFrames =[
        "game_art/sam1.png", "game_art/sam4.png"];
    var frameIndex = 0 ;
    var dx = 0 ;
    var x = 150;
    var width = 900;
    var height = 600;

    var player = new Image();
    var bkgd = new Image() ;


// DRAWS THE TWO LAYERS ;

    function drawAll() {
        draw1();
        draw2();

    }

// DRAWS THE BACKGROUND LAYER ;

    function draw1() {
        ctx1.clearRect(0, 0, width, height);
        ctx1.drawImage(bkgd, 0, 0);
    }

// SOLVES THE PROBLEM THAT WAS DRIVING ME CRAZY BY ADDING AN EVENT LISTENER ;
// NOW, WE DON'T TRY TO DRAW THE IMAGE UNTIL IT IS LOADED ;
    function draw2() {
        dx = 10;
        x = x + dx;
        ctx2.clearRect(0, 0, width, height);
         player.src = animationFrames[1];
        ctx2.drawImage(player, x,320);
        frameIndex++ ;
        if (frameIndex == 1) {
            ctx2.clearRect(0, 0, width, height);
            player.src = animationFrames[0];
     // THE STATEMENT BELOW RESTORED MY SANITY ;
            player.addEventListener('load', drawImage);
        }
        else  {
            ctx2.clearRect(0, 0, width, height);
            player.src = animationFrames[1];
            player.addEventListener('load', drawImage);
            frameIndex = 0 ;
        }
    }

// ONCE THE IMAGE IS LOADED, THIS FUNCTION IS CALLED ;

    function drawImage(){
        ctx2.drawImage(player, x,320);
    }
 
// WE ONLY SET THE BACKGROUND SOURCE ONCE, WHEN THE WINDOW LOADS ;
// 
   window.onload =function(){
        bkgd.src ="game_art/fields.png";
        layer1 = document.getElementById("layer1");
        ctx1 = layer1.getContext("2d");
        layer2 = document.getElementById("layer2");
        ctx2 = layer2.getContext("2d");

// EVERY 100 MILLISECONDS WE DRAW 
        var interval = setInterval(function(){
            draw1();
        }, 100);
        document.getElementById("workout").onclick = function drawnow(){
            var startTime = new Date().getTime();
            var interval = setInterval(function(){
                if(new Date().getTime() - startTime > 9000){
                    document.getElementById("audio1").src = "";
                    document.getElementById("audio2").src = "";
                    clearInterval(interval);
//   LATER I WILL ADD HERE TO STOP AFTER RUNNING 1/4 THE DISTANCE;
                }
                drawAll();
            }, 100);

// THIS MAKES THE SOUND OF RUNNING WHILE THEY ARE RUNNING ;
            document.getElementById("audio1").src = "sounds/footsteps_run.ogg";
            document.getElementById("audio2").src = "sounds/footsteps_run.mp3";
        }

    }
</script>

 

Sometimes, you can know too much programming for your own good. Yesterday, I was working on analyzing a data set and can I just say here that SAS’s inability to handle mixed-type arrays is the bane of my existence. (In case you don’t know, if you mix character and numeric variable types in an array, SAS will give you an error. If you know an easy way around this, you will be my new best friend.)

I started out doing all kinds of fancy things, using

ARRAY answersc{*} _char_ ;

to get all of the character variables in an array and the DIM function to give the dimension of an array.

There were various reasons why creating new variables that were character using the PUT function or numeric using the INPUT function was a bad idea.

It occurred to me that I was making the whole process unnecessarily complicated. I had a relatively small data set with just a few variables that needed to be changed to character. So, I opened the original CSV file in SAS Enterprise Guide by selecting IMPORT DATA and picking COMMA from the pull-down menu for how fields are delimited.

opening in SAS EG

Next, for each of the variables, I changed the name, label and type to what I wanted it to be.

change properties

If you’re one of those people who just click, “NEXT” over and over when you are importing your data you may not be aware that you can change almost everything in those field attributes.  To change the data in your to-be-created SAS data set, click in the box labeled TYPE. Change it from number to string, as shown below. Now you have character variables.
change hereNice! Now I have my variables all changed to character.

One more minor change to make my life easier.

change length

We had some spam in our file, with spambots answering the online pretest and resulting in an input format and output format length of 600. Didn’t I just say that you can change almost anything in those field attributes? Why, yes, yes I did. Click in the box on that variable and a window will pop up that allows you to change the length.

That’s it. Done!

Which left me time to start on the data analysis that you can read about here.

I’m preparing a data set for analysis and since the data are scored by SAS I am double-checking to make sure that I coded it correctly. One check is to select out an item and compare the percentage who answered correctly with the mean score for that item. These should be equal since items are scored 0=wrong, 1=correct.

When I look at the output for my PROC MEANS it says that 31% of the respondents answered this item correctly, that is, mean = .310.

However, the correct answer is D and when I look at the results from my PROC FREQ it shows that 35% of the respondents gave ‘D’ as the correct answer.

What is going on here? Is my program to score the tests off somewhere? Will I need to score all of these tests by hand?

Real hand soaps

I am sure those of you who are SAS gurus thought of the answer already (and if you didn’t, you’re going to be slapping your head when you read the simple solution).

By default, PROC FREQ gives you the percentage of non-missing records. Since many students who did not know the answer to the question left it blank, they were (rightfully) given a zero when the test was automatically scored. To get your FREQ and MEANS results to match, use the MISSING option, as so

PROC FREQ DATA =in.score ;
TABLES  item1 / MISSING ;

You will find that 31% of the total (including those who skipped the question) got the answer right.

Sometimes it’s the simplest things that give you pause.

Do you have a bunch of sites bookmarked with articles you are going to go back and read later? It’s not just me, is it?

One of my (many) favorite things at SAS Global Forum this year was the app. It included a function for emailing links to papers you found interesting. Perhaps the theory is that you would email these links to your friends to rub it in that their employer did not like you well enough. I emailed links to myself to read when I had time. Finally catching up on coding, email and meetings, today, I had a bit of time.

I was reading a paper by Lisa Henley

A Genetic Algorithm for Data Reduction.

It’s a really cool and relatively new concept – from the 1970s – compared to 1900 for the Pearson chi-square, for example.

In brief, here is the idea. You have a large number of independent variables.  How do you select the best subset? One way to do it is to let the variables fight it out in a form of natural selection.

Let’s say you have 40 variables. Each “chromosome” will have 40 “alleles” that will randomly be coded as 0 or 1, either included in the equation or not.

You compute the equation with these variables included or not and assess each equation based on a criterion, say, Akaike Information Criterion or the Root Mean Square Error.

You can select the “winning” chromosome/ equation either head to head, whichever has the higher AIC/ RMSE , although there are other methods of determination, like giving those with the higher criterion a higher probability of staying.

You do this repeatedly until you have your winning equation. Okay, this is a bit of a simplification but you should get the general idea. I included the link above so you could check out the paper for yourself.

Then, while I was standing there reading the paper, the ever-brilliant David Pasta walked by and mentioned the name of another paper on use of Genetic Algorithm for Model Selection that was presented  at the Western Users of SAS Software conference a couple of years back.

I don’t have any immediate use for GA in the projects I’m working on at this moment. However, I can’t even begin to count the number of techniques I’ve learned over the years that I had no immediate use for and then two weeks later turned out to be exactly what I needed.

Even though I knew the Genetic Algorithm existed,  I wasn’t as familiar with its use in model selection.

You’ll never use what you don’t know – which is a really strong argument for learning as much as you can in your field, whatever it might be.

We’ve looked at data on Body Mass Index (BMI) by race. Now let’s take a look at our sample another way. Instead of using BMI as a variable, let’s use obesity as a dichotomous variable, defined as a BMI greater than 30. It just so happened (really) that this variable was already in the data set so I didn’t even need to create it.

The code is super-simple and shown below. The reserved SAS keywords are capitalized just to make it easier to spot what must remain the same.  Let’s look at this line by line

LIBNAME  mydata “/courses/some123/c_1234/” ACCESS=READONLY;
PROC FREQ DATA = mydata.coh602 ;
TABLES race*obese / CHISQ ;
WHERE race NE “” ;
RUN ;

LIBNAME  mydata “/courses/some123/c_1234/” ACCESS=READONLY;

Identifies the directory where the data for your course are stored. As a student, you only have read access.
PROC FREQ DATA = mydata.coh602 ;

Begins the frequency procedure, using the data set in the directory linked with mydata in the previous statement.

TABLES race*obese / CHISQ ;

Creates a cross-tabulation of race by obesity and the CHISQ following the option statistic produces the second table you see below of chi-square and other statistics that test the hypothesis of a relationship between two categorical variables.
WHERE race NE “” ;

Only selects those observations where we have a value for race (where race is not equal to missing)
RUN ;

Pretty obvious? Runs the program.

Cross-tabulation of race by obesity

 

Similar to our ANOVA results previously, we see that the obesity rates for black and Hispanic samples are similar at 35% and 38% while the proportion of the white population that is obese is 25%. These numbers are the percentage for each row. As is standard practice, a 0 for obesity means no, the respondent is not obese and a 1 means yes, the person is obese.

The CHISQ option produces the table below. The first three statistics are all tests of statistical significance of the relationship between the two variables. Table with chi-square statistics

You can see from this that there is a statistically significant relationship between race and obesity. Another way to phrase this might be that the distribution of obesity is not the same across races.

The next three statistics give you the size of the relationship. A value of 1.0 denotes perfect agreement (be suspicious if you find that, it’s more often you coded something wrong than that everyone of one race is different from everyone of another race). A value of 0 indicates no relationship whatsoever between the two variables. Phi and Cramer’s V range from -1 to +1 , while the contingency coefficient ranges from 0 to 1. The latter seems more reasonable to me since what does a “negative” relationship between two categorical variables really mean? Nothing.

From this you can conclude that the relationship between obesity and race is not zero and that it is a fairly small relationship.

Next, I’d like to look at the odds ratios and also include some multivariate analyses. However, I’m still sick and some idiot hit my brand new car on the freeway yesterday and sped off, so I am both sick and annoyed.  So … I’m going back to bed and discussion of the next analyses will have to wait until tomorrow.

So far, we have looked at

  1. How to get the sample demographics and descriptive statistics for your dependent and independent variable.
  2. Computing descriptive statistics by category 

Now it’s time to dive into step 3, computing inferential statistics.

The code is quite simple. We need a LIBNAME statement. It will look something like this. The exact path to the data, which is between the quotation marks, will be different for every course. You get that path from your professor.

LIBNAME mydata “/courses/ab1234/c_0001/” access=readonly;

DATA example ;
SET mydata.coh602;
WHERE race ne “” ;
run ;

I’m creating a data set named example. The DATA statement does that.

It is being created as a subset from the coh602 dataset stored in the library referenced by mydata. The SET statement does that.

I’m only including those records where they have a non-missing value for race. The WHERE statement does that.

If you already did that earlier in your program, you don’t need to do it again. However, remember, example is a temporary data set (you can tell because it doesn’t have a two level name like mydata.example ) . It resides in working memory. Think of it as if you were working on a document and didn’t save it. If you closed that application, your document would be gone.  Okay, so much for the data set. Now we are on to ….. ta da da

Inferential Statistics Using SAS

Let’s start with Analysis of Variance.  We’re going to do PROC GLM. GLM stands for General Linear Model. There is a PROC ANOVA also and it works pretty much the same.

PROC GLM DATA = example ;

CLASS race ;

MODEL bmi_p = race ;

MEANS race / TUKEY ;

The CLASS statement is used to identify any categorical variables. Since with Analysis of Variance you are comparing the means of multiple groups, you need at least one CLASS statement with at least one variable that has multiple groups – in this case, race.

MODEL dependent = independent ;

Our model is of bmi_p  – that is body mass index, being dependent on race. Your dependent variable MUST be a numeric variable.

The model statement above will result in a test of significance of difference among means and produce an F-statistic.

What does an F-test test?

It tests the null hypothesis that there is NO difference among the means of the groups, in this case, among the three groups – White, Black and Hispanic . If the null hypothesis is accepted, then all the group means are the same and you can stop.

However, if the null hypothesis is rejected, you certainly also want to know which groups are different from which other groups. After that significant F-test, you need a post hoc test (Latin for “after that”. Never say all those years of Catholic school were wasted).

There are a lot to choose from but for this I used TUKEY. The last statement requests the post hoc test.

Let’s take a look at our results.

I have an F-value of 300.10 with a probability < .0001 .

Assuming my alpha level was .o5 (or .01, or .001, or .ooo1) , this is statistically significant and I would reject my null hypothesis. The differences between means are probably not zero, based on my F-test, but are they anything substantial?

If I look at the R-square, and I should, it tells me that this model explains 1.55% of the variance in BMI – which is not a lot. The mean BMI for the whole sample is 27.56.

You can see complete results here. Also, that link will probably work better with screen readers, if you’re visually impaired (Yeah, Tina, I put this here for you!).

ANOVA table

 

Next, I want to look at the results of the TUKEY test.

table of post hoc comparisons

 

We can see that there was about a 2-point difference between Blacks and Whites, with the mean for Blacks 2 points higher. There was also about a 2-point difference between Whites and Hispanics. The difference in mean BMI between White and Black samples and White and Hispanic samples was statistically significant. The difference between Hispanic and Black sample means was near zero with the mean BMI for Blacks 0.06 points higher than for Hispanics.

This difference is not significant.

So …. we have looked at the difference in Body Mass Index, but is that the best indicator of obesity? According to the World Health Organization, who you’d think would know, obesity is defined as a BMI of greater than 30.

The next step we might want to take is examine our null hypothesis using categorical variable, obese or not obese. That, is our next analysis and next post.

Next Page →