The results are in! The chart below gladdens my little heart, somewhat.

Graph showing significant improvement from pretest to posttest

One thing to note is the fact that the 95% confidence interval is comfortably above zero. Another point is that it looks like a pretty normal distribution.

What is it? It is the difference between pretest and post-test scores for 71 students at two small, rural schools who played Spirit Lake: The Game.

I selected these schools to analyze first, and held my breath. These were the schools we had worked with the most closely, who had implemented the games as we had recommended (play twice a week for 25-30 minutes). If it didn’t work here, it probably wasn’t going to work.

Two years ago, with a sample of 39 students from 4th and 5th grade from one school, we found a significant difference compared to the control group.


You probably don’t feel nervous reading that statement because you have not spent the last three years of your life developing games that you hope will improve children’s performance in math.

The answer, at least for the first group of data we have analyzed is – YES!

Scores improved 20% from pre-test to post-test. This was not as impressive as the improvement of 30% we had found in the first year, but this group also began with a substantially higher score. Two years ago, the average student scored 39% on the pre-test. This year, for 71 students with complete data, the average pre-test score was 47.9% , the post-test mean was 57.4%.   I started this post saying my little heart was gladdened “somewhat” because I still want to see the students improve more.

There is a lot more analysis to do. For a start, there is analysis of data from schools who were not part of our study but who used the pretest and post-test – with them, we can’t really tell how the game was implemented but at least we can get psychometric data on the tests.

We have data on persistence – which we might be able to correlate with post-test data, but I doubt it, since I suspect students who didn’t finish the game probably didn’t take the post-test.

We have data on Fish Lake, which also looks promising.

Overall, it’s just a great day to be a statistician at 7 Generation Games.

buffalo in the winter

Here is my baby, Spirit Lake. It can be yours for ten bucks. If you are rocking awesome at multiplication and division, including with word problems, but you’d like to help out a kid or a whole classroom, you can donate a copy.

Some problems that seem really complex are quite simple when you look at them in the right way. Take this one, for example:

My hypothesis is that a major problem in math achievement is persistence. Students just give up at the first sign of trouble. I have three different data sets with student data from the Spirit Lake game. Many of the students in the student table are the control group, so they will have no data on game play. There is a table of answers to the math challenges and another table with answers to quizzes which students took only if they missed a math challenge. When students miss a math challenge in  the game, depending on which educational resource they choose, they may do one of two or three different quizzes to get back into the game.  Also, some of the quiz records were not from quizzes actually in the game but from supplemental activities we provided. So, how do I identify where in the process students drop out and present in a simple graphic to discuss with schools? Just to complicate matters, the username was different lengths in the different datasets and the variable for timestamp also had different names.

It turns out, the problem was not that difficult.

  1. Merge the student table with the answers (math challenges) and only include those students with at least one answer.
  2. Merge the student table with the quizzes and only include those students with at least one quiz
  3. Concatenate the data sets from steps 1 & 2
  4. Create a new userid variable and set it equal to the username
  5. Create a new “entered” variable and set it equal to whichever of the datetime fields exists on that record
  6. Delete the quizzes not included in the game.
  7. Sort the dataset by userid and the date and time entered.
  8. Keep the last record for each userid. Now you have their last date of activity.
  9. If there is a value for the math challenge field then that is the name of the last activity, otherwise the quiz name is the name for the last activity.
  10. Use a PROC FORMAT to assign each activity a value equal to the step in the game.
  11. Do a PROC FREQ using that format and the order = FORMATTED option.

Once I had the frequencies, I just put them into a table in a word document and shaded the columns to match the percentage. There may be a way in SAS/Graph or something else to do this automatically, but honestly, the table took me two minutes once I had the data.

graph showing students dropping out at each step

I think it illustrates my points pretty clearly, which are:

  • A sizable number of students drop out after the second problem.
  • 25% of the students drop after the first difficulty they have (missing the second problem)
  • Only a minority of students persist all the way to the end, less than 25% of the total sample

This isn’t based on a tiny sample, either. The data above represent a sample of 397 students.

In case you would like to see it, the code for steps 3-11 is below. Particularly useful is the PROC FORMAT. Notice that you can have multiple values have the same format, which was important here because players can take multiple paths that are still the same step in the sequence.

data persist ;
attrib userid length= $49 ;
set mydata2.sl_answers mydata2.sl_quizzes ;
entered = max(date_answered_dt,date_taken_dt) ;
if quiztype in (“problemsolve”,”divide1long”,”multiplyby23″) then delete ;
userid = new_username ;
format entered datetime20. ;
proc sort data=persist ;
by quiztype ;

proc sort data=persist ;
by userid entered ;

data retention ;
set persist ;
by userid ;
if last.userid ;
attrib last_activity length= $14 ;
if inputform ne “” then last_activity = inputform ;
else last_activity = quiztype ;

proc freq data= retention ;
tables last_activity ;

proc format ;
“findcepansi” = “01”
“x2x9” = “02”
“math2x” = “02”
“math2_2” = “02”
“wolves1a” = “02”
“multiplyby5” = “03”
“multiplyby4” = “03”
“multiplyby3” = “04”
“wolves1b” = “05”
…. AND SO ON ….

“horseform2” = “21”
ods rtf file = “C:\Users\Spirit Lake\phaseII\pipeline.rtf” ;
proc freq data= retention order=formatted ;
tables last_activity ;
format last_activity $activity. ;
run ;
ods rtf close ;

—- Feel smarter after reading this blog?
Fish Lake artwork
Want to feel even smarter? Download and play our games!  You can run around in our virtual world while reviewing your basic math skills. If you are too busy (seriously?) you can still give a game as a gift or donate a game to a classroom or school

Let’s get this out right up front – I have no question that there is discrimination in the tech industry. I gave an hour-long talk on this very subject at MIT a couple of weeks ago, where I pointed out that everyone’s first draft of pretty much everything is crap – your first game, first database – and some people we give encouragement and other people we give up on.

That’s not my point here. My point is that sometimes we are our own barriers by not applying to positions. Let me give you two examples.

First, as I wrote on my 7 Generation Games blog earlier, we reject disproportionately more male applicants for positions but yet our last four hires have all been men. This may change with the current positions (read on to find out why).

For the six positions we have advertised over the last couple of years, the application pool has looked like this:

Yes No
Male 4 18
Female 2 4

We had one woman apply for the previous internship position we advertised, and we ended up hiring a male. If you look at this table, the odds of a woman being hired – 1 in 3, are greater than the odds of a man being hired, 1 in 5.5 . Yet, we hired twice as many men as women.

Why is that? Because more men apply. More unqualified men apply, which explains our higher rejection rate. If we explicitly state, “Must work in office five days a week”, we will get men (but no women) applying who live in, say, Sweden, and want to know if maybe that is negotiable (no.)

bannerWe have also recently filled 3 positions, and will soon fill two more, without advertising. In one of those cases, the person (male) contacted us and convinced us that he could do great work. All four of the other positions were filled by personal contacts. We called people we knew who were knowledgeable in the field and asked for recommendations.

We happen to know a lot of people who are Hispanic and Native American, so 3 of those positions ended up going to extremely well-qualified people from those groups. The one woman we hired out of those five positions was actually recommended by my 82-year-old mother who said,

“Your cousin, Jean, is a graphic artist, you should check out her work.”

As you can see from the photo of the 6-foot banner she made for us, she does do good work.

I see two factors at work here:

  1. Women are less likely to nominate themselves. While men will apply even if their meeting the  qualifications seems to be a stretch (or a delusion), women are less likely to do so. I don’t know why. Fear of rejection?
  2. People are recommended by their networks and women seem to be less plugged into those networks. This is also true of minorities. We make no special effort to recruit Hispanic or Native American employees but since that is a lot of who we know, it is a lot of who THEY know and hence a lot of our referrals.

How do you increase your proportion of female applicants? You are going to laugh at this because it is the simplest thing ever. This time around, I wrote a blog post and tweets that specifically encouraged females to apply. And it worked! Well, maybe you would have predicted that, but not me. I would never have guessed.

Do you really want to hire Latino graphic artists or software developers? Come to the next Latino Tech meetup. Bonus: the food is awesome.


My point, which you may have now despaired of me having, is that affirmative action is a good thing on both sides. By affirmative action I mean being pro-active. If you are from an under-represented group, APPLY. Invite yourself to the dance.  If you are an employer, reach out. It could be as easy as having a margarita during Hispanic Heritage Month or writing a blog post.

In both cases, you might be surprised how little effort yields big results.

Don’t forget to buy our games and play them. Fun! Plus, they’ll make you smarter.

man from Spirit Lake

You wouldn’t think there would be that much to say about scree plots. That is because you are like me and sometimes wrong.

The problem I often have teaching is that I assume people know a lot more than is reasonable to expect for someone coming into a course. Sometimes, I’m like a toddler who thinks that because she knows what color hat the baby was wearing yesterday that you do, too.

Toddler with baby


So …. a scree plot is a plot of the eigenvalues by the factor number. In the chart below, the first factor has an eigenvalue of about 5.5 while the eigenvalue of the second factor is around 1.5. (If you don’t know what an eigenvalue is, read this post. )

scree plot with bend in plot after second factor


As I mentioned in the previous post, an eigenvalue greater than 1 explains more than a single item, but as you can tell by looking at the plot, some of those eigenvalues are barely higher than one. Should you keep them? Or not?

What is scree, anyway? Scree is a pile of debris at the base of a cliff. In a scree plot, the real factors are at the top of the cliff and the scree is the random factors at the bottom you should discard. So, based on this, you might decide you only have one real factor.

The idea is to discard all of the factors after the line starts to flatten out. But is that after 1 factor? It kind of flattens out after four?  Maybe?

Sometimes a scree plot is really clear, but this one, not as much. So, what should you do next?

Hmm … maybe I should write another post on that.

I find this scree plot of eigenvalues very helpful in identifying the number of factors. A scree plot is a plot of the eigenvalues by the factor number.

I realized this is only helpful if one understands what an eigenvalue is.

scree plot with bend in plot after second factorFirst of all, go way back to Stat 101 & remember that correlation is the covariance of z-scores which have a standard deviation of 1, and since the square of 1 = 1, they also have a variance of 1

Understand that the default is to factor analyze the correlation matrix and that means that your variables are standardized before analysis, with a variance of 1. So, that is the total amount of variance we are trying to explain for each variable.

Therefore, the total amount of variance to be explained in a matrix will equal the number of variables. If you have 10 variables, the total amount of variance to be explained is 10.

If you’ve ever looked a correlation matrix you will have noticed that all of the diagonals are 1. The correlation of an item with itself is 1.

Example of correlation matrixWhat percentage of the variance in an item is explained by itself? It should be obvious that it is 100%. If I know your age, for example, I can predict your age with 100% accuracy. Duh.

An eigenvalue is the total amount of variance in the variables in the dataset explained by the common factor. (Mathematically, it’s the sum of the squared factor loadings. If you are interested in that, you can come to my class at WUSS on Wednesday morning. Or, possibly, I’ll blog about it next week.)

Now, if a factor has an eigenvalue of 1, it is pretty useless. That is because the whole purposes of factor analysis is to replace your 20 or 50 items that each explain of variance of 1 (their own variance) with a few common factors. A factor with an eigenvalue of 1 doesn’t explain any more variance than a single item.

Let’s say you have 24 items and your first factor has an eigenvalue of 6. Is that good? Yes, because that means that a single factor explains as much of the variance in the matrix of data as six items. If you could get four orthogonal factors, each with an eigenvalue close to 6, then you would have explained nearly 100% of the variance in your 24 items with just 4 factors.

Think about correlation matrices again. It’s not often you see an EXACTLY zero correlation. You’ll find correlations of .08, .03, .12 just by chance. Who knows, the same person had the highest score on sticks of bubble gum chewed and number of asses kicked (R.I.P. Rowdy Roddy Piper), it doesn’t mean that those two variables really have that much in common. This is why we look at statistical significance and how likely something is to occur by chance.

How do you tell if that factor has a higher degree of common variance, that is the eigenvalue is higher, than would be expected by chance? One way is the scree plot. You look at the eigenvalue for each factor and see where it drops off.

I would write more about this but my family is urging me to leave for a barbecue for Maria’s birthday, so you will have to last until tomorrow for a more detailed explanation of scree plots and why they are called that.


Random act of advertising: Buy our games – learn history, learn math, find herbs, spear fish 

stack of fish

If you want to be really cool and get a Tourist Visa to our Virtual Worlds, you can apply here.

I read this in a review of a study on teacher expectancy effects but it could really apply to so many other studies.

If these results bear any relationship at all to reality, it is indeed a fortunate coincidence.

Those of us who choose careers in research like to believe that it is all like everyone learns in their textbooks: hypothesis, data collection, analysis, results, conclusion and *PRESTO* knowledge.

In a few weeks, I will be in San Diego at the Western Users of SAS Software conference presenting results of the past year of testing with Fish Lake and Spirit Lake: The Game.

Occasionally, colleagues will ask me about my interest in the nuts and bolts of data analysis and why I ‘bother’ presenting at SAS conferences instead of  ‘the real thing’, like the American Educational Research Association or the National Council on Family Relations. One of the main reasons is that I like to be very transparent about how my data were collected, scored and analyzed. I find it odd that these “details” are given short shrift in publications when, in fact, all of the conclusions ever published rely on the assumption that these “details” were done correctly.

Presenting the nuts and bolts of the data cleaning, coding and analysis assures any funding agency or consumer of the research that it was done correctly. Or, if anyone wants to dispute the way I’ve done the analyses, at least it is crystal clear how exactly the data were processed. In most cases, the reader has no idea and is just taking it on faith that the researcher did everything correctly – which given some of the bozos I know is pretty shaky ground.

Once I have confidence that the data sets are in good shape, have corrected any data entry problems, deleted outliers, accurately scored measures and identified any statistical assumptions that need to be met, then I’m ready to proceed to the analyses with confidence.

Think about that next time someone with a turned up nose says,

“I don’t go to that type of conference.”

Yeah? Well, I do.


Want to see what it’s all about? We make games that make you smarter – 30% smarter, according to my data (-:

Click here to learn more.

stack of fish

In assessing whether our Fish Lake game really works to teach fractions, we collect a lot of data, including a pretest and a post-test. We also use a lot of types of items, including a couple of essay questions. Being reasonable people, we are interested in the extent to which the ratings on these items agree.

Lake with fish, divided into quarters

To measure agreement between two raters, we use Kappa’s coefficient. PROC FREQ produces two types of Kappa coefficients. The Kappa coefficient ranges from -1 to 1, with 1 indicating perfect agreement, 1 indicating exactly the agreement that would be expected by chance and negative numbers indicating less agreement than would be expected by chance . When there are only two categories, PROC FREQ produces only the Kappa coefficient. When more than two categories are rated, a weighted Kappa is also produced which credits categories closer together as partial agreement and categories at the extreme ends as no agreement.

The code is really simple:

PROC FREQ DATA =datasetname ;
TABLES variable1*variable2 / PLOTS = KAPPAPLOT;

Including the ODS GRAPHICS ON statement and the PLOTS = KAPPAPLOT option in your TABLES statement will give you a plot of both the agreement and distribution of ratings. Personally, I find the kappa plots, like the example below, to be pretty helpful.

Kappa plot

This visual representation of the agreement shows that there was a large amount of exact agreement (dark blue shading) for incorrect answers, scored 0, with a small percentage partial agreement and very few with no agreement. With 3 categories, only exact agreement or partial agreement is possible for the middle category. Two other take-away points from this plot are that agreement is lower for correct and partially correct answers than incorrect ones and that the distribution is skewed, with a large proportion of answers scored incorrect. Because it is adjusted for chance agreement, Kappa is affected by the distribution among categories . If each rater scores 90% of the answers correct, there should be 81% agreement by chance, thus requiring an extremely high level of agreement to be significantly different from chance. The Kappa plot shows agreement and distribution simultaneously, which is why I like it.


Want to play the game ? You can download it here, as well as our game for younger players, Spirit Lake.

Sometimes, you can just eyeball it.

Really, if something truly is an outlier, you ought to be able to spot it. Take this plot, for example.

plot with 3 large bars and a few outliers

It should be pretty obvious that the vast majority of our sample for the Fish Lake game were students in grades, 4, 5 and 6. Those in the lower grades are clearly exceptions. I don’t know who put 0 as their grade, because I doubt any of our users had no education.

I use these plots especially if I’m explaining why I think certain records should be deleted from a sample. For many people, it seems as if the visual representation makes it clearer that “some of these things don’t belong here.”

Did you know that you can get a plot from PROC FREQ just by adding an option, like so:

PROC FREQ DATA= datasetname ;


This will produce the frequency plot seen above, as well as a table for your frequency distribution.

Well, if you didn’t know, now you know.

Previously, I discussed PROC FREQ for checking the validity of your data. Now we are on to data analysis, but, as anyone who does analysis for more than about 23 minutes can tell you, cleaning your data and doing analysis is seldom a two-step process. In fact, it’s more like a loop of two steps, over and over.

First, we have the basic.

PROC FREQ DATA = mydata.quizzes ;

TABLES passed /binomial ;


(NOTE: If you have a screen reader, click here to read the images below. This is for you, Tina! )

This will give me not only what percentage passed a quiz that they took,

frequency table

but also the 95% confidence limits.

95% confidence limitsThis also gives  a test of the null hypothesis that the population proportion equals the number specified. If, as in this case, I did not specify any hypothesized population value, the default of .50.

Test of Ho: proportion = .50

I didn’t have any real justification for hypothesizing any other population value. What proportion of kids should be able to pass a quiz that is ostensibly at their grade level? Half of them – as in, the “average” kid? All of them, since it’s their grade level? I’m sure there are lots of numbers one could want to test.

If you do have a specific proportion, say, 75%, you’d code it like this:

PROC FREQ DATA =in.quizzes ;
TABLES passed / BINOMIAL (P=.75);

Note that the P= has to be enclosed in parentheses or you’ll get an error.

So, out of the 770 quizzes that were taken by students, only 30.65% of them passed. However, the quizzes aren’t all of equal difficulty, are they? Probably not.

So, my next PROC FREQ is a cross-tabulation of quiz by passed. I don’t need the column percent or percent of total. I just want to know what percent passed or failed each quiz and how many players took that quiz. The way the game is designed, you only need to study and take a quiz if you failed one of the math challenges, so there will be varying numbers of players for each quiz.

PROC FREQ DATA =in.quizzes ;

The first variable will be the row variable and the one after the * will be the column variable. Since I’m only interested in the row percent and N, I included the NOCOL and NOPERCENT options to suppress printing of the column and total percentages.

(For an accessible version for screen readers, click here)


Before I make anything of these statistics, I want to ask myself, what is going on with quiz22 (which actually comes after quiz2) and quiz4? Why did so many students take these two quizzes? I can tell at a glance that it wasn’t a coding error that made it impossible to pass the quiz (my first thought), since over a quarter of the students passed each one.

This leaves me three possibilities:

  1. The problem before the quiz was difficult for students, so many of them ended up taking the quiz (another PROC FREQ)
  2. One of the problems in the quiz was coded incorrectly, so some students failed the quiz when they shouldn’t have,
  3. There was a problem with the server repeatedly sending the data that was not picked up in the previous analyses (another PROC FREQ).

Remember what I said at the beginning about data analysis being a loop? So, back to the top!


If you’d like to see the game used to collect these data, even play the demo yourself, click here.

level up screen from Fish Lake

I’m in the middle of data preparation on a research project on games to teach fractions. This is the part of a data analysis project that takes up 80% of the time. Fortunately, PROC FREQ from SAS can simplify things.

1. How many unique records ?

There are multiple quizzes in the game, and you only end up taking a quiz if you miss one of the problems, so knowing how many unique players my 1,000 or so records represent isn’t as simple as dividing the number of players by X, where X is  a fixed number of quizzes.

PROC FREQ DATA = mydata.quizzes NLEVELS ;

TABLES username ;

Gives me the number of unique usernames. If you were dying to know, in the quizzes file for Fish Lake it was 163.

2. Are there data entry problems?

We had a problem early in the history of the project where, when the internet was down, the local computer would keep trying to send the data to our server, so we would get 112 of the same record once the connection was back up.

Now, it is very likely that a player might have the same quiz recorded more than once. Failing it the first time, he or she would be redirected to study and then have a chance to try again. Still, a player shouldn’t have TOO many of the same quiz. I thought this problem had been fixed, but I wanted to check.

To check if we had the same quiz an excessive number of times, I simply did this :

PROC FREQ DATA= in.quizzes ;
TABLES username*quiztype / OUT=check (WHERE = (COUNT > 10)) ;

This creates an output data set of those usernames that had the same quiz more than 10 times.

There were a few of these problems.  The question then became how to identify and delete those without deleting the real quizzes. This took me to step 3.

3. The LAG function

The LAG function provides the value from the prior observation. Assuming that it would take at least 2 minutes for a quiz, I sorted the data by username, quiz type, number correct and the time. I assumed it would take a minimum of 120 seconds for even the fastest student to complete a study activity and complete a test for the second time. Using the code below, I was able to delete all duplicate quizzes that occurred due to dropped internet connections.

proc sort data = check4;
by username quiztype numcorrect date_time ;

data check5 ;
set check4 ;
lagu = lag(username) ;
lagq = lag(quiztype) ;
lagn = lag(numcorrect) ;
lagd = lag(dt) ;
if lagu = username & lagq = quiztype & lagn = numcorrect then ddiff = dt – lagd ;
if ddiff ne . & ddiff < 120 then delete ;
run ;

Having finished off my data cleaning in record time, I’m now ready to do more PROC FREQ ‘ ing for data analysis – tomorrow.

(Actually, being 12:22 am, I guess it is technically tomorrow now.)


If you’d like to see the game that we are analyzing, you can download a free demo here



Next Page →