Recall that in the last post we were using SAS functions to score a test that had been completed by middle school and upper elementary students. Since we wanted to make it as easy as possible for students to enter their answers, we accepted just about any format.

Picking up where we left off …


In one question, the correct answer is 1/8 . Students entered 1/8, 1/8 cup, 1/8 cup of beans, and so on.  To score these, we use the substr function to read the first 3 characters and score the problem correct if those are “1/8”

If substr(q22,1,3) = “1/8” then q22 = 1 ;

else q22 = 0 ;


For q27, students clicked on which of the equations are correct. If they clicked the correct equation, the variable was set  to 1.

When they didn’t click on anything, it was missing. I wanted that to be changed to a zero, so I used the missing function, like this:

if missing(q27) then q27 = 0 ;


Now, SAS is not the newest kid on the block

and I can relate, because neither am I, not even if I’m on a relatively old block

Old and middle-aged people

The problem with being an older language is that  you have static types and you cannot have mixed arrays. What does that mean? It means that if you have defined q1 as a character variable because it might have a $ in it then by God it is going to stay a character variable and you can’t be doing any funny stuff like finding the mean and standard deviation of it. Also, if you are going to have an array, everything in it better be either all character variables or all numeric variables.

Well, fine, then, here is how I change all those now scored questions to numeric. First, I created a new numeric array of 32 items. You can tell it is numeric because there is no $.

Second, I used a DO loop and the INPUT function. The input function will read in a variable and and read it out in a different format, in this case, a numeric variable with a length of 8 and 0 decimal places.

I dropped the index variables j and i , which I mentioned in a previous post.

Now that I have my variables all nicely numeric, I can use the SUM function and add up all of the scored items into a total score.


array items {32} item1 – item32 ;

do j = 1 to 32 ;
items{j} = input(qs{j},8.0);
drop i j ;
total = sum(of item1-item32);


Now that I have my data, the fun stuff begins, but that’s for another post because I need to get back to making games.


This is my day job. Check it out. Buy a game. Maturity is over-rated!

burning village

Or donate a game to a school for good karma.

Did you ever fill out one of those online forms where you kept trying to submit it and got messages like,

“You need to enter your phone number in the format 311-234-12234”


You cannot have any special characters in this field.

That one really irritates me because, in fact, my last name has a space in it and many websites refuse to accept it. Take it up with The Invisible Developer, or his ancestors.

Have you ever just said the hell with it, and skipped filling out the form? Preventing users from entering all but the expected data type saves problems when you analyze your data, but it can also cause people to give up on your stupid web form.

So … when I created the pretest for Forgotten Trail and Aztech, I made it accept just about anything. If you wanted to write in 6, six, 9R6, 6 left over — any and all of those would be accepted and recorded.

You can get the first two games we developed here.

background for hidden pictures game

Forgotten Trail and Aztech are in beta and will not be commercially available for another couple of months.

What now? I have to score that test, but I’d rather the difficulty be on me than 150 or so middle school students who are our first test group.

So… how to fix it, with SAS character functions. Here is me, scoring the first half of the test:

First, I read the data into a new data set because I want to preserve the original data and not write over it. I may want to look at the exact incorrect answers later.

I create a character array of all 32 items on the test, and then I use a DO loop to change all of the questions to upper case.


Data in.recode ;
set in.pretestGMS ;
array qs{32} $ q1 – q27 q28a q28b q28c q29 q30 ;
do i = 1 to 32 ;
qs{i} = upcase(qs{i}) ;
end ;

Now, on to the questions. I eventually need all of these items to be score 1= correct, 0= incorrect

q1 is a question about money. People put all kinds of wrong answers – $35, $40, as well as the correct answer, 100 and $100. I used the COMPRESS function to remove the ‘$’ , then set q1 to equal 1 if the answer was 100, an 0 otherwise.
q1 = compress(q1,”$”) ;
if q1 = 100 then q1 = 1 ;
else q1 = 0 ;

The second use of compress function removes trailing blanks – if you don’t put any second parameter in the compress function, it just removes blanks. In q2, the answer was 4 but the students put “four”, “four frogs” “4/14” and so on. All of these are correct. You can have a list in an IF statement and if the variable matches any of those values in the list, then do something, in this case, set the answer as correct.
q2 = compress(q2) ;
if compress(q2) in (“4″,”FOUR”,”FOURFROGS”,”4/14″,”4OUTOF”,”4FROGS”) then q2 = 1;
else q2 = 0 ;

*** How to keep only numeric data using a simple SAS function (take that all you regular expression fetishists!)

The third use of the compress function KEEPS the characters that are the second parameter, because I added an optional third parameter of “k”, to KEEP the characters in the second parameter instead of discard those. So, this keeps numbers and deletes everything else from the answer. If it is 150, it is scored correct, otherwise, it’s wrong.
if compress(q5,”0123456789″,”k”) = 150 then q5 = 1;
else q5 = 0 ;


A lot of the items were similar, so that is half of scoring the test. I’ll try to write up the rest from the airport  tomorrow, but for now, I need to write a couple of emails, finish this scoring program and pack before 2 am, and that only gives me about 40 minutes.

On twitter, there were a few comments from people who said they didn’t like to take interns because “More than doing work, they want to watch me work.”

I see both sides of that. You’re busy. You’re not netflix. I get it. On the other hand, that’s a good way to learn.

The data are part of the evaluation of the effectiveness of 7 Generation Games in increasing math achievement scores. You can read more about us here. Below is a sneak peek of the artwork from the level we are creating now for Forgotten Trail.

characters from Forgotten Trail in Maine

So, here you go. I’m starting on a data analysis project today and I thought I’d give you the blow by blow.


It just so happens that the first several steps are all point-y and click-y. You could do it other ways but this is how I did it today. So, step one, I went to phpMyAdmin on the server where the data were saved and clicked Export.


For the format to export, I selected CSV and then clicked on the Go button. Now I have it downloaded to my desktop.

import data

Step 3: I opened SAS Enterprise Guide and selected Import Data.  I could have done this with SAS and written code to read the file, but, well, I didn’t. Nope, no particular reason, just this isn’t a very big data set so I thought, what the heck, why not.

boxes to check in import data menu

Step 4: DO NOT ACCEPT THE DEFAULTS!  Since I have a comma-delimited file with no field names, I need to uncheck the box that says File contains field names on record number. SAS conveniently shows you the data below so I can see that it is comma-delimited. I know I selected CSV but it’s always god practice to check. I can also see that the data starts at the first record, so I want to change that value in Data records start at record number to 1.

changing names

Step 5: Change the names  – I am never going to remember what F1, F2 etc. are, so for the first 5 , I click on the row and edit the names to be the name and label I want.

That’s it. Now I click the next button on the bottom of the screen until SAS imports my data.

I could have continued changing all of the variable names, because I KNOW down the line I am not going to remember that F6 is actually the first question or that F25 is question 28a. However, I wanted to do some other things that I thought would be easier to code, so I opened up a program file in SAS Enterprise guide and wrote some code.





data pretest2 ;

    set pretest ;



    array ren{32} $ f6-f37 ;

array qs {32} $ q1-q27 q28a q28b q28c q29 q30;

do i = 1 to 32 ;

qs{i} = ren{i} ;

end ;


rename f38 = date_test ;



drop f6- f37 i ;


proc sort data=pretest2 ;

by username date_test ;



data pretest2 ;

set pretest2;

by username date_test ;

if last.username ;

if index(username,‘TEST’) > 0 then delete;


Okay, that’s it. Now I have my data all ready to analyze. Pretty painless, isn’t it?

Want to learn more about SAS?

Here is a good paper on Arrays made easy .

If you’re interested in character functions like index, here is a good paper by Ron Cody.

It’s after midnight and officially the weekend so it’s time for the sixth installment of Mama AnnMaria’s Guide to Not Getting Your Sorry Ass Fired.

Click here for all of my advice on getting fired and quitting.

pyramid in jungle

Click here to buy the games I make on my day job.

As I was saying … there are many reasons you can get your sorry ass fired and many of these have nothing to do with your ability to perform competently at your job. I used to think the same as you, that if I was a good programmer/ cashier /secretary/ accountant / dental assistant / teacher or whatever, that my job should be safe. If I was a better than average whatever it was, they were lucky to have me.

No. Read my lips, or, in this case, type … no matter how great you are at your job, there is a point beyond which it is not worth the pain in the ass of putting up with you.

Let me give you a few examples in the “I can’t believe I have to explain this” category.

  1. The stuff at work is for you when you are at work. There are two parts to that sentence you should understand, “FOR YOU” and “WHEN YOU ARE AT WORK”. Maybe your job provides a nice office for you with a nice employee lounge. Your spouse/ mom/ roommate/ homeless guy you met on the street should not be in the employee lounge drinking the free coffee, watching the free cable and eating the free bagels. Now, I’m not saying if your roommate is in the neighborhood one day, he or she can’t relax and have a cup of coffee while waiting to go to lunch. What I am saying is that I shouldn’t see your boyfriend hanging out in the lounge more often than people who work here. There is NO circumstance under which I should find an adult who doesn’t work here sleeping on the couch, floor, across two chairs – either in your office or anywhere else in the workplace. If they are that sick, take them to the hospital. If they are that drunk, take them to rehab.
  2. Don’t take stuff home unless you need it for work, that includes filing cabinets, coffee pots, fax machines, boxes of – well, anything –  and that package of printer paper you took home to print out your roommate’s wedding invitations.
  3. Your work cell phone, iPad, computer and car is for you, for work. I can’t tell you the number of times I have seen someone in trouble at work because their child or friend broke their employer’s equipment, had an accident in a company car. This is the face I make when I hear that.

what the hell?

Seriously, what the hell are you thinking? Why was your kid playing with your company phone that she could drop it in the toilet? Why did you let your friend drive a car that did not belong to you?

4. Your mom shouldn’t be coming to your work place on a regular basis. The only exception I can imagine is if you work in a coffee shop and she comes there for coffee every morning. Otherwise, see face above.

Basically, get this straight, your work place is not your home and your office is not your living room. Wear pants. Wear shoes. Brush your teeth and bathe before you get there and don’t invite in your friends and family or share the place and the stuff in it with them. Let me explain why, because sometimes when I say this to people they think I am being mean, and they try to tell me that they are the only one of their social circle to have a job, have access to these nice things and what is wrong with sharing the wealth.

I will tell you what is wrong – generally, your employer budgets enough money for space, supplies, equipment to meet the needs of the people employed. A few times a year, when I submit budget justifications for granting agencies or investors, I include our estimated costs. Those extra people you are bringing in with you, whether it is space or flash drives they are taking, were not in the budget and as a result, there is not comfortably enough for the people who work here. This in itself may not be enough to get your sorry ass fired, but if your behavior in the workplace is annoying enough, you will find yourself the first one up against the wall when the revolution comes.

How common is a particular disease or condition? It depends on what you mean by common. Do you mean how many people have a condition, or do you mean how many new cases of it are there each year?

In other words, are you asking about the probability of you HAVING a disease or of you GETTING a disease?

Yes, I mentioned Down syndrome in the title and I am going to use it as an example and one could argue, and I would agree, that Down syndrome is not, strictly speaking, a disease. That relates to a second point, though, which is that incidence and prevalence are terms and techniques that can be applied not just to disease but to chromosomal disorders and even to risk factors, such as smoking. Now, I’m getting ahead of myself which just shows that one should pipe down and pay attention.

INCIDENCE RATE is the rate at which new cases are occurring.

I downloaded a data set of 40,002 births that occurred in U.S. territories. Did you know that the U.S. administers 16 territories around the world? Well, now you do. Bet you can’t name more than four of them. I’m not sure whether this is a good deal for the people of the territories or not, but I am sure they had 40,000 babies.

Finding the incidence rate for Down syndrome was super-duper easy.

I made a birthweight data set, but unlike the last one, I selected an additional variable, ca_down, and then I recoded that variable.

FILENAME  in “C:\Users\you\Downloads\Nat2014us\Nat2014PublicUS.c20150514.r20151022.txt ” ;

LIBNAME  out “C:\Users\me\mydir\” ;
DATA  out.birth2014 ;
INFILE in  ;
INPUT birthyr 9-12 bmonth 13-14 dob_tt 19-22 dob_wk 23 bfacil 32
mager 75-76 mrace6 107 mhisp_r 115 mar_p $ 119 dmar 120
meduc 124 bmi 283-286 cig1_R 262 cig2_R 263 cig3_r 264 dbwt 504-507 bwtr4 511

ca_down $ 552 ;

if ca_down in (“C”,”P”) then down = “Y” ;
else down = “N” ;

You see, the Down syndrome variable at birth is coded as C for Confirmed, N for No and P for Pending. Well, the “pending” means that the medical personnel suspect Down syndrome but they are waiting for the results of a chromosomal analysis and the confirmed means the analysis has confirmed a diagnosis of Down syndrome. Since I presume that most experienced medical personnel recognize Down syndrome, I combined those two categories (which, if you read the codebook, it turns out that the NCHS folks did, too.)

Then I did

PROC FREQ DATA =  out.birth2014 ;

TABLES down ;

And the result was

down Frequency Percent Cumulative
N 39963 99.90 39963 99.90
Y 39 0.10 40002 100.00

The incidence rate is .10 or 1 per 1,000 .

That’s all incidence rate is – total number of new cases / number at risk

The number at risk of Down syndrome would be all babies born, all 40,002 of them. The number new cases was 39. According to the World Health Organization, the incidence of Down syndrome is between 1 in 1,000 and 1 in 1,100 live births.

So, whatever other fallout there may be of living in a U.S. territory, it doesn’t seem to carry with it any greater incidence of Down syndrome births.

When discussing incidence, a condition like Down syndrome is easy to start with because you have it when you are born, you don’t develop it later. That is not the case with every health condition, though. That’s another blog for another day.

Here’s an interesting note: as often occurs, the most complicated part of this analysis was getting the data. After that, it was easy.



Feel smarter? Want smarter kids? Check out 7 Generation Games. It’s what I do when I’m not blogging.

Boy walking in rain

If you buy a game this month, we’ll send you a log in for our game in beta, Forgotten Trail, for free.

If you donate games to a school or classroom, we’ll give them FOUR games for the year, instead of two, because we have two new awesome games coming out this year.

Even though Rick Wicklin (buzzkill!) disabused me of the concern that SAS was communicating with aliens through the obscure coding in its sashelp data sets, I still wanted to roll my own.

If you, too, feel more comfortable with a data set you have produced yourself, let me give you a few tips.

  • There is a wealth of data available on line, much of it thanks to your federal government. For example, I downloaded the 2014 birth data from the Center for Disease Control site. They have a lot of interesting public use data sets.
  • Read the documentation! The nice thing about the CDC data sets is that they include a codebook. This particular one is 183 pages long. Not exciting reading, I know, but I’m sitting here right now watching a documentary where some guy is up to his elbows in an elephant’s placenta (I’m seriously not making this up) and that doesn’t look like a bowl of cherries, either.
  • Assuming you did not read all of the documentation even though I told you it was important, (since I have raised four daughters and know all about people not paying attention to my very sound advice), at a MINIMUM you need to read three things; 1) sampling information to find out if there are any problems with selection bias, any sampling weights you need to know about, 2) the record layout – otherwise how the hell are you going to read in the data, and 3) the coding scheme.

Let’s take a look at the code to create the data set I want to use in examples for my class. Uncompressed, the 2014 birth data set is over 5 GB which exceeds the limit for upload to SAS On-Demand for Academics and also isn’t the best idea for a class data set for a bunch of reasons, a big one being that I teach students distributed around the world and ships at sea (for real) and having them access a 5GB data set isn’t really the most feasible idea.


I’m going to assume you downloaded the file into your downloads folder and then unzipped it.


Since I READ the codebook, not being a big hypocrite and saw that the record length is 775 and there are nearly 4 million records in the data set. Opening it up in SAS Enterprise Guide or the Explorer window didn’t seem a good plan. My first step, then , was to use a FILENAME statement to refer to the data I downloaded, data that happens to be a text file.

I just want to take a look at the first 10 records to see that it is what it claims to be in the codebook. (No, I never DO trust anyone.)

The default length for a SAS variable is 8.

I’m going to give the variable a length of 775 characters.

Notice that the INFILE statement has to refer back to the reference used in the FILENAME statement, which I very uncreatively named “in” . Naming datasets, variables and file references is not the place for creativity. I once worked with a guy who named all of his data sets and variables after cartoon characters – until all of the other programmers got together and killed him.

Dead programmers aside, pay attention to that OBS=10 unless you really want to look at 3,998,175 records. The OBS =10 option will limit the number of records read to – you guessed it – 10.

With the INPUT statement, I read in from position 1-775 in the text file.

All of this just allows me to look at the first 10 records without having to open a file of almost 4 million records.

FILENAME  in “C:\Users\you\Downloads\Nat2014us\Nat2014PublicUS.c20150514.r20151022.txt ” ;

DATA example;
LENGTH var1 $775;
INFILE in OBS= 10 ;
INPUT var1 1-775;


Being satisfied with my first look, I went ahead to create a permanent data set. I needed a LIBNAME statement to specify where I wanted the data permanently stored.

The out in the LIBNAME and DATA statements need to match. It could be anything, it could be komodo or kangaroo, as long as it’s the same word in both places. So … my birth2014 data set will be stored in the directory specified.

How do I know what columns hold the birth year, birth month, etc. ? Why, I read the codebook (look for “record layout”).


LIBNAME  out “C:\Users\me\mydir\” ;
DATA  out.birth2014 ;
INFILE in OBS= 10000 ;
INPUT birthyr 9-12 bmonth 13-14 dob_tt 19-22 dob_wk 23 bfacil 32
mager 75-76 mrace6 107 mhisp_r 115 mar_p $ 119 dmar 120
meduc 124 bmi 283-286 cig1_R 262 cig2_R 263 cig3_r 264 dbwt 504-507 bwtr4 511;


STEP 3: RECODE MISSING DATA FIELDS. Think how much it would screw up your results if you did not recode 9999 for birthweight in grams, which means not that the child weighed almost 20 pounds at birth but that the birthweight was missing.  In every  one of the variables below, the “maximum” value was actually the flag for missing data. How did I know this? You guessed it, I read the codebook. NOTE: these statements below are included in the data step.
IF bmi > 99 THEN bmi = . ;
if cig1_r = 6 then cig1_r = . ;
if cig2_r = 6 then cig2_r = . ;
if cig3_r = 6 then cig3_r = . ;
if dbwt = 9999 then dbwt = . ;
if bwtr4 = 4 then bwtr4 = . ;

STEP 4: LABEL THE VARIABLES – Six months from now, you’ll never remember what dmar is.

NOTE: these statements below are also included in the data step.
LABEL mager = “Mom/age”
bfacil = “Birth/facility” mrace6 = “Mom/race” mhisp_r = “Mom/Hispanic”
dmar = “Marital/Status” meduc = “Mom/Educ”
cig1_r =”Smoke/Tri1″ cig2_r =”Smoke/Tri2″ cig3_r =”Smoke/Tri3″;

So, that’s it. Now I have a data set with 10,000 records and 19 variables that I can upload for my students to analyze.

I was going to write about prevalence and incidence, and how so-called simple statistics can be much more important than people think and are vastly under-rated. It was going to be cool. Trust me.

In the process, I ran across two things even more important or cooler (I know, hard to believe, right?)

Julia and company find this hard to believe

The Spoiled One & Co. find this hard to believe also

Here’s what happened … I thought I would use the sashelp library that comes with SAS On-Demand for Academics -and just about every other flavor of SAS – for examples of difference in prevalence. Since no documentation of all the data sets showed up on the first two pages of Google, and one is prohibited by international law from looking any further, I decided to use SAS to see something about the data sets.

Herein was revealed cool thing #1 – I know about the tasks in SAS Studio but I never really do much with these. However, since I’m teaching epidemiology this spring, I thought it would be good to take a look. You should do this any time you have a data set. I don’t care if it is your own or if it was handed down to you by God on Mount Sinai.

Moses with tablets

Okay, I totally take that back. If your data was handed down to you by God on Mount Sinai, you can skip this step, but only then.

At this point, Buddhists and Muslims are saying, “What the fuck?” and Christians are saying, “She just said, ‘fuck’! Right after a Biblical reference to Moses, too!”

This is why this blog should have some adult supervision.

But I digress. Even more than usual.

KNOW YOUR DATA! I don’t mean in the Biblical sense, because I’m not sure how that is even possible, but in the statistical sense. This is the important thing. I don’t care how amazingly impressive the statistical analyses are that you do, if you don’t know your data intimately (there’s that Biblical thing again) you may as well make your decisions by randomly throwing darts at a dartboard. I once told some people at the National Institutes of Health that’s how I arrived at the p-value for a test. For the record, the Men in Black have more of a sense of humor about these things than the NIH.

Ahem, so … if you are using SAS Studio, here is a lovely annotated image of what you are going to do.


1. Click in the left window pane where it says TASKS on the arrow to bring up a drop down menu


2. Click on the arrow next to Data and then select the Characterize Data task. (You might say that was 2 AND 3 if you were a smart ass and who asked you, anyway?)

3. Click the arrow next to the word DATA in the window pane second from the left and it will give you a choice of the available directories. (NOTE: If you are going to use directories not provided by SAS you’ll need a LIBNAME statement in an earlier step but we’re not and you don’t in this particular example.) Under the directory, you will be able to select your file, in this case, I want to look at birthweight.




4. Next to the word VARIABLES, click the + and it will show the variables in the data set. You can hold down the shift key and select more than one. You should do this for all of the variables of interest. In my case, I selected all of the variables – there aren’t many in this dataset.


5. To run your program, click the little running guy at the top of the window. This will give you – ta- da



Let’s notice something here – the mother’s age ranges from -9 (seriously? What’s that all about?) to 18. Is this a study of teenage mothers or what? The answer seems to be “what” because the mean age is .416. Say, what? The mother’s educational level ranges from 0 to 3, which probably refers to something but I’ll bet it’s not years of education.


In a class of 50 students, inevitably, one or two will turn in a paper saying,

“The average mother had 1.22 years of education.”

WHAT? Are you even listening to yourself? Those students will defend themselves by saying that is what “SAS” told them.

According to the SAS documentation, these data are from the 1997 records of the National Center for Health Statistics.

I ran the following code:

proc print data=sashelp.bweight (obs=5) ;

And either it’s the same data set or there was an amazing coincidence that all of the data in the first five records was the same.

However, because I really need to get a hobby, I went and found the documentation for the Natality Data set from 1997 and it did not match up with the coding here. This led me to conclude that either:

a. SAS Institute re-coded these data for purposes of their own, probably instructional,

b. This is actually some other 1997 birthweight data set from NCHS, in which case, those people have far too much time on their hands.

c. SAS is probably using secretly coded messages in these data to communicate with aliens.

Julia as fat alien

Not being willing to chance it, I went to the NCHS site and downloaded the 2014 birth statistics data set and codebook so I could make my own example data.

So … what have we learned today ?

  1. The TASKS in SAS Studio can be used to analyze your data with no programming required.
  2. It is crucial to understand how your data are coded before you go making stupid statements like the average mother is 3 months old.
  3. You can download birth statistics data and other cool staff from the National Center for Health Statistics site.
  4. The Spoiled One uses any phone not protected by National Security Council level of encryption for photos of herself.


Want to make your children as smart as you?

Player needing help

Get them 7 Generation Games. Learn math. Learn social studies. Learn not to fall into holes.

Runs on Mac and Windows for under ten bucks.



I’ll be teaching a graduate course in epidemiology in the spring and giving a talk on biostatistics at SAS Global Forum in April, so I thought I’d jump ahead and start rambling on about it now.

When I tell people that I teach epidemiology, the first question I usually get is,

What’s epidemiology?

In short, epidemiology is quantifying disease. There are five ways (at least) statistics can be applied to the study of disease:

  1. How common is it? This is a question of prevalence (how likely you are to have it) and incidence (how likely you are to get it). If you think those two are the same, you should take a course in epidemiology. Or, if you are busy, you can just read my blog post tomorrow or this paper from the Western Users of SAS Software Conference by Chu and Xie (2013).
  2. What causes it? What are the factors that increase (or decrease) your risk of contracting a disease? My first thought here is PROC LOGISTIC. It’s not my only thought, but my first one.
  3. What pattern(s) does it follow? What is the prognosis? Are you likely to die of it quickly, eventually or never? To determine if a treatment is effective for cancer of the eyelashes, we need to first have an idea of what the probability of disability or death is when one is left untreated and over how long of a period of time, that is, what is the “natural progression” of a disease? PROC LIFETEST and PROC PHREG lead to mind here.
  4. How effective are attempts to prevent or treat a disease? Several options suggest themselves here – PHREG for comparing how long people survive under different conditions, LOGISTIC for testing for significant differences in the probability of death. You could even use ordinary least squares (OLS) regression methods if you were interested in a measure like quality of life scaled scores.
  5. Developing policies to minimize disease.

The last one might not sound like a strictly statistical task to you, but I would argue that it is, that a key feature, perhaps THE key feature of statistics, and what makes it different from pure mathematics, is the application to answer a question.

I would argue (and so would Leon Gordis, who, literally, wrote the book on Epidemiology ) that a major part of epidemiology is applying the results from those first four aspects to make decisions that benefit public health.

Why do developing countries have the types of public health problems that the U.S and western Europe had over 100 years ago? Because, due to public health programs like improved sanitation and vaccines taken by all the people who were not raised by morons, we have greatly reduced tuberculosis and diarrheal diseases.

Okay, back to work, and more later as I work on my class notes and assignments.


Speaking of random amazingness …. Jarrad Connor, who I follow on twitter tipped me off to the unintended virtual pandemic in World of Warcraft as a model of how disease spreads.

—- My Day Job –

I make games that make people smarter. Learn math, learn history. Figure out why there is a muskrat in the middle of this basket.


Put “AnnMaria said so” in the comments section when you buy a game and I’ll have our staff send you a login for the beta version of Forgotten Trail as a bonus.



This is something that has bothered me for a long time. When I read tables and reports from the National Center of Education Statistics (yes, I do, don’t judge me!), it makes it sound as if all is well with rural education.

According to this table, for example, 100% of rural schools have at least some computers with Internet access and 90% of schools have wireless access at least somewhere in the building.

ND in winter

All is well in the heartland, yes?

As someone who has spent a good bit of my life freezing my ass in the heartland, I am pretty darn certain that all is NOT well when it comes to educational technology.

I read government pronouncements about what we should be doing and who we should be working with (reports I will not quote since I still maintain hopes of some day being funded by certain agencies) and I am convinced their recommendations will NOT work everywhere.

Let’s take the very different access to fiber optic networks in rural versus urban schools.

I read about and meet with Silicon Valley ed tech experts telling me that everything needs to be “in the cloud” and that our insistence on delivering software to schools via flash drives, downloading on to laptops or desktops, even apps installed on tablets, is outmoded.

Well, let me tell YOU something – 12 million children attend schools in rural communities, that’s 24% of the student population. Do you know what the difference is between cable and fiber optic speeds? It is so different that a game that takes less than two minutes to download in my office in Santa Monica takes two HOURS to download when I’m in many school sites in North Dakota. Of course, if your connection drops, you may end up starting all over again.

Since I don’t want to base my conclusion on a single state or my own experience, I asked a couple of  teachers I knew, in central Missouri and in central California, to try the same download and tell me how long it took them. In both cases, it was about an hour. Now, just think about this for a minute –

Download speeds are 30 to 60 times slower in many rural schools than in urban areas.

As I said, this has been bugging me for a long time, because I have been working with rural schools for decades and just about everything I witnessed convinced me that this was a problem that many people at the national level never even considered.

So, I was actually relieved to come across this post by the FCC Chair, Thomas Wheeler who noted, “41% of rural schools could not get fiber optic access if they tried”.

It’s not just me and the people that I know! The fact is that it’s really expensive to install fiber optic connections in rural communities. This isn’t to say that it will never happen, nor that some other, as yet unknown technology might not emerge to supplant fiber optics and solve the rural-urban digital divide when it comes to download speed. However, it definitely hasn’t happened YET.

This is why, although we are working on complete web-based games, we are ALSO continuing those that download once, or install from a flash drive and then require minimal Internet speed to perform exceptionally well. Personally, I think 24% of the population is a whole lot of school children to leave behind.

—— Want to sponsor a classroom or school?

Click here and then pick the game you want to donate.


Or, for a measly $10 you can get a game for yourself or someone you forgot to give a Christmas gift. Tell them you celebrate on January 5th, so you are actually early.




When I first taught multivariate statistics, I was nervous. The material is more difficult than Statistics 101 so I assumed teaching the course would be more difficult as well.  Over 25 years of teaching, I’ve found the opposite. The more advanced you get in a field, the easier the courses are to teach. You might expect it is because you have more motivated or capable students, and there is some of that effect. A bigger effect, I’ve found, is because once students have the basic concepts you have something to generalize from. Also, you have a common vocabulary. It’s much easier to explain that multiple regression is just simple regression with multiple predictor variables than to explain what regression is to someone who has never been exposed to the concepts of correlation and regression.

I’m in the middle of making a game to teach statistics to middle school students and was thinking how to explain to them why what they are learning is important and how to explain statistics to someone who has never been exposed to the idea. On top of this challenge is the fact that I know many of the students playing our games will be limited in English proficiency, either because it is their second language or simply because they have a limited vocabulary.

Why learn statistics? Did you even know that the type of mathematics you are learning at the moment has its own name? If you did, pat yourself on the back for being smart. Go ahead, I’ll wait.

Statistics is the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of making inferences.

We’re going to break down that definition.

Collecting numerical data.

Collecting: bringing or gathering together.  Notice people don’t have a collection of one thing!

Numerical: Numbers that have meaning as a measurement. The fact that 1 bass can feed 2 people is numerical data.

Data : Facts or figures from which conclusions can be drawn

Analyzing: looking at in detail, examining the basic parts – like looking at each category of animal and how many people it can feed

Let’s take the example of the Mayans hunting, using  this graph that shows how many people you can feed with each type of animal.


Based on the data that you have, you  know you can feed more people from a peccary, than a bass, so you could draw the conclusion that an area with a lot of peccaries would be a better place to be looking for food than one with a lot of bass.

This is what a peccary looks like, in case you were wondering.

peccary looks like a wild pig

Here is what is important to know about the science of collecting and analyzing numerical data – you are making decisions based on facts.

Why on earth would you hunt peccary? They can be dangerous if threatened, and trying to kill one and eat it is certainly threatening it.

On the other hand, no one ever got injured by a bass, as far as I know.

bassAs you can see from the graph above, you can feed 9 times as many people from a peccary, so maybe it is worth the risk.


You’re just learning to be a baby statistician at this point, working with really small quantities of data.

The same methods using bar graphs, computing the mean and analyzing variability are used everywhere with huge amounts of data. The military uses statistics, for everything from figuring out how many tanks they need to order to deciding when to move soldiers from one part of the country to another. One of the first uses of statistics was for agriculture, to decide what was working to raise more corn and what wasn’t. You’ll get to see for yourself when you get to the floating gardens of the Aztecs.


Here’s my question to you, oh reader people, what resources have you found useful for teaching statistics? I mean, resources you have really watched or used and thought, “Hey, this would be great for teaching? ”

There is a lot of mediocre, boring stuff on the interwebz and if any of you could point me to what you think rises above the rest, I’d be super appreciative.


grandfatherIf you want to check out our previous games, that teach multiplication and division (Spirit Lake) or fractions (Fish Lake) you can see them here. If you buy a game this month you can get our newest game, Forgotten Trail (fractions and statistics) as a free bonus.



Next Page →