On twitter, there were a few comments from people who said they didn’t like to take interns because “More than doing work, they want to watch me work.”

I see both sides of that. You’re busy. You’re not netflix. I get it. On the other hand, that’s a good way to learn.

The data are part of the evaluation of the effectiveness of 7 Generation Games in increasing math achievement scores. You can read more about us here. Below is a sneak peek of the artwork from the level we are creating now for Forgotten Trail.

characters from Forgotten Trail in Maine

So, here you go. I’m starting on a data analysis project today and I thought I’d give you the blow by blow.

phpmyadmin

It just so happens that the first several steps are all point-y and click-y. You could do it other ways but this is how I did it today. So, step one, I went to phpMyAdmin on the server where the data were saved and clicked Export.

analyze2

For the format to export, I selected CSV and then clicked on the Go button. Now I have it downloaded to my desktop.

import data

Step 3: I opened SAS Enterprise Guide and selected Import Data.  I could have done this with SAS and written code to read the file, but, well, I didn’t. Nope, no particular reason, just this isn’t a very big data set so I thought, what the heck, why not.

boxes to check in import data menu

Step 4: DO NOT ACCEPT THE DEFAULTS!  Since I have a comma-delimited file with no field names, I need to uncheck the box that says File contains field names on record number. SAS conveniently shows you the data below so I can see that it is comma-delimited. I know I selected CSV but it’s always god practice to check. I can also see that the data starts at the first record, so I want to change that value in Data records start at record number to 1.

changing names

Step 5: Change the names  – I am never going to remember what F1, F2 etc. are, so for the first 5 , I click on the row and edit the names to be the name and label I want.

That’s it. Now I click the next button on the bottom of the screen until SAS imports my data.

I could have continued changing all of the variable names, because I KNOW down the line I am not going to remember that F6 is actually the first question or that F25 is question 28a. However, I wanted to do some other things that I thought would be easier to code, so I opened up a program file in SAS Enterprise guide and wrote some code.

/* THIS CREATES TWO ARRAYS BECAUSE I AM TOO LAZY

TO RENAME 32 QUESTIONS INDIVIDUALLY

THE PRETEST DATA SET WAS CREATED BY THE STEPS ABOVE USING IMPORT DATA */

 

data pretest2 ;

    set pretest ;

** NOTE THAT THERE IS A $ AFTER THE NUMBER OF ELEMENTS IN THE ARRAY

** BECAUSE THIS IS A CHARACTER ARRAY ;

    array ren{32} $ f6-f37 ;

array qs {32} $ q1-q27 q28a q28b q28c q29 q30;

do i = 1 to 32 ;

qs{i} = ren{i} ;

end ;

** YOU CAN ALSO USE A RENAME STATEMENT TO RENAME THE SAME VARIABLES ;

rename f38 = date_test ;

*** SINCE I NO LONGER NEED THE VARIABLES F6- F37 OR THE INDEX VARIABLE FOR THE

ARRAY, I DROP THEM HERE ;

drop f6- f37 i ;

*** SOME STUDENTS SAVED THE TEST MORE THAN ONCE BECAUSE THEY SAVED BEFORE THEY WERE DONE AND AT THE END. SO, I SORT BY USERNAME AND TEST. WE WILL ONLY KEEP THE LAST ONE.

proc sort data=pretest2 ;

by username date_test ;

*** THIS KEEPS JUST THE LATEST TEST DATE. ALSO, WE TESTED THIS 45 TIMES IN

THE PROCESS OF GETTING READY FOR USE IN THE SCHOOLS. ALL OF OUR STAFF USED USERNAMES WITH ‘TEST” SO I USED THE INDEX FUNCTION TO FIND IF THERE WAS A “TEST” IN THE USERNAME AND, IF SO, DELETED THAT RECORD ;

data pretest2 ;

set pretest2;

by username date_test ;

if last.username ;

if index(username,‘TEST’) > 0 then delete;

run;

Okay, that’s it. Now I have my data all ready to analyze. Pretty painless, isn’t it?

Want to learn more about SAS?

Here is a good paper on Arrays made easy .

If you’re interested in character functions like index, here is a good paper by Ron Cody.

How common is a particular disease or condition? It depends on what you mean by common. Do you mean how many people have a condition, or do you mean how many new cases of it are there each year?

In other words, are you asking about the probability of you HAVING a disease or of you GETTING a disease?

Yes, I mentioned Down syndrome in the title and I am going to use it as an example and one could argue, and I would agree, that Down syndrome is not, strictly speaking, a disease. That relates to a second point, though, which is that incidence and prevalence are terms and techniques that can be applied not just to disease but to chromosomal disorders and even to risk factors, such as smoking. Now, I’m getting ahead of myself which just shows that one should pipe down and pay attention.

INCIDENCE RATE is the rate at which new cases are occurring.

I downloaded a data set of 40,002 births that occurred in U.S. territories. Did you know that the U.S. administers 16 territories around the world? Well, now you do. Bet you can’t name more than four of them. I’m not sure whether this is a good deal for the people of the territories or not, but I am sure they had 40,000 babies.

Finding the incidence rate for Down syndrome was super-duper easy.

I made a birthweight data set, but unlike the last one, I selected an additional variable, ca_down, and then I recoded that variable.

FILENAME  in “C:\Users\you\Downloads\Nat2014us\Nat2014PublicUS.c20150514.r20151022.txt ” ;

LIBNAME  out “C:\Users\me\mydir\” ;
DATA  out.birth2014 ;
INFILE in  ;
INPUT birthyr 9-12 bmonth 13-14 dob_tt 19-22 dob_wk 23 bfacil 32
mager 75-76 mrace6 107 mhisp_r 115 mar_p $ 119 dmar 120
meduc 124 bmi 283-286 cig1_R 262 cig2_R 263 cig3_r 264 dbwt 504-507 bwtr4 511

ca_down $ 552 ;

if ca_down in (“C”,”P”) then down = “Y” ;
else down = “N” ;

You see, the Down syndrome variable at birth is coded as C for Confirmed, N for No and P for Pending. Well, the “pending” means that the medical personnel suspect Down syndrome but they are waiting for the results of a chromosomal analysis and the confirmed means the analysis has confirmed a diagnosis of Down syndrome. Since I presume that most experienced medical personnel recognize Down syndrome, I combined those two categories (which, if you read the codebook, it turns out that the NCHS folks did, too.)

Then I did

PROC FREQ DATA =  out.birth2014 ;

TABLES down ;

And the result was

down Frequency Percent Cumulative
Frequency
Cumulative
Percent
N 39963 99.90 39963 99.90
Y 39 0.10 40002 100.00

The incidence rate is .10 or 1 per 1,000 .

That’s all incidence rate is – total number of new cases / number at risk

The number at risk of Down syndrome would be all babies born, all 40,002 of them. The number new cases was 39. According to the World Health Organization, the incidence of Down syndrome is between 1 in 1,000 and 1 in 1,100 live births.

So, whatever other fallout there may be of living in a U.S. territory, it doesn’t seem to carry with it any greater incidence of Down syndrome births.

When discussing incidence, a condition like Down syndrome is easy to start with because you have it when you are born, you don’t develop it later. That is not the case with every health condition, though. That’s another blog for another day.

Here’s an interesting note: as often occurs, the most complicated part of this analysis was getting the data. After that, it was easy.

 

—– 

Feel smarter? Want smarter kids? Check out 7 Generation Games. It’s what I do when I’m not blogging.

Boy walking in rain

If you buy a game this month, we’ll send you a log in for our game in beta, Forgotten Trail, for free.

If you donate games to a school or classroom, we’ll give them FOUR games for the year, instead of two, because we have two new awesome games coming out this year.

Even though Rick Wicklin (buzzkill!) disabused me of the concern that SAS was communicating with aliens through the obscure coding in its sashelp data sets, I still wanted to roll my own.

If you, too, feel more comfortable with a data set you have produced yourself, let me give you a few tips.

  • There is a wealth of data available on line, much of it thanks to your federal government. For example, I downloaded the 2014 birth data from the Center for Disease Control site. They have a lot of interesting public use data sets.
  • Read the documentation! The nice thing about the CDC data sets is that they include a codebook. This particular one is 183 pages long. Not exciting reading, I know, but I’m sitting here right now watching a documentary where some guy is up to his elbows in an elephant’s placenta (I’m seriously not making this up) and that doesn’t look like a bowl of cherries, either.
  • Assuming you did not read all of the documentation even though I told you it was important, (since I have raised four daughters and know all about people not paying attention to my very sound advice), at a MINIMUM you need to read three things; 1) sampling information to find out if there are any problems with selection bias, any sampling weights you need to know about, 2) the record layout – otherwise how the hell are you going to read in the data, and 3) the coding scheme.

Let’s take a look at the code to create the data set I want to use in examples for my class. Uncompressed, the 2014 birth data set is over 5 GB which exceeds the limit for upload to SAS On-Demand for Academics and also isn’t the best idea for a class data set for a bunch of reasons, a big one being that I teach students distributed around the world and ships at sea (for real) and having them access a 5GB data set isn’t really the most feasible idea.

 

I’m going to assume you downloaded the file into your downloads folder and then unzipped it.

STEP 1: CHECK OUT THE DATA SET

Since I READ the codebook, not being a big hypocrite and saw that the record length is 775 and there are nearly 4 million records in the data set. Opening it up in SAS Enterprise Guide or the Explorer window didn’t seem a good plan. My first step, then , was to use a FILENAME statement to refer to the data I downloaded, data that happens to be a text file.

I just want to take a look at the first 10 records to see that it is what it claims to be in the codebook. (No, I never DO trust anyone.)

The default length for a SAS variable is 8.

I’m going to give the variable a length of 775 characters.

Notice that the INFILE statement has to refer back to the reference used in the FILENAME statement, which I very uncreatively named “in” . Naming datasets, variables and file references is not the place for creativity. I once worked with a guy who named all of his data sets and variables after cartoon characters – until all of the other programmers got together and killed him.

Dead programmers aside, pay attention to that OBS=10 unless you really want to look at 3,998,175 records. The OBS =10 option will limit the number of records read to – you guessed it – 10.

With the INPUT statement, I read in from position 1-775 in the text file.

All of this just allows me to look at the first 10 records without having to open a file of almost 4 million records.

FILENAME  in “C:\Users\you\Downloads\Nat2014us\Nat2014PublicUS.c20150514.r20151022.txt ” ;

DATA example;
LENGTH var1 $775;
INFILE in OBS= 10 ;
INPUT var1 1-775;

STEP 2: READ IN A SUBSET OF THE DATA

Being satisfied with my first look, I went ahead to create a permanent data set. I needed a LIBNAME statement to specify where I wanted the data permanently stored.

The out in the LIBNAME and DATA statements need to match. It could be anything, it could be komodo or kangaroo, as long as it’s the same word in both places. So … my birth2014 data set will be stored in the directory specified.

How do I know what columns hold the birth year, birth month, etc. ? Why, I read the codebook (look for “record layout”).

 

LIBNAME  out “C:\Users\me\mydir\” ;
DATA  out.birth2014 ;
INFILE in OBS= 10000 ;
INPUT birthyr 9-12 bmonth 13-14 dob_tt 19-22 dob_wk 23 bfacil 32
mager 75-76 mrace6 107 mhisp_r 115 mar_p $ 119 dmar 120
meduc 124 bmi 283-286 cig1_R 262 cig2_R 263 cig3_r 264 dbwt 504-507 bwtr4 511;

 

STEP 3: RECODE MISSING DATA FIELDS. Think how much it would screw up your results if you did not recode 9999 for birthweight in grams, which means not that the child weighed almost 20 pounds at birth but that the birthweight was missing.  In every  one of the variables below, the “maximum” value was actually the flag for missing data. How did I know this? You guessed it, I read the codebook. NOTE: these statements below are included in the data step.
IF bmi > 99 THEN bmi = . ;
if cig1_r = 6 then cig1_r = . ;
if cig2_r = 6 then cig2_r = . ;
if cig3_r = 6 then cig3_r = . ;
if dbwt = 9999 then dbwt = . ;
if bwtr4 = 4 then bwtr4 = . ;

STEP 4: LABEL THE VARIABLES – Six months from now, you’ll never remember what dmar is.

NOTE: these statements below are also included in the data step.
LABEL mager = “Mom/age”
bfacil = “Birth/facility” mrace6 = “Mom/race” mhisp_r = “Mom/Hispanic”
dmar = “Marital/Status” meduc = “Mom/Educ”
cig1_r =”Smoke/Tri1″ cig2_r =”Smoke/Tri2″ cig3_r =”Smoke/Tri3″;

So, that’s it. Now I have a data set with 10,000 records and 19 variables that I can upload for my students to analyze.

I was going to write about prevalence and incidence, and how so-called simple statistics can be much more important than people think and are vastly under-rated. It was going to be cool. Trust me.

In the process, I ran across two things even more important or cooler (I know, hard to believe, right?)

Julia and company find this hard to believe

The Spoiled One & Co. find this hard to believe also

Here’s what happened … I thought I would use the sashelp library that comes with SAS On-Demand for Academics -and just about every other flavor of SAS – for examples of difference in prevalence. Since no documentation of all the data sets showed up on the first two pages of Google, and one is prohibited by international law from looking any further, I decided to use SAS to see something about the data sets.

Herein was revealed cool thing #1 – I know about the tasks in SAS Studio but I never really do much with these. However, since I’m teaching epidemiology this spring, I thought it would be good to take a look. You should do this any time you have a data set. I don’t care if it is your own or if it was handed down to you by God on Mount Sinai.

Moses with tablets

Okay, I totally take that back. If your data was handed down to you by God on Mount Sinai, you can skip this step, but only then.

At this point, Buddhists and Muslims are saying, “What the fuck?” and Christians are saying, “She just said, ‘fuck’! Right after a Biblical reference to Moses, too!”

This is why this blog should have some adult supervision.

But I digress. Even more than usual.

KNOW YOUR DATA! I don’t mean in the Biblical sense, because I’m not sure how that is even possible, but in the statistical sense. This is the important thing. I don’t care how amazingly impressive the statistical analyses are that you do, if you don’t know your data intimately (there’s that Biblical thing again) you may as well make your decisions by randomly throwing darts at a dartboard. I once told some people at the National Institutes of Health that’s how I arrived at the p-value for a test. For the record, the Men in Black have more of a sense of humor about these things than the NIH.

Ahem, so … if you are using SAS Studio, here is a lovely annotated image of what you are going to do.

characterize0

1. Click in the left window pane where it says TASKS on the arrow to bring up a drop down menu

characterize2

2. Click on the arrow next to Data and then select the Characterize Data task. (You might say that was 2 AND 3 if you were a smart ass and who asked you, anyway?)

3. Click the arrow next to the word DATA in the window pane second from the left and it will give you a choice of the available directories. (NOTE: If you are going to use directories not provided by SAS you’ll need a LIBNAME statement in an earlier step but we’re not and you don’t in this particular example.) Under the directory, you will be able to select your file, in this case, I want to look at birthweight.

characterize3

 

 

4. Next to the word VARIABLES, click the + and it will show the variables in the data set. You can hold down the shift key and select more than one. You should do this for all of the variables of interest. In my case, I selected all of the variables – there aren’t many in this dataset.

characterize5

5. To run your program, click the little running guy at the top of the window. This will give you – ta- da

RESULTS!

characterize6

Let’s notice something here – the mother’s age ranges from -9 (seriously? What’s that all about?) to 18. Is this a study of teenage mothers or what? The answer seems to be “what” because the mean age is .416. Say, what? The mother’s educational level ranges from 0 to 3, which probably refers to something but I’ll bet it’s not years of education.

 

In a class of 50 students, inevitably, one or two will turn in a paper saying,

“The average mother had 1.22 years of education.”

WHAT? Are you even listening to yourself? Those students will defend themselves by saying that is what “SAS” told them.

According to the SAS documentation, these data are from the 1997 records of the National Center for Health Statistics.

I ran the following code:


proc print data=sashelp.bweight (obs=5) ;

And either it’s the same data set or there was an amazing coincidence that all of the data in the first five records was the same.

However, because I really need to get a hobby, I went and found the documentation for the Natality Data set from 1997 and it did not match up with the coding here. This led me to conclude that either:

a. SAS Institute re-coded these data for purposes of their own, probably instructional,

b. This is actually some other 1997 birthweight data set from NCHS, in which case, those people have far too much time on their hands.

c. SAS is probably using secretly coded messages in these data to communicate with aliens.

Julia as fat alien

Not being willing to chance it, I went to the NCHS site and downloaded the 2014 birth statistics data set and codebook so I could make my own example data.

So … what have we learned today ?

  1. The TASKS in SAS Studio can be used to analyze your data with no programming required.
  2. It is crucial to understand how your data are coded before you go making stupid statements like the average mother is 3 months old.
  3. You can download birth statistics data and other cool staff from the National Center for Health Statistics site.
  4. The Spoiled One uses any phone not protected by National Security Council level of encryption for photos of herself.

———

Want to make your children as smart as you?

Player needing help

Get them 7 Generation Games. Learn math. Learn social studies. Learn not to fall into holes.

Runs on Mac and Windows for under ten bucks.

 

 

I’ll be teaching a graduate course in epidemiology in the spring and giving a talk on biostatistics at SAS Global Forum in April, so I thought I’d jump ahead and start rambling on about it now.

When I tell people that I teach epidemiology, the first question I usually get is,

What’s epidemiology?

In short, epidemiology is quantifying disease. There are five ways (at least) statistics can be applied to the study of disease:

  1. How common is it? This is a question of prevalence (how likely you are to have it) and incidence (how likely you are to get it). If you think those two are the same, you should take a course in epidemiology. Or, if you are busy, you can just read my blog post tomorrow or this paper from the Western Users of SAS Software Conference by Chu and Xie (2013).
  2. What causes it? What are the factors that increase (or decrease) your risk of contracting a disease? My first thought here is PROC LOGISTIC. It’s not my only thought, but my first one.
  3. What pattern(s) does it follow? What is the prognosis? Are you likely to die of it quickly, eventually or never? To determine if a treatment is effective for cancer of the eyelashes, we need to first have an idea of what the probability of disability or death is when one is left untreated and over how long of a period of time, that is, what is the “natural progression” of a disease? PROC LIFETEST and PROC PHREG lead to mind here.
  4. How effective are attempts to prevent or treat a disease? Several options suggest themselves here – PHREG for comparing how long people survive under different conditions, LOGISTIC for testing for significant differences in the probability of death. You could even use ordinary least squares (OLS) regression methods if you were interested in a measure like quality of life scaled scores.
  5. Developing policies to minimize disease.

The last one might not sound like a strictly statistical task to you, but I would argue that it is, that a key feature, perhaps THE key feature of statistics, and what makes it different from pure mathematics, is the application to answer a question.

I would argue (and so would Leon Gordis, who, literally, wrote the book on Epidemiology ) that a major part of epidemiology is applying the results from those first four aspects to make decisions that benefit public health.

Why do developing countries have the types of public health problems that the U.S and western Europe had over 100 years ago? Because, due to public health programs like improved sanitation and vaccines taken by all the people who were not raised by morons, we have greatly reduced tuberculosis and diarrheal diseases.

Okay, back to work, and more later as I work on my class notes and assignments.

—–

Speaking of random amazingness …. Jarrad Connor, who I follow on twitter tipped me off to the unintended virtual pandemic in World of Warcraft as a model of how disease spreads.

—- My Day Job –

I make games that make people smarter. Learn math, learn history. Figure out why there is a muskrat in the middle of this basket.

basket_muskrat

Put “AnnMaria said so” in the comments section when you buy a game and I’ll have our staff send you a login for the beta version of Forgotten Trail as a bonus.

 

 

When I first taught multivariate statistics, I was nervous. The material is more difficult than Statistics 101 so I assumed teaching the course would be more difficult as well.  Over 25 years of teaching, I’ve found the opposite. The more advanced you get in a field, the easier the courses are to teach. You might expect it is because you have more motivated or capable students, and there is some of that effect. A bigger effect, I’ve found, is because once students have the basic concepts you have something to generalize from. Also, you have a common vocabulary. It’s much easier to explain that multiple regression is just simple regression with multiple predictor variables than to explain what regression is to someone who has never been exposed to the concepts of correlation and regression.

I’m in the middle of making a game to teach statistics to middle school students and was thinking how to explain to them why what they are learning is important and how to explain statistics to someone who has never been exposed to the idea. On top of this challenge is the fact that I know many of the students playing our games will be limited in English proficiency, either because it is their second language or simply because they have a limited vocabulary.

Why learn statistics? Did you even know that the type of mathematics you are learning at the moment has its own name? If you did, pat yourself on the back for being smart. Go ahead, I’ll wait.

Statistics is the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of making inferences.

We’re going to break down that definition.

Collecting numerical data.

Collecting: bringing or gathering together.  Notice people don’t have a collection of one thing!

Numerical: Numbers that have meaning as a measurement. The fact that 1 bass can feed 2 people is numerical data.

Data : Facts or figures from which conclusions can be drawn

Analyzing: looking at in detail, examining the basic parts – like looking at each category of animal and how many people it can feed

Let’s take the example of the Mayans hunting, using  this graph that shows how many people you can feed with each type of animal.

mayan_hunting_graph

Based on the data that you have, you  know you can feed more people from a peccary, than a bass, so you could draw the conclusion that an area with a lot of peccaries would be a better place to be looking for food than one with a lot of bass.

This is what a peccary looks like, in case you were wondering.

peccary looks like a wild pig

Here is what is important to know about the science of collecting and analyzing numerical data – you are making decisions based on facts.

Why on earth would you hunt peccary? They can be dangerous if threatened, and trying to kill one and eat it is certainly threatening it.

On the other hand, no one ever got injured by a bass, as far as I know.

bassAs you can see from the graph above, you can feed 9 times as many people from a peccary, so maybe it is worth the risk.

 

You’re just learning to be a baby statistician at this point, working with really small quantities of data.

The same methods using bar graphs, computing the mean and analyzing variability are used everywhere with huge amounts of data. The military uses statistics, for everything from figuring out how many tanks they need to order to deciding when to move soldiers from one part of the country to another. One of the first uses of statistics was for agriculture, to decide what was working to raise more corn and what wasn’t. You’ll get to see for yourself when you get to the floating gardens of the Aztecs.

—–

Here’s my question to you, oh reader people, what resources have you found useful for teaching statistics? I mean, resources you have really watched or used and thought, “Hey, this would be great for teaching? ”

There is a lot of mediocre, boring stuff on the interwebz and if any of you could point me to what you think rises above the rest, I’d be super appreciative.

—–

grandfatherIf you want to check out our previous games, that teach multiplication and division (Spirit Lake) or fractions (Fish Lake) you can see them here. If you buy a game this month you can get our newest game, Forgotten Trail (fractions and statistics) as a free bonus.

 

 

The results are in! The chart below gladdens my little heart, somewhat.

Graph showing significant improvement from pretest to posttest

One thing to note is the fact that the 95% confidence interval is comfortably above zero. Another point is that it looks like a pretty normal distribution.

What is it? It is the difference between pretest and post-test scores for 71 students at two small, rural schools who played Spirit Lake: The Game.

I selected these schools to analyze first, and held my breath. These were the schools we had worked with the most closely, who had implemented the games as we had recommended (play twice a week for 25-30 minutes). If it didn’t work here, it probably wasn’t going to work.

Two years ago, with a sample of 39 students from 4th and 5th grade from one school, we found a significant difference compared to the control group.

COULD WE DO IT AGAIN?

You probably don’t feel nervous reading that statement because you have not spent the last three years of your life developing games that you hope will improve children’s performance in math.

The answer, at least for the first group of data we have analyzed is – YES!

Scores improved 20% from pre-test to post-test. This was not as impressive as the improvement of 30% we had found in the first year, but this group also began with a substantially higher score. Two years ago, the average student scored 39% on the pre-test. This year, for 71 students with complete data, the average pre-test score was 47.9% , the post-test mean was 57.4%.   I started this post saying my little heart was gladdened “somewhat” because I still want to see the students improve more.

There is a lot more analysis to do. For a start, there is analysis of data from schools who were not part of our study but who used the pretest and post-test – with them, we can’t really tell how the game was implemented but at least we can get psychometric data on the tests.

We have data on persistence – which we might be able to correlate with post-test data, but I doubt it, since I suspect students who didn’t finish the game probably didn’t take the post-test.

We have data on Fish Lake, which also looks promising.

Overall, it’s just a great day to be a statistician at 7 Generation Games.

buffalo in the winter

Here is my baby, Spirit Lake. It can be yours for ten bucks. If you are rocking awesome at multiplication and division, including with word problems, but you’d like to help out a kid or a whole classroom, you can donate a copy.

Some problems that seem really complex are quite simple when you look at them in the right way. Take this one, for example:

My hypothesis is that a major problem in math achievement is persistence. Students just give up at the first sign of trouble. I have three different data sets with student data from the Spirit Lake game. Many of the students in the student table are the control group, so they will have no data on game play. There is a table of answers to the math challenges and another table with answers to quizzes which students took only if they missed a math challenge. When students miss a math challenge in  the game, depending on which educational resource they choose, they may do one of two or three different quizzes to get back into the game.  Also, some of the quiz records were not from quizzes actually in the game but from supplemental activities we provided. So, how do I identify where in the process students drop out and present in a simple graphic to discuss with schools? Just to complicate matters, the username was different lengths in the different datasets and the variable for timestamp also had different names.

It turns out, the problem was not that difficult.

  1. Merge the student table with the answers (math challenges) and only include those students with at least one answer.
  2. Merge the student table with the quizzes and only include those students with at least one quiz
  3. Concatenate the data sets from steps 1 & 2
  4. Create a new userid variable and set it equal to the username
  5. Create a new “entered” variable and set it equal to whichever of the datetime fields exists on that record
  6. Delete the quizzes not included in the game.
  7. Sort the dataset by userid and the date and time entered.
  8. Keep the last record for each userid. Now you have their last date of activity.
  9. If there is a value for the math challenge field then that is the name of the last activity, otherwise the quiz name is the name for the last activity.
  10. Use a PROC FORMAT to assign each activity a value equal to the step in the game.
  11. Do a PROC FREQ using that format and the order = FORMATTED option.

Once I had the frequencies, I just put them into a table in a word document and shaded the columns to match the percentage. There may be a way in SAS/Graph or something else to do this automatically, but honestly, the table took me two minutes once I had the data.

graph showing students dropping out at each step

I think it illustrates my points pretty clearly, which are:

  • A sizable number of students drop out after the second problem.
  • 25% of the students drop after the first difficulty they have (missing the second problem)
  • Only a minority of students persist all the way to the end, less than 25% of the total sample

This isn’t based on a tiny sample, either. The data above represent a sample of 397 students.

In case you would like to see it, the code for steps 3-11 is below. Particularly useful is the PROC FORMAT. Notice that you can have multiple values have the same format, which was important here because players can take multiple paths that are still the same step in the sequence.

data persist ;
attrib userid length= $49 ;
set mydata2.sl_answers mydata2.sl_quizzes ;
entered = max(date_answered_dt,date_taken_dt) ;
**** DELETES QUIZZES IN EXTRA AND SUMMER SITE, NOT IN MAIN GAME ;
if quiztype in (“problemsolve”,”divide1long”,”multiplyby23″) then delete ;
userid = new_username ;
format entered datetime20. ;
proc sort data=persist ;
by quiztype ;

proc sort data=persist ;
by userid entered ;

data retention ;
set persist ;
by userid ;
if last.userid ;
attrib last_activity length= $14 ;
if inputform ne “” then last_activity = inputform ;
else last_activity = quiztype ;

proc freq data= retention ;
tables last_activity ;

proc format ;
value
$activity
“findcepansi” = “01”
“x2x9” = “02”
“math2x” = “02”
“math2_2” = “02”
“wolves1a” = “02”
“multiplyby5” = “03”
“multiplyby4” = “03”
“multiplyby3” = “04”
“wolves1b” = “05”
…. AND SO ON ….

“horseform2” = “21”
;
ods rtf file = “C:\Users\Spirit Lake\phaseII\pipeline.rtf” ;
proc freq data= retention order=formatted ;
tables last_activity ;
format last_activity $activity. ;
run ;
ods rtf close ;

—- Feel smarter after reading this blog?
Fish Lake artwork
Want to feel even smarter? Download and play our games!  You can run around in our virtual world while reviewing your basic math skills. If you are too busy (seriously?) you can still give a game as a gift or donate a game to a classroom or school

Let’s get this out right up front – I have no question that there is discrimination in the tech industry. I gave an hour-long talk on this very subject at MIT a couple of weeks ago, where I pointed out that everyone’s first draft of pretty much everything is crap – your first game, first database – and some people we give encouragement and other people we give up on.

That’s not my point here. My point is that sometimes we are our own barriers by not applying to positions. Let me give you two examples.

First, as I wrote on my 7 Generation Games blog earlier, we reject disproportionately more male applicants for positions but yet our last four hires have all been men. This may change with the current positions (read on to find out why).

For the six positions we have advertised over the last couple of years, the application pool has looked like this:

GENDER HIRED
Yes No
Male 4 18
Female 2 4

We had one woman apply for the previous internship position we advertised, and we ended up hiring a male. If you look at this table, the odds of a woman being hired – 1 in 3, are greater than the odds of a man being hired, 1 in 5.5 . Yet, we hired twice as many men as women.

Why is that? Because more men apply. More unqualified men apply, which explains our higher rejection rate. If we explicitly state, “Must work in office five days a week”, we will get men (but no women) applying who live in, say, Sweden, and want to know if maybe that is negotiable (no.)

bannerWe have also recently filled 3 positions, and will soon fill two more, without advertising. In one of those cases, the person (male) contacted us and convinced us that he could do great work. All four of the other positions were filled by personal contacts. We called people we knew who were knowledgeable in the field and asked for recommendations.

We happen to know a lot of people who are Hispanic and Native American, so 3 of those positions ended up going to extremely well-qualified people from those groups. The one woman we hired out of those five positions was actually recommended by my 82-year-old mother who said,

“Your cousin, Jean, is a graphic artist, you should check out her work.”

As you can see from the photo of the 6-foot banner she made for us, she does do good work.

I see two factors at work here:

  1. Women are less likely to nominate themselves. While men will apply even if their meeting the  qualifications seems to be a stretch (or a delusion), women are less likely to do so. I don’t know why. Fear of rejection?
  2. People are recommended by their networks and women seem to be less plugged into those networks. This is also true of minorities. We make no special effort to recruit Hispanic or Native American employees but since that is a lot of who we know, it is a lot of who THEY know and hence a lot of our referrals.

How do you increase your proportion of female applicants? You are going to laugh at this because it is the simplest thing ever. This time around, I wrote a blog post and tweets that specifically encouraged females to apply. And it worked! Well, maybe you would have predicted that, but not me. I would never have guessed.

Do you really want to hire Latino graphic artists or software developers? Come to the next Latino Tech meetup. Bonus: the food is awesome.

meetup

My point, which you may have now despaired of me having, is that affirmative action is a good thing on both sides. By affirmative action I mean being pro-active. If you are from an under-represented group, APPLY. Invite yourself to the dance.  If you are an employer, reach out. It could be as easy as having a margarita during Hispanic Heritage Month or writing a blog post.

In both cases, you might be surprised how little effort yields big results.

Don’t forget to buy our games and play them. Fun! Plus, they’ll make you smarter.

man from Spirit Lake

You wouldn’t think there would be that much to say about scree plots. That is because you are like me and sometimes wrong.

The problem I often have teaching is that I assume people know a lot more than is reasonable to expect for someone coming into a course. Sometimes, I’m like a toddler who thinks that because she knows what color hat the baby was wearing yesterday that you do, too.

Toddler with baby

 

So …. a scree plot is a plot of the eigenvalues by the factor number. In the chart below, the first factor has an eigenvalue of about 5.5 while the eigenvalue of the second factor is around 1.5. (If you don’t know what an eigenvalue is, read this post. )

scree plot with bend in plot after second factor

 

As I mentioned in the previous post, an eigenvalue greater than 1 explains more than a single item, but as you can tell by looking at the plot, some of those eigenvalues are barely higher than one. Should you keep them? Or not?

What is scree, anyway? Scree is a pile of debris at the base of a cliff. In a scree plot, the real factors are at the top of the cliff and the scree is the random factors at the bottom you should discard. So, based on this, you might decide you only have one real factor.

The idea is to discard all of the factors after the line starts to flatten out. But is that after 1 factor? It kind of flattens out after four?  Maybe?

Sometimes a scree plot is really clear, but this one, not as much. So, what should you do next?

Hmm … maybe I should write another post on that.

Next Page →