# Playing with Open Data – Good to Know Tips from the Random File

From the random file — I’ve been super-busy working on our new startup, 7 Generation Games , and Darling Daughter Number Three had to defend her world title again which distracted me a bit, so I have a bunch of half-written posts, I thought I’d just put up at random, for the same reason I do everything else on this blog, the hell of it.

902q798q467453q965pq86-34q9e’w5wi34ytrsghsf.ksfbcmn  – random!

I spend some time playing with other people’s data for a whole lot of reasons – for students to analyze as a learning experience, because I’m interested in a problem addressed by the data, to create presentations for elementary schoolchildren showing what one can learn from statistics.

Here are a few tips that may make your life easier:

Read the user’s guide. Most of all check to see if this is a random sample. If you are just using the data for the purpose of teaching your students who to compute a t-test, then it really doesn’t matter whether it is a completely random sample or not. However, if you are going to be drawing any conclusions based on these results, make sure you know whether the data should be weighted, stratified, or just really not used to generalize to the population at all. If your sample consists of actuaries who are also equestrian competitors, I’m afraid not too much generalization should occur. (Don’t write and tell me about your horse, Beau, and how the two of you are exactly representative of the state of Vermont. You’re not and I don’t care any way.)

Much of the open data I work with is very large data sets and I spend several hours trying to get a feel for the data before I do much with it. If I’m going to use the same data set for a course with a lot of students, I’d like it to have lots of variables, and many of them to be numeric so the students could combine them into scales, do a factor analysis or other quantitative uses and they wouldn’t end up all  using the same few numeric variables. They could have a little individuality in their research question and design.

One way to find number of numeric variables in a data set using SAS.

data testmiss ;
set in._500family ;
array allnums {*} _numeric_ ;
x = dim(allnums) ;
proc means data = testmiss ;
var x ;
run ;

++ Equally Random +++

If you buy the beta for Spirit Lake now for \$9.99 you’ll get our version 2.0 for free in May. It will be good.  I’ve been working on the newest game, Fish Lake for the last two weeks, but soon I’m going to swap with The Invisible Developer and do nothing but work on Spirit Lake for another few weeks.