If I were to give one piece of advice to a would-be program evaluator, it would be to get to know your data so intimately it’s almost immoral.
Generally, program evaluation is an activity undertaken by someone with a degree of expertise in research methods and statistics (hopefully!) using data gathered and entered by people’s whose interest is something completely different, from providing mental health services to educating students.
Because their interest in providing data is minimal, your interest in checking that data better be maximal. Let’s head on with the data from the last post. We have now created two data sets that have the same variable formats so we are good to go with concatenating them.
DATA answers hmph;
SET fl_answers ansfix1 ;
IF username IN(“UNDEFINED”,”UNKNOWN”) or INDEX(username,”TEST”) > 0 THEN OUTPUT hmph;
ELSE OUTPUT answers;
PRO TIP : I learned from a wise man years ago that one should not just gleefully delete data without looking at it. That is, instead of having a dataset where you put the data you expect and deleting the rest, send the unwanted data to a data set. If it turns out to be what you expected, you can always delete the data after you look at it.
There should be very few people with a username of ‘UNDEFINED’ or ‘UNKNOWN’. The only way to get that is to be one of our developers who are entering the data in forms as they create and test them, not by logging in and playing the game. The INDEX function checks in the variable in the first argument for the string given in the second and returns the starting position of the string, if found. So, INDEX(username, “TEST”) > 0 looks for the word TEST anywhere in the username.
Since we ask our software testers to put that word in the username they pick, it should delete all of the tester records. I looked at the hmph data set and the distribution of usernames was just as I expected and most of the usernames were in the answers data set with valid usernames.
Did you remember that we had concatenated the data set from the old server and the new server?
I hope you did because if you didn’t you will end up with a whole lot of the same answers in their twice.
Getting rid of the duplicates
PROC SORT DATA = answers OUT=in.all_fl_answers NODUP ;
by username date_entered ;
The difference between NODUP and NODUPKEY is relevant here. It is possible we could have a student with the same username and date_entered because different schools could have assigned students the same username. (We do our lookups by username + school). Some other student with the same username might have been entering data at the same time in a completely different part of the country. The NODUP option only removes records if every value of every variable is the same. The NODUPKEY removes them if the variables in the BY statement are duplicates.
All righty then, we have the cleaned up answers data, now we go back and create a summary data set as explained in this post. You don’t have to do it with SAS Enterprise Guide as I did there, I just did it for the same reason I do most things, the hell of it.
MERGING THE DATA
PROC SORT DATA = in.answers_summary ;
BY username ;
PROC SORT DATA = in.all_fl_students ;
BY username ;
DATA in.answers_studunc odd;
MERGE in.answers_summary (IN=a) in.all_fl_students (IN=b) ;
IF a AND b THEN OUTPUT in.answers_studunc ;
IF a AND NOT b THEN OUTPUT odd ;
The PROC SORT steps sort. The MERGE statement merges. The IN= option creates a temporary variable with the name ‘a’ or ‘b’. You can use any name so I use short ones. If there is a record in both the student record file and the answers summary file then the data is output to a data set of all students with summary of answers.
There should not be any cases where there are answers but no record in the student file. If you recall, that is what set me off on finding that some were still being written to the old server.
LOOK AT YOUR LOG FILE!
There is a sad corner of statistical purgatory for people who don’t look at their log files because they don’t know what they are looking for. ‘Nuff said.
This looks exactly as it should. A consistent finding in the pilot studies of assessment of educational games has found a disconcertingly low level of persistence. So, it is expected that many players quit when they come to the first math questions. The fact that of the 875 players slightly less than 600 had answered any questions was somewhat expected. As expected, there were no records where
NOTE: There were 596 observations read from the data set IN.ANSWERS_SUMMARY.
NOTE: There were 875 observations read from the data set IN.ALL_FL_STUDENTS.
NOTE: The data set IN.ANSWERS_STUDUNC has 596 observations and 11 variables.
NOTE: The data set WORK.ODD has 0 observations and 11 variables.
So, now, after several blog posts, we have a data set ready for analysis ….. almost.
For more on SAS character functions check out Ron Cody’s paper An Introduction to Character Functions, an oldie but goodie from WUSS back in 2003.