I’m in the middle of data preparation on a research project on games to teach fractions. This is the part of a data analysis project that takes up 80% of the time. Fortunately, PROC FREQ from SAS can simplify things.
1. How many unique records ?
There are multiple quizzes in the game, and you only end up taking a quiz if you miss one of the problems, so knowing how many unique players my 1,000 or so records represent isn’t as simple as dividing the number of players by X, where X is a fixed number of quizzes.
PROC FREQ DATA = mydata.quizzes NLEVELS ;
TABLES username ;
Gives me the number of unique usernames. If you were dying to know, in the quizzes file for Fish Lake it was 163.
2. Are there data entry problems?
We had a problem early in the history of the project where, when the internet was down, the local computer would keep trying to send the data to our server, so we would get 112 of the same record once the connection was back up.
Now, it is very likely that a player might have the same quiz recorded more than once. Failing it the first time, he or she would be redirected to study and then have a chance to try again. Still, a player shouldn’t have TOO many of the same quiz. I thought this problem had been fixed, but I wanted to check.
To check if we had the same quiz an excessive number of times, I simply did this :
PROC FREQ DATA= in.quizzes ;
TABLES username*quiztype / OUT=check (WHERE = (COUNT > 10)) ;
This creates an output data set of those usernames that had the same quiz more than 10 times.
There were a few of these problems. The question then became how to identify and delete those without deleting the real quizzes. This took me to step 3.
3. The LAG function
The LAG function provides the value from the prior observation. Assuming that it would take at least 2 minutes for a quiz, I sorted the data by username, quiz type, number correct and the time. I assumed it would take a minimum of 120 seconds for even the fastest student to complete a study activity and complete a test for the second time. Using the code below, I was able to delete all duplicate quizzes that occurred due to dropped internet connections.
proc sort data = check4;
by username quiztype numcorrect date_time ;
data check5 ;
set check4 ;
lagu = lag(username) ;
lagq = lag(quiztype) ;
lagn = lag(numcorrect) ;
lagd = lag(dt) ;
if lagu = username & lagq = quiztype & lagn = numcorrect then ddiff = dt – lagd ;
if ddiff ne . & ddiff < 120 then delete ;
Having finished off my data cleaning in record time, I’m now ready to do more PROC FREQ ‘ ing for data analysis – tomorrow.
(Actually, being 12:22 am, I guess it is technically tomorrow now.)