If I were to give one piece of advice to a would-be program evaluator, it would be to get to know your data so intimately it’s almost immoral.
Generally, program evaluation is an activity undertaken by someone with a degree of expertise in research methods and statistics (hopefully!) using data gathered and entered by people’s whose interest is something completely different, from providing mental health services to educating students.
Because their interest in providing data is minimal, your interest in checking that data better be maximal. Let’s head on with the data from the last post. We have now created two data sets that have the same variable formats so we are good to go with concatenating them.
DATA answers hmph;
SET fl_answers ansfix1 ;
IF username IN(“UNDEFINED”,”UNKNOWN”) or INDEX(username,”TEST”) > 0 THEN OUTPUT hmph;
ELSE OUTPUT answers;
PRO TIP : I learned from a wise man years ago that one should not just gleefully delete data without looking at it. That is, instead of having a dataset where you put the data you expect and deleting the rest, send the unwanted data to a data set. If it turns out to be what you expected, you can always delete the data after you look at it.
There should be very few people with a username of ‘UNDEFINED’ or ‘UNKNOWN’. The only way to get that is to be one of our developers who are entering the data in forms as they create and test them, not by logging in and playing the game. The INDEX function checks in the variable in the first argument for the string given in the second and returns the starting position of the string, if found. So, INDEX(username, “TEST”) > 0 looks for the word TEST anywhere in the username.
Since we ask our software testers to put that word in the username they pick, it should delete all of the tester records. I looked at the hmph data set and the distribution of usernames was just as I expected and most of the usernames were in the answers data set with valid usernames.
Did you remember that we had concatenated the data set from the old server and the new server?
I hope you did because if you didn’t you will end up with a whole lot of the same answers in their twice.
Getting rid of the duplicates
PROC SORT DATA = answers OUT=in.all_fl_answers NODUP ;
by username date_entered ;
The difference between NODUP and NODUPKEY is relevant here. It is possible we could have a student with the same username and date_entered because different schools could have assigned students the same username. (We do our lookups by username + school). Some other student with the same username might have been entering data at the same time in a completely different part of the country. The NODUP option only removes records if every value of every variable is the same. The NODUPKEY removes them if the variables in the BY statement are duplicates.
All righty then, we have the cleaned up answers data, now we go back and create a summary data set as explained in this post. You don’t have to do it with SAS Enterprise Guide as I did there, I just did it for the same reason I do most things, the hell of it.
MERGING THE DATA
PROC SORT DATA = in.answers_summary ;
BY username ;
PROC SORT DATA = in.all_fl_students ;
BY username ;
DATA in.answers_studunc odd;
MERGE in.answers_summary (IN=a) in.all_fl_students (IN=b) ;
IF a AND b THEN OUTPUT in.answers_studunc ;
IF a AND NOT b THEN OUTPUT odd ;
The PROC SORT steps sort. The MERGE statement merges. The IN= option creates a temporary variable with the name ‘a’ or ‘b’. You can use any name so I use short ones. If there is a record in both the student record file and the answers summary file then the data is output to a data set of all students with summary of answers.
There should not be any cases where there are answers but no record in the student file. If you recall, that is what set me off on finding that some were still being written to the old server.
LOOK AT YOUR LOG FILE!
There is a sad corner of statistical purgatory for people who don’t look at their log files because they don’t know what they are looking for. ‘Nuff said.
This looks exactly as it should. A consistent finding in the pilot studies of assessment of educational games has found a disconcertingly low level of persistence. So, it is expected that many players quit when they come to the first math questions. The fact that of the 875 players slightly less than 600 had answered any questions was somewhat expected. As expected, there were no records where
NOTE: There were 596 observations read from the data set IN.ANSWERS_SUMMARY.
NOTE: There were 875 observations read from the data set IN.ALL_FL_STUDENTS.
NOTE: The data set IN.ANSWERS_STUDUNC has 596 observations and 11 variables.
NOTE: The data set WORK.ODD has 0 observations and 11 variables.
So, now, after several blog posts, we have a data set ready for analysis ….. almost.
For more on SAS character functions check out Ron Cody’s paper An Introduction to Character Functions, an oldie but goodie from WUSS back in 2003.
At the Western Users of SAS Software conference (yes, they DO know that is WUSS), I’ll be speaking about using SAS for evaluation.
“If the results bear any relationship at all to reality, it is indeed a fortunate coincidence.”
I first read that in a review of research on expectancy effects, but I think it is true of all types of research.
Here is the interesting thing about evaluation – you never know what kind of data you are going to get. For example, in my last post I had created a data set that was a summary of the answers players had given in an educational game, with a variable for the mean percentage correct and another variable for number of questions answered.
When I merged this with the user data set so I could test for relationships between characteristics of these individuals – age, grade, gender, achievement scores – and perseverance I found a very odd thing. A substantial minority were not matched in the users file. This made no sense because you have to login with your username and password to play the game.
The reason I think that results are often far from reality is just this sort of thing – people don’t scrutinize their data well enough to realize when something is wrong, so they just merrily go ahead analyzing data that has big problems.
In a sense, this step in the data analysis revealed a good problem for us. We actually had more users than we thought. Several months ago, we had updated our games. We had also switched servers for the games. Not every teacher installed the new software so it turned out that some of the records were being written to our old server.
Here is what I needed to do to fix this:
- Download the files from our servers. I exported these as .xls files.
- Read the files into SAS
- Fix the variables so that the format was identical for both files.
- Concatenate the files of the same type, e.g., student file the student file from the other server.
- Remove the duplicates
- Merge the files with different data, e.g., answers file with student file
I did this in a few easy steps using SAS.
- USE PROC IMPORT to read in the files.
Now, you can use the IMPORT DATA option from the file menu but that gets a bit tedious if you have a dozen files to import.
TIP: If you are not familiar with the IMPORT procedure, do it with the menus once and save the code. Then you can just change the data set names and copy and paste this a dozen times. You could also turn it into a macro if you are feeling ambitious, but let’s assume you are not. The code looks like this:
PROC IMPORT OUT= work.answers DATAFILE= “C:\Users\Spirit Lake\WUSS16\fish_data\answers.xls”
Assuming that your Excel file has the names of the columns – ( GETNAMES = YES) . All you need to do for the next 11 data sets is to change the values in lower case – the file name you want for your SAS file goes after the OUT = , the Excel file after DATAFILE = and the sheet in that file that has your data after the RANGE =.
Notice there is a $ at the end of that sheet name.
Done. That’s it. Copy and paste however many times you want and change those three values for output dataset name, location of the input data and the sheet name.
2. Fix the variables so that the format is identical for both files
A. How do you know if the variables are the same format for each file?
PROC CONTENTS DATA = answers ;
This LOOKS good, right?
B. Look at a few records from each file.
OPTIONS OBS= 3 ;
PROC PRINT DATA = fl_answers_new ;
VAR date_entered ;
PROC PRINT DATA = fl_answers_old ;
VAR date_entered ;
OPTIONS OBS = MAX ;
PAY ATTENTION HERE !!! The OPTIONS OBS = 3 only shows the first three records, that’s a good idea because you don’t need to print out all 7,000+ records . However, if you forget to change it back to OBS = MAX then all of your procedures after that will only use the first 3 records, which is probably not what you want.
So, although my PROC CONTENTS showed the files were the same format in terms of variable type and length, here was a weird thing, since the servers were in different time zones, the time was recorded as 5 hours different, so
Since this was recorded as a character variable, not a date (see the output for the contents procedure above), I couldn’t just subtract 5 from the hour.
Because the value was not the same, if I sorted by username and date_entered , each one of these that was moved over from the old server would be included in the data set twice, because SAS would not recognize these were the same record.
So, what did I do?
I’m so glad you asked that question.
I read in the data to a new data set and the third statement gives a length of 19 to a new character variable.
Next, I create a variable that is the value of the date_entered variable that start at the 12th position and go for the next two (that is, the value of the hour).
Now, I add 5 to the hour value. Because I am adding a number to it , this will be created as a numeric value. Even though datefix1 is a character variable – since it was created using a character function, SUBSTR, when I add a number to it, SAS will try to make the resulting value a number.
Finally, I’m putting the value of datefixed to be the first 11 characters of the original date value , the part before the hour. I’m using the TRIM function to get rid of trailing blanks. I’m concatenating this value (that’s what the || does) with exactly one blank space. Next, I am concatenating this with the new hour value. First, though, I am left aligning that number and trimming any blanks. Finally, I’m concatenating the last 6 characters of the original date-time value. If I didn’t do this trimming and left alignment, I would end up with a whole bunch of extra spaces and it still wouldn’t match.
I still need to get this to be the value of the date_entered variable so it matches the date_entered value in the other data set.
I’m going to DROP the date_entered variable, and also the datefix1 and datefixn variables since I don’t need them any more.
I use the RENAME statement to rename datefixed to date_entered and I’m ready to go ahead with combining my datasets.
DATA ansfix1 ;
SET flo_answers ;
LENGTH datefixed $19 ;
datefix1 = SUBSTR(date_entered,12,2);
datefixn = datefix1 +5 ;
datefixed = TRIM(SUBSTR(date_entered,1,11)) || ” ” || TRIM(LEFT(datefixn)) || SUBSTR(date_entered,14,6) ;
DROP date_entered datefix1 datefixn ;
RENAME datefixed = date_entered ;
Occasionally, a brave student will ask me,
When will I ever use this?
The “this” can be anything from a mixed model analysis to nested arrays. (I have answers for both of those, by the way.)
I NEVER get that question when discussing topics like filtering data, whether for records or variables, because it is so damn ubiquitous.
Before I headed out to be, literally, testing in the field (you can read why here) , I was working on an evaluation of the usability of one of our games, Fish Lake.
My next thought was that many students played the game for a very short time, got the first answer correct and then quit. I decided to take a closer look at those people.
First step: from the top menu select TASKS, then DATA, then FILTER AND SORT
Second step: Create the filter. Click on the FILTER tab, select from the drop-down menu the variable to use to filter, in this case the one named “correct_Mean” , select the type of filter in the next drop-down menu, in this case EQUAL TO and in the box, enter the value you want it to equal. If you don’t remember all of the values you want, clicking on the three dots next to that box will bring up a list of values. You can also filter by more than one variable, but in this case, I only want one, so I’m done.
Third step: Select the variables. Steps two and three don’t have to be done in a particular order, but you DO have to select variables or your procedure won’t run, since it would end up with an empty data set. I do the filter first so I don’t forget. I know the filter is the whole point and you’re probably thinking you’d never forget that but you’re probably smarter than me or never rushed.
If you click the double arrows in the middle, that will select all of the variables. In this case, I just selected the two variables I wanted and clicked the single arrow (the top one) to move those over.
Why include correct_mean, since obviously that is a constant?
Because I could have made a mistake somewhere and these aren’t all with 100% correct. (Turns out, I didn’t and they were, but you never know in advance if you made a mistake because if you did then you wouldn’t make it.)
I click OK and now I have created a data set of just the people who answered 100% correctly.
For a first look, I graphed the frequency distribution of the number of questions answered by these perfect scorers. To do this,
- Go to TASKS > GRAPH > Bar Chart
2. Click on the first chart to select it, that’s a simple vertical bar chart
4. Under APPEARANCE click the box next to SPECIFY NUMBER OF BARS. The default here is one bar for each unique data value, which is already clicked. Caution with this if you might have hundreds of values, but I happen to know the max is less than 20.
I thought I’d find a bunch answered one question and a few answered all of the questions and maybe those few were data entry errors, say teachers who tested the game and shouldn’t be in the database. When I look at this graph, I’m surprised. There are a lot more people who had answered 100% correctly than I expected and they are distributed a lot more across the number of questions than I expected. That’s the fun of exploratory data analysis. You never know what you are going to find.
SO, now what?
So, now what?
I want to find out more about the relationship among persistence and performance. To do this, I’m going to need to merge the answers summary data set with demographics.
I’m going to go back to the Summary Data Set I created in the last post (remember that one) and just filter variables this time, keeping all of the records.
Again, I’m going to go to the TASKS menu, select DATA then FILTER AND SORT, this time, I’m going to have no filter and select the variables.
Since the pop-up window opens with the VARIABLES tab selected, I just click the variables I want, which happens to be “correct_N”,” correct_mean” and “username”, click the single arrow in between the panes to move them over, and click OK at the bottom of the pop-up window. Done! My data set is created.
You can always click on PROGRAM from the main menu to write code in SAS Enterprise Guide, but being an old dinosaur type, I’d like to export this data set I just created and do some programming with it using SAS. Personally, I find it easier to write code when I’m doing a lot of merging and data analysis. I find Enterprise Guide to be good for the quick looks and graphics but for more detailed analysis, the old timey SAS Editor is my preference. If you happen to be like me, all you need to do to output your data set is click on it in the process flow and select EXPORT.
You want to export this file as a stand-alone data set, not as a step in a project. Just select the first option and you can save it like any file, select the folder you want, give it the name you want. No LIBNAME statement required.
And it’s a beautiful sunny day in Santa Monica, so that’s it on this project for today.
Since it’s the 4th of July, I figure no one is very work-focused today and it would be a good time for one of my occasional rants.
I read a book this week, The Savage Damsel and the Dwarf. It was a good story for a lots of reasons, a primary one being that it didn’t follow the usual narrative of beautiful damsel in distress rescued by charming prince.
In fact, the beautiful damsel is kind of a stupid jerk and it is her overlooked, smarter sister who heads out to find a knight to save the castle. Said knight isn’t the sharpest knife in the drawer, either. In fact, most of the knights in the story seem to have been bonked on the head a few times too many.
In the end, the beautiful damsel is rescued by the not-too-bright knight and they go off to King Arthur’s court. The sister ends up with another guy who pretty much sucks at being a knight, so he gives it up and they live happily ever after together as he takes over the family lands and becomes a highly successful farmer.
One reason I almost never watch TV or movies is that the story is so predictable. Love at first sight. Unappreciated younger brother becomes greatest knight ever through magic potion/ love of a good woman. Bad guy defeated by good guy. Lots of fighting scenes. Overlooked woman develops into a beauty and the guy finally notices her.
More people should write their own story. Truly , fighting occasionally , with intervening sitting around the castle drinking beer waiting for a fight does sound like an incredibly boring life. A lot of the stuff the Knights (and people now) fight over is stupid.
“You insulteth mine honor.”
Yeah, so we should hack each other to pieces with swords? Get over it.
The ‘savage damsel’ falls in love with a knight who falls in love with her beautiful sister. When she has the chance to make him love her forever, she starts thinking past the first minute she imagines him declaring his love for her and tries to see being middle-aged, sitting by the fire with Sir Dumb-As-A-Rock and concludes, “Oh, hell, naw.”
Sir Lancelot loses a joust and disappears. And he doesn’t come back.
I liked the book a lot because it didn’t follow the recipe for fantasy stories. I ordered it for my granddaughter because it has a great life lesson – write your own story.
My usual disclaimer when I write about a product: No one paid me diddly-squat to write this.