If I were to give one piece of advice to a would-be program evaluator, it would be to get to know your data so intimately it’s almost immoral.
Generally, program evaluation is an activity undertaken by someone with a degree of expertise in research methods and statistics (hopefully!) using data gathered and entered by people’s whose interest is something completely different, from providing mental health services to educating students.
Because their interest in providing data is minimal, your interest in checking that data better be maximal. Let’s head on with the data from the last post. We have now created two data sets that have the same variable formats so we are good to go with concatenating them.
DATA answers hmph;
SET fl_answers ansfix1 ;
IF username IN(“UNDEFINED”,”UNKNOWN”) or INDEX(username,”TEST”) > 0 THEN OUTPUT hmph;
ELSE OUTPUT answers;
PRO TIP : I learned from a wise man years ago that one should not just gleefully delete data without looking at it. That is, instead of having a dataset where you put the data you expect and deleting the rest, send the unwanted data to a data set. If it turns out to be what you expected, you can always delete the data after you look at it.
There should be very few people with a username of ‘UNDEFINED’ or ‘UNKNOWN’. The only way to get that is to be one of our developers who are entering the data in forms as they create and test them, not by logging in and playing the game. The INDEX function checks in the variable in the first argument for the string given in the second and returns the starting position of the string, if found. So, INDEX(username, “TEST”) > 0 looks for the word TEST anywhere in the username.
Since we ask our software testers to put that word in the username they pick, it should delete all of the tester records. I looked at the hmph data set and the distribution of usernames was just as I expected and most of the usernames were in the answers data set with valid usernames.
Did you remember that we had concatenated the data set from the old server and the new server?
I hope you did because if you didn’t you will end up with a whole lot of the same answers in their twice.
Getting rid of the duplicates
PROC SORT DATA = answers OUT=in.all_fl_answers NODUP ;
by username date_entered ;
The difference between NODUP and NODUPKEY is relevant here. It is possible we could have a student with the same username and date_entered because different schools could have assigned students the same username. (We do our lookups by username + school). Some other student with the same username might have been entering data at the same time in a completely different part of the country. The NODUP option only removes records if every value of every variable is the same. The NODUPKEY removes them if the variables in the BY statement are duplicates.
All righty then, we have the cleaned up answers data, now we go back and create a summary data set as explained in this post. You don’t have to do it with SAS Enterprise Guide as I did there, I just did it for the same reason I do most things, the hell of it.
MERGING THE DATA
PROC SORT DATA = in.answers_summary ;
BY username ;
PROC SORT DATA = in.all_fl_students ;
BY username ;
DATA in.answers_studunc odd;
MERGE in.answers_summary (IN=a) in.all_fl_students (IN=b) ;
IF a AND b THEN OUTPUT in.answers_studunc ;
IF a AND NOT b THEN OUTPUT odd ;
The PROC SORT steps sort. The MERGE statement merges. The IN= option creates a temporary variable with the name ‘a’ or ‘b’. You can use any name so I use short ones. If there is a record in both the student record file and the answers summary file then the data is output to a data set of all students with summary of answers.
There should not be any cases where there are answers but no record in the student file. If you recall, that is what set me off on finding that some were still being written to the old server.
LOOK AT YOUR LOG FILE!
There is a sad corner of statistical purgatory for people who don’t look at their log files because they don’t know what they are looking for. ‘Nuff said.
This looks exactly as it should. A consistent finding in the pilot studies of assessment of educational games has found a disconcertingly low level of persistence. So, it is expected that many players quit when they come to the first math questions. The fact that of the 875 players slightly less than 600 had answered any questions was somewhat expected. As expected, there were no records where
NOTE: There were 596 observations read from the data set IN.ANSWERS_SUMMARY.
NOTE: There were 875 observations read from the data set IN.ALL_FL_STUDENTS.
NOTE: The data set IN.ANSWERS_STUDUNC has 596 observations and 11 variables.
NOTE: The data set WORK.ODD has 0 observations and 11 variables.
So, now, after several blog posts, we have a data set ready for analysis ….. almost.
For more on SAS character functions check out Ron Cody’s paper An Introduction to Character Functions, an oldie but goodie from WUSS back in 2003.
At the Western Users of SAS Software conference (yes, they DO know that is WUSS), I’ll be speaking about using SAS for evaluation.
“If the results bear any relationship at all to reality, it is indeed a fortunate coincidence.”
I first read that in a review of research on expectancy effects, but I think it is true of all types of research.
Here is the interesting thing about evaluation – you never know what kind of data you are going to get. For example, in my last post I had created a data set that was a summary of the answers players had given in an educational game, with a variable for the mean percentage correct and another variable for number of questions answered.
When I merged this with the user data set so I could test for relationships between characteristics of these individuals – age, grade, gender, achievement scores – and perseverance I found a very odd thing. A substantial minority were not matched in the users file. This made no sense because you have to login with your username and password to play the game.
The reason I think that results are often far from reality is just this sort of thing – people don’t scrutinize their data well enough to realize when something is wrong, so they just merrily go ahead analyzing data that has big problems.
In a sense, this step in the data analysis revealed a good problem for us. We actually had more users than we thought. Several months ago, we had updated our games. We had also switched servers for the games. Not every teacher installed the new software so it turned out that some of the records were being written to our old server.
Here is what I needed to do to fix this:
- Download the files from our servers. I exported these as .xls files.
- Read the files into SAS
- Fix the variables so that the format was identical for both files.
- Concatenate the files of the same type, e.g., student file the student file from the other server.
- Remove the duplicates
- Merge the files with different data, e.g., answers file with student file
I did this in a few easy steps using SAS.
- USE PROC IMPORT to read in the files.
Now, you can use the IMPORT DATA option from the file menu but that gets a bit tedious if you have a dozen files to import.
TIP: If you are not familiar with the IMPORT procedure, do it with the menus once and save the code. Then you can just change the data set names and copy and paste this a dozen times. You could also turn it into a macro if you are feeling ambitious, but let’s assume you are not. The code looks like this:
PROC IMPORT OUT= work.answers DATAFILE= “C:\Users\Spirit Lake\WUSS16\fish_data\answers.xls”
Assuming that your Excel file has the names of the columns – ( GETNAMES = YES) . All you need to do for the next 11 data sets is to change the values in lower case – the file name you want for your SAS file goes after the OUT = , the Excel file after DATAFILE = and the sheet in that file that has your data after the RANGE =.
Notice there is a $ at the end of that sheet name.
Done. That’s it. Copy and paste however many times you want and change those three values for output dataset name, location of the input data and the sheet name.
2. Fix the variables so that the format is identical for both files
A. How do you know if the variables are the same format for each file?
PROC CONTENTS DATA = answers ;
This LOOKS good, right?
B. Look at a few records from each file.
OPTIONS OBS= 3 ;
PROC PRINT DATA = fl_answers_new ;
VAR date_entered ;
PROC PRINT DATA = fl_answers_old ;
VAR date_entered ;
OPTIONS OBS = MAX ;
PAY ATTENTION HERE !!! The OPTIONS OBS = 3 only shows the first three records, that’s a good idea because you don’t need to print out all 7,000+ records . However, if you forget to change it back to OBS = MAX then all of your procedures after that will only use the first 3 records, which is probably not what you want.
So, although my PROC CONTENTS showed the files were the same format in terms of variable type and length, here was a weird thing, since the servers were in different time zones, the time was recorded as 5 hours different, so
Since this was recorded as a character variable, not a date (see the output for the contents procedure above), I couldn’t just subtract 5 from the hour.
Because the value was not the same, if I sorted by username and date_entered , each one of these that was moved over from the old server would be included in the data set twice, because SAS would not recognize these were the same record.
So, what did I do?
I’m so glad you asked that question.
I read in the data to a new data set and the third statement gives a length of 19 to a new character variable.
Next, I create a variable that is the value of the date_entered variable that start at the 12th position and go for the next two (that is, the value of the hour).
Now, I add 5 to the hour value. Because I am adding a number to it , this will be created as a numeric value. Even though datefix1 is a character variable – since it was created using a character function, SUBSTR, when I add a number to it, SAS will try to make the resulting value a number.
Finally, I’m putting the value of datefixed to be the first 11 characters of the original date value , the part before the hour. I’m using the TRIM function to get rid of trailing blanks. I’m concatenating this value (that’s what the || does) with exactly one blank space. Next, I am concatenating this with the new hour value. First, though, I am left aligning that number and trimming any blanks. Finally, I’m concatenating the last 6 characters of the original date-time value. If I didn’t do this trimming and left alignment, I would end up with a whole bunch of extra spaces and it still wouldn’t match.
I still need to get this to be the value of the date_entered variable so it matches the date_entered value in the other data set.
I’m going to DROP the date_entered variable, and also the datefix1 and datefixn variables since I don’t need them any more.
I use the RENAME statement to rename datefixed to date_entered and I’m ready to go ahead with combining my datasets.
DATA ansfix1 ;
SET flo_answers ;
LENGTH datefixed $19 ;
datefix1 = SUBSTR(date_entered,12,2);
datefixn = datefix1 +5 ;
datefixed = TRIM(SUBSTR(date_entered,1,11)) || ” ” || TRIM(LEFT(datefixn)) || SUBSTR(date_entered,14,6) ;
DROP date_entered datefix1 datefixn ;
RENAME datefixed = date_entered ;
Occasionally, a brave student will ask me,
When will I ever use this?
The “this” can be anything from a mixed model analysis to nested arrays. (I have answers for both of those, by the way.)
I NEVER get that question when discussing topics like filtering data, whether for records or variables, because it is so damn ubiquitous.
Before I headed out to be, literally, testing in the field (you can read why here) , I was working on an evaluation of the usability of one of our games, Fish Lake.
My next thought was that many students played the game for a very short time, got the first answer correct and then quit. I decided to take a closer look at those people.
First step: from the top menu select TASKS, then DATA, then FILTER AND SORT
Second step: Create the filter. Click on the FILTER tab, select from the drop-down menu the variable to use to filter, in this case the one named “correct_Mean” , select the type of filter in the next drop-down menu, in this case EQUAL TO and in the box, enter the value you want it to equal. If you don’t remember all of the values you want, clicking on the three dots next to that box will bring up a list of values. You can also filter by more than one variable, but in this case, I only want one, so I’m done.
Third step: Select the variables. Steps two and three don’t have to be done in a particular order, but you DO have to select variables or your procedure won’t run, since it would end up with an empty data set. I do the filter first so I don’t forget. I know the filter is the whole point and you’re probably thinking you’d never forget that but you’re probably smarter than me or never rushed.
If you click the double arrows in the middle, that will select all of the variables. In this case, I just selected the two variables I wanted and clicked the single arrow (the top one) to move those over.
Why include correct_mean, since obviously that is a constant?
Because I could have made a mistake somewhere and these aren’t all with 100% correct. (Turns out, I didn’t and they were, but you never know in advance if you made a mistake because if you did then you wouldn’t make it.)
I click OK and now I have created a data set of just the people who answered 100% correctly.
For a first look, I graphed the frequency distribution of the number of questions answered by these perfect scorers. To do this,
- Go to TASKS > GRAPH > Bar Chart
2. Click on the first chart to select it, that’s a simple vertical bar chart
4. Under APPEARANCE click the box next to SPECIFY NUMBER OF BARS. The default here is one bar for each unique data value, which is already clicked. Caution with this if you might have hundreds of values, but I happen to know the max is less than 20.
I thought I’d find a bunch answered one question and a few answered all of the questions and maybe those few were data entry errors, say teachers who tested the game and shouldn’t be in the database. When I look at this graph, I’m surprised. There are a lot more people who had answered 100% correctly than I expected and they are distributed a lot more across the number of questions than I expected. That’s the fun of exploratory data analysis. You never know what you are going to find.
SO, now what?
So, now what?
I want to find out more about the relationship among persistence and performance. To do this, I’m going to need to merge the answers summary data set with demographics.
I’m going to go back to the Summary Data Set I created in the last post (remember that one) and just filter variables this time, keeping all of the records.
Again, I’m going to go to the TASKS menu, select DATA then FILTER AND SORT, this time, I’m going to have no filter and select the variables.
Since the pop-up window opens with the VARIABLES tab selected, I just click the variables I want, which happens to be “correct_N”,” correct_mean” and “username”, click the single arrow in between the panes to move them over, and click OK at the bottom of the pop-up window. Done! My data set is created.
You can always click on PROGRAM from the main menu to write code in SAS Enterprise Guide, but being an old dinosaur type, I’d like to export this data set I just created and do some programming with it using SAS. Personally, I find it easier to write code when I’m doing a lot of merging and data analysis. I find Enterprise Guide to be good for the quick looks and graphics but for more detailed analysis, the old timey SAS Editor is my preference. If you happen to be like me, all you need to do to output your data set is click on it in the process flow and select EXPORT.
You want to export this file as a stand-alone data set, not as a step in a project. Just select the first option and you can save it like any file, select the folder you want, give it the name you want. No LIBNAME statement required.
And it’s a beautiful sunny day in Santa Monica, so that’s it on this project for today.
Since it’s the 4th of July, I figure no one is very work-focused today and it would be a good time for one of my occasional rants.
I read a book this week, The Savage Damsel and the Dwarf. It was a good story for a lots of reasons, a primary one being that it didn’t follow the usual narrative of beautiful damsel in distress rescued by charming prince.
In fact, the beautiful damsel is kind of a stupid jerk and it is her overlooked, smarter sister who heads out to find a knight to save the castle. Said knight isn’t the sharpest knife in the drawer, either. In fact, most of the knights in the story seem to have been bonked on the head a few times too many.
In the end, the beautiful damsel is rescued by the not-too-bright knight and they go off to King Arthur’s court. The sister ends up with another guy who pretty much sucks at being a knight, so he gives it up and they live happily ever after together as he takes over the family lands and becomes a highly successful farmer.
One reason I almost never watch TV or movies is that the story is so predictable. Love at first sight. Unappreciated younger brother becomes greatest knight ever through magic potion/ love of a good woman. Bad guy defeated by good guy. Lots of fighting scenes. Overlooked woman develops into a beauty and the guy finally notices her.
More people should write their own story. Truly , fighting occasionally , with intervening sitting around the castle drinking beer waiting for a fight does sound like an incredibly boring life. A lot of the stuff the Knights (and people now) fight over is stupid.
“You insulteth mine honor.”
Yeah, so we should hack each other to pieces with swords? Get over it.
The ‘savage damsel’ falls in love with a knight who falls in love with her beautiful sister. When she has the chance to make him love her forever, she starts thinking past the first minute she imagines him declaring his love for her and tries to see being middle-aged, sitting by the fire with Sir Dumb-As-A-Rock and concludes, “Oh, hell, naw.”
Sir Lancelot loses a joust and disappears. And he doesn’t come back.
I liked the book a lot because it didn’t follow the recipe for fantasy stories. I ordered it for my granddaughter because it has a great life lesson – write your own story.
My usual disclaimer when I write about a product: No one paid me diddly-squat to write this.
In the last post, I used SAS Enterprise Guide to filter out a couple of ‘bad’ records that came from test data, then I created a summary table of the number of questions answered and the percentage correct. Then, I calculated the mean percentage correct for the around 84%. That seemed a bit high to me.
Having (temporarily) answered the first question regarding the number of individual subjects and the average percent of correct answers from the 424 subjects, I turned to the next question:
Is there a correlation between percentage correct and the number of questions attempted? That is, do students who are getting the answers correct persist more often?
Since I had both variables, N and the mean correct (which, since this was score 0= correct, 1= incorrect gave me the percentage correct) from the summary tables I had created in the previous step, it was a simple procedure to compute the correlation.
I just went to the TASKS menu, selected MULTIVARIATE and then CORRELATIONS
Under ANALYSIS VARIABLES correct_ N for the ‘correct’ variable, which is a variable that holds whether the student answered correctly, 0(= no) or 1(=yes). Under CORRELATE WITH I dragged correct_mean, which has the percentage each student answered correctly.
Since it is just a bivariate correlation and the correlation of X with Y = the correlation of Y with X , it would make absolutely no difference if I switched the spots where I dragged the two variables.
I also note that the minimum number of answers attempted is 1. Now, I have done (and published) analyses of these data elsewhere, as this is an on-going project.
Other analyses from this same project can be found in:
Because of these analyses of ‘Fidelity of Implementation’, that is the degree to which a project is implemented as planned, I am pretty sure that these data include a large proportion of students who only had the opportunity to play the game once.
So … I decided to run a scatter plot and check my suspicion. This is pretty simple. I just go to the TASKS menu and select GRAPH then SCATTER PLOT.
I selected 2-D Scatter Plot
Then, I clicked on the DATA tab, dragged correct_Mean under Horizontal and Correct_N and vertical, then clicked RUN.
This produced the graph below.
Now, this graph isn’t fancy but it serves its purpose, which is to show me that there IS in fact a correlation of mean correct and the number of problems attempted. Look at that graph a minute and tell me that you don’t see a linear trend – but it is pulled off by the line of 1.0 at the far end.
This did NOT fit my preconceived notion, though, that the lack of correlation was due to the players who played once, and so there would be a bunch of people who had answered 1 or 2 questions and got 100% of them correct. Actually, those 100-percenters were all over the distribution in terms of number of problems attempted.
This reminds me of a great quote by Isaac Asimov,
The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ (I found it!) but ‘That’s funny …’
Well, we shall see, as our analysis continues …
You can also follow the link above to donate a copy of the game to a school or give as a gift.
The government is extremely fond of amassing great quantities of statistics. These are raised to the nth degree, the cube roots are extracted, and the results are arranged into elaborate and impressive displays. What must be kept ever in mind, however, is that in every case, the figures are first put down by a village watchman, and he puts down anything he damn well pleases.
Any time you do anything with any data your first step is to consider the wisdom of Sir Josiah Stamp and check the validity of your data. One quick first step is using the Summary Tables task from SAS Enterprise Guide. If you are not familiar with SAS Enterprise Guide, it is a menu driven application for using SAS for data analysis. You can open a program window and write code if you like, and I do that every now and then but that’s another post. In my experience, SAS Enterprise Guide works much better with smaller data sets – defined by me, as the blog owner, of less than 400,000 records or so. Your mileage may vary depending upon your system.
How to do it:
- Open SAS Enterprise Guide
- Open your data set – (FILE > OPEN > DATA)
- From the TASKS menu, select DESCRIBE and then SUMMARY TABLES. The window below will pop up
- Drag the variables to the roles you want for each. Since I have less than 450 usernames here, I just quickly want to see are there duplicates, errors (e.g. ‘gret bear’ is really the same kid as ‘grey bear’ , with a typo). I also want to find out the number of problems each student attempted and the percent correct. So, I drag ‘username’ under CLASSIFICATION VARIABLES and ‘correct’ under ANALYSIS variables. You can have more than one of each but it just so happens I only have one classification and one analysis variable I’m interested in right now.
5. Next click on the tab at left that says SUMMARY TABLES and drag your variables and statistics where you want them. I want ‘username’ as the row, so I drag it to the side, ‘correct’ as the column, N is already filled in as a statistic if you drag your classification variable to the table first. I also want the mean, so I drag that next to the N. Then, click RUN.
Wait a minute! Didn’t I say I wanted the percent correct for each student? Why would I select mean instead of percent?
Because the pctN will simply tell me what percent of the total N responses from this username make up. I don’t want that. Since the answers are score 0 = wrong, 1= right, the mean will tell me what percentage of the questions were answered correctly by each student. Hey, I know what I’m doing here.
6. Look at the data! In looking at the raw data, I see that there are two erroneous usernames that shouldn’t be there. These data have been cleaned pretty well already, so I don’t find much to fix.Now, I want to re-run the analysis deleting these two usernames.
7. At the top of your table, you’ll see an option that says “Modify Task”. Click that.
8. Under TASK FILTER pull down the first box to show the variable ‘username’. Pull down the second box to show the option NOT EQUAL TO and then click the three dots next to the third box. This will pull up a list of all of your values for usernames. You can select the one you want to exclude and click OK. Next to the three dots, pull down to select AND, then go through this to select the second username you want to delete. You can also just type in the values, but I tend to do it this way because I’m a bad typist with a bad short-term memory.
11. From the DESCRIBE menu again select SUMMARY STATISTICS
12. Drag ‘correct_mean’ under ANALYSIS VARIABLES and click RUN.
The resulting table gives me my answer – the mean is .838 with a standard deviation of .26 for N=424 subjects. So … the average subject answered 84% of the problems correctly. This, however, is just the first step. There are couple more interesting questions to be answered with this data set before moving on. Read the next step here.
It’s been a good week for the darling daughters.
The Spoiled One graduated summa cum laude, also president of the senior class, and is heading to the east coast to attend a small liberal arts college where she has an academic scholarship and a spot on the soccer team.
The book co-authored by Darling Daughter One and Darling Daughter Three won International Sports Biography of the Year, and the two lovelies pictured above flew to London to receive the award.
The Perfect Jennifer has tenure now and is finishing out another year of being an outstanding teacher.
A couple of years ago, there was a book with the thesis that Chinese mothers are superior and all Americans are raising a bunch of lazy slackers. It irritated me and I wrote a blog with the title “Why American mothers are superior” because that seemed more professional than “Go Fuck Yourself” . And no, in all seriousness, I really don’t think that one race or country has better mothers, but I also think the idea that if we don’t regiment our children lock-step for 18 years straight into MIT we are a bunch of losers is irritating as fuck.
You might think this is my rubbing it in post to say, “How you like me now? My kids are doing awesome.”
You’d be wrong. To paraphrase Erma Bombeck yet again, no mother should ever be arrogant because she can’t be sure that at any moment the principal won’t call to tell her that one of her children rode a motorcycle through the gymnasium.
I wanted to talk about something different – definitions of success that Tiger Mom Lady probably would not understand at all.
A friend of mine has a son in his mid-twenties who lives at home. He earned a degree from a two-year college. He is not crushing it as a hedge fund manager, but rather, has a regular job with benefits. I’m sure Tiger Mom would be dismayed if he was her kid.
My friend was distraught over the situation at work. The company had been acquired and reorganized. Her new boss was a nightmare and she came home in tears more often than not. Despite over a decade of good performance, she was afraid she was going to be laid off and was becoming depressed and stressed. They couldn’t afford to make the payments on their house on one income, and they had already lost a home back in 2008 when the housing marketing imploded. They were the collateral damage of those hedge fund managers.
It was at this point that her son (remember him?) stepped up. He had been living at home to save money for a down payment on a house of his own. Since he is single, has no children and gets along well with his parents, it seemed like a good arrangement, and he was paying them rent, but a lot less than it would cost to go out and get his own apartment. Plus, there were those home-cooked meals. He said something like this,
Look, you took care of me for 26 years. I make enough money now to cover the mortgage. If you are that unhappy about your job, quit. Even if you don’t quit your job, at least quit worrying about being laid off. I’ll pick up any slack. Between Dad and me, we got you covered.
Look at this family – they all love each other, the mom, dad and son. They get along well enough that he feels comfortable living at home to save money. Her son is hard-working and appreciates the fact that his parents have done what they could to support him. He can take the perspective of another person, see the stress his mother is experiencing and offer to do what he can to alleviate it out of appreciation for what they have done for him.
In my view, my friend is a success as a mother and her son is a success as a human being.
Where we left off, I had created some parcels and was going to do a factor analysis later. Now, it’s later. If you’ll recall, I had not find any items that correlated significantly with the food item that also made sense conceptually. For example, it correlated highly with attending church services but that didn’t really have any theoretical basis. So, I left it as a single variable. Here is my first factor analysis.
proc factor data= parcels rotate= varimax scree ;
Var socialp1 – socialp3 languagep spiritualp spiritual2 culturep1 culturep2 food;
You can see from the scree plot here that there is one factor way at the top of the chart with the rest scattered at the bottom. Although the minimum eigen value of 1 criterion would have you retain two factors, I think that is too many, for both logical and statistical reasons. The eigenvalues of the first two factors, by the way, were 4.74 and 1.10 .
Even if you aren’t really into statistics or factor analysis, I hope that this pattern is pretty clear. You can see that every single thing except for the item related to food loads predominantly on the first factor.
These results are interesting in light of the discussion on small sample size. If you didn’t read it, the particular quote in there that is relevant here is
“If components possess four or more variables with loadings above .60, the pattern may be interpreted whatever the sample size used .”
Final Communality Estimates: Total = 5.845142
These communality estimates are also relevant but it is nearly 1 am and I have to be up at 6:30 for a conference call, so I’ll ramble on about this some more next time.
First of all, what are parcels? Not the little packages your grandma left on the table in the hall when she came back from shopping. Well, not only that.
In factor analysis, parcels are simply the sum of a small number of items. I prefer using parcels when possible because both basic psychometric theory and common sense tells me that a combination of items will have greater variance and, c.p., greater reliability than a single item.
Just so you know that I learned my share of useless things in graduate school, c.p. is Latin for ceteris paribus which translates to “other things being equal”. The word “etcetera” meaning other things, has the same root.
Know you know. But I digress. Even more than usual. Back to parcels.
As parcels can be expected to have greater variance and greater reliability, harking back to our deep knowledge of both correlation and test theory we can assume that parcels would tend to have higher correlations than individual items. As factor loadings are simply correlations of a variable (be it item or parcel) with the factor, we would assume that – there’s that c.p. again – factor loadings of parcels would be higher.
Jeremy Anglim, in a post written several years ago, talks a bit about parceling and concludes that it is less of a problem in a case, like today, where one is trying to determine the number of factors. Actually, he was talking about confirmatory factor analysis but I just wanted you to see that I read other people’s blogs.
The very best article on parceling was called To Parcel or Not to Parcel and I don’t say that just because I took several statistics courses from one of the authors.
To recap this post and the last one:
I have a small sample size and due to the unique nature of a very small population it is not feasible to increase it by much.I need to reduce the number of items to an acceptable subject to variables ratio. The communality estimates are quite high (over .6) for the parcels. My primary interest is in the number of factors in the measure and finding an interpretable factor.
So… here we go. The person who provided me the data set went in and helpfully renamed the items that were supposed to measure socializing with people of the same culture ‘social1’, ‘social2’ etc, and renamed the items on language, spirituality, etc. similarly. I also had the original measure that gave me the actual text of each item.
Step 1: Correlation analysis
This was super-simple. All you need is a LIBNAME statement that references the location of your data and then:
PROC CORR DATA = mydataset ;
VAR firstvar — lastvar ;
In my case, it looked like this
PROC CORR DATA = in.culture ;
VAR social1 — art ;
The double dashes are interpreted as ‘all of the variables in the data set located from var1 to var2 ‘ . This saves you typing if you know all of your variables of interest are in sequence. I could have just used a single dash if they were named the same, like item1 – item17 , and then it would have used all of the variables named that regardless of their location in the data set. The problem I run into there is knowing what exactly item12 is supposed to measure. We could discuss this, but we won’t. Back to parcels.
Since you want to put together items that are both conceptually related and empirically – that is, the things you think should correlate do- you first want to look at the correlations.
Step 2: Create parcels
The items that were expected to assess similar factors tended to correlate from .42 to .67 with one another. I put these together in a ver simple data step.
data parcels ;
set out.factors ;
socialp1 = social1 + social5 ;
socialp2 = social4 + social3 ;
socialp3 = social2 + social6 + social7 ;
languagep = language2 + language1 ;
spiritualp = spiritual1 + spiritual4 ;
culturep1 = social2 + dance + total;
culturep2 = language3 + art ;
There was one item that asked how often the respondent ate food from the culture, and that didn’t seem to have a justifiable reason for putting with any other item in the measure.
Step 3: Conduct factor analysis
This was also super-simple to code. It is simply
proc factor data= parcels rotate= varimax scree ;
Var socialp1 – socialp3 languagep spiritualp spiritual2 culturep1 culturep2 ;
I actually did this twice, once with and once without the food item. Since it loaded by itself on a separate factor, I did not include it in the second analysis. Both factor analyses yielded two factors that every item but the food item loaded on. It was a very nice simple structure.
Since I have to get back to work at my day job making video games, though, that will have to wait until the next post, probably on Monday.
Someone handed me a data set on acculturation that they had collected from a small sample size of 25 people. There was a good reason that the sample was small – think African-American presidents of companies over $100 million in sales or Latina neurosurgeons. Anyway, small sample, can’t reasonably expect to get 500 or 1,000 people.
The first thing I thought about was whether there was a valid argument for a minimum sample size for factor analysis. I came across this very interesting post by Nathan Zhao where he reviews the research on both a minimum sample size and a minimum subjects to variables ratio.
Since I did the public service of reading it so you don’t have to, (though seriously, it was an easy read and interesting), I will summarize:
- There is no evidence for any absolute minimum number, be it 100, 500 or 1,000.
- The minimum sample size depends on the number of variables and the communality estimates for those variables
- “If components possess four or more variables with loadings above .60, the pattern may be interpreted whatever the sample size used .”
- There should be at least three measured variables per factor and preferably more.
This makes a lot of sense if you think about factor loadings in terms of what they are, correlations of an item with a factor. With correlations, if you have a very large correlation in the population, you’re going to find statistical significance even with a small sample size. It may not be precisely as large as your population correlation, but it is still going to be significantly different than zero.
So … this data set of 25 respondents that I received originally had 17 items. That seemed clearly too many for me. I thought there were two factors, so I wanted to reduce the number of variables down to 8, if possible. I also suspected the communality estimates would be pretty high, just based on previous research with this measure.
Here is what I did next :
- Parallel analysis
- Factor Analysis
I can’t believe I haven’t written at all on parceling before and hardly any on the parallel analysis criterion, given the length of time I’ve been doing this blog. I will remedy that deficit this week. Not tonight, though. It’s past midnight, so that will have to wait until the next post.