# You have your survey data, now what?

“The problem with women isn’t that they leave you, but that they have to tell you WHY.”

Whether in teaching judo, statistics, programming or management, I’m a big fan of telling people why.

As I said before, the first step in programming is to THINK. This applies to statistical programming as well, which is what you are going to do with your survey data.

Okay before you react as if you are being choked by Ronda

and try to escape and run screaming from the room, hear me out.

In my example, the 500 Family Study, the data set I used had 460 different items.

Even if you break it down to only the ones that are of particular interest to your question, which were relationships between parents and children, you have 42 different questions. How are you planning to discuss these?

If you are like too many people, you’ll give 42 statements like:

9% of adolescents reported parents never checked their homework

23% said their parents checked it rarely  ….

blah blah blah for 42 variables.

There are three problems with doing it this way:

1. No one, unless they are on your dissertation committee, is ever going to read it,
2. God forbid you might want to actually look at RELATIONSHIPS among things, say, parental supervision, communication and positive views of parents. What are you going to do now? Look at how each of the 42 variables relates to the other 42 variables?  That’s 1,764 variables, if you are keeping track. And don’t you DARE run a correlation matrix and just interpret the 88 that are significant. You will go to statistical hell right along with Sir Cyril Burt.
3. Individual variables are notoriously unreliable.

You have studied for a final exam in Biology 101. There is one question, “What is the relationship between respiration and photosynthesis?”

Is this a fair test?

Your child is in fifth grade. Her weekly spelling  test consists of one word.

Is this a fair test?

Okay, these aren’t trick questions and don’t start with the “Well, it depends on what was covered in the class …”  unless your biology teacher was incompetent or your child’s teacher was an idiot, there was more taught than photosynthesis in your biology class and they learned more than one spelling word that week.

The plain fact is that people are complicated and whether it is their knowledge of biology, how well they can spell or their relationship with their mom, you aren’t going to get as much information from one single question as from several put together. This is why spelling tests aren’t just one word. People VARY. They vary a lot. If you give only one question then all of your students fall into one of two groups – they spelled it correctly, 100% or they spelled it incorrectly 0%.  That is almost certainly not valid, surely some of your students are better spellers than others and they don’t fall into just two groups. Same with any number of other characteristics – some people wouldn’t cross the street to piss on their parents if they were on fire and others have practically never left the womb at age 14.  Then there is everybody in between.

Every individual item has a “true variance” aspect, how much you know about biology or spelling, how much of a loving family you are, and the statistician’s favorite whine – totally random, that is error variance. Maybe you know a lot about biology but you just happened to miss that day, skip that chapter on photosynthesis. It’s error in the sense that it doesn’t really represent the underlying construct (idea) you are trying to measure. Perhaps your family doesn’t ever have dinner together because Dad works the day shift and Mom works the night shift ….

So, other things being equal, the more questions you add together, the better. A spelling test with ten words is going to be more reliable, have more variance and be more accurate (valid) than a test with only one word. The same is true of a biology test or a measure of family functioning.

That other things being equal qualifier is important. If you add up the question on biology, the one on spelling and the one on your mom, you don’t have a more reliable or valid test of anything, even if you do have more variance. You’re just being stupid. Cut it out.

SO … you should have done this in the first place, but if you didn’t, better late than never.  Sort your items into groups that might be related. In my example, I have parental supervision, decision-making and discussion. That just happens to be 42 questions. I may revisit this decision later, but that’s what I’m going to use for now.

NOW  do you get to do a factor analysis? No.

NEXT you want to make sure your data are accurate. For reasons totally behind my comprehension, many people use codes for missing data, like 9. So, you have a bunch of people who have a score of 9 on a 1 to 5 scale and it completely throws off your results. Look at the descriptive statistics for your data and at a bare minimum see that you don’t have any variables out of range like that. This is just a reminder to always check your data before doing any even barely sophisticated statistics.

You’re set.

You understand WHY you are doing a factor analysis  – because you want to combine these many, many questions into a few reliable, valid scales that have some decent variance.

You understand WHAT you are doing a factor analysis with – you have selected out the appropriate questions you want to use.

You have checked to see that your data at least don’t totally suck. (If you’re not sure how to do this part, I’ll talk a little about that tomorrow,too).

Well, congratulations, you are now all ready. Check back tomorrow for “Mama AnnMaria’s Point-y Click-y Guide to Factor Analysis” in as many easy lessons as I decide to do before I get bored.