### Mar

#### 14

# I was brilliant. Then I wasn’t. Then I was.

March 14, 2012 | 2 Comments

Programming is NOT mostly about writing code. It’s mostly about figuring out how to solve a problem. Here is an example from yesterday….

**HOW TO SCORE QUESTIONS WHEN THE ANSWERS ARE IN MULTIPLE VARIABLES**

I downloaded an SPSS file from surveymonkey which a client had used to collect data. I then output that as a SAS file (sas7bdat) which did not work, then as an xpt file, which did work, and I had my data – sort of.

If you have a matrix of N questions like this:

Question True False

Q1. Do you eat bugs?

Q2 . How about dirt?

Q3. Do you find horses attractive?

etc.

You get a data set with N*2 variables, where v1 = missing if not checked true for the first question, 1 if checked true, v2 = missing if not checked false for the first question, 1 if checked false

I would like to recode this into N variables, each coded 0 for false, 1 for true. You could do this with N*2 IF statements (that’s just crazy).

The first step to solving this problem is realizing that the question number has to be related to the variable number.

Q1 uses V1 & 2

Q2 uses V3 and 4

Q3 uses V5 and 6

Q4 uses V7 and 8

If you look at this as math problem you see that the answer “True” to question N will always be in V (2*N -1). That was my first brilliant insight. Then, I thought I would do two arrays came up with a solution, which although it works was just kind of stupid. There were a lot of statements that were unnecessary. I have omitted my less than brilliant solution because – well, what’s the point, really?

Finally I came up with this (which I later modified a little bit)

`array answers {*} q1 - q10 ;`

array questions {*} q0001_00 -- q0001_19 ;

Do i = 1 to DIM(answers) ;

j = (i *2)- 1 ;

answers{i} = sum( questions{j},0*questions{j+1},) ;

end ;

Using the SUM function is good because one of those two items for checked true or checked false will always be missing. The SUM function returns the sum of the non-missing data. Using

`questions{j} + 0*question{j+1} ;`

would have meant that every question would end up missing. Bad.

However, once I looked at this, I realized it looked like I was doing something really stupid. Why on earth was I multiplying the value of question{j+1} times zero? It was always going to be zero. Just change that to a constant, 0.

The reason I had included 0 * questions{j+1} that is I wanted to get missing if the respondent had checked NEITHER true nor false. If I just put in 0 as a constant, when the respondent checked neither the answer would be zero, or false. Giving a zero for an item left blank is okay if it is a test – if they don’t answer, they get no credit, but not for a survey.

Instead, I put an IF statement up front, just to make it more obvious what I was doing.

So these seven statements recode all of the N*2 variables into N questions coded 0 = false, 1 = true. I like this way of doing it better because I think it is more obvious I am setting the answer to missing when they have marked neither true nor false. It is also pretty obvious that the answer is true when the jth question is answered true.

`array answers {*} q1 - q10 ;`

array questions {*} q0001_00 -- q0001_19 ;

Do i = 1 to DIM(answers) ;

j = (i *2)- 1

if questions{j + 1} = . & questions{j} = . then answers{i} = . ;

Else answers{i} = sum(0, questions{j}) ;

end ;

What is not so obvious is why add zero to anything? What is the point of that? Why not just

`answers{i} = questions{j} ;`

Isn’t that the same?

No. If you do that, you’ll get a mean of 1 and a standard deviation of zero. Remember I said each variable is either 1 or missing. So, if both variables are missing (my IF statement) the answer is set to missing. ELSE the answer is 1 when the respondent has marked true and 0 if the respondent has not marked true.

Why use the DIM function? Isn’t that a little silly I mean, I know I have 10 answers so what’s with the DIM and the asterisks? Well, there are actually hundreds of variables in this data set. I used these first 10 to test my code but now I have to go back and add a bunch more variables to those two arrays and they are not in order, they’re like question 14, 26, 36, 43 and so on. I can just add these to the two arrays, not worry that I will miscount (which I probably will) and not worry about changing the numbers for the dimension of the arrays or the Do-loop.

My point about programming is that the major part of solving this problem is understanding the relation between the answer and the variable number where it is found (i*2 – 1) , that you need to account for when the respondent did not check either true or false, the concept of an array, why the answer can’t just be the answer to questions{j} , what’s a DIM function, the difference between a SUM function and adding two numbers .

Writing the seven statements to solve this problem took 10% of my time.

Figuring out how to solve it was the other 90% .

# Comments

2 Comments so far

## Blogroll

- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog

[…] Last week I mentioned a problem with scoring questions when each of dozens of true/ false questions had not been scored true or false (as one might think) or 1 or 0 (as one might think in mathematical terms) but, no, in some bizarre Alice in Wonderland mushroom-eating logic, each question was recorded as the answer to TWO VARIABLES. The first was 1 if the person answered true, and missing otherwise. The second variable was coded 1 if the person responded false, and missing otherwise. […]

[…] A few days ago, I gave a solution when you have, for some bizarro reason, questions on a one to five scale coded into five different variables. Before that, I had a similar problem when true / false questions were coded into two variables. […]