I was brilliant. Then I wasn’t. Then I was.

Programming is NOT mostly about writing code. It’s mostly about figuring out how to solve a problem. Here is an example from yesterday….


I downloaded an SPSS file from surveymonkey which a client had used to collect data. I then output that as a SAS file (sas7bdat) which did not work, then as an xpt file, which did work, and I had my data – sort of.

If you have a matrix of  N questions like this:

Question                                                         True  False

Q1. Do you eat bugs?

Q2 . How about dirt?

Q3. Do you find horses attractive?


A man and his horse

You get a data set with N*2 variables, where v1 = missing if not checked true for the first question, 1 if checked true,  v2 = missing if not checked false for the first question, 1 if checked false

I would like to recode this into N variables, each coded 0 for false, 1 for true. You could do this with N*2 IF statements (that’s just crazy).

The first step to solving this problem is realizing that the question number has to be related to the variable number.

Q1  uses V1 & 2

Q2 uses V3 and 4

Q3 uses V5 and 6

Q4 uses V7 and 8

If you look at this as math problem you see that the answer “True” to question N will always be in V (2*N -1). That was my first brilliant insight. Then, I thought I would do two arrays came up with a solution, which although it works was just kind of stupid. There were a lot of statements that were unnecessary. I have omitted my less than brilliant solution because – well, what’s the point, really?

Finally I came up with this (which I later modified a little bit)

array answers {*} q1 - q10 ;
array questions {*} q0001_00 -- q0001_19 ;
Do i = 1 to DIM(answers) ;
j = (i *2)- 1 ;
answers{i} = sum( questions{j},0*questions{j+1},) ;
end ;

Using the SUM function is good because one of those two items for checked true  or checked false will always be missing. The SUM function returns the sum of the non-missing data. Using

questions{j} + 0*question{j+1} ;

would have meant that every question would end up missing. Bad.

However, once I looked at this, I realized it looked like I was doing something really stupid. Why on earth was I multiplying the value of   question{j+1} times zero? It was always going to be zero. Just change that to a constant, 0.

The reason I had included 0 * questions{j+1}  that is I wanted to get missing if the respondent had checked NEITHER true nor false. If I just put in 0 as a constant, when the respondent checked neither the answer would be zero, or false. Giving a zero for an item left blank is okay if it is a test – if they don’t answer, they get no credit, but not for a survey.

Instead, I put an IF statement up front, just to make it more obvious what I was doing.

So these seven statements recode all of the N*2 variables into N questions coded 0 = false, 1 = true. I like this way of doing it better because I think it is more obvious I am setting the answer to missing when they have marked neither true nor false. It is also pretty obvious that the answer is true when the jth question is answered true.


array answers {*} q1 - q10 ;
array questions {*} q0001_00 -- q0001_19 ;
Do i = 1 to DIM(answers) ;
j = (i *2)- 1
if questions{j + 1} = . & questions{j} = . then answers{i} = . ;
Else answers{i} = sum(0, questions{j}) ;
end ;

What is not so obvious is why add zero to anything? What is the point of that? Why not just

answers{i} = questions{j} ;

Isn’t that the same?

No.  If you do that, you’ll get a mean of 1 and a standard deviation of zero. Remember I said each variable is either 1 or missing. So,  if both variables are missing (my IF statement) the answer is set to missing. ELSE the answer is 1 when the respondent has marked true and 0 if the respondent has not marked true.

Why use the DIM function? Isn’t that a little silly I mean, I know I have 10 answers so what’s with the DIM and the asterisks? Well, there are actually hundreds of variables in this data set. I used these first 10 to test my code but now I have to go back and add a bunch more variables to those two arrays and they are not in order, they’re like question 14, 26, 36, 43 and so on. I can just add these to the two arrays, not worry that I will miscount (which I probably will) and not worry about changing the numbers for the dimension  of the arrays or the Do-loop.

My point about programming is that the major part of solving this problem is understanding the relation between the answer and the variable number where it is found (i*2 – 1) , that you need to account for when the respondent did not check either true or false, the concept of an array, why the answer can’t just be the answer to questions{j} , what’s a DIM function, the difference between a SUM function and adding two numbers .

Writing the seven statements to solve this problem took 10% of my time.

Figuring out how to solve it was the other 90% .



Similar Posts


Leave a Reply

Your email address will not be published. Required fields are marked *