When 3 = 15: Another annoying data problem

Last week I mentioned a problem with scoring questions when each of dozens of true/ false questions had not been scored true or false (as one might think) or 1 or 0 (as one might think in mathematical terms) but, no, in some bizarre Alice in Wonderland mushroom-eating logic, each question was recorded as the answer to TWO VARIABLES. The first was 1 if the person answered true, and missing otherwise. The second variable was coded 1 if the person responded false, and missing otherwise.

Two red mushrooms

Last week,  I also gave the solution to scoring these in Normal World, where we ended up with variables scored 0 or 1.

Just because that way of recording data was not fucked up enough, this little data problem presented itself:

Respondents were asked to rate their ability to read, write and speak a second language on a scale from 1 = None to 5 = Native speaker.

You might assume that these would be scored on a scale of 1 to 5 for three variables.  You think that my friend, because you are not stupid.

You might assume that if these were for some unfathomable reason scored as fifteen variables that the first variable would be 1 if the respondent answered for the first question and missing otherwise. The second variable would be 1 if the respondent answered 2 for the first question and missing otherwise. You think this because you noted a pattern above and are logical.

Neither of those assumptions are true. In this case, the data were coded as so:

V1 = 1 if answered 1 to question 1, missing otherwise

V2 = 1 if answered 1 to question 2, missing otherwise

V3 = 1 if answered 1 to question 3

….. all the way down to  …..

V15 =  1 if answered 5 to question 5, otherwise V15  is a missing value.

If that wasn’t enough to make you pull your hair out, after I scored it, I found out that some people had a score of 9 on a 1 to 5 scale.

In examining the data, it turned out that a few people had checked both 4 = Advanced ability and 5= Native speaker. While I understand how people could see those as not mutually exclusive categories and check both, the researcher wanted these people to have a score of 5.

Simply stated, the problem is this:

Take these 15 variables and code them into three questions. When respondents selected two choices, assign the the larger value.

The solution is actually quite simple and it is another array:

data test ;
set newfile ;
array language {3} writing listening speaking ;
array langq {15} q1 - q15 ;
do L = 1 to 3 ;
language{L} = max(langq{L},langq{L+3}*2,langq{L+6}*3,langq{L+9}*4 , langq{L+12}*5) ;
end ;

So, for speaking, for example, if the respondent checked :

  • q3 , none- the score = 1
  • q6, basic – the score = 2
  • q9, intermediate – the score = 3

and so on..

I could have used the SUM function if it wasn’t for the people who checked both 4 and 5. Using the MAX function gives those people a score of 5. Also, we had a discussion with the research team about (hypothetically) people who checked both 2 and 3, for example, because they felt their reading ability fell between basic and intermediate.  In that case, their score would be rounded up to the next whole number. The MAX function then, would give a 3, so also working in that case, which didn’t actually occur in these data yet, but we like to be prepared.

Similar Posts


  1. I would have liked to be a fly on the wall in the room when you were talking with the team and realized their bizarre reinterpretation of the concept of variables and variable levels.

Leave a Reply

Your email address will not be published. Required fields are marked *