Today’s post courtesy of Captain Obvious …
To do quintile matching, one must first match by quintiles. Hence, the name.
Quintiles divide your data set into five equal, um, fifths. Quint is Latin (or Greek or some other random language) for five. Hence, the name.
So, when I did this:
PROC UNIVARIATE ;
VAR prob ;
OUTPUT OUT = testquint PCTLPTS 20 40 60 80
PCTLPRE = PCT ;
I expected to get the values that divided the data set into quintiles.
When I did this, because I was too lazy to avoid typing the numbers,
/* write the quintiles to macro variables */
data _null_ ;
call symput(‘q1’,pct20) ;
call symput(‘q2’,pct40) ;
call symput(‘q3’,pct60) ;
call symput(‘q4’,pct80) ;
/* create the new variable in the main dataset */
set AllPropen ;
if prob =. then quintile = .;
else if prob le &q1 then quintile=1;
else if prob le &q2 then quintile=2;
else if prob le &q3 then quintile=3;
else if prob le &q4 then quintile=4;
proc freq data = allpropen ;
tables quintile ;
I expected to get five, even groups.
I did not.
I got three groups with 1,088 records but my first group had 1,075 which is, obviously less than 1,088 and my second group had more than 1,088.
I considered several possibilities. Did I misremember the meaning of percentile? Should it be LESS than the 20th percentile point instead of less than or equal to? If that is the case, why did the rest of the groups come out perfectly?
Did the macro facility for some reason not compare down to enough decimal places to see that the 20th percentile value was, in fact, equal? To check for that, I multiplied the probability at the 20th percentile by 10, then by 100 and compared it to 10, and then 100 times the 20th percentile, thus requiring one or two less decimal places. Still, not equal.
I used the %PUT to put the values of the macro variables for &q1 to &q4 to the log.
They were correct.
I re-ran the program in SPSS. Same result.
Finally, it dawned on me. I did a PROC FREQ and realized that, duh, there was NOT exactly a 20th percentile. While there was, for example, exactly one score at the 40th percentile, there were 11 people at the 19.76th percentile and 15 at the 20.04th percentile. There was not a single score at the 20th percentile so my SAS program could not give me an exact 20th percentile.
Thank you, Captain Obvious.
I have no idea why the obvious answer did not occur to me immediately, maybe because with a smaller data set, I wouldn’t expect to have several records match down to the 12th decimal place.
On the other hand, this further reinforces that I already knew about myself, which is that I am never satisfied with “close enough”. If it is supposed to be 20% and I get 19.76% , I want to know, why, damn it!
I think it also kind of shows how easy it is to get tunnel vision. I have spent the last few days focusing on some really, really complicated design problems, so when I got back to looking at these results from the PROC UNIVARIATE I had done last week, I began by assuming it must be something complicated, instead of starting with the most basic, obvious possibility first, which is what I am always telling other people to do.
As the hockey player in Slapshot said about the penalty box,
“Then you must feel shame.”