Easiest Quintile Matching Ever

Upfront confession – I completely copied the first part of this from a macro Paul Dickman wrote in 1999 because it did ALMOST exactly what I wanted. I tested it several times with different data and on different computers to see that it worked the way I expected. I actually found it on some random page but since they had retained the comments that showed the author, I tracked down Paul Dickman’s page with the enormous effort of typing a few words into Google and found the original source of this macro.

I threw in an option to do a surveyfreq, a couple of positional parameters, and I was done. Life is good.

A few days ago, I discussed one method of propensity scores, which was to match on the exact score as closely as possible. I used a modification of Lori Parson’s 1-5 Greedy match macro to do that.

Another way is to sort the scores into quintiles – the bottom 20% percentile, 20-40th percentile and so on. Then, you can control for quintile. The first thing I did was a logistic regression to create propensity scores, like so:

PROC LOGISTIC DATA = indata DESCENDING ;

CLASS var3 ;

MODEL depend = var1 var2 var3 var4 var5 ;

OUTPUT OUT = example PROB = propensity ;

This first step creates an output data set that has all of the original data plus the propensity scores. Next, I call Paul Dickman’s macro, which I reproduced below. The last two parameters in the macro statement I added. These are positional parameters. That means they are optional and if you don’t specify them when you call the macro you get the exact same results as the original macro, which puts a new variable named whatever you specify for “quintvar” in your existing data set. That is the quintile score.

The positional parameters, num and depend allow you to specify the number of matches you want. If you want 1 match from the larger population to each in your smaller group, then you would enter a 1. Depend is the name of your dependent variable. This assumes that your smaller group is coded 0 and your larger group is coded 1. For example, if dependent variable is lived, 0= no and 1 = yes.

%macro quint(dsn,var,quintvar,num=, depend=);

/* calculate the cutpoints for the quintiles */
PROC UNIVARIATE NOPRINT DATA=&dsn;
VAR &var;
OUTPUT OUT=quintile PCTLPTS=20 40 60 80 PCTLPRE=pct;
RUN;

/* write the quintiles to macro variables */
DATA _NULL_;
SET quintile;
CALL SYMPUT(‘q1’,pct20) ;
CALL SYMPUT(‘q2’,pct40) ;
CALL SYMPUT(‘q3’,pct60) ;
CALL SYMPUT(‘q4’,pct80) ;
RUN;

/* create the new variable in the main dataset */
DATA &dsn;
SET &dsn;
IF &var =. THEN &quintvar = .;
ELSE IF &var LE &q1 THEN &quintvar=1;
ELSE IF &var LE &q2 THEN &quintvar=2;
ELSE IF &var LE &q3 THEN &quintvar=3;
ELSE IF&var LE &q4 THEN &quintvar=4;
ELSE &quintvar=5;
RUN ;

I added this part because I wanted to give the option to randomly select a specified number of subjects with which to match each subject, as well as to add the quintile value and keep the whole data set.

The first statement checks to see if the positional parameter for number (&num) was entered. If the length= 0 it was not entered and we just skip to the end. Note that the label, downhere, does NOT have colon after it in the first statement but it does when it is used as a label at the end.
%IF %LENGTH(&num) = 0 %THEN %GOTO downhere ;

/* This outputs the two groups on the dependent variable, 0 and 1, to two different data sets */

DATA small large ;
SET &dsn ;
IF &depend = 0 THEN OUTPUT small ;
ELSE IF &depend = 1 THEN OUTPUT large ;

/* The frequency distribution of quintile of the smaller group (0) is output to a dataset named samp_pct */
PROC FREQ DATA = small ;
TABLES &quintvar / OUT = samp_pct ;
RUN ;

/* This creates a sample size per strata to match whatever was specified by the num positional parameter */

/* That variable needs to be named _NSIZE_ because this dataset will be the secondary data set to PROC SURVEYSELECT */
DATA samp_pct ;
SET samp_pct ;
_NSIZE_ = &num ;
_NSIZE_ = _NSIZE_ * COUNT ;
DROP PERCENT ;

/* Sort by strata is required for PROC SURVEYSELECT */
PROC SORT = large ;
BY &quintvar ;

/* This will select the number per strata to match the smaller sample */
PROC SURVEYSELECT DATA= large SAMPSIZE = samp_pct OUT = largesamp ;
STRATA &quintvar ;
RUN ;

/* Notice there needs to be a period there to separate the &dsn macro variable from the _SAMPLE appended to the new data set name */
DATA &dsn._SAMPLE ;
SET largesamp small ;
RUN ;

/* This is a label so there is really supposed to be a colon there, not a semi-colon */
%downhere:
RUN ;
%MEND quint;
RUN ;
How to use this ….

Either

%quint(mydata.project, prob,quint,num=1, depend= lived) ;

%quint(mydata.project, prob,quint) ;

/**********************************************************
* Macro for grouping continuous variable into quintiles. *
* The macro REQUIRES three input variables:               *
* dsn: data set name                                                            *
* var: variable to be categorised                                      *
* quintvar: name of the new variable to be created        *
*                                                        *
* If only these are specified a quintiles variable          *
* specified in quintvar will be added to the dataset, dsn *
*                                                                                        *
* Two OPTIONAL (positional) parameters are                *
* num: number of matches.                                *
* depend: dependent variable                             *
**** The LARGER group has the dependent variable coded 1 *                                                       *
*    The SMALLER group has the dependent coded 0          *
*                                                                                             *
* Example usage 1                                          *
* %quint(mydata.project, prob,quint,num=1, depend =lived) ;     *
* will output a dataset mydata.project_SAMPLE                 *
* with one subject with a value of 1 on LIVED matched to *
* each person with a value of 0 on LIVED                 *
*                                                          *
* Sample usage 2:                                                        *
* %quint(mydata.project,prob,quint );                       *
*                                                                                    *
* After running the macro, the dataset mydata.project     *
* will contain a new variable called quint with values    *
* . (missing), 1, 2, 3, 4, and 5.                         *
*                                                         *
* The cutpoints for the quintiles are calculated based    *
* on all non-missing values of the variable in question. *
*                                                         *
* To base the cutpoints for the quintiles on, for example,*
* controls only, the code can be changed as follows:      *
* proc univariate noprint data=&dsn.(where=(control=1)); *
*   Modification of 1999                                                           *
* Macro by Paul Dickman                                                    *
* AnnMaria De Mars 2012                                                     *
**********************************************************/

First person to comment with ANOTHER use of colons in SAS (other than labels and not counting body parts), I’ll send you a Julia Group flashlight pen.

5 Comments

Michelle Homes says:

February 7, 2012 at 3:56 am

Another use of colons in SAS is in variable lists as a prefix notation. For example, specifying all variables that start with qtr on the drop statement then you state….

drop qtr: ;

A very neat way if you have a lot of variables such as qtr1 qtr2 qtr3 qtr4 or more to drop.

You can use this technique where you specify variables.
AnnMaria says:

February 7, 2012 at 4:17 am

Ooh. You win. Email me your address at

annmaria@thejuliagroup.com and I will mail you your prize.
Rick Wicklin says:

February 7, 2012 at 11:02 am

He took mine, but I have a few more:
1) In SAS/IML, you can use the colon operator to form a vector of consecutive values. For example, x=1:5; is the vector {1 2 3 4 5} and y=’x1′:’x3′; is the vector {“x1” “x2” “x3”}.

2) In SAS/IML, you can use the color operator in a matrix subscript to mean “take the mean of this dimension. So z[:] is the mean of all values, z[1,:] is the mean of the first row, and z[:,3] is the mean of the 3rd column.

3) In PROC SQL, the colon operator is used to specify a macro variable. If you say
SELECT sum(x) into :mysum
then &mysum contains the sum of the x variable.

4) The first use can also be used in the DATA step as an alternative declaration for the length of an array:
array x{1:3} x1-x3;
More often you see just x{3}. However, the alternate syntax is useful if you are using an offset other than 1 to index your array. For example, you can say
array x{2008:2012} x2008-x2012;
if your data are from the years 2008-2012.
AnnMaria says:

February 7, 2012 at 2:13 pm

The array one is interesting . I’d never thought of indexing arrays any other way.
Gabe says:

February 11, 2012 at 1:49 pm

You use a colon when specifying a numeric range for an IN condition, such as…

Where var in (1:5)

Easiest Quintile Matching Ever

SAS on an iPad in a Soccer Field: Testing Homework Limits

No Surprise, Children Don’t Understand Math

Survival analysis tip from Ogden Nash

Are we living longer – or not?

The Emperor’s New Statistics

When Data Misbehave: Categorical Data

5 Comments

Leave a Reply

Similar Posts

5 Comments

Leave a Reply