I have a confession to make. As a sixteen-year-old college freshman, I had no idea what people in business did. As a nineteen-year-old college graduate, with a BSBA, I didn’t know much more.  My more affluent classmates had internships in the summer. My scholarship covered my tuition and sometimes room and board but left a lot uncovered, like textbooks, clothes and rent and food during the school breaks. So, I worked all summer at one minimum wage job or another, waitressing, selling shoes. When I told my grandmother I was going to go to graduate school to get an MBA she said,

“Mija, what do MBAs do?”

I told her,

“I’m not sure, Nanny, but I think they make a lot of money.”

We both agreed that sounded good.

One reason I often speak at schools in low-income neighborhoods is that kids who attend those schools have very few ideas of career possibilities. They know their teachers, go to the doctor, see police officers in their neighborhoods – but they don’t really get exposed to a lot of other professions.

Inspired by posts at the blog Some Lies on a day in the life at a research organization and a Day in the Life of Staff Scientist at 27 and a Ph.D. , I thought I would give a Day in the Life of a Consultant.

10: Coffee! Lucia comes to clean office.

10:30 -11:00 Coffee! Marisol stops by. She runs a small business doing data entry. I give her a pile of surveys and tax receipts to enter (in two different files!) . Check twitter. Read comments on stepwise methods. Agree they suck.


11:00- 11:30 Coffee! Read email. Delete 50 email messages for services I don’t want. Respond to email from new potential client that yes we would be delighted to work with her . Read answers to yesterday’s data questions, revise report accordingly. Review and approve website modifications and email for client approval. Respond to request to reschedule meeting that no, Friday, I’m in Ohio, and Monday I am in Boston, how about the following week. Client decides to meet tomorrow.

11:30 – 2:30 Coffee! Do tables for monthly client evaluation report in Excel. Email client questions about data. Why are some people included in the data base even though their last service date was years ago?

2:30 – 4:00 Coffee! Email taxes to accountant. Talk to client about additional work on website. Write eight pages on book I have contract to finish by July.

4:00 – Coffee! Work on paper for SAS Global Forum. Take break to talk on phone with oldest daughter and discuss general perfection of granddaughters.

6:00  Help youngest daughter with Algebra homework. Eat dinner. Watch news with husband. Offend daughter by telling her to shut up. Daughter stomps upstairs and refuses to speak to me the rest of the night.

8:00 Work on report for client.

9:00 Chardonnay. Read RFP from a local government seeking small business bids. Skim 85-page PART ONE of RFP, look at other five parts, detailing requirements. Note that no dollar amount is given, only “Under $100,000”. Discuss with rocket scientist and agree there is no fucking way it is cost-effective for us to bid on this.

9:30 Work on PowerPoint for SAS Global Forum Paper

10:00 Ride exercise bike

12:00 Finish client draft report.

12:30 Write blog

1:30 Check survey data Marisol had entered. Decide to write program in the morning since it is past 1 a.m.

2:00 Write list of people I want to contact in the morning, after I have coffee

I wondered about posting this since it doesn’t sound like a very exciting day, and I wasn’t sure it was representative. I concluded that it was as representative as any other day.

About 80 days of the year I am in some other city. That is actually much less than it used to be. My children used to tell me that all of my stories of their childhood began, “Well, I was out of town at the time – “.

Some days I am chasing a problem and do nothing but bang out code for 14 hours. Other days I’m working on a grant proposal and just write.

In brief, being a consultant is a mix:

  1. the business side – like taxes and invoices, having other people work for you doing anything you can outsource, like data entry, accounting or cleaning
  2. doing business –  working with clients by email, in person and on the phone to get information on what they need, writing reports, analyzing data – and
  3. getting business – responding to RFPs or to potential clients who contact us and want to work with us.

The good part is that there is a lot of flexibility. If I want to exercise for an hour, write a blog post or conference paper, no one can tell me to work on something different. The other good part is that if I want to start work at 10:30 and work until 2 a.m. I can.  Another good part is the variety. I can use Excel, SPSS and SAS, go from writing a report on meeting federal targets to an analysis of survey data to re-designing a website. Another good part is there is no commute. I work in an office in my house.

The bad part – I’m tempted to say there is no bad part. The biggest hassle is probably when clients pay late. I have a line of credit with the bank. The balance is currently at zero, but having that is a big advantage when people do pay late. When I started, I had to do a lot  more accounting, invoicing, budgeting – the business side – that really bored me. Now I can afford to pay other people to do most of that.

So, yeah, that’s what I do all day.



Continuing to learn Advanced SAS from the Propensity Score Matching with Calipers macro from Feng, Yu & Xu , we take our data set we created by doing a principal components analysis on the cases (experimental) group, using the coefficients from that analysis to score every record in our control group, concatenating the cases and controls, performing a cluster analysis on the whole group that adds a variable “distance” which is the distance from the seed of the one cluster.  The data set we ended up with after all this statistical slam-dunking was named mahalanobis_to_point.

I was thinking about the FASTCLUS and I realized something. The PROC APPEND appends all of the records that fall within the calipers to the reference point data set. In FASTCLUS , the first complete record is going to be the seed . So, that first record, which is the seed is from one group and all of the other records are from your matching group.  (Yes, Feng et al. said exactly that in their paper but I needed to see how it worked for myself.)

So, what now ? First, we’re going  to use SQL to select all the records from the mahalanobis_to_point that are not in the &refdata data set. That is, we are going to pull out the case we’re trying to match so we don’t end up with the silly situation where something is matched with itself.

I’m not sure if I can call PROC SQL a nifty thing.

I guess it is a good thing in the sense that homework and canned spinach are a good thing. Anyway, here we throw in some practice with PROC SQL.

proc sql;

create table mahalanobis_to_point

as select a.*
from mahalanobis_to_point a, &refdata b
where a.&id^=b.&id;

Next, we sort all of the records by distance, we SET by distance, which will be in the sorted order, and we take the the first record, because that is the the one with the smallest distance from the seed, i.e., the case we’re trying to match.
proc sort data=mahalanobis_to_point;
by distance;
data mahalanobis_to_point;
set mahalanobis_to_point;
by distance;
if _n_=1;

We add those two records into a data set named case_ctril and give them both the value of &i (the loop counter) for their newid variable, which we can now use to link the case and it’s control group match. We don’t want to write over their old id variable because that is usually needed.  The need to apply a newid is why we can’t (as you might be tempted at first glance) combine these into the case_ctrl_together data set in one step.

At our next step we add these two records, the case and its matched control,  into the case_ctrl_together data set we are building. We pull the record for the matched case out of the temporary data set with cases. We are going to keep going around in the loop until all of the cases are pulled out, either because they were matched here, they were matched previously (because there was only one match) or because there is not a control record that matches within the defined caliper.
data case_ctrl;
set &refdata(keep=&id) mahalanobis_to_point(keep=&id);
data case_ctrl_together;
set case_ctrl_together case_ctrl;
if &id=. then delete;

proc sql;
create table case_temp
as select a.*
from case_temp a, mahalanobis_to_point b
where a.&id^=b.&id;
*** Exclude patients in case group which are selected in the Mahalanobis macro ***;
%mend Mahalanobis;

You may be excused for having forgotten by now that this macro was executed inside of a loop inside of another loop inside of a macro, but it was. Having cranked through the macro as many times as necessary we hit a %END statement, and go searching for another case-control match. When we have matched all of our cases, we’ll hit our second %END statement for the last time and exit the outer loop.

***Create test dataset that contains patients from case_control_together, and all
variables and propensity scores as well***;

This final PROC SQL is going to select out only the matched records into a data set (table) named test from the original data set we had created at the very beginning as output from the PROC LOGISTIC with our propensity scores (remember propensity scores? That’s what this macro was about, matching by propensity scores.)

It was a lot of work to get this and I wanted it saved as a permanent data set just in case there was some problem later in the program and I accidentally did something stupid like save over this data set after it took 40 minutes for the program to run and create it. So, I added a DATA step to save it off somewhere safely away in a permanent SAS data set.

proc sql;
create table test
as select a.*, b.newid
from propen a, case_ctrl_together b
where a.&id=b.&id
order by newid,&yvar;
data study.calipers ;
set test ;
run ;
%mend match;

Was it worth it?

So, there you have it, how to do propensity score matching with calipers in 5,000 words or more over four posts. Even given all of that, I am not so sure it is worth it, other than the educational value of going through the macro – and there IS considerable educational value. However, quintiles is far simpler and often works to reduce the bias a great deal. The GREEDY match macro, is also far simpler (though not as simple as quintiles) and has worked really well with the data I’ve been using. I’m not convinced that the improvement in precision is worth the tradeoff in complexity. Yeah, it’s a cool macro and all, but you need someone to understand it and maintain it. When disseminating the results of your research you’re going to need to explain it to people like a hospital board, a city council. Increasing the probability of difference between case and control from .76 to .91  and reducing the r-square from .002 to .00018 – well, I’m just not sure it is a non-trivial improvement in a lot of cases.

That, however, is a philosophical discussion for another day.



This is part 3 (hence the name) of my posts on how to learn advanced SAS from other people’s code. Yes, taking a class is a good idea – I have taken several – a couple from SAS Institute here and there, several before or after conferences like WUSS and SAS Global Forum.

For me, though, the two best ways to learn are:

1. Take something really complicated someone else has done, take it apart to really understand it and modify it to fit my needs.

2. Use the language to solve some problem that is bothering you (or someone paying you).


So, here are are continuing with the propensity score matching with calipers macro from Feng, Yu & Xu. This is the really interesting part, in my opinion, and the crux of the whole matter, where you are trying to find the best match.


%MACRO Mahalanobis(data, var, refdata);
*** This macro is to calculate Mahalanobis distance from each point to a reference
*** point. For example, reference point can be a patient from case group, other
*** points can be the patients from control group who meet the criteria within
*** 1/4 standard deviation of Logit of propensity score ***

Nifty thing number twelve – a principal components analysis

So, here we are doing a principal components analysis with the data set data, OUT = is going to be a data set with our data plus the first principal component (in this case there is only one principal component because there is only one variable, the propensity score. You COULD have multiple variables – the propensity score and additional variables). The OUTSTAT = data set is going to contain the statistics created using the first data set – which is your treatment or experimental group (the “cases”).

Personally, I don’t like the NOPRINT option. I always want to print everything and look at it. In this example, since you only have one variable, you won’t be getting a lot of output, or necessarily learning a lot about what the PRINCOMP procedure does, but at the least you can open the data sets from your explorer window and see what is in them.
VAR &var;

Nifty thing thirteen – PROC SCORE
PROC SCORE DATA = &refdata SCORE = outstat OUT = reference_point ;
VAR &var;

From the SAS manual

“The SCORE procedure multiplies values from two SAS data sets, one containing coefficients (for example, factor-scoring coefficients or regression coefficients) and the other containing raw data to be scored using the coefficients from the first data set. The result of this multiplication is a SAS data set containing linear combinations of the coefficients and the raw data values.”

In this case, it is going to take the coefficients output from the PRINCOMP procedure and apply these to your new (control) data set. In the simpler example I am using here, there is only one variable in the principal components step and thus only one principal component. Your new scored data set with the control cases, now named reference_point, is going to have a new variable added PRIN1 , which is the principal component for each case in the control group calculated using the values from the cases (experimental or treatment) group.

I said it’s a simpler example because really, if you’ve been following this post from

Learning Advanced SAS from a  Macro ,  and

Learning Advanced SAS from a Macro, Part 2

It’s become abundantly clear to you that this is not a simple example at all, but that’s the whole point – we’re now moving into more advanced programming techniques territory, either because they take a bit of statistics, like principal components, which I’ll probably discuss next week, or they use less common procedures or macros.

Nifty thing fourteen – PROC APPEND

PROC APPEND DATA =out BASE=reference_point;

Compared to a lot of this other stuff, PROC APPEND is easy. I just threw it in here as a nifty thing because not everyone knows about it. You could create a new data set using a DATA step and a SET statement. Or, you can use PROC APPEND and add the data from the DATA = step to the BASE = data set. As David Carr so helpfully explains in this SAS Global Forum paper this can save you processing time and as you will learn when you run this macro, reducing processing time is a desirable feature.

Nifty thing fifteen and sixteen – PROC FASTCLUS & VAR PRIN:
PROC FASTCLUS DATA=reference_point MAXC=1 REPLACE=none MAXITER=0 NOPRINT  OUT=mahalanobis_to_point(DROP = CLUSTER);

Note they used the PRIN: syntax.  Because there is no way for the programmers who wrote this to know how many variables you will be using and hence how many principal components you will have, they just used PRIN:  , that will use all of the variables beginning with PRIN , from PRIN1, PRIN2 to however many you have.

This does a cluster analysis on the control group data set. The variables used in the cluster analysis are the principal components generated from the second step above. Remember, these were generated using the coefficients for the cases.  (Think about this until it makes sense – we’re trying to generate the closest match to a case in the treatment group.)

Again, they used the NOPRINT option, which is not my preference because I like to look at everything.  They are creating one cluster and the new data set mahalanobis_to_point . That data set will have one additional variable, distance, which is the distance from the seed of the one cluster.

Now, FASTCLUS starts with the first complete observation as the seed. There is no iteration so you are going to end up with distance from that first observation.

As I look at this, I wonder if you would get better results if you did let it iterate.  You certainly would get different results (I know because I tried it with a sample data set).

Tomorrow-ish  (defined as tomorrow or whatever day I get around to it), I’ll finish off with the propensity score with calipers macro.

Later, I’ll try some different things with FASTCLUS, when I have a bit more time. One of the benefits of working through someone else’s code is it makes you think. At first you’re thinking about what they did and trying to understand the decisions they made. Then, you’re thinking about how you might do it differently and testing out your ideas.



Okay, where we left off on the propensity score macro from Feng, Wu and Xu and the nifty things you can learn from reading someone else’s code, in this case, their propensity score macro with calipers ….

We previously dealt with the situation where you had no matches and exactly one match. If you find more than one match within the caliper, you would then go on to the next step. This is where it gets really interesting.

Nifty things eight and nine

%if  %eval(&casen)=0 %then %do;

This shows you how to set up a loop that does certain DATA and PROC steps repeatedly. Once you start using macro %DO loops you will wonder how you ever survived without them. In this first loop, it is going to do these things if there is no matching record for our case, delete the record if there is no matching id, and set the new id equal to i (the counter variable defined way above in the previous post). Notice that you need %IF , %THEN and %DO. I always leave off at least one of those % signs.

This also uses the %EVAL function again, so it is a nice reminder if you are new to macros. This function evaluates (hence the name) the value in the parentheses as a number, not as text. That &casen is the result of a previous PROC SQL step that put the count of the number of records into that macro variable.

So … essentially, if we don’t find a match, we delete that record, re-set our newid value to be one more and loop back up to the top of the data set where we will get the next record and go again.

data case_ctrl_together;
set case_ctrl_together match_temp(in=a keep=&id);
if &id=. then delete;
if a then newid=&i;

kaleMinor Nifty Thing Ten  …. if you like SQL, which I don’t particularly, but it is probably good for you, like kale

So, here is the %ELSE, followed by the %IF,  %THEN and %DO which does the exact same thing as an ELSE IF  blah blah blah THEN DO ; in  non-macro SAS. There is our %EVAL again (repetition is good when you are first learning).
%else %if %eval(&casen)=1 %then %do;

So, yeah, what exactly do we do if we do have exactly one match? First, you temporarily create a data set of the controls and matches. Next you add the match you have found in to the case_ctrl_together data set you are building

We use a PROC SQL step to update our temporary data set of cases that do not have matches.

I have to say, this is not how I would have done this, but it is good in a way because it does show one way of solving this problem of updating the data set of matches you are building and updating the file to be matched using SQL if you are into that sort of thing. By the way, if you aren’t familiar with SQL, do notice that you need the QUIT ; statement at the end to exit out of it and go back to regular SAS syntax.
data match_temp1;
set ctrl_temp match_temp;
data case_ctrl_together;
set case_ctrl_together match_temp1(keep=&id newid);
if &id=. then delete;
proc sql;
create table case_temp
as select a.*
from case_temp a, match_temp b
where a.&id^=b.&id;

Nifty thing eleven and I really do like this …

If there is more than one match, the macro calls ANOTHER macro which finds the optimal match of those available within the calipers. It does not need the &&macrovar notation to resolve nested macro variables because none of the parameters passed to it are macro variables. They are all ‘hard-coded’ into the original macro. I thought this was rather clever.

I know I have said before that I am not a fan of the unconditional ELSE statement , even under the best of circumstances (and this is the best of circumstances), but, oh well.
%else %do;
***please specify the key variables (including propensity score as one key variable) which are used to calculate the mahalanobis distance ****;
%Mahalanobis(match_temp, prob, ctrl_temp);

There are two %END statements because the first one ends the %ELSE %DO that will happen when you have more than one match and the second ends the whole loop that you are doing from 1 to the total number of observations in the control group data set.

I really  liked the %Mahalanobis macro, so tune in next time and see if you do as well.

For now, having just met several deadlines in the past two weeks, I’m going to put my feet up, drink tea and read random blogs on Blogher just for the hell of it.



We had a cat named Beijing. “Had” is the key word in that sentence. This weekend I dropped Bejing off to live with my daughter, Jenn, who actually likes cats in general and this cat in particular.  The main reason (besides that Jenn wanted her) is that the cat always wanted to sit on my lap while I was trying to work. Since I had just finished writing several programs and the cat was sitting on my lap so I could not type, I decided to use cinchcast to explain to the cat what I had been doing.

The cat seemed unimpressed.

Beijing the Cat



I thought my day today would be easy. There is a FUZZY module that goes with the SPSS integration for Python.  All I needed to do was install the SPSS Python Integration technologies, then, maybe install the fuzzy plug in and it would all work.

Both the rocket scientist and I worked on this most of the afternoon, using three different computers, a Mac with SPSS 20, a Windows 7 machine with SPSS 20 and a Mac with SPSS 18. As of now, when we are about to give up temporarily and go out for some well-deserved wine, having already worked through sushi at sunset at Casa del Mar, we haven’t gotten it to work on anything, although we have gotten a whole bunch of interesting different errors.

First, there is FINDING the download for SPSS integration.

Go here and click on the link at the very end of the sentence

The Essentials and Plugins for IBM SPSS Statistics Version 20 for Python and .NET are available here.

I don’t see very well so when the only indication that is the link is a faint color on the last word, it is tough for me, but that’s just me. Okay, I download it. I install it. I am using a Mac so Python is already installed. No problem, right? I open the terminal window and verify that Python 2.5, 2.6 and 2.7 are all installed in usr/lib  right where SPSS thinks they should be.

I go to UTILITIES > EXTENSION BUNDLES > VIEW INSTALLED EXTENSION BUNDLES and it gives me the nice little screen below that tells me all of this stuff is installed, including FUZZY.

Installed Extensions Screen

So, I type in this little program and get the following output. So, in short, it appears that the Python call is working but the module named FUZZY can’t be found.

import spss
print “Hello, World!”
Hello, World!
import FUZZY
print FUZZY
Traceback (most recent call last):
File “<string>”, line 2, in <module>
ImportError: No module named FUZZY

I try using both upper case and lower case fuzzy and it doesn’t work either way.

So, I decide to download and install the fuzzy plug in separately and install it. I go to UTILITIES > EXTENSION BUNDLES > INSTALL EXTENSION BUNDLE and get this message that  tells me that this extension requires the Integration Plug-In which is not installed even though I installed it, I accepted all the defaults, there were no errors and the (admittedly minor) Python program above appears to run.

Missing Integration technologies.

I try installing the plug in anyway. The FIRST time when I did this, it seemed to install, SPSS could find the FUZZY module, but nothing ran, as in I got an error on the first word. So, that was when I uninstalled and re-installed the plug-in. It still didn’t work, so I went ahead and tried installing the FUZZY.spe again.

At that point, I get a message saying that components of this extension are already installed.

Hey, I think, maybe I will try it on the Windows 7 machine instead.  First, I need to get the rocket scientist to install Python 2.7 in the directory expected by SPSS, because it is not already installed on this machine. He does that while I uninstall and re-install the Python Integration Plug-in for SPSS on the Mac with no better results.

I try installing the Integration Plug-In for Python on the Windows computer, it appears to work and yet when I look under installed extensions, it shows nothing. I try my little test program above, just for the hell of it and it finds neither Python nor the fuzzy module.

I think maybe it is because both copies of SPSS 20  I have are the 14-day trial and maybe it won’t work with that. There is no reason I could think of that it wouldn’t, but maybe not.

So, I try it on my Macbook Pro, running SPSS 18 which is an honest-to-God paid for license. It doesn’t work on that, either, giving me a message about not finding the spss.pth.

I ask the rocket scientist to take a look at that while I try to get it to work on Lion. (SPSS 18 does not run on Lion, which is why I do not have a real licensed SPSS on my desktop.) He tries the thing men always do when they want to get out of something, asking stupid questions until you give up and decide it would be less trouble to do it yourself. (Men: We women are on to you. We have secret meetings where we compare notes.)

He helpfully suggests,

“Did you try looking up that error on Google?”

“Google? No, never heard of it. How do you spell that? What the fuck do you think I did? I checked that Python was installed, checked the path, read the first six links for this specific error …. “

By the sound of sighing coming from upstairs, I can conclude that it is still not working.

So … for now we are going to go out to dinner, drink wine, talk it over and come back with a plan B. I suspect Plan B will entail him reading through a couple hundred pages of documentation on Python and SPSS while I work on the other project I need to have done by next week.

Actually, I’d like to trade places with him, but unfortunately, that’s not an option.

My take away point so far- getting Python integration plug in installed on anything is a pain in the ass.



I was using a macro this week written by Feng, Yu & Xu and there were so many less-used and advanced techniques in it I thought it would be a great example to use for teaching SAS.

** ** Nifty thing one computing the caliper width  ***** ;

Like with other propensity score macros, it started with a logistic regression. This was to do a matching with calipers and since a lot of people use .25* the logit as the caliper value, I need to calculate that. Notice the xbeta = logit on the  OUTPUT statement in the PROC LOGISTIC step.

%macro match(data, class, yvar, xvar, id);
**Perform logistic regression, select independent variables, and create the propensity
PROC LOGISTIC descending data=&data noprint;
model &yvar=&xvar;
OUTPUT =propen(drop=_level_) prob=prob XBETA = LOGIT ;

****  Here they use PROC UNIVARIATE to output the standard deviation of the logit. The result is a file with one record  ;
/* Calculate standard deviation of the logit */

proc univariate data=propen noprint;
var logit;
output out=propen_sd std=sd;

*** This is pretty standard just outputting to case and control files, giving the control file a random number;

***create datasets of case and control, which contain patients from case and control groups ***;
data ctrl case;
set propen;
if &yvar=1 and prob ne . then do;
label rannum=’Uniform Randomization Score’;
output ctrl;
else if &yvar=0 and prob ne . then do;
output case;

**** Nifty thing two: Adding the standard deviation of the logit to every record.

data ctrl;
if _n_=1 then set propen_sd(keep=sd);
set ctrl;

I had to look this up. Did you know that using a conditional SET statement causes you to RETAIN every variable in that dataset. Notice that data set only has one variable, due to the KEEP option that follows it. The result is that the value of sd, that is, the SD of the logit, is added to every record in the ctrl file.

Thanks to Paul McDonald who wrote a paper for SAS Global Forum If _N_ = 1 then set . Yes, that was actually the name of it.

Then, they sort by the random number, not too  interesting.

***sort ctrl by random number generated ***;
proc sort data=ctrl;
by rannum;

*** Nifty thing three, if you are learning macros , an example of CALL SYMPUT, plus creating a macro variable that is the total N ;

/* The tot macro variable is created here */

data _null_;
set ctrl;
by sd;
if last.sd then call symput(‘tot’,put(_n_,8.));

This is nifty in the first place because it uses CALL SYMPUT to put the value of  _N_ into a newly created macro variable called tot . Notice they did not actually bother sorting by sd, since it is the same for every record. However, they used the BY sd and then LAST.SD . That means the IF statement will be true on the last record in the data set, since everyone has the same value for SD. On the last record, the value of _N_ is the last record in the data set. I found this very clever because there are many times when people would want to know the number of records and put that into a macro variable and there are multiple ways to do it. This is one I had not thought of, plus it demonstrates the CALL SYMPUT so double learning points.

*** Nifty macro learning things four, five and six

First, they just create a data set (boring!) then they use the %LET statement and the %EVAL function. The %EVAL function is pretty essential in SAS macro language. It evaluates the value of a variable as a number instead of text. By default, SAS is going to treat the value of any macro variable as text. So, for example, 1 +2   is not evaluated as 3.  %EVAL only works with integers but since the number of records is always going to be an integer, you’re good here.
data case_temp;
set case;
*** select one patient at a timee time from dataset ctrl, and search for matched-pair in  case group ***;
%let j=%eval(&tot);

Next, they use the value of j, which is now the total number of subjects in the data set, in a DO statement, so that they are going to do these statements for every person in the control group. A “caliper” is set up for each person so that is the logit for that subject plus or minus .25* the standard deviation of the logit. All the cases that fall within that range are collected in a data set named match_temp
%do i=1 %to &j;
data ctrl_temp;
set ctrl;
if _n_=&i;
low=logit – 0.25*sd;
up =logit + 0.25*sd;
data match_temp;
if _n_=1 then set ctrl_temp(keep=low up);
set case_temp(drop=rannum);
if low<=logit<=up;

*** Nifty thing seven – an illustration of a use of PROC SQL.

*** calculate number of patients who were included in the match dataset            ***  ;

*** 1) if there is no match found, then select next patient from case group        ***
*** 2) if there is one matched, select this patient and go to next round           ***
*** 3) if there are more than 1 patients, calculate Mahalanobis distance and       ***
*** select the patient with smallest distance, then go to next round ***

Here they use PROC SQL for another way of finding the number of records in a file (I told you there were multiple ways). They create a new variable that is the count of IDs, which is going to be  the number of people in the data set since each one has a unique ID.
proc sql;
create table casen
as select count(&id) as n
from match_temp;
data _null_;
set casen;
if _n_=1 then call symput(‘casen’,put(n,8.));

This is followed by some more %IF , %THEN and %DO loops that, if no match is found, goes on to the next case, and if only one matches is found, keeps that in the matched data set we are building.

If more than one match is found — well, that’s where it gets really interesting.


—–  Tune in next time for the rest of the story, as Paul Harvey used to say.



Recently, I had the need to write the exact same programs twice, once using SAS and once using SPSS syntax. Even though these aren’t the same language, having done it once made it much easier to do it the second time. Let’s start with quintile matching. I’ve been rambling on about propensity scores lately and quintiles are one way of matching propensity scores. I could go on at length about the different methods and probably will some day, but this is not that day.

No, what I’m going to ramble on about is how much transfer occurs between one language and another.

The first thing that transfers is the structure of the solution.

Listen up and do what I say, not what I do. This is one of my many failings that I don’t sit down and draft out a solution right away. Often, I’ll start coding and not actually draw out the program until I have spun my wheels for a while. Flow charts are under-rated. And no, you don’t need to bother with all that crap where they teach you in school that a diamond is a decision point and a trapezoid is a process and this other thing is input data. No. Just break it into logical, sequential steps.


Run Logistic Regression




Compute Quintiles of Propensity Score




Add Random Number & Quintiles  to original data set




Randomly select out control cases in proportion

to the quintile distribution of treatment cases




Create a data set of the treatment cases, plus

the control cases you pulled out in the previous step


Ta-da ta -da. Done!

Whether I am going to use SAS, SPSS or something else, I want to follow the same basic steps outlined above. Knowing what you need to do is half the battle. Okay, well, maybe not half, but it pretty much guarantees that in the battle between you and your computer that you’re going to win.


The second thing that transfers is the sense to re-use as much code as possible.

I was heartbroken that I could not find a simple quintile matching program with SPSS somewhere and had to actually write some statements myself. Still, as much as possible, I tried to re-use code I had already written. The first box above was already done for a different SPSS program.

Start with defining the path where your interim files and the final matched file will be stored. I changed from the program I had written previously using SPSS 20 on the Mac because the Mac version *** BLOWS *** . It crashed so often as to be worthless, so I switched to my Windows 7 machine.

/* Change file path here and only here */

DEFINE !pathd() ‘c:\Users\AnnMaria\Documents\Here’ !ENDDEFINE.

/* This is the data set with all of my original data */

DEFINE !readin() ‘:\Users\AnnMaria\Documents\Here\ItIs.sav’ !ENDDEFINE.

The third thing that transfers is your knowledge of concepts relevant to solving the problem.

For example, even if you’ve never done a line of SPSS syntax in your life, if you use SAS you recognize the usefulness of something like a LIBNAME statement which defines the directory for your files, so that you don’t have to do it over and over. Also, you can change that directory in one place, and it permeates throughout the program. That is exactly what I am doing in that DEFINE !Pathd() above.

Similarly, if you are familiar with a macro variable, you can understand that both the !pathd  and !readin are acting as macro variables, to be reused throughout the program.

IT IS ASSUMED that the dependent variable is named treatm and coded 0 for the control (larger) group and 1 for the treatment (smaller) group. I also need to perform a logistic regression to compute the propensity score . I covered how to do that in a previous blog. You should have been paying attention.

Next is computing quintiles. Here is how you happen to do it in SPSS. Again, even if you’ve never used SPSS before, if you have some experience with statistical programming, you know what a quintile is (it’s dividing the group into five equal parts – the lowest 20%, 20-40% etc.).

* Find the quintiles .
FILE=!pathd + ‘test.sav’.


This will give you output of percentiles that looks like this


20     .9345586

40    .9565614

60   .9690461

80   9789572


***** USING THE NUMBERS FROM THE PREVIOUS STEP, code the quintiles .

*Code the quintiles in the data set.

GET FILE= !pathd + “test.sav”.
RECODE PRE_1 (0 thru .9345586=1) (.9345586 thru .95656=2) (.9565614 thru .9690461=3) (.969046 thru .9789572=4) (.9789572 thru 1=5) INTO Quintile.
COMPUTE x = RV.UNIFORM(1,1000000) .
VARIABLE LABELS  Quintile ‘Quintile’.
SAVE OUTFILE=!pathd + “test.sav” .
COMPUTE filter_$=(treatm = 1).
VARIABLE LABELS filter_$ ‘treatm = 1 (FILTER)’.
VALUE LABELS filter_$ 0 ‘Not Selected’ 1 ‘Selected’.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.



The fourth thing that transfers is your knowledge of how computers work.

You COULD take the macro Levesque wrote that I discussed earlier that does nearest neighbor matching and run it with quintile substituted for propensity score. That would work. It would also take probably 50 times as long to  run. Somewhere on the order of an hour as opposed to a minute. Here is why. You DON’T need to do a 1 to 1 match. You have, say 315 people in the first quintile in your treatment group. You want to pick 315 people in the first quintile in your control group. Coincidentally (okay, it wasn’t a coincidence), there is a random number assigned to each record. In the next step, I output all the control subjects from Quintile 1 to a data set, sort them by that random number and then take the first 315. I do the same for the next four quintiles. Sorting a data set and selecting out the first N people is FAR quicker than taking every one of say, 900 people from the treatment group and comparing them to each one of 20,000 people from the control group.
* Create quintile 1 data set.

GET FILE= !pathd + “test.sav”.
SELECT IF (treatm = 0 & Quintile =1).
SAVE OUTFILE=!pathd + “quintile1.sav” .

GET FILE= !pathd + “quintile1.sav”.
SAVE OUTFILE=!pathd + “quintile1.sav” .

* Create quintile 2  data set.

GET FILE= !pathd + “test.sav”.
SELECT IF (treatm = 0 & Quintile =2).
SAVE OUTFILE=!pathd + “quintile2.sav” .
GET FILE= !pathd + “quintile2.sav”.
SAVE OUTFILE=!pathd + “quintile2.sav” .

* Create quintile 3  data set.
GET FILE= !pathd + “test.sav”.
SELECT IF (treatm = 0 & Quintile =3).
SAVE OUTFILE=!pathd + “quintile3.sav” .
GET FILE= !pathd + “quintile3.sav”.
SAVE OUTFILE=!pathd + “quintile3.sav” .

* Create quintile 4 data set .

GET FILE= !pathd + “test.sav”.
SELECT IF (treatm = 0 & Quintile =4).
SAVE OUTFILE=!pathd + “quintile4.sav” .
GET FILE= !pathd + “quintile4.sav”.
SAVE OUTFILE=!pathd + “quintile4.sav” .

* Create quintile 5 data set .

GET FILE= !pathd + “test.sav”.
SELECT IF (treatm = 0 & Quintile =5).
SAVE OUTFILE=!pathd + “quintile5.sav” .
GET FILE= !pathd + “quintile5.sav”.
SAVE OUTFILE=!pathd + “quintile5.sav” .

********  Data set with  cases .

GET FILE= !pathd + “test.sav”.
SELECT IF (treatm =1).
SAVE OUTFILE=!pathd + “cases.sav” .
ADD FILES /FILE= !pathd + “cases.sav”
/FILE= !pathd + “quintile1.sav”
/FILE= !pathd + “quintile2.sav”
/FILE= !pathd + “quintile3.sav”
/FILE= !pathd + “quintile4.sav”
/FILE= !pathd + “quintile5.sav”  .
SAVE OUTFILE=!pathd + “quintilematch.sav” .

The result will be a data set quintilematch.sav with treatment cases matched by quintile with controls.

The fifth thing that transfers is your knowledge of how people work with computers.

I get the occasional snarky comment from someone that they could code “better” than me, meaning more efficiently or using more cool things. Sometimes the snarky comment is implied, by the following said in a wondering  tone …

“I don’t understand why you sometimes are turning away clients while I am always looking for work ….”

For example, the above program requires selecting something, running it, typing in eight numbers, running something else and typing in another five numbers. If I were running this in batch, that would not be feasible, so this program is not as extensible as it could be. Also, if this were going to be used to run against a data set with millions of records, say, the census data, running it interactively wouldn’t work. It would be more impressive if I had the frequency data output as input to the test.sav data set and then the next frequency output create macro variables then had a loop that went through five times, created the five quintile data sets and then concatenated all of them.

All of that would be impressive, but it wouldn’t be helpful. The environment where this program will be used doesn’t have more than 50,000 records. Why on earth would they run a program in batch that runs in a minute? They aren’t primarily an SPSS shop, so it will be much easier for them to use and maintain a program that essentially requires only knowing where you saved your data set, what your dependent and independent variables are, and how to copy some numbers from the screen into a program.

Actually, I can do some fairly impressive coding when the situation calls for it, but  that’s a pretty important qualifying phrase there. Usually, it doesn’t. Here is the secret to success as a consultant, I will give it to you for free, since I am feeling generous.

“It turns out that people don’t want to be impressed nearly so much as they want to be helped.”





I wrote Part 1 a couple of years ago, so I guess I’m due for a part 2. In this case, I started with a data set in SAS but because it was going to be used by a group who had some SAS users and some SPSS users, they wanted to have the code for both SPSS and SAS.

Levesque wrote a lovely macro years ago to do propensity score matching and a few years later John Painter added to it a bit. Both of them did great work for which many people should be extremely grateful (I am!) .

However, I think for many people who use SPSS primarily by pointing and clicking, they still may have a bit of trouble with the pre-processing that this macro assumes you will do. They also may not be so sure about how to code things for a Mac or how to do the post-processing after the macro. So, as a public service, here you go, Part 2.

Start with defining the path where your interim files and the final matched file will be stored. If my path looks funny to you it is because you are probably using Windows and I did this on a Mac.

/* Change file path here and only here */

DEFINE !pathd() ‘/Volumes/Mystuff/SaveHere/’ !ENDDEFINE.

/* This is the data set with all of my original data */

DEFINE !readin() ‘/Volumes/Otherplace/HereIs/inputfile.sav’ !ENDDEFINE.

IT IS ASSUMED that your dependent variable is named treatm and coded 0 for the control (larger) group and 1 for the treatment (smaller) group. If that is NOT the case you will need to execute a GET FILE statement to readin your data and then do whatever you need to make it coded 0 or 1. In my example I have a variable, Alive, coded in the OPPOSITE direction of what I need, that is most people are alive (1) and a few people are dead (0). It is easy enough here to execute a COMPUTE command and make treatm = Alive – 1 .

* Get file and make sure .
* Dependent is named treatm and coded 0 or 1.

FILE= !readin .
COMPUTE treatm=1 – Alive .

******************** .
* Perform logistic regression to compute propensity score .

******************* .

I could have made the a variable list and dependent variable macro variables also but since they are only used this one time, that seemed kind of silly.

Note the RENAME VARIABLES – that is going to name the propensity score to propen. That is used throughout the macro, so don’t change that. Also, the output file from the logistic regression is going to be test.sav in whatever directory you specified above. That is also used throughout the macro, so don’t change that either. In fact, don’t change anything here other than the dependent variable name and the list of independent variables after ENTER .

/METHOD=ENTER V1 v2 othervar morevar1 morevar2
SAVE OUTFILE=!pathd + “test.sav” .


After this is the macro Levesque wrote. It works fine. Just copy and paste it into your syntax file. Yay, Levesque. He also has a really good book on data management and programming. I HIGHLY recommend it, as well as just perusing his site. You’ll learn a ton about SPSS.

********************* .
** End Preparation .
********************* .
GET FILE= !pathd + “test.sav”.
COMPUTE x = RV.UNIFORM(1,1000000) .
SORT CASES BY treatm(D) propen x.
SAVE OUTFILE=!pathd + “mydata.sav”.

* Erase the previous temporary result file, if any.
ERASE FILE=!pathd + “results.sav”.
COMPUTE key=1.
SELECT IF (1=0).
* Create an empty data file to receive results.
SAVE OUTFILE=!pathd + “results.sav”.

* Define a macro which will do the job.

DEFINE !match (nbtreat=!TOKENS(1))
!DO !cnt=1 !TO !nbtreat

GET FILE=!pathd + “mydata.sav”.
SELECT IF idx=!cnt OR treatm=0.
* Select one treatment case and all control .
COMPUTE #target=propen.
COMPUTE delta=propen-#target.
IF (delta<0) delta=-delta. SORT CASES BY delta. SELECT IF $CASENUM=1. COMPUTE key=!cnt . SAVE OUTFILE=!pathd + "used.sav". ADD FILES FILE=* /FILE=!pathd + "results.sav". SAVE OUTFILE=!pathd + "results.sav". ************************************************ Match back to original and drop case from original . GET FILE= !pathd + "mydata.sav". SORT CASES BY idx . MATCH FILES /FILE=* /IN=mydata /FILE=!pathd + "used.sav" /IN=used /BY idx . SELECT IF (used = 0). SAVE OUTFILE=!pathd + "mydata.sav" / DROP = used mydata key delta. EXECUTE. !DOEND !ENDDEFINE. *////////////////////////////////. SET MPRINT=yes. **************************. * MACRO CALL (first insert the number of cases after nbtreat below) . **************************. So much for the macro definition. Now you need to call it. Replace the ### here with the number in your treatment (smaller) group, the people that are coded treatm = 1 . !match nbtreat= ### . Here is more of Levesque's work. Just copy and paste it. * Sort results file to allow matching. GET FILE=!pathd + "results.sav". SORT CASES BY key. SAVE OUTFILE=!pathd + "results.sav". ******************. * Match each treatment cases with the most similar non treatment case. * To include additional variables from original file list them on the RENAME subcommand below . ******************. GET FILE=!pathd + "mydata.sav". MATCH FILES /FILE=* /FILE=!pathd + "results.sav" /RENAME (idx = d0) (id=id2) (propen=propen2) (treatm=treatm2) (key=idx) /BY idx /DROP= d0. FORMATS delta propen propen2 (F10.8). SAVE OUTFILE=!pathd + "mydata and results.sav". EXECUTE. * That's it!. He says that's it, but there is a little more to it than that. The original macro assumed you had four variables, a propensity score, ID, treatm and improve, that is, a variable that shows improvement. What if you have a lot more than that and you would like to merge this back with your original data set and have the matched and treatment subjects selected out by ID number with everything else you may have recorded on them? In the file the macro produces, it has N cases, where N is the number of people in the treatment group. Perfect if you want to do a dependent t-test sort of analysis, but that is not what I want. I want N*2 cases with each ID on a separate row. There are probably more beautiful ways to do this, but here is one that works. Get the file that the macro produced with all of the data. If the case has a value for propen2, it was one of the cases selected. Output that to the results2 file and keep id2 (this is the match ID). Rename that variable to ID. GET FILE= !pathd +'mydata and results.sav'. SELECT IF NOT (SYSMIS(propen2)). SAVE OUTFILE=!pathd + 'results2.sav' / KEEP = id2. EXECUTE. GET FILE=!pathd + 'results2.sav'. RENAME VARIABLES (ID2 =ID) . SAVE OUTFILE=!pathd + 'results2.sav' / KEEP = id . EXECUTE. Get the file with the results and keep the id variable. That is the case ID. Output that to the results1 file. GET FILE= !pathd + 'mydata and results.sav'. SELECT IF NOT (SYSMIS(propen2)). SAVE OUTFILE=!pathd + 'results1.sav' / KEEP = id. EXECUTE. Concatenate the results1 and results2 files. This is your file of all of the ids, the treatment cases and their nearest match. ADD FILES /FILE= !pathd + 'results1.sav' /FILE= !pathd + 'results2.sav' . SAVE OUTFILE=!pathd + 'resultsall.sav' . EXECUTE. This is just a quality check. On the final results file, all of the records should have a variable inmatch with a value of 1. GET FILE= !pathd + 'resultsall.sav'. COMPUTE inmatch = 1 . SAVE OUTFILE=!pathd + 'resultsall.sav' . EXECUTE. You absolutely have to sort the cases by ID and save the sorted file before you can merge them. This sorts the resultsall.sav file just created. SORT CASES BY ID(A). SAVE OUTFILE= !pathd + 'resultsall.sav' /COMPRESSED. This sorts the original data file (remember your original data file?) GET FILE= !readin . EXECUTE. SORT CASES BY ID(A). SAVE OUTFILE= !readin /COMPRESSED. Now, we will finally match the subjects from the treatment group and their matched controls back together and save them in a new file named matches.sav that is in that directory you specified at the very beginning. MATCH FILES /TABLE= !pathd + 'resultsall.sav' /FILE= !readin /BY ID. SAVE OUTFILE= !pathd + 'matches.sav' /COMPRESSED. EXECUTE. Now that you have your matched file, I strongly suggest you do some tests to see that your propensity score matching worked as hoped. Which I did and nothing was within shouting distance of being significant, so I was happy.



Upfront confession – I completely copied the first part of this from a macro Paul Dickman wrote in 1999 because it did ALMOST exactly what I wanted. I tested it several times with different data and on different computers to see that it worked the way I expected. I actually found it on some random page but since they had retained the comments that showed the author, I tracked down Paul Dickman’s page with the enormous effort of typing a few words into Google and found the original source of this macro.

I threw in an option to do a surveyfreq, a couple of positional parameters, and I was done. Life is good.

A few days ago, I discussed one method of propensity scores, which was to match on the exact score as closely as possible. I used a modification of Lori Parson’s 1-5 Greedy match macro to do that.

Another way is to sort the scores into quintiles – the bottom 20% percentile, 20-40th percentile and so on. Then, you can control for quintile. The first thing I did was a logistic regression to create propensity scores, like so:



CLASS var3 ;

MODEL depend = var1 var2 var3 var4 var5 ;

OUTPUT OUT = example PROB = propensity ;

This first step creates an output data set that has all of the original data plus the propensity scores. Next, I call Paul Dickman’s macro, which I reproduced below.  The last two parameters in the macro statement I added. These are positional parameters. That means they are optional and if you don’t specify them when you call the macro you get the exact same results as the original macro, which puts a new variable named whatever you specify for “quintvar” in your existing data set. That is the quintile score.

The positional parameters, num and depend allow you to specify the number of matches you want. If you want 1 match from the larger population to each in your smaller group, then you would enter a 1. Depend is the name of your dependent variable. This assumes that your smaller group is coded 0 and your larger group is coded 1.  For example, if dependent variable is lived, 0= no and 1 = yes.

%macro quint(dsn,var,quintvar,num=, depend=);

/* calculate the cutpoints for the quintiles */
VAR &var;
OUTPUT OUT=quintile PCTLPTS=20 40 60 80 PCTLPRE=pct;

/* write the quintiles to macro variables */
SET quintile;
CALL SYMPUT(‘q1’,pct20) ;
CALL SYMPUT(‘q2’,pct40) ;
CALL SYMPUT(‘q3’,pct60) ;
CALL SYMPUT(‘q4’,pct80) ;

/* create the new variable in the main dataset */
DATA &dsn;
SET &dsn;
IF &var =.  THEN &quintvar = .;
ELSE IF &var LE &q1 THEN &quintvar=1;
ELSE IF &var LE &q2  THEN &quintvar=2;
ELSE IF &var LE &q3 THEN &quintvar=3;
ELSE IF&var LE &q4 THEN &quintvar=4;
ELSE  &quintvar=5;

I added this part because I wanted to give the option to randomly select a specified number of subjects with which to match each subject, as well as to add the quintile value and keep the whole data set.

The first statement checks to see if the positional parameter for number (&num) was entered. If the length= 0 it was not entered and we just skip to the end. Note that the label, downhere, does NOT have colon after it in the first statement but it does when it is used as a label at the end.
%IF %LENGTH(&num) = 0 %THEN %GOTO downhere ;

/* This outputs the two groups on the dependent variable, 0 and 1, to two different data sets */

DATA small large ;
SET &dsn ;
IF &depend = 0 THEN OUTPUT small ;
ELSE IF &depend = 1 THEN OUTPUT large ;

/* The frequency distribution of quintile of the smaller group (0) is output to a dataset named samp_pct */
PROC FREQ DATA = small ;
TABLES &quintvar / OUT = samp_pct  ;

/* This creates a sample size per strata to match whatever was specified by the num positional parameter */

/* That variable needs to be named _NSIZE_ because this dataset will be the secondary data set to PROC SURVEYSELECT */
DATA samp_pct ;
SET samp_pct ;
_NSIZE_  = &num ;
_NSIZE_ =   _NSIZE_ * COUNT  ;

/* Sort by strata is required for PROC SURVEYSELECT */
PROC SORT  = large ;
BY &quintvar ;

/* This will select the number per strata to match the smaller sample */
PROC SURVEYSELECT  DATA= large SAMPSIZE = samp_pct OUT = largesamp ;
STRATA &quintvar ;

/* Notice there needs to be a period there to separate the &dsn macro variable from the _SAMPLE appended to the new data set name */
SET largesamp small ;

/* This is  a label so there is really supposed to be a colon there, not a semi-colon */
%MEND quint;
How to use this ….



%quint(mydata.project, prob,quint,num=1, depend= lived) ;


%quint(mydata.project, prob,quint) ;


* Macro for grouping continuous variable into quintiles.  *
* The macro REQUIRES three input variables:               *
* dsn: data set name                                                            *
* var: variable to be categorised                                      *
* quintvar: name of the new variable to be created        *
*                                                                                      *
* If only these are specified a quintiles variable                *
* specified in quintvar will be added to the dataset, dsn *
*                                                                                        *
* Two OPTIONAL (positional) parameters are                *
*  num: number of matches.                                *
*  depend: dependent variable                             *
**** The LARGER group has the dependent variable coded 1  *                                                       *
*    The SMALLER group has the dependent coded 0          *
*                                                                                             *
*  Example usage 1                                                              *
*  %quint(mydata.project, prob,quint,num=1, depend =lived) ;     *
*  will output a dataset mydata.project_SAMPLE                 *
*  with one subject with a value of 1 on LIVED  matched to *
*  each person with a value of 0 on LIVED                 *
*                                                                                      *
*  Sample usage 2:                                                        *
* %quint(mydata.project,prob,quint );                         *
*                                                                                    *
* After running the macro, the dataset mydata.project     *
* will contain a new variable called quint with values    *
* . (missing), 1, 2, 3, 4, and 5.                         *
*                                                         *
* The cutpoints for the quintiles are calculated based    *
* on all non-missing values of the variable in question.  *
*                                                         *
* To base the cutpoints for the quintiles on, for example,*
* controls only, the code can be changed as follows:      *
* proc univariate noprint data=&dsn.(where=(control=1));  *
*   Modification of 1999                                                           *
*  Macro by Paul Dickman                                                    *
* AnnMaria De Mars 2012                                                     *

First person to comment with ANOTHER use of colons in SAS (other than labels and not counting body parts), I’ll send you a Julia Group flashlight pen.

keep looking »


WP Themes