A few days ago I gave an example of multiple imputation but I glossed over a number of things, which is fine when you are just getting into a topic, or when the mad scientist part of your personality has taken over, but seriously now …
Our story continues … when last seen, our data were being multiply imputed while Ronda was downstairs not driving the car through the lobby and not running over the guard at the front desk….
So, let’s look at the reason people would do an imputation, seriously now. It is not to create data so I don’t have ugly looking cells and so I can get higher N’s for the means. No, au contraire !
Generally, it is because you have a number of variables, let’s say four, and you are missing data on some, but not all of the variables, for a number of people. You would like to do, say,a multiple regression. If you are missing 10% of the data for each subject, that is not too terrible, but if each of four different variables is missing 10% you end up losing 40% of your sample. That’s bad.
Let’s say, because the psychopathic half of my split personality happened to feel like it, that I randomly deleted 10% of four different variables using the code:
rd = round ((10*(ranuni(0))),1) ;
if rd = 9 then weekend = . ;
else if rd = 4 then svc_lvl_pct = . ;
and so on. Then, when the hard-working statistician part resumed control of our shared brain it went, “@!$*)+ ! ” or something like that realizing that now, if I wanted to perform a regression analysis, for example, 40% of my data were missing. In my case, the data are missing completely at random so at least that’s something to be happy about.
Stuff to know:
1. Code your variables correctly.
Now let’s look at our variables. If I use weekday as a variable, that is going to be coded 1 = Sunday through 7 = Saturday. It’s rather unlikely that sales, service or anything else is a linear function with Sunday being the lowest and Saturday being the highest (bar tabs, maybe). We can get to imputing class variables later, but for now let’s just assume we have numeric variables and we are going to do a plain vanilla regression. I am going to create a new variable, weekend (= 1 on the weekends, and 0 other days).
If you have a categorical variable, such as department, you can use that using the CLASS statement, unless you are using an ancient version of SAS. (I am using 9.2 & I highly recommend it). However, that does require doing things a little differently. For simplicity sake I am going to use all numeric variables, including one variable dummy-coded as 0,1 .
2. Choose the right variables for the imputation
This may seem rather obvious but just playing around, I first did this analysis with variables I selected off the top of my head, that “made sense”, some of which ended up being related to one another and others that didn’t. Next, I actually did a bit of research to find out what variables really ought to be related, based on previous analyses, and discussions with experienced people in the organization, took some time to make sure the data were coded as assumed (which they never are). I ended up with a similar, but not identical, set of variables. My first analysis was okay, but not nearly as close to the real model as the second one. The added work took me about eight hours, but it was well worth it. I make this obvious point because I have had too many conversations with people who say, “I have 47 years of experience, I know what is significant in blah-de-blah.”
Maybe they do and maybe they don’t. Even if they do, the data may not be coded the way they expect. For example, weekday is not a variable that tells if it was a weekday or not but rather, the day of the week from 1 = Sunday to 7 = Saturday.
An example (seriously now).
Have you ever been put on hold for tech support, making reservations or anything else and waited so long that you got disgusted and hung up? Well the service level percentage is 100% – the percentage of people who hung up. I want to predict the service level percentage and I think it is a function of whether it is a weekend or not (fewer people on staff), the average amount of time people spend on each call (which means other callers will have to wait) and the average time someone waits before hanging up .
Step 1: Proc MI
proc mi data = history out = histmi minimum = 0 round = .01 1 1 1 ;
var svc_lvl_pct weekend ab_time a_time ;
This will create a dataset name histmi which will have five times as many observations as my original dataset, as it will do five imputations. The minimum value I want imputed is 0, because you can’t have negative time or percentage. Although Paul Allison has written that rounding gives you biased estimates I did it anyway just to spite him (actually, I have never met the man and I am sure that he is way cool. I did do this with rounding to 1 second, rounding to .01 second and without rounding. Actually, the one with rounding to 1 gave the closest estimate but that was with an N of 3 trials so it doesn’t really mean anything. He did his study with random data and mega-iterations and I am sure he is right, rounding is bad. The truth is I had run this and then read his article the next day. How is that for bad timing? )
Step 2: Perform your analysis
This is simple. Sort your data by _imputation_ (which is the imputation number, a variable created by PROC MI) . Then perform your regression analysis by imputation. Output the parameter estimates and covariance matrix for estimates to a dataset, in my case I called it outhistmi .
proc sort data = histmi ;
by _imputation_ ;
proc reg data = histmi outest =outhistmi covout ;
model svc_lvl_pct = weekend A_time ab_time ;
by _imputation_ ;
Step 3: Use MI Analyze to get parameter estimates
proc mianalyze data = outhistmi edf = 1231 ;
var weekend A_time ab_time ;
Edf is the estimated degrees of freedom. Since I had 1,235 observations and four parameters (including the intercept)
edf = 1235 – 4
So, how well did it work?
You can go here and see the model for the complete data for all 1,235 records. The estimates from MIANALYZE are shown here.
Actual estimates vs results of MIAnalyze
Weekend -.044 -.029
Average call time .00021 .00020
Abandoned call time -.00068 -.00056
In all cases, the actual parameter estimates did fall within the 95% confidence intervals. However, because I am not that easily satisfied, I went back and performed the regression with the 767 records that had complete data. It also gave me pretty similar results to the full data, which is what I would expect since in this case the data were missing completely at random (probably not a good representation of real life) as I had deleted these out using a random number function.
I am still not wholly convinced of the benefit of multiple imputation versus just going with those that have complete data. I’ll have to think about this some more. In the meantime, I have one daughter telling me that normal people play with guitar hero instead of PROC MIANALYZE and another one saying I am going to be late to hear her play the saxophone in her Christmas concert, so I guess I’ll think about it later.