Multiple imputation, seriously now

A few days ago I gave an example of multiple imputation but I glossed over a number of things, which is fine when you are just getting into a topic, or when the mad scientist part of your personality has taken over, but seriously now …

Our story continues … when last seen, our data were being multiply imputed while Ronda was downstairs not driving the car through the lobby and not running over the guard at the front desk….

So, let’s look at the reason people would do an imputation, seriously now. It is not to create data so I don’t have ugly looking cells and so I can get higher N’s for the means. No, au contraire !

Generally, it is because you have a number of variables, let’s say four, and you are missing data on some, but not all of the variables, for a number of people. You would like to do, say,a multiple regression. If you are missing 10% of the data for each subject, that is not too terrible, but if each of four different variables is missing 10% you end up losing 40% of your sample. That’s bad.

Let’s say, because the psychopathic half of my split personality happened to feel like it, that I randomly deleted 10% of four different variables using the code:

rd = round ((10*(ranuni(0))),1) ;

if rd = 9 then weekend = . ;

else if rd = 4 then svc_lvl_pct = . ;

and so on. Then, when the hard-working statistician part resumed control of our shared brain it went, “@!$*)+ ! ” or something like that realizing that now, if I wanted to perform a regression analysis, for example, 40% of my data were missing. In my case, the data are missing completely at random so at least that’s something to be happy about.

Stuff to know:

1. Code your variables correctly.

Now let’s look at our variables. If I use weekday as a variable, that is going to be coded 1 = Sunday through 7 = Saturday. It’s rather unlikely that sales, service or anything else is a linear function with Sunday being the lowest and Saturday being the highest (bar tabs, maybe). We can get to imputing class variables later, but for now let’s just assume we have numeric variables and we are going to do a plain vanilla regression. I am going to create a new variable, weekend (= 1 on the weekends, and 0 other days).

If you have a categorical variable, such as department, you can use that using the CLASS statement, unless you are using an ancient version of SAS. (I am using 9.2 & I highly recommend it). However, that does require doing things a little differently. For simplicity sake I am going to use all numeric variables, including one variable dummy-coded as 0,1 .

2. Choose the right variables for the imputation

This may seem rather obvious but just playing around, I first did this analysis with variables I selected off the top of my head, that “made sense”, some of which ended up being related to one another and others that didn’t. Next, I actually did a bit of research to find out what variables really ought to be related, based on previous analyses, and discussions with experienced people in the organization, took some time to make sure the data were coded as assumed (which they never are). I ended up with a similar, but not identical, set of variables. My first analysis was okay, but not nearly as close to the real model as the second one. The added work took me about eight hours, but it was well worth it. I make this obvious point because I have had too many conversations with people who say, “I have 47 years of experience, I know what is significant in blah-de-blah.”

Maybe they do and maybe they don’t. Even if they do, the data may not be coded the way they expect. For example, weekday is not a variable that tells if it was a weekday or not but rather, the day of the week from 1 = Sunday to 7 = Saturday.

An example (seriously now).

Have you ever been put on hold for tech support, making reservations or anything else and waited so long that you got disgusted and hung up? Well the service level percentage is 100% – the percentage of people who hung up. I want to predict the service level percentage and I think it is a function of whether it is a weekend or not (fewer people on staff), the average amount of time people spend on each call (which means other callers will have to wait) and the average time someone waits before hanging up .

Step 1: Proc MI

proc mi data = history out = histmi minimum = 0 round = .01 1 1 1 ;
var svc_lvl_pct weekend ab_time a_time ;

This will create a dataset name histmi which will have five times as many observations as my original dataset, as it will do five imputations. The minimum value I want imputed is 0, because you can’t have negative time or percentage. Although Paul Allison has written that rounding gives you biased estimates I did it anyway just to spite him (actually, I have never met the man and I am sure that he is way cool. I did do this with rounding to 1 second, rounding to .01 second and without rounding. Actually, the one with rounding to 1 gave the closest estimate but that was with an N of 3 trials so it doesn’t really mean anything. He did his study with random data and mega-iterations and I am sure he is right, rounding is bad. The truth is I had run this and then read his article the next day. How is that for bad timing? )

Step 2: Perform your analysis

This is simple. Sort your data by _imputation_ (which is the imputation number, a variable created by PROC MI) . Then perform your regression analysis by imputation. Output the parameter estimates and covariance matrix for estimates to a dataset, in my case I called it outhistmi .

proc sort data = histmi ;
by _imputation_ ;
proc reg data = histmi outest =outhistmi covout ;
model svc_lvl_pct = weekend A_time ab_time ;
by _imputation_ ;

Step 3: Use MI Analyze to get parameter estimates

proc mianalyze data = outhistmi edf = 1231 ;
var weekend A_time ab_time ;

Edf is the estimated degrees of freedom. Since I had 1,235 observations and four parameters (including the intercept)

edf = 1235 – 4

So, how well did it work?

You can go here and see the model for the complete data for all 1,235 records. The estimates from MIANALYZE are shown here.

Actual estimates vs results of MIAnalyze

Weekend -.044 -.029

Average call time .00021 .00020

Abandoned call time -.00068 -.00056

In all cases, the actual parameter estimates did fall within the 95% confidence intervals. However, because I am not that easily satisfied, I went back and performed the regression with the 767 records that had complete data. It also gave me pretty similar results to the full data, which is what I would expect since in this case the data were missing completely at random (probably not a good representation of real life) as I had deleted these out using a random number function.

I am still not wholly convinced of the benefit of multiple imputation versus just going with those that have complete data. I’ll have to think about this some more. In the meantime, I have one daughter telling me that normal people play with guitar hero instead of PROC MIANALYZE and another one saying I am going to be late to hear her play the saxophone in her Christmas concert, so I guess I’ll think about it later.

6 Comments

Cody L. Custis says:

December 18, 2009 at 9:20 am

Code monkey comment: I have to point out that the code
rd = round ((10*(ranuni(0))),1) ;
if rd = 9 then weekend = . ;
else if rd = 4 then svc_lvl_pct = . ;

would not lead to data which is missing completely at random, since the probability that weekend is missing is not independent of the probability that svc_lvl_pct is missing. I’m not sure if that affects the validity of the analysis.
admin says:

December 18, 2009 at 11:08 am

Well, actually, you are correct and I coded it to show the way it is. I could have written

if rd = 9 then weekend = . ;

if rd = 4 then svc_lvl_pct = . ;

However, if rd is 9, it can’t be 4. Thus, in each observation, only one of the four variables will be missing, so I would be pretending independence exists when it actually doesn’t.

Whether weekend, svc_lvl_pct, etc. is missing, though is completely independent of the VALUE of any other variable and just determined by the value of the random number function. Generally the concern about non-random missing data is that it is related to some other variable of importance, for example, that people who are more severely depressed are less likely to complete a longitudinal study on effectiveness of treatment for depression.

I did it this way to maximize the amount of missing data and minimize the amount missing from each record, as I think the only reasonable thing to do with people who are missing half the data is to toss them out. Also, I think this is the usual case where people use MI, they have 5% or 10% missing on a few different variables.

Since the seed of ranuni(0) is just the time of day, if I was more ambitious, I guess I could have done this:

rd = round ((10*(ranuni(0))),1) ;

if rd = 9 then weekend = . ;

rd = round ((10*(ranuni(0))),1) ;

if rd = 4 then svc_lvl_pct = . ;
Pingback: Passion, data, math & bullshit: Part 1 : AnnMaria’s Blog
Pingback: Damn! There really IS a structural equation modeling for dummies! : AnnMaria’s Blog
Jennifer says:

April 18, 2012 at 1:31 pm

Thanks for writing this! I’m trying to learn PROC MI and MIANALYZE, and I didn’t know what the EDF should be (the SAS manual didn’t help, and I was confused that the default is infinity because usually SAS defaults work just fine for me, but in this case infinity is obviously wrong). Anyway, thanks for writing this in an easy-to-understand and funny style. I’m a new fan!
Peter says:

July 21, 2016 at 8:00 am

Hi, I am currently using MI for my project which is using longitudinal data. One of my study variables is depression score at age 10. Is it true that I can only use depression scores which precede this age in my multiple imputation? i.e. depression scores at age 9 and under.
The reason I am asking is because most of the depression scores are taken after age 10, but I don’t know whether I can include them.

Multiple imputation, seriously now

Data quality – getting ready to teach data mining

Statistics as social justice

Does your candidate have a prayer in hell of getting elected?

Sports equality, t-tests and standard error

Controlling for Damn Near Everything: Propensity Score Matching

Floor Effect, Ceiling Effect and Computing Internal Consistency Reliability at Post-test

6 Comments

Leave a Reply

Similar Posts

6 Comments

Leave a Reply