I knew it!

I am finally getting some time to finish reading The Black Swan and I came across this statement,

“Makridakis and Hibon reached the sad conclusion that ‘statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones’. ”

I have thought this for years.

Like most of my peers, I started out in graduate school as a research assistant, collecting data under less than ideal circumstances. There are all sorts of reasons why the answers you get don’t connect too tightly to reality. People lie. They misunderstand your questions. You and your research are supremely unimportant to them so they just dash off the first thing that comes to mind. Data are entered incorrectly. Data aren’t collected at all on some variables, and the values that those respondents ‘would have given’ are estimated. It is a long and rather daunting list.

When it comes to human behavior, there is a very imperfect relationship between the measurements we take and the actual concept of interest. For example, gaining or losing weight may be a sign of depression in some people but it might also just be a sign of having been invited to a lot of good Christmas parties or having joined a gym.

Three points have popped up over and over in my decades of experience.

1. The simpler mistakes are often the ones that make the huge differences in prediction, for better or worse.

This became apparent to me early on when I found a significant correlation between marital adjustment and depression. The correlation between these two variables was .37, highly significant and certainly worthy of publication. One could certainly find plenty of literature to document why there should be a correlation between how happy one was with one’s marriage and depression. Being fortunate enough to have excellent statistics professors, I did the intelligent thing before publication and really analyzed the data. I looked at the distribution of each variable, the variance and noted that there was one noticeable outlier on each variable.
OUTPUT1

Being the careful type, I deleted this outlier and the resulting correlation dropped from .37 to a very unimpressive and non-significant .07. Once the outlier was dropped, it was pretty easy to see the lack of relationship.

OUTPUT2

This major improvement didn’t come about from a more sophisticated model, from using structural equation modeling to create a composite endogenous variable. It came from doing a simple model correctly.

2. The usual effect is very small.

I am often surprised when working with graduate students by their disappointed reaction to multiple correlations of .20 – .50 . I wonder what they were expecting. I explain to them my view that human behavior, and the world in general, is extremely complex. Being able to explain 25% of the variance in anything by two or three variables is amazing and probably has taken advantage of chance to some extent. It is unlikely that the real correlation is that high. They always look at me as if I am slightly daft since some statistics book they read said that .20 is a low effect size and .50 is only moderate. I ask them how many variables came into play in determining the simple decision to contact a statistical consultant – first they had to be in graduate school, and at the university where I am employed, and that meant getting admitted in the first place which meant having a certain GPA, etc. How much more complex is a decision like marriage, divorce or buying a house. How can they expect to explain 75% of the variance in that? Yet, that is exactly what they hoped for.

3. You can’t really predict the outliers even though those are often what matter most

Everyone wants to predict the next Microsoft, Apple, Google. Ordinary least squares models (regression, ANOVA, t-test) assume a normal distribution which is grossly violated when you have one observation 300 standard deviations from the mean. More than that, though, and this is one of the main points of The Black Swan, we don’t know what variables to include. Those outlier events are not a result of the same variables.

There were a couple of profound bits of knowledge I received in my MBA program that were worth the whole price. One of them was this advice during a lecture,

“Always remember, ladies and gentlemen, while Burroughs had all of its engineers hard at work making a better adding machine, Steve Wozniak was in his garage inventing the Apple computer.”

And yet, the variables in so many of our models are nothing but the different components that go into an adding machine.

A few days ago I gave an example of multiple imputation but I glossed over a number of things, which is fine when you are just getting into a topic, or when the mad scientist part of your personality has taken over, but seriously now …

Our story continues … when last seen, our data were being multiply imputed while Ronda was downstairs not driving the car through the lobby and not running over the guard at the front desk….

So, let’s look at the reason people would do an imputation, seriously now. It is not to create data so I don’t have ugly looking cells and so I can get higher N’s for the means. No, au contraire !

Generally, it is because you have a number of variables, let’s say four, and you are missing data on some, but not all of the variables, for a number of people. You would like to do, say,a multiple regression. If you are missing 10% of the data for each subject, that is not too terrible, but if each of four different variables is missing 10% you end up losing 40% of your sample. That’s bad.

Let’s say, because the psychopathic half of my split personality happened to feel like it, that I randomly deleted 10% of four different variables using the code:

rd = round ((10*(ranuni(0))),1) ;

if rd = 9 then weekend = . ;

else if rd  = 4 then svc_lvl_pct = . ;

and so on.  Then, when the hard-working statistician part resumed control of our shared brain it went, “@!$*)+ ! ” or something like that realizing that now, if I wanted to perform a regression analysis, for example, 40% of my data were missing. In my case, the data are missing completely at random so at least that’s something to be happy about.

Stuff to know:

1. Code your variables correctly.

Now let’s look at our variables. If I use weekday as a variable, that is going to be coded 1 = Sunday through 7 = Saturday. It’s rather unlikely that sales, service or anything else is a linear function with Sunday being the lowest and Saturday being the highest (bar tabs, maybe).   We can get to imputing class variables later, but for now let’s just assume we have numeric variables and we are going to do a plain vanilla regression. I am going to create a new variable, weekend (= 1 on the weekends, and 0 other days).

If you have a categorical variable, such as department, you can use that using the CLASS statement, unless you are using an ancient version of SAS. (I am using 9.2 & I highly recommend it). However, that does require doing things a little differently. For simplicity sake I am going to use all numeric variables, including one variable dummy-coded as 0,1 .

2. Choose the right variables for the imputation

This may seem rather obvious but just playing around, I first did this analysis with variables I selected off the top of my head, that “made sense”, some of which ended up being related to one another and others that didn’t. Next, I actually did a bit of research to find out what variables really ought to be related, based on previous analyses, and discussions with experienced people in the organization, took some time to make sure the data were coded as assumed (which they never are).  I ended up with a similar, but not identical, set of variables. My first analysis was okay, but not nearly as close to the real model as the second one. The added work took me about eight hours, but it was well worth it. I make this obvious point because I have had too many conversations with people who say, “I have 47 years of experience, I know what is significant in blah-de-blah.”

Maybe they do and maybe they don’t. Even if they do, the data may not be coded the way they expect. For example, weekday is not a variable that tells if it was a weekday or not but rather, the day of the week from 1 = Sunday to 7 = Saturday.

An example (seriously now).

Have you ever been put on hold for tech support, making reservations or anything else and waited so long that you got disgusted and hung up? Well the service level percentage is 100% – the percentage of people who hung up. I want to predict the service level percentage and I think it is a function of whether it is a weekend or not (fewer people on staff), the average amount of time people spend on each call (which means other callers will have to wait) and the average time someone waits before hanging up .

Step 1: Proc MI

proc mi data = history out = histmi minimum  = 0 round =  .01 1 1 1    ;
var           svc_lvl_pct weekend ab_time a_time ;

This will create a dataset name histmi which will have five times as many observations as my original dataset, as it will do five imputations. The minimum value I want imputed is 0, because you can’t have negative time or percentage. Although Paul Allison has written that rounding gives you biased estimates I did it anyway just to spite him (actually, I have never met the man and I am sure that he is way cool. I did do this with rounding to 1 second, rounding to .01 second and without rounding. Actually, the one with rounding to 1 gave the closest estimate but that was with an N of 3 trials so it doesn’t really mean anything. He did his study with random data and mega-iterations and I am sure he is right, rounding is bad. The truth is I had run this and then read his article the next day. How is that for bad timing? )

Step 2: Perform your analysis

This is simple. Sort your data by _imputation_  (which is the imputation number, a variable created by PROC MI) . Then perform your regression analysis by imputation. Output the parameter estimates and covariance matrix for estimates to a dataset, in my case I called it outhistmi .

proc sort data = histmi ;
by _imputation_ ;
proc reg data  = histmi outest =outhistmi covout ;
model svc_lvl_pct  = weekend A_time   ab_time ;
by _imputation_  ;

Step 3: Use MI Analyze to get parameter estimates

proc mianalyze data = outhistmi edf = 1231 ;
var  weekend A_time   ab_time ;

Edf is the estimated degrees of freedom. Since I had 1,235 observations and four parameters (including the intercept)

edf  =  1235 – 4

So, how well did it work?

You can go here and see the model for the complete data for all 1,235 records.  The estimates from MIANALYZE are shown here.

Actual estimates   vs  results of MIAnalyze

Weekend   -.044       -.029

Average call time  .00021   .00020

Abandoned call time   -.00068   -.00056

In all cases, the actual parameter estimates did fall within the 95% confidence intervals. However,  because I am not that easily satisfied, I went back and performed the regression with the 767 records that had complete data.  It also gave me pretty similar results to the full data, which is what I would expect since in this case the data were missing completely at random (probably not a good representation of real life) as I had deleted these out using a random number function.

I am still not wholly convinced of the benefit of multiple imputation versus just going with those that have complete data. I’ll have to think about this some more. In the meantime, I have one daughter telling me that normal people play with guitar hero instead of PROC MIANALYZE and another one saying I am going to be late to hear her play the saxophone in her Christmas concert, so  I guess I’ll think about it later.

(Or: Why a computer will not be replacing me any time soon)

[Originally, the second half of this was supposed to be an explanation of the real use of multiple imputation which is in estimating parameters but then my daughter came to pick me up early and I got the flu so I am now finishing the second half as a separate post and pouting that John Johnson  goes on Santa's nice list for saying it faster and shorter than me. However, since I was in the middle of writing it, I am going to do it as the next post anyway!]

I am a little hesitant to do multiple imputations or really substitutions for missing data of any type. I go through life with the “call a spade a spade” philosophy and what we are doing here is making up data where none exists. However, there is a difference in the quality of data one can make up. For example, if I asked my little granddaughter, eva

“What is your best estimate of the average number of clients the Arcadia branch served on Tuesday, January 19th, 2003?”

Her most likely estimate would be,

“Elmo!”

If I put the same into SAS (or SPSS)  it would be the equivalent of asking, since I do, contrary to appearances, know what I am doing,

“Given that the 19th was a Tuesday, in the first quarter of the year and we have data on the Arcadia branch for every other day in January and for the last seven years, they are in an urban area over 3 million in population, what’s the best estimate of the number they served?”

If I were to do this, I would include something like:

PROC MI DATA= in.banks OUT = banksmi  ;

VAR qtr weekday urban branch customer ;

I might still get an answer that makes only a little more sense than , “Elmo!”, say  -5 or 146.783 . Well, I really doubt that negative five customers came to the bank that day. What would a negative number of customers, be, anyway, bank robbers who withdrew money they did not put in?

You need to apply some constraints, which PROC MI lets you do. Here is an example of something I actually did today:

PROC MI DATA = fix2 OUT =fixmi MINIMUM = 0 ROUND = 1 .01 .01 .01 1 1 1 1 1 ;
VAR ACD_Calls A_speed A_time ab_time aban_calls qtr t_day year weekday skill_name ;

The MINIMUM = sets the minimum for all variables to be zero, since we can’t have a negative number of callers and negative time is one of those concepts that was probably covered in philosophy and other types of courses I did my best to weasel out of through one undergraduate and three graduate programs. The ROUND = sets the number of callers to be rounded to 1, the time for speed of answering and length of call and the time people wait before hanging up (less than 11 seconds) to be rounded to .01 and everything else to be rounded to 1.

SPSS also offers multiple imputation as part of its missing values add-on. If you have it installed, you select from the Analyze menu Multiple Imputation and then Impute Missing Data Values.  At this point a window pops up letting you select the variables and giving you the option to writing the output to a new dataset.

You can also click on the Constraints tab at the top and another window pops up which allows you to set constraints such as a minimum, maximum and rounding.

All of the stuff you can do with windows and the pointing and clicking you can do with SPSS syntax also but I have found that most people prefer to use SPSS precisely because they DON’T want to do the syntax business. [Note that I said "most" so if you are one of those other people there is no point in being snippy about it just because you are in the minority. Try being a Latina grandmother statistician for a while and you'll know what a minority feels like.]

Well, how well does multiple imputation do? Here is a really extreme test. I calculated the actual statistics for the month of October and then deleted about half of the month’s data for every group.

actual

Below are the initial statistics with half of the values imputed:

imputed

Okay, well normally you need something more extreme than multiple  imputation if  half your data are missing, like a new project, but this isn’t terrible. You might notice that the least error is in the last column. Why? Well, it turns out that this is the only department that is open on weekends. The others had data imputed for Saturday and Sunday but were never actually open on those days.

I could have done something more sophisticated but since one of my daughters borrowed the car and was downstairs waiting to pick me up, I greatly decreased the error from the actual data by one simple change:

in the imputed dataset, I added the statement

IF skill_name in(99,271,273) and weekday in (1,7) then delete ;

This deleted those records that had data for days the department was not open and increased the accuracy of the final dataset substantially. Here is the take away lesson – the SPSS people with the decision trees and the knowing what variables to constrain, they are on to something. You can do all of that in SAS also, but when you attend events like the SPSS Directions conference , that is their real emphasis, where at places like SAS conferences (more statistically-oriented) the emphasis is more on setting constraints and doing the actual procedures. What you want is both of those things, the former of which includes considerable expertise in the area you are analyzing, be it IQ scores, glucose levels or average number of hamburgers served, and the latter assumes some knowledge of statistics. If you add in a good ability to code, you might actually have a better answer in the end than

Elmo!

Now I am going to go downstairs and hope my daughter has not crashed my car into the lobby and run over the guard at the front desk. Wish me luck.