# Knowing what we don’t know

**I knew it! **

I am finally getting some time to finish reading The Black Swan and I came across this statement,

*“Makridakis and Hibon reached the sad conclusion that ‘statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones’. ”
*

I have thought this for years.

Like most of my peers, I started out in graduate school as a research assistant, collecting data under less than ideal circumstances. There are all sorts of reasons why the answers you get don’t connect too tightly to reality. People lie. They misunderstand your questions. You and your research are supremely unimportant to them so they just dash off the first thing that comes to mind. Data are entered incorrectly. Data aren’t collected at all on some variables, and the values that those respondents ‘would have given’ are estimated. It is a long and rather daunting list.

When it comes to human behavior, there is a very imperfect relationship between the measurements we take and the actual concept of interest. For example, gaining or losing weight *may* be a sign of depression in some people but it might also just be a sign of having been invited to a lot of good Christmas parties or having joined a gym.

Three points have popped up over and over in my decades of experience.

**1. The simpler mistakes are often the ones that make the huge differences in prediction, for better or worse. **

This became apparent to me early on when I found a significant correlation between marital adjustment and depression. The correlation between these two variables was .37, highly significant and certainly worthy of publication. One could certainly find plenty of literature to document why there should be a correlation between how happy one was with one’s marriage and depression. Being fortunate enough to have excellent statistics professors, I did the intelligent thing before publication and really analyzed the data. I looked at the distribution of each variable, the variance and noted that there was one noticeable outlier on each variable.

Being the careful type, I deleted this outlier and the resulting correlation dropped from .37 to a very unimpressive and non-significant .07. Once the outlier was dropped, it was pretty easy to see the lack of relationship.

This major improvement didn’t come about from a more sophisticated model, from using structural equation modeling to create a composite endogenous variable. It came from doing a simple model correctly.

2. The usual effect is very small.

I am often surprised when working with graduate students by their disappointed reaction to multiple correlations of .20 – .50 . I wonder what they were expecting. I explain to them my view that human behavior, and the world in general, is extremely complex. Being able to explain 25% of the variance in anything by two or three variables is amazing and probably has taken advantage of chance to some extent. It is unlikely that the real correlation is that high. They always look at me as if I am slightly daft since some statistics book they read said that .20 is a low effect size and .50 is only moderate. I ask them how many variables came into play in determining the simple decision to contact a statistical consultant – first they had to be in graduate school, and at the university where I am employed, and that meant getting admitted in the first place which meant having a certain GPA, etc. How much more complex is a decision like marriage, divorce or buying a house. How can they expect to explain 75% of the variance in that? Yet, that is exactly what they hoped for.

**3. You can’t really predict the outliers even though those are often what matter most
**

Everyone wants to predict the next Microsoft, Apple, Google. Ordinary least squares models (regression, ANOVA, t-test) assume a normal distribution which is grossly violated when you have one observation 300 standard deviations from the mean. More than that, though, and this is one of the main points of The Black Swan, we don’t know what variables to include. Those outlier events are not a result of the same variables.

There were a couple of profound bits of knowledge I received in my MBA program that were worth the whole price. One of them was this advice during a lecture,

“Always remember, ladies and gentlemen, while Burroughs had all of its engineers hard at work making a better adding machine, Steve Wozniak was in his garage inventing the Apple computer.”

And yet, the variables in so many of our models are nothing but the different components that go into an adding machine.

Ann,

As I am always enjoying your intelligent and pragmatic views on the problems you face, I get intrigued when you mention a book title – would “The Black Swan” happen to be this one:

The Black Swan: The Impact of the Highly Improbable by Nassim Nicholas Taleb ?

Yes, that’s the one. It is very thought-provoking as well as amusing to read.