Since I have written about odds ratios and logs lately, I was going to write about the natural log of the odds ratio, however, random events have caused me to do otherwise.
I read an interesting blog by Adam Jackson lately, in which he is concerned that robots will take over the world. At the very least, he worries, all of the data we are submitting on ourselves via our friends on Facebook, our pictures on Flickr, our personal web pages on igoogle, travelocity account, grocery store frequent shopper cards and a thousand other sources from billions of people will be combined and inserted into a series of equations that will predict our every move.
I say it aint so. Billions and billions have been spent on data mining and we still cannot predict behavior with any degree of accuracy. Two obstacles are imperfect models and random error. The latter might be better described as, “The State of Being Human”. For example, I hate SQL. I even wrote a blog about how much I hate it. That doesn’t mean I never use SQL and even, if offered an unimaginably enormous amount of money, would not become an SQL programmer. In fact, I was going to write on my blog about types of sums of squares a few days ago (since it was Square Root Day), but instead, I have decided to write about how robots are not taking over the world.
Let’s go with the imperfect models, first. I have a tank of sea monkeys on my desk that I bought for the purpose of collecting random data. Among the variables I have entered into an SPSS dataset are the following:
- DAYS: Day since purchase
- MONKEYS: Number of sea monkeys
- SPECKS: Number of specks indistinguishable from sea monkeys
- EVENT: A binary variable, event occurred or not
- NAME_EVENT: A string variable, (e.g., added food, added water)
- ACTIVITY: an ordinal ranking of sea monkey activity from none to high
- WTF: Number of co-workers who walk by, look on my desk and say “WTF? Sea Monkeys?
I have tried various models to predict the dependent variable WTF, which is my co-workers’ behavior. I have tried all of the independent variables, linear functions, quadratic functions, t-tests, multiple regression. To date, my attempts could only be characterized as a success if success has a completely different meaning than the one with which I am familiar. In fact, in one spectacular non-success event, I actually achieved an adjusted R-squared of -.232, meaning in statistical terms that my equation was, in fact, WORSE than doing nothing in terms of predictability.
This can be interpreted as SPSS way of saying,
“Why don’t you just shut the hell up, because a better equation could be developed by, say, a sea monkey?!”
Part of my problem may be that I just don’t have enough data. It could be that I need more data, and once I have it, like Adam Jackson’s megalomaniac robots, I could take over the world. Or, my failure to attain predictive nirvana may be due to having an imperfect model. it could be that I have the wrong variables. In that case, I could have variables on a billion people on a billion days and run it on a super-computer and still end up with no better prediction than nothing.
It is very likely that my dependent variable is not explained by the number of sea monkeys, the number of specks, the degree of sea monkey activity or anything at all external to my co-workers. Perhaps the variables I should have measured are these:
BOREDOM: Total rating of each worker
ATTEND : Number of people at work today on the third floor
PERSONBORE: The product of the first two variables for a boredom factor for the floor
WEATHER: This could be an ordinal scale from nasty, cold, rainy to perfect beach weather
(March 4 in Los Angeles – everyone else, eat your heart out!)
Even if I were to have the right variables, you cannot measure everything. Believe me, I have been trying most of my life. There is that random error factor. Co-worker A may have a boredom index of 4,000,000 , the weather outside may be hailing bowling balls so she is not leaving the building, she may walk by my desk 762 times but she is NOT going to speak to me, not even if the sea monkeys get impact by radiation leaking out of the pipes, grow to the size of a dinosaur and swallow our new CIO whole because once, three months ago, she arrived at the elevator shortly after me and I did not hold the door for her and she had to wait for the next elevator thus getting to her meeting thirty seconds late and blowing her chance to make a great impression on the new CIO.
But, what if you had a billion billion data points… wouldn’t that work? No. If you had that many data points you would be uniquely identifying each person. That would work kind of like tests like the Myers-Briggs work, where you check off boxes that you’d rather go to a party than solve algebra problems, you’d rather exercise than paint pictures and you’d rather drink a martini than write a blog. You get results back that say you like parties, exercise and drinking better than math, computers and art. No new information is added.
Mathematically, if you use a billion different data points to predict a billion people’s outcomes, you get just what I got with a much smaller number – an adjusted explained variance of less than nothing.
According to the Myers-Briggs, I am probably one of the most introverted people on earth. Although I loathe meetings, I spent decades as a consultant flying around the country, because my husband died, I had three small children and the money was good. That is just one random event, but here is what I have noticed …. for all of us, life is a whole series of random events strung together.
So, I don’t think we need to worry about the robots rising up against us just yet. On the other hand, maybe we ought to keep a closer eye on those sea monkeys.