Maybe I have been wrong.
It wouldn’t be the first time, in fact, most of the really great things in my life have come about when I realized I was on the wrong track and took a sharp right turn. (Uncharacteristically skipping the opportunity here to make snarky comment about my first boyfriend, job or marriage. )
Four incidents made me reconsider what matters in statistics.
The first two were related. I attended a really fascinating lecture on statistics where the speakers discussed pretty much what morons 99.99% of the world were, how assumptions are violated, variables are not normally distributed and as a result our standard errors are usually wrong, and not just a little bit, by a lot, and we should all be learning new statistical methods. Honestly, I found myself nodding my head in agreement at most of what they said, and I have to admit that I have been guilty of some of these same errors myself.
Shortly thereafter, I was at the SPSS Directions conference and I asked the equally fascinating Bill from SPSS about the research he was doing to predict shipments of contraband, violent incidents in Richmond and cyberattacks. He said much of what they were using was Decision Tree.
It dawned on me that both of these brilliant men were right. On the one hand, making the wrong assumptions can inflate your standard errors by a lot. On a practical level, with 100 or even 300 subjects this can make a substantive difference. However, if you have 12,000 or 12,000,000 records then yes, your standard error may actually be three times what you incorrectly have assumed, but if that means it is .00027 instead of .00009 – it isn’t going to make much difference. This harks back to that article in Wired a few years ago, on The End of Theory. The gist of it was that with the amount of data we have, who needs science. At the time, I spotted some immediate flaws in the argument, and similar ones since. If you don’t look even at some basic regression assumptions you may miss, for example, that your prediction is great overall but for your highest dollar customers, most violent offenders, or whatever your dependent variable, your error rate is much higher, and these are the exact people you most want to predict (i.e., you have violated the assumption of homescedasticity). I have to say, too, that the analogy with language translation was less than compelling, the article said that Google can translate from one language to the next without knowing the language. Well, not so well. I just typed a phrase in Spanish into Babelfish that I had said to my little daughter lying in the next bed in the hotel. It means, “You are such a beautiful little girl”. The translation came back, “That pretty girl.”
… and yet … it was close, it didn’t come back with “Hey, buddy, you want to buy a goat?”
It wouldn’t take too much effort, really, to take the 1,000 or 10,000 most common phrases in each language and enter those into the software and have these checked and THEN go on to the word by word translation.
The same with statistics. You can build in the diagnostics, as SAS has done with ODS GRAPHICS, for example.
I think some knowledge of statistics is crucial, but I am not so convinced that minor departures from normality, small correlations among variables or some heteroscedasticticity will damn us all to statistical hell when our datasets are approaching hundreds of gigabytes on a fairly regular basis. Yes, it will not be precisely correct. Yes, there IS danger to not understanding some of the basic underpinnings of statistics. The Chronicle of Higher Education forum has anonymous (more or less) users but some day I do hope to meet in person the person whose signature says,
“Being able to find SPSS in the start menu does not qualify you to run a multinomial logistic regression.”
Not only funny, but true. On the other hand, I don’t think you need to be able to calculate structural equations in your head to be qualified to design, conduct and interpret a statistical study.
The third incident included the same Bill- the-I-think-he-was- a-vice-president from SPSS. (A person better than me at sucking up would have found out his last name. I did make a faint effort on Google, yet again proving less than infallible as Bill appears to be a popular name for SPSS vice-presidents and I am writing this at 1:30 a.m. on the east coast as I have not yet adjusted to the time zone so am somewhat imprecise.) I ran into him after his presentation and asked him if he had published his results, as the improvements in prediction they had achieved were really quite remarkable. He said,
“Not in the sense that someone like you from a university means by publishing.”
He went on to say that he had presented at conferences like this one and discussed the results with customers but not published in an academic journal.
Finally, there was the SAS Tech Report, in which editor Waynette Tubbs mentioned about finding a job and networking, “Are you published?” but she meant having a blog, doing papers at places like WUSS (Western Users of SAS Software) and SAS Global Forum. This is very far from the definition of publishing that I was taught (nearly brainwashed) as almost all doctoral students at research institutions are, that peer-reviewed, academic journals were the gold standard, be-all, end-all and 90% of the measure of your worth as a human being.
So, is it just barely possible that having a very, very good understanding of statistics, albeit not to the point of tossing off pages of proofs off the top of your head, and writing about it in a comprehensible fashion is what really matters? To a regular person, this probably doesn’t sound too crazy, but to someone who has spent most of a lifetime in academia it borders on heresy.
I think that knock on the door is a group of inquisitors come to burn me at the stake.
On the other hand, it may just be room service.