Open Data Wikipedia or How many monkeys = 1 statistician?

Remember that old saying that 1,000,000 monkeys on a typewriter would eventually produce Shakespeare?

After the equivalent of more than a 1,000,000 monkey-years of text published on the web, so far, no Shakespeare. (For a superb, in-depth discussion of this point, read Jason Lanier’s book, “You are not a gadget”)

In very, very, brief, Lanier  says that crowd-sourcing is NOT a panacea, that comments from 250,000 Introductory Physics students do not even equal a mediocre physicist, much less Einstein.

Looking at the movement for Open Data, I strongly agree with his concerns. There are two reasons for my skepticism:

1. Even with considerable knowledge of statistics, programming and the field in question, analyzing data correctly takes a long time.
2. Without that knowledge, your results may be not only wrong but harmful.

Let’s start with the second problem. Some days I wonder if any of the people hyperventilating at the prospect of 100 million citizens doing their own analyses have ever heard of a Type I error. Let me burst their bubble …

Let’s take death from car accidents as a variable, since it is pretty easy to tell if a person is dead and it is pretty hard to misdiagnose getting run over by a car. We could hypothesize that people who live on the west side of town are more likely to die in car accidents than people who live on the east side. It would be unlikely that the percentage of people dying in a given year will be EXACTLY the same with 1.67845% of the people on both sides of town dying, maybe it will be 1.70000% on the east side.

We understand that slight differences are expected, even when there really is no “real” difference in the population but it would be unusual to get a large difference by chance. How unusual? A statistically significant result is one that would occur by chance less than five times out of a hundred.

Of course, if you have very small sample sizes, for example, looking at the people who live on the east and west sides of Zap, North Dakota, you might find substantial differences in percentages, with 25% of the east side of Zap passing away compared to 0% on the west side. (Last I knew, Zap had a population of 8 ). That difference, though large in percentages, would not be unusual at all.

EVEN IF everyone knew exactly what they were doing and EVEN IF every analysis was done exactly right… 25,000,000 analyses would STILL give us 1,250,000 statistically significant results JUST BY CHANCE. Somewhere in with those results that occurred by chance are the real, honest-to-good differences. But how do we know which are real?

There was that very, very important qualifier,

“Even IF everyone knew what they were doing”

which as my lovely daughter puts it

“has the same probability as flying monkeys coming out of my butt”.

Many of the people who are asking questions are just throwing up a bar chart … Let’s skip over the innumerable mistakes and over-generalizations that can occur and assume instead that we get thousands of sociologists, demographers, statisticians,  knowledgeable people who majored in various fields or studied them as a hobby and now want to apply their knowledge. Way to go! Wikipedia of data! Woot! Woot!

What would THEY do? Maybe ,,,

• Look for a pattern of results,  to see if we find that in every city in the country the people on the east side are more likely to have a close encounter of the fatal kind with a Toyota.
• Or … see if we find the same pattern five years in a row
• Or …split the data randomly into two datasets and look for the same pattern in each one

I’m not saying it couldn’t happen. I love wikipedia, and appearances to the contrary, I love the idea of open data. Here’s the thing … it will take a lot of time, effort and knowledge. When I download data from the data.gov site it invariably takes many hours of my  time to read the codebooks to understand what each variable represents, read the technical summary on how the data were sampled, stratified and weighted. Then, I run the programs to read in the data, often requiring merging multiple datasets from data.gov or from data.gov and other sites to include additional variables I might want, say economic variables for each census tract merged with test scores.

If I’m lucky, reading what I need to know to understand the data, running the programs and doing some basic analyses will take me a day or two. Often it can take as much as a week. And I have been using SAS (which many of the programs use) and working in applied statistics for over 25 years, I have a Ph.D. and I am almost always looking at data in areas where I’ve done research for a decade or more.

If I had an interest in, say global warming, and wanted to examine climate data, it would take me MUCH longer because there would be so much I would have to learn before I could begin to understand it.

Having said all of that, I think there may be a great possibility for open data, but perhaps not in the crowd-sourcing way. What I would love to see is NOT apps, NOT tweets but actual study. There is a model for this.

For years, the Inter-University Consortium for Political and Social Research has offered data to anyone at those universities that paid an institutional fee to belong. It was often not used

a) because in many schools people didn’t even know it was available and

b) it required someone with very good skills with SAS, SPSS or Stata (the syntax, not the pointing and clicking kind).

For those who did use it, the data were a gold mine. If you had to make the argument that in counties with higher rates of unemployment people were more likely to apply for disability benefits – bingo! You could find right there datasets on disability claims and unemployment.

I used the ICPSR data for many, many examples in courses. If a student was interested in a subject, I would help him or her obtain the data, put it into a dataset that was easily analyzed and provide assistance with the statistics.  Most results were on the lines of showing that children from low-income families were more likely to attend schools that had low API scores. I can think of a couple of students who came up with results I thought were revelations.

Now, data.gov and other open data resources are offering up data similar to ICPSR on a grand scale. What I’d love to see (and I’ve mentioned it before) is a repository for the RESULTS. Just like wikipedia has references and a format, I’d like to see that created for open data, with a link to the data, a link to the program, brief results and a contact person for more information.

I’d like it to be edited, where if someone posts results showing that women are more likely to die in childbirth than men and asserts that is proof that obstetricians are sexist that it be taken down – no, not a competing explanation of the results given, but actually deleted. Yes, care would have to be taken to be sure this didn’t end up like just a very, very large refereed journal (don’t even get me STARTED ranting on that!).

I just went in her room and asked the world’s most spoiled twelve-year-old what she was up to, she said,

“I’m plotting world domination. I’m going to start with the violent overthrow of the U.S. government. Since they’d probably ground the flights if I brought down the government, I’m going to do it after you get back from your trips in March, so you don’t get stuck in an airport, but before April 15 so you don’t have to pay taxes. See how thoughtful I am? Now, can I have an iPhone 4 and go see the Justin Bieber movie with my friends tomorrow?”

(I think she was being sarcastic.) Despite having an actual quote, and the fact that her current events report for social studies was on Mubarak’s resignation, I don’t think wikipedia would allow me to put up an article “Egypt youth protests spark pre-teen uprisings in America”. You NEED editing.

A good start for a wikipedia of data might be to provide encouragement and support for universities and professionals to use open data for their course assignments, dissertations, theses, blogs, conference presentations and then put up the results. Most of the work done by academics, students and other professionals never sees the light of day outside of a one-time presentation to a room of 30 people at a conference or class. Sometimes that’s just as well, but a lot of times what is presented is as good as most articles on wikipedia, and we’ve seen how many people that has helped.

Yes, it will be hard and a lot of work, but I think an open data wikipedia can be created and would be extraordinarily useful if done right. I just wouldn’t leave it up to monkeys.

local_offerevent_note February 12, 2011

account_box

6 thoughts on “Open Data Wikipedia or How many monkeys = 1 statistician?”

• Your blog rocks my socks!! Love the writing style!!

• I wish more people would write blogs like this that are really fun to read. With all the fluff floating around on the net, it is rare to read a blog like this instead.