True confessions: I just don’t get the data hackathon

I am old. If I was not aware of this by looking at my U.S. birth certificate (which, like President Obama, I do have), I have America’s most spoiled thirteen-year-old to tell me,

“You’re old.”

(sometimes followed by – “and stupid” if I have just told her that no, I will not buy her an iPhone 4 because her current iPhone works just fine, or no, I will not buy her a $26 lip gloss at Sephora because THAT is, in fact, stupid, and no one who is thirteen and beautiful needs to wear cosmetics anyway. )

So, it has been established by four daughters in succession becoming thirteen, that I am old and out of it.

Perhaps this explains why the data hackathon and similar concepts mystify me. I just don’t see what of any use can be accomplished by people with no familiarity throwing themselves at data for 24 or 48 hours without sleep.

  1. Have none of you read all of the research on how sleep deprivation has a negative effect on cognitive performance? Do you think for some reason it doesn’t apply to you? Seriously, the ability to go without sleep is not a super-power. It’s more often a sign you’re using coke, which as we used to say in college, is God’s way of telling you that you’re making too much money.
  2. At a rock bottom minimum, to understand any dataset you need to know how it was coded. What is variable M03216, exactly? If that is the question, “What is 3 cubed?”  , then what do the answers A, B, C, D 97 98 and 99 mean? Which of A through D is 27, which is the correct answer, and what the hell are 97 – 99 ? These are usually some type of missing value indicator, possibly showing whether the student was not administered that particular item, didn’t get that far in the test or just didn’t answer it.
  3. Once you’ve figured out how the data are coded, you need to make some decisions about what to do with it AT THE ITEM LEVEL. What I mean by that is, for example, if an item wasn’t administered, I count the data as missing, because there was no way the student could answer it. If the student left it blank, I count it as wrong because I’m assuming, on a test, if you knew the right answer, you would have given it. What about if the student didn’t make it that far in the test? I count it as wrong then, also, but I do realize I have less degree of certainty in that case. My point is, these are decisions that need to be made for each item. You can do it in a blanket way – as in, I’m doing this for all items, or you can do it on a case by case basis. Whether you do it knowingly or not, something is going to happen with those items.
  4. Often, sampling and weighting are issues. If you are doing anything more than just sample descriptive statistics (which aren’t all that useful in and of themselves) you need to know something about how the data are sampled and weighted. Much of the government data available uses stratified sampling, and often certain strata are sampled disproportionately. AT A MINIMUM , you need to read some documentation to find out how the sampling is done. If you don’t, maybe you’ll be lucky and the actual sampling was random, which is the default assumption for every procedure I can imagine. Listen carefully – Hoping to get lucky is a very poor basis for an analysis plan.
  5. Data quality needs to be checked and data cleaning done. Unless you are a complete novice or a complete moron, you are going to start by doing things like checking for out-of-range values, reasonable means and standard deviations, the expected distribution, just to make sure that each variable you are going to be using in your analysis behaves the way it should. If there are no diseases recorded on Sundays in some zip codes, I doubt it’s because people don’t get sick those days but rather because the labs are closed. This may make a difference in the incidence of a disease if people who get sick on weekends go to another area to be treated. You need to understand your data. It’s that simple.
  6. Somewhere in there is the knowledge of analytic procedures part. For example, I wanted to do a factor analysis (just trust me, I had a good reason). Well, because of the way this particular dataset was put together, no one had complete data, so I couldn’t do PROC FACTOR. I thought possibly that I could use PROC CALIS with method = FIML – that is the full information maximum likelihood method. I went to the SAS documentation and flipped through the first several pages, glanced at some model by Bentler illustrating indicators and latent constructs, said to myself, “Blah blah, I know what that is”, skipped over the stuff on various types of covariance matrices, etc. I can guarantee you that the first time I looked at this (which was a long time ago), I did not flip through it. I probably spent 15 minutes looking at just that model making sure that I knew what F1 was, what v1, v2, etc. were and so on. That is one advantage of being old, you’ve seen a lot of this stuff and puzzled it out before.
  7. You need to actually have the software required. Although there is supposedly a version of SAS out there that allows the use of FIML for PROC CALIS none of the clients I have at the moment have the latest version of SAS, so I don’t have access to it. I believe FIML is only available in SAS 9.22.  You might think having the software available would be a benefit of a hackathon but the ones I have seen advertised all say to bring your own laptop, so I am assuming not.
  8. All of this is before knowledge of the actual content area, be it contributions to charity, earthquake severity, mathematics achievement or whatever, I assume there has been research done previously on the topic and it might be helpful to have that knowledge. I will guess that the organizers of the hackathon are assuming whoever attends will already have #5, 6 and 7, here. This is again a bit puzzling to me because the combination of that knowledge took me years of experience and I don’t think I am particularly slow-witted. It’s like all the job announcements you see that want someone twenty-five years old with at least ten years of experience. The hackathons seemed to be aimed at young people. (The advertised free coffee and all the fast food you can eat is not really much an enticement for most people in their fifties. If you can’t afford to buy your own coffee and fast food by that age, I think you took a wrong turn somewhere and perhaps you should hang up that DATA ROCK STAR t-shirt, buy a suit and go get a job.)

So, I don’t get it. I don’t get what you are hoping to pull together in 48 hours, with little sleep, little time, and often limited knowledge of the data. PERHAPS the idea is that if enough people try enough different random things that we will just get lucky??

Luck is not a research strategy.

Oh, and there’s this little thing called Type I error.

Comments

6 Responses to “Luck is not a research strategy”

  1. Arthur Tabachneck on April 20th, 2011 9:39 am

    First, since you have a 13 year old daughter, you CAN’T be old, as that would make someone like me ancient and I simply can’t accept that.

    Second, I’m not going to defend hackathons and wouldn’t participate in one even if I was much younger.

    And, I don’t disagree with any of your comments about the utility of knowing one’s data, available analytics, the underlying assumptions of those analytics, and the conditions required to draw valid conclusions from combining the two.

    However, I do have to take exception with your discounting the potential usefulness of chance findings or, in your terms, luck as a research strategy. There is a reason why the word “serendipity” is often used the describe the historical discovery of so many scientific advances.

    We don’t all have clear paradigms to provide direction and, for those who are more fortunate, the world would know much less if some of our colleagues hadn’t looked for deviations from the expected patterns and results.

    Similarly, if we would only work with perfect datasets, I doubt if many of us would EVER get to reach the level needed to discover or confirm anything.

    In short, I have to disagree with the statement that “luck is not a research strategy.” Possibly not the optimal strategy for all types of research, but definitely one that I hope at least some researchers continue using.

  2. admin on April 20th, 2011 1:36 pm

    As far as serendipity is concerned, I would say that chance favors the prepared mind. That is, Alexander Fleming (not to be confused with Ian Fleming, the James Bond guy) discovered penicillin by serendipity but that was after many years of study and research in biology.

    If it would have been me in that laboratory, I would have said, “Damn! Half the things in this dish are dead and the rest are nasty!” and washed the whole thing down the drain – a scenario which no doubt occurred somewhere in history before Fleming.

    Certainly there are times when people start out to study one effect and find something else, but these are people that have some knowledge in that field, not people who randomly jam parts together and all of a sudden invent the integrated circuit.

  3. Rich Williams on April 22nd, 2011 2:12 pm

    My response to “You’re old and stupid” is, “Yes, and also holding the car keys and your allowance.”

    In defense of hackathons, it gives an outlet for the less-than-athletic nerdy types to compete in an “extreme” environment, but without that pesky risk of death hanging over the event. I don’t know why, but give young men (in their twenties – with no family resposibilities) the opportunity to overdose on adrenaline, caffeine, sugar, trans-fat, and sleep deprivation, and you’ll soon have an army ready to do whatever menial or idiotic tasks you set before them. This might be called the Scoobie Snack principle.

    When I was younger my colleagues and I would stay up all night coding – mainly because we had an insane director who always promised that we’d have a substitute for MicroSoft Office written up first thing in the morning. Typically we were familiar with our data, so that was not an issue, but just the sheer magnitude of the requirements squeezed into such tight deadlines kept the adrenaline flowing.

    At some point during the all-nighter, though, you reach a point where the miraculous happens: everything just seems to work out splendidly. This is not the serendipity that Arthur mentioned. No, this is where every data source has exactly what you need, every decision you make happens to be the correct one, every procedure gives exactly the output that you were expecting. You put the final report on your boss’s desk and go home for a couple hours sleep, knowing that the gods have smiled upon you and that you’ll probably get a huge bonus for being such a genius.

    The next morning, though, you relook at your code and start to realize that none of it makes any sense at all. In severe cases, you can’t even figure out what you were trying to accomplish. At about that point an older and wiser colleague will stop by to say that last night you had been listening to the Golden Monkey, the mythical beast that comes to coders late at night and tells them exactly what to type. Unfortunately, the monkey doesn’t speak English (or Spanish, or C, or SAS – but oddly enough he does know Perl), and nobody who has slept in the past two days can translate what he says.

    But on the bright side, at least with all the time you saved coding this up last night, you’ll have plenty of time to redo all of it today. Which leads to the saying that became almost a mantra with us: We never have time to do it right, but we always have time to do it over.

  4. admin on April 22nd, 2011 2:26 pm

    Rich,
    Thanks for the hilarious comment. And I am SO going to use that car keys and allowance line!

    There have been many times in my career when I have gone into the office one day and gone home the next. Some good work did get done, but there was always the point where I realized I was making more mistakes than progress and went home for a nap.

    Everyone makes mistakes, and if you realize that the analysis you were going to present on the National Institutes of Health campus next week was based on data that was incorrectly coded and there is no way you can present incorrect results (of course), I’m willing to step in and re-do everything at the last minute and work insane hours (at a fee). Once or twice is being human. After that, it’s a habit and unless that fee is very large, I may quit returning your calls.

    Just working at that pace for the hell of it strikes me as odd.

    I’ve found that most people who believe they are better speakers when they are off the cuff and unrehearsed are dead wrong. I wonder if the same is true of people who believe they work best under pressure?

    Your analogy with athletic competition fascinates me. Certainly, people tend to peak during major competitions.

    I think I may be going to a meeting next month where we are going to try to put together a design in a couple of days because that is all of the time I can be there. Will be interesting to see how that goes.

  5. Really the next big thing: SAS Global Forum Day 3 : AnnMaria’s Blog on April 24th, 2012 12:33 am

    [...] written many time before about how open data is a good idea but it is not as simple as running a million correlations and pulling out the 50,000 that are significant and throwing them all into a stepwise regression [...]

  6. More adventures with SAS web editor : AnnMaria’s Blog on June 4th, 2013 4:43 pm

    [...] As I’ve ranted on here before, don’t think that using open data for your class is going …. Plan in advance. Give yourself time to identify any problems [...]

Leave a Reply