I am old. If I was not aware of this by looking at my U.S. birth certificate (which, like President Obama, I do have), I have America’s most spoiled thirteen-year-old to tell me,
(sometimes followed by – “and stupid” if I have just told her that no, I will not buy her an iPhone 4 because her current iPhone works just fine, or no, I will not buy her a $26 lip gloss at Sephora because THAT is, in fact, stupid, and no one who is thirteen and beautiful needs to wear cosmetics anyway. )
So, it has been established by four daughters in succession becoming thirteen, that I am old and out of it.
Perhaps this explains why the data hackathon and similar concepts mystify me. I just don’t see what of any use can be accomplished by people with no familiarity throwing themselves at data for 24 or 48 hours without sleep.
- Have none of you read all of the research on how sleep deprivation has a negative effect on cognitive performance? Do you think for some reason it doesn’t apply to you? Seriously, the ability to go without sleep is not a super-power. It’s more often a sign you’re using coke, which as we used to say in college, is God’s way of telling you that you’re making too much money.
- At a rock bottom minimum, to understand any dataset you need to know how it was coded. What is variable M03216, exactly? If that is the question, “What is 3 cubed?” , then what do the answers A, B, C, D 97 98 and 99 mean? Which of A through D is 27, which is the correct answer, and what the hell are 97 – 99 ? These are usually some type of missing value indicator, possibly showing whether the student was not administered that particular item, didn’t get that far in the test or just didn’t answer it.
- Once you’ve figured out how the data are coded, you need to make some decisions about what to do with it AT THE ITEM LEVEL. What I mean by that is, for example, if an item wasn’t administered, I count the data as missing, because there was no way the student could answer it. If the student left it blank, I count it as wrong because I’m assuming, on a test, if you knew the right answer, you would have given it. What about if the student didn’t make it that far in the test? I count it as wrong then, also, but I do realize I have less degree of certainty in that case. My point is, these are decisions that need to be made for each item. You can do it in a blanket way – as in, I’m doing this for all items, or you can do it on a case by case basis. Whether you do it knowingly or not, something is going to happen with those items.
- Often, sampling and weighting are issues. If you are doing anything more than just sample descriptive statistics (which aren’t all that useful in and of themselves) you need to know something about how the data are sampled and weighted. Much of the government data available uses stratified sampling, and often certain strata are sampled disproportionately. AT A MINIMUM , you need to read some documentation to find out how the sampling is done. If you don’t, maybe you’ll be lucky and the actual sampling was random, which is the default assumption for every procedure I can imagine. Listen carefully – Hoping to get lucky is a very poor basis for an analysis plan.
- Data quality needs to be checked and data cleaning done. Unless you are a complete novice or a complete moron, you are going to start by doing things like checking for out-of-range values, reasonable means and standard deviations, the expected distribution, just to make sure that each variable you are going to be using in your analysis behaves the way it should. If there are no diseases recorded on Sundays in some zip codes, I doubt it’s because people don’t get sick those days but rather because the labs are closed. This may make a difference in the incidence of a disease if people who get sick on weekends go to another area to be treated. You need to understand your data. It’s that simple.
- Somewhere in there is the knowledge of analytic procedures part. For example, I wanted to do a factor analysis (just trust me, I had a good reason). Well, because of the way this particular dataset was put together, no one had complete data, so I couldn’t do PROC FACTOR. I thought possibly that I could use PROC CALIS with method = FIML – that is the full information maximum likelihood method. I went to the SAS documentation and flipped through the first several pages, glanced at some model by Bentler illustrating indicators and latent constructs, said to myself, “Blah blah, I know what that is”, skipped over the stuff on various types of covariance matrices, etc. I can guarantee you that the first time I looked at this (which was a long time ago), I did not flip through it. I probably spent 15 minutes looking at just that model making sure that I knew what F1 was, what v1, v2, etc. were and so on. That is one advantage of being old, you’ve seen a lot of this stuff and puzzled it out before.
- You need to actually have the software required. Although there is supposedly a version of SAS out there that allows the use of FIML for PROC CALIS none of the clients I have at the moment have the latest version of SAS, so I don’t have access to it. I believe FIML is only available in SAS 9.22. You might think having the software available would be a benefit of a hackathon but the ones I have seen advertised all say to bring your own laptop, so I am assuming not.
- All of this is before knowledge of the actual content area, be it contributions to charity, earthquake severity, mathematics achievement or whatever, I assume there has been research done previously on the topic and it might be helpful to have that knowledge. I will guess that the organizers of the hackathon are assuming whoever attends will already have #5, 6 and 7, here. This is again a bit puzzling to me because the combination of that knowledge took me years of experience and I don’t think I am particularly slow-witted. It’s like all the job announcements you see that want someone twenty-five years old with at least ten years of experience. The hackathons seemed to be aimed at young people. (The advertised free coffee and all the fast food you can eat is not really much an enticement for most people in their fifties. If you can’t afford to buy your own coffee and fast food by that age, I think you took a wrong turn somewhere and perhaps you should hang up that DATA ROCK STAR t-shirt, buy a suit and go get a job.)
So, I don’t get it. I don’t get what you are hoping to pull together in 48 hours, with little sleep, little time, and often limited knowledge of the data. PERHAPS the idea is that if enough people try enough different random things that we will just get lucky??
Luck is not a research strategy.
Oh, and there’s this little thing called Type I error.