We are looking for data to use as an example of propensity score matching for a couple of upcoming workshop / classes. Since the data I have used previously belonged to other people, I needed to come up with an example that could be stated in a format something like:
Controlling for X, Y and Z are people who received treatment A more likely to kerfluffle than people who did not receive treatment A, where kerfluffle is the outcome of interest that is your dependent variable.
Our new intern, Chelsea (yay, Chelsea!) had the brilliant idea of looking at whether drinking whole milk was bad for you (remember high fat milk being evil before Coke became evil?) Drinking whole milk could be treatment A and being overweight could be the kerfluffle, I mean, outcome, of interest. We could control for other things like exercise, family income – except that when we started looking at this question with a large public data set of surveys of students we found diddly squat in the way of group differences.
Then, I tried it from a different perspective having listened to The Daily Show enough times to know that soft drinks are the cause of obesity. I was also interested because I thought I was the only person in America who couldn’t stand Coke or Pepsi – tastes like melted sugar. So, I also ran some analyses using students who drank milk daily and never drank soft drinks as one group and students who drank soft drinks (non-diet) daily and never drank milk as my two groups. Still no difference, not in weight, not in computed BMI based on age, weight and height.
Why, you may ask, didn’t we do something like compare people who drank low-fat milk with those who only drank sodas? Here is the thing, if you run enough analyses, the odds are great that you will eventually find significance just by chance. A result that is significant at .05 happens one time out of twenty.
We started with reasonable hypotheses and they did not pan out. MY hypothesis is that people lie about their weight. This is further supported by the fact that the distribution of reported weight did NOT resemble the population distribution at all. It was only slightly negatively skewed rather than showing a large percentage of obese people.
My conclusion was that people are big fat liars. This is also borne out by many years of coaching judo, a sport where people compete in weight divisions so at the end, they DO have to get on a scale and I know what they weigh.
It may be that whole fat milk or Coke or Pepsi causes obesity but we just don’t know because our outcome variable is measured unreliably.
I think our plan B is to use a different data set with old people and use death as the outcome variable. After a year or two, it’s usually pretty easy to determine whether people are dead or not. They’re the buried ones.