Last week, a couple of really sharp cookies from JMP were on campus giving a presentation and their academic program manager, Curt Hinrchs commented that what is really needed is a course on statistical thinking. I think he is absolutely on to something.
I mentioned in my last post how there is a debate over whether fluid intelligence really declines with age or whether older people are just less inclined to put effort into a pointless task. This happened to me the other day when someone stopped by with a series of very complicated equations with the log of this and the log of that squared. Here is how our conversation went:
“Why do you have years of experience squared as an independent variable? Do you think that average earnings increase up to a certain number of years of experience and then have a negative relationship after that? Why do you think that would happen?”
“Oh, well we just have years of experience squared to check for a quadratic effect.”
“Do we really need to have a reason?”
“I think so. I would.”
“We want to test for the impact of an increase in education. So we would like to get the predicted value for population earnings if everyone who has less than a high school education graduated from high school.”
“That’s easy enough. You have a regression equation. You have a predicted value for people who have 12 years of education. Write some code to replace the predicted value for all those people who have less than 12 years to that value. Get the totals. Compare the two.”
“But we would have to decrease everyone’s years of experience.”
“Because if a person is in school for two more years, they would work for two less years.”
“Not at all. The unemployment rate is high for high school dropouts. Particularly youth. You are assuming that a person who drops out of school in the tenth grade will be employed continuously for the next two years. “
“But you know that is wrong.” (In fact, I looked it up, according to the Bureau of Labor Statistics, the unemployment rate for recent high school dropouts was 32.9%, not 0% ).
I was not much help to this person in solving the equations because I just could not see a lot of point to it. I try to help anyone that comes to see me, and they usually go away pretty happy. I am also very happy they come because it gives me a chance to steer them in the right direction. Here is another one:
“I am predicting the success of start-up firms but I have eliminated those that ended up making more than $10 million per year, and also those that weren’t in business five years later.”
“Because I am using income as my dependent variable and if I include the companies that made zero or over $10 million my data would be really skewed. Some of them made over $100 million.”
“But don’t you think that most investors really want to be able to predict which companies will make them a LOT of money? And if they apply your equation to predict which companies in which to invest, these won’t even be in your sample.”
“Yes, that is correct.”
“Have you thought about applying your equation to those companies and seeing if it is accurate? Maybe you want your dependent variable to be something different. Perhaps it could be categorical, something like the company failed, it was in business five years later with earnings less than $1,000,000 a year or more than $1,000, 000 a year. I don’t know if that is exactly what you want to do, but I do think you need to revise your design.”
“Because investors want to predict which companies will produce very high returns and which will cause them to lose their investment. I am not sure your equations will do that.”
And yet another one …
“Does this output look correct to you.”
“Because your R-squared is .99 . That means you have explained 99% of the variance.”
“Are you familiar with my field?”
“I am familiar with reality. You have dummy-coded county. That gives you about 3,000 independent variables right there. Altogether, you have about 4,997 independent variables and a sample of 5,000. Also, you have one variable, MALE, that is coded 1 if the subject is male and 0 if not. You have a second variable, FEMALE, that is coded 1 if female and 0 if not. “
“What should I do?”
“Read this article on multicollinearity. Drop the FEMALE variable from your equation. Take a look at your counties and see if there is some way you can subdivide these that make sense in terms of your field, for example, rural persistent poverty counties, urban persistent poverty counties, urban middle income, urban upper income – whatever makes sense in the context of your study. Is there even a reason for having county as a variable? Think about that. Come back next week and we will rewrite your program.”
I give all of these people credit for sensing something was not quite right and stopping by to talk with me. They come from different universities, all well respected, and they are all very intelligent people. Some of them have been taught mathematics very well. That is a good thing. Their equations are elegant and correct, as far as it goes.
Dr. F.N. (Florence Nightingale) David was chair of the statistics department at the University of California, Riverside. She left the university about the time I started my doctoral program there, but I remember one of my professors telling this story.
“_____ did the defense of his dissertation and F.N. David was the outside member on his committee. He gave a very good description of all of the analyses he had done and why his hypotheses were supported. She was really the statistical expert and we all turned to her when the candidate was out of the room and asked her opinion. She took her cigar out of her mouth and said, ‘That young man believes his numbers too much.’ “
Maybe it is true that the more things change the more they stay the same.
Somewhere in all of this we maybe have to get back to the idea of science we had when we were ten years old. Of not being so wedded to our ideas but just thinking, wondering about stuff, mucking around and seeing what works.