Feb
24
Logarithms
February 24, 2009 | 1 Comment
Logistic regression is based on logarithms. Ordinary Least Squares regression and analysis of variance uses the actual values as the dependent and independent variables in an equation. Logistic regression does not.
What is a log, anyway?
Let’s start with the very basics. First we learned to add:
5+5+5+5 = 20
After about eight years of age, we realized that was pretty inefficient so we started multiplying
5 x 4 = 20
We got a few years older, thought, why stop there, and got into exponents
5 x 5 x 5 x 5 became 5 to the fourth power = 625
Then we get into logarithms, where the log to the base 5 of 625 = 4
Think about this. Really think about it. Go to the wikipedia page that has a good explanation of logarithms and read that.
Calculate the logs of several numbers to different bases, just for the heck of it. I have noticed that, so often, students skip over topics like logarithms thinking, “I don’t need to know that.”
This is wrong on a whole lot of fronts, just one of those reasons being it is a really bad habit to get into. I don’t know how many reports I read in the newspaper of people losing their homes that included the statement,
“Mr & Mrs John Q Public said that they did not understand the mortgage papers, that they just trusted the real estate agent, the banks or the ad they watched on TV at 2 a.m.”
So, whose fault is that. Understand what you are doing! Start with logarithms. It’s as good a place as anywhere else.
Feb
13
Off-task – again
February 13, 2009 | Leave a Comment
I was going to write about the log of odds ratios and explain logarithms. This is the very, very odd fact I have noticed in most of social science – people in doctoral programs are often thrust into statistics course for which they really don’t have the basic mathematical foundation. This is because if they were really into mathematics they probably would have gone into that in the first place, but they didn’t. They majored in history or liberal studies or something else. They became teachers or social workers or counselors. In doing so, they took the one (count ’em, ONE) required mathematics course to get a college degree. Further dismaying news, the one mathematics course has been changed dramatically since I was in college – now, you can get a degree with a C- in College Algebra – whatever that is. It definitely does not involve logarithms. When I went to college, Algebra was something you were supposed to have had in high school. But I digress. This whole post is a digression so now I have digressed squared.
So, what did I do all day if not write about logarithms?
Downloaded a dataset from the Interuniversity Consortium for Political and Social Research, which is a site I truly love. What a great idea! Finished with your data? Upload it to the Internet and let anyone else use it for whatever they can find.
Spent an hour (I am embarrassed to confess this) trying to find the error in my SPSS syntax and could not figure out why the HELL it kept saying “file not found” when I could clearly see it there. Finally, as a last resort, went to the c:\ prompt, listed the files and realized that Windows, designed by Machiavelli, hides part of the file name so that my file was actually named college.txt.txt . AAAGH !
Monir, the travel lady, stopped by and took care of my reservations and registration for SAS Global Forum.
Worked on the PowerPoint for my Enterprise Guide class next week. For once, did not waste time rotating the charts in space to see how my bar chart would look sideways (oh, don’t pretend you never did it!)
Enterprise Guide runs pretty slow on my old computer with only 1 G RAM, and EVERYTHING runs really slow with the size of some of the datasets I have been using. My Mac desktop only has 512M RAM (I know, I am deprived) and I was kind of tired of using my laptop and the Unix server for everything.
Today, Justin a.k.a., our hardware guy, came by and told me he had a new computer for me. Like everywhere else, we are watching the budget but somehow he came up with one. So far, I have installed SPSS 17, run a factor analysis with 191,000 subjects and it ran in less time than it took me to type this sentence. I am very happy. Installing the applications I need took a good bit of time, but it will be well worth it in the end in saved time and aggravation. I decided to try something different, so I installed Seamonkey as my browser and downloaded Open Office and Gimp instead of Microsoft Office and Photoshop.
Speaking of sea monkeys, I thought this would be a good example of how statistics can be applied to everything. Even though I work about a block from the museums and Exposition Park, I have only been there once in the last year. So, Tuesday, I walked over to the science museum shop and bought a sea monkey kit. My idea was that I would have it on my desk, collect data and use it in different statistical analyses.
This could also be an example of creativity in analysis in trying to come up with different variables. My original thought was perhaps I could begin with the number of sea monkeys hatched. Unfortunately, statistics for the day are :
Specks floating around that could possibly be sea monkeys – somewhat less than a zillion. I would give a close approximation as 1,000.
Matter which can be definitely distinguished as sea monkeys- 0.
I probably should re-read that code on break point analysis before the meeting tomorrow. I should finish editing the on-line ethics course. Instead, though, I am sitting here with Jenn watching episodes of Numbers on DVD.
She wanted to know, “Is that really true? Can you tell that someone cheated by random numbers? That doesn’t make sense.”
I told her that it was exactly true. It’s like the old Sherlock Holmes story, the Curious Case of the Dog in the Night Time – what was significant was what you DIDN’T see. Sometimes, the telling evidence in an evaluation is that relationships don’t exist where they should, because the numbers were just made up and entered in the database, because, after all, who would ever know?
Me.
Feb
5
Baby Steps to Logistic regression
February 5, 2009 | Leave a Comment
Going from the phi coefficient to odds-ratios. Remember the numerator for the phi coefficient was
well, the odds ratio is the same two numbers DIVIDED rather than subtracted. You might think it is four numbers, but really it is not. The first number is the product of the diagonal cells (see below). The second number is the product of the off-diagonal cells. Let’s take a look at our data again, first in symbolic form and then the actual numbers.
So the odds of a woman doing the dishes are 9:1 , that is for every one woman who doesn’t do the dishes, there are nine who do. The odds of a man doing the dishes are 1:3, that is, for every three men who don’t do the dishes, there is one who does.
Here is our formula for the odds ratio:
= (10*25) /(75*90) = 1/27
The odds of a man doing the dishes (1/3) are one-twenty-seventh the odds of a woman doing them (9/1).
Tomorrow, I will try to find the time to explain how this is intimately related to logistic regression.
But for now, I am going to go home, and, no doubt, eventually do the dishes.
Feb
4
Phi coefficients, odds ratios and the F-word
February 4, 2009 | 1 Comment
Yes, I am the F-word – a feminist. I was at a faculty meeting this weekend and one of the presenters began by saying, pointing to a colleague in the audience,
“I am sure Dr. Y knows more about this than me.”
Several times in her presentation on analysis of assessment data she would pause and make comments such as,
“Well, I am not very good at statistics, but this is pretty easy to understand.”
I was a bit annoyed at her self-deprecating manner. I wanted to walk up to her and say,
“You understand this perfectly well and I know Dr. Y, who is very smart and competent, but no more so than you.”
Even more annoying was another presenter, also a woman, also very competent, who gave a very good presentation on assessment. Near the end of it, she said,
“You don’t have to use numbers. For those of you who don’t do math, you can put your students in categories as having exceeded criterion, met criterion or failed. You can just put it in bullet points.”
For those of you who don’t do math …. ????
What the hell? This is a university faculty meeting; 99% of the people in the room have graduate degrees and at least three-fourths of them have Ph.D.’s.
Since when has it become acceptable to not be competent, particularly in math??? Would that same presenter have started a sentence with,
“For those of you who can’t read, I have recorded this presentation as a podcast?”
There may be some people who can’t read because they are visually impaired or have a learning disability, but we consider this a disability, not a lifestyle choice.
This particular department is overwhelmingly female, and I could not help but wonder if the same sort of statements would be made in a predominantly male department? In my admittedly non-random and non-representative experience, the answer is, “No.”
So, first of all, for all of you women (and men), who say you aren’t good at math – cut it out! That’s a lot of nonsense that some people are naturally good at math and some aren’t. It’s a lot like swimming. You aren’t born knowing how to swim and, yes, very few people will become Olympic swimmers, but the vast majority of people can learn to dive in a pool and swim a few laps. It just takes time and effort to practice.
Let’s start with the phi coefficient. I blatantly stole this table from the Children’s Mercy Hospital website because I thought it was very well-explained and easy to understand – until I realized that it wasn’t and I only understood it because I already knew exactly how to calculate a phi coefficient. However, not one to let any act of larceny go to waste, I used it anyway.
The formula for Phi is
Notice that Phi compares the product of the diagonal cells (a*d) to the product of the off-diagonal cells (b*c). The denominator is an adjustment that ensures that Phi is always between -1 and +1.
Let me explain this a little better. We have two categorical variables, gender – coded 1 =female, 2= male, and “Did you eat today?” – coded 0 = no , 1 = yes
In our table below, you can see that there is zero correlation between gender and if you ate today, as males and females are both equally likely to have had something to eat.
Gender \Ate today? NO YES TOTAL
Female 10 90 100
Male 10 90 100
Total 20 180 200
When we subtract (10*90) – (10*90) — obviously, the numbers are the same, so we get zero. There is zero relationship. In the formula above, a, b, c & d are the numbers in each cell.
So, we have mathematically shown that there is no relationship between gender and whether one eats or not. Let’s try another question, “Did you do the dishes?” This time, we get the following results:
Gender \Washed Dishes? NO YES TOTAL
Female 10 90 100
Male 90 10 100
Total 100 100 200
Let’s look at the phi coefficient again.
10*10 – 90*90 = 100 – 8100 = -8,000
100*100*100*100 = 100,000,000 and the square root of that is 10,000
So, our phi coefficient is -8,000/ 10,0000 or -.80. That is a pretty high correlation, considering that the coefficient ranges from -1 to +1.0 . A negative coefficient means that those who are lower on one variable (1= female, 2= male) are more likely to be higher on the other variable (0 = did not do the dishes, 1 = washed dishes).
So, our conclusion is that, while women are no more likely to eat each day than men, they are significantly more likely to do the dishes with data that I just made up to prove it. My daughter, Maria, tells me that any married woman knows that without the need for statistics.
Why did I just go into this in such detail and all about one coefficient? Because I think that is a big part of the reason that many people don’t learn math is that there are so many assumptions that we can “just skip over this”. In fact, the reason I liked the Mercy Hospital site is it did not start out with n10n21 – n21n10 / √(n0+n1+n+1n+2)
and assume that everyone knew what marginal distributions and array subscripts meant, because, I can guarantee you, that they don’t.
Sheila Tobias wrote a really interesting book about teaching and learning science, the title of which is “They’re not dumb, they’re different”.
Maybe, but I guarantee you that part of the problem is that they’re not clairvoyant. No one was born knowing that n10 means the number in the cell where the row value =1 and the column value = 0. It doesn’t help that at other times that same cell would be represented as n11 as the first row and first column.
If you can make that switch in your mind easily, it is no doubt because you, like me, have looked at thousands of matrices and had that notation explained to you so long ago that it is probably like learning to swim, you can’t even remember it. The secret to being good at math is the same as being good at swimming – practice!
Completely random fact – in my misspent youth, I was the first American to win the world championships in judo. If you type judo blog into google, the first of 3,000,000+ pages that comes up is mine. And my most recent judo blog was on outliers and practice. Rather unusual when the two halves of my split personality come together.
As to odds ratios, I have more to say about those, but it is 1:30 a.m. and I have to get up in 7 1/2 hours to go to work, so that will have to wait until another day.
Blogroll
- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog