No, this is not a post about politics or life-hacking, although the same title could apply in either case. I am talking about statistical power. People often ask me what the power of a test is, but the problem is that they are asking the wrong question. Power is not a single number. I understand where the confusion can occur.
What is power & how do you get it?
There are two errors people worry about, Type I and Type II. The probability of making a Type I error is set and it is called alpha ( α ) . Alpha is usually set at .05. It is the probability of rejecting a true null hypothesis. Now, what is a null hypothesis? It is a hypothesis of ZERO difference between the means, ZERO relationship between X and Y. A Type I error can occur in the case of ONE number, zero. If the effect is zero and you say it isn’t, you have made a Type I error.
A Type II error is the probability of accepting a false null hypothesis. The probability of a Type II error is called beta (β ). A Type II error can occur in an infinite number of cases, for any number other than zero. If the effect isn’t zero and you say it is, you have made a Type II error. Power = 1 – β .Depending on what the actual value of your statistic is, the power will be different.
Look at it logically. If in an infinite population your experimental group is a million times better than your control group, then, just logically, the probability of you pulling two samples and incorrectly deciding there was no difference is very low. Similarly, if your experimental group performs .01% better than your control group, although the difference is not zero, you can logically conclude that a good percentage of the time you might conclude that there is zero difference, which is, incorrect statistically, although perhaps not for practical purposes.
Dr. Park, at the University of Indiana, has a very nice explanation of hypothesis testing and power analysis. He says, assume that we are testing the hypothesis that the mean is 4 when in actuality the mean is 7.
Let’s just say we are hypothesizing that people feed their office guinea pigs hay an average of four times a month. (I had to do something with the office guinea pigs to make them feel part of the team, so I put them in here.)
This variable is normally distributed with a standard deviation of 1. (Just a reminder that the standard deviation OF THE MEAN is the standard error.)
The cut off for rejecting our hypothesis is 5.96 because computing a z score, we get 5.96 – 4/ 1 = 1.96 since at 1.96, p is not less than .05, p = .05. So, 5.96 is the highest number at which we accept the hypothesis that the mean = 4.
This hypothesis is, in fact, wrong. People really feed their office guinea pigs hay a mean of 7 times per month.
We know this because God told us so in a spare moment when he was not busy telling Republicans they needed to become candidates for president.
Given that the true mean is 7, we can compute the z- score for
5.96 – 7 /1 = -1.04
We look up 1.04 in a z table because, although we have a direct line to God, we don’t have a calculator with statistical functions, and we find that about 15% of the time we’ll get a value of 1.04 or greater. (14.92% of the time, actually, if you’re a precision freak).
So, this tells us IF we hypothesize the mean is 4 but it really and truly honest-to-God is 7, and IF the standard error is 1, then our power is .85 because 15% of the time we’ll get a number at least as large as 1.04.
So, our power is .85, right?
Well, not so right. It is – IF the standard error is 1 and IF the “true” value is 7 and IF we were doing a z-test. But if we knew the true value, what was the point of doing any tests?
What if the true value is 6? Then z = 5.96 – 6 / 1 = .04 . The percentage of the z-distribution (which is normal) that is greater than .04 is about 50%, so our power is around .5o
Important point number one – power depends on the true value, and you don’t know the true value
This is the first important point to keep in mind …. the power of a test is different based on what the true population value is. But you don’t really know what it is, since God is too busy worrying if people are having gay sex or eating pork to talk to you about guinea pig cuisine.
Generally what people do (if “what people do” means what I do), is enter a number of possible values into software like PROC POWER. So, I enter 6, 6.5, 6.75 and 7 and find that the values for power are .51, .70, .78 and .85 I can say that the power of the test is at least .85 if the true mean number of office guinea pig hay purchases is 7 per month or higher. That is, when the true figure is at least 7 we would reject the false null hypothesis at least 85% of the time. If it was a lot more than 7, we’d reject it a lot more than 85%.
Important point number two – power depends on the variability
In the example above, I forced the standard error to equal one by assuming my standard deviation was 10 and my sample size was 100. That isn’t very realistic, but I was just going with the example in his paper. Let’s say instead that the standard deviation is 1, which is more reasonable, and the sample size is 10. Then my standard error .10 and the power is going to be greater than .999.
Important point number three – power depends on the test statistic
A z-test is a test where you compare the mean to a constant value. Generally, you don’t have a constant value. More likely, you have two groups. Say, we want to know if office guinea pigs get hay as often as home guinea pigs. My hypothesis is that the office guinea pigs will get it seven times a month, because they need more energy to keep up with their official duties, while the home guinea pigs will only get hay four times a month. I select a total sample of 10 with only 5 in each group, because I want an equal number and for some reason it is difficult to locate people who have office guinea pigs . The standard deviation of number of times of hay per month is still 1. When I compute the power of this test, it was .985.
SO …. even if you go with the standard .05 level of significance (level of significance ALSO affects power) and the standard two-tailed tests (whether you have a one or two-tailed test ALSO affects power) and you don’t have to bother about correlations between groups (the correlation between groups in a paired t-test ALSO affects power) you STILL can have a whole bunch of numbers that MAY be the power of the test depending on what the test statistic, variability and hypothesized value are.
The one thing that affects power people usually ask about is sample size
Yes, sample size also affects the power of a test. So, if I only had 4 guinea pigs per group, my power would be .939. If I had 10 guinea pigs in each group, it would be above .999
However, if you ask me to tell you how many people you need in your sample to have a power of .80, you’re asking the wrong question. The answer depends on how large of an effect size (in these examples difference between means), how much variability, the specific statistical test you are doing and other factors like whether it is a one or two-tailed tests and correlations between your groups.
The best answer you are going to get from me is that if you have 128 people total in your sample you will have power of AT LEAST .80 IF you are doing an independent t-test if there is AT LEAST a half-standard deviation difference between the two groups, AND you are doing a two-tailed AND your null hypothesis is that there is zero difference between the groups. However, if there is smaller difference than that it will be less. Also, if you are doing a different test, say, a logistic regression, power calculation is more complicated.
But I know that you are going to nod knowingly, turn around and walk out the door saying,
“128. Got it. Thanks!”