Today, taking a break from writing the grant proposal that has no end, I found myself thinking about easy ways to explain and understand standard error.
To understand standard error, you have to have some statistic that you’re discussing the standard error of. As a random example, let’s just take the mean.
T-TEST PROCEDURE FOR TESTING FOR DIFFERENCE BETWEEN MEANS
Let’s just say we have a sports organization that is interested in knowing whether there is a significant difference between the numbers of competitors in the male and female divisions each year. Since Title IX passed and we are supposed to be all equal in sports, there should be no significant difference, right?
To test for the difference in means between two groups, we compute a t-test. A t-test done with SAS will give four tables of results. The first one is shown below.
First PROC TTEST Table
sex N Mean Std Dev Std Err Minimum Maximum
Female 22 97.0909 15.8201 3.3729 66.0000 119.0
Male 22 222.8 45.2253 9.6421 146.0 324.0
Diff (1-2) -125.7 33.8792 10.2150
There were 22 records for males and 22 for females. The mean number of competitors each year was 97 for females, with a standard deviation of 15.8 , with a range from 66 to 119. For males, the mean number of competitors was almost 223 per year, with a standard deviation of 45, and a range from 146- 324.
What exactly is a standard deviation ? A standard deviation is the average amount by which observations differ from the mean. So, if you pulled out a year at random, you wouldn’t expect it to necessarily have exactly 222.8 male competitors. In fact, it would be pretty tough on that last guy who was the .8! On the other hand, you would be surprised if that year there were only 150 competitors, or if there were 320. On the average, a year will be within 45 competitors of the 223 male athletes and 95% of the years will be within two standard deviations, or, from 132 to 312. That is, 223 – ( 2 x 45) to 223 + (2 x 45) .
WHAT IS THE STANDARD ERROR AND WHAT DETERMINES IT?
But what is the standard error? The standard error is the average amount by which we can expect our sample mean to differ from the population mean. If we take a different sample of years, say, 1988- 2009, 1991- 2012, all odd numbered years for the last 30 years and so on, each time we’ll get a different mean. It won’t be exactly 97.09 for women and 222.8 for men. Each time, there will be some error in estimating the true population value. Sometimes we’ll have a higher value than the real mean. Sometimes we’ll underestimate it.
On the average, our error in estimate will be 9.64 for men, 3.37 for women.
Why is the standard error for women lower? Because the standard deviation is lower.
The standard error of the mean is the standard deviation divided by the square root of N, where N is your sample size. The square root of 22 is 4.69. If you divide 15.82 by 4.69, you get 3.37. Why the N matters seems somewhat obvious. If you had sampled several hundred thousand tournaments, assuming you did an unbiased sample, you would expect to get a mean pretty close to the true population. If you sampled two tournaments, you wouldn’t be surprised if your mean was pretty far off. We all know this. We walk around with a fairly intuitive understanding of error. If a teacher gives a final exam with only one or two questions, students complain, and rightly so. With such a small sample of items, it’s likely that there is a large amount of error in the teacher’s estimate of the true mean number of items the student could answer correctly. If we hear a survey found that children of mothers who ate tofu during pregnancy scored .5 points higher on a standard mathematics achievement test, and then find out that this was based on a sample of only ten people, we are skeptical about the results.
What about the standard deviation? Why does that matter? The smaller the variation in the population, the smaller error there is going to be in our estimate of the means. Let’s go back to our sample of mothers eating tofu during pregnancy. Let’s say that we found that children of those mothers had .5 more heads. So, the average child is born with one head, but half of these ten mothers had babies with two heads, bringing their mean number of heads to 1.5. I’ll bet if that was a true study, it would be enough for you never to eat tofu again. There is very, very little variation in the number of heads per baby, so even with a very small N, you’d expect a small standard error in estimating the mean.
The second table produced by the TTEST procedure is shown in Table 2 below. Here we have an application of our standard error. We see that the mean for females is 97, with a 95% Confidence Level (CL) from 90.07 to 104.1. That 95% is the mean minus two times the standard error, plus two times the standard error. That is, 97.09 – (2 x 3.37) to 97.09 + (2 x 3.37).
Why does that sound familiar? Perhaps because it is exactly what we discussed 10 seconds ago about a normal distribution? Yes, errors follow a normal distribution. Errors in estimation should be equally likely to occur above the mean or below the mean. We would not expect very large errors to occur very often. In fact, 95% of the time, our sample mean should be within two standard errors of the mean.
Second PROC TTEST Table
sex Method Mean 95% CL Mean Std Dev 95% CL Std Dev
Female 97.0909 90.0766 104.1 15.8201 12.1713 22.6080
Male 222.8 202.7 242.8 45.2253 34.7941 64.6299
Diff (1-2) Pooled -125.7 -146.3 -105.1 33.8792 27.9348 43.0609
Diff (1-2) Satterthwaite -125.7 -146.7 -104.7
The next two lines both say Diff (1-2) and both show the difference between the two means is -125.7. That is, if you subtract the mean for the number of male competitors from the mean number of female competitors, you get negative 125.7. So, there is a difference of 125.7 between the two means. Is that statistically significant? How often would a difference this large occur by chance? To answer this question we look at the next table. It gives us two answers. The first method is used when the variances are equal. If the variances are unequal, we would use the statistics shown on the second line. In this instance, both give us the same conclusion, that is, the probability of finding a difference between means this large if the population values were equal is less than 1 in 10,000. That is the value you see under the PRobability > absolute value of t. If you were writing this up in a report, you would say,
“There were, on the average 126 fewer female competitors each year than males. This difference was statistically significant (t = -12.30, p <.0001).”
Third PROC TTEST Table
Method Variances DF t Value Pr > |t|
Pooled Equal 42 -12.30 <.0001
Satterthwaite Unequal 26.064 -12.30 <.0001
In this case the t-values and probability are the same, but what if they are not? How do we know which of those two methods to use?
This is where our fourth, and final table from the TTEST procedure comes into use. This is the test for equality of variances. The test statistic in this case is the F value. We see the probability of a greater F is < .0001. This means that we would only get an F-value larger than this 1 in 10,000 times if the variances were really equal in the population. Since that is a really large number, and the normal cut-off for statistical significance is p < .05 and .0001 is a LOT less than .05, we would say that there is a statistically significant difference between the variances. That is, they are unequal. We would use the second line in Table 3 above to make our decision about whether or not the differences in means are statistically significant.
Fourth PROC TTEST Table
Equality of Variances
Method Num DF Den DF F Value Pr > F
Folded F 21 21 8.17 <.0001