Sports equality, t-tests and standard error

Today, taking a break from writing the grant proposal that has no end, I found myself thinking about easy ways to explain and understand standard error.

To understand standard error, you have to have some statistic that you’re discussing the standard error of. As a random example, let’s just take the mean.

Let’s just say we have a sports organization that is interested in knowing whether there is a significant difference between the numbers of competitors in the male and female divisions each year. Since Title IX passed and we are supposed to be all equal in sports, there should be no significant difference, right?

To test for the difference in means between two groups, we compute a t-test. A t-test done with SAS will give four tables of results. The first one is shown below.

Table 1
First PROC TTEST Table
sex             N    Mean         Std Dev    Std Err    Minimum    Maximum
Female    22    97.0909    15.8201    3.3729    66.0000    119.0
Male         22    222.8        45.2253    9.6421    146.0         324.0
Diff (1-2)        -125.7        33.8792    10.2150

There were 22 records for males and 22 for females. The mean number of competitors each year was 97 for females, with a standard deviation of 15.8 , with a range from 66 to 119. For males, the mean number of competitors was almost 223 per year, with a standard deviation of 45, and a range from 146- 324.

What exactly is a standard deviation ? A standard deviation is the average amount by which observations differ from the mean. So, if you pulled out a year at random, you wouldn’t expect it to necessarily have exactly 222.8 male competitors. In fact, it would be pretty tough on that last guy who was the .8! On the other hand, you would be surprised if that year there were only 150 competitors, or if there were 320. On the average, a year will be within 45 competitors of the 223 male athletes and 95% of the years will be within two standard deviations, or, from 132 to 312.  That is, 223 – ( 2 x 45)  to 223 + (2 x 45) .

But what is the standard error? The standard error is the average amount by which we can expect our sample mean to differ from the population mean.  If we take a different sample of years, say, 1988- 2009, 1991- 2012, all odd numbered years for the last 30 years and so on, each time we’ll get a different mean. It won’t be exactly 97.09 for women and 222.8 for men. Each time, there will be some error in estimating the true population value. Sometimes we’ll have a higher value than the real mean. Sometimes we’ll underestimate it.

On the average, our error in estimate will be 9.64 for men, 3.37 for women.

Why is the standard error for women lower? Because the standard deviation is lower.

The standard error of the mean is the standard deviation divided by the square root of N, where N is your sample size. The square root of 22 is 4.69. If you divide 15.82 by 4.69, you get 3.37.  Why the N matters seems somewhat obvious. If you had sampled several hundred thousand tournaments, assuming you did an unbiased sample, you would expect to get a mean pretty close to the true population. If you sampled two tournaments, you wouldn’t be surprised if your mean was pretty far off. We all know this. We walk around with a fairly intuitive understanding of error. If a teacher gives a final exam with only one or two questions, students complain, and rightly so. With such a small sample of items, it’s likely that there is a large amount of error in the teacher’s estimate of the true mean number of items the student could answer correctly. If we hear a survey found that children of mothers who ate tofu during pregnancy scored .5 points higher on a standard mathematics achievement test, and then find out that this was based on a sample of only ten people, we are skeptical about the results.
What about the standard deviation? Why does that matter? The smaller the variation in the population, the smaller error there is going to be in our estimate of the means. Let’s go back to our sample of mothers eating tofu during pregnancy. Let’s say that we found that children of those mothers had .5 more heads. So, the average child is born with one head, but half of these ten mothers had babies with two heads, bringing their mean number of heads to 1.5. I’ll bet if that was a true study, it would be enough for you never to eat tofu again. There is very, very little variation in the number of heads per baby, so even with a very small N, you’d expect a small standard error in estimating the mean.
The second table produced by the TTEST procedure is shown in Table 2 below. Here we have an application of our standard error. We see that the mean for females is 97, with a 95% Confidence Level (CL) from 90.07 to 104.1.  That 95% is the mean minus two times the standard error, plus two times the standard error. That is, 97.09 – (2 x 3.37)  to 97.09 + (2 x 3.37).

Why does that sound familiar? Perhaps because it is exactly what we discussed 10 seconds ago about a normal distribution? Yes, errors follow a normal distribution. Errors in estimation should be equally likely to occur above the mean or below the mean. We would not expect very large errors to occur very often. In fact, 95% of the time, our sample mean should be within two standard errors of the mean.
Table 2
Second PROC TTEST Table

sex               Method        Mean      95% CL Mean             Std Dev    95% CL Std Dev
Female                               97.0909    90.0766    104.1    15.8201    12.1713    22.6080
Male                                    222.8       202.7    242.8        45.2253    34.7941    64.6299
Diff (1-2)    Pooled                -125.7    -146.3    -105.1    33.8792    27.9348    43.0609
Diff (1-2)    Satterthwaite    -125.7    -146.7    -104.7

The next two lines both say Diff (1-2) and both show the difference between the  two means is -125.7. That is, if you subtract the mean for the number of male competitors from the mean number of female competitors, you get negative 125.7. So, there is a difference of 125.7 between the two means. Is that statistically significant? How often would a difference this large occur by chance? To answer this question we look at the next table. It gives us two answers. The first method is used when the variances are equal. If the variances are unequal, we would use the statistics shown on the second line. In this instance, both give us the same conclusion, that is, the probability of finding a difference between means this large if the population values were equal is less than 1 in 10,000. That is the value you see under the PRobability > absolute value of t. If you were writing this up in a report, you would say,

“There were, on the average 126 fewer female competitors each year than males.  This difference was statistically significant (t = -12.30, p <.0001).”

Table 3
Third PROC TTEST Table
Method    Variances           DF    t Value    Pr > |t|
Pooled    Equal                    42            -12.30    <.0001
Satterthwaite    Unequal    26.064    -12.30    <.0001

In this case the t-values and probability are the same, but what if they are not? How do we know which of those two methods to use?

This is where our fourth, and final table from the TTEST procedure comes into use. This is the test for equality of variances. The test statistic in this case is the F value. We see the probability of a greater F is < .0001. This means that we would only get an F-value larger than this 1 in 10,000 times if the variances were really equal in the population.  Since that is a really large number, and the normal cut-off for statistical significance is p < .05  and  .0001 is a LOT less than .05, we would say that there is a statistically significant difference between the variances. That is, they are unequal. We would use the second line in Table 3 above to make our decision about whether or not the differences in means are statistically significant.

Table 4
Fourth PROC TTEST Table
Equality of Variances
Method    Num DF    Den DF    F Value    Pr > F
Folded F    21                    21           8.17    <.0001

Similar Posts


  1. “A standard deviation is the average amount by which observations differ from the mean.”

    Is this strictly true? The standard deviation is the square root of the average of the distances from the mean sqaured, rather than the average distance from the mean. The main reason for using the standard deviation as a measure of deviation seems to be that it is the l^2 norm, which is easier to work with when using theory from linear algebra, which seems to be all over modern statistics. I think.

  2. Standard error has always been something of a mystery to me. After numerous stats classes I am able to use it in formulas when necessary, but I’ve never really understood what it represented. Until now. Thanks for such a clear explanation.

  3. Robby,
    If you mean is dividing the sum of the absolute values of deviations from the mean by N or (N-1) the same as taking the square root of the sum of the squared deviations divided by N, no of course it isn’t.

    What I mean by the word average in this case is the same as when I say my sister is of average height – in other words, she is about the height one would expect. If you met her you would not be surprised that she was a lot taller or a lot shorter than most people that walk by.

    The standard deviation is the amount one would intuitively expect that a random observation would differ from the mean.

    Similarly, the standard error is the amount one would reasonably guess a sample mean (in this case) would differ from the population mean.

    One of my hobby horses is I think most people walk around with a fairly good intuitive knowledge of statistics until we convince them in school that they are too stupid to know what is going on.

Leave a Reply

Your email address will not be published. Required fields are marked *