# Sports equality, t-tests and standard error

Today, taking a break from writing the grant proposal that has no end, I found myself thinking about easy ways to explain and understand standard error.

To understand standard error, you have to have some statistic that you’re discussing the standard error of. As a random example, let’s just take the mean.

**T-TEST PROCEDURE FOR TESTING FOR DIFFERENCE BETWEEN MEANS**

Let’s just say we have a sports organization that is interested in knowing whether there is a significant difference between the numbers of competitors in the male and female divisions each year. Since Title IX passed and we are supposed to be all equal in sports, there should be no significant difference, right?

To test for the difference in means between two groups, we compute a t-test. A t-test done with SAS will give four tables of results. The first one is shown below.

**Table 1**

**First PROC TTEST Table**

sex N Mean Std Dev Std Err Minimum Maximum

Female 22 97.0909 15.8201 3.3729 66.0000 119.0

Male 22 222.8 45.2253 9.6421 146.0 324.0

Diff (1-2) -125.7 33.8792 10.2150

There were 22 records for males and 22 for females. The mean number of competitors each year was 97 for females, with a standard deviation of 15.8 , with a range from 66 to 119. For males, the mean number of competitors was almost 223 per year, with a standard deviation of 45, and a range from 146- 324.

What exactly is a standard deviation ? A standard deviation is the average amount by which observations differ from the mean. So, if you pulled out a year at random, you wouldn’t expect it to necessarily have exactly 222.8 male competitors. In fact, it would be pretty tough on that last guy who was the .8! On the other hand, you would be surprised if that year there were only 150 competitors, or if there were 320. On the average, a year will be within 45 competitors of the 223 male athletes and 95% of the years will be within two standard deviations, or, from 132 to 312. That is, 223 – ( 2 x 45) to 223 + (2 x 45) .

**WHAT IS THE STANDARD ERROR AND WHAT DETERMINES IT?**

But what is the standard error? The standard error is the average amount by which we can expect our sample mean to differ from the population mean. If we take a different sample of years, say, 1988- 2009, 1991- 2012, all odd numbered years for the last 30 years and so on, each time we’ll get a different mean. It won’t be exactly 97.09 for women and 222.8 for men. Each time, there will be some error in estimating the true population value. Sometimes we’ll have a higher value than the real mean. Sometimes we’ll underestimate it.

On the average, our error in estimate will be 9.64 for men, 3.37 for women.

Why is the standard error for women lower? Because the standard deviation is lower.

The standard error of the mean is the standard deviation divided by the square root of N, where N is your sample size. The square root of 22 is 4.69. If you divide 15.82 by 4.69, you get 3.37. Why the N matters seems somewhat obvious. If you had sampled several hundred thousand tournaments, assuming you did an unbiased sample, you would expect to get a mean pretty close to the true population. If you sampled two tournaments, you wouldn’t be surprised if your mean was pretty far off. We all know this. We walk around with a fairly intuitive understanding of error. If a teacher gives a final exam with only one or two questions, students complain, and rightly so. With such a small sample of items, it’s likely that there is a large amount of error in the teacher’s estimate of the true mean number of items the student could answer correctly. If we hear a survey found that children of mothers who ate tofu during pregnancy scored .5 points higher on a standard mathematics achievement test, and then find out that this was based on a sample of only ten people, we are skeptical about the results.

What about the standard deviation? Why does that matter? The smaller the variation in the population, the smaller error there is going to be in our estimate of the means. Let’s go back to our sample of mothers eating tofu during pregnancy. Let’s say that we found that children of those mothers had .5 more heads. So, the average child is born with one head, but half of these ten mothers had babies with two heads, bringing their mean number of heads to 1.5. I’ll bet if that was a true study, it would be enough for you never to eat tofu again. There is very, very little variation in the number of heads per baby, so even with a very small N, you’d expect a small standard error in estimating the mean.

The second table produced by the TTEST procedure is shown in Table 2 below. Here we have an application of our standard error. We see that the mean for females is 97, with a 95% Confidence Level (CL) from 90.07 to 104.1. That 95% is the mean minus two times the standard error, plus two times the standard error. That is, 97.09 – (2 x 3.37) to 97.09 + (2 x 3.37).

Why does that sound familiar? Perhaps because it is exactly what we discussed 10 seconds ago about a normal distribution? Yes, errors follow a normal distribution. Errors in estimation should be equally likely to occur above the mean or below the mean. We would not expect very large errors to occur very often. In fact, 95% of the time, our sample mean should be within two standard errors of the mean.

**Table 2**

**Second PROC TTEST Table**

##### sex Method Mean 95% CL Mean Std Dev 95% CL Std Dev

Female 97.0909 90.0766 104.1 15.8201 12.1713 22.6080

Male 222.8 202.7 242.8 45.2253 34.7941 64.6299

Diff (1-2) Pooled -125.7 -146.3 -105.1 33.8792 27.9348 43.0609

Diff (1-2) Satterthwaite -125.7 -146.7 -104.7

The next two lines both say Diff (1-2) and both show the difference between the two means is -125.7. That is, if you subtract the mean for the number of male competitors from the mean number of female competitors, you get negative 125.7. So, there is a difference of 125.7 between the two means. Is that statistically significant? How often would a difference this large occur by chance? To answer this question we look at the next table. It gives us two answers. The first method is used when the variances are equal. If the variances are unequal, we would use the statistics shown on the second line. In this instance, both give us the same conclusion, that is, the probability of finding a difference between means this large if the population values were equal is less than 1 in 10,000. That is the value you see under the PRobability > absolute value of t. If you were writing this up in a report, you would say,

“There were, on the average 126 fewer female competitors each year than males. This difference was statistically significant (t = -12.30, p <.0001).”

**Table 3**

**Third PROC TTEST Table**

Method Variances DF t Value Pr > |t|

Pooled Equal 42 -12.30 <.0001

Satterthwaite Unequal 26.064 -12.30 <.0001

In this case the t-values and probability are the same, but what if they are not? How do we know which of those two methods to use?

This is where our fourth, and final table from the TTEST procedure comes into use. This is the test for equality of variances. The test statistic in this case is the F value. We see the probability of a greater F is < .0001. This means that we would only get an F-value larger than this 1 in 10,000 times if the variances were really equal in the population. Since that is a really large number, and the normal cut-off for statistical significance is p < .05 and .0001 is a LOT less than .05, we would say that there is a statistically significant difference between the variances. That is, they are unequal. We would use the second line in Table 3 above to make our decision about whether or not the differences in means are statistically significant.

**Table 4**

**Fourth PROC TTEST Table**

**Equality of Variances**

Method Num DF Den DF F Value Pr > F

Folded F 21 21 8.17 <.0001

“A standard deviation is the average amount by which observations differ from the mean.”

Is this strictly true? The standard deviation is the square root of the average of the distances from the mean sqaured, rather than the average distance from the mean. The main reason for using the standard deviation as a measure of deviation seems to be that it is the l^2 norm, which is easier to work with when using theory from linear algebra, which seems to be all over modern statistics. I think.

Standard error has always been something of a mystery to me. After numerous stats classes I am able to use it in formulas when necessary, but I’ve never really understood what it represented. Until now. Thanks for such a clear explanation.

Aaw, thank you. You made my day.

Robby,

If you mean is dividing the sum of the absolute values of deviations from the mean by N or (N-1) the same as taking the square root of the sum of the squared deviations divided by N, no of course it isn’t.

What I mean by the word average in this case is the same as when I say my sister is of average height – in other words, she is about the height one would expect. If you met her you would not be surprised that she was a lot taller or a lot shorter than most people that walk by.

The standard deviation is the amount one would intuitively expect that a random observation would differ from the mean.

Similarly, the standard error is the amount one would reasonably guess a sample mean (in this case) would differ from the population mean.

One of my hobby horses is I think most people walk around with a fairly good intuitive knowledge of statistics until we convince them in school that they are too stupid to know what is going on.