# Why chi-square is expecting the expected value

October 2, 2011 | 5 Comments

This is one of those things that is obvious after someone points it out to you and you smack your head saying, “Of course! I knew that.”

As I was going through everything I have to say about analyzing categorical data trying to winnow it down to a three-hour workshop for the WUSS conference (Western Users of SAS Software) next week, I wondered how many people ever THOUGHT about probability again once they had finished that chapter or two in their statistics course.

Professors are optimistic when they believe that students forget almost everything they have learned six months after the course. I have found that if you give chapter tests, students forget a lot of what they have learned by the next week. And I don’t blame them. Very seldom have I seen a real effort made in textbooks to draw connections back to what was learned previously. This is why I have a hatred, varying only in degree of venom, for all mathematics textbooks ever written.
So, as a public service, here is what the information you learned about probabilities has to do with expected value.

The probability of two independent events occurring is the product of their individual probabilities. That is, under the assumption that

the probability of event A occurring  – P(A)

— is unrelated to

the probability of event B occurring – P(B)

— then the probability of A and B occurring , which is written as  P(A  U B)   and read as “the probability of the union of A and B)

is equal to P(A) * P(B)

Let’s say that whether or not you have your own desk at home (yes or no) as a middle school student is unrelated to gender. Parents are equally likely to provide a desk for  a boy or a girl.

Let’s say we have a population of  7,286 eighth-graders that is almost exactly divided between girls (50.51%) and boys (49.49%).

We also find that

of those 7,286 eighth-graders, 85.08% have their own desk.

Then our EXPECTED frequency for girls having their own desk is 50.51%  times 85.08% times 7,286

.5051 * .8508 * 7286 = 3,131

What an amazing coincidence, that is exactly what the expected frequency is in this table.

If you remember (and if you never knew, let it be a brand new surprise to you) that the chi-square is calculated by the  sum of the observed minus the expected squared (hence the name chi-square) divided by the expected

So, the further your observed frequency is from the frequency expected under the assumption the two variables are independent, the larger your chi-square value.

Why divide by the expected? Well, if your expected value is 10 and your observed value is 20 then 10 more than expected is a lot of difference, it is twice what was expected. On the other hand if your expected value is 2,000 and your observed value is 2,010 then your observed is actually pretty close to the expected, percentage-wise

How to get some tables….

I was feeling all pointy and clicky today so I produced the SAS table above using SAS Enterprise Guide. Go to the TASKS menu, select DESCRIBE and TABLE ANALYSIS. Under cells be sure to click on expected frequency and cell percentages. (If you are using a screen reader, click here for an html version of the table)

If you want to do the same thing in SPSS you can use this syntax

CROSSTABS
/TABLES=ITSEX BY BS4GTH03
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT EXPECTED TOTAL
/COUNT ROUND CELL.

Or, you can go to ANALYZE then DESCRIPTIVE STATISTICS then CROSSTABS then click on CELLS and click the button next to expected.

And now I was feeling guilty because even though we have four desks in the house, two are in my office, one is upstairs and one is in the living room so that anyone who wants to work on the computer while watching TV can. None of them belong to the world’s most spoiled 13-year-old personally.

But .. then I re-read the question and saw that it just asked if there was a study desk or table the student could use. So, we are off the hook. Which is a good thing, too, because her shopping list for today includes:

One Halloween costume

Zero Desk

All of the make-up sold by MAC and Sephora

# Fun studying deaths of old people – or not

October 1, 2011 | 1 Comment

I am probably going to hell for this … because today I was studying the death rate of older people using the data from Kaiser Permanente available on the Inter-university Consortium for Political and Social Research (ICPSR) website and really having a great time.

First funny thing, after I extracted it and noticed I had a .stc file, I remembered needing to do a PROC CIMPORT or something but not exactly how to do it. I typed it into Google and the first page that came up was a post over two years ago by me!

This is all it takes to read in the file

```Libname in "C:\Users\AnnMaria\Documents\oldpeople\sasdata" ; Filename readit 'C:\Users\AnnMaria\Documents\oldpeople\ICPSR_04219\DS0002\04219-0002-Data.stc' ; proc cimport infile = readit library = in ; run ;```

Check this out

```NOTE: Proc CIMPORT begins to create/update data set IN.DA4219P2 NOTE: Data set contains 39 variables and 14730 observations. Logical record length is 248```

``` ```

```NOTE: Proc CIMPORT begins to create/update catalog IN.FORMATS NOTE: Entry DTHFLAG.FORMAT has been imported. NOTE: Entry SEX.FORMAT has been imported. NOTE: Entry DISP.FORMATC has been imported. NOTE: Total number of entries processed in catalog IN.FORMATS: 3```

So, SAS has now very nicely created my formats and stored them in a catalog.

Go, .stc files!

Using the formats created

`options fmtsearch = (in) ;`

The “in” refers to the folder where the formats were automatically stored, which is the same folder as I specified in my LIBNAME statement.

Proc freq awesome options

I cannot believe I have never used the binomial options in PROC FREQ. How did that happen? I decided to overcome this lack in my life today. I decided to test if your odds of living to the end of the study were 1 out of 3. Check this out ….

```proc freq data = in.da4219p2 ; tables dthflag / binomial (exact  equiv p = .333) alpha = .05 ;```

The binomial (equiv p = .333)   will produce a test that the population proportion is .333 for the first category. That is “No” for death.  A Z-value will be produced and probabilities for one-tail and two-tailed tests.

The exact  keyword will produce confidence intervals and, since I have specified alpha = .05, these will be the 95% confidence intervals.

You can see the output here. It is very cool.

Why am I going to hell?

Well, hopefully I am not, but I was having such a good time today and then I remembered something from decades ago. I was a graduate student and I was working for two different professors on two projects at the same time. One was a project interviewing parents of children with disabilities to understand family functioning. The other used a very large government records data set for estimating mortality among people with different disabilities. Because death rates are low, particularly for children, every time a record changed to show another death, we were happy because our sample  size increased, thus giving greater (statistical) power for our tests. We had another death show up in our data and we were quite pleased about that because we were getting close to a large enough sample size for some of the statistical analyses we had in mind.

A few days later, we went to interview a family. The mother came to the door, looking like she’d been crying for a solid week, which she probably had. She said dully,

“I’m sorry I forgot to call and cancel the appointment. Sammy died 12 days ago.”

Then it hit me. That six-year-old in our data who died was not a number. He was her son, Sammy.  (No, his name wasn’t really Sammy, but over twenty years later, I do remember his name.)

So, yeah, all the SAS stuff and the statistics was fun today, but at the end of it I tried to remember that every one of  those 9,170 people who died were someone’s husband, wife, mother, father, grandmother or grandfather and be respectful of that.

« go back