statistics

What Good is Cook’s D ?

ByAnnMaria De Mars October 19, 2014

I’ve written here before about visual literacy and Cook’s D is just my latest example.

Most people intuitively understand that any sample can have outliers, say, an 80-year-old man who is the father of a six-year-old child, the new college graduate who is making $150,000 a year. We understand that those people may throw off our predictions and perhaps we want to exclude those outliers from our models.

What if you have multiple variables, though? It’s possible that each individual value may not be very extreme but the combination is. Take this data set below that I totally made up, with mom’s age, dad’s age and child’s age.

Mom Dad Child

30 32 6
20 27 5
31 33 8
29 28 6
40 42 20
44 44 21
37 39 14
25 29 7
30 32 6
20 27 5
31 33 8
29 28 6
39 42 19
43 44 20
37 39 13
25 28 6
40 29 15

Look at our last record. The mother has an age of 40, the father an age of 29 and the child an age of 15. None of these individually are extreme scores. These aren’t even the minimum or maximum for any of the variables. There are mothers older (and younger) than 40, fathers younger (and older) than 29; 15 isn’t that extreme an age in our sample of children. The COMBINATION, however, of a 40-year-old mother, 29-year-old father and 15-year-old child is an extreme case.

Enter Cook’s distance, a.k.a. Cook’s D, which measures the effect of deleting an observation. The larger the distance, the more influential that point is on the results. Take a look at my graph below.

It is pretty clear that the last observation is very influential. Now, you might have guessed that if you had thought to look at the data. However, if you had 11 variables and 100 observations it wouldn’t be so easy to see by looking at the data and you might be really happy you had Cook around to help you out.

Let’s look at the data re-analyzed without that last observation. Here is what our plot of Cook’s D looks like now.

This gives you a very different picture. While a couple of points are higher than the others, it is certainly not the extreme case we saw before.

In fact, dropping out that one point changed our explained variance from 89% to 93%.

So … knowing how to use Cook’s D for regression diagnostics is our latest lesson in visual literacy.

You’re welcome.

Open data | statistics

Wilcoxon, Normality, Paired T-test & Smart Boys

ByAnnMaria De Mars May 22, 2011May 24, 2011

Lately, I’ve been missing some of my former colleagues at the USC Medical School. This is not just because they are super-nice people, which they are, but also because they used to ask for different types of statistics, and I do think variety is the spice of life – except for in marital relationships where…

statistics

A Beginner’s Guide to Propensity Score Matching

ByAnnMaria De Mars April 27, 2017

One advantage of writing this blog for almost a decade is that there are a lots of topics I have already covered. However, software moving at the speed that it does, there are always updates. So, today I’m going to recycle a couple of older posts that introduce you to propensity score matching. Then, tomorrow,…

statistics

Mama, what’s a scree plot?

ByAnnMaria De Mars September 8, 2015

You wouldn’t think there would be that much to say about scree plots. That is because you are like me and sometimes wrong. The problem I often have teaching is that I assume people know a lot more than is reasonable to expect for someone coming into a course. Sometimes, I’m like a toddler who…

statistics

Random Rambling on Structural Equation Models

ByAnnMaria De Mars January 2, 2014

Sometimes people talk about path analysis models, confirmatory factor analysis and/or exploratory factor analysis as separate and distinct techniques from structural equation modeling (SEM). That is rather like talking about Dogo Argentinos as different from dogs when in fact they are a TYPE of dog (picture of dogo attached for those wondering). Similarly, path analysis…

statistics

Race, Income and Education – AnnMaria Explains it All

ByAnnMaria De Mars March 24, 2011March 24, 2011

If you are the right age to have watched re-runs of the show on Nickelodeon, Clarissa Explains It All, then you are the age group today’s blog was written for. And don’t tell me the previous statement is grammatically incorrect. After having looked at these results, I’m already pissed off (note to self: don’t swear…

computer games | statistics

Standardized testing: Solving your reliability problem

ByAnnMaria De Mars December 9, 2016

Where we left off, the reliability was unacceptably low for our measure to assess students knowledge of multiplication, division and other third and fourth grade math standards. We were sad. One person, whose picture I have replaced with the mother from our game, Spirit Lake, so she can remain anonymous, said to me: But there…

Similar Posts

Leave a Reply