I’ve written here before about visual literacy and Cook’s D is just my latest example.
Most people intuitively understand that any sample can have outliers, say, an 80-year-old man who is the father of a six-year-old child, the new college graduate who is making $150,000 a year. We understand that those people may throw off our predictions and perhaps we want to exclude those outliers from our models.
What if you have multiple variables, though? It’s possible that each individual value may not be very extreme but the combination is. Take this data set below that I totally made up, with mom’s age, dad’s age and child’s age.
Mom Dad Child
30 32 6
20 27 5
31 33 8
29 28 6
40 42 20
44 44 21
37 39 14
25 29 7
30 32 6
20 27 5
31 33 8
29 28 6
39 42 19
43 44 20
37 39 13
25 28 6
40 29 15
Look at our last record. The mother has an age of 40, the father an age of 29 and the child an age of 15. None of these individually are extreme scores. These aren’t even the minimum or maximum for any of the variables. There are mothers older (and younger) than 40, fathers younger (and older) than 29; 15 isn’t that extreme an age in our sample of children. The COMBINATION, however, of a 40-year-old mother, 29-year-old father and 15-year-old child is an extreme case.
Enter Cook’s distance, a.k.a. Cook’s D, which measures the effect of deleting an observation. The larger the distance, the more influential that point is on the results. Take a look at my graph below.
It is pretty clear that the last observation is very influential. Now, you might have guessed that if you had thought to look at the data. However, if you had 11 variables and 100 observations it wouldn’t be so easy to see by looking at the data and you might be really happy you had Cook around to help you out.
Let’s look at the data re-analyzed without that last observation. Here is what our plot of Cook’s D looks like now.
In fact, dropping out that one point changed our explained variance from 89% to 93%.
So … knowing how to use Cook’s D for regression diagnostics is our latest lesson in visual literacy.
Since it is the weekend, I decided to blog about weekend stuff. Look for more statistics tomorrow. For most of the past quarter-century, I have been roped into being a volunteer for one organization or another. Here is a very, very partial list:
- American Association on Mental Retardation
- National Council on Family Relations
- United States Judo Federation
- Community Outreach Medical Services
- American Youth Soccer Organization
I’ve been everything from Chair of the Board to chaperone. I’ve spoken at more conferences than I can count, certainly giving a few hundred presentations. I’ve raised hundreds of thousands of dollars.
Given that experience, I’ve concluded that volunteers fall into three broad categories. Recognizing that fact is probably key to having a successful non-profit organization, because for most non-profits, volunteers are essential.
Category 1: People who are very excited to be a volunteer. These individuals derive a lot of their self-esteem from their position in the organization. Their enthusiasm may stem from a genuine passion for the mission of the organization, be it youth sports, individuals with disabilities or health care. Alternatively, the volunteer position may be an exciting departure from a boring day job, an opportunity to use more of their talents. Generally, it is both reasons. They are willing to do a lot of work. They are also willing to put up with authoritarian and unprofessional interactions with the organizations, because they are so enthusiastic and often, they are accustomed to being bossed around and devalued on their “day jobs”. There is a limit to their tolerance, though.
Category 2: People who are not at all excited to volunteer but have skills and talents your organization needs. These individuals are there out of obligation – they have a child on the team, a friend on the staff, or they really care deeply about the mission of the organization. These people do valuable work for the organization like raising money, providing free legal or accounting services. They have very little tolerance for authoritarian and unprofessional interactions with the organizations, because they would rather be somewhere else in the first place and they are accustomed to being the boss or highly valued on their “day jobs”.
Category 3: People who show up and don’t do any real work.
It seems pretty clear to me that organizations need both of the first two categories, and the more the better.
Not everyone sees it that way, obviously. Let me give you just a few examples, and again, this is a very partial list. I have witnessed
- Volunteers told how to dress for an event,
- Week-long required continuing education classes costing several hundred dollars,
- Required training held hundreds of miles away,
- Required drug testing (this was after I had been asked to coach for free at an event and pay my own way. My response was “Are you fucking kidding me?”),
- “Two-hour” meetings running six hours late,
- Volunteers told to “Show up at 7 a.m. , don’t be late and be sure you don’t leave early”,
- Volunteers chastised, yelled at, berated by a board member or staff member
Do’s and don’ts
Well, first of all, don’t do any of those things above.
Second, say “Thank you.” A lot.
Think of these individuals just like donors who are giving you thousands of dollars, because they are. It would costs you a lot of money to replace their services. Treat them as you would professionals providing services for you. Would you ask your accountant to take a drug test? Would you tell your attorney to be sure he dressed professionally when he represents you in court? Don’t assume just because someone is working for free that he is a degenerate or an idiot.
It’s funny that most organizations seem to think what volunteers want is an engraved plaque or a certificate printed out from PowerPoint. Really, a little common courtesy goes a long way.
Ever wonder why with goodness of fit tests non- significance is what you want?
Why is that sometimes when you have a significant p-value it means your hypothesis is correct, there is a relationship between the price of honey and the number of bees, and in other cases, significance means your model is rejected? Well, if you are reading this blog, it’s possible you already know all of this, but I can guarantee you that students who start off in statistics learning that a significant p-value is a good thing often are confused to learn that with model fit statistics, non-significance is (usually) what you want.
You are hoping that you find non-significance when you are looking at model fit statistics because the hypothesis you are testing is that the full model – one that has as many parameters as there are observations – is different than this model you have postulated.
To understand model fit statistics, you should think about three models.
The null model, and contains only one parameter, the mean. Think of it this way, if all of your explanatory variables are useless then your best prediction for the dependent variable is the mean. If you knew nothing about the next woman likely to walk into the room, your best prediction of her height would be 5’4″ , if you live in the U.S., because that is the average height.
The full model has one parameter per observation. With this model, you can predict the data perfectly. Wouldn’t that be great? No, it would be useless. Using the full model is a bad idea because it is non- replicable
Here is an example data set where I predict IQ using gender, number of M & M’s in your pocket and hair color.
Male 10 redhead 100
Female 0 blonde. 70
Male 10 blonde 60
Female 30 brunette 100
50 + MMx1 + female x 20 + redhead x 40
Is that replicable at all? If you selected another random sample of 4 people from the population do you think you could predict their scores perfectly using this equation?
Also, I do not know why that woman has so many M & M’s in her pocket.
In between these two useless models is your model. The hypothesis you are testing is that your model, whatever it is, is non-significantly different from the full model. If you throw out one of your parameters, your new model won’t be as good as the full model – that one extra parameter may explain one case – but the question is, does the model without that parameter differ significantly from the full model. If it doesn’t then we can conclude that the parameters we have excluded from the model were unimportant.
We have a more parsimonious model and we are happy.
But WHY do more parsimonious models make us happy? Well, because that is kind of the whole point of model building. If you need a parameter for each person, why not just examine each person individually? The whole point of a model is dimension reduction, that is, reducing the number of dimensions you need to measure while still adequately explaining the data.
If, instead of needing 2,000 parameters to explain the data gathered from 2,000 people you can do just as well with 47 parameters, then you would have made some strides forward in understanding how the world works.
Coincidentally, I discussed dimension reduction on this blog almost exactly a year ago, in a post with the title “What’s all that factor analysis crap mean, anyway?”
(Prediction: At least one person who follows this link will be surprised at the title of the post.)
I was reading a book on PHP just to get some ideas for a re-design I’m doing for a client, when I thought of this.
Although I think of PHP as something you use to put stuff into a database and take it out – data entry of client records, reports of total sales – it is possible to use without any SQL intervention.
You can enter data in a form, call another file and use the data entered to determine what you show in that file. The basic examples used to teach are trivial – a page asks what’s your name and the next page that pops up says, “Howdy” + your name .
Generally, when a player answers a question, the answer is written to our database and the next step depends on whether the answer was correct. Correct, go on with the game. Incorrect, pick one of these choices to study the concept you missed. Because schools use our games, they want this database setup so they can get reports by student, classroom, grade or school.
I could also use PHP.
In the case where we drop out the database interface altogether, is there a benefit to keeping PHP? I couldn’t think of one.
Still thinking about this question.
I’ve been looking high and low for a supplemental text for a course on multivariate statistics and I found this one –
The Multivariate Social Scientist, by Graeme Hutcheson 7 Nick Sofroniou
They are big proponents of generalized linear models, in fact, the subtitle is “Introductory statistics using generalized linear models”, so if you don’t like generalized models, you won’t like this book.
I liked this book a lot. Because this is a random blog, here is day one of my random notes
A generalized linear model has three components:
- The random component is the probability distribution assumed to underlie the response variable. (y)
- The systematic component is the fixed structure of the explanatory variables, usually linear. (x1, x2 … xn)
- The link function maps the systematic component on to the random component.
The systematic component takes the form
η = α + ß1×1 + ß2×2 + … ßnxn
They use η to designate the predicted variable instead of y-hat. I know you were dying to know that.
Obviously, since that IS a multiple regression equation (which could also be used for ANOVA), when you have linear regression, the link function is actually identity. With logistic regression, it is the logit function, which maps the log odds of the random component on to the systematic one.
The reason I think this is such a good book for students taking a multivariate statistics course is that it relates to what they should know. They certainly should be familiar with multiple regression and logistic regression, and understand that the log of the odds is used in the latter.
The book also discusses the log link used in loglinear analyses, which I don’t necessarily assume every student will have used. I don’t say that as a criticism, merely an observation.
I find it weird when I make people nervous. I’ve had people shake and stutter so much that I thought they had some sort of disability, only to find out later that it was a reaction to meeting me!
My family and friends say I’m intimidating, which I also find bizarre. I am, literally, a little old grandma.
Are you kidding? You’re just amazing! Do you think we forget that you were the FIRST American to win a world judo championship, have a Ph.D. , published a book last year, started a company that made a million dollars in less than two years, then started another company to make games, came out with your first game this year, published scientific articles. Oh, and you raised four successful kids, one of whom is also a world champion and making movies.
He went on to an embarrassing degree about a lot more stuff. I’m not one of those fake humble-brag people, like the super-models who claim to be “so fat”.
It’s just that …. it’s not like that when it’s happening. Even to me, if I stopped and piled it all up like that, it sounds impressive, but day to day, it’s not really like that at all.
Whether it’s winning a world championship, earning a Ph.D., building a company or making computer games, it’s not amazing when you’re in the middle of it.
For example, I spent the last week fixing up our next game, Fish Lake. I improved the graphics, added gravity so that when a character walks off a hill, it falls down instead of walking around on the air. I added artificial intelligence to make the animals run around at random instead of just stand there. I modified the css so that the input boxes for the math problems stand out better. All of those are minor fixes in the grand scheme of things. The purpose of the game I was working on is to teach fractions, which are a super important part of understanding math, but if it’s not a fun game, kids aren’t going to play it.
Tomorrow, my day starts with reviewing the quizzes one person wrote, followed by reviewing a PowerPoint and video clip someone else wrote to teach about reading graphs and then testing some software for podcasts.
Hopefully, enough days like this piled on top of one another and we’ll have an amazing game.
It’s just like in my judo competition time, when I trained three times a day, every day. Looking back, winning a gold medal and being best in the world was amazing.
In the middle of it, though, it’s just getting up and working hard all day. Repeat a few thousand times.
You might have gotten the misimpression from my previous post that I don’t think students need to learn all that much matrix algebra that I am a slacker as far as expecting students to come to courses with some prior knowledge. That’s not exactly the case. In fact, here are some things I just assume students coming into a multivariate statistics course should know and even though some textbooks begin with these, well, all I can say is if you have had three statistics courses and you still don’t know what a covariance is, I think something has gone awry in your education.
- Know the equation to compute variance – it’s pretty darn basic – and have a really good understanding of interpreting variance, like what 0 variance means, the statistical and practical interpretation of explained variance. I personally view science as the search for explained variance.
- REALLY understand covariance – that is, now how it is calculated, that it is a measure of linear relationship and that a covariance of 0 usually but not always signifies independence.
- Be able to interpret a correlation.
- Have a basic grasp of the Central Limit Theorem and the difference between population values and sample statistics.
- Understand what a chi-square is, how you get it and how you interpret it
- Remember the definition and interpretation of an F-test
- Understand the difference between statistical significance and effect size
- Know what the null hypothesis test
- Realize that before you do ANYTHING with data, if you don’t check the data coding and quality you are an idiot. You should have some understanding of how to read a codebook and be able to compute a frequency distribution, descriptive statistics and data description (like a PROC CONTENTS with SAS). When I look at the scant attention many so-called researchers pay to issues like missing data, miscoded data and non-random sampling, I am surprised we’re ever able to replicate anything.
Diving into MANOVA was really what I wanted to blog about next, so maybe I will actually get to that in the context of analyzing missing data, but having failed already at my attempt to leave my desk before midnight, that will have to wait until next time.
Having found no significant differences in the missing and non-missing data, as I’d expected, I went on to do a couple of more analyses where I was quite surprised not to find differences, but that will also have to wait for next time. I’m really only mentioning it here so I don’t forget. Wouldn’t you think that there would be differences in hospital length of stay and age by race and region? Well, I would, but I was wrong.
On a random note, I have to say, I really do love this remote desktop set up for teaching. It solves the problem of whether students have Windows or Mac, having to get needed software installed. All the way around, I love it.
Every day, every week, I face the same question that all entrepreneurs ask themselves –
“How do you know when you are done?”
Most days, I start work around 10 am and finish about 14 hours later. Usually, I take off an hour for lunch and an hour for dinner, or take a few hours in the middle of the day to get away from the office. Sunday, it was taking my grandchildren to the Natural History Museum and the park. I average 10-11 hours a day, seven days a week. Even then, there is no end in sight to the tasks I want to accomplish, goals I want to achieve. When there’s no time clock to punch, no boss looking over your shoulder, how do you decide when it’s time to hang it up for the day?
One answer is when you are just exhausted and making more mistakes than you are progress. Frankly, the prospect of just working every night until I fall asleep from exhaustion isn’t very appealing. I did that in the year after my husband died, and even though it was probably a preferable (and more profitable) way of coping than drinking myself into a stupor every night, I can tell you that it’s not a lifestyle I would recommend. The reality is that there is never, ever going to be a day at the office when I say,
Okay, that’s it. No more work to do here. Time to head to the beach.
Some people (who are not me), would say that you should take off to celebrate achievements. For example, last week, I
- found out that a project we had worked on for a client had been wildly successful,
- submitted a grant proposal to create a game for English language learners, including receiving written agreement from teachers in three school districts in three different states to assist with development,
- finished 1/4 of the lectures for a course I will be teaching soon,
- made major improvements in one level of the Fish Lake game, which we will be able to use for Spirit Lake as well,
- found out that a huge school district is now using Spirit Lake,
- renewed a consulting contract,
- created css to improve our web pages in the Fish Lake game,
- did the usual stuff of meetings, approving payroll, answering email, reviewing staff tasks on basecamp, updating a few things in the company wiki, approved a couple of employment contracts.
And all of this was accomplished with having spent all of Monday in airports and on planes flying back from Kansas City where I had been as coach for a judo team of seven students from Gompers Middle School. So … did I take off early? No, because I still needed to
- submit a revised budget for a contract,
- submit another revised budget for a grant,
- rewrite the PHP for a client database,
- get ready for an investors’ meeting,
- figure out what is wrong with the gravity in one level where the player is literally walking on air.
My unhelpful point here is that I DON’T necessarily take off to celebrate and I definitely don’t take off when I have something that could be very important to our company, like a meeting with a potential investors during which I want to get as much information (and not look like an idiot) for that time down the road when we do need to bring in outside investors.
What I DO try to to do is stop working by midnight every night. There just seems to be something dysfunctional about not leaving the office the same day you came in, even if you come in at 10 a.m. I don’t take off to celebrate so I can take off when I feel that I need a break.
One thing I can guarantee you for an absolute fact is that you will be less effective if you don’t get enough sleep. You’ll make mistakes you never would have made if you were not so tired. Knowing this, another reason that I try to quit working at midnight so I can be asleep by 2 a.m. That gives me 8 hours to sleep before I get up and hit it again at 10 a.m.
Staying up until 5:30 a.m. as The Invisible Developer sometimes does strikes me as counter-productive. You’re just going to sleep later the next day, so why not just go to sleep now and start up again when you are rested enough to be more effective. Even if I do say so myself, this post I wrote about doing one more thing before you go to bed is worth reading. Often, that one more thing will be to make the list of the things that are a priority for tomorrow. I then can knock off with confidence that I’ll get on those things first thing the next day.
I work hard, I work a lot, but I have learned not to make myself crazy trying to get everything done, because … at the end of the day, there’s another day. That’s how time works.
Following a discussion using matrix algebra to show computation in a Multivariate Analysis of Variance, a doctoral student asked me,
“Professor, when will I ever use this? Why do I need to know this?”
He had a valid point. I’m always asking myself why I’m teaching something. Is it because it interests me personally, because it is in the textbook or because students really need to know it.
Let’s take some things about matrix algebra we always teach students in statistics.
What conformable means and why it might matter
Two matrices are conformable if they can be multiplied together. When you multiply two matrices, the row of the first matrix will be multiplied by the column of the second matrix. You sum the products and that is the first element in the matrix. You repeat this until you have multiplied all of the rows in the first matrix by all of the columns in the second.
So — you can multiply a 2 x 3 matrix by a 3 x 2 matrix but not vice versa.
Multiplying a matrix of dimension a x b and a matrix of dimension c x d will give you a resulting matrix with a rows and d columns, that is, of dimensions a x d .
This can give you results that sometimes seem counter-intuitive, like that the product of a 1 x 3 matrix and a 3 x 1 matrix is a 3 x 3 matrix.
It may seem weird that the result of matrix multiplication can either be a larger matrix than both of the matrices you multiplied, or smaller than both of them, but there it is.
If both matrices are square, that is, of dimension n x n, then the resulting product will also be an n x n matrix.
And, of course, any matrix can be multiplied by its transpose because the transpose of an m x n matrix will always be n x m .
If a square matrix is of full rank, it means that none of the rows are linearly dependent. If you DO have linear dependence, it means you have redundant measures. Now, I could go on to prove this mathematically and all of it is very interesting to me.
I question, though, whether you really need to know anything about matrix algebra to understand that redundant measures are a bad thing.
Do you need matrix algebra to explain that we are going to apply coefficients (do you even need to refer to it as a vector?) to the values of each variable for each record and get a predicted score such that
predicted score = b0 + b1X1 + b2X2 …. b.Xn
When I was in graduate school, calculators that did statistical analyses, even as simple as regression, cost a few hundred dollars which was the equivalent of three months of my car payment. Computer time was charged to your department by the hour. So … my first few courses, I did all of my homework problems using a pencil and paper, transposing and inverting matrices – and it was a huge pain in the ass.
Then, I got a job as a research assistant and one of the perks was hours of computer time. I thought I’d died and gone to heaven. It took me less than half an hour to get all of my homework done using SAS (which ran on a mini-computer and spit out printouts that I had to walk across campus to pick up).
My students are learning in a completely different environment. So … do they need to learn the same things in the same way I did? This is a question I ponder a lot.
I’m pretty certain that I’m a woman in technology.
Last night, I was using SAS on a virtual machine through a remote desktop connection to prepare data from the National Hospital Discharge Survey for use in examples of MANOVA and multinomial logistic regression.
Next week, I will start on a contract to completely re-do the PHP/ MySQL database for a client to bring it to something more secure and up to date.
Oh, and I also was reviewing my notes for the graduate courses in biostatistics and advanced multivariate statistics that I’m teaching this fall.
Pretty certain that by any standard – writing code, founding companies, graduate degrees, university appointment, successful Kickstarter – I am definitely a woman in tech/ STEM whatever the day’s buzzword.
I read SO many articles, blog posts, tweets about the need for women in tech, women-led start-ups, women entrepreneurs.
If you ask me, the U.S. Department of Agriculture is the greatest proponent of women in tech that there is, because they have actually put up money and funded us to do a prototype of an adventure game that teaches math.
When results from that were positive, they funded us again with a Phase II Small Business Innovation Research award to develop the games for commercialization.
I have written here before about the troubling nature of the Black Girls Code, Latina Girls Code emphasis that seems to completely overlook the grown women who are here now. I am NOT saying those aren’t good programs. I assume they are but I have no personal experience. What I am saying is pretty much what I said in January.
It seems to me that when people are looking at minorities or women to develop in their fields, they are much more interested in the hypothetical idea of that cute 11-year-old girl being a computer scientist some day than of that thirty-something competing with them for market share or jobs. If there are venture capitalists or conference organizers or others out there that are sincerely trying to promote WOMEN who code, not girls, I’ve never met any.
(Since then, I have met a couple of conference organizers.)
I suppose Ada Lovelace was cool – my two-year-old granddaughter has a shirt with her picture on it. Still, I don’t think a trending hashtag of #fuckyeahadalovelace did anything for me as a woman in tech.
You know what helped me as a woman in tech? Seed money from the USDA. You can see what we did with it here at our 7 Generation Games site.
One thing Sheryl Sandberg got right in her book, Lean In, was that women tend to be judged on their accomplishments where men are judged on their potential. Of course, you also don’t want to be “too old” to be an innovator so by the time women have those accomplishments, they are past their prime as entrepreneurs according to those VCs who believe that people over 30 are too old to do a start-up.
It’s hard for me to complain about my life when my morning starts out with reading technical books with lines like, “Figure 1 shows the sprite with the red and green blood particles for player and zombie”.
My point is that our company is in the situation we are in not because of any “help minorities code” program but because USDA and our backers on Kickstarter gave us cold, hard cash to develop our products.
Want to help women in tech? Back them on Kickstarter. Buy their products. Tweet about their products and companies to help their marketing. Invest in their companies.
USDA got it right.