I’m about to tear my hair out. I’ve been reading this statistics textbook which shall remain nameless, ostensibly a book for graduate students in computer science, engineering and similar subjects. The presumption is that at the end of the course students will be able to compute and interpret a factor analysis, MANOVA and other multivariate statistics. The text spends 90% of the space discussing the mathematics in computing the results, 10% discussing the code to get these statistics and 0% discussing decisions one makes in selection of communality estimates, rotation, post hoc tests or anything else.
In short, the book is entirely devoted to explaining the part that the computer does for you that students will never need to do and 10% or less on the decisions and interpretation that they will spend their careers doing. One might argue that it is good to understand what is going on “under the hood” and I’m certainly not going to argue against that but there is a limit on how much can be taught in any one course and I would argue very strenuously that there needs to be a much greater emphasis on the part the computer cannot do for you.
There was an interesting article in Wired a few years ago on The End of Theory, saying that we now have immediate access to so much data that we can use “brute force”. We can throw the data into computers and “find patterns where science cannot.”
Um. Maybe not.
Let’s take an example I was working on today, from the California Health Interview Survey. There are 47,000+ subjects but it wouldn’t matter if there were 47 million. There are also over 500 variables measured on these 47,000 people. That’s over 23,000,000 pieces of data. Not huge by some standards, but not exactly chicken feed, either.
Let’s say that I want to do a factor analysis, which I do. By some theory – or whatever that word is we’re using instead of theory – I could just dump all of the variables into an analysis and magically factors would come out, if I did it often enough. So, I did that and came up with results that meant absolutely nothing because the whole premise was so stupid.
Here are a couple of problems
1. The CHIS includes a lot of different types of variables, sample weights, coding for race and ethnic categories, dozens of items on treatment of asthma, diabetes or heart disease, dozens more items on access to health care. Theoretically (or computationally, I guess the new word is), one could run an analysis and we would get factors of asthma treatment, health care access, etc. Well, except I don’t really see that the variables that are not on a numeric scale are going to be anything but noise. What the heck does racesex coded as 1= “Latin male”, 10 = “African American male” etc. ever load on as a factor?
2. LOTS of the variables are coded with -1 as inapplicable. For example, “Have you had an asthma attack in the last 12 months?”
-1 = Inapplicable
1 = Yes
2 = N0
While this may not be theory, these two problems do suggest that some knowledge of your data is essential.
Once you get results, how do you interpret them? Using the default minimum eigenvalue of 1 criterion (which if all you learned in school was how to factor analyze a matrix using a pencil and a pad of paper, I guess you’d use the defaults), you get 89 factors. Here is my scree plot.
What exactly is one supposed to do with 500 variables that load on 89 factors? Should we then factor analyze these factors to further reduce the dimensions? It would certainly be possible. All you’d need to do is output the factor scores on the 89 factors, and then do a factor analysis on that.
I would argue, though, and I would be right, that before you do any of that you need to actually put some thought into the selection of your variables and how they are coded.
Also, you should perhaps understand some of the implications of having variables measured on vastly different scales. As this handy page on item analysis points out,
“Bernstein (1988) states that the following simple examination should be mandatory: “When you have identified the salient items (variables) defining factors, compute the means and standard deviations of the items on each factor. If you find large differences in means, e.g., if you find one factor includes mostly items with high response levels, another with intermediate response levels, and a third with low response levels, there is strong reason to attribute the factors to statistical rather than to substantive bases” (p. 398).”
And hold that thought, because our analysis of the 517 or so variables provided a great example …. or would it be using some kind of theory to point that out? Stay tuned.
I’ve written here before about visual literacy and Cook’s D is just my latest example.
Most people intuitively understand that any sample can have outliers, say, an 80-year-old man who is the father of a six-year-old child, the new college graduate who is making $150,000 a year. We understand that those people may throw off our predictions and perhaps we want to exclude those outliers from our models.
What if you have multiple variables, though? It’s possible that each individual value may not be very extreme but the combination is. Take this data set below that I totally made up, with mom’s age, dad’s age and child’s age.
Mom Dad Child
30 32 6
20 27 5
31 33 8
29 28 6
40 42 20
44 44 21
37 39 14
25 29 7
30 32 6
20 27 5
31 33 8
29 28 6
39 42 19
43 44 20
37 39 13
25 28 6
40 29 15
Look at our last record. The mother has an age of 40, the father an age of 29 and the child an age of 15. None of these individually are extreme scores. These aren’t even the minimum or maximum for any of the variables. There are mothers older (and younger) than 40, fathers younger (and older) than 29; 15 isn’t that extreme an age in our sample of children. The COMBINATION, however, of a 40-year-old mother, 29-year-old father and 15-year-old child is an extreme case.
Enter Cook’s distance, a.k.a. Cook’s D, which measures the effect of deleting an observation. The larger the distance, the more influential that point is on the results. Take a look at my graph below.
It is pretty clear that the last observation is very influential. Now, you might have guessed that if you had thought to look at the data. However, if you had 11 variables and 100 observations it wouldn’t be so easy to see by looking at the data and you might be really happy you had Cook around to help you out.
Let’s look at the data re-analyzed without that last observation. Here is what our plot of Cook’s D looks like now.
In fact, dropping out that one point changed our explained variance from 89% to 93%.
So … knowing how to use Cook’s D for regression diagnostics is our latest lesson in visual literacy.
Ever wonder why with goodness of fit tests non- significance is what you want?
Why is that sometimes when you have a significant p-value it means your hypothesis is correct, there is a relationship between the price of honey and the number of bees, and in other cases, significance means your model is rejected? Well, if you are reading this blog, it’s possible you already know all of this, but I can guarantee you that students who start off in statistics learning that a significant p-value is a good thing often are confused to learn that with model fit statistics, non-significance is (usually) what you want.
You are hoping that you find non-significance when you are looking at model fit statistics because the hypothesis you are testing is that the full model – one that has as many parameters as there are observations - is different than this model you have postulated.
To understand model fit statistics, you should think about three models.
The null model, and contains only one parameter, the mean. Think of it this way, if all of your explanatory variables are useless then your best prediction for the dependent variable is the mean. If you knew nothing about the next woman likely to walk into the room, your best prediction of her height would be 5’4″ , if you live in the U.S., because that is the average height.
The full model has one parameter per observation. With this model, you can predict the data perfectly. Wouldn’t that be great? No, it would be useless. Using the full model is a bad idea because it is non- replicable
Here is an example data set where I predict IQ using gender, number of M & M’s in your pocket and hair color.
Male 10 redhead 100
Female 0 blonde. 70
Male 10 blonde 60
Female 30 brunette 100
50 + MMx1 + female x 20 + redhead x 40
Is that replicable at all? If you selected another random sample of 4 people from the population do you think you could predict their scores perfectly using this equation?
Also, I do not know why that woman has so many M & M’s in her pocket.
In between these two useless models is your model. The hypothesis you are testing is that your model, whatever it is, is non-significantly different from the full model. If you throw out one of your parameters, your new model won’t be as good as the full model – that one extra parameter may explain one case – but the question is, does the model without that parameter differ significantly from the full model. If it doesn’t then we can conclude that the parameters we have excluded from the model were unimportant.
We have a more parsimonious model and we are happy.
But WHY do more parsimonious models make us happy? Well, because that is kind of the whole point of model building. If you need a parameter for each person, why not just examine each person individually? The whole point of a model is dimension reduction, that is, reducing the number of dimensions you need to measure while still adequately explaining the data.
If, instead of needing 2,000 parameters to explain the data gathered from 2,000 people you can do just as well with 47 parameters, then you would have made some strides forward in understanding how the world works.
Coincidentally, I discussed dimension reduction on this blog almost exactly a year ago, in a post with the title “What’s all that factor analysis crap mean, anyway?”
(Prediction: At least one person who follows this link will be surprised at the title of the post.)
I’ve been looking high and low for a supplemental text for a course on multivariate statistics and I found this one -
The Multivariate Social Scientist, by Graeme Hutcheson 7 Nick Sofroniou
They are big proponents of generalized linear models, in fact, the subtitle is “Introductory statistics using generalized linear models”, so if you don’t like generalized models, you won’t like this book.
I liked this book a lot. Because this is a random blog, here is day one of my random notes
A generalized linear model has three components:
- The random component is the probability distribution assumed to underlie the response variable. (y)
- The systematic component is the fixed structure of the explanatory variables, usually linear. (x1, x2 … xn)
- The link function maps the systematic component on to the random component.
The systematic component takes the form
η = α + ß1×1 + ß2×2 + … ßnxn
They use η to designate the predicted variable instead of y-hat. I know you were dying to know that.
Obviously, since that IS a multiple regression equation (which could also be used for ANOVA), when you have linear regression, the link function is actually identity. With logistic regression, it is the logit function, which maps the log odds of the random component on to the systematic one.
The reason I think this is such a good book for students taking a multivariate statistics course is that it relates to what they should know. They certainly should be familiar with multiple regression and logistic regression, and understand that the log of the odds is used in the latter.
The book also discusses the log link used in loglinear analyses, which I don’t necessarily assume every student will have used. I don’t say that as a criticism, merely an observation.
You might have gotten the misimpression from my previous post that I don’t think students need to learn all that much matrix algebra that I am a slacker as far as expecting students to come to courses with some prior knowledge. That’s not exactly the case. In fact, here are some things I just assume students coming into a multivariate statistics course should know and even though some textbooks begin with these, well, all I can say is if you have had three statistics courses and you still don’t know what a covariance is, I think something has gone awry in your education.
- Know the equation to compute variance – it’s pretty darn basic – and have a really good understanding of interpreting variance, like what 0 variance means, the statistical and practical interpretation of explained variance. I personally view science as the search for explained variance.
- REALLY understand covariance – that is, now how it is calculated, that it is a measure of linear relationship and that a covariance of 0 usually but not always signifies independence.
- Be able to interpret a correlation.
- Have a basic grasp of the Central Limit Theorem and the difference between population values and sample statistics.
- Understand what a chi-square is, how you get it and how you interpret it
- Remember the definition and interpretation of an F-test
- Understand the difference between statistical significance and effect size
- Know what the null hypothesis test
- Realize that before you do ANYTHING with data, if you don’t check the data coding and quality you are an idiot. You should have some understanding of how to read a codebook and be able to compute a frequency distribution, descriptive statistics and data description (like a PROC CONTENTS with SAS). When I look at the scant attention many so-called researchers pay to issues like missing data, miscoded data and non-random sampling, I am surprised we’re ever able to replicate anything.
Diving into MANOVA was really what I wanted to blog about next, so maybe I will actually get to that in the context of analyzing missing data, but having failed already at my attempt to leave my desk before midnight, that will have to wait until next time.
Having found no significant differences in the missing and non-missing data, as I’d expected, I went on to do a couple of more analyses where I was quite surprised not to find differences, but that will also have to wait for next time. I’m really only mentioning it here so I don’t forget. Wouldn’t you think that there would be differences in hospital length of stay and age by race and region? Well, I would, but I was wrong.
On a random note, I have to say, I really do love this remote desktop set up for teaching. It solves the problem of whether students have Windows or Mac, having to get needed software installed. All the way around, I love it.
Following a discussion using matrix algebra to show computation in a Multivariate Analysis of Variance, a doctoral student asked me,
“Professor, when will I ever use this? Why do I need to know this?”
He had a valid point. I’m always asking myself why I’m teaching something. Is it because it interests me personally, because it is in the textbook or because students really need to know it.
Let’s take some things about matrix algebra we always teach students in statistics.
What conformable means and why it might matter
Two matrices are conformable if they can be multiplied together. When you multiply two matrices, the row of the first matrix will be multiplied by the column of the second matrix. You sum the products and that is the first element in the matrix. You repeat this until you have multiplied all of the rows in the first matrix by all of the columns in the second.
So — you can multiply a 2 x 3 matrix by a 3 x 2 matrix but not vice versa.
Multiplying a matrix of dimension a x b and a matrix of dimension c x d will give you a resulting matrix with a rows and d columns, that is, of dimensions a x d .
This can give you results that sometimes seem counter-intuitive, like that the product of a 1 x 3 matrix and a 3 x 1 matrix is a 3 x 3 matrix.
It may seem weird that the result of matrix multiplication can either be a larger matrix than both of the matrices you multiplied, or smaller than both of them, but there it is.
If both matrices are square, that is, of dimension n x n, then the resulting product will also be an n x n matrix.
And, of course, any matrix can be multiplied by its transpose because the transpose of an m x n matrix will always be n x m .
If a square matrix is of full rank, it means that none of the rows are linearly dependent. If you DO have linear dependence, it means you have redundant measures. Now, I could go on to prove this mathematically and all of it is very interesting to me.
I question, though, whether you really need to know anything about matrix algebra to understand that redundant measures are a bad thing.
Do you need matrix algebra to explain that we are going to apply coefficients (do you even need to refer to it as a vector?) to the values of each variable for each record and get a predicted score such that
predicted score = b0 + b1X1 + b2X2 …. b.Xn
When I was in graduate school, calculators that did statistical analyses, even as simple as regression, cost a few hundred dollars which was the equivalent of three months of my car payment. Computer time was charged to your department by the hour. So … my first few courses, I did all of my homework problems using a pencil and paper, transposing and inverting matrices – and it was a huge pain in the ass.
Then, I got a job as a research assistant and one of the perks was hours of computer time. I thought I’d died and gone to heaven. It took me less than half an hour to get all of my homework done using SAS (which ran on a mini-computer and spit out printouts that I had to walk across campus to pick up).
My students are learning in a completely different environment. So … do they need to learn the same things in the same way I did? This is a question I ponder a lot.
Today, I was thinking about using data from the National Hospital Discharge Survey to try to predict type of hospital admission. Is it true that some people use the emergency room as their primary method of care? Mostly, I wanted to poke around wit the NHDS data and get to know it better for possible use for teaching statistics. Before I could do anything, though, I needed to get the data into a usable form.
I decided to use as my dependent variable the type of hospital admission. There were certain categories, though, that were not going to be dependent on much else, for example – if you are an infant born in a hospital, your admission type is newborn. I also deleted the people whose admission type was not given.
The next question was what would be interesting predictor variables. Unfortunately, some of what I thought would be useful had less than perfect data, for example, discharge status, about 5% of the patients had a status of “Alive, disposition not stated”.
I also thought either diagnostic group or primary diagnosis would be a good variable for prediction. When I did a frequency distribution for each it was ridiculously long, so I thought I would be clever and only select those diagnoses where it was .05% or more, which is over 60 people. Apparently, there is more variation in diagnosis than I thought because in both cases that was over 330 different diagnoses.
Here is a handy little tip, by the way -
PROC FREQ DATA = analyze1 NOPRINT ;
TABLES dx1 / OUT = freqcnt ;
PROC PRINT DATA = freqcnt ;
WHERE PERCENT > 0.05 ;
Will only print out the diagnoses that occurred over the specified percentage of the time.
I thought what about the diagnoses that were at least .5% of the admissions? So, I re-ran the analyses with 0.5 and came up with 41 DRGs. I didn’t want to type in 41 separate DRGs, especially because I thought I might want to change the cut off point later, so I used a SAS format, like this. Note that in a CNTLIN dataset, which I am creating, the variables MUST have the names fmtname, label and start.
Also, note that the RENAME statement doesn’t take effect until you write out the new dataset, so your KEEP statement has to have the old variable name, in this case, drg.
Data fmtc ;
set freqcnt ;
if percent > 0.5 ;
retain fmtname ‘drgf’ ;
retain label “in5″ ;
rename drg = start ;
keep fmtname drg label ;
Okay, so, what I did here was create a dataset that assigns the formatted value of in5 to everyone of my diagnosis related groups that occurs in .5% of the discharges or more.
To actually create the format, I need one more step
proc format cntlin = fmtc ;
Then, I can use this format to pull out my sample
DATA analyze2 ;
SET nhds.nhds10 ;
IF admisstype in (1,2,3) ;
IF dischargestatus in (1,3,4,6) & PUT(drg,drgf.) = “in5″ then insample = 1 ;
ELSE insample = 0 ;
I could have just selected the sample that met these criteria, but I wanted to have the option of comparing those I kept in and those I dropped out. Now, I have 71,869 people dropped from the sample and 59,743 that I kept. (I excluded the newborns from the beginning because we KNOW they are significantly different. They’re younger, for one thing.)
So, now I am perfectly well set up to do a MANOVA with age and days of care as dependent variables. (You’d think there would be more numeric variables in this data set than those two, but surprisingly, even though many variables in the data set are stored as numeric they are actually categorical or ordinal and not really suited to a MANOVA.)
Anyway …. I think that MANOVA will be one of the first analyses we do in my multivariate course. It’s going to be fun.
I was talking to a friend of mine today who had taken a test for a new job recently and he had a hard time with the math portion of it. We were in college about the same time and he did perfectly fine in math, but it had been a while. This got me to thinking that I should review things like matrix algebra from time to time, just because it has been a while since I had any need to multiply a matrix without a computer. Well, actually, I can’t imagine that I will ever have such a need but since I’m teaching multivariate statistics and the textbooks generally have a lot of matrix algebra, I thought I should brush up on it whether I ever need it or not.
I had the normal equations for regression drilled into my brain in graduate school and there was a time in my life when I actually had spare time when I found solving systems of linear equations something amusing to do. All of that was a very long time ago.
So …. as I sit here thinking what do my students need to know, I run into the Goldilocks problem yet again. Nothing seems just right. Teaching multiplying a scalar by a matrix seems a waste of time, no matter how brief. All you do is multiply every number in the matrix by that value. Okay, got it.
They should know what an Identity matrix is. This could actually have some useful implications in statistics. If your correlation matrix is close to an identity matrix, with 1 in the diagonals and 0s in the off-diagonal then it tells you that your variables are uncorrelated. If you analyzed a matrix of random data, this is exactly what you would expect to get.
If you multiply a matrix by the identity matrix, I, you are going to get the original matrix as a result, hence the name, identity matrix.
IA = A
This is analogous to the identity property of scalar (that is, regular numbers, not matrices) multiplication that 1X = X
The determinant of a matrix is, for a 2 x 2 matrix, of this form
is equal to
(ad – bc)
To find the inverse of a matrix, the reciprocal of the determinant, that is 1 / (ad-bc), in the case of our same 2 x 2 matrix is multiplied by the following matrix
This is particularly important in statistics because you will occasionally get a message on your output that the “determinant is zero” and it would be helpful to you if you understood what that meant and why it was important.
One important point here is that you need the determinant to find the inverse of a matrix. For example, to find the vector of regression coefficients you would use this equation
Notice here that you need to take the inverse of the product of the transpose of the X matrix and the X matrix. What if the determinant is zero? Well, you can’t divide by zero – SO THERE IS NO SOLUTION.
At this point, you want to start to chase down why the determinant is zero. Do you have redundant measures? Is there no variance in the sample?
All of this is very interesting to me personally, but aside from that, I keep asking myself whether the students really need an in-depth understanding of matrix algebra when it is all done by a computer. I really don’t know the answer to that, which is why I keep thinking about it.
When I was running out to the airport, I said I would explain how to get a single plot of all of your scatter plots with SAS. Let’s see if I can get this posted before the plane lands. Not likely but you never know …
It’s super simple
proc sgscatter data=sashelp.heart ;
matrix ageatdeath ageatstart agechddiag mrw / group= sex ;
And there you go.
Statistical graphics from 10,000 feet. Is this a full service blog or what?
My life is upside down. All day, as my job, I spent writing a program to get a little man to run around a maze, come out the other end and have a new screen come up with a math challenge question. Then, in the evening, I’m surfing the web for interesting bits to read on multivariate statistics.
I’m teaching a course this winter and could not find the Goldilocks textbook, you know, the one that is just right. They either had no details – just pointing and clicking to get results – or the wrong kind of details. One book had pages of equations, then code for producing some output with very little explanation and no interpretation of the output.
I finally did select a textbook but it was a little less thorough than I would like in some places. I decided to supplement it with required and optional readings from other sources. Thus, the websurfing.
One book I came across that is a good introduction for the beginning of a course is Applied Multivariate Statistical Analysis, by Hardle and Simar. You can buy the latest version for $61 but really, the introduction from 2003 is still applicable. I was delighted to see someone else start with the same point as I do – descriptive statistics.
Whether you have a multivariate question or univariate one, you should still begin with understanding your data. I really liked the way they used plots to visualize multiple variables. I knocked a few of these out in a minute using SAS Studio.
symbol1 v= squarefilled ;
proc gplot data=sashelp.heart ;
plot ageatdeath*agechddiag = sex ;
plot ageatdeath*ageatstart = sex ;
plot ageatdeath*mrw = sex ;
Title “Male ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10) ;
where sex = “Male” ;
Title “Female ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10);
where sex = “Female” ;
If I had more time, I would show all of these plots in one page – a draftman’s plot - but I’m running out to the airport in a minute. Maybe next time. Yes, I do realize these charts are probably terrible for people who are color-blind. Will work on that at some point also.
You can see that the age at diagnosis and death is linearly related. It seems there are many more males than females and the age at death seems higher for females.
The picture with Metropolitan relative weight did not seem nearly as linear, which makes sense because if you think about it, age at start and age at death HAVE to be related. You cannot be diagnosed at age 50 if you died when you were 30.It also seems as if there is more variance for women than men and the distribution is skewed positively for women.
The last two graphs seem to bear that out, ( You can see those here – click and scroll down). which makes me want to do a PROC UNIVARIATE and a t-test. It also makes me wonder if it’s possible that weight could be related to age at death for men but not for women. Or, it could just be that as you get older, you put on weight and older people are more likely to die.
My point is that some simple graphs can allow you to look at your variables in 3 dimensions and then compare the relationships of multiple 3-dimensional graphs.
Gotta run – taking a group of middle school students to a judo tournament in Kansas City. Life is never boring.