From time to time I get asked, “Can you recommend a book like Structural Equation Modeling for Dummies?” My unspoken thought is always, “You’re f***ing kidding me, right?” SEM isn’t the sort of thing done by dummies. Well, ask no more if you want straightforward, basic treatment of CALIS – the SAS procedure for structural equation modeling, you should definitely check out Yiu-Fai Yung’s presentation on CALIS and missing data.
Of course in a 50 minute or so presentation you can’t do a comprehensive discussion of anything. Well, except maybe mangoes . I don’t think there is more than 50 minutes worth of stuff to say about mangoes.
Dr. Yung did not talk about mangoes, though, he talked about missing data.
As you well know, SAS, along with most other common statistical packages, uses pairwise deletion for missing data when it creates a correlation or covariance matrix. So, if you are missing data for 3 people for question 1, 6 different people for question 2, when you do the correlation your total N will be (N-9). Let’s say you have no missing data for question 46. Then, the N for question 1 and question 46 is (N-3) and the N for the correlation of question 2 and question 46 is (N-6). You know this. Everyone knows this. I thought it was a very good idea to start an SEM presentation with information everyone knows.
One problem with pairwise deletion is that you may end up with a matrix that is not positive definite. This is a bad thing. I wrote a blog a while back on the sadness of non-positive definite matrices. This page from Ed Rigdon’s structural equation modeling site explains a little more about why non-positive definite matrices involve division by zero to get the inverse, which is another thing that doesn’t require a huge amount of knowledge of advanced mathematics to know it won’t end well. Okay, pairwise deletion may give you a matrix which results in negative eigenvalues which is kind of the same as negative variance, i.e., stupid.
If you try to use a matrix with unequal Ns as input to PROC FACTOR it will give you a warning and use the minimum N for any pair as the N. It is a sad reflection on what I do in my free time that I know this. The next possibility is to do listwise deletion. In that case, every record that has even one of your variables missing will be deleted. In that case, you should end up with a positive definite matrix but you may have lost a huge proportion of your data. Let’s assume, says Dr. Yung, that you have a small number of people missing data for each variable. If you have a large number of people missing for one variable, say 50% didn’t answer question 3, that’s an issue and you should look into what’s wrong with question 3. What the heck are you asking that half the people didn’t answer, their bra size?
Another possibility is to do mean imputation. For each variable missing data, you substitute the mean for all of the other people. The problem with this is that it overstates your certainty because it understates your standard error. You are pretending you had those data but you didn’t. The standard error is a function of N and something else. For example, the standard error of the mean is the standard deviation divided by the square root of N. I know you already knew that, too. So, when you increase the size of N larger than it really is, you are dividing by a larger number which means your resulting standard error will be smaller. [Think about this for a moment and you will realize it makes perfect sense. Take your time. I’ll wait. ]
With PROC CALIS if you do this PROC CALIS METHOD = ML MSTRUC = x1 – x5 Then it will use the maximum likelihood method to arrive at a solution. The MSTRUC = gives the variables for which you want it to estimate the means and covariances. The default is to use all variables. The maximum likelohood method uses listwise deletion.
If you do PROC CALIS method = FIML ; it will use the full information maximum likelihood method which uses all of the information and does NOT do listwise deletion. You should read his paper to get a complete explanation. My best analogy is this. If you are familiar with multiple imputation, it is highly similar to if you did a multiple imputation using PROC MI and then ran your analysis. [If you’re not familiar with multiple imputation this multiple imputation FAQ page is a quick and easy way to get to know it better. ] This takes multiple steps, though.
So, if you wanted to do a regression and impute your variables, you do the PROC MI, then PROG REG and then the PROC MIANALYZE . PROC CALIS does the same things all in one step. To prove this, you could do those three steps above and then go ahead and do the same thing with CALIS. You did know that CALIS does things besides SEM, right, like regular path analysis models and regression. Confirmatory factor analysis, too.
Of course you did,because if think about it, a structural equation model is just all of those pieces put together. Do the regression with MI and MIANALYZE and then, with the same dataset, try this, and you’ll see what I mean:
PROC CALIS method = FIML ;
path x1 <— x2 – x5 ;
So FIML is very much like if you did a whole bunch of multiple imputations and then ran your model. It uses all of the information so you do not delete any observations. So, he says, and I believe him, but I am still going to try it when I get home. You should read the paper. It was really good. Seriously, there was a lot more to it and it was all extremely clear with less discussion of mangoes included than found here. It is number one on my list from here on out when people ask me if such a thing as a clear explanation of at least one aspect of SEM exists. Also, full information maximum likelihood is a relatively new concept to most people so having it explained clearly was most helpful.