Am I missing something here? All of the macros I have seen for the parallel analysis criterion for factor analysis look pretty complicated, but, unless I am missing something, it is a simple deal.

The presumption is this:

There isn’t a number like a t-value or F-value to use to test if an eigenvalue is significant. However, it makes sense that the eigenvalue should be larger than if you factor analyzed a set of random data.

Random data is, well, random, so it’s possible you might have gotten a really large or really small eigenvalue the one time you analyzed the random data. So, what you want to do is analyze a set of random data with the same number of variables and the same number of observations a whole bunch of times.

Horn, back in 1965, was proposing that the eigenvalue should be higher than the average of when you analyzed a set of random data. Now, people are suggesting it should be higher than 95% of the time you analyzed random data (which kind of makes sense to me).

Either way, it seems simple. Here is what I did and it seems right so I am not clear why other macros I see are much more complicated. Please chime in if you see what I’m missing.

  1. Randomly generate a set of random data with N variables and Y observations.
  2. Keep the eigenvalues.
  3. Repeat 500 times.
  4. Combine the 500 datasets  (each will only have 1 record with N variables)
  5. Find the 95th percentile

%macro para(numvars,numreps) ;
%DO k = 1 %TO 500 ;
data A;
array nums {&numvars} a1- a&numvars ;
do i = 1 to &numreps;
do j = 1 to &numvars ;
nums{j} = rand(“Normal”) ;
if j < 2 then nums{j} = round(100*nums{j}) ;
else nums{j} = round(nums{j}) ;
end ;
drop i j ;

proc factor data= a outstat = a&k  noprint;
var a1 – a&numvars ;
data a&k ;
set  a&k  ;
if trim(_type_) = “EIGENVAL” ;

%END ;
%mend ;

%para(30,1000) ;

data all ;
set a1-a500 ;

proc univariate data= all noprint ;
var a1 – a30 ;
output out = eigvals pctlpts =  95 pctlpre = pa1 – pa30;

*** You don’t need the transpose but I just find it easier to read ;
proc transpose data= eigvals out=eigsig ;
Title “95th Percentile of Eigenvalues ” ;
proc print data = eigsig ;
run ;

It runs fine and I have puzzled and puzzled over why a more complicated program would be necessary. I ran it 500 times with 1,000 observations and 30 variables and it took less than a minute on a remote desktop with 4GB RAM. Yes, I do see the possibility that if you had a much larger data set that you would want to optimize the speed in some way. Other than that, though, I can’t see why it needs to be any more complicated than this.

If you wanted to change the percentile, say, to 50, you would just change the 95 above. If you wanted to change the method from say, Principal Components Analysis (the default, with commonality of 1) to saying else, you could just do that in the PROC FACTOR step above.

The above assumes a normal distribution of your variables, but if that was not the case, you could change that in the RAND function above.

As I said, I am puzzled. Suggestions to my puzzlement welcome.


I was reading a book on PHP just to get some ideas for a re-design I’m doing for a client, when I thought of this.

Although I think of PHP as something you use to put stuff into a database and take it out –  data entry of client records, reports of total sales – it is possible to use without any SQL intervention.

You can enter data in a form, call another file and use the data entered to determine what you show in that file. The basic examples used to teach are trivial – a page asks what’s your name and the next page that pops up says, “Howdy”  + your name .

We make games that teach math using Unity 3D, Javascript, PHP, MySQL and C# .

Generally, when a player answers a question, the answer is written to our database and the next step depends on whether the answer was correct. Correct, go on with the game. Incorrect, pick one of these choices to study the concept you missed. Because schools use our games, they want this database setup so they can get reports by student, classroom, grade or school.

What about those individual users, though? They can tell where they/ their child is by just looking at the level in the game.  So, I could drop the PHP altogether in that case and just use Javascript to serve the next step.

I could also use PHP.

In the case where we drop out the database interface altogether, is there a benefit to keeping PHP? I couldn’t think of one.

Still thinking about this question.

Today, I was thinking about using data from the National Hospital Discharge Survey to try to predict type of hospital admission. Is it true that some people use the emergency room as their primary method of care? Mostly, I wanted to poke around wit the NHDS data and get to know it better for possible use for teaching statistics.  Before I could do anything, though, I needed to get the data into a usable form.

I decided to use as my dependent variable the type of hospital admission. There were certain categories, though, that were not going to be dependent on much else, for example – if you are an infant born in a hospital, your admission type is newborn. I also deleted the people whose admission type was not given.

The next question was what would be interesting predictor variables. Unfortunately, some of what I thought would be useful had less than perfect data, for example, discharge status, about 5% of the patients had a status of “Alive, disposition not stated”.

I also thought either diagnostic group or primary diagnosis would be a good variable for prediction. When I did a frequency distribution for each it was ridiculously long, so I thought I would be clever and only select those diagnoses where it was .05% or more, which is over 60 people. Apparently, there is more variation in diagnosis than I thought because in both cases that was over 330 different diagnoses.

Here is a handy little tip, by the way -


TABLES dx1  / OUT = freqcnt ;

PROC PRINT DATA = freqcnt ;


Will only print out the diagnoses that occurred over the specified percentage of the time.

I thought  what about the diagnoses that were at least .5% of the admissions?  So, I re-ran the analyses with 0.5  and came up with 41 DRGs. I didn’t want to type in 41 separate DRGs, especially because I thought I might want to change the cut off point later, so I used a SAS format, like this. Note that in a CNTLIN dataset, which I am creating, the variables MUST have the names fmtname, label and start.

Also, note that the RENAME statement doesn’t take effect until you write out the new dataset, so your KEEP statement has to have the old variable name, in this case, drg.

Data fmtc ;
set freqcnt ;
if percent > 0.5 ;
retain fmtname ‘drgf’  ;
retain label “in5″ ;
rename drg = start ;
keep fmtname drg label ;

Okay, so, what I did here was create a dataset that assigns the formatted value of in5 to everyone of my diagnosis related groups that occurs in .5% of the discharges or more.

To actually create the format, I need one more step

proc format cntlin = fmtc ;

Then,  I can use this format to pull out my sample

DATA analyze2 ;
SET nhds.nhds10 ;
IF admisstype in (1,2,3) ;
IF dischargestatus in (1,3,4,6) & PUT(drg,drgf.) = “in5″ then insample = 1 ;
ELSE insample = 0 ;

I could have just selected the sample that met these criteria, but I wanted to have the option of comparing those I kept in and those I dropped out. Now, I have 71,869 people dropped from the sample and 59,743 that I kept.  (I excluded the newborns from the beginning because we KNOW they are significantly different. They’re younger, for one thing.)

So, now I am perfectly well set up to do a MANOVA with age and days of care as dependent variables. (You’d think there would be more numeric variables in this data set than those two, but surprisingly, even though many variables in the data set are stored as numeric they are actually categorical or ordinal and not really suited to a MANOVA.)

Anyway ….  I think that MANOVA will be one of the first analyses we do in my multivariate course. It’s going to be fun.



One definition of insanity is doing the same thing over and over, expecting different results. One thing that can drive programmers insane is doing the same thing over again and GETTING different results.

In a past life, working in tech support, I learned that whenever anyone calls and says,

I did it exactly like your example and it didn’t work for me.

- they are lying.

In my experience, when you have the same programming statements but get different results, something else is always different and that something is often the demands put on the system.

How can that be if your statements are the same? Let me give two examples, one using javascript and one using SAS.


I had made a game using canvas and html5. The game had three layers. The background was the bottom layer, some game objects that mostly stayed in the same place were the middle layer and the top layer was a game object that changed with each move. The init function on load drew all the layers. On update, all three layers were updated by calling one function. All was well.

function drawAll() {


Then, I made another game the exact same way and I could not get rid of the screen flicker every time the player piece on the top layer moved. I tried clearing the canvas between each re-draw which had solved the problem in the past. Nope. What finally did work, in case you run into this problem yourself, is that I only drew the background in the init function and never re-drew it.

function init(){
layer3 = document.getElementById(“layer3″) ;
layer2 = document.getElementById(“layer2″) ;
layer1 = document.getElementById(“layer1″);
ctx = layer1.getContext(’2d’) ;
ctx2 = layer2.getContext(’2d’) ;
ctx3 = layer3.getContext(’2d’);
startwall() ;
draw1() ;
draw2() ;
draw3() ;


function drawall() {


Problem solved. My conclusion was that the second program involved a lot more complicated drawing of objects. Instead of just placing an image here or there, the program needed to compute  collisions, read lines from an array, draw objects and the time it took was noticeable.


Several times I have written a program that worked wonderfully on a high performance computing cluster but crashed on my laptop, or failed on SAS on demand but worked beautifully on my desktop . The difference in all of those cases was that the processing requirements exceeded the capabilities of the machine. All is not lost in those cases. One pretty obvious but not always feasible solution is to use a different machine. When that isn’t an option, there are workarounds. For example, if I wanted students to analyze an enormous dataset, I could have them analyze the correlation matrix instead of trying to load a 100gb dataset – but that is another post.

When I was running out to the airport, I said I would explain how to get a single plot of all of your scatter plots with SAS. Let’s see if I can get this posted before the plane lands. Not likely but you never know …

It’s super simple

proc sgscatter data=sashelp.heart ;
matrix ageatdeath ageatstart agechddiag mrw / group= sex ;

And there you go.



Statistical graphics from 10,000 feet. Is this a full service blog or what?

My life is upside down. All day, as my job, I spent writing a program to get a little man to run around a maze, come out the other end and have a new screen come up with a math challenge question. Then, in the evening, I’m surfing the web for interesting bits to read on multivariate statistics.

I’m teaching a course this winter and could not find the Goldilocks textbook, you know, the one that is just right. They either had no details – just pointing and clicking to get results – or the wrong kind of details. One book had pages of equations, then code for producing some output with very little explanation and no interpretation of the output.

I finally did select a textbook but it was a little less thorough than I would like in some places. I decided to supplement it with required and optional readings from other sources. Thus, the websurfing.

One book I came across that is a good introduction for the beginning of a course is Applied Multivariate Statistical Analysis, by Hardle and Simar. You can buy the latest version for $61 but really, the introduction from 2003 is still applicable. I was delighted to see someone else start with the same point as I do – descriptive statistics.

Whether you have a multivariate question or univariate one, you should still begin with understanding your data. I really liked the way they used plots to visualize multiple variables. I knocked a few of these out in a minute using SAS Studio.

symbol1 v= squarefilled ;
proc gplot data=sashelp.heart ;
plot ageatdeath*agechddiag = sex ;
plot ageatdeath*ageatstart = sex ;
plot ageatdeath*mrw = sex ;
Title “Male ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10) ;
where sex = “Male” ;
run ;
Title “Female ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10);
where sex = “Female” ;

If I had more time, I would show all of these plots in one page – a draftman’s plot - but I’m running out to the airport in a minute. Maybe next time. Yes, I do realize these charts are probably terrible for people who are color-blind. Will work on that at some point also.

You can see that the age at diagnosis and death is linearly related. It seems there are many more males than females and the age at death seems higher for females.

You can see a larger picture here.

age diagnosis by age at deathage at start by age at death
The picture with Metropolitan relative weight did not seem nearly as linear, which makes sense because if you think about it, age at start and age at death HAVE to be related. You cannot be diagnosed at age 50 if you died when you were 30.It also seems as if there is more variance for women than men and the distribution is skewed positively for women.
MRW by age at death by gender

The last two graphs seem to bear that out, ( You can see those here – click and scroll down). which makes me want to do a PROC UNIVARIATE and a t-test. It also makes me wonder if it’s possible that weight could be related to age at death for men but not for women. Or, it could just be that as you get older, you put on weight and older people are more likely to die.

My point is that some simple graphs can allow you to look at your variables in 3 dimensions and then compare the relationships of multiple 3-dimensional graphs.

Gotta run – taking a group of middle school students to a judo tournament in Kansas City. Life is never boring.

I’m pretty certain I did not deliberately hide these folders. When I opened up my new and improved SAS Studio, it had tasks but my programs were missing.


If this happens to you and you are full of sadness missing your programs, look to the top right of your screen where you see some horizontal lines. Click on those lines.


folders with programs

A menu will drop down that has the VIEW option. Select that and select FOLDERS. Now you can view your folders where your programs reside. Based on this, you might think I’m against using the tasks. You’d be wrong. I just like having the option to write SAS code if needed. The tasks are super easy to use and students are going to love this. Check my multiple regression  for an example




I selected the data set from the SASHELP library, the dependent and independent variables and there you are – ANOVA table, parameter estimates, plot of dependent observed vs predicted  and if you scrolled down – not here because this is just a screen shot, but in SAS Studio, you’d see all of your diagnostic plots. Awesome sauce.



I’ve been pretty pleased with SAS Studio (the product formerly known as SAS Web Editor), so when Jodi sent me an email with information about using a virtual machine for the multivariate statistics course, I was a bit skeptical. Every time I’ve had to use a remote desktop connection virtual machine for SAS it has been painfully slow. I’ve done it several times but it’s probably been like in 2001, 2003 and 2008 when I was at sites that tried, and generally failed, to use SAS on virtual machines.

Your mileage may vary and here is the danger of testing on a development machine – I have the second-best computer in the office. I have 16GB of RAM and a 3.5 GHz Intel Core i7 processor. Everything from available space (175 GB) to download speed (27Mbps) is probably better than the average student will have.

The previous occasions I was using SAS on a remote virtual machine I had pretty good computers, too,  for the time, but 6 -13 years is pretty dramatic differences in terms of technology.

That being said, the virtual machine offered levels of coolness not available with SAS Studio.

Firstly, size. I did a factor analysis with 206 variables and 51,000 observations because I’m weird like that. I wanted to see what would happen. It extracted 49 factors and performed a varimax rotation in 16.49 seconds. I don’t believe SAS Studio was created with this size of data set in mind.

Secondly, size again. The data sets on the virtual machine added up to several times more than the allowable space for  a course directory in SAS on-demand.
Thirdly, it looked exactly like SAS because it was.

Now, I do realize that the virtual machine with SAS is probably only allowable if your university has a site wide license from SAS.

SAS Studio remains as having the significant advantage of being free and easy.  It also seems to have morphed overnight. I don’t remember these tasks being on the left side, and while they look interesting and useful, they do NOT

  1. Encompass all of the statistics students need to compute in my classes, e.g. , population attributable risk.
  2. Explain where the heck my programs went that I wrote previously. I can still create a new program and save a program and it even shows the folders I had previously as choices to save the new program.

#1 is easily taken care of if I can just find out where the programs are saved, for statistics not available in the task selections, they can just write a program. I’ll look into that this weekend since I have had to get up THREE days this week before 9 a.m. I am thinking I need to get some sleep.


From my initial takes of the latest versions of each, I think I will:

  • Use SAS Studio for my biostatistics course because it is an easy, basic introduction AND, once I figure out where the programs are hidden, I can have students write some simple programs. (It may be in an obvious place but sleep deprivation does strange things to your brain.)
  • Use the virtual machine for multivariate statistics because it allows for larger data sets and, although I did not have a similar size data set in SAS Studio, I am assuming it will run much faster.


Sometimes the benefits of attending a conference aren’t so much the specific sessions you attend as the ideas they spark. One example was at the Western Users of SAS Software conference last week. I was sitting in a session on PROC PHREG and the presenter was talking about analyzing the covariance matrix when it hit me –

Earlier in the week, Rebecca Ottesen (from Cal Poly) and I had been discussing the limitations of directory size with SAS Studio. You can only have 2 GB of data in a course directory. Well, that’s not very big data, now, is it?

It’s a very reasonable limit for SAS to impose. They can’t go around hosting terabytes of data for each course.

If you, the professor, have a regular SAS license, which many professors do, you can create a covariance matrix for your students to analyze. Even if you include 500 variables, that’s going to be a pretty tiny dataset but it has the data you would need for a lot of analyses – factor analysis, structural equation models, regression.

Creating a covariance data set is a piece of cake. Just do this:

proc corr data=sashelp.heart cov outp=mydata.test2 ;
var ageatdeath ageatstart ageCHDdiag ;

The COV option requests the covariances and the OUTP option has those written to a SAS data set.

If you don’t have access to a high performance computer and have to run the analysis on your desktop, you are going to be somewhat limited, but far less than just using SAS Studio.

So — create a covariance matrix and have them analyze that. Pretty obvious and I don’t know why I haven’t been doing it all along.

What about means, frequencies and chi-square and all that, though?

Well, really, the output from a PROC FREQ can condense your data down dramatically. Say I have 10,000,000 people and I want age at death, blood pressure status, cholesterol status, cause of death and smoking status. I can create an output data set like this. (Not that the heart data set has 10,000,000 records but you get the idea.)

Proc freq data= sashelp.heart ;
Tables AgeAtDeath
*Smoking /noprint out=mydata.test1;

This creates a data set with a count variable, which you can use in your WEIGHT statement in just about any procedure, like

proc means data = test1 ;

weight count ;

var ageatdeath ;


Really, you can create “cubes” and analyze your big data on SAS Studio that way.

Yeah, obvious, I know, but I hadn’t been doing it with my students.

I’m just heading off to the Western Users of SAS Software meeting that starts tomorrow.  After the keynote, during which I have promised not to swear even once, I’m doing a SAS Essentials talk on Thursday, where I teach students 10 basic steps that allow them to complete an entire annual report project.

One of these is PROC DATASETS.  It is used twice in the project. First, they get a list of all of the datasets in the directory. We’re using SAS Studio which runs on the SAS server. Since students neither have access to issue Unix commands directly nor do they know any most likely, we use PROC DATASETS.

libname mydata  "/courses/u_mine.edu1/i_1234/c_7890/wuss14/";
proc datasets library= mydata ;

This gives me the output below.

# Name Member Type File Size Last Modified
1 SLPOST_SCORED DATA 208896 26Jun14:04:00:40
2 SLPRE_SCORED DATA 487424 26Jun14:04:00:41
3 SL_ANSWERS DATA 619520 26Jun14:03:59:42
4 SL_PRE_POST DATA 196608 26Jun14:04:00:03


Once we have cleaned up the data in every data set, we are not quite ready to start merging them together. A common problem is that data sets have different names, lengths or types for the same variable. You’d be wise to check the variable names, types and lengths of all the variables.  So, here is where we use PROC DATASETS a second time.

proc datasets library= work ;
contents data = _all_ ;

This time, we added another statement. The “contents data = _all_ “ will print the contents of all of the data sets. In perusing the contents, I see that grade is entered as character data in one – 5th, 4th and so on, while it is numeric data in another. This is the sort of thing you never run into in “back of the textbook” data, but that shows up often in real life.

Those are two super simple steps that allow you to do useful things.

You can do more with PROC DATASETS – append, compare – but my plane is boarding so more about that some other time.



Next Page →