Any time you learn anything new it can be intimidating. That is true of programming as well as anything else. It may be even more true of using statistical software because you combine the uneasiness many people have about learning statistics with learning a new language.

To a statistician, this error message makes perfect sense:

ERROR: Variable BP_Status in list does not match type prescribed for this list.

but to someone new to both statistics and SAS it may be clear as mud.

Here is your problem.

The procedure you are using, PROC UNIVARIATE , PROC MEANS is designed ONLY for numeric variables. You have tried to use it for a categorical variable.

This error means you’ve used a categorical variable in a list where only numeric variables are expected. For example, bp_status is “High”, “Normal” and “Optimal”

You cannot find the mean or standard deviation of words, so your procedure has an error.

So … what do you do if you need descriptive statistics?

Go back to your PROC UNIVARIATE or PROC MEANS and delete the offending variables. Re-run it with only numeric variables.

For your categorical variables, use a PROC FREQ  for a frequency distribution  and/ or PROC GCHART.

Problem solved.

You’re welcome.

In August, I attended a class at Unite 2014 (on Unity game development) and the presenter said,

“And some of you, your code won’t run and you’ll swear you did exactly what was shown in the examples. But, of course, all of the rest of us will know that is not true.”

This perfectly describes my experience teaching. For example, the problem with the LIBNAME.

I tell students,

Do not just copy and paste the LIBNAME from a Word document into your program. Often, this will cause problems because of extra formatting codes in the word processor. You may not see the code as any different from what you typed in, but it may not work. Type your LIBNAME statement into the program.

Apparently, students believe that when I say,

Do not just copy and paste the LIBNAME statement.

either, that what I really mean is,

Sure, go ahead and copy and paste the LIBNAME statement

or, that I did mean it but that is only because I want to force them to do extra typing, or because I am so old that I am against copying and pasting as a new-fangled invention and how the hell would I know if they copied and pasted it anyway.

Then their program does not work.

Very likely, their log looks something like this:

58         LIBNAME mydata “/courses/d1234455550/c_2222/” access=readonly;

59          run ;

NOTE: Library MYDATA does not exist.

All quotation marks are not created equal.

What you see above if you look very closely is that the end quote at the end of the path for the LIBNAME statement does not exactly match the beginning quote. Therefore, your reference for your library was not


but rather, something like

/courses/d1234455550/c_2222/  access=readonly run ;

Which is not what you had in mind, and, as SAS very reasonably told you, that directory does not exist.

The simplest fix: delete the quotation marks and TYPE in quotes.

LIBNAME mydata ‘/courses/d1234455550/c_2222/’ access=readonly;

If that doesn’t work, do what I said to begin with. Erase your whole LIBNAME statement and TYPE it into the program without copying and pasting.

Contrary to appearances, I don’t just make this shit up.

Computing confidence intervals is one of the areas where beginning statistics students have the most trouble. It is not as difficult if you break it down into steps, and if you use SAS or other statistical software.

Here are the steps:

1. Compute the statistic of interest– that is mean, proportion, difference between means
2. Compute the standard error of the statistic
3. Obtain critical value. Do you have 30 or more in your sample and are you interested in the 95% confidence interval?

  • If yes, multiply standard error by 1.96
  • If no (fewer people), look up t-value for your sample size for .95
  • If no (different alpha level) look up z-value for your alpha level
  • If no (different alpha level  AND less than 30) look up the t-value for your alpha level.

4. Multiply the critical value you obtained in step #3 by the standard error you obtained in #2

5. Subtract the result  you obtained in step #4 from the statistic you obtained in #1 . That is your lower confidence limit.

6. Add the result you obtained in step #4 to the statistic you obtained in #1. That is your upper confidence limit.
Simplifying it with SAS

Here is a homework problem:

The following data are collected as part of a study of coffee consumption among undergraduate students. The following reflect cups per day consumed:

3          4          6          8          2          1          0          2


A. Compute the sample mean.

B. Compute the sample standard deviation.

C. Construct a 95% confidence interval

I did this in SAS as so

data coffee ;
input cups ;
datalines ;
proc means mean std stderr;
var cups ;

I get the follow results.

Analysis Variable : cups
Mean Std Dev Std Error
3.2500000 2.6592158 0.9401748

These results give me A and B. Now, all I need to do to compute C is find the correct critical value.  I look it up and find that it is 2.365

3.25   – 2.365 * .94 = 1.03

3.25 + 2.365 * .94 = 5.47

That is my confidence interval (1.03, 5.47)


If you want to verify it, or just don’t want to do any computations at all, you can do this

Proc means clm mean stddev ;
var cups ;

You will end up with the same confidence intervals.

Prediction: At least one person who reads this won’t believe me, will run the analysis and be surprised when I am right.

SAS Enterprise Miner – Free, on a Mac – Bet You Didn’t See That Coming … but how the hell do you get your data on it?

I wanted to test SAS Text Miner and was surprised to find the university did not have a license. No problem – and it really was, astoundingly, no problem – I had SAS On-Demand Enterprise Miner on a virtual machine using VMware.

I had installed it thinking   – “This probably won’t work but what the hell.”

Here are all the links on this blog on getting SAS Enterprise Miner to work in all of its different flavors, because I am helpful like that.

Let me emphasize that you just better have the correct version of the Java Run Time Environment (jre), don’t say I didn’t warn you, and after you have it running whenever Java asks if you want to update, give it a resounding, “NO!”

So, surprisingly, running Windows 8.1 Pro on a 4GB virtual machine, it pops open no problem.

Okay, now how to find your data.

Turns out that even if you have SAS Enterprise Miner you need to use SAS Studio to upload your data. So, you go to SAS Studio, on the top left hand side of your screen, you see an UP arrow. Click on that arrow and you will be prompted to upload your data.

Not so fast …. where do you want to put your data?

You can only upload the data if you are a professor but since I am,  that should be no problem. There is also a note on my login page that

The directory for all of your courses will be “courses/lalal123/”   .

The LIBNAME for your courses should be

LIBNAME  mydata “courses/lalal123″ access = readonly ;

Except that it isn’t. In fact, my course directory is something like


I found that out only by calling tech support a few months ago where someone told me that. Now, when I look on the left window pane I see several directories, most of which I created, and a few I did not. One of the latter is named my_content. If I click on the my_content directory I see two subdirectories




These are the directories for my two courses. How would you have known to look there if you didn’t call SAS tech support or read this blog? Damned if I know, but hey, you did read it, so good for you.

If you leave off the subdirectory … say you actually followed the instructions on your login page and in your start code had this:

LIBNAME  mydata “courses/lalal123″ access = readonly ;
run ;

Why, it would run without an error but it would show your directory is empty of data sets, which is kind of true because they are all in those subdirectories whose name you needed to find out.

So …. to recap

1. Use SAS Studio to upload the data to the directory you are using for your SAS Enterprise Miner course. (Seems illogical but it works, so just go with it.)

2. In the start code for your SAS Enterprise Miner project, have the LIBNAME statement including the subdirectory which is under the my_content directory.

Once you know what to do, it runs fine. You can access your data, create a diagram, drag the desired nodes to it.

I’ve only been using this for testing purposes for use in a future course. For that it works fine. It is convenient to be able to pull it up on a virtual machine on my Mac. It is pretty slow but nowhere near as bad as the original version years ago, which was so slow as to be useless.

If you teach data mining – or want to – and your campus doesn’t have a SAS Enterprise Miner license, which I believe is equivalent to the cost of the provost’s first born and a kidney – you definitely want to check out SAS On-demand. It’s a little quirky, but so far, so good.



Am I missing something here? All of the macros I have seen for the parallel analysis criterion for factor analysis look pretty complicated, but, unless I am missing something, it is a simple deal.

The presumption is this:

There isn’t a number like a t-value or F-value to use to test if an eigenvalue is significant. However, it makes sense that the eigenvalue should be larger than if you factor analyzed a set of random data.

Random data is, well, random, so it’s possible you might have gotten a really large or really small eigenvalue the one time you analyzed the random data. So, what you want to do is analyze a set of random data with the same number of variables and the same number of observations a whole bunch of times.

Horn, back in 1965, was proposing that the eigenvalue should be higher than the average of when you analyzed a set of random data. Now, people are suggesting it should be higher than 95% of the time you analyzed random data (which kind of makes sense to me).

Either way, it seems simple. Here is what I did and it seems right so I am not clear why other macros I see are much more complicated. Please chime in if you see what I’m missing.

  1. Randomly generate a set of random data with N variables and Y observations.
  2. Keep the eigenvalues.
  3. Repeat 500 times.
  4. Combine the 500 datasets  (each will only have 1 record with N variables)
  5. Find the 95th percentile

%macro para(numvars,numreps) ;
%DO k = 1 %TO 500 ;
data A;
array nums {&numvars} a1- a&numvars ;
do i = 1 to &numreps;
do j = 1 to &numvars ;
nums{j} = rand(“Normal”) ;
if j < 2 then nums{j} = round(100*nums{j}) ;
else nums{j} = round(nums{j}) ;
end ;
drop i j ;

proc factor data= a outstat = a&k  noprint;
var a1 – a&numvars ;
data a&k ;
set  a&k  ;
if trim(_type_) = “EIGENVAL” ;

%END ;
%mend ;

%para(30,1000) ;

data all ;
set a1-a500 ;

proc univariate data= all noprint ;
var a1 – a30 ;
output out = eigvals pctlpts =  95 pctlpre = pa1 – pa30;

*** You don’t need the transpose but I just find it easier to read ;
proc transpose data= eigvals out=eigsig ;
Title “95th Percentile of Eigenvalues ” ;
proc print data = eigsig ;
run ;

It runs fine and I have puzzled and puzzled over why a more complicated program would be necessary. I ran it 500 times with 1,000 observations and 30 variables and it took less than a minute on a remote desktop with 4GB RAM. Yes, I do see the possibility that if you had a much larger data set that you would want to optimize the speed in some way. Other than that, though, I can’t see why it needs to be any more complicated than this.

If you wanted to change the percentile, say, to 50, you would just change the 95 above. If you wanted to change the method from say, Principal Components Analysis (the default, with commonality of 1) to saying else, you could just do that in the PROC FACTOR step above.

The above assumes a normal distribution of your variables, but if that was not the case, you could change that in the RAND function above.

As I said, I am puzzled. Suggestions to my puzzlement welcome.


I was reading a book on PHP just to get some ideas for a re-design I’m doing for a client, when I thought of this.

Although I think of PHP as something you use to put stuff into a database and take it out –  data entry of client records, reports of total sales – it is possible to use without any SQL intervention.

You can enter data in a form, call another file and use the data entered to determine what you show in that file. The basic examples used to teach are trivial – a page asks what’s your name and the next page that pops up says, “Howdy”  + your name .

We make games that teach math using Unity 3D, Javascript, PHP, MySQL and C# .

Generally, when a player answers a question, the answer is written to our database and the next step depends on whether the answer was correct. Correct, go on with the game. Incorrect, pick one of these choices to study the concept you missed. Because schools use our games, they want this database setup so they can get reports by student, classroom, grade or school.

What about those individual users, though? They can tell where they/ their child is by just looking at the level in the game.  So, I could drop the PHP altogether in that case and just use Javascript to serve the next step.

I could also use PHP.

In the case where we drop out the database interface altogether, is there a benefit to keeping PHP? I couldn’t think of one.

Still thinking about this question.

Today, I was thinking about using data from the National Hospital Discharge Survey to try to predict type of hospital admission. Is it true that some people use the emergency room as their primary method of care? Mostly, I wanted to poke around wit the NHDS data and get to know it better for possible use for teaching statistics.  Before I could do anything, though, I needed to get the data into a usable form.

I decided to use as my dependent variable the type of hospital admission. There were certain categories, though, that were not going to be dependent on much else, for example – if you are an infant born in a hospital, your admission type is newborn. I also deleted the people whose admission type was not given.

The next question was what would be interesting predictor variables. Unfortunately, some of what I thought would be useful had less than perfect data, for example, discharge status, about 5% of the patients had a status of “Alive, disposition not stated”.

I also thought either diagnostic group or primary diagnosis would be a good variable for prediction. When I did a frequency distribution for each it was ridiculously long, so I thought I would be clever and only select those diagnoses where it was .05% or more, which is over 60 people. Apparently, there is more variation in diagnosis than I thought because in both cases that was over 330 different diagnoses.

Here is a handy little tip, by the way –


TABLES dx1  / OUT = freqcnt ;

PROC PRINT DATA = freqcnt ;


Will only print out the diagnoses that occurred over the specified percentage of the time.

I thought  what about the diagnoses that were at least .5% of the admissions?  So, I re-ran the analyses with 0.5  and came up with 41 DRGs. I didn’t want to type in 41 separate DRGs, especially because I thought I might want to change the cut off point later, so I used a SAS format, like this. Note that in a CNTLIN dataset, which I am creating, the variables MUST have the names fmtname, label and start.

Also, note that the RENAME statement doesn’t take effect until you write out the new dataset, so your KEEP statement has to have the old variable name, in this case, drg.

Data fmtc ;
set freqcnt ;
if percent > 0.5 ;
retain fmtname ‘drgf’  ;
retain label “in5″ ;
rename drg = start ;
keep fmtname drg label ;

Okay, so, what I did here was create a dataset that assigns the formatted value of in5 to everyone of my diagnosis related groups that occurs in .5% of the discharges or more.

To actually create the format, I need one more step

proc format cntlin = fmtc ;

Then,  I can use this format to pull out my sample

DATA analyze2 ;
SET nhds.nhds10 ;
IF admisstype in (1,2,3) ;
IF dischargestatus in (1,3,4,6) & PUT(drg,drgf.) = “in5″ then insample = 1 ;
ELSE insample = 0 ;

I could have just selected the sample that met these criteria, but I wanted to have the option of comparing those I kept in and those I dropped out. Now, I have 71,869 people dropped from the sample and 59,743 that I kept.  (I excluded the newborns from the beginning because we KNOW they are significantly different. They’re younger, for one thing.)

So, now I am perfectly well set up to do a MANOVA with age and days of care as dependent variables. (You’d think there would be more numeric variables in this data set than those two, but surprisingly, even though many variables in the data set are stored as numeric they are actually categorical or ordinal and not really suited to a MANOVA.)

Anyway ….  I think that MANOVA will be one of the first analyses we do in my multivariate course. It’s going to be fun.



One definition of insanity is doing the same thing over and over, expecting different results. One thing that can drive programmers insane is doing the same thing over again and GETTING different results.

In a past life, working in tech support, I learned that whenever anyone calls and says,

I did it exactly like your example and it didn’t work for me.

- they are lying.

In my experience, when you have the same programming statements but get different results, something else is always different and that something is often the demands put on the system.

How can that be if your statements are the same? Let me give two examples, one using javascript and one using SAS.


I had made a game using canvas and html5. The game had three layers. The background was the bottom layer, some game objects that mostly stayed in the same place were the middle layer and the top layer was a game object that changed with each move. The init function on load drew all the layers. On update, all three layers were updated by calling one function. All was well.

function drawAll() {


Then, I made another game the exact same way and I could not get rid of the screen flicker every time the player piece on the top layer moved. I tried clearing the canvas between each re-draw which had solved the problem in the past. Nope. What finally did work, in case you run into this problem yourself, is that I only drew the background in the init function and never re-drew it.

function init(){
layer3 = document.getElementById(“layer3″) ;
layer2 = document.getElementById(“layer2″) ;
layer1 = document.getElementById(“layer1″);
ctx = layer1.getContext(‘2d’) ;
ctx2 = layer2.getContext(‘2d’) ;
ctx3 = layer3.getContext(‘2d’);
startwall() ;
draw1() ;
draw2() ;
draw3() ;


function drawall() {


Problem solved. My conclusion was that the second program involved a lot more complicated drawing of objects. Instead of just placing an image here or there, the program needed to compute  collisions, read lines from an array, draw objects and the time it took was noticeable.


Several times I have written a program that worked wonderfully on a high performance computing cluster but crashed on my laptop, or failed on SAS on demand but worked beautifully on my desktop . The difference in all of those cases was that the processing requirements exceeded the capabilities of the machine. All is not lost in those cases. One pretty obvious but not always feasible solution is to use a different machine. When that isn’t an option, there are workarounds. For example, if I wanted students to analyze an enormous dataset, I could have them analyze the correlation matrix instead of trying to load a 100gb dataset – but that is another post.

When I was running out to the airport, I said I would explain how to get a single plot of all of your scatter plots with SAS. Let’s see if I can get this posted before the plane lands. Not likely but you never know …

It’s super simple

proc sgscatter data=sashelp.heart ;
matrix ageatdeath ageatstart agechddiag mrw / group= sex ;

And there you go.



Statistical graphics from 10,000 feet. Is this a full service blog or what?

My life is upside down. All day, as my job, I spent writing a program to get a little man to run around a maze, come out the other end and have a new screen come up with a math challenge question. Then, in the evening, I’m surfing the web for interesting bits to read on multivariate statistics.

I’m teaching a course this winter and could not find the Goldilocks textbook, you know, the one that is just right. They either had no details – just pointing and clicking to get results – or the wrong kind of details. One book had pages of equations, then code for producing some output with very little explanation and no interpretation of the output.

I finally did select a textbook but it was a little less thorough than I would like in some places. I decided to supplement it with required and optional readings from other sources. Thus, the websurfing.

One book I came across that is a good introduction for the beginning of a course is Applied Multivariate Statistical Analysis, by Hardle and Simar. You can buy the latest version for $61 but really, the introduction from 2003 is still applicable. I was delighted to see someone else start with the same point as I do – descriptive statistics.

Whether you have a multivariate question or univariate one, you should still begin with understanding your data. I really liked the way they used plots to visualize multiple variables. I knocked a few of these out in a minute using SAS Studio.

symbol1 v= squarefilled ;
proc gplot data=sashelp.heart ;
plot ageatdeath*agechddiag = sex ;
plot ageatdeath*ageatstart = sex ;
plot ageatdeath*mrw = sex ;
Title “Male ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10) ;
where sex = “Male” ;
run ;
Title “Female ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10);
where sex = “Female” ;

If I had more time, I would show all of these plots in one page – a draftman’s plot - but I’m running out to the airport in a minute. Maybe next time. Yes, I do realize these charts are probably terrible for people who are color-blind. Will work on that at some point also.

You can see that the age at diagnosis and death is linearly related. It seems there are many more males than females and the age at death seems higher for females.

You can see a larger picture here.

age diagnosis by age at deathage at start by age at death
The picture with Metropolitan relative weight did not seem nearly as linear, which makes sense because if you think about it, age at start and age at death HAVE to be related. You cannot be diagnosed at age 50 if you died when you were 30.It also seems as if there is more variance for women than men and the distribution is skewed positively for women.
MRW by age at death by gender

The last two graphs seem to bear that out, ( You can see those here – click and scroll down). which makes me want to do a PROC UNIVARIATE and a t-test. It also makes me wonder if it’s possible that weight could be related to age at death for men but not for women. Or, it could just be that as you get older, you put on weight and older people are more likely to die.

My point is that some simple graphs can allow you to look at your variables in 3 dimensions and then compare the relationships of multiple 3-dimensional graphs.

Gotta run – taking a group of middle school students to a judo tournament in Kansas City. Life is never boring.

Next Page →