### Dec

#### 31

# My Year in Books: Technical Edition

Filed Under Dr. De Mars General Life Ramblings, Software, statistics | Leave a Comment

I read a lot. This year, I finished 308 books on my Kindle app, another dozen on iBooks, a half-dozen on ebrary and 15 or 20 around the house. I don’t read books on paper very often any more. It’s not too practical for me. I go through them at the rate of about a book a night, thanks to a very successful speed reading program when I was young (thank you, St. Mary’s Elementary School). Don’t be too impressed. I don’t watch TV and I read a lot of what I colleague called, “Junk food for the brain”. I read a bunch of Agatha Christie novels, three Skullduggery Pleasant books, several of the Percy Jackson and the Olympian books. Yes, when it comes to my fiction reading, I have the interests of a fourteen-year-old girl. Trying to read like a grown up, I also read a bunch of New York Times bestseller novels and didn’t like any of them.

So, I decided to do my own “best books list” based on a random sample of one, me, and make up my own categories.

Because I taught a course on multivariate statistics, I read a lot of books in that area, and while several of them were okay, there was only one that I really liked.

**The winner for best statistics book I read this year, **

*Applied logistic regression, 3rd Edition,* by David Hosmer, Stanley Lemmeshow and Rodney Sturdivant.

I really liked this book. I’m not new to logistic regression, but I’m always looking for new ideas, new ways to teach, and this book was chock full of them. What I liked most about it is that they used examples with real data, e.g., when discussing multinomial logistic regression, the dependent variable was type of placement for adolescents, and one of the predictor variables was how likely the youthful offender was to engage in violence against others. It is a very technical book and if you are put off by matrix multiplication and odds ratios, this isn’t the book for you. On the other hand, if you want any in depth understanding of logistic regression from a practical point of view, read it from the beginning to end.

**Best SAS book I read this year …**

Let me start with the caveat that I have been using SAS for over 30 years and I don’t teach undergraduates, so I have not read any basic books at all. I read a lot of books on a range of advanced topics and most of them I found to be just – meh. Maybe it is because I had read all the good books previously and so the only ones I had left unread lying around were just so-so. All that being said, the winner is …

*Applied statistics and the SAS programming language (5th Ed)*, by Ronald Cody and Jeffrey Smith

This book has been around for eight years and I had actually read parts of it a couple of years ago, but this was the first time I read through the whole book. It’s a very readable intermediate book. Very little mathematics is included. It’s all about how to write SAS code to produce a factor analysis, repeated measures ANOVA, etc. It has a lot of random stuff thrown in, like a review of functions, and working with date data. If you have a linear style of learning and teaching, you might hate that. Personally, I liked that about it. This book was published eight years ago, which is an eon in programming time, but a chi-square or ANOVA have been around 100 years, so that wasn’t an issue. While I don’t generally like the use of simulated data for problems in statistics, for teaching this was really helpful because when students were first exposed to a new concept they didn’t need to get a codebook, fix the data. For the purpose of teaching applied statistics, it’s a good book.

**Best Javascript programming book I read this year**

I read a lot of Javascript books and found many of them interesting and useful, so this was a hard choice.

*The jQuery cookbook, edited by Cody Lindley*

was my favorite. If you haven’t gathered by now, I’m fond of learning by example, and this book is pretty much nothing but elaborate examples along the lines of , “Say you wanted to make every other row in a table green”. There are some like that I can imagine wanting to do and others I cannot think of any need to use ever. However, those are famous last words. When I was in high school, I couldn’t imagine I would ever use the matrix algebra we were learning.

**Best game programming book I read this year**

Again, I read a lot of game programming books. I didn’t read a lot of mediocre game programming books. They all were either pretty good or sucked. The best of the good ones was difficult to choose, but I did

*Building HTML5 Games by Jesse Freeman*

This is a very hands-on approach to building 2-D games with impact, with, you guessed it, plenty of examples. I was excited to learn that he has several other books out. I’m going to read all of them next year.

So, there you have it …. my favorite technical books that I read this year. Feel free to make suggestions for what I ought to read next year.

### Dec

#### 2

What if you wanted to turn your PROC MIXED into a repeated measures ANOVA using PROC GLM. Why would you want to do this? Well, I don’t know why you would want to do it but I wanted to do it because I wanted to demonstrate for my class that both give you the same fixed effects F value and significance.

I started out with the Statin dataset from the Cody and Smith textbook. In this data set, each subject has three records,one each for drugs A, B and C. To do a mixed model with subject as a random effect and drug as a fixed effect, you would code it as so. Remember to include both the subject variable and your fixed effect in the CLASS statement.

Proc mixed data = statin ;

class subj drug ;

model ldl = drug ;

random subj ;

To do a repeated measures ANOVA with PROC GLM you need three variables for each subject, not three records.

**First, create three data sets** for Drug A, Drug B and Drug C.

Data one two three ;

set statin ;

if drug = ‘A’ then output one ;

else if drug = ‘B’ then output two ;

else if drug = ‘C’ then output three ;

**Second, sort these datasets and as you read in each one, rename LDL** to a new name so that when you merge the datasets you have three different names. Yes, I really only needed to rename two of them, but I figured it was just neater this way.

proc sort data = one (rename= (ldl =ldla)) ;

by subj ;

proc sort data= two (rename = (ldl = ldlb)) ;

by subj ;

proc sort data=three (rename =(ldl = ldlc)) ;

by subj ;

**Third, merge the three datasets by subject.**

data mrg ;

merge one two three ;

by subj ;

Fourth, run your repeated measures ANOVA .

Your three times measuring LDL are the dependent . It seems weird to not have an independent on the other side of the equation, but that’s the way it is. In your REPEATED statement you give a name for the repeated variable and the number of levels. I used “drug” here to be consistent but actually, this could be any name at all. I could have used “frog” or “rutabaga” instead and it would have worked just as well.

proc glm data = mrg ;

model ldla ldlb ldlc = /nouni ;

repeated drug 3 (1 2 3) ;

run ;

Now you can be happy.

### Nov

#### 19

# The most common error new SAS users make

Filed Under Software, statistics, Technology | 1 Comment

Any time you learn anything new it can be intimidating. That is true of programming as well as anything else. It may be even more true of using statistical software because you combine the uneasiness many people have about learning statistics with learning a new language.

To a statistician, this error message makes perfect sense:

`ERROR: Variable BP_Status in list does not match type prescribed for this list.`

but to someone new to both statistics and SAS it may be clear as mud.

Here is your problem.

The procedure you are using, PROC UNIVARIATE , PROC MEANS is designed ONLY for numeric variables. You have tried to use it for a categorical variable.

This error means you’ve used a categorical variable in a list where only numeric variables are expected. For example, bp_status is “High”, “Normal” and “Optimal”

You cannot find the mean or standard deviation of words, so your procedure has an error.

So … what do you do if you need descriptive statistics?

Go back to your PROC UNIVARIATE or PROC MEANS and delete the offending variables. Re-run it with only numeric variables.

For your categorical variables, use a PROC FREQ for a frequency distribution and/ or PROC GCHART.

Problem solved.

You’re welcome.

### Nov

#### 16

In August, I attended a class at Unite 2014 (on Unity game development) and the presenter said,

“And some of you, your code won’t run and you’ll swear you did exactly what was shown in the examples. But, of course, all of the rest of us will know that is not true.”

This perfectly describes my experience teaching. For example, the problem with the LIBNAME.

I tell students,

Do not just copy and paste the LIBNAME from a Word document into your program. Often, this will cause problems because of extra formatting codes in the word processor. You may not see the code as any different from what you typed in, but it may not work. Type your LIBNAME statement into the program.

Apparently, students believe that when I say,

Do not just copy and paste the LIBNAME statement.

either, that what I really mean is,

Sure, go ahead and copy and paste the LIBNAME statement

or, that I did mean it but that is only because I want to force them to do extra typing, or because I am so old that I am against copying and pasting as a new-fangled invention and how the hell would I know if they copied and pasted it anyway.

Then their program does not work.

Very likely, their log looks something like this:

58 LIBNAME mydata “/courses/d1234455550/c_2222/” access=readonly;

59 run ;

NOTE: Library MYDATA does not exist.

**All quotation marks are not created equal.**

What you see above if you look very closely is that the end quote at the end of the path for the LIBNAME statement does not exactly match the beginning quote. Therefore, your reference for your library was not

/courses/d1234455550/c_2222/

but rather, something like

/courses/d1234455550/c_2222/ access=readonly run ;

Which is not what you had in mind, and, as SAS very reasonably told you, that directory does not exist.

The simplest fix: delete the quotation marks and TYPE in quotes.

LIBNAME mydata ‘/courses/d1234455550/c_2222/’ access=readonly;

If that doesn’t work, do what I said to begin with. Erase your whole LIBNAME statement and TYPE it into the program without copying and pasting.

Contrary to appearances, I don’t just make this shit up.

### Nov

#### 13

# Doing your statistics homework with SAS – confidence intervals

Filed Under Software, statistics | Leave a Comment

Computing confidence intervals is one of the areas where beginning statistics students have the most trouble. It is not as difficult if you break it down into steps, and if you use SAS or other statistical software.

Here are the steps:

1. Compute the statistic of interest– that is mean, proportion, difference between means

2. Compute the standard error of the statistic

3. Obtain critical value. Do you have 30 or more in your sample and are you interested in the 95% confidence interval?

- If yes, multiply standard error by 1.96
- If no (fewer people), look up t-value for your sample size for .95
- If no (different alpha level) look up z-value for your alpha level
- If no (different alpha level AND less than 30) look up the t-value for your alpha level.

4. Multiply the critical value you obtained in step #3 by the standard error you obtained in #2

5. Subtract the result you obtained in step #4 from the statistic you obtained in #1 . That is your lower confidence limit.

6. Add the result you obtained in step #4 to the statistic you obtained in #1. That is your upper confidence limit.

**Simplifying it with SAS**

Here is a homework problem:

The following data are collected as part of a study of coffee consumption among undergraduate students. The following reflect cups per day consumed:

3 4 6 8 2 1 0 2

A. Compute the sample mean.

B. Compute the sample standard deviation.

C. Construct a 95% confidence interval

I did this in SAS as so

data coffee ;

input cups ;

datalines ;

3

4

6

8

2

1

0

2

;

proc means mean std stderr;

var cups ;

I get the follow results.

Analysis Variable : cups | ||
---|---|---|

Mean | Std Dev | Std Error |

3.2500000 | 2.6592158 | 0.9401748 |

These results give me A and B. Now, all I need to do to compute C is find the correct critical value. I look it up and find that it is 2.365

3.25 – 2.365 * .94 = 1.03

3.25 + 2.365 * .94 = 5.47

That is my confidence interval (1.03, 5.47)

=========================

If you want to verify it, or just don’t want to do any computations at all, you can do this

Proc means clm mean stddev ;

var cups ;

You will end up with the same confidence intervals.

Prediction: At least one person who reads this won’t believe me, will run the analysis and be surprised when I am right.

### Oct

#### 28

# SAS Enterprise Miner – Free, on a Mac – But can you find your data?

Filed Under Software, Technology | Leave a Comment

*SAS Enterprise Miner – Free, on a Mac – Bet You Didn’t See That Coming* … but how the hell do you get your data on it?

I wanted to test SAS Text Miner and was surprised to find the university did not have a license. No problem – and it really was, astoundingly, no problem – I had SAS On-Demand Enterprise Miner on a virtual machine using VMware.

I had installed it thinking – “This probably won’t work but what the hell.”

Here are all the links on this blog on getting SAS Enterprise Miner to work in all of its different flavors, because I am helpful like that.

Let me emphasize that you just better have the correct version of the Java Run Time Environment (jre), don’t say I didn’t warn you, and after you have it running whenever Java asks if you want to update, give it a resounding, “NO!”

So, surprisingly, running Windows 8.1 Pro on a 4GB virtual machine, it pops open no problem.

**Okay, now how to find your data.**

Turns out that even if you have SAS Enterprise Miner you need to use SAS Studio to upload your data. So, you go to SAS Studio, on the top left hand side of your screen, you see an UP arrow. Click on that arrow and you will be prompted to upload your data.

Not so fast …. where do you want to put your data?

You can only upload the data if you are a professor but since I am, that should be no problem. There is also a note on my login page that

The directory for all of your courses will be “courses/lalal123/” .

The LIBNAME for your courses should be

LIBNAME mydata “courses/lalal123″ access = readonly ;

Except that it isn’t. In fact, my course directory is something like

“courses/lalal123/c_1223″

I found that out only by calling tech support a few months ago where someone told me that. Now, when I look on the left window pane I see several directories, most of which I created, and a few I did not. One of the latter is named my_content. If I click on the my_content directory I see two subdirectories

c_1223

and

c_7845

These are the directories for my two courses. How would you have known to look there if you didn’t call SAS tech support or read this blog? Damned if I know, but hey, you did read it, so good for you.

If you leave off the subdirectory … say you actually followed the instructions on your login page and in your start code had this:

LIBNAME mydata “courses/lalal123″ access = readonly ;

run ;

Why, it would run without an error but it would show your directory is empty of data sets, which is kind of true because they are all in those subdirectories whose name you needed to find out.

So …. to recap

1. Use SAS Studio to upload the data to the directory you are using for your SAS Enterprise Miner course. (Seems illogical but it works, so just go with it.)

2. In the start code for your SAS Enterprise Miner project, have the LIBNAME statement *including the subdirectory* which is under the my_content directory.

Once you know what to do, it runs fine. You can access your data, create a diagram, drag the desired nodes to it.

I’ve only been using this for testing purposes for use in a future course. For that it works fine. It is convenient to be able to pull it up on a virtual machine on my Mac. It is pretty slow but nowhere near as bad as the original version years ago, which was so slow as to be useless.

If you teach data mining – or want to – and your campus doesn’t have a SAS Enterprise Miner license, which I believe is equivalent to the cost of the provost’s first born and a kidney – you definitely want to check out SAS On-demand. It’s a little quirky, but so far, so good.

### Oct

#### 23

# Parallel Analysis Criterion Simplified?

Filed Under Software, statistics | Leave a Comment

Am I missing something here? All of the macros I have seen for the parallel analysis criterion for factor analysis look pretty complicated, but, unless I am missing something, it is a simple deal.

The presumption is this:

There isn’t a number like a t-value or F-value to use to test if an eigenvalue is significant. However, it makes sense that the eigenvalue should be larger than if you factor analyzed a set of random data.

Random data is, well, random, so it’s possible you might have gotten a really large or really small eigenvalue the one time you analyzed the random data. So, what you want to do is analyze a set of random data with the same number of variables and the same number of observations a whole bunch of times.

Horn, back in 1965, was proposing that the eigenvalue should be higher than the average of when you analyzed a set of random data. Now, people are suggesting it should be higher than 95% of the time you analyzed random data (which kind of makes sense to me).

Either way, it seems simple. Here is what I did and it seems right so I am not clear why other macros I see are much more complicated. Please chime in if you see what I’m missing.

- Randomly generate a set of random data with N variables and Y observations.
- Keep the eigenvalues.
- Repeat 500 times.
- Combine the 500 datasets (each will only have 1 record with N variables)
- Find the 95th percentile

%macro para(numvars,numreps) ;

%DO k = 1 %TO 500 ;

data A;

array nums {&numvars} a1- a&numvars ;

do i = 1 to &numreps;

do j = 1 to &numvars ;

nums{j} = rand(“Normal”) ;

if j < 2 then nums{j} = round(100*nums{j}) ;

else nums{j} = round(nums{j}) ;

end ;

drop i j ;

output;

end;

proc factor data= a outstat = a&k noprint;

var a1 – a&numvars ;

data a&k ;

set a&k ;

if trim(_type_) = “EIGENVAL” ;

%END ;

%mend ;

%para(30,1000) ;

data all ;

set a1-a500 ;

proc univariate data= all noprint ;

var a1 – a30 ;

output out = eigvals pctlpts = 95 pctlpre = pa1 – pa30;

*** You don’t need the transpose but I just find it easier to read ;

proc transpose data= eigvals out=eigsig ;

Title “95th Percentile of Eigenvalues ” ;

proc print data = eigsig ;

run ;

It runs fine and I have puzzled and puzzled over why a more complicated program would be necessary. I ran it 500 times with 1,000 observations and 30 variables and it took less than a minute on a remote desktop with 4GB RAM. Yes, I do see the possibility that if you had a much larger data set that you would want to optimize the speed in some way. Other than that, though, I can’t see why it needs to be any more complicated than this.

If you wanted to change the percentile, say, to 50, you would just change the 95 above. If you wanted to change the method from say, Principal Components Analysis (the default, with commonality of 1) to saying else, you could just do that in the PROC FACTOR step above.

The above assumes a normal distribution of your variables, but if that was not the case, you could change that in the RAND function above.

As I said, I am puzzled. Suggestions to my puzzlement welcome.

### Oct

#### 12

# PHP Rambling

Filed Under computer games, Software | 5 Comments

I was reading a book on PHP just to get some ideas for a re-design I’m doing for a client, when I thought of this.

Although I think of PHP as something you use to put stuff into a database and take it out – data entry of client records, reports of total sales – it is possible to use without any SQL intervention.

You can enter data in a form, call another file and use the data entered to determine what you show in that file. The basic examples used to teach are trivial – a page asks what’s your name and the next page that pops up says, “Howdy” + your name .

We make games that teach math using Unity 3D, Javascript, PHP, MySQL and C# .

Generally, when a player answers a question, the answer is written to our database and the next step depends on whether the answer was correct. Correct, go on with the game. Incorrect, pick one of these choices to study the concept you missed. Because schools use our games, they want this database setup so they can get reports by student, classroom, grade or school.

What about those individual users, though? They can tell where they/ their child is by just looking at the level in the game. So, I could drop the PHP altogether in that case and just use Javascript to serve the next step.

I could also use PHP.

**In the case where we drop out the database interface altogether, is there a benefit to keeping PHP? I couldn’t think of one.**

Still thinking about this question.

### Oct

#### 3

# SAS Tricks for Massaging Data into Shape

Filed Under Software, statistics, Technology | Leave a Comment

Today, I was thinking about using data from the National Hospital Discharge Survey to try to predict type of hospital admission. Is it true that some people use the emergency room as their primary method of care? Mostly, I wanted to poke around wit the NHDS data and get to know it better for possible use for teaching statistics. Before I could do anything, though, I needed to get the data into a usable form.

I decided to use as my dependent variable the type of hospital admission. There were certain categories, though, that were not going to be dependent on much else, for example – if you are an infant born in a hospital, your admission type is newborn. I also deleted the people whose admission type was not given.

The next question was what would be interesting predictor variables. Unfortunately, some of what I thought would be useful had less than perfect data, for example, discharge status, about 5% of the patients had a status of “Alive, disposition not stated”.

I also thought either diagnostic group or primary diagnosis would be a good variable for prediction. When I did a frequency distribution for each it was ridiculously long, so I thought I would be clever and only select those diagnoses where it was .05% or more, which is over 60 people. Apparently, there is more variation in diagnosis than I thought because in both cases that was over 330 different diagnoses.

Here is a handy little tip, by the way –

PROC FREQ DATA = analyze1 NOPRINT ;

TABLES dx1 / OUT = freqcnt ;

PROC PRINT DATA = freqcnt ;

WHERE PERCENT > 0.05 ;

Will only print out the diagnoses that occurred over the specified percentage of the time.

I thought what about the diagnoses that were at least .5% of the admissions? So, I re-ran the analyses with 0.5 and came up with 41 DRGs. I didn’t want to type in 41 separate DRGs, especially because I thought I might want to change the cut off point later, so I used a SAS format, like this. Note that in a CNTLIN dataset, which I am creating, the variables MUST have the names fmtname, label and start.

Also, note that the RENAME statement doesn’t take effect until you write out the new dataset, so your KEEP statement has to have the old variable name, in this case, drg.

Data fmtc ;

set freqcnt ;

if percent > 0.5 ;

retain fmtname ‘drgf’ ;

retain label “in5″ ;

rename drg = start ;

keep fmtname drg label ;

Okay, so, what I did here was create a dataset that assigns the formatted value of in5 to everyone of my diagnosis related groups that occurs in .5% of the discharges or more.

To actually create the format, I need one more step

proc format cntlin = fmtc ;

Then, I can use this format to pull out my sample

DATA analyze2 ;

SET nhds.nhds10 ;

IF admisstype in (1,2,3) ;

IF dischargestatus in (1,3,4,6) & PUT(drg,drgf.) = “in5″ then insample = 1 ;

ELSE insample = 0 ;

I could have just selected the sample that met these criteria, but I wanted to have the option of comparing those I kept in and those I dropped out. Now, I have 71,869 people dropped from the sample and 59,743 that I kept. (I excluded the newborns from the beginning because we KNOW they are significantly different. They’re younger, for one thing.)

So, now I am perfectly well set up to do a MANOVA with age and days of care as dependent variables. (You’d think there would be more numeric variables in this data set than those two, but surprisingly, even though many variables in the data set are stored as numeric they are actually categorical or ordinal and not really suited to a MANOVA.)

Anyway …. I think that MANOVA will be one of the first analyses we do in my multivariate course. It’s going to be fun.

### Oct

#### 1

# Flickering screens, stalled machines and not working like it used to

Filed Under Software, Technology | Leave a Comment

One definition of insanity is doing the same thing over and over, expecting different results. One thing that can drive programmers insane is doing the same thing over again and GETTING different results.

In a past life, working in tech support, I learned that whenever anyone calls and says,

I did it exactly like your example and it didn’t work for me.

- they are lying.

In my experience, when you have the same programming statements but get different results, something else is always different and that something is often the demands put on the system.

How can that be if your statements are the same? Let me give two examples, one using javascript and one using SAS.

**Javascript**

I had made a game using canvas and html5. The game had three layers. The background was the bottom layer, some game objects that mostly stayed in the same place were the middle layer and the top layer was a game object that changed with each move. The init function on load drew all the layers. On update, all three layers were updated by calling one function. All was well.

function drawAll() {

draw1();

draw2();

draw3();

}

Then, I made another game the exact same way and I could not get rid of the screen flicker every time the player piece on the top layer moved. I tried clearing the canvas between each re-draw which had solved the problem in the past. Nope. What finally did work, in case you run into this problem yourself, is that I only drew the background in the init function and never re-drew it.

function init(){

layer3 = document.getElementById(“layer3″) ;

layer2 = document.getElementById(“layer2″) ;

layer1 = document.getElementById(“layer1″);

ctx = layer1.getContext(‘2d’) ;

ctx2 = layer2.getContext(‘2d’) ;

ctx3 = layer3.getContext(‘2d’);

window.addEventListener(‘keydown’,getkeyAndMove,false);

startwall() ;

draw1() ;

draw2() ;

draw3() ;

}

function drawall() {

draw2();

draw3();

}

Problem solved. My conclusion was that the second program involved a lot more complicated drawing of objects. Instead of just placing an image here or there, the program needed to compute collisions, read lines from an array, draw objects and the time it took was noticeable.

**SAS**

Several times I have written a program that worked wonderfully on a high performance computing cluster but crashed on my laptop, or failed on SAS on demand but worked beautifully on my desktop . The difference in all of those cases was that the processing requirements exceeded the capabilities of the machine. All is not lost in those cases. One pretty obvious but not always feasible solution is to use a different machine. When that isn’t an option, there are workarounds. For example, if I wanted students to analyze an enormous dataset, I could have them analyze the correlation matrix instead of trying to load a 100gb dataset – but that is another post.