### Feb

#### 8

# Computing Kappa is a Piece of Cake

Filed Under Software, statistics | Leave a Comment

Kappa is a useful measure of agreement between two raters. Say you have two radiologists looking at X-rays, rating them as normal or abnormal and you want to get a quantitative measure of how well they agree. Kappa is your go-to coefficient.

How do you compute it? Well, personally, I use SAS because this is the year 2015 and we have computers.

Let’s take this table, where 100 X rays were rated by two different raters as an example:

Rating by Physician 1

————-Abnormal | Normal

Physician 2

————————————–

Abnormal 40 20

Normal 10 30

So ….. the first physician rated 60 X-rays as Abnormal. Of those 60, the second physician rated 40 abnormal and 20 normal, and so on.

If you received the data as a SAS data set like this, with an abnormal rating = 1 and normal = 0, then life is easy and you can just do the PROC FREQ.

Rater1 Rater2

1 1

1 1

and so for 50 lines.

However, I very often get not an actual data set but a table like the one above. In this case, it is still relatively simple to code

DATA compk ;

INPUT rater1 rater2 nums ;

DATALINES ;

1 1 40

1 0 20

0 1 10

0 0 30

;

So, there were 40 x-rays coded as abnormal by both rater1 and rater2. When rater1 = 1 (abnormal) and rater2 = 0 (normal), there were 20, and so on.

The next part is easy

PROC FREQ DATA = compk ;

TABLES rater1*rater2/ AGREE ;

WEIGHT nums ;

That’s it. The WEIGHT statement is necessary in this case because I did not have 100 individual records, I just had a table, so the WEIGHT variable gives the number in each category.

This will work fine for a 2 x 2 table. If you have a table that is more than 2 x 2, at the end, you can add the statement

TEST WTKAP ;

This will give you the weighted Kappa coefficient. If you include this with a 2 x2 table nothing happens because the weighted kappa coefficient and the simple Kappa coefficient are the same in this case.

See, I told you it was simple.

### Jan

#### 23

# I am tired: In praise of details-oriented people

Filed Under computer games, Dr. De Mars General Life Ramblings, Software, The Julia Group | Leave a Comment

Our Project Manager, Jessica, made the very insightful comment at lunch the other day,

No one cares how hard it was for you to make. When people are looking to buy your product, all they want to know is what it will do for them.

That young woman has a bright future in marketing. Unfortunately for those who read this blog, I do not, so I am going to tell you how hard it is to make that last push to the finish line.

I quit counting the number of hours I worked this week when I got to 80. I’m sure The Invisible Developer had put in even more, because many nights (mornings?) I have gone to bed at 2 a.m. and when I wake up and check the latest build in the morning I find it was put up at 5 or 6 that morning. There hasn’t been much blogging going on lately and I only have a bit of a minute now because I’m waiting to get the latest latest latest build so that I can make the Windows installer.

I’ve blogged before on the great value I place on “details” people and this week is a prime example of the importance of details.

You’d think that down to and past the wire – the last build of the game was supposed to be today and we have negative 68 minutes left in today – that we would be moving forward pretty quickly. Um, not so much.

At the beginning of development, you can easily find the problems – the question is what fraction of the fish are over one foot long when you caught 125 fish last summer and 25 were over a foot long. The correct answer is 1/5. However, 25/125 is also a correct answer, as is 5/25 . Finding those problems is easy. You can check the answer while you are creating the pages, have it write to the console the correct answer, step through the logic. No problem.

Same thing with playing the 3-D part of the game. If you are at the part where you are supposed to be spearing the fish and there is no spear, then it is an easy enough fix.

HOWEVER, now we are supposedly at the end. So…

- We make a version of the build for Mac and another for Windows.
- We zip the Windows file because many systems block .exe files downloaded from the Internet to prevent malware installation.
- We upload the zipped file to our server.
- We download it.
- We play the game from beginning to end on Mac.
- We play the game from beginning to end on Windows.

That is, we go through every step that a user would — and somewhere along the way we find an error that we somehow missed in all of our earlier testing. Maybe something we fixed in a later stage of the game was a script that was used in an earlier level and now that doesn’t work.

So … we go through all of the steps all over again. Yes, we do have debugging capabilities where we can skip to level 6 and test that, for example, but at the very end, you NEED to go through all of the steps your users will. Trust me. You can put in every unit test you want but it will not let you know that Microsoft or Chrome or some other organization put on this earth to try my patience now has a security feature that blocks the game from installing. You won’t see that three problems and all of the accompanying instructional material were left out.If you start at level 6 you will miss the fact that there is a problem in the transition from level 5 to level 6. And so on ad inifinitum until you go to speaking in Latin and wanting to tear out your eyeballs.

We go through all of the details so that when you play it all you see is a game that works.

My high school English teacher told me,

If something is easy to read, you can damn sure believe that it was hard to write.

I think this is also true,

Any kind of software that is easy to use, you can damn sure believe it was hard to make.

### Jan

#### 15

# Descriptives, Details and Death

Filed Under Software, statistics | Leave a Comment

I think descriptive statistics are under-rated. One reason I like Leon Gordis’ Epidemiology book is that he agrees with me. He says that sometimes statistics pass the “inter-ocular test”. That is, they hit you right between the eyes.

I’m a big fan of eye-balling statistics and SAS/GRAPH is good for that. Let’s take this example. It is fairly well established that women have a longer life span than men in the United States. In other words, men die at a younger age. Is that true of all causes?

To answer that question, I used a subset of the Framingham Heart Study and looked at two major causes of death, coronary heart disease and cancer. The first thing I did was round the age at death into five year intervals to smooth out some of the fluctuations from year to year.

data test2 ;

set sashelp.heart ;

ageatdeath5 = round(ageatdeath,5) ;

proc freq data=test2 noprint;

tables sex*ageatdeath5*deathcause / missing out= test3 ;

/* NOTE THAT THE MISSING OPTION IS IMPORTANT */

**THE DEVIL IS IN THE DETAILS**

Then I did a frequency distribution by sex, age at death and cause of death. Notice that I used the missing option. That is super-important. Without it, instead of getting what percentage of the entire population died of a specific cause at a certain age, I would get a percentage of those who died. However, as with many studies of survival, life expectancy, etc. a substantial proportion of the people were still alive at the time data were being collected. So, percentage of the population, and percentage of people who died were quite different numbers. I used the NOPRINT option on the PROC FREQ statement simply because I had no need to print out a long, convoluted frequency table I wasn’t going to use.

I used the OUT = option to output the frequency distribution to a dataset that I could use for graphing.

**More details:** The symbol statements just make the graphs easier to read by putting an asterisk at each data point and by joining the points together. I have very bad eyesight so anything I can do to make graphics more readable, I try to do.

symbol1 value = star ;

symbol1 interpol = join ;

Here I am just sorting the data set by cause of death and only keeping those with Cancer or Coronary Heart Disease.

proc sort data=test3;

by deathcause ;

where deathcause in (“Cancer”,”Coronary Heart Disease”);

**Even more details. ** You always want to have the axes the same on your charts or you can’t really compare them. That is what the UNIFORM option in the PROC GPLOT statement does. The PLOT statement requests a plot of percent who died at each age by sex. The LABEL statement just gives reasonable labels to my variables.

proc gplot data = test3 uniform;

plot percent*ageatdeath5 = sex ;

by deathcause ;

Label percent = “%”

ageatdeath5 = “Age at Death” ;

When you look at these graphs, even if your eyes are as bad as mine you can see a few things. The top chart is of cancer and you can conclude a couple of things right away.

- There is not nearly the discrepancy in the death rates of men and women for cancer as there is for heart disease.
- Men are much more likely to die of heart disease than women at every age up until 80 years old. After that, I suspect that the percentage of men dying off has declined relative to women because a very large proportion of the men are already dead.

So, the answer to my question is “No.”

### Jan

#### 8

# You’ll never use what you don’t know

Filed Under Dr. De Mars General Life Ramblings, Software | Leave a Comment

Frequently, I hear adults who should know better argue against learning something, whether it is algebra, analysis of variance or learning a programming language. They say,

I’m 47 years old and I’ve never used (insert thing I didn’t learn here).

Yes, that is true. However, if you had learned it, there is a good chance that you would have used it . For those of you protesting, “Hey, I learned algebra!” , maybe you did and maybe you didn’t. Read my post on number sense.)

Let’s take this morning as an example. In keeping with my New Year’s Literary resolution, I started out the day reading the jQuery cookbook. There were two things I learned that I expect to use this year. One of them was very simple, but I didn’t know it.

var t1 = +new Date ;

This returns the current date in milliseconds *converted to a number.* Yes, you could use the Javascript Number() function but this saves you a step.

Now you can use this useful bit of code which I am planning on applying next week when I get done with what I’m working on now. I can use it to see how long a student worked on each individual problem and how long he or she took for the whole test.

`(function() {`

var log = [], first, last ;

time = function(message, since) {

var now= +new Date ;

var seconds = (now - (since || last)) /1000 ;

log.push(seconds.toFixed(3) + ':' + message + 'br/>') ;

return last = +new Date ;

};

time.done = function(selector) {

time('total', first) ;

$(selector).html(log.join('')) ;

};

first = last = +new Date ;

})() ;

Now, the author’s interest was in seeing how long each bit of code took to run. However, I can see how this could be really useful in the pretest and posttests we use for our games to see how long the student spent on each problem. We could call this function each time the student clicks on the next arrow to go to the next problem.

One of the Common Core Standards for mathematical practice is “Make sense of problems and persevere in solving them.”

How do you know if a student is “persevering”? One way would be to measure how long he or she spent on a particular problem before going on to the next. We cannot know for a fact that the student spent time thinking about it rather than staring off into space, but we can at least set a maximum amount of time the student spent thinking about it before going on to the next thing.

This takes me to the point Morton Jervens was making about not everything that counts can be counted and data does not always equal statistics.

While there is truth in that, I would say that much more that counts can be counted and much more of data can be turned into statistics if you know how to do it.

Learn how.

### Dec

#### 31

# My Year in Books: Technical Edition

Filed Under Dr. De Mars General Life Ramblings, Software, statistics | Leave a Comment

I read a lot. This year, I finished 308 books on my Kindle app, another dozen on iBooks, a half-dozen on ebrary and 15 or 20 around the house. I don’t read books on paper very often any more. It’s not too practical for me. I go through them at the rate of about a book a night, thanks to a very successful speed reading program when I was young (thank you, St. Mary’s Elementary School). Don’t be too impressed. I don’t watch TV and I read a lot of what I colleague called, “Junk food for the brain”. I read a bunch of Agatha Christie novels, three Skullduggery Pleasant books, several of the Percy Jackson and the Olympian books. Yes, when it comes to my fiction reading, I have the interests of a fourteen-year-old girl. Trying to read like a grown up, I also read a bunch of New York Times bestseller novels and didn’t like any of them.

So, I decided to do my own “best books list” based on a random sample of one, me, and make up my own categories.

Because I taught a course on multivariate statistics, I read a lot of books in that area, and while several of them were okay, there was only one that I really liked.

**The winner for best statistics book I read this year, **

*Applied logistic regression, 3rd Edition,* by David Hosmer, Stanley Lemmeshow and Rodney Sturdivant.

I really liked this book. I’m not new to logistic regression, but I’m always looking for new ideas, new ways to teach, and this book was chock full of them. What I liked most about it is that they used examples with real data, e.g., when discussing multinomial logistic regression, the dependent variable was type of placement for adolescents, and one of the predictor variables was how likely the youthful offender was to engage in violence against others. It is a very technical book and if you are put off by matrix multiplication and odds ratios, this isn’t the book for you. On the other hand, if you want any in depth understanding of logistic regression from a practical point of view, read it from the beginning to end.

**Best SAS book I read this year …**

Let me start with the caveat that I have been using SAS for over 30 years and I don’t teach undergraduates, so I have not read any basic books at all. I read a lot of books on a range of advanced topics and most of them I found to be just – meh. Maybe it is because I had read all the good books previously and so the only ones I had left unread lying around were just so-so. All that being said, the winner is …

*Applied statistics and the SAS programming language (5th Ed)*, by Ronald Cody and Jeffrey Smith

This book has been around for eight years and I had actually read parts of it a couple of years ago, but this was the first time I read through the whole book. It’s a very readable intermediate book. Very little mathematics is included. It’s all about how to write SAS code to produce a factor analysis, repeated measures ANOVA, etc. It has a lot of random stuff thrown in, like a review of functions, and working with date data. If you have a linear style of learning and teaching, you might hate that. Personally, I liked that about it. This book was published eight years ago, which is an eon in programming time, but a chi-square or ANOVA have been around 100 years, so that wasn’t an issue. While I don’t generally like the use of simulated data for problems in statistics, for teaching this was really helpful because when students were first exposed to a new concept they didn’t need to get a codebook, fix the data. For the purpose of teaching applied statistics, it’s a good book.

**Best Javascript programming book I read this year**

I read a lot of Javascript books and found many of them interesting and useful, so this was a hard choice.

*The jQuery cookbook, edited by Cody Lindley*

was my favorite. If you haven’t gathered by now, I’m fond of learning by example, and this book is pretty much nothing but elaborate examples along the lines of , “Say you wanted to make every other row in a table green”. There are some like that I can imagine wanting to do and others I cannot think of any need to use ever. However, those are famous last words. When I was in high school, I couldn’t imagine I would ever use the matrix algebra we were learning.

**Best game programming book I read this year**

Again, I read a lot of game programming books. I didn’t read a lot of mediocre game programming books. They all were either pretty good or sucked. The best of the good ones was difficult to choose, but I did

*Building HTML5 Games by Jesse Freeman*

This is a very hands-on approach to building 2-D games with impact, with, you guessed it, plenty of examples. I was excited to learn that he has several other books out. I’m going to read all of them next year.

So, there you have it …. my favorite technical books that I read this year. Feel free to make suggestions for what I ought to read next year.

### Dec

#### 12

# A Dating Site for Data Science

Filed Under Technology | 2 Comments

There may come a day (shudder) when I am called upon to find what my mother refers to as “a real job”. I’m not sure how I would go about it. For the past 30 years, here is how my career has gone.

- I was walking in the door at the university where I had dropped in to visit a friend. A professor I’d known as a student years ago was walking out the other door. She said, “Hey, AnnMaria, I need a statistical consultant. Are you available?”
- I was walking in the door of my obstetrician’s office when I ran into someone else I had known years ago. She had been seeing another doctor in the same office. She said, “Hey, I just got a grant and need a statistician.”
- I was bored one day and applied for a job on Dice.com. They called up, interviewed me and hired me.
- I was bored one day and applied for a job I saw advertised in the Chronicle of Higher Ed. They interviewed me and hired me.

I think one can clearly detect a pattern here, mainly that I should spend more time walking in doors to buildings.

When getting a new job, I’ve generally been in the work equivalent of “married but looking”. I know that sounds horrible but what I mean is that I have had a job that I was considering getting out of, but I didn’t necessarily want the people at the job to know that.

This problem is common to people in any field, but I think those in analytic jobs have another problem.

Most of us did not come here by the approved route that Human Resource offices think we should. Personally, I have a B.S. in Business Administration, an MBA, an M.A. and a Ph.D. in Educational Psychology, where I specialized in Applied Statistics and Psychometrics (Tests & Measurement). Along the way, I have had more courses in statistics than anyone with a masters in the subject, I have worked for thirty years programming in multiple languages – SAS, PHP and javascript mostly, with a few early years of Fortran, Basic and some defunct languages. I’ve taught courses in statistics and programming at all levels from undergraduate through doctoral students. Yet …. I do not have a “B.S. in Computer Science” or whatever the requirement du jour is.

Enter analyst finder. This is the new company started by Art Tabachneck of SAS fame. If you’ve been using SAS for any length of time at all, you’ve run across his papers and if you live in Canada and drive a car you have been affected by his work. He uses SAS to set automobile insurance rates.

I checked out the site and it takes less than 15 minutes to fill out a form to be included in their data base. The really cool thing is that it asks about so many areas of expertise – what industries you have had experience, are you familiar with SAS, SQL, ANCOVA … it is a very, very long list – but you can just check off the boxes that apply to you.

If I was actually looking for a job, I might have spent a little time filling in the “essay questions” that allow you to expand on your credentials as well.

**How it works**

Currently, Art is compiling a database of analysts. Once this is of reasonable size, employers will be able, for a very modest fee – around $300 – to submit position descriptions. Analysts who match those descriptions will be contacted and asked if they are interested. The 20 names with the closest match who have expressed interest will be sent to the employer with contact information.

As an employer, it sounds like a great service. If I’m ever in the market for a “real job”, as an employee, it’s the first place I would hit up.

So … go check it out. It’s totally free to analysts, which is very broadly defined. If you’re interested, download the form, fill it out and send it back.

It’s a more scientific method for running around the city walking through doors hoping you run into someone who offers you a job.

Speaking of which, I need to be walking in the door of my office in less than 8 hours, so I guess I’ll call it a night.

### Dec

#### 2

What if you wanted to turn your PROC MIXED into a repeated measures ANOVA using PROC GLM. Why would you want to do this? Well, I don’t know why you would want to do it but I wanted to do it because I wanted to demonstrate for my class that both give you the same fixed effects F value and significance.

I started out with the Statin dataset from the Cody and Smith textbook. In this data set, each subject has three records,one each for drugs A, B and C. To do a mixed model with subject as a random effect and drug as a fixed effect, you would code it as so. Remember to include both the subject variable and your fixed effect in the CLASS statement.

Proc mixed data = statin ;

class subj drug ;

model ldl = drug ;

random subj ;

To do a repeated measures ANOVA with PROC GLM you need three variables for each subject, not three records.

**First, create three data sets** for Drug A, Drug B and Drug C.

Data one two three ;

set statin ;

if drug = ‘A’ then output one ;

else if drug = ‘B’ then output two ;

else if drug = ‘C’ then output three ;

**Second, sort these datasets and as you read in each one, rename LDL** to a new name so that when you merge the datasets you have three different names. Yes, I really only needed to rename two of them, but I figured it was just neater this way.

proc sort data = one (rename= (ldl =ldla)) ;

by subj ;

proc sort data= two (rename = (ldl = ldlb)) ;

by subj ;

proc sort data=three (rename =(ldl = ldlc)) ;

by subj ;

**Third, merge the three datasets by subject.**

data mrg ;

merge one two three ;

by subj ;

Fourth, run your repeated measures ANOVA .

Your three times measuring LDL are the dependent . It seems weird to not have an independent on the other side of the equation, but that’s the way it is. In your REPEATED statement you give a name for the repeated variable and the number of levels. I used “drug” here to be consistent but actually, this could be any name at all. I could have used “frog” or “rutabaga” instead and it would have worked just as well.

proc glm data = mrg ;

model ldla ldlb ldlc = /nouni ;

repeated drug 3 (1 2 3) ;

run ;

Now you can be happy.

### Nov

#### 19

# The most common error new SAS users make

Filed Under Software, statistics, Technology | 1 Comment

Any time you learn anything new it can be intimidating. That is true of programming as well as anything else. It may be even more true of using statistical software because you combine the uneasiness many people have about learning statistics with learning a new language.

To a statistician, this error message makes perfect sense:

`ERROR: Variable BP_Status in list does not match type prescribed for this list.`

but to someone new to both statistics and SAS it may be clear as mud.

Here is your problem.

The procedure you are using, PROC UNIVARIATE , PROC MEANS is designed ONLY for numeric variables. You have tried to use it for a categorical variable.

This error means you’ve used a categorical variable in a list where only numeric variables are expected. For example, bp_status is “High”, “Normal” and “Optimal”

You cannot find the mean or standard deviation of words, so your procedure has an error.

So … what do you do if you need descriptive statistics?

Go back to your PROC UNIVARIATE or PROC MEANS and delete the offending variables. Re-run it with only numeric variables.

For your categorical variables, use a PROC FREQ for a frequency distribution and/ or PROC GCHART.

Problem solved.

You’re welcome.

### Nov

#### 16

In August, I attended a class at Unite 2014 (on Unity game development) and the presenter said,

“And some of you, your code won’t run and you’ll swear you did exactly what was shown in the examples. But, of course, all of the rest of us will know that is not true.”

This perfectly describes my experience teaching. For example, the problem with the LIBNAME.

I tell students,

Do not just copy and paste the LIBNAME from a Word document into your program. Often, this will cause problems because of extra formatting codes in the word processor. You may not see the code as any different from what you typed in, but it may not work. Type your LIBNAME statement into the program.

Apparently, students believe that when I say,

Do not just copy and paste the LIBNAME statement.

either, that what I really mean is,

Sure, go ahead and copy and paste the LIBNAME statement

or, that I did mean it but that is only because I want to force them to do extra typing, or because I am so old that I am against copying and pasting as a new-fangled invention and how the hell would I know if they copied and pasted it anyway.

Then their program does not work.

Very likely, their log looks something like this:

58 LIBNAME mydata “/courses/d1234455550/c_2222/” access=readonly;

59 run ;

NOTE: Library MYDATA does not exist.

**All quotation marks are not created equal.**

What you see above if you look very closely is that the end quote at the end of the path for the LIBNAME statement does not exactly match the beginning quote. Therefore, your reference for your library was not

/courses/d1234455550/c_2222/

but rather, something like

/courses/d1234455550/c_2222/ access=readonly run ;

Which is not what you had in mind, and, as SAS very reasonably told you, that directory does not exist.

The simplest fix: delete the quotation marks and TYPE in quotes.

LIBNAME mydata ‘/courses/d1234455550/c_2222/’ access=readonly;

If that doesn’t work, do what I said to begin with. Erase your whole LIBNAME statement and TYPE it into the program without copying and pasting.

Contrary to appearances, I don’t just make this shit up.

### Nov

#### 13

# Doing your statistics homework with SAS – confidence intervals

Filed Under Software, statistics | Leave a Comment

Computing confidence intervals is one of the areas where beginning statistics students have the most trouble. It is not as difficult if you break it down into steps, and if you use SAS or other statistical software.

Here are the steps:

1. Compute the statistic of interest– that is mean, proportion, difference between means

2. Compute the standard error of the statistic

3. Obtain critical value. Do you have 30 or more in your sample and are you interested in the 95% confidence interval?

- If yes, multiply standard error by 1.96
- If no (fewer people), look up t-value for your sample size for .95
- If no (different alpha level) look up z-value for your alpha level
- If no (different alpha level AND less than 30) look up the t-value for your alpha level.

4. Multiply the critical value you obtained in step #3 by the standard error you obtained in #2

5. Subtract the result you obtained in step #4 from the statistic you obtained in #1 . That is your lower confidence limit.

6. Add the result you obtained in step #4 to the statistic you obtained in #1. That is your upper confidence limit.

**Simplifying it with SAS**

Here is a homework problem:

The following data are collected as part of a study of coffee consumption among undergraduate students. The following reflect cups per day consumed:

3 4 6 8 2 1 0 2

A. Compute the sample mean.

B. Compute the sample standard deviation.

C. Construct a 95% confidence interval

I did this in SAS as so

data coffee ;

input cups ;

datalines ;

3

4

6

8

2

1

0

2

;

proc means mean std stderr;

var cups ;

I get the follow results.

Analysis Variable : cups | ||
---|---|---|

Mean | Std Dev | Std Error |

3.2500000 | 2.6592158 | 0.9401748 |

These results give me A and B. Now, all I need to do to compute C is find the correct critical value. I look it up and find that it is 2.365

3.25 – 2.365 * .94 = 1.03

3.25 + 2.365 * .94 = 5.47

That is my confidence interval (1.03, 5.47)

=========================

If you want to verify it, or just don’t want to do any computations at all, you can do this

Proc means clm mean stddev ;

var cups ;

You will end up with the same confidence intervals.

Prediction: At least one person who reads this won’t believe me, will run the analysis and be surprised when I am right.