Physicians say that once a patient hears the word “cancer”, their brain shuts down and they don’t hear anything else. To be fair to the patients, understanding survival statistics isn’t always simple.

Let’s take just one example:

The three-year survival rate is different from the third-year survival rate.  If you have been told that the three-year survival rate is 50% and now it is the third year since your diagnosis, your probability of surviving the year is likely to be much higher than 50%

Let’s take a look at this example, with the number of patients diagnosed each year and how many were alive the 1st, 2nd and 3rd year after diagnosis

Year |  N |  1st | 2nd | 3rd
2012 | 75 | 60 _ | 56 _| 48
2013 | 63 | 55 _|  31 _|___
2014 | 42 | 37 _| ___|___
===================

The probability of survival year 1 = 152/180 = .84

The probability of survival year 2 = 87/115 = .76

The probability of survival year 3 = 48/56 =.86

To find the probability of survival in the THIRD YEAR you divide the number of people alive at the end of three years, which is 48, by the number of people alive at the beginning of the third year, which is 56. (The number of people who survived the second year is the same as the number of people who were alive at the beginning of the third year.)

48/56 =  86% probability of survival the third year.

So, IF YOU HAVE SURVIVED TO THE BEGINNING OF THE THIRD YEAR, your probability of survival in that year is 86%.

However, if you asked me on day 1 what your probability of living three years is, I would say 55% (actually, 54.9024% if you want to be precise).

How can your three-year survival be lower than third-year survival? Here’s how:

We can only measure third-year survival on people who survived the first two years … 

We followed (75+63 +42) = 180 people for one year. At the end of that year, we had 152 survivors (60 +55 + 37).

So, first year survival rate = 152/180 = 84%

Of those 84%, only 76% survived the second year. Of the people who survived the second year, 86% survived the third. So, what percent survived all three years?

.84 x .76 x .86  = .549024   or,  54.9%

Sometimes people will look at three-year survival rate and think, WRONGLY,

The three-year survival rate is only a little better than 50% and I have already lived to the third year, I must have a 50-50 chance of dying this year.

 

Actually, that is not correct. As the example shows, your chance of surviving the third-year may be substantially greater than the three-year survival rate.

==========================

Want to exercise your brain while having fun? Play Fish Lake, canoe down rapids, escape your enemies and review fractions. If you are already smart enough, consider donating a copy to a low-income school or after-school program.

Enemies in forest

baby mashing cake

Kappa is a useful measure of agreement between two raters. Say you have two radiologists looking at X-rays, rating them as normal or abnormal and you want to get a quantitative measure of how well they agree. Kappa is your go-to coefficient.

How do you compute it? Well, personally, I use SAS because this is the year 2015 and we have computers.

Let’s take this table, where 100 X rays were rated by two different raters as an example:

Rating by   Physician 1

————-Abnormal   |  Normal

Physician 2
————————————–

Abnormal        40             20

Normal             10            30

 

So ….. the first physician rated 60 X-rays as Abnormal. Of those 60, the second physician rated 40 abnormal and 20 normal, and so on.
If you received the data as a SAS data set like this, with an abnormal rating = 1 and normal = 0, then life is easy and you can just do the PROC FREQ.

 

Rater1    Rater2

1                1
1                 1

and so for 50 lines.

 

However, I very often get not an actual data set but a table like the one above. In this case, it is  still relatively simple to code

DATA compk ;

INPUT rater1 rater2 nums ;

DATALINES ;

1 1 40
1 0 20
0 1 10
0 0 30
;

 

So, there were 40 x-rays coded as abnormal by both rater1 and rater2.  When rater1 = 1 (abnormal) and rater2 = 0 (normal), there were 20,  and so on.

The next part is easy

PROC FREQ DATA = compk ;

TABLES rater1*rater2/ AGREE ;

WEIGHT nums ;

 

That’s it.  The WEIGHT statement is necessary in this case because I did not have 100 individual records, I just had a table, so the WEIGHT variable gives the number in each category.

This will work fine for a 2 x 2 table. If you have a table that is more than 2 x 2, at the end, you can add the statement

TEST WTKAP ;

This will give you the weighted Kappa coefficient. If you include this with a 2 x2 table nothing happens because the weighted kappa coefficient and the simple Kappa coefficient are the same in this case.

See, I told you it was simple.

 

Listen my dears and you will learn the difference between incidence and prevalence and why effective treatment for a disease may mean that more people have it.

Prevalence is the number of people in the population who have a disease. It is computed as

Number of people with disease   X 1,000
Number in population

As I mentioned before, there is nothing magical about 1,000 and you could compute prevalence per 100, per 100,000 or per million.

The important point is that prevalence is the number who HAVE the disease.

Below, you see a graph of prevalence and incidence of HIV in the U.S. from 1997- 2006.

 

Graph of HIV prevalence and incidence

 

The green line is prevalence, the number who have the disease. Clearly, prevalence went up after 1995.

So, did the AIDS epidemic get worse? No, actually, it stayed the same in one sense and got better in another.

The red dotted line is incidence. If you define epidemic, as occurrence of a disease in excess of normal frequency, then there has been no change and no evidence of a worsening epidemic, even though more people HAVE the disease. How can that be?

INCIDENCE is the number of new cases occurring, you compute it like this

Number of new cases during a period of time      x 1,000
Number in population at risk during period

Prevalence is affected by both incidence and duration. If a lot of people get the disease but the duration is short the prevalence won’t be as high as it would be if the duration was very long. That’s one reason you’ll find the prevalence of diabetes is higher than, say, chicken pox. Even though a lot of people get chicken pox, it doesn’t last very long. In contrast, once you have diabetes, you have it for years.

Another reason duration could be low is that mortality is high. If people die of a disease at a high rate, within a short period of time, prevalence would be low, but this would generally not be considered a good thing. Once effective treatment is found and people live longer, the prevalence would go UP, which is exactly what happened with HIV. In 1997, there was a 40% decline in AIDS deaths as a result of new, effective, anti-retroviral drugs. That lower death rate has been maintained so year after year the number of people who have HIV has increased even though the incidence has been pretty stable.

Diabetes is another example of a disease that has been affected by effective treatment. A friend of mine told me this story:

An elderly patient was against taking the medication he had been prescribed for his diabetes and high blood pressure. He demanded of his physician.

They didn’t have all of this medication back in the old days. What did they do then, huh?

To which Jake politely replied,

Well, sir, they died.

However, effective treatment is only one reason that diabetes prevalence is rising in modern times. Remember prevalence is affected both by incidence and duration. Well, the incidence of diabetes has been increasing dramatically.

Since it is nearly 3 am in North Dakota though, and I am flying home today, that will have to be a post for another day.

 

You’d think the ultimate example of simplicity in measurement would be mortality rate.

dead guy

Dead guy

Count up the dead people  – they aren’t hard to catch. Divide by the total number of people you had when you started.

Done.

It’s not super complicated but it is slightly more complicated than that.

First of all, how do you figure the number of people in the population? The population changes all of the time. People are born, people die. So to compute annual mortality rate you use this formula

Total number of deaths from all causes in 1 year   X   1,000
Number of persons in the population at midyear

(You can also use per 10,000,  100,000  or million. There is nothing magic about the number 1,000).

You also may want to compute mortality for certain groups, for example, by age. You would then calculate the age-specific mortality rate  For example, if we want to know the mortality rate of children aged 15 and under it would be

Number of deaths from all causes in 1 year of children under 16  x  1,000
Number of children under 16 in the population at midyear

There is also cause-specific mortality rate

Number of deaths from specific disease in one year    x 1,000
Number of persons in the population at midyear

CASE-FATALITY is a very different thing than mortality rate. Case-fatality is

Number of individuals dying during a period after disease onset X 100
Number of individuals with the disease

For example, the annual mortality rate from measles in 1971-1975 was .17 per million

During that same period, the case fatality rate was about .1% or  .98 per 1,000.

Think about that. One person out of 1,000 who got measles died.

After a measles epidemic in 1989-91, increased vaccination rates resulted in a dramatic drop in measles mortality rates.

Unfortunately, measles are making a comeback in the United States due to stupid bastards who don’t vaccinate their children. Whether this will translate to an increase in mortality rate remains to be seen. With only 644 cases in 2014, there hasn’t been a reported death – yet.

I think descriptive statistics are under-rated. One reason I like Leon Gordis’ Epidemiology book is that he agrees with me. He says that sometimes statistics pass the “inter-ocular test”. That is, they hit you right between the eyes.

moving eyeballs

I’m a big fan of eye-balling statistics and SAS/GRAPH is good for that. Let’s take this example. It is fairly well established that women have a longer life span than men in the United States. In other words, men die at a younger age. Is that true of all causes?

To answer that question, I used a subset of the Framingham Heart Study and looked at two major causes of death, coronary heart disease and cancer. The first thing I did was round the age at death into five year intervals to smooth out some of the fluctuations from year to year.

data test2 ;
set sashelp.heart ;
ageatdeath5 = round(ageatdeath,5) ;
proc freq data=test2 noprint;
tables sex*ageatdeath5*deathcause / missing out= test3 ;
/* NOTE THAT THE MISSING OPTION IS IMPORTANT */

THE DEVIL IS IN THE DETAILS

Then I did a frequency distribution by sex, age at death and cause of death. Notice that I used the missing option. That is super-important. Without it, instead of getting what percentage of the entire population died of a specific cause at a certain age,  I would get a percentage of those who died. However, as with many studies of survival, life expectancy, etc. a substantial proportion of the people were still alive at the time data were being collected. So, percentage of the population, and percentage of people who died were quite different numbers. I used the NOPRINT option on the PROC FREQ statement simply because I had no need to print out a long, convoluted frequency table I wasn’t going to use.

I used the OUT = option to output the frequency distribution to a dataset that I could use for graphing.

More details: The symbol statements just make the graphs easier to read by putting an asterisk at each data point and by joining the points together. I have very bad eyesight so anything I can do to make graphics more readable, I try to do.
symbol1 value = star ;
symbol1 interpol = join ;

Here I am just sorting the data set by cause of death and only keeping those with Cancer or Coronary Heart Disease.
proc sort data=test3;
by deathcause ;
where deathcause in (“Cancer”,”Coronary Heart Disease”);

 

Even more details.  You always want to have the axes the same on your charts or you can’t really compare them. That is what the UNIFORM option in the PROC GPLOT statement does. The PLOT statement requests a plot of percent who died at each age by sex. The LABEL statement just gives reasonable labels to my variables.

proc gplot data = test3 uniform;
plot percent*ageatdeath5 = sex ;
by deathcause ;
Label percent = “%”
ageatdeath5 = “Age at Death” ;

cause of death by age by gender

When you look at these graphs, even if your eyes are as bad as mine you can see a few things. The top chart is of cancer and you can conclude a couple of  things right away.

  1. There is not nearly the discrepancy in the death rates of men and women for cancer as there is for heart disease.
  2. Men are much more likely to die of heart disease than women at every age up until 80 years old. After that, I suspect that the percentage of men dying off has declined relative to women because a very large proportion of the men are already dead.

So, the answer to my question is “No.”

I read a lot. This year, I finished 308 books  on my Kindle app, another dozen on iBooks, a half-dozen on ebrary and 15 or 20 around the house. I don’t read books on paper very often any more. It’s not too practical for me. I go through them at the rate of about a book a night, thanks to a very successful speed reading program when I was young (thank you, St. Mary’s Elementary School). Don’t be too impressed. I don’t watch TV and I read a lot of what I colleague called, “Junk food for the brain”. I read a bunch of Agatha Christie novels, three Skullduggery Pleasant books, several of the Percy Jackson and the Olympian books. Yes, when it comes to my fiction reading, I have the interests of a fourteen-year-old girl. Trying to read like a grown up, I also read a bunch of New York Times bestseller novels and didn’t  like any of them.

So, I decided to do my own “best books list” based on a random sample of one, me, and make up my own categories.

Because I taught a course on multivariate statistics,  I read a lot of books in that area, and while several of them were okay, there was only one that I really liked.

The winner for best statistics book I read this year, 

Applied logistic regression, 3rd Edition, by David Hosmer, Stanley Lemmeshow and Rodney Sturdivant.

I really liked this book. I’m not new to logistic regression, but I’m always looking for new ideas, new ways to teach, and this book was chock full of them.  What I liked most about it is that they used examples with real data, e.g., when discussing multinomial logistic regression, the dependent variable was type of placement for adolescents, and one of the predictor variables was how likely the youthful offender was to engage in violence against others. It is a very technical book and if you are put off by matrix multiplication and odds ratios, this isn’t the book for you. On the other hand, if you want any in depth understanding of logistic regression from a practical point of view, read it from the beginning to end.

Best SAS book  I read this year …

Let me start with the caveat that I have been using SAS for over 30 years and I don’t teach undergraduates, so I have not read any basic books at all. I read a lot of books on a range of advanced topics and most of them I found to be just – meh. Maybe it is because I had read all the good books previously and so the only ones I had left unread lying around were just so-so. All that being said, the winner is …

Applied statistics and the SAS programming language (5th Ed), by Ronald Cody and Jeffrey Smith

This book has been around for eight years and I had actually read parts of it a couple of years ago, but this was the first time I read through the whole book. It’s a very readable intermediate book. Very little mathematics is included. It’s all about how to write SAS code to produce a factor analysis, repeated measures ANOVA, etc. It has a lot of random stuff thrown in, like a review of functions, and working with date data. If you have a linear style of learning and teaching, you might hate that. Personally, I liked that about it. This book was published eight years ago, which is an eon in programming time, but a chi-square or ANOVA have been around 100 years, so that wasn’t an issue. While I don’t generally like the use of simulated data for problems in statistics, for teaching this was really helpful because when students were first exposed to a new concept they didn’t need to get a codebook, fix the data. For the purpose of teaching applied statistics, it’s a good book.

Best Javascript programming book I read this year

I read a lot of Javascript books and found many of them interesting and useful, so this was a hard choice.

The jQuery cookbook, edited by Cody Lindley

was my favorite. If you haven’t gathered by now, I’m fond of learning by example, and this book is pretty much nothing but elaborate examples along the lines of , “Say you wanted to make every other row in a table green”. There are some like that I can imagine wanting to do and others I cannot think of any need to use ever. However, those are famous last words. When I was in high school, I couldn’t imagine I would ever use the matrix algebra we were learning.

Best game programming book I read this year

Again, I read a lot of game programming books. I didn’t read a lot of mediocre game programming books. They all were either pretty good or sucked. The best of the good ones was difficult to choose, but I did

Building HTML5 Games by Jesse Freeman

This is a very hands-on approach to building 2-D games with impact, with, you guessed it, plenty of examples. I was excited to learn that he has several other books out. I’m going to read all of them next year.

So, there you have it …. my favorite technical books that I read this year. Feel free to make suggestions for what I ought to read next year.

 

Two girls talking

(There may even be a part two, if I get around to it.)

Let me ask you a couple of questions:

1. Do you have more than just one dependent variable and one independent variable?

2. If you said, yes, do you have a CATEGORICAL or ORDINAL dependent variable? If so, use logistic regression. I have written several posts on it. You can find a list of them here. Some involve Euclid, marriage, SAS and SPSS. Alas, none involve a naked mole rat. I shall have to remedy that.

3. You said yes to #1, multiple variables, but no to number 2, so I am assuming you have multiple variables in your design and your dependent variable is interval or continuous, something like sales for the month of December, average annual temperature or IQ. The next question is do you have only ONE dependent variable and is it measured only ONCE per observation? For example, you have measured average annual temperature of each city in 2013 or sales in December , 2012. In this case, you would do either Analysis of Variance or multiple regression. It doesn’t matter much which you do if you code it correctly. Both are specific cases of the general linear model and will give you the same result. You may also want to do a general linear MIXED model, where you have city as a random effect and something else, say, whether the administration was Democratic or Republican as a fixed effect. In this case I assume that you have sales as your dependent variable because contrary to the beliefs of some extremists, political parties do not determine the weather. Generally, whether you use a mixed model or an Ordinary Least Squares (OLS) plain vanilla ANOVA or regression will not have a dramatic impact on your results unless the result is a grade in a course where the professor REALLY wants you to show that you know that school is a random effect when comparing curricula.

4. Still here? I’m guessing you have one of two other common designs. That is, you have measured the same subjects, stores, cities, whatever, more than once. Most commonly, it is the good old pretest posttest design and you have an experimental and control group. You want to know if it works. If you have only tested your people twice, you are perfectly fine with a repeated measures ANOVA. If you have tested them more than twice, you are very likely to have grossly violated the assumption of compound symmetry and I would recommend a mixed model.

5. All righty then, you DO have multiple variables, they are NOT categorical or ordinal, your dependent variable is NOT repeated, so you must have multiple dependent variables. In that case, you would do a multivariate Analysis of Variance.

Some might argue that logistic regression is not a multivariate design. Other people would argue with them that, assuming your data are multinomial, you need multiple logit functions so that really is a type of multivariate design. A third group of people would say it is multivariate in the ordinal or multinomial case because there are multiple possible outcomes.

Personally, I wonder about all of those types of people. I wonder about the amount of time in higher education spent in forcing students to learn answers to questions that have no real use or purpose as far as I can see.

On the other hand, while knowing whether something falls in the multivariate category or not probably won’t impact your life or analyses, if you treat time as an independent variable and analyze your repeated measures ANOVA with experiment and condition as a 2 x 2 ANOVA, you’re screwed.

Know your research designs.

What if you wanted to turn your PROC MIXED into a repeated measures ANOVA using PROC GLM. Why would you want to do this? Well, I don’t know why you would want to do it but I wanted to do it because I wanted to demonstrate for my class that both give you the same fixed effects F value and significance.

I started out with the Statin dataset from the Cody and Smith textbook. In this data set, each subject has three records,one each for drugs A, B and C. To do a mixed model with subject as a random effect and drug as a fixed effect, you would code it as so. Remember to include both the subject variable and your fixed effect in the CLASS statement.

Proc mixed data = statin ;
class subj drug ;
model ldl = drug ;
random subj ;

To do a repeated measures ANOVA with PROC GLM you need three variables for each subject, not three records.

First, create three data sets for Drug A, Drug B and Drug C.

Data one two three ;
set statin ;
if drug = ‘A’ then output one ;
else if drug = ‘B’ then output two ;
else if drug = ‘C’ then output three ;

Second, sort these datasets and as you read in each one, rename LDL to a new name so that when you merge the datasets you have three different names. Yes, I really only needed to rename two of them, but I figured it was just neater this way.

proc sort data = one (rename= (ldl =ldla)) ;
by subj ;

proc sort data= two (rename = (ldl = ldlb)) ;
by subj ;
proc sort data=three (rename =(ldl = ldlc)) ;
by subj ;

Third, merge the three datasets by subject.

data mrg ;
merge one two three ;
by subj ;

Fourth, run your repeated measures ANOVA .

Your three times measuring LDL are the dependent . It seems weird to not have an independent on the other side of the equation, but that’s the way it is. In your REPEATED statement you give a name for the repeated variable and the number of levels. I used “drug” here to be consistent but actually, this could be any name at all. I could have used “frog” or “rutabaga” instead and it would have worked just as well.

proc glm data = mrg ;
model ldla ldlb ldlc = /nouni ;
repeated drug 3 (1 2 3) ;
run ;

Compare the results and you will see that both give you the numerator and denominator degrees of freedom, F-statistic and p-value for the fixed effect of drug.

Now you can be happy.

Any time you learn anything new it can be intimidating. That is true of programming as well as anything else. It may be even more true of using statistical software because you combine the uneasiness many people have about learning statistics with learning a new language.

To a statistician, this error message makes perfect sense:

ERROR: Variable BP_Status in list does not match type prescribed for this list.

but to someone new to both statistics and SAS it may be clear as mud.

Here is your problem.

The procedure you are using, PROC UNIVARIATE , PROC MEANS is designed ONLY for numeric variables. You have tried to use it for a categorical variable.

This error means you’ve used a categorical variable in a list where only numeric variables are expected. For example, bp_status is “High”, “Normal” and “Optimal”

You cannot find the mean or standard deviation of words, so your procedure has an error.

So … what do you do if you need descriptive statistics?

Go back to your PROC UNIVARIATE or PROC MEANS and delete the offending variables. Re-run it with only numeric variables.

For your categorical variables, use a PROC FREQ  for a frequency distribution  and/ or PROC GCHART.

Problem solved.

You’re welcome.

Computing confidence intervals is one of the areas where beginning statistics students have the most trouble. It is not as difficult if you break it down into steps, and if you use SAS or other statistical software.

Here are the steps:

1. Compute the statistic of interest– that is mean, proportion, difference between means
2. Compute the standard error of the statistic
3. Obtain critical value. Do you have 30 or more in your sample and are you interested in the 95% confidence interval?

  • If yes, multiply standard error by 1.96
  • If no (fewer people), look up t-value for your sample size for .95
  • If no (different alpha level) look up z-value for your alpha level
  • If no (different alpha level  AND less than 30) look up the t-value for your alpha level.

4. Multiply the critical value you obtained in step #3 by the standard error you obtained in #2

5. Subtract the result  you obtained in step #4 from the statistic you obtained in #1 . That is your lower confidence limit.

6. Add the result you obtained in step #4 to the statistic you obtained in #1. That is your upper confidence limit.
Simplifying it with SAS

Here is a homework problem:

The following data are collected as part of a study of coffee consumption among undergraduate students. The following reflect cups per day consumed:

3          4          6          8          2          1          0          2

 

A. Compute the sample mean.

B. Compute the sample standard deviation.

C. Construct a 95% confidence interval

I did this in SAS as so

data coffee ;
input cups ;
datalines ;
3
4
6
8
2
1
0
2
;
proc means mean std stderr;
var cups ;

I get the follow results.

Analysis Variable : cups
Mean Std Dev Std Error
3.2500000 2.6592158 0.9401748

These results give me A and B. Now, all I need to do to compute C is find the correct critical value.  I look it up and find that it is 2.365

3.25   – 2.365 * .94 = 1.03

3.25 + 2.365 * .94 = 5.47

That is my confidence interval (1.03, 5.47)

=========================

If you want to verify it, or just don’t want to do any computations at all, you can do this

Proc means clm mean stddev ;
var cups ;

You will end up with the same confidence intervals.

Prediction: At least one person who reads this won’t believe me, will run the analysis and be surprised when I am right.

Next Page →