### Sep

#### 26

# A gentle introduction to survival analysis: I. The language of survival analysis

Filed Under statistics

This month was my 14th wedding anniversary. For some reason, a number of my close friends and relatives chose this occasion to tell me that they had bet this marriage would not last more than five years. Which got me to thinking about survival analysis …. (whether or not it *should* have gotten me to wondering about my friends is a different issue)

As I mentioned in a post months ago, logistic regression is what you would use if you wanted to predict whether or not someone would get divorced. Survival analysis is what you would use if you were interested in HOW LONG the marriage would last.

The first thing you need to know is that in survival analysis, we are interested in **time to an event**. The event can be anything – death, college graduation, diagnosis with a disease, marriage, arrest, divorce. Notice that the events are not always negative, as in the example of college graduation or marriage. Unlike statistical methods like logistic regression, where we are interested in categorical dependent variable – did the event occur or not – with survival analysis we are interested in a numeric variable HOW LONG until the event occurred.

The second thing you need to understand about survival analysis is **censoring. **All censoring means is that you don’t have data for the complete period. In most cases, censoring occurs on the right side of the curve. (Imagine plotting time on the X axis.) For example, the study ends and some of your people aren’t dead. In this case, you don’t know how many months they will survive, you just know it is more than the 36 months that your study lasted. Not surprisingly, this is called right-censored data. There is also “left-censored data” when you don’t know the starting point, for example, you have a sample of people diagnosed with HIV but for some of them you don’t know when they were diagnosed, so you don’t know the beginning of the time period. From what I have seen, right censored data is a lot more common than left censored data.

Finding **median survival time** is quite easy, which is nice, because it is a statistic that interests many people across many situations. How long does the average patient with X diagnosis live, how long do most marriages last, how long do people live who have been given treatment Y? It’s just a median. Order all of your subjects by survival time, from the subject who died the day after the study started to the one who died on the last day of the study, followed by all of the people who are still alive. At the fiftieth percentile is your median survival time, so you can say, for example, that the median survival time for patients receiving treatment Y was 39 months.

What if more than half of your people are surviving past the end of your study? Well, if your study was say, five years long, you can only conclude that, “The median survival rate for patients with a diagnosis of Ugly Nose Disease was in excess of sixty months.”

The median survival rate is nice information to have but usually you want more than that.

Two functions that are important in survival analysis are the **survivor function** and the **hazard function**.

For most studies, and all of those where death is the event of interest, the **survivor function** traces a curve that begins at one and theoretically ends at zero. At the beginning of the study, everyone is alive – the probability of being alive is 100% – and at the end, if the study goes on long enough, everyone is dead, the probability of being alive is 0%. The survivor function gives you the probability that someone will survive at least a certain length of time.

So, want the answer to the question “How long are people with Ugly Nose Disease expected to live? Compute the median survival time.

Want the answer to the question “What is the probability of a person living at least eleven years after having been diagnosed with Ugly Nose Disease?” Use the survivor function and compute the value for T = 11.

The **hazard function **is a little more complicated mathematically. If your question is, “Given that they have lived for five years since diagnosis, what is the rate at which people are dying from Ugly Nose Disease?’ Note two things about the hazard function. First, it is NOT a probability, so it can be greater than 1.0. Second, it is the failure rate conditional on that you have lived to this point already. Thus, not surprisingly, it is also called the **conditional failure rate**.

I have a lot more to say about survival analysis. I was sure I had written on this blog about proportional hazards models before, but in over three years, nope, not one peep. I guess that was one of those things I meant to blog about and never got around to. But now I’m on a roll. Check back tomorrow for more on hazard functions and survivor functions.

P.S. If there actually is an Ugly Nose Syndrome and you are dying of it, I’m really, really sorry.

In the interest of not adding to your tragedy, I did some research on this but the closest I could come to my hopefully imaginary malady was an entry on Random Big Nose Syndrome from the Urban Dictionary, whom I believe it is safe to say are a bunch of liars.

# Comments

**3 Responses to “A gentle introduction to survival analysis: I. The language of survival analysis”**

**Leave a Reply**

Really nice intro to survival analysis.

Really don’t want to be a ANALyst, but shouldn’t it be:

survivor function traces a curve that begins at ONE and theoretically ends at ZERO.

Also, is the hazard rate not the conditional probability of failure in the next little time period given that no failure has occured to the current time period. Thus it is a probability? I understand though that the Cumulative hazard rate can exceed 1.

You’re funny! Thanks for catching that. I was thinking of the probability of death beginning at zero and ending at one, but you’re right, the survivor function, which plots the inverse of those probabilities, begins at one and ends at zero.

Some days, I just can’t write and think at the same time!

The hazard rate is the conditional probability DIVIDED BY the time interval , which is how it can end up being greater than 1. The numerator is the conditional probability. The denominator is delta t