# A gentle introduction to survival analysis: I. The language of survival analysis

Filed Under statistics | 3 Comments

This month was my 14th wedding anniversary. For some reason, a number of my close friends and relatives chose this occasion to tell me that they had bet this marriage would not last more than five years. Which got me to thinking about survival analysis …. (whether or not it should have gotten me to wondering about my friends is a different issue)

As I mentioned in a post months ago, logistic regression is what you would use if you wanted to predict whether or not someone would get divorced. Survival analysis is what you would use if you were interested in HOW LONG the marriage would last.

The first thing you need to know is that in survival analysis, we are interested in time to an event. The event can be anything – death, college graduation, diagnosis with a disease, marriage, arrest, divorce.  Notice that the events are not always negative, as in the example of college graduation or marriage. Unlike statistical methods like logistic regression, where we are interested in  categorical dependent variable – did the event occur or not – with survival analysis we are interested in a numeric variable HOW LONG until the event occurred.

The second thing you need to understand about survival analysis is censoring. All censoring means is that you don’t have data for the complete period. In most cases, censoring occurs on the right side of the curve. (Imagine plotting time on the X axis.)  For example, the study ends and some of your people aren’t dead. In this case, you don’t know how many months they will survive, you just know it is more than the 36 months that your study lasted. Not surprisingly, this is called right-censored data. There is also “left-censored data” when you don’t know the starting point, for example, you have a sample of people diagnosed with HIV but for some of them you don’t know when they were diagnosed, so you don’t know the beginning of the time period. From what I have seen, right censored data is a lot more common than left censored data.

Finding median survival time is quite easy, which is nice, because it is a statistic that interests many people across many situations. How long does the average patient with X diagnosis live, how long do most marriages last, how long do people live who have been given treatment Y?  It’s just a median. Order all of your subjects by survival time, from the subject who died the day after the study started to the one who died on the last day of the study, followed by all of the people who are still alive. At the fiftieth percentile is your median survival time, so you can say, for example, that the median survival time for patients receiving treatment Y was 39 months.

What if more than half of your people are surviving past the end of your study? Well, if your study was say, five years long, you can only conclude that, “The median survival rate for patients with a diagnosis of Ugly Nose Disease was in excess of sixty months.”

The median survival rate is nice information to have but usually you want more than that.

Two functions that are important in survival analysis are the survivor function and the hazard function.

For most studies, and all of those where death is the event of interest, the survivor function traces a curve that begins at one and theoretically ends at zero. At the beginning of the study, everyone is alive – the probability of being alive is 100%  – and at the end, if the study goes on long enough, everyone is dead, the probability of being alive is 0%. The survivor function gives you the probability that someone will survive at least a certain length of time.

So, want the answer to the question “How long are people with Ugly Nose Disease expected to live? Compute the median survival time.

Want the answer to the question “What is the probability of a person living at least eleven years after having been diagnosed with Ugly Nose Disease?”  Use the survivor function and compute the value for T = 11.

The hazard function is a little more complicated mathematically. If your question is, “Given that they have lived for five years since diagnosis, what is the rate at which people are dying from Ugly Nose Disease?’ Note two things about the hazard function. First, it is NOT a probability, so it can be greater than 1.0.  Second, it is the failure rate conditional on that you have lived to this point already. Thus, not surprisingly, it is also called the conditional failure rate.

I have a lot more to say about survival analysis. I was sure I had written on this blog about proportional hazards models before, but in over three years, nope, not one peep. I guess that was one of those things I meant to blog about and never got around to. But now I’m on  a roll. Check back tomorrow for more on hazard functions and survivor functions.

P.S. If there actually is an Ugly Nose Syndrome and you are dying of it, I’m really, really sorry.

In the interest of not adding to your tragedy, I did some research on this but the closest I could come to my hopefully imaginary malady was an entry on Random Big Nose Syndrome from the Urban Dictionary, whom I believe it is safe to say are a bunch of liars.

# Why I Don’t Have Minions

Admit it, more than once you have thought to yourself,

Wouldn’t it be convenient about now to have some mindless minions to do my bidding?

I’d always thought if this whole statistical consulting thing didn’t work out, I could be an evil scientist. I mean, I already went to the trouble to get a Ph.D. so I have got half of the Dr. Evil thing down, the Dr. part. You’d think that would be the more difficult part,  no?

According to Merriam-Webster , a minion is a servile dependent, follower, or underling. I’ve often been asked why I don’t have someone else write programs to delete all out-of-range data, reverse-code answers where needed, re-code answers such as “not applicable” to missing, check that the distributions meet assumptions of normality and so on.

I think that is a very bad idea.

The first reason having someone else do the “menial work” of data cleaning is a bad idea is that all of your analysis is going to rest on the assumption of that data cleaning having been done correctly. In cases where it is my reputation (or other parts of me) on the line, I want to be sure that data was coded and scored correctly. If not, everything after that was a waste of time. It’s not uncommon to find differences in conclusions from different studies. This can be due to a variety of reasons – differences in age, education, health, occupation and a thousand other factors in your sample, differences in type of measure – GPA versus standardized test versus high school graduation and on and on. I wonder, though, if sometimes those differences are not due to some of the studies’ data just being entered, coded and scored incorrectly.

The second reason for getting down and dirty with your own data is that you often start to have ideas as you play with variables. What is the relationship between mother’s education and school failure? What about father’s education? What about taking the maximum (or minimum) of the two? After all, there are programs for students who are “First-generation college students”, that is, students who neither of their parents went to college. As you play with your data, you can see new relationships, develop new questions.

Personally, if I’m going to delegate tasks, I prefer to do it at the end, having someone else make a powerpoint presentation or graphs of the data, maybe doing some of the final analyses. If any “minions” are doing this work after I have spent considerable time working with the data set, then I should be able to spot any mistakes and say, “You know, I don’t think your finding that mother’s education is unrelated to school failure fits with what I found in other analyses, let’s take a look at how you arrived at that.”

Besides, just look at what happens when evil scientists DO use minions. Think how Young Frankenstein would have come out differently if the doctor had gone and gotten the brain himself and assigned Igor a task of say, picking out the monster’s wardrobe instead.

# GUEST POST: Mental Health Outcome Statistics Saved my Life

Filed Under statistics | 2 Comments

I met Corinna when she was a member of the 1996 Olympic team. In the intervening 15 years she has had a battle with mental illness – or maybe I should say, the mental health system.  This article is re-posted, with permission, from her blog, where she writes about mental illness, mental health, motivational speaking, poetry and a lot more.

I’m a lot more into statistics than poetry – Corinna and I are not much alike there – but I, too, knew Brenda Day and was shocked when she died. More recently, a young woman that used to compete against my daughter, 2004 Olympic silver medalist Claudia Heil died after jumping off a building.

Corinna makes a really important point that when we write about statistics we need to keep in mind the people those statistics represent and the people our published results will affect.

By Corinna West

I was really sick for a long time, and now I’m not. This is a story about how an encounter with mental health outcome statistics turned my life around. I am also a finalist for \$12,000 from the US Olympic Committee grant to share this information, and I need your vote here for Combat Arts for Recovery.

Corinna West fighting Hannalore Brown in the 1996 US Olympic Trials. Photo by Paul Hensley

I was on the 1996 Olympic Judo Team and worked out about four hours a day for about 6 years leading up to that. When I retired from Judo I finished my bachelors in Chemistry and starting working on a Ph.D. program in Pharmaceutical Sciences. My research project was coming up with a cancer drug that blocked telomerase, an enzyme that is only active in cancer cells.  At some point I had various things coming together in my life that  built into an emotional crisis. The problems included a disconnect with my creator, a marriage to a low energy spouse, lack of meaningful hobbies and social connections, pot use, and insecurity about my career path.

At that point I thought that emotional distress meant mental illness so I entered the mental health system and after a 45 minute interview I was given a diagnosis and medications. The medications made me feel worse. What was really harmful, though, was the idea that there was something wrong with me for the rest of my life and that I might end up on disability. No one told me what I could expect for mental health outcome statistics but sites like this on schizophrenia statistics said how awful things were. Also I saw everyone around me in “treatment” doing poorly and came to my own conclusions. Eventually I couldn’t pull things together and I had to leave the Ph.D. program with only a masters degree. I had a few jobs that didn’t work out in a sequence and then my marriage starting looking like it would fall apart. I thought, “I have nothing left.” and decided to take my early exit from this planet.  I had heard that 15% of people with serious mental illness die by suicide and I became convinced that it would happen to me. As an Olympian, I have a lot of willpower, and when I decide to do something, it usually happens.

I have a poem about that time and how I started to find my way out, honoring one of my Olympic Judo Team training partners who didn’t make it through, Brenda Day. It’s my second most watched video on my account.

I’d been hearing about the Glore Psychiatric  Museum about an hour drive from me where much of the old equipment was preserved that used to apply physical “treatment” for those of us with mental health labels. They had lobotomy tools, spinning chairs, dunk tanks, tiny boxes to lock people in, chains, and electroshock equipment. But the best thing for me was a link on their website to the National Empowerment Center.

### The National Empowerment center has a list of journal articles on their website showing mental health outcome statistics that 58% of people with schizophrenia recover.

Dan Fisher, M.D., Ph.D. completely recovered from a schizophrenia diagnosis

At this point, I didn’t believe that, because in my grad school I thought I had learned how to search academic literature and my studies showed about a 13% recovery rate.  So I emailed the director of their organization, a doctor named Dan Fisher. Dan was diagnosed with schizophrenia as a grad student as well, back in the old days when the system would tell you outright that no one recovered. He told one of his therapists, “I’m going to get out of this hospital and go to med school and be a psychiatrist.”

The therapist said, “I’ll come to your graduation when you do.” And Dan did graduate and the therapist was there at the ceremony. Then Dan helped to found our mental health civil rights movement, and the National Empowerment Center is partly his organization.

So I emailed him and asked why he could say that 58% of people recovered when my studies were only showing 13%? His response was so great that I saved it and still have it to this day.
Quantum mental health outcome statistics email:

From: Dan Fisher
To: Corinna West
Sent: Monday, June 20, 2005 3:06 pm

Corinna, I notice that the first few studies you sent were relatively short-term compared to the results we put on our website. Also, it is very important to note that the conditions of treatment make a huge difference. Harding showed this when she compared the recovery rate in Vermont (with a positive set of expectations and programs) to Maine which had maintenance as the goal. I am going to discuss your question further with Dr. Harding, but in the meantime, don’t be overly influenced by science. Conventional science deals in statistics and aggregate results, whereas your life and mine are unique as snowflakes. We can change our world through our actions in it, and conventional science is dumb to explain that. However, quantum science can explain it. Modern science has the perspective I and most people who have recovered find much more beneficial. Modern science emphasizes the importance of our perceptions in modifying the world. We can be architects of our lives not mere passengers.

By this time I’d realized that mental health recovery was  a lot of work and I wasn’t willing to do that work for a 13% chance. But I could have faith in a 58% chance for good mental health outcome statistics.

That was a key turning point for me. Another turning point was when I realized I’d have to take responsibility, that no one was going to fix me. Also I realized that my creator didn’t want me gone or it would have happened already. So I joined a support group, Recovery International. Eventually I started my own support group which led to my job as the coordinator of a call-in support line. I learned how important daily exercise was to keep my head straight, and I learned that most of my “symptoms” were coming from trauma experiences. I did several entrepreneurship training programs and started developing as an artist, and two years ago I started my own business working to improve our current mental health outcome statistics.

I am creating a social entrepreneurship, a business that helps people while also making money so that it is scalable and sustainable. I want to design a mental health system using principles from The Fortune at the Bottom of the Pyramid by C.K. Prahalad. It should be so cheap that people in recovery can afford to pay for it themselves. I want to use peer support, volunteers, and lay people to help people that are struggling. A national cross disability rally in Washington, D.C. next week says that we save Medicaid funding by demedicalizing services like this. I also started an advocacy campaign called, “Please Cut our Budgets, We’ll Tell You How.” We can also save a ton of money by not labeling short term emotional crises as lifelong illnesses.  The Open Dialogue program in Finland has such good mental health outcome statistics that many of their hospital beds are empty.

Members of Authentic Boxing club in Kansas City use exercise, which is shown to highly improve mental health outcome statistics

I recently went to a national conference in Boston to discuss alternatives to medication, since long-term evidence is starting to show that  meds help some people, but many people with mental health labels do better off medications. One of the presenters, Suzanne Beachy, is a TEDx fellow and was a mom to someone who was diagnosed with schizophrenia. Because all the information she got was based on the medical model and fairly hopeless, both she and he gave up on recovery. The son then died in an accident while he was homeless and she undertook a year long search to find more information about the diagnoses. She finally found the mental health recovery movement, our organization of people who have come out the other side of our labels. Many of us are completely recovered, working full time, out of the mental health system, and totally off psych meds.

She said, “Everyone should know about you people. It shouldn’t be completely random who finds out there is hope beyond the medical model. We need to have a mental health outcome statistics public relations person.”

We are now on the brink of creating this program. I have linked with the Olympic Committee to fund a program where doctors prescribe exercise instead of medication for people with emotional distress. However, we need your votes to get this program funded. You can vote once per day until Sept 18 at:

# Why you care about conditional distributions: The MRS degree explained

1. Getting an MRS degree is an effective strategy.
2. You care about conditional distributions even if you don’t know what they are.

When I was an undergraduate at Washington University in St. Louis in the 1970s, nearly all of my friends (and the vast majority of the student body) could be sorted into one of three, non-exclusive subgroups:

• Jewish students
• Pre-medical students
• Students who intended to marry a Jewish doctor, also known as “Getting her Mrs. degree”. (Yes, back then people actually said flat out that is why they were in college. At least if they were under the influence of enough substances of which I never partook but read about in books that I checked out of the library.)

Fast-forward 20 years when I was an associate professor at a very nice, Christian private college reviewing applications of the freshman class we had just admitted. One of the questions was, “What are your goals in attending college?”  A majority of both males and females selected, “To find a spouse.” It should be noted that most chose other goals as well, showing progress, I guess.

Does this work? I don’t have data on what percentage of people met their spouse in college but the TIMSS data do provide information on assortative mating. Hence, the importance of conditional distributions.

First, let’s start with a marginal distribution.  A marginal distribution in a two-way contingency table is the row or column totals divided by the grand total. For example, look at the table below:

The marginal distributions are (rounded) 17% of father’s had less than a high school education, 41% were high school graduates and 43% were college graduates.

To see the conditional distribution for mother’s education, look at the row percentages (the second from the bottom in each cell).

Given the CONDITION that the mother has a college degree, the distribution of father’s education is 4% less than high school, 25% high school graduates and 71% college graduates.

Given the CONDITION that the mother is a high school graduate, the distribution of father’s education is 13% less than high school, 66% high school and 21% college graduates.

SO … a woman who is a college graduate is 3.5 times more likely to marry a college graduate than a woman who is high school graduate. She has less than one-fourth the probability ( 4%/ 13%) of marrying a man with less than a high school education compared to a woman who graduated from high school.

Why people care about conditional distributions in general … even if you are not married, married or hate marriage, Mrs degrees aside, a conditional distribution answers the question, “Given X, what are the odds that Y?”

Given that I finish my Ph.D. what are the odds that I will make over \$90,000 a year?

Given that I am over 45 years old, what are the odds that my baby will have Down syndrome?

Given that I live less than half a mile from the coast, what are the odds that my home will increase in value?

And a million similar questions.

DISCLAIMERS TO KEEP PEOPLE FROM SENDING ME HATE MAIL

1. The person in the wedding photo above did NOT go to college to get an MRS degree. She went to NYU, her husband went to Stanford and they met at a conference.
2. The relationship between mother’s education and father’s education may be due to some third variable unrelated to that the mother (or father) actually went to college.  For example, the resident rocket scientist when searching for a wife was specifically looking for someone with a graduate degree. He says this is because he could not imagine having conversations about topics that interest him, like compilers, with someone without a graduate education. Personally, I suspect it is because when he says things like, “Why don’t we name the baby after Gaston Julia?”  I say, “What a nice idea!” as opposed to “What the hell are you talking about?”

Also, if you are at WUSS you can come to my class on categorical data analysis and I don’t want to hear any lame excuses like you live in Australia. San Francisco has an airport, you know, and things you can’t get in Australia like the Exploratorium, Ghiradelli’s chocolate and freezing cold, foggy wet weather. As Meatloaf said, Two out of three aint bad.

# Categorical Data & Bivariate Descriptive Statistics

The Agresti and Finlay book,  Statistical methods in the social sciences , has a nice section on bivariate descriptive statistics.  (And thank you to the person on twitter who recommended that book. I apologize that I can’t remember who it was.) I got to thinking about that today, especially with regard to categorical data. Often when working with categorical data our interest is in predicting which category a person falls into. Will you vote for Obama next November, for whoever the Republicans nominate, a third-party candidate or just stay home? Will you buy my fabulous widget or just click on to the next page?

I’m interested in something else with descriptive statistics today. Say I already know what category you are in, what does that tell me? The answer is maybe a lot and maybe nothing.

Let’s take two examples.The first uses two categorical variables, do you have a computer at home and were you born in the U.S. All respondents were eighth-grade students who were part of the 2007 Trends in International Mathematics and Science Study (TIMSS) .

The table above shows the frequency for each cell, e.g. how many said  YES or NO to Have a home computer & YES or NO to Born in the US. It also shows the percentage of the column giving one response or the other. So, 94.65% of those born in the U.S. have a computer at home versus 88.65% of those not born in the U.S. Out of our total 7,237 responses only 431 students said they did not have a computer at home. This was 5% of those who were born in the U.S. and 11% of those who were not.  As a teacher, can this information help me? Maybe? Here is what I can conclude

1. The great majority of students have a home computer, almost nine out of ten students who were not born in the U.S. do and more than 9 out of 10 of those who were born here.
2. Assuming your class is a mix of students who are and are not native born, which describes most classes in California, it is a good bet that over 90% of your students have a computer at home. (The average for the total sample was over 94%.)
3. If you know that a student was not born in the U.S., you know that he or she is twice as likely as U.S. born students to not have a computer at home (11% versus 5%) but on the other hand you also know that the odds are about 8 to 1 that the student does have a computer.

Does any of that do you any good? Maybe. Why not just ask the students if they have access to a computer? You could do that certainly. Some teachers feel uncomfortable asking questions like that because they feel like it is intrusive or it might make students feel uncomfortable if they are the only kid in the class who doesn’t have a computer. I think the information does some good because teachers often assume that students in low-income families or from immigrant families don’t have computers, email or Internet access and thus don’t assign homework that uses these resources. The data show that assumption is usually incorrect. If more than one out of ten students who are not born in this country do not have access to a home computer and I was working with a class of primarily immigrant children, I would certainly take that into account in my planning.

What other statistics would I like to have at this point? What other analyses might I do? There are several that spring to mind right away:

• I could re-run that table analysis to show row percentages and total percentages. I didn’t do it that way because I thought it was easier to read and it isn’t a big deal for me to estimate the other percentages, but other readers might prefer to have the percentages given.
• Since plenty of children are born in the U.S. to immigrant parents, it might be more useful to re-run this analysis looking at if the parent was born in the U.S. That may be more strongly correlated with socioeconomic factors than the child’s birthplace. Or maybe not, because the child’s birthplace undoubtedly relates to how long you’ve lived in this country – at least 13 years if your 13-year-old was born here.
• Speaking of SES, it might be more useful to use information like the family income or parent’s education instead of or in addition to where the parent or child was born.
• What else can we know about students who don’t have a computer at home? What else might that not have? Is this a proxy for having limited academic support materials in the home like books, a calculator, etc.?

The second uses the graphs I produced with SAS On-Demand to look at a categorical variable – do you have a computer at home – and an ordinal variable – the number of books you own. This is where I said MAYBE knowing the child does not have computer at home will tell you nothing. If both of these distributions were the same, that would be the case (at least as far as books) but both distributions are not the same.

Curiously, knowing that the child has a computer tells you very little. Why is that?  Let’s look at the distribution.

Out of the total population, 94% of students have a computer and the median for those computer owners is 26-100 books in the home. They are almost exactly equally likely (34% vs 33%) to have less than that or more than that. So a child who has a computer is like the vast majority of the population and for that population access to books at home is all over the map.

Now let’s look at whether a child does NOT have a home computer.

This is about 6% of the population. The median number of books for those children is 11-25, so it is less than the typical student. It is also a very skewed distribution. Children without a home computer are much more likely to have FEWER than 11 books than to have more. If we look back and forth between the two graphs, we can see that a child with a home computer has about a 1 in 6 chance of having more than 200 books at home. A child without a home computer has about a 1 in 20 chance of having more than 200 books at home.

As a teacher, if I knew that several of my students didn’t have a computer at home, I’d have a fair degree of certainty that they did not have a lot of books at home either (a lot being defined as 100 or more).

I find examples like this one interesting because at first glance, it doesn’t seem possible that knowing a student DOES have a computer tells you very little while knowing that he/she DOES NOT have a computer can provide you useful information.

Which just goes to show you that you should always take a second glance.

P.S. If I were that teacher, I would make a major effort to introduce my students to the public library – organize a field trip, invite a librarian to visit.

# The DARK side of SAS On-Demand ?

Filed Under Software | 12 Comments

In case you don’t know – SAS On-Demand is the “cloud-based” version of SAS for teaching and research at universities. That’s a fancy way of saying it runs on the SAS servers and it’s free.

Lately I have been happily working with SAS On-Demand for academics so I was a bit surprised speaking with someone at a different university who said,

“I’m probably not the best person to talk to about it because I think it is an unbelievable pain. SPSS, yes, Stata, fine but SAS On-Demand, well it’s been a headache.”

We’re supposed to have lunch sometime in the next couple of weeks and catch up, so I don’t know the details, but since I am starting teaching in two days using SAS On-demand, this was enough to cause some concern. While I was on campus today, I went into one of the classrooms and tried SAS On-demand and it was exactly what I feared – it was so slow as to be almost useless. I had seen this before and it got me to thinking perhaps this semester won’t be as problem-free as I had hoped.

One possible reason SAS On-Demand is so slow in computer labs may be that you have 30 or 40 people all trying to access the server at the same time. I know that when I was using the wireless connection in the classroom where I tested SOD there were probably 200 students in the building using the wireless at the same time. Unfortunately for me, there is no computer lab available, and even if there was, I’m pretty sure that a lot of computer labs use wireless connections, even though the computers are literally locked down. Also, only one classroom in the building has an instructor station and I’m not scheduled for that classroom.

There is an ethernet connection in the classroom so I can bring my own ethernet cable and plug my computer into that and see if it is faster. It should be.

I tried to break down all of the differences from my office set-up, which works fine, and the classroom, which did not.

1. In the office, all of the computers have an ethernet connection, not wireless.

2. In the office, I primarily use SAS O-D on a Windows 7 computer.

3. In the classroom, I was running it on a virtual machine with Vista on VMware.  I ran a couple of procedures and timed how long it took on the Windows machine versus the virtual machine. I have a pretty old PC because about all I use it for is to run SAS. I did a cross-tabulation of two variables, computing the frequencies and column percentages for each cell and a chi-square. For the menu to come up for the table analysis wizard and the analysis to run took, in total 30 seconds. That is longer than I would like but not unacceptable. It took about 10 seconds for the menu to come up and another 20 seconds for the procedure to run.

On the Virtual Machine, it took a minute and a half. Ninety seconds is a long time to just stand in front of 20 or 30 people and stare at a screen doing nothing. I know that other procedures, like characterize data or factor analysis will take a lot longer.

…. &  so on.

The rocket scientist thinks Windows 7 plus the ethernet connection should solve it, I think that’s the right track but that creating a dual boot Windows system will work better

BUT there are other problems:

Downloading SAS On-Demand took just a few minutes over my connection in the office but when I tried it over the home wireless network it took nearly an hour. This was in the evening when there were two other people in the house making heavy use of the network.

When I installed the client on my laptop and desktop it took just a few minutes. That was because the first step is to VERIFY SYSTEM REQUIREMENTS and on the two computers that I use very frequently everything was up to date. The same was not true of the test machines though, and there were several things that needed to be downloaded and updated, which took a good 15 minutes or so.

SO … what took me a few minutes in my office could take an hour or two in the classroom. Not good and that explains why my colleague may be so frustrated.

The rocket scientist doesn’t think I can install Windows 7 on boot camp, install SAS on-demand and get it working in 2 days. I think he’s wrong but since he is often right I backed up everything just in case.

What I am going to try to do ….

• Install Windows 7 in boot camp and boot up as a Windows machine.
• Bring my own ethernet cable and use that to connect in the classroom
• Have all the screen shots and output just in case it is too slow to be usable. I actually had most of the powerpoint done already being the overprepared type that I am.
• In the first class have students get a SAS profile and register for SAS On-demand accounts. Both of those should happen very quickly.
• Walk them through the process of downloading and installing SAS on their computers so that they can (hopefully) do it at home
• Cover all of descriptive statitistics

I think that’s enough for the first day.

Filed Under Software, Technology | 8 Comments

1. It’s free. Some people say this is just the evil corporate answer to R. Maybe. Probably. I don’t care.  I don’t see Microsoft giving me anything for free.

2. It’s pretty easy for an instructor to get an account. I presume SAS verifies your instructor account.

3. Registering a course is also easy and you have options for selecting the type of software.  I use SAS Enterprise Guide, which I suspect the vast majority of others do as well.

4.You do have to download client software. The BAD part about that is it only runs on Windows. I have a couple of Windows computers and two of the Macs run Windows under a virtual machine, but for students who don’t have Windows it’s a pain. I’m hoping if the computer lab doesn’t have SAS installed I can get at least SAS On-demand installed so the students can use it there. The GOOD part about the client software is it literally takes less than 2 minutes to install and takes up very little space, completely unlike the typical SAS installation. Just click on the link on the log in page to download it.

6. To make data available for your students, upload the SAS data set to the saslib subdirectory for your course. Your SAS data set will be available to use within a few minutes.

*** IF YOU GET AN ERROR CHECK YOUR FTP SETTINGS.  I was using Filezilla and the default was sftp and that didn’t work. To use Filezilla easily, I suggest these settings

Hostname:  ftp://whateverhostname    (There will be a hostname in the info for your account

Port:  — leave blank —

Click QUICKCONNECT

If you upload a SAS data set, then you and your students will be able to access the data using the LIBNAME statement shown below. You’ll want to include the access=readonly parameter to prevent your students from modifying the data.

[MY PATH libname mydata “/courses/u_pepperdine.edu1/i_467600/c_2469/saslib” access=readonly;

7.  What about programming? There is a program option. When you use EG you can just open up a program window and type away.

8. Only the instructor account has write permission to the class library

9. You cannot save your data on the server. This really isn’t a big deal. Data sets you create are written to the work directory. To save those before exiting a session, select the data set and then under the TASKS menu select DATA and then DOWNLOAD FILE TO PC

10. To save your output, say a graph or a chart, select it, right-click and choose EXPORT. You can save output as pdf, rtf or html

I test everything I can before recommending using it for teaching, because it is pretty difficult if something goes wrong. Most professors don’t have a back-up lecture in case things don’t go as planned. Also, it is important to keep in mind that students who are learning statistics or writing a dissertation have that as their major focus. Using the software takes time away from that main goal (I know, I know it in the end helps you compute statistics or conduct your research but for students it is often one more thing added on to what at the time seems like an overwhelming task). So, any difficulty in getting the software to work can be really frustrating.

So far, having spent probably 40 hours with the latest version of SAS On-Demand running Enterprise Guide 4.3 I have run into very few problems. I know that I’m not a random sample and certainly not representative on every dimension from having Filezilla already installed to having used SAS for 29 years. On the other hand, if I have trouble with it, it’s a safe bet most people will.

So, yeah, so far it has turned out to be more like free beer than a free puppy.

Mmm. Beer.

Wonder if the rocket scientist would want to walk down to some bar on the beach and have a beer. Who am I kidding? It involves beer. Of course he would. I’m outta here.

# Univariate statistics for categorical data? How weird!

Filed Under statistics | 1 Comment

PROC UNIVARIATE is for numeric data. I use it a lot of times as the first step in my categorical data analyses. How weird is that?

Okay, well, maybe it’s not leafy sea dragon level of strange but it does seem an odd thing to do. After all, much of the output that PROC UNIVARIATE gives you is completely nonsensical for categorical data – the mean, standard deviation, tests for mu = 0. Why would I do such an odd thing?

Eva, Supergenius Baby

Wait, I can  explain!

There are two reasons that I often use PROC UNIVARIATE for categorical data.

• Exhibit A: Character data are stored as numeric Often I will be using data sets with hundreds, or even thousands, of variables. Even though the variables really do represent categorical data, they are stored in numeric format. That is, 1= Democrat, 2 = Independent, 3= Republican and so on.

My soap box I am on again and again has to do with checking your data, getting to know your data. If there are many questions like “Which party do you belong to?” “Which party did you vote for in the 2008 national election?” “Which party did your spouse vote for?” etc. etc. or a very large number of  Yes/No questions “Do you own a computer?” “Do you own a car?” “Do you own your own home?” “Do you own a kazoo?” then you know the number of categories that should exist.

I have written before at great length about checking data quality using a macro. Also, I’m doing a presentation on it at WUSS this year.

or you could do this with two tasks in SAS Enterprise Guide:

One very quick way to check the data for data entry problems is to run the CHARACTERIZE DATA task  with SAS Enterprise Guide. If you take a peak at the back end of what SAS EG is doing (check out the CODE or LOG windows) you can see it is running PROC UNIVARIATE, PROC FREQ and some macros.

The UNIVARIATE output will provide you the following information:

Number of records with non-missing data.

Number of records with missing data

Minimum

Maximum

Minimum and maximum seem an odd statistic to use. What is the maximum for political party? What I am looking for here is data entry errors, missing data. If the only possible answers were 1 – 4 and our minimum is 0 and maximum is 9, we have a problem.

With the CHARACTERIZE DATA task I can get frequency distributions and graphs for the variables that are stored as character (in this particular example data set, there is only one), all that univariate stuff for all of the other variables that are stored as numeric (even though they’re really character variables).

In this data set, which is SUPPOSED to contain nicely cleaned up data, all of the answers should be on a scale of 1 -4. So, I create a FILTER just by pointing and clicking to pull out the out-of-range data.  This identifies any variables with values out of range, missing more than 5% of the data (there are about 7,300 records so 365 is about 5%) and with a standard error of 0. Why the heck that last one? Because for some odd reason, CHARACTERIZE data does not give you a standard deviation or variance. Since the standard errors is the standard deviation divided by the square root of N a standard error of 0 means a standard deviation of 0 which means everyone gave the same answer. I want to see any variables where everyone gave the same answer because, in a data set of  over 7,300 people that’s just plain weird.

This produces the following data set as output. There are more variables than this, but I hid the ones like mean, median and total that were irrelevant for categorical data.

You will also get graphs for all of your variables. The default is to graph the 30 most common categories. Normally, I would not request these graphs if I had hundreds of variables but if your variables are stored as character data, this is probably the quickest way of doing  a check for out-of-range values.

Personally, I find it easier to glance through frequency tables than graphs, but that’s me.  Also, if there is a way to set options for the CHARACTERIZE DATA task, it is well-hidden. As far as I can see, you get the frequency distributions for categorical variables only …

… and if you want any distributional information on the variables stored as numeric you have to go with the default which is to produce graphs. Since the default is to produce graphs, you don’t need to do anything extra to get them.

Of course with truly categorical data the only measure of central tendency you can discuss is the mode, and you can see here that it is the first category. Most eighth-graders say they spend no time playing computer games (they’re probably lying). You can also talk about the distribution. With truly categorical data saying it is positively or negatively skewed doesn’t make much sense because there is no positive or negative direction. You can, however, comment on whether the observations were relatively evenly distributed among categories or predominantly in one or two categories.

In this example it is obvious but with other variables it may not be so clear. This is just the part where you are getting  a first look at your data. More on a detailed look tomorrow (maybe, if I have time).

• Exhibit B: Your data really are ordinal data in disguise.This is kind of obviously the case for the question above. Even if it was stored as character data, you can see that it really is on an ordinal scale. It makes sense in that case to talk about the median and say that half of the students played computer games less than one hour and half play for one hour or more. You can also say that your data are positively skewed. Don’t get all excited though and think just because of that it is a good idea to use this as a dependent variable in a regression equation. You only have five possible answers and that really does not fit the idea of a normal distribution.

And now I have some how managed to spend three or four hours playing with SAS On-demand with categorical data. (Yes, I did a lot more than what I wrote about in this blog.)

Maybe tomorrow I will write about bivariate descriptive statistics. Or maybe I will just sleep late and be a slug.

If you’re dying to know more, you can come to the class on categorical data analysis I’m doing at WUSS. Or you could just keep reading this blog. Or both.