It has been pretty well established that I am the worst soccer mom in the history of soccer moms. Most of the games I miss because I am somewhere else. My children have told me that my autobiography should be entitled, “I was out of town at the time” because most of the stories of their childhood begin this way.
Having come back in town shortly before the game this weekend, I was unaware that it was a two-day tournament 2 1/2 hours from home and that we were supposed to have reserved the hotel weeks ago. Hot tip: If you get your reservation last minute and have the choice of a close hotel or a nice hotel, get the nice one.
I fulfilled my obligation. I showed up. During the time The Spoiled One played, I watched. During half time and the breaks between the games I was able to write a couple of blog posts and test out SAS Studio.
If you look at the picture above you might see that I was working in a field surrounded by mountains. Not the best situation for Internet access, which I had via the hotspot on my iPhone.
I was able to log on to SAS Studio with no problem. When I logged in on my iPad I had the screen shown above where I could just start typing my program in the code window.
To see folders, libraries, etc. tap the BROWSE link in the top left corner, as shown
You can tap any of the categories to bring down the list of folders, libraries, etc. You can tap on a file to open it.
The one problem I did have, and depending on your situation, it may be a severe one, was that I could not get any of the libraries to open. I wanted to open the sashelp library and see if I could run some tasks using an open data set. This did not work. It is very possibly related to poor Internet due to laying in a soccer field ringed by mountains. I tried it last year in a movie theater and I was able to access the libraries. In this case, as you might guess from the top photo, the Internet was barely accessible.
Next, I tried simulating a homework problem a student might have, just typing in some data and running the program.
I have a bluetooth keyboard I use with my iPad and it all worked fine. I typed in data, tapped on the little running guy and my program ran fine. You can see the results below.
To save it, I held down the home button and the power button simultaneously, just like any time you take a screenshot on an iPad. Then, I emailed that screenshot to myself, so here you have results.
My point is that a student could do their homework using SAS Studio in the middle of a soccer field on an iPad, as long as it did not require external files, which most of the homework I assign does not. They could then email the results to their professor, still from the (dis)comfort of the field.
This is useful to know for three reasons:
- I travel frequently to areas where there is very limited bandwidth,
- Many of the students in my online courses live in areas with limited bandwidth,
- The Spoiled One’s team won their bracket in the State Cup, so it turns out that means they have more soccer games next weekend as they advanced in the tournament. This is not at the same field surrounded by mountains. It’s at a different field at the edge of the desert. Sigh.
Your students should be able to use SAS Studio almost anywhere, even if all they have is an iPad.
This is doubly true if you don’t assign homework that requires accessing external datasets.
I’ll be able to review homework assignments for the course I am teaching next during the soccer tournament this weekend. (I really AM the worst soccer mom in the history of ever.)
—————— SHAMELESS PLUG
Our Kickstarter campaign is still going on, making adventure games to make math (and history and English) awesome.
We are 84% of the way to our goal!
Getting ready to teach biostatistics in a few weeks and it seems to me that the real confusion in most cases is not the calculations, which can be fairly simple, but rather that there can be several ways of looking at the same question. Let’s take “risk” as an example.
What is the “risk” of diabetes?
You could answer this by prevalence – 9.3% of Americans have diabetes. So you could say you have about a 1 in 11 chance of having diabetes. Is that your risk?
On the other hand, incidence, the number of new cases per year is about 1.8 million, which comes out to around 0.6% in a population of 313 million. So, your chance of being newly diagnosed with diabetes is around 1 in 200. Is that your risk?
In discussing risk of a disease, it may be useful to consider the specific population. For example, the CONDITIONAL risk of having diabetes given the condition that your ethnicity is Asian-American of Chinese descent is 4.4 %. (Conditional risk of a disease is defined here as the prevalence given a specific condition.)
Conditional risk given that you are Puerto Rican is 14.8%.
What is the relationship between diabetes and ethnicity?
This is another simple-sounding question that can be answered in multiple ways. First of all, what is your reference group?
Is it, say, Puerto Ricans compared to the total prevalence of 9.3% ? Is it Puerto Ricans compared to non-Hispanic whites? To all Hispanics? To Americans of Chinese descent? If the latter sounds silly, I’m not sure why it is any sillier than non-Hispanic whites, but perhaps someone can enlighten me.
Once you have a reference group, then what do you pick as the method of measuring relationship?
Risk difference is the absolute value of difference in probabilities between two groups.
So, if the probability of someone who is Puerto Rican having diabetes is 14.8%, which it happens to be, and the prevalence of diabetes among Central and South Americans in the U.S. is 8.5% (which it also is), the risk difference is 14.8 – 8.5 = 6.3%.
The relative risk is the risk of one group divided by the risk of the other group. So, the relative risk is 1.74. Rounding it up, you could say that Puerto Ricans are twice as likely to have diabetes as Central or South Americans – which sounds considerably different than that the difference between the two ethnic group risks is .063.
Then there are odds ratios, which I have written about extensively, including here. Proportional attributable fraction, proportional attributable risk.
Well, I can go on for weeks – and will, once class starts.
How to make it all less confusing
Start with this question, “What do you want to know and why do you want to know that?”
If you want to know what the probable demand for insulin will be in the next year, you might care most about prevalence + incidence. If you are interested in predicting diabetes 10 years from now, you might be very interested in differing probabilities within ethnic groups, as some have a much faster rate of growth than others.
If you are interested in screening or prevention, you would be very interested in which groups have the highest incidence.
I’m thinking a fun and useful thing to do for both biostatistics and epidemiology would be to have students make a flowchart with questions like : If you want to know this, then do that.
That’s a couple more posts, at least.
Feel smarter? Want to be even smarter?
Any time I hear someone brag,
“I’ve never used X in my life,”
I automatically assume that whatever it is, they haven’t learned it very well. Just about everything I’ve learned has come in useful, and the better I learned it, the more useful it is.
Take statistics, for example. There is nowhere in my life that knowledge of statistics isn’t helpful. Darling Daughter Number 3 competes in mixed martial arts and I’m the worrying type.
Whenever her next fight is announced, the very first thing I do is check the fight odds. For the one coming up in Brazil, she is a 15-1 favorite. Knowing that makes my stress level go down a little. I’ll still drop by her gym a time or two during camp just to reassure myself that all is going well. As I said, I’m a worrier.
The latest thing I’m worrying about is our Kickstarter campaign, but here again, statistics cheer me up. Two years ago, we did a Kickstarter campaign with a goal of $20,000. I should have researched a bit better in advance because even though Kickstarter touted the 44% success rate that is an average (there’s that knowledge of statistics again). Things that were less likely to get funded were projects seeking over $10,000, game projects and projects not featured on Kickstarter. We fit all three. Pretty depressing. In fact, looking at the statistics after we had started our campaign last time I found that less than 5% of campaigns raised over $20,000.
Well, we made it. You’d think we have learned our lesson, but due to a couple of reasons, I’ll go into another day, we decided to do ANOTHER Kickstarter two years later. So, here we are today.
The bad news is that the success rate on Kickstarter has gone down. The overall success rate is now 39% . The semi-good news is that the success rate for games actually ticked up a bit – it was 33% two years ago and it is 34% now.
The really good news: success tends to be all or nothing – 79% of projects that raised 20% of their goal ended successfully funded. Of projects that raised 41% of their goal, 94% went on to be successfully funded. We’re at 42% and we still have two-thirds of our campaign to run, so I’m feeling somewhat less worried.
There is not nearly enough replication in scientific research. It’s unfortunate that funding agencies and academic journals always want to see a new twist – a different technique, a different population. Personally, I’m very interested in reading studies that say:
“I did the exact same study as Mary Lou Who and I found pretty much the same thing.”
One reason this is interesting is that it controls for the history effect. Maybe a specific event determined the outcome. A second reason I find replication interesting is that people are very quick to generate causal hypotheses to explain relationships after the fact. In a subsequent study, those hypotheses can be assessed. Do they still stand up?
Here is an example that comes up in my personal life a lot. People assume since Darling Daughter Number 3 is on TV and in the movies a lot that it helps my business.
Let’s take a look at the graph below:
This shows website statistics for The Julia Group site. Those lines are average daily visits to this site in months when my little pumpkin had UFC world title fights. I used average daily visits to control for the fact that some months have more days than others. Contrary to expectations, the months when she had fights I had stagnant or declining number of visitors. Hearing this, some of the same people who had suggested her career would have a positive effect on business, without blinking an eye reversed themselves and said it must be because I was distracted and away from the office during those months.
Let’s replicate that graph with data from 2012-2013. You see a pretty similar trend between the top and bottom lines. Over the past couple of years, visits have been rising, so the average daily visitors is higher than in 2012-13 but the pattern is the same – an increase during the months from September to December and fewer visits in the summer months. December 2012 was a little unusual compared to most years – usually there is a drop over the holidays.
Because I see these same trends year after year, I realize it’s not at all attributable to how much Ronda is in the media in a given month. It’s a seasonal trend. Since I write about statistics and programming a lot, I’m pretty sure more people come to this blog during the academic year when they are taking a class. Also, people can read my blog at work and pretend it is work-related, even if I’m just ranting about something that day, because, hey there is a possibility that it COULD be about something relevant.
This assumption is further supported by the fact that the lowest days of the week for website visits are Saturday and Sunday.
It’s also interesting when you don’t find the same thing
If one defines “interesting” as not getting what you want, I had an interesting experience with a research project recently. Replicating the project a second year, we ran into all kinds of technical difficulties and the results were far from significant. In short, the subjects did not receive the planned intervention so no effect of intervention was observed. Much swearing ensued. I’m now analyzing data from the third year of the same project.
Multi-year studies make so much more sense to me and it troubles me that there are not more of them. I understand the reasons. For one thing, there is so much pressure to publish in many institutions that people put out as many articles as they can as quickly as they can (everyone except for YOU, of course). They are expensive and it is hard to justify funding to study something you already supposedly studied and reported the results.
Yeah, I get it, but just like those people who confidently explain my website statistics, without replication it is too easy to be persuaded that one’s first, or completely contradictory second, hypothesis is correct.
I’m giving a talk on Preparing Students for the Real World of Data at SAS Global Forum next month.
You’d think 50 minutes would be long enough for me to talk, but that just goes to show you don’t know me as well as you think you do. One point made in the template for papers is that you should not try to tell every single thing you know about the DATA step, for example, because it will bore your audience to death.
Random Tips That Didn’t Make it Into the Paper
1. CATS removes blanks and concatenates
While I did give a few shout outs to character functions, it was not possible to put in every function that is worth mentioning. One that didn’t make the cut is the CATS function.
The CATS function concatenates strings, removing all leading and trailing blanks.
Let’s say that I want to have each category renamed with a leading “F” to distinguish all of the variables from the Fish Lake game. I also want to add a ‘_’ to problems 10-14 so that when I chart the variables 11 comes just before 12, not before 2 (which is what would happen in alphabetical order). So, I include these statements in my DATA step.
IF problem_num IN(11,10,12,13,14) THEN probname = CATS(‘F’,’_’,probname);
ELSE probname = CATS(game,probname) ;
Now when I chart the results you can see the drop off in correct answers as the game gets more difficult.
2. Not all export files are created equal
Nine of the ten datasets I needed I was able to download as an EXCEL file and open up in SAS Enterprise Guide. It was a piece of cake, as I mentioned last time. Unfortunately, the third file was download from a different site and it had special characters in it, like division signs, and the data had commas in the middle of it. When I opened it up in SAS Studio it looked like this.
Fixing it was actually super simple. This was an Excel file. I simply did a Replace ALL and changed the division signs to “DIV” and the commas to spaces. The whole thing took FIVE lines to read in after that.
3. Listen to Michelle Homes and know your data
filename fred “/courses/abc123add/sgf15/sl_pretest.csv ” ;
Data pretest keyed;
LENGTH item9 $ 38. ;
infile fred firstobs = 2 dlm=”,”;
input started $ ended $ username $ (item1 – item24) ($) ;
Thank you to the lovely Michelle Homes for catching this! As she pointed out in the comments, the input statement assumes that the variables are 8 characters in length and character data. This is true for 26 of the 27 variables. However, ONE of the 24 items on the test is a question that can be answered with something like Four million, four thousand and twelve.
That, as you can see, is over 8 characters. So, I added a LENGTH statement. That brought up another issue, but that is the next post …
I’ll have a lot more to talk about in Dallas. Hope to see you there.
Want to be even smarter? Back us on Kickstarter! We make games that make you smarter. The latest one, Forgotten Trail, is going to be great! You can get cool prizes and great karma.
In case you don’t know, SAS On-Demand is the FREE , as in free beer, offering of SAS for academic use. How good is it? There really can’t be one answer to that.
First of all, there are multiple options – SAS Studio, SAS Enterprise Miner, SAS Enterprise Guide, JMP, etc. so some may be better than others.I have a fair bit of experience with two of them, so let’s just look at one of those today.
I mostly use SAS Studio with my students and over the past few courses I have been really pleased with the results. I selected SAS Studio over Enterprise Guide because I strongly believe it is useful for students to learn to code and many students, yes, even in an area like biostatistics need a little encouragement to learn. While they don’t end up expert SAS programmers after two or three courses, they at least can code a DATA step , read in raw data, aggregate data and data from external files, produce a variety of statistics and graphics and interpret the results.
Let’s be frank about this … it’s going to require a bit of work up front. You need to create a course with SAS On-Demand. You need to notify your students that they need to create accounts. If you are not going to use solely the sashelp directory data sets, you’re going to have to upload your own data.
Please don’t tell me you plan on solely using the sashelp data sets! These are really helpful for the first assignment or two while students get their feet wet but unless you expect your students to have careers where all of their files to be analyzed are going to be shipped with the software they use, you’re going to move to reading in other types of data sooner or later.
Your data are going to be stored on the SAS server (so you can tell people who ask that yes, you are ‘computing in the cloud’ – instead of what I usually tell people who ask stupid questions like that, which is shut the hell up and quit bothering me – but I digress. Even more than usual.)
No matter what software you use, you’re going to have to select some data sets for students to analyze, have some sort of codebook and make sure your data is reasonably clean (but not so clean that students won’t learn something about data quality problems). So, the only real additional time is figuring out how to get it on the SAS server.
None of these steps take much time, but adding them all up – getting a SAS profile, creating a course, creating an email to send to all of your students, with the correct LIBNAME, uploading your data – it all maybe adds up to a couple of extra hours.
My challenge always is how I shoehorn additional content into the very limited class time I have with students. One tool I’ve been using lately is livebinders. This is an application that lets you put together an online binder of web pages, videos and material you write yourself.
Here is an example of a livebinder I use for my graduate course in epidemiology. It has SAS assignments beginning with simply copying code to modifying it . Links to the relevant SAS documentation are included, as are videos that show step by step how to use SAS Studio for computing relative risk, population attributable risk, etc. I have a similar livebinder for my biostatistics course.
You might think this is a bit of hand-holding to walk the students through it, but I would disagree. Every time I have found myself thinking,
“Well, this is a little too easy”,
I have been wrong.
If you have been doing something for a decade or, in my case, a few decades, it’s hard to remember how confusing concepts were the very first time. Even things that you do automatically, like downloading your results as an HTML file, were a mystery at one time in your life. Making the videos takes some time initially – you have to do a screencast, and then the voice over. Sometimes I do them at once, using QuickTime and GarageBand simultaneously. Other times, I import the screencast into iMovie and record a voiceover.
Either way, a 7-minute video usually takes me half an hour to record, when you add in screwing up the first time, editing out the part where The Spoiled One came in and asked for money to go shopping, etc. So, you’re adding maybe 3-4 hours to the time you spend on your course. On the other hand, you only have to do it once, so, if you teach the same course a few times, it pays off. I cannot tell you how many times students tell me that the videos were helpful. Unlike when I am lecturing in class, they can slow the video down, play it over.Students end the course with experience coding, using data from actual studies and interpreting data to answer problems that matter.
My point is, that it is a little more work to teach using SAS Studio, but it is worth it.
Physicians say that once a patient hears the word “cancer”, their brain shuts down and they don’t hear anything else. To be fair to the patients, understanding survival statistics isn’t always simple.
Let’s take just one example:
The three-year survival rate is different from the third-year survival rate. If you have been told that the three-year survival rate is 50% and now it is the third year since your diagnosis, your probability of surviving the year is likely to be much higher than 50%
Let’s take a look at this example, with the number of patients diagnosed each year and how many were alive the 1st, 2nd and 3rd year after diagnosis
Year | N | 1st | 2nd | 3rd
2012 | 75 | 60 _ | 56 _| 48
2013 | 63 | 55 _| 31 _|___
2014 | 42 | 37 _| ___|___
The probability of survival year 1 = 152/180 = .84
The probability of survival year 2 = 87/115 = .76
The probability of survival year 3 = 48/56 =.86
To find the probability of survival in the THIRD YEAR you divide the number of people alive at the end of three years, which is 48, by the number of people alive at the beginning of the third year, which is 56. (The number of people who survived the second year is the same as the number of people who were alive at the beginning of the third year.)
48/56 = 86% probability of survival the third year.
So, IF YOU HAVE SURVIVED TO THE BEGINNING OF THE THIRD YEAR, your probability of survival in that year is 86%.
However, if you asked me on day 1 what your probability of living three years is, I would say 55% (actually, 54.9024% if you want to be precise).
How can your three-year survival be lower than third-year survival? Here’s how:
We can only measure third-year survival on people who survived the first two years …
We followed (75+63 +42) = 180 people for one year. At the end of that year, we had 152 survivors (60 +55 + 37).
So, first year survival rate = 152/180 = 84%
Of those 84%, only 76% survived the second year. Of the people who survived the second year, 86% survived the third. So, what percent survived all three years?
.84 x .76 x .86 = .549024 or, 54.9%
Sometimes people will look at three-year survival rate and think, WRONGLY,
The three-year survival rate is only a little better than 50% and I have already lived to the third year, I must have a 50-50 chance of dying this year.
Actually, that is not correct. As the example shows, your chance of surviving the third-year may be substantially greater than the three-year survival rate.
Want to exercise your brain while having fun? Play Fish Lake, canoe down rapids, escape your enemies and review fractions. If you are already smart enough, consider donating a copy to a low-income school or after-school program.
Kappa is a useful measure of agreement between two raters. Say you have two radiologists looking at X-rays, rating them as normal or abnormal and you want to get a quantitative measure of how well they agree. Kappa is your go-to coefficient.
How do you compute it? Well, personally, I use SAS because this is the year 2015 and we have computers.
Let’s take this table, where 100 X rays were rated by two different raters as an example:
Rating by Physician 1
————-Abnormal | Normal
Abnormal 40 20
Normal 10 30
So ….. the first physician rated 60 X-rays as Abnormal. Of those 60, the second physician rated 40 abnormal and 20 normal, and so on.
If you received the data as a SAS data set like this, with an abnormal rating = 1 and normal = 0, then life is easy and you can just do the PROC FREQ.
and so for 50 lines.
However, I very often get not an actual data set but a table like the one above. In this case, it is still relatively simple to code
DATA compk ;
INPUT rater1 rater2 nums ;
1 1 40
1 0 20
0 1 10
0 0 30
So, there were 40 x-rays coded as abnormal by both rater1 and rater2. When rater1 = 1 (abnormal) and rater2 = 0 (normal), there were 20, and so on.
The next part is easy
PROC FREQ DATA = compk ;
TABLES rater1*rater2/ AGREE ;
WEIGHT nums ;
That’s it. The WEIGHT statement is necessary in this case because I did not have 100 individual records, I just had a table, so the WEIGHT variable gives the number in each category.
This will work fine for a 2 x 2 table. If you have a table that is more than 2 x 2, at the end, you can add the statement
TEST WTKAP ;
This will give you the weighted Kappa coefficient. If you include this with a 2 x2 table nothing happens because the weighted kappa coefficient and the simple Kappa coefficient are the same in this case.
See, I told you it was simple.
Listen my dears and you will learn the difference between incidence and prevalence and why effective treatment for a disease may mean that more people have it.
Prevalence is the number of people in the population who have a disease. It is computed as
Number of people with disease X 1,000
Number in population
As I mentioned before, there is nothing magical about 1,000 and you could compute prevalence per 100, per 100,000 or per million.
The important point is that prevalence is the number who HAVE the disease.
The green line is prevalence, the number who have the disease. Clearly, prevalence went up after 1995.
So, did the AIDS epidemic get worse? No, actually, it stayed the same in one sense and got better in another.
The red dotted line is incidence. If you define epidemic, as occurrence of a disease in excess of normal frequency, then there has been no change and no evidence of a worsening epidemic, even though more people HAVE the disease. How can that be?
INCIDENCE is the number of new cases occurring, you compute it like this
Number of new cases during a period of time x 1,000
Number in population at risk during period
Prevalence is affected by both incidence and duration. If a lot of people get the disease but the duration is short the prevalence won’t be as high as it would be if the duration was very long. That’s one reason you’ll find the prevalence of diabetes is higher than, say, chicken pox. Even though a lot of people get chicken pox, it doesn’t last very long. In contrast, once you have diabetes, you have it for years.
Another reason duration could be low is that mortality is high. If people die of a disease at a high rate, within a short period of time, prevalence would be low, but this would generally not be considered a good thing. Once effective treatment is found and people live longer, the prevalence would go UP, which is exactly what happened with HIV. In 1997, there was a 40% decline in AIDS deaths as a result of new, effective, anti-retroviral drugs. That lower death rate has been maintained so year after year the number of people who have HIV has increased even though the incidence has been pretty stable.
Diabetes is another example of a disease that has been affected by effective treatment. A friend of mine told me this story:
An elderly patient was against taking the medication he had been prescribed for his diabetes and high blood pressure. He demanded of his physician.
They didn’t have all of this medication back in the old days. What did they do then, huh?
To which Jake politely replied,
Well, sir, they died.
However, effective treatment is only one reason that diabetes prevalence is rising in modern times. Remember prevalence is affected both by incidence and duration. Well, the incidence of diabetes has been increasing dramatically.
Since it is nearly 3 am in North Dakota though, and I am flying home today, that will have to be a post for another day.
You’d think the ultimate example of simplicity in measurement would be mortality rate.
Count up the dead people – they aren’t hard to catch. Divide by the total number of people you had when you started.
It’s not super complicated but it is slightly more complicated than that.
First of all, how do you figure the number of people in the population? The population changes all of the time. People are born, people die. So to compute annual mortality rate you use this formula
Total number of deaths from all causes in 1 year X 1,000
Number of persons in the population at midyear
(You can also use per 10,000, 100,000 or million. There is nothing magic about the number 1,000).
You also may want to compute mortality for certain groups, for example, by age. You would then calculate the age-specific mortality rate For example, if we want to know the mortality rate of children aged 15 and under it would be
Number of deaths from all causes in 1 year of children under 16 x 1,000
Number of children under 16 in the population at midyear
There is also cause-specific mortality rate
Number of deaths from specific disease in one year x 1,000
Number of persons in the population at midyear
CASE-FATALITY is a very different thing than mortality rate. Case-fatality is
Number of individuals dying during a period after disease onset X 100
Number of individuals with the disease
During that same period, the case fatality rate was about .1% or .98 per 1,000.
Think about that. One person out of 1,000 who got measles died.
Unfortunately, measles are making a comeback in the United States due to stupid bastards who don’t vaccinate their children. Whether this will translate to an increase in mortality rate remains to be seen. With only 644 cases in 2014, there hasn’t been a reported death – yet.