We’ve looked at data on Body Mass Index (BMI) by race. Now let’s take a look at our sample another way. Instead of using BMI as a variable, let’s use obesity as a dichotomous variable, defined as a BMI greater than 30. It just so happened (really) that this variable was already in the data set so I didn’t even need to create it.

The code is super-simple and shown below. The reserved SAS keywords are capitalized just to make it easier to spot what must remain the same.  Let’s look at this line by line

LIBNAME  mydata “/courses/some123/c_1234/” ACCESS=READONLY;
PROC FREQ DATA = mydata.coh602 ;
TABLES race*obese / CHISQ ;
WHERE race NE “” ;
RUN ;

LIBNAME  mydata “/courses/some123/c_1234/” ACCESS=READONLY;

Identifies the directory where the data for your course are stored. As a student, you only have read access.
PROC FREQ DATA = mydata.coh602 ;

Begins the frequency procedure, using the data set in the directory linked with mydata in the previous statement.

TABLES race*obese / CHISQ ;

Creates a cross-tabulation of race by obesity and the CHISQ following the option statistic
WHERE race NE “” ;

Only selects those observations where we have a value for race (where race is not equal to missing)
RUN ;

Pretty obvious? Runs the program.

Cross-tabulation of race by obesity

 

Similar to our ANOVA results previously, we see that the obesity rates for black and Hispanic samples are similar at 35% and 38% while the proportion of the white population that is obese is 25%. These numbers are the percentage for each row. As is standard practice, a 0 for obesity means no, the respondent is not obese and a 1 means yes, the person is obese.

The CHISQ option produces the table below. The first three statistics are all tests of statistical significance of the relationship between the two variables. Table with chi-square statistics

You can see from this that there is a statistically significant relationship between race and obesity. Another way to phrase this might be that the distribution of obesity is not the same across races.

The next three statistics give you the size of the relationship. A value of 1.0 denotes perfect agreement (be suspicious if you find that, it’s more often you coded something wrong than that everyone of one race is different from everyone of another race). A value of 0 indicates no relationship whatsoever between the two variables. Phi and Cramer’s V range from -1 to +1 , while the contingency coefficient ranges from 0 to 1. The latter seems more reasonable to me since what does a “negative” relationship between two categorical variables really mean? Nothing.

From this you can conclude that the relationship between obesity and race is not zero and that it is a fairly small relationship.

Next, I’d like to look at the odds ratios and also include some multivariate analyses. However, I’m still sick and some idiot hit my brand new car on the freeway yesterday and sped off, so I am both sick and annoyed.  So … I’m going back to bed and discussion of the next analyses will have to wait until tomorrow.

So far, we have looked at

  1. How to get the sample demographics and descriptive statistics for your dependent and independent variable.
  2. Computing descriptive statistics by category 

Now it’s time to dive into step 3, computing inferential statistics.

The code is quite simple. We need a LIBNAME statement. It will look something like this. The exact path to the data, which is between the quotation marks, will be different for every course. You get that path from your professor.

LIBNAME mydata “/courses/ab1234/c_0001/” access=readonly;

DATA example ;
SET mydata.coh602;
WHERE race ne “” ;
run ;

I’m creating a data set named example. The DATA statement does that.

It is being created as a subset from the coh602 dataset stored in the library referenced by mydata. The SET statement does that.

I’m only including those records where they have a non-missing value for race. The WHERE statement does that.

If you already did that earlier in your program, you don’t need to do it again. However, remember, example is a temporary data set (you can tell because it doesn’t have a two level name like mydata.example ) . It resides in working memory. Think of it as if you were working on a document and didn’t save it. If you closed that application, your document would be gone.  Okay, so much for the data set. Now we are on to ….. ta da da

Inferential Statistics Using SAS

Let’s start with Analysis of Variance.  We’re going to do PROC GLM. GLM stands for General Linear Model. There is a PROC ANOVA also and it works pretty much the same.

PROC GLM DATA = example ;

CLASS race ;

MODEL bmi_p = race ;

MEANS race / TUKEY ;

The CLASS statement is used to identify any categorical variables. Since with Analysis of Variance you are comparing the means of multiple groups, you need at least one CLASS statement with at least one variable that has multiple groups – in this case, race.

MODEL dependent = independent ;

Our model is of bmi_p  – that is body mass index, being dependent on race. Your dependent variable MUST be a numeric variable.

The model statement above will result in a test of significance of difference among means and produce an F-statistic.

What does an F-test test?

It tests the null hypothesis that there is NO difference among the means of the groups, in this case, among the three groups – White, Black and Hispanic . If the null hypothesis is accepted, then all the group means are the same and you can stop.

However, if the null hypothesis is rejected, you certainly also want to know which groups are different from which other groups. After that significant F-test, you need a post hoc test (Latin for “after that”. Never say all those years of Catholic school were wasted).

There are a lot to choose from but for this I used TUKEY. The last statement requests the post hoc test.

Let’s take a look at our results.

I have an F-value of 300.10 with a probability < .0001 .

Assuming my alpha level was .o5 (or .01, or .001, or .ooo1) , this is statistically significant and I would reject my null hypothesis. The differences between means are probably not zero, based on my F-test, but are they anything substantial?

If I look at the R-square, and I should, it tells me that this model explains 1.55% of the variance in BMI – which is not a lot. The mean BMI for the whole sample is 27.56.

You can see complete results here. Also, that link will probably work better with screen readers, if you’re visually impaired (Yeah, Tina, I put this here for you!).

ANOVA table

 

Next, I want to look at the results of the TUKEY test.

table of post hoc comparisons

 

We can see that there was about a 2-point difference between Blacks and Whites, with the mean for Blacks 2 points higher. There was also about a 2-point difference between Whites and Hispanics. The difference in mean BMI between White and Black samples and White and Hispanic samples was statistically significant. The difference between Hispanic and Black sample means was near zero with the mean BMI for Blacks 0.06 points higher than for Hispanics.

This difference is not significant.

So …. we have looked at the difference in Body Mass Index, but is that the best indicator of obesity? According to the World Health Organization, who you’d think would know, obesity is defined as a BMI of greater than 30.

The next step we might want to take is examine our null hypothesis using categorical variable, obese or not obese. That, is our next analysis and next post.

In the last post, I posed the following null hypothesis as an example:

There is no difference in obesity among Caucasians, African-Americans and Latinos.

You can see the results from the statistical analyses here.

Since my question only pertains to those three groups, let’s begin by creating a data set with just those subjects.

libname mydata “/courses/ab1234/c_0001/” access=readonly;

Data example ;
set mydata.coh602;
where race ne “” ;

Don’t forget to run the program!

Now, let’s do something new and use something relatively new, the tasks in SAS Studio. On the left screen, click on TASKS, then on STATISTICS, then click DATA EXPLORATION.

 

tasks in left pane

Once you click on DATA EXPLORATION, in the right window pane you’ll see several boxes, but the first thing you need to do is select the correct data set. To do that, click on the thing that looks like a sort of spreadsheet.

tasks2When you do that, you’ll see the list of libraries available to you. You need to scroll all the way down to the WORK library. This is where temporary data sets that you create are stored. Click on the WORK library to see the list of data sets in it.

Dataset example in work librarySelect the EXAMPLE data set and click on OK. Now that you have your data set, it is time to select your variables.

Window with task rolesClick on the +  next to the variables and you’ll get a list of variables from which you can select.  Scroll down and select the variable you want. First, as shown above, I selected RACE for the classification variable.

Selecting variables

This gives me a chart, and it appears that whites have a lower body mass index than black or Latino respondents in this survey.

bar chart of BMIMy next analysis is to do the summary statistics. I simply click on SUMMARY STATISTICS under the statistics tab (it’s right under data exploration) and select the same two variables. You can click here to see the results. Mean BMI for both the black and Hispanic samples was 29, while for whites it was 27. Standard deviations for the three groups ranged from 5.7 to 6.9 which was actually less than I expected.

So, there are differences in body mass index by race/ ethnicity, but that leaves a few questions left:

  1. Do those differences persist when you control for age and gender?
  2. While there are differences in body mass index, that doesn’t necessarily mean more people are obese. Maybe there are more underweight white people. Hey, it’s possible.

Well, now you have a chart and a table to add to the table you created in the first analyses. In the next post, we can move on to those other questions.

 

I get asked this question fairly often so I thought I would do a few posts on it. The most common problem is that a student who is new to statistics has no idea where to even start.

These examples use SAS but you could use any package you like.

My recommendation to students beginning to learn statistics is to start with some type of publicly available data set, getting some experience with real data.

1. IDENTIFY THE VARIABLES YOU HAVE AVAILABLE

The first thing to do is examine the contents of the dataset. Look at the variables you have available. With SAS, you would do this with PROC CONTENTS.

Your program at this point is super simple

LIBNAME mydata “path to where your data are” ;

PROC CONTENTS DATA = mydata.datasetname ;

Normally, you would come up with a hypothesis first and then collect the data. The advantage of working with public use data sets is you don’t have to go to the time and expense of interviewing 40,000 people. The disadvantage is that you are limited to the variables collected.

2. GENERATE A HYPOTHESIS

Looking at the California Health Interview Survey data, I came up with the following null hypothesis:

There is no difference in obesity among Caucasians, African-Americans and Latinos.

3. RUN DESCRIPTIVE STATISTICS

You need descriptive statistics for three reasons. First, if you don’t have enough variance on the variables of interest, you can’t test your null hypothesis. If everyone is white or no one is obese, you don’t have the right dataset for your study. Second, you are going to need to include a table of sample statistics in your paper. This should include standard demographic variables – age, sex, education, income and race are the main ones. Last, and not necessarily least, descriptive statistics will give you some insight into how your data are coded and distributed.

proc freq data = mydata.coh602 ;
tables race obese srsex aheduc ;
where race ne “” ;

proc means data= mydata.coh602 ;
var ak22_p srage_p ;

where race ne “” ;
run ;

You can see the results from the code above here.

Notice something about the code above – the WHERE statement. My hypothesis only mentioned three groups – Caucasians, African-Americans and Latinos. Those were the only three groups that had a value for the race variable. (This example uses a modified subset of the CHIS , if you are really into that sort of thing and want to know.) Since that is the population I will be analyzing, I do not want to include people who don’t fall into one of those three groups in my computation of the frequency distributions and means.

4. PUT TOGETHER YOUR FIRST TABLE

Using the results from your first analysis, you are all set to write up your sample section, like this

Subjects

The sample consisted of 38,081 adults who were part of the 2009 California Health Interview Survey. Sample demographics are shown in Table 1.

<Then you have a Table 1>

Variable …………N….     %

Race

  • Black 2,181 5.7
  • Hispanic ,4926 13.0
  • White 30,974 81.3

Gender

  • Male 15,751 41.4
  • Female 22,330 58.6

Variable ……N ….. Mean… SD

Age…………38,081 55.4 18.0

Income  37,686 $69,888  $63,586

 

I’ll try to write more soon, but for now The Invisible Developer is pointing out that it is past 1 a.m. and I should get off my computer.

For many students just learning statistics, the relationship of z-scores and probability is confusing.

Let’s try this concrete example. Here is a chart of the distribution of height in a sample of over 2,800 women.

distribution of height of women

 

Notice that the peak, the mode is around 62-63 inches.

You can see the frequency table here, as well as a larger picture of the histogram. You’ll notice the median is between 62 and 63 inches

The mean is 62.7 – between 62 and 63 inches.

Looks like a normal distribution in that mean = median = mode.

Let’s go back to that mean of 62.7 inches. The standard deviation in this population is 2.46.   What would 2 standard deviations above the mean be? Let’s round our 2 x 2.46 = 4.92 up to 5.

The mean + 2 standard deviations = 62.7 + 5 = 67.7

height distribution of women with 2 sd marked

If you look at the frequency distribution you’ll see that 97.3% of the people measured had a height of 67 inches or less.

So, this is a perfect demonstration of what we mean when we say that 97.5% of the people fall below 2 standard deviations above the mean.

First off, the good news. You can find all of the papers from SAS Global Forum 2015 online.  This is good news if you are anything like me (and you should be, because, let’s face it, I’m awesome) because even if you went to Dallas there were no doubt several papers you wanted to attend scheduled at the same time.

I liked everything I attended but there were two that stood out as really interesting. The first one was …

Taking the Path More Travelled – SAS Visual Analytics and Path Analysis
Falko Schulz
You can download it here
http://support.sas.com/resources/papers/proceedings15/SAS1444-2015.pdf

My idea of path analysis is a series of regression coefficients where you calculate direct and indirect effects. That is not the path analysis discussed in this paper.

He literally means what path did the customer (critter, whatever) take ?

For example, your path in using this website could be you went to the home page then blog page home then the previous entry on the blog.

While websites are an obvious use for this type of path analysis, there could be many others – customer experience in a call center, where people go in a huge department store, migration of humans or animals, path to achieving a job at a start-up.

Drop-off is often of interest in a path analysis – did they fall out of the path before the endpoint you wanted, e.g., sale, employment, customer support problem solved?

You can also look at weight in a path, not only whether they buy a widget but how much money did they spend?

Visual analytics allows for path segmentation. You can combine items or exclude items.
In the example, Schulz discussed using path analysis to see how effective your different online marketing methods are. Since many people will come from typing your name into a search engine, you may want to delete those paths and only include ads from Google adwords, blogher, your corporate website and other paid marketing efforts.

You can click on events and select Exclude to filter out all paths beginning with those events that are not of interest to you.

Sankey diagrams are available in visual analytics. Although these have their origin in uses like energy flow, they are now being applied all over the place.

Here is a sample from Schulz’s paper

sample sankey diagram

Sample Sankey Diagram

A Sankey diagram, FYI, shows the direction and quantity of the flow along a path. There is a blog devoted to Sankey diagrams here.
http://www.sankey-diagrams.com/sankey-definitions/

(This wasn’t mentioned in the paper, I just found that interesting. I’m sure there’s a blog out there devoted to Gantt charts that I could find if I looked, which I didn’t.)

Once the path analysis roles of interest are defined:

  • Event
  • Sequence
  • ID

… one of the first things to do would be drop the number of paths displayed. Just imagine the mess you would be looking at if you tried to visually display all of the paths someone took in navigating a website with even a few hundred pages.

You can edit the minimum path frequency, e.g., only show a path if at least 250 people took it.

This is just a brief, brief taste of what you can do with path analysis using SAS Visual Analytics and the coolness of SAS Global Forum. There was a lot, lot more and I’ll try to post about the second paper I really liked this week,

tractor in dirt field

but for now perhaps I should quit looking out the window and pay attention in this training session I’m sitting in at Fort Berthold (don’t tell Bruce I wrote my blog during it).

If you missed out on SAS Global Forum, you don’t need to wait until next year for your fix of networking, instruction (and possibly drinking). You can go to the Western Users of SAS Software conference in San Diego in September.

I finally am getting around to something in SAS Studio that I think is really cool – the tasks.

Although they don’t look identical to SAS Enterprise Guide just because the screen layout is a little different, these are really, really similar to the tasks you would see in EG.

screens for SAS studio tasks

If you are using this with a real beginner class, you can start out with using the sashelp directory. Otherwise, the only programming they will need is to assign their directory at the beginning. Run that program and away you go .

Let’s take an example with SASHELP.

1. Log into your SAS on-demand account and select SAS Studio

2. Click on tasks in the left pane.

tasks pane

3. Select statistics

4. Select distribution analysis

Note that Studio will show you the type of variables required for your analysis

Window with option for distribution analysis Window with option for distribution analysis

It will also tell you in the right window pane whether you are unable to run your analysis and, if so, what you are missing

code windown telling what you are missing

To select a variable, just click the + sign and you’ll see a list of the variables in your data set

Select the desired variables for the roles and click your little running guy at the top. (This will be greyed out until you have all of the required fields filled in.)

running guy

Your results will appear in the right window pane.

charts of distribution by married or not

If you want to see the code, click on the code button.

 

proc univariate data=SASHELP.BWEIGHT noprint ;

class married ;

histogram Weight ;

run ;

To keep your program, copy the code

  • Click on folders
  • Click on the far left button and select new program

location of new program button

paste in your code.

Save your program.

program pasted from tasks

You could do a whole bunch of analyses this way and come up with a lot of results and a long SAS program without knowing anything about programming SAS. Some people would find that horrifying but I think it’s really cool.

A really interesting assignments for students would be to take a data set in the sashelp library, do as many tasks as they can and see what interesting results and conclusions they can find.

view over the top of my ipad

It has been pretty well established that I am the worst soccer mom in the history of soccer moms. Most of the games I miss because I am somewhere else. My children have told me that my autobiography should be entitled, “I was out of town at the time” because most of the stories of their childhood begin this way.

Having come back in town shortly before the game this weekend, I was unaware that it was a two-day tournament 2 1/2 hours from home and that we were supposed to have reserved the hotel weeks ago. Hot tip: If you get your reservation last minute and have the choice of a close hotel or a nice hotel, get the nice one.

I fulfilled my obligation. I showed up. During the time The Spoiled One played, I watched. During half time and the breaks between the games I was able to write a couple of blog posts and test out SAS Studio.

If you look at the picture above you might see that I was working in a field surrounded by mountains. Not the best situation for Internet access, which I had via the hotspot on my iPhone.

SAS program screen

I was able to log on to SAS Studio with no problem. When I logged in on my iPad I had the screen shown above where I could just start typing my program in the code window.

To see folders, libraries, etc. tap the BROWSE link in the top left corner, as shown

list of folders and libraries

You can tap any of the categories to bring down the list of folders, libraries, etc. You can tap on a file to open it.

The one problem I did have, and depending on your situation, it may be a severe one, was that I could not get any of the libraries to open. I wanted to open the sashelp library and see if I could run some tasks using an open data set. This did not work. It is very possibly related to poor Internet due to laying in a soccer field ringed by mountains. I tried it last year in a movie theater and I was able to access the libraries. In this case, as you might guess from the top photo, the Internet was barely accessible.

Next, I tried simulating a homework problem a student might have, just typing in some data and running the program.

running a program

I have a bluetooth keyboard I use with my iPad and it all worked fine. I typed in data, tapped on the little running guy and my program ran fine. You can see the results below.

results of proc means

To save it, I held down the home button and the power button simultaneously, just like any time you take a screenshot on an iPad. Then, I emailed that screenshot to myself, so here you have results.

My point is that a student could do their homework using SAS Studio in the middle of a soccer field on an iPad, as long as it did not require external files, which most of the homework I assign does not. They could then email the results to their professor, still from the (dis)comfort of the field.

This is useful to know for three reasons:

  1. I travel frequently to areas where there is very limited bandwidth,
  2. Many of the students in my online courses live in areas with limited bandwidth,
  3. The Spoiled One’s team won their bracket in the State Cup, so it turns out that means they have more soccer games next weekend as they advanced in the tournament. This is not at the same field surrounded by mountains. It’s at a different field at the edge of the desert. Sigh.

Take-away points:

Your students should be able to use SAS Studio almost anywhere, even if all they have is an iPad.

This is doubly true if you don’t assign homework that requires accessing external datasets.

I’ll be able to review homework assignments for the course I am teaching next during the soccer tournament this weekend. (I really AM the worst soccer mom in the history of ever.)

girl playing soccer

—————— SHAMELESS PLUG

Our Kickstarter campaign is still going on, making adventure games to make math (and history and English) awesome.

We are 84% of the way to our goal!

Show your “blog appreciation” and help us get the last few steps by backing us now. 

 

Getting ready to teach biostatistics in a few weeks and it seems to me that the real confusion in most cases is not the calculations, which can be fairly simple, but rather that there can be several ways of looking at the same question. Let’s take “risk” as an example.

What is the “risk” of diabetes?

You could answer this by prevalence – 9.3% of Americans have diabetes. So you could say you have about a 1 in 11 chance of having diabetes. Is that your risk?

On the other hand, incidence, the number of new cases per year is about 1.8 million, which comes out to around 0.6% in a population of 313 million. So, your chance of being newly diagnosed with diabetes is around 1 in 200. Is that your risk?

In discussing risk of a disease, it may be useful to consider the specific population. For example, the CONDITIONAL risk of having diabetes given the condition that your ethnicity is Asian-American of Chinese descent is 4.4 %. (Conditional risk of a disease is defined here as the prevalence given a specific condition.)

Conditional risk given that you are Puerto Rican is 14.8%.

What is the relationship between diabetes and ethnicity?

This is another simple-sounding question that can be answered in multiple ways. First of all, what is your reference group?

Is it, say, Puerto Ricans compared to the total prevalence of 9.3% ? Is it Puerto Ricans compared to non-Hispanic whites? To all Hispanics? To Americans of Chinese descent?  If the latter sounds silly, I’m not sure why it is any sillier than non-Hispanic whites, but perhaps someone can enlighten me.

Once you have a reference group, then what do you pick as the method of measuring relationship?

Risk difference is the absolute value of difference in probabilities between two groups.

So, if the probability of someone who is Puerto Rican having diabetes is 14.8%, which it happens to be, and the prevalence of diabetes among Central and South Americans in the U.S. is 8.5% (which it also is), the risk difference is 14.8 – 8.5 = 6.3%.

The relative risk is the risk of one group divided by the risk of the other group. So, the relative risk is 1.74.  Rounding it up, you could say that Puerto Ricans are twice as likely to have diabetes as Central or South Americans – which sounds considerably different than that the difference between the two ethnic group risks is .063.

Then there are odds ratios, which I have written about extensively, including here.  Proportional attributable fraction, proportional attributable risk.

Well, I can go on for weeks – and will, once class starts.

How to make it all less confusing

Start with this question, “What do you want to know and why do you want to know that?”

If you want to know what the probable demand for insulin will be in the next year, you might care most about prevalence + incidence. If you are interested in predicting diabetes 10 years from now, you might be very interested in differing probabilities within ethnic groups, as some have a much faster rate of growth than others.

If you are interested in screening or prevention, you would be very interested in which groups have the highest incidence.

I’m thinking a fun and useful thing to do for both biostatistics and epidemiology would be to have students make a flowchart with questions like : If you want to know this, then do that.

That’s a couple more posts, at least.

————————–

Feel smarter? Want to be even smarter?

Check out what I’m analyzing these days, data from games that make you smarter. You can back us on Kickstarter at $10 or more and get the newest game yourself.

Or, you can go to my session at SAS Global Forum and learn about preparing students for the real world of data with SAS Studio

Any time I hear someone brag,

“I’ve never used X in my life,”

I automatically assume that whatever it is, they haven’t learned it very well. Just about everything I’ve learned has come in useful, and the better I learned it, the more useful it is.

Take statistics, for example. There is nowhere in my life that knowledge of statistics isn’t helpful. Darling Daughter Number 3 competes in mixed martial arts and I’m the worrying type.

Ronda armbars

Darling Daughter 3 Defending World Title (Photo by @HansGutknecht )

Whenever her next fight is announced, the very first thing I do is check the fight odds. For the one coming up in Brazil, she is a 15-1 favorite. Knowing that makes my stress level go down a little. I’ll still drop by her gym a time or two during camp just to reassure myself that all is going well. As I said, I’m a worrier.

Player needing help

Back our Kickstarter game, Forgotten Trail!  Read more about what I’m worried about today here. Watch the video. It’s only 2 1/2 minutes!

The latest thing I’m worrying about is our Kickstarter campaign, but here again, statistics cheer me up. Two years ago, we did a Kickstarter campaign with a goal of $20,000. I should have researched a bit better in advance because even though Kickstarter touted  the 44% success rate that is an average (there’s that knowledge of statistics again). Things that were less likely to get funded were projects seeking over $10,000, game projects and projects not featured on Kickstarter. We fit all three. Pretty depressing. In fact, looking at the statistics after we had started our campaign last time I found that less than 5% of campaigns raised over $20,000.
Well, we made it. You’d think we have learned our lesson, but due to a couple of reasons, I’ll go into another day, we decided to do ANOTHER Kickstarter two years later. So, here we are today.

tired girls on hillThe bad news is that the success rate on Kickstarter has gone down. The overall success rate is now 39% .  The semi-good news is that the success rate for games actually ticked up a bit – it was 33% two years ago and it is 34% now.

The really good news: success tends to be all or nothing – 79% of projects that raised 20% of their goal ended successfully funded. Of projects that raised 41% of their goal, 94% went on to be successfully funded. We’re at 42% and we still have two-thirds of our campaign to run, so I’m feeling somewhat less worried.

So, now that you know what the odds are, help a sister out and go back us. I know you have access to the internet, because you’re reading this!

 

 

 

Next Page →