Jun
26
Data Analysis by Example: That’s funny …
Filed Under Software, statistics, Technology | 2 Comments
In the last post, I used SAS Enterprise Guide to filter out a couple of ‘bad’ records that came from test data, then I created a summary table of the number of questions answered and the percentage correct. Then, I calculated the mean percentage correct for the around 84%. That seemed a bit high to me.
Having (temporarily) answered the first question regarding the number of individual subjects and the average percent of correct answers from the 424 subjects, I turned to the next question:
Is there a correlation between percentage correct and the number of questions attempted? That is, do students who are getting the answers correct persist more often?
Since I had both variables, N and the mean correct (which, since this was score 0= correct, 1= incorrect gave me the percentage correct) from the summary tables I had created in the previous step, it was a simple procedure to compute the correlation.
I just went to the TASKS menu, selected MULTIVARIATE and then CORRELATIONS
Under ANALYSIS VARIABLES correct_ N for the ‘correct’ variable, which is a variable that holds whether the student answered correctly, 0(= no) or 1(=yes). Under CORRELATE WITH I dragged correct_mean, which has the percentage each student answered correctly.
Since it is just a bivariate correlation and the correlation of X with Y = the correlation of Y with X , it would make absolutely no difference if I switched the spots where I dragged the two variables.
I click run and I get a somewhat unexpected result, you can see here, with a correlation of -.07.
I also note that the minimum number of answers attempted is 1. Now, I have done (and published) analyses of these data elsewhere, as this is an on-going project.
Other analyses from this same project can be found in:
Telling Stories with Your Data and
Because of these analyses of ‘Fidelity of Implementation’, that is the degree to which a project is implemented as planned, I am pretty sure that these data include a large proportion of students who only had the opportunity to play the game once.
So … I decided to run a scatter plot and check my suspicion. This is pretty simple. I just go to the TASKS menu and select GRAPH then SCATTER PLOT.
I selected 2-D Scatter Plot
Then, I clicked on the DATA tab, dragged correct_Mean under Horizontal and Correct_N and vertical, then clicked RUN.
This produced the graph below.
Now, this graph isn’t fancy but it serves its purpose, which is to show me that there IS in fact a correlation of mean correct and the number of problems attempted. Look at that graph a minute and tell me that you don’t see a linear trend – but it is pulled off by the line of 1.0 at the far end.
This did NOT fit my preconceived notion, though, that the lack of correlation was due to the players who played once, and so there would be a bunch of people who had answered 1 or 2 questions and got 100% of them correct. Actually, those 100-percenters were all over the distribution in terms of number of problems attempted.
This reminds me of a great quote by Isaac Asimov,
The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ (I found it!) but ‘That’s funny …’
Well, we shall see, as our analysis continues …
Want to see these data at the source?
You can also follow the link above to donate a copy of the game to a school or give as a gift.
Jun
20
The Village Watchman and SAS Enterprise Guide Summary Tables
Filed Under Software, statistics, Technology | 1 Comment
The government is extremely fond of amassing great quantities of statistics. These are raised to the nth degree, the cube roots are extracted, and the results are arranged into elaborate and impressive displays. What must be kept ever in mind, however, is that in every case, the figures are first put down by a village watchman, and he puts down anything he damn well pleases.
Josiah Stamp
Any time you do anything with any data your first step is to consider the wisdom of Sir Josiah Stamp and check the validity of your data. One quick first step is using the Summary Tables task from SAS Enterprise Guide. If you are not familiar with SAS Enterprise Guide, it is a menu driven application for using SAS for data analysis. You can open a program window and write code if you like, and I do that every now and then but that’s another post. In my experience, SAS Enterprise Guide works much better with smaller data sets – defined by me, as the blog owner, of less than 400,000 records or so. Your mileage may vary depending upon your system.
How to do it:
- Open SAS Enterprise Guide
- Open your data set – (FILE > OPEN > DATA)
- From the TASKS menu, select DESCRIBE and then SUMMARY TABLES. The window below will pop up
- Drag the variables to the roles you want for each. Since I have less than 450 usernames here, I just quickly want to see are there duplicates, errors (e.g. ‘gret bear’ is really the same kid as ‘grey bear’ , with a typo). I also want to find out the number of problems each student attempted and the percent correct. So, I drag ‘username’ under CLASSIFICATION VARIABLES and ‘correct’ under ANALYSIS variables. You can have more than one of each but it just so happens I only have one classification and one analysis variable I’m interested in right now.
5. Next click on the tab at left that says SUMMARY TABLES and drag your variables and statistics where you want them. I want ‘username’ as the row, so I drag it to the side, ‘correct’ as the column, N is already filled in as a statistic if you drag your classification variable to the table first. I also want the mean, so I drag that next to the N. Then, click RUN.
Wait a minute! Didn’t I say I wanted the percent correct for each student? Why would I select mean instead of percent?
Because the pctN will simply tell me what percent of the total N responses from this username make up. I don’t want that. Since the answers are score 0 = wrong, 1= right, the mean will tell me what percentage of the questions were answered correctly by each student. Hey, I know what I’m doing here.
6. Look at the data! In looking at the raw data, I see that there are two erroneous usernames that shouldn’t be there. These data have been cleaned pretty well already, so I don’t find much to fix.Now, I want to re-run the analysis deleting these two usernames.
7. At the top of your table, you’ll see an option that says “Modify Task”. Click that.
8. You’ll have the summary tables window pop up, this time with your data filled in. Click on the edit button at the top right of this window. You are about to create a task filter.
8. Under TASK FILTER pull down the first box to show the variable ‘username’. Pull down the second box to show the option NOT EQUAL TO and then click the three dots next to the third box. This will pull up a list of all of your values for usernames. You can select the one you want to exclude and click OK. Next to the three dots, pull down to select AND, then go through this to select the second username you want to delete. You can also just type in the values, but I tend to do it this way because I’m a bad typist with a bad short-term memory.
9. Create a SAS dataset of the output. It’s super easy. Click on the RESULTS tab to the left and in the window that pops up click SAVE RESULTS TO A DATA SET. Then, click RUN.
10. The most recently created data set should be your default data set for analysis but click on it in your process flow diagram to activate it just in case.
11. From the DESCRIBE menu again select SUMMARY STATISTICS
12. Drag ‘correct_mean’ under ANALYSIS VARIABLES and click RUN.
The resulting table gives me my answer – the mean is .838 with a standard deviation of .26 for N=424 subjects. So … the average subject answered 84% of the problems correctly. This, however, is just the first step. There are couple more interesting questions to be answered with this data set before moving on. Read the next step here.
————–
Want to play the game that produced these data? Own a Mac or Windows computer? Have ten bucks?
May
20
Parceling Items in Factor Analysis
Filed Under Software, statistics | 3 Comments
First of all, what are parcels? Not the little packages your grandma left on the table in the hall when she came back from shopping. Well, not only that.
In factor analysis, parcels are simply the sum of a small number of items. I prefer using parcels when possible because both basic psychometric theory and common sense tells me that a combination of items will have greater variance and, c.p., greater reliability than a single item.
Just so you know that I learned my share of useless things in graduate school, c.p. is Latin for ceteris paribus which translates to “other things being equal”. The word “etcetera” meaning other things, has the same root.
Know you know. But I digress. Even more than usual. Back to parcels.
As parcels can be expected to have greater variance and greater reliability, harking back to our deep knowledge of both correlation and test theory we can assume that parcels would tend to have higher correlations than individual items. As factor loadings are simply correlations of a variable (be it item or parcel) with the factor, we would assume that – there’s that c.p. again – factor loadings of parcels would be higher.
Jeremy Anglim, in a post written several years ago, talks a bit about parceling and concludes that it is less of a problem in a case, like today, where one is trying to determine the number of factors. Actually, he was talking about confirmatory factor analysis but I just wanted you to see that I read other people’s blogs.
The very best article on parceling was called To Parcel or Not to Parcel and I don’t say that just because I took several statistics courses from one of the authors.
To recap this post and the last one:
I have a small sample size and due to the unique nature of a very small population it is not feasible to increase it by much.I need to reduce the number of items to an acceptable subject to variables ratio. The communality estimates are quite high (over .6) for the parcels. My primary interest is in the number of factors in the measure and finding an interpretable factor.
So… here we go. The person who provided me the data set went in and helpfully renamed the items that were supposed to measure socializing with people of the same culture ‘social1’, ‘social2’ etc, and renamed the items on language, spirituality, etc. similarly. I also had the original measure that gave me the actual text of each item.
Step 1: Correlation analysis
This was super-simple. All you need is a LIBNAME statement that references the location of your data and then:
PROC CORR DATA = mydataset ;
VAR firstvar — lastvar ;
In my case, it looked like this
PROC CORR DATA = in.culture ;
VAR social1 — art ;
The double dashes are interpreted as ‘all of the variables in the data set located from var1 to var2 ‘ . This saves you typing if you know all of your variables of interest are in sequence. I could have just used a single dash if they were named the same, like item1 – item17 , and then it would have used all of the variables named that regardless of their location in the data set. The problem I run into there is knowing what exactly item12 is supposed to measure. We could discuss this, but we won’t. Back to parcels.
Since you want to put together items that are both conceptually related and empirically – that is, the things you think should correlate do- you first want to look at the correlations.
Step 2: Create parcels
The items that were expected to assess similar factors tended to correlate from .42 to .67 with one another. I put these together in a ver simple data step.
data parcels ;
set out.factors ;
socialp1 = social1 + social5 ;
socialp2 = social4 + social3 ;
socialp3 = social2 + social6 + social7 ;
languagep = language2 + language1 ;
spiritualp = spiritual1 + spiritual4 ;
culturep1 = social2 + dance + total;
culturep2 = language3 + art ;
There was one item that asked how often the respondent ate food from the culture, and that didn’t seem to have a justifiable reason for putting with any other item in the measure.
Step 3: Conduct factor analysis
This was also super-simple to code. It is simply
proc factor data= parcels rotate= varimax scree ;
Var socialp1 – socialp3 languagep spiritualp spiritual2 culturep1 culturep2 ;
I actually did this twice, once with and once without the food item. Since it loaded by itself on a separate factor, I did not include it in the second analysis. Both factor analyses yielded two factors that every item but the food item loaded on. It was a very nice simple structure.
Since I have to get back to work at my day job making video games, though, that will have to wait until the next post, probably on Monday.
—–
Be more than ordinary. Take a break. Play Forgotten Trail. I bet you have a computer!
May
2
What I learned from my favorite paper at SAS Global Forum
Filed Under Software, statistics, Technology | Leave a Comment
At first, I was thinking it wasn’t right to have a favorite paper, but then I realized that was idiotic. It’s not like these papers (or their presenters) are my children.
My favorite paper was,
Statistical modeling for large complex data: Five new directions from SAS/STAT software
If you’re not a statistician, props to you for reading after that first sentence, especially since some of the lessons apply to any conference.
- You don’t always have to present or attend presentations on whatever is shiny and new. The techniques he presented, like GLMSELECT, a method for selecting the best model is not brand new. I remember when it was first added to SAS/STAT and thinking it was a way cool idea I should use – but, then, I didn’t. As you can see from the graph above, it can be pretty easy to select the best model. Looks a lot like a scree plot, doesn’t it? This also further supports my point that visual displays of data, like the one above, are everywhere and taking over. Now that I have been reminded of its existence, I’m looking for a use for it so I can really remember it. Unfortunately, this is a method for general linear models and what I am most interested in right now has a binomial outcome, whether a player finished a game or not.
- Don’t stop learning when you go home. I remembered that there was also an example in this paper that used HPGENSELECT for generalized linear models, including binomial distributions. So, I am going to try that out with this dataset. One of the areas where I am improving is actually reading all of those papers I mean to get around to when I get home. Whether it is a paper you attended, but is now jumbled around in your brain with the other 25 sessions, or one you could not attend because it conflicted with something else, when you get home, you should read it. Conferences can be expensive and you want to get the most out of that time and money you spent.
- Of course, I learned about sparse regression, quantile regression, classification and regression trees and more, which you can, too if you follow my advice from #2.
Okay, well there is a lot more to say about SAS Global Forum and my adventures with HPGENSELECT but we have a new game, Forgotten Trail, coming out for sale tomorrow, so back to work.
———-
Apr
22
Visual Analytics are EVERYWHERE: SAS Global Forum Continued
Filed Under Software, statistics, Technology | 1 Comment
The nice thing about going to SAS Global Forum is that it’s the gift that keeps on giving. Long after I have gone home, there are still points to ponder.
Visual analytics is big and not just in the sense of there is a product out called that which I have never used but that every presentation, no matter how ‘tech-y’ now makes very effective use of graphics. If I was the type of person to say I told you so, I would mention that I predicted this six years ago after I went to SAS Global Forum in 2010.
In my last post, I mentioned the propensity score graphic with mustaches.
Richard Culter’s presentation on PROC HPSPLIT, which was really excellent, made extensive use of graphics to illustrate fairly complex models.
You can create classification and regression trees (the model you can’t see in this tiny graphic on the left) and you can drill down into sub-trees for further analysis.
Sometimes your classification tree is very easily interpretable. For example, in this case here from the same presentation, each split represents a different type of vegetation/ land surface – water, two different species of tree, etc.
Speaking of classification, regression and PROC HPSPLIT ….
If you didn’t know, now you know
PROC HPSPLIT is a high performance procedure for fitting and classification now available in SAS/STAT which is useful for data sets where relationships are non-linear. It produces classification and regression trees, includes options for pruning trees and a whole lot more. It is now available on a single computer, not limited to high performance computing clusters. So, yay!
A regression tree is what you get when your dependent variable is continuous, and a classification tree when it is categorical, as in the vegetation example above.
On a semi-related note, graphics can even be used to show when a data set is not suited to a linear model as in the example below, also from Cutler’s presentation. You can see that all of the 1’s are in two quadrants and all of the 0’s in two other quadrants. Yes, you COULD use a regression line to fit this but that is not the best fit of the data.
Also, on a related topic that visualizing data, like all of statistics, really, is a process of iterations, I think this would be more obvious if the quadrants were color coded.
I have a lot more to say on this but I am in North Dakota speaking at the ND STEM conference this weekend and a kind soul gave me tickets to the hockey game in the president’s box, so, peace, I’m out.
Apr
19
SAS Global Forum Random Post 1: Statistics
Filed Under Software, statistics, Technology | Leave a Comment
If you did not go to SAS Global Forum this week, here are some things you missed:
Me, rambling on about the 13 techniques all biostatisticians should know, including the answer to:
If McNemar and Kappa are both statistics for handling correlated, categorical data, how can they give you completely different results?
The answer is that the two test different hypotheses, apply different formula and are coded differently.
McNemar tests whether the marginal probabilities are the same. For example, when you switched your patients from drug one to drug two, was there a decrease in the number who experienced side effects? These are correlated data because they are the same people. Can’t get much more correlated than that.
Kappa tests whether the level of agreement of two raters is greater than would be expected by chance. I’ve rambled on it here before, using it to test the level of agreement that our 7 Generation Games raters have when scoring the pretest and post-test we use to assess whether kids are improving as a result of playing our games. Quick answer: Yes.
You also missed Lucy D’Agostino McGowan’s talk on propensity score matching integrating SAS and R.
Random notes from that presentation:
Why would you want to do this? Well, it would be lovely if you could do a randomized control trial and sending your subjects randomly off to treatment or control group.
However, what if your subjects tell you to drop dead they’re not going to be in your stupid treatment group?
In my experience, propensity scores have been commonly used when evaluating special programs that do not randomly receive patients. For example, patients sent to an Intensive Care Unit tend to be sicker than non-ICU patients. How then, do you decide if an ICU has any benefit when people in it are more likely to die?
Observational studies can use propensity scores to get a more unbiased estimate of treatment effects.
Propensity score matching assumes
- That there are no unmeasured confounders
- Every subject has a non-zero probability of receiving treatment.
Propensity scores are simply predicted values from a logistic regression predicting treatment
Useful rule of thumb:
Use caliper of .2 * pooled standard deviation
Only match people from treatment group to control group if their distance is within the caliper.
Also, I have slide envy because she thought to use mustaches and fedoras in illustrating propensity scores.
Also with really cool slides I was not quick enough to take a picture before he moved on …
Using Custom Tasks with In-memory statistics and SAS Studio by Steve Ludlow
I was able to find the slides from a related presentation he give in the UK last year. I linked to that one because it gave a little more detail on what SAS in-memory statistics is, how to use it and examples. If you had gone to his presentation, you probably would have wanted to learn more about this proc imstat and custom tasks of which he speaks.
Three points you might have come away with:
- Creating custom tasks is really easy
- Custom tasks could be really useful for teams sharing a large data base. Say, for example, you are on a longitudinal project study development of at-risk youth from age 12-25. You might have all kinds of people doing similar analyses, maybe looking at predictors of high school dropout, say. You could save your task and re-run it with next year’s data, only for females or in a hundred other ways.
- Custom tasks could be super-useful for teaching. Have the students run and inspect tasks you create and then modify these for their own analyses.
Okay, off to more sessions. Just a reminder, if you are here and feeling guilt that you left your children/ grandchildren at home, you can buy Fish Lake or Spirit Lake for them to play while you are gone. They’ll get smarter and you will get brownie points from their mom / dad / teacher .
Apr
2
Statistics Guru Predicts Republican Sweep! With Proc GMAP
Filed Under Software, statistics | Leave a Comment
Esteemed statistics guru, Dr. Nathaniel Golden has some sobering news for Democrats. His latest models predict a Republican blow out. As can be seen by the map below, the Republican front-runner has tapped into the mood of resentment in the country’s non-elites. When the dust has settled, only the two highest earning states in the country will remain in the blue column, Maryland and New Jersey (seriously, New Jersey). Code used in creating this map and the statistics behind it can be found below.
Step 1: Create a data set
Oh, and April Fool’s ! I just made up these data. If you really do need a data set with state data aligned to SAS maps, though, you can do what I did and pull it from the UCLA Stats Site. If you had real data, say percent of people who use methamphetamine, or whatever, you could just replace the last column there with your data. Since I did not have actual data, I just created a variable that was 40,000 for everything less than 51,000, and 51,000 for everything over. I’m going to use that in the PROC FORMAT below.
Also, even though my data are not nicely aligned here, note that the statename variable has a width of 20 so make sure you align your data like that so that state comes in column 22 or after.
DATA income2000;
INPUT statename $20. state income ;
IF income < 51000 THEN vote = 40000 ;
ELSE vote = 51000 ;
DATALINES ;
Maryland 24 51695
Alaska 2 50746
New Jersey 34 51032
Connecticut 9 50360
— a bunch more data
;
Here’s how you set up a PROC FORMAT for the two categories.
PROC FORMAT
VALUE votfmt low-50000="Republican"
50001-high="Democrat";
*** Making the patterns red and blue ;
pattern1 value=msolid color=red;
pattern2 value=msolid color=blue;
*** Making the map ;
proc gmap data = income2000 map=maps.us;
id state;
choro vote;
format vote votfmt.;
The important thing to keep in mind is if you want a U.S. map with the states that maps.us is in a SAS library named maps. Like the sashelp library, it’s already there, you don’t need to create it or assign it in the LIBNAME statement, you can just reference it. Go look under your libraries. See, I was right.
And don’t forget to vote. I don’t care how busy you are. You don’t want this, do you?
Mar
20
Plots of Relative Risk: A picture says 1,000 words
Filed Under Software, statistics, Technology | 1 Comment
I can’t believe I haven’t written about this before – I’m going to tell you an easy (yes, easy) way to find and communicate to a non-technical audience standardized mortality rates and relative risk by strata.
It all starts with PROC STDRATE . No, I take that back. It starts with this post I wrote on age-adjusted mortality rates which many cohorts of students have found to be – and this is a technical term here – “really hard”.
Here is the idea in a nutshell – you want to compare two populations, in my case, smokers and non-smokers, and see if one of them experiences an “event”, in my case, death from cancer, at a higher rate than the other. However, there is a problem. Your populations are not the same in age and – news flash from Captain Obvious here – old people are more likely to die of just about anything, including cancer, than are younger people. I say “just about anything” because I am pretty sure that there are more skydiving deaths and extreme sports-related deaths among younger people.
So, you compute the risk stratified by age. I happened to have this exact situation here, and if you want to follow along at home, tomorrow I will post how to create the data using the sashelp library’s heart data set.
The code is a piece of cake
PROC STDRATE DATA=std4
REFDATA=std4
METHOD=indirect(af)
STAT=RISK
PLOTS(STRATUM=HORIZONTAL);
POPULATION EVENT=event_e TOTAL=count_e;
REFERENCE EVENT=event_ne TOTAL=count_ne;
STRATA agegroup / STATS;
The first statement gives the data set name that holds your exposed sample data, e.g., the smokers, your reference data set of non-exposed records, in this example, the non-smokers. You don’t need these data to be in two different data sets, and, this example, they happen to be in the same one. The method used for standardization is indirect. If you’re interested in the different types of standardization, check out this 2013 SAS Global Forum paper by Yang Yuan.
STAT = RISK will actually produce many statistics, including both crude risk estimates and estimates by strata for the exposed and non-exposed groups, as well as standardized mortality rate – just, a bunch of stuff. Run it yourself and see. The PLOTS option is what is of interest to me right now. I want plots of the risk by stratum.
The POPULATION statement gives the variable that holds the value for the number of people in the exposed group who had the event, in this case, death by cancer, and the count is the total in the exposed group.
The REFERENCE statement names the variable that holds the value of the number in the non-exposed group who had the event, and the total count in the non-exposed group (both those who died and those who didn’t).
The STRATA statement gives the variable by which to stratify. If you don’t need your data set stratified because there are no confounding variables – lucky you – then just leave this statement out.
Below is the graph
The PLOTS statement produces plots of the crude estimate of the risk by strata, with the reference group risk as a single line. If you look at the graph above you can see several useful measures. First, the blue circles are the risk estimate for the exposed group at each age group and the vertical blue bars represent the 95% confidence limits for that risk. The red crosses are the risk for the reference group at each age group. The horizontal, solid blue line is the crude estimate for the study group, i.e., smokers, and the dashed, red line is the crude estimate of risk for the reference group, in this case, the non-smokers.
Several observations can be made at a glance.
- The crude risk for non-smokers is lower than for smokers.
- As expected, the younger age groups are below the overall risk of mortality from cancer.
- At every age group, the risk is lower for the non-exposed group.
- The differences between exposed and non-exposed are significantly different for the two younger age groups only, for the other two groups, the non-smokers, although having a lower risk, do fall within the 95% confidence limits for the exposed group.
There are also a lot more statistics produced in tables but I have to get back to work so maybe more about that later.
I live in opposite world
Speaking of work — my day job is that I make games for 7 Generation Games and for fun I write a blog on statistics and teach courses in things like epidemiology. Actually, though, I really like making adventure games that teach math and since you are reading this, I assume you like math or at least find it useful.
Share the love! Get your child, grandchild, niece or nephew a game from 7 Generation Games.
One of my favorite emails was from the woman who said that after playing the games several times while visiting her house, her grandson asked her suspiciously,
Grandma, are these games on your computer a really sneaky way to teach me math?
You can check out the games here and if you have no children to visit you or to send one as a gift, you can give one to a school – good karma. (But, hey, what’s with the lack of children in your life? What’s going on?)
Mar
10
SENSITIVITY, SPECIFICITY AND SAS USAGE NOTES
Filed Under Software, statistics, Technology | Leave a Comment
SENSITIVITY AND SPECIFICITY – TWO ANSWERS TO “DO YOU HAVE A DISEASE?”
Both sensitivity and specificity address the same question – how accurate is a test for disease – but from opposite perspectives. Sensitivity is defined as the proportion of those who have the disease that are correctly identified as positive. Specificity is the proportion of those who do not have the disease who are correctly identified as negative.
Students and others new to biostatistics often confuse the two, perhaps because the names are somewhat similar. If I was in charge of naming things, I would have named one ‘sensitivity’ and the other something completely different like ‘unfabuloso’. Why I am never consulted on these issues is a mystery to me, too.
Specificity and sensitivity can be computed simultaneously, as shown in the example below using a hypothetical Disease Test. The results are in and the following table has been obtained:
Disease | No Disease | |
Test Positive | 240 | 40 |
Test Negative | 60 | 160 |
Results from Hypothetical Screening Test
COMPUTING SENSITIVITY AND SPECIFICITY USING SAS
Step 1 (optional): Reading the data into SAS. If you already have the data in a SAS data set, this step is unnecessary.
The example below demonstrates several SAS statements in reading data into a SAS dataset when only aggregate results are available. The ATTRIB statement sets the length of the result variable to be 10, rather than accepting the SAS default of 8 characters. The INPUT statement uses list input, with a $ signifying character variables.
DATALINES;
a statement on a line by itself, precedes the data. (Trivial pursuit fact : CARDS; will also work, dating back to the days when this statement was followed by cards with the data punched on them.) A semi-colon on a line by itself denotes the end of the data.
DATA diseasetest ;
ATTRIB result LENGTH= $10 ;
INPUT result $ disease $ weight ;
DATALINES ;
positive present 240
positive absent 40
negative present 60
negative absent 160
;
Step 2: PROC FREQ
PROC FREQ DATA= diseasetest ORDER=FREQ ;
TABLES result* disease;
WEIGHT weight ;
Yes, plain old boring PROC FREQ. The ORDER = FREQ option is not required but it makes the data more readable, in my opinion, because with these data the first column will now be those who had a positive result and did, in fact, have the disease. This is the numerator for the formula for sensitivity, which is:
Sensitivity = (Number tested positive)/ (Total with disease).
TABLES variable1*variable2 will produce a cross-tabulation with variable1 as the row variable and variable2 as the column variable.
Weight weightvariable will weight each record by the value of the weight variable. The variable was named ‘weight’ in the example above but any valid SAS name is acceptable. Leaving off this statement will result in a table that only has 4 subjects, 1 subject for each combination of result and disease, corresponding to the data lines above.
Results of the PROC FREQ are shown below. The bottom value in each box is the column percent.
Because the first category happens to be the “tested positive” and the first column is “disease present”, the column percent for the first box in the cross-tabulation – positive test result, disease is present – is the sensitivity, 80%. This is the proportion of those who have the disease (the disease present column) who had a positive test result.
Table of result by disease | |||
result | disease | ||
Frequency Percent Row Pct Col Pct |
present | absent | Total |
positive | 240 48.00 85.71 80.00 |
40 8.00 14.29 20.00 |
280 56.00 |
negative | 60 12.00 27.27 20.00 |
160 32.00 72.73 80.00 |
220 44.00 |
Total | 300 60.00 |
200 40.00 |
500 100.00 |
Output from PROC FREQ for Sensitivity and Specificity
The column percentage for the box corresponding to a negative test result and absence of disease is the value for specificity. In this example, the two values, coincidentally, are both 80%.
Three points are worthy of emphasis here:
- While the location of specificity and sensitivity in the table may vary based on how the data and PROC FREQ are coded, the values for sensitivity and specificity will always be diagonal to one another.
- This exact table produces four additional values of interest in evaluating screening and diagnostic tests; positive predictive value, negative predictive value, false positive probability and false negative probability. Further details on each of these, along with how to compute the confidence intervals for each can be found in Usage Note 24170 (SAS Institute, 2015).
- The same exact procedure produces six different statistics used in evaluating the usefulness of a test. Yes, that is pretty much the same as point number 2, but it bears repeating.
Speaking of that SAS Usage Note, you should really check it out.
Feb
24
SAS Studio: Finding prevalence with pointing and clicking
Filed Under Software, statistics | 3 Comments
Policy makers have very good reason for wanting to know how common a condition or disease is. It allows them to plan and budget for treatment facilities, supplies of medication, rehabilitation personnel. There are two broad answers to the question, “How common is condition X?” and, interestingly, both of these use the exact same SAS procedures. Prevalence is the number of persons with a condition divided by the number in the population. It’s often given as per thousand, or per 100,000, depending on how common the condition is. Prevalence is often referred to as a snapshot. It’s how many people have a condition at any given time.
Just for fun, let’s take a look at how to compute prevalence with SAS Studio.
Step 1: Access your data set
First, assign a libname so that you can access your data. To do that, you create a new SAS program by clicking on the first tab in the top menu and selecting SAS Program.
libname mydata "/courses/number/number/" access=readonly;
(Students only have readonly access to data sets in the course directory. This prevents them from accidentally deleting files shared by the whole class. As a professor with many years of experience, let me just tell you that this is a GREAT idea.)
Click on the little running guy at the top of your screen and, voila, your LIBNAME is assigned and the directory is now available for access.
(Didn’t believe me there is a little running guy that means “run”? Ha!)
Next, in the left window pane, click on Tasks and in the window to the right, click on the icon next to the data field.
From the drop down menu of directories, select the one with your data and then click on the file you need to analyze.
Step 2: Select the statistic that you want and then select the variable. In this case, I selected one-way frequencies, and one cool thing is that SAS will automatically show you ONLY the roles you need for a specific test. If you were doing a two-sample t-test, for example, it would ask for you groups variable and your analysis variable. Since I am doing a one-way frequency, there is only an analysis variable.
When you click on the plus next to Analysis Variables, all of the variables in your data set pop up and you can select which you want to use. Then, click on your little running guy again, and voila again, results.
So … the prevalence of diabetes is about 11% of the ADULT population in California, or about 110 per 1,000.
You can also code it very simply if you would like:
libname mydata “/courses/number/number/” access=readonly;
PROC FREQ DATA = mydata.datasetname ;
TABLE variable ;
Of course, all of this assumes that your data is cleaned and you have a binary variable with has disease/ doesn’t have disease, which is a pretty large assumption.
Now, curiously, the code above is the exact SAME code we used to compute incidence of Down syndrome a few weeks ago. What’s up with that and how can you use the exact same code to compute two different statistics?
Patience, my dear. That is a post for another day.