Apr
22
Visual Analytics are EVERYWHERE: SAS Global Forum Continued
Filed Under Software, statistics, Technology | Leave a Comment
The nice thing about going to SAS Global Forum is that it’s the gift that keeps on giving. Long after I have gone home, there are still points to ponder.
Visual analytics is big and not just in the sense of there is a product out called that which I have never used but that every presentation, no matter how ‘tech-y’ now makes very effective use of graphics. If I was the type of person to say I told you so, I would mention that I predicted this six years ago after I went to SAS Global Forum in 2010.
In my last post, I mentioned the propensity score graphic with mustaches.
Richard Culter’s presentation on PROC HPSPLIT, which was really excellent, made extensive use of graphics to illustrate fairly complex models.
You can create classification and regression trees (the model you can’t see in this tiny graphic on the left) and you can drill down into sub-trees for further analysis.
Sometimes your classification tree is very easily interpretable. For example, in this case here from the same presentation, each split represents a different type of vegetation/ land surface – water, two different species of tree, etc.
Speaking of classification, regression and PROC HPSPLIT ….
If you didn’t know, now you know
PROC HPSPLIT is a high performance procedure for fitting and classification now available in SAS/STAT which is useful for data sets where relationships are non-linear. It produces classification and regression trees, includes options for pruning trees and a whole lot more. It is now available on a single computer, not limited to high performance computing clusters. So, yay!
A regression tree is what you get when your dependent variable is continuous, and a classification tree when it is categorical, as in the vegetation example above.
On a semi-related note, graphics can even be used to show when a data set is not suited to a linear model as in the example below, also from Cutler’s presentation. You can see that all of the 1’s are in two quadrants and all of the 0’s in two other quadrants. Yes, you COULD use a regression line to fit this but that is not the best fit of the data.
Also, on a related topic that visualizing data, like all of statistics, really, is a process of iterations, I think this would be more obvious if the quadrants were color coded.
I have a lot more to say on this but I am in North Dakota speaking at the ND STEM conference this weekend and a kind soul gave me tickets to the hockey game in the president’s box, so, peace, I’m out.
Apr
19
SAS Global Forum Random Post 1: Statistics
Filed Under Software, statistics, Technology | Leave a Comment
If you did not go to SAS Global Forum this week, here are some things you missed:
Me, rambling on about the 13 techniques all biostatisticians should know, including the answer to:
If McNemar and Kappa are both statistics for handling correlated, categorical data, how can they give you completely different results?
The answer is that the two test different hypotheses, apply different formula and are coded differently.
McNemar tests whether the marginal probabilities are the same. For example, when you switched your patients from drug one to drug two, was there a decrease in the number who experienced side effects? These are correlated data because they are the same people. Can’t get much more correlated than that.
Kappa tests whether the level of agreement of two raters is greater than would be expected by chance. I’ve rambled on it here before, using it to test the level of agreement that our 7 Generation Games raters have when scoring the pretest and post-test we use to assess whether kids are improving as a result of playing our games. Quick answer: Yes.
You also missed Lucy D’Agostino McGowan’s talk on propensity score matching integrating SAS and R.
Random notes from that presentation:
Why would you want to do this? Well, it would be lovely if you could do a randomized control trial and sending your subjects randomly off to treatment or control group.
However, what if your subjects tell you to drop dead they’re not going to be in your stupid treatment group?
In my experience, propensity scores have been commonly used when evaluating special programs that do not randomly receive patients. For example, patients sent to an Intensive Care Unit tend to be sicker than non-ICU patients. How then, do you decide if an ICU has any benefit when people in it are more likely to die?
Observational studies can use propensity scores to get a more unbiased estimate of treatment effects.
Propensity score matching assumes
- That there are no unmeasured confounders
- Every subject has a non-zero probability of receiving treatment.
Propensity scores are simply predicted values from a logistic regression predicting treatment
Useful rule of thumb:
Use caliper of .2 * pooled standard deviation
Only match people from treatment group to control group if their distance is within the caliper.
Also, I have slide envy because she thought to use mustaches and fedoras in illustrating propensity scores.
Also with really cool slides I was not quick enough to take a picture before he moved on …
Using Custom Tasks with In-memory statistics and SAS Studio by Steve Ludlow
I was able to find the slides from a related presentation he give in the UK last year. I linked to that one because it gave a little more detail on what SAS in-memory statistics is, how to use it and examples. If you had gone to his presentation, you probably would have wanted to learn more about this proc imstat and custom tasks of which he speaks.
Three points you might have come away with:
- Creating custom tasks is really easy
- Custom tasks could be really useful for teams sharing a large data base. Say, for example, you are on a longitudinal project study development of at-risk youth from age 12-25. You might have all kinds of people doing similar analyses, maybe looking at predictors of high school dropout, say. You could save your task and re-run it with next year’s data, only for females or in a hundred other ways.
- Custom tasks could be super-useful for teaching. Have the students run and inspect tasks you create and then modify these for their own analyses.
Okay, off to more sessions. Just a reminder, if you are here and feeling guilt that you left your children/ grandchildren at home, you can buy Fish Lake or Spirit Lake for them to play while you are gone. They’ll get smarter and you will get brownie points from their mom / dad / teacher .
Apr
2
Statistics Guru Predicts Republican Sweep! With Proc GMAP
Filed Under Software, statistics | Leave a Comment
Esteemed statistics guru, Dr. Nathaniel Golden has some sobering news for Democrats. His latest models predict a Republican blow out. As can be seen by the map below, the Republican front-runner has tapped into the mood of resentment in the country’s non-elites. When the dust has settled, only the two highest earning states in the country will remain in the blue column, Maryland and New Jersey (seriously, New Jersey). Code used in creating this map and the statistics behind it can be found below.
Step 1: Create a data set
Oh, and April Fool’s ! I just made up these data. If you really do need a data set with state data aligned to SAS maps, though, you can do what I did and pull it from the UCLA Stats Site. If you had real data, say percent of people who use methamphetamine, or whatever, you could just replace the last column there with your data. Since I did not have actual data, I just created a variable that was 40,000 for everything less than 51,000, and 51,000 for everything over. I’m going to use that in the PROC FORMAT below.
Also, even though my data are not nicely aligned here, note that the statename variable has a width of 20 so make sure you align your data like that so that state comes in column 22 or after.
DATA income2000;
INPUT statename $20. state income ;
IF income < 51000 THEN vote = 40000 ;
ELSE vote = 51000 ;
DATALINES ;
Maryland 24 51695
Alaska 2 50746
New Jersey 34 51032
Connecticut 9 50360
— a bunch more data
;
Here’s how you set up a PROC FORMAT for the two categories.
PROC FORMAT
VALUE votfmt low-50000="Republican"
50001-high="Democrat";
*** Making the patterns red and blue ;
pattern1 value=msolid color=red;
pattern2 value=msolid color=blue;
*** Making the map ;
proc gmap data = income2000 map=maps.us;
id state;
choro vote;
format vote votfmt.;
The important thing to keep in mind is if you want a U.S. map with the states that maps.us is in a SAS library named maps. Like the sashelp library, it’s already there, you don’t need to create it or assign it in the LIBNAME statement, you can just reference it. Go look under your libraries. See, I was right.
And don’t forget to vote. I don’t care how busy you are. You don’t want this, do you?
Mar
20
Plots of Relative Risk: A picture says 1,000 words
Filed Under Software, statistics, Technology | 1 Comment
I can’t believe I haven’t written about this before – I’m going to tell you an easy (yes, easy) way to find and communicate to a non-technical audience standardized mortality rates and relative risk by strata.
It all starts with PROC STDRATE . No, I take that back. It starts with this post I wrote on age-adjusted mortality rates which many cohorts of students have found to be – and this is a technical term here – “really hard”.
Here is the idea in a nutshell – you want to compare two populations, in my case, smokers and non-smokers, and see if one of them experiences an “event”, in my case, death from cancer, at a higher rate than the other. However, there is a problem. Your populations are not the same in age and – news flash from Captain Obvious here – old people are more likely to die of just about anything, including cancer, than are younger people. I say “just about anything” because I am pretty sure that there are more skydiving deaths and extreme sports-related deaths among younger people.
So, you compute the risk stratified by age. I happened to have this exact situation here, and if you want to follow along at home, tomorrow I will post how to create the data using the sashelp library’s heart data set.
The code is a piece of cake
PROC STDRATE DATA=std4
REFDATA=std4
METHOD=indirect(af)
STAT=RISK
PLOTS(STRATUM=HORIZONTAL);
POPULATION EVENT=event_e TOTAL=count_e;
REFERENCE EVENT=event_ne TOTAL=count_ne;
STRATA agegroup / STATS;
The first statement gives the data set name that holds your exposed sample data, e.g., the smokers, your reference data set of non-exposed records, in this example, the non-smokers. You don’t need these data to be in two different data sets, and, this example, they happen to be in the same one. The method used for standardization is indirect. If you’re interested in the different types of standardization, check out this 2013 SAS Global Forum paper by Yang Yuan.
STAT = RISK will actually produce many statistics, including both crude risk estimates and estimates by strata for the exposed and non-exposed groups, as well as standardized mortality rate – just, a bunch of stuff. Run it yourself and see. The PLOTS option is what is of interest to me right now. I want plots of the risk by stratum.
The POPULATION statement gives the variable that holds the value for the number of people in the exposed group who had the event, in this case, death by cancer, and the count is the total in the exposed group.
The REFERENCE statement names the variable that holds the value of the number in the non-exposed group who had the event, and the total count in the non-exposed group (both those who died and those who didn’t).
The STRATA statement gives the variable by which to stratify. If you don’t need your data set stratified because there are no confounding variables – lucky you – then just leave this statement out.
Below is the graph
The PLOTS statement produces plots of the crude estimate of the risk by strata, with the reference group risk as a single line. If you look at the graph above you can see several useful measures. First, the blue circles are the risk estimate for the exposed group at each age group and the vertical blue bars represent the 95% confidence limits for that risk. The red crosses are the risk for the reference group at each age group. The horizontal, solid blue line is the crude estimate for the study group, i.e., smokers, and the dashed, red line is the crude estimate of risk for the reference group, in this case, the non-smokers.
Several observations can be made at a glance.
- The crude risk for non-smokers is lower than for smokers.
- As expected, the younger age groups are below the overall risk of mortality from cancer.
- At every age group, the risk is lower for the non-exposed group.
- The differences between exposed and non-exposed are significantly different for the two younger age groups only, for the other two groups, the non-smokers, although having a lower risk, do fall within the 95% confidence limits for the exposed group.
There are also a lot more statistics produced in tables but I have to get back to work so maybe more about that later.
I live in opposite world
Speaking of work — my day job is that I make games for 7 Generation Games and for fun I write a blog on statistics and teach courses in things like epidemiology. Actually, though, I really like making adventure games that teach math and since you are reading this, I assume you like math or at least find it useful.
Share the love! Get your child, grandchild, niece or nephew a game from 7 Generation Games.
One of my favorite emails was from the woman who said that after playing the games several times while visiting her house, her grandson asked her suspiciously,
Grandma, are these games on your computer a really sneaky way to teach me math?
You can check out the games here and if you have no children to visit you or to send one as a gift, you can give one to a school – good karma. (But, hey, what’s with the lack of children in your life? What’s going on?)
Mar
10
SENSITIVITY, SPECIFICITY AND SAS USAGE NOTES
Filed Under Software, statistics, Technology | Leave a Comment
SENSITIVITY AND SPECIFICITY – TWO ANSWERS TO “DO YOU HAVE A DISEASE?”
Both sensitivity and specificity address the same question – how accurate is a test for disease – but from opposite perspectives. Sensitivity is defined as the proportion of those who have the disease that are correctly identified as positive. Specificity is the proportion of those who do not have the disease who are correctly identified as negative.
Students and others new to biostatistics often confuse the two, perhaps because the names are somewhat similar. If I was in charge of naming things, I would have named one ‘sensitivity’ and the other something completely different like ‘unfabuloso’. Why I am never consulted on these issues is a mystery to me, too.
Specificity and sensitivity can be computed simultaneously, as shown in the example below using a hypothetical Disease Test. The results are in and the following table has been obtained:
Disease | No Disease | |
Test Positive | 240 | 40 |
Test Negative | 60 | 160 |
Results from Hypothetical Screening Test
COMPUTING SENSITIVITY AND SPECIFICITY USING SAS
Step 1 (optional): Reading the data into SAS. If you already have the data in a SAS data set, this step is unnecessary.
The example below demonstrates several SAS statements in reading data into a SAS dataset when only aggregate results are available. The ATTRIB statement sets the length of the result variable to be 10, rather than accepting the SAS default of 8 characters. The INPUT statement uses list input, with a $ signifying character variables.
DATALINES;
a statement on a line by itself, precedes the data. (Trivial pursuit fact : CARDS; will also work, dating back to the days when this statement was followed by cards with the data punched on them.) A semi-colon on a line by itself denotes the end of the data.
DATA diseasetest ;
ATTRIB result LENGTH= $10 ;
INPUT result $ disease $ weight ;
DATALINES ;
positive present 240
positive absent 40
negative present 60
negative absent 160
;
Step 2: PROC FREQ
PROC FREQ DATA= diseasetest ORDER=FREQ ;
TABLES result* disease;
WEIGHT weight ;
Yes, plain old boring PROC FREQ. The ORDER = FREQ option is not required but it makes the data more readable, in my opinion, because with these data the first column will now be those who had a positive result and did, in fact, have the disease. This is the numerator for the formula for sensitivity, which is:
Sensitivity = (Number tested positive)/ (Total with disease).
TABLES variable1*variable2 will produce a cross-tabulation with variable1 as the row variable and variable2 as the column variable.
Weight weightvariable will weight each record by the value of the weight variable. The variable was named ‘weight’ in the example above but any valid SAS name is acceptable. Leaving off this statement will result in a table that only has 4 subjects, 1 subject for each combination of result and disease, corresponding to the data lines above.
Results of the PROC FREQ are shown below. The bottom value in each box is the column percent.
Because the first category happens to be the “tested positive” and the first column is “disease present”, the column percent for the first box in the cross-tabulation – positive test result, disease is present – is the sensitivity, 80%. This is the proportion of those who have the disease (the disease present column) who had a positive test result.
Table of result by disease | |||
result | disease | ||
Frequency Percent Row Pct Col Pct |
present | absent | Total |
positive | 240 48.00 85.71 80.00 |
40 8.00 14.29 20.00 |
280 56.00 |
negative | 60 12.00 27.27 20.00 |
160 32.00 72.73 80.00 |
220 44.00 |
Total | 300 60.00 |
200 40.00 |
500 100.00 |
Output from PROC FREQ for Sensitivity and Specificity
The column percentage for the box corresponding to a negative test result and absence of disease is the value for specificity. In this example, the two values, coincidentally, are both 80%.
Three points are worthy of emphasis here:
- While the location of specificity and sensitivity in the table may vary based on how the data and PROC FREQ are coded, the values for sensitivity and specificity will always be diagonal to one another.
- This exact table produces four additional values of interest in evaluating screening and diagnostic tests; positive predictive value, negative predictive value, false positive probability and false negative probability. Further details on each of these, along with how to compute the confidence intervals for each can be found in Usage Note 24170 (SAS Institute, 2015).
- The same exact procedure produces six different statistics used in evaluating the usefulness of a test. Yes, that is pretty much the same as point number 2, but it bears repeating.
Speaking of that SAS Usage Note, you should really check it out.
Mar
3
Age-adjusted mortality
Filed Under statistics | 1 Comment
In the early part of any epidemiology course, few things throw students as much as computing age-adjusted mortality. It seems really counter-intuitive that two populations could have the exact same mortality rate and yet one is significantly less healthy than the other.
Thinking about it for a moment, though, before diving into computation, makes it pretty clear.
Say you have a class of 30 members of a senior citizens center in Florida and 30 second-graders in Santa Monica. In each group, 4 members died this year. Is the mortality rate the same?
This is where age-adjusted mortality comes in. It just so happens that in fact the CRUDE MORTALITY RATE is the same.
Crude mortality rate is simply (# of people who died)/(population at midyear)
We take midyear population because the denominator is the population at risk and if you have already died you cannot die again. Poets talk about dying a thousand deaths but statisticians don’t believe in that crap.
Since, we will assume, no one is joining your community center or class, mid-year population = 28.
How did I get 28? I assumed that people died randomly throughout the year so 1/2 of the year, 2 of your 4 people have died.
So, your crude mortality rate is 143 per 1,000.
Does it bother you that more of the second-graders died? Does that not seem right? That’s because you have some intuitive understanding of age-adjusted mortality.
Age-adjusted mortality is what you get when you apply ACTUAL age specific rates to a HYPOTHETICAL STANDARD POPULATION.
Let’s say we want to compare the mortality rates of two relatively small cities, each with the same size population and in each city, 74 people died in the last year. We are arguing that pollution is causing increased mortality but the main polluter in town points to the fact that City B has no more deaths than City A, on the other side of the state.
To compute the age-adjusted population, we would take the actual mortality rate for each age group for each city, as in the example below. Applying that to a standard population, let’s say each city had 10,000 children born that year, 20,000 ages 1-5 and so on.
Age Group |
Standard Population |
CITY A Mortality Rates per 100,000 |
Expected Deaths A |
CITY B Mortality Rates per 100,000 |
Expected Deaths B |
< 1 | 10,000 | 160 | 16 | 70 | 7 |
1-5 | 20,000 | 20 | 4 | 12 | 2.4 |
6-40 | 50,000 | 30 | 15 | 14 | 7 |
41-65 | 10,000 | 50 | 5 | 45 | 4.5 |
Over 65 | 10,000 | 350 | 35 | 300 | 30 |
TOTAL | — | 75 | 50.9 |
The cities may have different age distributions, so city A, which is a college town, has a lot more young people than City B. Given the City A mortality rates for each age group, one would expect 75 deaths in a standard population – that is, with the age distribution given above.
However, given the mortality rates by age in City B, one would expect only 50.9 deaths in a year. So, yes, City A has the same number of people and the same number of deaths, but if the people in City A are much younger, they should have FEWER deaths.
The standardized mortality ratio is the observed number of deaths per year divided by the expected number of deaths.
Let’s say we use the rate for City B, without the polluter, as our expected number
SMR = 75/ 50.9 = 1.47
Usually, we multiply it by 100. So, this says the deaths in City A are 147% of what would be expected for this distribution of ages based on the mortality rate in a city with no polluting plant.
Feel smarter after reading this?
Check out my day job, making educational adventure games. Download one for Mac or Windows today.
Feb
24
SAS Studio: Finding prevalence with pointing and clicking
Filed Under Software, statistics | 2 Comments
Policy makers have very good reason for wanting to know how common a condition or disease is. It allows them to plan and budget for treatment facilities, supplies of medication, rehabilitation personnel. There are two broad answers to the question, “How common is condition X?” and, interestingly, both of these use the exact same SAS procedures. Prevalence is the number of persons with a condition divided by the number in the population. It’s often given as per thousand, or per 100,000, depending on how common the condition is. Prevalence is often referred to as a snapshot. It’s how many people have a condition at any given time.
Just for fun, let’s take a look at how to compute prevalence with SAS Studio.
Step 1: Access your data set
First, assign a libname so that you can access your data. To do that, you create a new SAS program by clicking on the first tab in the top menu and selecting SAS Program.
libname mydata "/courses/number/number/" access=readonly;
(Students only have readonly access to data sets in the course directory. This prevents them from accidentally deleting files shared by the whole class. As a professor with many years of experience, let me just tell you that this is a GREAT idea.)
Click on the little running guy at the top of your screen and, voila, your LIBNAME is assigned and the directory is now available for access.
(Didn’t believe me there is a little running guy that means “run”? Ha!)
Next, in the left window pane, click on Tasks and in the window to the right, click on the icon next to the data field.
From the drop down menu of directories, select the one with your data and then click on the file you need to analyze.
Step 2: Select the statistic that you want and then select the variable. In this case, I selected one-way frequencies, and one cool thing is that SAS will automatically show you ONLY the roles you need for a specific test. If you were doing a two-sample t-test, for example, it would ask for you groups variable and your analysis variable. Since I am doing a one-way frequency, there is only an analysis variable.
When you click on the plus next to Analysis Variables, all of the variables in your data set pop up and you can select which you want to use. Then, click on your little running guy again, and voila again, results.
So … the prevalence of diabetes is about 11% of the ADULT population in California, or about 110 per 1,000.
You can also code it very simply if you would like:
libname mydata “/courses/number/number/” access=readonly;
PROC FREQ DATA = mydata.datasetname ;
TABLE variable ;
Of course, all of this assumes that your data is cleaned and you have a binary variable with has disease/ doesn’t have disease, which is a pretty large assumption.
Now, curiously, the code above is the exact SAME code we used to compute incidence of Down syndrome a few weeks ago. What’s up with that and how can you use the exact same code to compute two different statistics?
Patience, my dear. That is a post for another day.
Feb
16
Urban vs Rural Barriers to Ed Tech: An example of Fisher’s Exact Test
Filed Under computer games, Software, statistics, Technology | Leave a Comment
Who was it that said asking a statistician about sample size is like asking a jeweler about price. If you have to ask, you can’t afford it.
We all know that the validity of a chi-square test is questionable if the expected sample size of the cells is less than five. Well, what do you do when, as happened to me recently, ALL of your cells have a sample size less than five?
The standard answer might be to collect more data, and we are in the process of that, but having the patience of the average toddler, I wanted that data analyzed NOW because it was very interesting.
It was our hypothesis that rural schools were less likely to face obstacles in installing software than urban schools, due to the extra layers of administrative approval required in the latter (some might call it bureaucracy). On the other hand, we could be wrong (horrors!). Maybe rural schools had more problems because they had more difficulty finding qualified personnel to fill information technology positions. We had data from 17 schools, 9 from urban school districts and 8 from rural districts. To participate in our study, schools had to have a contact person who was willing to attempt to get the software installed on the school computers. This was not a survey asking them whether it would be difficult or how long it would take. We actually wanted them to get software ( 7 Generation Games ) not currently on their school computers installed. To make sure that cost was not an issue, all 17 schools received donated licenses.
You can see the full results here.
In short, 8 of the 9 urban schools had barriers to installation of the games which delayed their use in the classroom by a median of three months. I say median instead of mean because four of the schools STILL have not been able to get the games installed. The director of one after-school program that wanted to use the games decided it was easier for his program to go out and buy their own computers than to get through all of the layers of district approval to use the school computer labs, so that is what they did.
For the rural schools, 7 out of 8 reported no policy or administrative barriers to installation. The median length of time from when they received the software to installation was two weeks. In two of the schools, the software was installed the day it was received.
Here is a typical comment from an urban school staff member,
“I needed to get it approved by the math coach, and she was all on board. Then I got it approved at the building level. We had new administration this year so it took them a few weeks to get around to it, and then they were all for it. Then it got sent to the district level. Since your games had already been approved by the district, that was just a rubber stamp but it took a few weeks until it got back to us, then we had all of the approvals so we needed to get it installed but the person who had the administrator password had been laid off. Fortunately, I had his phone number and I got it from him. Then, we just needed to find someone who had the spare time to put the game on all of the computers. All told, it took us about three months, which was sad because that was a whole semester lost that the kids could have been playing the games. “
And here is a typical comment from a rural staff member.
“It took me, like, two minutes to get approval. I called the IT guy and he came over and installed it.”
The differences sound pretty dramatic, but are they different from what one would expect by chance, given the small sample size? Since we can’t use a chi-square, we’ll use Fisher’s exact test. Here is the SAS code to do just that:
PROC FREQ DATA = install ;
TABLES rural*install / CHISQ ;
Wait a minute! Isn’t that just a PROC FREQ and a chi-square? How the heck did I get a Fisher’s exact test from that?
Well, it turns out that if you have a 2 x 2 table, SAS automatically computes the Fisher exact test, as well as several others. I told you that you could see the full results here but you didn’t look, now, did you?
You can see the full results here.
In case you still didn’t look, the probability of obtaining this table under the null hypothesis that there is no difference in administrative barriers in urban versus rural districts is .0034.
If you think these data suggest it is easier to adopt educational technology in rural districts than in urban ones, well, not exactly. Rural districts have their own set of challenges, but that is a post for another day.
Feb
3
Watch me work: Data Project
Filed Under Software, statistics, Technology | Leave a Comment
On twitter, there were a few comments from people who said they didn’t like to take interns because “More than doing work, they want to watch me work.”
I see both sides of that. You’re busy. You’re not netflix. I get it. On the other hand, that’s a good way to learn.
So, here you go. I’m starting on a data analysis project today and I thought I’d give you the blow by blow.
It just so happens that the first several steps are all point-y and click-y. You could do it other ways but this is how I did it today. So, step one, I went to phpMyAdmin on the server where the data were saved and clicked Export.
For the format to export, I selected CSV and then clicked on the Go button. Now I have it downloaded to my desktop.
Step 3: I opened SAS Enterprise Guide and selected Import Data. I could have done this with SAS and written code to read the file, but, well, I didn’t. Nope, no particular reason, just this isn’t a very big data set so I thought, what the heck, why not.
Step 4: DO NOT ACCEPT THE DEFAULTS! Since I have a comma-delimited file with no field names, I need to uncheck the box that says File contains field names on record number. SAS conveniently shows you the data below so I can see that it is comma-delimited. I know I selected CSV but it’s always god practice to check. I can also see that the data starts at the first record, so I want to change that value in Data records start at record number to 1.
Step 5: Change the names – I am never going to remember what F1, F2 etc. are, so for the first 5 , I click on the row and edit the names to be the name and label I want.
That’s it. Now I click the next button on the bottom of the screen until SAS imports my data.
I could have continued changing all of the variable names, because I KNOW down the line I am not going to remember that F6 is actually the first question or that F25 is question 28a. However, I wanted to do some other things that I thought would be easier to code, so I opened up a program file in SAS Enterprise guide and wrote some code.
/* THIS CREATES TWO ARRAYS BECAUSE I AM TOO LAZY
TO RENAME 32 QUESTIONS INDIVIDUALLY
THE PRETEST DATA SET WAS CREATED BY THE STEPS ABOVE USING IMPORT DATA */
data pretest2 ;
set pretest ;
** NOTE THAT THERE IS A $ AFTER THE NUMBER OF ELEMENTS IN THE ARRAY
** BECAUSE THIS IS A CHARACTER ARRAY ;
array ren{32} $ f6-f37 ;
array qs {32} $ q1-q27 q28a q28b q28c q29 q30;
do i = 1 to 32 ;
qs{i} = ren{i} ;
end ;
** YOU CAN ALSO USE A RENAME STATEMENT TO RENAME THE SAME VARIABLES ;
rename f38 = date_test ;
*** SINCE I NO LONGER NEED THE VARIABLES F6- F37 OR THE INDEX VARIABLE FOR THE
ARRAY, I DROP THEM HERE ;
drop f6- f37 i ;
*** SOME STUDENTS SAVED THE TEST MORE THAN ONCE BECAUSE THEY SAVED BEFORE THEY WERE DONE AND AT THE END. SO, I SORT BY USERNAME AND TEST. WE WILL ONLY KEEP THE LAST ONE.
proc sort data=pretest2 ;
by username date_test ;
*** THIS KEEPS JUST THE LATEST TEST DATE. ALSO, WE TESTED THIS 45 TIMES IN
THE PROCESS OF GETTING READY FOR USE IN THE SCHOOLS. ALL OF OUR STAFF USED USERNAMES WITH ‘TEST” SO I USED THE INDEX FUNCTION TO FIND IF THERE WAS A “TEST” IN THE USERNAME AND, IF SO, DELETED THAT RECORD ;
data pretest2 ;
set pretest2;
by username date_test ;
if last.username ;
if index(username,‘TEST’) > 0 then delete;
run;
Okay, that’s it. Now I have my data all ready to analyze. Pretty painless, isn’t it?
Want to learn more about SAS?
Here is a good paper on Arrays made easy .
If you’re interested in character functions like index, here is a good paper by Ron Cody.
Jan
19
Incidence Rate: An example with Down Syndrome
Filed Under statistics | 1 Comment
How common is a particular disease or condition? It depends on what you mean by common. Do you mean how many people have a condition, or do you mean how many new cases of it are there each year?
In other words, are you asking about the probability of you HAVING a disease or of you GETTING a disease?
Yes, I mentioned Down syndrome in the title and I am going to use it as an example and one could argue, and I would agree, that Down syndrome is not, strictly speaking, a disease. That relates to a second point, though, which is that incidence and prevalence are terms and techniques that can be applied not just to disease but to chromosomal disorders and even to risk factors, such as smoking. Now, I’m getting ahead of myself which just shows that one should pipe down and pay attention.
INCIDENCE RATE is the rate at which new cases are occurring.
I downloaded a data set of 40,002 births that occurred in U.S. territories. Did you know that the U.S. administers 16 territories around the world? Well, now you do. Bet you can’t name more than four of them. I’m not sure whether this is a good deal for the people of the territories or not, but I am sure they had 40,000 babies.
Finding the incidence rate for Down syndrome was super-duper easy.
I made a birthweight data set, but unlike the last one, I selected an additional variable, ca_down, and then I recoded that variable.
FILENAME in “C:\Users\you\Downloads\Nat2014us\Nat2014PublicUS.c20150514.r20151022.txt ” ;
LIBNAME out “C:\Users\me\mydir\” ;
DATA out.birth2014 ;
INFILE in ;
INPUT birthyr 9-12 bmonth 13-14 dob_tt 19-22 dob_wk 23 bfacil 32
mager 75-76 mrace6 107 mhisp_r 115 mar_p $ 119 dmar 120
meduc 124 bmi 283-286 cig1_R 262 cig2_R 263 cig3_r 264 dbwt 504-507 bwtr4 511
ca_down $ 552 ;
if ca_down in (“C”,”P”) then down = “Y” ;
else down = “N” ;
You see, the Down syndrome variable at birth is coded as C for Confirmed, N for No and P for Pending. Well, the “pending” means that the medical personnel suspect Down syndrome but they are waiting for the results of a chromosomal analysis and the confirmed means the analysis has confirmed a diagnosis of Down syndrome. Since I presume that most experienced medical personnel recognize Down syndrome, I combined those two categories (which, if you read the codebook, it turns out that the NCHS folks did, too.)
Then I did
PROC FREQ DATA = out.birth2014 ;
TABLES down ;
And the result was
down | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
N | 39963 | 99.90 | 39963 | 99.90 |
Y | 39 | 0.10 | 40002 | 100.00 |
The incidence rate is .10 or 1 per 1,000 .
That’s all incidence rate is – total number of new cases / number at risk
The number at risk of Down syndrome would be all babies born, all 40,002 of them. The number new cases was 39. According to the World Health Organization, the incidence of Down syndrome is between 1 in 1,000 and 1 in 1,100 live births.
So, whatever other fallout there may be of living in a U.S. territory, it doesn’t seem to carry with it any greater incidence of Down syndrome births.
When discussing incidence, a condition like Down syndrome is easy to start with because you have it when you are born, you don’t develop it later. That is not the case with every health condition, though. That’s another blog for another day.
Here’s an interesting note: as often occurs, the most complicated part of this analysis was getting the data. After that, it was easy.
Feel smarter? Want smarter kids? Check out 7 Generation Games. It’s what I do when I’m not blogging.