Jan
23
Other people’s stuff: How not to get your sorry ass fired, part 6
January 23, 2016 | 1 Comment
It’s after midnight and officially the weekend so it’s time for the sixth installment of Mama AnnMaria’s Guide to Not Getting Your Sorry Ass Fired.
Click here for all of my advice on getting fired and quitting.
Click here to buy the games I make on my day job.
As I was saying … there are many reasons you can get your sorry ass fired and many of these have nothing to do with your ability to perform competently at your job. I used to think the same as you, that if I was a good programmer/ cashier /secretary/ accountant / dental assistant / teacher or whatever, that my job should be safe. If I was a better than average whatever it was, they were lucky to have me.
No. Read my lips, or, in this case, type … no matter how great you are at your job, there is a point beyond which it is not worth the pain in the ass of putting up with you.
Let me give you a few examples in the “I can’t believe I have to explain this” category.
- The stuff at work is for you when you are at work. There are two parts to that sentence you should understand, “FOR YOU” and “WHEN YOU ARE AT WORK”. Maybe your job provides a nice office for you with a nice employee lounge. Your spouse/ mom/ roommate/ homeless guy you met on the street should not be in the employee lounge drinking the free coffee, watching the free cable and eating the free bagels. Now, I’m not saying if your roommate is in the neighborhood one day, he or she can’t relax and have a cup of coffee while waiting to go to lunch. What I am saying is that I shouldn’t see your boyfriend hanging out in the lounge more often than people who work here. There is NO circumstance under which I should find an adult who doesn’t work here sleeping on the couch, floor, across two chairs – either in your office or anywhere else in the workplace. If they are that sick, take them to the hospital. If they are that drunk, take them to rehab.
- Don’t take stuff home unless you need it for work, that includes filing cabinets, coffee pots, fax machines, boxes of – well, anything – and that package of printer paper you took home to print out your roommate’s wedding invitations.
- Your work cell phone, iPad, computer and car is for you, for work. I can’t tell you the number of times I have seen someone in trouble at work because their child or friend broke their employer’s equipment, had an accident in a company car. This is the face I make when I hear that.
Seriously, what the hell are you thinking? Why was your kid playing with your company phone that she could drop it in the toilet? Why did you let your friend drive a car that did not belong to you?
4. Your mom shouldn’t be coming to your work place on a regular basis. The only exception I can imagine is if you work in a coffee shop and she comes there for coffee every morning. Otherwise, see face above.
Basically, get this straight, your work place is not your home and your office is not your living room. Wear pants. Wear shoes. Brush your teeth and bathe before you get there and don’t invite in your friends and family or share the place and the stuff in it with them. Let me explain why, because sometimes when I say this to people they think I am being mean, and they try to tell me that they are the only one of their social circle to have a job, have access to these nice things and what is wrong with sharing the wealth.
I will tell you what is wrong – generally, your employer budgets enough money for space, supplies, equipment to meet the needs of the people employed. A few times a year, when I submit budget justifications for granting agencies or investors, I include our estimated costs. Those extra people you are bringing in with you, whether it is space or flash drives they are taking, were not in the budget and as a result, there is not comfortably enough for the people who work here. This in itself may not be enough to get your sorry ass fired, but if your behavior in the workplace is annoying enough, you will find yourself the first one up against the wall when the revolution comes.
Jan
19
Incidence Rate: An example with Down Syndrome
January 19, 2016 | 1 Comment
How common is a particular disease or condition? It depends on what you mean by common. Do you mean how many people have a condition, or do you mean how many new cases of it are there each year?
In other words, are you asking about the probability of you HAVING a disease or of you GETTING a disease?
Yes, I mentioned Down syndrome in the title and I am going to use it as an example and one could argue, and I would agree, that Down syndrome is not, strictly speaking, a disease. That relates to a second point, though, which is that incidence and prevalence are terms and techniques that can be applied not just to disease but to chromosomal disorders and even to risk factors, such as smoking. Now, I’m getting ahead of myself which just shows that one should pipe down and pay attention.
INCIDENCE RATE is the rate at which new cases are occurring.
I downloaded a data set of 40,002 births that occurred in U.S. territories. Did you know that the U.S. administers 16 territories around the world? Well, now you do. Bet you can’t name more than four of them. I’m not sure whether this is a good deal for the people of the territories or not, but I am sure they had 40,000 babies.
Finding the incidence rate for Down syndrome was super-duper easy.
I made a birthweight data set, but unlike the last one, I selected an additional variable, ca_down, and then I recoded that variable.
FILENAME in “C:\Users\you\Downloads\Nat2014us\Nat2014PublicUS.c20150514.r20151022.txt ” ;
LIBNAME out “C:\Users\me\mydir\” ;
DATA out.birth2014 ;
INFILE in ;
INPUT birthyr 9-12 bmonth 13-14 dob_tt 19-22 dob_wk 23 bfacil 32
mager 75-76 mrace6 107 mhisp_r 115 mar_p $ 119 dmar 120
meduc 124 bmi 283-286 cig1_R 262 cig2_R 263 cig3_r 264 dbwt 504-507 bwtr4 511
ca_down $ 552 ;
if ca_down in (“C”,”P”) then down = “Y” ;
else down = “N” ;
You see, the Down syndrome variable at birth is coded as C for Confirmed, N for No and P for Pending. Well, the “pending” means that the medical personnel suspect Down syndrome but they are waiting for the results of a chromosomal analysis and the confirmed means the analysis has confirmed a diagnosis of Down syndrome. Since I presume that most experienced medical personnel recognize Down syndrome, I combined those two categories (which, if you read the codebook, it turns out that the NCHS folks did, too.)
Then I did
PROC FREQ DATA = out.birth2014 ;
TABLES down ;
And the result was
down | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
N | 39963 | 99.90 | 39963 | 99.90 |
Y | 39 | 0.10 | 40002 | 100.00 |
The incidence rate is .10 or 1 per 1,000 .
That’s all incidence rate is – total number of new cases / number at risk
The number at risk of Down syndrome would be all babies born, all 40,002 of them. The number new cases was 39. According to the World Health Organization, the incidence of Down syndrome is between 1 in 1,000 and 1 in 1,100 live births.
So, whatever other fallout there may be of living in a U.S. territory, it doesn’t seem to carry with it any greater incidence of Down syndrome births.
When discussing incidence, a condition like Down syndrome is easy to start with because you have it when you are born, you don’t develop it later. That is not the case with every health condition, though. That’s another blog for another day.
Here’s an interesting note: as often occurs, the most complicated part of this analysis was getting the data. After that, it was easy.
Feel smarter? Want smarter kids? Check out 7 Generation Games. It’s what I do when I’m not blogging.
Jan
11
Make Yourself a Birthweight Data Set
January 11, 2016 | 1 Comment
Even though Rick Wicklin (buzzkill!) disabused me of the concern that SAS was communicating with aliens through the obscure coding in its sashelp data sets, I still wanted to roll my own.
If you, too, feel more comfortable with a data set you have produced yourself, let me give you a few tips.
- There is a wealth of data available on line, much of it thanks to your federal government. For example, I downloaded the 2014 birth data from the Center for Disease Control site. They have a lot of interesting public use data sets.
- Read the documentation! The nice thing about the CDC data sets is that they include a codebook. This particular one is 183 pages long. Not exciting reading, I know, but I’m sitting here right now watching a documentary where some guy is up to his elbows in an elephant’s placenta (I’m seriously not making this up) and that doesn’t look like a bowl of cherries, either.
- Assuming you did not read all of the documentation even though I told you it was important, (since I have raised four daughters and know all about people not paying attention to my very sound advice), at a MINIMUM you need to read three things; 1) sampling information to find out if there are any problems with selection bias, any sampling weights you need to know about, 2) the record layout – otherwise how the hell are you going to read in the data, and 3) the coding scheme.
Let’s take a look at the code to create the data set I want to use in examples for my class. Uncompressed, the 2014 birth data set is over 5 GB which exceeds the limit for upload to SAS On-Demand for Academics and also isn’t the best idea for a class data set for a bunch of reasons, a big one being that I teach students distributed around the world and ships at sea (for real) and having them access a 5GB data set isn’t really the most feasible idea.
I’m going to assume you downloaded the file into your downloads folder and then unzipped it.
STEP 1: CHECK OUT THE DATA SET
Since I READ the codebook, not being a big hypocrite and saw that the record length is 775 and there are nearly 4 million records in the data set. Opening it up in SAS Enterprise Guide or the Explorer window didn’t seem a good plan. My first step, then , was to use a FILENAME statement to refer to the data I downloaded, data that happens to be a text file.
I just want to take a look at the first 10 records to see that it is what it claims to be in the codebook. (No, I never DO trust anyone.)
The default length for a SAS variable is 8.
I’m going to give the variable a length of 775 characters.
Notice that the INFILE statement has to refer back to the reference used in the FILENAME statement, which I very uncreatively named “in” . Naming datasets, variables and file references is not the place for creativity. I once worked with a guy who named all of his data sets and variables after cartoon characters – until all of the other programmers got together and killed him.
Dead programmers aside, pay attention to that OBS=10 unless you really want to look at 3,998,175 records. The OBS =10 option will limit the number of records read to – you guessed it – 10.
With the INPUT statement, I read in from position 1-775 in the text file.
All of this just allows me to look at the first 10 records without having to open a file of almost 4 million records.
FILENAME in “C:\Users\you\Downloads\Nat2014us\Nat2014PublicUS.c20150514.r20151022.txt ” ;
DATA example;
LENGTH var1 $775;
INFILE in OBS= 10 ;
INPUT var1 1-775;
STEP 2: READ IN A SUBSET OF THE DATA
Being satisfied with my first look, I went ahead to create a permanent data set. I needed a LIBNAME statement to specify where I wanted the data permanently stored.
The out in the LIBNAME and DATA statements need to match. It could be anything, it could be komodo or kangaroo, as long as it’s the same word in both places. So … my birth2014 data set will be stored in the directory specified.
How do I know what columns hold the birth year, birth month, etc. ? Why, I read the codebook (look for “record layout”).
LIBNAME out “C:\Users\me\mydir\” ;
DATA out.birth2014 ;
INFILE in OBS= 10000 ;
INPUT birthyr 9-12 bmonth 13-14 dob_tt 19-22 dob_wk 23 bfacil 32
mager 75-76 mrace6 107 mhisp_r 115 mar_p $ 119 dmar 120
meduc 124 bmi 283-286 cig1_R 262 cig2_R 263 cig3_r 264 dbwt 504-507 bwtr4 511;
STEP 3: RECODE MISSING DATA FIELDS. Think how much it would screw up your results if you did not recode 9999 for birthweight in grams, which means not that the child weighed almost 20 pounds at birth but that the birthweight was missing. In every one of the variables below, the “maximum” value was actually the flag for missing data. How did I know this? You guessed it, I read the codebook. NOTE: these statements below are included in the data step.
IF bmi > 99 THEN bmi = . ;
if cig1_r = 6 then cig1_r = . ;
if cig2_r = 6 then cig2_r = . ;
if cig3_r = 6 then cig3_r = . ;
if dbwt = 9999 then dbwt = . ;
if bwtr4 = 4 then bwtr4 = . ;
STEP 4: LABEL THE VARIABLES – Six months from now, you’ll never remember what dmar is.
NOTE: these statements below are also included in the data step.
LABEL mager = “Mom/age”
bfacil = “Birth/facility” mrace6 = “Mom/race” mhisp_r = “Mom/Hispanic”
dmar = “Marital/Status” meduc = “Mom/Educ”
cig1_r =”Smoke/Tri1″ cig2_r =”Smoke/Tri2″ cig3_r =”Smoke/Tri3″;
So, that’s it. Now I have a data set with 10,000 records and 19 variables that I can upload for my students to analyze.
Jan
7
Know Thy Data: The Most Important Commandment in Statistics
January 7, 2016 | 4 Comments
I was going to write about prevalence and incidence, and how so-called simple statistics can be much more important than people think and are vastly under-rated. It was going to be cool. Trust me.
In the process, I ran across two things even more important or cooler (I know, hard to believe, right?)
Here’s what happened … I thought I would use the sashelp library that comes with SAS On-Demand for Academics -and just about every other flavor of SAS – for examples of difference in prevalence. Since no documentation of all the data sets showed up on the first two pages of Google, and one is prohibited by international law from looking any further, I decided to use SAS to see something about the data sets.
Herein was revealed cool thing #1 – I know about the tasks in SAS Studio but I never really do much with these. However, since I’m teaching epidemiology this spring, I thought it would be good to take a look. You should do this any time you have a data set. I don’t care if it is your own or if it was handed down to you by God on Mount Sinai.
Okay, I totally take that back. If your data was handed down to you by God on Mount Sinai, you can skip this step, but only then.
At this point, Buddhists and Muslims are saying, “What the fuck?” and Christians are saying, “She just said, ‘fuck’! Right after a Biblical reference to Moses, too!”
This is why this blog should have some adult supervision.
But I digress. Even more than usual.
KNOW YOUR DATA! I don’t mean in the Biblical sense, because I’m not sure how that is even possible, but in the statistical sense. This is the important thing. I don’t care how amazingly impressive the statistical analyses are that you do, if you don’t know your data intimately (there’s that Biblical thing again) you may as well make your decisions by randomly throwing darts at a dartboard. I once told some people at the National Institutes of Health that’s how I arrived at the p-value for a test. For the record, the Men in Black have more of a sense of humor about these things than the NIH.
Ahem, so … if you are using SAS Studio, here is a lovely annotated image of what you are going to do.
1. Click in the left window pane where it says TASKS on the arrow to bring up a drop down menu
2. Click on the arrow next to Data and then select the Characterize Data task. (You might say that was 2 AND 3 if you were a smart ass and who asked you, anyway?)
3. Click the arrow next to the word DATA in the window pane second from the left and it will give you a choice of the available directories. (NOTE: If you are going to use directories not provided by SAS you’ll need a LIBNAME statement in an earlier step but we’re not and you don’t in this particular example.) Under the directory, you will be able to select your file, in this case, I want to look at birthweight.
4. Next to the word VARIABLES, click the + and it will show the variables in the data set. You can hold down the shift key and select more than one. You should do this for all of the variables of interest. In my case, I selected all of the variables – there aren’t many in this dataset.
5. To run your program, click the little running guy at the top of the window. This will give you – ta- da
RESULTS!
Let’s notice something here – the mother’s age ranges from -9 (seriously? What’s that all about?) to 18. Is this a study of teenage mothers or what? The answer seems to be “what” because the mean age is .416. Say, what? The mother’s educational level ranges from 0 to 3, which probably refers to something but I’ll bet it’s not years of education.
In a class of 50 students, inevitably, one or two will turn in a paper saying,
“The average mother had 1.22 years of education.”
WHAT? Are you even listening to yourself? Those students will defend themselves by saying that is what “SAS” told them.
According to the SAS documentation, these data are from the 1997 records of the National Center for Health Statistics.
I ran the following code:
proc print data=sashelp.bweight (obs=5) ;
And either it’s the same data set or there was an amazing coincidence that all of the data in the first five records was the same.
However, because I really need to get a hobby, I went and found the documentation for the Natality Data set from 1997 and it did not match up with the coding here. This led me to conclude that either:
a. SAS Institute re-coded these data for purposes of their own, probably instructional,
b. This is actually some other 1997 birthweight data set from NCHS, in which case, those people have far too much time on their hands.
c. SAS is probably using secretly coded messages in these data to communicate with aliens.
Not being willing to chance it, I went to the NCHS site and downloaded the 2014 birth statistics data set and codebook so I could make my own example data.
So … what have we learned today ?
- The TASKS in SAS Studio can be used to analyze your data with no programming required.
- It is crucial to understand how your data are coded before you go making stupid statements like the average mother is 3 months old.
- You can download birth statistics data and other cool staff from the National Center for Health Statistics site.
- The Spoiled One uses any phone not protected by National Security Council level of encryption for photos of herself.
———
Want to make your children as smart as you?
Get them 7 Generation Games. Learn math. Learn social studies. Learn not to fall into holes.
Runs on Mac and Windows for under ten bucks.
Jan
5
What’s epidemiology? A definition with a side of SAS
January 5, 2016 | 3 Comments
I’ll be teaching a graduate course in epidemiology in the spring and giving a talk on biostatistics at SAS Global Forum in April, so I thought I’d jump ahead and start rambling on about it now.
When I tell people that I teach epidemiology, the first question I usually get is,
What’s epidemiology?
In short, epidemiology is quantifying disease. There are five ways (at least) statistics can be applied to the study of disease:
- How common is it? This is a question of prevalence (how likely you are to have it) and incidence (how likely you are to get it). If you think those two are the same, you should take a course in epidemiology. Or, if you are busy, you can just read my blog post tomorrow or this paper from the Western Users of SAS Software Conference by Chu and Xie (2013).
- What causes it? What are the factors that increase (or decrease) your risk of contracting a disease? My first thought here is PROC LOGISTIC. It’s not my only thought, but my first one.
- What pattern(s) does it follow? What is the prognosis? Are you likely to die of it quickly, eventually or never? To determine if a treatment is effective for cancer of the eyelashes, we need to first have an idea of what the probability of disability or death is when one is left untreated and over how long of a period of time, that is, what is the “natural progression” of a disease? PROC LIFETEST and PROC PHREG lead to mind here.
- How effective are attempts to prevent or treat a disease? Several options suggest themselves here – PHREG for comparing how long people survive under different conditions, LOGISTIC for testing for significant differences in the probability of death. You could even use ordinary least squares (OLS) regression methods if you were interested in a measure like quality of life scaled scores.
- Developing policies to minimize disease.
The last one might not sound like a strictly statistical task to you, but I would argue that it is, that a key feature, perhaps THE key feature of statistics, and what makes it different from pure mathematics, is the application to answer a question.
I would argue (and so would Leon Gordis, who, literally, wrote the book on Epidemiology ) that a major part of epidemiology is applying the results from those first four aspects to make decisions that benefit public health.
Why do developing countries have the types of public health problems that the U.S and western Europe had over 100 years ago? Because, due to public health programs like improved sanitation and vaccines taken by all the people who were not raised by morons, we have greatly reduced tuberculosis and diarrheal diseases.
Okay, back to work, and more later as I work on my class notes and assignments.
—–
Speaking of random amazingness …. Jarrad Connor, who I follow on twitter tipped me off to the unintended virtual pandemic in World of Warcraft as a model of how disease spreads.
—- My Day Job –
Blogroll
- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog