Let the buyer beware – that phrase certainly applies to open data, as does the less historical but equally true statement that students always want to work with real data until they get some.

Lately, I have had students working with two different data sets that have led me to the inevitable conclusion that self-report data is another way of proving that people are big fat liars. One data set is on campus crime and, according to the data provided by personnel at college campuses, the modal number of crimes committed per year – rape, burglary, vehicle theft, intimidation, assault, vandalism – is zero. Having taught at a very wide variety of campuses, from tribal colleges to extremely expensive private universities, and never seen one that was 100% crime free, I suspected  – and this is a technical term here – that these data were complete bullshit. I looked up a couple of campuses that were in high crime areas and where I knew faculty or staff members who verified crime had occurred and been reported to the campus and city police. These institutions conveniently had not reported their data, which is morally preferable to lying through their teeth, with the added benefit of requiring less effort.

Jumping from there to a second study on school bullying, we found, as reported by school administrators in a national sample of thousands of public elementary, middle and secondary schools, that bullying and sexual harassment never, or almost never, occur and there are no schools in the country where these occur on a daily basis. Are you fucking kidding me? Have you never walked down the hall at a middle school or high school? Have you never attended a school? What the administrators thought to gain or avoid by lying on these surveys, I don’t know, but it certainly reduces the value of the data for, well, for anything.

So …. the students learn a valuable life lesson about not trusting their data too much. In fact, this may be the most valuable lesson they learn, Stamp’s Law

The government are very keen on amassing statistics. They collect them, add them, raise them to the nth power, take the cube root and prepare wonderful diagrams. But you must never forget that every one of these figures comes in the first instance from the village watchman, who just puts down what he damn pleases.

From an analysis standpoint, this is my soapbox that I am ranting on every day. Before you do anything, do a reality check. If you use SAS Enterprise Guide, the Characterize Data task is good for this, but any statistical software, or any programming language, for that matter, will have options for descriptive statistics  – means, frequencies, standard deviations.

This isn’t to say that all open data sucks. On the contrary, I’m working with two other data sets at the moment that are brilliant. One used abstracts of medical records data over nine years plus state vital records to record data on medical care, diagnoses and death for patients over 65. Since Medicare doesn’t pay unless you have data on the care provided and diagnosis, and the state is kind of insistent on recording deaths, these data are beautifully complete and probably pretty accurate.

I’ve also been working with the TIMSS data. You can argue all you want about teaching to the test, but it’s not debatable that the TIMSS data is pretty complete and passes the reality test with flying colors. Distributions by race, gender, income and every other variable you can imagine is pretty much what you would expect based on U.S. Census data.

So, I am not saying open data = bad. What I am saying is let the buyer beware, and run some statistics to verify the data quality before you start believing your numbers too much.

It’s about time I got some recognition !


You can claim your own Love Stats award here.

Careful, of undeserved awards, though. The last person who falsely claimed a Love Stats award had multicollinearity in his measures, a high VIF and died of complications of homoscedasticity. You have been warned.

I am tired of hearing that men in technical fields cannot meet women. Some stupid company even went so far as to say that they had hired hot women to work at their company so now they were going to have no problem attracting good programmers and engineers. This is stupid on so many levels. As the house rocket scientist said,

“Right. That’s exactly what guys want, more good-looking women to turn them down. I think the way  to attract talent is have interesting problems and pay people well.”

To continue on the logic – some of the technical people you want to hire are women, a lot of those men are married, and explaining to your wife that you are taking a job at another company because there are hot women is probably not the kind of conversation conducive to marital happiness. I presume this company is looking for young men who are very bright with no social life so they will work all of the time. I do have to wonder, though, what the “hot women” hired by the company think of being used as recruiting bonuses.

As a public service, I have decided to explain the three steps to actually meeting a woman. This is based on the extensive research of being a woman, having been married three times (less exciting than it sounds, I was divorced at 25 and my second husband passed away after 10 years of marriage) and accomplished “till death do us part” 67% of the time – in excess of the national average I might add – being the mother of three daughters in their twenties and spending well over half my life as a programmer, statistician and engineer. I believe anthropologists call this “participant observation”. All of these same steps apply in on-line relationships as well.

Jenn after finishing her M.A.

1. Say “Hi”.  This is the first step where many men fail. Have you actually talked to that woman in accounting or who you see every day at the coffee shop? Next time you walk by her cubicle, see her in front of you in line or you pass each other in the halls in your apartment complex, say “Hello”.  That’s all. Just say, “Hi”.  The next time you see her, do it again. (I don’t mean walking by Janet’s cubicle to the water cooler every hour and saying hello. That puts you in the weirdo camp.) I mean, when you walk into the office in the morning or by her desk as you head out for lunch.

After this has gone on for a few days, if you are behind her in line, coming out of the elevator, say, “You know, I don’t think we’ve been introduced. My name is Jack. I work over in the systems administration group.” (Unless your name is actually Jack and you really are in systems administration, you should substitute the correct information in that sentence.) DON’T ASK HER OUT. She probably already knows who you are, but that doesn’t matter. She’ll tell you her name, as if you didn’t know, and something about herself. If she tells you she works in accounting or for Goldman Sachs, ask her about it, how she likes working there, how long she has been there, how did you end up there, did you always plan on being an accountant. You might discover she is the most boring person on God’s green earth and you don’t want to go out with her, but that is okay, too.  After you’ve walked to where her desk is or picked up your coffee, say “Nice talking to you.” And leave.

Exhibit A of how seldom men actually even say, “Hi” is darling daughter number two, shown above right after she finished her masters degree. For four semesters, while she was getting her M.A., she would come over to the building where I worked and we would car pool home. Often, I was stuck in a meeting and she would be sitting downstairs waiting for me in a building where hundreds of the exact men who complain about meeting women worked. At the  time, she wasn’t dating anyone, having just moved back to L.A. to start graduate school. How many people in that time actually stopped and said, “Hi” to her? About seven – four married men who worked with me and knew she was my daughter, and the three security guards who rotated at the front desk, one of whom was a woman and the other old enough to be her grandfather.

An even better time to say hi is when someone new starts at work or moves in the neighborhood. When you pass the person in the hall, say, “Hi, I’m Jack. I know you’re new here – if you need any help, I’m in apartment 12/ I work in the Unix group – feel free to ask. I know it is hard getting used to everything a new place.”

Here is a really key point – do this whenever someone new moves in/ comes in to the area. She may be married, old or not the type of woman you would date in a million years. You ought to offer to help out new men at work too. You don’t want to be seen as that creepy guy who hits on all the attractive women.

Why anyone thinks all of these suggestions go out the window when they are online is beyond me. Do you know how many women on on-line sites get messages that start out, “Show me your tits.” If you wouldn’t do that in the hallway don’t do it on line.

2. Don’t be a jerk. If Janet does come over and ask how to log in to the server, don’t start out with “Well, you see we have these things called computers…” I cannot count the number of men I have seen trying to impress women with their technical knowledge and instead coming off as a pompous ass (and I can count pretty high). If she asks a question, tell her the answer. Then, if you can, add something like, “Did anyone tell you where the company handbook/ procedures/ secret site of insider knowledge is?” and show her where she can find the answer for herself in the future. DON’T ASK HER OUT.

From talking to Janet in the elevator, you know that she works in accounting, is into fly fishing whatever. The next time you see her in the hallway ask her if she could recommend an accountant, tell her you were interested in taking up fly fishing and wonder if Bass is a good place to buy bait. Don’t lie. If fly-fishing makes you want to puke, go with the accountant angle or you may be trying to find some way to get out of cutting bait. DON’T ASK HER OUT.  On the other hand, if she says, “I’m going to Bass on Friday after work, do you want to come?” Don’t be an idiot. Say yes.

Keep in mind, though, that when a woman asks you if you want to go to lunch or go to Mandelbrot’s lecture (well, before he was dead) or go to Macworld, she may just think it would be nice to talk with you at lunch or to have company to hear Mandelbrot talk about fractals or go to Macworld.

Again, all of this applies on-line as well. Darling daughter number three (shown here) has thousands of twitter followers and Facebook friends. Fairly often, she’ll get a tweet along the lines of,

“I would like you to have my babies.”
and many, many more graphic ones. What the hell are these people thinking? If she ever in a psychotic break went out with one of these idiots, I’d smack them in the head the instant they came into the house. The only one that I did not think was super-creepy was the guy who asked her to the Marine Corps Ball. If she hadn’t been in the middle of training for a fight this Friday, it wouldn’t have surprised me at all  if she had taken him up on it.

 

So, when do you get to go out with her?

3. Find the right woman. You might think this would come first. You’re wrong. If you do steps 1 & 2 first, you’ll find out if Janet is married. DON’T EVER ASK HER OUT. What are you, a moron? If Janet has a boyfriend, don’t ask her out either. Seriously, do you want to date a woman who cheats on her boyfriend? If she breaks up with him, and she’s interested in you (and often even if she isn’t) she’ll bring it up in the conversation. By this point, you know if Janet is dumb as a rock, is a born-again Christian or spends her evenings kick-boxing.  All of those may be a turn-off for you or it may be exactly what you are looking for your whole life. You may find that you thought dating someone very religious was off the table but that Janet’s views on always being honest, the importance of family and the way she walks the walk by volunteering at a soup kitchen every weekend are pretty amazing. In short, you’ve gotten to know Janet as  a person. Worst case scenario is you have made an acquaintance who you don’t want to date. However, that same person may have given you a new perspective, may refer you for a job some day or introduce you to someone you DO want to date. Women tend to have a lot more friends who are other women than men do. A second possible outcome is you and Janet become friends. She’s not your type, maybe she never was. She’s married, too old, too young, just not compatible. However, having a friend is a good thing and as a bonus, she may give you some good insight into starting a relationship with a woman.

Ronda trains two or three times a day and loses things at such a rate that it is almost a super-power. She has lost so many passports that I think the State Department has her on a watch list.  If hiking in the mountains for three hours is not your idea of a good time, the two of you are not going to get along. On the other hand, Jenn, lovely daughter number two above, had a minor in Film Studies and teaches history. If your idea of a good time is going to the gym at 5 a.m. on Sunday morning, she is probably going to quote some line from a movie I never saw to describe you. Whatever it means, it won’t be good. They are both massive Dr. Who fans, though. So, one day, when you are hanging out with your friend, Ronda, watching a Dr. Who marathon because she has trained for five hours that day and is too exhausted to move, you mention you really liked the Ken Burns Civil War documentary. She rolls her eyes at your boringness and then a light goes off, “You know, you really ought to meet my sister …”  and then you are off again at Step #1.

 

Giving a whole new meaning to “computing in the cloud”, I finished up my paper “A gentle introduction to survival analysis” for the Nevada SAS Users Group from 30,000 feet up using on-board wi-fi and SAS On-demand.  I was shocked to find that the performance was much better in-flight. Presumably, if you charge people $12.95 for three hours to access your wireless system on the plane there are fewer people using it than if everyone can use it as long as they want for free, say, at a university. I believe this was some principle about supply and demand I learned in microeconomics.

View coming into LAX airport

In case you did not know, SAS On-Demand is the *FREE* (as in free puppy, although occasionally as in free beer) offering from SAS. It comes in three flavors, Enterprise Guide (which I am using), Enterprise Miner and JMP.

Let’s say I wanted to do a survival analysis in SAS Enterprise Guide. More specifically, let’s say I wanted to do a proportional hazards regression model using the PROC PHREG procedure.

Step 1: Go to TASKS in the top menu, select SURVIVAL ANALYSIS and then PROPORTIONAL HAZARDS, as shown below.

Menu for PH Reg

Next, you need to specify your variables

Variable specification in survival analysis windowIn this case,

  • I drag the variable Survival Time in Days under  “Survival Time”
  • I drag the variable named Status under “Censoring variable”
  • Finally, I drag the variable named Rx under “Explanatory variables”

For this analysis, I only have one explanatory variable, so I am done. But nothing happens. I would click on the RUN button but SAS On-Demand won’t let me. It’s greyed out. If this happens to you, it’s one of those simple things to fix once you know how  – so, hey, you came to the right blog. Click on that variable under Censoring variable. A pane on the right will appear.

Menu with pane for censoring variable addedClick on the value that denotes censoring. In this case, the value of 0 means there was no event (the patient survived to the end of the study). If nothing shows up in the box at the bottom to check, never fear. You can always enter the value that denotes censoring the box above that says Enter Custom Value and click ADD. Then, you can click on the RUN button.

To see the results, exactly as produced, you can click here for the pdf file.

What about code? We like code !

Code Window

When your program runs, you get the pdf file as output, but if you look at the top of your screen, next to the output you can see three other tabs, for log, code and input data. Click on the Code button and you can see the code SAS Enterprise Guide created.


Maria

 

Maria Burns Ortiz, a sports writer for Fox News Latino and my darling daughter number one, has commented,

 

“You write a blog on statistics and other people read it? That must be nerd-squared.”

Well, now you have SAS computing in the cloud while in the clouds, so that must be cloud-squared. Does that make this blog nerd to the fourth?

 

I might as well give you my opinion of these two kinds of sin as long as, in a way, against each other we are pitting them,
And that is, don’t bother your head about the sins of commission because however sinful, they must at least be fun or else you wouldn’t be
committing them.
It is the sin of omission, the second kind of sin,
That lays eggs under your skin.
The way you really get painfully bitten
Is by the insurance you haven’t taken out and the checks you haven’t added up the stubs of and the appointments you haven’t kept and the bills you haven’t paid and the letters you haven’t written.

I had to respond to the post by Mike Nemecek  and tweets by Rick Wicklin quoting Shakespeare with some culture of my own. Not having any degrees in liberal arts, though, the best I could do was this excerpt from the poem by Ogden Nash, Portrait of the Artist as a Prematurely Old Man.

Lately I’ve been playing around with the PROC LIFETEST procedure and it occurred to me that a way to get painfully bitten with this, and other survival analysis procedures, is not to think about some obvious facts. I’m assuming you are new to these procedures, either that, or in a big hurry, and you don’t scrutinize your output carefully. In that case, you may misinterpret the mean survival rate.

The mean survival rate is the mean length of time people/ bacteria / rats survived, right? Not necessarily.

Many procedures, say factor analysis, regression – automatically drop observations with missing data. Survival analysis procedures don’t work exactly the same way.

I am telling you this because it is a mistake I have seen people make who were familiar with other statistical procedures, and I can only presume in a hurry. Their solution to not knowing the length of survival time for some of their subjects was to drop those for whom the data are unknown.

Let’s try this with a data set I have laying around. I only use those observations for which I have complete data, that is, I know the survival time. It gives me the following information on survival time in days.

Mean        Standard Error
360.934    22.183

Quartile Estimates
95% Confidence
Percent  Estimate Transform [Lower Upper)
75               532.00 LOGLOG 489.00 612.00
50              308.00 LOGLOG 244.00 393.00
25               167.00 LOGLOG 117.00 193.00

Easy. right? The mean survival time is 360 days. The median is 308 days.

However, this is only using those people for whom we have a survival time. What about the other people? When I include EVERYONE, whether they died or not, I get the following

 

Mean Standard Error
431.466 22.506

So, is this the correct value then?  Are these the correct percentile points?

Percent Point
Estimate
Transform [Lower Upper)
75 652.00 LOGLOG 560.00 755.00
50 428.00 LOGLOG 341.00 512.00
25 192.00 LOGLOG 157.00 237.00

Well, not exactly. In fact, if you are using SAS, you will see this helpful note in your log
The mean survival time and its standard error were underestimated because the largest observation was censored and the estimation was restricted to the largest event time.
In my sample data set, 25% of the observations were censored, that is, we don’t know when they died.

Can we then say that the median survival rate really is 428 days? Okay, the mean is not correct because those 25% some of them may have died years later. What about the median? The answer is, “It depends.”

Depends on what, you might ask. Well, it depends on why you don’t know when they died.If they dropped out of your study and you have no idea what happened to them, then I would say that you want to be a bit cautious in your interpretation of both the mean and the median survival rates.


If all of the people who you don’t know how long they lived are censored because your study ended and they were still alive, then I would say, yes, the median survival time is accurate, assuming you had data on all of those people at the beginning and the end of the study.

Although survival analysis is generally thought of as predicting whether one survived in the sense of not dying, it is also used for other applications, such as predicting how long people last in a treatment program, without committing another crime, without drinking or using drugs. In those cases, when people are lost by dropping out of your study and out of sight, I would suspect it is very possible that at least some of those people began drinking, committing crimes or whatever prior to the end of your study. So, when people are censored due to having missing data, I would be a bit skeptical of both the median and the mean, and, as with all data problems, the larger percentage of your sample this affects, the more worried about it I would be.

An interesting point came up the other day when I was listening to a lecture. I’d assumed in animal studies that you would only have the problem of right-censored data due to the subjects living past the end of the study period. She mentioned that a couple of her subjects were censored because the rats had died of other causes unrelated to the disease under study. Not sure what occupational hazards a rat faces, but it’s getting pretty bad when you can’t even count on a rat these days.

A rat

 

 

Completely unrelated to statistics, software or, anything really, my darling daughter number three is ranked in the top ten in the world in mixed martial arts.

She has been nominated for female fighter of the year, which is decided in a scientific, objective manner by how many votes the nominees get on the Internet. She said,

“Hey mom, you have a blog and I bet not too many people who read it have a favorite fighter  they’re going to vote for, so why don’t you ask them all to vote for me.”

The site is here, you can vote once per email address.

You need to click on the very large VOTE NOW button, and give your name, email and a password. Ronda is under female fighters, which is category #12.

If you need some actual statistics, her current record is 6-0 . The longest fight that she has had to date is about 57 seconds. It has taken her a total of less than three minutes to win six fights. You can watch one of her fights on youtube here.

In an odd coincidence, her next fight is in Las Vegas on November 18, a few hours after I finish speaking to the Nevada SAS Users Group on, of all things, survival analysis. It has been suggested to me that one of the examples I could use is the number of seconds that the opponent will last fighting Ronda. When I went to her last fight, I was having a martini and the man in front of me at the bar asked why I looked so nervous. I told him that my daughter was fighting. He said, “Ronda Rousey? The bookies are giving six to one odds on her. It’s three to one odds against her opponent even getting out of the first round.”

See? Statistics are everywhere.

If you’re in Las Vegas, the fight is at The Palms, after the SAS Users Group meeting, and it will also be shown on Showtime (the fight, not the users group meeting).

So, yeah, vote for Ronda.

She’s a good kid, she’s funny and she has a big-ass white dog. What more could you ask?

Jon Peltier and I were going back and forth on twitter about why it is that people will post answers on a forum or mailing list that are completely incorrect. As Peter Flom says, “They are often in error but never in doubt.”

Jon suggested there are three types; those who don’t know and admit it, those who don’t know they don’t know and those who know they don’t know but won’t admit it.

I’m going to be in the first type today and admit several things I did not know.

I have always hated the idea of custom hypotheses using the ESTIMATE or CONTRAST statements because they are a pain in the ass to do, but sometimes they just make sense. Essentially, any time your interest is not in whether “any of these cell means are different from any other”, the standard test, but whether one or more specific cell means are different than the others, than you want a CONTRAST or ESTIMATE statement.

First coolness thing about the LSMESTIMATE statement is that it is easier to write. Here (from the SAS documentation) is an example of the ESTIMATE statement replaced by LSMESTIMATE

estimate 'AB12' intercept 1 
                            a 1 0 
                            b 0 1 0 
                          a*b 0 1 0 0 0 0;
    estimate 'avg ABij' intercept 6 
                                a 3 3 
                                b 2 2 2 
                              a*b 1 1 1 1 1 1 / divisor=6;
    estimate 'AB12 vs avg ABij' a 3 -3 
                                b -2 4 -2 
                              a*b -1 5 -1 -1 -1 -1 / divisor=6;

Is replaced by

       lsmestimate a*b 'AB12 vs avg ABij' -1 5 -1 -1 -1 -1 / divisor=6;

Not only is it a lot less trouble to write, and, I think, to interpret, but, I have very bad vision. I wear nuclear-strength contacts to see past the end of my nose, glasses on top of them to read and enlarge everything on my screen 125% or more, so the odds of me typing something like the first several statements without making a mistake somewhere are very slim. It is very hard for me to tell of there is actually a space there or not, which is why I was very enthusiastic about the other syntax option for LSMESTIMATE.

Phil Gibbs, in a paper at the Denver SAS Users Group meeting, gave some very good examples of exactly when you would want to use custom hypotheses, for example if you thought one drug was more effective for disease A and a second drug was more effective for disease B, which seems a perfectly reasonable set of hypotheses, barring you are testing some incredible drug that cures all ills (according to my grandmother, one already exists and it is called rum).

Even better, he pointed out that you can use non-positional syntax where rather than listing all of the cells with zeroes for those you don’t want to contrast  you can just have the ones you are interested in, like this …

LSMESTIMATE drug "drug pair 1,2 vs drug pair 3,4"
[ 1,1] [ 1,2]
[-1,3] [-1,4] / divisor=2;

This isn’t all that new of a statement and I don’t know how I overlooked it when I came out. I was probably in a session where it was mentioned, didn’t have any use for it at the moment and just remembered how much I hated ESTIMATE and CONTRAST statements.

Most interesting to me was the fact that this paper was originally presented at PharmaSUG , which I have never attended because I haven’t done anything with pharmaceuticals in years (and no, I’m not referring to those parties in college).

Right before Phil spoke, Dr. Patrick Thornton had given a talk on ODS that he had presented at PharmaSUG. He mentioned that there is way more than just information of interest to the pharmaceutical industry there and you should check it out. Although he was not speaking to me personally, I did check it out and found that there really are a LOT of interesting papers presented there, and it is in San Francisco next year, right close to home, so I just might head up that way.